Embedding Layer Size Rule

muellerzr · July 17, 2019, 10:03pm

Do we have any documentation as to why the rule of

min(600, round(1.6 * n_cat ** .56)

works? Or any papers that lead to this rule? I won’t @ jeremy here unless it’s necessary, but I’d rather get one of my biggest ‘black boxes’ answered if possible.

Thanks!

ste · July 17, 2019, 10:07pm

Empirical values - see this similar post:

muellerzr · July 17, 2019, 10:12pm

Thanks @ste, I was reading on that, my main question is the 1.6 and .56. I understand that they follow an Empirical value, but the origin of the 1.6 and the .56 is lost on me, or what their meanings are. Is that a basic statistics formula for the empirical value? As I thought it was generated via standard deviations and means a particular way. The reason being that the embedding rule that she mentions is outdated compared to the new embedding rule fastai uses. Following that rule, if I have a variable with a cardinality of say 200, then our size should be 100, but instead we get a size of 31

Thanks!

ste · July 17, 2019, 10:28pm

AFAIK, the formula is a kind of “rule of thumb” to have your embedding size not too small and not too big.
I usually take it as a default and increase/decrease the value according to the relative improvement of the model.

muellerzr · July 17, 2019, 10:30pm

I do as well usually, but with my research at my university they have little experience with neural nets and are extremely fascinated with the fastai practices, so I’m moreorless a translator between them all. Hopefully I can flesh out where that came from or someone can provide some input, as explaining to a professor it’s a black box backed by experience isn’t very helpful

muellerzr · July 17, 2019, 11:30pm

I don’t want to jump the gun but I’ve looked into many papers and dug through Jeremy’s twitter feed without luck, I’m yet to find an explanation for it… @jeremy? I know you are a busy man, but I cannot seem to wrap my head around where this experience came from. Is it documented somewhere?

Thank you,

Zach

muellerzr · July 18, 2019, 4:17pm

Okay good news! The first question has been answered: Why 0.25? A google developer blog has it here: Introducing TensorFlow Feature Columns - Google Developers Blog

"Why is the embedding vector size 3 in our example? Well, the following “formula” provides a general rule of thumb about the number of embedding dimensions:

embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:

3 = 81**0.25
Note that this is just a general guideline; you can set the number of embedding dimensions as you please."

The cut off of 600 I believe comes from Word2Vec itself: Word2vec - Wikipedia

“Quality of word embedding increases with higher dimensionality. But after reaching some point, marginal gain will diminish.[1] Typically, the dimensionality of the vectors is set to be between 100 and 1,000.”

And then 1.6 may have been just some factor that, as Jeremy loves to state, “Just works”. I could believe it also may push to add that dimensional for non-dictionary category numbers so we can have somewhere in that appropriate range.

jeremy · July 19, 2019, 4:36pm

I just fitted a line to the empirical values in Excel. There’s no theory behind it. Sorry!

jeremy · July 19, 2019, 4:36pm

That sounds way too low to me!

muellerzr · July 19, 2019, 4:46pm

Thanks for the reply Jeremy! I had realized that my statement above was incorrect, as you used **.56 not **.25. So if we want to generate our own embedding sizes following the ‘rule of thumb’, try fitting a line to the empirical values and go from there? Interesting thought… I will try this for myself!

Thank you very much for the insight, and apologies for my ‘over researching’!

muellerzr · July 19, 2019, 4:47pm

Where does the ‘600’ maximum come from then? Or is that correlated with Word2Vec? Or just a maximum that kinda ‘worked’

Jess · July 19, 2019, 6:43pm

I actually love that! Great example of “opportunistic development” in action.

haj_mammad · March 6, 2020, 12:08pm

Hey everyone…
I was just studying and thinking about my M.S. thesis and I came up with this idea that if we want to choose an embedding size for certain number of points, we need to think about the density of the points in the embedding space.
Let me explain to you what am I thinking with an example: Imagine we have 100 points and we embed them in a two dimensional space. (One can imagine a 10 by 10 square with one point equally in each little square) If we increase the dimensionality of the space from two to three, how many points can we put in that space? (Again one can imagine a 10 by 10 by 10 cube with each point in a 1 * 1 * 1 cube)
The answer is intuitively 1000. So lets wrap it up with a mathematical formula:
If we want to maintain the density in our embedding space, the below fraction must be a constant value.

log(# of points)
----------------------- = constant (Which we set it according our belief over the usage of that embedding)
(# of dimensions)

If this constant value is a large number, it means we have a dense embedding space and embedded entities are densely close to each other. It helps when we know we have many categories that are similar to each other and we want our algorithms to embed them in similar positions. (So we allow such high density)
If this constant value is a small number, it means our embedding space is quite sparse and entities are more likely to be away from each other. It can help when we know most of the categories are semantically different and we want to allow our algorithms to embed them quite dis-similarly. (So we let the density be low)
I come to this conclusion that the formula mentioned above:

or more generally
a * n_cat ** b
is not the best decision. Instead, we must use:
a * log(n_cat)
and i think having only one hyper-parameter to tune is another advantage.
But why nobody was concerned about the above formula and all the discussions was about the constant values used in it?
My answer to that question is simple: Because two formulas are quite similar when we plot them and look at the shapes. (Except the beginning of the log which won’t be used since n_cat is never less than 2)

All i discussed here is my personal opinion and I’ll be happy to hear your (@muellerzr, @ste) idea about it.

muellerzr · March 6, 2020, 12:59pm

That’s because Jeremy did his own (many) investigations and experiments on the Rossmann dataset and found that it worked better. For your experiments, attempt to do so on the Rossmann dataset (as we have a baseline).

(Also if you have more questions, ping me on it I’ve spoken to Jeremy on this topic)

haj_mammad · March 6, 2020, 3:01pm

Thanks for your fast response @muellerzr !
I read the data description of that dataset. As far as I understood, this is a single dataset with few categorical fields. So one can’t simply extend the conclusions about size of embedding vectors from this single dataset to an overall rule [of thumb].
Also i should mention again that two functions are quite similar if you plot them (I did it). So if Jeremy found best embedding sizes about this dataset (which is just a dozen of points), we can fit both functions to those points and then evaluate both of them according to similarity of their outputs to the best values. But this comparison may not be enough for us to decide a general rule to pick embedding sizes in all problems and all datasets.
A better experiment will be evaluating both ideas on a dataset with many categorical fields (which all of them need to be embedded) and CTR prediction datasets (Criteo, Avazu, etc.) are quite suitable for this comparison.
I also will be delighted if you share the experiment results on the Rassmann dataset.

muellerzr · March 6, 2020, 4:47pm

You should use the feature engineered one we use in the course. This will provide yourself baselines as well all the feature engineered columns turn into categorical, leaving us with ~29 categorical variables.

haj_mammad · March 6, 2020, 5:24pm

Sorry I’m not in the context of the course. I just wanted to find some article or website to cite in my thesis and I ended up here. But 29 is still a small number compared with CTR prediction datasets.
I was thinking about a global rule for embedding size.