Let’s say that you have a categorical variable with cardinality N.
If you 1-hot-encode this variable, you’re essentially transforming it to a set of points in N-dimensional space. This is a suitable representation for a neural net. However, points in that space are subjects to a constraint: they appear only in the corners of an N-dimensional cube, so “almost all” area in that space is not used at all.
Now, if we let each category be an arbitrary point in the N-dimensional space, we can potentially use all available area. Moreover, with the previous constraint removed, and since an embedding is a linear learnable layer, the net can move these points to whatever places that yield the smaller loss at the end.
I don’t know for sure, but it looks like the reason why the
max(50, c//2) rule works well in practice is because an embedding space with
c//2 dimensions is a very large space, compared to 1-hot-encoded space with
c dimensions, so it’ll be more that enough to learn meaningful relationships. The maximum cap of 50 is probably because each dimension “exponentially increases” the “representation power”, so going beyond 50 is an overkill, except when the embedding has to learn really rich representations, like a language model. By saying “exponentially increases” I draw an analogy to discrete spaces, however I don’t know how it properly translates to continuous spaces.