Categorical Entity Embedding

sver · June 28, 2018, 3:17am

Can anyone share if pytorch fastai does one hot encoding underneath the hood when creating an embedding of the categorical variable.

I noticed in the code for Rossman, that we tell the embedding layer how many values there are for a given categorical, but we pass in the original column, not a DF w/ the columns one hot encoded.

Unless I’m misunderstanding how categorical embeddings are created we dot a one hot vector for a given categorical value and that then gives us the weights for that one hot vector associated, which then pipes into the next fully connected later where it’s combined with the continuous variables, etc etc etc.

bny6613 · June 28, 2018, 10:52am

No, it does not do one hot encoding, it passes an index of the categorical variable directly to an embedding layer. When we create an embedding, we pass it the variable size and a vector size and what it does is basically a lookup table from indexes to vectors (which is equivalent to a dot product with one hot encoded vector, but computationally more efficient).