I’ve been working through the part 1 videos, really great material!

One thing that I was curious about is the embedding approach as seen in the Rossman discussion. My understanding is that there is no nonlinearity between the embedding matrices and the first dense layer (see Figure 1 in the Entity Embeddings for Categorical Variables paper).

We can think of the embeddings as the blocks in a big sparse block matrix multiply, so its interaction with the dense layer is really just a matrix multiply, so I think they could just be collapsed into a single layer. Is it true that from the network’s prospective, these might as well be fused into a single dense layer? In other words, if I took two networks who only differed by their use of embedding matrices vs just using one hot encodings, would I actually get a better result/more powerful network?

I get that they are nice for other reasons, e.g. interpretation of features, use in feature engineering, just curious about them in the context of neural nets!

Embeddings are just a programming trick that mathematically is equivalent to one hot encoding followed by a fully connected layer.

If you would have a lot of of values you would like to one hot encode, and you would like to embed them in high dimensional space (word embeddings can be of length 300 for example and dictionaries of the size 20k - 30k are not particularly large), that would lead to a very tall and wide matrix of weights you would need to multiply the one hot encoding by.

Instead, a lookup is performed by the idx of the embedded entity and a vector of trainable weights is returned (I believe this can be likened to a lookup into a dict vs performing the expensive matrix multiply)

Has anyone tried sticking a nonlinearity between the embedding matrices and the first fully connected layer? I wonder if you’d get better embeddings as they’re no longer linearly mixed with the first real layer in the net.