I’ve been working through the part 1 videos, really great material!
One thing that I was curious about is the embedding approach as seen in the Rossman discussion. My understanding is that there is no nonlinearity between the embedding matrices and the first dense layer (see Figure 1 in the Entity Embeddings for Categorical Variables paper).
We can think of the embeddings as the blocks in a big sparse block matrix multiply, so its interaction with the dense layer is really just a matrix multiply, so I think they could just be collapsed into a single layer. Is it true that from the network’s prospective, these might as well be fused into a single dense layer? In other words, if I took two networks who only differed by their use of embedding matrices vs just using one hot encodings, would I actually get a better result/more powerful network?
I get that they are nice for other reasons, e.g. interpretation of features, use in feature engineering, just curious about them in the context of neural nets!