Can anyone please share some insights as to why we use linear activations when we add an embedding layer for structured data? Why not wrap that in a non linear activation before concatenating with continuous features?
The only advantage I could see to passing it through a non-linearity is that it might put the embedding values on a scale that more aligns with the other ordinal features and/or it might be better scaled relative to the randomly initialized weights in the hidden layers. My guess is it might work just as well, but I don’t see why it would do better except by chance. Also, I would be cautious of using ReLU since that would effectively equalize any embedding weights that were learnt to be negative, which might not be a bad thing as a form of like l1 regularization, but it feels weird.