I went through the Rossman notebook and the lecture. It is fascinating that a simple architecture (not deep and without any feature engineering) can do well on tabular data. But I am not able to intuitively understand how entity embeddings help a NN learn better. What is so special about them when dealing with tabular data? Thank you for your time!
@muellerzr, I get the idea a bit, but still not clear. Here is exactly what I am confused about. Let us say that in a dataset there are two independent variables, where one is categorical (day of the week (DOW)) and one continuous (area of a shop (A)). The dependent variable is the dollar sales (S). Now, the categorical embedding would hold random values to begin with, so there is no relationship between the embeddings for Saturday and Sunday. On the other hand, the continuous variable already contains meaningful information for the NN. In the process of training the NN to predict the sales correctly, the embeddings for Saturday and Sunday could get pushed closer to each other, and finally be very close in a 7-dimensional space. Then, at inference time, given a shop area (A), the predicted sales for Saturday and Sunday will be similar. Is that the idea? If we don’t have the embedding layers, couldn’t we have learnt the weights which would lead to similar results?
If all we had to work off of was just simply the area of the shop and the day of the week then yes, possibly it could (it’d still be one-hot encoded but on a basic level sure). This is also not just specific to NN’s. Random Forests also use the same ideas. (Their new book goes into this, and the new course should touch on this more too).
Would it be beneficial to try to make an embedding out of these continuous variables? Instead of treating each of these singly, they all get grouped into one (multi-dimensional) “date” embedding that “learns” the group of continuous variables.
The reason why I ask is because I am working on a time-series problem and wanted to see what Fast.AI could do. I typically lag variables and wanted to have multiple lags of my binary variable. If I have n binary variables that all represent a single lag, could those not be grouped? Even though it is not categorical in the sense that it is not one-hot-encoded, I would think that making an embedding out of these might be able to capture frequency (or something else, idk) a little better in an embedding than multiple binary inputs.