Hi everyone, I’m working on using the entity embeddings of the neural net to improve random forest results. This is all in the chapter 09_tabular notebook with the bulldozer bluebook dataset.
The first stumbling block: I don’t quite get the dimensions of the embeddings. Every categorical variable should gets its own embedding layer. This seems right:
embeds = list(learn.model.embeds.parameters())
len(embeds)
as well as len(cat_nn)
is 13.
Now my understanding was that the first dimension of the embedding layer is equal to the number of levels for the variable. The other dimension is determined by a heuristic that works well in practice.
However, these numbers don’t match.
for i in range(len(cat_nn)):
print(embeds[i].shape, df_nn_final[cat_nn[i]].nunique())
Gives following result:
torch.Size([73, 18]) 73
torch.Size([7, 5]) 6
torch.Size([3, 3]) 2
torch.Size([75, 18]) 74
torch.Size([4, 3]) 3
torch.Size([5242, 194]) 5281
torch.Size([178, 29]) 177
torch.Size([5060, 190]) 5059
torch.Size([7, 5]) 6
torch.Size([13, 7]) 12
torch.Size([7, 5]) 6
torch.Size([5, 4]) 4
torch.Size([18, 8]) 17
Where does the mismatch come from? Am I maybe using the wrong dataframes or do I have a wrong conception about embeddings?
Thank you!