Lesson 7 - Official topic

johannesstutz · November 3, 2020, 11:43am

Hi everyone, I’m working on using the entity embeddings of the neural net to improve random forest results. This is all in the chapter 09_tabular notebook with the bulldozer bluebook dataset.

The first stumbling block: I don’t quite get the dimensions of the embeddings. Every categorical variable should gets its own embedding layer. This seems right:

embeds = list(learn.model.embeds.parameters())

len(embeds) as well as len(cat_nn) is 13.

Now my understanding was that the first dimension of the embedding layer is equal to the number of levels for the variable. The other dimension is determined by a heuristic that works well in practice.

However, these numbers don’t match.

for i in range(len(cat_nn)):
    print(embeds[i].shape, df_nn_final[cat_nn[i]].nunique())

Gives following result:

torch.Size([73, 18]) 73
torch.Size([7, 5]) 6
torch.Size([3, 3]) 2
torch.Size([75, 18]) 74
torch.Size([4, 3]) 3
torch.Size([5242, 194]) 5281
torch.Size([178, 29]) 177
torch.Size([5060, 190]) 5059
torch.Size([7, 5]) 6
torch.Size([13, 7]) 12
torch.Size([7, 5]) 6
torch.Size([5, 4]) 4
torch.Size([18, 8]) 17

Where does the mismatch come from? Am I maybe using the wrong dataframes or do I have a wrong conception about embeddings?

Thank you!