Understanding the relationship between categorical embeddings and network complexity

In the lecture about using embeddings for the categorical data in the Rossman dataset, we use a two-layer network, with 1000 and 500 neurons/activations in each of the layers.

Although it’s loosely a best practice, I’ve seen many advise using a network width that is anywhere from 1-3X the number of features.

So is the reason for using such a complex network (given the dataset) to compensate for the embeddings? If we’re using many categorical features with high-cardinality, should the network complexity need to be increased by a lot? Or does the width of the network have no relationship with embeddings?

I’ve tested such a dataset on both relatively simple and complex architectures and there does seem to be some preference for the more complex networks, but there were many simple architectures that performed just as well so I wanted to get an understanding/intuition (and maybe know if I’m wasting my time testing complex architectures).

Thanks!

1 Like

My understanding is that this Rossman notebook is essentially a replication of the Cheng Guo and Felix Berkhahn “Entity Embeddings of Categorical Variables” paper:

“In this experiment we use both one-hot encoding and entity embedding to represent input features of neural networks. We use two fully connected layers (1000 and 500 neurons respectively) on top of either the embedding layer or directly on top of the one-hot encoding layer.”

You can find the original authors’ Kaggle competition code here.

Unfortunately, I can’t seem to find an explanation of how they actually arrived at these 1000 and 500 figures, but FYI Jeremy used 500 and 250 instead for his (less complex?) bulldozer model in the “Machine Learning for Coders” course.

My guess is that you may just have to establish both the optimal number of layers and the number of activations in each of these layers via experimentation, as indeed you have already been doing!?

2 Likes

Thanks for the link!

I did a grid search from:

  • one layer, 16x2 … to
  • two layers, 4112x2056

And got the following results:

lowest error

highest error

I don’t really see any trends besides the fact that using two layers doesn’t converge that well… Any trends/patterns you can see in regards to complexity?

The “lowest validation error” is the lowest error after 7 epochs (w/ 3 restarts and a mult of 2)

Thanks again!