In lesson 4, the concept of embeddings is presented for categorical data. Jeremy states the values for the embedding columns are randomly initialized. It is never explained how the embeddings are subsequently modified from that point.

In an effort to learn on my own, I found Rachel’s article where she mentions as an aside (in parentheses!) that the values of the categorical variables are learned as the network is trained. That is a major concept, not intuitive, and strangely (to me at least) never addressed.

My understanding currently is that the weights of the neural net are trained based on the data presented. The embeddings are the data. How and where does the system update the actual data values in addition to weights? I have trouble even understanding how that could work.

Could anyone explain or direct me to a resource that concisely explains this in great detail?

The short answer is that the embedding layer for the categorical variable is also “learned” during training as well, just like any other activation. I would take a look at lessons 11 and 12 of the Machine Learning course Jeremy just launched on YouTube - they go pretty deep into embeddings as well.

Thanks for the quick response, but it does not address my question. What I want to understand is the mechanism for modifying the embeddings at the same time as the neural net weights are modified. I believe that it works, but do not understand it.

In my (obviously) incorrect mental model, I am assuming that the objective function is calculated by a series of operations involving the data and the weights through each layer. Backprop modifies the weights, the data is static. How can the data change too? I just can’t envision how this works.

For each categorical variable, the embedding is a randomly initialised matrix, whose values are updated based on their contribution to a successful classification. Within this matrix, think of an embedding as a representation of the value in the categorical variable. For each distinct category in one of the variables (e.g. 2018 in the Year column) there may either be a single row of random numbers, if you set the matrix dimensions to be exactly equal to the number of years in the dataset, or the year may be represented by some (probably easy to visualise) combination of rows and columns in this matrix. These values are represented based on every other category in this variable, which is why they are called distributed representations.

The fact that the category is a Year, in this example, and the fact that the years in each row are represented in this way does not mean that they are replaced. It’s just a representation of it. Embeddings are learnable parameters, but they don’t replace the categorical variables in the dataset. Instead of sending the categorical variables themselves as one-hot encoded values, we send embeddings as part of the model, so that it can update those values based on the behaviour of every other value in the dataset…

Jeremy explained, and I have seen in other presentations, that the random initialization changes over time so that the embeddings take on some sort of meaning that they like to plot in two dimensions for demonstration purposes. OK, cool.

What I do not understand is how a neural net can update data and weights at the same time. When back propagation happens, it updates the weights based on the gradient of the mini batch. When and how could the data be updated?

@arkerpay to your point about one-hot representations being suboptimal: I think it’s important to understand that embeddings are totally equivalent to one-hot encodings; they’re just more efficient.

If you take a one-hot encoded row vector [0 … 0 1 0 … 0] and feed it into a matrix from the left, the math of matrix multiplication means you pluck out the i-th row of the matrix. This is its embedding.

This works, but as you mentioned, it’s suboptimal. Every row vector we feed in is one-hot encoded, so there really isn’t any need to literally do that matrix multiplication; we’re always just plucking out some particular row. That’s all an embedding layer does: it plucks out the right row without actually doing a matrix multiplication.

As for what happens when you train an embedding: think of the embedding layer as a linear layer without any activation function. The data (the one-hot encodings) don’t change; only the weights in the linear layer change (aka, the embeddings). That is, the embeddings are the weights.

But I don’t still get it. How we can update some of our inputs (I mean these embeddings) and let others be unchanged? Simple code example might clear this.

Now that I finally understand this, I will try to explain to you. Appreciate anyone else stepping in if I get this wrong.

The embeddings, which we think of as data, are actually weights in the neural net that are trained in the same manner as all other weights. If you think of category levels as rows and the number of dimensions as columns, you have a matrix of weights for a single categorical variable.

For each training example, only the row that corresponds to the appropriate category level is sent to the next neural net layer. The “embedding matrix” is used as a lookup table. Conceptually, it is as if each categorical variable is stored as a one-hot representation, and the entry that equals ‘1’ activates the weights in the appropriate row of the embedding matrix.

For efficiency, this is not how it is implemented. The one hot representation is replaced by an integer that triggers sending the corresponding row of weights to the next layer.

In this type of implementation, you would not train the embeddings separate of the rest of the network. If you have a similar application for the same set of categorical variables, you could reuse the embedding weights as a pretrained embedded variable. At least, I assume so. This is where a more knowledgeable person could step in.

When using many categorical variables with high-cardinality, does the number of neurons in the net need to be increased as well? Or is there no relationship?

There is a general relationship between cardinality and embedding count. If you think of a decision tree, you may need to fork many times to divide a large data set into homogeneous groups. At the same time, it may be that the large data set only has two or three meaningful splits, and additional subtleties aren’t relevant to your decisions. So you want your initial model to have the ability to find those additional splits, but subsequent versions may trim the embedding count.

In my work, I’ve set my first neuron layer wider than the feature layer with expanded embeddings, but there may be cases where forcing the first layer to distill the inputs works as well.

In regards to “forcing the first layer to distill the inputs works as well”, wouldn’t the neurons quickly become saturated with high-cardinal features? I feel like in that situation, it may force the network to rely on continuous features as the embeddings wouldn’t be given a chance to update to provide enough spatial information and would therefore be less meaningful/valuable (as determined by the network).

It would actually be really interesting to visualize the embeddings of high-cardinal features after near-optimal training in a “simple” vs “complex” network to see if there is indeed an effect.