Lesson 8: confusion understanding Embeddings

Hi all,

i am struggling to deeply understand Embeddings.

in collaborative-filtering-deep-dive.ipynb it is defined like

Embedding: Multiplying by a one-hot-encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. This is quite a fancy word for a very simple concept. The thing that you multiply the one-hot-encoded matrix by (or, using the computational shortcut, index into directly) is called the embedding matrix.

  1. my first problem is that i can not see through / understand what one-hot-encoding has to do with embeddings

i understand that
user_factors.t() @ one_hot_3 is the same as user_factors[3]

but i can not see one-hot-encoding later in the lesson, not even when Jeremy builds his own embedding module from scratch

he creates embeddings just by defining and calling this function:

def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

...
self.user_factors = create_params([n_users, n_factors])
...
  1. i also don’t have clear understanding how embeddings are used in Tabular as categorical embeddings
    eg. embedding size for Titanic’s Sex feature is (3,3)
    Sex has 2 unique values so why does it need 3
    also Pclass has 3 unique values but the emb size is (4,3), why 4?
    it seems it is calculated as unique_values+1, but why?

thank you for the help!

Hi @teamtom here are my 2 cents on this,

For me, embeddings are all about encoding your data. In the application of collaborative filtering we are looking for a representation / encoding of our users and our movies. The case for movies and users is identical, so let’s consider movies here:

Let’s say that we want to use 4D embeddings of our movies (meaning that each movie will be represented by 4 numbers e.g. [ 0.4, -0.2, 0.9, 0.7]) and we have a total of 1000 movies. All this data is thus stored in a 1000 x 4 weight matrix, holding all the encodings of our movies. Each movie is thus associated with one row in this matrix.

To access the embedding encoding for any movie, we can thus index into the matrix. If we want to get the embedding for a movie that is associated with row 2, we can do:

weight_matrix = torch.randn(1000,4)
weight_matrix[2,:]

which shows for example:

> tensor([-1.1338, -0.7682, -0.4630,  0.5829])

Mathematically indexing into is exactly the same as multiplying this weight_matrix by a one-hot encoded vector:

one_hot(2, 1000).float() @ W
> tensor([-1.1338, -0.7682, -0.4630,  0.5829])

Also, on the second line of the forward method you can see how he is indexing into the weight matrix called movie_factors to get the embeddings for the movies for his data (x)

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

For your second question: that’s all a design decision. You can use a 4D movie embedding as we did above, but also a 10D. If you ask me, it doesn’t make sense to use a 3D embedding for a variable such as sex which can only take on 2 values. There you would probably directly use a binary 0 or 1. Or perhaps a 2D embedding, but certainly not a 3D

1 Like

thank you for the answer!

first:
i know that code block (class DotProductBias) from collaborative-filtering-deep-dive.ipynb and understand how it works

my confusion comes from that mentioning one-hot-encoding is just a metaphor or one-hot-encoding is the base that embeddings’ mechanims are built upon
‘Mathematically indexing into is exactly the same as multiplying this weight_matrix by a one-hot encoded vector:’

second:
i used the get_emb_sz() function to calculate embedding sizes and for Titanic’s Sex it returned 3 (unique values: 2), and for Pclass it returrned 4 (unique values: 2)

get_emb_sz(dls.train_ds)
[(3, 3), (4, 3)]

i am just curious why? this seems intentional; also not clear for me, i feel as if i missed something

and a third ( new one :wink: ):
when we use deep learning for Collaborative Filtering and pass concatenated embedding matrices through linear layers then what is the input? the embedding matrices?
how does backpropagation work in this case?
how can the embedding matrices’ trainable parameters be updated?
sorry if it is a dumb question

thank you!

  1. not sure if you still have a question here, but if you do: I’d suggest to write out on paper a one hot encoded vector with some weight matrix and confirm that it’s the same as row indexing into the weight matrix.

  2. I’m not familiar with this piece of the code base but from the source code you can see what’s going on extremely well:

  1. the input is always our (training) data, the x argument in the forward method from the class printed above. In that method ou can see exactly what’s going on. The x matrix will be of size batch size * features (in this case: 2). It’s easiest to consider a batch size of one, so just one training sample.

So we have our (movie, user) training sample, and the first thing is to get the embeddings by indexing into the respective matrices. Next we multiply the values of both embeddings together and take the sum (basically a dot product) We then add the bias for the respective movie and user sample we are training on, and clamp the outputs between 2 values with the sigmoid range function. That’s the output that goes to our loss function.

thank you for your answer!

in 3) you explained class DotProductBias(Module) which is mostly clear for me
my question is about the CollabNN class (see below)

so when we use deep learning for Collaborative Filtering and pass concatenated embedding matrices through linear layers then what is the input? the embedding matrices?
how does backpropagation work in this case?
how can the embedding matrices’ trainable parameters be updated?

thank you!

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

It’s almost exactly the same, except that in the CollabNN class the network is a bit different, e.g. we concatenate the embeddings from movies and users, and pass it to a linear layer, followed by a relu and then another linear layer, followed by the sigmoid range. But the principle stays the same. We pass in our data, x and y. From the x data we first get our embeddings and then pass it through the rest of the network. During training we learn the parameters of the embedding matrices and linear layers. Perhaps it makes sense if you make a toy example and follow step by step what happens during the forward pass.

1 Like

thank you!

so if the embedding matrices’ rows are passed as inputs to the rest of the network how are they updated?

During training we learn the parameters of the embedding matrices…

inputs don’t change in the NNs; weights & biases do, at least in the linear layers
in know that embedding matrices’ elements are trainable parameters
so are they passed to linear layers without change or is there some special calculation happening inside the embedding layers?

sorry i am still confused, what am i missing?

i am very grateful for your help!