Lesson 8: confusion understanding Embeddings

teamtom · February 11, 2023, 6:13pm

Hi all,

i am struggling to deeply understand Embeddings.

in collaborative-filtering-deep-dive.ipynb it is defined like

Embedding: Multiplying by a one-hot-encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. This is quite a fancy word for a very simple concept. The thing that you multiply the one-hot-encoded matrix by (or, using the computational shortcut, index into directly) is called the embedding matrix.

my first problem is that i can not see through / understand what one-hot-encoding has to do with embeddings

i understand that
user_factors.t() @ one_hot_3 is the same as user_factors[3]

but i can not see one-hot-encoding later in the lesson, not even when Jeremy builds his own embedding module from scratch

he creates embeddings just by defining and calling this function:

def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

...
self.user_factors = create_params([n_users, n_factors])
...

i also don’t have clear understanding how embeddings are used in Tabular as categorical embeddings
eg. embedding size for Titanic’s Sex feature is (3,3)
Sex has 2 unique values so why does it need 3
also Pclass has 3 unique values but the emb size is (4,3), why 4?
it seems it is calculated as unique_values+1, but why?

thank you for the help!

lucasvw · February 12, 2023, 8:18pm

Hi @teamtom here are my 2 cents on this,

For me, embeddings are all about encoding your data. In the application of collaborative filtering we are looking for a representation / encoding of our users and our movies. The case for movies and users is identical, so let’s consider movies here:

Let’s say that we want to use 4D embeddings of our movies (meaning that each movie will be represented by 4 numbers e.g. [ 0.4, -0.2, 0.9, 0.7]) and we have a total of 1000 movies. All this data is thus stored in a 1000 x 4 weight matrix, holding all the encodings of our movies. Each movie is thus associated with one row in this matrix.

To access the embedding encoding for any movie, we can thus index into the matrix. If we want to get the embedding for a movie that is associated with row 2, we can do:

weight_matrix = torch.randn(1000,4)
weight_matrix[2,:]

which shows for example:

> tensor([-1.1338, -0.7682, -0.4630,  0.5829])

Mathematically indexing into is exactly the same as multiplying this weight_matrix by a one-hot encoded vector:

one_hot(2, 1000).float() @ W
> tensor([-1.1338, -0.7682, -0.4630,  0.5829])

Also, on the second line of the forward method you can see how he is indexing into the weight matrix called movie_factors to get the embeddings for the movies for his data (x)

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

For your second question: that’s all a design decision. You can use a 4D movie embedding as we did above, but also a 10D. If you ask me, it doesn’t make sense to use a 3D embedding for a variable such as sex which can only take on 2 values. There you would probably directly use a binary 0 or 1. Or perhaps a 2D embedding, but certainly not a 3D

teamtom · February 14, 2023, 9:21pm

thank you for the answer!

first:
i know that code block (class DotProductBias) from collaborative-filtering-deep-dive.ipynb and understand how it works

my confusion comes from that mentioning one-hot-encoding is just a metaphor or one-hot-encoding is the base that embeddings’ mechanims are built upon
‘Mathematically indexing into is exactly the same as multiplying this weight_matrix by a one-hot encoded vector:’

second:
i used the get_emb_sz() function to calculate embedding sizes and for Titanic’s Sex it returned 3 (unique values: 2), and for Pclass it returrned 4 (unique values: 2)

get_emb_sz(dls.train_ds)
[(3, 3), (4, 3)]

i am just curious why? this seems intentional; also not clear for me, i feel as if i missed something

and a third ( new one ):
when we use deep learning for Collaborative Filtering and pass concatenated embedding matrices through linear layers then what is the input? the embedding matrices?
how does backpropagation work in this case?
how can the embedding matrices’ trainable parameters be updated?
sorry if it is a dumb question

thank you!

lucasvw · February 15, 2023, 10:07am

not sure if you still have a question here, but if you do: I’d suggest to write out on paper a one hot encoded vector with some weight matrix and confirm that it’s the same as row indexing into the weight matrix.
I’m not familiar with this piece of the code base but from the source code you can see what’s going on extremely well:

github.com

fastai/fastai/blob/master/fastai/tabular/model.py#L27


      
          
          
# %% ../../nbs/42_tabular.model.ipynb 7
          def _one_emb_sz(classes, n, sz_dict=None):
              "Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
              sz_dict = ifnone(sz_dict, {})
              n_cat = len(classes[n])
              sz = sz_dict.get(n, int(emb_sz_rule(n_cat)))  # rule of thumb
              return n_cat,sz
          
          
# %% ../../nbs/42_tabular.model.ipynb 9
          def get_emb_sz(
              to:Tabular|TabularPandas, 
              sz_dict:dict=None # Dictionary of {'class_name' : size, ...} to override default `emb_sz_rule` 
          ) -> list: # List of embedding sizes for each category
              "Get embedding size for each cat_name in `Tabular` or `TabularPandas`, or populate embedding size manually using sz_dict"
              return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
          
          
# %% ../../nbs/42_tabular.model.ipynb 10
          class TabularModel(Module):
              "Basic model for tabular data."
              def __init__(self,

the input is always our (training) data, the x argument in the forward method from the class printed above. In that method ou can see exactly what’s going on. The x matrix will be of size batch size * features (in this case: 2). It’s easiest to consider a batch size of one, so just one training sample.

So we have our (movie, user) training sample, and the first thing is to get the embeddings by indexing into the respective matrices. Next we multiply the values of both embeddings together and take the sum (basically a dot product) We then add the bias for the respective movie and user sample we are training on, and clamp the outputs between 2 values with the sigmoid range function. That’s the output that goes to our loss function.

teamtom · February 17, 2023, 10:27pm

thank you for your answer!

in 3) you explained class DotProductBias(Module) which is mostly clear for me
my question is about the CollabNN class (see below)

so when we use deep learning for Collaborative Filtering and pass concatenated embedding matrices through linear layers then what is the input? the embedding matrices?
how does backpropagation work in this case?
how can the embedding matrices’ trainable parameters be updated?

thank you!

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

lucasvw · February 18, 2023, 11:40am

It’s almost exactly the same, except that in the CollabNN class the network is a bit different, e.g. we concatenate the embeddings from movies and users, and pass it to a linear layer, followed by a relu and then another linear layer, followed by the sigmoid range. But the principle stays the same. We pass in our data, x and y. From the x data we first get our embeddings and then pass it through the rest of the network. During training we learn the parameters of the embedding matrices and linear layers. Perhaps it makes sense if you make a toy example and follow step by step what happens during the forward pass.

teamtom · February 20, 2023, 9:32pm

thank you!

so if the embedding matrices’ rows are passed as inputs to the rest of the network how are they updated?

During training we learn the parameters of the embedding matrices…

inputs don’t change in the NNs; weights & biases do, at least in the linear layers
in know that embedding matrices’ elements are trainable parameters
so are they passed to linear layers without change or is there some special calculation happening inside the embedding layers?

sorry i am still confused, what am i missing?

i am very grateful for your help!

rafaArroyo · November 18, 2024, 11:04pm

I have also struggled very much, so I´ll share my reasonning which I hope will help you get it, and it surely help me understand it a little bit better.
The author creates two standard tensors with random values for both users and movies (user_factors & movie_factors). We could access the 5 numbers associated to any user and movie, we just need to know its index (Where they are stored within the tensor), then you just have to write “tensor_name[index]”, and you get its five corresponding values.
The issue is if we wanted to calculate the dot multiplication for our user - movie combinations from our data set, we would have to loop through all the dataset, and python is not extremely fast so it would be expensive in time and computing resources. We better do our calculations all at the same time, and that is why they use embeddings, which encompass this one-hot-encoded matrix idea (which is just a filter for a particular vector within your tensor)
We have two Embeddings:

Embedding(users) - One vector for each user, with 5 numbers in each one
Embedding(movies) - One vector for each movie, with 5 numbers in each one

Now, you can try to get the 5 components of any user / movie with the expression “embeddings_name(tensor([index]))” , index being the id of the user / movie that we initially read from the file. That is where “one_hot_encoded” enters to play, asking for an index to an embedding seems to be like multiplying the embedding by an “one-hot-encoded matrix” that will filter all vectors but the one corresponding to the provided index.
As the book says, the cool thing about broadcasting is that you may use the same code for one index and for many; so If you write “user_factors(x[:,0])” you are asking for the 5 numbers regarding every user id there is in “dls” (our data structure for model trainning). if you ask for “movie_factors(x[:,1])”, you are doing just the same for the movies. But … ordering matters.
Both outputs keep the same ordering in which we read them from “dls”, so if you do the dot product of both tensors you are multiplying the first user´s vector with the first movie´s vector we have a rating for, the second user with the second movie, and so on.
You just do the same for the bias embeddings, you will do the dot multiplication and will add the corresponding movie and user bias, since you are once again reading those numbers in the same order as you did for the user and movie embeddings.
So it is just a way to optimize your calculations