Wiki: Lesson 5

Scratch everything I said above.

I’m keeping my post above so folks that are similarly confused by similarly thinking about PCA as I was, can see where I went wrong with my assumptions of how it worked. I spent this past weekend really digging into PCA … how it works, where it is useful, and to what objectives it can be applied too. I include a list of resources below that I found helpful, but let’s get to answering your two questions first:

Why do we choose to fit the transpose of the movie embedding rather than the embedding itself?

The answer lies in part by asking ourselves two questions:

  1. "What does the data represent?
  2. “What is the problem we are attempting to solve?”

The answer to the first question is that our (3000, 50) matrix represents 3,000 movie reviews by 50 things the model has learned about movie reviews that make them meaningful for language modeling and classification. We don’t know what each of these 50 embedding values represent, but we do know that whatever they are, they provide a good representation of each movie review because they have proven useful in the classification task.

The answer to the second question is that we are trying to reduce these 50 things to 3 things and then figure out how related each of the 3k movie reviews are to each of these 3 things so we can infer what the 3 things represent. We are asking, “How can we cluster these 50 different things learned into 3 big areas?” (Notice that we are NOT trying to reduce the dimensionality of each movie review. If we were, we wouldn’t want to transpose the embedding matrix and we wouldn’t care about the components.)

So …

By transposing the matrix so that are examples become the 50 things each of our embeddings have learned about the movie reviews and the features become the 3k movie reviews, when we call, we are asking PCA to essentially figure out how much each review plays a part in each of the 3 learned principal components. That is exactly what we want!

Why do we take the components from the fit PCA rather than transforming the matrix?

Because they represent the eigenvector for each principal component.

What is an ‘eigenvector’ and why do we care about this?

Simply put, the eigenvector represents how many parts each feature (the 3k reviews) plays in composing the PC. The higher the number, the more important the review is for that PC, and therefore, the more representative of what the PC means.

So look at the dimension of .components_ and you’ll notice that it is (3,3000). The first row is the eigenvector for PC 1 … and each value in there tells you much each review played a part in learning that PC. Thus, if you order the reviews in the first row in descending order, you will have those that are most strongly correlated to PC 1. That information can then be used in turn to infer what PC 1 means.

Grab a cold one and think on all this and I guarantee it will start to make sense

Helpful resources:


I have a question about Pytorch’s RNN segment. When we are doing the manual version, for concatenation, we just create a linear layer of n_fac + n_hidden and do a operation of the embedding of the input like so inp =, self.e(c)), dim=1).

I would like to do something similar using the Pytorch’s RNN module. I created an RNN layer of dimension self.rnn = nn.RNN(n_fac + n_hidden, n_hidden). But I’m not sure how to match the dimension during concatenation as the hidden state is a rank 3 tensor. Any ideas?

This was helpful in wrapping my head around things. I get what’s going on with the linear algebra now.

I’m still left with the question of why we used PCA in this particular fashion instead of using it to reduce the matrix weights into 3 dimensions and plotting those.

I did a quick notebook looking at each method. The clustering results, while not identical, are very very similar. So I guess this is one of those things where it doesn’t really matter either way?

I was able to figure out how to do it. I’m posting it here in case others are looking for it and to see if anyone else has it done in a more efficient way.

Here is the module:

class CharConcatRNN(nn.Module):
    def __init__(self, vocab_size, n_hidden, n_fac):
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac + n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
    def forward(self, cs):
        bs = cs.shape[0]
        cs = cs.transpose(0, 1).contiguous()        
        h = cs.new_zeros((1,bs,n_hidden), dtype=torch.float)
        expand_h = [cs.shape[0] // h.shape[0]] + [-1] * (len(h.shape) - 1)
        inp =, h.expand(*expand_h)), dim=-1)
        outp, h = self.rnn(inp, h)
    return F.log_softmax(self.l_out(outp[-1]), dim=-1)

This is using Pytorch 0.4. There are 3 lines of code here that are important.

  1. The key is to manually broadcast the h tensor to match the shape of self.e(cs) so that we can concatenate it on the 3rd dimension (which means the other two dimensions have to match). I got this trick from this stackoverflow question.
  2. In order to do the manual broadcasting, I had to transpose the first two dimensions of cs (whose code I copied from the custom loss function nll_seq_loss
  3. I used the method new_zeros to initialize h.(documentation here). According to the documentation, the new zero-valued tensor is put on the same device as the source, which means if the module resides on the GPU h will get initialized on the GPU (similarly for the CPU). If we use just torch.zeros we would have to add .cuda() to it to get it on the GPU.


They are related but not the same thing.

The eigenvector for each PC is what tells us how important each feature is to the PC, and what we are particularly interested to know is which reviews are more important for each PC. This vector is made available to us via the .components_ property.

So consider what happens when run this code:
x = pca.transform(movie_emb)
eigens = pca.components_

x is a (3000,3) dimensional matrix where each value represents the transformed “embedding” value required to reduce the dimensionality of the data form 50 to 3. The eigens is a (3,50) dimensional matrix and tells us how important each embedding column is to each of the 3 principal components.

But we aren’t interested in ranking the importance of the 50 embedding columns to each PC … we want to rank the importance of each movie reviews with respect to each PC

Consider this now …
x = pca.transform(movie_emb.T)
eigens = pca.components_

x is a (50,3) dimensional matrix where each value represents the transformed “movie review” value required to reduce the dimensionality from 3000 to 3. The eigens is a (3,3000) dimensional matrix and tells us how important each movie review is to each of the principal components.

That is what we want.

If we simply wanted to project the 3,000 reviews into a 3-dimensional space, then simply using x from the first approach would be sufficient. But as we are here interested in how important each movie review is to each PC, the latter approach seems correct.

See also:

TL;DR Using approach #2, we are reducing the dimensionality of the 3000 reviews to 3 and then using the .components_ property to find out how important each of the 3000 movie reviews are to each PC

1 Like

Good question. That would just mean each individual piece of information (user_id, genre, movie_title) has its own embedding matrix that is initialized randomly. So for every movie, you would look up the embedding matrix of each piece of info, get their respective vectors, concatenate them and pass them through a linear layer.

This layer would then be tweaked and fine-tuned on the with back-propagation and represent a “movie embedding”. Repeat the same for users.

This is due to Jeremy rerunning the model after he made a change later on, to explain something, which “messed up” this part. If you run your own unmodified version, you won’t see this happening. In other words we NEVER let them overfit!

As for how much they should be different it is all in the numbers themselves, that is: you want to think whether the results you are getting are “significantly” different. For accuracy problems sometimes it is easy to tell, are these differences due to the fact that we are identifying correctly just one or two more samples?

In the part for DIY Embeddings, can anyone explain why there’s a conts not used in the def forward?

class EmbeddingDot(nn.Module):
    def __init__(self, n_users, n_movies):
    def forward(self, cats, conts):
        users,movies = cats[:,0],cats[:,1] # this line did not use conts
        u,m = self.u(users),self.m(movies)
        ret = (u*m).sum(1)
        return ret.view(ret.size()[0],1)

cats are for categorical variables, conts for continuous ones. For the collaborative filtering model, we only have two independent variables: user ID and movie ID. Neither of them is continuous.

Now, you might wonder why put it into the function definition? Just so it works with other parts of the code that we are re-using.

1 Like

In lesson 5 there is an Microsoft Excel spreadsheet, how do I add more movies and more users to the spreadsheet?

I dont think what you are saying with respect to performing PCA on transposed matrix, makes sense.

By performing PCA on the transposed data we are not getting importance of each movie reviews for each PC, the reason being movie_emb just has unique movieId and its embedding, there is no information of ratings per se in the matrix, hence I am not sure how we are getting ratings into the picture here?

You can easily confirm that movie_emb consist of unique movieIds, by checking equality if len(np.unique(topMovieIdx)) == len(topMovieIdx)

If we have to get the ratings into picture, we will have to have multiple rows for a particular movie!

Where do I mention anything about the movie “ratings”???

PCA is being used here to break down the dimensionality of a vector so, through observation, we can infer what a particular embedding means. We aren’t looking at the ratings per se except to compose our dataset to the top 3,000 rated movies.

What does importance of each movie review in this statement means? As movie reviews are represented by ratings, I would have confused between the two, but would like to get more clarity here

As for this 0.5 padding, I think it is a great example of @jeremy at his best. When he told the story, everything is looking so natural and straightforward. But from time to time this kind of magic constants pop-ups. He says it could be arbitrary. In any way, if we need a number here we will need to choose some, he says. Let it be 3, or 0.5 or whatever, it is not important, he says. And voilà we’ve surpassed the state of the art (again). But when you try to reimplement it by yourself - you quickly realize that any other values - lead to worse results. And you have no idea how this exact one was chosen. And how to chose another one in your own specific task. And I think this is exactly the reason why @jeremy was a kaggle champion so many times - because of his ability to discover such magic numbers. :slight_smile:

Talking about new users and new movies, we are mostly talking about a cold start problem ( ). To not repeat myself here is the link to answer to a similar question after prev. year (2017) version of the course.
Making predictions with collaborative filtering

1 Like

Thanks Lator

1 Like


in 1:23:00 on this lesson, when you say nu+nm(written in dark green), should it be nu_factors + nm_factors, which here happened to be the same value?
Thank you very much!

I’m having trouble understanding how do we update the elements in embedding matrices for batch gradient descent.

Let’s say I have randomly initialize an embedding matrix for day of week as a 7x4 matrix. I then use SGD to update my weights and this embedding matrix (updating my NN weights after every training example). My first training example has a Tuesday, then I replace that Tuesday with an corresponding embedding vector, and feed it to my NN, then during backprop, I allow the embedding vector of Tuesday to be updated as well.

Then how do I update the embedding matrix if I want to use batch gradient descent? Since batch GD only update the weights after a certain number of training examples have been fed to my NN, say after 7 training examples, each with a different day of week. After backprop, which embedding vectors would be update?

yes, that’s true, already discussed and answered above in this thread Wiki: Lesson 5

From the perspective of an embedding matrix it doesn’t matter if the embeddings are updated after each training example (online learning) or after the minibatch (sgd). It’s no different from how layer weights are updated.