Wiki: Lesson 5

(Rachel Thomas) #1

<<< Wiki: Lesson 4Wiki: Lesson 6 >>>

Lesson resources

Links to more info

Other datasets available

Video timeline

  • 00:00:01 Review of students articles and works

  • 00:07:45 Starting the 2nd half of the course: what’s next ?
    MovieLens dataset: build an effective collaborative filtering model from scratch

  • 00:12:15 Why a matrix factorization and not a neural net ?
    Using Excel solver for Gradient Descent ‘GRG Nonlinear’

  • 00:23:15 What are the negative values for ‘movieid’ & ‘userid’, and more student questions

  • 00:26:00 Collaborative filtering notebook, ‘n_factors=’, ‘CollabFilterDataset.from_csv’

  • 00:34:05 Dot Product example in PyTorch, module ‘DotProduct()’

  • 00:41:45 Class ‘EmbeddingDot()’

  • 00:47:05 Kaiming He Initialization (via DeepGrid),
    sticking an underscore ‘_’ in PyTorch, ‘ColumnarModelData.from_data_frame()’, ‘optim.SGD()’

  • Pause

  • 00:58:30 ‘fit()’ in ‘’ walk-through

  • 01:00:30 Improving the MovieLens model in Excel again,
    adding a constant for movies and users called “a bias”

  • 01:02:30 Function ‘get_emb(ni, nf)’ and Class ‘EmbeddingDotBias(nn.Module)’, ‘.squeeze()’ for broadcasting in PyTorch

  • 01:06:45 Squeashing the ratings between 1 and 5, with Sigmoid function

  • 01:12:30 What happened in the Netflix prize, looking at ‘’ module and ‘get_learner()’

  • 01:17:15 Creating a Neural Net version “of all this”, using the ‘movielens_emb’ tab in our Excel file, the “Mini net” section in ‘lesson5-movielens.ipynb’

  • 01:33:15 What is happening inside the “Training Loop”, what the optimizer ‘optim.SGD()’ and ‘momentum=’ do, spreadsheet ‘graddesc.xlsm’ basic tab

  • 01:41:15 “You don’t need to learn how to calculate derivates & integrals, but you need to learn how to think about the spatially”, the ‘chain rule’, ‘jacobian’ & ‘hessian’

  • 01:53:45 Spreadsheet ‘Momentum’ tab

  • 01:59:05 Spreasheet ‘Adam’ tab

  • 02:12:01 Beyond Dropout: ‘Weight-decay’ or L2 regularization

Lesson 5 In-Class Discussion
Wiki: Lesson 4

I was getting an error running the section of the code in the movielens notebook. A git pull fixed it though. Just a hint in case anybody else was experiencing similar problems.

(Scott C) #4

Does anybody know the section of the machine learning course where broadcasting is discussed?


I’m not sure, but it’s a common numpy/Python concept. If you just Google “Python broadcasting” you can find lots of resources.


Sharing my notes for this lesson.

Collaborative filtering:
(Pretty similar to the lesson notebook, but with some added screenshots/explanations)

Gradient descent/optimization techniques:

I wasn’t sure how much of the optimization techniques were going to be covered in future lessons, so I ended up doing quite a bit of background study on my own (never a bad thing of course).

I found the Excel method of working through the algorithms works very well for getting an understanding. Though I did have one question that I was never able to resolve (Question about graddesc.xlsm).

Here are some links I found useful for some of the “lower level” stuff.

Derivatives of multivariable functions (talks about what partial derivatives are, the gradient, has some GREAT visualizations of showing the gradient as slope of steepest ascent, the Jacobian, etc.

This course is REALLY good – it’s done by the same guy who did this very popular neural network series on YouTube:

Anyway, the course is:

It seems like a lot but the videos are fairly short. This lesson in particular is helpful:

Backprop (calculating derivative) as a circuit: (video version of Stanford notes linked above)

RMS prop:


Weight decay/L2 regularization:

(Scott C) #7

This is also a good explanation of backpropagation:

(Pranav Kanade) #8

Hi There,
I am trying to solve predict the happiness mentioned above by @rachel

Here is what I did-

  1. Taking very simple approach, I converted the column of comments to word count.
  2. Next, I converted every thing to categorical numeric values
  3. Then tried to build a simple feed forward neural net using pytorch.
  4. Hence the input tensor will have dimension of 3 * 1 and the output will have 2 * 1 (happy or not)
  • If I run following code for training it throws error :
net = Net(input_size, hidden_size, num_classes)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    for i, (param, res) in enumerate(zip(Dataset, lab)):
        # convert the python list to torch variable
        data = torch.FloatTensor(param)
        res = torch.FloatTensor(res)

        data = Variable(data)
        label = Variable(res)
        # forward + backward + optimize
        outputs = net(data)
        loss = criterion(outputs, label)      # here I get error

Result of printing input(data), output(outputs) and expected(label) :

[torch.FloatTensor of size 3]

[torch.FloatTensor of size 2]

[torch.FloatTensor of size 2]

RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

Any kind of help would be appreciated …!!

(heisenburgzero) #9

I have a question regarding RMSProp. In the lecture, it was mentioned the square of the gradient represents the variance of the gradient, so if the learning rate divides by it, it will dampen the oscillations in the weights. But, is it also true it will also this prevent learning quickly at steep gradients? If the gradients are in the same direction, it will still produce a high RMS value.

(Scott C) #10

Answering my previous question, this CS231n Numpy Tutorial gives a brief but good explanation of broadcasting.

(Jorge Barrios) #11

What is “” expressing?

I see it’s an array with values related to the validation set, but different to predictions and ground truths.

(Minh Nguyen) #12


I have a doubt regarding embedding vectors. My simple understanding about neural networks is that they try to tune the weights to optimize a cost, based on X-y values given by observations. Since we are trying to tune embeddings, I suppose they belong to the ‘weight’ class. However, if that is correct, what will be the X-y values given by a ranking in collaborative filtering case? I know y is the ranking values, but X in this case is simply userID and movieID, how can the network optimize embeddings based on these two simple values? Thanks for enlightening me :smile:

(Bart Fish) #13

I was going thru this notebook again to review it and experienced this problem, it was because I had a Conda for python 2 on my machine selected. Change the kernel type to Python Version 3 and you’ll be good to go.

(Dave Luo) #14

Hi @nminhptnk,

Your thinking is close to the mark in that embeddings contain weights to be learned. The next step is understanding that embeddings are 1) indices (also described as a look-up table) that link each unique userID and movieID to 2) a higher-dimensional space (vectors of learnable weights) that can better represent the complexity of each userID and movieID:

Instead of trying to learn directly:

userID, movieID -> y

With embeddings, you’re learning:

userID, movieID -> look up the specific n-dimensional vector representing each userID, movieID based on their indices -> update vectors -> y

The number of dimensions in embeddings is set by the hyperparameter n_factors, aka number of latent factors. E.g. if n_factors = 5:

  • the embeddings for movieID “1” could learn to be [0.1, 0.2, 0.95, -0.36, 0.04]
  • the embeddings for movieID “2” would be a different set of 5 learned values like [-0.3, 1.32, 0.35, 0.96, -0.12].
  • The same is true for embeddings representing each userID.
  • These numbers are randomly initialized to start and learned/updated through training to represent the “latent” qualities of each movie and user.

Movies or users that are more semantically similar to each other (i.e. Oscar-winning biopics or users who really like Denis Villeneuve-directed movies) will have embedding vectors that are “closer” to each other in high-dimensional space. This similarity as proximity can be interpreted and visualized using dimensionality-reduction techniques like PCA as shown in lesson 6 (lesson video time-marked here).

(Jorge Barrios) #15

What function(s) should I call to get predictions for a (userId, movieId) pair?

Isn’t there something like:

ranking = predict(userId, movieId)

A code sample would be greatly appreciated.

Thanks in advance!

(Minh Nguyen) #16
  1. Thank you @daveluo, I understand it better now. So basically it still follows the principle of linear matrix multiplication. It is just that the input layer is under one-hot encoding form so the matrix multiplication is simplified into a look-up operation. I guess when backpropagation flows backward, the gradient dg/dw for this layer is simply 1.

  1. Says, hypothetically, we want to combine these categorical variables with some data like pictures, I suppose we should concatenate this embedding vector at fully connected layers in order to match dimensions, shouldn’t we?

(Dave Luo) #17

Yes, your 1st point is correct in that we’re doing the equivalent of a matrix product between one-hot encoding vector and embedding matrix. I found it useful to rewatch the part of lesson 4 where Jeremy explains this matrix algebra behind embeddings, time-marked here:

Re: your 2nd point, yea, sounds like that should work to concat embeddings with other inputs or features the way we would generally handle multiple inputs. I haven’t tried personally but I found this relevant discussion which suggests it’s pretty straightforward: