Wiki: Lesson 5

jcatanza · May 14, 2018, 10:42pm

At the end of Lesson 5 we learned that collaborative filtering can approximate the results of a user - movie rating database with astonishingly low error. But this seems to be a dead end – all we’ve really done is reproduce the results of a giant movie rating survey – we haven’t gained any new knowledge.

For our result to be practical, we would want to be able to predict how a new user who is not in our database would rate a new movie that is not in our database.

How could we do this? If the user and movie embeddings were based on a set of features, we could compare their embeddings to their nearest neighbors in the training set and use some sort of averaging to predict how any new user would rank any new movie. But there aren’t enough user and movie features.

So I am a bit perplexed here as to how these results can be used practically. Can anyone weigh in with some additional insight?

Just for fun, I modified the EmbeddingNet class to incorporate an embedding for movie genres. As Jeremy predicted, this didn’t improve the score.

I think the really amazing takeaway from this part of Lesson 5 is the simplicity of the matrix factorization technique that allows us to decompose a huge MxN rectangular matrix into the product of an MxJ matrix and a JxN matrix, for arbitrary integer J! This means solving for M*N*J^2 weights given only M*N datapoints!

pierreguillou · May 19, 2018, 10:12pm

Check your understanding of the lesson 5

<<< Check your understanding of the lesson 4 | Check your understanding of the lesson 6 >>>

(original post in portuguese)

Hi guys,

I did watch again the video of the lesson 5 (part 1) to get the whole image and I took notes of the vocabulary used by @jeremy.

Let’s play ! OK ?
Can you give a definition / a url / an explanation for all the followings terms and expressions ?

If yes, you are done with the 5th lesson !!!

PS : you do not want to test yourself or you want to check your answers ? Go to the blog post “Deep Learning 2: Part 1 Lesson 5” of @hiromi : " super travail !!! "

Structured Deep Learning : not a lot of paper on Deep Learning for structured data with comparaison to computer visionand language natural
Towards Data Science
Kaggle competition : Plant seedings Classification
this course starts the 2nd half of parte 1 (let’s dive into the source code) : the first half was about understanding the concepts, knowing best pratices and running the code by going through aplications (notebooks); this one is about the code to write with a high level of description
Goal of the lesson : create a collaborative filtering model from scratch (notebook : lesson5-movielens.ipynb)
Movielens dataset is a list of ratings
we use userid and movieid (categorical variables) and rating (independant variable) (we do not use here timestamp)
we get the users that watch the most movies and the movies most watched
in the beginning of the course, we are not going to build a Neural Network but a collaborative filtering model.
we use pandas in the jupyter notebook in order to create a crosstab table of the 15 users they give the most ratings vs the movies which were the most rated
Then, we copy/call this table of numbers atuais in Excel.
** functions to know : pd.read_csv, groupby(), sort_values(), join, crosstab()
** We copy/paste the stucture of the table and put ratings numbers by random (how ? each rating is the dot product of 2 vectors : one that qualifies a user and the other that qualifies a movie. The initial values of these 2 vectores are taken by random. When there is not a true rating, we put zero as the prevision).
** Then, we create an error cell that computes the root-mean-square error (RMSE) which is square root of the mean of the error square).
** This is not a neural net but a single matrix multiplication between 2 matrixes (one of the users and one of the movies)
** In Excel, we can do Gradient Descent : go to Data >> Solver >> Objective function (the cell with the RMSE) : cells to change + MIN (using GRG NonLinear which is Gradient Descent method)
** As this is not a Deep Neural Network (there is no hidden layer), we call this shallow learning.
** We do here a matrix decomposition (probabilistic matrix factorization)
** The numbers for each movie and for each user are called latent factors do vector de embeddings. The gradient descent tries to find these numbers.
** how do decide the dimensionality of our embedding matrix ? No idea. We have to try things and this have to represent the true complexity of the system but not too big (avoid overfitting, avoid time consuming for computation)
** the negative value in the embedding matrix represents the oposite (ie, I do not like)
** if you have a new user, you must retrain your model but we will see that later
Back to the jupyter notebook
** we use get_cv_idxs() to get our validation set
** wd means weight decay (L2 regularisation)
** n_factores : size of our embedding matrix
** our data model is cf = CollabFilterDataset.from_csv()
** our learn model is learn = cf.get_learner() with an optimizer which is optim.Adam
** learn.fit(lr, wd=wd, cycle_len = 1, cycle_mult=2)
** the error is the MSE (mean squared error), not the RMSE, then we need to take the root
** that’s all : the fastai library allows us to get a better validation loss in 3 lines of codes (cf, learn, learn.fit) than the actual benchmark
** Let’s try now to build the Collaborative Filtering from scratch using pytorch
** we can create a torch Tensor in pytorch by using capital T : T([1.,2],[3, 4])
** The multiplication of 2 torch Tensor is a element wise multiplication
we are going to build a layer (our custom neural net layer or custom pytorch layer) = a pytorch module
** And then we can instantiate a model as a pytorch module, use it as a function that we can compose with very conveniently (take the derivative for example)
** to create a pytorch module, we need first to create a pytorch class in which you return the calculated value in a special method called forward
** in a neural net, when you calculate the next activations, it is called the forward pass : it is doing a forward calculation (the gradient is called the backward calculation but we do not have to define that as pytorch does it automatically)
** first thing to do is to get a continuous index of userid and movieid to avoid a huge embedding matrix (we use for that the unique() method and the creation of dictionary)
** each time we want to pass our new number of users, movies (we call them states), we need a constructor for our class (this is a special method def __init__)
** 2 other things to get a full pytorch layer : we inherit of the nn.Module class to get all cool staff from pytorch and we need to call the super class constructor (when we create our own constructor : super().__init__())
Then, we need to give some behavior and we do that by storing somethings in it.
** we create self.u which is an embedding layer : self.u = nn.Embedding(n_users,n_factors), same thing with movies
** we need now to initialize by random our embedding matrices but with small numbers
** the embedding matrix is not a tensor, it is a variable (a variable is a tensor and it does automatic diferentiation)
** then to get the tensor, we use the data attribute
** uniform_ does operate in the same tensor (fill in the matrix)
** finally, we create the forward method by grabbing the embeddings vector for the user and the movie (minibatch of them : this is done autmatically by pytorch : DON’T DO A FOR LOOP because it does not use GPU), and return the dot vector multiplication
** Then, we can write our 3 lines of codes : data with the fastai library, our pytorch module (our model) that we initiate with our EmbeddingDot class, and finally we can fit our model by using the pytorch way
Biais
** we need to add a constant for each user and one for each movie to take account the fact that for example the user always gives a high rating and that a movie is liked by everyone because these are biais : they hide the true diferences.
** Then, we modificate our pytorch module to take account the biais.
** we use broadcasting to add a matrix and a vector (squeeze())
** then, we use a sigmoid function to put all calculations between 1 and 5 (it is not common but help)
** all the functions in pytorch are availables in capital F (F.sigmoid)
** we must precise cuda() as we don’t use a learner from fastai
** One remark : we do not do exactly matrix factorization
** before the Netflix prize, this matrix factorization had actually already been invented but nobody noticed and in the first year of the Netflix price, someone wrote this really famous blog post where they basically said “eh just use it” (2009 by BellKor’s Pragmatic Chaos team)
let’s create a neural net version of this
** A one embedding is exactly the same as doing a one hot encoding.
** An embedding is a matrix product
** the only reason it exists, it is because it is an optimization : it is a computational performance thing for a particular kind of matrix multiplier
** Our neural net will take in the entry a concatenation of the 2 embeddings vectores : this is an embedding net
** We start with 2 linear layers (then the first one is an hidden layer) and the second one has only one output as we want a single number (we use nn.Linear()). These layers are Fully Connected Layers.
** In the forward method, we grab the data (users and movies) and create the embeddings vectors, we concatenate theses vectors with torch.cat(), we add dropout, we add relu on activations of the layer 1 (F.relu), and activation function after the layer 2 (F.sigmoid())
** Then, we create our data model, our learn object and we fit this learn object with the MSE function (F.mse_loss)
** Point important : we do not need to get the same size of latent factors in the embeddings vetores of user and movie (for example, the embedding vector of the movies can have latent factors for genre and duration for example besides the n_factors shared with the user embedding vector)
Let’s use graddesc.xlsm to implement Gradient descent in excel
** errb1 : finding the derivative through fine diferencing
** derivative of the cost function is how the dependent variable (loss function) changes when the independant variable (intercept or slope) changes
** Jacobian and Hessian matrix
** Chain rule
** mini batch de size 1 = online gradient descent
** problem : it takes time and more, we can see that the error function goes down the same way : it means we can go faster. This is momentum
momemtum is a linear interpoletion between our derivative of the error function (small number) and the ones calculated before : keep doing the way we did before and upgrade a little bit
** everyone uses momentum
** More one point : in momemtum, the learning rate does not change
Adam
** We use SGD with momentum by default in the fastai library but we can now use Adam with weight decay in Fastai (Adam-W)
** Adam has 2 parts : one uses the momemtum of the gradient and the other part uses the momentum of the gradient square
** we use a lot the linear interpolation in DL papers
** if there is a lot of variance of the gradients, the number that divises the learning rate (the square root of the moving average of our squared gradient) will be high and than, the learning rate general is low
** ADAM is finally an adaptative learning rate (but there is only one learning rate)
L2 or weight decay
** when you have huge neural network, lots of parameters, more parameters than data points : then, regularization is important (like dropout)
** we take our loss function and add an aditional piece to that (square of the weights)
** the loss function wants to get the weights small
** if you have a huge weight decay, the gradient descent will keep your parameters to zero : it will never overfit
** if you then decrease the weight decay, some parameters will rise but the ones useless will stay to zero (proche de zero)
** when there are a lot of variation, we end up decreasing the amount of weight decay (and the oposite is true)
ADAMW
** penalize paremeters with weight very high unless their gradient varies a lot : but we do not want that
** so in ADAMW we do not mix weight decay with ADAM
** majority of models uses dropout and weight decay

ZachL · May 23, 2018, 3:38am

I am trying to understand the arithmetic behind the choice of weight initialization that Jeremy uses (around the 47 minute mark). He uses a uniform random variable on [0,.05]. What is the arithmetic to arrive at that number? It seems to me that, for example, the maximum of the interval should be weight s.t. max score = (weight^2)*num_factors since that is the dot product of the embedding matrices. What am I missing?

brobot · May 29, 2018, 10:25pm

I recreated the whole thing in Keras and found that embeddings are, in fact, meaningful (but their meaning is not immediately obvious). With 100-dim embeddings, here’s how kNN search works for me:

edit: this is from 1m dataset

KDT50 = sklearn.neighbors.KDTree(movie_emb.iloc[:, 3:-3])
def similar50(movieIds, n = 15):
  mov = movie_emb[movie_emb.movieId.isin(movieIds)]
  _, sims = KDT50.query(mov.iloc[:, 3:-3], k = n)
return sims

## find similar to Golden Eye
movie_emb.iloc[similar50([10], 20)[0], :4]

	movieId	title	genres	bias
      9	10	        GoldenEye (1995)	Action|Adventure|Thriller	0.221678
1273	1722	Tomorrow Never Dies (1997)	Action|Romance|Thriller	0.170194
2272	2990	Licence to Kill (1989)	Action	0.018498
2271	2989	For Your Eyes Only (1981)	Action	0.096108
2345	3082	World Is Not Enough, The (1999)	Action|Thriller	0.248690
2757	3639	Man with the Golden Gun, The (1974)	Action	0.219623
1774	2376	View to a Kill, A (1985)	Action	0.117994
1241	1672	Rainmaker, The (1997)	Drama	0.263498
1577	2115	Indiana Jones and the Temple of Doom (1984)	Action|Adventure	0.381587
1512	2045	Far Off Place, A (1993)	Adventure|Children's|Drama|Romance	0.076758
1059	1401	Ghosts of Mississippi (1996)	Drama	0.198065
1861	2470	Crocodile Dundee (1986)	Adventure|Comedy	0.257988
334	 408	        8 Seconds (1994)	Drama	0.056840
2607	3441	Red Dawn (1984)	Action|War	0.228135
1037	1375	Star Trek III: The Search for Spock (1984)	Action|Adventure|Sci-Fi	0.165070
2275	2993	Thunderball (1965)	Action	0.169414
2756	3638	Moonraker (1979)	Action|Romance|Sci-Fi	0.147252
2196	2889	Mystery, Alaska (1999)	Comedy	0.302645
2755	3635	Spy Who Loved Me, The (1977)	Action	0.193987
208	257	Just Cause (1995)	Mystery|Thriller	-0.069102

yurket · May 31, 2018, 9:20pm

While watching a lecture I spotted that the results of fitting models (fastai first model, dot product model, and mini net model) are overfitted. For instance: lesson5_q
As I understand, generally we want to keep train and validation losses close to each other. So my question is
why we let them differ here and at which situations can we let them differ, and by how much? Thanks!

MohammedSunasra · June 9, 2018, 3:05pm

What does those validation indexes suppose to mean in Collaborative filtering context? Does one index mean we will predict the rating for all the movies of that particular user?

In classification or regression cases for structured data set, a data point in validation set means we’ll predict a particular number for that data point. Now sure how that would work in this scenario.
Any help would be appreciated.

Thanks!

MohammedSunasra · June 10, 2018, 7:56am

I am also looking for something familiar. Did you have any luck on this?

tmcanty · June 12, 2018, 2:38am

Collaborative Filtering from Scratch: Runtime Error during fit. (line 41)
“RuntimeError: input and target shapes do not match: input [64], target [64 x 1] at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THCUNN/generic/MSECriterion.cu:15”

I have encountered a runtime error that I am so far not able to debug.
At this point I am just using the code as-is to try to understand this section.
I re-downloaded the notebook and I have updated the conda environment, but I still get the error.

Does anyone have any insights on this?

xtermz · June 12, 2018, 6:27am

I’m getting the same error. If I had to guess, it has to do with the dimensionality of the target (y) and the input (predictions) vectors.

y.shape prints (100004,), so I think somewhere where the batch is being created the target is being chunked into a [64 x 1] vector instead of a [64] vector

tmcanty · June 13, 2018, 2:10am

Haven’t found anything yet, trying to map it out to understand and troubleshoot… Lots of learning to do along the way so its slow going.

Tom

bellomatic · June 13, 2018, 12:55pm

I’m stuck on this too… I guess it must be a new bug as it seems that its only people trying this recently who have this problem.

tmcanty · June 14, 2018, 2:05am

Figured it out!

The EmbeddingDot module method forward returns a 1-dim Tensor [ batchsize] instead of a 2d [batchsize , 1]
return (u*m).sum(1) #with a length of batch size
you can convert this to a 2 dim tensor by updating the EmbeddingDot Class method code

original : return (u*m).sum(1)

new : out1 = (u*m).sum(1)
return out1.view(len(out1),1)

view is a pytorch tensor method to allowing to rearrange the shape of the tensor.
Use the length of the original tensor as dim0 and 1 as the dim1

xtermz · June 14, 2018, 2:44am

@tmcanty Nice! Looks like we reached the same conclusion

class EmbeddingDot(nn.Module):
    def __init__(self, n_users, n_movies):
        super().__init__()
        self.u = nn.Embedding(n_users, n_factors)
        self.m = nn.Embedding(n_movies, n_factors)
        self.u.weight.data.uniform_(0,0.05)
        self.m.weight.data.uniform_(0,0.05)
        
    def forward(self, cats, conts):
        users,movies = cats[:,0],cats[:,1]
        u,m = self.u(users),self.m(movies)
        ret = (u*m).sum(1)
        return ret.view(ret.size()[0],1)

xtermz · June 14, 2018, 4:29am

Note you’ll need the same fix for subsequent class definitions:

class EmbeddingDotBias(nn.Module):
    def __init__(self, n_users, n_movies):
        super().__init__()
        (self.u, self.m, self.ub, self.mb) = [get_emb(*o) for o in [
            (n_users, n_factors), (n_movies, n_factors), (n_users,1), (n_movies,1)
        ]]
        
    def forward(self, cats, conts):
        users,movies = cats[:,0],cats[:,1]
        um = (self.u(users)* self.m(movies)).sum(1)
        res = um + self.ub(users).squeeze() + self.mb(movies).squeeze()
        res = F.sigmoid(res) * (max_rating-min_rating) + min_rating
        return res.view(res.size()[0],1)

seungwooson94 · June 14, 2018, 6:14pm

I have a question about using more than 2 features (UserID, MovieID) but also like Movie Genres, etc. Jeremy does mention a hint about it towards the end that we can additionally concatenate latent vectors of movie genres and other features in addition to those of userID, movieID. How do we create those additional latent vectors when collab filtering allows us to crosstab userID against the movieID? If anyone tried this, could you share code please?

seungwooson94 · June 14, 2018, 6:21pm

Hey could you share your code on how you incorporated an embedding for movie genres??

abhishekmishra · June 16, 2018, 2:32pm

Thanks.This helped me to run the notebook.
Alternatively we can use res.view(-1,1) as well.

andrey · June 17, 2018, 7:07pm

Hello. Could you tell if there’s a mistake in a lecture here: https://youtu.be/J99NV9Cr75I?t=1h23m4s

I’m a little confused when Jeremy says that we’re going to multiply our concatenation of user and movie lookups by a matrix that has nu + mu number of rows. My understanding is that it should be nf + mf instead, no? Because when we take extracts from U and M each has a length of nf and mf accordingly and after concatenating them we get nf + mf.

Thanks,
Andrey

sayko · June 18, 2018, 4:52am

yes, I think you are correct. Here is the same diagram re-draw by @hiromi

rshamsy · June 18, 2018, 4:58pm

Just realized this got answered in the previous post. Going to delete.