Lesson 4/5 Movielens question - Dealing with new users/movies/rankings?

I’m playing around with the movielens data and the model that we create in lesson 4/5 and trying to transform it into a working system. I have a number of questions…

At 32:00 in Lesson 5 video @jeremy says that in order to create recommendations for an existing user we can iterate through movies they haven’t watched and come up with their predicted ranking on a single movie. I’m wondering if it wouldn’t make more sense to multiply the users latent factors with all of the movie factors as a single dot product? For the NN implementation is there a similar methodology or do we just have to iterate because the input to the model is user movie pairs?

I’m also wondering about the practicalities of keeping the system up to date and initializing new users and new movies. We can’t take the inverse of the dot product to get the users latent factors because the problem is intractable but with an NN is there a way to get new user vectors? Can we take the mean latent factors? Or is that meaningless? I know we can somewhat bypass this by just looking at the movie bias for the first few movies and recommending popular movies and i’m guessing that’s what most recommender systems do.

What about for new movies? How do you recommend a movie that noone has watched? Or do you just refrain from recommending it until it’s built up a number of recommendations?

Finally and probably most importantly when do we update the model? If a single user rates a single movie we just want to update that user and that movie’s latent factors. Is there a way to do this without retraining the whole model? Or do you simply wait for a number of updates and then retrain? I feel like this could dramatically affect the new user experience as they’re starting from a sparse matrix and rating movies should have a big impact on their vector.

I can’t see retraining the model after every user action so there must be some other way to update the users and movies latent factors independently, but I can’t think of how to do that. Has anyone thought this through and how did you solve this issue?

4 Likes

So it sounds like the solution is to do exactly what I talked about, which is great.

What about the model updating? When a single user rates a single movie are we updating the whole model or is there a way to just update that user and movie? Do we just feed that single change in and do a micro batch of that single update as our epoch with a really low learning rate?

And if not is the system unresponsive to user updates?

I’m curious about how you solved this if you’ve figured it out?

I’d be inclined to wait for a few ratings to come in, and put them through as a batch.

1 Like

I wonder if you can bridge the time between re-fitting the model by using a similarity based lookup.

Like if you re-train the model every hour then if a new user comes in just after you last trained the model, maybe you can just compute the cosine similarity between that user and every other user (fast if you use a sparse implementation) - and then just use the factors for the looked-up user in place of the new user.

I suppose you’d want to avoid assuming the new user is too much like the existing user in the case where you have much more reviews for the existing user so perhaps you could incorporate the timestamps and store not just a user’s full set of reviews but also their first review (duplicating the data and increasing the size of the computation but then you’d get a direct lookup of a single user to use on new users without needing to recompute the model inbetween)…

Seems like it is the recommended approach if you have data about the movies and or the users.

“In recommender systems, the cold start problem is often reduced by adopting a hybrid approach between content-based matching and collaborative filtering. New items (which have not yet received any ratings from the community) would be assigned a rating automatically, based on the ratings assigned by the community to other similar items. Item similarity would be determined according to the items’ content-based characteristics.[1]”

Perhaps we could also do a ‘matching’ between the new movie known features and ones for movies already rated. Find the most similar movie in the rated ones, and then assign its latent factors vector to the new movie.
Once the new movie reaches a number of rating threshold we could include it and re-train the model.

Similar approach for new users, although it might be more difficult to get known features for them.

@Even Just wondering if you have figured out an efficient way to re-train, when a new set of data arrives ?
Generalizing your scenario for other scenarios apart from recommendation, what is your approach to have an updated model every time when new data arrives on a daily basis, will you append new data to existing dataset every time and retrain on whole dataset or just fit it for incremental data?

Funny you should ask this; I’ve just started a new role and i’m working on applying some of the deep learning paradigms to recommender systems and learning to rank. I’m still figuring this all out, but I think I have a handle on a few ways to do it.

First, in terms of retraining, if you’re smart about the way you chose to organize your data and calculate your embeddings then you aren’t actually updating the whole model, just the embeddings that are affected. This is actually a common optimization anyway where you train the network with both positive and negative examples to accelerate the learning rather than providing it with the one positive and the implicit assumption everything else is not. So you give sets of x was clicked with a,b,c and another example of y wasn’t clicked with a,b,c. When you do this negative sampling you reduce the training set because you’re being explicit about the non-matches.

This is commonly done to optimize word2vec and was how I came to understand it. Extending that concept to online training isn’t commonplace but http://rutumulkar.com/blog/2015/word2vec shows how this is done and you can do similar things with recommendation. There are criticisms about the method including the fact that it isn’t necessarily stable long term and I think that retraining in an offline fashion on all of the data to renormalize is commonplace. There’s an open source gensim implementation of online linked in the article above.

The other thing you can do if you have metadata is use that metadata to project into the embedding space. The embeddings are built based on user behaviour but if that behaviour is meaningful data will generally cluster based on metadata so you can use it to figure out where new items fit approximately. Similarly you need to do offline training once you’ve got enough data to include the new item in the model.

I’m curious to hear others thoughts and ideas. I’ve really been diving into the intersection between recommendation and deep learning this past month and would love to have people to bounce ideas off of.

Even

1 Like

In my research I ran across this talk and the corresponding paper which are a great example of a deep learning based recommender system at scale. Hopefully someone’s interested in the topic as well; I’d love to have some people to talk with about this.

I did have some questions about the architecture they use for candidate generation (Figure 3). What confuses me are the video vectors. Based on my understanding the videos are one hot encoded and interpreted through the softmax as class probabilities. Are they just using the one hot encodings and the softmax weights as the video vectors, essentially creating an inverse embedding? It’s an interesting idea and I haven’t seen it before. They’re way of generating a user vector is pretty novel too.

1 Like

Hey @Even, very nice topic and I feel like this should be covered a bit more by fast.ai.
Even if this collaborative filtering approach is really nice and the principle is really cool, I found couple major draw back when putting it into production.

  1. is retraining/fine tuning for new movie/new user as you mentioned. This is far from trivial and you end up with hybrid solution as you mentioned.
  2. Is scaling. Going back to your example with user clicking or not clicking. You could assume that you have a very high traffic website with let’s say 500 millions users. This matrix will never fit in memory. I am not saying there isn’t any solution but I don’t think the class really cover that which is sad since it is most “real” life application.
  3. is sparcity. Going back to your example, assume that there are several million articles but a user really only click on 1 to 3 articles. Is collaborative filtering a good solution there?

I would be very curious to know what you did/are doing in order to resolve theses issues. I faced them and ended up doing an autoencoder with user x movie. This allowed me to have fast training time as well as a way to solve 1 and 2. 3 still being an issue.
I would be very curious to hear what @jeremy thinks about it. I have watched last year and this year class hoping to get an answer for it especially in lesson 4 you mention that “Our goal will be for some user-movie combination we have not seen before, we have to predict a rating” but sadly I feel like it does not fully solve it in my opinion.

Hey @zipp, thanks for resurrecting this thread. :slight_smile: I’m working full time in the field and if you’re interested I could point you towards a number of good resources for deep learning based recommenders. I’ve been meaning to compile my studies over the past year and I’ll try to do so in the next few months, but it’s nice to have another interested soul in the topic.

If you want a jumping off point I’d recommend googling session based recommender systems on youtube. Those begin to explore how to do in session recommendation based on RNNs and look a lot like language models.

Thanks @Even, Yeah please do so I think it is not just me that would be interested. The youtube paper that does session based is interesting and you also have the spotlight library that implement it. Hybrid approach with sparse tensor can also help.
Anyways, I’d definitely be interested in some ressources since the more the better :).

As an aside, are you able to best the score of 0.76 attained by Jeremy’s CollabFilterDataset.from_csv method, which uses the EmbeddingDotBias model?

I tried many versions of rossmann models on it, but surprisingly none of them could best that simple model. I can’t understand why.

?

Hey!

Glad I found your comment! I came across the same article and was scratching my head too about the nearest neighbor search bit (naturally popped over to fastai forum)

Have you thought more about it? IIUC, they are using the last layer output as user vector (should be same size as number of videos?) But not sure how to combine with the softmax weights.

Best
A