Get predictions for test data via collab

IRailean · January 21, 2020, 3:52pm

Hello!

I want to build a little movie recommender system. The model works fine, but, then I`ve came up with the question: how do I predict with this model? My goal is to recommend the top 10 movies for a given user (user already presented in the dataset with 5-10 movie reviews).

I was not able to get_preds on test data using CollabDataBunch

data_collab = CollabDataBunch.from_df(data_reduced,test=test_data, seed=42, valid_pct=0.2, user_name='user_id', item_name='movie_id', rating_name='rating')

… and then

learn.get_preds(DatasetType.Test)

gives me an error.
It is clearly that I`m doing smth wrong. How should I properly define the test dataset and predict it using a collab learner?

It is not feasible to retrain the model every time I want to make predictions. I was able to solve the problem by calling learn.predict() for each user_id - movie_id pair. But this is very slow and is not taking advantage of any type of parallelism.
How can I make fast predictions on large amount of data?

Thank you!

mschmit5 · January 21, 2020, 7:10pm

What error are you getting?

IRailean · January 21, 2020, 7:37pm

Hi!
It is working now, just reconnected to colab and it turned out to be fine.

I`ve came up with the following solution:
Train your model on your training data.
Save it.
When you want to get predictions
Define it in this way:

data_collab = CollabDataBunch.from_df(data_reduced,**test=test_data**, seed=42, valid_pct=0.2, user_name='user_id', item_name='movie_id', rating_name='rating')

Specifying test dataset.
define your learner:

learn = collab_learner(data_collab, n_factors=40, y_range=(1, 10), wd=1e-2)

Then load your saved trained model (learner)

learn_loaded = learn.load(Path(path/'trained_model'))

And then you can get predictions using

preds, y = learn_loaded.get_preds(DatasetType.Test)

As I understood we can use different test dataset when defining the CollabDataBunch and it will not cause any problem when loading model (seems logic, as model is not changed) trained before.

Now it is working fine and fast.

jaganlal · January 22, 2020, 6:43pm

Hi @IRailean, i’m having the same requirement, also i’m fairly new to AI/ML domain, can you please tell me how you got it done?
Btw, i followed the code from https://jovian.ml/aakashns/movielens-fastai/v/14 for the movie recommendations. Now i’m trying to predict top 10 movies for any particular user.
Struggling since 2 days to get that done, any help will be highly appreciated.

Thanks

IRailean · January 22, 2020, 7:20pm

Hi @jaganlal!
Here is my notebook on GitHub.

github.com

IRailean/RecommenderSystem/blob/master/MovieTweetingRecommender.ipynb

{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"MovieTweetingRecommender.ipynb","provenance":[],"collapsed_sections":[],"authorship_tag":"ABX9TyNIxM+a0Vrl8YNoJfrM0SRs"},"kernelspec":{"name":"python3","display_name":"Python 3"},"accelerator":"GPU"},"cells":[{"cell_type":"code","metadata":{"id":"NCoOX-PSLnvn","colab_type":"code","outputId":"ae3810d4-015a-49a8-895d-175c16aeb808","executionInfo":{"status":"ok","timestamp":1579712196820,"user_tz":-120,"elapsed":30232,"user":{"displayName":"Igor Railean","photoUrl":"","userId":"12397893646663715598"}},"colab":{"base_uri":"https://localhost:8080/","height":122}},"source":["from google.colab import drive\n","drive.mount('/content/drive')"],"execution_count":1,"outputs":[{"output_type":"stream","text":["Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n","\n","Enter your authorization code:\n","··········\n","Mounted at /content/drive\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"rHZUDSs3Lof4","colab_type":"code","colab":{}},"source":["# For modeling\n","from fastai.tabular import *\n","from fastai.collab import *\n","\n","# For visualization\n","import seaborn as sns\n","from matplotlib import pyplot as plt \n","\n","# For working with data\n","import pandas as pd\n","import numpy as np"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"-IkJSqgqL84O","colab_type":"code","colab":{}},"source":["path = Path('/content/drive/My Drive/kaggle/MovieTweeting')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"JbY1hVArMPOF","colab_type":"code","colab":{}},"source":["data = pd.read_pickle(path/'movies_preprocessed.pkl')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"gawr66PHMUDF","colab_type":"code","outputId":"9a201e6e-3894-4b49-c5e1-471520d12457","executionInfo":{"status":"ok","timestamp":1579712216357,"user_tz":-120,"elapsed":6272,"user":{"displayName":"Igor Railean","photoUrl":"","userId":"12397893646663715598"}},"colab":{"base_uri":"https://localhost:8080/","height":462}},"source":["data.head()"],"execution_count":5,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>id</th>\n","      <th>name</th>\n","      <th>user_id</th>\n","      <th>movie_id</th>\n","      <th>rating</th>\n","      <th>rating_timestamp</th>\n","      <th>Action</th>\n","      <th>Adult</th>\n","      <th>Adventure</th>\n","      <th>Animation</th>\n","      <th>Biography</th>\n","      <th>Comedy</th>\n","      <th>Crime</th>\n","      <th>Documentary</th>\n","      <th>Drama</th>\n","      <th>Family</th>\n","      <th>Fantasy</th>\n","      <th>Film-Noir</th>\n","      <th>Game-Show</th>\n","      <th>History</th>\n","      <th>Horror</th>\n","      <th>Music</th>\n","      <th>Musical</th>\n","      <th>Mystery</th>\n","      <th>News</th>\n","      <th>Reality-TV</th>\n","      <th>Romance</th>\n","      <th>Sci-Fi</th>\n","      <th>Short</th>\n","      <th>Sport</th>\n","      <th>Talk-Show</th>\n","      <th>Thriller</th>\n","      <th>War</th>\n","      <th>Western</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>8</td>\n","      <td>Edison Kinetoscopic Record of a Sneeze (1894)</td>\n","      <td>45154</td>\n","      <td>8</td>\n","      <td>5</td>\n","      <td>1396981211</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>10</td>\n","      <td>La sortie des usines Lumière (1895)</td>\n","      <td>54337</td>\n","      <td>10</td>\n","      <td>10</td>\n","      <td>1412878553</td>\n","      <td>0</td>\n","      <t

This file has been truncated. show original

Note: User for which you want to make predictions must be introduced in the dataset before training. Otherwise it will not have his embedding vector calculated (the same with movies).

I have also put an article on Medium to explain what I`ve done

jaganlal · January 22, 2020, 7:26pm

Thanks a Lot @IRailean, my hearty thanks to you. You saved my day. I’ll take a look at the code and try to see how the predictions are. Once again thank you very much.

IRailean · January 22, 2020, 7:39pm

My pleasure, @jaganlal. Please note, that there are 2 notebooks in GitHub repo. One for data preparation and another one for modeling and predictions.

jaganlal · January 22, 2020, 8:41pm

Hi @IRailean, i would like to send the code that i have to you for some clarifications, can you please email to this id - tsjaganlal@yahoo.com

Thanks

IRailean · January 23, 2020, 6:49am

@jaganlal You can attach link to your code right here with your questions, so I can analyze it.

jaganlal · January 27, 2020, 12:31am

Hi @IRailean, how to test the model with the test data? We supply valid_pct=0.2 (20% of the data as test data) to CollabDataBunch.from_df?
How to grab a user from that 20% test data and test it against the trained model?

Thanks in Advance,
Jagan

jaganlal · January 27, 2020, 6:00am

@IRailean - here is my very basic draft on NCF (inspired from https://jovian.ml/aakashns/movielens-fastai/v/14)
NCFMovie100k/NCFMovie100K.ipynb at master · jaganlal/NCFMovie100k · GitHub

In my source code, i’m creating test_data from the existing data (rating_movie), is this the correct way to test my model?

data_collab = CollabDataBunch.from_df(rating_movie, test=test_data, seed=42, valid_pct=0.2)

If so what is the significance of valid_pct=0.2

IRailean · January 27, 2020, 6:52am

You have too high ratings for several movies.

learn1 = collab_learner(data_collab, n_factors=40, y_range=(0, 10), wd=1e-2)

y_range specifies range of your scores. As I`ve seen you use 0-5 ratings, so change this to y_range=(0,5)
Do not forget also to load your model before predictions.
In your case smth like that:

learn1 = learn.load("trained_model')

valid_pct=0.2 shows how much of your initial data will be chosen as validation data.It means that from rating_movie dataset you use 20% as validation data.
Please, check out this video by Andrew Ng about train/dev/test distribution.

jaganlal · January 27, 2020, 2:26pm

How to get that 20% data and test it? (i don’t know whether this question makes sense or not). In the past i have seen from other Neural Network tutorials where they train the NN with 80% of the data and remaining 20% they’ll test and validate the model. Similar to that is there a way to validate the model from that 20% test data?

Sorry, If this question isn’t relevant/doesn’t make sense please ignore.

Thanks

IRailean · January 27, 2020, 4:59pm

You may not know from the start which architecture or hyperparameters will be the best choice for your NN. Therefore you often want to separate your data into 3 categories: train/val/test data.

Training data/Validation data.
NN uses training data to learn. Then you validate your NN on validation data.
If your metrics` values(accuracy, mse, rmse, etc.) are still not satisfactory, you may change hyperparameters or alter architecture of an NN.

Test data
Once you have found the best hyperparameters and architecture using train/val data, you evaluate your model on the test data. It is used as unbiased evaluation of a final model.

When creating databunch in fastai you just give it your data and this coefficient valid_pct which tells how much of this data will be used as validation set. It is up to fastai which entries of your data will be used as validation (I believe it does split data randomly).

data_collab = CollabDataBunch.from_df(rating_movie, test=test_data, seed=42, valid_pct=0.2)

In this line 80% of rating_movie will be used as training set, 20% as validation set and test_data will be used as test set.

jaganlal · January 27, 2020, 5:05pm

Now it makes sense and i’m able to connect the dots. Thanks @IRailean for clearing my doubts. You mentioned the data is split into Training set and Validation set - how to validate the model with the Validation Data? Is there a way to extract the validation set?

IRailean · January 27, 2020, 5:13pm

fastai does this for you. Just specify how much data you want to use as validation data using valid_pct.
You can take a look at validation data in the following way:

data.valid_ds[0]

This will give you first entry of your validation data.

jaganlal · January 28, 2020, 4:04am

@IRailean - updated my code to find top 10 similar movies (i choose Toy Story movie, expecting other animated kids movie in my recommendation list)

github.com

jaganlal/NCFMovie100k/blob/master/NCFMovie100K.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from fastai.collab import CollabDataBunch, collab_learner, load_learner, DatasetType"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "user,item,title = 'userId','movieId','title'"
   ]

This file has been truncated. show original

Please let me know what i did is correct or not, is there any other ways to improve it?

Thanks

IRailean · January 28, 2020, 7:04am

Hi @jaganlal!
movieId should correspond to only one movie name.

You have built a test dataset, where it is only 1 movieId and all users. From this dataset, you will predict how each user would rate this movie (313).

Regarding your question:
You want to find similar movies. As for each movie you have an embedding vector that represents this movie, for a given movie you want top-10 similar movies, when similar means with the
nearest embedding vector.

For this, you can retrieve weights and biases for each movie, calculate the distance between your movie and each movie in the dataset, and then sort them.
Here is told how to get bias and weight for a given movie.

jaganlal · January 28, 2020, 2:55pm

Sorry my bad on using movieId wrongly.
I made some changes to my code -https://github.com/jaganlal/NCFMovie100k/blob/master/NCFMovie100K.ipynb to extract bias and weights for movieId. Trying to find top 10 recommendations for a movie given its id. Is there any direct way to supply the movieId and get the bias and weights for that movie (i mean top recommendations)?

IRailean · January 28, 2020, 3:24pm

To be honest, I do not understand what do you mean by “Trying to find top 10 recommendations for a movie given its id”. You want to find 10 users that would rate this movie with the highest rating or top-10 similar movies to a given one?