You have Learner.predict
to make predictions on one item. Then if you want to make predictions on a large set, you should put it in the test set of your DataBunch
and use learn.get_preds(ds_type = DatasetType.Test)
.
Hm – how would that work when I train the model first, and then want to evaluate it many times with changing data?
Also, I did not find out how to actually load the model that learner.save would save me without needing to first recreate the learner (using the same training data, I gathered). Other learners have a static load method on the class, but this one does not seem to offer one…
It’s all detailed in this tutorial: you basically type learn.export()
then load your empty learner with learn = load_learner(path)
(note that this is on master for now, future v1.0.40).
As for predicting on
- one item: use
learn.predict(a_row_of_ratings)
(for instanceratings.iloc[0]
but adapt to your test data) - a lot of data at the same time: load your learner with a test set, which will allow you to call
learn.get_preds(ds_type=DatasetType.Test)
with something like:
learn = load_learner(path, test=CollabList.from_df(test_ratings, cat_names=['userId', 'movieId'], path=path))
As for you other question about getting the validation indexes, you can choose the one you want if you use the data block API (just look at the docs then the source code for CollabDataBunch
to get an example).
Thanks – the new load_learner
function works great.
For scoring, however, I see a huge performance difference between just using learner.model.forward(u, m)
and learner.get_preds(ds_type = DatasetType.Test)
the former takes <2 seconds for 1423498 predictions, the latter takes more than 45 seconds. Below are the two functions for reference. This is on a gpu machine – could it be that learner.get_preds
doesn’t use the gpu while model.forward
does?
Also, reloading the model for each scoring request doesn’t fit my workflow very well – maybe we could add a method like below score_direct
to the CollabLearner?
def score_direct(learner, test_df, user_col, item_col, prediction_col, top_k=0):
"""score all users+movies provided and reduce to top_k items per user if top_k>0"""
# replace values not known to the model with #na#
total_users, total_items = learner.data.classes.values()
test_df.loc[~test_df[user_col].isin(total_users),user_col] = total_users[0]
test_df.loc[~test_df[item_col].isin(total_items),item_col] = total_items[0]
# map ids to embedding ids
u = learner.get_idx(test_df[user_col], is_item=False)
m = learner.get_idx(test_df[item_col], is_item=True)
# score the pytorch model
pred = learner.model.forward(u, m)
scores = pd.DataFrame({user_col: test_df[user_col], item_col:test_df[item_col], prediction_col:pred})
scores = scores.sort_values([user_col,prediction_col],ascending=[True,False])
if top_k > 0:
top_scores = scores.groupby(user_col).head(top_k).reset_index(drop=True)
else:
top_scores = scores
return top_scores
def score(path, fname, test_df, user_col, item_col, prediction_col, top_k=0):
"""score all users+movies provided and reduce to top_k items per user if top_k>0"""
learner = load_learner(path=path,
fname=fname,
test=CollabList.from_df(test_df, cat_names=[user_col, item_col]))
preds = learner.get_preds(ds_type = DatasetType.Test)
scores = pd.DataFrame({user_col: test_df[user_col], item_col:test_df[item_col], prediction_col:np.array(preds[0])})
scores = scores.sort_values([user_col,prediction_col],ascending=[True,False])
if top_k > 0:
top_scores = scores.groupby(user_col).head(top_k).reset_index(drop=True)
else:
top_scores = scores
return top_scores
Hi @sgugger I have reproduced lesson 5’s notebook and I’m wondering how can I predict an Item given the user. Furthermore, if we have new users how do we encode them using the embedding we trained ??
If you have a new user, by definition, you can’t make predictions on him since he hasn’t been seen by your model.
To make a prediction user/movie, put them in a dataframe row the same way as your data was predict, then learn.predict(row)
will give you the result (see also the example notebook).
Hi @sgugger, while trying to load a dataframe as a test set using:
learn = load_learner( './', test = CollabList.from_df( df_t ) )
I’m getting the following error:
AttributeError: Can only use .cat accessor with a ‘category’ dtype
DataBunch used for training:
data_bunch = CollabDataBunch.from_df( ratings = df, seed = 40, user_name = 'ID', item_name = 'KEYWORD', rating_name = 'RATING' )
DataFrame used for training is loaded from a CSV with format:
You might need to convert your dataframe columns to categories (I think this is done by the latest fastai, but I’m not sure which version you have). Also note you will probably have to pass cat_names=[name_of_user_colmun, name_of_item_column]
to make this work.
My fastai version is 1.0.51
Also which columns do I need to change to categorical?
df_t.info(verbose=True)
Running the above showed my NAME
and KEYWORD
columns to be of type Object
I changed both of them to Categorical
Do I need to change all of the columns to Categorical
?
Updated Code Snippet:
learn = load_learner( './', test = CollabList.from_df( df_t, cat_names = ['ID', 'KEYWORD'] ) )
UPDATE:
After setting NAME
and KEYWORD
columns to Categorical
it throws this error:
ValueError: Cannot set a Categorical with another, without identical categories
Thank you for your rapid response @sgugger . what if I want to rank the movies with cosine similarity for each user. Like the output would be a list of let’s say 5 movies which are similar to the users taste.
I’m using movies as an example, the real problem is that I’m trying to rank which websites are similar to what the user likes given it’s history.
As training data I have the URL, keyword and Score as data. I thought collaborative filtering is the way to go. So I created the model using lesson-5 2018 notebook . Now I would want to get which websites or keywords are similar if I input a URL or keyword
Hi @sgugger, based on the above discussion & my own experience trying to use the fastai software - might it be possible to get a tutorial similar to the rest of the series in the link you provided for the collaborative filtering aspect of the fastai software?
if you are looking for an end-to-end example that includes scoring, you can check out this notebook: https://github.com/Microsoft/Recommenders/blob/master/notebooks/00_quick_start/fastai_movielens.ipynb
the actual scoring is happening here: https://github.com/Microsoft/Recommenders/blob/master/reco_utils/recommender/fastai/fastai_utils.py
which is pretty much what I had posted above in January, and which shows much better perf than learner.get_preds.
hey @danielsc - the link in your original post does not work , also read this entire thread and clicked all links before deciding to post.
Thanks heaps for the extra help & will look into your notebooks today mate!
Hi Paul,
the link in the original post has moved here: https://github.com/Microsoft/Recommenders/blob/master/notebooks/00_quick_start/fastai_movielens.ipynb – can you not access that either?
br,
Daniel
great ! - yup Daniel your new link works - currently trying to run my data through your workbook right now!
Btw just shared your repo with our team and they love it, awesome awesome work buddy!
note am getting a bug which kinda makes sense since the test data has been removed when you split between your train & test sets in your notebook:
when it comes to scoring:
scores = score(learner,
test_df=test_df.copy(),
user_col=USER,
item_col=ITEM,
prediction_col=PREDICTION)
it is throwing the following error:
You're trying to access a user that isn't in the training data.
If it was in your original data, it may have been split such that it's only in the validation set now.
TypeError Traceback (most recent call last)
in ()
3 user_col=USER,
4 item_col=ITEM,
----> 5 prediction_col=PREDICTION)
did this work for you mate?
What version of the fastai software are you using mate?
Hey, did you ever figure this out?
Ive been following through the proccess detailed in the microsoft notebook but am getting the same error.
Thanks.
I gave up on using the fast.ai package for anything beyond teaching yourself Neural Nets
Updated link using the built learner recommenders/fastai_movielens.ipynb at main · microsoft/recommenders · GitHub
to generate recommendations and score the model to find the top recommendations