CollabLearner prediction

For a project where we compare different collaborative filtering algorithms, I built a notebook using the CollabLearner from FastAI v1. The notebook can be found here.

I am wondering if I am understanding correctly how predictions should be made (I looked through the docs, and forum, but didn’t find much tangible info – maybe I missed it).
My main function for scoring is this:

def score(learner, userIds, movieIds, user_col, item_col, prediction_col, top_k=0):
    """score all users+movies provided and reduce to top_k items per user if top_k>0"""
    u = learner.get_idx(userIds, is_item=False)
    m = learner.get_idx(movieIds, is_item=True)
    
    pred = learner.model.forward(u, m)
    scores = pd.DataFrame({user_col: userIds, item_col:movieIds, prediction_col:pred})
    scores =  scores.sort_values([user_col,prediction_col],ascending=[True,False])
    if top_k > 0:
        top_scores = scores.groupby(user_col).head(top_k).reset_index(drop=True)
    else:
        top_scores = scores
    return top_scores
  1. As can be seen, I am just calling pytorch’s model.forward() after mapping the external ids to the internal ids. I was expecting get_preds to do that for me, but since it is not overwritten by the class, it doesn’t work here. Is there a different method that should be used?
  2. To be able to calculate all our metrics, I need the train and validation sets, hence I pull them out of the learner I am using the below code, which seems quite complicated (and ends up being quite slow with larger datasets. Is there a more straight-forward way? Is the mapping from original dataset to train and valid accessible somehow?
# learn is an instance of CollabLearner
valid_df = pd.DataFrame({USER:[row.classes[USER][row.cats[0]] for row in learn.data.valid_ds.x], 
                        ITEM:[row.classes[ITEM][row.cats[1]] for row in learn.data.valid_ds.x], 
                        RATING: [row.obj for row in data.valid_ds.y]})

train_df = pd.DataFrame({USER:[row.classes[USER][row.cats[0]] for row in learn.data.train_ds.x], 
                        ITEM:[row.classes[ITEM][row.cats[1]] for row in learn.data.train_ds.x], 
                        RATING: [row.obj for row in data.train_ds.y]})
  1. In order to provide predictions for a set of users, I create a catesian product for those users with all the relevant items (all movies in the test set). With a large number of items, that will be a large list score. Is that the way to go or is there a smarter way?

Grateful for any input,
Daniel

3 Likes

You have Learner.predict to make predictions on one item. Then if you want to make predictions on a large set, you should put it in the test set of your DataBunch and use learn.get_preds(ds_type = DatasetType.Test).

1 Like

Hm – how would that work when I train the model first, and then want to evaluate it many times with changing data?
Also, I did not find out how to actually load the model that learner.save would save me without needing to first recreate the learner (using the same training data, I gathered). Other learners have a static load method on the class, but this one does not seem to offer one…

It’s all detailed in this tutorial: you basically type learn.export() then load your empty learner with learn = load_learner(path) (note that this is on master for now, future v1.0.40).
As for predicting on

  • one item: use learn.predict(a_row_of_ratings) (for instance ratings.iloc[0] but adapt to your test data)
  • a lot of data at the same time: load your learner with a test set, which will allow you to call learn.get_preds(ds_type=DatasetType.Test) with something like:
learn = load_learner(path, test=CollabList.from_df(test_ratings, cat_names=['userId', 'movieId'], path=path))

As for you other question about getting the validation indexes, you can choose the one you want if you use the data block API (just look at the docs then the source code for CollabDataBunch to get an example).

2 Likes

Thanks – the new load_learner function works great.

For scoring, however, I see a huge performance difference between just using learner.model.forward(u, m) and learner.get_preds(ds_type = DatasetType.Test)

the former takes <2 seconds for 1423498 predictions, the latter takes more than 45 seconds. Below are the two functions for reference. This is on a gpu machine – could it be that learner.get_preds doesn’t use the gpu while model.forward does?
Also, reloading the model for each scoring request doesn’t fit my workflow very well – maybe we could add a method like below score_direct to the CollabLearner?

def score_direct(learner, test_df, user_col, item_col, prediction_col, top_k=0):
    """score all users+movies provided and reduce to top_k items per user if top_k>0"""
    # replace values not known to the model with #na#
    total_users, total_items = learner.data.classes.values()
    test_df.loc[~test_df[user_col].isin(total_users),user_col] = total_users[0]
    test_df.loc[~test_df[item_col].isin(total_items),item_col] = total_items[0]
   
    # map ids to embedding ids 
    u = learner.get_idx(test_df[user_col], is_item=False)
    m = learner.get_idx(test_df[item_col], is_item=True)
    
    # score the pytorch model
    pred = learner.model.forward(u, m)
    scores = pd.DataFrame({user_col: test_df[user_col], item_col:test_df[item_col], prediction_col:pred})
    scores =  scores.sort_values([user_col,prediction_col],ascending=[True,False])
    if top_k > 0:
        top_scores = scores.groupby(user_col).head(top_k).reset_index(drop=True)
    else:
        top_scores = scores
    return top_scores


def score(path, fname, test_df, user_col, item_col, prediction_col, top_k=0):
    """score all users+movies provided and reduce to top_k items per user if top_k>0"""        
    learner = load_learner(path=path, 
                           fname=fname, 
                           test=CollabList.from_df(test_df, cat_names=[user_col, item_col]))     
    preds = learner.get_preds(ds_type = DatasetType.Test)
    
    scores = pd.DataFrame({user_col: test_df[user_col], item_col:test_df[item_col], prediction_col:np.array(preds[0])})
    scores =  scores.sort_values([user_col,prediction_col],ascending=[True,False])
    if top_k > 0:
        top_scores = scores.groupby(user_col).head(top_k).reset_index(drop=True)
    else:
        top_scores = scores
    return top_scores

Hi @sgugger I have reproduced lesson 5’s notebook and I’m wondering how can I predict an Item given the user. Furthermore, if we have new users how do we encode them using the embedding we trained ??

If you have a new user, by definition, you can’t make predictions on him since he hasn’t been seen by your model.
To make a prediction user/movie, put them in a dataframe row the same way as your data was predict, then learn.predict(row) will give you the result (see also the example notebook).

2 Likes

Hi @sgugger, while trying to load a dataframe as a test set using:

learn = load_learner(
    './',
    test = CollabList.from_df(
        df_t
    )
)

I’m getting the following error:

AttributeError: Can only use .cat accessor with a ‘category’ dtype

DataBunch used for training:

data_bunch = CollabDataBunch.from_df(
    ratings = df,
    seed = 40,
    user_name = 'ID',
    item_name = 'KEYWORD',
    rating_name = 'RATING'
)

DataFrame used for training is loaded from a CSV with format:

You might need to convert your dataframe columns to categories (I think this is done by the latest fastai, but I’m not sure which version you have). Also note you will probably have to pass cat_names=[name_of_user_colmun, name_of_item_column] to make this work.

My fastai version is 1.0.51
Also which columns do I need to change to categorical?

df_t.info(verbose=True)

Running the above showed my NAME and KEYWORD columns to be of type Object
I changed both of them to Categorical
Do I need to change all of the columns to Categorical?

Updated Code Snippet:

learn = load_learner(
    './',
    test = CollabList.from_df(
        df_t,
        cat_names = ['ID', 'KEYWORD']
    )
)

UPDATE:
After setting NAME and KEYWORD columns to Categorical it throws this error:

ValueError: Cannot set a Categorical with another, without identical categories

Thank you for your rapid response @sgugger . what if I want to rank the movies with cosine similarity for each user. Like the output would be a list of let’s say 5 movies which are similar to the users taste.

I’m using movies as an example, the real problem is that I’m trying to rank which websites are similar to what the user likes given it’s history.
As training data I have the URL, keyword and Score as data. I thought collaborative filtering is the way to go. So I created the model using lesson-5 2018 notebook . Now I would want to get which websites or keywords are similar if I input a URL or keyword

Hi @sgugger, based on the above discussion & my own experience trying to use the fastai software - might it be possible to get a tutorial similar to the rest of the series in the link you provided for the collaborative filtering aspect of the fastai software?

if you are looking for an end-to-end example that includes scoring, you can check out this notebook: https://github.com/Microsoft/Recommenders/blob/master/notebooks/00_quick_start/fastai_movielens.ipynb

the actual scoring is happening here: https://github.com/Microsoft/Recommenders/blob/master/reco_utils/recommender/fastai/fastai_utils.py
which is pretty much what I had posted above in January, and which shows much better perf than learner.get_preds.

hey @danielsc - the link in your original post does not work :frowning: , also read this entire thread and clicked all links before deciding to post.

Thanks heaps for the extra help & will look into your notebooks today mate!

Hi Paul,
the link in the original post has moved here: https://github.com/Microsoft/Recommenders/blob/master/notebooks/00_quick_start/fastai_movielens.ipynb – can you not access that either?
br,
Daniel

great ! - yup Daniel your new link works - currently trying to run my data through your workbook right now!

Btw just shared your repo with our team and they love it, awesome awesome work buddy!

note am getting a bug which kinda makes sense since the test data has been removed when you split between your train & test sets in your notebook:

when it comes to scoring:

scores = score(learner, 
           test_df=test_df.copy(), 
           user_col=USER, 
           item_col=ITEM, 
           prediction_col=PREDICTION)

it is throwing the following error:

You're trying to access a user that isn't in the training data.
              If it was in your original data, it may have been split such that it's only in the validation set now.

TypeError Traceback (most recent call last)
in ()
3 user_col=USER,
4 item_col=ITEM,
----> 5 prediction_col=PREDICTION)

did this work for you mate?

What version of the fastai software are you using mate?

Hey, did you ever figure this out?
Ive been following through the proccess detailed in the microsoft notebook but am getting the same error.

Thanks.

I gave up on using the fast.ai package for anything beyond teaching yourself Neural Nets