Collaborative filtering test predictions

jorgebar · March 12, 2018, 6:26am

Hi, how can I make predictions for a test set or all possible predictions with collaborative filtering?

learn.predict()

can only give predictions for the cross validation set because

CollabFilterDataset.from_csv()

does not receive a test_name argument for the test set file.

In other words, I would like to output the values for all combinations between the two independent variables. i.e. fill all NaN values in the crosstab visualization.

I want something like:

predict(userId, movieId)

Thanks in advance!

noliki · March 16, 2018, 5:35pm

Hi,

that’s a good question. I would be interested also in it.
I’ve come up with following:

Create a model which is ready to make predictions
Create the user-item combinations for which you need predictions. For the ratings add some value. Stackoverflow how to do this.
Get the correct ids for the newly created combinations for the val_idxs parameter. Make sure the new combinations are in the validation set.
Create a new data loader object ( e.g. CollabFilterDataset.from_csv). Let’s call it new_dl
Set the new DL for the model just like in lesson2 .-> “learn.set_data(new_dl)”
Get predictions by “predict(learn.model, learn.data.val_dl)”

I have not tried it in practice. But I think it should work for existing users and items that they have not rated.
For new users check out the other thread on this topic.
Let me know if it works.

bharadwaj · March 18, 2018, 8:34pm

I haven’t used the CollabFilterDataset API for generating test predictions, but I have used similar model in different competition with some success. (I would love to know if someone has used this API to make predictions, especially the test_df parameter; or maybe the predict method.)

AFAIK, for those variable combinations where we have data, we can just dot product corresponding embedding vectors to get the rating.

In case of predictions for variables for which we don’t have any data - for example in case of new users - I think one way would be to start off with median/ mean movie ratings for combination of this user and respective movie. If it’s both new variables, we can start off with some other sensible value to start off with.