Pandas: How to make the crosstab in the collab filtering example?

I’m trying to write my own recreation of the collaborative filtering notebook’s crosstab, but there is no example code in that notebook (it’s an image of Excel, I think?). Shown in the Chapter 8 of Fastbook, and in the Lesson 7 video.

I tried to recreate it, but this is as far as I got. While it’s better than a random crosstab, it’s still much more sparse than what the author used. The idea is to improve on a complete but sparse crosstab of everything by first filtering and then joining the most frequent raters with the most frequently rated movies. I thought this would be more dense, but it’s only halfway there.

(setup if you want to copy/paste)

from fastai2.collab import *
from fastai2.tabular.all import *
path = untar_data(URLs.ML_100k)  
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])  

(my crosstab attempt)

most_frequent_users = list(ratings.user.value_counts()[:20])
most_rated_movies = list(ratings.movie.value_counts()[:20])
denser_ratings = ratings[ratings.user.isin(most_frequent_users)]
denser_movies = ratings[ratings.movie.isin(most_rated_movies)]
pd.crosstab(denser_ratings.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')

Any fluent pandas users know how to slice the frame to have a dense crosstab more like the first image?

1 Like

I posted this on my TWIML study group today https://twimlai.slack.com/archives/C01ALNFJBR6

Take a look at the “Slack Meeting chat Notes”, I posted yesterday (10/20/2020) and look at the reply. I posted a long thread on how I got to these videos and spreadsheets, which I think will help you.

You can also join our group and ask us questions, although we’re in lesson video #8

Enjoy!

I posted a StackOverflow question, then eventually figured out how to do what I wanted and posted the answer. https://stackoverflow.com/questions/64344352/pandas-dense-crosstab-with-n-most-frequent-from-column1-and-column2?noredirect=1#comment113780423_64344352

Basically, there was an error in my most_frequent_* sampling (I was using the value counts instead of the top values’ ids), and there’s also a smarter way to grab the most frequent ratings of the most frequent users. Final code below.

n_users = 10
n_movies = 20

# list the ids of the most frequent users (those who rated the most movies)
most_frequent_users = list(ratings.user.value_counts().index[:n_users])
# grab all the ratings made by these most frequent users
denser_users = ratings[ratings.user.isin(most_frequent_users)]

# list the ids of the most frequent movies within this group of users
dense_users_most_rated = list(denser_ratings.movie.value_counts().index[:n_movies])
# grab all the most frequent movies rated by the most frequent users
denser_movies = ratings[ratings.movie.isin(dense_users_most_rated)]

# plot the crosstab
pd.crosstab(denser_users.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')

And the result

1 Like