In the collab tutorial notebook we use only a subset of top movies to avoid indexing errors
g = rating_movie.groupby(title)['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
Apparently the get_idx
method uses only classes derived from the training set
u_class,i_class = self.data.train_ds.x.classes.values()
But even when calling u_class, i_class = self.data.classes.values()
I get the same number of classes. Even when providing the whole dataset as a test argument like this
data = collab.CollabDataBunch.from_df(
ratings_df,
test=ratings_df,
seed=42,
pct_val=0.1,
user_name='userId',
item_name='movieId',
rating_name='rating',
path=path)
and calling data.test_ds.classes.values()
gives an incomplete number of movie classes.
The same goes for the collab_learner
method, which creates Embedding(9380, 1)
when there are 9725 unique movie ids in the dataset.
Any solution to this?