CollabDataBunch: Incomplete set of item classes

In the collab tutorial notebook we use only a subset of top movies to avoid indexing errors

g = rating_movie.groupby(title)['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]

Apparently the get_idx method uses only classes derived from the training set

u_class,i_class = self.data.train_ds.x.classes.values()

But even when calling u_class, i_class = self.data.classes.values() I get the same number of classes. Even when providing the whole dataset as a test argument like this

data = collab.CollabDataBunch.from_df(
    ratings_df, 
    test=ratings_df,
    seed=42, 
    pct_val=0.1, 
    user_name='userId', 
    item_name='movieId', 
    rating_name='rating', 
    path=path)

and calling data.test_ds.classes.values() gives an incomplete number of movie classes.

The same goes for the collab_learner method, which creates Embedding(9380, 1) when there are 9725 unique movie ids in the dataset.

Any solution to this?

That’s what happens when setting pct_val=0.99

data.valid_ds.x
...
userId #na#; movieId 80489; 
userId #na#; movieId 80906; 
userId #na#; movieId #na#; 
userId #na#; movieId #na#; 
userId #na#; movieId #na#; 
userId #na#; movieId #na#; 
userId #na#; movieId 99114; 
userId #na#; movieId #na#; 
userId #na#; movieId 109487; 
userId #na#; movieId 112552; 
...

Okay, seems I misunderstood the principles of the collaborative filtering. By design, the movies present only in the validation/test set are irrelevant for the training, since their embeddings would have been stayed randomly initialized. Thus they are excluded from all sets entirely and marked with NaNs.

That is exactly right.