Validation Dataset for Collaborative Filtering

aakashns · March 17, 2019, 12:38pm

I’ve recently been reviewing the collaborative filtering notebook discussed in Lesson 4 ( https://course.fast.ai/videos/?lesson=4 ), and I have question about how a validation set should be picked.

If we pick a validation set randomly (e.g. shuffle and pick 20% of the rows), then there might be some users and some items (e.g. movies) that don’t show up in the training set at all. In this case, the embedding vector for those users/items will never get updated during the backpropagation step, because they never contribute to the loss.

A good train-validation split should ensure the following:

Every user shows up in the training set at least once
Every item (movie) shows up in the training set at least once

Is this a valid concern, or am I missing something? I have tried reading through the CollabDataBunch source code on Github, and the library doesn’t seem to ensure the above criteria.