3 questions about chapter 8

  1. How does dataloaders separate training set and validation set when we haven’t told it with DataBlock api? Does it automatically separate 20% of randomly chosen data as validation set?
  2. When we’re training the data, are we including the ‘0’ ratings, the ones that the users haven’t watched and rated yet? That seems odd, doesn’t doing so make the model to predict low or close to 0 ratings for those movies?
  3. How to choose the optimal number of latent factors?
    Thank you.

If you look at the function signature of the dataloader then you can notice that by default the valid_pct is set to 0.2 (20%). So, if you don’t explicitly mention the valod percentage as is done in the chapter then due to the default value, the valod percentage is 20%.

One of the goals during collaborative filtering is to learn the embeddings of the data. So, even if a user has not rated a particular movie but those movies can still appear closer to some other similar movie and thus these “un-rated” data can be said to have their rating similar to such rated movies. In short the model should be able to predict the ratings of such un-rated movies.