Part 1, online study group

Meeting Minutes of 02/02/2019

Presentation on Lesson 4 (Tabular and Collaborative Filtering)

Presenter: @Tendo

Thanks to @Tendo for the wonderful Colab notebooks!

Questions

Tabular Data:
  • What are the heuristics or the formula for determining the size of the hidden layers for the tabular learner?

    learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

    • Forum thread for reference and possible further discussion linked below in Resources
  • In Tendo’s notebook, total size of training set was 3256, so if we choose rows 800-1000 to be our validation set, that means, with 200 samples, we have a validation set that is around 6% of the training set. Is that enough?

    test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)

    • I didn’t quite gather if we fully resolved this in the discussion
    • Also, why 800-1000? Can we not achieve a more random split by using ratio/percentage like in sklearn?
      • one reason could be that we want a contiguous set for our validation, because much like, video frames, if we have adjacent frames, one in training, one in valid, then our model is not learning anything - it is cheating
      • Any other explanations? Is 6% enough?

Collaborative Filtering:

  • How do I differentiate between when to use collaborative filtering vs tabular?
    • A thought experiment. Taking the ‘US Salary’ example of Tabular, could I instead run Collaborative Filtering on that and come up with a recommendation for a salary?
    • Basic intuition for this is to look at it as:
      • Tabular :: Supervised
      • Collaborative Filtering :: Unsupervised
  • What are n_factors?
    • They are the hidden features that the model learns after training
      • For example, deciding that some movies are family-friendly vs others not. Family-friendliness is one of the n_factors.
    • So, while we set up the learner, is the number of n_factors we choose one of the hyperparameters?
      • It could affect speed and accuracy, but need more experiments to determine.

Resources

Jeremy’s tweet on Tabular:

6 Likes