Part 1, online study group

shimsan · February 3, 2020, 5:12pm

Meeting Minutes of 02/02/2019

Presenter: @Tendo

Thanks to @Tendo for the wonderful Colab notebooks!

Tabular Data:

What are the heuristics or the formula for determining the size of the hidden layers for the tabular learner?

learn = tabular_learner(data, layers=[200,100], metrics=accuracy)
- Forum thread for reference and possible further discussion linked below in Resources
In Tendo’s notebook, total size of training set was 3256, so if we choose rows 800-1000 to be our validation set, that means, with 200 samples, we have a validation set that is around 6% of the training set. Is that enough?

test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
- I didn’t quite gather if we fully resolved this in the discussion
- Also, why 800-1000? Can we not achieve a more random split by using ratio/percentage like in sklearn?
  - one reason could be that we want a contiguous set for our validation, because much like, video frames, if we have adjacent frames, one in training, one in valid, then our model is not learning anything - it is cheating
  - Any other explanations? Is 6% enough?

Collaborative Filtering:

How do I differentiate between when to use collaborative filtering vs tabular?
- A thought experiment. Taking the ‘US Salary’ example of Tabular, could I instead run Collaborative Filtering on that and come up with a recommendation for a salary?
- Basic intuition for this is to look at it as:
  - Tabular :: Supervised
  - Collaborative Filtering :: Unsupervised
What are n_factors?
- They are the hidden features that the model learns after training
  - For example, deciding that some movies are family-friendly vs others not. Family-friendliness is one of the n_factors.
- So, while we set up the learner, is the number of n_factors we choose one of the hyperparameters?
  - It could affect speed and accuracy, but need more experiments to determine.

Jeremy’s tweet on Tabular: