Meeting Minutes of 02/02/2019
Presentation on Lesson 4 (Tabular and Collaborative Filtering)
Presenter: @Tendo
- Tabular with US Salary Dataset Colab Notebook
- Tabular Titanic Dataset Colab Notebook
- Collaborative Filtering Colab Notebook
Thanks to @Tendo for the wonderful Colab notebooks!
Questions
Tabular Data:
-
What are the heuristics or the formula for determining the size of the hidden layers for the tabular learner?
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)
- Forum thread for reference and possible further discussion linked below in Resources
-
In Tendo’s notebook, total size of training set was 3256, so if we choose rows 800-1000 to be our validation set, that means, with 200 samples, we have a validation set that is around 6% of the training set. Is that enough?
test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
- I didn’t quite gather if we fully resolved this in the discussion
- Also, why 800-1000? Can we not achieve a more random split by using ratio/percentage like in sklearn?
- one reason could be that we want a contiguous set for our validation, because much like, video frames, if we have adjacent frames, one in training, one in valid, then our model is not learning anything - it is cheating
- Any other explanations? Is 6% enough?
Collaborative Filtering:
- How do I differentiate between when to use collaborative filtering vs tabular?
- A thought experiment. Taking the ‘US Salary’ example of Tabular, could I instead run Collaborative Filtering on that and come up with a recommendation for a salary?
- Basic intuition for this is to look at it as:
- Tabular :: Supervised
- Collaborative Filtering :: Unsupervised
- What are n_factors?
- They are the hidden features that the model learns after training
- For example, deciding that some movies are family-friendly vs others not. Family-friendliness is one of the n_factors.
- So, while we set up the learner, is the number of n_factors we choose one of the hyperparameters?
- It could affect speed and accuracy, but need more experiments to determine.
- They are the hidden features that the model learns after training
Resources
Jeremy’s tweet on Tabular: