Lesson 6 official topic

jeremy · June 20, 2022, 12:05am

This post is for topics related to lesson 6 of the course. This lesson is based partly on chapter 9 of the book.

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 5｜ Lesson 7 >>>

Lesson resources

Recording
Notebooks for this lesson:
- How random forests really work
- Road to the top, part 1
Chapter 9 of the book
Solutions to chapter 9 questionnaire from the book.

Links from the lesson

Course repo
Official Course Walk Thru - 10am Brisbane time, Tue-Fri (any changes will be annouced here)
“Deep dive” deep learning walk-thrus
How to explain Gradient Boosting
The best vision models for fine tuning
“Statistical Modeling: The Two Cultures” by Leo Breiman

Video timestamps

00:00 Review
02:09 TwoR model
04:43 How to create a decision tree
07:02 Gini
10:54 Making a submission
15:52 Bagging
19:06 Random forest introduction
20:09 Creating a random forest
22:38 Feature importance
26:37 Adding trees
29:32 What is OOB
32:08 Model interpretation
35:47 Removing the redundant features
35:59 What does Partial dependence do
39:22 Can you explain why a particular prediction is made
46:07 Can you overfit a random forest
49:03 What is gradient boosting
51:56 Introducing walkthrus
54:28 What does fastkaggle do
1:02:52 fastcore.parallel
1:04:12 item_tfms=Resize(480, method=‘squish’)
1:06:20 Fine-tuning project
1:07:22 Criteria for evaluating models
1:10:22 Should we submit as soon as we can
1:15:15 How to automate the process of sharing kaggle notebooks
1:20:17 AutoML
1:24:16 Why the first model run so slow on Kaggle GPUs
1:27:53 How much better can a new novel architecture improve the accuracy
1:28:33 Convnext
1:31:10 How to iterate the model with padding
1:32:01 What does our data augmentation do to images
1:34:12 How to iterate the model with larger images
1:36:08 pandas indexing
1:38:16 What data-augmentation does tta use?

vguerra · June 21, 2022, 8:35am

I have a questions regarding the number of trees in a Random Forest, does increasing the number of trees always translates to better error?

msp · June 21, 2022, 8:36am

Thanks for running part 1! Are there plans for part 2 already?

Zakia · June 21, 2022, 8:37am

Questions:

So we know that bagging is powerful as an ensemble approach to machine learning. Would it be advisable to try out bagging then, first, when approaching a particular task (say, tabular task) before deep learning?
Can we create a bagging model, which includes fast.ai deep learning model(s)? I guess it will be really powerful?

amir8 · June 21, 2022, 8:38am

Is it true that Random Forest model does not get overfitted?

arunslb123 · June 21, 2022, 8:39am

Also for data, dropout in deep learning is very similar to bagging in random forest.

meanders · June 21, 2022, 8:40am

Would you ever exclude a tree from the forest if had a ‘bad’ OOB error?

amir8 · June 21, 2022, 8:44am

In terms of ML explainability, feature importance of RF model sometimes has different results when you compare it to other explainability techniques like well-known SHAP method or LIME. In this situation, which one would be more accurate/reliable? RF feature importance or other techniques?

Zakia · June 21, 2022, 8:46am

Just some comments or thoughts (and questions) please:

We could go on and create ensembles (and more) of bagged models, and I assume they would result in better performing models - so also, question here: when should we stop?
Another question regarding ensembles: if we’d like to create ensembles of ensembles, then are bagged models combined with bagged models a better approach, than, say: creating an ensemble of bagged model with a different ensemble technique, like stacking?

Moody · June 21, 2022, 8:52am

Yes, later this year. check out Walkthru 13 and Discord discussion.

JerynC · June 21, 2022, 8:57am

How does random forest compare to bootstrapping?

rjohnson · June 21, 2022, 9:00am

“Statistical Modeling: The Two Cultures” by Leo Breiman as mentioned by Jeremy in the lecture.

madhavajay · June 21, 2022, 9:01am

Is there any relationship between random forests and having an equivalent number of weights and relu activations in a similarly deep neural network? Can random forests just be implemented with a sufficient DL network?

Zakia · June 21, 2022, 9:09am

On the overfitting aspect:

If we use random forest to do feature importance to get the best columns in a dataset, and then use random forest also for creating the model, would that translate to overfitting?

kurianbenoy · June 21, 2022, 9:13am

When you are working on tabular data, how do you go on trying different models like random forests, gradient boosting, neural networks etc…? How is that decision being made, is there any benchmarks like which image models are best in tabular world also?

Tamori · June 21, 2022, 9:25am

Do you use AutoML frameworks to help improve your iterations in a more automated way and if yes which AutoML frameworks or services do you recommend?

gautam_e · June 21, 2022, 9:33am

If you’re using Linux or WSL autosklearn is a good library to try out. As the name suggests, it is closely related to/ based on sklearn, which you probably already have some familiarity with.

Zakia · June 21, 2022, 9:37am

Do you create different Kaggle notebooks for the different models you try? So one Kaggle notebook for the first (base) model… and separate notebooks for subsequent notebooks? OR, do you put your subsequent models in the bottom part of the same (the base model) notebook? Just wondering what’s your ideal approach?

Tamori · June 21, 2022, 9:38am

I see there’s a dedicated framework for PyTorch: GitHub - automl/Auto-PyTorch: Automatic architecture search and hyperparameter optimization for PyTorch