Lesson 6 official topic

This post is for topics related to lesson 6 of the course. This lesson is based partly on chapter 9 of the book.

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 5Lesson 7 >>>

Lesson resources

Links from the lesson

Video timestamps

  • 00:00 Review
  • 02:09 TwoR model
  • 04:43 How to create a decision tree
  • 07:02 Gini
  • 10:54 Making a submission
  • 15:52 Bagging
  • 19:06 Random forest introduction
  • 20:09 Creating a random forest
  • 22:38 Feature importance
  • 26:37 Adding trees
  • 29:32 What is OOB
  • 32:08 Model interpretation
  • 35:47 Removing the redundant features
  • 35:59 What does Partial dependence do
  • 39:22 Can you explain why a particular prediction is made
  • 46:07 Can you overfit a random forest
  • 49:03 What is gradient boosting
  • 51:56 Introducing walkthrus
  • 54:28 What does fastkaggle do
  • 1:02:52 fastcore.parallel
  • 1:04:12 item_tfms=Resize(480, method=‘squish’)
  • 1:06:20 Fine-tuning project
  • 1:07:22 Criteria for evaluating models
  • 1:10:22 Should we submit as soon as we can
  • 1:15:15 How to automate the process of sharing kaggle notebooks
  • 1:20:17 AutoML
  • 1:24:16 Why the first model run so slow on Kaggle GPUs
  • 1:27:53 How much better can a new novel architecture improve the accuracy
  • 1:28:33 Convnext
  • 1:31:10 How to iterate the model with padding
  • 1:32:01 What does our data augmentation do to images
  • 1:34:12 How to iterate the model with larger images
  • 1:36:08 pandas indexing
  • 1:38:16 What data-augmentation does tta use?
16 Likes

I have a questions regarding the number of trees in a Random Forest, does increasing the number of trees always translates to better error?

3 Likes

Thanks for running part 1! Are there plans for part 2 already?

1 Like

Questions:

  • So we know that bagging is powerful as an ensemble approach to machine learning. Would it be advisable to try out bagging then, first, when approaching a particular task (say, tabular task) before deep learning?

  • Can we create a bagging model, which includes fast.ai deep learning model(s)? I guess it will be really powerful?

6 Likes

Is it true that Random Forest model does not get overfitted?

2 Likes

Also for data, dropout in deep learning is very similar to bagging in random forest.

1 Like

Would you ever exclude a tree from the forest if had a ‘bad’ OOB error?

2 Likes

In terms of ML explainability, feature importance of RF model sometimes has different results when you compare it to other explainability techniques like well-known SHAP method or LIME. In this situation, which one would be more accurate/reliable? RF feature importance or other techniques?

1 Like

Just some comments or thoughts (and questions) please:

  • We could go on and create ensembles (and more) of bagged models, and I assume they would result in better performing models - so also, question here: when should we stop?

  • Another question regarding ensembles: if we’d like to create ensembles of ensembles, then are bagged models combined with bagged models a better approach, than, say: creating an ensemble of bagged model with a different ensemble technique, like stacking?

Yes, later this year. check out Walkthru 13 and Discord discussion.

4 Likes

How does random forest compare to bootstrapping?

“Statistical Modeling: The Two Cultures” by Leo Breiman as mentioned by Jeremy in the lecture.

6 Likes

Is there any relationship between random forests and having an equivalent number of weights and relu activations in a similarly deep neural network? Can random forests just be implemented with a sufficient DL network?

1 Like

On the overfitting aspect:

If we use random forest to do feature importance to get the best columns in a dataset, and then use random forest also for creating the model, would that translate to overfitting?

2 Likes

When you are working on tabular data, how do you go on trying different models like random forests, gradient boosting, neural networks etc…? How is that decision being made, is there any benchmarks like which image models are best in tabular world also?

1 Like

Do you use AutoML frameworks to help improve your iterations in a more automated way and if yes which AutoML frameworks or services do you recommend?

1 Like

If you’re using Linux or WSL autosklearn is a good library to try out. As the name suggests, it is closely related to/ based on sklearn, which you probably already have some familiarity with.

1 Like

Do you create different Kaggle notebooks for the different models you try? So one Kaggle notebook for the first (base) model… and separate notebooks for subsequent notebooks? OR, do you put your subsequent models in the bottom part of the same (the base model) notebook? Just wondering what’s your ideal approach?

3 Likes

I see there’s a dedicated framework for PyTorch: GitHub - automl/Auto-PyTorch: Automatic architecture search and hyperparameter optimization for PyTorch

1 Like