Lesson 7 - Official topic

wdhorton · April 29, 2020, 2:10am

It really depends on the data set, these days I’d say people try all of them

Raymond-Wu · April 29, 2020, 2:11am

is Rachel’s mic cutting out for anyone else?

sgugger · April 29, 2020, 2:11am

You’d need those probabilities to be reflected in your training dataset for the numbers predicted on new data to make sense.

tanguyen14 · April 29, 2020, 2:12am

Also try addressing the imbalance by upsampling / downsampling / generate additional synthetic data, which i think would result in more gain than trying a bigger suite of models.

giacomov · April 29, 2020, 2:12am

Can you comment on real-time applications of Random Forests? In my experience they tend to be too slow for real-time (latency bound) use cases, like a real reccomender system. A NN is much faster when run on the right hardware.

The only other option I found that is good from the performance perspective is XGBoost or Cat boost (boosted decision trees).

ilovescience · April 29, 2020, 2:15am

Note Jeremy was once the President of Kaggle and at one point the top data scientist on Kaggle!

kodzaks · April 29, 2020, 2:15am

Yes, but I only have outcomes for sales that happened or not in my training set. Basically, yes/no. But I’d like to know if all conditions are very favorable for making this big sale. I guess, my outcome variable will have to be yes or no then for my test set.

radikubwa · April 29, 2020, 2:15am

I agree. Try all of them. There’s an argument I recommend changing in Random forest. I’m not sure if its there in XGBoost. Try changing the class-weight argument to “balanced” to deal with the class imbalance. That’s what I use. In addition, F1 score is better in evaluating the model.

quantum · April 29, 2020, 2:16am

Have you looked at the current Abstract and Reasoning Challenge competition at Kaggle, which asks whether a computer can learn complex, abstract tasks from just a few examples? Can you share some thoughts on it?

tanguyen14 · April 29, 2020, 2:16am

There are implementations of RF that are optimized for the right hardware as well, for example see https://github.com/rapidsai/cuml

Margolis · April 29, 2020, 2:17am

Regarding Kaggle: I’m trying to use fastai2 on TPUs (PyTorch version for TPUs came out March 25) as part of Kaggle’s “Flower Classification with TPUs” in case any one wants to join me https://www.kaggle.com/c/flower-classification-with-tpus/overview

njparker · April 29, 2020, 2:17am

Jeremy, I heard that you won every Kaggle competition for 5 years straight. Is this true? Do you have any favorite stories of Kaggle competitions you were involved in?

ilovescience · April 29, 2020, 2:19am

fastai2 won’t work directly with TPUs at this point (even with the PyTorch TPU library). There is ongoing development for this though.

init_27 · April 29, 2020, 2:19am

You’ll find a few answers here: https://youtu.be/205j37G1cxw

Margolis · April 29, 2020, 2:20am

Here is Fastai’s competition using GPUs:
https://forums.fast.ai/t/fastgarden-a-new-imagenette-like-competition-just-for-fun/65909

ilovescience · April 29, 2020, 2:23am

It seems this algorithm is only for categorical variables, correct?

If I understand correctly, decision trees also work with continuous (numeric) variables too. Is this true? If so, how does that work?

kofi · April 29, 2020, 2:24am

If we are splitting only on categorical variables, then what do we do with the continuous variables?

sgugger · April 29, 2020, 2:24am

We are just talking about the cleaning for now.

sgugger · April 29, 2020, 2:25am

We split on some values: less than something or greater than something.

ilovescience · April 29, 2020, 2:25am

Oh I was referring to the section describing “The basic steps to train a decision tree can be written down very easily:”

Did I miss something?