Lesson 7 - Official topic

sgugger · April 29, 2020, 2:25am

Oh sorry, see my other answer above.

jona · April 29, 2020, 2:25am

Does fastai use any default data augmentation or create synthetic data for tabular datasets?
Doe such techniques exist?

sgugger · April 29, 2020, 2:26am

I don’t know if such a technique exists, and there is nothing in fastai for this. Such a thing is probably domain-dependent.

Margolis · April 29, 2020, 2:26am

@ilovescience I’m giving it a try and seeing what happens. If it doesn’t work, I’l either join the GPU competition or try to get the best of both worlds, i.e., data augmentation with fastai2 and learning with tf

ilovescience · April 29, 2020, 2:27am

Ok feel free to do so. I have worked on this for a couple months though (first with fastai and now with fastai2) so I know it’s not trivial. Just fair warning
I would love to see how it goes for you though, and please do share your progress!

harish3110 · April 29, 2020, 2:28am

I believe the split is checked for every value in the range of that continuous variable and looks for the best value for the split based on the metric used. Generally in DT’s they look to maximize the information gain or reduce the entropy.

jona · April 29, 2020, 2:28am

@jeremy, do you have any thoughts on what data augmentation for tabular might look like?

ilovescience · April 29, 2020, 2:29am

This leads me to a follow-up question. Is there like a “resolution” for how this split value would be adjusted during training? Like adjusting the split from 0.1 to 0.15 vs 0.1 to 0.11?

giacomov · April 29, 2020, 2:30am

Does fastai distinguish between ordered (example: “low”, “medium”, “high”) and unordered categorical variables (“red”, “green”, “blue”)?

tanguyen14 · April 29, 2020, 2:32am

There’s some work on using GANs for generating tabular data https://github.com/sdv-dev/CTGAN

mario_carrillo · April 29, 2020, 2:34am

Is there a different channel one can use to chat about package issues. I run the !pip install kaggle command as explained by Jeremy, then run the code and got this message Error: Missing username in configuration… also one needs to run !pip install dtreeviz to be able to see the decision tree visuals.

Raymond-Wu · April 29, 2020, 2:34am

What’s the benefit of generating new tabular data? Wouldn’t it just be like copying your dataset if enough data was generated?

harish3110 · April 29, 2020, 2:38am

So for a decision tree regressor we use use as the metric for the splits. Does Fastai use entropy and information gain for a decision tree classifier or is something else used?

Raymond-Wu · April 29, 2020, 2:39am

Is there a way to determine optimal # of leaves or selecting when leaves are split?

arunslb123 · April 29, 2020, 2:39am

Fastai uses sklearn implementation for decision tree and random forest models

tanguyen14 · April 29, 2020, 2:39am

Think of the generated data as additional data points that are slightly perturbed from the real ones, but realistic enough. You have a point, sometimes it’s just a copy and doesn’t provide much benefit. I imagine if you run enough generations you may get something that’s different enough. This is beneficial when data collection is expensive (e.g medical / health data). Some fields have privacy / compliance issues, which prevent developer/data scientist from accessing the real dataset, but if you have a fake dataset that yields the same performance, this is easier to work with, and could be used for testing purposes.

pinaki · April 29, 2020, 2:40am

Do random forests / GBTress have analogous ways to stop overfitting – like using weight decay in NN ?

sfyash · April 29, 2020, 2:41am

It’s not clear how a categorical variable like ProductSize can be ordered with “#NA#” - where do we place it in the order of categories?

FraPochetti · April 29, 2020, 2:43am

Did you do the following too?

You need an API key to use the Kaggle API; to get one, go to "my account" on the Kaggle website, and click "create new API token". 
This will save a file called kaggle.json to your PC. 
We need to create this on your GPU server. 
To do so, open the file you downloaded, copy the contents, and paste them inside '' below, e.g.: creds = '{"username":"xxx","key":"xxx"}'

hiromi · April 29, 2020, 2:43am

This tells you how to set your username (under API credentials):