Lesson 7 - Official topic

Oh sorry, see my other answer above.

1 Like

Does fastai use any default data augmentation or create synthetic data for tabular datasets?
Doe such techniques exist?

1 Like

I don’t know if such a technique exists, and there is nothing in fastai for this. Such a thing is probably domain-dependent.

1 Like

@ilovescience I’m giving it a try and seeing what happens. If it doesn’t work, I’l either join the GPU competition or try to get the best of both worlds, i.e., data augmentation with fastai2 and learning with tf

Ok feel free to do so. I have worked on this for a couple months though (first with fastai and now with fastai2) so I know it’s not trivial. Just fair warning :slight_smile:
I would love to see how it goes for you though, and please do share your progress! :slight_smile:

2 Likes

I believe the split is checked for every value in the range of that continuous variable and looks for the best value for the split based on the metric used. Generally in DT’s they look to maximize the information gain or reduce the entropy.

1 Like

@jeremy, do you have any thoughts on what data augmentation for tabular might look like?

7 Likes

This leads me to a follow-up question. Is there like a “resolution” for how this split value would be adjusted during training? Like adjusting the split from 0.1 to 0.15 vs 0.1 to 0.11?

Does fastai distinguish between ordered (example: “low”, “medium”, “high”) and unordered categorical variables (“red”, “green”, “blue”)?

4 Likes

There’s some work on using GANs for generating tabular data https://github.com/sdv-dev/CTGAN

7 Likes

Is there a different channel one can use to chat about package issues. I run the !pip install kaggle command as explained by Jeremy, then run the code and got this message Error: Missing username in configuration… also one needs to run !pip install dtreeviz to be able to see the decision tree visuals.

What’s the benefit of generating new tabular data? Wouldn’t it just be like copying your dataset if enough data was generated?

So for a decision tree regressor we use use as the metric for the splits. Does Fastai use entropy and information gain for a decision tree classifier or is something else used?

Is there a way to determine optimal # of leaves or selecting when leaves are split?

2 Likes

Fastai uses sklearn implementation for decision tree and random forest models

1 Like

Think of the generated data as additional data points that are slightly perturbed from the real ones, but realistic enough. You have a point, sometimes it’s just a copy and doesn’t provide much benefit. I imagine if you run enough generations you may get something that’s different enough. This is beneficial when data collection is expensive (e.g medical / health data). Some fields have privacy / compliance issues, which prevent developer/data scientist from accessing the real dataset, but if you have a fake dataset that yields the same performance, this is easier to work with, and could be used for testing purposes.

2 Likes

Do random forests / GBTress have analogous ways to stop overfitting – like using weight decay in NN ?

2 Likes

It’s not clear how a categorical variable like ProductSize can be ordered with “#NA#” - where do we place it in the order of categories?

1 Like

Did you do the following too?

You need an API key to use the Kaggle API; to get one, go to "my account" on the Kaggle website, and click "create new API token". 
This will save a file called kaggle.json to your PC. 
We need to create this on your GPU server. 
To do so, open the file you downloaded, copy the contents, and paste them inside '' below, e.g.: creds = '{"username":"xxx","key":"xxx"}'
1 Like

This tells you how to set your username (under API credentials):

1 Like