Oh sorry, see my other answer above.
Does fastai use any default data augmentation or create synthetic data for tabular datasets?
Doe such techniques exist?
I donât know if such a technique exists, and there is nothing in fastai for this. Such a thing is probably domain-dependent.
@ilovescience Iâm giving it a try and seeing what happens. If it doesnât work, Iâl either join the GPU competition or try to get the best of both worlds, i.e., data augmentation with fastai2 and learning with tf
Ok feel free to do so. I have worked on this for a couple months though (first with fastai and now with fastai2) so I know itâs not trivial. Just fair warning
I would love to see how it goes for you though, and please do share your progress!
I believe the split is checked for every value in the range of that continuous variable and looks for the best value for the split based on the metric used. Generally in DTâs they look to maximize the information gain or reduce the entropy.
@jeremy, do you have any thoughts on what data augmentation for tabular might look like?
This leads me to a follow-up question. Is there like a âresolutionâ for how this split value would be adjusted during training? Like adjusting the split from 0.1 to 0.15 vs 0.1 to 0.11?
Does fastai distinguish between ordered (example: âlowâ, âmediumâ, âhighâ) and unordered categorical variables (âredâ, âgreenâ, âblueâ)?
Thereâs some work on using GANs for generating tabular data https://github.com/sdv-dev/CTGAN
Is there a different channel one can use to chat about package issues. I run the !pip install kaggle
command as explained by Jeremy, then run the code and got this message Error: Missing username in configuration
⌠also one needs to run !pip install dtreeviz
to be able to see the decision tree visuals.
Whatâs the benefit of generating new tabular data? Wouldnât it just be like copying your dataset if enough data was generated?
So for a decision tree regressor we use use as the metric for the splits. Does Fastai use entropy and information gain for a decision tree classifier or is something else used?
Is there a way to determine optimal # of leaves or selecting when leaves are split?
Fastai uses sklearn implementation for decision tree and random forest models
Think of the generated data as additional data points that are slightly perturbed from the real ones, but realistic enough. You have a point, sometimes itâs just a copy and doesnât provide much benefit. I imagine if you run enough generations you may get something thatâs different enough. This is beneficial when data collection is expensive (e.g medical / health data). Some fields have privacy / compliance issues, which prevent developer/data scientist from accessing the real dataset, but if you have a fake dataset that yields the same performance, this is easier to work with, and could be used for testing purposes.
Do random forests / GBTress have analogous ways to stop overfitting â like using weight decay in NN ?
Itâs not clear how a categorical variable like ProductSize
can be ordered with â#NA#â - where do we place it in the order of categories?
Did you do the following too?
You need an API key to use the Kaggle API; to get one, go to "my account" on the Kaggle website, and click "create new API token".
This will save a file called kaggle.json to your PC.
We need to create this on your GPU server.
To do so, open the file you downloaded, copy the contents, and paste them inside '' below, e.g.: creds = '{"username":"xxx","key":"xxx"}'
This tells you how to set your username (under API credentials):