Request to Support Various Sampling Strategies


(Gokkul Nath T S) #1

Hi Everyone,

I have been working with imbalanced datasets lately and found it a bit challenging to incorporate SMOTE with fastai library (Using from_csv). Hence i Thought, It would be great if we can include support for various sampling methods (Both Under sampling and Oversampling) like ADASYN,SMOTE. This Feature can enable us to deal with imbalanced datasets effectively and can make it easier for the users to focus on building models rather than on setting up of data.

Proposed Approach :
For Indexed Datasets: (e.g: Tabular Data, CSV based datasets)
Use imbalanced-learn library and based on the method(eg SMOTE) passed we apply respective techniques and return the indices of the Sampled data.

we can inherit from torch.data.Sampler base class and override _iter_ by passing the indices as an iterator in and respective length in _len_ . These samplers can be then passed to the existing dataloaders.

I am not sure whether my approach will work but I would like to help in development if everyone find this feature necessary/valuable to have. Also, We need to come up with an approach for folder based datasets(i.e Similar to lesson1- dogs and cats). Please provide your feedback and comments.Thanks.:slight_smile:

~Gokkul
P.S : I was not sure where to post the request. So had to create a new Thread.


(Jeremy Howard) #2

That’s a good idea. I believe that it’s been shown recently that over-sampling is always better than under-sampling, BTW.

This won’t be in v1, but we’ll look to add it in a future version.

Have you seen the approach we already have? Are you looking for something else?


(Gokkul Nath T S) #3

I haven’t looked into it yet… I believe it performs oversampling by default. Will check it out. Thanks for considering the feature request.