Imbalanced dataset augmentation/oversampling

ilovescience · February 1, 2019, 6:52am

I have been trying some of the basic techniques from lesson 1 on some medical image datasets, but noticed a lot of them are imbalanced. Does the fastai library have something for dealing with this, or does it already deal with this when loading into ImageDataBunch? If so, what kind of method does it use? Oversampling, undersampling?

yeldarb · February 14, 2019, 6:07pm

I’m looking to do this too. Haven’t found a built in option in fastai yet for oversampling classes (I saw a pull request here linked from a related thread that lets you give a file multiple times when loading from CSV. But I’m creating my ImageDataBunch using from_folder.

In the past I duplicated the files on disk to balance the classes but if I can’t find a built-in way to do it I may try adding it to the library – if anyone has suggestions of how to go about that (eg is there a callback that’d be handy?) that’d be helpful since I’m not yet at all familiar with fastai.

ilovescience · February 15, 2019, 5:55am

I did it with a pandas dataframe and using the sample function, but I am worried that when I split it the validation set will have images from the training set and so one cannot know if the the model overfit or not…