Is there any built-in method to oversample minority classes?

abyaadrafid · May 28, 2019, 12:12pm

Hi,
I’ve been working on an image dataset with 20 classes where only one class makes up 80% of the total data. As far as I understand, I need to oversample the minority classes.
Reading up on the topic, it seems for tabular data it is as easy as duplicating existing values.
But how do I oversample images?
And is there a built-in method I can use to oversample the classes of my choosing?

Thanks in advance.

muellerzr · May 28, 2019, 12:31pm

You can just copy/paste the image to duplicate them in the particular class you want to. Remember to only do it in the training set though.

ptrampert · May 28, 2019, 2:16pm

Do you also have an idea how to resolve that issue in a segmentation task? There, it is not possible to just duplicate minority classes.

abyaadrafid · May 28, 2019, 2:31pm

Hello again Zachary.
I’m actually using a kaggle dataset. Now the dataset itself is 49GB in a read-only directory; kaggle kernels give me only 4.9GB of space to work with.
Do you know any ways to work around that?

tcapelle · May 28, 2019, 3:19pm

Use a custom sampler that oversamples. You can just pass any pytorch sampler to the databunch and it will work.
Here you have a imbalancedsampler I did some time ago:
ImbalancedSampler
You only have to do this:

train_ds, val_ds = data.train, data.valid
sampler = ImbalancedDatasetSampler(train_ds, num_samples=sample)
train_dl = DataLoader(train_ds, bs, sampler=sampler, num_workers=12)
val_dl = DataLoader(val_ds, 2*bs, False, num_workers=8)
db = ImageDataBunch(train_dl=train_dl, valid_dl=val_dl).normalize(stats)

abyaadrafid · May 28, 2019, 3:45pm

Thank you so much @tcapelle .
I’ll try to integrate it into my code.

muellerzr · May 28, 2019, 3:48pm

Good question. I haven’t worked much in Kaggle Kernals but I know in colab I solve this issue by mounting my google drive to the working directory. Perhaps that can be done here? Or a Dropbox? Not 100% certain on that.

abyaadrafid · May 28, 2019, 4:11pm

Couldn’t find anything on kaggle forums, so created a thread there. If there’s an update I’ll post here as well.

Colab is a good idea, but to me kaggle GPU seemed faster than colab. But hey, I’m broke; I’ll take what I can get

abyaadrafid · May 29, 2019, 5:42am

@tcapelle I’m getting this error. Could you point out what I’m doing wrong?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-59-03316aa7729a> in <module>()
----> 1 sampler = ImbalancedDatasetSampler(learner.data.train_ds,num_samples=20)

<ipython-input-28-22c469eed6da> in __init__(self, dataset, indices, num_samples)
 14         for idx in self.indices:
 15             label = self._get_label(dataset, idx)
---> 16             for l in label:
 17                 if l in label_to_count:
 18                     label_to_count[l] += 1

TypeError: 'int' object is not iterable

tcapelle · May 29, 2019, 7:08am

It is not working because my original sampler is aimed at multilabel classification (where you can get more than one class per image). I modified the sampler to work with standard one class classification.

Here you go, a working example:
https://colab.research.google.com/drive/1hU181nhBvYTeknHDqbmCyg5IdSaeCVm2

You probably want to modify how the weights are computed in the sampler.
Here it is inversely propotional, maybe try something lighter, with sqrt or log.

Also, look at this post from @fchollet

I also implemented the example from @fchollet here:
https://colab.research.google.com/drive/1-MJJU8QBh_WtRyWTz6570WHRQaeT4nxS

abyaadrafid · May 29, 2019, 8:53am

Thank you so much for your help @tcapelle
This has been a great learning experince.

muellerzr · May 29, 2019, 3:11pm

Going to slightly tag in as something I wanna be sure of. When we oversample, it’s okay to balance out all of our classes in the training set so long as the test set we leave alone correct?

tcapelle · May 29, 2019, 3:18pm

I would say yes.

aipitch · August 5, 2019, 8:57pm

This is just what I have been looking for! Thank you @tcapelle for posting thee code.

When I run the pets notebook though, it takes much longer to train. It starts with an error of 0.70 - and then slowly decreases. After 10 epochs it is still at 0.34, Compared to the original notebook which starts off with an error of 0.10 and quickly gets to 0.07 after 3 epochs. I set the LR based on lr_find - so not sure why it is slower. Did you find this too , or am I doing something wrong.

Thanks

tcapelle · August 7, 2019, 7:59am

It is probably because you are using replacement in a balanced dataset.
If you find an answer I am interested.
Also, there is now a built-in callback for sampling in fastai.

learn = cnn_learner(db, models.resnet34, metrics=error_rate, callback_fns=[OverSamplingCallback])

I made another example here using the integrated callback.

aipitch · August 7, 2019, 4:39pm

Yes, I too discovered that fastai now has integrated the Oversampling callback, and I am using it instead. Here is a thread with more information and example usage :

mjj · October 5, 2019, 10:18pm

train_ds, val_ds = data.train, data.valid
sampler = ImbalancedDatasetSampler(train_ds, num_samples=sample)
train_dl = DataLoader(train_ds, bs, sampler=sampler, num_workers=12)
val_dl = DataLoader(val_ds, 2*bs, False, num_workers=8)
db = ImageDataBunch(train_dl=train_dl, valid_dl=val_dl).normalize(stats)

How would you add augmentation transforms in this example? Apparently this can only be done from one of the from_* factory methods, not the ImageDataBunch constructor unless I’m missing something (which I probably am).

tcapelle · October 7, 2019, 7:07am

To be able to use fastai built-in transforms you will need to create an ImageList . Recently I am replacing all transforms with mixup, you may try that, only need to append to your learner .mixup() before training.

learn = cnn_learner(db, models.resnet34, metrics=error_rate).mixup()

mjj · October 7, 2019, 10:03am

Thanks, I’ve been able to apply transforms by creating an ImageList.from_folder() as you suggest and calling databunch() on it, then replacing the default batch sampler with yours:

EDIT: this didn’t actually work, see my next post.

db = ImageList.from_folder(...).[etc, etc].transform(...).databunch()
db.train_dl.batch_sampler = ImbalancedDatasetSampler(db.train_ds)

Mixing up was in my model’s TODO list, so thanks for the tip, I didn’t know about mixup()!

mjj · October 10, 2019, 6:48am

Sorry, scratch my above snippet. For some reason, monkey patching the batch_sampler in an already initialized Dataloader (train_dl) didn’t actually work. This is the code that finally worked for me that uses both @tcapelle’s ImbalancedDatasetSampler and fast.ai built-in transforms:

data = (ImageList.from_folder('train_images/', extensions=['.png'], presort=True)
    .split_by_rand_pct(seed=6)
    .label_from_func(get_labels, classes=labels)
    .transform(tfms)
    .add_test('test_images/' + test_fns))
bs=64
train_ds, val_ds, test_ds = data.train, data.valid, data.test
sampler = ImbalancedDatasetSampler(train_ds)
train_dl = DataLoader(train_ds, bs, sampler=sampler, num_workers=8)
val_dl = DataLoader(val_ds, 2*bs, False, num_workers=8)
test_dl = DataLoader(test_ds, 2*bs, False, num_workers=8)

db = ImageDataBunch(train_dl=train_dl, valid_dl=val_dl, test_dl=test_dl).normalize(my_stats)