Is there any built-in method to oversample minority classes?

Hi,
I’ve been working on an image dataset with 20 classes where only one class makes up 80% of the total data. As far as I understand, I need to oversample the minority classes.
Reading up on the topic, it seems for tabular data it is as easy as duplicating existing values.
But how do I oversample images?
And is there a built-in method I can use to oversample the classes of my choosing?

Thanks in advance.

You can just copy/paste the image to duplicate them in the particular class you want to. Remember to only do it in the training set though.

1 Like

Do you also have an idea how to resolve that issue in a segmentation task? There, it is not possible to just duplicate minority classes.

Hello again Zachary.
I’m actually using a kaggle dataset. Now the dataset itself is 49GB in a read-only directory; kaggle kernels give me only 4.9GB of space to work with.
Do you know any ways to work around that?

Use a custom sampler that oversamples. You can just pass any pytorch sampler to the databunch and it will work.
Here you have a imbalancedsampler I did some time ago:
ImbalancedSampler
You only have to do this:

train_ds, val_ds = data.train, data.valid
sampler = ImbalancedDatasetSampler(train_ds, num_samples=sample)
train_dl = DataLoader(train_ds, bs, sampler=sampler, num_workers=12)
val_dl = DataLoader(val_ds, 2*bs, False, num_workers=8)
db = ImageDataBunch(train_dl=train_dl, valid_dl=val_dl).normalize(stats)
8 Likes

Thank you so much @tcapelle .
I’ll try to integrate it into my code.

Good question. I haven’t worked much in Kaggle Kernals but I know in colab I solve this issue by mounting my google drive to the working directory. Perhaps that can be done here? Or a Dropbox? Not 100% certain on that.

Couldn’t find anything on kaggle forums, so created a thread there. If there’s an update I’ll post here as well.

Colab is a good idea, but to me kaggle GPU seemed faster than colab. But hey, I’m broke; I’ll take what I can get :laughing:

@tcapelle I’m getting this error. Could you point out what I’m doing wrong?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-59-03316aa7729a> in <module>()
----> 1 sampler = ImbalancedDatasetSampler(learner.data.train_ds,num_samples=20)

<ipython-input-28-22c469eed6da> in __init__(self, dataset, indices, num_samples)
 14         for idx in self.indices:
 15             label = self._get_label(dataset, idx)
---> 16             for l in label:
 17                 if l in label_to_count:
 18                     label_to_count[l] += 1

TypeError: 'int' object is not iterable
1 Like

It is not working because my original sampler is aimed at multilabel classification (where you can get more than one class per image). I modified the sampler to work with standard one class classification.

Here you go, a working example:
https://colab.research.google.com/drive/1hU181nhBvYTeknHDqbmCyg5IdSaeCVm2

You probably want to modify how the weights are computed in the sampler.
Here it is inversely propotional, maybe try something lighter, with sqrt or log.

Also, look at this post from @fchollet

I also implemented the example from @fchollet here:
https://colab.research.google.com/drive/1-MJJU8QBh_WtRyWTz6570WHRQaeT4nxS

2 Likes

Thank you so much for your help @tcapelle
This has been a great learning experince.

Going to slightly tag in as something I wanna be sure of. When we oversample, it’s okay to balance out all of our classes in the training set so long as the test set we leave alone correct?

I would say yes.

This is just what I have been looking for! Thank you @tcapelle for posting thee code.

When I run the pets notebook though, it takes much longer to train. It starts with an error of 0.70 - and then slowly decreases. After 10 epochs it is still at 0.34, Compared to the original notebook which starts off with an error of 0.10 and quickly gets to 0.07 after 3 epochs. I set the LR based on lr_find - so not sure why it is slower. Did you find this too , or am I doing something wrong.

Thanks

It is probably because you are using replacement in a balanced dataset.
If you find an answer I am interested.
Also, there is now a built-in callback for sampling in fastai.

learn = cnn_learner(db, models.resnet34, metrics=error_rate, callback_fns=[OverSamplingCallback])

I made another example here using the integrated callback.

2 Likes

Yes, I too discovered that fastai now has integrated the Oversampling callback, and I am using it instead. Here is a thread with more information and example usage :

train_ds, val_ds = data.train, data.valid
sampler = ImbalancedDatasetSampler(train_ds, num_samples=sample)
train_dl = DataLoader(train_ds, bs, sampler=sampler, num_workers=12)
val_dl = DataLoader(val_ds, 2*bs, False, num_workers=8)
db = ImageDataBunch(train_dl=train_dl, valid_dl=val_dl).normalize(stats)

How would you add augmentation transforms in this example? Apparently this can only be done from one of the from_* factory methods, not the ImageDataBunch constructor unless I’m missing something (which I probably am).

To be able to use fastai built-in transforms you will need to create an ImageList . Recently I am replacing all transforms with mixup, you may try that, only need to append to your learner .mixup() before training.

learn = cnn_learner(db, models.resnet34, metrics=error_rate).mixup()

Thanks, I’ve been able to apply transforms by creating an ImageList.from_folder() as you suggest and calling databunch() on it, then replacing the default batch sampler with yours:

EDIT: this didn’t actually work, see my next post.

db = ImageList.from_folder(...).[etc, etc].transform(...).databunch()
db.train_dl.batch_sampler = ImbalancedDatasetSampler(db.train_ds)

Mixing up was in my model’s TODO list, so thanks for the tip, I didn’t know about mixup()!

Sorry, scratch my above snippet. For some reason, monkey patching the batch_sampler in an already initialized Dataloader (train_dl) didn’t actually work. This is the code that finally worked for me that uses both @tcapelle’s ImbalancedDatasetSampler and fast.ai built-in transforms:

data = (ImageList.from_folder('train_images/', extensions=['.png'], presort=True)
    .split_by_rand_pct(seed=6)
    .label_from_func(get_labels, classes=labels)
    .transform(tfms)
    .add_test('test_images/' + test_fns))
bs=64
train_ds, val_ds, test_ds = data.train, data.valid, data.test
sampler = ImbalancedDatasetSampler(train_ds)
train_dl = DataLoader(train_ds, bs, sampler=sampler, num_workers=8)
val_dl = DataLoader(val_ds, 2*bs, False, num_workers=8)
test_dl = DataLoader(test_ds, 2*bs, False, num_workers=8)

db = ImageDataBunch(train_dl=train_dl, valid_dl=val_dl, test_dl=test_dl).normalize(my_stats)