I’ve been working on an image dataset with 20 classes where only one class makes up 80% of the total data. As far as I understand, I need to oversample the minority classes.
Reading up on the topic, it seems for tabular data it is as easy as duplicating existing values.
But how do I oversample images?
And is there a built-in method I can use to oversample the classes of my choosing?
Hello again Zachary.
I’m actually using a kaggle dataset. Now the dataset itself is 49GB in a read-only directory; kaggle kernels give me only 4.9GB of space to work with.
Do you know any ways to work around that?
Use a custom sampler that oversamples. You can just pass any pytorch sampler to the databunch and it will work.
Here you have a imbalancedsampler I did some time ago: ImbalancedSampler
You only have to do this:
Good question. I haven’t worked much in Kaggle Kernals but I know in colab I solve this issue by mounting my google drive to the working directory. Perhaps that can be done here? Or a Dropbox? Not 100% certain on that.
@tcapelle I’m getting this error. Could you point out what I’m doing wrong?
TypeError Traceback (most recent call last)
<ipython-input-59-03316aa7729a> in <module>()
----> 1 sampler = ImbalancedDatasetSampler(learner.data.train_ds,num_samples=20)
<ipython-input-28-22c469eed6da> in __init__(self, dataset, indices, num_samples)
14 for idx in self.indices:
15 label = self._get_label(dataset, idx)
---> 16 for l in label:
17 if l in label_to_count:
18 label_to_count[l] += 1
TypeError: 'int' object is not iterable
It is not working because my original sampler is aimed at multilabel classification (where you can get more than one class per image). I modified the sampler to work with standard one class classification.
This is just what I have been looking for! Thank you @tcapelle for posting thee code.
When I run the pets notebook though, it takes much longer to train. It starts with an error of 0.70 - and then slowly decreases. After 10 epochs it is still at 0.34, Compared to the original notebook which starts off with an error of 0.10 and quickly gets to 0.07 after 3 epochs. I set the LR based on lr_find - so not sure why it is slower. Did you find this too , or am I doing something wrong.
How would you add augmentation transforms in this example? Apparently this can only be done from one of the from_* factory methods, not the ImageDataBunch constructor unless I’m missing something (which I probably am).
To be able to use fastai built-in transforms you will need to create an ImageList . Recently I am replacing all transforms with mixup, you may try that, only need to append to your learner .mixup() before training.
Sorry, scratch my above snippet. For some reason, monkey patching the batch_sampler in an already initialized Dataloader (train_dl) didn’t actually work. This is the code that finally worked for me that uses both @tcapelle’s ImbalancedDatasetSampler and fast.ai built-in transforms: