I’m working on a dataset containing nearly 5 million images distributed across 43 categories. The distribution of number of images in each class is as shown in the image. In order to balance the dataset, are there any functions in fastai library such that we can oversample only the minority classes and down sample the majority classes?
No there isn’t as yet.
You can check how a sampler from pytorch works (https://pytorch.org/docs/master/data.html#torch.utils.data.WeightedRandomSampler) and set it up on the dataloader.
I had this problem with a recent Kaggle competition, so I oversampled with replacement the classes with less representation (it was a bit tricky due to multi-labels), bringing them from 0.1% to 2.5% or so. But it turned out not to improve the performance on the test set having the original distribution (public score).