Paper on Imbalanced classes

During lesson 7, it was mentioned a paper describing that the best way to handle imbalanced classes is to oversample the minority class. I looked online but could not find the paper. Does anyone have the reference?

Thanks a lot, I really appreciate.

1 Like

It may have been this paper -

I found it when searching the forums for solving the problem of class imbalance. Let me know if his helps. I am in fact also looking to see if there is a way to implement the oversampling approach in easily without copying the images. If anyone has ideas on how to do that please let me know.



I use Imbalanced Dataset Sampler ( instead of the default Random Sampler. This way you don’t have to make copies of images.


I have a page isn’t found error.

Edit: Perfect! Will have a look!

Edited the post with the correct link :grinning:

There was a recent article that I found very interesting related to this subject (Data Augmentation) :

Yes, I found the imbalanced dataset sampler too. But I could not figure out where to use it in the current example code from the lessons. I am trying to use it for image classification - with the training images in folders. If you have any code snippet using this class with the Databunch examples that would be great!

Thanks in advance!

I tried using this class as a callback but ran into several issues. I could get things working by editing & files in PyTorch. I added this class to & changed to use imbalanced dataset sampler instead of random sampler. (You’ll also need to edit

Ok - great thanks for the tips. Will try it out and see how it works. What sort of imbalanced classes did you have - and how much did this help? I have heavily imbalanced classes - ratio of almost 30:1 and 50:1 in some cases.

I had an imbalance of approx 15:1:1 (3 classes). Without balanced sampling the classifier didn’t learn about the classes with less samples - so I felt it was required… I also spent some time trying out different data augmentations, which helped.

Could you share the changes you made to, and - that would be very helpful. I am trying to change the library rather than PyTorch source code. I found that the old fastai library has a that can be changed to do this, but the new version doesn’t - and it is not clear which part of the code can be changed to incoporate this. I will post my changes once I can figure out and make it work,

class ImbalancedDatasetSampler(Sampler):
    """Samples elements randomly from a given list of indices for imbalanced dataset
        indices (list, optional): a list of indices
        num_samples (int, optional): number of samples to draw

    def __init__(self, dataset, indices=None, num_samples=None):
        # if indices is not provided, 
        # all elements in the dataset will be considered
        self.indices = list(range(len(dataset))) \
            if indices is None else indices
        # if num_samples is not provided, 
        # draw `len(indices)` samples in each iteration
        self.num_samples = len(self.indices) \
        if num_samples is None else num_samples
    # distribution of classes in the dataset 
    label_to_count = {}
    for idx in self.indices:
        label = self._get_label(dataset, idx)
        if label in label_to_count:
            label_to_count[label] += 1
            label_to_count[label] = 1
    # weight for each sample
    weights = [1.0 / label_to_count[self._get_label(dataset, idx)]
               for idx in self.indices]
    self.weights = torch.DoubleTensor(weights)

def _get_label(self, dataset, idx):
    dataset_type = type(dataset)
    if dataset_type is torchvision.datasets.MNIST:
        return dataset.train_labels[idx].item()
    elif dataset_type is torchvision.datasets.ImageFolder:
        return dataset.imgs[idx][1]
        raise NotImplementedError
def __iter__(self):
    return (self.indices[i] for i in torch.multinomial(
        self.weights, self.num_samples, replacement=True))

def __len__(self):
    return self.num_samples (in DataLoader class):

from . import SequentialSampler, RandomSampler, BatchSampler, ImbalancedDatasetSampler


# sampler = RandomSampler(dataset)
sampler = ImbalancedDatasetSampler(dataset)

from .sampler import Sampler, SequentialSampler, RandomSampler, SubsetRandomSampler, WeightedRandomSampler, BatchSampler, ImbalancedDatasetSampler

Thanks for posting the code. Does it work with the current version of (1.0) and PyTorch (1.0)?

I tried it out but I am getting the following error from
Exception - NotImplementedError:

I debugged and found that the dataset type is
not torchvision.datasets.ImageFolder but is fastai.data_block.LabelList
when the following function is called, and hence it fails to return the label value

def _get_label(self, dataset, idx):
    dataset_type = type(dataset)
    if dataset_type is torchvision.datasets.MNIST:
        return dataset.train_labels[idx].item()
    elif dataset_type is torchvision.datasets.ImageFolder:
        return dataset.imgs[idx][1]
        raise NotImplementedError


I don’t think I’ve updated to the current version of fastai… But its fastai 1.0 & PyTorch 1.0

You have to make changes to the dataloader class. Labels in fastai is stored in LabelList… you’ll have to do something like:

dataset_type = type(dataset)

if dataset_type is fastai.data_block.LabelList:
   return dataset[idx][1].__int__()

Thanks - I had to import the fastai…labelList class to the module (else get a not defined error) and the error went away but the program seems to be stuck in a loop and the notebook is stuck with the hourglass . So will have to troubleshoot further - will update this thread once I can figure it out.

I also noticed that already has a WeightedRandomSampler class, that seems to be similar to the ImbalancedDataClass. How does one change the sampling method when creating the Databunch or train dataset ? If I can figure that our I could try calling the weighted sampler. Or else, the other option is to just make copies of the extra classes - which is what @jeremy recommended doing in the lecture.

When creating DataBunch you should be able to pass on a variable called ‘sampler’. You could try adding into parameters: sampler=YourWeightedSampler(). From fastai source I understand it will be passed on to pytorch Dataloader.


Great - thanks for the tip. This is a much better approach - as you don’t have to change the PyTorch or Fastai source code (except import the ImbalanceDataClass),

However, the code at is not working with the current version of Fastai and Pytorch. Still trying to figure it out.

It would be great if handling imabalanced datasets could become a standard part of Fastai; as it is a very common problem specially for health and medical data.

Try creating directly pytorch weighted sampler:

Ok, thanks - will try it out.

Pretty good resource: