Paper on Imbalanced classes

I have a page isn’t found error.

Edit: Perfect! Will have a look!

Edited the post with the correct link :grinning:

There was a recent article that I found very interesting related to this subject (Data Augmentation) :

Yes, I found the imbalanced dataset sampler too. But I could not figure out where to use it in the current example code from the lessons. I am trying to use it for image classification - with the training images in folders. If you have any code snippet using this class with the Databunch examples that would be great!

Thanks in advance!

I tried using this class as a callback but ran into several issues. I could get things working by editing & files in PyTorch. I added this class to & changed to use imbalanced dataset sampler instead of random sampler. (You’ll also need to edit

Ok - great thanks for the tips. Will try it out and see how it works. What sort of imbalanced classes did you have - and how much did this help? I have heavily imbalanced classes - ratio of almost 30:1 and 50:1 in some cases.

I had an imbalance of approx 15:1:1 (3 classes). Without balanced sampling the classifier didn’t learn about the classes with less samples - so I felt it was required… I also spent some time trying out different data augmentations, which helped.

Could you share the changes you made to, and - that would be very helpful. I am trying to change the library rather than PyTorch source code. I found that the old fastai library has a that can be changed to do this, but the new version doesn’t - and it is not clear which part of the code can be changed to incoporate this. I will post my changes once I can figure out and make it work,

class ImbalancedDatasetSampler(Sampler):
    """Samples elements randomly from a given list of indices for imbalanced dataset
        indices (list, optional): a list of indices
        num_samples (int, optional): number of samples to draw

    def __init__(self, dataset, indices=None, num_samples=None):
        # if indices is not provided, 
        # all elements in the dataset will be considered
        self.indices = list(range(len(dataset))) \
            if indices is None else indices
        # if num_samples is not provided, 
        # draw `len(indices)` samples in each iteration
        self.num_samples = len(self.indices) \
        if num_samples is None else num_samples
    # distribution of classes in the dataset 
    label_to_count = {}
    for idx in self.indices:
        label = self._get_label(dataset, idx)
        if label in label_to_count:
            label_to_count[label] += 1
            label_to_count[label] = 1
    # weight for each sample
    weights = [1.0 / label_to_count[self._get_label(dataset, idx)]
               for idx in self.indices]
    self.weights = torch.DoubleTensor(weights)

def _get_label(self, dataset, idx):
    dataset_type = type(dataset)
    if dataset_type is torchvision.datasets.MNIST:
        return dataset.train_labels[idx].item()
    elif dataset_type is torchvision.datasets.ImageFolder:
        return dataset.imgs[idx][1]
        raise NotImplementedError
def __iter__(self):
    return (self.indices[i] for i in torch.multinomial(
        self.weights, self.num_samples, replacement=True))

def __len__(self):
    return self.num_samples (in DataLoader class):

from . import SequentialSampler, RandomSampler, BatchSampler, ImbalancedDatasetSampler


# sampler = RandomSampler(dataset)
sampler = ImbalancedDatasetSampler(dataset)

from .sampler import Sampler, SequentialSampler, RandomSampler, SubsetRandomSampler, WeightedRandomSampler, BatchSampler, ImbalancedDatasetSampler

Thanks for posting the code. Does it work with the current version of (1.0) and PyTorch (1.0)?

I tried it out but I am getting the following error from
Exception - NotImplementedError:

I debugged and found that the dataset type is
not torchvision.datasets.ImageFolder but is fastai.data_block.LabelList
when the following function is called, and hence it fails to return the label value

def _get_label(self, dataset, idx):
    dataset_type = type(dataset)
    if dataset_type is torchvision.datasets.MNIST:
        return dataset.train_labels[idx].item()
    elif dataset_type is torchvision.datasets.ImageFolder:
        return dataset.imgs[idx][1]
        raise NotImplementedError


I don’t think I’ve updated to the current version of fastai… But its fastai 1.0 & PyTorch 1.0

You have to make changes to the dataloader class. Labels in fastai is stored in LabelList… you’ll have to do something like:

dataset_type = type(dataset)

if dataset_type is fastai.data_block.LabelList:
   return dataset[idx][1].__int__()

Thanks - I had to import the fastai…labelList class to the module (else get a not defined error) and the error went away but the program seems to be stuck in a loop and the notebook is stuck with the hourglass . So will have to troubleshoot further - will update this thread once I can figure it out.

I also noticed that already has a WeightedRandomSampler class, that seems to be similar to the ImbalancedDataClass. How does one change the sampling method when creating the Databunch or train dataset ? If I can figure that our I could try calling the weighted sampler. Or else, the other option is to just make copies of the extra classes - which is what @jeremy recommended doing in the lecture.

When creating DataBunch you should be able to pass on a variable called ‘sampler’. You could try adding into parameters: sampler=YourWeightedSampler(). From fastai source I understand it will be passed on to pytorch Dataloader.


Great - thanks for the tip. This is a much better approach - as you don’t have to change the PyTorch or Fastai source code (except import the ImbalanceDataClass),

However, the code at is not working with the current version of Fastai and Pytorch. Still trying to figure it out.

It would be great if handling imabalanced datasets could become a standard part of Fastai; as it is a very common problem specially for health and medical data.

Try creating directly pytorch weighted sampler:

Ok, thanks - will try it out.

Pretty good resource:

Excellent resource - thank you! They do have batch generators for Keras and Tensorflow - not for Pytorch though.

Another nice practical example how to deal with class imbalance by Francois Chollet, the creator of Keras:

The most recent update of Fastai implements Oversampling to handle unbalanced classes as a Callback.

You can call it as follows:

from fastai.callbacks import *

cb = OverSamplingCallback(learn)
learn.fit_one_cycle(4,callbacks = cb)

( Make sure you have the most recent fastai version first)