Paper on Imbalanced classes

tiagocaruso · April 3, 2019, 3:47am

During lesson 7, it was mentioned a paper describing that the best way to handle imbalanced classes is to oversample the minority class. I looked online but could not find the paper. Does anyone have the reference?

Thanks a lot, I really appreciate.

aipitch · April 15, 2019, 8:01pm

It may have been this paper -

I found it when searching the forums for solving the problem of class imbalance. Let me know if his helps. I am in fact also looking to see if there is a way to implement the oversampling approach in fast.ai easily without copying the images. If anyone has ideas on how to do that please let me know.

Thanks
A

ashwinakannan · May 1, 2019, 2:57pm

I use Imbalanced Dataset Sampler (https://github.com/ufoym/imbalanced-dataset-sampler) instead of the default Random Sampler. This way you don’t have to make copies of images.

mindtrinket · May 1, 2019, 3:00pm

I have a page isn’t found error.

Edit: Perfect! Will have a look!

ashwinakannan · May 1, 2019, 3:18pm

Edited the post with the correct link

jeremyeast · May 1, 2019, 3:35pm

There was a recent article that I found very interesting related to this subject (Data Augmentation) : https://arxiv.org/abs/1904.12848

aipitch · May 9, 2019, 1:02am

Yes, I found the imbalanced dataset sampler too. But I could not figure out where to use it in the current Fast.ai example code from the lessons. I am trying to use it for image classification - with the training images in folders. If you have any code snippet using this class with the Databunch examples that would be great!

Thanks in advance!

ashwinakannan · May 9, 2019, 2:41am

I tried using this class as a callback but ran into several issues. I could get things working by editing dataloader.py & sampler.py files in PyTorch. I added this class to dataloader.py & changed sampler.py to use imbalanced dataset sampler instead of random sampler. (You’ll also need to edit init.py)

aipitch · May 9, 2019, 4:20am

Ok - great thanks for the tips. Will try it out and see how it works. What sort of imbalanced classes did you have - and how much did this help? I have heavily imbalanced classes - ratio of almost 30:1 and 50:1 in some cases.

ashwinakannan · May 9, 2019, 5:29pm

I had an imbalance of approx 15:1:1 (3 classes). Without balanced sampling the classifier didn’t learn about the classes with less samples - so I felt it was required… I also spent some time trying out different data augmentations, which helped.

aipitch · May 10, 2019, 6:35am

Could you share the changes you made to dataloader.py, sampler.py and init.py - that would be very helpful. I am trying to change the fast.ai library rather than PyTorch source code. I found that the old fastai library has a dataloader.py that can be changed to do this, but the new version doesn’t - and it is not clear which part of the code can be changed to incoporate this. I will post my changes once I can figure out and make it work,

ashwinakannan · May 10, 2019, 3:25pm

sampler.py:

class ImbalancedDatasetSampler(Sampler):
    """Samples elements randomly from a given list of indices for imbalanced dataset
    Arguments:
        indices (list, optional): a list of indices
        num_samples (int, optional): number of samples to draw
    """

    def __init__(self, dataset, indices=None, num_samples=None):
                
        # if indices is not provided, 
        # all elements in the dataset will be considered
        self.indices = list(range(len(dataset))) \
            if indices is None else indices
            
        # if num_samples is not provided, 
        # draw `len(indices)` samples in each iteration
        self.num_samples = len(self.indices) \
        if num_samples is None else num_samples
        
    # distribution of classes in the dataset 
    label_to_count = {}
    for idx in self.indices:
        label = self._get_label(dataset, idx)
        if label in label_to_count:
            label_to_count[label] += 1
        else:
            label_to_count[label] = 1
            
    # weight for each sample
    weights = [1.0 / label_to_count[self._get_label(dataset, idx)]
               for idx in self.indices]
    self.weights = torch.DoubleTensor(weights)

def _get_label(self, dataset, idx):
    dataset_type = type(dataset)
    if dataset_type is torchvision.datasets.MNIST:
        return dataset.train_labels[idx].item()
    elif dataset_type is torchvision.datasets.ImageFolder:
        return dataset.imgs[idx][1]
    else:
        raise NotImplementedError
            
def __iter__(self):
    return (self.indices[i] for i in torch.multinomial(
        self.weights, self.num_samples, replacement=True))

def __len__(self):
    return self.num_samples

dataloader.py (in DataLoader class):

from . import SequentialSampler, RandomSampler, BatchSampler, ImbalancedDatasetSampler

&

# sampler = RandomSampler(dataset)
sampler = ImbalancedDatasetSampler(dataset)

init.py

from .sampler import Sampler, SequentialSampler, RandomSampler, SubsetRandomSampler, WeightedRandomSampler, BatchSampler, ImbalancedDatasetSampler

aipitch · May 11, 2019, 9:22am

Thanks for posting the code. Does it work with the current version of Fast.ai (1.0) and PyTorch (1.0)?

I tried it out but I am getting the following error from Sampler.py
Exception - NotImplementedError:

I debugged and found that the dataset type is
not torchvision.datasets.ImageFolder but is fastai.data_block.LabelList
when the following function is called, and hence it fails to return the label value

def _get_label(self, dataset, idx):
    dataset_type = type(dataset)
    if dataset_type is torchvision.datasets.MNIST:
        return dataset.train_labels[idx].item()
    elif dataset_type is torchvision.datasets.ImageFolder:
        return dataset.imgs[idx][1]
    else:
        raise NotImplementedError

ashwinakannan · May 11, 2019, 12:55pm

Hi,

I don’t think I’ve updated to the current version of fastai… But its fastai 1.0 & PyTorch 1.0

You have to make changes to the dataloader class. Labels in fastai is stored in LabelList… you’ll have to do something like:

dataset_type = type(dataset)

if dataset_type is fastai.data_block.LabelList:
   return dataset[idx][1].__int__()

aipitch · May 11, 2019, 4:49pm

Thanks - I had to import the fastai…labelList class to the module (else get a not defined error) and the error went away but the program seems to be stuck in a loop and the notebook is stuck with the hourglass . So will have to troubleshoot further - will update this thread once I can figure it out.

I also noticed that sampler.py already has a WeightedRandomSampler class, that seems to be similar to the ImbalancedDataClass. How does one change the sampling method when creating the Databunch or train dataset ? If I can figure that our I could try calling the weighted sampler. Or else, the other option is to just make copies of the extra classes - which is what @jeremy recommended doing in the lecture.

urmas.pitsi · May 11, 2019, 8:29pm

When creating DataBunch you should be able to pass on a variable called ‘sampler’. You could try adding into parameters: sampler=YourWeightedSampler(). From fastai source I understand it will be passed on to pytorch Dataloader.

aipitch · May 11, 2019, 8:59pm

Great - thanks for the tip. This is a much better approach - as you don’t have to change the PyTorch or Fastai source code (except import the ImbalanceDataClass),

However, the code at https://github.com/ufoym/imbalanced-dataset-sampler is not working with the current version of Fastai and Pytorch. Still trying to figure it out.

It would be great if handling imabalanced datasets could become a standard part of Fastai; as it is a very common problem specially for health and medical data.

urmas.pitsi · May 11, 2019, 9:06pm

Try creating directly pytorch weighted sampler:

github.com

pytorch/pytorch/blob/master/torch/utils/data/sampler.py

import torch
from torch._six import int_classes as _int_classes


class Sampler(object):
    r"""Base class for all Samplers.

    Every Sampler subclass has to provide an __iter__ method, providing a way
    to iterate over indices of dataset elements, and a __len__ method that
    returns the length of the returned iterators.
    """

    def __init__(self, data_source):
        pass

    def __iter__(self):
        raise NotImplementedError

    def __len__(self):
        raise NotImplementedError

This file has been truncated. show original

aipitch · May 12, 2019, 11:27pm

Ok, thanks - will try it out.

urmas.pitsi · May 13, 2019, 9:22pm

Pretty good resource:

https://imbalanced-learn.readthedocs.io/en/stable/user_guide.html