During lesson 7, it was mentioned a paper describing that the best way to handle imbalanced classes is to oversample the minority class. I looked online but could not find the paper. Does anyone have the reference?
I found it when searching the forums for solving the problem of class imbalance. Let me know if his helps. I am in fact also looking to see if there is a way to implement the oversampling approach in fast.ai easily without copying the images. If anyone has ideas on how to do that please let me know.
Yes, I found the imbalanced dataset sampler too. But I could not figure out where to use it in the current Fast.ai example code from the lessons. I am trying to use it for image classification - with the training images in folders. If you have any code snippet using this class with the Databunch examples that would be great!
I tried using this class as a callback but ran into several issues. I could get things working by editing dataloader.py & sampler.py files in PyTorch. I added this class to dataloader.py & changed sampler.py to use imbalanced dataset sampler instead of random sampler. (You’ll also need to edit init.py)
Ok - great thanks for the tips. Will try it out and see how it works. What sort of imbalanced classes did you have - and how much did this help? I have heavily imbalanced classes - ratio of almost 30:1 and 50:1 in some cases.
I had an imbalance of approx 15:1:1 (3 classes). Without balanced sampling the classifier didn’t learn about the classes with less samples - so I felt it was required… I also spent some time trying out different data augmentations, which helped.
Could you share the changes you made to dataloader.py, sampler.py and init.py - that would be very helpful. I am trying to change the fast.ai library rather than PyTorch source code. I found that the old fastai library has a dataloader.py that can be changed to do this, but the new version doesn’t - and it is not clear which part of the code can be changed to incoporate this. I will post my changes once I can figure out and make it work,
class ImbalancedDatasetSampler(Sampler):
"""Samples elements randomly from a given list of indices for imbalanced dataset
Arguments:
indices (list, optional): a list of indices
num_samples (int, optional): number of samples to draw
"""
def __init__(self, dataset, indices=None, num_samples=None):
# if indices is not provided,
# all elements in the dataset will be considered
self.indices = list(range(len(dataset))) \
if indices is None else indices
# if num_samples is not provided,
# draw `len(indices)` samples in each iteration
self.num_samples = len(self.indices) \
if num_samples is None else num_samples
# distribution of classes in the dataset
label_to_count = {}
for idx in self.indices:
label = self._get_label(dataset, idx)
if label in label_to_count:
label_to_count[label] += 1
else:
label_to_count[label] = 1
# weight for each sample
weights = [1.0 / label_to_count[self._get_label(dataset, idx)]
for idx in self.indices]
self.weights = torch.DoubleTensor(weights)
def _get_label(self, dataset, idx):
dataset_type = type(dataset)
if dataset_type is torchvision.datasets.MNIST:
return dataset.train_labels[idx].item()
elif dataset_type is torchvision.datasets.ImageFolder:
return dataset.imgs[idx][1]
else:
raise NotImplementedError
def __iter__(self):
return (self.indices[i] for i in torch.multinomial(
self.weights, self.num_samples, replacement=True))
def __len__(self):
return self.num_samples
dataloader.py (in DataLoader class):
from . import SequentialSampler, RandomSampler, BatchSampler, ImbalancedDatasetSampler
Thanks for posting the code. Does it work with the current version of Fast.ai (1.0) and PyTorch (1.0)?
I tried it out but I am getting the following error from Sampler.py
Exception - NotImplementedError:
I debugged and found that the dataset type is
not torchvision.datasets.ImageFolder but is fastai.data_block.LabelList
when the following function is called, and hence it fails to return the label value
def _get_label(self, dataset, idx):
dataset_type = type(dataset)
if dataset_type is torchvision.datasets.MNIST:
return dataset.train_labels[idx].item()
elif dataset_type is torchvision.datasets.ImageFolder:
return dataset.imgs[idx][1]
else:
raise NotImplementedError
Thanks - I had to import the fastai…labelList class to the module (else get a not defined error) and the error went away but the program seems to be stuck in a loop and the notebook is stuck with the hourglass . So will have to troubleshoot further - will update this thread once I can figure it out.
I also noticed that sampler.py already has a WeightedRandomSampler class, that seems to be similar to the ImbalancedDataClass. How does one change the sampling method when creating the Databunch or train dataset ? If I can figure that our I could try calling the weighted sampler. Or else, the other option is to just make copies of the extra classes - which is what @jeremy recommended doing in the lecture.
When creating DataBunch you should be able to pass on a variable called ‘sampler’. You could try adding into parameters: sampler=YourWeightedSampler(). From fastai source I understand it will be passed on to pytorch Dataloader.
Great - thanks for the tip. This is a much better approach - as you don’t have to change the PyTorch or Fastai source code (except import the ImbalanceDataClass),
It would be great if handling imabalanced datasets could become a standard part of Fastai; as it is a very common problem specially for health and medical data.