Oversampling in fastai2

Hi,

Is there is a ‘beginner/intermediate’ friendly way to do oversampling in fastai2?

I’ve seen this post on this forum: [Oversampling Callback]. But it’s only for fastai1 I think. Getting this to work on fastai2 would be very useful, since i’m interested in medical imaging and most medical image datasets are highly unbalanced.

6 Likes

We do not yet, but if you’re up to it a PR would be welcome :slight_smile:

(We just have a weighted DataLoader)

If the class is indicated in the DataFrame as category_id, then this should work:

from collections import Counter
count = Counter(df.category_id).values()
class_weights = 1/np.array(list(count))
wgts = class_weights[dsets.train.items['category_id']] 
dls = dsets.weighted_dataloaders(path=path,bs=bs,after_batch=batch_tfms,wgts=wgts)

I guess the only thing to note is that you need to create a Dataset instead of directly creating DataLoaders.

12 Likes

The last time I tried to use the weighted dataloader I could not get it working properly, so I used a different methodology: exporting pandas DataFrames, performing the oversampling there then converting it back. It’s easier than it sounds, and the full example is here

2 Likes

@muellerzr I don’t think i’m at that level yet, but might be a nice project for the future.
Is there an advantage of a weighted DataLoader above oversampling?

@ilovescience Thanks, this seems like a nice solution. I’m going to try it.

I am not sure about it, but I think using the weighted dataloaders is more memory efficient than a classic oversampler that replicates data, just because the dataloder passes the items from the minority class more frequently to the model, but without an explicit replication…(I should double check this though)

Anyway, what I found to work better in my problems with imbalanced data is just to assign weights in the loss function, so that you penalize more the errors in the minority class.

1 Like

wanted to double check, the way way you calculated the weights?
Is it the same as how you would calculate the weights for a weighted dataloader? ie 1/number_of_samples_per_class or were there any tweaks ?
I have very samples of certain classes so my ratio is 2000 to 1 and i was not getting great result with weighted loss functions. :slight_smile:

I usually apply the rule 1/n_samples_in_class in the weighted loss function and it works well in terms of Recall and F1, which are the metrics I am interested in in this type of problems.

1 Like

Is it possible to fix the typo in the thread title? :slight_smile:

Very useful thread

Hmmm…

I’m getting this error when I try this:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-14-eb2169037660> in <module>()
      2 count = Counter(df.label).values()
      3 class_weights = 1/np.array(list(count))
----> 4 wgts = class_weights[dsets.train.items['label']]
      5 dls = dsets.weighted_dataloaders(path=path,bs=bs,after_batch=batch_tfms,wgts=wgts)

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Are we trying to multiply each sample in a class by the class weight?

1 Like

This does not answer your question.

But providing an example for folks who want to apply Weighted Cross-Entropy as the loss function for the imbalanced dataset.

Weighted Cross-Entropy

#Get weights based on the class distribution in the training data
def get_weights(dls):
   
    # 0th index would provide the vocab from text
    # 1st index would provide the vocab from classes
    classes = dls.vocab[1]

    #Get label ids from the dataset using map
    #train_lb_ids = L(map(lambda x: x[1], dls.train_ds))
    # Get the actual labels from the label_ids & the vocab
    #train_lbls = L(map(lambda x: classes[x], train_lb_ids))

    #Combine the above into a single
    train_lbls = L(map(lambda x: classes[x[1]], dls.train_ds))
    label_counter = Counter(train_lbls)
    n_most_common_class = max(label_counter.values()); 
    print(f'Occurrences of the most common class {n_most_common_class}')
    
    # Source: https://discuss.pytorch.org/t/what-is-the-weight-values-mean-in-torch-nn-crossentropyloss/11455/9
    weights = [n_most_common_class/v for k, v in label_counter.items() if v > 0]; return weights 

#Get the weights from classification dataloader

weights = get_weights(dls_cls) 
class_weights = torch.FloatTensor(weights).to(dls_cls.device)
learn_cls.loss_func = partial(F.cross_entropy, weight=class_weights)
4 Likes

Any updates on this? I tried the WeightedDL but couldn’t make it work.

I could train with the method @msivanes provided, but when I exported the learner and loaded it again it caused some issues with prediction.

The solution I found was to change

learn_cls.loss_func = partial(F.cross_entropy, weight=class_weights)

to

learn_cls.loss_func = CrossEntropyLossFlat(weight=class_weights)

1 Like

I missed updating the answer. @muellerzr assisted me with this issue. No need to use pytorch. What you did is exactly the right Solution.

Reason

Just to help people which is trying to get somekind of OverSampler, here I post the “hack” I had to do to implement it This is a first iteration, so this could have some improvements to be more robust, so feel free to suggest improvements :slight_smile:

In order to set you in context, here the splitter is somehow splitting by index depending on the column Dataset. Besides, mention that the label_df DataFrame is sorted to have first the rows regarding training, this is why I use [:len(train_df)] on the weights.

label_db = DataBlock(
    blocks=(ImageBlock(cls=PILImageBW), MultiCategoryBlock),
    get_x=ColReader('Original_Filename', pref=raw_preprocess_folder+'/', suff='.png'), 
    get_y=ColReader('Target'),
    splitter=TestColSplitter(col='Dataset'),
    item_tfms=item_tfms,
    batch_tfms=label_transform,
)

label_dl = label_db.dataloaders(label_df, bs=BATCH_SIZE, num_workers=0, shuffle_train=True, drop_last=True)

# Calculate sample weights to balance the DataLoader 
from collections import Counter

count = Counter(label_dl.items['Target'])
class_weights = {}
for c in count:
  class_weights[c] = 1/count[c]
wgts = label_dl.items['Target'].map(class_weights).values[:len(train_df)]

weighted_dl = label_db.dataloaders(label_df, bs=BATCH_SIZE, num_workers=0, shuffle_train=True, drop_last=True, dl_type=WeightedDL, wgts=wgts)
label_dl.train = weighted_dl.train
3 Likes

How i can make use of this to perform balanced sampling like

say C1-5k
C2-2k
C3-1.5k

I wanted the batch to have class ratio more or less 1:1:1
so out of C1 i may want to select not all but just ones that match in number to other less dominant classes

Hi there !
Am getting the same error, did you resolve this issue?

Thanks for the code. I think you should reorder the weights according to the dls vocab as well.