How to duplicate training examples to handle class imbalance


(Adrian Galdran) #1

Hi!

I have a 21-class classification problem in which there is a huge class imbalance. Since I remembered @jeremy mentioned that the better strategy for deep neural networks in this case was to oversample the minority class, what I have done is replicating examples in my csv rows via pandas before submitting it to the data loader. However, it seems that ImageClassifierData.from_csv removes duplicate rows from my csv without asking me if I would like to, and my replicated examples go away after calling it.

Is there any way to avoid that behavior of the data loading function? Or, otherwise, to replicate minority examples with an already existing fast.ai utility?

Many thanks!!

Adrian


Wiki: Fastai Library Feature Requests
Another treat! Early access to Intro To Machine Learning videos
(Jeremy Howard (Admin)) #2

There’s nothing I’ve implemented in fastai, sorry! Perhaps you could modify from_csv?


(Adrian Galdran) #3

Hi,

Thanks for answering. Sure, as soon as I find some time to dedicate, I will look into from_csv :slight_smile:

Cheers


(Miguel Perez Michaus) #4

About oversampling minority class, even if Im aware of papers proposing oversampling as a good idea I have some doubts about it.

First known big issue is that replicating minority class one is reducing -sometimes heavily- the variance of that class. The consequence is easier overfitting of that class, or “asymetric overfitting” (I dont think the term exists :grinning: ). Anyway… the more assimetric the overfiting the more suboptimal your regularization, at least in theory. So, even if oversampling is “the lesser of some number of evils” it still can be problematic.

Another doubt I have specifically for fastai framework is that if we are dealing with images loading from path the image loader would be already doing recycling the minoritary class when building the batches? (so no resampling needed in that case?)


(Ramesh Sampath) #5

Created a Pull Request for this fix - https://github.com/fastai/fastai/pull/43


(Ramesh Sampath) #6

It will not re-balance the ratio of Classes. For example, if we have 10:1, dogs to cats in our images CSV, then we may want to upsample cats to get a ratio of 10:5 so that they both contribute proportionally towards the loss value. Otherwise the system might settle on a Local Minima of predicting everything is a DOG and be right 90% of the time and have reasonably small loss value.

FastAI Batch loaders don’t re-balance the classes, so this feature of being able to upsample could be useful.


(Ramesh Sampath) #7

@agaldran - The pull request to allow duplicate (upsample) is merged now. Try git pull and let us know if the Upsample technique helped for your problem or if there are other things you tried.


(James Requa) #8

@ramesh this is great! I too have been looking for this functionality.

One other issue to consider though is that we may also need to update get_cv_idxs because I think it could be bad if any of the upsampled minority class (duplicate images) ends up in both the training and validation sets. I have run into this trouble myself because it will appear during training that both training and validation have great accuracy but then perform poorly on a test set because, in fact, the model was just predicting on images it has already seen.


(Adrian Galdran) #9

Yes, @jamesrequa, that is happening to me also. I have 99% f2-score in train and val, but when I compute externally my average AUC across classes, it is much lower, due to the class imbalance which distorts the validation loss.

In the beginning I tried to handle the class imbalance at the “pandas level”, by first separating a train and a validation set from the original dataframe, then under/oversampling separately in those sets, shuffling, re-merging the dataframes, and instead of calling to get_cv_idx, passing the last 20% integers starting from n backwards. Super error-prone, indeed.

In the end, this path was blocked because of the inner duplicate dropping of from_data, and I forgot about it. But now that @ramesh has added this functionality, I will try it tonight and see what happens.

Thanks!


(Arun Prakash) #10

Can anyone share the paper that discusses oversampling approaches as Jeremy mentioned in the classes? Thanks


(maddogS) #11

I’m dealing with the same problem. But with NLP data. I have read that SMOTE or SMOTE+ENN / tomek link removal might be a good idea for this… for your reference. This is the standard package that has these implemented: https://github.com/scikit-learn-contrib/imbalanced-learn

I’m exploring a 2 stage model of 1. is the the majority class (and train for a low false +)
then if it’s not in the majority class …
2. which minority class is it… (I only have 2)…
I thought about using a auto-encoder trained on one of the minority class as a “anomaly detector” … basically if you feed a minority class data point though the auto-encoder and it comes out “mangled” then it’s the other minority class…
Given I only have 200 of each minority class labelled… I have some doubts if the auto-encoder idea will work… would love to hear your thoughts…

also anyone have ideas/experience generating nlp based data to augment minority classes??? (Other than smote…)

thx all!


(William) #12

You’ve probably already found it :smiley:
but I think he meant this one: https://arxiv.org/pdf/1710.05381.pdf


(Sandeep Gupta) #13

Hi @ramesh - It seems this issue persists in the latest code. Can you please confirm.

I have been trying to upsample the minority class, but it seems that the duplicate records are getting removed.


(Vladimir Kovačević) #14

Would it make sense and, if not, why not, to do something like training time augmentation of images that belong to the unbalanced part of the dataset? For example, add flipped, or zoomed version of these images to the training set instead of just duplicate them?