How to duplicate training examples to handle class imbalance

agaldran · November 26, 2017, 3:54pm

Hi!

I have a 21-class classification problem in which there is a huge class imbalance. Since I remembered @jeremy mentioned that the better strategy for deep neural networks in this case was to oversample the minority class, what I have done is replicating examples in my csv rows via pandas before submitting it to the data loader. However, it seems that ImageClassifierData.from_csv removes duplicate rows from my csv without asking me if I would like to, and my replicated examples go away after calling it.

Is there any way to avoid that behavior of the data loading function? Or, otherwise, to replicate minority examples with an already existing fast.ai utility?

Many thanks!!

Adrian

jeremy · November 26, 2017, 10:10pm

There’s nothing I’ve implemented in fastai, sorry! Perhaps you could modify from_csv?

agaldran · November 26, 2017, 11:06pm

Hi,

Thanks for answering. Sure, as soon as I find some time to dedicate, I will look into from_csv

Cheers

miguel_perez · November 27, 2017, 12:45am

About oversampling minority class, even if Im aware of papers proposing oversampling as a good idea I have some doubts about it.

First known big issue is that replicating minority class one is reducing -sometimes heavily- the variance of that class. The consequence is easier overfitting of that class, or “asymetric overfitting” (I dont think the term exists ). Anyway… the more assimetric the overfiting the more suboptimal your regularization, at least in theory. So, even if oversampling is “the lesser of some number of evils” it still can be problematic.

Another doubt I have specifically for fastai framework is that if we are dealing with images loading from path the image loader would be already doing recycling the minoritary class when building the batches? (so no resampling needed in that case?)

ramesh · November 27, 2017, 1:01am

Created a Pull Request for this fix - https://github.com/fastai/fastai/pull/43

ramesh · November 27, 2017, 1:07am

It will not re-balance the ratio of Classes. For example, if we have 10:1, dogs to cats in our images CSV, then we may want to upsample cats to get a ratio of 10:5 so that they both contribute proportionally towards the loss value. Otherwise the system might settle on a Local Minima of predicting everything is a DOG and be right 90% of the time and have reasonably small loss value.

FastAI Batch loaders don’t re-balance the classes, so this feature of being able to upsample could be useful.

ramesh · November 27, 2017, 6:13am

@agaldran - The pull request to allow duplicate (upsample) is merged now. Try git pull and let us know if the Upsample technique helped for your problem or if there are other things you tried.

jamesrequa · November 27, 2017, 5:53pm

@ramesh this is great! I too have been looking for this functionality.

One other issue to consider though is that we may also need to update get_cv_idxs because I think it could be bad if any of the upsampled minority class (duplicate images) ends up in both the training and validation sets. I have run into this trouble myself because it will appear during training that both training and validation have great accuracy but then perform poorly on a test set because, in fact, the model was just predicting on images it has already seen.

agaldran · November 27, 2017, 6:09pm

Yes, @jamesrequa, that is happening to me also. I have 99% f2-score in train and val, but when I compute externally my average AUC across classes, it is much lower, due to the class imbalance which distorts the validation loss.

In the beginning I tried to handle the class imbalance at the “pandas level”, by first separating a train and a validation set from the original dataframe, then under/oversampling separately in those sets, shuffling, re-merging the dataframes, and instead of calling to get_cv_idx, passing the last 20% integers starting from n backwards. Super error-prone, indeed.

In the end, this path was blocked because of the inner duplicate dropping of from_data, and I forgot about it. But now that @ramesh has added this functionality, I will try it tonight and see what happens.

Thanks!

arunslb123 · January 28, 2018, 12:41am

Can anyone share the paper that discusses oversampling approaches as Jeremy mentioned in the classes? Thanks

maddogS · January 28, 2018, 12:54am

I’m dealing with the same problem. But with NLP data. I have read that SMOTE or SMOTE+ENN / tomek link removal might be a good idea for this… for your reference. This is the standard package that has these implemented: https://github.com/scikit-learn-contrib/imbalanced-learn

I’m exploring a 2 stage model of 1. is the the majority class (and train for a low false +)
then if it’s not in the majority class …
2. which minority class is it… (I only have 2)…
I thought about using a auto-encoder trained on one of the minority class as a “anomaly detector” … basically if you feed a minority class data point though the auto-encoder and it comes out “mangled” then it’s the other minority class…
Given I only have 200 of each minority class labelled… I have some doubts if the auto-encoder idea will work… would love to hear your thoughts…

also anyone have ideas/experience generating nlp based data to augment minority classes??? (Other than smote…)

thx all!

wnurmi · March 16, 2018, 8:11pm

You’ve probably already found it
but I think he meant this one: https://arxiv.org/pdf/1710.05381.pdf

sandeepgupta2 · December 13, 2018, 10:17am

Hi @ramesh - It seems this issue persists in the latest code. Can you please confirm.

I have been trying to upsample the minority class, but it seems that the duplicate records are getting removed.

vladimirk · December 22, 2018, 1:42am

Would it make sense and, if not, why not, to do something like training time augmentation of images that belong to the unbalanced part of the dataset? For example, add flipped, or zoomed version of these images to the training set instead of just duplicate them?