Methods for dealing with class imbalance?

muellerzr · June 12, 2019, 3:44pm

Hi all,

Just wanted to have a quick discussion on this topic as I am learning to deal with this in a tabular model. I know the easiest way to deal with class imbalance is in the train set, you oversample the minority in order to help deal with this fact. But I’m weary of this in such a task as my largest class has 290,000 items and my lowest class has 40,000 items. That would mean each sample is copied 7 times! Would it not be better to instead take the mean over all the classes, then over or undersample randomly to fill the gap? So now, the mean is 165,000 items, and instead each item is only copied 4 times. Is there caveat to doing this besides losing some data from the undersampled set? Are there better ways?

Another thought I had is we have AUROC to help with this a bit, but what if we had a different loss function where instead on beginning of training, we store the relative imbalances in relation to the highest abundance. Then when we calculate or losses, we multiply them based off of 1+(1-relative percentage), so now the losses from all classes but the highest class are impacted more.

Thank you for your input!

Zach

Edit: When I have time, I will run a test on how each of the three sampling techniques do on the ADULTs dataset as a comparison of sorts.

muellerzr · June 13, 2019, 12:11am

Okay so I ran a quick experiment myself, the GitHub repo is here:

I got very interesting results. So the class imbalance in the ADULTs dataset is quite large, 7,841 in > 50k, and 27,720 samples in < 50k. The notebook and excel sheet both show what and how I did but when I averaged 10 runs with the three above methods, here is what I got:

Our regular tabular model did achieve the usual 84.14% average on the test set, BUT when this error is broken down by class, 46% of the > 50k samples were incorrect, whereas the < 50k was a mere 7%. For such a high accuracy, we can see it’s mostly just favoring one class and almost randomly guessing the other!

Now to over and under sampling. When I started to do this, I looked at what the distributions were for the numerical values to ensure the mean and standard deviation were still being preserved, which they were. I started to see a much more balanced error between the classes when performing both, but the overall accuracy was lowered as well!

When I under-sampled, < 50k turned into 22% error, and > 50k turned into 18% error. Much much closer. But the overall accuracy was 78%, I had dropped 6% balancing. The most obvious explanation is that large chunk of random data that was dropped had value and meaning to it. But also comparatively it had the highest accuracy vs the rest, so we also did a little bit of data cleaning perhaps unintentionally?

When I over-sampled, < 50k went up to 25%, and > 50k went down to 14%, which makes sense as there were 7x as many > 50k samples in the dataset now whereas < 50k did not change. Overall accuracy plummeted though and became 76.93%. This would most likely correlate to the distribution of the classes within the test set and how >50 only accounted for 23%.

Then finally when I did a balanced sample, I did almost as well as the under sampling, with 22% error on < 50k, and 18% error on > 50k and an overall accuracy of 78.47%.

Clearly, the method of choice for over and under sampling does play an impact into how the model performs, even choosing not to or doing it. The next one I want to explore would be the loss functions. Please let me know if in my analysis you see anything wrong, as I did make a few errors along the way and had to go back

The table below shows the averages after 10 runs and related information for them:

msrdinesh · June 13, 2019, 10:49am

Hii @muellerzr,I am also working on dealing with multi-class imbalance. I am trying to do oversampling by pytorch method in this link during training. I am unable to do it. Please let me know if you could successfully implement it.

abyaadrafid · June 13, 2019, 10:55am

Hi @msrdinesh,
You can see this thread for image data oversampling.
@tcapelle was kind enough to provide me with a working example here .
Hope this helps.
Cheers.

tcapelle · June 13, 2019, 11:01am

You may want to try FocalLoss ,

msrdinesh · June 13, 2019, 12:00pm

Thankyou @abyaadrafid for sharing!!

Pak · June 17, 2019, 10:24am

Hi.
Couple weeks ago I was also working with the same imbalanced classes issue. Unfortunately it was dataset I’m not allowed to share, but it’s properties are like these:
Dataset consists of 1,7M samples. Each row can be one of 3 classes (depended variable). The ratio (by classes) is 50:6:1
I’ve conducted some experiments, like you had. My accuracy also dropped (from 91% to 80%) after oversampling. But confusion matrix became much more balanced (in terms of %%). So I’ve concluded the same thing, that initially it was more beneficial for the model just to predict the first class for every “not sure” situation (and as for business class-3 is what we trying to achieve).
But, as I’ve digged in deeper, I was analyzing the second (balanced) model with feature importance (by removing feature one by one, retraining new model and calculating the accuracy difference), I’ve concluded that because class-3 were repeated 50 times, model managed “hardwired” to it. Looks like it was able to detect each sample (in fact each user, even though it was removed from data) by a bunch of different features (not connected with user itself at the first sight), that can almost identify each class-3 sample. I really don’t know if it will work or not in real life (I prefer to wait a couple of months and collect more data maybe it will help).

TLDR;
I was working with realdata example with 1.7M samples and ratio by classes of 50 to 6 to 1. After some digging into model I’ve concluded that in my case such huge oversampling, when class-3 samples I repeat for 50 times, is not a good thing, model tend to “hardwire” to these samples. Maybe undersampling will work better, I will continue my experiments, when I will gather more data.
But now I’m upset