Dealing with class imbalances image classification

sophia · November 14, 2020, 12:05am

Hi, so this question isn’t necessarily new, but I was wondering whether there meanwhile is some good approach or recommendation?

So short overview: I want to predict classes from pathology cancer slides.
The classes I want to predict have a big imbalance: So I have 350+ cases from which I want to predict 4 different classes (single label classification/but maybe I will also try label smoothing later with multiclass prediction).

The most rare class I want to predict has a little less than 20 cases, the most common class has about 140 cases.

From each case I will use about 50 extracted tiles out of the annotation, each generated tile is 512*512 pixels for now (maybe I will change the numbers per case later).

I wanted to use something like SMOTE as random oversampling isn’t as effective from what I read I think but apparently SMOTE is not directly usable with image classification? - Nonetheless it was used in an x-ray detection problem just some time ago (they predicted 1024 features of images and used those to do SMOTE afterwards): https://www.medrxiv.org/content/10.1101/2020.04.13.20063461v1

I also found this repo, which uses SMRT, which is apparently similiar to SMOTE and applicable for images? Synthetic Minority Reconstruction Technique ([SMRT])

And I mean I could also use image augmentation like blurring, elastic deformation, color transforms, etc but I was just wondering whether there would still be something better to use?

I also read something about using variational autoencoders to generate mor synthetic data, but I have no idea about variational autoencoder yet.

Maybe someone an idea or recommendation?

vferrer · November 16, 2020, 10:17am

I would recommend to search in the forums. There are several related issues. Also, you can search in kaggle competitions to see how kagglers handles class imbalance.

Archaeologist · December 24, 2020, 6:55pm

Have you tried undersampling/oversampling?

A combination of both helped me substantially with a similar problem with aerial image classification.

Undersampling: simply remove a random portion of your images in the majority class(es).

Oversampling: randomly duplicate / clone images in your minority class

sophia · December 29, 2020, 3:51pm

Thanks for the reply ! - I have tried only undersampling until now. I will try oversampling soon too then. I am currently working on a different problem set with the same dataset and I realized just recently, that in that different problem set some tiles from the same case have an unproportionally high loss in comparison to other ones from the same case (doesn’t occur in all cases), so I am currently trying to figure a way out to either use an adapted loss function that directly deals with potentially noisy labels or to afterwards get kind of rid of potentially noisy labelled tiles/ tiles with a high loss. I might have this “noisy” problem with the original problem set - which I want to look into again soon - too.

Have a good start into the new year!