Training CNN model with class imbalance data

Hello there,

I was curios on how would one approach an 18 classes image classification problem with varying number of examples from 1200 to 13000 for each class. Is there a “best practice” strategy or fast.ai automation for this situation? Thanks!

Best,
Andrei

Up-weight the lower weighed classes by re-sampling the data till you get them even. Think of this as a augmentation of the data. This has worked well for me in the past.

Hi, Andrei –

Is the data available, or are you able to share it? I have an implementation of what’s called Class Rectification Loss that I’d like to showcase with some real data. Would be happy to share the code with you if I can use your dataset as a testbed.

1 Like

Hi Andrew,

The data I’m using is available on kaggle in the Painter by Numbers competition: https://www.kaggle.com/c/painter-by-numbers. It’s a big paintings database with style, genre and date of creation for almost 80.000 works of art. The competition focused on creating a model for forgery detection but my goal is to create a model that can accurately detect the genre of a painting. And, like I mentioned, the classes are heavily imbalanced (ex.: there are about 13000 portrait paintings, 2500 still life paintings, 1300 flower paintings).

I would love to try any solution that might get the work done :slight_smile: .

@bfarzin, when you mention re-sampling, you’re referring to an actual physical augmentation (creating new images for the lower weighted classes) or are you referring to “virtual” augmentation of some sort using the ImageDataBunch and the ds_tfms transform.

Thanks a lot guys!

I did it with just replicating multiple copies of the same (undersampled) class. So if I had 500 examples, and there was 1000 in the “bigger” class I would sample each one twice. I did that with the list of files (or inside a dataframe before loading the data.) Result was “balanced” classes.

I would then apply data augmentation to all the images in all the classes now that they are balanced.

As always, the particular use case might not work well for this. And you might over-estimated the less frequent case after you have a fitted model (since you “expect” 50% to be this lower-sampled case.)

I am curious what you find out!

Ok, I’ll try it, thanks!