CNN prediction seems to favour most prominent class in dataset

bilalkhan · September 29, 2017, 8:54pm

Hi,

Just checking if anyone has trained a CNN with skewed data set? in other words, i have a set of 5 classes where for one of those classes, I have a much higher number of training images that the others (lets say distribution of the number of images across the 5 classes is: 60%, 10%,10%,10%,10%).
What i am finding is that training results in over predicting for this particular class (hardly any predictions are made for the others). For example, the network seems to predict this one prominent class 95% of the time.

In this case, would it be correct to remove some training images from this more prominent class to create more of a balance?
Also, during validation/testing, is it sufficient to just naively randomly sample to dataset when we have a very prominent example? Or would it be better during testing/training to pick out a number of images that a fairly distributed over the dataset labels?

Thanks

Bilal

harveyslash · September 30, 2017, 5:23am

If you’re doing classification using cross entropy, it’s possible to have a class balance weight for each class. You shouldn’t have to remove the data that way. After you assign the right weights, random sampling should be ok to use.
For evaluation, don’t use accuracy as a metric, but instead , use precision and recall

marcemile · September 30, 2017, 1:53pm

You may be interested in https://www.svds.com/learning-imbalanced-classes/

alexott · September 30, 2017, 2:26pm

Undersampling is a standard way to tackle the data imbalance, but for DL-related tasks it could be make sense to oversample the “small” classes…

bilalkhan · October 1, 2017, 9:27pm

thank you all
I am using keras v 2.0.2 - i found out that precision/recall/f1 were taken out, however, they can be implemented as custom metrics by copying in the code from older version of keras from github.
I have used f1 max as the criteria for early stopping, rather than accuracy.

tony1 · October 5, 2017, 9:30am

Let me give you a practical idea, that you can very easy test and see how this works.
If your class distribution is 6/1/1/1/1 then during the training let the CNN see the classes with less elements more times.
The idea is to have 6/6/6/6/6 distribution during the training.

Few ideas you can try:

Repeat the same images multiple times - 6 times in your case
Use image generator to do augmentation like flipping, rotating, resizing, cropping …

Good luck and please share the result!