Oversampling leads to rapid overfitting

maronr · May 2, 2019, 10:18am

Hello,

I am currently trying out a CNN on a highly imbalanced image data-set. From what I read, oversampling seems to be a good approach in tackling imbalanced data, so I over-sampled my minority classes to match the single majority class.
For oversampling I duplicated the original images and slightly changed them using fastai’s recommended data augmentation techniques (so flips, rotates, zooms, etc.).
During training my train loss rapidly decreases - almost until 0 - while my validation loss and error rate decrease to a certain extent and then remain at the same level. It looks like the network is memorizing the training data, which it probably has an easy time doing considering the high amount of “repetitive” images it is seeing (i.e. loss of variance in my over-sampled classes).

To combat the issues I have increased dropout in the last two layers, trying rates ranging from the default 0.25 & 0.5 up to 0.35 & 0.7. I also attempted to run the network with data that was over-sampled to a smaller extent, as well as a smaller CNN. The outcome was however similar.

Does anyone experience strong overfitting for oversampled data? And if so, was there a way around it? Or could I be overlooking something in my approach?

ptrampert · May 2, 2019, 10:31am

How did you split the data into training and validation set?

maronr · May 2, 2019, 11:21am

I split the data into a 80:20 train:validation set. But I only oversampled the training data and left the test set untouched.

ptrampert · May 2, 2019, 11:27am

You only used training and validation (split into 2 sets), or training, validation, and test (split in 3 sets)?

Further, did you make sure that each subset is balanced?

maronr · May 2, 2019, 11:35am

Oh yea sorry, I also have a test set which is however quite small. Initially, the splits are representative of the class imbalance of the overall dataset, so train, validation and test sets still have class imbalance. For training I am removing the imbalance in the training set by oversampling. I leave validation and test sets untouched.

When I started initially I also balanced the validation set by oversampling with similar results. Should the validation set be balanced as well?

ptrampert · May 2, 2019, 12:16pm

I think all sets should be balanced to guarantee a non distorted result. If you take care that all sets are balanced, you at least can exclude the influence of this.

Maybe you should manually select validation and test set to be balanced and then oversample the remaining test set. The new results may indicate the further road to take.