I am currently trying out a CNN on a highly imbalanced image data-set. From what I read, oversampling seems to be a good approach in tackling imbalanced data, so I over-sampled my minority classes to match the single majority class.
For oversampling I duplicated the original images and slightly changed them using fastai’s recommended data augmentation techniques (so flips, rotates, zooms, etc.).
During training my train loss rapidly decreases - almost until 0 - while my validation loss and error rate decrease to a certain extent and then remain at the same level. It looks like the network is memorizing the training data, which it probably has an easy time doing considering the high amount of “repetitive” images it is seeing (i.e. loss of variance in my over-sampled classes).
To combat the issues I have increased dropout in the last two layers, trying rates ranging from the default 0.25 & 0.5 up to 0.35 & 0.7. I also attempted to run the network with data that was over-sampled to a smaller extent, as well as a smaller CNN. The outcome was however similar.
Does anyone experience strong overfitting for oversampled data? And if so, was there a way around it? Or could I be overlooking something in my approach?