Oversampling methods in fastai?

hwuau · August 9, 2019, 10:54am

I am doing a kaggle competition and I think I have encountered imbalanced classes in an image classification model.

I have performed various data augmentation and the more augmentation I have, the worse the model performed in certain classes, in some extreme cases, the model identified zero cases of some minority classes.

the data augmentation including flip, rotate, zoom, circle crop, brightness and contrast. Is there other over sampling strategy besides introducing external data source? Thank you for any help.

tfms2=([RandTransform(tfm=TfmCrop (crop_pad), kwargs={‘row_pct’: 0.5, ‘col_pct’: 0.5, ‘padding_mode’: ‘zeros’}, p=1.0, resolved={}, do_run=False, is_random=True, use_on_y=True),
RandTransform(tfm=TfmAffine (dihedral_affine), kwargs={}, p=1.0, resolved={}, do_run=True, is_random=True, use_on_y=True),
RandTransform(tfm=TfmAffine (rotate), kwargs={‘degrees’: (-10.0, 10.0)}, p=0.75, resolved={}, do_run=True, is_random=True, use_on_y=True),
RandTransform(tfm=TfmAffine (zoom), kwargs={‘scale’: (1.01, 1.03), ‘row_pct’:0.5, ‘col_pct’:0.5}, p=1, resolved={}, do_run=True, is_random=True, use_on_y=True),
RandTransform(tfm=TfmLighting (brightness), kwargs={‘change’: (0.40, 0.50)}, p=0.75, resolved={}, do_run=True, is_random=True, use_on_y=True),
RandTransform(tfm=TfmLighting (contrast), kwargs={‘scale’: (0.9, 1.1111111111111112)}, p=0.75, resolved={}, do_run=True, is_random=True, use_on_y=True)],
[RandTransform(tfm=TfmCrop (crop_pad), kwargs={}, p=1.0, resolved={}, do_run=True, is_random=True, use_on_y=True)])

dhoa · August 9, 2019, 12:21pm

I think you will find this thread interesting. Oversampling Callback

hwuau · August 13, 2019, 2:03am

@dhoa, thank you for the tip. it seems the addition of oversampling seem to make the model performs worse.

Would it because the oversampling taught the model too much details of the minority class and lead to worse generalisation for all classes?

muellerzr · August 13, 2019, 3:26am

That would depend on a few factors such as how is your validation set being made? Is it representative of the original set? If so then yes, it’s most likely actually learning the differences which can hinder the performance sometimes as it’s generalizing better.

An easy way is compare two confusion matrix’s. One being without oversampling and the other with.