Am I overfitting on multilabel classification mode?

britton · May 10, 2020, 9:51am

Hi there!

I have a rather small dataset of ~3500 images of t-shirts, each labelled with 4-5 different features (blue color, long sleeve, patch pocket, etc).

While some features are common, others aren’t – there are around 3000 labels that can be predicted from. I am assuming this is the issue behind my poor performance, but I don’t have an intuitive sense for how this is affecting the behavior. Any ideas?

I’m attempting to train a multilabel classification model using lesson 3’s planets method, pretty much exactly as Jeremy did. But I’m getting some results I don’t understand:

The model gets extremely high accuracy very quickly, and the validation loss is way lower than the training loss! Shouldn’t valid be higher than train?

I do see decent results when I train through and run it on a test set, but the model generalises very poorly.

I’m not sure if I’m simply overfitting, or if this is because of the high number of labels I’m trying to work with.

What would be the best way to add regularisation to a multilabel classification model like this?

Thanks very much for any help!
Britton

vferrer · May 15, 2020, 5:40pm

Hi there,
You may be getting this low fbeta if your data is highly imbalanced. Try to oversample those images with rare labels. The easiest way is to duplicate them.

britton · May 16, 2020, 1:46pm

Hi @vferrer, thanks for the tip – do you know if there an established way in the API to change how I’m augmenting data to focus more on the rare labels? Or would that be a manual process?

Lim · May 16, 2020, 3:33pm

As you said about valid loss being higher than train loss; I think you may be underfitting

vferrer · May 17, 2020, 9:58am

@Lim is correct. You may be underfitting.

I though that fastai v1 had a weighted dataloader although I couldn’t find it. In fastai v2, there is WeightedDL although I couldn’t make it work. So, in my case, I duplicated the rare train images inside the train folder. So, I could treat the problem as a regular image classification problem with balanced dataset.