Understanding class imbalance (HAM10000 dermatology images)

When does class imbalance become an issue in binary image classification?

At the moment I’m trying to classify two groups of skin lesions. The dataset contains about 2300 images, 90% belongs to class A and the remainder to class B. Even for a trained dermatologist it can be hard to distinguish between A (benign) and B (malign).

I’m using resnet 50, image size 299, batch size 4, wd = 0.01. After 7 epochs on the final layer and 3 more using unfreeze(), there is an error rate of 0.09. Which is about the prior probability.

About 60% of class B get misclassified as false negative (‘benign skin lesions’). The problem is that the CNN predicts almost everything to be in class A. So the specificity (number of predicted ‘benigns’ that are truly ‘benign’) is still very good.

How can I improve on this?

  • Punish certain type of mistake more heavy, for example by changing the threshold for saying either A or B?

  • Train only on the misclassified images (how???)

  • Oversampling?

PM: the images are from the HAM10000 dataset. Very interesting, it contains 7 classes in total. Currently I focus on two classes: nevi and melanoma.

For more information: there is an article on a CNN that uses this dataset among others. It was developed by a team from Stanford University.. They are ‘on par’ with a team of about 30 dermatologists. Their dataset is containts about 120.000 images in total(!).

Accuracy is not the best metric when you have class imbalance. If you predict everything is benign your accuracy is 90%. Similar problems have appeared in kaggle competitions in the recent future. Some train using f1 loss. See this excellent kernel: https://www.kaggle.com/iafoss/pretrained-resnet34-with-rgby-0-460-public-lb

1 Like

Thank you. I was already trying to implement the f1_score and will examine the Kaggle kernel.

Still, the metrics don’t influence the training process and I’m also interested in approaches that change the training process like oversampling and training on misclassified images.

In this specific case the ‘cost’ of missing a malign lesion should be much higher.

Hello Maria,

Today I tried to implement the F1 metric. The Kaggle kernel your refered to uses a callback F1_callback(Callback). This class refers to the F1 metric when initialized. I just copied the whole method and tried to apply it in my FastAI 2019 framework.

Unfortunately it throws a bunch of errors :disappointed:. Any tips maybe on how to implement this F1 metric.
When literally copying the example I receive the following error:

As the kernel progresses the author makes use of a callback:


When I remove the F1 as a metric and just use the callback another error gets thrown into the party:

I’ve got the idea I’m making things more complicated than necessary. So far it couldn’t work it out with the previous posts on the forum.


Hello Sinsji:
The kernel I shared is for fastai v0.7. Are you working on v1.0?
Also, can you share more of the code? it would be hard to follow what you are trying to do without it.

Oversampling has not worked that well for me. I tried to do this in the protein competition and I may not know how to do it correctly but I just quite don’t trust it :slight_smile:

One other option is passing weights to the CrossEntropyLoss function so it increases the loss more for undersampled classes.

I did this and it helped my model generalize better. Not sure if it’s a good strategy in general but here’s the code if you want to give it a try:

_, class_counts = np.unique(data.y.items, return_counts=True)
weights = np.sum(class_counts)/class_counts
weights = tensor(weights).float().cuda()
learn.loss_func = nn.CrossEntropyLoss(weight=weights)

Yes, I’m using v1.0.

Here is the code I used initially:

data = (ImageItemList.from_folder(path_img, extensions='.jpg')
       .use_partial_data(sample_pct = .1, seed = 17)
       .random_split_by_pct(valid_pct=0.2, seed = 3)
       .transform(tfms, size=299)

learn = create_cnn(data, models.resnet50, metrics = [error_rate]

learn.fit_one_cycle(4, max_lr = 2e-02, wd=0.05)

After implementing the kernel I added this:

f1_callback = F1_callback()
learn.metrics = [acc,f1_callback.f1]

learn.fit_one_cycle(4, max_lr = 2.29E-02, wd=0.05, **callbacks=[f1_callback]**)

I literally copied the functions from the kernel otherwise…

@yeldarb thank you. I will try to add it and see!

As far as I know in medicine the type of mistake is sometimes more relevant than the overall accuracy. In these cases the goal is more specific like the cost of missing a certain disease or running an unnecessary expensive test. So in this case I would like to punish the learning process for missing the severe disease.