library not learning on skewed xray dataset

I have a xray dataset containing 14 classes. The dataset is skewed.
But the predictions are highly biased towards the majority class. see int the image

The number above is the f1 score which is very low.
How can I approach this?

I am using resnet18. More complex architectures are overfitting much more

It is a good idea to start with duplicating the examples to bring the training example counts closer / match the image count of the biggest class.

If you have class A with 10 examples and class B with 100 examples, you duplicate class A examples 9 times.

This might sound silly but is a good starting point (and often turns out to work the best even when compared to more advanced techniques).

There was a discussion on the forums about this IIRC some time ago with at least one linked paper that I think was quite interesting. Might be useful to see if you can find that thread.

1 Like

I might be wrong, but maybe this paper?


Not wrong!

1 Like

What are those advanced techniques?

Thankyou! I will go through it.

Is there an inbuilt feature in the library which handles this?

I tried different techniques in raw pytorch.

  1. Made an Oversampled Dataset class one by SMOTE algorithm and another by random oversampler by using imblearn library by scikit.
    When I used SMOTE.
    I converted image names to numerical IDs.
    My dataset increased from 11K images to 64K but on training for 3 epochs validation loss didn’t drop much and accuracy remained 8-15%.
    My training accuracy was around 30%.
    I couldn’t reason this behavior.
    When I used Random Sampler
    The classifier learned to predict only 4 of 14 classes and I got around 30% accuracy on the val set.

I didn’t dig deeper. Can this be a bug in my code or is this normal.

I am new to this forum. I just wanted to ask a stupid question. If our dependent variable is skewed and we removed it using np.log() function so do we need to perform same operation on our independent variable if yes? then do we need to use the same np.log() function or we can use other method to remove the skewness of our independent variable?