Balancing an Imbalanced Dataset

Confusion matrix can be confusing though :wink: You’re forcing your predictions into a class based on some threshold.

For your validation set, you shouldn’t use stratified sampling - or if you do, you should be careful to make sure you always use the same stratification when comparing results.

For unbalanced data that doesn’t use stratified sampling, it’s correct that you get 90% accuracy in the case you mentioned - but that’s not a problem at all. A better model will get better accuracy.

You can also use things like Fbeta scores which are more understandable than AUC, but less understandable than accuracy.

Either way, I strongly recommend always including accuracy as a metric wherever possible. I’ve too often seen people only use AUC, and totally misunderstand whether their model is actually any good.

1 Like

I thought using precision/recall/F1 score was accepted as best practice for classification in imbalanced data setting. These metrics would be calculated using the threshold that corresponds to the optimal point in the ROC curve.

1 Like

Yes this is a great way to do it.

This is true, but your justification for using for using accuracy over AUC is that it’s less confusing. I think to most beginners if they saw a model with 90% accuracy that was doing no better than a random guess that would be more confusing :smiley:

Anyway, we’ll have to agree to disagree on this one. Looking forward to the rest of the course - from the UK here and you’re piece on covid-19 was excellent and I’ve been telling all my friends to watch it.

I think, again, showing the confusion matrix is important here when using accuracy. Yes we’re forcing them into one class (via the argmax) but we’re doing that anyways in this scenario. And visualizing the matrix allows beginners to see how the class with the smaller amount of data gets confused compared to the larger one. (And it’s up to us knowledgeable folk to explain this to them)

2 Likes

I thought of a different way how to “balance” an imbalanced tabular dataset and I would like to get your opinion on that. What I did is to augment the tabular dataset by creating fake data for the class which is underrepresented. I did this via a variational autoencoder: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html
I built a RandomForest with the class_weight oversampling from sklearn and another one with my created fake data and the reults look quite promising.
Cheers
Lasse

3 Likes

great

For binary classification models it’s always a good idea to build a NULL model, which typically will ALWAYS predict the majority class. Any model built must do better than the NULL model as far as accuracy metric is concerned. For example, a cancer that rarely occurs in the general population (less than 1%) will result in a NULL model with 99% accuracy. But, for the task at hand - i.e. detecting the rare cancer - this model is terrible.

As pointed out in earlier posts, consider using other metrics such as precision, recall, ROC, AUC, F1 depending on the use case to gauge model performance.

Class imbalance may be addressed by under/over sampling, generating synthetic data (e.g. SMOTE), using class weights, etc.

Best!