Confusion matrix can be confusing though Youâre forcing your predictions into a class based on some threshold.
For your validation set, you shouldnât use stratified sampling - or if you do, you should be careful to make sure you always use the same stratification when comparing results.
For unbalanced data that doesnât use stratified sampling, itâs correct that you get 90% accuracy in the case you mentioned - but thatâs not a problem at all. A better model will get better accuracy.
You can also use things like Fbeta scores which are more understandable than AUC, but less understandable than accuracy.
Either way, I strongly recommend always including accuracy as a metric wherever possible. Iâve too often seen people only use AUC, and totally misunderstand whether their model is actually any good.
I thought using precision/recall/F1 score was accepted as best practice for classification in imbalanced data setting. These metrics would be calculated using the threshold that corresponds to the optimal point in the ROC curve.
Yes this is a great way to do it.
This is true, but your justification for using for using accuracy over AUC is that itâs less confusing. I think to most beginners if they saw a model with 90% accuracy that was doing no better than a random guess that would be more confusing
Anyway, weâll have to agree to disagree on this one. Looking forward to the rest of the course - from the UK here and youâre piece on covid-19 was excellent and Iâve been telling all my friends to watch it.
I think, again, showing the confusion matrix is important here when using accuracy. Yes weâre forcing them into one class (via the argmax) but weâre doing that anyways in this scenario. And visualizing the matrix allows beginners to see how the class with the smaller amount of data gets confused compared to the larger one. (And itâs up to us knowledgeable folk to explain this to them)
I thought of a different way how to âbalanceâ an imbalanced tabular dataset and I would like to get your opinion on that. What I did is to augment the tabular dataset by creating fake data for the class which is underrepresented. I did this via a variational autoencoder: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html
I built a RandomForest with the class_weight oversampling from sklearn and another one with my created fake data and the reults look quite promising.
Cheers
Lasse
great
For binary classification models itâs always a good idea to build a NULL model, which typically will ALWAYS predict the majority class. Any model built must do better than the NULL model as far as accuracy metric is concerned. For example, a cancer that rarely occurs in the general population (less than 1%) will result in a NULL model with 99% accuracy. But, for the task at hand - i.e. detecting the rare cancer - this model is terrible.
As pointed out in earlier posts, consider using other metrics such as precision, recall, ROC, AUC, F1 depending on the use case to gauge model performance.
Class imbalance may be addressed by under/over sampling, generating synthetic data (e.g. SMOTE), using class weights, etc.
Best!