Hi guys,
I’ve been thinking about this for some time now. The datasets I work with are usually highly imbalanced, with positive/negative ratios as low as 1:1000. It’s not uncommon get metrics like this:
Negative samples: 33440
Positive samples: 287
Matthews correlation coef. = 0.554
Recall = 0.533
Specificity = 0.996
Precision = 0.583
Balanced accuracy = 0.765
Area under precision-recall curve = 0.492
I realised that even if 99,9% of the negative samples are correctly classified, the remaining 0.004% are still close to the number of true positives
I work with early drug discovery using ML and DL to prioritize molecules for biological testing. Intuitively, I’d say filtering 0.996% of the negative samples is a good thing, which translate to not spend money on things that won’t work in bio testing. But those metrics, especially the precision and recall, makes me feel unease.
I know I can change the threshold for classification using the precision-recall curve, but is that the best we can do? Suppose we tried resampling methods (i.e., undersampling, oversampling etc), different weights for positive and negative classes etc. How do we know that a model is ready for production / publication when the metrics are quite different from what tutorials shows, with amazing accuracies of 0.98?