Hi,
I’ve noticed there’s a discrepancy between fastai2 and sklearn when using metrics that take a probability as an input ( y_score) like RocAuc and APScore in binary classification problems.
I’ve looked into it and the reason seems to be that what is passed to the metric is not a probability, but an actual prediction. When I pass the predictions instead of the probabilities to sklearn I get the same result.
This is sklearn’s definition of what needs to be passed:

y_score array-like of shape (n_samples,) or (n_samples, n_classes)

Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers). In the multiclass case, these must be probability estimates which sum to 1. The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label. The multiclass and multilabel cases expect a shape (n_samples, n_classes). In the multiclass case, the order of the class scores must correspond to the order of labels , if provided, or else to the numerical or lexicographical order of the labels in y_true .

Has anyone noticed this discrepancy? I believe this is a bug.

Thanks for your follow up @FraPochetti!
I’ve created a simple gist to explain the issue.
I noticed it because the metrics I was getting when I started working with fastai2 where significantly worse that with v1.

Please, let me know if this is not clear.

@muellerzr, could you please take a quick look at this? Have you noticed any issue when using RocAuc in v2?

This seems like an sklearn issue though.
As you can see we are passing 2 almost identical arrays in terms of values.
Of course valid_preds contains either 0 or 1 (I also tried with valid_preds.float() and the result is the same), whereas valid_probas contains actual floats. Rather extreme, as they are either very close to 0 or very close to 1, but still floats.
Can this be driven just by a rounding issue?
At the end of the day ROC AUC (and Precision) are calculated computing TP, TN, FP and FN at varying thresholds. Thing is, with such extreme values, these thresholds never make a difference on the underlying metrics.
Especially when the threshold is compared to 0 or 1.
It could return something funny though, when compared to 2.288320e-42, hence the rounding issue.
I have to admit I am not sure though. @lgvaz we need your expertise too!
Are we missing anything super dumb here?

Btw, have you tried stepping out of fastai and just trying reproducing the same issue with a toy sklearn example (make_ckassification)?
I am planning to test that tomorrow.

@oguiza, I made a quick test in a separate env and got the same results.
This was just to prove nothing was somehow screwed up with the sklearn installation in fastai, as I had claimed this could be a sklearn-related problem.

I think you are right in saying this is a fastai bug.
It is rather obvious that passing targets alongside 0 and 1, or floats probabilities, return different results.
It seems fastai uses the former, while it should use the latter.

I think this is driven by the fact that sigmoid=None in skm_to_fastai (here), which means pred is calculated with argmax (here) instead of torch.sigmoid (here).

This is how I (well @ilovesciencedid ) fixed this problem (in a multi-class context). Now that I think about it, it works also in the binary case , e.g. it should fix your problem too.

def _accumulate(self, learn):
pred = learn.pred
if self.sigmoid: pred = torch.nn.functional.softmax(pred, dim=1) #hack for roc_auc_score
if self.thresh: pred = (pred >= self.thresh)
targ = learn.y
pred,targ = to_detach(pred),to_detach(targ)
if self.flatten: pred,targ = flatten_check(pred,targ)
self.preds.append(pred)
self.targs.append(targ)
AccumMetric.accumulate = _accumulate
def RocAuc(axis=-1, average='macro', sample_weight=None, max_fpr=None,multi_class='ovr'):
"Area Under the Receiver Operating Characteristic Curve for single-label binary classification problems"
return skm_to_fastai(skm.roc_auc_score, axis=axis,
average=average, sample_weight=sample_weight, max_fpr=max_fpr,
flatten=False,multi_class=multi_class,sigmoid=True)

Yep, that makes sense! I will use it in the interim until we have a solution to this
I’d prefer not to modify the AccumMetric.accumulate function as it might have an impact on other metrics.
Let’s see what Sylvain thinks of all this!

I don’t understand anything to what you are saying. The _accumulate function is wrong: why use softmax when you ask for sigmoid? That doesn’t make any sense.

The accumulate function in fastai has a dim_argmax you can pass for softmax. Maybe what you are saying is that this argument should be used to wrap the roc metric?

The _accumulate function does exactly what I want and it returns the correct outputs. It was “hacked” in the context of ROC AUC for a multi-class problem (hence softmax) in which I needed probabilities for all classes, not just argmax. It is very likely not the most elegant, nor the most efficient way of achieving the final goal, but this is a whole different story. It was also my very first interaction with Callbacks and custom Metrics in fastai2, so I was definitely not aware of all my options. I just wanted it done.

As for the (potential) problem @oguiza originally reported, what we are saying is that the default RocAuc implementation for binary classification seems not to perform sigmoid under the hood, passing to sklearn predictions which are not probabilities, but 1 and 0 (e.g. the result of predict instead of predit_proba in sklearn jargon).

So, in a nutshell, our question is: how can we call RocAuc inside a learner, and make it calculate sigmoid on model’s outputs before passing them to sklearn’s roc_auc_score? Maybe cnn_learner(dls, arch, metrics=RocAuc(sigmoid=True))?

As suggested in the below gist, it seems the default is to pass predictions and not probabilities.

Once again, it might be we are getting this all wrong.
If this is the case, we apologize.

I think I understand a bit better. Basically you need to have some behavior where instead of taking the argmax, you just want the softmax that returns all probabilities. What confuses me in your posts is that you keep talking about sigmoid, but you don’t want to apply that, you want a softmax on a certain dimension. This means we need to add a softmax argument to skm_to_fastai not change the current behavior (otherwise it would break all multi-label metrics).

Which other metrics take those probabilities instead of predictions while I’m at it?

I could definitely have phrased the whole thing better . Sorry about that.

I think ROC AUC, Precision and Recall expect probabilities, as those metrics are based on applying different thresholds and check how False Positives, True Positives, etc, change.

The below AUROC implementation from fastai1 makes sense to me, as feeds targets and probabilities (e.g. F.softmax(last_output, dim=1)[:,-1]) to auc_roc_score.
Isn’t feeding predictions just wrong? auc_roc_score would “see” 0 and 1 and use them as probabilities, messing everything up.

@dataclass
class AUROC(Callback):
"Computes the area under the curve (AUC) score based on the receiver operator characteristic (ROC) curve. Restricted to binary classification tasks."
def on_epoch_begin(self, **kwargs):
self.targs, self.preds = LongTensor([]), Tensor([])
def on_batch_end(self, last_output:Tensor, last_target:Tensor, **kwargs):
last_output = F.softmax(last_output, dim=1)[:,-1]
self.preds = torch.cat((self.preds, last_output.cpu()))
self.targs = torch.cat((self.targs, last_target.cpu().long()))
def on_epoch_end(self, last_metrics, **kwargs):
return add_metrics(last_metrics, auc_roc_score(self.preds, self.targs))

The whole point of roc_auc_score is to check how FPR and TPR change, with varying proba thresholds. So roc_auc_score needs probas, not predictions.
I hope I did not mess up here

As far as I know it’s only rocauc and average precision. Precision and Recall expect press, not probas.
You can check this here and look for those metrics that indicate y_score instead of y_pred.