Discrepancy with proba-based metrics between fastai2 and sklearn

Ok. Let me try to clarify what’s needed for a binary clasiffication problem.

I’ve trained the learner without RocAuc or APScore.

To get the metrics calculated manually I need this:

valid_probas, valid_targets, valid_preds = learn.get_preds(dl=dls.valid, with_decoded=True)
skm.average_precision_score(valid_targets, valid_probas[:, 1])
skm.roc_auc_score(valid_targets, valid_probas[:, 1])

So it needs the proba for the positive label.

This returns the correct values. What we were getting before was:

skm.average_precision_score(valid_targets, valid_preds)
skm.roc_auc_score(valid_targets, valid_preds)

This comes from sklearn roc_auc_score documentation:

Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers). In the multiclass case, these must be probability estimates which sum to 1. The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label. The multiclass and multilabel cases expect a shape (n_samples, n_classes).

But what do you do when there are more than two labels, then? This is what I don’t get.

I’ve only used it in binary classification.
@FraPochetti worked on a multi class one. He may be able to provide a better reply.
My understanding is that for multi class and multi label it requires a shape (n_samples, n_classes) plus pass multi_class=‘ovr’ to the API:
skim.roc_auc_score ( y_true , y_score , average=‘macro’ , sample_weight=None , max_fpr=None , multi_class=‘raise’ , labels=None ) like shown before.

This is getting way too problem-specific for a single API. It seems like the binary case will need a special metric to be handled and the current AUCRoc and Average Precision are only for the multi-label case.

I don’t know if it’ relevant but there was a working multi class version. Fastai v2 vision

Hi @sgugger,

@FraPochetti and I have been working together this morning to review the proba-based metrics issue in fastai2 (RocAuc and APScore), and have jointly come up with a proposal we’d like to submit to you.
It manages all possibilities sklearn allows while keeping the API consistent with the rest of fastai2 metrics.
We have tested our proposal vs sklearn’s API using this gist and everything works well.

In sklearn there are 3 scenarios for roc_auc_score (each of them calculated slightly differently):

  • Binary:

    • targets: shape = (n_samples, )
    • preds: pass through softmax and then [:, -1], shape = (n_samples,)
  • Multiclass:

    • targets: shape = (n_samples, )
    • preds: pass through softmax, shape = (n_samples, n_classes)
    • multi_class = ‘ovr’ or ‘ovo’ (1)
  • Multilabel:

    • targets: shape = (n_samples, n_classes)
    • preds: pass through sigmoid, shape = (n_samples, n_classes)

(1) ‘ovr’: average AUC of each class against the rest . 'ovo’ : average AUC of all possible pairwise combinations of classes.

sklearn’s average_precision_score implementation is restricted to binary or multilabel classification tasks. So it cannot be used in multiclass cases.

Here’s our proposal:

class AccumMetric(Metric):
    "Stores predictions and targets on CPU in accumulate to perform final calculations with `func`."
    def __init__(self, func, dim_argmax=None, sigmoid=False, softmax=False, proba=False, thresh=None, to_np=False, invert_arg=False,
                 flatten=True, **kwargs):
        store_attr(self,'func,dim_argmax,sigmoid,softmax,proba,thresh,flatten')
        self.to_np,self.invert_args,self.kwargs = to_np,invert_arg,kwargs

    def reset(self): self.targs,self.preds = [],[]

    def accumulate(self, learn):
        pred = learn.pred.argmax(dim=self.dim_argmax) if (self.dim_argmax and not self.proba) else learn.pred
        if self.sigmoid: pred = torch.sigmoid(pred)
        if self.thresh:  pred = (pred >= self.thresh)
        if self.softmax: 
            pred = F.softmax(pred, dim=-1)
            if learn.dls.c == 2: pred = pred[:, -1]
        targ = learn.y
        pred,targ = to_detach(pred),to_detach(targ)
        if self.flatten: pred,targ = flatten_check(pred,targ)
        self.preds.append(pred)
        self.targs.append(targ)

    @property
    def value(self):
        if len(self.preds) == 0: return
        preds,targs = torch.cat(self.preds),torch.cat(self.targs)
        if self.to_np: preds,targs = preds.numpy(),targs.numpy()
        return self.func(targs, preds, **self.kwargs) if self.invert_args else self.func(preds, targs, **self.kwargs)

    @property
    def name(self):  return self.func.func.__name__ if hasattr(self.func, 'func') else  self.func.__name__

def skm_to_fastai(func, is_class=True, thresh=None, axis=-1, sigmoid=None, softmax=False, proba=False, **kwargs):
    "Convert `func` from sklearn.metrics to a fastai metric"
    dim_argmax = axis if is_class and thresh is None else None
    sigmoid = sigmoid if sigmoid is not None else (is_class and thresh is not None)
    return AccumMetric(func, dim_argmax=dim_argmax, sigmoid=sigmoid, softmax=softmax, proba=proba, thresh=thresh,
                       to_np=True, invert_arg=True, **kwargs)

def APScore(axis=-1, average='macro', pos_label=1, sample_weight=None):
    "Average Precision for binary single-label classification problems"
    return skm_to_fastai(skm.average_precision_score, axis=axis, flatten=False, softmax=True, proba=True,
                         average=average, pos_label=pos_label, sample_weight=sample_weight)
    
def APScoreMulti(axis=-1, average='macro', pos_label=1, sample_weight=None):
    "Average Precision for multi-label classification problems"
    return skm_to_fastai(skm.average_precision_score, axis=axis, flatten=False, sigmoid=True, proba=True,
                         average=average, pos_label=pos_label, sample_weight=sample_weight)
    
def RocAuc(axis=-1, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None):
    "Area Under the Receiver Operating Characteristic Curve for single-label classification problems"
    """use default multi_class ('raise') for binary-class, and 'ovr'(average AUC of each class against the rest) 
    or 'ovo' (average AUC of all possible pairwise combinations of classes) for multi-class tasks"""
    return skm_to_fastai(skm.roc_auc_score, axis=axis, flatten=False, softmax=True, proba=True,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr, multi_class=multi_class, labels=labels)
    
def RocAucMulti(axis=-1, average='macro', sample_weight=None, max_fpr=None):
    "Area Under the Receiver Operating Characteristic Curve for multi-label classification problems"
    return skm_to_fastai(skm.roc_auc_score, axis=axis, flatten=False, sigmoid=True, proba=True,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr)

Please, let us know if we can help you in any way with this.

3 Likes

This introduces a bit too much magic. I think there should be two names: BinaryRocAuc and RocAuc for the two separate metrics (that handle things differently).

Hi @sgugger,

Yes, @FraPochetti and I also discussed how the different cases should be grouped and named.

If we understand you correctly, you are proposing to split RocAuc into 2 to avoid the multi_class kwarg. That makes sense.

This would be our proposal for the 3 scenarios (gist with full code):

def RocAuc(axis=-1, average='macro', sample_weight=None, max_fpr=None):
    "Area Under the Receiver Operating Characteristic Curve for single-label binary classification problems"
    return skm_to_fastai(skm.roc_auc_score, axis=axis, flatten=False, softmax=True, proba=True,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr)

def RocAucMultiClass(axis=-1, average='macro', sample_weight=None, max_fpr=None, multi_class='ovr', labels=None):
    "Area Under the Receiver Operating Characteristic Curve for single-label multi-class classification problems"
    return skm_to_fastai(skm.roc_auc_score, axis=axis, flatten=False, softmax=True, proba=True,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr, multi_class=multi_class, labels=labels)
    
def RocAucMulti(axis=-1, average='macro', sample_weight=None, max_fpr=None):
    "Area Under the Receiver Operating Characteristic Curve for multi-label classification problems"
    return skm_to_fastai(skm.roc_auc_score, axis=axis, flatten=False, sigmoid=True, proba=True,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr)

As to the names we have a several options:

  • binary case: RocAuc or RocAucBinary and APScore
  • multi-class case: RocAucMultiClass (avg precision ot available in sklearn)
  • multi-label case: RocAucMulti or RocAucMultiLabel, and APScoreMulti

We believe RocAuc and RocAucMulti are consistent with all other fastai2 metrics. The new one would be RocAucMultiClass as multiclass in rocauc requires a different behavior.

1 Like

I disagree with the multi-class terminology. All metrics for single-label work with any number of labels, so the base RocAuc/APScore should work for the multi-label case. Since the binary case requires special behavior, it should be BinaryRocAuc and BinaryAPScore.

I think you meant:
"All metrics for single-label work with any number of classes, so the base RocAuc / APScore should work for the multi-class case.”
Right?

If so, it makes sense.
May I suggest just one thing. Can we use Binary as suffix instead of prefix? It’s easier to find the different RocAuc types when you start typing it using code completion?

This way it’d be:

  • RocAuc: for single-label multi-class
  • RocAucBinary or BinaryRocAuc/ APScoreBinary or BinaryAPScore: for single-label binary
  • RocAucMulti/ APSMulti: for multi-label

But it’s your call.

2 Likes

Yes I wanted to say multi-class, sorry.
No problem with having Binary as a suffix (since Multi is also a suffix).

2 Likes

Ok, good. So we agreed :sweat_smile:.

Here’s a gist with the code and the tests we used.

Here’s the code with agreed naming:

class AccumMetric(Metric):
    "Stores predictions and targets on CPU in accumulate to perform final calculations with `func`."
    def __init__(self, func, dim_argmax=None, sigmoid=False, softmax=False, proba=False, thresh=None, to_np=False, invert_arg=False,
                 flatten=True, **kwargs):
        store_attr(self,'func,dim_argmax,sigmoid,softmax,proba,thresh,flatten')
        self.to_np,self.invert_args,self.kwargs = to_np,invert_arg,kwargs

    def reset(self): self.targs,self.preds = [],[]

    def accumulate(self, learn):
        pred = learn.pred.argmax(dim=self.dim_argmax) if (self.dim_argmax and not self.proba) else learn.pred
        if self.sigmoid: pred = torch.sigmoid(pred)
        if self.thresh:  pred = (pred >= self.thresh)
        if self.softmax: 
            pred = F.softmax(pred, dim=-1)
            if learn.dls.c == 2: pred = pred[:, -1]
        targ = learn.y
        pred,targ = to_detach(pred),to_detach(targ)
        if self.flatten: pred,targ = flatten_check(pred,targ)
        self.preds.append(pred)
        self.targs.append(targ)

    @property
    def value(self):
        if len(self.preds) == 0: return
        preds,targs = torch.cat(self.preds),torch.cat(self.targs)
        if self.to_np: preds,targs = preds.numpy(),targs.numpy()
        return self.func(targs, preds, **self.kwargs) if self.invert_args else self.func(preds, targs, **self.kwargs)

    @property
    def name(self):  return self.func.func.__name__ if hasattr(self.func, 'func') else  self.func.__name__

def skm_to_fastai(func, is_class=True, thresh=None, axis=-1, sigmoid=None, softmax=False, proba=False, **kwargs):
    "Convert `func` from sklearn.metrics to a fastai metric"
    dim_argmax = axis if is_class and thresh is None else None
    sigmoid = sigmoid if sigmoid is not None else (is_class and thresh is not None)
    return AccumMetric(func, dim_argmax=dim_argmax, sigmoid=sigmoid, softmax=softmax, proba=proba, thresh=thresh,
                       to_np=True, invert_arg=True, **kwargs)

def APScore(axis=-1, average='macro', pos_label=1, sample_weight=None):
    "Average Precision for binary single-label classification problems"
    return skm_to_fastai(skm.average_precision_score, axis=axis, flatten=False, softmax=True, proba=True,
                         average=average, pos_label=pos_label, sample_weight=sample_weight)
    
def APScoreMulti(axis=-1, average='macro', pos_label=1, sample_weight=None):
    "Average Precision for multi-label classification problems"
    return skm_to_fastai(skm.average_precision_score, axis=axis, flatten=False, sigmoid=True, proba=True,
                         average=average, pos_label=pos_label, sample_weight=sample_weight)
    
def RocAucBinary(axis=-1, average='macro', sample_weight=None, max_fpr=None):
    "Area Under the Receiver Operating Characteristic Curve for single-label binary classification problems"
    return skm_to_fastai(skm.roc_auc_score, axis=axis, flatten=False, softmax=True, proba=True,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr)

def RocAuc(axis=-1, average='macro', sample_weight=None, max_fpr=None, multi_class='ovr', labels=None):
    "Area Under the Receiver Operating Characteristic Curve for single-label multi-class classification problems"
    return skm_to_fastai(skm.roc_auc_score, axis=axis, flatten=False, softmax=True, proba=True,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr, multi_class=multi_class, labels=labels)
    
def RocAucMulti(axis=-1, average='macro', sample_weight=None, max_fpr=None):
    "Area Under the Receiver Operating Characteristic Curve for multi-label classification problems"
    return skm_to_fastai(skm.roc_auc_score, axis=axis, flatten=False, sigmoid=True, proba=True,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr)

Will you add update this in fastai2 then? Is there anything else you need from @FraPochetti or me?

1 Like

I’ve made a tentative update. Let me know if you get any problem with it.

1 Like

Great!
I’ll test it right away, and will get back to you.

Ok, I’ve just finished testing. And have found a few (easy to solve) issues.

  • Binary: APScoreBinary and RocAucBinary both work as expected.

  • Multi-class: RocAuc works well too. But:
    * labels=None as a kwarg is missing
    * there’s a typo in the description :
    It says: "Area Under the Receiver Operating Characteristic Curve for single-label multi-label classification problems”
    when it should be "Area Under the Receiver Operating Characteristic Curve for single-label multi-class classification problems”

  • Multi-label is not working well because a thresh=0.5 has been added. But these are proba-based metrics that don’t require a thresh.

I’ve removed thresh and now they work well.

So they should be:

def RocAuc(axis=-1, average='macro', sample_weight=None, max_fpr=None, multi_class='ovr', labels=None):
    "Area Under the Receiver Operating Characteristic Curve for single-label multi-class classification problems"
    assert multi_class in ['ovr', 'ovo']
    return skm_to_fastai(skm.roc_auc_score, axis=axis, activation=ActivationType.Softmax, flatten=False, average=average, sample_weight=sample_weight, max_fpr=max_fpr, multi_class=multi_class, labels=labels)


def APScoreMulti(sigmoid=True, average='macro', pos_label=1, sample_weight=None):
    "Average Precision for multi-label classification problems"
    activation = ActivationType.Sigmoid if sigmoid else ActivationType.No
    return skm_to_fastai(skm.average_precision_score, activation=activation, flatten=False,
                         average=average, pos_label=pos_label, sample_weight=sample_weight)


def RocAucMulti(sigmoid=True, average='macro', sample_weight=None, max_fpr=None):
    "Area Under the Receiver Operating Characteristic Curve for multi-label binary classification problems"
    activation = ActivationType.Sigmoid if sigmoid else ActivationType.No
    return skm_to_fastai(skm.roc_auc_score, activation=activation, flatten=False,
                         average=average, sample_weight=sample_weight, max_fpr=max_fpr)
3 Likes

Thanks for investigating all of this. I removed the thresh and fixed the typo.

Great!
I’ve retested again and everything works smoothly now :ok_hand:
So from my side we can close this.
THANKS a lot @FraPochetti and @sgugger for your work to fix this issue. It’s been a pleasure working with you!

4 Likes

If you have class = {0,1} and you want to use RocAUc
class{0,1} are complementary like cats & dogs.

learn = cnn_learner(dls, resnet34, metrics=[accuracy])
learn.fine_tune(1)

What’s the best way to invoke it?
I don’t see examples here

https://dev.fast.ai/metrics#RocAuc

Hi Gerardo,
Sorry for the late reply, but I was out last week.

  1. You should select the appropriate metric:
    • RocAucBinary: for single-label binary
    • RocAuc: for single-label multi-class
    • RocAucMulti/ APSMulti: for multi-label
  2. In your case (binary classification):
    learn = cnn_learner(dls, resnet34, metrics=[accuracy, RocAucBinary()])
1 Like