Are ROC curves still useful when we use a NN with an output for each class?

I’ve been reading up on ROC curves in this article and my understanding they are useful for picking out a threshold to use in a binary classifier that’s outputting a single probability (i.e. threshold of 0.4 means that anything lower than 0.4 is negative).

However with neural networks, e.g. the Cats vs. Dogs binary classification, we are outputting two probabilities (after going through a softmax layer) and then classifying the input according to which probability is higher (i.e. this line: preds = np.argmax(log_preds, axis=1 in, which seems to bypass fiddling around with thresholds. Can/should ROC curves still be used in this context?

ROC curves are most used in medical research and are essential there

I think you can replace np.arg with np.where if you don’t want the threshold to be at the 50% mark. Say column 0 is cats and column 1 is dogs. Suppose we only want to classify an image to be cats if the column 0 probability is more than 70%.

import numpy as np
prob = np.array([[0.51,0.49],[0.71,0.29],[0.2,0.8]])


>>array([[0.51, 0.49],
         [0.71, 0.29],
         [0.2 , 0.8 ]])

np.where(prob[:,0]>0.7, 0,1)
>> array([1, 0, 1])

My understanding was that it didn’t need to use a threshold value at all - it was just seeing which activation is higher - either the cat or the dog node, which suggests that an ROC is not applicable in this context, but I wanted some confirmation on this conclusion!

That is correct in fastai where the the highest score is selected using np.argmax.

However say the images were x-rays and the two classes were “healthy” vs “cancer”. In that case you might want to set at threshold that minimize the number of false negative (minimizes the number of cases classified as healthy that are in reality with cancer). A low level of false negative results i a higher level of false positive which the doctors would the have to diagnose by other means such as ultrasound, CT, medical history, family history etc…

This tradeoff exist in many other domains too.

In the skin cancer classification paper published last year in Nature (full paper here: the deep net outputs probabilities of 757 different disease classes, but they use ROC curves to compare binary classification performance against domain experts (See Fig. 3).

My concern with this paper is that the network was trained on a biased sample that does not reflect the actual incidence of skin cancer. They mention this as a limitation at the end of the article.

Yes the ROC is not constructed to take incidence/prevalence into account

Thanks for the article link Todd - it indeed sets a threshold, which suggests it is capable of doing multiclass classification as well, e.g. if the probability of two skin cancer diseases are higher, it’ll predict it for both.

If the classification is mutually exclusive though, it seems like instead of outputting one value in the NN and then tweaking the threshold probability to determine whether it’s a dog or cat, you can use two outputs for the NN and then choose the higher activation, instead of having this additional parameter to tune?