In the multicategory lecture (lecture 6), we create a custom accuracy measurement that requires us to add a sigmoid prior to calculating accuracy (accuracy_multi below) , but the same function for a single category (but multiclass) accuracy does not require the sigmoid (accuracy).
Why this difference? What activations are being sent as the inp parameter by the Learner in each case? It seems from this code that post-sigmoid values are sent in the first case and pre-sigmoid values in the second. Is this right? How do you predict from a given model?
The same input activations are sent in both cases, and they won’t necessarily already be sent through sigmoid. Because sigmoid is an increasing function, it does not matter to accuracy() whether on not it is included. accuracy() only needs to determine whether the single maximum prediction matches the target.
accuracy_multi uses a given probability as a threshold to determine which of several predictions are true. If your model does not include a sigmoid at the end, you need to set sigmoid=True to convert activations to probabilities [-inf,+inf] -> [0,1] to compare against thresh. If your model already outputs activations that are interpretable as probabilities, you do not need the sigmoid.
I hope this clarifies. IMHO, this reasoning and usage should be explained somewhere, maybe in the code?
But in the multiclass example in the lecture, I believe the learner will apply a sigmoid function to create the final activations, or else how is it applying binary cross-entropy loss? It’s opaque in the API but isn’t that is what is happening?
Or, asked another way, how do I determine if sigmoid has been applied to generate the activations of the final layer in fastai? Or does fastai always send the presigmoid values?
this comes from the loss functions, not the metrics. So check out what BCELossLogits will do. It has an activation function (which is where this is coming from), same with CrossEntropyLossFlat, shown below:
(decodes is called in a situation like learn.get_preds(dl=dl, decoded=True))
I see that BCEWithLogitsLossFlat there is a sigmoid applied but with CrossEntropyLossFlat there is a softmax.
I interpret this to mean that the model outputs the pre-sigmoid activations, and the loss function applies the final sigmoid activation prior to calculating the loss? It is the pre-sigmoid/softmax activation of the model that is sent to accuracy.
So in BCEWithLogitsLossFlat the pre-sigmoid activation is the final output of the model and this output is sent to the accuracy_multi function? The sigmoid will be applied by the loss function prior to calculating the loss, so the accuracy_multi function needs to add that sigmoid to determine the predictions and thus the accuracy?
With CrossEntropyLossFlat there is a softmax calcuated. The highest activation prior to softmax will correspond to the highest softmax value, so there is no need to have a softmax in the accuracy function. But we could in principle add a softmax and it would not change the result?