How to calculate the loss on classification tasks with 2+ labels using F.nll_loss()

Problem: The column indexing approach to calculating loss for classification problems with more than 2 classes does not return the loss, but the negative probabilities

In chapter 4 of Deep Learning for Coders with Fastai and Pytorch, Jeremy describes using an indexing approach to calculate the loss on a ‘3’ vs ‘7’ binary classification problem like so:

softmax_activations = torch.softmax(acts, dim=1)
tensor([[0.6025, 0.3975],
[0.5021, 0.4979],
[0.1332, 0.8668],
[0.9966, 0.0034],
[0.5959, 0.4041],
[0.3661, 0.6339]])

target = tensor([0,1,0,1,1,0])
index = range(6)
softmax_activations[index , target ]
tensor([0.6025, 0.4979, 0.1332, 0.0034, 0.4041, 0.3661])

By indexing into the ‘3’ activation column when target = 0 (i.e. a ‘7’) and the ‘7’ column when the target is 1 (i.e a ‘3’), you get the loss.

Jeremy says that this can scale to classifying 2+ classes, but I don’t see how. You would need to sum all other columns except your target to get the loss surely? However when I use F.nll_loss, which uses the indexing approach, on a multiclass classification example like so:

sm_acts = torch.softmax(acts, 1); sm_acts
tensor([[0.7189, 0.1380, 0.1431],
[0.0443, 0.1951, 0.7606],
[0.5092, 0.3590, 0.1318],
[0.1392, 0.1421, 0.7187],
[0.0353, 0.6423, 0.3224],
[0.2209, 0.1252, 0.6540]])

targ = tensor([0, 0, 1, 1, 2, 2])
F.nll_loss(sm_acts, targ, reduction=‘none’)
tensor([-0.7189, -0.0443, -0.3590, -0.1421, -0.3224, -0.6540])

It gives me the negative probabilites for each class, not the loss. Can someone clear this up for me?

Hi William, this took me a while to figure out when reading the book and playing with it, I believe Pytorch assumes that you apply LogSoftmax before passing the values to NLLLoss. If you modify the example by applying log between your softmax and NLLLoss, it seems to workout well:

Hi, thanks for the reply. Yes I can see that it works now.

If I have an activation of 0.7 that was correct then taking the -log() would give a loss of 0.155, if I improve the prediction to 0.8 then I get a loss of 0.097, which is lower as you would expect. The -log() of a very good prediction 0.999 is very low (0.0004).

They definately could have explained that better in the book I think.

HI @wjs20, Hi @darek.kleczek,

The multilabel loss function is explained in this video at time 1:02:00. It is mentioning using the binary_cross_entropy which is said to be mnist_loss along with log.
Hope it helps !