How is the loss function computed for multi label classification?


I have a doubt related to multi label classification for satellite image data.

How is the loss function computed here? Since there are multiple labels with 0 or 1 output, how loss takes into account for each label.


When you have a single label a softmax is used after the final linear layer of the model. The softmax outputs a set of values between [0, 1] that sum to 1, so it’s like a probability for each class.

For multiple labels however, a sigmoid is applied after the last linear layer so that the values are between [0, 1] but there’s no constraint on the sum of the values so you can have multiple values close to 1.

Then for the loss function, for single label you can have predictions [0.1, 0.2, 0.7] and target [0, 0, 1]. For multiple labels, something like [0.8, 0.2, 0.9] with target [1, 0, 1] is possible. I’m not sure how exactly this is implemented but this is the main idea :slight_smile:


Thanks @mnpinto. I got the point.

I had the same question. After reading up on it for a bit, I ended up implementing it using BCEWithLogitsLoss in the pytorch nn module. The inputs that it expects are the predictions for each type of object and the actual probability for each type of object. The predictions are passed through Sigmoid function inside the BCEWithLogitsLoss before computing the loss.

1 Like

Suppose your multi-hot encoding label is
[1, 0, 1]

Let’s say the last layer of our model is
[1.1, -0.1, 0.8]

Applying sigmoid function to each element (so that they fall between 0 and 1 inclusive) gives us
[1, 0, 0.8]

Compare the model output [1, 0, 0.8] with the ground truth label [1, 0, 1] using binary cross-entropy, element-wise. Sum the errors.

This is what I gathered from