How to convert multi-label predictions into multi-class probabilities?

So I’m playing in the toxic comments kaggle competition and was able to hack together a simple network that does multi-label predictions.

Looked at the planet notebooks which demonstrate how to use thresholding to predict each label … but this competition wants to know the probabilities of each class/label independently.

So … what are some recommended approaches for turning these multi-label predictions into class probabilities?

And, given that the scoring metric is column-wise log loss, would it be helpful to write my own custom loss function in lieu of using binary cross entropy?

1 Like

I haven’t looked at the planet notebook yet, but I think the usual approach for predicting multi-class probabilities is to have your final layer be a sigmoid. No need to do any thresholding unless/until you want to make a hard yes-no decision for whether something has a certain class.

For the toxic comments competition, it looks like there are six possible labels, so your network would output six scores/logits for each comment, and then you’d feed them to a sigmoid (which works element-wise). That gets you six numbers between 0 and 1, your individual per-class probabilities.

As for the loss, you’d just use binary cross entropy. As an example of where the libary does this, check out

Thanks for the reply @cqfd!

Let’s say you have a label where you determine that anything > 0.2 = “yes” and anything less = “no”. Even though we’re saying that a result of 0.21 = “yes” … you are saying that the probability of that label being yes is still only 0.21, correct?

If so, I guess that is what is confusing for me.

It seems that if we are saying that a value of 0.21 = “yes”, that the probability we give it of meaning “yes” should reflect that. In other words, the probability should >= 0.5.

Hmm, yeah, that’s confusing. Maybe the planet competition requires the tricky thresholding because its evaluation metric isn’t what we’re actually using to train our network? (Presumably because the f2 score isn’t differentiable, so we have to use a differentiable proxy for the loss, namely cross entropy.) I guess weird things can happen when the loss function we’re using for gradient descent isn’t the actual loss function/evaluation metric we care about. If I’m understanding the file, I think the opt_th function tries to find an optimal threshold by just trying out a bunch and seeing which one works best for the f2 score. At any rate, seems like someone looked into optimizing the f2 score directly; I’ll have to give that paper a read.

I think the toxic comment challenge will be simpler because you’re training the network with the actual evaluation metric (binary cross entropy). I don’t see why you’d need to do any thresholding at all, although I’m definitely curious if that’s wrong!

Yup I think that’s pretty much it. I’ve just been using a threshold of 0.5 for the Jigsaw comp for reporting accuracy. It doesn’t actually matter however since the comp is evaluated using cross entropy.

The threshold you choose for yes/no affects the sensitivity of your predictions. Sometimes you want a fairly low threshold so that more things are considered to be “yes”, for example when it’s really important that you have no false negatives. When detecting cancer, where positive means a tumor is found, you want to err on the side of predicting yes when it’s really no (i.e. the patient is actually not sick) and not the other way around.

You can also vary this threshold and see what happens. This is how an ROC curve is created, for example.

In the case of the toxic comments competition, the threshold isn’t really important as they expect you to predict probabilities for each column independently. In fact, using a threshold will harm your score as predictions that are really confident (i.e. close to 0 or 1) are penalized more heavily if they turn out to be wrong.

The loss function to use for this competition is the regular binary cross entropy / negative log loss, but done for each column independently, added up, and then divided by 6.


Thanks for the excellent clarification @cqfd and @machinethink.

So is it worth it to define a custom loss function that calculates the negative log loss for each column / 6 … or just stick with binary cross entropy?

I think Pytorch’s built-in binary cross entropy in effect already does that. For example, torch.nn.functional.binary_cross_entropy just averages over everything you feed it: given yhat and y of shape (bs, 6), it will average over all bs*6 predictions, which is equivalent to first averaging over the six columns and only then averaging over the bs rows.

from torch import Tensor as T
from torch.autograd import Variable as V
import torch.nn.functional as F

# actuals
y = V(T([
    [1, 0, 0, 1, 1, 0],
    [0, 0, 0, 1, 0, 1]

# predictions
yhat = V(T([
    [0.9, 0.1, 0, 0.8, 0.99, 0.01],
    [0.1, 0.2, 0.1, 0.95, 0.3, 0.5]

print(F.binary_cross_entropy(yhat, y))

def bce_by_hand(yhat, y):
    ces = - (y*yhat.log() + (1-y)*(1-yhat).log())
    ces[ces!=ces]=0 # trick to replace nans with zero
    return ces.mean()

def bce_by_averaging_columns_first(yhat, y):
    ces = -(y*yhat.log() + (1-y)*(1-yhat).log())
    col_avgs = ces.mean(dim=1, keepdim=True)
    return col_avgs.mean()