I haven’t looked at the planet notebook yet, but I think the usual approach for predicting multi-class probabilities is to have your final layer be a sigmoid. No need to do any thresholding unless/until you want to make a hard yes-no decision for whether something has a certain class.
For the toxic comments competition, it looks like there are six possible labels, so your network would output six scores/logits for each comment, and then you’d feed them to a sigmoid (which works element-wise). That gets you six numbers between 0 and 1, your individual per-class probabilities.
As for the loss, you’d just use binary cross entropy. As an example of where the fast.ai libary does this, check out conv_learner.py.
Let’s say you have a label where you determine that anything > 0.2 = “yes” and anything less = “no”. Even though we’re saying that a result of 0.21 = “yes” … you are saying that the probability of that label being yes is still only 0.21, correct?
If so, I guess that is what is confusing for me.
It seems that if we are saying that a value of 0.21 = “yes”, that the probability we give it of meaning “yes” should reflect that. In other words, the probability should >= 0.5.
Hmm, yeah, that’s confusing. Maybe the planet competition requires the tricky thresholding because its evaluation metric isn’t what we’re actually using to train our network? (Presumably because the f2 score isn’t differentiable, so we have to use a differentiable proxy for the loss, namely cross entropy.) I guess weird things can happen when the loss function we’re using for gradient descent isn’t the actual loss function/evaluation metric we care about. If I’m understanding the planet.py file, I think the opt_th function tries to find an optimal threshold by just trying out a bunch and seeing which one works best for the f2 score. At any rate, seems like someone looked into optimizing the f2 score directly; I’ll have to give that paper a read.
I think the toxic comment challenge will be simpler because you’re training the network with the actual evaluation metric (binary cross entropy). I don’t see why you’d need to do any thresholding at all, although I’m definitely curious if that’s wrong!
Yup I think that’s pretty much it. I’ve just been using a threshold of 0.5 for the Jigsaw comp for reporting accuracy. It doesn’t actually matter however since the comp is evaluated using cross entropy.
The threshold you choose for yes/no affects the sensitivity of your predictions. Sometimes you want a fairly low threshold so that more things are considered to be “yes”, for example when it’s really important that you have no false negatives. When detecting cancer, where positive means a tumor is found, you want to err on the side of predicting yes when it’s really no (i.e. the patient is actually not sick) and not the other way around.
You can also vary this threshold and see what happens. This is how an ROC curve is created, for example.
In the case of the toxic comments competition, the threshold isn’t really important as they expect you to predict probabilities for each column independently. In fact, using a threshold will harm your score as predictions that are really confident (i.e. close to 0 or 1) are penalized more heavily if they turn out to be wrong.
The loss function to use for this competition is the regular binary cross entropy / negative log loss, but done for each column independently, added up, and then divided by 6.
I think Pytorch’s built-in binary cross entropy in effect already does that. For example, torch.nn.functional.binary_cross_entropy just averages over everything you feed it: given yhat and y of shape (bs, 6), it will average over all bs*6 predictions, which is equivalent to first averaging over the six columns and only then averaging over the bs rows.
from torch import Tensor as T
from torch.autograd import Variable as V
import torch.nn.functional as F
y = V(T([
[1, 0, 0, 1, 1, 0],
[0, 0, 0, 1, 0, 1]
yhat = V(T([
[0.9, 0.1, 0, 0.8, 0.99, 0.01],
[0.1, 0.2, 0.1, 0.95, 0.3, 0.5]
def bce_by_hand(yhat, y):
ces = - (y*yhat.log() + (1-y)*(1-yhat).log())
ces[ces!=ces]=0 # trick to replace nans with zero
def bce_by_averaging_columns_first(yhat, y):
ces = -(y*yhat.log() + (1-y)*(1-yhat).log())
col_avgs = ces.mean(dim=1, keepdim=True)