When should we NOT call ".sigmoid" in a loss function?

I’m on chapter 6 of the video lessons, and it mentions that F.binary_cross_entropy does not call .sigmoid() as part of its implementation, but F.binary_cross_entropy_with_logits does make that call. For the purposes of the chapter’s multi-label classification, we’re told to use the _with_logits version, but when would we choose the other, non-sigmoid option?

1 Like

So with F.binary_cross_entropy(), you need to apply a sigmoid to the output logits of your neural network before passing into the loss function. But using F.binary_cross_entropy_with_logits(), you just need to provide the output logits of the model to the function, since the sigmoid is applied internally. So it’s more of an implementation decision, whether you want an explicit sigmoid when you define your model, or if you assume the sigmoid takes place during the loss function calculation.

1 Like

This was helpful for me. Here is a little code I used to confirm what you said:

import torch
import torch.nn.functional as F

out = torch.randn((3, 2), requires_grad=True)
y = torch.rand((3, 2), requires_grad=False)
out_after_signmoid = F.sigmoid(out)
loss1 = F.binary_cross_entropy(out_after_signmoid, y)
print(loss1)
loss2 = F.binary_cross_entropy_with_logits(out, y)
print(loss2)

output:

tensor(0.8506, grad_fn=<BinaryCrossEntropyBackward>)
tensor(0.8506, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
1 Like

If to generalize is your goal, the loss function wouldn’t be the matter. I would start experimenting different regularization methos: dropout, L2, L1, etc, in this order. I cannot find the paper at hand, but I remember something like induced bias (regularization) could improve generalization with small sample size data.