I’m on chapter 6 of the video lessons, and it mentions that F.binary_cross_entropy
does not call .sigmoid()
as part of its implementation, but F.binary_cross_entropy_with_logits
does make that call. For the purposes of the chapter’s multi-label classification, we’re told to use the _with_logits
version, but when would we choose the other, non-sigmoid option?
So with F.binary_cross_entropy()
, you need to apply a sigmoid to the output logits of your neural network before passing into the loss function. But using F.binary_cross_entropy_with_logits()
, you just need to provide the output logits of the model to the function, since the sigmoid is applied internally. So it’s more of an implementation decision, whether you want an explicit sigmoid when you define your model, or if you assume the sigmoid takes place during the loss function calculation.
This was helpful for me. Here is a little code I used to confirm what you said:
import torch
import torch.nn.functional as F
out = torch.randn((3, 2), requires_grad=True)
y = torch.rand((3, 2), requires_grad=False)
out_after_signmoid = F.sigmoid(out)
loss1 = F.binary_cross_entropy(out_after_signmoid, y)
print(loss1)
loss2 = F.binary_cross_entropy_with_logits(out, y)
print(loss2)
output:
tensor(0.8506, grad_fn=<BinaryCrossEntropyBackward>)
tensor(0.8506, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
If to generalize is your goal, the loss function wouldn’t be the matter. I would start experimenting different regularization methos: dropout, L2, L1, etc, in this order. I cannot find the paper at hand, but I remember something like induced bias (regularization) could improve generalization with small sample size data.