Thank you very much for this!
I like that explanation a lot. If I look at the distribution of predictions after training with the above loss function, I see
If I now tweak the loss function to something else (complete nonsense, of course!)
def mnist_loss(predictions, targets):
predictions = predictions.sigmoid()
# mind the change of the second argument from 1-predictions to 1+predictions
return torch.where(targets==1, 1+predictions, predictions).mean()
I see a different distribution:
As expected, changing the loss function will lead to the predictions being “optimized” differently.