I feel like something is off with label smoothing. While the implementation is correct and agrees with the paper, my intuition suggests that the additional eps/N should not be added to the term for the correct class.
In the notebook for label smoothing we see the following explanation:
Another regularization technique that’s often used is label smoothing. It’s designed to make the model a little bit less certain of it’s decision by changing a little bit its target: instead of wanting to predict 1 for the correct class and 0 for all the others, we ask it to predict 1-ε
for the correct class and ε
for all the others, with ε
a (small) positive number and N the number of classes. This can be written as:
loss = (1-ε) ce(i) + ε \sum ce(j) / N
where ce(x)
is cross-entropy of x
(i.e. -\log(p_{x})), and i
is the correct class.
However, it turns out that the second sum is over the entire class list. I.e. , we never take special care to ignore the correct class. Thus, the coefficient for ce(i) becomes (1-ε + \frac{\epsilon}{N}).
This pushes the minumum of the function further to the right. For example, in the binary case with eps=0.1, if we use the original formula the minumum would be found at x=0.95 instead of x=0.9.