I revised and extended the discussion on label smoothing in the 10_b_mixup_label_smoothing_jcat.ipynb notebook.
A Note on Label Smoothing
Another regularization technique that’s often used is label smoothing. The basic idea is to make the model a little bit less certain of its decision. Here we describe two approaches:
Method #1 (section 5.2 in the “Bag of Tricks” paper)
We effectively add noise to the training labels by replacing each target label with a mixture distribution, with weights of 1-\varepsilon for the case where the label is the correct class and \varepsilon for the case where the label is distributed uniformly among the incorrect classes. We choose \varepsilon to be a positive number that is much smaller than one (0.1 by default), so that the mixture is dominated by the case with the correct training label. This leads to the following loss function:
loss = (1-\varepsilon) \ell(i) + \varepsilon \sum_{k \ne i} \frac{\ell(k)}{K-1}
where \ell(k) = -\log(p_{k}) is the cross-entropy loss for class k, K is the number of classes, and i is the correct class.
Method #2 (original description of label-smoothing, from section 7 of the “Inception” paper)
Again we form a mixture distribution, this time with weights of 1-\varepsilon for the case where the label is the correct class and \varepsilon for the case where we know nothing about the label. The latter case is represented by assuming the label is distributed uniformly across the K classes. Again, since \varepsilon is a small probability, the mixture is dominated by the correct case. This leads to a slightly different loss function:
loss = (1-\varepsilon) \ell(i) + \varepsilon \sum_{k=1}^K \frac{\ell(k)}{K}
We implemented both methods for comparison, and performed a set of of ten single-epoch training runs for each. We found that the resulting accuracies for the two methods are statistically indistinguishable, i.e. both methods give essentially the same result.
Label smoothing Method #2 is the one that’s implemented in the notebook.
Does Label-smoothing improve accuracy?
I performed a test and wrote up the results in a Conclusion section:
We performed 10 single-epoch training runs each without label-smoothing and with label-smoothing.
Without label smoothing the mean and standard deviation in accuracy are: 0.264 and 0.082
With label smoothing the mean and standard deviation in accuracy are: 0.306 and 0.055
The standard deviations in the means are
\sigma_{MeanAccuracyWithoutLS} = \frac{0.082}{\sqrt 10} = 0.03, and
\sigma_{MeanAccuracyWithLS} = \frac{0.055}{\sqrt 10} = 0.02
The accuracies are therefore
0.26\pm0.03 without label smoothing
0.31\pm0.02 with label smoothing
The results are different by about two standard deviations, showing that label-smoothing significantly improves accuracy!