Is Label Smoothing off by eps/N?

I feel like something is off with label smoothing. While the implementation is correct and agrees with the paper, my intuition suggests that the additional eps/N should not be added to the term for the correct class.

In the notebook for label smoothing we see the following explanation:
Another regularization technique that’s often used is label smoothing. It’s designed to make the model a little bit less certain of it’s decision by changing a little bit its target: instead of wanting to predict 1 for the correct class and 0 for all the others, we ask it to predict 1-ε for the correct class and ε for all the others, with ε a (small) positive number and N the number of classes. This can be written as:

loss = (1-ε) ce(i) + ε \sum ce(j) / N

where ce(x) is cross-entropy of x (i.e. -\log(p_{x})), and i is the correct class.

However, it turns out that the second sum is over the entire class list. I.e. , we never take special care to ignore the correct class. Thus, the coefficient for ce(i) becomes (1-ε + \frac{\epsilon}{N}).

This pushes the minumum of the function further to the right. For example, in the binary case with eps=0.1, if we use the original formula the minumum would be found at x=0.95 instead of x=0.9.


We do indeed take special care. See if you can convince yourself of this:

\begin{aligned} \left ( 1 - \frac{N-1}{N} eps \right ) (-\log(p_{i})) + \sum_{j \neq i} \frac{eps}{N} (-\log(p_{j})) \\ = (1-eps) (-\log(p_{i})) + \sum \frac{eps}{N} (-\log(p_{j})) \end{aligned}

(originally @sgugger included this in the notebook but I removed it since it was more math than I wanted to show :wink: )


Thanks for the reply!
The formula you posted works and is the formula used in the paper.
What I don’t like is the location of the minimum. If you calculate it (both empirically or analytically) you will see that it will be located at (1-\frac{N-1}{N}\epsilon,\frac{\epsilon}{N}, \frac{\epsilon}{N},...) assuming 0 is the correct class. This is in contrast with the sentence ask it to predict 1-ε for the correct class and ε for all the others. It turns out that actually \epsilon= \frac{\epsilon}{N} and even that is imprecise for N > 2.
My intuition was that the minimum should be at (1-\epsilon, \frac{\epsilon}{N-1}, \frac{\epsilon}{N-1}, ...) which is easily achievable if cross entropy with the noisy labels is used.


your intuition does sound correct. Have your testet it ?

I can confirm that the minima are shifted - the math checks out both algebraically and empirically.
In terms of tests … nothing changes fundamentally, so I don’t expect any benefits in terms of the quality of training.
It’s more of a UX issue - if \epsilon=0.1, then I expect the minumum should be achieved when the probability for the correct class is at 1-\epsilon =0.9. To get this behaviour currently, you need to set \epsilon = \frac{N}{N-1}\epsilon, which is a nuisance.
Note that, for large N, the difference is minimal, i.e. 1- \frac{N-1}{N}\epsilon \approx 1-\epsilon. I’m working with noisy binary labels(N=2), so I can feel the difference.


The first implementation was using \frac{\epsilon}{N-1} for the other labels and 1-\epsilon for the correct label but we got feedback it wasn’t as good with classification problems where there are few labels (for imagenet and N=1000 it doesn’t really change anything).
Happy to change back to it if we have proof it doesn’t help, though rescaling the epsilon can probably achieve the same effect.


Great! Thanks for the discussion. I’m happy it wasn’t just for the sake of reimplementing the paper.
Will get back to you if my experiments yield a significant difference.

Would it be a good idea to have a general label smoother as a callback rather than a loss function which manipulates the target before loss is calculated?

Can’t we have uncertain labels for other tasks e.g. multilabel, segmentation, etc…? Or is this idea of label smoothing help with the case of softmax?

Can’t we have uncertain labels for other tasks e.g. multilabel, segmentation, etc…? Or is this idea of label smoothing help with the case of softmax?

It’s not immediately clear to me how to apply label smoothing for segmentation but I’ve been trying it with multilabel with some degree of success. My approach doesn’t have the same degree of mathematical justification but I use a binarizer to create binary indicators for each training sample (1 represents this label is valid for the given example 0 means it is not). I then just modify the 1 to 0.95 and give all the 0's a small value (eg. 0.001).

My labels don’t sum to 1 because there can be multiple true labels for a given training example. I should also mention that I don’t use fastai for any of this, I just pre-process my .csv or dataframe before feeding it to fastai.


In bag of tricks paper they justify label smoothing by giving distributional properties coming from softmax, they compare gaps between predicted probability and rest, that’s why I was curious.

I think label smoothing solves the problem which arises from softmax:

The optimal solution is z = inf while keeping others small enough. In other words, it
encourages the output scores dramatically distinctive which
potentially leads to overfitting.

Sigmoid might not have this issue to begin with, which is the case with multilabel but idea can be applied to segmentation or single shot detectors which uses softmax.

1 Like

Yeah, my justification for using it was much more hand-wavy. In a clean dataset, I understand a label of 1 means that "This label is present with probability of 100% ". In my dataset (which is noisy) I interpret my label smoothing changes to say “This label is present with probability 95%” (Or whatever % is chosen).

No idea if that’s justified, but it seems to help.

1 Like

That correlates with what Jeremy said in one of the lessons, you are never 100% certain of your labels :wink: It’s probably highly dependent on dataset and can’t know without trying it out as always :smiley:

1 Like

I have used labelSmoothingCrossEntropy() as my loss function and used get_preds to take a look a the losses for each class and the values are not probabilities. How should I interpret this? I would like to know the 2nd and 3rd best prediction along with the predicted probabilities.

Here is what I have done:
I am doing multi-class text classification

learn_ls = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, loss_func=LabelSmoothingCrossEntropy())
assert learn_ls.loss_func.reduction == ‘mean’
learn_ls.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

To get the test predictions and loss:
y_pred_ls, y_true_ls, loss_ls = learn_ls.get_preds(ds_type=DatasetType.Valid, ordered = True, with_loss = True)


tensor([ 3.0519, -2.1997, -2.5680, -1.9759, -1.2035, 1.2873, -0.4322, -2.2431,
0.2241, 1.2692, 0.3663, -1.8390, -1.8057, 2.5918, -3.0455, -2.3456,
1.1350, -1.7231, -1.6235, -1.7825, -2.1809, -0.8144, 0.1560, -0.3662,
0.7369, -2.3453, -1.8301, -1.1423, -1.4672, 0.6629, 1.4636, -2.4634,
-2.5142, -1.8705, -2.3502, 0.3716, 1.3911, -2.3641, -2.1080, -1.9681,
-2.0923, -2.3841, -0.0498, -2.6077, 0.7605, 1.1574, -0.5264, -3.0365,
2.0393, -4.1978, -2.1945, 0.4962, -2.1258, 0.1349, -2.3988, -1.0581,
-2.5596, -0.4933, -3.8392, -1.1518, 0.9673, -0.9122, -2.9362, -2.2552,
-2.4577, -1.5672, -0.7736, 1.6386]) tensor(0) tensor(1.9229)

Here is my setup
=== Software ===
python : 3.7.3
fastai : 1.0.55
fastprogress : 0.1.21
torch : 1.0.0
torch cuda : 9.0 / is available
torch cudnn : 7005 / is enabled

=== Hardware ===
torch devices : 1

  • gpu0 : TITAN X (Pascal)

=== Environment ===
platform : Windows-10-10.0.15063-SP0
conda env : fastai_v1


Found my answer here

Most Pytorch loss functions has the final softmax included in them while some of the newer loss functions do not and only out put the scores, like above.

To convert them into probabilities: exp(x) and then softmax to get the probabilities
Or if you are interested in getting the probabilities of the top n categories, exp(x)/(1+exp(x)

1 Like

Not sure if I misunderstood label smoothing. Assuming with a 11-class problem.
After applying label smoothing, I assume for the true label, the probability is 0.9, while the remaining are 0.01.

Isn’t the formula should be just

\begin{aligned} \left ( 1 - eps \right ) (-\log(p_{i})) + \sum_{j \neq i} \frac{eps}{N} (-\log(p_{j})) \end{aligned}

Where is the first (N-1/N) term coming from?

1 Like

I have been also looking into LS and if the sum excludes i then it should look like this, i.e., N-1, because one class is positive and we divide the rest evenly to the negative classes, or am I wrong?

\begin{aligned} \left ( 1 - eps \right ) (-\log(p_{i})) + \sum_{j \neq i} \frac{eps}{N-1} (-\log(p_{j})) \end{aligned}

However, as mentioned, I guess if you tune the eps in similar but not identical implementations you can yield equivalently good results?

Another option would be to set every negative label to eps.

The labels in our formula don’t add up to 1: It adds up to a bit less.

(1-eps) + (N-1)*eps/N = N-eps/N = 1-eps/N