Can label smoothing be used for multi-label images?

ilovescience · May 26, 2019, 2:39am

I am a student who finished Part 1 and am interested in applying label smoothing to a problem. I saw that it was taught in Part 2 so figured this would be a good place to ask my question.

I was wondering though if label smoothing can be applied to multi-label problems. In addition, typically, as I read about label smoothing online, it seems that they usually are replacing the labels with the smoothed labels, but that is done in the loss function, correct?

jeremy · May 26, 2019, 7:32pm

See discussion in this thread: https://forums.fast.ai/t/is-label-smoothing-off-by-eps-n/44290

ilovescience · May 26, 2019, 11:14pm

Thanks so much for your response!

It seems based on this post it was possible to try label smoothing with multi-label:

However, he was saying it did not add to 1, which seems to be important to match it up with the probabilities.

Would it make sense to set the labels to \frac{1-\epsilon}{n} for those labeled 1 and \frac{\epsilon}{N-n} for those labeled 0 where n is the number of positive labels per data point?

In terms of loss for each data point with n labels that are one-hot encoded, it would be:

\sum_i\frac{(1-\frac{N-n}{N}\epsilon)}{n}(-\log(p_i)) + \sum_{j \neq i} \frac{\epsilon}{N}(-\log(p_j))

where i are the positive labels.

Does this seem correct?

ilovescience · June 4, 2019, 2:24am

Coming back to this, I realized I didn’t simplify it like was done for regular multi-label. Here is the simplified version:
\begin{aligned} (1-\epsilon)\sum_i (-\frac{\log p_i }{n} ) + \frac{\epsilon}{N} \sum (-\log p_i) \end{aligned}

where the last term is the full cross entropy over the entire dataset.

I am unsure how to implement this. I see in the notebook there is a loss = reduce_loss(-log_preds.sum(dim=-1), self.reduction) and also nll = F.nll_loss(log_preds, target, reduction=self.reduction). The output seems to be lin_comb(loss/c, nll, self.ε) so that would be self.ε * loss/c + (1- self.ε)*nll

Is nll the cross-entropy of the entire dataset, because then shouldn’t it be multiplied by self.ε instead of (1- self.ε)?

zache · September 2, 2019, 7:35pm

Did you ever manage to implement label smoothing on multi-label? Would love to see it if so

ilovescience · September 2, 2019, 7:50pm

Unfortunately not. I think I wasn’t able to get the math to work up in an intuitive sense like it did for single-label, and also I think I had some problems during empirical tests.

However, since a couple other people have also asked about this, I might look into this again soon.

zache · September 2, 2019, 9:32pm

Ah alright – I might do the same then. Let me know if you figure anything out!

vferrer · April 24, 2020, 9:21am

Did you manage to implement label smoothing for multi-label images?

Gusto · January 19, 2021, 10:03am

Hi,

I don’t understand why the labels should sum up to one in a multi-label scenario, as they are usually not summing up to one anyway. Further, can’t we simply modify the labels by a smoothing factor and use the new targets in a BCEWithLogitsLoss?

For example, assuming that the labels do not have to add up to one, we can modify the labels such that the 1’s become 1-e and the 0’s become 0+e. I implemented this logic in the following Callback (In my scenario I worked with a Callback rather than a new loss function because I wanted to compare the effect of label smoothing on my BCWithLogitsLoss on the validation set):

class MultiLabelSmoothingCallback(Callback):
    def __init__(self, learn:Learner, ε:float=0.05):
        super().__init__()
        self.ε = ε
    
    def before_backward(self, **kwargs):
        labels = self.yb[0].detach().clone()
        for i, x in enumerate(labels):
            labels[i] = torch.where(x==1.0, 1.0-self.ε, self.ε)
        smoothed_criterion = nn.BCEWithLogitsLoss()(self.pred, labels)
        return {‘last_loss’: smoothed_criterion}

WDYT?