In bag of tricks paper they justify label smoothing by giving distributional properties coming from softmax, they compare gaps between predicted probability and rest, that’s why I was curious.
I think label smoothing solves the problem which arises from softmax:
The optimal solution is z = inf while keeping others small enough. In other words, it
encourages the output scores dramatically distinctive which
potentially leads to overfitting.
Sigmoid might not have this issue to begin with, which is the case with multilabel but idea can be applied to segmentation or single shot detectors which uses softmax.