Thanks so much for your response!
It seems based on this post it was possible to try label smoothing with multi-label:
However, he was saying it did not add to 1, which seems to be important to match it up with the probabilities.
Would it make sense to set the labels to \frac{1-\epsilon}{n} for those labeled 1 and \frac{\epsilon}{N-n} for those labeled 0 where n is the number of positive labels per data point?
In terms of loss for each data point with n labels that are one-hot encoded, it would be:
\sum_i\frac{(1-\frac{N-n}{N}\epsilon)}{n}(-\log(p_i)) + \sum_{j \neq i} \frac{\epsilon}{N}(-\log(p_j))
where i are the positive labels.
Does this seem correct?