Would someone please clarify my understanding of multiclass vs. multiclassmultilabel problems?
Two cases…

A multiclass problem, where each training image is labelled as class A, B, or C. The loss function is CrossEntropyLoss. The output is three probabilities, for classes A, B, and C. These probabilities necessarily sum to 1 because the final activations are exponentiated and normalized to sum to 1.

A multiclassmultilabel problem where each training image is given exactly one tag, A, B, or C. (I understand that in this scenario each image could be given several tags.) The loss function is BCEWithLogitsLoss (includes Sigmoid). The output is three probabilities that the input is in each of the three classes. These probabilities are derived from taking sigmoid of the final activations.
Suppose in 2) we normalize the probabilities to sum to one. The meaning of the outputs is the same. But do 1) and 2) amount to the same model mathematically and converge to the same solution?
I would surely like to understand why and how they would differ.
I know this sounds theoretical, but it’s a practical issue that has come up for me while designing a model.
Thanks for any insights.