I am trying to train a decoder model that spits out a tensor of size over 4000 (4000+ classes).

If I understood correctly, F.nll_loss focuses on how far is probability term corresponding to the index of the true label (true label marked as 1) from 1 while Cross entropy accounts all probabilities deviating from corresponding labels i.e 1 for the true label and 0 for the rest.

My objective is to reduce training time (no of epochs) without losing performance. My question is, which one of F.nll_loss and Cross entropy is a more appropriate loss function to use in my case.