a bit hair-splitting, but I think they are the same nll(log(softmax())) == nll(log_softmax) == F.crossentropy
your version == jeremy’s version == pytorch version
Yes, I find it a little bit confusing. In the code, there is only ever ONE log operation. So, when you say you are applying the “negative log-likelihood” function after applying the “log softmax” function, it sounds like there will be TWO log operations. The log, however, is only in the “log softmax”.
On the other hand, in this article, for example, the negative log-likelihood function is applied to the softmax, not the “log softmax”. So here the log is in the “negative log-likelihood”.
Ooops, after working through a little bit more of the notebook, it appears that the log has to be bundled with the softmax, in order to reduce the number of exp operations. That is,