Why use exponents in cross entropy loss?

we could replace them with absolute values and get needed probabilities but less confident. Why do we artificially increase probabilities instead of just keeping penalizing a NN for non-confident predictions? In my comprehension, we don’t motivate the NN to do its job properly.

This question is related to Why not use absolute value for softmax?

Hello. In general, the final layer activation function should be considered in combination with loss function we use. For multiclass classification task negative log likelihood is a good choice, we want to maximize probability that y_{pred} = i where i is target class:
- log P(y=i; x)

substituting softmax distribution from logits z and and canceling out log end exp yields:
- z_i + log \sum(exp(z_j)

so high value for z_i (correct prediction) will decrease loss and high value for any z_{j!=i} will contribute to loss and \sum(exp(z_j) is dominated by highest z and all second term of the expression is aprox max(z). So correct predictions do not contribute to the total loss much, but confident wrong predictions are penalized.
If you replace exp() with abs() you will loose this synergy and also negative and positive values will become indistinguishable, which seems to be bad in this case :wink:
But overconifdent predictions may cause problems in some cases, to avoid those one can use Label Smoothing CE loss
Also it’s always good to try out your ideas in practice and see what you can get

Hi Sergey,

To add to arampacha’s fine answer, the post you linked (I think) confuses and mixes together MSELoss and CrossEntropyLoss. The latter uses no squaring at all, but does contain exponentiation and softmax. The former finds the mean of squares of the differences between predictions and targets, and uses no softmax.

There is a relative of MSELoss, called L1Loss, that does use the absolute values of the differences instead of the squares. Both MSELoss and L1Loss are used for regression.

CrossEntropyLoss is used for classification problems (not regression). It is based on a mathematical argument from information theory that it is the “best” loss function to use for classification. Yes, the math is advanced, and I personally do not understand it! But I trust that mathematicians and ML engineers smarter than me have already figured this out and have used CrossEntropyLoss successfully for many years.

You can design any loss function as long as it rewards correct predictions relative to wrong ones (and is differentiable). It may not train as fast as the standard ones and may not achieve the same accuracy, but it will work.

Wikipedia has extensive articles on “cross entropy” and “Loss functions for classification” that explain the reasoning and math in great detail. Certainly more detail than I can personally understand and make use of! :slightly_smiling_face:

1 Like