Generally, when training multiple-classification model, we use cross-entropy as loss function.
The input of cross-entropy is:
- the label
- model prediction (model output)
In my opinion, in terms of mathematical theory, the model prediction value should be converted to probability. So there must be a soft-max layer at the end of the model.
My question is:
- Is my understanding that should be converted to probability correct?
- Must there is a soft-max layer at the end of the model?
Can we train the model if there is no soft-max layer in terms of mathematical theory?