Generally, when training multiple-classification model, we use cross-entropy as loss function.

The input of cross-entropy is:

- the label
- model prediction (model output)

In my opinion, in terms of mathematical theory, the model prediction value should be converted to probability. So there must be a soft-max layer at the end of the model.

My question is:

- Is my understanding that should be converted to probability correct?
- Must there is a soft-max layer at the end of the model?

Can we train the model if there is no soft-max layer in terms of mathematical theory?