For the planet Amazon dataset ( lesson 3 ) or for any multi label classification problem we take cross entropy loss. There is a softmax layer in the end so all the probabilities will be less then 1 and the sum of all probabilities will be 1. But as we know for multilabel classification the sum of all the probabilities need not be one ( for the predictions ).

So can anyone please explain how do we get good results with such a loss function that won’t allow the predictions to match with the real results ever ?

As you see, cross-entropy loss does not require the predicted probabilities to sum to 1. As long as the loss goes down when the predicted probabilities move closer to the target, gradient descent methods will train the weights. *What matters is loss slope, not loss value.* Therefore the predictions do not have to ever match the target, and in practice they almost never will.

Does this answer your question?

Yup , got it.

Thanks a lot !