Cross-Entropy Loss question

I am working on learning Neural Networks, and I am a bit unclear on the benefits of cross-entropy loss function for multi-class image classification. I am hoping someone can help point me in the right direction. I am going to outline my thought process and what I think I know.

A simple way to measure loss it to take the difference between the prediction (passed through a sigmoid so all are between 0 and 1) and my y_truth. In a single class classifier (taken from new Fastai book by Sylvain and Jeremy on github.com/fastai/fastbook) it looks like this:

def mnist_loss(predictions, targets):
        predictions = predictions.sigmoid()
        return torch.where(targets==1, 1-predictions, predictions).mean()

Why should I use cross-entropy rather than extending this? If I take prediction that are all between 0 and 1, why not just take the difference.

For example if we have 2 images we are classifying into 3 classes, we may have this:

pred = tensor([.2,.8,.4],[.3,.1,.4])
target = tensor([0,1,0],[0,0,1])

The more confident it is about a wrong class, the more it adds to loss. The less confident it is about a correct class, the more that adds to the loss as well. We could have 1 image have multiple classes by having the target have multiple columns be ‘1’ in the same row. That seems to do what we want and the more we minimize it the more closely the model is matching the targets.

My understanding is cross-entropy does something similar, but rather than only putting each prediction between 0 and 1, it makes each prediction of this class a % likelihood (so all sum to 1). To me, this seems like it’s doing roughly the same thing just with an additional conversion. I don’t really understand why this would make it easier or faster for the model to train.

Can anyone point me in the right direction for where I need to either expand or correct my understanding of this? My goal is to try to build an understanding of loss functions so that I can understand when and how they should be changed for specific problems.

I think you are confused about softmax and binomial loss. If the last layer has a softmax activation your predictions will be forced to have a single large number for only one class. But if you use a binomial loss i.e a sigmoid activation for each of the output activations you will have an array of floats which has a high value for each of the classes which are in the image. Then you can threshold this array or simply feed it to a Cross Entropy Loss and compute the loss. The targets for multi label classification would probably have a ‘1’ at multiple places, corresponding to the classes in the image

2 Likes

Thank you for answering. I have been looking, and your response helped a ton. I was confusing a few things for sure. I am going to summarize what I think I learned.

In addition to what SamJoel said, these 2 quotes from github.com/fastai/fastbook proved very helpful to me. I had read them before, but they only made sense after what SamJoel said :slight_smile:

CH5

When we first take the softmax, and then the log likelihood of that, that combination is called cross-entropy loss . In PyTorch, this is available as nn.CrossEntropyLoss (which, in practice, actually does log_softmax and then nll_loss ):

CH6

Note that because we have a one-hot-encoded dependent variable, we can’t directly use nll_loss or softmax (and therefore we can’t use cross_entropy ):

  • softmax , as we saw, requires that all predictions sum to 1, and tends to push one activation to be much larger than the others (due to the use of exp ); however, we may well have multiple objects that we’re confident appear in an image, so restricting the maximum sum of activations to 1 is not a good idea. By the same reasoning, we may want the sum to be less than 1, if we don’t think any of the categories appear in an image.

So basically the reason why using the softmax (ie cross-entropy) is better for gradient descent, is because baked into the loss function is additional information. If 1 class chance increases, the one or more of the other classes decrease. This makes sense in a single class and makes the space you need to search smaller.

This is why when you do multi label problems cross-entropy should be replaced with binary-cross-entropy. Softmax passes through information in a way that says there’s only 1 label per image as they all sum to 1. If this is not true then softmax should not be used. This is why Binary-Cross-Entropy is often used for this, and it is using sigmoid instead of a softmax for this reason.

Hi Ezno. I am not a real mathematician, so experts feel free make corrections. As I understand it, many different functions can serve for a loss function. It only needs to penalize the wrong guesses, reward the correct ones, and be mostly differentiable. So the function you posted above would work fine.

See this article for examples of standard loss functions:

https://en.wikipedia.org/wiki/Loss_functions_for_classification

Each loss function has its own justifications and characteristics, for example, how sharply they separate classes and how much they penalize outliers. I’ve use hinge loss and found it trained to higher accuracy for one problem.

Cross entropy loss is the loss function most often used in machine learning. Its strength is that it measures the divergence between an unknown probability distribution and a predicted distribution. Here I am way out of my depth mathematically. But if you want to jump down the math rabbit hole…

https://en.wikipedia.org/wiki/Cross_entropy

and here’s a more informal explanation:

https://www.quora.com/When-should-you-use-cross-entropy-loss-and-why

To sum up, cross entropy has informational theoretic arguments for it being a generally good loss function for classification tasks. But there are many other choices for classification loss functions, each with its own strengths that can be relevant to your particular machine learning problem.

1 Like

You may also want to look into the smoothing cross entropy function. The idea is to use soft targets. For example, if you have 3 classes, your label will be [.98, .01, .01] instead of [1, 0, 0]. Doing that will less penalize miss-classification errors. It is useful when you have mistakes in your labels for example and it tend to facilitate learning.

1 Like

True, and it usually boosts accuracy by a few % points. However, the caveat is that the mode’s predictions are a little less confident (speaking purely from experience). This might be a good thing, if you’re training on something objective (like dog vs. cat), but you might want a more opinionated (biased) model if you’re training on something subjective (pretty colors vs. not-so-pretty colors).

Re. Softmax vs. Sigmoid:

As mentioned above, softmax really wants to choose between one of the classes, and will always give a max probability for one class, whereas sigmoid “neurons” may get activated for multiple classes. If your labels are mutually exclusive, you probably want to go with softmax.
Sigmoid is said to be a good choice when you’re training on some data and your model might be asked to predict on unseen data after it’s trained (and so, it could not get activated for any label). Does that make sense?

In my experience, CrossEntropyLoss (softmax) trains much better than BinaryCrossEntropy (sigmoid) loss, to the extent that I formulate problems intuitively suited for binary (sigmoid) classification as softmax problems (adding an NA class where I don’t want it to predict anythin)

1 Like