Why not use Binary Cross Entropy for multi-class problems?

voa · December 18, 2023, 12:00am

From my understanding, normal cross entropy cannot be used in multi-label problems, because softmax measures probabilities relative to one another.

(if you were confused like me at first, multi-class means multiple classes, one right, whereas multi-label is multiple classes, multiple answers)

To be sure, here’s a (not working) multi-label learner:

from fastai.vision.all import *
from random import randint
vocab = ['this', 'is', 'a', 'label'] # have some random classes just so we're able to create a DataLoader
path = untar_data(URLs.BIWI_HEAD_POSE)
dls = DataBlock(
    blocks=(ImageBlock, MultiCategoryBlock), 
    get_y=lambda x: vocab[randint(0, 3)], 
    get_items=get_image_files
).dataloaders(path)
learn = vision_learner(dls, resnet18)
print(learn.loss_func)

FlattenedLoss of BCEWithLogitsLoss()

If we instead use a CategoryBlock, then the output is

FlattenedLoss of CrossEntropyLoss()

Since fast.ai is choosing it, I guess it’s safe to assume that this is the best overall loss for this type of problem.
Admittedly, I haven’t been able to test the performance of a multi-label learner with BCE yet (the PASCAL_2012 download is taking quite a bit). It also doesn’t help that whenever I Google the uses of BCE, multi-label is not mentioned:

Binary Cross-Entropy is widely used in training neural networks for binary classification problems

And the one Medium article I found on it is paywalled…

If BCE works when multiple classes are correct, why shouldn’t it too when only one is? Or does it work okay in this case, BUT (normal?) cross entropy just happens to be better? And if so, why?

For context, I’m beginning part 7 of the course, but I’m dragging 5 and 6 along as I try to understand loss functions. Please feel free to correct any misassumptions.

usernotabuser · December 18, 2023, 2:27am

voa:

From my understanding, normal cross entropy cannot be used in multi-label problems, because softmax measures probabilities relative to one another.

(if you were confused like me at first, multi-class means multiple classes, one right, whereas multi-label is multiple classes, multiple answers)

To be sure, here’s a (not working) multi-label learner:
from fastai.vision.all import *
from random import randint
vocab = ['this', 'is', 'a', 'label'] # have some random classes just so we're able to create a DataLoader
path = untar_data(URLs.BIWI_HEAD_POSE)
dls = DataBlock(
    blocks=(ImageBlock, MultiCategoryBlock), 
    get_y=lambda x: vocab[randint(0, 3)], 
    get_items=get_image_files
).dataloaders(path)
learn = vision_learner(dls, resnet18)
print(learn.loss_func)
FlattenedLoss of BCEWithLogitsLoss()

If we instead use a CategoryBlock, then the output is

FlattenedLoss of CrossEntropyLoss()

Since fast.ai is choosing it, I guess it’s safe to assume that this is the best overall loss for this type of problem.
Admittedly, I haven’t been able to test the performance of a multi-label learner with BCE yet (the PASCAL_2012 download is taking quite a bit). It also doesn’t help that whenever I Google the uses of BCE, multi-label is not mentioned:

Binary Cross-Entropy is widely used in training neural networks for binary classification problems

And the one Medium article I found on it is paywalled…

If BCE works when multiple classes are correct, why shouldn’t it too when only one is? Or does it work okay in this case, BUT (normal?) cross entropy just happens to be better? And if so, why?

Hello,

Your concerns motivated me to look up into Pytorch’s about those loss functions. In my time zone is 3.00 a.m. now, so being little tired I took support of GPT4 to put this into one txt so here it’s:

" The post on the fast.ai forum raises several important points regarding the use of different loss functions in classification tasks. Let’s break down these points:

Multi-Class vs. Multi-Label Classification:

In multi-class classification, each example belongs to exactly one of multiple classes. CrossEntropyLoss is commonly used here because the softmax function (applied in CrossEntropyLoss) normalizes outputs such that the sum of probabilities across all classes equals 1. This means increasing the probability of one class decreases the probability of others, which makes sense when only one class can be correct.
In multi-label classification, an example can belong to multiple classes simultaneously. Here, softmax isn’t used because we want to independently evaluate the probability of each class. Instead, functions like BCEWithLogitsLoss are used, where each label is treated as a separate binary classification problem.

Why BCEWithLogitsLoss for Multi-Label Problems:

BCEWithLogitsLoss allows for the independent assessment of the probability of each class, ideal in situations where multiple classes can be correct for a single example. Each label is considered as a separate binary classification problem.

Using BCEWithLogitsLoss in Single-Class-Correct Scenarios:

Technically, you could use BCEWithLogitsLoss in scenarios where only one class is correct, but it’s not the optimal approach. CrossEntropyLoss is specifically designed for scenarios where only one class is correct and efficiently manages dependencies between classes.
Using BCEWithLogitsLoss for a single-correct-class problem treats each class as an independent binary classification, which isn’t ideal because it ignores the interdependencies among classes.

Concerns About Access to Information:

Indeed, finding detailed information on the application of BCEWithLogitsLoss for multi-label classification can be challenging online. However, this loss function is standard for such problems. Using CrossEntropyLoss would be inappropriate in this context as it assumes that only one class is correct for a given example.

In summary, the choice of loss function depends on the type of classification problem. For multi-label classification, where multiple classes can be correct for a single example, BCEWithLogitsLoss is the appropriate choice. Conversely, in multi-class classification, where only one class is correct, CrossEntropyLoss is more suitable.

To further understand and clarify the usage of these loss functions in PyTorch, it’s highly beneficial to refer directly to the official PyTorch documentation. Here are the specific links to the relevant sections:

torch.nn.BCEWithLogitsLoss: PyTorch Documentation

This documentation explains BCEWithLogitsLoss, which combines a sigmoid layer with the binary cross-entropy loss in one single class. It’s particularly useful for multi-label classification tasks.

torch.nn.BCELoss: PyTorch Documentation

Here, you’ll find information on BCELoss, which is used for binary classification problems. This function requires the input to be probabilities (i.e., values between 0 and 1), so a sigmoid activation should be applied to model outputs before using this loss.

torch.nn.CrossEntropyLoss: PyTorch Documentation

This section covers CrossEntropyLoss, ideal for multi-class classification tasks where each sample belongs to exactly one class. This loss function combines log_softmax and nll_loss in a single function."
I’ve forgotten, look also into: https://docs.fast.ai/metrics.html#multi-label-classification

Have a nice day/night!

voa · December 19, 2023, 1:25pm

Hey, thanks for replying (and sorry for my late reply),

Technically, you could use BCEWithLogitsLoss in scenarios where only one class is correct, but it’s not the optimal approach. CrossEntropyLoss is specifically designed for scenarios where only one class is correct and efficiently manages dependencies between classes.

So it really is the case that BCE works, but normal cross entropy works better? That is, we’re always in a less ideal situation when tackling multi-label problems (e.g. more limited in the choice of loss functions)?

Using BCEWithLogitsLoss for a single-correct-class problem treats each class as an independent binary classification, which isn’t ideal because it ignores the interdependencies among classes.

What does “interdependencies among classes” mean? Why should this matter for a multi-class classifier, or again, similar to the usage of BCE vs CE, is this just something that we can use to our benefit but is only available within multi-class problems?

usernotabuser · December 19, 2023, 11:58pm

Hi,

Imagine you want to determine the probability of an object appearing at a pedestrian crossing. In your data set you have the distances of a pedestrian crossing from: a pub, a school, a beauty salon, a car repair store, a hardware store, a walking route, in the data you also have the probability of a blind person in the vicinity of that crossing. On the other hand, you want to predict what object will be at the crossing, 5 classes (woman, man, dog, cyclist, car). In the case of cross entropy, you will be returned a tensor of 5 numbers, let’s simplify that it’s a vector in which the index refers to the class and the value of this index is the probability for the class. The sum of this vector in the case of nn.Crossentropy adds up to a value of 1 due to the included softmax in this function. In answer to your second question: imagine that the probability in the data X for occurrence in the vicinity of a pedestrian crossing, a blind person is high. Wouldn’t the values after softmax for the dog index be high, as well as for the woman or man indexes.
This is called: “positive correlation”.

This time without using GPT, so I hope you understand my answer

Regards

EDIT: This Jeremy’s lesson provides some explanations of this function. Catch:

https://colab.research.google.com/github/fastai/fastbook/blob/master/06_multicat.ipynb

BTW: I have seen a video on YT a few days ago where someone using and promoting fast.ai ‘was teaching’ how to develop apps using dataset ‘cats or dogs’. I watched only 4 from 30 minutes