Doubts regarding multi-label classification

I was going through the docs of fastai V2 and was trying out the multi-label classification example using the pascal 2007 dataset and was wondering if there are useful resources to understand the basics of multi-label classification?

I understand the working of the single-label classifier using CNN, for instance how an image of a labeled picture of a cat when put through a CNN architecture keeps intact the spatial information and how it learns various features of the cat gradually as it goes through layers and finally when an image of a cat is shown to the model the specific neurons fire, and the model can identify that its a cat.

But in the case of multiple labels, my main issue is that since we aren’t identifying and labeling segments of the image and training those parts specifically, I was wondering how using a CNN to label multiple objects in a picture?

Is it just that the final layer of the NN is all the given classes, and for each image, multiple neurons are activated based on the labels, and as we train the data, the CNN picks up various features corresponding to each object type and classifies them accordingly?


I’d recommend checking out chapter 6 in the fastbook repo. It’s all about multi-label classification! The real trick is using binary cross entropy as the loss function but Jeremy and Sylvain explain the technical parts much better than I can :slight_smile:

I think an intuitive way to think about it is that there are features that indicate a dog and a cat are in the image, regardless of where they are. In the same way, a dog vs. cat classifier is somewhat agnostic to the location of the animal in each picture (in fact data augmentation in fastai basically guarantees the model will see images where the dog/cat is in different parts of the image).

Thanks, @gc_mac! I’ll definitely look into chapter 6 of the fastbook. Unfortunately was taking a bottom-up approach to the book and am on chapter 3 and didn’t quite see the contents page of the book properly to directly jump to that! :sweat_smile:

Let’s say the last layer in the model (before the classifier) has 512 feature maps of size 10x10.

This means the model has learned to detect 512 different patterns in the data. Each feature map represents a different pattern.

Let’s say the original input image is 320x320 pixels. Because each feature map is a grid of 10x10 pixels, each of these pixels corresponds to a group of 32x32 pixels in the original image.

This means we can tell how much each 32x32 group of pixels in the original image is matching with the pattern that belongs to that feature map.

For example, if one feature map is the pattern for “cat ears” and there is a cat ear in the top-left corner of the input image, then the corresponding pixel in the feature map will have a high value, because it matches very well with this particular pattern. But the other pixels in that feature map will have low values because they do not have a cat ear in them.

When we build a classifier, we add another layer that connects these 512 patterns to the N classes that we’re interested in (often after global pooling because we may not care exactly where the pattern appears in the image, only whether it does or not).

So every class is connected to a certain combination of these patterns, and these combinations will be different for each class. (These connections are of course the weights of the classifier layer.)

For example, if the image is of a cat then the feature map with the “cat ears” pattern will get a high response. It’s likely that the class “cat” will be connected to the “cat ears” pattern with a large weight, and this will help the class “cat” get a high response too.

For a multi-class classifier, we apply the softmax function so that there is only one class that is the winner. Basically, we look for the class for which the patterns it is connected to have the largest response to the input image. If none of the other classes have a higher response to the image of the cat, then class “cat” wins.

For a multi-label classifier, we don’t apply the softmax, and so we decide for each class separately how much the input image matches the pattern detectors it is connected to. That is really the only difference. If class “cat” has a high response, then the image contains a cat. But if another class also gets a high-enough response, we’ll also have detected that other class. Now the classes aren’t fighting with each other over who wins.


So, the location of a cat or a dog on the image is irrelevant and could change a lot, as we are looking for distinct dog, cat, etc., features everywhere? In other words, the “combination” of cats’ and dogs’ position on the image does not matter? Also, what if the multi-label classifier gets the image with one single cat in it, will it classify it correctly, still?

Thanks a lot, @machinethink, my intuition seems to have been somewhat in the right place, and I think I understand it now.

So the final layer for the multi-label classifier is also a fully connected layer connecting to the N classes.

And whenever we get an image labeled a ‘cat’, the respective feature maps that were learned to be associated with a cat through training will fire and thus will have a higher weight.

The idea of not using a softmax function is because we would not like to make the probability of the image existing independent of each other? i.e the existence of a cat in the image shouldn’t in any way related to the possibility of a dog as well.

So instead of doing a normal softmax on the final layer and apply cross-entropy to get the right class as in the case of multi-class problem, we instead apply a sigmoid function and then do cross-entropy loss and update weights which is what I gather is the binary cross-entropy loss used in multi-label classification. Is that about right?

1 Like

Yes, that is the key feature of convolution. Let’s say you have a conv layer with 3x3 kernel size and F filters. Then each of these F filters will learn to detect a specific 3x3 pattern. Since the convolution window slides over the input feature map, it checks at every position how well the 3x3 pixels in the feature map match with the pattern it has learned. (Actually, the pattern has size 3x3xC where C is the number of channels in the feature map, but it’s the same idea.)

Yes, because it looks at each class independently.


Correct, it’s exactly the same except it uses sigmoid instead of softmax activation. If there are N classes, then both types of classifiers output a vector with N numbers.

The difference is that for a multi-class classifier, all the numbers in this vector sum up to 1. But for a multi-label classifier they don’t. Each element in this vector has a value between 0 and 1, and is independent of the others.

Yep, you got it. And we use binary cross entropy (i.e. logistic loss) because that goes with the sigmoid function.


I gave a talk about using fastai (an older version) for a multi-label classification competition, I go over some of the challenges of dealing with this kind of data if it might be helpful to you.

For this competition:


@wdhorton Thanks! :slightly_smiling_face: Will definitely take a look at it!

Thank you so much, this is super helpful.

@kodzaks I’ll also mention, if you want a good example of this, I repurposed PETS to do multi label (as a way to try to have a model that says I don’t know) which does a single-label problem in a multi-label framework


So, this is a label that means “not-all-or-any–other-labels”?

Yes. It works by just not returning a label whatsoever (see the donkey example in the end of the notebook)

1 Like

Interesting! So you did not train your model for “nothing” category where you just put some random stuff, instead when it sees something that is "not-all-or-any–other-labels " it returns it as “nothing”?

1 Like

Yes! Exactly. The reason is multi label now gives the model what’s called a sigmoid function which puts everything on its own threshold. A basic example is in multi-label problems, we can have 1,2, or 3 labels present in a given moment, which means there are times label 2 or label 3 may not be present. Which then holds true the idea there are cases where it can have no labels that reach above a threshold we put. This is different from our regular classification because we apply argmax, where the general idea is we gather all the raw probabilities and scale them to 0-1 total so all our probability sum to 100%, and we take the highest one as our answer. Instead here we look at each raw probability and see if they’re above a % threshold. For example I could have say 15%, and if a particular label is above .85, then it’ll show up


Ok, but what about situations when we have both a dog and a cat in the same image, a single cat in another image, and 2 dogs in 3rd image and a donkey on 4th. I think I am confused with what “multilabel” means. I though it was for situations when several classes are present on the same image.

It is, but if used creatively we can repurpose our models to tell us if something may not be what we want our input to be (such as a user inputing a car on an animal classification model).

If we assume that this is for instance the binary classifier from lesson 1 (where it’s either cat or dog), your outputs should look like so:

  1. Dog and cat are present
  2. Cat is present but not dog
  3. Dog is present
  4. Nothing is present

Note here it does not check for the number of instances something is there, just simply if it is present. You’d further want to combine ideas of say object detection or image regression to do some form of a counting mechanism.

On 3 though, if it was instead the PETs model from lesson 1 last year, one would presume that 3 would give species 1 and species 2 (like Labrador retriever, husky)

1 Like

Thank you! I will need to think about it more, i.e. what is the best way to go about it. I have, let’s say, 5 classes (object names) and on any images there could be just one class present, or 3 or 5, or 2 in various combinations. Probably object detection will work better then? I need the results to be like this: this image has class 1, class 3 and class 5. But this image has class 1 only, etc, various combination of classes (i.e object names?).

I’d say multi-label is exactly what you need here. What I was describing above was the following situation:

I have a picture like so:

(We’ll use puppies because who doesn’t love puppies!)

Object detection (or regression) would be advisable in this situation if we wanted to know how many Rottweiler puppies were present (in this case 5). In a multi-label perspective like discussed earlier, our model would return “Rottweiler”.

Also multi-label (like your describing) could be if we re-write the scenario like so:

In this case now, our model would return saying there is a Rottweiler present, a husky, and a golden retriever, despite the fact there are a few more rottweilers present. Image regession or bounding boxes would tell us where each one is and what each dog’s class is (in bounding box, regression would simply return a number. You could also then go down a rabbit hole with image regression, but if that’s of interest I’ll explain he whole concept on zoom and upload that :slight_smile: (and what we just discussed here) )

Does this help? :slight_smile:

1 Like