Multi-label classification?

In the Lesson 2 video, a question was raised on multi-label classification, which is different from multi-class classification, of which the ImageNet 1000-class classification problem is an example, and we already know how to solve it from the later class videos. But how do we do multi-label classification using what we’ve learned in this course? e.g. what if a picture can have any combination of animals, and we’d like to label any picture with all the animals that are present in the picture?


I am guessing that labeling a variable number of animals could be in some way or another better approached if we first start with segmentation - this is just a guess though so take it with a grain of salt please.

As for a situation where a training example can have numerous labels, here is one possible set up. Let’s say there are 4 valid labels: A, B, C and D. For each example, we encode the targets as a vector with four binary elements. A training example with only label A assigned could be encoded as [1 0 0 0], with A and C [1 0 1 0] and with all four labels as [1 1 1 1].

Softmax output layer is nice as it tells the NN - hey, we are dealing with probabilities (in mathematical, not statistical sense) so all the outputs for a given training example should sum to one.

If we switch the last layer to one with sigmoid activations, we can use a cost function that is appropriate to our targets. I think in keras it would be called binary cross entropy but I might be wrong (basically a log loss). Taking a negative log of the output has nice properties that helps with the backpropagation of error, but you could also go with mean squared error. With a MSE, if the correct output for a training example is [1 0 1 0] and our network produced [0.6 0.9 1 0], for each of the output neurons the error with regards to the square of the difference would be propagated back to earlier layers. This calculation will be done independently of other outputs (at the last layer that is), so in this particular case the error of the first output neuron will be (1 - 0.6)**2, which has some magnitude, and that will be propagated to earlier layers, and for the third output (third label) where we are outputting 1 and label 3 is assigned to this example, the error for that neuron that will be propagated back is 0 (no error at all).

The reason this approach works is that the sigmoid only reaches values between 0 and 1 and for a large spectrum of values it is very close to 0 and 1 which gives our NN some leeway to navigate around. You could also use this with tanh or some other activation that ranges over -1 and 1 but in that case you would be better off encoding the targets with 1 and -1 labels.

1 Like

This is an entire area of research and is non-trivial. I believe they mentioned this will be covered in part II. For now here is a lecture from the Stanford course.

For a code implementation in Keras of multi-label classification of imagenet like images (albeit Tensorflow backend see):

Keep in mind there are a lot of different approaches, which generally are different iterations on region proposals of suggested labels - Faster R-CNN, YOLO, SSD, etc.

Check out the Microsoft Coco competition for state of the art, but you will unlikely find a clean keras version of any of the code.


Thank you both! So I’ll ask a naive question: why is it so hard such that it deserves an entire area of research? Why can’t we just classify the pictures as cat/no-cat, and then dog/no-dog, and then squirrel/no-squirrel etc., essentially translating a multi-label classification problem into a repeated two-class classification problem?

1 Like

Yes, well the two hammers used in ML are regression and classification. That may or may not work depending on the complexity of what you need. For example, with self driving cars you may need to know not only that there is a car but you need to know the location of the car in the image, whether there is more than one car, and you need to know that information instantaneously. With that in mind the proposed two class approach fails on all three of those counts, which gives some indication of why it is an area of research.

One approach, Faster R-CNN, creates region proposals, a bunch of smaller cropped parts of the image, based on some indication that there could be something in that region and performs classification, which ultimately results in a bounding box due to the region proposal. A two class approach would effectively be creating 1000 “region proposals” but using the whole image without the other benefits such as knowing the location and having faster speed.

1 Like

Thank you, but just to clarify, I’m not asking about self-driving cars. I just wanted to know how to classify with multiple labels. The difficulty of identifying locations or doing counts in the self-driving car example is therefore not relevant to my narrow purpose. It is not clear why difficulty in that kind of problem would also mean difficulty for the problem I described originally…

I think there are probably two answers to that. One: It might work, but nobody thought of approaching the solution this way. Two: It doesn’t work, because the nn has been trained using images of one thing only and its accuracy goes way down if there are, for example, two dogs or a dog and a cat in the photo.

If it is the first one, then you should attempt it. So many things we’ve seen so far (I’ve only made it to lesson 3) have been surprisingly simple, it was just that nobody had tried them before. Either it will work and you will achieve glory, or it won’t and you’ll learn a lot. :slight_smile:

If it is the second one, then a sensible first step would be segmentation or some other approach that first divided up the image into “there is a feature of interest here and here and here” and then it would have a chance.

Good points. Thank you @Rothrock42!