Letâs say the last layer in the model (before the classifier) has 512 feature maps of size 10x10.
This means the model has learned to detect 512 different patterns in the data. Each feature map represents a different pattern.
Letâs say the original input image is 320x320 pixels. Because each feature map is a grid of 10x10 pixels, each of these pixels corresponds to a group of 32x32 pixels in the original image.
This means we can tell how much each 32x32 group of pixels in the original image is matching with the pattern that belongs to that feature map.
For example, if one feature map is the pattern for âcat earsâ and there is a cat ear in the top-left corner of the input image, then the corresponding pixel in the feature map will have a high value, because it matches very well with this particular pattern. But the other pixels in that feature map will have low values because they do not have a cat ear in them.
When we build a classifier, we add another layer that connects these 512 patterns to the N classes that weâre interested in (often after global pooling because we may not care exactly where the pattern appears in the image, only whether it does or not).
So every class is connected to a certain combination of these patterns, and these combinations will be different for each class. (These connections are of course the weights of the classifier layer.)
For example, if the image is of a cat then the feature map with the âcat earsâ pattern will get a high response. Itâs likely that the class âcatâ will be connected to the âcat earsâ pattern with a large weight, and this will help the class âcatâ get a high response too.
For a multi-class classifier, we apply the softmax function so that there is only one class that is the winner. Basically, we look for the class for which the patterns it is connected to have the largest response to the input image. If none of the other classes have a higher response to the image of the cat, then class âcatâ wins.
For a multi-label classifier, we donât apply the softmax, and so we decide for each class separately how much the input image matches the pattern detectors it is connected to. That is really the only difference. If class âcatâ has a high response, then the image contains a cat. But if another class also gets a high-enough response, weâll also have detected that other class. Now the classes arenât fighting with each other over who wins.