When to use a 'background' or 'other' class


When is it appropriate to include a “background” or “other” class in a model?

I’m working on finding animals in aerial survey photographs, using single-label and multiple-label classification. There are animals in fewer than 1% of the photographs and the background is extremely variable, including areas of forest, sparse desert vegetation, nearly pure sand, rock, winding rivers, human habitation, agriculture, and so on. Like finding cancer cells, the challenge is finding the needle in the haystack.

Including a ‘background’ class for all non-animal images seems to create problems, because the model achieves the same reduction in loss when it correctly identifies a patch of sand as ‘other’, as when it correctly identifies an animal. I don’t actually want the model to learn anything at all from the very variable background images.

I’m wondering whether it would be better to cull the training images and only train on a smaller set of ‘positives’, i.e., photographs with animals in them. But if I do that, then how can the model indicate a low probability of animals if I gave it a photo with no animals? Is that just a matter of choosing an appropriate loss function?



(Marc P. Rostock) #2

I don‘t have a full or definite answer on this, but I have asked the same question before. In Lesson 10 (Part2 2019) Jeremy discussed some of this (Lesson 10 Discussion & Wiki (2019)) and a general suggestion is to not use a ‚background‘ class and not use softmax, but rather to use the probabilities and thresholds to classify the ‚background only‘ images basically by the absence of probable detections.

Re: the training practice I don‘t know, but it might be useful to experiment with the curriculum learning approaches here, maybe train only with animal images first and then add more and more pure background images to the training set for later epochs. But I have not tried this myself yet. If you do try, please share your results! :wink:

1 Like


Thank you for those thoughts and the link to the discussion. From my notes:

  • The softmax activation function produces normalized probability values. It calculates the probability of an item being in each category, and it “acts like it wants to choose a single answer.” It should not be used in multi-label classification.
    • softmax(x[i]) = exp(x[i])/sum(exp(x[i]))
  • The sigmoid activation function is used for multi-label classification. The sigmoid function returns a value for each class that is somewhat like a probability on [0,1], but the classes together don’t sum to 1. For example, an image can have a 50% probability of being an A and also a 70% probability of being a B.
    • sigmoid(x) = exp(x)/(1+exp(x)).

It seems to me that perhaps the right approach would be to train the model without a ‘background’ category (in my case, train using only images that contain animals). Then you could run the fully trained model with a sigmoid activation function as your output, and if a photo didn’t contain any items of interest, the sigmoid function would just return low probabilities for all of the animal categories.

I’m going to experiment with this–it seems a number of people have the same general question.



A follow-up here: I finally found the answer in 2019 Part 2 lecture 10 at 45:00. In brief, the solution is to not include an ‘other’ category, but to use a binomial function instead of softmax as the final activation function. Softmax should only be used where there is exactly one item in each image (not more, not less).

That’s because softmax divides each output by the sum:
softmax(x) = exp(x)/sum(exp(x)) for all x. The problem is that that dividing by the sum means you can get the same probability for x if all values are relatively high, or all values are relatively low. Imagine that your model is trying to identify birds, cats, and fish. Softmax might tell you that there is a 70% chance that the image is a fish when there is also likely a bird and a cat in it (i.e., all three are likely but the fish has the highest number of the three categories), or that there is a 70% chance that the image is a fish when the outputs are all low (i.e., the model is quite sure that none of them are included in the image, but fish is slightly more likely than the others). The binomial function (p/1+p) doesn’t divide by the sum of all, so it doesn’t suffer from that problem. This is especially important when you have a lot of images that will not include your categories of interest.