When is it appropriate to include a “background” or “other” class in a model?
I’m working on finding animals in aerial survey photographs, using single-label and multiple-label classification. There are animals in fewer than 1% of the photographs and the background is extremely variable, including areas of forest, sparse desert vegetation, nearly pure sand, rock, winding rivers, human habitation, agriculture, and so on. Like finding cancer cells, the challenge is finding the needle in the haystack.
Including a ‘background’ class for all non-animal images seems to create problems, because the model achieves the same reduction in loss when it correctly identifies a patch of sand as ‘other’, as when it correctly identifies an animal. I don’t actually want the model to learn anything at all from the very variable background images.
I’m wondering whether it would be better to cull the training images and only train on a smaller set of ‘positives’, i.e., photographs with animals in them. But if I do that, then how can the model indicate a low probability of animals if I gave it a photo with no animals? Is that just a matter of choosing an appropriate loss function?
I don‘t have a full or definite answer on this, but I have asked the same question before. In Lesson 10 (Part2 2019) Jeremy discussed some of this (Lesson 10 Discussion & Wiki (2019)) and a general suggestion is to not use a ‚background‘ class and not use softmax, but rather to use the probabilities and thresholds to classify the ‚background only‘ images basically by the absence of probable detections.
Re: the training practice I don‘t know, but it might be useful to experiment with the curriculum learning approaches here, maybe train only with animal images first and then add more and more pure background images to the training set for later epochs. But I have not tried this myself yet. If you do try, please share your results!
Thank you for those thoughts and the link to the discussion. From my notes:
The softmax activation function produces normalized probability values. It calculates the probability of an item being in each category, and it “acts like it wants to choose a single answer.” It should not be used in multi-label classification.
softmax(x[i]) = exp(x[i])/sum(exp(x[i]))
The sigmoid activation function is used for multi-label classification. The sigmoid function returns a value for each class that is somewhat like a probability on [0,1], but the classes together don’t sum to 1. For example, an image can have a 50% probability of being an A and also a 70% probability of being a B.
sigmoid(x) = exp(x)/(1+exp(x)).
It seems to me that perhaps the right approach would be to train the model without a ‘background’ category (in my case, train using only images that contain animals). Then you could run the fully trained model with a sigmoid activation function as your output, and if a photo didn’t contain any items of interest, the sigmoid function would just return low probabilities for all of the animal categories.
I’m going to experiment with this–it seems a number of people have the same general question.
A follow-up here: I finally found the answer in 2019 Part 2 lecture 10 at 45:00. In brief, the solution is to not include an ‘other’ category, but to use a binomial function instead of softmax as the final activation function. Softmax should only be used where there is exactly one item in each image (not more, not less).
That’s because softmax divides each output by the sum: softmax(x) = exp(x)/sum(exp(x)) for all x. The problem is that that dividing by the sum means you can get the same probability for x if all values are relatively high, or all values are relatively low. Imagine that your model is trying to identify birds, cats, and fish. Softmax might tell you that there is a 70% chance that the image is a fish when there is also likely a bird and a cat in it (i.e., all three are likely but the fish has the highest number of the three categories), or that there is a 70% chance that the image is a fish when the outputs are all low (i.e., the model is quite sure that none of them are included in the image, but fish is slightly more likely than the others). The binomial function (p/1+p) doesn’t divide by the sum of all, so it doesn’t suffer from that problem. This is especially important when you have a lot of images that will not include your categories of interest.
@lawrence
After some research I am unsure if @jeremy meant binary instead of binomial . This would mean the idea is to replace softmax with sigmoid. Sigmoid converts each score of the final node between 0 to 1 independent of what the other scores are.
Apologies for the very slow reply @Andreas_Daiminger. Fastai uses the sigmoid by default in multi-label classification models. The sigmoid function is built into PyTorch, so you can also type torch.nn.Functional.sigmoid(x) (or F.sigmoid(x) depending on how your imports are named) on any tensor x. Here’s how fastai works:
When you create a dataset by labeling the data with either a CategoryList or a MultiCategoryList, datablock.py will choose an appropriate loss function (either Categorical Cross Entropy (CE) or Binary Cross Entropy (BCE), respectively;
The train.py module then binds a softmax activation function to CE and a sigmoid activation function to BCE.
You can see what is going on by experimenting with get_preds (which is also called internally by learn.fit)–it just applies the sigmoid to the loss in the multi-label case.
preds,y,losses = learn.get_preds(with_loss=True) #activ=F.sigmoid() is the default for BCE
I have had a bit more experience since my original posting, and have come to think that a good model can usually train pretty well even with an ‘other’ category, although you may still get some improvement by separating the training into two stages – the first to separate ‘other’ from the rest, and the second to distinguish between non-other categories. I’d be interested in other opinions, though.
I can confirm your results Lawrence, with single classification softmax seems to work fine. I’ve been working with bunch of single-classification models i’ve built for a production environment and I have been using an ‘absence’ or ‘other’ class it most of my models and achieving very high recognition rates (i typically train to 97-99%).
I’m actually switching to doing a multi-label solution as there are a number of cases where it’s getting one class right but, missing the second one and was looking to see what other people do in that case. It looks like the right method would be to not train with this class. Maybe i can do a comparison when i’m done with and without the absence class. It does make sense than when using a sigmoid you wouldn’t necessarily need the extra data.