Aggregating TTA's outputs

TTA calculates class probabilities for the original image and 4 transformed versions of the original image. In Lesson 1, we simply average these 5 probabilities to get the probability we use for classification, and so estimate our classifier’s accuracy as follows:

log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),0)
accuracy_np(probs, y)

I think instead of averaging the probabilities, a better way of aggregating the probabilities would be to average (i.e., take the arithmetic mean of) the log-odds corresponding to these probabilities; or, equivalently, take the geometric mean of the odds corresponding to these probabilities.

As an example, say we have a cats vs dogs classifier, and the classifier sees p(dog)=10% on the first 4 images. Now, if we simply average the probabilities, it makes almost no difference whether the classifier sees p(dog)=99% or p(dog)=99.99% on the 5th image; the average will be around 28% either way. However, if we average the log-odds instead, we obtain an aggregated probability of around 30% in the first case, and of around 52% in the second case. So it does make a difference now whether the classifier is “very sure” or “virtually certain” that the 5th image shows a dog.

A different way of thinking about averaging log-odds is that it is equivalent to averaging the outputs of the last FC layer before the softmax layer (the “logits”). The logits are natural quantities to take the arithmetic mean of, since they are themselves calculated as linear combinations of different feature activations.

In practice, I expect the difference between these two ways of aggregating to be most relevant if there is a feature in an image that is highly informative of class membership and is close to the edge of the image, so it can get lost when square-cropping.

In code, averaging the log-odds can be done as follows:

log_preds,y = learn.TTA()
all_probs = np.clip(np.exp(log_preds), 1e-6, 1-1e-6) # avoid division-by/log-of zero
log_odds = np.log(all_probs / (1-all_probs))
avg_odds = np.exp(np.mean(log_odds, 0))
probs = avg_odds / (1+avg_odds)
accuracy_np(probs, y)

I ran a small experiment using the cats vs dogs classifier from Lesson 1 and found a small, but statistically significant advantage (i.e. higher accuracy) for the second way of aggregating the probabilities.

1 Like

I agree with your point. Even with ensembling I was getting better results (slightly though) with the second method of taking geometric mean. However I didn’t get your example on dogs classification. How did you get 28%, 30%,52% values. I am most likely mis understanding something

Hi Arka,

Maybe a bit of code is worth a thousand words. Note that when simple-averaging, there is almost not difference between probs1 and probs2. However, when averaging the log-odds, there is a significant difference.

def average_log_odds(probs):
log_odds = np.log(probs / (1-probs))
avg_odds = np.exp(np.mean(log_odds, 0))
return (avg_odds / (1+avg_odds))

probs1 = np.array([0.1,0.1,0.1,0.1,0.99])
probs2 = np.array([0.1,0.1,0.1,0.1,0.9999])

np.mean(probs1)
0.278

np.mean(probs2)
0.27998

average_log_odds(probs1)
0.30179691437705425

average_log_odds(probs2)
0.5210546449799203

Correct me if I am wrong but probs2 wouldn’t be a probability distribution. You have given the log odds for dog. Try to calculate for cat, it becomes 56%. So you would still end up choosing cat.

If there are multiple classes you would need to compute this for all.

probs1 and probs2 in my example are the 5 values for p(dog) given by the classifier for the original image and the 4 augmented images; so they are not a probability distribution.

The function average_log_odds I defined has the property that average_log_odds(1-probs) == 1-average_log_odds(probs). So you can calculate p(cat) either as average_log_odds(1-probs2), or as 1-average_log_odds(probs2), and either way you get 48%.

My bad. I didn’t do avg_odds = avg_odds/ (1+avg_odds).

However, this brings into question, if your model predicts with high probability (here 90%) that the image is that of a cat in 4 of them, and a dog in one of them with very high probability (99.99%), then which should you go for.

I would say cat is more probable answer right? I mean in 4 of the cases, it is quite sure that the answer is cat. In the one image, it could be some artifacts which are leading it to believe that its dog.

Well, I gave my reasons for why I find averaging log-odds more appealing than averaging probabilities in my first post. However, these intuitive arguments only go so far. The great thing is, if you have already run TTA you can with very little extra computational cost check yourself which of the two ways of aggregating works better for your problem.