# Aggregating TTA's outputs

#1

TTA calculates class probabilities for the original image and 4 transformed versions of the original image. In Lesson 1, we simply average these 5 probabilities to get the probability we use for classification, and so estimate our classifier’s accuracy as follows:

log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),0)
accuracy_np(probs, y)

I think instead of averaging the probabilities, a better way of aggregating the probabilities would be to average (i.e., take the arithmetic mean of) the log-odds corresponding to these probabilities; or, equivalently, take the geometric mean of the odds corresponding to these probabilities.

As an example, say we have a cats vs dogs classifier, and the classifier sees p(dog)=10% on the first 4 images. Now, if we simply average the probabilities, it makes almost no difference whether the classifier sees p(dog)=99% or p(dog)=99.99% on the 5th image; the average will be around 28% either way. However, if we average the log-odds instead, we obtain an aggregated probability of around 30% in the first case, and of around 52% in the second case. So it does make a difference now whether the classifier is “very sure” or “virtually certain” that the 5th image shows a dog.

A different way of thinking about averaging log-odds is that it is equivalent to averaging the outputs of the last FC layer before the softmax layer (the “logits”). The logits are natural quantities to take the arithmetic mean of, since they are themselves calculated as linear combinations of different feature activations.

In practice, I expect the difference between these two ways of aggregating to be most relevant if there is a feature in an image that is highly informative of class membership and is close to the edge of the image, so it can get lost when square-cropping.

In code, averaging the log-odds can be done as follows:

log_preds,y = learn.TTA()
all_probs = np.clip(np.exp(log_preds), 1e-6, 1-1e-6) # avoid division-by/log-of zero
log_odds = np.log(all_probs / (1-all_probs))
avg_odds = np.exp(np.mean(log_odds, 0))
probs = avg_odds / (1+avg_odds)
accuracy_np(probs, y)

I ran a small experiment using the cats vs dogs classifier from Lesson 1 and found a small, but statistically significant advantage (i.e. higher accuracy) for the second way of aggregating the probabilities.

I agree with your point. Even with ensembling I was getting better results (slightly though) with the second method of taking geometric mean. However I didn’t get your example on dogs classification. How did you get 28%, 30%,52% values. I am most likely mis understanding something

#3

Hi Arka,

Maybe a bit of code is worth a thousand words. Note that when simple-averaging, there is almost not difference between probs1 and probs2. However, when averaging the log-odds, there is a significant difference.

def average_log_odds(probs):
log_odds = np.log(probs / (1-probs))
avg_odds = np.exp(np.mean(log_odds, 0))
return (avg_odds / (1+avg_odds))

probs1 = np.array([0.1,0.1,0.1,0.1,0.99])
probs2 = np.array([0.1,0.1,0.1,0.1,0.9999])

np.mean(probs1)
0.278

np.mean(probs2)
0.27998

average_log_odds(probs1)
0.30179691437705425

average_log_odds(probs2)
0.5210546449799203

Correct me if I am wrong but probs2 wouldn’t be a probability distribution. You have given the log odds for dog. Try to calculate for cat, it becomes 56%. So you would still end up choosing cat.

If there are multiple classes you would need to compute this for all.

#5

probs1 and probs2 in my example are the 5 values for p(dog) given by the classifier for the original image and the 4 augmented images; so they are not a probability distribution.

The function average_log_odds I defined has the property that average_log_odds(1-probs) == 1-average_log_odds(probs). So you can calculate p(cat) either as average_log_odds(1-probs2), or as 1-average_log_odds(probs2), and either way you get 48%.

My bad. I didn’t do `avg_odds = avg_odds/ (1+avg_odds)`.