TTA calculates class probabilities for the original image and 4 transformed versions of the original image. In Lesson 1, we simply average these 5 probabilities to get the probability we use for classification, and so estimate our classifier’s accuracy as follows:

log_preds,y = learn.TTA()

probs = np.mean(np.exp(log_preds),0)

accuracy_np(probs, y)

I think instead of averaging the probabilities, a better way of aggregating the probabilities would be to average (i.e., take the arithmetic mean of) the log-odds corresponding to these probabilities; or, equivalently, take the geometric mean of the odds corresponding to these probabilities.

As an example, say we have a cats vs dogs classifier, and the classifier sees p(dog)=10% on the first 4 images. Now, if we simply average the probabilities, it makes almost no difference whether the classifier sees p(dog)=99% or p(dog)=99.99% on the 5th image; the average will be around 28% either way. However, if we average the log-odds instead, we obtain an aggregated probability of around 30% in the first case, and of around 52% in the second case. So it does make a difference now whether the classifier is “very sure” or “virtually certain” that the 5th image shows a dog.

A different way of thinking about averaging log-odds is that it is equivalent to averaging the outputs of the last FC layer before the softmax layer (the “logits”). The logits are natural quantities to take the arithmetic mean of, since they are themselves calculated as linear combinations of different feature activations.

In practice, I expect the difference between these two ways of aggregating to be most relevant if there is a feature in an image that is highly informative of class membership and is close to the edge of the image, so it can get lost when square-cropping.

In code, averaging the log-odds can be done as follows:

log_preds,y = learn.TTA()

all_probs = np.clip(np.exp(log_preds), 1e-6, 1-1e-6) # avoid division-by/log-of zero

log_odds = np.log(all_probs / (1-all_probs))

avg_odds = np.exp(np.mean(log_odds, 0))

probs = avg_odds / (1+avg_odds)

accuracy_np(probs, y)

I ran a small experiment using the cats vs dogs classifier from Lesson 1 and found a small, but statistically significant advantage (i.e. higher accuracy) for the second way of aggregating the probabilities.