Why average predictions from TTA rather than selecting most confident?

In the notebooks when using TTA the usual procedure is:

log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),axis=0)
preds = np.argmax(probs, axis=1)

Instead of averaging the results why do we not find the most confident prediction? i.e.:

probs = np.max(np.exp(log_preds),axis=0)  # won't add to one but won't matter as result is just used in next line
preds = np.argmax(probs, axis=1)

My understanding of TTA is that by presenting variants of the image to the trained classifier it may have a better chance at classifying correctly. Just as if you presented me variants of an image I may be able to recognize it better in one variant over another, e.g. if you showed me pictures of celebrities upside down I would be less accurate than if they were all right side up and I would probably be more confident about the right side up pictures. We would not take the average of the predictions in this case but would take the most confident prediction. Why do we not do this with TTA results?

I think your approach is valid but my guess is that taking the average of the variants first (rather than the max) is for robustness purposes. By taking the max of the max, you may be over-fitting (specially if you consider many variants). Following your example of celebrities, would you rather use the algorithm that outputs the celebrity for which several of the variants are similar to the target, or the celebrity that has 1 variant very similar to the target but all the others variants very different? Aside from the fact that with faces, upside down variations is probably not a good idea, I think your upside down celebrity example is a good argument against: probs = np.min(np.exp(log_preds),axis=0), rather than against probs = np.mean(np.exp(log_preds),axis=0).

Thanks @martinmm. The argmax is just selecting which class was predicted more strongly than the others so I don’t understand how that would be overfitting, this is test time anyway. As I understand it augmentations are really a hack (at training time too) to force the network to generalize somewhat so that it doesn’t learn to expect certain orientations etc. e.g. the if training set always has the nose in the center of the picture, the model may learn to expect this and augmentations help disrupt that. Without training time augmentation a test image with the nose offset would probably produce a lower confidence result than if the test image had the nose in the center, I may be wrong on this but this is how I currently understand it.

So by using max really what I’m asking the model is which augmentation of the image produces the highest confidence result.

I ran a couple of tests on my dataset. Firstly with the default number of augmentations in TTA (1 +4 augmentations) taking the most confident prediction had higher accuracy, mean:87.7% vs. max:87.9%
Secondly I increased the number of augmentations in TTA to (1 + 19) . In this case the mean won out on accuracy, mean:88.1% vs max:87.8%

These results imply that at lower numbers of test time augmentations max is better but when the number of augmentations is increased mean is better. But it’s very hard to draw conclusions from just the one dataset. And I don’t have a clear reason to draw this conclusion.

By the way I’m not using upside down celebrities, just using that as a human perceptual example. I’m actually using the skin cancers data set hmnist.

@MarkD I agree that it’s hard to draw conclusions from one data set, but your results are consistent with what I would expect, that the larger the number of augmentations, the better the mean works compared to max.

I meant that it’s over-fitting in the sense that the approach (with max-max) favors having a good fit to the target merely by chance. For example, suppose that among the augmentations there is a zoom to part of the image. If the zoom happens to be to the hair of the celebrity, then the algorithm will output the celebrity with most similar hair to the target, regardless of the rest of the face.

Another example: imagine a hiring system based on the grades of the candidate. Hiring the candidate that maximizes average grade is probably better than hiring the candidate maximizing best grade. It could be even better to hire the candidate maximizing worst grade. This is minimax optimization. You could try in your data set: for TTA with 1+19, is it better to use mean or min?

@martinmm I agree, this is what I think may be the rationale for using mean. The samples that flip from one category to another are close in confidence. Selecting one that produced the numerically higher result in one case is probably inferior to averaging. e.g. in a binary classifier if in 90% of the augmentations we were 51.1% confident it was class a and 48.9% confident it was class b and in only one augmentation we were 51.2% confident of class b then using mean would select class a and using max would select class b. In this example class a would likely be the better choice so mean would be preferable in general to max.

Increasing the number of augmentations was still a worthwhile outcome from this experiment as my accuracy increased from 87.7% to 88.1%. i.e. Using mean and increasing the number of augmentations in my TTA to 19 got me improved accuracy as you expect.

1 Like