Updating here with my findings. This is pretty verbose, so TL;DR - I’m deciding to go with ensembling probabilities, but it may be worth running some tests on your own data sets and models, as my experiments were inconclusive but showed little difference between the two methods.

This paper on ensembling suggests that ensembling probabilities is experimentally *slightly* favorable. They provide the same intuition for this as you did @muellerzr, namely

It is more reasonable to average after the softmax transformation, as the scores might have

varying scales of magnitude across the base learners, as the score output from different network

might be in different magnitude.

Nice call!

I ran my own experiments to see the effects on model loss when ensembling probabilities vs. ensembling logits. I did the following on 4 different (albeit similar) tabular data sets that I use at work:

- trained 100 models with a typical train-val-test split.
- got the outputs (
`logits`

) for each of these 100 models.
- converted these logits into probabilities (
`probs`

) using softmax.
- created 10 ensembles each of varying sizes from these 100 prediction sets. So the first 10 ensembles have 1 model in them (not really an ensemble), the second 10 ensembles have 2 models in them, … the last 10 ensembles have 10 models in them.
- I got the ensemble predictions by averaging
`logits`

and compared the loss to ensembling predictions by averaging `probs`

. I compared the mean and STD of the sets of 10 ensembles.

My results were fairly inconclusive - on some data sets averaging the probs was slightly better (slightly lower mean loss and/or STD), on some data sets averaging logits was slightly better, and sometimes it just varied. Below are some plots of some of these results so you can see what I mean:

## Data set 1 - Using probs was better on test and on mean in val, logits better STD on val

## Data set 2 - hard to tell, but I’d say logits were better

I won’t post any more charts but I think you get the picture. They’re really pretty close, and neither was consistently better. Because of the theoretical/intuitive concern about scaling issues, I’m going with ensembling the probs since there may be less risk there.