Ensembling across models Workflow suggestions

Hi all,

So, I’ve finally been able to make more time for training different models, esp. for Kaggle competition. More fun when you can benchmark your results. :muscle:

One thing that I’ve come across over as a general advise, for improving the models is to train over various models and ensemble the test results; and with the conversations going on in the forum, it seems like we agree that it’s an effective technique.

I know that some of us here are using these techniques for (hopefully) better results.
Could you share share your approach on how you’re going about this, especially regarding

  • what kind of workflow you have for saving/storing the results ? separate notebooks -> write to a file ? (what libs/format are you using for that)
  • how and combining them later ? load up everything and just average(mean) them ? (any intuition on the averaging method used? why mean and not something else etc.)
  • if possible, some sort of pseudo-code-ish explanation would be superb, as I’m not extremely familiar with all the cool/efficient python libs, that I might be missing out on.

(cc @jeremy @KevinB @sermakarevich @wgpubs @jamesrequa @ramesh pinging you here, since I’ve seen you talk about it now and then :smiley: )


A very nice description of ensembling techniques is here Kaggle ensembling guide.

  • If we would like to go further then averaging predictions from different models blindly we need to have predictions of our models on a train set. This is typically done through CV.
  • Once we have predictions on a train we have another optimisation problem: new train set >> test. This task is like any other ML task so we can solve it in any way we like: xgboost, RF, neural network, logit, weights, average, median, etc. Do whatever you want just don`t forget to CV-verify it.

This is it. Not much to write about. Once you know how to get predictions for your train, you can keep stacking till your score on CV improves.


Thanks for the link and tips. CV-verification makes a lot of sense. Do you tend to use, or have seen any benefits of going for the full k-fold CV ?

1 Like

Yes, I am trying to use it now. Live example on Dog breeds:

  • averaging models: 0.148 loss
  • using weights: 0.139 loss

Without train predictions the only you can do is to use leaderboard to find how to blend. But you can easily overfit if you do so.


Can you clarify this a bit ?

1 Like

Say you have 5 predictions of a test set from different models/configs. The question is how to combine these predictions in order to achieve better accuracy than any single prediction. You can do it in three ways

  • blindly - just average and hope for the best
  • experimenting how to combine on a CV predictions of a train set
  • experimenting how to combine by sending your blends to Kaggle and checking your public leaderboard score

It is well known that sticking to public leaderboard score might harm your models generalisation ability.


Ah, I see. Now I get what you meant. The first two options sound reasonable in increasing complexity of implementation.
The last one though, sounds like a pure hack. I’m not very familiar with the leaderboard in Kaggle, but isn’t it the case that Kaggle holds on to a final dataset to run the predictions after the competition is over, just to avoid being gamed/overfitted with this strategy.

1 Like

btw, that’s a very well written guide you linked above. Thanks again, for sharing !

1 Like

That’s true. And that’s why leader board score is not a reliable source of accuracy estimation.

1 Like

Have you checked out planet_cv.ipynb? That’s got examples of ensembling across cycles, and across different CV samples.


Great questions @suvash!

I’m currently working on doing a little ensembling in the planet competition and would love to hear feedback on the approach as I have no idea whether it is even legit.

  1. Basically, I’m following the training example from @jeremy in the lesson2-image_models notebook to verify the process results in a good model.
  2. Submit the results as a baseline
  3. Do a five-fold stratified cross-validation, and in each iteration
    – Create a new learner
    – Train using the sample approach in step #1
    – Get the F2 and optimal threshold
    – TTA against validation set and save predictions in CSV file along with image_name (e.g., I’m saving the probabilities and not the actual labels as you do when creating the submission file).
  4. Create a DataFrame for each of the 5 .csv files and stack them one on top of another in a master DataFrame
  5. In the master DataFrame, group on image_name to get mean of each column for each image
  6. Use the mean of the 5 optimal thresholds as my final threshold
  7. Create a final submission file based on the grouped, master DataFrame and final threshold value

Thoughts? Recommendations?


I’m checking this out later today.

What kind of ensembling does it do?

It seems like the term can be defined in various ways. It can mean averaging predictions of the same model against different train/valid splits, averaging the results of different models over different splits, or averaging the weights of the same model over different splits, etc…

I’m kinda confused as to what the best and more useful approaches to ensembling are, and if they vary by problem, what logic we should use in determining what to do. In particular, I’m more interested in approaches that can be used in a real-world production environment rather than just those that can help win a kaggle competition.

Oh. I hadn’t yet. Thanks for reminding us, once again. There’s just so much info here, easy to forget. :beers:

Woah ! That sounds quite methodical. Good luck with the planet comp. Haven’t yet started that one, as I’ve just begun and working with the dog prediction dataset for now. I really don’t have much to add to it. Sounds pretty good and time consuming to me:smile:

Thanks for sharing the steps you plan to work through.

Trying this notebook out … can’t find imports:



Yeah you can’t run it as-is. It uses a really old version of the library. But hopefully the ideas still are useful.

I’ll update it before we cover it in class.