So, I’ve finally been able to make more time for training different models, esp. for Kaggle competition. More fun when you can benchmark your results.
One thing that I’ve come across over as a general advise, for improving the models is to train over various models and ensemble the test results; and with the conversations going on in the forum, it seems like we agree that it’s an effective technique.
I know that some of us here are using these techniques for (hopefully) better results.
Could you share share your approach on how you’re going about this, especially regarding
what kind of workflow you have for saving/storing the results ? separate notebooks -> write to a file ? (what libs/format are you using for that)
how and combining them later ? load up everything and just average(mean) them ? (any intuition on the averaging method used? why mean and not something else etc.)
if possible, some sort of pseudo-code-ish explanation would be superb, as I’m not extremely familiar with all the cool/efficient python libs, that I might be missing out on.
If we would like to go further then averaging predictions from different models blindly we need to have predictions of our models on a train set. This is typically done through CV.
Once we have predictions on a train we have another optimisation problem: new train set >> test. This task is like any other ML task so we can solve it in any way we like: xgboost, RF, neural network, logit, weights, average, median, etc. Do whatever you want just don`t forget to CV-verify it.
This is it. Not much to write about. Once you know how to get predictions for your train, you can keep stacking till your score on CV improves.
Say you have 5 predictions of a test set from different models/configs. The question is how to combine these predictions in order to achieve better accuracy than any single prediction. You can do it in three ways
blindly - just average and hope for the best
experimenting how to combine on a CV predictions of a train set
experimenting how to combine by sending your blends to Kaggle and checking your public leaderboard score
It is well known that sticking to public leaderboard score might harm your models generalisation ability.
Ah, I see. Now I get what you meant. The first two options sound reasonable in increasing complexity of implementation.
The last one though, sounds like a pure hack. I’m not very familiar with the leaderboard in Kaggle, but isn’t it the case that Kaggle holds on to a final dataset to run the predictions after the competition is over, just to avoid being gamed/overfitted with this strategy.
I’m currently working on doing a little ensembling in the planet competition and would love to hear feedback on the approach as I have no idea whether it is even legit.
Basically, I’m following the training example from @jeremy in the lesson2-image_models notebook to verify the process results in a good model.
Submit the results as a baseline
Do a five-fold stratified cross-validation, and in each iteration
– Create a new learner
– Train using the sample approach in step #1
– Get the F2 and optimal threshold
– TTA against validation set and save predictions in CSV file along with image_name (e.g., I’m saving the probabilities and not the actual labels as you do when creating the submission file).
Create a DataFrame for each of the 5 .csv files and stack them one on top of another in a master DataFrame
In the master DataFrame, group on image_name to get mean of each column for each image
Use the mean of the 5 optimal thresholds as my final threshold
Create a final submission file based on the grouped, master DataFrame and final threshold value
It seems like the term can be defined in various ways. It can mean averaging predictions of the same model against different train/valid splits, averaging the results of different models over different splits, or averaging the weights of the same model over different splits, etc…
I’m kinda confused as to what the best and more useful approaches to ensembling are, and if they vary by problem, what logic we should use in determining what to do. In particular, I’m more interested in approaches that can be used in a real-world production environment rather than just those that can help win a kaggle competition.
Woah ! That sounds quite methodical. Good luck with the planet comp. Haven’t yet started that one, as I’ve just begun and working with the dog prediction dataset for now. I really don’t have much to add to it. Sounds pretty good and time consuming to me:smile:
Thanks for sharing the steps you plan to work through.