Recipe for tackling Kaggle challenges

radek · March 9, 2018, 1:24pm

I mostly learn from my mistakes - and here it is no different. I joined the toxic competition a little bit over a week ago (which in itself is a dubious idea to join a competition that late - great for learning but otherwise can be stressful if you cannot distance yourself from how poorly you are doing! )

Based on my current understanding, here is a generic recipe for approaching a Kaggle competition:

Make a first submission (published kernels are a great starting point!)
Establish an easy way of training on a small sample
Perform a full error analysis on model from #1 (very important, this is the step that I forwent this time and led me to flying blind!)
Build a CV training pipeline, train the original model averaging test results and submit
Put in place a single stacking pipeline, building on the first model only (only learning about it right now - maybe I am wrong on this one as I have not tested it but seems about right )
Throughout the process keep reading relevant papers and blog posts, learn from kernels and forum discussion <- this + error analysis is what should govern what changes to your original model you should make / what architecture to train next.
Throughout the process, keep a google keep note with checkboxes for organizing ideas what to work on next <- super valuable! also ties in with something I have been thinking about for quite a while: “if you sit down in front of your computer having only a vague idea of what you should work on, you will waste time! you should have a plan before you sit down and stick to it for the duration of the session”

If you have any comments or suggestions, do chime in please!

SHAR1 · March 9, 2018, 5:43pm

It would be great if you could sum all this into a post, by taking a kaggle challenge as an example. I think it will be worth your time.

Recipe 5. I am also surfing here.

Thanks, for sharing your insights. Very helpful.

machinethink · March 9, 2018, 5:49pm

I like to implement the metric as soon as possible and compare how the results from this metric on a very basic prediction (from a published kernel for example) compares to the leaderboard score. Having the ability to compute a score that’s close to the LB score makes it a lot easier to try out different ideas.

I also like to have all the steps in the pipeline working first, even if the predictions are not very good yet. That way you don’t have to worry about how to make the submission etc anymore and you can just focus on the actual model.

kcturgutlu · March 9, 2018, 6:47pm

For stacking http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/. Pretty straightforward to implement after this blog post. Before reading this I was too afraid of leaking data but now it works.

abhilashsk · March 16, 2018, 5:54am

HI you guys use fastai for the toxic comment challenges ? is so let me we can discuss on solving it jointly