Hi guys,
recently I’ve taken Andrew NG’s courses on Deep Learning via Coursera. While a little outdated (they use tensorflow1), they offer unique insight about how you should set up your experiments !
In particular, Andrew emphasizes that your dev set (dev set is his name for what we call “validation set” nowadays) should come from the same distribution than your test set.
He takes the analogy with a target that you’d be training to hit, and finally, on test day, one asks you to hit a totally different target, placed elsewhere (admittedly, this analogy has some flaws but you get the point).
However, two times in a row, Kaggle has hosted competitions where they make it impossible to extract a validation(or dev’) set from your training data that could mimic the test set they use.
In the PANDA competition, this resulted in a ridiculous shake-up where somebody who had just tweaked a public kernel two months before the end of the competition (and did nothing more after that) ended up third place. In the current Cornell Birdsong Classification competition, they offer a train set constituted only of individual bird recording.
But the test set is constituted of “sound scapes” (ie, the recording of a rainforest), where multiple birds can sing at the same time, not to mention other sounds unrelated to birds.
They make a valid point as to why they did this: “We can’t record an annotate hours of rainforest soundcape everytime we deploy our application to a new site”. See here for more explanation: https://www.kaggle.com/c/birdsong-recognition/discussion/159123#890675
Yet, by developing like this, they also take a huge risk: since your dev set and test set are somewhat unrelated, if you get a good score on dev/validation and move to test and fail, you are now stuck with either a failed project, or the danger of overfitting to the test set.
To me, this is a mistake. Yet they are also right: in real life, you don’t necessarily have the opportunity of getting all the labeled data you want.
What would you do in their situation ? Would you spend even more time and resources getting data ? Or is Andrew’s recommendation somewhat outdated too, or irrelevant in cases where you can’t get the data you need for a proper split ?
edit : some interesting point from radek here : we could actually consider the public leaderboard like a validation/dev set. Alas, it is often too small. But that doesn’t answer the question about the real life setting