Real Life: Should your training/ valid/ test set come from the same distribution?

bdubreu · August 29, 2020, 9:57am

Hi guys,

recently I’ve taken Andrew NG’s courses on Deep Learning via Coursera. While a little outdated (they use tensorflow1), they offer unique insight about how you should set up your experiments !

In particular, Andrew emphasizes that your dev set (dev set is his name for what we call “validation set” nowadays) should come from the same distribution than your test set.

He takes the analogy with a target that you’d be training to hit, and finally, on test day, one asks you to hit a totally different target, placed elsewhere (admittedly, this analogy has some flaws but you get the point).

However, two times in a row, Kaggle has hosted competitions where they make it impossible to extract a validation(or dev’) set from your training data that could mimic the test set they use.

In the PANDA competition, this resulted in a ridiculous shake-up where somebody who had just tweaked a public kernel two months before the end of the competition (and did nothing more after that) ended up third place. In the current Cornell Birdsong Classification competition, they offer a train set constituted only of individual bird recording.

But the test set is constituted of “sound scapes” (ie, the recording of a rainforest), where multiple birds can sing at the same time, not to mention other sounds unrelated to birds.

They make a valid point as to why they did this: “We can’t record an annotate hours of rainforest soundcape everytime we deploy our application to a new site”. See here for more explanation: https://www.kaggle.com/c/birdsong-recognition/discussion/159123#890675

Yet, by developing like this, they also take a huge risk: since your dev set and test set are somewhat unrelated, if you get a good score on dev/validation and move to test and fail, you are now stuck with either a failed project, or the danger of overfitting to the test set.

To me, this is a mistake. Yet they are also right: in real life, you don’t necessarily have the opportunity of getting all the labeled data you want.

What would you do in their situation ? Would you spend even more time and resources getting data ? Or is Andrew’s recommendation somewhat outdated too, or irrelevant in cases where you can’t get the data you need for a proper split ?

edit : some interesting point from radek here : we could actually consider the public leaderboard like a validation/dev set. Alas, it is often too small. But that doesn’t answer the question about the real life setting

machinethink · August 29, 2020, 10:06am

I’ve seen this happen more than twice on Kaggle.

Usually it turns the competition into “guess how they manipulated the test set”. And the only way to score well is to adjust your training set in the same manner, so that both datasets are again similar.

When you’re building your own models, you do have control over the test set. Obviously, you should try to make it as realistic as possible. But then you also make your training set as realistic as possible, otherwise training on this data makes no sense.

If it turns out the model does not work well in practice, you figure out what the difference is between the real world data and your test&train data, and you fix it and train the model on the improved data.

The “problem” with Kaggle competitions is that the test set is like “real world” data that is different from the training data somehow, but you don’t know how it is different, and you can only inspect it indirectly (through leaderboard submissions).

bdubreu · August 29, 2020, 10:10am

@machinethink thanks for your answer. I tend to agree with you. However I try to put myself in their shoes: if I were to face a situation at work where the client tells me that gathering more data will be too expensive / time-consuming, and I have to resort to have a realistic test-set but a very different train set, should I show professional discipline and refuse the project ?

@radek since you are participating in BirdSong, may I ask your take on this ?

machinethink · August 29, 2020, 10:31am

I’ve dealt with a number of these clients. My advice is to not take the project. If they are not going to listen to your recommendations now (while you are supposed to be the expert), they never will.

There is a big research component to building ML models and a lot of clients do not understand that research is expensive, takes time, and success is unpredictable.

If you’re still going ahead with the project, I would see if there is a way you can synthetically increase the size of the training set (through augmentation etc). That’s basically what happens in those Kaggle competitions, but at least you can look at the test data to see how it is different from the training data.

If that is not possible, I’d probably just train on the test set since that data is apparently more appropriate than the training set. Or find a way to blend the two if the test set by itself isn’t big enough (and then make a new test set from this “blended” data set).

bdubreu · August 29, 2020, 11:00am

My advice is to not take the project. If they are not going to listen to your recommendations now (while you are supposed to be the expert), they never will.

Very good point

bdubreu · September 3, 2020, 7:23am

Anybody else has thoughts on this ?

mrfabulous1 · September 4, 2020, 5:30am

Hi bdubreu I hope you are having a jolly day.

I agree with both posts above by @machinethink

If I were in your position I would probably try to explain my doubts to the client. If they agreed and put it in writing I would give it my best shot.
If they didn’t agree that the results may not be what they desired, and I had other options I would take the other options.

Thoughts:
One of the things that grabed me when I first started using fastai. was Jeremy’s statement that you don’t need vast ammounts of data to train every model.

This challenge seems to take this to the extreme.

One possible avenue, if you have the time is to emulate:

[quote=“bdubreu, post:1, topic:77709”]
where somebody who had just tweaked a public kernel two months before the end of the competition (and did nothing more after that) ended up third place.
[/quote] if you can.

Although this request seems a little bit bonkers now, could we not consider this a a type of evolutionary jump into the future experiment.

I found this What to do when your training and testing data come from different distributions interesting for a few other thoughts to add to @machinethink ideas.

Like everyone in AI I have more projects I would like to investigate than there are hours in a month. The idea of training with minute ammouts of data actually intrigues me.

My Book for this week is The Economic Singularity which argues that working for a wage could be coming to an end! in the not to distant future. So have fun while it lasts lol!

Cheers mrfabulous1