Noob question about creating validation sets

sapal6 · August 4, 2021, 2:14am

What are some other ways to select a validation set other than cross validation?

I see that in kaggle competitions almost everyone is using cross validation to test the models but there was a fastai blog a couple of years ago in which it was stated that cross validation doesn’t translate to the real world.

Fastai also has some nice defaults of building validation set but I wanted to know if anyone has encountered any scenario where the default method of validation set creation in fastai is not enough or is it true that in kaggle competition cross validation is so far the better way to test model.

BobMcDear · August 5, 2021, 11:40pm

Hello,

The post you have linked touches on the most important points to keep in mind when validating your model, but I’d like to elaborate and put my two cents in:

Careless use of cross-validation can be highly misleading, just like the blog says so. Based on my experience, the two most frequent pitfalls are when,

I) Your data includes temporal data. The Rossman Store Sales competition on Kaggle, for example, would fall under this category. Cross-validation may give optimistic results because it’s possible that tomorrow’s data is in the training set whereas you’re predicting today’s sales, which is clearly impossible in a real-life situation.

II) Multiple data points may belong to the same source (i.e. ID). For instance, the Melanoma Classification challenge includes people with more than one skin lesion. Here, cross-validation would be bad because someone’s moles may end up in both the training and validation set, an uncommon situation in practice.

However, note that I said careless use of cross-validation; if you look out for hidden "Gotcha!"s like the ones I’ve mentioned and select your techniques accordingly, cross-validation can be your friend (or at least not your foe). In case of time-series data, you may go with time-series cross-validation, and when there are IDs, cross-validation should look something like this (assuming you have five IDs, and each ID composes roughly 20% of your dataset):

Fold 1: Training IDs = 1, 2, 3, 4; Validation ID = 5
Fold 2: Training IDs = 1, 2, 3, 5; Validation ID = 4
…
Fold 5: Training IDs = 2, 3, 4, 5; Validation ID = 1

There is, of course, the question of do you truly need cross-validation or not. When given an abundant amount of data, cross-validation won’t make much difference, and you’d be better off concentrating your computing time & power on other things (Kaggle is an exception since every decimal counts, so even if you’ve got lots of data, cross-validation is vital.). Otherwise, cross-validation would be an excellent route of action because it gives you a much better insight into the model’s performance that hold-out validation wouldn’t.

The bottom line is, in the rights hands, cross-validation can’t hurt, but in today’s ML climate where even small startups have access to plenty of data, it’s usually not needed and yields small gains. You should be fine with just one thoroughly-chosen validation set as long as you keep the two points above in mind.

Cheers!

sapal6 · August 6, 2021, 4:03pm

Thanks. It’s a very detailed explanation.