Validation test size

In lesson 1, it is said that the validation set’s size should be that of the test set on Kaggle. This is easy to say for the Bluebook for Bulldozers, but for example for the House Prices: Advanced Regression Techniques competition, the test set size is the same as the one for the training set.

So, how do you determine which percentage of the training set to separate into a validation set?

Not sure about special case of kaggle, but in general I reserve 20%. Pay careful attention to how you grab the 20%. If not a time series, no biggie usually but they might have sorted/grouped data in a special way to randomly grab 20%. For time series, usually you want to sort by date then grab last 20%. Use a weight decay on older observations.

2 Likes