Proposed: past/future train/val split may not be helpful

dataatad · April 14, 2020, 8:01pm

Lesson 6: “This is really common. If you want to predict things, there’s no point predicting things that are in the middle of your training set. You want to predict things in the future.”

Proposed: past/future train test split is not actually helpful (for tabular data)

Reasoning:

If the process that created the training data is no longer in place in the future where the validation data comes from, then generalization to this validation set is already limited, perhaps severely. You can’t know if generalization is bad due to overfitting or signal shift.

But if the process is the same in train/val split, then it shouldn’t matter if the val split is from the future (assuming the past and future splits both have any seasonality effects if present in the signal/process being modeled).

I suppose past/future split could be helpful if the underlying process has changed in that there will be a clue in poor generalization, but you won’t know if it is due to overfitting or process shift

Thoughts?