Set valid set

lizhuofeng · June 3, 2022, 2:06pm

When I’m setting up a validation set, why do I have to go to the last segment of data instead of randomly drawing from the dataset, is it because, the last data can be better predictive?

Conwyn · June 3, 2022, 2:13pm

Hi Lizhuofeng
For retail time series you want to predict tomorrow so you build a model on yesterday, validate on today and then use the model to predict tomorrow.
Regards Conwyn

iamgianluca · June 3, 2022, 3:25pm

Let’s think about how our model will be asked to behave in production/inference time. We will be asking it to predict the future, so data that hasn’t occurred yet.

Now, if we train the model using data randomly sampled from the time series, and ask the model to predict some random points within the series, we are putting the model to
a significant advantage ― an advantage that it won’t have in production/inference time.

Let me give you an example…

time	0	1	2	3	4	5	6	7	8	9	10
target	0.1	0.12	0.9	0.95	1	1.1	1.05	1.5	1.75	2	2.4

In this series, something changed at t=7. Ask yourself, without knowing the value of the target after t=6, what would you think the value of t=7, or t=8 would be? We might safely assume that these values are centered around 1.

Now, assume you have every data point aside t=9, and are asked to predict its value based on every other data point in the series. You will likely predict a value between 1.75 and 2.4. Similarly, if we were asked to predict the value at t=3, we would probably think at a number close to 1.

This is the advantage I’m talking about. In real life, your model won’t get any hint about what the future looks like, and therefore will have a tougher job at predicting the future.

If we just randomly sample data points from the time-series to build our validation set, we might overestimate the accuracy of our model compared to how it will perform at inference time.