I understand that you want to predict future values based on past ones. There is a nice tutorial on getting started with time series with afocus on trends and correlation.
To do so, I suggest conducting a correlation test whether there is a statistical significance between past and present values to determine whether you can use present values to predict future values.
The correlation coefficient of two variables captures how linearly related they are. If there-there is no (linear) relation between present and future values, how on earth can you make any useful prediction of future values? How do you know there is one?
A random walk dataset would have only a low correlation coefficient between the last and current value. That is exactly when you know that past values have no impact whatsoever on future ones. It is just random crap and no model in the world can predict that. To do a proper test, you add the (n) previous value of the target value to the feature set.
n = 1
X[“last_value-”+str(n)] = y.shift(n)
When you take a closer look at the RF Ada Boost model discussed in the tsfresh example, you noticed in the feature importance ranking that “feature_last_value” contributes to 88.5% of the accuracy while all the 347 features generated by tsfresh only added an accumulative 1.5% to the prediction.
Three insights:
-
You need to know the correlation coefficient between the past n values and the current value to figure out whether you are barking up the wrong tree.
-
To capture seasonality, cycles, and trend in time-series data, you need to categorify the date field. Jeremy made that abundantly clear in the lessons and also stresses it in the data preparation section of the Rossmann example
-
When adding tsfresh generated features, expect a 1 - 2% impact, as they demonstrated in the Google example. Turned out, many generated features suffer from low importance so ideally, you want to add correlated external data.
Hope that helps.