I do not understand the following line of code:
cut = train_df[‘Date’][(train_df[‘Date’] == train_df[‘Date’][len(test_df)])].index.max()
Other issue is about how to split. I mean that when dealing with time series we should choose the last part of the dataset without shuffling for the validation set, shouldn’t we?
For split, just my suggestion is to do a chronological split for time series because that’s how the Rossmann competition did it.
In addition, from personal experience price data can change in unpredictable ways, such as 10 year US bond rates in the in 1979 reaching levels never before seen before. So, my investment models before 1970 take large draw downs around that time. However if I can use data after that, then the hindsight makes the situation easy to navigate.
So in this particular case, we want the last n samples from our training set to be our validation, and n is equal to the length of the test set, to mirror the time-series aspect of the whole thing. So our cut is equal to this difference