Hi everyone! I am new on this so I would be very grateful if anyone could help me with my question, please. My question is about the Rossman notebook. Full notebook on Colab can be found here: https://drive.google.com/file/d/1xDhquDg03X6Z314VpNsmodq-23Gzp0SR/view?usp=sharing
I do not understand the following line of code:
cut = train_df[‘Date’][(train_df[‘Date’] == train_df[‘Date’][len(test_df)])].index.max()
Other issue is about how to split. I mean that when dealing with time series we should choose the last part of the dataset without shuffling for the validation set, shouldn’t we?
Many thanks in advance.
Hey Juan Carlos,
Just a noob here, but here’s what I could find.
??cut says “Object
cut not found.”
But this might be related. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
For split, just my suggestion is to do a chronological split for time series because that’s how the Rossmann competition did it.
In addition, from personal experience price data can change in unpredictable ways, such as 10 year US bond rates in the in 1979 reaching levels never before seen before. So, my investment models before 1970 take large draw downs around that time. However if I can use data after that, then the hindsight makes the situation easy to navigate.
So in this particular case, we want the last
n samples from our training set to be our validation, and
n is equal to the length of the test set, to mirror the time-series aspect of the whole thing. So our cut is equal to this difference
Thanks for your help @forenpower!
Thanks for your help @muellerzr. It is important for people like me who are starting to receive feedback from experts. I really appreciate it.
Speaking of which, I just watched this by @muellerzr and it reminded me of you @juancarloscg. Maybe you would like it.
At about 35 minutes on the topic of forecasting.
Thanks! It is very useful. Yes, I want to focus on forecasting. I need it for my work. Great contribution @muellerzr.