Lesson 6 - Managing tabular data and preparing split for Time Series Analysis

juancarloscg · June 19, 2020, 3:02pm

Hi everyone! I am new on this so I would be very grateful if anyone could help me with my question, please. My question is about the Rossman notebook. Full notebook on Colab can be found here: https://drive.google.com/file/d/1xDhquDg03X6Z314VpNsmodq-23Gzp0SR/view?usp=sharing

I do not understand the following line of code:
cut = train_df[‘Date’][(train_df[‘Date’] == train_df[‘Date’][len(test_df)])].index.max()

Other issue is about how to split. I mean that when dealing with time series we should choose the last part of the dataset without shuffling for the validation set, shouldn’t we?

Many thanks in advance.

forenpower · June 20, 2020, 10:25pm

Hey Juan Carlos,

Just a noob here, but here’s what I could find.

??cut says “Object cut not found.”
But this might be related. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

For split, just my suggestion is to do a chronological split for time series because that’s how the Rossmann competition did it.

In addition, from personal experience price data can change in unpredictable ways, such as 10 year US bond rates in the in 1979 reaching levels never before seen before. So, my investment models before 1970 take large draw downs around that time. However if I can use data after that, then the hindsight makes the situation easy to navigate.

muellerzr · June 20, 2020, 10:29pm

So in this particular case, we want the last n samples from our training set to be our validation, and n is equal to the length of the test set, to mirror the time-series aspect of the whole thing. So our cut is equal to this difference

juancarloscg · June 22, 2020, 10:38am

Thanks for your help @forenpower!

juancarloscg · June 22, 2020, 10:39am

Thanks for your help @muellerzr. It is important for people like me who are starting to receive feedback from experts. I really appreciate it.

forenpower · June 22, 2020, 9:21pm

Speaking of which, I just watched this by @muellerzr and it reminded me of you @juancarloscg. Maybe you would like it.

At about 35 minutes on the topic of forecasting.

juancarloscg · June 23, 2020, 9:52am

Thanks! It is very useful. Yes, I want to focus on forecasting. I need it for my work. Great contribution @muellerzr.