Lesson4 Rossmann - Notebook running order

I am trying to get the same results as the video for the Rossmann data but am failing.

I understand that the notebook can not be run sequentaly and have tried my best to follow the advice

such as

first do ‘df = train[columns]’ then run all the cells up to and including ‘joined = join_df(joined, df, [‘Store’, ‘Date’])’
then do ‘df = test[columns]’ and run all the cells up to but not including ‘joined = join_df(joined, df, [‘Store’, ‘Date’])’
then joined_test = join_df(joined_test, df, [‘Store’, ‘Date’]) and continue to run cells.

for val_idx I ignore ‘val_idx = list(range(train_size, len(df)))’ and ‘val_idx=[0]’ and go with the two latest two week data = ‘val_idx = np.flatnonzero(
(df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))’

IN the DL section I am lost. Why are there SAMPLE, ALL and TEST sections the learner code is exactly the same. I would have thought there is only a need to run the ALL section which would look at the full dataset training on the actual training data and use two weeks of that as the validation set (eg val_idx) then predict after fitting run the predict against the test data set

I did that and get a provate result of 0.11925 which puts me around 450th not 5th like Jeremy.

What am I missing?

Any advice would be deeply appreciated.

2 Likes

I also share your confusion. Thanks for clarifying handing the part of the notebook where we had to repeat.

Also, the last two weeks of my notebook are later than Sep. 2014 (my dataset/notebook goes until Aug 2015), so I am also confused about the dates for the val_idx

Hi,

The Rossmann data goes up to August 2015 and you we are asked to estimate the results for September 2015.

I believe that the reason for the validation being selected with the line below

  • val_idx = np.flatnonzero((df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))’

Is that the validation period eg September 2014 will be similar to September 2015 (same holidays and other influences) and should make sure that the model works best for September months.

Hope that helps

It does a little. Is there concern with using a two week period when you have data before and after that? Some folks would be concerned about leakage there.

All of the other data is still used in the training set so a pattern is being found there and then validated against the two week’s worth of September 2014 data. i believe this is done so that it will adjust the model to be better at predicting the data is September months. The model is a classification one not a regression.

i must point out that this is my own understanding but I have only been studying this topic for a short period of time.

1 Like