Lesson 6 Rossman Val Set Creation Question

whamp · November 28, 2018, 4:50pm

This is more pandas related than Deep Learning, but I can’t seem to figure out what this line is doing in the Rossman Notebook from Lesson 6:

cut = train_df['Date'][(train_df['Date'] == train_df['Date'][len(test_df)])].index.max()

I get that it first gets the Date column from the train_df as a series. Then there is a boolean mask applied that’s combined with an int mask which is the length of the test_df, I assume the len(test_df) is there to make the validation set the same length as the test_df but i’ve never seen this syntax before.

whamp · November 28, 2018, 5:30pm

So breaking it down a little further, I see the first term creates a series of dates and the boolean mask just checks to see if the series of dates is equal to Timestamp(‘2015-06-19 00:00:00’) because train_df['Date'][len(test_df)]) returns that value and is true for the train_df[‘Date’] series 1,114 times. I’m not sure I follow why that date is chosen and why the maximum index of those 1,114 samples was chosen as the cut value.

jeremy · November 29, 2018, 3:11pm

Have a look at the dates in the test set - and recall that we want our validation set to cover the same time period that the test set covers. Does that give you a clue?

whamp · November 30, 2018, 5:01pm

So my logic is that since the Test set ranges from:
(Timestamp(‘2015-08-01 00:00:00’), Timestamp(‘2015-09-17 00:00:00’))
We would want the val set to reflect a similar time period. So in this case the test set is about a month and a half in length, beginning at a month start and ending just after mid month(therefore possibly including a 15th of the month payday spike).

So I see a couple options for the validation set.

You could match the specific days of the month reflected in the test set, so in this case 2015-06-01 to 2015-07-17. The problem with this is you throw away the data from 2015-07-17 until 2015-08-01.
You could just match the length of time included in the test set and count back from the beginning of the test set. So, in this case, the test set lasts 48 days. So counting back 48 days from the beginning of the test set gives 2015-06-14.

Unfortunately, I remain confused about the selection of 2015-06-19 but also the syntax being used to create the indexes.

After considering my notes above, it seems like instead of matching days of the month or number of days in the test set, a third option is being used that instead matches the number of observations in the test set to the number of observations in the val set.

I guess the thing that confuses me is that len(test_df) = 41,088. So, assuming the DataFrame is sorted in descending order of date, the line:

train_df['Date'][len(test_df)] returns: Timestamp(‘2015-06-19 00:00:00’)

is saying since the test_df has 41,088 observations, count back 41,088 observations from the end of the train_df and give me whatever date pops up. That date is then used for the boolean mask on the original train_df[‘Date’] series. Therefore, this segment:

train_df['Date'][(train_df['Date'] == train_df['Date'][len(test_df)])]

Ultimately just gives a series of dates from train_df that equal 2015-06-19.

Finally, the .index().max() just grabs the indexes of that filtered series and then the largest index which I assume reflects the last index in the train_df before the date switches to 2015-06-18 and is index 41,395.

This results in 41,396 observations in the validation set compared with 41,088 observations in the test set.

So at this point I think I understand what the code is doing, but I’m still a little confused on the logic. It may just be a quick and dirty way to get a validation set and I’m over thinking it, but since the desire appeared to be to match the number of observations from the test set to the validation set, this doesn’t seem to be what ended up happening, although obviously, it was very close. Am I missing something or on the right track at this point?

This weekend I’ll create validation sets that match the two approaches I thought of above to see if I can determine any material difference between the 3 approaches.

jeremy · November 30, 2018, 11:03pm

Yup you’re basically right. But I didn’t want to split any date to be partially in validation partially in test. So we needed to pick a split point that would fully split each date into train vs valid.

whamp · December 1, 2018, 12:17am

Ok great, I just wanted to make sure i understood what was going on and didn’t glaze over!

wgpubs · December 1, 2018, 8:35pm

Yup.

It didn’t make sense to me either until I realized the DataFrame is sorted by Date descending. Essentially, what Jeremy is doing is creating a validation set that mimics the test set both in terms of length and in predicting prices for the next “N” number of days.

This article Rachel wrote explains it quite nicely. The fundamental idea is to structure your validation set based on your task at hand.

peterkoman · December 15, 2019, 9:45am

Thanks for the explanation @whamp. This line confused me greatly A lot of logic is hidden in minute details.

rgarcia · April 7, 2020, 6:59pm

In cases like this, I found the ability to run selection (run only the selected code), available in Colab [1]/ Kaggle [2] and possibly Jupyter incredibly useful.

Here is an example.

[1] Colab: Ctrl+Shift+Enter
[2] Kaggle: Mouse click on the >} icon on the left of cell (couldn’t find a shortcut)