I think there is an error in defining the training and validation sets for the Bluebook for Tractors competition (Chapter 9, Lesson 7).
The stated objective is to:
define a narrower training dataset which consists only of the Kaggle training data from before November 2011, and we’ll define a validation set consisting of data from after November 2011
The code used to define the condition passed to
cond = (df.saleYear<2011) | (df.saleMonth<10). But this uses bitwise OR, so it will include all indices where the
saleYear is 2010 or earlier OR the
saleMonth is September or earlier. So it will include, for example, March 2012 in the training set, when it should be in the validation set. (Also, the condition is before October, so September and earlier, when the desired condition is October and earlier).
When I run the indices for the training set through Pandas, I get the following values for
So it includes some dates with year 2012 and all months.
I think the condition should be:
cond = ((df.saleYear<=2010) | ((df.saleYear==2011) & (df.saleMonth<=10)))
When I rerun the unique values, they are limited to 2011 and ealier with this condition. The maximum month for 2011 is 10 (October).
Am I missing something?