Error in setting dates for Bluebook for Tractors

I think there is an error in defining the training and validation sets for the Bluebook for Tractors competition (Chapter 9, Lesson 7).

The stated objective is to:

define a narrower training dataset which consists only of the Kaggle training data from before November 2011, and we’ll define a validation set consisting of data from after November 2011

The code used to define the condition passed to np.where is cond = (df.saleYear<2011) | (df.saleMonth<10). But this uses bitwise OR, so it will include all indices where the saleYear is 2010 or earlier OR the saleMonth is September or earlier. So it will include, for example, March 2012 in the training set, when it should be in the validation set. (Also, the condition is before October, so September and earlier, when the desired condition is October and earlier).

When I run the indices for the training set through Pandas, I get the following values for saleYear and saleMonth:

So it includes some dates with year 2012 and all months.

I think the condition should be:

cond = ((df.saleYear<=2010) | ((df.saleYear==2011) & (df.saleMonth<=10)))

When I rerun the unique values, they are limited to 2011 and ealier with this condition. The maximum month for 2011 is 10 (October).

Am I missing something?

2 Likes

Mark_F I agree with you on this point. I was going through that section as well and wondering why the code was written that way. I went ahead and sorted the original dataset in order by “saledate” and found that the most recent 8,000 entries run from Feb 2012 through Apr 2012, so November through January don’t even make the cut for the validation set. Not that there is anything magical about 8,000 records for the val set, I was just trying to use approximately the same amount that (Jeremy) uses in the book - 7,988. So, I’m curious to hear if we get any responses here, but for now, I am going to move forward with the validation set of 8,000 that I created which run from Feb - Apr 2012.

You’re right. For a few minutes, I was mind fucked. But then I realized that the authors have made a mistake.

1 Like

@sambit Did the authors aknowledge this is a mistake?

Is there a plan to fix this mistake in the noteooks?

1 Like

I’m not sure. It will probably be fixed in the next edition of the textbook.

1 Like

I too got the same issue. I guess there is error, with the given logic in the book, many entries after November will be missed even before 2011, same is true with validation set

Glad you guys found this as well. I used a slightly different code to fix it

cond = ((df[‘saleYear’]==2011) & (df[‘saleMonth’]>10))|(df[‘saleYear’]==2012)