Lecture 2 - Deep dive: Ordered by date dataset + review

p2327 · October 25, 2019, 3:16pm

In lecture 2 video Jeremy mentions the dataset is ordered by date, and that we can set the validation set so that having only “future sales” will make for a better model.

However after calling add_datepart I noticed the rows are NOT ordered by saleYear in df_raw.

I had to sort the df with: df_raw.sort_values(by=[‘saleYear’]) and then call proc_df and split_vals.

This made my RandomForestRegressor much better (even too good?):

m = RandomForestRegressor(n_jobs=-1, random_state=1)
m.fit(X_train, y_train)
print_score(m, valid=True)

Validation set RMSE: 0.32
Validation set R^2: 0.79
Train RMSE: 0.21
Train R^2: 0.91

Here is my notebook: I don’t know if I am doing the right thing? I’d appreciate if anybody with experience can give an insight. https://bit.ly/2oicglL