In lecture 2 video Jeremy mentions the dataset is ordered by date, and that we can set the validation set so that having only “future sales” will make for a better model.
However after calling add_datepart I noticed the rows are NOT ordered by saleYear in df_raw.
I had to sort the df with: df_raw.sort_values(by=[‘saleYear’]) and then call proc_df and split_vals.
This made my RandomForestRegressor much better (even too good?):
m = RandomForestRegressor(n_jobs=-1, random_state=1)
Validation set RMSE: 0.32
Validation set R^2: 0.79
Train RMSE: 0.21
Train R^2: 0.91
Here is my notebook: I don’t know if I am doing the right thing? I’d appreciate if anybody with experience can give an insight. https://bit.ly/2oicglL