As mentioned in the videos i observe the proc_df returning sample subset randomly picked from the df_raw
df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’, subset=30000, na_dict=nas)
Wouldn’t this training data then overlap with the validation data ??
1 Like
I think we should create a pipeline to prevent the mixing of the data.
I would agree with karthikshyam. It seems to me that one should replace df_train
inside the call of the function proc_df
with raw_train
, i.e. use
df_trn, y_trn, nas_trn = proc_df(raw_train, ‘SalePrice’, subset=30000, na_dict=nas)
instead of
df_trn, y_trn, nas_trn = proc_df(df_raw, ‘SalePrice’, subset=30000, na_dict=nas)