Proc_df subset overlap with validation data?


#1

As mentioned in the videos i observe the proc_df returning sample subset randomly picked from the df_raw
df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’, subset=30000, na_dict=nas)
Wouldn’t this training data then overlap with the validation data ??


#2

I think we should create a pipeline to prevent the mixing of the data.


(Antoine) #3

I would agree with karthikshyam. It seems to me that one should replace df_train inside the call of the function proc_df with raw_train, i.e. use

df_trn, y_trn, nas_trn = proc_df(raw_train, ‘SalePrice’, subset=30000, na_dict=nas)

instead of

df_trn, y_trn, nas_trn = proc_df(df_raw, ‘SalePrice’, subset=30000, na_dict=nas)