Proc_df subset overlap with validation data?


As mentioned in the videos i observe the proc_df returning sample subset randomly picked from the df_raw
df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’, subset=30000, na_dict=nas)
Wouldn’t this training data then overlap with the validation data ??


I think we should create a pipeline to prevent the mixing of the data.