Jeremy applies the proc_df() with the fix_missing() function inside to the whole dataframe and only after that splits it into a training and validation set. Doesn’t this create leakage?
Inside proc_df(), the fix_missing() function is called which (if not specified otherwise) replaces all NA values in continuous variables with the median. The problem I see is that the median is calculated and applied on the whole dataset and only afterwards the data is split into train and validation sets. This mean that information from the train set leaks into the validation set and therefore we should overestimate our performance on the validation set (model is doing better on the validation set than it actually would on unseen data).
I think it would be better to calculate the median only on the training set and then apply it to all NAs in the validation set.
Am I wrong or did I overlook something?