Leakage through fix_missing() in proc_df() function?


(Simon) #1

Hi,

Jeremy applies the proc_df() with the fix_missing() function inside to the whole dataframe and only after that splits it into a training and validation set. Doesn’t this create leakage?

Inside proc_df(), the fix_missing() function is called which (if not specified otherwise) replaces all NA values in continuous variables with the median. The problem I see is that the median is calculated and applied on the whole dataset and only afterwards the data is split into train and validation sets. This mean that information from the train set leaks into the validation set and therefore we should overestimate our performance on the validation set (model is doing better on the validation set than it actually would on unseen data).

I think it would be better to calculate the median only on the training set and then apply it to all NAs in the validation set.

Am I wrong or did I overlook something?

Best


(Simon) #2

Just want to push this to see whether someone can answer my question?


#3

hello how do use proc_df on test data


(Simon) #4

Not on test data but on my validation set.


#5

I think @Buddhi post explains the use of nas and hence effectively why the fix_missing() sequence doesn’t matter.