Leakage through fix_missing() in proc_df() function?

SimonWeiss · November 5, 2018, 8:00pm

Hi,

Jeremy applies the proc_df() with the fix_missing() function inside to the whole dataframe and only after that splits it into a training and validation set. Doesn’t this create leakage?

Inside proc_df(), the fix_missing() function is called which (if not specified otherwise) replaces all NA values in continuous variables with the median. The problem I see is that the median is calculated and applied on the whole dataset and only afterwards the data is split into train and validation sets. This mean that information from the train set leaks into the validation set and therefore we should overestimate our performance on the validation set (model is doing better on the validation set than it actually would on unseen data).

I think it would be better to calculate the median only on the training set and then apply it to all NAs in the validation set.

Am I wrong or did I overlook something?

Best

SimonWeiss · November 15, 2018, 2:31pm

Just want to push this to see whether someone can answer my question?

YunusDev · November 18, 2018, 2:42pm

hello how do use proc_df on test data

SimonWeiss · November 18, 2018, 7:15pm

Not on test data but on my validation set.

number007 · November 19, 2018, 6:57am

I think @Buddhi post explains the use of nas and hence effectively why the fix_missing() sequence doesn’t matter.