Why not merge train and test before applying train_cats and proc_df?

sumeetsk · March 4, 2019, 8:21pm

In apply_cats, we pass trn in order to create the same categories for the same words. Similarly, nas is present in proc_df so that empty columns from train can be added in test. (See How to use proc_df on a test set?)

Is there anything wrong with the following approach instead? Merge the train and test sets, then apply train_cats and proc_df, and then split them back again. This way we don’t have to pass arguments back and forth.

Is there any issue with this approach?

krithi07 · May 7, 2020, 8:49am

Hi @sumeetsk
Yes, there is an issue with this approach. The train set is all you got to teach your model. If you start using your test set in pre-processing, you have defied the purpose of a test set, which by definition is the ‘unseen’ data.

For cases where the categories are same in both the sets, this approach can be adopted for ease but that is not always the case. There are times where the train set only has x categories and test set has x+3 categories.