Is it a good practice to append test data to train data?

saikrishna26 · April 14, 2019, 7:36pm

I am very newbie to ML, and this question is regarding Kaggle competitions.

Is it a good practice to append test set to training set, do the preprocessing/feature engineering and later split rows back to test and training sets? I would like to know what are the pros and cons.

Thanks in advance…

alenas · April 14, 2019, 7:52pm

Hi Sai, it’s usually not a good idea to do preprocessing and feature engineering, cons:

If you do feature engineering on a test set you will have no independent set to evaluate how good are the features you have engineered;
If you do data preprocessing on test appended to training set, you have a data leakage situation where again you can’t evaluate your model performance adequately.

No pros as far as I am aware

lunchcook · April 14, 2019, 8:10pm

Hey @alenas could you please help me to proceed. This is my first kaggle Dataset that I am working on Housing Prices. It consists of 2 subparts train.csv and test.csv. Now I trained the model using Random Forest that Jereme taught in BlueBook for Bulldozers dataset. The training was fairly easy. Now I need to predict the SalePrice of test.csv which contains all the features except the label(which is quite obvious). So do I need to preprocess in the test.csv dataset as well ? Or just use y=m.predict(df1) considering y will store the values of prediction based on learning that it did with train dataset and df1 is the features in test set. I am completely new in this domain, please help.

saikrishna26 · April 14, 2019, 8:18pm

it just started to make sense to me… thank you.

can you please suggest any references/links on how to handle a new categorial data under feature with high cardinality, that is present in “test set” but not in “training set”?

saikrishna26 · April 14, 2019, 8:28pm

@lunchcook Hi Mukesh, As far as I know, if you have done any basic data preprocessing on training dataset like dropping na columns or imputation, then the same needs to be applied to test dataset.

Any predictive model trained on one set of features in training data set, can not predict outcomes of test dataset with different features.

anyone please correct me if I am wrong.

lunchcook · April 14, 2019, 8:38pm

Yes I got that Thanks. I think this link is also useful. It helped out as well:

How to use proc_df on a test set? - #2 by maciejkpl