Is it a good practice to append test data to train data?

(Sai Krishna) #1

I am very newbie to ML, and this question is regarding Kaggle competitions.

Is it a good practice to append test set to training set, do the preprocessing/feature engineering and later split rows back to test and training sets? I would like to know what are the pros and cons.

Thanks in advance… :slight_smile:


(Alena Harley) #2

Hi Sai, it’s usually not a good idea to do preprocessing and feature engineering, cons:

  1. If you do feature engineering on a test set you will have no independent set to evaluate how good are the features you have engineered;
  2. If you do data preprocessing on test appended to training set, you have a data leakage situation where again you can’t evaluate your model performance adequately.

No pros as far as I am aware :sweat_smile:

1 Like

(Mukesh Jha) #3

Hey @alenas could you please help me to proceed. This is my first kaggle Dataset that I am working on Housing Prices. It consists of 2 subparts train.csv and test.csv. Now I trained the model using Random Forest that Jereme taught in BlueBook for Bulldozers dataset. The training was fairly easy. Now I need to predict the SalePrice of test.csv which contains all the features except the label(which is quite obvious). So do I need to preprocess in the test.csv dataset as well ? Or just use y=m.predict(df1) considering y will store the values of prediction based on learning that it did with train dataset and df1 is the features in test set. I am completely new in this domain, please help.


(Sai Krishna) #4

:smile: it just started to make sense to me… thank you.

can you please suggest any references/links on how to handle a new categorial data under feature with high cardinality, that is present in “test set” but not in “training set”?:confused:


(Sai Krishna) #5

@lunchcook Hi Mukesh, As far as I know, if you have done any basic data preprocessing on training dataset like dropping na columns or imputation, then the same needs to be applied to test dataset.

Any predictive model trained on one set of features in training data set, can not predict outcomes of test dataset with different features.

anyone please correct me if I am wrong.

1 Like

(Mukesh Jha) #6

Yes I got that Thanks. I think this link is also useful. It helped out as well:

How to use proc_df on a test set?

1 Like