1. How should we handle missing values for a feature in test and training datasets?
By this I mean, do we apply whatever methodology to handling missing values for feature A to the training and test datasets independently, or do we apply it only to the training dataset and then use those values on the test data?
For example, in the Kaggle Titanic competition there are a bunch of missing values for the “Age” feature. Let’s say we decided to handle it by simply replacing all null values with the mean of the values that are there. If we do this on the training and test set independently, we will be replacing nulls with two separate values. If we simply calculate the mean against the training dataset, and use it to replace nulls on both training and test, we use the same value for nulls. Which is the right approach?
2. How do we handle categorical values that exist in test but not in train, and vice versa?
In my opinion, it seems like whatever values exist in your test dataset need to be thrown out if they don’t exist in your training dataset or else should be treated as missing values. My thinking is that if the values aren’t there to train your model on, that would have a negative effect on its predictive capability when it sees a value it never saw in training. However, I’ve seen other combine the training and test datasets to include all categories regardless if some only exist in one dataset.