1. How should we handle missing values for a feature in test and training datasets?
By this I mean, do we apply whatever methodology to handling missing values for feature A to the training and test datasets independently, or do we apply it only to the training dataset and then use those values on the test data?
For example, in the Kaggle Titanic competition there are a bunch of missing values for the “Age” feature. Let’s say we decided to handle it by simply replacing all null values with the mean of the values that are there. If we do this on the training and test set independently, we will be replacing nulls with two separate values. If we simply calculate the mean against the training dataset, and use it to replace nulls on both training and test, we use the same value for nulls. Which is the right approach?
2. How do we handle categorical values that exist in test but not in train, and vice versa?
In my opinion, it seems like whatever values exist in your test dataset need to be thrown out if they don’t exist in your training dataset or else should be treated as missing values. My thinking is that if the values aren’t there to train your model on, that would have a negative effect on its predictive capability when it sees a value it never saw in training. However, I’ve seen other combine the training and test datasets to include all categories regardless if some only exist in one dataset.
The “honest” thing to do is only ever look at what is inside your training set.
For a competition such as Titanic where all the data is known already, limiting yourself to just the training data won’t get the highest leaderboard score – obviously, if you use data from the test set, you can score higher. That’s fine for this particular competition, but it’s a questionable strategy in practice since you want the test set to be representative of unseen data.
You can get the highest score on the Titanic competition by looking up the actual historical records and making your test set “predictions” from those. But you probably won’t learn a lot about machine learning that way.