How to deal with missing categorical values and missing values in training and test datasets?

wgpubs · October 12, 2017, 12:17am

1. How should we handle missing values for a feature in test and training datasets?

By this I mean, do we apply whatever methodology to handling missing values for feature A to the training and test datasets independently, or do we apply it only to the training dataset and then use those values on the test data?

For example, in the Kaggle Titanic competition there are a bunch of missing values for the “Age” feature. Let’s say we decided to handle it by simply replacing all null values with the mean of the values that are there. If we do this on the training and test set independently, we will be replacing nulls with two separate values. If we simply calculate the mean against the training dataset, and use it to replace nulls on both training and test, we use the same value for nulls. Which is the right approach?

2. How do we handle categorical values that exist in test but not in train, and vice versa?

In my opinion, it seems like whatever values exist in your test dataset need to be thrown out if they don’t exist in your training dataset or else should be treated as missing values. My thinking is that if the values aren’t there to train your model on, that would have a negative effect on its predictive capability when it sees a value it never saw in training. However, I’ve seen other combine the training and test datasets to include all categories regardless if some only exist in one dataset.

gnak · October 12, 2017, 9:34pm

Out of the two methods you mention, the former appears to me to be the most sound. But sounder still, seems to be to use the conditional (on all other attributes) mean. I did this when I worked through the Titanic dataset:
https://github.com/gurbraj/ML/blob/master/kaggle/titanic_decision_tree/titanic_decision_tree.ipynb (under ‘Handle missing values’)

wgpubs · October 13, 2017, 12:59am

Nice notebook @gnak and I’ve seen that technique elsewhere on another Titanic notebook. If I recall correctly, that user derived missing Age values by grouping on Pclass, Sex, and Title.

machinethink · October 13, 2017, 12:01pm

The “honest” thing to do is only ever look at what is inside your training set.

For a competition such as Titanic where all the data is known already, limiting yourself to just the training data won’t get the highest leaderboard score – obviously, if you use data from the test set, you can score higher. That’s fine for this particular competition, but it’s a questionable strategy in practice since you want the test set to be representative of unseen data.

You can get the highest score on the Titanic competition by looking up the actual historical records and making your test set “predictions” from those. But you probably won’t learn a lot about machine learning that way.

wgpubs · October 14, 2017, 12:19am

Good advice.

Yah, I was wondering how folks were scoring a perfect score on Titanic … now it all makes sense.