Best way to deal with missing values

Hello,
I was wondering how we deal with missing values for structured deep learning.
I saw that a lot of people say “just put 0” and the neural network will “figure out” that the data are missing.
Lets forget embedding encoding here for a second and say I have continuous data, temperature for example.
So I scale my temperature which goes from -40c to +40c on a scale between 0 and 1 and feed that data into a fully connected layer.
Now few of my entries are set on 0c and few other have missing values. How can I tell the neural network that this value is missing and is not at 0c? Do I set these entries to 0 (or to the mean: 0.5?) and add a new feature missing_temperature that I set to 1? Or there is a better way to do this? I just want to confirm that I do it the right way as it’s hard for me to get this intuitively.
Thanks for your help :slight_smile:

3 Likes

I think the idea to fill missing values was to put a number that is not present before in your dataset. So for eg. something like -999 (as long as -999 wasnt there in your dataset).

1 Like

But by doing this won’t we be “breaking” the weights? Considering they will be multiplied by this huge number they may “explode” wont they?

1 Like

Just looked at the Rossman notebook.

for df in (joined,joined_test):
    df.CompetitionOpenSinceYear = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)

You can see that the missing values of the column 'CompetitionOpenSinceYear ’ of the joined df was filled with 1900.

-999 was just as an example. Maybe for each column that has an empty value, you can find the minimum value, and go one below that minium value. This was you are putting in a new value that did not exist before, and is also in the same range as the others.

2 Likes

Filling up the missing values with mean/median of their columns is also a method suggested in the Python for Data analysis book.

3 Likes

Note that sometimes your treatment of NAs depends on whether you are using a tree based algorithm (random forest, gradient boosting) or a neural net (logistic regression). Trees split on values that is why using a value that is not in your dataset works ( example using -999 ).

In general you can do this. For the case of a numeric variable you can add a new column X1_is_NA which is going to be 1 if the original X1 column was NA and 0 otherwise. Then you substitute the NA value in the original column with the mean/ median. For a categorical variable you can leave NA as a new category.

11 Likes