Missing values in time-series data


(Sean Aubin) #1

I have a large dataset with a number of missing values. I was wondering what’s the current best practice for dealing with missing values/data when you have temporal data. From reading a little bit, the typical approaches seem to be:

1. Replace missing data with an impossible value

If you replace the value with an impossible value (such as -1, if the rest of you data is positive) the neural net will “figure it out”, as suggested in this discussion.

2. Drop the missing values

Just don’t bother with them.

3. Data imputation

The most basic version of this is to replace the missing values with the mean of the surrounding values. However, there are a variety of neural networks have been created to perform this while doing inference/regression/forecasting. For example, in " Recurrent Neural Networks for Multivariate Time Series with Missing Values" by Che et al., they create a GRU unit. This paper has been cited many times, but I’m a bit hesitant to search through the citations.

Has anyone done reading or have practical experience on this topic? I’m curious if the custom neural-net architectures are “worth it” in practice. Do public implementations of the neural network architectures for missing data exist? I ask, because I’m getting better at PyTorch, but doubt my ability to implement the method mentioned in the paper.


(Sean Aubin) #2

One method of robust data imputation is " BRITS: Bidirectional Recurrent Imputation for Time Series" by Cao et al. According to the paper, this network architecture out-performs the GRU-D cited in the original post and functions well even if the dataset is missing 78% of the data.

The implementation in Python 2.7 + PyTorch is available on GitHub.