CNN better than LSTM/GRU for time series

Jeremy is talking about that CNN maybe will take over by the end of the year.
What would be the best solution for a time series with parallel parameters that normally use LSTM/GRU to solve before? For example predicting temperature in one place with 10 other places that are giving the temperature at the same time. For example for 100 time steps. Is a CNN with 11 rows with 100 columns the best solution? Will the 3*3 find correlations that are in the first row and the 11th row? Before this course I thought the solution was LSTM/GRU with a moving window. So the CNN solution will ignore the moving window and look at it as just separate print screens? Every time step as a different picture? What would a good solution look like with all the tools available in this course?


You could look into dilated convolutions (e.g. see Wavenet). Also see


Thank you !!

Thanks Jeremy,

I’m glad Rolfe asked this question because I was wondering about it too.

I had a search for wavenet and landed here, as well as looking into the paper you posted.

Here is my quick interpretation for anyone who doesn’t want to read the papers. I might be wrong though, so happy to be corrected.

Wavenet is designed to take in speech and convert it to speech but from a different speaker. Since audio is a non-fixed length input stream you would normally consider an RNN for this kind of task, but the authors observe that for this problem the output length could easily be the same as the input length. They also employ something new called a “causal CNN”, which appears to be a convolution that has its output shifted in the feature map layer, it’s “causal” because it’s output can never effect an output from an earlier part of the signal (I think I’m confused about how this really works). Another trick they use is to dilate these causal CNN’s, which basically means adding holes into the CNN’s filter so that it skips parts of the signal. These holes mean that the CNN can have a larger receptive field, thus letting the CNN model longer range dependencies much like an RNN. I think that the size of the dialation of the CNN’s and the time difference for the causal CNN are hyper-parameters of the model.

For the other paper, “Artificial Neural Networks applied to Taxi prediction”, In this paper they try to address the problem of predicting a taxi’s final destination given a random snapshot of its trajectory. That is, given a sequence of GPS coordinates, and some meta-data predict the final GPS coordinate. It’s difficult because the GPS coordinates is a variable length sequence.
it seems to use mostly techniques that we have seen in the course, although CNN’s don’t appear. The paper is nice in that it compares a variety of models, and has a nice explanation why you might want to use a forwards and backwards pass RNN. However, I’m not convinced that their “winning” solution is all that elegant. In the winning solution they merely grab the first and last 7 GPS points and then feed those into Dense Network.
They do a nice trick to make the learning task easier by getting the model to predict the likelihood of the taxi arriving at one of several predetermined common taxi endpoints. I also like the use of using embeddings for things other than words, such as taxi numbers. It makes sense but I had never thought about this kind of data like that.


This could be because the test set it so small. But note that:

  • They won easily! And this was attached to an academic conference and had hundreds of entries. So the approach they show (i.e. use embeddings plus DL) is powerful
  • They show an even better approach in the paper, based on results on a larger test set

There’s a lot of elegant details in the paper. E.g. they use a softmax layer as the penultimate layer, in order to handle the common cases of a few frequent destinations (e.g. the airport) but also handle the long tail. I think any write-up of a competition winning solution is worth serious study, because it shows what really works, rather than is just theoretically neat.

Your writeup of the Wavenet paper is terrific. Thanks for contributing that - would make for the basis of a nice blog post, if you have the time; no-one has written a clear description of the paper for a more general readership yet AFAIK.


It’s quite fantastic - we’ll be studying it in depth in part 2. Perhaps the most underappreciated part of DL at the moment, IMHO…


Not only about the Wavenet paper but I found this to be a really great blog post about using CNN for sequence data:

1 Like