Thanks Jeremy,
I’m glad Rolfe asked this question because I was wondering about it too.
I had a search for wavenet and landed here https://arxiv.org/pdf/1609.03499.pdf, as well as looking into the paper you posted.
Here is my quick interpretation for anyone who doesn’t want to read the papers. I might be wrong though, so happy to be corrected.
Wavenet is designed to take in speech and convert it to speech but from a different speaker. Since audio is a non-fixed length input stream you would normally consider an RNN for this kind of task, but the authors observe that for this problem the output length could easily be the same as the input length. They also employ something new called a “causal CNN”, which appears to be a convolution that has its output shifted in the feature map layer, it’s “causal” because it’s output can never effect an output from an earlier part of the signal (I think I’m confused about how this really works). Another trick they use is to dilate these causal CNN’s, which basically means adding holes into the CNN’s filter so that it skips parts of the signal. These holes mean that the CNN can have a larger receptive field, thus letting the CNN model longer range dependencies much like an RNN. I think that the size of the dialation of the CNN’s and the time difference for the causal CNN are hyper-parameters of the model.
For the other paper, “Artificial Neural Networks applied to Taxi prediction”, In this paper they try to address the problem of predicting a taxi’s final destination given a random snapshot of its trajectory. That is, given a sequence of GPS coordinates, and some meta-data predict the final GPS coordinate. It’s difficult because the GPS coordinates is a variable length sequence.
it seems to use mostly techniques that we have seen in the course, although CNN’s don’t appear. The paper is nice in that it compares a variety of models, and has a nice explanation why you might want to use a forwards and backwards pass RNN. However, I’m not convinced that their “winning” solution is all that elegant. In the winning solution they merely grab the first and last 7 GPS points and then feed those into Dense Network.
They do a nice trick to make the learning task easier by getting the model to predict the likelihood of the taxi arriving at one of several predetermined common taxi endpoints. I also like the use of using embeddings for things other than words, such as taxi numbers. It makes sense but I had never thought about this kind of data like that.