Deep gates in RNNs?

(Alan O'Donnell) #1

I’ve been learning about LSTMs and GRUs lately, and one question that occurred to me is if it makes sense to use deeper networks for the gates. I tried googling around but haven’t found any resources yet–if you can set me straight on why that would be a bad idea or point me in the right direction, that would be great :slight_smile:

As an example of what I mean, the update rule for a GRU’s hidden state is an interpolation between the old state, h_{t-1}, and a new candidate state, \tilde{h}_t:

h_t = h_{t-1} + z_t * (\tilde{h}_t - h_{t-1})

The interpolation coefficients z_t (one for each slot in the state vector), also known as the “update gate”, are all between 0 and 1. They’re a function of the previous state and the current input, and they’re calculated in essentially the simplest way that could work:

z_t = sigmoid(W_zx x_t + W_zh h_{t-1})

My question is if it would make sense to generalize this: instead of calculating the gate with a single-layer network, what would happen if you did something fancier, like a two-layer network? Or really anything, as long as it maps h_{t-1} and x_t to suitable interpolation coefficients. I’m curious if that could lead to better state updates (at the expense of more computation and presumably more difficult training).