Deep gates in RNNs?

I’ve been learning about LSTMs and GRUs lately, and one question that occurred to me is if it makes sense to use deeper networks for the gates. I tried googling around but haven’t found any resources yet–if you can set me straight on why that would be a bad idea or point me in the right direction, that would be great :slight_smile:

As an example of what I mean, the update rule for a GRU’s hidden state is an interpolation between the old state, h_{t-1}, and a new candidate state, \tilde{h}_t:

h_t = h_{t-1} + z_t * (\tilde{h}_t - h_{t-1})

The interpolation coefficients z_t (one for each slot in the state vector), also known as the “update gate”, are all between 0 and 1. They’re a function of the previous state and the current input, and they’re calculated in essentially the simplest way that could work:

z_t = sigmoid(W_zx x_t + W_zh h_{t-1})

My question is if it would make sense to generalize this: instead of calculating the gate with a single-layer network, what would happen if you did something fancier, like a two-layer network? Or really anything, as long as it maps h_{t-1} and x_t to suitable interpolation coefficients. I’m curious if that could lead to better state updates (at the expense of more computation and presumably more difficult training).