I’ve been learning about LSTMs and GRUs lately, and one question that occurred to me is if it makes sense to use deeper networks for the gates. I tried googling around but haven’t found any resources yet–if you can set me straight on why that would be a bad idea or point me in the right direction, that would be great

As an example of what I mean, the update rule for a GRU’s hidden state is an interpolation between the old state, `h_{t-1}`

, and a new candidate state, `\tilde{h}_t`

:

```
h_t = h_{t-1} + z_t * (\tilde{h}_t - h_{t-1})
```

The interpolation coefficients `z_t`

(one for each slot in the state vector), also known as the “update gate”, are all between 0 and 1. They’re a function of the previous state and the current input, and they’re calculated in essentially the simplest way that could work:

```
z_t = sigmoid(W_zx x_t + W_zh h_{t-1})
```

My question is if it would make sense to generalize this: instead of calculating the gate with a single-layer network, what would happen if you did something fancier, like a two-layer network? Or really anything, as long as it maps `h_{t-1}`

and `x_t`

to suitable interpolation coefficients. I’m curious if that could lead to better state updates (at the expense of more computation and presumably more difficult training).