I’ve been learning about LSTMs and GRUs lately, and one question that occurred to me is if it makes sense to use deeper networks for the gates. I tried googling around but haven’t found any resources yet–if you can set me straight on why that would be a bad idea or point me in the right direction, that would be great
As an example of what I mean, the update rule for a GRU’s hidden state is an interpolation between the old state, h_{t-1}
, and a new candidate state, \tilde{h}_t
:
h_t = h_{t-1} + z_t * (\tilde{h}_t - h_{t-1})
The interpolation coefficients z_t
(one for each slot in the state vector), also known as the “update gate”, are all between 0 and 1. They’re a function of the previous state and the current input, and they’re calculated in essentially the simplest way that could work:
z_t = sigmoid(W_zx x_t + W_zh h_{t-1})
My question is if it would make sense to generalize this: instead of calculating the gate with a single-layer network, what would happen if you did something fancier, like a two-layer network? Or really anything, as long as it maps h_{t-1}
and x_t
to suitable interpolation coefficients. I’m curious if that could lead to better state updates (at the expense of more computation and presumably more difficult training).