I’m seeking an intuitive explanation of one of the features of AWD-LSTM.

Notebook 12a includes a Python version of AWD-LSTM, in which Jeremy splits the input into 4 pieces and runs each one through a gate in the `forward`

method of the LSTMCell Class. That seems to match the equations given. However, I noticed that in Colah’s Understanding LSTM Networks equations, the *entire* input seems to be used for each of the gates.

Splitting the input into 4 pieces doesn’t make sense to me intuitively, since I don’t see how you would decide which piece goes through which gate. Is there a trick I’m missing, like randomizing the split or perhaps pre-copying the matrix so that when you split it, you are back to a single copy for each gate? I don’t quite see that in the code.

Thanks,

John