I’m seeking an intuitive explanation of one of the features of AWD-LSTM.
Notebook 12a includes a Python version of AWD-LSTM, in which Jeremy splits the input into 4 pieces and runs each one through a gate in the
forward method of the LSTMCell Class. That seems to match the equations given. However, I noticed that in Colah’s Understanding LSTM Networks equations, the entire input seems to be used for each of the gates.
Splitting the input into 4 pieces doesn’t make sense to me intuitively, since I don’t see how you would decide which piece goes through which gate. Is there a trick I’m missing, like randomizing the split or perhaps pre-copying the matrix so that when you split it, you are back to a single copy for each gate? I don’t quite see that in the code.