Hi All!
I was wondering if there was a particular reason why the gradients of the hidden state of the AWD-LSTM are detached (relevant line of code).
I’ve noticed a number of people posting issues on GitHub and questions here regarding how to use AWD-LSTM hidden states in their networks, and the answer usually seems to be to edit the AWD_LSTM object such that hidden states are no longer detached (see Need help with awd-lstm translator model).
My best guess is that it’s better for performance if hidden-states aren’t used later in the network?
Would it be desirable to have a parameter on AWD_LSTM that turns detaching the hidden states on and off?
Cheers!
Joseph