Why are AWD_LSTM hidden states detached?

Hi All!

I was wondering if there was a particular reason why the gradients of the hidden state of the AWD-LSTM are detached (relevant line of code).

I’ve noticed a number of people posting issues on GitHub and questions here regarding how to use AWD-LSTM hidden states in their networks, and the answer usually seems to be to edit the AWD_LSTM object such that hidden states are no longer detached (see Need help with awd-lstm translator model).

My best guess is that it’s better for performance if hidden-states aren’t used later in the network?

Would it be desirable to have a parameter on AWD_LSTM that turns detaching the hidden states on and off?