I am exploring ULMFIT source code lately and reading about AWD-LSTM. In AWD-LSTM there is an idea of variational BPTT to better utilize the dataset. I am trying to see how this is implemented in the library. But I couldn’t find the part where the random window for BPTT is selected for language model training. Perhaps someone can enlighten me.
What I understand from the current implementation for batch creation of language modeling is that there is no fixed window but rather sliding windows of
min(BPTT, sequence length). Which actuall looks like using every possible data sequence to me. So that the utilization is 100%. Is this a correct assumption?
Is it possible to say that implementation in the library is better than what’s offered in AWD-LSTM which is using a jittered bptt around seq_len?
I am very confused here probably didn’t fully understand the implementation