I am exploring ULMFIT source code lately and reading about AWD-LSTM. In AWD-LSTM there is an idea of variational BPTT to better utilize the dataset. I am trying to see how this is implemented in the library. But I couldn’t find the part where the random window for BPTT is selected for language model training. Perhaps someone can enlighten me.
What I understand from the current implementation for batch creation of language modeling is that there is no fixed window but rather sliding windows of
min(BPTT, sequence length). Which actuall looks like using every possible data sequence to me. So that the utilization is 100%. Is this a correct assumption?
Is it possible to say that implementation in the library is better than what’s offered in AWD-LSTM which is using a jittered bptt around seq_len?
I am very confused here probably didn’t fully understand the implementation
you are correct:
- The variation in the bptt was removed recently, because it did not contribute to accuracy after we started to shuffle the sentences after each epoch
- At the same time we introduced the CircularIndex to handle the situation where the batchsize*sequencelenght is not an integral of the number of tokens. The circularIndex allows us to round up the number og iterations in each epoch so that we use at least all tokens (ie almost always more)
- The sliding windows approach in “fill_row” helps us reduce memory allocation on the cpu radically
- The AWD-LSTM paper explores many regularization tricks. I guess one could say that the methods jeremy and sugger implemented from the paper works better. We always build on each other methods to bet the past performances :).
-About being confused => we all are:) It is always more confusing to read other people code that to create it your self
Thanks a lot Kaspar! I will think deeply about those. The fact that now there are more language models in the library, it totally made my day - now I will have to get confused on more things