In text classification, you want to predict over the entire text you’re classifying, which may be much longer than your bptt value. That loop runs through the entire text in chunks of length bptt.
thanks. I read and ran the notebook again and noticed this:
Then we just have to feed our texts to those two blocks, (but we can’t give them all at once to the AWD_LSTM or we might get OOM error: we’ll go for chunks of bptt length to regularly detach the history of our hidden states.)