hey, great to see the the course is out!
about lesson 12. i’m trying to understand why there is a use of bptt in text classificatoin.

    for i in range(0, sl, self.bptt):
        r,o,m = self.module(input[:,i: min(i+self.bptt, sl)])
        masks.append(pad_tensor(m, bs, 1))

I thought of two reasons:

  1. To have a dropout between the layers.
  2. For memory reasons.

Am I right? Or are there other reasons?

In text classification, you want to predict over the entire text you’re classifying, which may be much longer than your bptt value. That loop runs through the entire text in chunks of length bptt.


thanks. I read and ran the notebook again and noticed this:

Then we just have to feed our texts to those two blocks, (but we can’t give them all at once to the AWD_LSTM or we might get OOM error: we’ll go for chunks of bptt length to regularly detach the history of our hidden states.)