Help understanding this BPTT example

kushaj · April 21, 2019, 2:14pm

This is from Section 4.1 of Regularizing and Optimizing LSTM Language Models paper. Can someone tell how this is true?

Given a fixed sequence length that is used to break a data set into fixed length batches, the data set is not efficiently used. To illustrate this, imagine being given 100 elements to perform backpropagation through with a fixed backpropagation through time (BPTT) window of 10. Any element divisible by 10 will never have any elements to backprop into, no matter how many times you may traverse the data set. Indeed, the backpropagation window that each element receives is equal to i mod 10 where i is the element’s index. This is data inefficient, preventing 1/10 of the data set from ever being able to improve itself in a recurrent fashion, and resulting in 8/10 of the remaining elements receiving only a partial backpropagation window compared to the full possible backpropagation window of length 10.

What I understand.
If my batch_size=16 and sequence length=10, then I have [16,10] matrix representing my batch. Now if I set BPTT=5, then I split my sequence length of 10 into 2 parts, so now I have 2 minibatches for every sequence.

Which is correct?
In the end, I keep the original batch of 16, but split the sequences in minibatches to get [16, 2, 5] shape.
OR
In the end, I have my batch split into 32 tensors. [32, 5]

Kaspar · April 21, 2019, 5:16pm

i guess that merity describes that if the batchsize is 10 then you have only have 9 predictions unless you let the batches overlap by 1 as in fastai’s LanguageModelPreLoader