I’m trying to wrap my head around how BPT3C (implemented in MultiBatchRNN) works…

When we have a normal simple RNN-based network, we pass in a minibatch of size (bptt, bs, emb_sz), and backprop from the loss to update all the weights in the network. Since we preserve the hidden state of our rnn layer between minibatches, we need to call repackage_var to make sure that we’re not creating essentially a thousands-layers-deep NN. I.e. Each time we backprop the loss, the network weights are updated based only on a history of length bptt of timepoints in the past.

The point of BPT3C is to extend the number of timepoints into the past that we are considering. It does this through repeated calls to RNN_Encoder, and appending the results into a list. The output of RNN_Encoder for a given rnn-layer is of size (bptt,bs,nh), so the overall result if we call RNN_Encoder e.g. 10x is going to be a tensor of size (bptt*10,bs,nh). The number of times we append to the list is controlled by the max_seq parameter.

Since we want to consider a greater number of timepoints, first question is why not just set bptt to a larger value? If bptt is 100 and max_seq is 1000, then we append to the list 10x, and wind up with a tensor of size (1000,bs,nh). Why not just set bptt to 1000, and we’d still wind up with an output tensor of size (1000,bs,nh).

Also, each time we call RNN_Encoder it uses repackage_var() to reset the rnn’s hidden state. If we want to preserve the history to a greater number of timepoints, why are we still calling repackage_var with each call to RNN_Encoder?

Any insight is appreciated, thanks.