Sequence length, batch size & bptt

This post is dedicated to understand the difference between three of the most important concepts in creating your data for a Language Model:

  • SEQUENCE LENGTH: it’s the length of the sequence you’re going to learn (on fastai it defaults to [total length]/[batch size]).
  • BATCH SIZE: as usual is the number of “concurrent items” you’re going to feed into the model.
  • BPTT: Back Propagation Through Time - eventually it’s the “depth” of your RNN (the number of iteration of your “for” loop in the forward pass).

Visually follow the black numbers in the picture above.
I’ve slightly modified the original 12_text.ipynb notebook with this line:

dl = DataLoader(LM_PreLoader(ll.valid, shuffle=False, bptt=5), batch_size=6) # Demo params

To show the “path” of your data using the same parameters of the initial example (SL=15, BS=6, BPTT=5)

This was the original line, with a lot bigger BPTT and BS. Moreover the original code shuffle your data (in the sequence space) on each epoch (AFAIK it’s a kind of “data augmentation” that prevents the model to overfit…)

dl = DataLoader(LM_PreLoader(ll.valid, shuffle=True, bptt=70), batch_size=64)


  1. Get a lot of documents.
  2. Extract, process and numericalize your data.
  3. Concatenate all your numericalized documents in a stream of TOTAL LENGHT items.
  4. Chop your stream in lists of SEQUENCE LENGTH items (+).
  5. Stack your sequences into BATCH SIZE lines.
  6. Chop the groups again in chunks of BPTT items and you’ve got your X
  7. To get your Y, do the same thing as previous step, but one word shifted at right (note that X,Y has the same shape!)

VERY IMPORTANT Despite what happens with images or structured data (where for each Image you’ve a sample), your (X,Y) pairs will be A LOT MORE than your original documents!

IMPORTANT: if your SEQUENCE LENGTH isn’t a multiple of BPTT, you’ll pad the data inside your last batch.

(+) AFAIK usually you “randomize” a little bit the sequence length. IE: if SL=70 we can get sequences of 50…90.

@stas for having reviewed and corrected the first version of the post.
@sgugger for sequence length formula: [sequnce lenght] = [total lenght]/[batch size]


Is sequence length set for us automatically in fastai based on bptt or is it something we can tune separate? I don’t think I’ve ever seen it accepted as a parameter anywhere.

1 Like

In fastai sequence length is total length divided by batch size.


One important note if you’re writing this yourself is that pytorch allocates memory as needed so to avoid memory issues you need to make sure your first batch is of max length. Otherwise it allocates multiple spaces, often until it runs out of memory.


@ste, I think your step 6 is drawn incorrectly/confusing - it shows as if bppt is slicing across B0, whereas it’s slicing across batches. It’s because in your steps 5 and 6 the stacks appear identical - whereas in step 6 those should be batches - not a single batch. It’s easy to see from the helpful token printouts that bppt is taken across batches and not a single batch. I hope it makes sense, I’m not sure I am explaining myself clearly.

I think it can be corrected/clarified by adding on the left side of step 6 something like B0, B1, … B5 (instead of the allusion to B0 from step 5 diagram).

or alternatively, stage 5 is incorrect, since total len = bs * seq len, i.e. in your example there can be no S6+, only S0…S5.

But otherwise a very cool visual. Thank you!

1 Like

Thnx @stas - i check it out!