I wanted to confirm my understanding of how
DataLoaders create batches for a language model.
In question 17 of the Chapter 10 questionnaire it asks:
Why do we need padding for text classification? Why don’t we need it for language modeling?
I understand the explanation in the book (as well as Tanishq’s solution)—emphasis mine:
The sorting and padding are automatically done by the data block API for us when using a
TextBlock , with
is_lm=False . (We don’t have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)
However, what happens if the length of the tokens in the concatenated documents is not divisible by the product of the batch size and sequence length?
Following the first test listed in the docs for
LMDataLoader I recreated the following scenario:
The number of tokens (integers) is 14, the batch size is 5 and the sequence length is 2. So, I would think that splitting the 14 tokens into equal sized batches would result in two full batches (5 integers apiece, one for independent and one for dependent variable) and two partially full batches (with 4 integers, again one for independent and one for the dependent variable). And that the partially full batches would be padded. However, it seems like the
LMDataLoader gets rid of the last pair of batches.
The following code:
bs,sl = 5, 2
ints = L([[0,1,2,3,4,5,6,7,8,9,10,11,12,13]]).map(tensor)
dl = LMDataLoader(ints, bs=bs, seq_len=sl)
Gives the output:
tensor([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10]]))]
Given that example, am I understanding correctly that in the book code:
dls_lm = DataBlock(
).dataloaders(path, path=path, bs=128, seq_len=80)
The final x and y batches will be dropped if they are not full? And if so, why drop them instead of padding them so they are full batches?
Thanks in advance----apologies if this a duplicate question, I couldn’t find anything related in this thread or through search.
Update: I think this question/response from 2019 is relevant—If there is a different explanation please let me know.:
a lot of models use BatchNorm, which behaves badly if you have a batch of a small size (especially size 1, that will throw an error). To avoid this, we drop the last batch during training (since there is shuffle, it doesn’t have any impact).