Lesson 8 - Official topic

msivanes · September 11, 2020, 2:47pm

I do not know the answer myself. But …

From Chapter 10 information about why this is the case

We will expand the shortest texts to make them all the same size. To do this, we use a special padding token that will be ignored by our model. Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend of be of similar lengths. We won’t pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size.

The sorting and padding are automatically done by the data block API for us when using a TextBlock , with is_lm=False . (We don’t have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)

My understanding there is a sample in this dataset that is of large size which causes this additional padding tokens. But I do not know how to validate this because of my minimal understanding of SortedDL.

@init_27 @muellerzr - Could you please assist with this question?