@NavneetSajwan @KevinB, if I understand Jeremy’s response to my post, I don’t think it’s a bug, I think it is expected (albeit confusing) behaviour.
There is one review in the IMDB dataset which is much longer than the rest of the reviews.
Fastai chops up variable-length token sequences (i.e. movie reviews) into fixed size mini-batches, with bs
other reviews, where other reviews are different rows of the tensor. The resulting mini-batches must be sufficiently small to fit into GPU memory, but also need to retain continuity in the sequence of tokens between consecutive mini-batches, so that the text flows between mini-batches. In order for all the mini-batches to have the same dimensions, fastai uses the special xpad
token (i.e. padding, like in computer vision where black pixels can be used to make all the images square).
If there is one review which is much longer than the rest, the rest of the reviews in the same mini-batch will be mostly padding. Therefore what you are seeing is expected behaviour. My guess is that show_batch()
returns movie reviews in order of decreasing length, which is why you’re seeing this at all.
I think there are a few ways you can verify the story above. One way might be to iterate through mini-batches until you get to e.g. the 10,000th mini-batch, and then take a look at that batch to see that you’re seeing words again, not just padding. Alternatively, you could exclude the very long movie review before building your dls
.