Pretrain MLM and fintune on GLUE with fastai - 2 - TextDataloader for sliding window over sentences or tokens

Richard-Wang · May 18, 2020, 1:55pm

=== 2020.05.18 ===

Cache TextDataloader
Initialization of TextDataloader(and also SortedDL /LMDataloader) takes time, so I make TextDataloader be able to cache itself but dataset inside it, so you need to only initialize it once, save you the time of initialization. (To cache dataset, you can see huggingface/nlp)
Speed comparison
TextDataloader is a new general dataloader who can also behave as the same as SortedDL or LMDataloader, but faster or as fast as them both in initialization or batch loading.
I add a speed comparison between TextDataloader and SortedDL/LMDataloader to the notebook.

Initialization: (Dataset pre created)

Batch Loading:

When using TextDataloader as SortedDL to load batch, TextDataloader is often very slightly slower (less than 0.5 s) in this setting. Hoping somebody help me figure out why…
When using TextDataloader as LMDataloader to load batch, TextDataloader is largely faster. Note that assume seq_len=50 , the shape of last batch of LMDataloader will be sth like (64, 30), the one of TextDataloader will be sth like (37, 50), and some pad in last sample. But you should be easily add some code to TextDataloader.__init__ to adjust last batch_size samples to get the same behavior as
LMDataloader do.