=== 2020.05.18 ===
-
Cache
TextDataloader
Initialization ofTextDataloader(and alsoSortedDL/LMDataloader) takes time, so I makeTextDataloaderbe able to cache itself but dataset inside it, so you need to only initialize it once, save you the time of initialization. (To cache dataset, you can see huggingface/nlp) -
Speed comparison
TextDataloaderis a new general dataloader who can also behave as the same asSortedDLorLMDataloader, but faster or as fast as them both in initialization or batch loading.
I add a speed comparison betweenTextDataloaderandSortedDL/LMDataloaderto the notebook.
Initialization: (Dataset pre created)
Batch Loading:
-
When using
TextDataloaderasSortedDLto load batch,TextDataloaderis often very slightly slower (less than 0.5 s) in this setting. Hoping somebody help me figure out why… -
When using
TextDataloaderasLMDataloaderto load batch,TextDataloaderis largely faster. Note that assumeseq_len=50, the shape of last batch ofLMDataloaderwill be sth like(64, 30), the one ofTextDataloaderwill be sth like(37, 50), and some pad in last sample. But you should be easily add some code toTextDataloader.__init__to adjust last batch_size samples to get the same behavior as
LMDataloaderdo.

