=== 2020.05.18 ===
-
Cache
TextDataloader
Initialization ofTextDataloader
(and alsoSortedDL
/LMDataloader
) takes time, so I makeTextDataloader
be able to cache itself but dataset inside it, so you need to only initialize it once, save you the time of initialization. (To cache dataset, you can see huggingface/nlp) -
Speed comparison
TextDataloader
is a new general dataloader who can also behave as the same asSortedDL
orLMDataloader
, but faster or as fast as them both in initialization or batch loading.
I add a speed comparison betweenTextDataloader
andSortedDL
/LMDataloader
to the notebook.
Initialization: (Dataset pre created)
Batch Loading:
-
When using
TextDataloader
asSortedDL
to load batch,TextDataloader
is often very slightly slower (less than 0.5 s) in this setting. Hoping somebody help me figure out why… -
When using
TextDataloader
asLMDataloader
to load batch,TextDataloader
is largely faster. Note that assumeseq_len=50
, the shape of last batch ofLMDataloader
will be sth like(64, 30)
, the one ofTextDataloader
will be sth like(37, 50)
, and some pad in last sample. But you should be easily add some code toTextDataloader.__init__
to adjust last batch_size samples to get the same behavior as
LMDataloader
do.