Speedtest: HuggingFace nlp Datasets lib vs Fastai TextDataloaders

morgan · July 22, 2020, 8:05am

I was playing around with HuggingFace’s nlp Datasets library and was seriously impressed by the speed you could iterate though data with it. It uses PyArrow in the backend and allows you to easily load your own csv or json files.

I figured it would be interesting to test it out to see if it would make more sense to do as much text processing (e.g. cleaning, tokenization, numericalisation) with it, instead of using Fastai’s defaults. I used fastai’s TextDataloader with all of its defaults and tried to replicate all its functionality with nlp Datasets

Full blog post here

Curious if anyone has any feedback or how this test might have been done better

Just tell me the results

Results were…mixed…

Fastai’s initialisation (e.g. load, preprocess, tokenize etc) was faster with the 1.6M row Sentiment140 dataset I used, however I have a few caveats:

Parallelisation

Fastai parallelises the tokenization, which I couldn’t figure out how to do with nlp Datasets (probably my own lack of knowledge and not a limitation of the library though). My guess is that doing so would likely make nlp Datasets much faster than Fastai

Sorting by sample length

To try and replicate SortedDL's behaviour, I sorted the entire dataset in the nlp Dataset trial, which added a significant amount of time, possibly theres a way to better replicated SortedDL's behaviour

Caching

nlp Datasets also uses caching so that the second time around you’d like to do the same pre-processing etc, it is much much faster

10% Data

0.16M ROWS:	Init (s)	1 epoch (s)	1 mini-batch [bs=64] (ms)
Fastai	124	14.3	7.4
Fastai w/sorted	48.1	14.3	7.4
nlp	71.2	11.3	5.6

100% Data

1.6M ROWS:	Init (s)	1 epoch (s)
Fastai w/sorted	484	142
nlp	1024	323