Speedtest: HuggingFace nlp Datasets lib vs Fastai TextDataloaders

I was playing around with HuggingFace’s nlp Datasets library and was seriously impressed by the speed you could iterate though data with it. It uses PyArrow in the backend and allows you to easily load your own csv or json files.

I figured it would be interesting to test it out to see if it would make more sense to do as much text processing (e.g. cleaning, tokenization, numericalisation) with it, instead of using Fastai’s defaults. I used fastai’s TextDataloader with all of its defaults and tried to replicate all its functionality with nlp Datasets

Full blog post here

Curious if anyone has any feedback or how this test might have been done better :slight_smile:

Just tell me the results

Results were…mixed…

Fastai’s initialisation (e.g. load, preprocess, tokenize etc) was faster with the 1.6M row Sentiment140 dataset I used, however I have a few caveats:

Parallelisation

Fastai parallelises the tokenization, which I couldn’t figure out how to do with nlp Datasets (probably my own lack of knowledge and not a limitation of the library though). My guess is that doing so would likely make nlp Datasets much faster than Fastai

Sorting by sample length

To try and replicate SortedDL's behaviour, I sorted the entire dataset in the nlp Dataset trial, which added a significant amount of time, possibly theres a way to better replicated SortedDL's behaviour

Caching

nlp Datasets also uses caching so that the second time around you’d like to do the same pre-processing etc, it is much much faster

10% Data

0.16M ROWS: Init (s) 1 epoch (s) 1 mini-batch [bs=64] (ms)
Fastai 124 14.3 7.4
Fastai w/sorted 48.1 14.3 7.4
nlp 71.2 11.3 5.6

100% Data

1.6M ROWS: Init (s) 1 epoch (s)
Fastai w/sorted 484 142
nlp 1024 323
2 Likes