I was playing around with HuggingFace’s nlp Datasets
library and was seriously impressed by the speed you could iterate though data with it. It uses PyArrow in the backend and allows you to easily load your own csv or json files.
I figured it would be interesting to test it out to see if it would make more sense to do as much text processing (e.g. cleaning, tokenization, numericalisation) with it, instead of using Fastai’s defaults. I used fastai’s TextDataloader
with all of its defaults and tried to replicate all its functionality with nlp Datasets
Curious if anyone has any feedback or how this test might have been done better
Just tell me the results
Results were…mixed…
Fastai’s initialisation (e.g. load, preprocess, tokenize etc) was faster with the 1.6M row Sentiment140 dataset I used, however I have a few caveats:
Parallelisation
Fastai parallelises the tokenization, which I couldn’t figure out how to do with nlp Datasets
(probably my own lack of knowledge and not a limitation of the library though). My guess is that doing so would likely make nlp Datasets
much faster than Fastai
Sorting by sample length
To try and replicate SortedDL
's behaviour, I sorted the entire dataset in the nlp Dataset
trial, which added a significant amount of time, possibly theres a way to better replicated SortedDL
's behaviour
Caching
nlp Datasets
also uses caching so that the second time around you’d like to do the same pre-processing etc, it is much much faster
10% Data
0.16M ROWS: | Init (s) | 1 epoch (s) | 1 mini-batch [bs=64] (ms) |
---|---|---|---|
Fastai | 124 | 14.3 | 7.4 |
Fastai w/sorted | 48.1 | 14.3 | 7.4 |
nlp | 71.2 | 11.3 | 5.6 |
100% Data
1.6M ROWS: | Init (s) | 1 epoch (s) |
---|---|---|
Fastai w/sorted | 484 | 142 |
nlp | 1024 | 323 |