[SOLVED] PyArrow / HuggingFace nlp dataset only running on 1 CPU, slow

morgan · July 4, 2020, 7:14pm

EDIT:

Solved:

Slow performance not related to cpus but to SortedDL’s init: NLP speed-up if using SortedDL

…

Calling all dsets.dataloaders nerds…

I’ve been playing around with the HuggingFace nlp datasets library which stores their datasets in PyArrow format. Its suuuper fast for data manipulation, but when I create the dataloaders with dsets.dataloaders only 1 cpu is being used.

When I follow the same workflow using a pandas dataframe instead, dsets.dataloaders uses all 20 cpus…

Can anyone shed any light on what is going on when dataloaders is called?

Anyone have any idea why multiprocessing isn’t working with pyarrow?

I’m trying to understand whether there is any advantage to using PyArrow, but right now they seem equivalent…

Dataframe vs PyArrow notebook here

For anyone curious about replicating the nlp library behaviour, have a look below. One thing to note is that with the PyArrow version, the data is all pre-tokenized (super fast) before the Datasets and Dataloaders are created. So there shouldn’t even be any “work” done by the dataloader, all it has to do is index into the PyArrow dataset…

from fastai2.basics import *
from fastai2.text.all import *
from fastai2.data.transforms import RandomSplitter

from nlp import load_dataset
from pprint import pprint
from transformers import RobertaForSequenceClassification, RobertaTokenizerFast, RobertaTokenizer

class HfTokenize(Transform):
    def __init__(self,hfdset, tokenizer):
        self.hfdset, self.tokenizer, self.max_len=hfdset,tokenizer,tokenizer.max_len_single_sentence
    def encodes(self, i): return TensorText(self.hfdset[i]['input_ids'][:self.max_len])
    def decode(self, o=None, split_idx=None): return TitledStr(self.tokenizer.decode(list(o)))

class HfLabel(Transform):
    def __init__(self,hfdset): self.hfdset=hfdset
    def encodes(self, i): return int(self.hfdset[i]['sentiment'])
    
def convert_to_features(example_batch):
    encodings = tokenizer.batch_encode_plus(example_batch['text'], pad_to_max_length=False)
    return encodings

# Download 1% of the data
senti_dataset = load_dataset('sentiment140', split='train[:1%]')
# Pre-tokenize
senti_dataset = senti_dataset.map(convert_to_features, batched=True)
# Change indexing behaviour of the columns you need
senti_dataset.set_format(type='torch', columns=['input_ids','sentiment'])

# Create datasets
splits = [list(range(len(senti_dataset))),list(range(len(senti_dataset)))]
vcb=np.unique(senti_dataset['sentiment'])
tfms = [[HfTokenize(senti_dataset, tokenizer)],
        [HfLabel(senti_dataset), Categorize(vocab=vcb)]]
dsets = Datasets(range(len(senti_dataset)), tfms, splits=splits, dl_type=SortedDL)

# Here comes the slow bit, create dataloaders:
bs = 64
dls = dsets.dataloaders(bs=bs, before_batch=pad_input, device='cuda')

Richard-Wang · July 7, 2020, 1:50am

Hi Morgan, although I don’t know why either, but have you consider about using nlp.Dataset.map ? It is fast and flexible , and it is native thus should be used with less low level problems I guess.
Or there is something you can only do with transform ?

morgan · July 7, 2020, 7:57am

Thanks @Richard-Wang, I actually figured out why it was slow last night; I’m using SortedDL and if you don’t pass it a list of keys to sort on (text length in the default case) then its init is suuuuper slow as it loops over every item in your dataset. So nothing to do with the CPUs as it turns out.

Will be sharing code here shortly as well as a suggested couple of PRs. Passing the list of keys took the init down from 90s to < 1s

Richard-Wang · July 7, 2020, 12:33pm

I have met the problem before, I just hack SoredDL and make it cache the keys needed, so I have only need to do it once.