EDIT:
Solved:
Slow performance not related to cpus but to SortedDL’s init: NLP speed-up if using SortedDL
…
Calling all dsets.dataloaders
nerds…
I’ve been playing around with the HuggingFace nlp datasets library which stores their datasets in PyArrow format. Its suuuper fast for data manipulation, but when I create the dataloaders with dsets.dataloaders
only 1 cpu is being used.
When I follow the same workflow using a pandas dataframe instead, dsets.dataloaders
uses all 20 cpus…
Can anyone shed any light on what is going on when dataloaders
is called?
Anyone have any idea why multiprocessing isn’t working with pyarrow?
I’m trying to understand whether there is any advantage to using PyArrow, but right now they seem equivalent…
Dataframe vs PyArrow notebook here
For anyone curious about replicating the nlp
library behaviour, have a look below. One thing to note is that with the PyArrow version, the data is all pre-tokenized (super fast) before the Datasets and Dataloaders are created. So there shouldn’t even be any “work” done by the dataloader, all it has to do is index into the PyArrow dataset…
from fastai2.basics import *
from fastai2.text.all import *
from fastai2.data.transforms import RandomSplitter
from nlp import load_dataset
from pprint import pprint
from transformers import RobertaForSequenceClassification, RobertaTokenizerFast, RobertaTokenizer
class HfTokenize(Transform):
def __init__(self,hfdset, tokenizer):
self.hfdset, self.tokenizer, self.max_len=hfdset,tokenizer,tokenizer.max_len_single_sentence
def encodes(self, i): return TensorText(self.hfdset[i]['input_ids'][:self.max_len])
def decode(self, o=None, split_idx=None): return TitledStr(self.tokenizer.decode(list(o)))
class HfLabel(Transform):
def __init__(self,hfdset): self.hfdset=hfdset
def encodes(self, i): return int(self.hfdset[i]['sentiment'])
def convert_to_features(example_batch):
encodings = tokenizer.batch_encode_plus(example_batch['text'], pad_to_max_length=False)
return encodings
# Download 1% of the data
senti_dataset = load_dataset('sentiment140', split='train[:1%]')
# Pre-tokenize
senti_dataset = senti_dataset.map(convert_to_features, batched=True)
# Change indexing behaviour of the columns you need
senti_dataset.set_format(type='torch', columns=['input_ids','sentiment'])
# Create datasets
splits = [list(range(len(senti_dataset))),list(range(len(senti_dataset)))]
vcb=np.unique(senti_dataset['sentiment'])
tfms = [[HfTokenize(senti_dataset, tokenizer)],
[HfLabel(senti_dataset), Categorize(vocab=vcb)]]
dsets = Datasets(range(len(senti_dataset)), tfms, splits=splits, dl_type=SortedDL)
# Here comes the slow bit, create dataloaders:
bs = 64
dls = dsets.dataloaders(bs=bs, before_batch=pad_input, device='cuda')