Dataloaders are using single CPU and very slow

Hi everyone.

I’m trying to build a language model with a custom tokenizer (which is using spacy).

I create a datablock:

SENT_LEN = 60

db = DataBlock(
    blocks=TextBlock(is_lm=True, tok_tfm=tokenizer, seq_len=SENT_LEN),
    get_x = attrgetter('_VALUE'),
    splitter=RandomSplitter(0.1)
)

then feed a pandas dataframe into it like this

dls_lm = db.dataloaders(
    data_subset.head(5000), 
    bs=128, 
    seq_len=SENT_LEN, 
    verbose=True, 
    numworkers=8)

I observe two strange things.

  1. texts are passed to the tokenizer one by one, which doesn’t allow utilizing pipe in spacy. Therefore, each text is tokenized sequentially
  2. dataloaders use only 1 CPU, no matter what is the value of numworkers

Because of that loading data takes forever (texts are long and tokenizer is not basic).

How to speed this up? How to process texts in parallel? Can I pre-tokenize them before training?
Is there another way to train a language model that is more efficient?

I have found a solution that uses all CPUs, but I cannot explain why it works.
The difference is in the DataBlock. I build it using .from_df factory:

tb = TextBlock.from_df('_VALUE',is_lm=True, tok=tokenizer)

db = DataBlock(blocks=tb,
                    get_x=ColReader('text'),
                    splitter=RandomSplitter(0.1))

creating dataloaders is exactly the same as above.
As a result, all CPUs are used and everything works much, much faster.

Do we have DataBlox experts here who can explain why this is the case, compared to the method using attrgetter shown above?


Correction. If I use my custom tokenizer with TextBlock.from_df, this code dataloaders return
Could not do one pass in your dataloader, there is something wrong in it. Very ambiguous…
It works fine with the default tokenizer.

Another finding.

If I use TextBlock.from_df - it uses all CPUs, but I cannot make it use my custom tokenizer using tok argument. It hangs, when I pass my custom tokenizer class (that is supposed to be wrapped by Tokenizer later).
It works fine with the default tokenizer though. Fast and as expected.

While if I use TextBlock without the factory method, it accepts my custom tokenizer (instance of Tokenizer), but uses only 1 CPU and is too slow to use.

More progress :slight_smile:

The problem was here (it is not documented well):
TextBlock.from_df('_VALUE',is_lm=True, tok=tokenizer)

It is about the tok argument.
It must be an instance of your custom tokenizer. It should NOT be wrapped in Tokenizer yet as it will be wrapped inside the from_df method. It should NOT be the object of the tokenizer, but an instance.

So, here is the final working code:

# What is the meaningful sentence length?
SENT_LEN = 60

# How do we tokenize
tkn = Tokenizer.from_df('_VALUE', MySpacyTokenizer2(), rules=[])

# How do we want our X and Y to look like
tb = TextBlock(tok_tfm=tkn, is_lm=True, seq_len=SENT_LEN)

# How do we read from the dataframe and split to train/val
db = DataBlock(
    blocks=tb,
    get_x=ColReader('text'),
    splitter=RandomSplitter(0.1)
)

# Use the dataset in pandas dataframe to 
# Tokenize, Numericalize (TextBlock)
# split to train/vail (Datasets) 
# and split to batches (Dataloaders)
dls_lm = db.dataloaders(
    data_subset, 
    bs=128, 
    seq_len=SENT_LEN)