text tokenizer seems to be very slow Tokenizer is super slow. The next command for a data set of about 2M short comments takes about 30 minutes

data_lm = TextLMDataBunch.from_df('.', train_df=train, valid_df=valid_small, text_cols='comment_text')

I saw that in other solutions that use standard Pytorch text tokenizer the similar task takes about 2-3 minutes.

I saw also that @sgugger mentioned recently in Twitter that there is significant speed improvement in Tokenization now but where can I see it?

1 Like

I saw this twit describing how to overcome this issue:

But I still trying to figure out how to include this in
If I try to write code like this:

from spacy.lang.en import English
nlp = English()
data_lm = TextLMDataBunch.from_df('.', train_df=train, valid_df=valid_small, text_cols='comment_text', tokenizer=Tokenizer(tok_func=nlp.tokenizer))

I am getting an error:
Traceback (most recent call last):
File “/opt/conda/lib/python3.6/concurrent/futures/”, line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File “/opt/conda/lib/python3.6/concurrent/futures/”, line 153, in _process_chunk
return [fn(*args) for args in chunk]
File “/opt/conda/lib/python3.6/concurrent/futures/”, line 153, in
return [fn(*args) for args in chunk]
File “/opt/conda/lib/python3.6/site-packages/fastai/text/”, line 113, in _process_all_1
if self.special_cases: tok.add_special_cases(self.special_cases)
AttributeError: ‘spacy.tokens.doc.Doc’ object has no attribute ‘add_special_cases’

I saw that you put this improvement inside of 1.0.52 version.
So, I do need to make upgrade and use this version on Kaggle/Colab?

I did upgrade to 1.0.52 version and still this command of TextLMDataBunch.from_df(…) takes about 26 minutes on Kaggle kernel, comparing to about 34 minutes 1.0.51 version.
So definitely improvement but not by 10 times.
And I see that CPU load is 198%, so probably the main bottleneck is there.

Still Tokenization by Torch Tokenizer takes much much less.

Then use it, by all means. Spacy takes a long time to tokenize text, but we always found it was critical in getting state-of-the-art results. I’m not sure why this is such a huge issue since tokenization is a one-time thing.


Thanks, Sylvain.

But for me is important both state of the art results but the time consumption is important too, since I am trying to participate in this Kaggle competition
And there is a limitation for solution to run inside of kernel not more than 2 hours (basically CPU time is 9 hours so possibly it is fine if Tokenization is running on CPU).
But another part is that it is difficult to save intermediate results on Kaggle, since all of them are removed when kernel restarts.

In addition to it, I am also trying to understand for myself (and possible practical work), which things influence the quality of the result and which influences the performance.

Did you try other Tokenizers in your work too and got worse results?

That is what I said, yes. You can try on your side too, maybe it’s not going to change anything for your task, and if speed is the issue, you might be better with something faster.

1 Like