Fast.ai text tokenizer seems to be very slow

vladgets · April 29, 2019, 9:31pm

Fast.ai Tokenizer is super slow. The next command for a data set of about 2M short comments takes about 30 minutes

data_lm = TextLMDataBunch.from_df('.', train_df=train, valid_df=valid_small, text_cols='comment_text')

I saw that in other solutions that use standard Pytorch text tokenizer the similar task takes about 2-3 minutes.

I saw also that @sgugger mentioned recently in Twitter that there is significant speed improvement in Fast.ai Tokenization now but where can I see it?

vladgets · April 29, 2019, 11:09pm

I saw this twit describing how to overcome this issue: https://twitter.com/PfeiffJo/status/1112801797096304642

But I still trying to figure out how to include this in fast.ai.
If I try to write code like this:

from spacy.lang.en import English
nlp = English()
data_lm = TextLMDataBunch.from_df('.', train_df=train, valid_df=valid_small, text_cols='comment_text', tokenizer=Tokenizer(tok_func=nlp.tokenizer))

I am getting an error:
Traceback (most recent call last):
File “/opt/conda/lib/python3.6/concurrent/futures/process.py”, line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File “/opt/conda/lib/python3.6/concurrent/futures/process.py”, line 153, in _process_chunk
return [fn(*args) for args in chunk]
File “/opt/conda/lib/python3.6/concurrent/futures/process.py”, line 153, in
return [fn(*args) for args in chunk]
File “/opt/conda/lib/python3.6/site-packages/fastai/text/transform.py”, line 113, in _process_all_1
if self.special_cases: tok.add_special_cases(self.special_cases)
AttributeError: ‘spacy.tokens.doc.Doc’ object has no attribute ‘add_special_cases’

vladgets · April 30, 2019, 4:58am

I saw that you put this improvement inside of fast.ai 1.0.52 version.
So, I do need to make upgrade and use this version on Kaggle/Colab?

vladgets · April 30, 2019, 5:25am

@sgugger
I did upgrade to 1.0.52 version and still this command of TextLMDataBunch.from_df(…) takes about 26 minutes on Kaggle kernel, comparing to about 34 minutes 1.0.51 version.
So definitely improvement but not by 10 times.
And I see that CPU load is 198%, so probably the main bottleneck is there.

Still Tokenization by Torch Tokenizer takes much much less.

sgugger · April 30, 2019, 12:53pm

Then use it, by all means. Spacy takes a long time to tokenize text, but we always found it was critical in getting state-of-the-art results. I’m not sure why this is such a huge issue since tokenization is a one-time thing.

vladgets · April 30, 2019, 3:25pm

Thanks, Sylvain.

But for me is important both state of the art results but the time consumption is important too, since I am trying to participate in this Kaggle competition https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
And there is a limitation for solution to run inside of kernel not more than 2 hours (basically CPU time is 9 hours so possibly it is fine if Tokenization is running on CPU).
But another part is that it is difficult to save intermediate results on Kaggle, since all of them are removed when kernel restarts.

In addition to it, I am also trying to understand for myself (and possible practical work), which things influence the quality of the result and which influences the performance.

Did you try other Tokenizers in your work too and got worse results?

sgugger · April 30, 2019, 4:18pm

That is what I said, yes. You can try on your side too, maybe it’s not going to change anything for your task, and if speed is the issue, you might be better with something faster.