NLP - Tokenizer()

I have this piece of code in my prediction and my text has only 4 words.

#tokenize using the fastai wrapper around spacy
tic = time.clock()
tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
toc = time.clock()

Tokenizer time 29.547838342226896
Is this normal?

This function creates multiple processes in python. There should be overhead time on that + messaging passing to different processors + tokenizer time.

I assume you mean 29 seconds or 2.9 seconds. Also len(texts) = 4?

29 seconds for the tokenizer and yes 4 words

I doubt. I used a larger corpus and it was less than that. Can you confirm the output of tok is word based? I think it is character based perhaps (not sure).

from fastai.text import *
import time
tic = time.clock()
tok = Tokenizer().proc_all_mp(partition_by_cores([‘Hello’, ‘World’, ‘Foo’, ‘Bar’]))
toc = time.clock()
print (toc - tic)

3 seconds for me 8 cpus here

I ran into the same issue. I think whatever magic happens in the CPU when it’s partitioning to cores can take a bit of time.

Try using Tokenizer().proc_text(s) instead. I have found this works faster for processing small strings (ie for prediction) where the input is small enough that using multiple cores doesn’t really make sense.



tic = time.clock()
tok = Tokenizer().proc_all(texts, lang=‘en’)
toc = time.clock()

Tokenizer time 0.9702205351916291

Great improvement

1 Like