NLP - Tokenizer()

gerardo · August 21, 2018, 10:19pm

I have this piece of code in my prediction and my text has only 4 words.

Blockquote
#tokenize using the fastai wrapper around spacy
tic = time.clock()
tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
toc = time.clock()

Tokenizer time 29.547838342226896
Is this normal?

tester · August 21, 2018, 10:32pm

This function creates multiple processes in python. There should be overhead time on that + messaging passing to different processors + tokenizer time.

I assume you mean 29 seconds or 2.9 seconds. Also len(texts) = 4?

gerardo · August 22, 2018, 12:25am

29 seconds for the tokenizer and yes 4 words

TheShadow29 · August 22, 2018, 4:37am

I doubt. I used a larger corpus and it was less than that. Can you confirm the output of tok is word based? I think it is character based perhaps (not sure).

tester · August 22, 2018, 4:48am

from fastai.text import *
import time
tic = time.clock()
tok = Tokenizer().proc_all_mp(partition_by_cores([‘Hello’, ‘World’, ‘Foo’, ‘Bar’]))
toc = time.clock()
print (toc - tic)

3 seconds for me 8 cpus here

KarlH · August 22, 2018, 9:58pm

I ran into the same issue. I think whatever magic happens in the CPU when it’s partitioning to cores can take a bit of time.

Try using Tokenizer().proc_text(s) instead. I have found this works faster for processing small strings (ie for prediction) where the input is small enough that using multiple cores doesn’t really make sense.

gerardo · August 23, 2018, 12:38pm

WOW

tic = time.clock()
print(texts)
tok = Tokenizer().proc_all(texts, lang=‘en’)
toc = time.clock()

Tokenizer time 0.9702205351916291

Great improvement
Thanks!