ULMFit in production, tokenizer slow (recreating Spacy tokenizer at each request)

Hi !

I’m trying to deploy a ULMFit model in production (on aws). The language is french, thus the tokenizer is french also:

tokenizer = TokenizerForProduction(lang='fr', n_cpus=5)
data_lm = (TextList.from_df(df=df, processor=[TokenizeProcessor(tokenizer=tokenizer), NumericalizeProcessor(max_vocab=600000)])

I’ve notice that the performances are really bad when calling learner.predict:

def predict(sentence):
     _, _, predictions = learner.predict(sentence.lower())
     result = sorted(list(zip(predictions.tolist(), learner.data.classes)), key=lambda tup: tup[0], reverse=True)
     return result

And I’ve track it down to the code in https://github.com/fastai/fastai/blob/master/fastai/text/transform.py#L87 on line 112:

def _process_all_1(self, texts:Collection[str]) -> List[List[str]]:
    "Process a list of `texts` in one process."
    tok = self.tok_func(self.lang)
    if self.special_cases: tok.add_special_cases(self.special_cases)
    return [self.process_text(str(t), tok) for t in texts]

In this function, a Spacy tokenizer is recreated for each request. It’s quite time consuming, especially with french.

Just to try, I wrote my own Tokenizer class where the Spacy tokenizer is created only once and stored in the class instance. I got 10 times more request per second on a simple small aws instance. (The code is at the end of the message).

My question is: Is there any design reason I’m not seeing for re-creating the Spacy tokenizer at each call of learner.predict() ? Am I missing something ? learner.predict is not supposed to be used in production ?



class TokenizerForProduction():

"Put together rules and a tokenizer function to tokenize text with multiprocessing."
def __init__(self, tok_func:Callable=SpacyTokenizer, lang:str='en', pre_rules:ListRules=None,
             post_rules:ListRules=None, special_cases:Collection[str]=None, n_cpus:int=None):
    self.tok_func,self.lang,self.special_cases = tok_func,lang,special_cases
    self.pre_rules  = ifnone(pre_rules,  defaults.text_pre_rules )
    self.post_rules = ifnone(post_rules, defaults.text_post_rules)
    self.special_cases = special_cases if special_cases is not None else defaults.text_spec_tok
    self.n_cpus = ifnone(n_cpus, defaults.cpus)
    self.tok = self.tok_func(self.lang)
def __repr__(self) -> str:
    res = f'Tokenizer {self.tok_func.__name__} in {self.lang} with the following rules:\n'
    for rule in self.pre_rules: res += f' - {rule.__name__}\n'
    for rule in self.post_rules: res += f' - {rule.__name__}\n'
    return res

def process_text(self, t:str, tok:BaseTokenizer) -> List[str]:
    "Process one text `t` with tokenizer `tok`."
    for rule in self.pre_rules: t = rule(t)
    toks = self.tok.tokenizer(t)
    for rule in self.post_rules: toks = rule(toks)
    return toks

def _process_all_1(self, texts:Collection[str]) -> List[List[str]]:
    "Process a list of `texts` in one process."
    if self.special_cases: self.tok.add_special_cases(self.special_cases)
    return [self.process_text(str(t), self.tok) for t in texts]

def process_all(self, texts:Collection[str]) -> List[List[str]]:
    "Process a list of `texts`."
    if self.n_cpus <= 1: return self._process_all_1(texts)
    with ProcessPoolExecutor(self.n_cpus) as e:
        return sum(e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), [])

Yes, this is because it needs to be created on each subprocess when tokenizing lots of text before training. We know it’s slow for production so will fix this in v2.

1 Like

thanks this was very usefull!!

do you know if it’s possible to change a tokenizer inside an exported learner?

(I’m not sure this is the answer you are looking for… Can you re-export your learner?)

For the production code to work, I did it the quick and dirty way for the time being:

Declare your tokenizer class and then, right after, inject it in main so that the exported model can be unpickeled:

import main
main.TokenizerForProduction = TokenizerForProduction

A better way should be to declare your tokenizer in a proper python package. Then, pickle should be able to unpickle it.

1 Like

Thanks for the answer (and the future fix) ! (and for the great work on the library too !!)

mm not sure I’m following, what I was asking is the following:

  • I have and already trained model, with the default spacy tokenizer
  • I want to change that tokenizer inside the learner with your code, but without re creating the databunch and re training the model
    not sure if this is possible, I can’t find a reference to the tokenizer inside the learner object