Improving inference latency for ULMFIT

Hey, I’ve been using ULMFIT for a while now to experiment with state of the art results on various text classification tasks. I’m at a point where I’d like to put some of these models into a production environment. From what I can tell, fastai is the only real python library out there at the moment that has built-in support for NLP transfer learning.

The problem is that calling learner.predict with ULMFIT takes on average 0.5 seconds per prediction for me on the CPU, which is a good order of magnitude longer than it can afford to be for most production tasks. Using a GPU for prediction doesn’t seem to help all that much, and is generally not the way you want to go to serve predictions anyway.

Does anyone know of any tips to speed this up, while keeping predictions on the CPU?

Obvious things I’ve considered are 1) predicting in batches, and 2) using multiprocessing where possible. But there are limitations in how much latency reduction one can reasonably hope to gain with these, just constant factor improvements as opposed to order of magnitude improvements.

Side note: It’s possible that a large part of this latency is coming from the text preprocessors that the learner does. I haven’t run a detailed profiler to check this yet. I have heard the tokenizer is slow, but I’m skeptical that this is the bottleneck for prediction.


@rkingery Did you find any way to speed up the inference?


Interested in this thread if anyone can help.