Speeding Up fastai2 Inference - And A Few Things Learned

muellerzr · March 21, 2020, 7:32pm

So today I went through and looked at trying to do inference efficiently. First thing you’ll notice is that there will be a decent gain in speed here, the main reason I believe is because fastai is attempting to do a lot on the back end and make other various functions/utilities easy to use and try to do it as efficiently as possible which may show the lack of speed. (also if I am wrong here Jeremy or Sylvain please don’t hesitate to call me out ) With that out of the way, let’s get into a few bits:

Trying to Speed Up with Jit Scripting

Something done often in Kaggle (thanks @DrHB!) to speed up inference is saving the model away as a jit script. I found that this has no noticeable change in time (not even 0.01ms)

Doing fastai2 efficiently

Now let’s get into the juicy stuff. Let’s presume the following pipeline:

Make a test_dl (which for the record is super efficent, micro-seconds in time!). For 1,000 images I clocked it in at 669 micro seconds
Run predictions
Decode your predictions (such as if we have keypoints, get back the actual values)

We’re going to then further look at 3 different bits of code to look at for this. First, using fastai regularly, then two other ways of writing similar code. (again this is not to bash on the library, and this is an extreme edge case of people who need to get predictions/speed as fast as possible!)

Note: Baselines taken on 100 images with a batch size of 50 (I looked at 64, wasn’t as fast)

fastai straight

Let’s presume we have the following prediction script, which is pretty standard for fastai:

dl = learner.dls.test_dl(imgs)
inp, preds,_,dec_preds = learner.get_preds(dl=dl, with_input=True, with_decoded=True)
full_dec = learner.dls.decode_batch((*tuplify(inp),*tuplify(dec_preds)))

This loop (doing a %%time) takes approximately 1.98 seconds in total, which is about 0.02 seconds per image. If for many this is fast enough, that’s fine. I’ll be trying to make it faster

Getting rid of `get_preds`

The next bit we’ll try is getting rid of get_preds. Here’s what this new code involves:

dec_batches = []
dl = learn.dls.test_dl(imgs)
learn.model.eval()
with torch.no_grad():
  for batch in dl:
    res = learn.model(batch[0])
    inp = batch[0]
    dec_batches.append(learn.dls.decode_batch((*tuplify(inp), *tuplify(res))))

Notice specifically we’re calling learn.model directly and we grab all the batches from my dataloader in succession, and afterwards do the decoding. This brings our time down to 1.7 seconds, so we shaved off .2 seconds

A better way

We can shave even more time by instead of decoding each batch as it comes out, we combine them all into one big batch we decode at the end. Remember, as we’re not sending this to the model as a big batch, we’re okay combining it after the fact!!

What does this look like? Something like so:

outs = []
inps = []
dl = learner.dls.test_dl(imgs)
learn.model.eval()
with torch.no_grad():
  for batch in dl:
    outs.append(learner.model(batch[0]))
    inps.append(batch[0])

outs = torch.stack(outs)
inps = torch.stack(inps)
dec = learner.dls.decode_batch((*tuplify(inps), *tuplify(outs)))

So we can see we combine all the batches via torch.stack and pass this into decode_batch. But what does this time gain look like? It turns into 1.3 seconds! We decreased the time by 33%!

There is probably ways to decrease this further, remember though this is an advanced idea in the sense you’re walking away from purely fastai code and instead utilizing the library with other code to take it further.

Hope this helps

Also, this was just a find after a quick day, as I mentioned there is probably better ways of doing so, if you find them post them in this thread!

Edit:

The new best way:

type_tfms = [PILImage.create]
item_tfms = [Resize(224), ToTensor()]
type_pipe = Pipeline(type_tfms)
item_pipe = Pipeline(item_tfms)
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()


batches = []
batch = []
outs = []
inps = []
k = 0
for im in im_names:
    batch.append(item_pipe(type_pipe(im)))
    k += 1
    if k == 50:
        batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
        batch = []
        k = 0

learner.model.eval()
with torch.no_grad():
    for b in batches:
        outs.append(learner.model(b))
        inps.append(b)

inp = torch.stack(inps)
out = torch.stack(outs)
dec = learner.dls.decode_batch((*tuplify(inp), *tuplify(out)))

radek · March 21, 2020, 8:38pm

I think a really interesting direction here would be seeing how this applies (and what else could be done) with a single image with inference run on the CPU. This is likely the most common scenario to be seen in production (apart from at some very, very large places).

Scaling APIs horizontally on CPU vms would probably be the way to go for nearly anyone and is relatively cheap and shaving any significant time here could probably be very useful. Doing anything more complex adds very quickly in time and user experience deteriorates very, very quickly past the 1sec mark (from when the user takes an action to the result being displayed). The faster an API endpoint the better

muellerzr · March 21, 2020, 8:46pm

Here’s those results, and the ideas still hold true (though not as big of a difference)! I did the first and second method, as the third is exclusively for batches (it won’t matter on the decode_batch overhead)

Regular learn.predict: 835ms/loop
Using learn.model (with .eval()): 811ms/loop

Which would make sense, as if you look at the source code for predict, we’re essentially replacing one line of code there instead of the multiples we were able to do in the batch instances

@radek you got me thinking about something and so I tried exporting to jit and bringing it in. What do you know, CPU did show better performance!

732ms/loop

The steps to do so:

dummy_inp = torch.randn([2,3,224,224])
torch.jit.save(torch.jit.trace(learner.model.cpu(), dummy_inp), 'jit.pt')
model = torch.jit.load('jit.pt')
model.cpu()
model.eval();

~~(I’m also looking into rooting in an export to jit function in the library)~~ or not based on Radek’s comment below

radek · March 21, 2020, 9:25pm

A 12% decrease in execution time would definitely be helpful so this is looking great

I wonder what the status of jit is, whether we can rely on saving the model and loading it like this? In part 2 of v3 lecture 12 Jeremy speaks to jit being unreliable. I wonder if this has changed?

Improvement of this magnitude would be super useful but I wonder if we should be checking our model after export if it is still doing what we want it to be doing?

It probably doesn’t hurt to check and I am fairly sure one would want to do so before putting anything in production Just leaving this as a note for whoever might be reading this thread in the future and exploring this route.

rwightman · March 21, 2020, 9:29pm

Overall principle of doing work in larger chunks can save some overhead but I doubt the fastai fns in question here, removing them, collecting the outputs and decoding at once can account for a 33% difference.

Are you sure you handled GPU-CPU transfers equivalently in each case? I believe get_preds moves to CPU? Did your code? I didn’t see a .cpu() anyhwere, so i assume you may have left results on the GPU, and also, don’t see a any cuda synchronize() call in there. Generally benchmarks of data on GPU isn’t valid unless you move it to cpu or make a synchronize call due to the way the kernels are executed.

Also, one thing to always consider in these sorts of optimizations… in the extreme beetween one sample a at a time and all samples at once, there are not only extremes of overhead per sample, but also on the other end, extremes of per sample latency. While having higher per sample overhead may sound bad, it is sometimes a necessary price to pay to get your per sample latency down, which is generally lowest when you process one sample at a time. Collecting inferences results for a Kaggle competition with limited kernel time vs running a model in a realtime system have very different demands in this regard. Also consider that when your model inference runs within a larger pipeline, having more lumpy output may waste resources idling and then slam them, some overhead to smooth it out is worthwhile.

muellerzr · March 21, 2020, 9:37pm

I’ll start by saying thank you for providing your input, I’m learning a great deal from your response and I have a great amount of respect for you

In v2 there is no call to such a change (to my knowledge, or at least I have not found it yet surfing through the code for the last 6 months). The initial DataLoader and the model we’re both set to cuda before running the benchmark and thus running on that particular device.

This is where it’s a bit out of my knowledge bank, so I can’t provide an answer (without more research)

Interesting, that was definitely something I did not account for. (also from my lack of knowledge on it). How would this be different from say the DataLoader’s device being cuda and the model being cuda as well? I assume there’s intermittent steps that I may be missing here.

This was originally just an intermediate way for me (a non expert in cuda-benchmarking, etc) to try to speed up fastai with some comparisons that seemed to work on paper, so apologies if some of the techniques/ideas do not work in the real benchmarking world (but I can work to fix that!)

ilovescience · March 21, 2020, 9:49pm

If you are interested in speeding up CPU inference times, it’s probably best to take the underlying PyTorch model, and use pure PyTorch code. Then you can speed things up with quantization and obviously TorchScript (JIT). Even better, I am pretty sure you can convert the models to C++ and you might see some speed up there. There are definitely a lot of opportunities to speed up model inference with PyTorch.

(Note: I haven’t tried any of these methods, and only have read about them, so there may be some caveats I am not aware of)

jeremy · March 22, 2020, 1:45am

I suspect it would move to CPU too. To test, check .type() on the returned tensors to see if they’re cuda. I agree with @rwightman that CUDA syncing is likely a key issue here. Also note that you have to be careful of not using up all your GPU memory. (Which is not to say what you’re doing isn’t a good idea - just pointing out compromises to be aware of.)

muellerzr · March 22, 2020, 1:56am

If cpu=True, then the DataLoader is mapped as well to CPU, and the same goes for False (they’re tied together in load_learner) and so the tensors show this matching behavior

I tried to keep that in mind as best I could that I could see, but if there are obvious ways of freeing memory I may have missed while going through it, that’d be very important to fix (aside from say batch size etc).

boris · March 22, 2020, 10:24pm

get_preds uses GatherPredsCallback which moves everything to cpu to avoid CUDA memory issues.
You can see it through the calls to to_detach.

muellerzr · March 22, 2020, 10:26pm

Exact line:

github.com

fastai/fastai2/blob/master/fastai2/callback/core.py#L82


    if self.with_input: self.inputs.append((to_detach(self.xb)))


def begin_validate(self):
    "Initialize containers"
    self.preds,self.targets = [],[]
    if self.with_input: self.inputs = []
    if self.with_loss:  self.losses = []


def after_batch(self):
    "Save predictions, targets and potentially losses"
    preds,targs = to_detach(self.pred),to_detach(self.yb)
    if self.save_preds is None: self.preds.append(preds)
    else: (self.save_preds/str(self.iter)).save_array(preds)
    if self.save_targs is None: self.targets.append(targs)
    else: (self.save_targs/str(self.iter)).save_array(targs[0])
    if self.with_loss:
        bs = find_bs(self.yb)
        loss = self.loss if self.loss.numel() == bs else self.loss.view(bs,-1).mean(1)
        self.losses.append(to_detach(loss))


def after_validate(self):

Thanks for finding that @boris!

boris · March 22, 2020, 10:34pm

No problem, actually this line is only if you want to return the inputs.
Other relevant lines would be in after_batch method where data is moved to cpu (predictions and targets).

However it’s not that obvious because I believe self.pred is calculated on the GPU (models and dataloaders still on GPU) and then results of all batches are all gathered on the CPU.

muellerzr · March 22, 2020, 10:36pm

Fixed it to point there

waydegg · March 22, 2020, 10:53pm

What’s the best/fastest way to do inference on a GPU? I’ve been using get_preds thinking it was on GPU, but after reading this thread I’ve realized that’s not the case.

muellerzr · March 22, 2020, 10:54pm

So far, see the third technique above though as mentioned it’s not 100% GPU

boris · March 23, 2020, 12:12am

I’m not 100% sure but I think the inference is done on GPU and inputs, labels, predictions are moved to CPU at each batch.

muellerzr · April 11, 2020, 4:07am

A little update to this, so I found the bottleneck (duh) batches are done on the fly, doing the dl = test_dl just sets up the pipeline.

I’ve gotten it down to just a hair under a second.

Here were my steps:

Build PipeLine manually. There was a chunk of overhead being done in the background I could avoid. My particular problem used PILImage.create, Resize, ToTensor, IntToFloatTensor, and Normalize. As such I made a few pipelines (also notice the Normalzie, couldn’t quite get it to work on the Pipeline for some reason:

type_pipe = Pipeline(PILImage.create)
item_pipe = Pipeline(Resize(224), ToTensor())
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()

Next comes actually applying it into batches:

for im in im_names:
    batch.append(item_pipe(type_pipe(im)))
    k += 1
    if k == 50:
        batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
        batch = []
        k = 0

Now this single step right here is what shaved off the most time for me. The rest is the usual for predictions (so we just got rid of the dl = learner.dls.test_dl)

How much time did I shave? We went from our last time of 1.3 seconds down to 937ms for 100 images, so I was able to shave off even more time. I should also note that half of this time is just grabbing the data via PIL.create

Here’s the entire script:

type_tfms = [PILImage.create]
item_tfms = [Resize(224), ToTensor()]
type_pipe = Pipeline(type_tfms)
item_pipe = Pipeline(item_tfms)
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()


batches = []
batch = []
outs = []
inps = []
k = 0
for im in im_names:
    batch.append(item_pipe(type_pipe(im)))
    k += 1
    if k == 50:
        batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
        batch = []
        k = 0

learner.model.eval()
with torch.no_grad():
    for b in batches:
        outs.append(learner.model(b))
        inps.append(b)

inp = torch.stack(inps)
out = torch.stack(outs)
dec = learner.dls.decode_batch((*tuplify(inp), *tuplify(out)))

(PS if this needs further explanation let me know, just the mad ramblings of man at 2am…)

(PPS, on a resnet18 you can get close to real time with this too on a GPU, with 3 images I clocked it at ~39ms)

vferrer · April 11, 2020, 3:29pm

Amazing! Could you update the initial post to reflect this new speed gain? So, it won’t be lost between post replies and any new user will see

muellerzr · April 11, 2020, 3:33pm

Done

boris · April 11, 2020, 6:32pm

Does the time consider also your data preparation? When does the timer start?