Speeding Up fastai2 Inference - And A Few Things Learned

I suspect it would move to CPU too. To test, check .type() on the returned tensors to see if they’re cuda. I agree with @rwightman that CUDA syncing is likely a key issue here. Also note that you have to be careful of not using up all your GPU memory. (Which is not to say what you’re doing isn’t a good idea - just pointing out compromises to be aware of.)

2 Likes

If cpu=True, then the DataLoader is mapped as well to CPU, and the same goes for False (they’re tied together in load_learner) and so the tensors show this matching behavior

I tried to keep that in mind as best I could that I could see, but if there are obvious ways of freeing memory I may have missed while going through it, that’d be very important to fix (aside from say batch size etc).

get_preds uses GatherPredsCallback which moves everything to cpu to avoid CUDA memory issues.
You can see it through the calls to to_detach.

2 Likes

Exact line:

Thanks for finding that @boris!

2 Likes

No problem, actually this line is only if you want to return the inputs.
Other relevant lines would be in after_batch method where data is moved to cpu (predictions and targets).

However it’s not that obvious because I believe self.pred is calculated on the GPU (models and dataloaders still on GPU) and then results of all batches are all gathered on the CPU.

Fixed it to point there :slight_smile:

1 Like

What’s the best/fastest way to do inference on a GPU? I’ve been using get_preds thinking it was on GPU, but after reading this thread I’ve realized that’s not the case.

1 Like

So far, see the third technique above though as mentioned it’s not 100% GPU

2 Likes

I’m not 100% sure but I think the inference is done on GPU and inputs, labels, predictions are moved to CPU at each batch.

A little update to this, so I found the bottleneck (duh) batches are done on the fly, doing the dl = test_dl just sets up the pipeline.

I’ve gotten it down to just a hair under a second.

Here were my steps:

  1. Build PipeLine manually. There was a chunk of overhead being done in the background I could avoid. My particular problem used PILImage.create, Resize, ToTensor, IntToFloatTensor, and Normalize. As such I made a few pipelines (also notice the Normalzie, couldn’t quite get it to work on the Pipeline for some reason:
type_pipe = Pipeline(PILImage.create)
item_pipe = Pipeline(Resize(224), ToTensor())
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()
  1. Next comes actually applying it into batches:
for im in im_names:
    batch.append(item_pipe(type_pipe(im)))
    k += 1
    if k == 50:
        batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
        batch = []
        k = 0

Now this single step right here is what shaved off the most time for me. The rest is the usual for predictions (so we just got rid of the dl = learner.dls.test_dl)

How much time did I shave? We went from our last time of 1.3 seconds down to 937ms for 100 images, so I was able to shave off even more time. I should also note that half of this time is just grabbing the data via PIL.create :slight_smile:

Here’s the entire script:

type_tfms = [PILImage.create]
item_tfms = [Resize(224), ToTensor()]
type_pipe = Pipeline(type_tfms)
item_pipe = Pipeline(item_tfms)
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()


batches = []
batch = []
outs = []
inps = []
k = 0
for im in im_names:
    batch.append(item_pipe(type_pipe(im)))
    k += 1
    if k == 50:
        batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
        batch = []
        k = 0

learner.model.eval()
with torch.no_grad():
    for b in batches:
        outs.append(learner.model(b))
        inps.append(b)

inp = torch.stack(inps)
out = torch.stack(outs)
dec = learner.dls.decode_batch((*tuplify(inp), *tuplify(out)))

(PS if this needs further explanation let me know, just the mad ramblings of man at 2am…)

(PPS, on a resnet18 you can get close to real time with this too on a GPU, with 3 images I clocked it at ~39ms)

9 Likes

Amazing! Could you update the initial post to reflect this new speed gain? So, it won’t be lost between post replies and any new user will see :slight_smile:

2 Likes

Done :slight_smile:

1 Like

Does the time consider also your data preparation? When does the timer start?

1 Like

Yes, time starts from data prep into full model inference, so end to end pulling in from PILImage to decoding the batch

In a nutshell I exposed everything and it’s very similar to pytorch in a way, just with the fastai nomenclature we’re familiar with

2 Likes

This is interesting.
I guess it will depend on where the bottleneck is. It’s a bit easier to find it during training with CPU/GPU usage.
It would be interesting to make a function that finds where it is (whether for training or inference).

Also you can speed it up even futher on the decodes too. For instance here was a script I used for decoding keypoints. With the script I got 907 microseconds and with the dataloader I got 3.13 milliseconds:

(which mind you all my decodes function is like so):

def decode_kpts(x, y):
    imgs = [decode_im(im) for im in x[0]]
    kpts = [decode_y(pts) for pts in y]
    return (imgs, kpts)

def decode_im(x):
    return torch.stack([(x[i] * std[i]) + mean[i] for i in range(3)])

def decode_y(x):
    return [_unscale_pnts(x.cpu().view(-1,2), (224,224)) for i in x]

def _unscale_pnts(y, sz): return TensorPoint((y+1) * tensor(sz).float()/2, img_size=sz)

(so all the steps our decodes is doing)

I’m honestly unsure. It could very well be within the DataLoader itself or something. Another interesting bit is to generate all my data (batches and all) it takes 17.7ms, but to iterate through my entire dataloader and do the same (if I’m doing this right, see below) takes 267 ms. (also I sped it up a little further by installing Pillow SIMD too)

# 267 ms:
%%timeit
for batch in dl:
    _
# 17.7 ms:
%%timeit
batches = _build_batches(fname_args3, 64)
def _build_batches(fnames, bs):
    "Builds batches to skip `DataLoader` overhead"
    type_tfms = [PILImage.create]
    item_tfms = [Resize(224), ToTensor()]
    type_pipe = Pipeline(type_tfms)
    item_pipe = Pipeline(item_tfms)
    norm = Normalize.from_stats(*imagenet_stats)
    i2f = IntToFloatTensor()
    batches = []
    batch = []
    k = 0
    for im in fnames:
        batch.append(item_pipe(type_pipe(im)))
        k += 1
        if k == bs or k==len(fnames):
            batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
            batch = []
            k = 0
    return batches

And I quadruple checked that this does generate the exact same data (I know I posted this before but it’s just one example I focused on, and a very detailed blog post on this whole ordeal will come soon)

Just some further hints at this, doing a .predict on CPU and GPU for a single image give me

414 ms (CPU) and 324 ms (GPU), vs the script which has 49.4 ms (CPU) and 10.7 ms (GPU). When doing the CPU test, the model and all transforms were adjusted to the cpu including Normalize and IntToFloat (b.cuda() became just b)

3 Likes

I finally found an answer that makes a bit more sense. When doing one image I can get within 1ms by doing:

%%timeit
test = dls.test_dl([fnames[0]])
with test.fake_l.no_multiproc():
    out = next(iter(test))

Which this makes much more sense on how to go about doing it using the library :slight_smile: (and also can be a little bit faster than doing it my way.)

3 Likes

For those that may be visiting this later, I took these practices and implemented them inside of my fastinference library: muellerzr.github.io/fastinference

4 Likes

Thanks for the great post @muellerzr! I’m trying to use your code but I’m getting the following error:

  File "inference.py", line 119, in fast1
    dec = learner.dls.decode_batch((*tuplify(inps), *tuplify(outs)))
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 80, in decode_batch
    def decode_batch(self, b, max_n=9, full=True): return self._decode_batch(self.decode(b), max_n, full)
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 86, in _decode_batch
    return L(batch_to_samples(b, max_n=max_n)).map(f)
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 226, in map
    def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
  File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 578, in map_ex
    return list(res)
  File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 568, in __call__
    return self.func(*fargs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 588, in _inner
    for f in funcs: x = f(x, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 322, in decode
    def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 322, in <genexpr>
    def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 246, in decode
    def decode(self, o, **kwargs): return self.tfms.decode(o, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 206, in decode
    if full: return compose_tfms(o, tfms=self.fs, is_enc=False, reverse=True, split_idx=self.split_idx)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 150, in compose_tfms
    x = f(x, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 74, in decode
    def decode  (self, x, **kwargs): return self._call('decodes', x, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 83, in _call
    return self._do_call(getattr(self, fn), x, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 89, in _do_call
    return retain_type(f(x, **kwargs), x, ret)
  File "/usr/local/lib/python3.8/site-packages/fastcore/dispatch.py", line 117, in __call__
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastai/data/transforms.py", line 246, in decodes
    def decodes(self, o): return Category      (self.vocab    [o])
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 153, in __getitem__
    def __getitem__(self, k): return self.items[list(k) if isinstance(k,CollBase) else k]
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 183, in __getitem__
    def __getitem__(self, idx): return self._get(idx) if is_indexer(idx) else L(self._get(idx), use_list=None)
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 188, in _get
    i = mask2idxs(i)
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 103, in mask2idxs
    if hasattr(it,'item'): it = it.item()
ValueError: only one element tensors can be converted to Python scalars

I’m using your code:

from fastai.vision.all import *
import torch


def _build_batches(fnames, bs):
    "Builds batches to skip `DataLoader` overhead"
    type_tfms = [PILImage.create]
    item_tfms = [Resize(224), ToTensor()]
    type_pipe = Pipeline(type_tfms)
    item_pipe = Pipeline(item_tfms)
    norm = Normalize.from_stats(*imagenet_stats)
    i2f = IntToFloatTensor()
    batches = []
    batch = []
    k = 0
    for im in fnames:
        batch.append(item_pipe(type_pipe(im)))
        k += 1
        if k == bs or k == len(fnames):
            # batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
            batches.append(torch.cat([norm(i2f(b.cpu())) for b in batch]))
            batch = []
            k = 0
    return batches


def faster(bs=2):
    # https://forums.fast.ai/t/speeding-up-fastai2-inference-and-a-few-things-learned/66179/17
    fnames = [...]
    if isinstance(fnames, str):
        fnames = [fnames]
    batches = _build_batches(fnames, bs)
    outs = []
    inps = []

    learner = load_learner(...)
    learner.model.eval()
    # learner.model.cpu()
    with torch.no_grad():
        for b in batches:
            outs.append(learner.model(b))
            inps.append(b)

    inp = torch.stack(inps)
    out = torch.stack(outs)
    dec = learner.dls.decode_batch((*tuplify(inp), *tuplify(out)))

Have any idea what is going on? Thanks!

Thanks for the informative post.
Any tips on how to do the same in distributed mode?
Cheers!