Speeding Up fastai2 Inference - And A Few Things Learned

Yes, time starts from data prep into full model inference, so end to end pulling in from PILImage to decoding the batch

In a nutshell I exposed everything and it’s very similar to pytorch in a way, just with the fastai nomenclature we’re familiar with

2 Likes

This is interesting.
I guess it will depend on where the bottleneck is. It’s a bit easier to find it during training with CPU/GPU usage.
It would be interesting to make a function that finds where it is (whether for training or inference).

Also you can speed it up even futher on the decodes too. For instance here was a script I used for decoding keypoints. With the script I got 907 microseconds and with the dataloader I got 3.13 milliseconds:

(which mind you all my decodes function is like so):

def decode_kpts(x, y):
    imgs = [decode_im(im) for im in x[0]]
    kpts = [decode_y(pts) for pts in y]
    return (imgs, kpts)

def decode_im(x):
    return torch.stack([(x[i] * std[i]) + mean[i] for i in range(3)])

def decode_y(x):
    return [_unscale_pnts(x.cpu().view(-1,2), (224,224)) for i in x]

def _unscale_pnts(y, sz): return TensorPoint((y+1) * tensor(sz).float()/2, img_size=sz)

(so all the steps our decodes is doing)

I’m honestly unsure. It could very well be within the DataLoader itself or something. Another interesting bit is to generate all my data (batches and all) it takes 17.7ms, but to iterate through my entire dataloader and do the same (if I’m doing this right, see below) takes 267 ms. (also I sped it up a little further by installing Pillow SIMD too)

# 267 ms:
%%timeit
for batch in dl:
    _
# 17.7 ms:
%%timeit
batches = _build_batches(fname_args3, 64)
def _build_batches(fnames, bs):
    "Builds batches to skip `DataLoader` overhead"
    type_tfms = [PILImage.create]
    item_tfms = [Resize(224), ToTensor()]
    type_pipe = Pipeline(type_tfms)
    item_pipe = Pipeline(item_tfms)
    norm = Normalize.from_stats(*imagenet_stats)
    i2f = IntToFloatTensor()
    batches = []
    batch = []
    k = 0
    for im in fnames:
        batch.append(item_pipe(type_pipe(im)))
        k += 1
        if k == bs or k==len(fnames):
            batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
            batch = []
            k = 0
    return batches

And I quadruple checked that this does generate the exact same data (I know I posted this before but it’s just one example I focused on, and a very detailed blog post on this whole ordeal will come soon)

Just some further hints at this, doing a .predict on CPU and GPU for a single image give me

414 ms (CPU) and 324 ms (GPU), vs the script which has 49.4 ms (CPU) and 10.7 ms (GPU). When doing the CPU test, the model and all transforms were adjusted to the cpu including Normalize and IntToFloat (b.cuda() became just b)

3 Likes

I finally found an answer that makes a bit more sense. When doing one image I can get within 1ms by doing:

%%timeit
test = dls.test_dl([fnames[0]])
with test.fake_l.no_multiproc():
    out = next(iter(test))

Which this makes much more sense on how to go about doing it using the library :slight_smile: (and also can be a little bit faster than doing it my way.)

3 Likes

For those that may be visiting this later, I took these practices and implemented them inside of my fastinference library: muellerzr.github.io/fastinference

4 Likes

Thanks for the great post @muellerzr! I’m trying to use your code but I’m getting the following error:

  File "inference.py", line 119, in fast1
    dec = learner.dls.decode_batch((*tuplify(inps), *tuplify(outs)))
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 80, in decode_batch
    def decode_batch(self, b, max_n=9, full=True): return self._decode_batch(self.decode(b), max_n, full)
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 86, in _decode_batch
    return L(batch_to_samples(b, max_n=max_n)).map(f)
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 226, in map
    def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
  File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 578, in map_ex
    return list(res)
  File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 568, in __call__
    return self.func(*fargs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 588, in _inner
    for f in funcs: x = f(x, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 322, in decode
    def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 322, in <genexpr>
    def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))
  File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 246, in decode
    def decode(self, o, **kwargs): return self.tfms.decode(o, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 206, in decode
    if full: return compose_tfms(o, tfms=self.fs, is_enc=False, reverse=True, split_idx=self.split_idx)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 150, in compose_tfms
    x = f(x, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 74, in decode
    def decode  (self, x, **kwargs): return self._call('decodes', x, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 83, in _call
    return self._do_call(getattr(self, fn), x, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 89, in _do_call
    return retain_type(f(x, **kwargs), x, ret)
  File "/usr/local/lib/python3.8/site-packages/fastcore/dispatch.py", line 117, in __call__
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fastai/data/transforms.py", line 246, in decodes
    def decodes(self, o): return Category      (self.vocab    [o])
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 153, in __getitem__
    def __getitem__(self, k): return self.items[list(k) if isinstance(k,CollBase) else k]
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 183, in __getitem__
    def __getitem__(self, idx): return self._get(idx) if is_indexer(idx) else L(self._get(idx), use_list=None)
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 188, in _get
    i = mask2idxs(i)
  File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 103, in mask2idxs
    if hasattr(it,'item'): it = it.item()
ValueError: only one element tensors can be converted to Python scalars

I’m using your code:

from fastai.vision.all import *
import torch


def _build_batches(fnames, bs):
    "Builds batches to skip `DataLoader` overhead"
    type_tfms = [PILImage.create]
    item_tfms = [Resize(224), ToTensor()]
    type_pipe = Pipeline(type_tfms)
    item_pipe = Pipeline(item_tfms)
    norm = Normalize.from_stats(*imagenet_stats)
    i2f = IntToFloatTensor()
    batches = []
    batch = []
    k = 0
    for im in fnames:
        batch.append(item_pipe(type_pipe(im)))
        k += 1
        if k == bs or k == len(fnames):
            # batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
            batches.append(torch.cat([norm(i2f(b.cpu())) for b in batch]))
            batch = []
            k = 0
    return batches


def faster(bs=2):
    # https://forums.fast.ai/t/speeding-up-fastai2-inference-and-a-few-things-learned/66179/17
    fnames = [...]
    if isinstance(fnames, str):
        fnames = [fnames]
    batches = _build_batches(fnames, bs)
    outs = []
    inps = []

    learner = load_learner(...)
    learner.model.eval()
    # learner.model.cpu()
    with torch.no_grad():
        for b in batches:
            outs.append(learner.model(b))
            inps.append(b)

    inp = torch.stack(inps)
    out = torch.stack(outs)
    dec = learner.dls.decode_batch((*tuplify(inp), *tuplify(out)))

Have any idea what is going on? Thanks!

Thanks for the informative post.
Any tips on how to do the same in distributed mode?
Cheers!