I suspect it would move to CPU too. To test, check .type()
on the returned tensors to see if they’re cuda. I agree with @rwightman that CUDA syncing is likely a key issue here. Also note that you have to be careful of not using up all your GPU memory. (Which is not to say what you’re doing isn’t a good idea - just pointing out compromises to be aware of.)
If cpu=True
, then the DataLoader is mapped as well to CPU, and the same goes for False
(they’re tied together in load_learner
) and so the tensors show this matching behavior
I tried to keep that in mind as best I could that I could see, but if there are obvious ways of freeing memory I may have missed while going through it, that’d be very important to fix (aside from say batch size etc).
get_preds
uses GatherPredsCallback
which moves everything to cpu to avoid CUDA memory issues.
You can see it through the calls to to_detach
.
No problem, actually this line is only if you want to return the inputs.
Other relevant lines would be in after_batch
method where data is moved to cpu (predictions and targets).
However it’s not that obvious because I believe self.pred
is calculated on the GPU (models and dataloaders still on GPU) and then results of all batches are all gathered on the CPU.
Fixed it to point there
What’s the best/fastest way to do inference on a GPU? I’ve been using get_preds
thinking it was on GPU, but after reading this thread I’ve realized that’s not the case.
So far, see the third technique above though as mentioned it’s not 100% GPU
I’m not 100% sure but I think the inference is done on GPU and inputs, labels, predictions are moved to CPU at each batch.
A little update to this, so I found the bottleneck (duh) batches are done on the fly, doing the dl = test_dl
just sets up the pipeline.
I’ve gotten it down to just a hair under a second.
Here were my steps:
- Build
PipeLine
manually. There was a chunk of overhead being done in the background I could avoid. My particular problem usedPILImage.create
,Resize
,ToTensor
,IntToFloatTensor
, andNormalize
. As such I made a few pipelines (also notice theNormalzie
, couldn’t quite get it to work on thePipeline
for some reason:
type_pipe = Pipeline(PILImage.create)
item_pipe = Pipeline(Resize(224), ToTensor())
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()
- Next comes actually applying it into batches:
for im in im_names:
batch.append(item_pipe(type_pipe(im)))
k += 1
if k == 50:
batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
batch = []
k = 0
Now this single step right here is what shaved off the most time for me. The rest is the usual for predictions (so we just got rid of the dl = learner.dls.test_dl
)
How much time did I shave? We went from our last time of 1.3 seconds down to 937ms for 100 images, so I was able to shave off even more time. I should also note that half of this time is just grabbing the data via PIL.create
Here’s the entire script:
type_tfms = [PILImage.create]
item_tfms = [Resize(224), ToTensor()]
type_pipe = Pipeline(type_tfms)
item_pipe = Pipeline(item_tfms)
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()
batches = []
batch = []
outs = []
inps = []
k = 0
for im in im_names:
batch.append(item_pipe(type_pipe(im)))
k += 1
if k == 50:
batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
batch = []
k = 0
learner.model.eval()
with torch.no_grad():
for b in batches:
outs.append(learner.model(b))
inps.append(b)
inp = torch.stack(inps)
out = torch.stack(outs)
dec = learner.dls.decode_batch((*tuplify(inp), *tuplify(out)))
(PS if this needs further explanation let me know, just the mad ramblings of man at 2am…)
(PPS, on a resnet18 you can get close to real time with this too on a GPU, with 3 images I clocked it at ~39ms)
Amazing! Could you update the initial post to reflect this new speed gain? So, it won’t be lost between post replies and any new user will see
Done
Does the time consider also your data preparation? When does the timer start?
Yes, time starts from data prep into full model inference, so end to end pulling in from PILImage to decoding the batch
In a nutshell I exposed everything and it’s very similar to pytorch in a way, just with the fastai nomenclature we’re familiar with
This is interesting.
I guess it will depend on where the bottleneck is. It’s a bit easier to find it during training with CPU/GPU usage.
It would be interesting to make a function that finds where it is (whether for training or inference).
Also you can speed it up even futher on the decodes too. For instance here was a script I used for decoding keypoints. With the script I got 907 microseconds and with the dataloader I got 3.13 milliseconds:
(which mind you all my decodes function is like so):
def decode_kpts(x, y):
imgs = [decode_im(im) for im in x[0]]
kpts = [decode_y(pts) for pts in y]
return (imgs, kpts)
def decode_im(x):
return torch.stack([(x[i] * std[i]) + mean[i] for i in range(3)])
def decode_y(x):
return [_unscale_pnts(x.cpu().view(-1,2), (224,224)) for i in x]
def _unscale_pnts(y, sz): return TensorPoint((y+1) * tensor(sz).float()/2, img_size=sz)
(so all the steps our decodes
is doing)
I’m honestly unsure. It could very well be within the DataLoader
itself or something. Another interesting bit is to generate all my data (batches and all) it takes 17.7ms, but to iterate through my entire dataloader and do the same (if I’m doing this right, see below) takes 267 ms. (also I sped it up a little further by installing Pillow SIMD too)
# 267 ms:
%%timeit
for batch in dl:
_
# 17.7 ms:
%%timeit
batches = _build_batches(fname_args3, 64)
def _build_batches(fnames, bs):
"Builds batches to skip `DataLoader` overhead"
type_tfms = [PILImage.create]
item_tfms = [Resize(224), ToTensor()]
type_pipe = Pipeline(type_tfms)
item_pipe = Pipeline(item_tfms)
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()
batches = []
batch = []
k = 0
for im in fnames:
batch.append(item_pipe(type_pipe(im)))
k += 1
if k == bs or k==len(fnames):
batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
batch = []
k = 0
return batches
And I quadruple checked that this does generate the exact same data (I know I posted this before but it’s just one example I focused on, and a very detailed blog post on this whole ordeal will come soon)
Just some further hints at this, doing a .predict
on CPU and GPU for a single image give me
414 ms (CPU) and 324 ms (GPU), vs the script which has 49.4 ms (CPU) and 10.7 ms (GPU). When doing the CPU test, the model and all transforms were adjusted to the cpu including Normalize
and IntToFloat
(b.cuda()
became just b
)
I finally found an answer that makes a bit more sense. When doing one image I can get within 1ms by doing:
%%timeit
test = dls.test_dl([fnames[0]])
with test.fake_l.no_multiproc():
out = next(iter(test))
Which this makes much more sense on how to go about doing it using the library (and also can be a little bit faster than doing it my way.)
For those that may be visiting this later, I took these practices and implemented them inside of my fastinference library: muellerzr.github.io/fastinference
Thanks for the great post @muellerzr! I’m trying to use your code but I’m getting the following error:
File "inference.py", line 119, in fast1
dec = learner.dls.decode_batch((*tuplify(inps), *tuplify(outs)))
File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 80, in decode_batch
def decode_batch(self, b, max_n=9, full=True): return self._decode_batch(self.decode(b), max_n, full)
File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 86, in _decode_batch
return L(batch_to_samples(b, max_n=max_n)).map(f)
File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 226, in map
def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 578, in map_ex
return list(res)
File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 568, in __call__
return self.func(*fargs, **kwargs)
File "/usr/local/lib/python3.8/site-packages/fastcore/basics.py", line 588, in _inner
for f in funcs: x = f(x, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 322, in decode
def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))
File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 322, in <genexpr>
def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))
File "/usr/local/lib/python3.8/site-packages/fastai/data/core.py", line 246, in decode
def decode(self, o, **kwargs): return self.tfms.decode(o, **kwargs)
File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 206, in decode
if full: return compose_tfms(o, tfms=self.fs, is_enc=False, reverse=True, split_idx=self.split_idx)
File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 150, in compose_tfms
x = f(x, **kwargs)
File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 74, in decode
def decode (self, x, **kwargs): return self._call('decodes', x, **kwargs)
File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 83, in _call
return self._do_call(getattr(self, fn), x, **kwargs)
File "/usr/local/lib/python3.8/site-packages/fastcore/transform.py", line 89, in _do_call
return retain_type(f(x, **kwargs), x, ret)
File "/usr/local/lib/python3.8/site-packages/fastcore/dispatch.py", line 117, in __call__
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/fastai/data/transforms.py", line 246, in decodes
def decodes(self, o): return Category (self.vocab [o])
File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 153, in __getitem__
def __getitem__(self, k): return self.items[list(k) if isinstance(k,CollBase) else k]
File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 183, in __getitem__
def __getitem__(self, idx): return self._get(idx) if is_indexer(idx) else L(self._get(idx), use_list=None)
File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 188, in _get
i = mask2idxs(i)
File "/usr/local/lib/python3.8/site-packages/fastcore/foundation.py", line 103, in mask2idxs
if hasattr(it,'item'): it = it.item()
ValueError: only one element tensors can be converted to Python scalars
I’m using your code:
from fastai.vision.all import *
import torch
def _build_batches(fnames, bs):
"Builds batches to skip `DataLoader` overhead"
type_tfms = [PILImage.create]
item_tfms = [Resize(224), ToTensor()]
type_pipe = Pipeline(type_tfms)
item_pipe = Pipeline(item_tfms)
norm = Normalize.from_stats(*imagenet_stats)
i2f = IntToFloatTensor()
batches = []
batch = []
k = 0
for im in fnames:
batch.append(item_pipe(type_pipe(im)))
k += 1
if k == bs or k == len(fnames):
# batches.append(torch.cat([norm(i2f(b.cuda())) for b in batch]))
batches.append(torch.cat([norm(i2f(b.cpu())) for b in batch]))
batch = []
k = 0
return batches
def faster(bs=2):
# https://forums.fast.ai/t/speeding-up-fastai2-inference-and-a-few-things-learned/66179/17
fnames = [...]
if isinstance(fnames, str):
fnames = [fnames]
batches = _build_batches(fnames, bs)
outs = []
inps = []
learner = load_learner(...)
learner.model.eval()
# learner.model.cpu()
with torch.no_grad():
for b in batches:
outs.append(learner.model(b))
inps.append(b)
inp = torch.stack(inps)
out = torch.stack(outs)
dec = learner.dls.decode_batch((*tuplify(inp), *tuplify(out)))
Have any idea what is going on? Thanks!
Thanks for the informative post.
Any tips on how to do the same in distributed mode?
Cheers!