The correct way to %timeit functions operating on cuda tensors
Say, you want to measure the speed of finding out whether any of the elements of the tensor on cuda are nan
:
def test_overflow(x):
return torch.isnan(x).any().item()
x = torch.randn(512,1024).cuda()
x[123,145] = float('inf')
%timeit test_overflow(x)
- CUDA synchronize
By default, CUDA kernel calls are asynchronous, meaning that a CUDA program can proceed to the next instruction before the kernel actually completes execution. So to get the true measurements you need to make sure you sync the kernel and include that time in the execution measurement:
def test_speed_synced(x):
test_overflow(x)
torch.cuda.synchronize()
%timeit test_speed_synced(x)
Sometimes you will get identical results w/ and w/o sync. But it’s best to always synchronize the kernel.
You don’t need to sync if the operation is done on a non-cuda tensor.
- CUDA setup+warm up
And you also want to ensure to have a warm up stage, to ensure everything is setup and synced before the measurement starts, so you’d run it as:
def test_speed_synced(x):
test_overflow(x)
torch.cuda.synchronize()
test_speed_synced(x) # warm up + sync
%timeit test_speed_synced(x) # measure w/ sync
And, of course, you probably don’t want to run anything in parallel on that same GPU.
Wishlist: code a %cuda_timeit
magic to get back to the nice, short way of calling it.
Thanks to @t-v for this recipe.
note: Another way to turn the async kernel launching feature off, is to set CUDA_LAUNCH_BLOCKING
env var to 1. Which can be useful for debug, but you don’t want that if you want the full speed of your GPU.