The correct way to %timeit functions operating on cuda tensors
Say, you want to measure the speed of finding out whether any of the elements of the tensor on cuda are
x = torch.randn(512,1024).cuda()
x[123,145] = float('inf')
- CUDA synchronize
By default, CUDA kernel calls are asynchronous, meaning that a CUDA program can proceed to the next instruction before the kernel actually completes execution. So to get the true measurements you need to make sure you sync the kernel and include that time in the execution measurement:
Sometimes you will get identical results w/ and w/o sync. But it’s best to always synchronize the kernel.
You don’t need to sync if the operation is done on a non-cuda tensor.
- CUDA setup+warm up
And you also want to ensure to have a warm up stage, to ensure everything is setup and synced before the measurement starts, so you’d run it as:
test_speed_synced(x) # warm up + sync
%timeit test_speed_synced(x) # measure w/ sync
And, of course, you probably don’t want to run anything in parallel on that same GPU.
Wishlist: code a
%cuda_timeit magic to get back to the nice, short way of calling it.
Thanks to @t-v for this recipe.
note: Another way to turn the async kernel launching feature off, is to set
CUDA_LAUNCH_BLOCKING env var to 1. Which can be useful for debug, but you don’t want that if you want the full speed of your GPU.