Pytorch tips and tricks

(Stas Bekman) #1

This thread is dedicated to pytorch tips, tricks and related goodies. (similar to Jupyter Notebook Enhancements, Tips And Tricks ).

Please contribute your tips and improvements that make your work with pytorch easier.

This is a recipe collection thread, if something needs to be discussed we should create another thread.

Thank you.

2 Likes

(Stas Bekman) #2

How to print tensor data on the same scale

When one prints out a tensor often the entry points have different e-0X multipliers in the scientific notation and thus are hard to compare visually, e.g. try to find the largest entry point in the following tensor:

print(preds[0])
tensor([-7.3169e-02,  1.3782e-01, 5.8808e-02, 2.4611e-01, -9.3025e-02, 
        -3.6066e-02, -3.1601e-02, 1.5187e-01, 6.2414e-02,  9.2027e-03], grad_fn=<SelectBackward>)

Here is how to print the data on the same scale of 1.

torch.set_printoptions(sci_mode=False)
print(preds[0])
tensor([-0.0732,  0.1378, 0.0588, 0.2461, -0.0930,
        -0.0361, -0.0316, 0.1519, 0.0624,  0.0092], grad_fn=<SelectBackward>)

Now one can quickly tell that entry 3 is the largest.

To restore the default behavior:

torch.set_printoptions(sci_mode=True)

Thanks to Thomas Vman for this recipe.

1 Like

(Stas Bekman) #3

The correct way to %timeit functions operating on cuda tensors

Say, you want to measure the speed of finding out whether any of the elements of the tensor on cuda are nan:

def test_overflow(x): 
    return torch.isnan(x).any().item()

x = torch.randn(512,1024).cuda()
x[123,145] = float('inf')

%timeit test_overflow(x)
  1. CUDA synchronize

By default, CUDA kernel calls are asynchronous, meaning that a CUDA program can proceed to the next instruction before the kernel actually completes execution. So to get the true measurements you need to make sure you sync the kernel and include that time in the execution measurement:

def test_speed_synced(x): 
    test_overflow(x)
    torch.cuda.synchronize()

%timeit test_speed_synced(x)

Sometimes you will get identical results w/ and w/o sync. But it’s best to always synchronize the kernel.

You don’t need to sync if the operation is done on a non-cuda tensor.

  1. CUDA setup+warm up

And you also want to ensure to have a warm up stage, to ensure everything is setup and synced before the measurement starts, so you’d run it as:

def test_speed_synced(x): 
    test_overflow(x)
    torch.cuda.synchronize()

test_speed_synced(x)         # warm up + sync
%timeit test_speed_synced(x) # measure w/ sync

And, of course, you probably don’t want to run anything in parallel on that same GPU.

Wishlist: code a %cuda_timeit magic to get back to the nice, short way of calling it.

Thanks to @t-v for this recipe.

note: Another way to turn the async kernel launching feature off, is to set CUDA_LAUNCH_BLOCKING env var to 1. Which can be useful for debug, but you don’t want that if you want the full speed of your GPU.

1 Like