Mixed precision training still kills the kernel in V100 with CUDA 10.0 and latest driver

PegasusWithoutWinds · November 13, 2018, 5:10am

Impressed by the speed and performance of mixed precision training, I immediately tired to get my hand on it. Because Jeremy suggested that it only work with the latest Nvidia GPU driver, I installed the latest driver and updated CUDA 9 to CUDA 10 on my GCP V100 instance. Here is the screenshot from nvidia-smi.

However, when I run through the lesson3-camvid notebook with to_fp16() added to the learner as Jeremy did in the lecture video, the kernel still gets killed after the training starts. The memory is not the issue here as it did not even get above 20% usage when the kernel dies.

What am I missing here?

cjwinslow · November 13, 2018, 6:13am

An obvious thing to play with might be batch size but I presume you’ve fiddled with that and still killed the kernel.

Are you running the lesson3 camvid as is? Perhaps some modification you’ve made is causing the GPU memory to balloon unnecessarily.

For reference, I’m able to run it on a Tesla P4 using a GCP preemptible instance, with CUDA v9

jupyter@my-fastai-instance:~/tutorials/fastai$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   84C    P0    63W /  75W |   2953MiB /  7611MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9383      C   /opt/anaconda3/bin/python                   2943MiB |
+-----------------------------------------------------------------------------+

PegasusWithoutWinds · November 13, 2018, 7:32am

Thanks for the input. The issue here is not memory. Since half-precision floating point takes less memory than full-precision, V100 should not run out of memory, considering the fact that it did not even get close to full memory usage even when using full-precision floating point. In fact, when the kernel gets killed, there is still ample amount of memory left.

champs.jaideep · July 28, 2019, 4:52am

hi were u able to succeed…
in kaggle there are latest cuda drivers installed still not feeling getting the speed of fp16.