Reset GPU without restarting linux?


(Antoni Batchelli) #1

I setup this computer to use remotely. In some instances CUDA errors (maybe related to network issues, I can’t tell) left the GPU useless. Killing the jupyter kernel didn’t help, only a computer restart.

This is the only GPU in the system (1070ti), so I believe it’s in use by the display. I am not running xwindows or similar.

nvidia-smi reset doesn’t seem to help either:

tbatchelli@MLrig:~$ nvidia-smi -r
GPU Reset couldn't run because GPU 00000000:23:00.0 is the primary GPU.

Is there any way to reset the GPU without restarting linux? Would adding a second (cheaper) GPU for display help?


(Ramesh Sampath) #2

What’s the underlying OS - Ubuntu? Can you see what processes are running nvidia-smi?

Then you can kill by sudo kill -9 <pid>. More details in this quora - https://www.quora.com/How-do-I-kill-all-the-computer-processes-shown-in-nvidia-smi


(Antoni Batchelli) #3

Sorry, yes, it’s Ubuntu 16.04 LTS.

It happens without having any process:

tbatchelli@MLrig:~$ nvidia-smi
Wed Mar 28 11:44:09 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  On   | 00000000:23:00.0 Off |                  N/A |
|  0%   30C    P8     9W / 180W |      1MiB /  8116MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
tbatchelli@MLrig:~$ sudo nvidia-smi -r
GPU Reset couldn't run because GPU 00000000:23:00.0 is the primary GPU.
tbatchelli@MLrig:~$

So I wonder if this is because the computer screen is connected to the video card. I could consider buying a much cheaper card just to connect the screen to it…


(Jeremy Howard (Admin)) #4

Your Intel CPU should have an integrated GPU, so you should find there’s another video port on your motherboard that would let you use that.

Using the GPU that’s driving your display to do deep learning makes things much slower in my experience.


(Antoni Batchelli) #5

It’s not Intel :frowning: It’s a AMD Ryzen which happens to not have an integrated GPU. But I guess getting another video card should work…


(Harsha Vardhan) #6

Did it work after adding another video card? I faced the same problem, and hence I bought a cheap videocard for my monitor and plugged the monitor into that, making my main GPU free. Even after this sudo nvidia-smi -r gives the same error (GPU Reset couldn’t run because GPU is the primary GPU)


(Shiv Gowda) #7

I usually reload the module when I get such errors(for me it usually happens after waking up the linux system from sleep/suspend). The command works if there are no other processes using the gpu. If there are any processes using the GPU, kill those processes before invoking this command.

alias gpureload="sudo rmmod nvidia_uvm ; sudo modprobe nvidia_uvm"