Torch.cuda.is_available() returns False


(Siddharth) #1

I am using AWS p2.xlarge for running my notebook. I am also using fastai ami shown by Jeremy in this course.

Now, when I running torch.cuda.is_available() it prints False, however torch.backends.cudnn.enabled returns True.

To check my installation, when I run nvidia-smi it is giving me the following error.

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Whereas If I run nvcc --version it gives the following information.

nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Should I update my drivers? Can anyone point me in the right direction as how it can be fixed in p2.xlarge instance in AWS?


(Martin) #2

Have you tried a good old fashion restart?


(Siddharth) #3

Yes, I have tried rebooting my p2.xlarge instance but no luck.


(Martin) #4

What about this: http://www.linuxandubuntu.com/home/how-to-install-latest-nvidia-drivers-in-linux


(Siddharth) #5

Thanks for the link that you provided.

Initially, I tried the following set of commands to reinstall nvidia drivers.

  1. Download the driver - curl -O http://us.download.nvidia.com/tesla/384.125/nvidia-diag-driver-local-repo-ubuntu1604-384.125_1.0-1_amd64.deb
  2. sudo apt-get purge nvidia* - To remove current installations
  3. sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604-384.125_1.0-1_amd64.debearlier
  4. sudo apt-get update
  5. sudo apt-get install cuda-drivers

However, after rebooting the instance same problem still exist.

Then I followed the instruction provided in the link http://www.linuxandubuntu.com/home/how-to-install-latest-nvidia-drivers-in-linux that you provided. However, after following those steps and rebooting the instance the lsmod | grep nvidia command is giving me no output. As per the post, it suggests installation has failed. As usual nvidia-smi is not working as expected.


(Martin) #6

I wonder why the installation failed. Did the apt-get install cuda-drivers show errors?

I googled what lsmod does:
“It shows which loadable kernel modules are currently loaded.”

Maybe it did install but it isn’t loaded yet. Try doing the apt-get install cuda-drivers again and if it was successfully installed it should say that the package is already installed.

I had a somewhat similar problem when I installed the driver on my Ubuntu system. I think the problem might be that the driver is not in use.


(Siddharth) #7

I tried re-installing everything following http://www.linuxandubuntu.com/home/how-to-install-latest-nvidia-drivers-in-linux, to look into the logs what’s exactly going on. Like previously instillation didn’t work. I also noticed few error messages related to kernel while installing, but can’t really figure out what it is all about. I am attaching the logs…

sudo apt-get install nvidia-390

Reading package lists… Done
Building dependency tree
Reading state information… Done
The following packages were automatically installed and are no longer required:
libllvm4.0 linux-aws-headers-4.4.0-1022 linux-headers-4.4.0-1022-aws linux-image-4.4.0-1022-aws linux-image-4.4.0-1038-aws
Use ‘sudo apt autoremove’ to remove them.
The following additional packages will be installed:
libcuda1-390 nvidia-opencl-icd-390 nvidia-prime nvidia-settings
The following NEW packages will be installed:
libcuda1-390 nvidia-390 nvidia-opencl-icd-390 nvidia-prime nvidia-settings
0 upgraded, 5 newly installed, 0 to remove and 6 not upgraded.
Need to get 0 B/81.5 MB of archives.
After this operation, 362 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Selecting previously unselected package nvidia-390.
(Reading database … 261447 files and directories currently installed.)
Preparing to unpack …/nvidia-390_390.59-0ubuntu0~gpu16.04.1_amd64.deb …
Unpacking nvidia-390 (390.59-0ubuntu0~gpu16.04.1) …
Selecting previously unselected package libcuda1-390.
Preparing to unpack …/libcuda1-390_390.59-0ubuntu0~gpu16.04.1_amd64.deb …
Unpacking libcuda1-390 (390.59-0ubuntu0~gpu16.04.1) …
Selecting previously unselected package nvidia-opencl-icd-390.
Preparing to unpack …/nvidia-opencl-icd-390_390.59-0ubuntu0~gpu16.04.1_amd64.deb …
Unpacking nvidia-opencl-icd-390 (390.59-0ubuntu0~gpu16.04.1) …
Selecting previously unselected package nvidia-prime.
Preparing to unpack …/nvidia-prime_0.8.2_amd64.deb …
Unpacking nvidia-prime (0.8.2) …
Selecting previously unselected package nvidia-settings.
Preparing to unpack …/nvidia-settings_396.24-0ubuntu0~gpu16.04.1_amd64.deb …
Unpacking nvidia-settings (396.24-0ubuntu0~gpu16.04.1) …
Processing triggers for libc-bin (2.23-0ubuntu10) …
Processing triggers for ureadahead (0.100.0-19) …
Processing triggers for man-db (2.7.5-1) …
Processing triggers for gnome-menus (3.13.3-6ubuntu3.1) …
Processing triggers for desktop-file-utils (0.22-1ubuntu5.1) …
Processing triggers for mime-support (3.59ubuntu1) …
Setting up nvidia-390 (390.59-0ubuntu0~gpu16.04.1) …
update-alternatives: using /usr/lib/nvidia-390/ld.so.conf to provide /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf (x86_64-linux-gnu_gl_conf) in auto mode
update-alternatives: using /usr/lib/nvidia-390/ld.so.conf to provide /etc/ld.so.conf.d/x86_64-linux-gnu_EGL.conf (x86_64-linux-gnu_egl_conf) in auto mode
update-alternatives: using /usr/lib/nvidia-390/alt_ld.so.conf to provide /etc/ld.so.conf.d/i386-linux-gnu_GL.conf (i386-linux-gnu_gl_conf) in auto mode
update-alternatives: using /usr/lib/nvidia-390/alt_ld.so.conf to provide /etc/ld.so.conf.d/i386-linux-gnu_EGL.conf (i386-linux-gnu_egl_conf) in auto mode
update-alternatives: using /usr/share/nvidia-390/glamor.conf to provide /usr/share/X11/xorg.conf.d/glamoregl.conf (glamor_conf) in auto mode
update-initramfs: deferring update (trigger activated)
update-initramfs: Generating /boot/initrd.img-4.4.0-59-generic
WARNING: missing /lib/modules/4.4.0-59-generic
Ensure all necessary drivers are built into the linux image!
depmod: ERROR: could not open directory /lib/modules/4.4.0-59-generic: No such file or directory
depmod: FATAL: could not search modules: No such file or directory
W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
depmod: WARNING: could not open /var/tmp/mkinitramfs_rG1dM5/lib/modules/4.4.0-59-generic/modules.order: No such file or directory
depmod: WARNING: could not open /var/tmp/mkinitramfs_rG1dM5/lib/modules/4.4.0-59-generic/modules.builtin: No such file or directory
Adding system user nvidia-persistenced' (UID 116) ... Adding new groupnvidia-persistenced’ (GID 122) …
Adding new user nvidia-persistenced' (UID 116) with groupnvidia-persistenced’ …
Not creating home directory `/’.
Loading new nvidia-390-390.59 DKMS files…
First Installation: checking all kernels…
It is likely that 4.4.0-59-generic belongs to a chroot’s host
Building only for 4.4.0-1057-aws
Building for architecture x86_64
Building initial module for 4.4.0-1057-aws
Done.

nvidia_390:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/4.4.0-1057-aws/updates/dkms/

nvidia_390_modeset.ko:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/4.4.0-1057-aws/updates/dkms/

nvidia_390_drm.ko:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/4.4.0-1057-aws/updates/dkms/

nvidia_390_uvm.ko:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/4.4.0-1057-aws/updates/dkms/

depmod…

DKMS: install completed.
Setting up libcuda1-390 (390.59-0ubuntu0~gpu16.04.1) …
Setting up nvidia-opencl-icd-390 (390.59-0ubuntu0~gpu16.04.1) …
Setting up nvidia-prime (0.8.2) …
Setting up nvidia-settings (396.24-0ubuntu0~gpu16.04.1) …
Processing triggers for libc-bin (2.23-0ubuntu10) …
Processing triggers for initramfs-tools (0.122ubuntu8.11) …
update-initramfs: Generating /boot/initrd.img-4.4.0-1057-aws
W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
Processing triggers for ureadahead (0.100.0-19) …


#8

I had this same problem when I built my own server. I followed these instructions (kind of, I mostly used it to see the dependencies required). Using sudo apt-get install nvidia-390 never worked for me, and I wound up reinstalling ubuntu fresh about 4 times before figuring out the correct sequence. In my experience it’s better to just install the CUDA toolkit which takes care of everything. Here’s where to find what distro you need. If I recall correctly, I believe I did the following (assuming ubuntu 16.04 and an x86 architecture):

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.2.88-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_9.2.88-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda

Hopefully this works, I don’t know what kind of gpus are used in a p2 instance but this is what worked for my 1080ti.


(Siddharth) #9

I tried with this method as well. But I am absolutely surprised why none of this worked.

I think I should try with docker this time. Hope it will work. Otherwise I have to move to Crestle or Paperspace for now.


(Siddharth) #10

I have figured out a workaround for this issue by switching into Amazon deep learning AMI instead of fastai AMI. torch.cuda.is_available() is returning True.:smiley:

However, only downside is that I had to download all fastai contents from github, but that is okay for me.


#11

I am using paperspace and facing the original issue : torch.cuda.is_available() returns False. And when I run nvidia-smi on the machine, I get “No Devices Were Found”.

Anyone else facing this issue?


(Siddharth) #12

Are you using fastai template in paperspace?


(David Leslie) #13

Im having this problem too. Im using the GPU+ machine with the fast.ai template. Did you manage to fix this?


(David Leslie) #14

Ok I ran apt-get install cuda-drivers from the command line and that fixed the problem


#15

Yes, I’m using the fastai template.


#16

Hey yes. This issue is fixed. I contacted their customer support. And restarted the VM a couple of times (suggested by the support staff), and it worked like a charm. :slight_smile:


(Pawel) #17

Actually paperspace has been having some serious technical problems. From last couple of days I couldn’t run stable VM for more than 2h, or GPU brakes, or they run out of space, or I cannot start it at all. I’m seriously considering moving to Amazon or setting up something for my own…


(Siddharth) #18

AWS is definitely is a very good option to consider. If I consider spot instances then AWS is the cheapest option available as of now. I further minimize the cost by switching to t2 instances when I am not training the model. Now, to make the storage persistent I am swapping my root volume.

You can use the following fastai wiki link for dealing with spot instances:

http://wiki.fast.ai/index.php/AWS_Spot_instances


(Peter Rinaudo) #19

Hi, I am having a lot of trouble with the ‘torch.cuda.is_available() >False’ problem and would appreciate some help. I have a windows10 pc and started a Paperspace account and loaded all of the files as per the instructions, everything seems to have worked fine except the GPU doesn’t run.
Can you please tell me from where do you run ‘apt-get install cuda-drivers’
I tried running it from the Paperspace machine console but got this message -

(fastai) paperspace@psphyed93:~ pwd /home/paperspace (fastai) paperspace@psphyed93:~ apt-get install cuda-drivers
E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?

Any help would be appreciated, thanks.


(Peter Rinaudo) #20

Problem - ‘torch.cuda.is_available()’ returns ‘False’
I contacted Paperspace support and this was their reply -

It is a known issue that we’ve fixed on all future machines that are created, but you can fix it on your existing VM by running: sudo apt update && sudo apt upgrade
When you are prompted to update minions/salt, make sure you choose N. That should resolve this for you!

All working well now.