Nvidia driver install fail on gcp

gshiba · February 23, 2020, 8:24am

I’m following the instructions here: https://course.fast.ai/start_gcp.html

I can set everything up and get access to http://localhost:8080/notebooks/tutorials/fastai/course-v3/nbs/dl1/lesson1-pets.ipynb.

Problem: nvidia driver fails to install and so I can’t utilize the gpu.

When I ssh into the instance, the following prompt shows up:

This VM requires Nvidia drivers to function correctly.   Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n]

Upon hitting y, this is the error:

Would you like to install the Nvidia driver? [y/n] y
Installing Nvidia driver.
Downloading driver from GCS location gs://nvidia-drivers-us-public/tesla/418.87.01/NVIDIA-Linux-x86_64-418.87.01.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.87.01..............

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.


WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the
         `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.


WARNING: Unable to find a suitable destination to install 32-bit compatibility libraries. Your system may not be set up for 32-bit compatibility. 32-bit compatibility files will not be installed; if you wish to install them, re-run the
         installation and set a valid directory with the --compat32-libdir option.


ERROR: Error while parsing line 680 of '/var/lib/nvidia/log'.


ERROR: Uninstallation failed.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Nvidia driver installed.

Oddly enough, the last message is Nvidia driver installed but it clearly hasn’t, since I get the same prompt when I get a new ssh connection.

The error log is:

jupyter@my-fastai-instance:~$ cat /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Sun Feb 23 08:20:18 2020
installer version: 418.87.01

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

nvidia-installer command line:
    ./nvidia-installer
    --dkms
    -a
    -s
    --no-drm

Using built-in stream user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> Installing NVIDIA driver version 418.87.01.
-> There appears to already be a driver installed on your system (version: 418.87.01).  As part of installing this driver (version: 418.87.01), the existing driver will be uninstalled.  Are you sure you want to continue? (Answer: Continue installation)
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
WARNING: Unable to find a suitable destination to install 32-bit compatibility libraries. Your system may not be set up for 32-bit compatibility. 32-bit compatibility files will not be installed; if you wish to install them, re-run the installation and set a valid directory with the --compat32-libdir option.
-> Will install GLVND GLX client libraries.
-> Will install GLVND EGL client libraries.
-> Skipping GLX non-GLVND file: "libGL.so.418.87.01"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.418.87.01"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Parsing log file:
-> error.
ERROR: Error while parsing line 680 of '/var/lib/nvidia/log'.
ERROR: Uninstallation failed.
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh
   executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...
   Checking for libglvnd installation.
   Checking libGLdispatch...
   Checking libGLdispatch dispatch table
   Checking call through libGLdispatch
   All OK
   libGLdispatch is OK
   Checking for libGLX
   libGLX is OK
   Checking for libEGL
   Can't load libEGL from libEGL.so.1: libEGL.so.1: cannot open shared object file: No such file or directory
   Checking entrypoint library libOpenGL.so.0
   Checking call through libGLdispatch
   Checking call through library libOpenGL.so.0
   All OK
   Entrypoint library libOpenGL.so.0 is OK
   Checking entrypoint library libGL.so.1
   Checking call through libGLdispatch
   Checking call through library libGL.so.1
   dlopen("libGL.so.1") failed: libGL.so.1: cannot open shared object file: No such file or directory
   
   Found libglvnd libraries: libOpenGL.so.0 libGLX.so.0 libGLdispatch.so.0 
   Missing libglvnd libraries: libGL.so.1 libEGL.so.1 
   
-> An incomplete installation of libglvnd was found. Do you want to install a full copy of libglvnd? This will overwrite any existing libglvnd libraries. (Answer: Abort installation.)
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

gshiba · February 23, 2020, 6:41pm

(Answering my own question…)

Short answer: choosing Tesla T4 fixed the problem. The prompt to install driver did not appear.

Long answer: On GCP console, click on “Create Similar” to create a new instance, and choose Tesla T4 as the gpu (instead of the --accelerator="type=nvidia-tesla-p100,count=1" as in the instructions) (I guess Tesal P100 is no longer supported?)