NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

After an update to my AWS Deep Learning AMI (Ubuntu) Version 21.2 instance, I noticed that running an epoch that would normally take 30 seconds was taking 10+ minutes.

When I logged in to the instance, I noticed the message “restart required” on my console. This means that an automatic security update was applied and one of the updates may be a kernel update requiring a reboot.

I did the reboot but it seemed like I was getting the performance of a CPU instead of GPU.
Running command “nvidia-smi” to check the status fo the GPU gave the following message:

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Somehow the nvidia driver was removed or corrupted from a previous update.

I fixed this by installing the latest compatible version of the nvidia drivers.
In my case I installed version 410.104

$cat driver_install.sh
#!/bin/bash
set -x
version=$1
wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run 
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd 

$sudo ./driver_install.sh 410.104
$sudo modprobe nvidia
$nvidia-smi

nvidia-smi works fine now.

fastai performance back to normal.

Hope this helps someone.

5 Likes

Nice! Could you elaborate about the different options you gave to the installer (except for --silent as that looks obvious in a script).
I never give options to the installer and it just works.