Notes on using NVIDIA A100 (40GB)


I’ve just tried to train a model on Azure’s ND96asr_v4 instance size with 8× A100 (40GB) GPUs, and wanted to share some notes and watch-its.

For some reason that I do not fully understand, on a VM+OS setup that works with the 8×V100 instance size (ND40rs_v2), when it’s resized to the 8×A100 instance size, torch stops working with CUDA.

import torch
/anaconda/envs/a100/lib/python3.8/site-packages/torch/cuda/ UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? ia- (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:115.)
  return torch._C._cuda_getDeviceCount() > 0

As with most things CUDA, it’s all pretty cryptic plus hit-and-miss re: finding solutions for it. This post turns out to be very useful, mentioning the need for NVIDIA fabric-manager and DCGM, linking to here.

My VM OS is Ubuntu 18.04, so I just had to do:

sudo apt-get install -y datacenter-gpu-manager
sudo systemctl --now enable nvidia-dcgm
sudo systemctl status nvidia-dcgm

sudo apt-get install cuda-drivers-fabricmanager
sudo systemctl start nvidia-fabricmanager
sudo systemctl status nvidia-fabricmanager

Then came the next thing on the software stack, to do with torch version 1.9.0 not working with the A100 GPUs:

/anaconda/envs/a100/lib/python3.8/site-packages/torch/cuda/ UserWarning:
NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

And so I created a new conda env with torch==1.8.1+cu111 via a pip install, according to the PyTorch install page:

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f

After that, everything worked, torch.cuda.is_available() returned True, and I was able to run distributed multi-GPU training with the amazing fastai library support for distributed and parallel training, with minimal fuss on the software stack.

Hopefully this helps people who might want to use A100 GPUs.