Hi,
I’ve just tried to train a model on Azure’s ND96asr_v4 instance size with 8× A100 (40GB) GPUs, and wanted to share some notes and watch-its.
For some reason that I do not fully understand, on a VM+OS setup that works with the 8×V100 instance size (ND40rs_v2), when it’s resized to the 8×A100 instance size, torch
stops working with CUDA.
import torch
torch.cuda.is_available()
/anaconda/envs/a100/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? ia- (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
As with most things CUDA, it’s all pretty cryptic plus hit-and-miss re: finding solutions for it. This post turns out to be very useful, mentioning the need for NVIDIA fabric-manager and DCGM, linking to here.
My VM OS is Ubuntu 18.04, so I just had to do:
sudo apt-get install -y datacenter-gpu-manager
sudo systemctl --now enable nvidia-dcgm
sudo systemctl status nvidia-dcgm
sudo apt-get install cuda-drivers-fabricmanager
sudo systemctl start nvidia-fabricmanager
sudo systemctl status nvidia-fabricmanager
Then came the next thing on the software stack, to do with torch
version 1.9.0 not working with the A100 GPUs:
/anaconda/envs/a100/lib/python3.8/site-packages/torch/cuda/__init__.py:106: UserWarning:
NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
And so I created a new conda
env with torch==1.8.1+cu111
via a pip
install, according to the PyTorch install page:
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
After that, everything worked, torch.cuda.is_available()
returned True
, and I was able to run distributed multi-GPU training with the amazing fastai library support for distributed and parallel training, with minimal fuss on the software stack.
Hopefully this helps people who might want to use A100 GPUs.
Regards,
Yijin