I’ve just tried to train a model on Azure’s ND96asr_v4 instance size with 8× A100 (40GB) GPUs, and wanted to share some notes and watch-its.
For some reason that I do not fully understand, on a VM+OS setup that works with the 8×V100 instance size (ND40rs_v2), when it’s resized to the 8×A100 instance size,
torch stops working with CUDA.
import torch torch.cuda.is_available() /anaconda/envs/a100/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? ia- (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.) return torch._C._cuda_getDeviceCount() > 0
As with most things CUDA, it’s all pretty cryptic plus hit-and-miss re: finding solutions for it. This post turns out to be very useful, mentioning the need for NVIDIA fabric-manager and DCGM, linking to here.
My VM OS is Ubuntu 18.04, so I just had to do:
sudo apt-get install -y datacenter-gpu-manager sudo systemctl --now enable nvidia-dcgm sudo systemctl status nvidia-dcgm sudo apt-get install cuda-drivers-fabricmanager sudo systemctl start nvidia-fabricmanager sudo systemctl status nvidia-fabricmanager
Then came the next thing on the software stack, to do with
torch version 1.9.0 not working with the A100 GPUs:
/anaconda/envs/a100/lib/python3.8/site-packages/torch/cuda/__init__.py:106: UserWarning: NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
And so I created a new
conda env with
torch==1.8.1+cu111 via a
pip install, according to the PyTorch install page:
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
After that, everything worked,
True, and I was able to run distributed multi-GPU training with the amazing fastai library support for distributed and parallel training, with minimal fuss on the software stack.
Hopefully this helps people who might want to use A100 GPUs.