Distributed training error

lawrence · May 28, 2021, 3:45am

I am working on an Azure VM (Ubuntu 18.04 DSVM). I installed fastai using conda. I haven’t been able to get distributed training working. Here is a reproducible example using the beginner vision tutorial. The following works if run in the notebook, but fails when run in distributed mode.

Before running this file, I downloaded the data:

path = untar_data(URLs.PASCAL_2007)

The runfile is called “distrib_pascal_test.py,” and I call it with:

python -m fastai.launch trident_dev/dev_nbs/distrib_pascal_test.py

Here’s the file:

import pandas as pd
from fastai.vision.all import *
from fastai.distributed import *
from fastai.callback import *
from fastai.test_utils import *

path = Path('/home/egdod/.fastai/data/pascal_2007')
df = pd.read_csv(path/'train.csv')
dls = ImageDataLoaders.from_df(df, path, folder='train', valid_col='is_valid', label_delim=' ',item_tfms=Resize(460), batch_tfms=aug_transforms(size=224))
learn = cnn_learner(dls, resnet50, metrics=partial(accuracy_multi, thresh=0.5))
with learn.distrib_ctx(): 
    learn.fine_tune(2, 3e-2)

Over the course of a few days, I’ve tried variations including different data and models, and have gotten a variety of errors. I once got it working briefly by adding with learn.distrib_ctx(sync_bn=False):, but that stopped working for no obvious reason. This is the latest error:

Traceback (most recent call last):
File “trident_dev/dev_nbs/distrib_pascal_test.py”, line 11, in
with learn.distrib_ctx():
File “/anaconda/lib/python3.7/contextlib.py”, line 112, in enter
return next(self.gen)
File “/anaconda/lib/python3.7/site-packages/fastai/distributed.py”, line 167, in distrib_ctx
setup_distrib(cuda_id)
File “/anaconda/lib/python3.7/site-packages/fastai/distributed.py”, line 61, in setup_distrib
if num_distrib() > 0: torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
File “/anaconda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File “/anaconda/lib/python3.7/site-packages/torch/distributed/rendezvous.py”, line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File “trident_dev/dev_nbs/distrib_pascal_test.py”, line 11, in
with learn.distrib_ctx():
File “/anaconda/lib/python3.7/contextlib.py”, line 112, in enter
return next(self.gen)
File “/anaconda/lib/python3.7/site-packages/fastai/distributed.py”, line 167, in distrib_ctx
setup_distrib(cuda_id)
File “/anaconda/lib/python3.7/site-packages/fastai/distributed.py”, line 61, in setup_distrib
if num_distrib() > 0: torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
File “/anaconda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/anaconda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
Traceback (most recent call last):
File “trident_dev/dev_nbs/distrib_pascal_test.py”, line 11, in
with learn.distrib_ctx():
File “/anaconda/lib/python3.7/contextlib.py”, line 112, in enter
return next(self.gen)
File “/anaconda/lib/python3.7/site-packages/fastai/distributed.py”, line 167, in distrib_ctx
setup_distrib(cuda_id)
File “/anaconda/lib/python3.7/site-packages/fastai/distributed.py”, line 61, in setup_distrib
if num_distrib() > 0: torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
File “/anaconda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/anaconda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
Traceback (most recent call last):
File “trident_dev/dev_nbs/distrib_pascal_test.py”, line 11, in
with learn.distrib_ctx():
File “/anaconda/lib/python3.7/contextlib.py”, line 112, in enter
return next(self.gen)
File “/anaconda/lib/python3.7/site-packages/fastai/distributed.py”, line 167, in distrib_ctx
setup_distrib(cuda_id)
File “/anaconda/lib/python3.7/site-packages/fastai/distributed.py”, line 61, in setup_distrib
if num_distrib() > 0: torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
File “/anaconda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/anaconda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.

Not sure if I should have added this to previous posts on the same issue, but they were much older. I’d love to know if anyone else can replicate the problem.