Distributed segmentation NCCL environment issue

For classification models, training in distributed mode can be achieved by simply calling fine_tune within the distributed context manager. There is no need any more to initialize explicitly the NCCL environment.

Doing the same for segmentation model however will cause this error:

File “/workspace/fastai/fastai/distributed.py”, line 160, in distrib_ctx
setup_distrib(cuda_id)
File “/workspace/fastai/fastai/distributed.py”, line 58, in setup_distrib
torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
File “/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

The fix is to initialize explicitly the NCCL environment before running fine_tune within the distributed context manager by calling setup_distrib and torch.cuda.set_device on the process rank.

Does anybody know the reason for this behavior difference? When checking the source for cnn_learner and unet_learner, the cause for this was not obvious.