DistributedDataParallel init hanging

kcturgutlu · March 18, 2019, 12:39am

Hi,

I am trying to do single node multi-gpu (4 gpus) training with DistributedDataParallel using to_distributed():

# environment vars
os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '5444'
os.environ['WORLD_SIZE'] = '4'

learn.to_distributed(0)
learn.fit(1)

code above hangs and I believe it’s hanging during torch.distributed.init_process_group(backend='nccl', init_method='env://', rank=0).

Any help?

Thanks

sgugger · March 18, 2019, 2:08am

If you’re launching on just one machine, you don’t need to specify those env variables normally, and fastai launch is enough to do everything properly for you.

kcturgutlu · March 18, 2019, 4:34am

Yes, it is a single machine with 8 GPUs. That was my initial approach but then I got the following error:

learn = cnn_learner(data=fold_data, base_arch=arch, metrics=[accuracy, auc], 
                    lin_ftrs=[1024,1024], ps=[0.7, 0.7, 0.7],
                    callbacks=learn_callbacks,
                    callback_fns=learn_callback_fns)
learn.to_distributed(cuda_id=0)
learn.fit(1)

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Most of the answers around data parallelism in the forums use nn.DataParallel and couldn’t find a working solution in PyTorch forums as well.

Then regarding to that error message I set the following but it keeps hanging:

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['WORLD_SIZE'] = '4'
os.environ['RANK'] = '0'
torch.distributed.init_process_group(backend='nccl')

This error is not fastai related but there might be someone who faced a similar issue.

sgugger · March 18, 2019, 1:38pm

Does the CIFAR10 example hang as well?

kcturgutlu · March 19, 2019, 4:56am

python -m fastai.launch train_cifar.py --gpu=3

It throws an Division by Zero since n_gpu is set to 0 by:

def num_distrib():
    "Return the number of processes in distributed training (if applicable)."
    return int(os.environ.get('WORLD_SIZE', 0))

Then I set inside main func of the script:

os.environ['WORLD_SIZE'] = '4'
os.environ['CUDA_VISIBLE_DEVICES']='3,4,5,6'

and run python -m fastai.launch train_cifar.py --gpu=3
This time getting

Traceback (most recent call last):
  File "train_cifar.py", line 8, in <module>
    def main( gpu:Param("GPU to run on", str)=None ):
  File "/home/turgutluk/fastai/fastai/script.py", line 40, in call_parse
    func(**args.__dict__)
  File "train_cifar.py", line 23, in main
    num_workers=workers).normalize(cifar_stats)
  File "/home/turgutluk/fastai/fastai/vision/data.py", line 108, in from_folder
    if valid_pct is None: src = il.split_by_folder(train=train, valid=valid)
  File "/home/turgutluk/fastai/fastai/data_block.py", line 199, in split_by_folder
    return self.split_by_idxs(self._get_by_folder(train), self._get_by_folder(valid))
  File "/home/turgutluk/fastai/fastai/data_block.py", line 195, in _get_by_folder
    return [i for i in range_of(self) if self.items[i].parts[self.num_parts]==name]
  File "/home/turgutluk/fastai/fastai/data_block.py", line 195, in <listcomp>
    return [i for i in range_of(self) if self.items[i].parts[self.num_parts]==name]
IndexError: index 0 is out of bounds for axis 0 with size 0

But when I set all these environment variables:

os.environ['WORLD_SIZE'] = '4'
os.environ['CUDA_VISIBLE_DEVICES']='3,4,5,6'
os.environ['RANK'] = '3'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '1234'

then it hangs.

python fastai/fastai/launch.py --gpus=3,4,5,6 fastai/examples/train_cifar.py --gpu=3

Gives the same index out of range error.

[EDIT]

with:

python fastai/fastai/launch.py --gpus=3,4,5,6 fastai/examples/train_cifar.py --gpu=3

It works, I was missing the CIFAR data after all. I will use fastai/launch.py to spawn process for my own script and see if it works.

Another question I have is that I am only seeing utilization at gpu=3 looking at watch gpustat but I was expecting it to be distributed across gpus=3,4,5,6. Am I missing something? It looks like there are 4 processes on running on gpu=3.

[3] GeForce RTX 2080 Ti | 87’C, 99 % | 9084 / 10989 MB | turgutluk(2263M) turgutluk(2271M) turgutluk(2269M) turgutluk(2271M)
[4] GeForce RTX 2080 Ti | 33’C, 0 % | 10 / 10989 MB |
[5] GeForce RTX 2080 Ti | 35’C, 0 % | 10 / 10989 MB |
[6] GeForce RTX 2080 Ti | 29’C, 0 % | 10 / 10989 MB |

[SOLVED]

It should be like this since list(‘3456’) is the correct format:

python fastai/fastai/launch.py --gpus=3456 fastai/examples/train_cifar.py

It really scales linearly with constant batch size, wow

Thanks!

sebastienwood · May 30, 2019, 2:24pm

When using learn.to_distributed() in a Jupyter notebook, there is the same issue :
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Would it require a call to launch ? Maybe wrapping the process inside .to_distributed() would make it easier at least for Jupyter Notebooks ?

Thanks,

sgugger · May 30, 2019, 7:17pm

You can’t run distributed training in jupyter, it needs to be in a script (it needs to launch several copies of the training for the different GPUs and it’s not possible in jupyter).

sebastienwood · May 30, 2019, 7:18pm

Oh ok I didn’t knew ! Thanks !

mgloria · June 10, 2019, 2:36pm

@sgugger, @sebastienwood et. all I followed your instructions to create a script to do distributed training but as a newbie I am not sure it is taking place. I see processes running in GPU but volatile memory is 0%…! Do not know what it means but I find it weird. How would you interpret this GPU stats?

Note I added a line to save the trained model (I think it was missing, right?) but for the rest is the same as in the docs.

EDIT: I realized that GPU volatile memory does jump for all 8 GPUs from time to time to nearly 100% (like spikes) but most of the times is 0% as in the picture.

alexxiaoze · July 29, 2019, 3:29pm

@mgloria You’re misreading the Volatile Uncorr. ECC | GPU-Util lines; They are actually two different values.
GPU-Util is a time sample telling you what % of the time a GPU is running atleast one process.
Source: https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation
Volatile Uncorr. ECC is a counter of uncorrectable ECC memory errors since the last driver load.
Source: https://www.andrey-melentyev.com/monitoring-gpus.html

In regards to your GPU stats, it appears your DistributedDataParallel script is behaving normally, because all GPU’s are being utilized equally, and that your memory-usage is near capacity.
It’s probably a good idea to check your GPU stats every second or so by using command
watch -n1 nvidia-smi It’l give you a better idea of your actual GPU-util.

mgloria · July 29, 2019, 8:28pm

Thanks a lot, very good explanation. Do you know why sometimes the GPU-Util is 0% and sometimes (most) nearly 100%? When is the GPU not being used during the training?

alexxiaoze · July 29, 2019, 9:22pm

My guess is you have an IO bottleneck, where your GPUs are waiting for your data to be moved from disk to GPU, and during this time GPU util is 0%. This is a bigger problem when dealing with large input data like large images.
But that’s only a guess. Could be harmless. To get more detail, profiling your code with cprofile and snake viz, and/or nvidia visual profiler would be good idea.

gist.github.com

https://gist.github.com/sonots/5abc0bccec2010ac69ff74788b265086

nvvp.md

Usually, located at /usr/local/cuda/bin

## Non-Visual Profiler

```
$ nvprof python train_mnist.py
```

I prefer to use --print-gpu-trace.

This file has been truncated. show original

hallvagi · August 12, 2019, 12:33pm

Hi, maybe this is the wrong thread, but I was trying to select which GPUs to use for a multi-gpu experiment and python fastai/fastai/launch.py --gpus=3456 fastai/examples/train_cifar.py worked.

But I was also wondering if something similar was possible with the “official” distributed guide: https://docs.fast.ai/distributed.html. This code also works, but uses all available GPUs. Maybe the best approach is to just adapt the examples/train_cifar.py file?

philchu · October 22, 2019, 5:30am

@hallvagi set the environment variable CUDA_VISIBLE_DEVICES when launching the wrapper torch.distributed.launch utility script:

CUDA_VISIBLE_DEVICES=3,4,5,6 python -m torch.distributed.launch …

See full thread here:

hallvagi · October 22, 2019, 6:38am

Thanks, I will check it out!