DistributedDataParallel init hanging

(Kerem Turgutlu) #1

Hi,

I am trying to do single node multi-gpu (4 gpus) training with DistributedDataParallel using to_distributed():

# environment vars
os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '5444'
os.environ['WORLD_SIZE'] = '4'

learn.to_distributed(0)
learn.fit(1)

code above hangs and I believe it’s hanging during torch.distributed.init_process_group(backend='nccl', init_method='env://', rank=0).

Any help? :slight_smile:

Thanks

0 Likes

#2

If you’re launching on just one machine, you don’t need to specify those env variables normally, and fastai launch is enough to do everything properly for you.

0 Likes

(Kerem Turgutlu) #3

Yes, it is a single machine with 8 GPUs. That was my initial approach but then I got the following error:

learn = cnn_learner(data=fold_data, base_arch=arch, metrics=[accuracy, auc], 
                    lin_ftrs=[1024,1024], ps=[0.7, 0.7, 0.7],
                    callbacks=learn_callbacks,
                    callback_fns=learn_callback_fns)
learn.to_distributed(cuda_id=0)
learn.fit(1)

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Most of the answers around data parallelism in the forums use nn.DataParallel and couldn’t find a working solution in PyTorch forums as well.

Then regarding to that error message I set the following but it keeps hanging:

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['WORLD_SIZE'] = '4'
os.environ['RANK'] = '0'
torch.distributed.init_process_group(backend='nccl')

This error is not fastai related but there might be someone who faced a similar issue.

0 Likes

How to use Multiple GPUs?
How to use Multiple GPUs?
#4

Does the CIFAR10 example hang as well?

0 Likes

(Kerem Turgutlu) #5

python -m fastai.launch train_cifar.py --gpu=3

It throws an Division by Zero since n_gpu is set to 0 by:

def num_distrib():
    "Return the number of processes in distributed training (if applicable)."
    return int(os.environ.get('WORLD_SIZE', 0))

Then I set inside main func of the script:

os.environ['WORLD_SIZE'] = '4'
os.environ['CUDA_VISIBLE_DEVICES']='3,4,5,6'

and run python -m fastai.launch train_cifar.py --gpu=3
This time getting

Traceback (most recent call last):
  File "train_cifar.py", line 8, in <module>
    def main( gpu:Param("GPU to run on", str)=None ):
  File "/home/turgutluk/fastai/fastai/script.py", line 40, in call_parse
    func(**args.__dict__)
  File "train_cifar.py", line 23, in main
    num_workers=workers).normalize(cifar_stats)
  File "/home/turgutluk/fastai/fastai/vision/data.py", line 108, in from_folder
    if valid_pct is None: src = il.split_by_folder(train=train, valid=valid)
  File "/home/turgutluk/fastai/fastai/data_block.py", line 199, in split_by_folder
    return self.split_by_idxs(self._get_by_folder(train), self._get_by_folder(valid))
  File "/home/turgutluk/fastai/fastai/data_block.py", line 195, in _get_by_folder
    return [i for i in range_of(self) if self.items[i].parts[self.num_parts]==name]
  File "/home/turgutluk/fastai/fastai/data_block.py", line 195, in <listcomp>
    return [i for i in range_of(self) if self.items[i].parts[self.num_parts]==name]
IndexError: index 0 is out of bounds for axis 0 with size 0

But when I set all these environment variables:

os.environ['WORLD_SIZE'] = '4'
os.environ['CUDA_VISIBLE_DEVICES']='3,4,5,6'
os.environ['RANK'] = '3'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '1234'

then it hangs.

python fastai/fastai/launch.py --gpus=3,4,5,6 fastai/examples/train_cifar.py --gpu=3

Gives the same index out of range error.

[EDIT]

with:

python fastai/fastai/launch.py --gpus=3,4,5,6 fastai/examples/train_cifar.py --gpu=3

It works, I was missing the CIFAR data after all. I will use fastai/launch.py to spawn process for my own script and see if it works.

Another question I have is that I am only seeing utilization at gpu=3 looking at watch gpustat but I was expecting it to be distributed across gpus=3,4,5,6. Am I missing something? It looks like there are 4 processes on running on gpu=3.

[3] GeForce RTX 2080 Ti | 87’C, 99 % | 9084 / 10989 MB | turgutluk(2263M) turgutluk(2271M) turgutluk(2269M) turgutluk(2271M)
[4] GeForce RTX 2080 Ti | 33’C, 0 % | 10 / 10989 MB |
[5] GeForce RTX 2080 Ti | 35’C, 0 % | 10 / 10989 MB |
[6] GeForce RTX 2080 Ti | 29’C, 0 % | 10 / 10989 MB |

[SOLVED]

It should be like this since list(‘3456’) is the correct format:

python fastai/fastai/launch.py --gpus=3456 fastai/examples/train_cifar.py

It really scales linearly with constant batch size, wow :smiley:

Thanks!

4 Likes

#6

When using learn.to_distributed() in a Jupyter notebook, there is the same issue :
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Would it require a call to launch ? Maybe wrapping the process inside .to_distributed() would make it easier at least for Jupyter Notebooks ?

Thanks,

0 Likes

#7

You can’t run distributed training in jupyter, it needs to be in a script (it needs to launch several copies of the training for the different GPUs and it’s not possible in jupyter).

4 Likes

#8

Oh ok I didn’t knew ! Thanks ! :slight_smile:

0 Likes

#9

@sgugger, @sebastienwood et. all I followed your instructions to create a script to do distributed training but as a newbie I am not sure it is taking place. I see processes running in GPU but volatile memory is 0%…! Do not know what it means but I find it weird. How would you interpret this GPU stats?



Note I added a line to save the trained model (I think it was missing, right?) but for the rest is the same as in the docs.

EDIT: I realized that GPU volatile memory does jump for all 8 GPUs from time to time to nearly 100% (like spikes) but most of the times is 0% as in the picture.

0 Likes

#10

@mgloria You’re misreading the Volatile Uncorr. ECC | GPU-Util lines; They are actually two different values.
GPU-Util is a time sample telling you what % of the time a GPU is running atleast one process.
Source: https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation
Volatile Uncorr. ECC is a counter of uncorrectable ECC memory errors since the last driver load.
Source: https://www.andrey-melentyev.com/monitoring-gpus.html

In regards to your GPU stats, it appears your DistributedDataParallel script is behaving normally, because all GPU’s are being utilized equally, and that your memory-usage is near capacity.
It’s probably a good idea to check your GPU stats every second or so by using command
watch -n1 nvidia-smi It’l give you a better idea of your actual GPU-util.

1 Like

#11

Thanks a lot, very good explanation. Do you know why sometimes the GPU-Util is 0% and sometimes (most) nearly 100%? When is the GPU not being used during the training?

0 Likes

#12

My guess is you have an IO bottleneck, where your GPUs are waiting for your data to be moved from disk to GPU, and during this time GPU util is 0%. This is a bigger problem when dealing with large input data like large images.
But that’s only a guess. Could be harmless. To get more detail, profiling your code with cprofile and snake viz, and/or nvidia visual profiler would be good idea.


1 Like

(Hallvar Gisnås) #13

Hi, maybe this is the wrong thread, but I was trying to select which GPUs to use for a multi-gpu experiment and python fastai/fastai/launch.py --gpus=3456 fastai/examples/train_cifar.py worked.

But I was also wondering if something similar was possible with the “official” distributed guide: https://docs.fast.ai/distributed.html. This code also works, but uses all available GPUs. Maybe the best approach is to just adapt the examples/train_cifar.py file?

0 Likes