DistributedDataParallel init hanging

@mgloria You’re misreading the Volatile Uncorr. ECC | GPU-Util lines; They are actually two different values.
GPU-Util is a time sample telling you what % of the time a GPU is running atleast one process.
Source: https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation
Volatile Uncorr. ECC is a counter of uncorrectable ECC memory errors since the last driver load.
Source: https://www.andrey-melentyev.com/monitoring-gpus.html

In regards to your GPU stats, it appears your DistributedDataParallel script is behaving normally, because all GPU’s are being utilized equally, and that your memory-usage is near capacity.
It’s probably a good idea to check your GPU stats every second or so by using command
watch -n1 nvidia-smi It’l give you a better idea of your actual GPU-util.

2 Likes