Distributed training across multiple nodes

adrianerrea · February 17, 2022, 7:52am

Hi there,

I have been struggling with this topic for days and I want to share that I made some steps forward but still not working as desired.

I used the DDP in Pytorch across multiple nodes (2 nodes with 2 GPUs each) and it seems it is working great. However, when I tried to replicate the behaviour on fastai the outputs are quite strange.

First, to make that example worked in different nodes you must set these environment variables in your scripts.

In my case I have 2 nodes with 2 GPUs in each one. But I am going to use just one GPU on each node to make things easier.

So I set the MASTER_ADDR and the MASTER_PORT to one node’s IP and port. Then you must set the WORLD_SIZE and RANK. In the script for the first node (the master), I set the WORLD_SIZE to 2 and the RANK to 0 and, in the second node, I set the WORLD_SIZE to 2 and the RANK to 1.

WORLD_SIZE indicates the number of total processes involved in the DDP so it is set to 2 (one process for each GPU on each node). The RANK indicates the id for each process.

Notice that I also had to set the NCCL_SOCKET_IFNAME environment variable to the ethernet interface used before as I could not make it worked without it.

With everything set up, I managed to train on both GPUs on both nodes. However, the behaviour is different than a Pytorch DDP.

When I try to set a bigger batch size (I want DDP just for this reason), I get the memory CUDA error because that batch size is not fitting in the GPUs memory. The other GPUs memory is at half size, though. Besides that, with same data and script and same batch size, running DDP in two GPUs on the same node or two GPUs on different nodes return different behaviours as the second GPU used remains with more free space in the first case. The first GPU seems like a bottleneck regardless the number of GPUs used.

I am wondering if fastai is prepared to work with DDP in multiple nodes or if I am missing something here. My guessing is that the data is not being distributed equally on each process or something related.

Hope @jeremy, @sgugger, @muellerzr have some answers here.

Regards

adrianerrea · April 28, 2022, 6:32am

Hi again,

I created a Medium post explaining everything in case someone is facing the same problem.

I will appreciate if anyone take a look and see something to be fixed or improved.

Hope it will be useful!

https://medium.com/naia-blog/distributed-training-across-multiple-nodes-in-fastai-299c0ef56e2f