Distributed training across multiple nodes

Hi,

I have just started using fastai and I was trying to do distributed model training with DDP across multiple machines (2 nodes/ 2 GPUs per node). I was able to follow the example and trained using 2 GPUs on a single node using fastai.launch script. Is there a similar script that can be used across nodes? What would be the steps to follow for fastai in this case?

Any help is appreciated. Thank you!

Hi there,

I have been struggling with the same topic for days and I want to share that I made some steps forward but still not working as desired.

I used the DDP in Pytorch across multiple nodes (2 nodes with 2 GPUs each) and it seems it is working great. However, when I tried to replicate the behaviour on fastai the outputs are quite strange.

First, to make that example worked in different nodes you must set these environment variables in your scripts.

In my case I have 2 nodes with 2 GPUs in each one. But I am going to use just one GPU on each node to make things easier.

So I set the MASTER_ADDR and the MASTER_PORT to one node’s IP and port. Then you must set the WORLD_SIZE and RANK. In the script for the first node (the master), I set the WORLD_SIZE to 2 and the RANK to 0 and, in the second node, I set the WORLD_SIZE to 2 and the RANK to 1.

WORLD_SIZE indicates the number of total processes involved in the DDP so it is set to 2 (one process for each GPU on each node). The RANK indicates the id for each process.

Notice that I also had to set the NCCL_SOCKET_IFNAME environment variable to the ethernet interface used before as I could not make it worked without it.

With everything set up, I managed to train on both GPUs on both nodes. However, the behaviour is different than a Pytorch DDP.

When I try to set a bigger batch size (I want DDP just for this reason), I get the memory CUDA error because that batch size is not fitting in the GPUs memory. The other GPUs memory is at half size, though. Besides that, with same data and script and same batch size, running DDP in two GPUs on the same node or two GPUs on different nodes return different behaviours as the second GPU used remains with more free space in the first case. The first GPU seems like a bottleneck regardless the number of GPUs used.

I am wondering if fastai is prepared to work with DDP in multiple nodes or if I am missing something here. My guessing is that the data is not being distributed equally on each process or something related.

Hope @jeremy, @sgugger, @muellerzr have some answers here.

Regards

Hi @adrianerrea ,

Thank you very much for sharing your solution. I was finally able to run DDP with fast.ai on 2 nodes with 2 GPUs in each one.

I did exactly like you suggested in terms of setting environment variables (e.g., WORLD_SIZE, RANK, etc.). I set WORLD_SIZE = 4 (2 process per node). I also ended up modifying the launch script and call torch.distributed.init_process_group with the value from the environment variable RANK and use .to_distributed on my learner with local_rank (GPU id within each node) instead of using learn.distrib_ctx() like the example. There is probably a better solution with less modification than what I did.

Thank you!

Hi again @psinthon,

Glad to see you made it work succesfully!

Finally, I managed to trained in multinode with DDP modifying the launch.py script from fastai as well as the example script.

In my new launch.py I changed a few things to make it work as I wanted and a little bit more customized. In the example script I modified the rank0_first() function to pass as an argument the cuda_id and I also pass this argument to the learner.distrib_ctx(). This way I make the RANK and the cuda_id to use independent which is not set in fastai by default.

Regards

Hi again,

I created a Medium post explaining everything in case someone is facing the same problem.

Hope it will be useful!

https://medium.com/naia-blog/distributed-training-across-multiple-nodes-in-fastai-299c0ef56e2f