How to use multiple gpus

After some more searches, I found out that such imbalance GPU usage using DataParallel is an expected behaviour. It occurs because the result of all parallel computation are gathered in the main GPU as explained in the following article by Thomas Wolf from huggingface: https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255. The Pythons GIL could further slow down the multi threading used by DataParallel. A better approach is using Distributed training with torch.distributed.launch. It uses multi processing instead of multi threading and provides balanced GPU usage. It is also mentioned in fastai : https://docs.fast.ai/distributed.html

1 Like