Distributed/Multi-GPU Training with FastAi in Jupyter Notebook

Hello all

I’d like to share a tool I built to enable interactive Distributed Training of FastAI in Jupyter notebooks. It is a iPython/Jupyter notebook extension of line and cell magics, and uses ipyparallel to manage the multiprocess PyTorch DistributedDataParallel (DDP) group.

The main objective of the tool is to let FastAi users to play with Distributed training in FastAi’s lesson notebooks with minimal changes, and without any change to the fastai code base. A few features (from the README):

  1. Switch execution easily between PyTorch’s multiprocess DDP group and local notebook namespace.
  2. Automatically empties cuda cache after executing a cell in DDP group, to reduce the likelihood of OOM errors in a long notebook session.
  3. Takes only 3 - 5 lines of iPython magics to port a Fastai course v3 notebook to run in DDP.
  4. Extensible architecture. Future support for fastai v2 could be implemented as a loadable module, like that for fastai v1.

Here is a summary of speedup observed in FastAI notebooks when trained with 3 GPUs.

The repository of the tool Ddip (“Dee dip”), for Distributed Data “interactive” Parallel, is at:


Ddip is far from perfect, and a few fun puzzles are yet to be solved: some models don’t see speed-up, and I suspect some features may be better implemented using fastai’s callback architecture.

As multi-GPU machines become more common, I hope Ddip can help more fastai notebook users to speed up training. I have ported and uploaded most of course v3-dl1 notebooks to the repo, as usage examples. Please do not hesitate to ask any thing about this tool, I welcome and appreciate any feedback/questions/ideas to improve it.

Since I’m not as fast a learner as fastai's Learner, by the time I get it to work alright with fastai v1, fastai v2 is being rolled out already. Now Ddip has to catch up to v2 — an exciting target.

I haven’t investigated fastai v2's distributed training capability — can anyone shed some light on it?

Thank you FastAI team and users!


Thank you! I am going to try this out now. I see you have been updating the code, do you still see instances where some models don’t see a speed increase?

1 Like

Hello @GrahamAcademy, thank you for the interest and following the project.

I haven’t been working on the speed-up gain for v1 lately, but only on porting it over to fastai v2. So those didn’t see speedup, still don’t.

But from casual discussion with Sylvain, some workloads may not lend itself to linear speed up – like language modelling or classification, because input sentences can be of different sizes even though the fastai library tries to bucket them. There are speedup, but not linear, and may vary with batch sizes.

wgan is another one difficult to parallelize/distribute, becoz’ of frequent critic/generator synchronization, and that the maths behind wgan is difficult to parallelize – there was a paper specifically seek how to change wgan mathematical problem statement such that its solution can benefit from distributed training. That seems to suggest the regular standard wgan, doesn’t.

In short, not all workloads/model architecture are parallelizable.

nvprof, an nvidia performance profiling tool with a fancy GUI, can capture multi-GPU workloads’ performance profile over the run time. But that approach may not fit everyone’s style. Nvidia is pushing for Nsight Tools to replace nvprof though.

On the other hand, there are other places where performance can be improved by moving work to GPU. E.g. Nvidia’s RAPIDS is used by fastai v2’s tabular data, and DALI, used in data preprocessing on GPU. We will never run out of bottlenecks :wink:.

I think this doesn’t work on windows, does that sound right? Do you know if anyone has ever tested on windows?

I haven’t tested it in a multiGPU Windows OS environment. Perhaps someone can try it within WSL.

I don’t think WSL allows for GPU use :frowning:

Well, the extension I wrote depends on ipyparallel, which probably works on Windows, because it doesn’t rely on the unix-only fork() behavior to inherit states from parent process to the children processes. And since fastai should work on windows, I would guess there is a good chance my extension works on Windows as well.

But I’m sorry that I don’t have access to a Windows 10 setup. If you do I highly encourage you to try it.

Unfortunately I am getting the error
---------------------------------------------------------------------------AttributeError Traceback (most recent call last) in
----> 1 r = join_group_single(g_rank=g_rank, l_rank=l_rank, gpu=gpu, ws=ws)
~\Anaconda3\envs\fastai_02\lib\site-packages\Ddip\ddp.py in join_group_single(g_rank, l_rank, gpu, ws)
30 os.environ[“OMP_NUM_THREADS”] = str(1) # See https://github.com/pytorch/pytorch/pull/22501
31 torch.cuda.set_device(gpu)
—> 32 if ws > 0: torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
33 return os.environ[“LOCAL_RANK”]
AttributeError: module ‘torch.distributed’ has no attribute ‘init_process_group’

---------------------------------------------------------------------------AttributeError                            Traceback (most recent call last)<ipython-input-1-85059193315d> in <module>
----> 1 r = join_group_single(g_rank=g_rank, l_rank=l_rank, gpu=gpu, ws=ws)
~\Anaconda3\envs\fastai_02\lib\site-packages\Ddip\ddp.py in join_group_single(g_rank, l_rank, gpu, ws)
     30     os.environ["OMP_NUM_THREADS"] = str(1) # See https://github.com/pytorch/pytorch/pull/22501
     31     torch.cuda.set_device(gpu)
---> 32     if ws > 0: torch.distributed.init_process_group(backend='nccl', init_method='env://')
     33     return os.environ["LOCAL_RANK"]
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'

I get this when running
%reload_ext autoreload
%autoreload 2
%matplotlib inline

%load_ext Ddip
%makedip -g all -a fastai_v1 --verbose False

From my research on this error it looks like Windows doesn’t support torch.distributed. so sad!