I’d like to share a tool I built to enable interactive Distributed Training of FastAI in Jupyter notebooks. It is a iPython/Jupyter notebook extension of line and cell magics, and uses ipyparallel to manage the multiprocess PyTorch DistributedDataParallel (DDP) group.
The main objective of the tool is to let FastAi users to play with Distributed training in FastAi’s lesson notebooks with minimal changes, and without any change to the fastai code base. A few features (from the README):
Switch execution easily between PyTorch’s multiprocess DDP group and local notebook namespace.
Automatically empties cuda cache after executing a cell in DDP group, to reduce the likelihood of OOM errors in a long notebook session.
Takes only 3 - 5 lines of iPython magics to port a Fastai course v3 notebook to run in DDP.
Extensible architecture. Future support for fastai v2 could be implemented as a loadable module, like that for fastai v1.
Ddip is far from perfect, and a few fun puzzles are yet to be solved: some models don’t see speed-up, and I suspect some features may be better implemented using fastai’s callback architecture.
As multi-GPU machines become more common, I hope Ddip can help more fastai notebook users to speed up training. I have ported and uploaded most of course v3-dl1 notebooks to the repo, as usage examples. Please do not hesitate to ask any thing about this tool, I welcome and appreciate any feedback/questions/ideas to improve it.
Since I’m not as fast a learner as fastai’s Learner, by the time I get it to work alright with fastai v1, fastai v2 is being rolled out already. Now Ddip has to catch up to v2 — an exciting target.
I haven’t investigated fastai v2’s distributed training capability — can anyone shed some light on it?
Thank you! I am going to try this out now. I see you have been updating the code, do you still see instances where some models don’t see a speed increase?
Hello @GrahamAcademy, thank you for the interest and following the project.
I haven’t been working on the speed-up gain for v1 lately, but only on porting it over to fastai v2. So those didn’t see speedup, still don’t.
But from casual discussion with Sylvain, some workloads may not lend itself to linear speed up – like language modelling or classification, because input sentences can be of different sizes even though the fastai library tries to bucket them. There are speedup, but not linear, and may vary with batch sizes.
wgan is another one difficult to parallelize/distribute, becoz’ of frequent critic/generator synchronization, and that the maths behind wgan is difficult to parallelize – there was a paper specifically seek how to change wgan mathematical problem statement such that its solution can benefit from distributed training. That seems to suggest the regular standard wgan, doesn’t.
In short, not all workloads/model architecture are parallelizable.
nvprof, an nvidia performance profiling tool with a fancy GUI, can capture multi-GPU workloads’ performance profile over the run time. But that approach may not fit everyone’s style. Nvidia is pushing for Nsight Tools to replace nvprof though.
On the other hand, there are other places where performance can be improved by moving work to GPU. E.g. Nvidia’s RAPIDS is used by fastai v2’s tabular data, and DALI, used in data preprocessing on GPU. We will never run out of bottlenecks .
Well, the extension I wrote depends on ipyparallel, which probably works on Windows, because it doesn’t rely on the unix-only fork() behavior to inherit states from parent process to the children processes. And since fastai should work on windows, I would guess there is a good chance my extension works on Windows as well.
But I’m sorry that I don’t have access to a Windows 10 setup. If you do I highly encourage you to try it.