This is more promising, but I end up with the following error message:
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)
Can you share a minimal (non-)working example, preferably on one of the datasets available in fastai? I have a multi-gpu setup, I can try and run it and see what I get
DistributedDataLearner the way go. You see an approximate linear improvement on the rate of learning. So learning with 2 GPUs is ~2x as fast. With 4 is about ~3.7x as fast etc.
Let me upload a sample with DDL. There are a bunch of samples already, but they are a bit hard to discover in the repo. fastai has a neat launcher script that makes the setup pretty simple and has nice rank0 helpers. I will also send a PR to improve the docs around DDL.
Also look at the train_imagenette.py example and ignore the parts that support the DistributedLearner.