By the way, Jeremy mentioned once in lesson 8 DL course v2. That he prefers to run separate models on GPUs rather than running one model in parallel on multiple GPUs, because it does not speed up very much. And this was indeed my experience too.
And yes, Jeremy used parallel training in Pytorch on 8 GPUs (8xV100s on AWS). It’s not clear for me whether he used the normal Pytorch method to run one model in parallel. (Maybe the AWS NVlinks topology is better designed than GCP instances?). But I doubt it.
The other possibility that I lean toward more, is that Jeremy used 8 separate python processes that are completely separately running and their results are combined into a master python script that he designed himself. As you can see in my analysis post in the pet’s benchmark thread, running 8 GPUs separately is completely fine, provided they are not talking to each other. And this is the most likely way that he circumvented the inherent slow down of running one model on multiple GPUs in Pytorch that he described himself by “running one model on multiple GPUs does not speed up well” (in lesson 8 DL course v2 link above).
Here is the snapshot of the code for his Dawnbench example ran during lesson12 DL course v2:
We can see that he is running the multiple python processes by his own script and not depending on Pytorch in this. His talk in the video is making this more clear too.
Edit:
Jeremy released the Dawnbench code in this tweet. And it is indeed does not use the torch.nn.DataParallel(learn.model)
method, but he used learn.distributed(gpu)
which needs launch
module (which handles distributed launching).