How to use Multiple GPUs?

(Andrea de Luca) #64

They won Dawnbench a few months ago. If I remember it right, they used an aws instance with way more than 2 GPUs.

I think we are just missing something regarding parallel usage of GPUs with fastai/pytorch. Let us see if some of the big guys see this thread (tagging them will get us shot on sight).

(Gerardo Garcia) #65

I think it’s lack of expertise that’s preventing me the use of multiple GPUs. :confused:

Jeremy published a tweet few days ago
Version:1.0 StartHTML:000000196 EndHTML:000019399 StartFragment:000019194 EndFragment:000019349 StartSelection:000019194 EndSelection:000019349 SourceURL:

with pretty much what I’m doing but still does not work for me.
The GPUs get loaded the GPUs goes to 100% and no results comes out.


What’s your rig like, Jeremy?
I’m guessing AMD Threadripper and 4 x 2080 Ti.
Do you dual boot?

(Haider Alwasiti) #67

For anything more than 2 GPUs, PCIe lanes become an important factor.

Also I have experimented a bit with different numbers of GPUs on Google GCP, and seems anything more than 2 parallell GPUs seemed handicapped by the PCIe lane speed on V100 x8 GPUs. The NVlink topology was in a way that did not help to scale into 4x GPUs in parallel. Of course I could still use 4 separately running models on 8 GPUs with better speedup (4 models = each model/2 GPUs ).

I don’t know how AWS V100 NVlink topology is. But I gues there are better topologies can be tuned for DL training. Specifically, the DGX-2 topology seems better than GCP.

Here are my 3 posts on the details of GCP’s 8 GPU analysis.

I wonder how did you connect 6 GPUs. What MB, CPU and how many PCIe lanes do you have?

(Haider Alwasiti) #68

I was keep looking for hardware options with reasonable price budget in high end desktop range to connect 4 GPUs with 16x PCIe lanes each. Maybe there are server grade CPUs/MB that can do it, but the price is too high to justify.

There are these new AMD CPUs announced few months ago, that give you 64 lanes without PLX (PLX PCI switches can make inter GPU communication faster but the CPU <-> GPU link is still bottlenecked with the CPU PCIe lanes limit). So I think for 4 GPU/16x lanes each or 8 GPU/8x lanes each then AMD is your friend.

AMD Ryzen Threadripper 2990WX
32 cores/64 threads
4.2GHz boost/3.0GHz base
64MB L3 cache
250W TDP
64 PCIe Gen 3.0 lanes
Price: $1,799
Availability: Aug 13, 2018

AMD Ryzen Threadripper 2970WX
24 cores/48 threads
4.2GHz boost/3.0GHz base
64MB L3 cache
250W TDP
64 PCIe Gen 3.0 lanes
Price: $1,299
Availability: Oct 2018

AMD Ryzen Threadripper 2950X
16 cores/32 threads
4.4GHz boost/3.5GHz base
32MB L3 cache
180W TDP
64 PCIe Gen 3.0 lanes
Price: $899
Availability: Aug 31, 2018

AMD Ryzen Threadripper 2920X
12 cores/24 threads
4.3GHz boost/3.5GHz base
32MB L3 cache
180W TDP
64 PCIe Gen 3.0 lanes
Price: $649
Availability: Oct 2018


It seems intel could not (do not want?) make CPUs with 64 lanes.

For Nvidia DGX-2 with 16 GPUs they seem relying on the NVlinks between the V100s. They are using 2 Xeon processors (Dual Intel Xeon Platinum
8168, 2.7 GHz, 24-cores) each with 48 lanes (6 lanes/gpu?), but that is not a problem with NVlinks.

2x Intel Xeon E5-2698 v3 (16 core, Haswell-EP) with 40 lanes each
8x NVIDIA Tesla P100 (3584 CUDA Cores) with NVlinks
= 10 lanes/gpu
But again there are NVlinks between the 8 P100s gpus

I think, only the new AMD processors can do 4 GPUs efficiently for us without server grade tesla gpus.

My question though, if we go for the cheapest AMD with 64 lanes. How is the performance and most importantly the compatibility with the DL frameworks of AMD CPUs. Like this one with $650:
AMD Ryzen Threadripper 2920X
12 cores/24 threads
4.3GHz boost/3.5GHz base
32MB L3 cache
180W TDP
64 PCIe Gen 3.0 lanes
Price: $649
Availability: Oct 2018

Saying that, I have recently purchased the newly announced corei9-9900k ($550) for building a system with 3 GPUs .

I just don’t feel comfortable to go for AMD cpus fearing of compatibility issues and lower performance for DL or other compute tasks that I am interested in (rendering, Ansys simulations…etc.)

(Haider Alwasiti) #69

By the way, Jeremy mentioned once in lesson 8 DL course v2. That he prefers to run separate models on GPUs rather than running one model in parallel on multiple GPUs, because it does not speed up very much. And this was indeed my experience too.

And yes, Jeremy used parallel training in Pytorch on 8 GPUs (8xV100s on AWS). It’s not clear for me whether he used the normal Pytorch method to run one model in parallel. (Maybe the AWS NVlinks topology is better designed than GCP instances?). But I doubt it.

The other possibility that I lean toward more, is that Jeremy used 8 separate python processes that are completely separately running and their results are combined into a master python script that he designed himself. As you can see in my analysis post in the pet’s benchmark thread, running 8 GPUs separately is completely fine, provided they are not talking to each other. And this is the most likely way that he circumvented the inherent slow down of running one model on multiple GPUs in Pytorch that he described himself by “running one model on multiple GPUs does not speed up well” (in lesson 8 DL course v2 link above).

Here is the snapshot of the code for his Dawnbench example ran during lesson12 DL course v2:

We can see that he is running the multiple python processes by his own script and not depending on Pytorch in this. His talk in the video is making this more clear too.

Jeremy released the Dawnbench code in this tweet. And it is indeed does not use the torch.nn.DataParallel(learn.model) method, but he used learn.distributed(gpu) which needs launch module (which handles distributed launching).

(adrian) #70

Another option is to buy used - the xeon e5-2670 ($100)/e5-2680v2($200)/e5-2690v2 are very good performance per dollar. Eg 2x e5-2680v2 gives you 80pcie lanes, 20 cores/40 threads and up to 768GB ram per CPU. Not bad for $400. The other advantage is that they are compatible with DDR3 RAM (~$50 per 16gb stick) which is much cheaper than DDR4

(Haider Alwasiti) #71

As far as I know, doing that in a dual CPU system is not a good idea. The PCIe lanes of one CPU are isolated from the other’s lanes. Which means any data that has to be passed from the GPUs of CPU1 cannot be transferred as peer-to-peer connection from GPU to GPU if it belongs to the 2nd CPU. It should move through the CPU that they have attached to, which makes the connection rather slow.

Now, we can notice that both DGX-1 and DGX-2 used 2 Xeon processors, but we should keep in mind that those Tesla Pascal and Volta GPUs have NVlinks which are way faster than PCIe lanes. So they have effectively passed this need for fast PCIe communication. Those cards are very expensive that renders them beyond consideration for everybody, unless using ordinary Gtx GPUs is not permitted (e.g., Nvidia does not allow to install them in cloud service providers)

(Haider Alwasiti) #72

I would be interested to see a publication doing that. A fast google search showed this blog post, where even though pytorch speed up is not that good, but other frameworks like TF, Keras are even worse:

In the Summary section:

For single-node multi-gpu training using distributed wrappers (such as Horovod or torch.distributed ) is probably likely to result in faster timings (each process will bind to a single GPU and do multi-process distributed training on a single-node).

I think Jeremy did something like this in the Dawnbench script, and without using the straightforward Pytorch’s torch.nn.DataParallel(learn.model, device_ids=[0,1,2,3,4,5,6,7]).

It seems Horovod is the way to go:

512-GPU Benchmark

The above benchmark was done on 128 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network. Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.

This is Uber’s solution for using massively parallel GPU training. I suspect that this is how other big guys are doing it like FB, Google…etc. It needs a one time, not so easy setup of Open MPI. But for serious projects, I think it’s worth it. Still interested to find a better or easier solution.

(adrian) #73

With this mb you get 3x pcie slots for CPU01 (x16,x16,x8), and a fourth pcie slot (x16) on CPU02, so there is a fair amount of flexibility. Is bigger than ATX form factor so needs a big case.

(Brian) #74

I am using ASUS X99-E WS/USB which can do x16/x16/x16/x16 in conjunction with a Intel 6850k CPU (40 PCI lanes). I have 4 1080Ti cards in this box. Essentially its a clone of an NVIDIA DevBox:

(Haider Alwasiti) #75

The real bottleneck is
CPU to GPU lanes which is still 40 lanes. PLX switches help a bit, but with the new 64 lanes AMD cpus, it is time to search for true 4 x 16 lanes without PLX switches.

More details :

(Gerardo Garcia) #76

your write-up definitely is explaining a lot.
I connected the 6 GPUs using the

They work with mining but they don’t seems to be working with deep learning.
I created a Franken rig with
ASUS Z270-P i7 3800
32 GB of RAM I was suing 16 GB and it was working fine.
4 NVIDIA 1080 ti and 2 1070 ti
I have a 256 GB M.2 Super necessary

There’s not enough space on the rig to load all cards so I’m going to try with extenders.

(Haider Alwasiti) #77

Well, how did use learn.distributed(gpu)? It needs launch module (which handles distributed launching).

I think, currently just to test your system, it is easier to stick with just the torch.nn.DataParallel(learn.model) method, because it does not need any extra module or code.

Here is my notebook example running the pet’s NB on 4 GPUS:

on 8 GPUs

on 2 GPUS:

and please note if you want to run the notebook bove on 2 GPUs + want to run another notebook at the same time on the other free GPUs, you have to specify your main GPUs at cell [3]. Like in my case when I wanted to run additional 2 GPUs along with the above NB:

(Haider Alwasiti) #78

It should work with DL too. They are just slower. In mining the PCIe lanes does not matter. In DL running the GPUs on x1 lane only using those risers will slow them down. Also note CPU speed/cores does matter if you run more than 1-2 gpus. In my old 3 gpu system even with such x1 risers like yours the real bottleneck was not the risers but the CPU. It could not feed image data with enough speed to the GPUs, because it was slow. The gpus was idle with some bursts of 100% activity. If I ran 1 notebook and finished with 20 minutes, running another copy of the NB at the same time, they both will finish in 40 minutes. Which means using 2 GPUs on such system was pointless!

Multi-gpu DL rigs needs a poweful CPU + at least 8 lanes/gpu, otherwise multiple GPU will not gain you anything beyond running only one gpu.

(Andrea de Luca) #79

I know nothing about distributed. May you elaborate about the difference?

I have a single e5-2680v2. Incredibly, I was able to buy it from a german reseller (ebay) for 80 euros, along with 96Gb (16X6, triple channel) for 220 euros.
Great performance per money ratio, all the money you save can be hijacked upon the gpus.
Cons: they don’t support AVX2 and AVX-512, but who cares?

You can use NVlink starting from RTX2080 (non-ti).

I was interested in the X99-E WS, but quoting Dettmer’s blog:

<<X99E-WS uses the PLX PEX8747 PCI-E switch, which communicates with the CPU using 32x bi-directional lanes of PCI-E 3.0. The switch also communicates with each GPU using 16x bi-directional lanes of PCI-E 3.0.

Hence, the CPU can’t transmit only, or receive only, data to/from 4x GPUs using x16 lanes per GPU concurrently. It is limitted by the 32x lanes to the switch.

However, if the software schedules the data transfers so that only two concurrent READs and two concurrent WRITEs take place in parallel (at most), the algorithm will take advantage of the PCI-E switch.

Note that NVIDIA’s recent drivers for GTX graphics cards are unstable in Windows 10 since August 2017, causing BSOD. The latest stable drivers for 3 or 4 GPUs on motherboards featuring the PLX PEX8747 switch is version 382.53. This is a know issue and NVIDIA does not seem interested in fixing it.>>

(Haider Alwasiti) #80

I knew nothing about it either. But learned that this is something that works too on multiple nodes as well as multiple GPUs on single nodes. The other difference seems that it is better in isolating the communication between GPUs, which makes it faster than the torch.nn.DataParallel(learn.model) method. But then we need a module that collects/feeds data to those GPUs instead of dedicating one of the GPUs doing that in torch.nn.DataParallel(learn.model) method, and hence needs a bit more coding. All of this take it with a grain of salt, since I haven’t tried it. It is my speculation on what I read about Uber’s Horovod and the tweet of Jeremy.

Seems it has been already implemented in fastai, but still lacking documentation or an example. Coming soon I guess.

Yes, but only maximum 2 GPUs connected. Which is pointless I think, since the PCIe is already good for 2 GPUs in parallel.

(Brian) #81

@hwasiti so the only change that has to be done is to the learn.model() lines to use multiple GPU’s?

(Haider Alwasiti) #82

Yes. After creating the learn object put this line
learn.model = torch.nn.DataParallel(learn.model, device_ids=[0, 1])

(Haider Alwasiti) #83

I have edited my above answer. Forgot the learn.model =