How to use Multiple GPUs?

kiay123 · January 30, 2018, 4:55am

or this way
torch.cuda.set_device(no.of gpu)
Example if you want to use GPU number 2
torch.cuda.set_device(2)

prairieguy · February 8, 2018, 9:32pm

I have three GPU’s and have been trying to set up my environment so that one of the gpu is dedicated to my monitor and the others are available for deep learning. What was confusing is the difference in the ordering of the gpu’s in nvidia-smi and torch.cuda.get_device_name Here is the output of both:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34                 Driver Version: 387.34                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0  On |                  N/A |
| 23%   38C    P8    17W / 250W |    811MiB / 12188MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Graphics Device     Off  | 00000000:03:00.0 Off |                  N/A |
| 34%   49C    P8    26W / 250W |      0MiB / 12058MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Graphics Device     Off  | 00000000:04:00.0 Off |                  N/A |
| 28%   40C    P8    24W / 250W |      0MiB / 12058MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

>>> torch.cuda.get_device_name(0)
'Graphics Device'
>>> torch.cuda.get_device_name(1)
'TITAN Xp'
>>> torch.cuda.get_device_name(2)
'Graphics Device'

I had expected the device numbers to be consistent across these applications. (This matters because I have one Titan Xp and two Titan V’s and I want the Xp to always be dedicated to the monitor.)

After help from the folks on the PyTorch forum, I needed to adjust two environment settings:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2

Now, gpu(0) is reserved for the monitor. The Titan Xp (as opposed to the Titan V’s) is always used for my monitor. It never shows up as being available within torch.

torch.cuda.device_count() returns 2 instead of 3.

This took me quite a bit of time to figure out, so I thought I would share here.

avn3r · February 9, 2018, 11:08pm

technically you will not see an improvement in speed by using DataParallel unless you increase the batch size. DataParallel allows you to increase your batch_size * num_gpus. That’s the only way that you will end up seeing performance improvements since DataParallel simply takes a batch_size splits it into N chunks (based on the number of GPUs) and then trains each on different models.

If you have 4 GPUs you might want to use 3 for training and 1 for debugging testing that way you can do both multi-GPU but have one GPU unfrozen while your model trains.

matdmiller · February 10, 2018, 12:07pm

I have used multiple GPU’s when working on the sp-society-camera-model-identification Kaggle competition. This competition used 512x512 images so the extra ram using 3 GPU’s allowed me to go from a batch size of 12 to 36.

The code I use to activate multiple GPU’s is:
learn.models.model = torch.nn.DataParallel(learn.models.model,device_ids=[0, 1, 3])
I have 4 GPU’s - 3 1080ti’s and one 1050ti that I use for my display. The ID’s I pass in the argument are the device ID’s from nvidia-smi. I have found that my training time is cut in half when going from 1 GPU to 3 GPU’s so the scaling is not ideal, but it is a notable speedup. The dataloader needs to be fast enough to keep ahead of the GPU’s and I had to scale my batch size linearly with respect to the number of GPU’s. The other thing to watch out for is when you save your weights using 3 gpu’s, you cannot load that back into a training session with 1 GPU. I have read a few suggestions on how to accomplish it, but I have not tried any of them yet.

dortonway · March 25, 2018, 2:19pm

I get error:
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:88

Do you know how to fix that?

redturtle · April 10, 2018, 6:57pm

Just wanted to confirm Prarieguy’s experience here. The order of devices in nvidia-smi does not match that seen by torch.cuda. The code I use to specify 1 GPU for 1 Jupyter notebook (NOT using multiple GPUs for 1 notebook) follows:

torch.cuda.set_device(1)
print ('Current cuda device ', torch.cuda.current_device())

fone · April 10, 2018, 10:47pm

That would be amazing. Please someone do this. (and then make a tutorial at my level please )

Interogativ · May 2, 2018, 10:01pm

Having similar problem with DataParallel in fastai. Still troubleshooting it. Just out of curiosity, how did you figure out it was a power problem?

prairieguy · May 2, 2018, 10:37pm

My problem with multiple gpu’s turned out to be a power problem which took me forever to debug. The issue was that whenever I cranked” up dogs-vs-cats, I would get strange system crashes. After multiple re-installs of Ubuntu and of fastai, switching gpu’s between different PCI slots and between different machines, I finally narrowed it down to the power supply. It just couldn’t supply the requisite power when multiple gpu’s were all fully engaged. So now, I just use that machine with a single gpu and all is good. (My other machine—with a more capable PSU—is able to handle three gpu’s with no problem).

Not directly related to this problem, but I also had problems running multiple machines (each with multiple gpu’s). Here the issue was pulling too many amps and blowing the circuit breaker.

Not sure if this helps, but if nothing else, don’t assume every problem is software. . .

Interogativ · May 3, 2018, 7:56pm

Thanks, I have a gold 750 Watt PS for my SLI X299 plus MB with 64GB DDR4, Core I7 7800. Two ASUS GTX 1080s. The machine crashes and reboots during training with both. gPUs simultaneously. It works fine with only 1 GPU. I may need a larger PS. Any thoughts?

prairieguy · May 4, 2018, 1:00am

That is exactly what was happening with me. One gpu worked, two failed. I now use it with one. It seemed too complicated to change the PSU.

adrian · May 4, 2018, 6:32am

Anyone here who has compared performance and practical use of of 2x gtx1070 vs 1x gtx1080 ti?
On paper you get
8(x2) vs 11gb RAM
1920(x2) vs 3584 processors
256(x2) vs 484 GB(s) bandwidth.

For a single model, a c. 50% speedup scaling to 2x gpus would put 2x1070 on par with 1x1080ti (using T. Dettmers rough performace metrics)

@avn3r just to be clear if you increase batch_size * num_gpus, then youd be able to combine the memory of both GPU’s (ie for 2x1070 for 16GB)?

Does multi gpu in pytorch work better for some nn’s (eg convnet) vs say rnn’s (eg see Tim Dettmers posts)?

adrian · May 4, 2018, 6:39am

If you had room in your box you can just use a second smaller PS for the second GPU only, may be cheaper to do that than buy a 1000W+ PS.

May have been mentioned somewhere here already but pcpartpicker will allow you to calculate your power use for current parts, ball of park Ive read is to add 100W buffer to this for safety.

Interogativ · May 8, 2018, 6:10pm

According to PCPartPicker the power requirements are 626W and I’ve got at 750W PS. Guess an extra 124W isn’t enough.

adrian · May 8, 2018, 10:55pm

Personally ive gone with a gold+ rated psu with a some wattage to spare in case i add some more parts.

Interogativ · May 9, 2018, 3:28am

Just got a reply back from ASUS:

After reviewing your email I understand that whenever you try to run both of your graphics card at the same time your PC crashes. I’m sorry to hear you are experiencing this issue and thank you for letting us know about this. It will be my pleasure to assist you in resolving it.
I appreciate the details that you have sent out. As I check you are using a MSI motherboard, the best recommendation I can give you is to contact MSI for the power requirement of the power supply that you must use for your system to be stable.
Because as I check with our ASUS motherboard if you are a using a high-end PCIe devices like the TURBO-GTX1080-8G you must use a 1000w power supply or higher to assure the stability of your computer. Also I am very sorry to tell you but you cannot throttle back the power requirements using NVIDIA-SMI because the cards need power for it to work properly and to assure its stability.

adrian · May 10, 2018, 12:24am

What temperatures do you get per card when both were running?(or has the PSU issue prevented running both for any significant time?). I was running a 1080ti and 1070 in same box and was getting 80-90 degrees C on the gpus (a couple of mm apart, nb these were ‘open’ style with 3 fans not fe turbo shrouded). Alone the 1080ti sits at around 70 degrees. Now i am in process watercooling-which is a pretty time consuming process

gerardo · May 10, 2018, 3:33am

Use MSI Afterburner if you are using windows to control the temperature of the NVIDIA cards
Don’t let them go above 70 C or your cards will fry.

In Ubuntu you can use the nvdia tools to set the fan parameters.

NVidia-smi is a great tool to see the numbers. (Win 10 and Ubuntu)

Interogativ · May 11, 2018, 5:46pm

Working at last with 2 GPUs, 1000 Watts is the anwer

Sorry for the later reply, but I’ve been busy ripping out my 750W Platinum PS and replacing it with a 1000W gold PS. As for the temps, the GPUs run as hot as 81C and draw as much as 187W of power each. The Enhance.ipynb notebook is now running fine with both GTX 1080’s engaged.

balnazzar · May 22, 2018, 2:21pm

Interesting. But how do you make it work? I devise a couple of problems:

You got to have the second PSU starting as the system starts, but the signal comes from the mainboard, which is connected just to the main PSU.