or this way
torch.cuda.set_device(no.of gpu)
Example if you want to use GPU number 2
torch.cuda.set_device(2)
I have three GPUās and have been trying to set up my environment so that one of the gpu is dedicated to my monitor and the others are available for deep learning. What was confusing is the difference in the ordering of the gpuās in nvidia-smi
and torch.cuda.get_device_name
Here is the output of both:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34 Driver Version: 387.34 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 On | N/A |
| 23% 38C P8 17W / 250W | 811MiB / 12188MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Graphics Device Off | 00000000:03:00.0 Off | N/A |
| 34% 49C P8 26W / 250W | 0MiB / 12058MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Graphics Device Off | 00000000:04:00.0 Off | N/A |
| 28% 40C P8 24W / 250W | 0MiB / 12058MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
>>> torch.cuda.get_device_name(0)
'Graphics Device'
>>> torch.cuda.get_device_name(1)
'TITAN Xp'
>>> torch.cuda.get_device_name(2)
'Graphics Device'
I had expected the device numbers to be consistent across these applications. (This matters because I have one Titan Xp and two Titan Vās and I want the Xp to always be dedicated to the monitor.)
After help from the folks on the PyTorch forum, I needed to adjust two
environment settings:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2
Now, gpu(0)
is reserved for the monitor. The Titan Xp (as opposed to the Titan Vās) is always used for my monitor. It never shows up as being available within torch.
torch.cuda.device_count()
returns 2
instead of 3
.
This took me quite a bit of time to figure out, so I thought I would share here.
technically you will not see an improvement in speed by using DataParallel
unless you increase the batch size. DataParallel allows you to increase your batch_size * num_gpus. Thatās the only way that you will end up seeing performance improvements since DataParallel simply takes a batch_size splits it into N chunks (based on the number of GPUs) and then trains each on different models.
If you have 4 GPUs you might want to use 3 for training and 1 for debugging testing that way you can do both multi-GPU but have one GPU unfrozen while your model trains.
I have used multiple GPUās when working on the sp-society-camera-model-identification Kaggle competition. This competition used 512x512 images so the extra ram using 3 GPUās allowed me to go from a batch size of 12 to 36.
The code I use to activate multiple GPUās is:
learn.models.model = torch.nn.DataParallel(learn.models.model,device_ids=[0, 1, 3])
I have 4 GPUās - 3 1080tiās and one 1050ti that I use for my display. The IDās I pass in the argument are the device IDās from nvidia-smi. I have found that my training time is cut in half when going from 1 GPU to 3 GPUās so the scaling is not ideal, but it is a notable speedup. The dataloader needs to be fast enough to keep ahead of the GPUās and I had to scale my batch size linearly with respect to the number of GPUās. The other thing to watch out for is when you save your weights using 3 gpuās, you cannot load that back into a training session with 1 GPU. I have read a few suggestions on how to accomplish it, but I have not tried any of them yet.
I get error:
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:88
Do you know how to fix that?
Just wanted to confirm Prarieguyās experience here. The order of devices in nvidia-smi does not match that seen by torch.cuda. The code I use to specify 1 GPU for 1 Jupyter notebook (NOT using multiple GPUs for 1 notebook) follows:
torch.cuda.set_device(1)
print ('Current cuda device ', torch.cuda.current_device())
That would be amazing. Please someone do this. (and then make a tutorial at my level please )
Having similar problem with DataParallel in fastai. Still troubleshooting it. Just out of curiosity, how did you figure out it was a power problem?
My problem with multiple gpuās turned out to be a power problem which took me forever to debug. The issue was that whenever I crankedā up dogs-vs-cats, I would get strange system crashes. After multiple re-installs of Ubuntu and of fastai, switching gpuās between different PCI slots and between different machines, I finally narrowed it down to the power supply. It just couldnāt supply the requisite power when multiple gpuās were all fully engaged. So now, I just use that machine with a single gpu and all is good. (My other machineāwith a more capable PSUāis able to handle three gpuās with no problem).
Not directly related to this problem, but I also had problems running multiple machines (each with multiple gpuās). Here the issue was pulling too many amps and blowing the circuit breaker.
Not sure if this helps, but if nothing else, donāt assume every problem is software. . .
Thanks, I have a gold 750 Watt PS for my SLI X299 plus MB with 64GB DDR4, Core I7 7800. Two ASUS GTX 1080s. The machine crashes and reboots during training with both. gPUs simultaneously. It works fine with only 1 GPU. I may need a larger PS. Any thoughts?
That is exactly what was happening with me. One gpu worked, two failed. I now use it with one. It seemed too complicated to change the PSU.
Anyone here who has compared performance and practical use of of 2x gtx1070 vs 1x gtx1080 ti?
On paper you get
8(x2) vs 11gb RAM
1920(x2) vs 3584 processors
256(x2) vs 484 GB(s) bandwidth.
For a single model, a c. 50% speedup scaling to 2x gpus would put 2x1070 on par with 1x1080ti (using T. Dettmers rough performace metrics)
@avn3r just to be clear if you increase batch_size * num_gpus, then youd be able to combine the memory of both GPUās (ie for 2x1070 for 16GB)?
Does multi gpu in pytorch work better for some nnās (eg convnet) vs say rnnās (eg see Tim Dettmers posts)?
If you had room in your box you can just use a second smaller PS for the second GPU only, may be cheaper to do that than buy a 1000W+ PS.
May have been mentioned somewhere here already but pcpartpicker will allow you to calculate your power use for current parts, ball of park Ive read is to add 100W buffer to this for safety.
According to PCPartPicker the power requirements are 626W and Iāve got at 750W PS. Guess an extra 124W isnāt enough.
Personally ive gone with a gold+ rated psu with a some wattage to spare in case i add some more parts.
Just got a reply back from ASUS:
After reviewing your email I understand that whenever you try to run both of your graphics card at the same time your PC crashes. Iām sorry to hear you are experiencing this issue and thank you for letting us know about this. It will be my pleasure to assist you in resolving it.
I appreciate the details that you have sent out. As I check you are using a MSI motherboard, the best recommendation I can give you is to contact MSI for the power requirement of the power supply that you must use for your system to be stable.
Because as I check with our ASUS motherboard if you are a using a high-end PCIe devices like the TURBO-GTX1080-8G you must use a 1000w power supply or higher to assure the stability of your computer. Also I am very sorry to tell you but you cannot throttle back the power requirements using NVIDIA-SMI because the cards need power for it to work properly and to assure its stability.
What temperatures do you get per card when both were running?(or has the PSU issue prevented running both for any significant time?). I was running a 1080ti and 1070 in same box and was getting 80-90 degrees C on the gpus (a couple of mm apart, nb these were āopenā style with 3 fans not fe turbo shrouded). Alone the 1080ti sits at around 70 degrees. Now i am in process watercooling-which is a pretty time consuming process
Use MSI Afterburner if you are using windows to control the temperature of the NVIDIA cards
Donāt let them go above 70 C or your cards will fry.
In Ubuntu you can use the nvdia tools to set the fan parameters.
NVidia-smi is a great tool to see the numbers. (Win 10 and Ubuntu)
Working at last with 2 GPUs, 1000 Watts is the anwer
Sorry for the later reply, but Iāve been busy ripping out my 750W Platinum PS and replacing it with a 1000W gold PS. As for the temps, the GPUs run as hot as 81C and draw as much as 187W of power each. The Enhance.ipynb notebook is now running fine with both GTX 1080ās engaged.
Interesting. But how do you make it work? I devise a couple of problems:
You got to have the second PSU starting as the system starts, but the signal comes from the mainboard, which is connected just to the main PSU.