How to use Multiple GPUs?

I just tried wrapping various parts of the fastai library code in nn.DataParallel and didn’t have any luck.

The last thing I tried was to modify an attribute of learner from the LSTM notebook,

learner = md.get_model(...)
learner.models.model = nn.DataParallel(learner.models.model)
learner.fit(...)

That results in the error:

AttributeError: 'RNN_Encoder' object has no attribute 'hidden'

due to this line in RNN_Encoder.forward():

raw_output, new_h = rnn(raw_output, self.hidden[l])

It seems trivial according to the Pytorch tutorial but I couldn’t figure out how to add it. Maybe someone smarter than me can!

3 Likes

After today’s class hopefully you’ll have enough information to do it :slight_smile:

5 Likes

Out of curiosity, what type of temperatures do you see on your cards while running a “fit epoch”, according to nvidia-smi? In my rig, in slot 1 which displays to the monitor i see ~40C while the card in slot 5 that is running the code sees just over 80C while running and will quickly drop below 60C when finished and both card will equalize after about 2 minutes.

My system, 1080ti, shows the single GPU reached a max of 82C, probably near 100% utilization.

1 Like

Problems with Multiple GPU’s

  1. I’ve been having trouble with multiple gpu’s.
  2. I am working through Dogs and Cat’s version 2 with fastai library around Pytorch.
  3. I am using Paperspace system configuration, i.e. Ubuntu 16.04.
  4. I have two rigs. One has a one Titan Xp card and the other has two Titan Xp cards.
  5. I am working through fastai/courses/dl1/Lesson1.pynb
  6. The code block that I am focusing on is:
arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 3)
  1. Everything works as expected with the single gpu rig.
  2. The double gnu rig crashes hard. It undergoes a complete reboot.
  3. The crash is replicable. I have done a complete reinstall of rig using Paperspace configuration and it continues to crash.
  4. Since primary difference between the two rigs are the number of Titan V cards, I can only assume that the multiple cards are a problem.
  5. I have used the rig with two gpu’s for other purposes and haven’t had any problems, so it doesn’t strike me as a hardware issue.
  6. To narrow this down further, I would like to find a way to use only one card to see if this fixes the problem.
  7. I have gone through the fastai code and underlying torch.cuda code looking for a parameter to select only one card, but haven’t been able find it.
  8. I haven’t done so, but I plan to redo Dogs and Cats using v1 of the course, using keras to see what happens.

If anyone could suggest how to work within the current code base and limit torch (or python or the env) to using one card, I would appreciate the help.

Also, other ideas, would be appreciated.

Finally, for others trying to debug errors, I found it useful to not only remove the ‘data/dogscats/tmp/’ directory, but also the ‘~/.torch/model’ directory. Apparently, the resnet34 model, i.e., ‘resnet34-333f7ec4.pth’ had become corrupted. I couldn’t replicate the bug until I had removed this and forced a fresh download by Pytorch.

UPDATE: I was able to modify the environmental variable CUDA_VISIBLE_DEVICES to selectively use only one gpu card or the other. In spite of limiting myself to one gpu, the computer still crashes when executing the code above. If it were a hardware problem with the card, I would have thought it would have been one or the other card.

RESOLUTION: After considerable time debugging this problem, I am embarrassed to say that this was indeed a hardware problem. Even though I had used the rig with both gpu for considerable periods of time (crypto mining) with no problem, apparently “kicking” in the deep learning algorithm created a power surge that forced a restart. Plugging the computer into another circuit solved the problem. Since this problem has nothing to do with deep learning, I was going to delete this post, but after all the time on this problem, if someone else is stuck, perhaps this will be of help.

4 Likes

Wondering how to set fastai to use a different GPU (if you have more than one)?
use:
os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘n’ in your notebook
where n is the number (starting with zero for the first one)

this must be done BEFORE you import the fastai library.

e.g.
import os
os.environ[‘CUDA_VISIBLE_DEVICES’] = '1’
from fastai.conv_learner import *

2 Likes

Just pass an integer when we do model.cuda(0)

All,
So I know in the past, I once actually had the just the ConvLearner running on my dual GPU system utilizing both the cards. As a technical exercise and simply for the heck of it, I’ll give it another go. I also know that I used pytorch’s DataParallel module to do so.

I will try to ensure compatibility for all executable actions in the current library. @jeremy Would you be amenable to merging such a change?

Thnx.

1 Like

Absolutely! :smiley:

1 Like

I gave this a shot … I wrapped self.model with the nn.DataParallel(…) in ConvnetBuilder, and began running lesson-1.ipynb. Kinda surprising that things just worked out the box. I had to do significant changes when I first attempted this way back.

class ConvnetBuilder():
    """Class representing a convolutional network...
    """
    def __init__(self, f, c, is_multi, is_reg, ps=None, xtra_fc=None, xtra_cut=0):
        .....
        ...
        if f in model_meta: cut,self.lr_cut = model_meta[f]
        self.top_model = nn.DataParallel(nn.Sequential(*layers)) # <----  (First attempt)

However, the runtimes are terrible. My runtimes actually went up by around a factor of 2 for finishing each epoch!! I could also see that my CPU’s were being slightly less-utilized using parallelism. Could it be some kind of starving because of how CPUs are feeding images to the GPU?

Not that it’s important! If any pointers pops immediately in your mind, I can look on it… If not, I’ll do some other explorations.

I’ve not had luck improving performance with multiple GPUs either (not as bad results as you saw, but no faster than single GPU). I haven’t looked closely into it. I’d be interested to hear if you find out anything. Perhaps on the pytorch forums?

Puzzling phenomena

When I run the following code without any other jobs running, it is significantly slower than when the GPU is running other process. (Specifically, it is under heavy load running crypto mining software.) I have repeated the trials numerous times to make sure that there were no differences in pre-computing or caching taking place. Moreover, I have tested this off and on over several weeks with the same result. I have used nvidia-smi to verifying what jobs are running on the GPU. Here are the times:

  • Time with no load: 45 seconds

  • Time with load: 20 seconds

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 5)`

This really doesn’t make sense to me.

EDIT: I was wondering if someone could let me know how long the above code runs for them. (I am running this on a Titan V, though I’ve tested on a Titan X and it’s about the same.) This is right out of Lesson1. Note that in learn.fit(0.01, 5), I am running 5 epochs.

I’ve also exprienced a similar issue when I used nn.DataParallel to run on 4 GPUs, it didn’t seem to help much in time. So I’ve increased batch sizes to the extend where all GPU memories were almost full to take full advantage of it since there might be some bottlenecks when we are splitting data and copying modules to all GPUs. Still not sure whether it’s worth running nn.DataParallel. It’s probably better to stick with what Jeremy suggests and utilize GPUs for running different experiments.

Maybe plotting performance vs different batch sizes might give some clue about bottlenecks.

It’s definitely possible to get nearly linear scaling with more GPUs - I just haven’t looked in to how to make that work. But plenty of folks have published results showing that.

or this way
torch.cuda.set_device(no.of gpu)
Example if you want to use GPU number 2
torch.cuda.set_device(2)

I have three GPU’s and have been trying to set up my environment so that one of the gpu is dedicated to my monitor and the others are available for deep learning. What was confusing is the difference in the ordering of the gpu’s in nvidia-smi and torch.cuda.get_device_name Here is the output of both:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34                 Driver Version: 387.34                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0  On |                  N/A |
| 23%   38C    P8    17W / 250W |    811MiB / 12188MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Graphics Device     Off  | 00000000:03:00.0 Off |                  N/A |
| 34%   49C    P8    26W / 250W |      0MiB / 12058MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Graphics Device     Off  | 00000000:04:00.0 Off |                  N/A |
| 28%   40C    P8    24W / 250W |      0MiB / 12058MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
>>> torch.cuda.get_device_name(0)
'Graphics Device'
>>> torch.cuda.get_device_name(1)
'TITAN Xp'
>>> torch.cuda.get_device_name(2)
'Graphics Device'

I had expected the device numbers to be consistent across these applications. (This matters because I have one Titan Xp and two Titan V’s and I want the Xp to always be dedicated to the monitor.)

After help from the folks on the PyTorch forum, I needed to adjust two environment settings:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2

Now, gpu(0) is reserved for the monitor. The Titan Xp (as opposed to the Titan V’s) is always used for my monitor. It never shows up as being available within torch.

torch.cuda.device_count() returns 2 instead of 3.

This took me quite a bit of time to figure out, so I thought I would share here.

2 Likes

technically you will not see an improvement in speed by using DataParallel unless you increase the batch size. DataParallel allows you to increase your batch_size * num_gpus. That’s the only way that you will end up seeing performance improvements since DataParallel simply takes a batch_size splits it into N chunks (based on the number of GPUs) and then trains each on different models.

If you have 4 GPUs you might want to use 3 for training and 1 for debugging testing that way you can do both multi-GPU but have one GPU unfrozen while your model trains.

3 Likes

I have used multiple GPU’s when working on the sp-society-camera-model-identification Kaggle competition. This competition used 512x512 images so the extra ram using 3 GPU’s allowed me to go from a batch size of 12 to 36.

The code I use to activate multiple GPU’s is:
learn.models.model = torch.nn.DataParallel(learn.models.model,device_ids=[0, 1, 3])
I have 4 GPU’s - 3 1080ti’s and one 1050ti that I use for my display. The ID’s I pass in the argument are the device ID’s from nvidia-smi. I have found that my training time is cut in half when going from 1 GPU to 3 GPU’s so the scaling is not ideal, but it is a notable speedup. The dataloader needs to be fast enough to keep ahead of the GPU’s and I had to scale my batch size linearly with respect to the number of GPU’s. The other thing to watch out for is when you save your weights using 3 gpu’s, you cannot load that back into a training session with 1 GPU. I have read a few suggestions on how to accomplish it, but I have not tried any of them yet.

5 Likes

I get error:
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:88

Do you know how to fix that?

Just wanted to confirm Prarieguy’s experience here. The order of devices in nvidia-smi does not match that seen by torch.cuda. The code I use to specify 1 GPU for 1 Jupyter notebook (NOT using multiple GPUs for 1 notebook) follows:

torch.cuda.set_device(1)
print ('Current cuda device ', torch.cuda.current_device())

1 Like