How to use Multiple GPUs?


Is there any simple flag in fasti ai / Torch library to use multiple GPUs?



Not in fastai. There’s DataParallel in pytorch, but we don’t currently support it.


Darn, I was wondering this as well.

I saw dataparallel. If we went through and added it to the library would if offer a significant boost for those with our own DL rigs? Assume it isnt a priority for those using AWS, Crestle and paperspace.

Personally I don’t think it’s that useful - I’ve never found it’s helped me. Reason is: I always have multiple experiments I want to run, and I also want a spare GPU for doing interactive stuff. My main rig has 4 GPUs. So I can run 3 experiments plus an interactive notebook.


Makes sense.

For others, here is the code to check which GPU you are using and set it to a different one in PyTorch

# See how many devices are around
# Set it to a particular device
# Check which device you are on

Have you tried using ‘DataParallel’ in one of fastai notebooks ?

I just tried wrapping various parts of the fastai library code in nn.DataParallel and didn’t have any luck.

The last thing I tried was to modify an attribute of learner from the LSTM notebook,

learner = md.get_model(...)
learner.models.model = nn.DataParallel(learner.models.model)

That results in the error:

AttributeError: 'RNN_Encoder' object has no attribute 'hidden'

due to this line in RNN_Encoder.forward():

raw_output, new_h = rnn(raw_output, self.hidden[l])

It seems trivial according to the Pytorch tutorial but I couldn’t figure out how to add it. Maybe someone smarter than me can!


After today’s class hopefully you’ll have enough information to do it :slight_smile:


Out of curiosity, what type of temperatures do you see on your cards while running a “fit epoch”, according to nvidia-smi? In my rig, in slot 1 which displays to the monitor i see ~40C while the card in slot 5 that is running the code sees just over 80C while running and will quickly drop below 60C when finished and both card will equalize after about 2 minutes.

My system, 1080ti, shows the single GPU reached a max of 82C, probably near 100% utilization.

1 Like

Problems with Multiple GPU’s

  1. I’ve been having trouble with multiple gpu’s.
  2. I am working through Dogs and Cat’s version 2 with fastai library around Pytorch.
  3. I am using Paperspace system configuration, i.e. Ubuntu 16.04.
  4. I have two rigs. One has a one Titan Xp card and the other has two Titan Xp cards.
  5. I am working through fastai/courses/dl1/Lesson1.pynb
  6. The code block that I am focusing on is:
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True), 3)
  1. Everything works as expected with the single gpu rig.
  2. The double gnu rig crashes hard. It undergoes a complete reboot.
  3. The crash is replicable. I have done a complete reinstall of rig using Paperspace configuration and it continues to crash.
  4. Since primary difference between the two rigs are the number of Titan V cards, I can only assume that the multiple cards are a problem.
  5. I have used the rig with two gpu’s for other purposes and haven’t had any problems, so it doesn’t strike me as a hardware issue.
  6. To narrow this down further, I would like to find a way to use only one card to see if this fixes the problem.
  7. I have gone through the fastai code and underlying torch.cuda code looking for a parameter to select only one card, but haven’t been able find it.
  8. I haven’t done so, but I plan to redo Dogs and Cats using v1 of the course, using keras to see what happens.

If anyone could suggest how to work within the current code base and limit torch (or python or the env) to using one card, I would appreciate the help.

Also, other ideas, would be appreciated.

Finally, for others trying to debug errors, I found it useful to not only remove the ‘data/dogscats/tmp/’ directory, but also the ‘~/.torch/model’ directory. Apparently, the resnet34 model, i.e., ‘resnet34-333f7ec4.pth’ had become corrupted. I couldn’t replicate the bug until I had removed this and forced a fresh download by Pytorch.

UPDATE: I was able to modify the environmental variable CUDA_VISIBLE_DEVICES to selectively use only one gpu card or the other. In spite of limiting myself to one gpu, the computer still crashes when executing the code above. If it were a hardware problem with the card, I would have thought it would have been one or the other card.

RESOLUTION: After considerable time debugging this problem, I am embarrassed to say that this was indeed a hardware problem. Even though I had used the rig with both gpu for considerable periods of time (crypto mining) with no problem, apparently “kicking” in the deep learning algorithm created a power surge that forced a restart. Plugging the computer into another circuit solved the problem. Since this problem has nothing to do with deep learning, I was going to delete this post, but after all the time on this problem, if someone else is stuck, perhaps this will be of help.


Wondering how to set fastai to use a different GPU (if you have more than one)?
os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘n’ in your notebook
where n is the number (starting with zero for the first one)

this must be done BEFORE you import the fastai library.

import os
os.environ[‘CUDA_VISIBLE_DEVICES’] = '1’
from fastai.conv_learner import *


Just pass an integer when we do model.cuda(0)

So I know in the past, I once actually had the just the ConvLearner running on my dual GPU system utilizing both the cards. As a technical exercise and simply for the heck of it, I’ll give it another go. I also know that I used pytorch’s DataParallel module to do so.

I will try to ensure compatibility for all executable actions in the current library. @jeremy Would you be amenable to merging such a change?


1 Like

Absolutely! :smiley:

1 Like

I gave this a shot … I wrapped self.model with the nn.DataParallel(…) in ConvnetBuilder, and began running lesson-1.ipynb. Kinda surprising that things just worked out the box. I had to do significant changes when I first attempted this way back.

class ConvnetBuilder():
    """Class representing a convolutional network...
    def __init__(self, f, c, is_multi, is_reg, ps=None, xtra_fc=None, xtra_cut=0):
        if f in model_meta: cut,self.lr_cut = model_meta[f]
        self.top_model = nn.DataParallel(nn.Sequential(*layers)) # <----  (First attempt)

However, the runtimes are terrible. My runtimes actually went up by around a factor of 2 for finishing each epoch!! I could also see that my CPU’s were being slightly less-utilized using parallelism. Could it be some kind of starving because of how CPUs are feeding images to the GPU?

Not that it’s important! If any pointers pops immediately in your mind, I can look on it… If not, I’ll do some other explorations.

I’ve not had luck improving performance with multiple GPUs either (not as bad results as you saw, but no faster than single GPU). I haven’t looked closely into it. I’d be interested to hear if you find out anything. Perhaps on the pytorch forums?

Puzzling phenomena

When I run the following code without any other jobs running, it is significantly slower than when the GPU is running other process. (Specifically, it is under heavy load running crypto mining software.) I have repeated the trials numerous times to make sure that there were no differences in pre-computing or caching taking place. Moreover, I have tested this off and on over several weeks with the same result. I have used nvidia-smi to verifying what jobs are running on the GPU. Here are the times:

  • Time with no load: 45 seconds

  • Time with load: 20 seconds

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True), 5)`

This really doesn’t make sense to me.

EDIT: I was wondering if someone could let me know how long the above code runs for them. (I am running this on a Titan V, though I’ve tested on a Titan X and it’s about the same.) This is right out of Lesson1. Note that in, 5), I am running 5 epochs.

I’ve also exprienced a similar issue when I used nn.DataParallel to run on 4 GPUs, it didn’t seem to help much in time. So I’ve increased batch sizes to the extend where all GPU memories were almost full to take full advantage of it since there might be some bottlenecks when we are splitting data and copying modules to all GPUs. Still not sure whether it’s worth running nn.DataParallel. It’s probably better to stick with what Jeremy suggests and utilize GPUs for running different experiments.

Maybe plotting performance vs different batch sizes might give some clue about bottlenecks.

It’s definitely possible to get nearly linear scaling with more GPUs - I just haven’t looked in to how to make that work. But plenty of folks have published results showing that.