Mulitple Graphics Cards - Pooling RAM

I just finished lesson 3 (Pt.1) and am currently completing as many image competitions as I can on Kaggle.

I have a local machine (Ubuntu16/ corei7/ 32GB ram/ 1Tb m.2 SSD/ Asus GTX 1080ti 11GB) and am constantly running out of memory.

I bring my batch size down to 2 and that generally keeps the kernel going but my results are awful as it forces me to change the learning rate lower and everything seems to go downhill from there.

As a rule I try not to throw hardware at a problem, but is the library capable of handling multiple graphics cards? If I bought 1 or 2 more cards, could I just plug them in and reboot my machine and takes it from there pooling all of the memory?

I just found this thread: How to use Multiple GPUs? but it seems to focus on speed and power. My question is Memory. Am I able to install multiple cards and share memory?

DataParallel makes utilizing multiple GPUs very straightforward, can be used with fastai (at least last time I checked but think this has not changed) and effectively allows you to train with larger batch sizes.

You seem to be using Is this a custom model for detection? That might explain the batch size issues. Reason I am asking, I have the same GPU and was able to train with this model with a batch size of ~40 IIRC. Maybe you are not resizing images or are training with really big input sizes? I believe I used sizes of 224 x 224 or something in this ballpark.

I am not sure if the currently running kaggle deep learning competitions are a viable way of learning deep learning. They might be nice for learning the google ecosystem, figuring out how to train with large amounts of data, learning how to use object detection libraries and how to make sense of docs that are not very straightforward, but for learning how to construct models or understanding how deep learning works I am not sure they offer a viable path.

I just looked and seems there are more DL competitions on kaggle atm - my comment above only relates to the ranked competitions sponsored by google, not the knowledge comps.

Thank you @radek. I resized the test and train images in the Invasive Species Monitoring to 224x168 so not to skew the images and followed lesson1-rxt50.ipynb (Dogs v Cats Super-charged!). That is why I am using this model.

When I do:

lr=np.array([1e-5,1e-4,1e-3]), 3, cycle_len=1)

That is when the kernel crashes. I have to go all the way down to bs=2 for this to work…

Also, thank you for the heads up on the model. I just tried resnet34 from lesson1.pynb and was able to run it in full with bs=16.

There must be a bug somewhere. With these sizes you should be getting much bigger batch sizes. These competitions should be good for learning, sorry for my earlier comment :slight_smile:

Not at all, thanks for the heads up on all of this.

1 Like

With this GPU and resnet34 you should be able to go up to way over a 100, maybe even over 200 in batch size. Unless you are building a very big, non standard classification head. Still, something must be way off here.

Interesting! Ok thanks. Does it matter that this research contest is using a csv? I assume Im bringing that into main memory not gpu memory?

I guess a more general question based on the static test/train/valid image directory vs the csv based setup is does the csv based setup take any memory from the GPU?

That shouldn’t make a difference. Maybe you have something taking up GPU memory that you put there earlier. Best would be to restart your machine, open a terminal, run nvidia-smi -l and observe what happens to the memory as you run your code (this way you will be able to figure out down the road by just looking at nvidia-smi if everything is okay out whether you might need to restart the kernel, the output of this command is very useful in general)

Ok. Is there an nvidia terminal command that can flush the gpu memory? Or is rebooting the best bet?

Update: It seems killing the pid is the best way to do this if you dont reboot.

@radek thank you. I just ran nvidia-smi and it seems there are four processes that are eating it up:

…theyear2000/anaconda3/envs/fastai/bin/python 3473MiB
…theyear2000/anaconda3/envs/fastai/bin/python 3471MiB
…theyear2000/anaconda3/envs/fastai/bin/python 2397MiB
…theyear2000/anaconda3/envs/fastai/bin/python 1567MiB

I really appreciate the time on this. It must have been the extra Jupyter Notebooks I was opening while building the model


When you say you are out of memory, are you sure it is out of frame buffer on the GPU?

If it is out of GPU memory, the error msg should say “RuntimeError: cuda runtime error (2) : out of memory” but if it is something else, then the message is different.

Let me share with your me experience.

I recently got to Lesson 3 as you do and was playing with ImageNet 2012’s massive data-set and got frustrated that I keep getting memory error after I was done with just one single epoch. After some determination, I fount out that the default Ubuntu SWAP file setup was too small at 2GB, so dispite I have 16GB system memory, the kernel will crash when trying to update after the epoch.

So, after increase my SWAP file to 64GB, I was able to play with much bigger batch size with ResNet50 than before.

Hope this helps.