Running out of GPU memory on first run dl1/dogscats

finn · December 14, 2018, 3:29pm

Describe the bug
My machine is running out of memory when I first run the ConvLearner.pretrained from dl1/lesson1. The conda env consumes 1754MiB gpu memory

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 2)

Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /root/.torch/models/resnet34-333f7ec4.pth
100%|██████████| 87306240/87306240 [00:04<00:00, 17734184.12it/s]

  0%|          | 0/360 [00:00<?, ?it/s]

RuntimeError: CUDA out of memory. Tried to allocate 24.50 MiB (GPU 0; 1.96 GiB total capacity; 1.31 GiB already allocated; 7.25 MiB free; 3.23 MiB cached)

I am using this docker image now, but the issue happened to me before when running inside a common conda env. In fact, i switched to docker in the hopes of taming some bug in the memory handling.
https://github.com/Paperspace/fastai-docker/blob/master/Dockerfile

Expected behavior
The process should not run out of memory.

Screenshots
nvidia-smi:


Thu Dec 13 12:54:41 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18       Driver Version: 415.18       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960M    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P0    N/A /  N/A |   1997MiB /  2004MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1448      G   /usr/lib/xorg/Xorg                           231MiB |
|    0     18172      C   /opt/conda/envs/fastai/bin/python           1754MiB |
+-----------------------------------------------------------------------------+

EDIT: Added more logs

Additional context
Is there a way to run the net without consuming so much memory?

willismar · December 14, 2018, 4:25pm

Hi @finn,

That’s normal at least you are working with pre-trained models because if you just set pre-trained to false your entire network would need to save the entire parameters amount each layer and calculates the back propagation of them to optimize the weights and so as much layers you have on the network would reflect on the GPU memory you would need to have.

I did a test recently to different models and I saw that some models in pre-trained mode just consumes until 2Gb GPU RAM in a low batch size. O the other hand that model I testes without been pre-trained would consume 8Gb RAM

Additional context
Is there a way to run the net without consuming so much memory?

The fact also using your linux to render your screen already consume some fundamental memory for you. 231Mb just to render linux screen.
If you have any other GPU to just render your screen you would save 231Mb !
SORRY: I just realized you are in a Notebook because you use a GTX 960(M) this series are just for notebooks so you cannot put another GPU on the system

 +-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1448      G   /usr/lib/xorg/Xorg                           231MiB |
|    0     18172      C   /opt/conda/envs/fastai/bin/python           1754MiB |
+-----------------------------------------------------------------------------+

Also be aware that using a pre-trained model in your specific problem just will change the last layer(s) of your network so just the last layer will be trained for you and your memory is been consumed by this last layer receiving interactions of backpropagation while the rest of the parameters are static and just seated around on memory.

Summing all there is nothing you can do other than jump to CPU if you have more memory available on your machine.

willismar · December 14, 2018, 4:42pm

But in fact there is a outline that can help but you need to dig on the Pytorch framework and make changes on your fastai library.

That is enabling the torch.utils.checkpoint . If you have knowledge to working in pure PyTorch you can optimize your process, enabling it. Take a time to read and research about this:

https://pytorch.org/docs/stable/checkpoint.html

willismar · December 14, 2018, 4:56pm

~~Now I have one question:~~
~~Do you know if you have PyTorch with Magma support ?~~
~~How did you installed PyTorch ?~~

Ok based on the Dockerfile you mentioned I can see you don’t have Magma installed.

finn · December 15, 2018, 10:59am

i use cpu now, it works. its a tit slow but for learning it will suffice. that you.
whats a sensible gpu ram size?

willismar · December 15, 2018, 11:13am

Hi again how are you ?

whats a sensible gpu ram size?

Well, assuming your are asking a general question and not correlated to your hardware limitations:

Then depends on the models and problems are you interested in. For example Computer Vision using big network models in pretrained modes uses about 2Gb, so if you are interested in working only with pre-trained models and a reasonable batch size, 4Gb RAM on GPU will suffice, some times 3G RAM will suffice.
But let say you load a pre-trained model and at some point you decide to retraing the entire network for your solution “unfreezing” the parameters to be trained again. That can require aggressive resources.

Professional cards
The top class GPUs today comes with 24Gb-48Gb RAM
(Titan RTX, Tesla V100, Quadro RTX / GV100)

Semi-Professional to Professional cards
The medium class GPUs today comes with 12Gb-16Gb RAM
(Titan V / XP, Tesla V100 / P100 / MXX / KXX , Quadro GP100 )

Gamer Cards to Semi-Professional cards
The lower class GPUs comes with 6G/8Gb - 11Gb RAM
(RTX 20xx, GTX 10xx, old Teslas and Quadros)

The size of the Memory depends on your model, on the batch size and how much savings is done during the back propagations. Other day I trained a model that in pre-trained mode just used 1.7GB of my RAM , later After unfreezing step consumed 10Gb RAM of my GPU.

Even a top GPU with 32Gb Ram can have erros “Out Of Memory” in that case some will need to add another GPU. If you are working in a Cloud, for a double price per hour you can just put another GPU and get the job done.

The same for a local resource can cost you many many bucks.
Other thing there are libraries that promises to overcome the RAM limitations but that only works in case a humongous Datasets I don’t have seen anybody having success for lower memory manager yet.

May be the PyTorch needs to change its core to better uses that API.

Shahinfar · February 7, 2019, 5:13am

Hello Willismar,

I am having pretty much similar issue running this code below with fastai version 1.

arch = models.resnet34
aName = ‘_resNet34test’
epochS = 15
epochD = 25
maxLR=1e-02

learnS = create_cnn(data, arch, pretrained=True,
metrics=[accuracy, error_rate],
callback_fns=[partial(CSVLogger, filename =str(‘stat_S’ + aName)), ShowGraph,
partial(SaveModelCallback, monitor =‘val_loss’, mode =‘auto’,
name = aName )],
ps=[0.5])
learnS.fit_one_cycle(epochS, max_lr=maxLR, moms =[0.95, 0.85], div_factor = 25.0)

I have a 8GB gamer gpu on my laptop and immediately getting this error:

RuntimeError: CUDA out of memory. Tried to allocate 42.25 MiB (GPU 0; 8.00 GiB total capacity; 6.23 GiB already allocated; 5.15 MiB free; 51.69 MiB cached)

My Entire training set is 260MB and my testing set is: 1.25GB. I can successfully train and test resnet18 but when increase the size of network to 34 i am getting this error. (I am getting the same error on a semi-professional 12GB GPU on a server computer when trying resnet50, but resnet34 is fine though)

A BIG confusion for me is that a couple month ago I was able to train and test resNEXT_101_64 (which is much deeper tha resnet34) with the same data and on the same GPU successfully via fastai 0.4.

Any clue why is this and how can i fix the issue?

Regards,
Shahinfar

willismar · February 7, 2019, 9:46am

Hi @Shahinfar, how are you ?

Nice case do you have in your hands.

Can you share your notebook and dataset or is something private ?

I have two answers for your question the faster and the longer.
1-The faster is you will need more GPU power
2-The longer you can try to implement the algorithm directly in PyTorch to see the behavior is the same and In this case if still have the same problem may be is a trade off left to try.

Let me know.

Shahinfar · February 7, 2019, 10:52pm

Hi @willismar,
I am good thanks, how about you?

Thanks for your reply,
I can share the code but not the Data as it is strictly confidential data from government agencies.
well, the highest GPU i can access now is 12GB. at the moment I am running it on multiple CPU and is going well but a little slower than i would like it to be.

but as I could run it with older version of fastai just fine and I am getting the error with fastai v1. I still cannot understand why it can happen?

cheers!

stas · February 8, 2019, 3:55pm

@Shahinfar, I recommend you use GitHub - stas00/ipyexperiments: Automatic GPU+CPU memory profiling, re-use and memory leaks detection using jupyter/ipython experiment containers, and watch the real GPU available/used memory cell after cell, that will help you find the culprit. If you can’t figure it out please post the output here, so we can see how much GPU RAM each cell used.

And also I’m hoping that when you say you have 8GB card, it’s actually 8GB card dedicated to your work. If you have xserver and other stuff running on it, then this is not really a 8GB card, but whatever is left of it after all the system processes consumed the GPU RAM they need.

A BIG confusion for me is that a couple month ago I was able to train and test resNEXT_101_64 (which is much deeper tha resnet34) with the same data and on the same GPU successfully via fastai 0.4 .

you must have meant fastai 0.7. Given the complexity of the DL setup, it’s very easy to mis-remember things. Can you go back to the fastai 0.7 setup and try there again using the identical hyperparameters, model and data and actually check that what you’re suggesting is so? I’m not saying there is no possibility for a regression, but I’m asking to run an actual test, rather than rely on a distant memory. Thank you.

Shahinfar · February 11, 2019, 4:44am

@stas, Regarding availability of my GPU it seems that at least 7.4GB is free to be used by fastai (of course some data fragmentation will automatically happen during training which may take up some space and is true for all sort of computation).

Yes I meant fastai 0.7.

Just to compare the performance i have been trying to roll back the fastai, version to 0.7 and try resNext101_64 again but unfortunately I have been running into endless errors regarding all sort of older version dependencies of fastai 0.7. I wonder if you guys have any Docker container (for windows 10), set up for fastai 0.7 (and its requirements) in which I can use to test if resNext101 work or not? in this case I can compare performance of two versions of fasai 0.7 and 1.0 side by side via ipyexperiments that you have suggested.

Best regards,

stas · February 13, 2019, 7:07pm

why not setup one dedicated conda env for fastai-0.7 and another for fastai-1.0.x, then it should work just fine. you just conda activate envname1, test, conda activate envname2, test, done. (probably will need to restart jupyter but may be not)

I don’t know if we have a docker, I wasn’t here at the time of fastai-0.7, perhaps someone created one - search the forums? e.g. I found this one Docker for fast.ai I don’t know whether it’s still working. But I think the first paragraph of this post should give you what you need.

if you’re on windows you will need to follow these instructions Howto: installation on Windows

Shahinfar · February 14, 2019, 12:56am

@stas to be honest I have been trying to install fastai 0.7 in all different ways possible for last 7 days and each time encountered a different issue when solve it another error would show up. and for me who am not a software engineer is really frustrating and time consuming. My latest error is when following Beginner Installation of FastAi 0.7 on Windows 10 Nvidia GPU 10xx steps, at the end when I run Jupyter notebook for lesso1-breeds is :
ImportError Traceback (most recent call last)
in
1 from fastai.imports import *
**----> 2 from fastai.torch_imports import ***
3 from fastai.transforms import *
4 from fastai.conv_learner import *
5 from fastai.model import *

~\Desktop\fastai\courses\dl1\fastai\torch_imports.py in
1 import os
2 from distutils.version import LooseVersion
----> 3 import torch, torchvision, torchtext
4 from torch import nn, cuda, backends, FloatTensor, LongTensor, optim
5 import torch.nn.functional as F

~\AppData\Local\Continuum\anaconda3\envs\fastai\lib\site-packages\torch_init_.py in
74 pass
75
**—> 76 from torch._C import ***
77
78 all += [name for name in dir(_C)

ImportError: DLL load failed: The specified module could not be found.

I re-installed torch, torchvision and torchtext one by one and repeated the whole steps mentioned above but I again received the same exact error. Really no clue what should I do

stas · February 14, 2019, 1:18am

Unfortunately, I don’t know anything about windows, so can’t really help you.

If you feel that it’s important then perhaps ask someone to help you in the windows thread. If not, then perhaps we let it go and trust that over time things will get better even if they are somehow different now.

Shahinfar · February 14, 2019, 1:35am

hi @stas
it is still important to know why i was able to train and test resneXt_101_64 on my machine and now with fastai 1 i cannot do more than resnet_18 ? My machine, GPU, and dataset is exactly the same. I can share both codes I have tried and the characteristics of my data (unfortunately I cannot share the data itself due to government restrictions) in this case would you be able to test the both in different version of fastai and see where the bug is coming from?

stas · February 14, 2019, 2:02am

If you make a simple reproducible test with another dataset that can be shared then I can try.

Note that we have a bunch of datasets you can use as the base - see https://github.com/fastai/fastai/blob/master/fastai/datasets.py#L17

So I will need one nb or test file with fastai-0.7 way of doing it and another with v1 way

and remember I will be testing it on linux - perhaps the regression is on pytorch on windows?

Shahinfar · February 20, 2019, 1:00am

Shahinfar:

I am having pretty much similar issue running this code below with fastai version 1 .

arch = models.resnet34
aName = ‘_resNet34test’
epochS = 15
epochD = 25
maxLR=1e-02

learnS = create_cnn(data, arch, pretrained=True,
metrics=[accuracy, error_rate],
callback_fns=[partial(CSVLogger, filename =str(‘stat_S’ + aName)), ShowGraph,
partial(SaveModelCallback, monitor =‘val_loss’, mode =‘auto’,
name = aName )],
ps=[0.5])
learnS.fit_one_cycle(epochS, max_lr=maxLR, moms =[0.95, 0.85], div_factor = 25.0)

I have a 8GB gamer gpu on my laptop and immediately getting this error:

RuntimeError: CUDA out of memory. Tried to allocate 42.25 MiB (GPU 0; 8.00 GiB total capacity; 6.23 GiB already allocated; 5.15 MiB free; 51.69 MiB cached)

My Entire training set is 260MB and my testing set is: 1.25GB. I can successfully train and test resnet18 but when increase the size of network to 34 i am getting this error. (I am getting the same error on a semi-professional 12GB GPU on a server computer when trying resnet50, but resnet34 is fine though)

A BIG confusion for me is that a couple month ago I was able to train and test resNEXT_101_64 (which is much deeper tha resnet34) with the same data and on the same GPU successfully via fastai 0.4 .

Any clue why is this and how can i fix the issue?

Regards,
Shahinfar

Many thanks to @stas who helped me to solve this issue!
It was user issue by me, mainly two things:
1- I did not setting batch size in ImageDataBunch
2- Defining two ImageDataBunch object which only one of them would be used by learner and the other just sit there and take up memory which in turn will push GPU OOM. Silly!