RuntimeError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 11.00 GiB total capacity; 230.80 MiB already allocated; 8.53 GiB free; 242.00 MiB reserved in total by PyTorch)

Hello everyone!
I tried to run the camvid project of the FastAI course (https://course.fast.ai/videos/?lesson=3, https://nbviewer.jupyter.org/github/fastai/course-v3/blob/master/nbs/dl1/lesson3-camvid.ipynb) on my pc.
(specs: Windows 10 Education 64 bit, processor Intel® Core™ i7-7700K CPU @ 4.20GHz, 4200 MHz,
GPU: Nvidia GeForce GTX 1080 Ti 11GB )
torch.version = ‘1.4.0’
fastai.version= ‘1.0.61’
Its my first time running any kind of cnn/unet and for me the error message doesnt really make sense since it tries to allocate 538 MiB and it wont work altough 8.53 GiB is free.
Is it because PyTorch has only reserved 242 MiB ?
Occures while running “lr_find(learn)”
I checked the other related forum topics but couldnt find any solution and batchsize reduction doesnt solve the problem for me.
Neither does torch.cuda.empty_cache() or gc.collect() solve it.

Anyone has had the same problem?

Best regards!

1 Like

I got similar errors before also using Unet. Please try to lower your batch size and/or image size. Then, it may work. At one stage, I skipped the learning finder step and then no problem with training.

Even I got a V100 with 8G Ram allocation, I can only train at 512 x 512 with bs=1. Other systems info are:
torch. version = ‘1.5.0’
fastai2. version = ‘0.0.17’ - I am using fastai2 (https://dev.fast.ai/)

After you hit RuntimeError: CUDA out of memory. You need to restart the kernel.

Although I did not hit RuntimeError: CUDA out of memory, Neither does torch.cuda.empty_cache() or gc.collect() can release the CUDA memory.

Thanks for the help!
I guess there was a problem with my Anaconda version and the fastAi library together with windows.
I solved it by buying a new ssd where I installed the new ubuntu 20.04 and it worked first try with batch size of 4.

Before reducing the batch size check the status of GPU memory :slight_smile:

nvidia-smi

Then check which process is eating up the memory choose PID and kill :boom: that process with

sudo kill -9 PID

or

sudo fuser -v /dev/nvidia*

sudo kill -9 PID

1 Like

I have successfully run the train.py code , with batch size= 4 , but when i am trying to run test.py, i got the following error:
“Traceback (most recent call last):
File “test.py”, line 50, in
outputs = model(image)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torchvision\models\segmentation_utils.py”, line 19, in forward
features = self.backbone(x)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torchvision\models_utils.py”, line 63, in forward
x = module(x)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\container.py”, line 119, in forward
input = module(input)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torchvision\models\resnet.py”, line 136, in forward
identity = self.downsample(x)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\container.py”, line 119, in forward
input = module(input)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\conv.py”, line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File “C:\Users\matif\anaconda3\envs\deeplab\lib\site-packages\torch\nn\modules\conv.py”, line 396, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 254.00 MiB (GPU 0; 8.00 GiB total capacity; 6.15 GiB already allocated; 15.44 MiB free; 6.19 GiB reserved in total by PyTorch)” .
please recommend a solution