CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.90 GiB total capacity; 15.17 GiB already allocated; 15.88 MiB free; 15.18 GiB reserved in total by PyTorch)

linminhtoo · July 2, 2020, 2:42pm

Hello again,

I am back on the forums to ask about maximising RAM and GPU usage while training relatively big CNNs. Currently I am using Google Colab where I have a high RAM instance (25 GB) + P100 gpu. I am on torch 1.4.0 and torchvision 0.5.0, with fastai version of 1.0.46. My model is ResNext101_32x8d from pytorchcv model zoo. I am doing progressive resizing with rotational augmentations. I know that there have been posts on this before, but I want to check if there is something specific I am not doing correctly.

My xtra_tfms are as follows (constant throughout):
*zoom_crop(0.75, 1.25, p=0.6) ,
perspective_warp(magnitude=0.3)
pad(padding=30, mode=‘reflection’)

For clarity sake, this is how I am defining everything:

src = (ImageList.from_folder(path)
.split_by_rand_pct(0.05)
.label_from_folder())

xtra_tfms = [ pad(padding=30, mode=‘reflection’),
*zoom_crop(0.75, 1.25, p=0.6) ,
perspective_warp(magnitude=0.3),
]

tfms = get_transforms(do_flip=True, xtra_tfms=xtra_tfms)

data = (src.transform(tfms=tfms, size=384)
.databunch(bs=40)
.normalize(imagenet_stats)
)

loss_func = LabelSmoothingCrossEntropy()

learn = cnn_learner(data, base_arch=sys.modules[‘torchvision.models’].dict[‘resnext101_32x8d’], pretrained=False,
cut=-2, split_on=lambda m: (m[0][6], m[1]),
metrics=accuracy,
custom_head=custom_head,
loss_func=loss_func).mixup(alpha=0.4).to_fp16()

I have already resized my image dataset(100k+) outside so they are all already 384x384 in my data folder. However, I am now repeatedly facing CUDA out of memory error. I find this somewhat strange as my RAM is almost not utilised, or is this how it is supposed to be? I also wonder if one of my xtra_tfms is taking up too much memory.

Is there any way I can get this to work without further decreasing my batch size? Now it takes like 1hr30mins/epoch with bs=32… Because with smaller batch size, it gets incredibly slower… With previous image size of 256, a batch size of 72 worked with 40min/epoch. With image size of 128, batch size up to 128 is okay with ~10min/epoch.

Any advice would be appreciated. Thank you!

utkb · July 3, 2020, 4:24pm

CUDA out-of-memory is referring to the GPU memory, not RAM. Unfortunately there isn’t really any way around it, except to either reduce batch size (so that fewer images are put onto the GPU at the same time, and thus require less GPU memory), or use a (or more) GPU(s) with more memory, or use a smaller/lighter model architecture that requires less GPU mem. You can run !nvidia-smi command to see how much memory it’s taking, but from the error message you can already see that it has taken up the full 16GB of GPU mem, and thus giving you the o-o-m error.

I remember there was some cool code on the forum to automatically and quickly figure out the max. batch size that will fit on the GPU mem before you start properly training, but I don’t have the link to hand. Have a search on the forum? Though it will only provide a more convenient way of finding the max. batch size (before o-o-m), and obviously will not ‘help’ you to run at any higher batch size.

Yijin

linminhtoo · July 3, 2020, 5:04pm

Thank you for your reply. I will do a search. Do you have any idea on how much additional memory space the augmentations take up (if any)?

PalaashAgrawal · August 16, 2020, 9:34am

@linminhtoo Its been long, sorry you haven’t had a reply.
Augmentations don’t take up any space. Fastai applies transforms to the data on the fly, and feeds them to the GPU along with the original data.