Memory full with previous models

Groskilled · April 19, 2020, 8:25am

Hi !

I started working on the Plant Pathology competition and I am facing a big problem.
I cannot train any model because my GPU memory is full.

I have a 2070, and every time I try to train a new model I face this: “CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 7.76 GiB total capacity; 6.70 GiB already allocated; 50.31 MiB free; 6.73 GiB reserved in total by PyTorch)”.
I wanted to track the GPU memory usage and what I found is that everything looks fine (I created a data loader with img size 64 and bs 2 and I use a resnet 50, at this point I use 1117MiB / 7949MiB of memory on my GPU), until I try to train. At this point python is using 7709MiB of gpu memory. Previously I could train a resnet50 with a 512 img size and a batch size of at least 24, today an image of size 64 and bs=2 is too much.
My guess is that something like the previous tensors or models are still somehow there and claim the memory they had when I trained them as soon as I try to train the new model.

What I tried:

reduce batch size
restart kernel
restart PC
torch.cuda.empty_cache()
gc.collect()
kill python processes

Here is my code:

import pandas as pd
from pathlib import Path
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
from tqdm.notebook import tqdm
from torch.utils.data.sampler import WeightedRandomSampler
from fastai2.basics import *
from fastai2.callback.all import *
from fastai2.vision.all import *

DATA_PATH = Path(‘./’)
IMG_PATH = DATA_PATH / ‘images’
LABEL_COLS = [‘healthy’, ‘multiple_diseases’, ‘rust’, ‘scab’]
SIZE = 64
BS = 2
ARCH = resnet50
train_df = pd.read_csv(‘train.csv’)
test_df = pd.read_csv(‘test.csv’)

def get_label(row):
for k, v in row[LABEL_COLS].items():
if v == 1:
return k

train_df[‘label’] = train_df.apply(get_label, axis=1)

dls = ImageDataLoaders.from_df(train_df, folder=‘images’, label_col=‘label’, suff=‘.jpg’,
size=SIZE, bs=BS)
learn = cnn_learner(dls, ARCH, metrics=[error_rate, accuracy])
learn.lr_find()

When running this, anaconda3/bin/python uses 7709MiB.
Anyone faced this problem before ? I have been searching for about 2h on google and could not find a fix.

Have a nice day

mcclomitz · April 19, 2020, 11:32am

Hey Adam,

I ran into this problem a lot with one of my models, have you tried setting up ec2 on AWS, or using the Kaggle free GPUs?

One thing I found that made a big difference (note that it was on a tabular dataset) was to set the validation size at 0.5. This seemed to free up more space to increase my batch_size and play around with the architecture without getting that error.

There are also a number of threads if you search for them relating to this problem, for example:

Groskilled · April 19, 2020, 3:02pm

My problem is quite different.

I actually can’t train a resnet50 on a batch of 2 images of size 64 (not even 1 epoch) on a 2070.

boris · April 19, 2020, 3:50pm

Did you check your label function worked as expected?

Groskilled · April 20, 2020, 8:43am

It does.

marii · April 20, 2020, 10:34am

Hard to tell, I do think pytorch changed how memory allocation was done in newer versions.

You can step through your code and used " torch.cuda. max_memory_allocated" to see where you are running out of memory in your training loop.