Hi !
I started working on the Plant Pathology competition and I am facing a big problem.
I cannot train any model because my GPU memory is full.
I have a 2070, and every time I try to train a new model I face this: “CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 7.76 GiB total capacity; 6.70 GiB already allocated; 50.31 MiB free; 6.73 GiB reserved in total by PyTorch)”.
I wanted to track the GPU memory usage and what I found is that everything looks fine (I created a data loader with img size 64 and bs 2 and I use a resnet 50, at this point I use 1117MiB / 7949MiB of memory on my GPU), until I try to train. At this point python is using 7709MiB of gpu memory. Previously I could train a resnet50 with a 512 img size and a batch size of at least 24, today an image of size 64 and bs=2 is too much.
My guess is that something like the previous tensors or models are still somehow there and claim the memory they had when I trained them as soon as I try to train the new model.
What I tried:
- reduce batch size
- restart kernel
- restart PC
- torch.cuda.empty_cache()
- gc.collect()
- kill python processes
Here is my code:
import pandas as pd
from pathlib import Path
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
from tqdm.notebook import tqdm
from torch.utils.data.sampler import WeightedRandomSampler
from fastai2.basics import *
from fastai2.callback.all import *
from fastai2.vision.all import *DATA_PATH = Path(‘./’)
IMG_PATH = DATA_PATH / ‘images’
LABEL_COLS = [‘healthy’, ‘multiple_diseases’, ‘rust’, ‘scab’]
SIZE = 64
BS = 2
ARCH = resnet50
train_df = pd.read_csv(‘train.csv’)
test_df = pd.read_csv(‘test.csv’)def get_label(row):
for k, v in row[LABEL_COLS].items():
if v == 1:
return ktrain_df[‘label’] = train_df.apply(get_label, axis=1)
dls = ImageDataLoaders.from_df(train_df, folder=‘images’, label_col=‘label’, suff=‘.jpg’,
size=SIZE, bs=BS)
learn = cnn_learner(dls, ARCH, metrics=[error_rate, accuracy])
learn.lr_find()
When running this, anaconda3/bin/python uses 7709MiB.
Anyone faced this problem before ? I have been searching for about 2h on google and could not find a fix.
Have a nice day