CUDA error: Out of memory on last epoch of training

awais980 · April 22, 2020, 8:16am

I am using efficientnet network with RTX 2080 Ti. Till yesterday night, everything was working and I was able to complete training successfully. But now, it says CUDA error: Out of memory. For B0 model with 16,32,64 batch size, the model will give error at last epoch. If I train for 50 epochs then I will get error on 50th epoch. 49 epochs will be ok.
Following is the code:

from fastai.vision import *
from fastai.callbacks import *
from fastai.metrics import error_rate
from efficientnet_pytorch import EfficientNet
from ranger import Ranger
from radam import RAdam
#from optimizer import Lookahead



#base_optimizer = Radam
#opt = Lookahead(base_optimizer=base_optimizer,k=5,alpha=0.5)

#opt = Ranger()
spath = '/home/awais/Desktop/UCSD_Birds/datasets/bird_70_30/images/'
 
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.1, max_warp=0.)

data = ImageDataBunch.from_folder(path=spath, train='train', valid='val', ds_tfms=tfms, 
            test=None, valid_pct=None, no_check=True, size=(224,224), num_workers=6, bs=64)

data = data.normalize(imagenet_stats)

model = EfficientNet.from_pretrained('efficientnet-b2', num_classes=200)
torch.cuda.set_device(1)

#optimizer
opt = partial(Ranger, betas=(0.95, 0.99), eps=1e-6)
#opt = partial(RAdam)
#opt = optim.Adam
learn = Learner(data, model, metrics=[error_rate, accuracy, top_k_accuracy], callback_fns=[CSVLogger])

learn.unfreeze()
learn.fit_one_cycle(50, max_lr=1e-4, callbacks=[SaveModelCallback(learn, monitor='error_rate', mode='min', name="test_birds")])

#learn.save("trained_model", return_path=True)

bsalita · April 22, 2020, 8:29am

I fixed my Out of Memory error by blowing away the fastai directory at $HOME/.fastai

It’s a bit extreme and it may not fix it for you. You’ll have to reinstall fastai. Please report back if it helped or not.

dougforrest · April 22, 2020, 9:55am

Hard to say without the stack trace of the error. My guess would be the memory error is caused by the call of the SaveModelCallback.on_train_end which loads the best model once training is complete.

awais980 · April 22, 2020, 1:01pm

Solved the issue by your suggestion.

bsalita · April 23, 2020, 4:36pm

Interesting. I had opened an issue on github about the need to remove .fastai directory to solve this issue. @sgugger closed it saying “CUDA out of memory means you have no memory on the GPU you are using. It needs a restart of the kernel, removing the .fastai directory will have no effect on that.”

Of course, he knows best