CUDA error: Out of memory on last epoch of training

I am using efficientnet network with RTX 2080 Ti. Till yesterday night, everything was working and I was able to complete training successfully. But now, it says CUDA error: Out of memory. For B0 model with 16,32,64 batch size, the model will give error at last epoch. If I train for 50 epochs then I will get error on 50th epoch. 49 epochs will be ok.
Following is the code:

from import *
from fastai.callbacks import *
from fastai.metrics import error_rate
from efficientnet_pytorch import EfficientNet
from ranger import Ranger
from radam import RAdam
#from optimizer import Lookahead

#base_optimizer = Radam
#opt = Lookahead(base_optimizer=base_optimizer,k=5,alpha=0.5)

#opt = Ranger()
spath = '/home/awais/Desktop/UCSD_Birds/datasets/bird_70_30/images/'
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.1, max_warp=0.)

data = ImageDataBunch.from_folder(path=spath, train='train', valid='val', ds_tfms=tfms, 
            test=None, valid_pct=None, no_check=True, size=(224,224), num_workers=6, bs=64)

data = data.normalize(imagenet_stats)

model = EfficientNet.from_pretrained('efficientnet-b2', num_classes=200)

opt = partial(Ranger, betas=(0.95, 0.99), eps=1e-6)
#opt = partial(RAdam)
#opt = optim.Adam
learn = Learner(data, model, metrics=[error_rate, accuracy, top_k_accuracy], callback_fns=[CSVLogger])

learn.fit_one_cycle(50, max_lr=1e-4, callbacks=[SaveModelCallback(learn, monitor='error_rate', mode='min', name="test_birds")])"trained_model", return_path=True)

I fixed my Out of Memory error by blowing away the fastai directory at $HOME/.fastai

It’s a bit extreme and it may not fix it for you. You’ll have to reinstall fastai. Please report back if it helped or not.

Hard to say without the stack trace of the error. My guess would be the memory error is caused by the call of the SaveModelCallback.on_train_end which loads the best model once training is complete.

Solved the issue by your suggestion.

Interesting. I had opened an issue on github about the need to remove .fastai directory to solve this issue. @sgugger closed it saying “CUDA out of memory means you have no memory on the GPU you are using. It needs a restart of the kernel, removing the .fastai directory will have no effect on that.”

Of course, he knows best :wink: