Train_loss and valid_loss becom nan from the second epoch, fastai2

mobby · January 24, 2021, 6:23am

Hi, I’m training a 2 class segmentation net on fastai2. My main code is,
////////////////////////////
learn.lr_find()
lr = 1e-3
learn.fit_flat_cos(1, slice(lr))

lr=1e-4
lrs = slice(lr)
learn.unfreeze()
learn.fit_flat_cos(3, lrs)
//////////////////////
When it runs to the last code line, I encounter a confusing thing. The train_loss and valid_loss of the first epoch are 0.4458 and 0.6682, but they become nan during the second epoch and the third epoch. And there are no any other errors being throw during the training process.

My configuration is:
fasiai 2.2.2

Are there something I don’t understand?

Thanks.

nishith006 · January 24, 2021, 10:11am

Hey I think you most probably ran out of cuda memory or you are using wrong metrics…can you share your code screenshot of learning code. You can try to to debug your dataloaders by doing dblock.summary(path) check if everything is all right.

mobby · January 24, 2021, 11:11pm

Many thanks.
Yes, I check the GPU memory. It’s nearly full. It work normally after I reduce the batch size.