NaN valid_loss after a few epochs

hangjoni · July 20, 2022, 9:39am

I’m building a food classifier model based on Food 101 dataset using ‘resnet26d’ architecture and running on Colab GPU.

foods = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    get_items=get_image_files,
    get_y=labeller,
    item_tfms=Resize(256, method='squish'),
    batch_tfms = aug_transforms(size=128, min_scale=0.75))

dls = foods.dataloaders(food101)

learn = vision_learner(dls, 'resnet26d', metrics=error_rate).to_fp16()

The learning rate recommended by lr_find:

I tried fine-tuning the model with LR of 0.05 and noticed that valid_loss was NaN after 1 epoch.

I then tried reducing LR to 0.01, valid_loss was NaN for a few epochs, then appear again in the last few epochs:
Screenshot 2022-07-20 at 6.19.41 PM

My questions are:

Is this NaN valid_loss a problem? Since error_rate and training_loss is still improving after each epoch?
Why is valid_loss NaN? Is this because of too low or too high LR?
Some other threads related to NaN valid loss on the forum mentions out of CUDA memory on the GPU? How do I check if that issue applies here / in my model?

Looking for some pointers and appreciate any thoughts /advice!