NaN valid_loss after a few epochs

I’m building a food classifier model based on Food 101 dataset using ‘resnet26d’ architecture and running on Colab GPU.

foods = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    get_items=get_image_files,
    get_y=labeller,
    item_tfms=Resize(256, method='squish'),
    batch_tfms = aug_transforms(size=128, min_scale=0.75))

dls = foods.dataloaders(food101)

learn = vision_learner(dls, 'resnet26d', metrics=error_rate).to_fp16()

The learning rate recommended by lr_find:

I tried fine-tuning the model with LR of 0.05 and noticed that valid_loss was NaN after 1 epoch.


I then tried reducing LR to 0.01, valid_loss was NaN for a few epochs, then appear again in the last few epochs:
Screenshot 2022-07-20 at 6.19.41 PM

My questions are:

  • Is this NaN valid_loss a problem? Since error_rate and training_loss is still improving after each epoch?
  • Why is valid_loss NaN? Is this because of too low or too high LR?
  • Some other threads related to NaN valid loss on the forum mentions out of CUDA memory on the GPU? How do I check if that issue applies here / in my model?

Looking for some pointers and appreciate any thoughts /advice!