What are the recommended ways to deal with NaNs appearing in our validation loss?

Running the CAMVID notebook locally against a 1080Ti GPU and getting nan reported for the validation loss at every epoch. I reduced the LR by 10x and was able to get things working, but I would like to know …

1) Why does this occur?

2) What are the recommended steps to deal with it?

5 Likes

It means your LR is too high, and you need to reduce it :slight_smile: (Make sure you have the latest fastai)

4 Likes

Yah I have the latest …

I also found that reducing the batch size remedies this issue as well.

2 Likes

Huh - that’s odd. And interesting.

This is your moment @wgpubs Time to do some research! You might find something really cool if you can replicate the results and show those to everyone.

1 Like

Hi Mr. Howard, in lesson2 of course-v3, when i follow the example code in that notebook, either using high LR or low LR, i can get the #na# valid_loss…and therefore i can’t get the valid_loss curve
:worried:

Reducing the batch size from 16 to 8 got the validation loss back. Not sure if this is a bug in fastai.

I found the same question as above. Would U please tell me why did it happen?

I found the reason Jeremy pointed before. He said the main reason was your set of the learning rate. Don’t assign it a too high value! Otherwise, the ball may bump into another world and never come back —— NaN!

Hi @hitgszf, thank you for your response. I am facing a similar problem. Do you mind explaining where I could update the learning rate?

Thank you!

lr_find does not use the validation data. So validation loss Will be NaN. Don’t worry about it. Validation loss should not be NaN during training.

1 Like

Hey @PalaashAgrawal, thank you! Have a good day!