Fastai examples dogscats NaN Loss

elmarculino · October 16, 2018, 1:57pm

My CPU is an AMD Phenom II X6 1055t. I got the same problem with python 3.6.

jeremy · October 16, 2018, 2:31pm

I wonder if it’s an AMD issue. Are you using anaconda? Try a different BLAS library: https://docs.anaconda.com/mkl-optimizations/ . Please let me know if any of these fix the issue?

nok · October 16, 2018, 4:50pm

Model name: Intel® Xeon® CPU @ 2.50GHz

For my case, I am in GCP and a Intel CPU, in python 3.6 and 3.7

I also tried checkout at tag 1.0.5, still get NaN

elmarculino · October 16, 2018, 4:58pm

Yes, I’m using anaconda.

First test: Install nomkl packages
conda install nomkl numpy scipy scikit-learn numexpr
Result: NaN loss

Second test: Install openblas
conda install -c anaconda openblas
Result: NaN loss

Could not unistall MKL:

The following packages will be REMOVED:

mkl: 2019.0-118
mkl_fft: 1.0.1-py36h3010b51_0 anaconda
mkl_random: 1.0.1-py36h629b387_0 anaconda
pytorch-nightly: 1.0.0.dev20181015-py3.6_cuda9.2.148_cudnn7.1.4_0 pytorch [cuda92]
torchvision-nightly: 0.2.1-py_0 fastai

jeremy · October 16, 2018, 10:15pm

Try now - update from master first. Hopefully it’s fixed (I can’t test since I can’t repro the bug).

elmarculino · October 16, 2018, 10:26pm

It’s running without problem now. Thanks Jeremy

Last version:

Screenshot_20181016_194042

Old version with ds_tfms=get_transforms(max_lighting=0)

Screenshot_20181016_194723

nok · October 17, 2018, 3:03pm

Thx! it is fixed now, I try to look at the commits that you made yesterday, but it is not obvious to me which commits fix the issue, I am interested in what was causing this.

Thank you.

sgugger · October 17, 2018, 3:54pm

For some reason, there was some numerical instability in the lighting transforms. The fix is the clipping introduced here.

nok · October 17, 2018, 5:18pm

Ah, thank you! I didn’t realize the clipping was fixing this issue. So this instability seems somehow depends on other things(hardware?) as seems it is not a issue for quite a few peoples.

jeremy · October 17, 2018, 6:00pm

Something like that. Or perhaps some blas issue.

digitalspecialists · November 4, 2018, 9:15am

Has anyone else been suffering a sudden NaN loss with other datasets?

I’m working with a large (200k) dataset for binary classification that gracefully descends a loss curve from .10 to .03 and in about 1 in 5 runs loss suddenly goes NaN when previous epochs have descended nicely. Granted that’s a low per-batch likelihood but high per-run. It never happened to me pre 1.0. I haven’t touched the loss_func. Once, it came back from NaN after a few epochs as if nothing had happened. My transforms are dihedral, and 10% bands of brightness, contrast change, with resnet34. Using latest fastai and pytorch builds, with fp16. Perhaps there is a clipping parameter I can set?

jeremy · November 4, 2018, 12:25pm

Plenty of changes under the hood in v1, so likely hyperparams need to change. Try lowering your learning rate by 10x.

Summiya_35 · December 23, 2021, 10:04am

but when I have unfreeze layer and use learn.lr.find() my valid loss result is #NA how i can get my valid loss?

qdk0901 · October 25, 2023, 11:27am

encounter the same problem now, on windows 11, latest version of fastai, just running the dogcats example