Fastai examples dogscats NaN Loss

My CPU is an AMD Phenom II X6 1055t. I got the same problem with python 3.6.

I wonder if itā€™s an AMD issue. Are you using anaconda? Try a different BLAS library: https://docs.anaconda.com/mkl-optimizations/ . Please let me know if any of these fix the issue?

Model name: IntelĀ® XeonĀ® CPU @ 2.50GHz

For my case, I am in GCP and a Intel CPU, in python 3.6 and 3.7

I also tried checkout at tag 1.0.5, still get NaN

Yes, Iā€™m using anaconda.

First test: Install nomkl packages
conda install nomkl numpy scipy scikit-learn numexpr
Result: NaN loss

Second test: Install openblas
conda install -c anaconda openblas
Result: NaN loss

Could not unistall MKL:

The following packages will be REMOVED:

mkl: 2019.0-118
mkl_fft: 1.0.1-py36h3010b51_0 anaconda
mkl_random: 1.0.1-py36h629b387_0 anaconda
pytorch-nightly: 1.0.0.dev20181015-py3.6_cuda9.2.148_cudnn7.1.4_0 pytorch [cuda92]
torchvision-nightly: 0.2.1-py_0 fastai

Try now - update from master first. Hopefully itā€™s fixed (I canā€™t test since I canā€™t repro the bug).

Itā€™s running without problem now. Thanks Jeremy

Last version:

Screenshot_20181016_194042

Old version with ds_tfms=get_transforms(max_lighting=0)

Screenshot_20181016_194723

2 Likes

Thx! it is fixed now, I try to look at the commits that you made yesterday, but it is not obvious to me which commits fix the issue, I am interested in what was causing this.

Thank you. :slight_smile:

For some reason, there was some numerical instability in the lighting transforms. The fix is the clipping introduced here.

2 Likes

Ah, thank you! I didnā€™t realize the clipping was fixing this issue. So this instability seems somehow depends on other things(hardware?) as seems it is not a issue for quite a few peoples.

Something like that. Or perhaps some blas issue.

1 Like

Has anyone else been suffering a sudden NaN loss with other datasets?

Iā€™m working with a large (200k) dataset for binary classification that gracefully descends a loss curve from .10 to .03 and in about 1 in 5 runs loss suddenly goes NaN when previous epochs have descended nicely. Granted thatā€™s a low per-batch likelihood but high per-run. It never happened to me pre 1.0. I havenā€™t touched the loss_func. Once, it came back from NaN after a few epochs as if nothing had happened. My transforms are dihedral, and 10% bands of brightness, contrast change, with resnet34. Using latest fastai and pytorch builds, with fp16. Perhaps there is a clipping parameter I can set?

Plenty of changes under the hood in v1, so likely hyperparams need to change. Try lowering your learning rate by 10x.

but when I have unfreeze layer and use learn.lr.find() my valid loss result is #NA how i can get my valid loss?

encounter the same problem now, on windows 11, latest version of fastai, just running the dogcats example