My CPU is an AMD Phenom II X6 1055t. I got the same problem with python 3.6.
I wonder if itās an AMD issue. Are you using anaconda? Try a different BLAS library: https://docs.anaconda.com/mkl-optimizations/ . Please let me know if any of these fix the issue?
Model name: IntelĀ® XeonĀ® CPU @ 2.50GHz
For my case, I am in GCP and a Intel CPU, in python 3.6 and 3.7
I also tried checkout at tag 1.0.5, still get NaN
Yes, Iām using anaconda.
First test: Install nomkl packages
conda install nomkl numpy scipy scikit-learn numexpr
Result: NaN loss
Second test: Install openblas
conda install -c anaconda openblas
Result: NaN loss
Could not unistall MKL:
The following packages will be REMOVED:
mkl: 2019.0-118
mkl_fft: 1.0.1-py36h3010b51_0 anaconda
mkl_random: 1.0.1-py36h629b387_0 anaconda
pytorch-nightly: 1.0.0.dev20181015-py3.6_cuda9.2.148_cudnn7.1.4_0 pytorch [cuda92]
torchvision-nightly: 0.2.1-py_0 fastai
Try now - update from master first. Hopefully itās fixed (I canāt test since I canāt repro the bug).
Itās running without problem now. Thanks Jeremy
Last version:
Old version with ds_tfms=get_transforms(max_lighting=0)
Thx! it is fixed now, I try to look at the commits that you made yesterday, but it is not obvious to me which commits fix the issue, I am interested in what was causing this.
Thank you.
For some reason, there was some numerical instability in the lighting transforms. The fix is the clipping introduced here.
Ah, thank you! I didnāt realize the clipping was fixing this issue. So this instability seems somehow depends on other things(hardware?) as seems it is not a issue for quite a few peoples.
Something like that. Or perhaps some blas issue.
Has anyone else been suffering a sudden NaN loss with other datasets?
Iām working with a large (200k) dataset for binary classification that gracefully descends a loss curve from .10 to .03 and in about 1 in 5 runs loss suddenly goes NaN when previous epochs have descended nicely. Granted thatās a low per-batch likelihood but high per-run. It never happened to me pre 1.0. I havenāt touched the loss_func. Once, it came back from NaN after a few epochs as if nothing had happened. My transforms are dihedral, and 10% bands of brightness, contrast change, with resnet34. Using latest fastai and pytorch builds, with fp16. Perhaps there is a clipping parameter I can set?
Plenty of changes under the hood in v1, so likely hyperparams need to change. Try lowering your learning rate by 10x.
but when I have unfreeze layer and use learn.lr.find() my valid loss result is #NA how i can get my valid loss?
encounter the same problem now, on windows 11, latest version of fastai, just running the dogcats example