I have asked this in other categories yesterday, but I think here maybe the more appropriate place.
I got NaN Loss when running through the example, in fastai/example/dogscats.ipynb.
I am running this with the Google Deep Learning Image with latest git pull, and I have checked the library is pulling from the directory (so it is the updated version instead of the pip one)
I’m running the notebook right now without any problem on master. I’d need more information to see where it’s coming from. Please also pull the latest version of fastai.
PyTorch version: 1.0.0.dev20181013
Is debug build: No
CUDA used to build PyTorch: 9.2.148
OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 410.48
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
I try to iterate data.train_ds and print out NaN if found, and I found that the index is not a fix thing, so I suspect it is related to tfms.
So I check get_transforms() and remove all the transformation, got no NaN anymore. Sorry if this is too messy, it’s midnight here, I can tidy this out tomorrow if needed. But seems that you guy don’t have this issue, maybe it’s just something didn’t merge into the master?
Interesting test. When I print those sums, I get high numbers but no nans.
Are you in half precision by any chance? Or can you train to redownload the data (remove the dogscats.tgz and folder to force it)?
@nok You are right, I can train without problem if I comment out the max_lighting part in the get_transforms(). ds_tfms=get_transforms(max_lighting=0) works fine too.
@sgugger I got the nan loss after redownload the data.
Test: GCP Deep Learning Image with latest git pull
Result: Still get NaN
Redownload the dogscats data, still get the same error. The tensor is just in cpu? so I don’t think it’s related to fp16