I have asked this in other categories yesterday, but I think here maybe the more appropriate place.
I got NaN Loss when running through the example, in fastai/example/dogscats.ipynb.
I am running this with the Google Deep Learning Image with latest git pull, and I have checked the library is pulling from the directory (so it is the updated version instead of the pip one)
PyTorch version: 1.0.0.dev20181013
Is debug build: No
CUDA used to build PyTorch: 9.2.148
OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 410.48
cuDNN version: Probably one of the following:
I try to iterate data.train_ds and print out NaN if found, and I found that the index is not a fix thing, so I suspect it is related to tfms.
So I check get_transforms() and remove all the transformation, got no NaN anymore. Sorry if this is too messy, it’s midnight here, I can tidy this out tomorrow if needed. But seems that you guy don’t have this issue, maybe it’s just something didn’t merge into the master?
Interesting test. When I print those sums, I get high numbers but no nans.
Are you in half precision by any chance? Or can you train to redownload the data (remove the dogscats.tgz and folder to force it)?