Fastai examples dogscats NaN Loss

I have asked this in other categories yesterday, but I think here maybe the more appropriate place.

I got NaN Loss when running through the example, in fastai/example/dogscats.ipynb.
I am running this with the Google Deep Learning Image with latest git pull, and I have checked the library is pulling from the directory (so it is the updated version instead of the pip one)

Please remove my previous blog if needed.

I can confirm that. Posted some days ago on Developer chat

I saw the notebook was updated today, thought it was re-run… maybe those output are just old record then…

Thanks for confirming that!

I’m running the notebook right now without any problem on master. I’d need more information to see where it’s coming from. Please also pull the latest version of fastai.

1 Like

Using the last version of fastai.

This is my current enviroment:

PyTorch version: 1.0.0.dev20181013
Is debug build: No
CUDA used to build PyTorch: 9.2.148

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 410.48
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

I tried with python 3.6 and cuda 9.2 too.

Also running here without problems.

1 Like

May I ask what command I need to run to show this?

(From pytorch repository)

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
1 Like

I have pulled it and re-run, still get NaN. The loss does not go straight to NaN, instead it get some normal number and then suddenly go to NaN.

Thanks!

I try to iterate data.train_ds and print out NaN if found, and I found that the index is not a fix thing, so I suspect it is related to tfms.

So I check get_transforms() and remove all the transformation, got no NaN anymore. Sorry if this is too messy, it’s midnight here, I can tidy this out tomorrow if needed. But seems that you guy don’t have this issue, maybe it’s just something didn’t merge into the master?

And we have our own:

git clone https://github.com/fastai/fastai
cd fastai
python -c 'import fastai; fastai.show_install(0)'

which gives:

platform    : Linux-4.15.0-36-generic-x86_64-with-debian-buster-sid
distro      : Ubuntu 18.04 Bionic Beaver
python      : 3.6.6
fastai      : 1.0.6.dev0
torch       : 1.0.0.dev20181013
nvidia dr.  : 396.44
torch cuda  : 9.2.148
nvcc  cuda  : 9.2.148
torch gpus  : 1
  [gpu0]
  name      : GeForce GTX 1070 Ti
  total mem : 8119MB

if you pass 1 it’ll also dump the nvidia-smi output:

python -c 'import fastai; fastai.show_install(1)'
1 Like

Interesting test. When I print those sums, I get high numbers but no nans.
Are you in half precision by any chance? Or can you train to redownload the data (remove the dogscats.tgz and folder to force it)?

@nok You are right, I can train without problem if I comment out the max_lighting part in the get_transforms().
ds_tfms=get_transforms(max_lighting=0) works fine too.

@sgugger I got the nan loss after redownload the data.

If everyone having this problem using Google Cloud?

I’m not on Google Cloud. This is my current enviroment: link

Do you need more details about my pc?

I have tested it again:

Test: GCP Deep Learning Image with latest git pull
Result: Still get NaN
Redownload the dogscats data, still get the same error. The tensor is just in cpu? so I don’t think it’s related to fp16

Seems like the issue comes from GCP since it only happens there. Jeremy has told them so that we can sort this thing out.

GCP = Google Cloud Platform?

I am not on Google Cloud

Oh interesting. @elmarculino what kind of CPU do you have? Can you try py36 and see if you still have the problem?