Fastai examples dogscats NaN Loss

nok · October 15, 2018, 3:39pm

I have asked this in other categories yesterday, but I think here maybe the more appropriate place.

I got NaN Loss when running through the example, in fastai/example/dogscats.ipynb.
I am running this with the Google Deep Learning Image with latest git pull, and I have checked the library is pulling from the directory (so it is the updated version instead of the pip one)

Please remove my previous blog if needed.

elmarculino · October 15, 2018, 4:10pm

I can confirm that. Posted some days ago on Developer chat

nok · October 15, 2018, 4:50pm

I saw the notebook was updated today, thought it was re-run… maybe those output are just old record then…

Thanks for confirming that!

sgugger · October 15, 2018, 5:01pm

I’m running the notebook right now without any problem on master. I’d need more information to see where it’s coming from. Please also pull the latest version of fastai.

elmarculino · October 15, 2018, 5:28pm

Using the last version of fastai.

This is my current enviroment:

PyTorch version: 1.0.0.dev20181013
Is debug build: No
CUDA used to build PyTorch: 9.2.148

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 410.48
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

I tried with python 3.6 and cuda 9.2 too.

fredguth · October 15, 2018, 5:29pm

Also running here without problems.

nok · October 15, 2018, 5:46pm

May I ask what command I need to run to show this?

elmarculino · October 15, 2018, 5:50pm

(From pytorch repository)

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

nok · October 15, 2018, 5:51pm

I have pulled it and re-run, still get NaN. The loss does not go straight to NaN, instead it get some normal number and then suddenly go to NaN.

nok · October 15, 2018, 5:52pm

Thanks!

nok · October 15, 2018, 6:06pm

I try to iterate data.train_ds and print out NaN if found, and I found that the index is not a fix thing, so I suspect it is related to tfms.

So I check get_transforms() and remove all the transformation, got no NaN anymore. Sorry if this is too messy, it’s midnight here, I can tidy this out tomorrow if needed. But seems that you guy don’t have this issue, maybe it’s just something didn’t merge into the master?

stas · October 15, 2018, 6:38pm

And we have our own:

git clone https://github.com/fastai/fastai
cd fastai
python -c 'import fastai; fastai.show_install(0)'

which gives:

platform    : Linux-4.15.0-36-generic-x86_64-with-debian-buster-sid
distro      : Ubuntu 18.04 Bionic Beaver
python      : 3.6.6
fastai      : 1.0.6.dev0
torch       : 1.0.0.dev20181013
nvidia dr.  : 396.44
torch cuda  : 9.2.148
nvcc  cuda  : 9.2.148
torch gpus  : 1
  [gpu0]
  name      : GeForce GTX 1070 Ti
  total mem : 8119MB

if you pass 1 it’ll also dump the nvidia-smi output:

python -c 'import fastai; fastai.show_install(1)'

sgugger · October 15, 2018, 7:07pm

Interesting test. When I print those sums, I get high numbers but no nans.
Are you in half precision by any chance? Or can you train to redownload the data (remove the dogscats.tgz and folder to force it)?

elmarculino · October 15, 2018, 8:17pm

@nok You are right, I can train without problem if I comment out the max_lighting part in the get_transforms().
ds_tfms=get_transforms(max_lighting=0) works fine too.

@sgugger I got the nan loss after redownload the data.

jeremy · October 16, 2018, 12:07am

If everyone having this problem using Google Cloud?

elmarculino · October 16, 2018, 12:34am

I’m not on Google Cloud. This is my current enviroment: link

Do you need more details about my pc?

nok · October 16, 2018, 5:36am

I have tested it again:

Test: GCP Deep Learning Image with latest git pull
Result: Still get NaN
Redownload the dogscats data, still get the same error. The tensor is just in cpu? so I don’t think it’s related to fp16

gist.github.com

https://gist.github.com/noklam/9afca6260b362f0bb6d8ce6427fc8903

lesson1_v3.ipynb

{
  "cells": [
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "# Put these at the top of every notebook, to get automatic reloading and inline plotting\n%reload_ext autoreload\n%autoreload 2\n%matplotlib inline\nfrom IPython.core.interactiveshell import InteractiveShell\n# pretty print all cell's output and not just the last one\n# InteractiveShell.ast_node_interactivity = \"all\"\n",
      "execution_count": 1,
      "outputs": []

This file has been truncated. show original

sgugger · October 16, 2018, 12:19pm

Seems like the issue comes from GCP since it only happens there. Jeremy has told them so that we can sort this thing out.

elmarculino · October 16, 2018, 12:25pm

GCP = Google Cloud Platform?

I am not on Google Cloud

jeremy · October 16, 2018, 1:16pm

Oh interesting. @elmarculino what kind of CPU do you have? Can you try py36 and see if you still have the problem?