Getting some NaN, where to start investigating?

I am getting NaN for a very small portion of my results. What tooling could help me figure out what layers might be causing this?

Thanks,

Try this: https://pytorch.org/docs/stable/autograd.html#torch.autograd.detect_anomaly

5 Likes

Thanks! Very useful! Already gave me something to investigate:

RuntimeError: Function 'CudnnBatchNormBackward' returned nan values in its 1th output.

Turns out that if I remove learner = learner.to_fp16() and just train in 32bits, autograd.detect_anomaly() doesn’t complain.

You might have large numbers turning into inf when going from 32 bits to 16 bits (the largest 16bits floating point number before infinity is only 6550000).

(it might also be small numbers becoming 0 for the same reasons)

The easiest solution is to not use 16 bits here :sweat_smile:

1 Like