Getting some NaN, where to start investigating?

etremblay · March 6, 2020, 2:13am

I am getting NaN for a very small portion of my results. What tooling could help me figure out what layers might be causing this?

Thanks,

jeremy · March 6, 2020, 2:35am

Try this: https://pytorch.org/docs/stable/autograd.html#torch.autograd.detect_anomaly

etremblay · March 6, 2020, 3:07am

Thanks! Very useful! Already gave me something to investigate:

RuntimeError: Function 'CudnnBatchNormBackward' returned nan values in its 1th output.

etremblay · March 6, 2020, 7:04pm

Turns out that if I remove learner = learner.to_fp16() and just train in 32bits, autograd.detect_anomaly() doesn’t complain.

nestorDemeure · March 6, 2020, 7:18pm

You might have large numbers turning into inf when going from 32 bits to 16 bits (the largest 16bits floating point number before infinity is only 6550000).

(it might also be small numbers becoming 0 for the same reasons)

The easiest solution is to not use 16 bits here