Debugging cuda runtime error, device-side assert triggered

(Malcolm McLean) #1

…and others such as
RuntimeError: reduce failed to synchronize: device-side assert triggered

The error message is non-specific, and the stack dump shows the wrong spot.

Answer: Convert model and data to cpu and rerun. This can be done by removing .cuda() from the model and adding .cpu() to the data. The error will then be shown at the offending line and the error message will become more specific, such as

RuntimeError: Assertion "x >= 0. && x <= 1." failed. input value should be between 0~1, but got -1.058900 at /opt/conda/conda-bld/pytorch-nightly_1554613755097/work/aten/src/THNN/generic/BCECriterion.c:62

You will need to restart the Jupyter kernel to clear the error from CUDA.

Alternatively, you can add
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

as the first line of the notebook and restart the kernel. I have not tested this second method.

These answers have been discovered and rediscovered many times in the forums. Also, see…

Here they are in one spot, and hopefully findable by forum search.

The issue is that Python calls into C code which invokes the GPU asynchonously. When the GPU hits an error, the CPU execution point could be far away from the line that initiated the error. Plus, the error messages returned by the C module tend to be uninformative, since it can’t tell exactly what went wrong in the GPU.