CuDNN Error

balnazzar · June 18, 2018, 11:14pm

I encountered the error you may see below a couple of times on two different systems.

Both times, that error produced the following consequences:

Occupied cuda memory was not released following a kernel restart
The gpu remained with some load on it, with no other process using it (it’s not even, in both systems, connected to a monitor).
The notebook server refused to exit on Ctrl-C
Once I manually killed the notebook server, the system hanged up completely.

Any idea about what could have caused that error? Note that it almost completed the epoch, and memory occupation rose to 80% but remained stable till the error.

Something similar was already asked: Cudnn_status_execution_failed but no one answered. Let me try again.

sayko · June 19, 2018, 4:47am

what version of pytorch?
you can always ask on pytroch forum, I think they would be more responsive:

balnazzar · June 19, 2018, 1:41pm

Thank you for your reply. It’s the version which comes with fastai, that is 0.3.1

I’ll ask on pytorch forums, too, but I was interested in knowing if some other fastai user encountered it.

jls · October 30, 2018, 12:22am

Hi balnazzr. Did you solve the problem? I’ve encountered the same problem again and again, in different situations.

balnazzar · October 30, 2018, 12:40am

No, frankly. It seems to have been solved with newer versions of cudnn and cuda.

jls · October 31, 2018, 6:23am

Thank your for reply, balnazzar.

I’ve been using fast.ai v0.7 following this guide, and the v0.7 library seems like using the older version of CUDA and cuDNN? So you just upgrade CUDA and cuDNN then the problem just go away?

balnazzar · October 31, 2018, 10:24am

Yes, BUT I changed hardware too. Same GPUs, but newer host.

jls · November 5, 2018, 9:14am

I see. Thank you!

chandlertu · January 1, 2020, 4:31am

at 2020/1/1, I just encounter this issue, with screenshot below

cuda 9.0