I encountered the error you may see below a couple of times on two different systems.
Both times, that error produced the following consequences:
Occupied cuda memory was not released following a kernel restart
The gpu remained with some load on it, with no other process using it (it’s not even, in both systems, connected to a monitor).
The notebook server refused to exit on Ctrl-C
Once I manually killed the notebook server, the system hanged up completely.
Any idea about what could have caused that error? Note that it almost completed the epoch, and memory occupation rose to 80% but remained stable till the error.
Something similar was already asked: Cudnn_status_execution_failed but no one answered. Let me try again.
what version of pytorch?
you can always ask on pytroch forum, I think they would be more responsive:
Thank you for your reply. It’s the version which comes with fastai, that is 0.3.1
I’ll ask on pytorch forums, too, but I was interested in knowing if some other fastai user encountered it.
Hi balnazzr. Did you solve the problem? I’ve encountered the same problem again and again, in different situations.
No, frankly. It seems to have been solved with newer versions of cudnn and cuda.
Thank your for reply, balnazzar.
I’ve been using fast.ai v0.7 following this guide, and the v0.7 library seems like using the older version of CUDA and cuDNN? So you just upgrade CUDA and cuDNN then the problem just go away?
Yes, BUT I changed hardware too. Same GPUs, but newer host.
at 2020/1/1, I just encounter this issue, with screenshot below