Nasty cuDNN bug on my Ubuntu 16.04 home rig

EricPB · May 17, 2017, 7:45pm

Hello everyone,

I was enjoying my little dual-boot Win10+Ubuntu with GTX 1080Ti server for the last 2 weeks until it became unstable this morning so I ran a bunch of “sudo apt-get install/update/upgrade”.
I can’t recall at what stage it went really wrong but suddenly I got flooded with pink-box messages, when starting notebooks, such as:

INFO (theano.gof.compilelock): Waiting for existing lock by process ‘2478’ (I am process ‘2680’)
INFO (theano.gof.compilelock): To manually release the lock, delete /home/eric/.theano/compiledir_Linux-4.8–generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/lock_dir

or

1 #define _CUDA_NDARRAY_C
2
3 #include <Python.h>
4 #include <structmember.h>
5 #include "theano_mod_helper.h"
6
7 #include <numpy/arrayobject.h>
8 #include
9
10 #include "cuda_ndarray.cuh"
11
12 #ifndef CNMEM_DLLEXPORT
13 #define CNMEM_DLLEXPORT
14 #endif
15
16 #include "cnmem.h"
17 #include "cnmem.cpp"
18
19 //If true, when there is a gpu malloc or free error, we print the size of allocated memory on the device.
20 #define COMPUTE_GPU_MEM_USED 0
21
22 //If true, we fill with NAN allocated device memory.
23 #define ALLOC_MEMSET 0
24
25 //If true, we print out when we free a device pointer, uninitialize a
26 //CudaNdarray, or allocate a device pointer
27 #define PRINT_FREE_MALLOC 0
28
29 //If true, we do error checking at the start of functions, to make sure there
30 //is not a pre-existing error when the function is called.
31 //You probably need to set the environment variable
32 //CUDA_LAUNCH_BLOCKING=1, and/or modify the CNDA_THREAD_SYNC
33 //preprocessor macro in cuda_ndarray.cuh
34 //if you want this to work.
35 #define PRECHECK_ERROR 0
36
37 cublasHandle_t handle = NULL;
38 int* err_var = NULL;
(…)

I did multiple reinstall of Theano + Keras + CUDA: no success.

Then I wiped out Anaconda2 entirely, using the “Anaconda-clean” package from
https://docs.continuum.io/anaconda/install
followed by a brutal “rm -rf ~/anaconda2”.

Did TWO complete reinstall using the super-practical “bash install-gpu.sh” from wiki.fast.ai
http://wiki.fast.ai/index.php/Ubuntu_installation

And more tweaking here and there.

Now I can run Lesson1 cell #7 again, the “state of the art custom model in 7 lines of code with one epoch of Vgg16”.
It is slower than before: 307 sec vs. 205 sec, at least it runs.
But I keep having a nasty cuDNN message at launch:

Can not use cuDNN on context None: cannot compile with cuDNN. We got this error:
/tmp/try_flags_JuwE3B.c:4:19: fatal error: cudnn.h: No such file or directory
compilation terminated.

Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)

Anyone encountered that ?

Eric

shushi2000 · May 17, 2017, 11:14pm

I don’t know what went wrong on your machine. But I have struggled with installing on linux too, and here is how I made everything work for me:

reinstall Ubuntu 16.04 (after installation, don’t change GPU driver. The display could be ugly for now but that’s okay).
Follow this link to install everything, step-by-step. (I didn’t run the .sh file. Instead I copy-paste each command and run it in terminal. This way I know exactly what’s going on.) The script is well-written, so you should be able to install everything and get the environment ready for this course.

Good luck!

rteja1113 · May 18, 2017, 1:16am

Hi @EricPB, I have encountered similar problems.I had to convert back to old theano backend instead of the new gpuarray backend. Try changing to old gpu backend for now(just changing the device to gpu .theanorc file).Also it seems that theano is unable to access cudnn

below is my .theanorc file
[global]
floatX=float32
device=gpu
optimizer=fast_run

[lib]
cudnn=True
cnmem=0.0

[cuda]
root = /usr/local/cuda-8.0/include

[nvcc]
fastmath=True

[blas]
ldflags = -llapack -lblas

[dnn]
enabled=True
library_path=/usr/local/cuda-8.0/lib64
include_path=/usr/local/cuda-8.0/include

also, what driver version are you using ?

Please let me know if you are still having problems

EricPB · May 25, 2017, 10:20am

Many thanks for the tips to @shushi2000 and @rteja1113.
I ended up upgrading to Python 3.6 and Keras 2.0 for Part#2 and used updated notebooks from @Robi .

So far so good.

Eric