[Help wanted] Simplify Swift for Tensorflow / Harebrain installation

(Jeremy Howard (Admin)) #1

Currently installing tf-gpu is quite a process. Students don’t already have cudnn/cuda installed since that’s inside their pytorch conda env. cudnn is particularly annoying to install since it’s behind a registration wall. (I’ve put a copy on our public file server so make life a bit easier, but I’m not sure it’s officially allowed…)

I suspect we could make life easy by simply leveraging the existing official tensorflow-gpu conda install experience. Here’s a look at how easy that is:

Since that installs cudnn and cuda automatically, we could add a flag to register.py that points LD_LIBRARY_PATH there (or similar). Would that be easy? Would it work?

0 Likes

(Marc Rasi) #2

Since I’m about to install S4TF with CUDA on a VM (in order to play with notebook 06), I’m going to see if I can do it with conda, and report back on how it went. If I succeed, I’ll update the swift-jupyter installation instructions to describe how to do it.

1 Like

(Marc Rasi) #3

Okay, I made some changes to register.py that simplify use with Conda and that make it possible for Swift to use a Conda-installed CUDA. I updated the Conda install instructions to explain that you can install CUDA using Conda, and I made sure that it worked for me on my VM.

PR: https://github.com/google/swift-jupyter/pull/54

3 Likes

(Jeremy Howard (Admin)) #4

That’s so awesome. I’ll try it out!

0 Likes

(Marc Rasi) #5

Oh no, I am unable to run convolutions when I install using this method. It says “Fatal error: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.” (There is no warning log message printed above.)

This discussion suggests that the problem is that the cudnn in conda is too old for TF: https://github.com/tensorflow/tensorflow/issues/24828

I tried downgrading to CUDA 9.2, but that didn’t help.

Some searching on the internet suggests that it’s impossible to install any cudnn later than 7.3 in conda. I don’t know if there’s a fundamental reason, or just no one has gotten around to putting a later cudnn in conda.

Anyways, I think I will have to change the instructions to suggest installing CUDA/CUDNN using non-conda means.

1 Like

(Marc Rasi) #6

I confirm that I can convolve when I install CUDNN 7.5 manually. I have updated the readme to suggest not installing CUDNN using Conda.

1 Like

(Jeremy Howard (Admin)) #7

Wow amazing that the conda package has been in that state for months! FYI fastai and pytorch both run their own conda channels for just this reason - it’s best to keep full control over your packages.

I’ve got a few versions of cudnn available here fyi: http://files.fast.ai/files/ .

0 Likes

(Jeff Lee) #8

When I run the following in jupyter and repl:

import TensorFlow
var l1 = Conv2D<Float>(filterShape:(5, 5, 1, 32))
var out = l1.applied(to: Tensor<Float>(zeros: [1, 28,28,1]))

I get the same error you mentioned

“Fatal error: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.: file /swift-base/swift/stdlib/public/TensorFlow/CompilerRuntime.swift, line 2043”

Any suggestions on what I should look into to figure out what’s broken?

I did setup swift-jupyter in a conda environment, but not sure if that has anything to do with this problem since I can’t even use the conv layer in repl.

My setup:
Ubuntu 18.04
Cuda 10.0
Nvidia Driver 410.104
Cudnn 7.5 edit: I even tried downgrading to cudnn 7.4.1.5

Note: I did not install cudnn with conda, though I have on this box in the past. I am able to run the cuda mnist sample without issues so my cudnn/conda install appears to be ok. I even installed tensorflow-gpu for python and ran some convs successfully in python.

0 Likes

#9

I had everything running smoothly up until I used a conv layer, and the issue was CuDNN, so I’d double-check your cuddn install if I were you.

0 Likes

(Jeff Lee) #10

Wouldn’t any install issue have shown up when I try running the cuda mnist sample “mnistCUDNN” since it uses cudnn? And I am able to use conv layers all day long with a python tf install using the gpu. Any specific suggestions for troubleshooting the install?

0 Likes

#11

I followed the most recent Jeff Heaton’s software installation tutorial for my Win 10 GTX 1070 laptop here https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class01_intro_python.ipynb and changed his pip install --upgrade tensorflow==1.12.0 with conda install tensorflow-gpu , which takes care of the needed CUDA and CUDNN under your dedicated environment provided that you have installed the nVidia video drive, CUDA driver and CUDNN already.

EDIT: my bad @metachi , you are looking for Swift help in this topic! I had problems with tensorflow-gpu under Python and misread the title and the first post from Jeremy. Sorry. Good luck with Swift!

0 Likes

(Jeff Lee) #12

I could be wrong, but I believe Swift for TensorFlow does not play nice with conda installations of cuda/cudnn since conda installs cudnn 7.3 and S4TF needs cudnn 7.5. Are you using S4TF with that cuda/cudnn install from conda?

0 Likes

(Jeff Lee) #13

I just tried uninstalling cuda cudnn etc. and following these instructions to reinstall cuda/cudnn, but I am still seeing the same issue.

Were there specific install instructions you followed for cuda/cudnn?

Edit: I have tried just about every variation of install I can think of including directly downloading cuda, cudnn, etc. from nvidia’s website. I’ve tried multiple versions of cudnn and am still getting the same error message about cudnn failing to initialize.

0 Likes

(Pedro Cuenca) #14

I had the same problem, and it was fixed by installing cuDNN 7.5 as others said.

I installed cuDNN globally (at /usr/local/cuda), and then ensured my LD_LIBRARY_PATH was pointing to that location inside my virtualenv environment.

Are you sure your installation is seeing the same cuDNN versions you installed? Can you run the following inside a Swift notebook (use the appropriate library path for your setup) and see if test gets a value?

import func Glibc.dlopen
import func Glibc.dlsym

let handle = dlopen("/usr/local/cuda/lib64/libcudnn.so", RTLD_NOW)
let test = dlsym(handle, "cudnnCreateAttnDescriptor")

One other factor that comes to mind: are you using a nightly swift toolchain? (I am)

2 Likes

(Jeff Lee) #15

I have the nightly build (built from source) and the more stable one. In each if I run the code above test gets a value.

Notably however, if I run the code you provided in the swift nightly version before attempting any convolutions, I don’t get the error message about cudnn initialization. If I try to convolve before doing so, I do get it.

I really appreciate the help. I’ve been banging my head against this one all day thinking I missed an installation step.

0 Likes

(Pedro Cuenca) #16

So it looks like a different cuDNN library is being loaded by default, but loading it manually does work. Did you verify your LD_LIBRARY_PATH?

Perhaps locate libcudnn.so could shed some light on any conflicting version that could be installed in your system.

1 Like

(Jeff Lee) #17

It is located at /usr/local/cuda/lib64/libcudnn.so.

If I run echo $LD_LIBRARY_PATH I get:
/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64

0 Likes

#18

Thanks, @pcuenq. This fixed it for me. In case it’s helpful for others, here are the details:

Like you suggested, I used locate to find out my libcudnn.so, which was in /usr/lib/x86_64-linux-gnu/libcudnn.so, which I think where it ends up if it its installed via apt-get. There were older versions elsewhere on the system.

Then I used your snippet to verify it was the right version. Here’s a script version of your snippet for that, which you can optionally point at a library by passing in a command line argument:

#!/usr/bin/env swift
import func Glibc.dlopen
import func Glibc.dlsym
import var Glibc.RTLD_NOW

var pathToLibcuddnso = "/usr/lib/x86_64-linux-gnu/libcudnn.so"
let symbolName = "cudnnCreateAttnDescriptor"

if CommandLine.arguments.count >= 2 {
    pathToLibcuddnso = CommandLine.arguments[1]
}

print("Will try to load libcudnn.so at path: \(pathToLibcuddnso)")

let handle = dlopen(pathToLibcuddnso,RTLD_NOW)
let symbolPtr:UnsafeMutableRawPointer? = dlsym(handle,symbolName)

let cudnnIsAtLeastVersion750 = (symbolPtr != nil)

if cudnnIsAtLeastVersion750 {
    print("Installed cuDNN version is at least 7.5.0, since the library \(pathToLibcuddnso) contains the symbol \(symbolName)")
}

Finally, instead of setting LD_LIBRARY_PATH, I added a file /etc/ld.so.conf.d/cudnn.conf which only contained the path to the library, /usr/lib/x86_64-linux-gnu/libcudnn.so.

With all of this in place, when I started a jupyter notebook, with a kernel configured to use the swift-tensorflow nightly toolchain from 2019-04-15, I was able to train with cuDNN in notebook 06.

0 Likes

(Jeff Lee) #19

Thanks @pcuenq! This helped my find that I had about 3-4 versions of cudnn flying around from old installs that could have been conflicting. I did a lot of deleting and uninstalling and now everything works :smiley:

1 Like

#20

Just an aside if you install the development version and samples you can double check the installation by building mnistCUDNN

$ cd $HOME/cudnn_samples_v7/mnistCUDNN
$ make clean && make
$ ./mnistCUDNN

Result should be

Test Passed

So with regards to nvidia and cudnn things worked but notebooks 02a and 06 failed on the failed to get convolution algorithm . All my environment variables seemed correct so following the link from;

I uninstalled from the conda env the cudnn library run the kernel restart option of the notebook and was able to proceed past that cell of the notebook.

Below is the installed list for cud*

sudo apt list --installed | grep cud
[sudo] password for dl:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

cuda-10-1/unknown,now 10.1.105-1 amd64 [installed]
cuda-command-line-tools-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-compiler-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cudart-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cudart-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cufft-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cufft-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cuobjdump-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cupti-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-curand-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-curand-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cusolver-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cusolver-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cusparse-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-cusparse-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-demo-suite-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-documentation-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-driver-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-drivers/unknown,now 418.40.04-1 amd64 [installed,automatic]
cuda-gdb-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-gpu-library-advisor-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-libraries-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-libraries-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-license-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-memcheck-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-misc-headers-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-npp-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-npp-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nsight-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nsight-compute-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nsight-systems-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvcc-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvdisasm-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvgraph-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvgraph-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvjpeg-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvjpeg-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvml-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvprof-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvprune-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvrtc-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvrtc-dev-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvtx-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-nvvp-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-runtime-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-samples-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-sanitizer-api-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-toolkit-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-tools-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
cuda-visual-tools-10-1/unknown,now 10.1.105-1 amd64 [installed,automatic]
libcudnn7/now 7.5.0.56-1+cuda10.1 amd64 [installed,local]
libcudnn7-dev/now 7.5.0.56-1+cuda10.1 amd64 [installed,local]
libcudnn7-doc/now 7.5.0.56-1+cuda10.1 amd64 [installed,local]
0 Likes