Jeremy's Harebrained install guide

(Michael) #148

Thank you for your reply and your patience.

Unfortunately, I only have a /usr/local/cuda-10.1/lib64 directory instead of a /usr/local/cuda-10.0/lib64.

Therefore, I tried to set it up with using CUDA-10.1

echo /usr/local/cuda-10.1/lib64 | sudo tee -a /etc/ld.so.conf 
sudo ldconfig
echo 'export PATH=/usr/local/cuda-10.1/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

but still no luck.

My /etc/ld.so.conf looks like this:

include /etc/ld.so.conf.d/*.conf
/usr/local/cuda-10.1/lib64

I thought with a working CUDA and cudnn installation I can directly start with the S4TF installation, but this seems to be not enough.

But running swift in bash still results in the error:
swift: error while loading shared libraries: libcusparse.so.10.0: cannot open shared object file: No such file or directory

Therefore, I explored the output of python register.py --sys-prefix --swift-python-use-conda --use-conda-shared-libs --swift-toolchain ~/swift:

kernel.json:
{
  "argv": [
    "/home/user/anaconda3/envs/swift/bin/python",
    "/home/user/Documents/swift-jupyter/parent_kernel.py",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Swift",
  "language": "swift",
  "env": {
    "PYTHONPATH": "/home/user/swift/usr/lib/python3.6/site-packages",
    "LD_LIBRARY_PATH": "/home/user/swift/usr/lib/swift/linux:/home/user/anaconda3/envs/swift/lib",
    "REPL_SWIFT_PATH": "/home/user/swift/usr/bin/repl_swift",
    "SWIFT_BUILD_PATH": "/home/user/swift/usr/bin/swift-build",
    "SWIFT_PACKAGE_PATH": "/home/user/swift/usr/bin/swift-package",
    "PYTHON_LIBRARY": "/home/user/anaconda3/envs/swift/lib/libpython3.6m.so"
  }
}

Registered kernel 'Swift' as 'swift'!

The 2nd LD_LIBRARY_PATH points to the directory /home/user/anaconda3/envs/swift/lib that includes:

/home/user/anaconda3/envs/swift/lib/libcusparse.so
/home/user/anaconda3/envs/swift/lib/libcusparse.so.9.2
/home/user/anaconda3/envs/swift/lib/libcusparse.so.9.2.148

Maybe that here is no libcusparse.so.10.0 is the problem?

With find / -name "libcublas.so.10.0"I was able to located the file in several locations:

/home/user/anaconda3/pkgs/cudatoolkit-10.0.130-0/lib/libcublas.so.10.0
/home/user/anaconda3/envs/fastai-pytorch-nightly/lib/libcublas.so.10.0
/home/user/anaconda3/envs/fastai/lib/libcublas.so.10.0
/home/user/anaconda3/envs/fastai-dev/lib/libcublas.so.10.0

Then I tried to pass /home/user/anaconda3/pkgs/cudatoolkit-10.0.130-0/lib/ as the --use-conda-shared-libs parameter in python register.py --sys-prefix --swift-python-use-conda --use-conda-shared-libs --swift-toolchain ~/swift but I was not successful.

I am not sure how the syntax should look like to hand over this argument. Maybe somebody can point me in the right direction?

PS:
I am not sure if this part is also needed from the installation steps from the first post:

cd ~/download
wget http://files.fast.ai/files/cudnn-10.0-linux-x64-v7.5.0.56.tgz
tar xf cudnn-10.0-linux-x64-v7.5.0.56.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
sudo ldconfig

However, I am afraid that this is maybe breaking my current CUDA setup?

0 Likes

#149

A problem here is that when you created your swift conda environment, your setup was pointing to CUDA 9.2 as the system CUDA. That’s why you see libcusparse.so.9.2 in the envs/swift/lib directory.

There are a couple of things you might want to check.

FIrst, look at what /usr/local/cuda is pointing to. It should point to cuda-10.0 or in your case, sounds like cuda-10.1. This is mine:

$ ls -l /usr/local/cuda
lrwxrwxrwx 1 root root 9 Mar  8 18:03 /usr/local/cuda -> cuda-10.0/

Secondly, confirm that your personal LD_LIBRARY_PATH in the shell points either directly to cuda-10.X OR to /usr/local/cuda assuming the symlink is pointing to cuda-10.X. By personal LD_LIBRARY_PATH, I mean the one in the shell, not the one set by register.py.

$ echo $LD_LIBRARY_PATH
/usr/local/cuda/lib64

If these things check out, I would create a new conda env from scratch, and check that the right libcusparse.so is in your new conda env lib directory. You can delete your current env or just name a new one.

Finally, whether you need to do the cudnn install steps depends on whether you already have it. Check like so. If you don’t see libcudnn.so.7.5.0 like below then you need to insdtall it.

$ ls /usr/local/cuda/lib64/*dnn*
/usr/local/cuda/lib64/libcudnn.so@    /usr/local/cuda/lib64/libcudnn.so.7.5.0*
/usr/local/cuda/lib64/libcudnn.so.7@  /usr/local/cuda/lib64/libcudnn_static.a
2 Likes

(Michael) #150

Thank you for your detailed reply!

I checked everything you mentioned:

$ ls -l /usr/local/cuda
/usr/local/cuda -> cuda-10.1
$ echo $LD_LIBRARY_PATH
:/usr/local/cuda

And installed cudnn:

$ ls /usr/local/cuda/lib64/*dnn*
/usr/local/cuda/lib64/libcudnn.so    /usr/local/cuda/lib64/libcudnn.so.7.5.0
/usr/local/cuda/lib64/libcudnn.so.7  /usr/local/cuda/lib64/libcudnn_static.a

Then I created a new conda env named “swift” and run python register.py --sys-prefix --swift-python-use-conda --use-conda-shared-libs --swift-toolchain ~/swift in the swift-jupyter git dir.
(Has the conda env to be named “swift” or “Swift”? With the empty conda env it was not working and, therefore, I installed tensorflow-gpu and additional jupyter, ipykernel, packages as requested by error messages.)

Looking for libcusparse.so.10.0 it in the swift env dir:

$ ls anaconda3/envs/swift/lib/libcusparse*
anaconda3/envs/swift/lib/libcusparse.so
anaconda3/envs/swift/lib/libcusparse.so.10.0
anaconda3/envs/swift/lib/libcusparse.so.10.0.130

So far this looks all good, but I still get the same error as before:

$ swift
swift: error while loading shared libraries: libcusparse.so.10.0: cannot open shared object file: No such file or directory

and the notebook in jupyter behaves as a normal python notebook.

It looks like swift is looking in the wrong place, even when running register.py printed "LD_LIBRARY_PATH": /home/user/swift/usr/lib/swift/linux:/home/user/anaconda3/envs/swift/lib"?
Is there a way to check where swift is looking for the missing file?

0 Likes

#151

Great. Two things.

#1 is, you ran swift from command line above. So we need to think about where swift is getting the libraries in that context. One thing I notice - your LD_LIBRARY_PATH is missing /lib64 at the end.

So I would change that and try again.

#2 is, given the register.py output and your library contents match up, if you run jupyter and try to run a swift kernel in a notebook, it MIGHT already work for you as is. Your error above was when running swift directly in shell.

0 Likes

(Michael) #152

Thanks for catching the missing /lib64!

I corrected the LD_LIBRARY_PATH path and rerun python register.py --sys-prefix --swift-python-use-conda --use-conda-shared-libs --swift-toolchain ~/swift with the following output:

kernel.json:
{
  "argv": [
    "/home/user/anaconda3/bin/python",
    "/home/user/Documents/swift-jupyter/parent_kernel.py",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Swift",
  "language": "swift",
  "env": {
    "PYTHONPATH": "/home/user/swift/usr/lib/python3.7/site-packages",
    "LD_LIBRARY_PATH": "/home/user/swift/usr/lib/swift/linux:/home/user/anaconda3/lib",
    "REPL_SWIFT_PATH": "/home/user/swift/usr/bin/repl_swift",
    "SWIFT_BUILD_PATH": "/home/user/swift/usr/bin/swift-build",
    "SWIFT_PACKAGE_PATH": "/home/user/swift/usr/bin/swift-package",
    "PYTHON_LIBRARY": "/home/user/anaconda3/lib/libpython3.7m.so"
  }
}

Registered kernel 'Swift' as 'swift'!

I guess the output looks correct?

However, I still get the same error when running swift in the shell and the kernel in jupyter notebook still behaves as a normal python kernel.

In addition I tried that:
Because you mentioned that you have a cuda-10.0 in /usr/local/ directory I created a symlink with this name there which links to the cuda-10.1 directory but this didn’t help.

0 Likes

#153

Did you pick Swift as your kernel?

0 Likes

GPU-enabled Swift for TensorFlow build for Nvidia Jetson devices
#154

Also, you’re now missing envs/swift/lib from the register.py output, which you had previously.

0 Likes

(Michael) #155

This is there, or am I misunderstanding it?

I switched to the swift kernel in the notebook. However, mine is written with lower-case letters, i.e., “swift”. Should it be named “Swift” like in your screenshot?
(I was asking this myself after running register.py which says Registered kernel 'Swift' as 'swift'!.)

Under /anaconda3/envs I only have a directory named “swift” (and no “Swift”).

My “swift” conda env I created with conda create --name swift prior to the installation steps.

Maybe, there is a way to get the the output of the actions carried of the register.py script?
Is there a way to look for the changes that should have been carried by the register.py script?

0 Likes

#156

Your most recent register.py output is missing the /envs/swift

You might not have activated the swift conda environment where you ran register.py. You need to:

  1. source activate swift
  2. Run register.py etc.
  3. Start jupyter notebook
  4. All in the same shell in the same go
1 Like

(Michael) #157

I tried everything again:

1.) conda activate swift

2.) Set LD_LIBRARY_PATH:

$ echo $LD_LIBRARY_PATH
/usr/local/cuda/lib64:/home/user/anaconda3/envs/swift/lib

3.) Run register.py:

$ python register.py --sys-prefix --swift-python-use-conda --use-conda-shared-libs   --swift-toolchain ~/swift
kernel.json:
{
  "argv": [
    "/home/user/anaconda3/envs/swift/bin/python",
    "/home/user/Documents/swift-jupyter/parent_kernel.py",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Swift",
  "language": "swift",
  "env": {
    "PYTHONPATH": "/home/user/swift/usr/lib/python3.7/site-packages",
    "LD_LIBRARY_PATH": "/home/user/swift/usr/lib/swift/linux:/home/user/anaconda3/envs/swift/lib",
    "REPL_SWIFT_PATH": "/home/user/swift/usr/bin/repl_swift",
    "SWIFT_BUILD_PATH": "/home/user/swift/usr/bin/swift-build",
    "SWIFT_PACKAGE_PATH": "/home/user/swift/usr/bin/swift-package",
    "PYTHON_LIBRARY": "/home/user/anaconda3/envs/swift/lib/libpython3.7m.so"
  }
}

Registered kernel 'Swift' as 'swift'!

4.) Run jupyter notebook in the same bash window.

5.) Start fastai_docs notebook and run the first cells:


(Jupyter in the command line does not throw errors.)

6.) When I run swift in the shell I get this new error:

$ swift
swift: /home/user/anaconda3/envs/swift/lib/libuuid.so.1: no version information available (required by swift)
/home/user/swift/usr/bin/lldb: /home/user/anaconda3/envs/swift/lib/libuuid.so.1: no version information available (required by /home/user/swift/usr/bin/../lib/liblldb.so.7)
error: failed to stop process at REPL breakpoint

(There is also a thread about a similar output in the notebooks: [minor] "No version information available" warning when importing modules)

To run everything in the same shell was essential (and this is something I am not used to with other python kernels I regularly use). I guess the problem was that when you are writing the paths to .bashrc you have to restart the bash to have it there too.

However, with this no version information available I didn’t really found a source that helped me.

I am making small steps, but those are still steps. :wink:

0 Likes

#158

It seems to me that you may not be actually running a Swift kernel. The kernel name shows lower case ‘swift’ in your screenshot, but it should be registered with upper case based on the register.py output. The fact that it doesn’t recognize the magic or the comment is suspicious. You could try running some Python to see if it’s actually a Python kernel.

I have experience with the error you see in the Jupyter notebook. I saw it when I updated S4TF. I believe I resolved it by updating swift-jupyter, running register.py, then starting jupyter. It actually may or may not matter if it is in the same shell but that is what I did.

What version S4TF and swift-jupyter are you running? I synched up as of right before Lesson 14. I can’t vouch for anything that doesn’t match that.

What you experience in Jupyter and what you experience when you run swift in the command line aren’t necessarily completely related. So you are potentially describing two totally different issues. I think you probably want to focus on Jupyter.

As far as Jupyter goes, what register.py does is set up the context for the Jupyter kernel named ‘Swift’ (in the UI). You can look at the source.

I would avoid adding the anaconda one in here. And definitely don’t do this in your bash init. But I don’t think it is necessarily causing your issue.

0 Likes

(Michael) #159

Thanks for your help.

My (lower-case) swift kernel in jupyter behaves like a normal python kernel and I can run python code with it.

I used swift-tensorflow-RELEASE-0.3.1-cuda10.0-cudnn7-ubuntu18.04.tar.gz and the latest swift-jupyter repo.

I think so too, that the problem in jupyter is not related to the one in the shell.
It looks like the (upper-case) swift env is not installed properly (even when the script is not showing an error).

However, I don’t know what else I could try.
I guess I will switch to colab or use docker until the next release.

0 Likes

#160

Yeah, not sure what the issue is.

I know I encountered the same problem you are having in the notebook, but it was pretty easy to resolve by updating and running register.py again.

0 Likes

(Michael) #161

After removing the conda “swift” env. and the swift folder I see now a “Swift” environment in the jupyter kernels.

Maybe the two “swift” and “Swift” env. where hindering each other?

From the install script in the first post it seems like the conda “base” env. is used?

I’ll try it again and I am happy for every tip! :slight_smile:

0 Likes

(Darshan Kathiriya) #162

Hi, I had the same issue with libuuid.so.1 and i used this which solved my issue.

python register.py --sys-prefix --swift-toolchain ~/Applications/swift-tensorflow --swift-python-library ~/miniconda/envs/s4tf/lib/libpython3.6m.so

put your path to libpython3.6m.so

1 Like

(Aditya Palepu) #163

For those of you creating new Google Compute Engine instances, what settings are you using? The default fastai GCP instructions install Debian I believe. Are you using the Ubuntu 18.04 LTS Image and the same fastai hardware settings recommended in the GCP guide? Also if you already had a Google VM for fastai, are you migrating that onto the new Ubuntu machine or using the Ubuntu machine exclusively for Swift?

0 Likes

(Aditya Palepu) #164

Just following up for anyone else wondering, I set up a new VM instance in my existing fastai project with the commands outlined in step 3 of the GCP Start Guide. I set IMAGE_FAMILY=“ubuntu-1804-lts”, renamed INSTANCE_NAME and set --image-project=ubuntu-os-cloud. Otherwise, I followed the instructions exactly for creating a new instance. Then installed everything per Jeremy’s gist and was successfully able to start the first notebook.

6 Likes

(john v) #165

I’m trying to setup Swift (v0.4 of S4TF), I’ve managed to get the notebooks running but I am running into a strange error in 00_load_data.ipynb
When I try to define the new protocol:

protocol ConvertibleFromByte: TensorFlowScalar {
    init(_ d:UInt8)
}

I get this error:

error: Couldn't lookup symbols:
  protocol descriptor for TensorFlow.TensorFlowScalar
  protocol descriptor for TensorFlow.TensorFlowScalar
  protocol descriptor for TensorFlow.TensorFlowScalar
  protocol descriptor for TensorFlow.TensorFlowScalar

TensorFlow.TensorFlowScalar is defined but the error message suggests that doesn’t have a protocol descriptor?

0 Likes

(john v) #166

Hmmm… it seems that I had not installed cuda 10 into my system properly. Installing cuda10 by more carefully following jeremy’s minimal installation subset has helped and I’m on to the next error :slight_smile:

0 Likes

(Jeremy Harari) #168

I got this


What’s the problem?
It’s too hard to install

0 Likes