Making your own server

layla.tadjpour · April 10, 2017, 6:20pm

How about the CPU i am gong to buy: Intel Core i5-7500 3.4GHz Quad-Core Processor which only supports 16 lanes in these configurations. If I get the second GPU card, would that be a problem?

dradientgescent · April 10, 2017, 6:34pm

Depends on the card.

stephenl · April 10, 2017, 6:51pm

Thanks leon I willl give that a go. I had added before the libgpuarray it didn’t work but as of yesterdays rebuild I did not. The ubuntu upgrade to 4.10.9 is certainly someplace I have no been yet.

stephenl · April 10, 2017, 6:55pm

Roger - pretty much it has been consistently this output below…cheers

stephenl:

Exception: (‘The following error happened while compiling the node’, GpuDnnConv{algo=‘small’, inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode=‘valid’, subsample=(1, 1), conv_mode=‘conv’, precision=‘float32’}.0, Constant{1.0}, Constant{0.0}), ‘\n’, ‘nvcc return status’, 2, ‘for cmd’, ‘nvcc -shared -O3 -Xlinker -rpath,/usr/local/cuda-8.0/lib64/ -use_fast_math -arch=sm_61 -m64 -Xcompiler -fno-math-errno,-Wno-unused-label,-Wno-unused-variable,-Wno-write-strings,-DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,/home/sl/.theano/compiledir_Linux-4.4–generic-x86_64-with-debian-stretch-sid-x86_64-2.7.12-64/cuda_ndarray -I/home/sl/.theano/compiledir_Linux-4.4–generic-x86_64-with-debian-stretch-sid-x86_64-2.7.12-64/cuda_ndarray -I/usr/local/cuda8.0/include -I/home/sl/Theano/theano/sandbox/cuda -I/usr/local/cuda-8.0/include/ -I/home/sl/anaconda2/lib/python2.7/site-packages/numpy/core/include -I/home/sl/anaconda2/include/python2.7 -I/home/sl/Theano/theano/gof -L/home/sl/.theano/compiledir_Linux-4.4–generic-x86_64-with-debian-stretch-sid-x86_64-2.7.12-64/cuda_ndarray -L/usr/local/cuda-8.0/lib64/ -L/home/sl/anaconda2/lib -o /home/sl/.theano/compiledir_Linux-4.4–generic-x86_64-with-debian-stretch-sid-x86_64-2.7.12-64/tmpWUD0e_/ea4e203b6529466794536f8a1bfa77ae.so mod.cu -lcudart -lcublas -lcuda_ndarray -lcudnn -lpython2.7’, “[GpuDnnConv{algo=‘small’, inplace=True}(, , , , Constant{1.0}, Constant{0.0})]”)

stephenl · April 10, 2017, 9:32pm

Leon - thanks I tried all that it just did the same thing. You can see below how the error starts out. Its never changed no matter what I do to this thing it will not budge. It maybe a ‘Haswell thing’ I am not sure what processor you run but my script is pretty much AWS except for our departures with 8.0.61, 6.0.20, I tried vvidic drivers 375 as well as others to no affect. Its maybe some kind of linking thing. I did note as usual the ‘sudo ldconfig’ seems to produce

stephenl · April 10, 2017, 9:33pm

sorry this …

/sbin/ldconfig.real: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.6 is not a symbolic link

every time some one may have seen this nasty before. Other than that here is the thrown compiled file below.

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
1 #include <Python.h>
2 #include
3 #include "theano_mod_helper.h"
4 #include "cuda_ndarray.cuh"
5 #include <math.h>
6 #include <numpy/arrayobject.h>
7 #include <numpy/arrayscalars.h>
8 #include "cudnn.h"
9 #include "cudnn_helper.h"
10 //////////////////////
11 //// Support Code
12 //////////////////////
13
14 void _capsule_destructor(PyObject *o) {
15 void *d = PyCapsule_GetContext(o);
16 void *p = PyCapsule_GetPointer(o, NULL);
17 void (*f)(void ) = (void ()(void *))d;
18 if (f != NULL) f§;
19 }
20

stephenl · April 10, 2017, 9:59pm

…and to really add insult to injury the simplicity of getting cuDNN 5.1 back into place and going is as simple as…

tar -zxf cudnn.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda-8.0/lib64/
sudo cp include/cudnn.h /usr/local/cuda-8.0/include

Truely unbelievable no dramas, no nasty cuDNN theano messages (part from the deprecation warning), it just works! try that with cuDNN 6.0.20 and get a world of hurt.

I must move on it has me beaten, I think I will wait for Theano 10 maybe it has something to offer I know theano 0.9.0rc4 is out there but I tried other variants before no use so far, its something deeper in my system, something thats missed, something I must do but others need not do.

Thanks for every ones help it has been appreciated.!

leonletto · April 11, 2017, 3:54am

@stephenl I think I have it.

I reinstalled fully twice and was crashing out with the same errors as you were getting.

I installed 16.04 server.
then used the standard instructions here https://github.com/fastai/courses/blob/master/setup/install-gpu.sh

Except I used cuda 6.0.20 and it was crashing.

The last two steps which fixed it were:
Use the instructions here to setup libguparray http://deeplearning.net/software/libgpuarray/installation.html
extract/clone the source to a directory
For libgpuarray:

cd directory
mkdir Build
cd Build
cmake … -DCMAKE_BUILD_TYPE=Release # or Debug if you are investigating a crash
make
make install
cd …

Then change the .theanorc - here is mine:
[global]
device = cuda
floatX = float32

[cuda]
root = /usr/local/cuda

Finally install pygpu
conda install pygpu

that worked for me.

Sorry for the formatting. MD is not working very well in this message

stephenl · April 11, 2017, 4:09am

Thanks leon - ok it looks like you have the 8.0.44 which uses a ‘cuda’ directory not ‘cuda-8.0’ a constant source of confusion on the net. It changes if you use the .run file to cuda-8.0 and of course there is actually another ‘cuda’ linking directory during install for symbolic links. Will give it a go…Cheers

leonletto · April 11, 2017, 4:22am

These are the file sizes of the ones I have installed:
1913589814 Apr 10 21:24 cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb
201123192 Apr 10 21:24 cudnn-8.0-linux-x64-v6.0.tgz

stephenl · April 11, 2017, 5:09am

ok looks like you pulled down 8.0.61-1 as the target and not what was in the original script which is saying

cuda-repo-ubuntu1604_8.0.44-1_amd64.deb - noted
cudnn-8.0-linux-x64-v6.0.tgz - noted

cud is towards the bottom did you do cuDNN when you did the CUDA file? order maybe import as conda install pygpu
was up furthe the list.

rebuilding now

leonletto · April 11, 2017, 5:14am

Yes. Once I realized that 8.0.44 was slow I stopped installing it.
I did cudnn right after I did cuda-repo
Then I did all the other stuff to get it running in a crashing state
Then I did libgpuarray
Then Changed the .theanorc
then I installed pygpu.
thats when the crashing stopped.

I hope you get it working. I know I have spent at least 18 hours putzing around with this. I don’t mind learning but once its working, its nice to not have to change anything for a while lol!

stephenl · April 11, 2017, 5:16am

ok so you did the pip install theano but crashed and replaced with conda install pygpu?

leonletto · April 11, 2017, 5:18am

Yes. Followed the directions as listed then added the other steps to get it stable.

stephenl · April 11, 2017, 5:20am

oh and the Ubuntu Kernel to 4.10.9 upgrade I guess was after libgpu?

leonletto · April 11, 2017, 5:21am

I didn’t do that this time. Standard Kernel.
4.4.0-62-generic

stephenl · April 11, 2017, 7:18pm

Miracle of Miracles it worked! Top man Leon!

The order if libarraygpu and theano pygpu is critical for success. Si ts running CUDA 8.0.61-1 and cuDNN 6.0.20.

Remarkably there’s no fluffing around with .bashrc pointers etc. The performance is down to 195 seconds it shaved of a minute. Remarkably ist using less GPU memory, less GPU utilisation way down from 100% now around 30% with occasional bursts and of course less heat generated to do a Vgg pass. So the pascal chip and code when matched makes a big difference. The question is if its running slower and doing it quicker could it be made to do it faster or is the speed ramping just a fact of it doing it quicker it doesn’t need to run at 100%.

I tested on nvidia drivers 375, 378 and 381. Seems a marginal improvement on 378 not much nothing noted between 378 and 381.

I had a issue when I rebooted once, it forced a looping guest login issue. I tried to fix it by removing lightdm (the GUI library) its worked before but not that time. So I did a run over with CUDA and all drivers except cuDNN because when I pulled out lightdm to try to fix it and rid myself of looping guest login, lightdm yanks out a lot of CUDA links and that made a mess, but it can be restored. Jupypter must be gracefully stopped before rebooting! otherwise it will mess things up.

I will log what I did. Cheers and thanks for the assistance!

leonletto · April 11, 2017, 9:00pm

Awesome! BTW - I also switched my server to full text mode:
Edit your /etc/default/grub file
sudo vim /etc/default/grub
Comment out the following line to disable the splash
`#GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
Uncomment the following line disable graphical terminal (grub-pc only)
GRUB_TERMINAL=console
Then run a sudo update-grub.

Then run:
sudo systemctl enable multi-user.target --force
sudo systemctl set-default multi-user.target

Pure Text glory.

stephenl · April 11, 2017, 9:08pm

Great will do that. Yes that will free up the memory more. I have enough on my card I think to do a batch=128, it didn’t quite do it on mine just short of memory only just but I expect it will if I get rid of the GUI. But I noticed no improvement in running large batch sizes even so. It seemed the same on 96 as on 64.

About to code up instructions for others to do the conversion to cuDNN 6.0 while its still fresh.

stephenl · April 11, 2017, 10:56pm

For anybody that wished to run cuDNN 6.0.20 on CUDA 8.0.,61 the instructions here are as best I can recall. The reason to do this are the benefits of running the CUDA PASCAL code in a pascal equipped GPU’s that provides large improvements in performance over cuDNN 5.1, such as using less GPU memory, cooler GPU running temperatures and lower GPU utilisation all while giving massive gains in running convolutional neural nets. This instruction was done with the help of Leon Letto on the forum who found away around the issue of running cuDNN on CUDA 8.0.61-1.

I did this because of the grief and time this has taken and there’s nothing on the web in complete form yet at this time that can lead to a successful install of cuDNN 6.0.20 on CUDA 8.0 with theano 0.9.0. Theano just does not like this arrangement of CUDA 8.0 cuDNN 6.0 at all so its been trial and error. I can’t find anywhere on the web, in one single instruction to do this, so both Leon and I have been back and forth over this, so this is to capture what we know to get it going from complete fresh install to working lesson 1 using CUDA 8.61.1 and cuDNN 6.0.20. with theano 0.9.0 at this time for nvidiia pascal equipped GPU’s. Good Luck!

Pre-installation:
This assumes you are doing the course this forum is about so it references material from it. Also spell checker in this dialogue may change things I did not capture so beware.

Get two USB drives.
You will need to rescue some folders to save time on re-running Vgg later, these are the ‘model’ folder that hold the Vgg weights and ‘Data’. ‘Models’ is in a hidden folder called ~/.keras inside that is ‘model’. Dump model on a USB someplace. Rescue the DATA folder holding cats-dogs under NBS where your jupiter projects are and dump that on the data USB.

The two USB’s are needed. One for data and one for ubuntu LTS 16.04, as ubuntu may give out warning if it finds other folders present. Yes we are nervous! This install has taken dozens of trial runs with no success to get to this point, everything needs to be straight down the middle no unusual deviations will help replicate success. If there’s a deviation and it doesn’t work there will be always doubt about a step taken that was not quite true to the instructions below. We note the order of installation at this time from what we know is very order specific especially libgpuarray and conda pygpu which are the key files that make this work we think. Yes we do install theano twice its best to ‘play this game’ as its what is working as of now.

Obtain a copy of ubuntu LTS 16.04 load it on the ubuntu USB and format with a windows PC using ‘rufus’. https://rufus.akeo.ie

Spin up ubuntu and do nothing as yet. We will, I suggest use ssh on another machine it helps when things go wrong we can still possibly have access to the system via the command line or to run init 3 states etc. plus its easy to cut n paste these instructions on another machine straight into the ssh command line.

Instructions: (as best I can recall) There maybe parts we don’t need in this installation but just follow it step by step to have a base reference. It is not script, it is the modified AWS script from the course but its better to do this by hand until someone refines it.

This instruction is designed to work with ubuntu 16.04 LTS

INSTALL SSH OPENSERVER

sudo apt-get install openssh-server
sudo service ssh status

ctrl-c to get out

####locate your ip address and write it down to use on your other machine.

ifconfig

log into your local machine running the lab and GPU using ssh your-machine-name@your-ipaddsress

ensure system is updated and has basic build tools

sudo apt-get update
sudo apt-get --assume-yes upgrade
sudo apt-get --assume-yes install tmux build-essential gcc g++ make binutils
sudo apt-get --assume-yes install software-properties-common

download and install GPU drivers

wget “http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb” -O “cuda-repo-ubuntu1604_8.0.61-1_amd64.deb”

sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get -y install cuda
sudo modprobe nvidia
nvidia-smi

Download the cuDNN 6.0 from nvidia developers website https://developer.nvidia.com/, you will need an account, sign up. Obtain the files for Linux cuDNN 6.0 for CUDA 8.0 and place that in your ~/Downloads folder.

INSTALL cuDNN 6.0

tar -zxf cudnn-8.0-linux-x64-v6.0.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/

install Anaconda for current user

mkdir downloads
cd downloads
wget “https://repo.continuum.io/archive/Anaconda2-4.2.0-Linux-x86_64.sh” -O "Anaconda2-4.2.0-Linux-x86_64.sh"
bash “Anaconda2-4.2.0-Linux-x86_64.sh” -b

echo “export PATH=”$HOME/anaconda2/bin:$PATH"" >> ~/.bashrc
export PATH="$HOME/anaconda2/bin:$PATH"
conda install -y bcolz
conda upgrade -y --all

####install and configure theano
pip install theano
echo “[global]
device = gpu
floatX = float32
[cuda]
root = /usr/local/cuda” > ~/.theanorc

#####configure jupyter and prompt for password and remember it!
jupyter notebook --generate-config

####make a password
jupass=python -c "from notebook.auth import passwd; print(passwd())"
echo “c.NotebookApp.password = u’”$jupass"’" >> $HOME/.jupyter/jupyter_notebook_config.py
echo “c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False” >> $HOME/.jupyter/jupyter_notebook_config.py

Place ‘model’ and ‘data’ folders back into the folder directories you took them from before ~/.keras and nbs lesson 1 respectively.

we will start Jupyter up as follows on the local and remote machines

####Do this from the Local ( the machine with the GPU running the labs)

jupyter notebook --no-browser --port=8889

####Do this from the remote machine you will need to modify the name and ip address

ssh -N -f -L localhost:8888:localhost:8889 machine name@ipaddress

if you have used a remote machine before for ssh you will need to get into the localhost file and erase the SHA keys against previously used local machines IP addresses.

OK now you will run the NBS course Lesson 1. It will look like it might work, but throw a compile error on In [7] the Vgg() section.

Next we will need to make some changes to the system to get this to work -at all-.

Close off the session jupyter session on the remote, locate the cmd line window running jupyter and ctrl-c.

####close-off jupiter on the local machine, ctrl-c.

Next we are going to download the libgpuarray libraries key to getting this going.

git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray

mkdir Build
cd Build
cmake … -DCMAKE_BUILD_TYPE=Release
make
make install
cd …

####Then change the .theanorc which should be in cd~ but is hidden use ls -a to find:
[global]
device = cuda
floatX = float32

[cuda]
root = /usr/local/cuda

Finally install pygpu using conda this is also key

conda install pygpu

Probably a good time for a reboot so restart the local

next restart the local machine jupyter session and remote ssh sessions. I suggest you set up an additional ssh session to the local machine as well as jupyter.

####on the remote do…
ssh your-machine-name@your-ipaddress

####on the local do…
jupyter notebook --no-browser --port=8889

####Do this from the remote machine, please insert your details your-machine-name@your-ipaddress

ssh -N -f -L localhost:8888:localhost:8889 machine-name@ipaddress

####bring up a http bowser window and put in local machine ipaddress:8889

####re run Vgg on line [4] where we do the import and start theano you will get a nasty message like this or very similar

/home/sl/anaconda2/lib/python2.7/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6020 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:02:00.0)
Using Theano backend.

####Just ignore above nasty message and re-run line , lesson1 Vgg it should now work. next look at the nvidia card and see how its running.

nvidia-smi

####polling the card while running the Vgg session will lower times by a couple of seconds BTW.

###Next refinements: this is up to you.

create some aliases so you don’t have to re-run long instructions on the local and remote.

on local machine running our lab and GPU

echo ‘alias ju=‘jupyter notebook —-no-browser —-port=8889’’ >> ~/.bashrc

source ~/.bashrc

on remote the alias assuming iMAC or similar you will need to modify the .bash_profile or use nano to get into .bash_profile

####again this is for apple iMAC

echo 'alias remote=‘ssh -N -f -L localhost:8888:localhost:8889 sl@.localdomain’ >> ~/.bash_profile

source ~/.bash_profile

ok higher risk activities ahead! -you don’t need to do this - lets upgrade to nividia driver 378.13

####close jupyter notebook on the local as this may cause a crash to lightdm , causing guest login re-cycling if you do not gracefully exit jupyter notebook when rebooting.

####download the nividia driver 378.13 driver from nividia.com and place it in ~/Downloads

####make sure you are on the remote machine and ssh’d in - this time I really mean you will need to ssh in as you will lose the GUI completely on local. You maybe able to get in on the local cmd line using ctrl+alt+f1 but it does not work always, so ssh.

sudo service lightdm stop

####local screen goes black

sudo init 3

####remove the nvidia 375 driver

sudo apt-get remove --purge nvidia-375 nividia-modprobe nvidia-settings

####there goes nvidia 375, now install 378.13

sudo sh NVIDIA-Linux-x86_64-378.13.run

####follow screens ignore errors when it asks that the pre-install didn’t go well, don’t accept 32bit if offers and say yes to X configure and the reboot local.

back in local check with nvidia-smi restart jupiter on both machines and see how it goes it may shave a few seconds off.

####Thats it!, sorry if there are typos or I missed something this is a recollection of events also ‘<>’ did not come out on forum for ssh instructions usually its ssh local-machine-name@loaclmachine-ipaddress