Huge performance improvement with network training!

rteja1113 · May 13, 2017, 3:25am

Hi @FabienP, @kzuiderveld, thank you for your replies

I was able to reduce training time to 200s but still far from 111s in @kzuiderveld 's script.The performance improved after including [dnn] paths in .theanorc.It was definitely not cnmem as I trained in similar time(199s) after disabling it.

I am still using old theano backend (as I am getting some exceptions with new gpuarray backend) .Do you think that could be the reason for the difference between @kzuiderveld 's training time and mine or is it something to do with BLAS/CUBLAS because the official theano page says ‘Maybe a small run time speed up.’ after converting to new backend

When I ran check_blas.py , the following is the output

Some Theano flags:
blas.ldflags=
compiledir= /home/cvpr/.theano/compiledir_Linux-4.8–generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64
floatX= float32
device= gpu
Some OS information:
sys.platform= linux2
sys.version= 2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
sys.prefix= /home/cvpr/anaconda2
Some environment variables:
MKL_NUM_THREADS= None
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None

Numpy config: (used when the Theano flag “blas.ldflags” is empty)
lapack_opt_info:
libraries = [‘openblas’, ‘openblas’]
library_dirs = [’/usr/local/lib’]
define_macros = [(‘HAVE_CBLAS’, None)]
language = c
blas_opt_info:
libraries = [‘openblas’, ‘openblas’]
library_dirs = [’/usr/local/lib’]
define_macros = [(‘HAVE_CBLAS’, None)]
language = c
openblas_info:
libraries = [‘openblas’, ‘openblas’]
library_dirs = [’/usr/local/lib’]
define_macros = [(‘HAVE_CBLAS’, None)]
language = c
blis_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = [‘openblas’, ‘openblas’]
library_dirs = [’/usr/local/lib’]
define_macros = [(‘HAVE_CBLAS’, None)]
language = c
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
Numpy dot module: numpy.core.multiarray
Numpy location: /home/cvpr/anaconda2/lib/python2.7/site-packages/numpy/init.pyc
Numpy version: 1.12.1
nvcc version:
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 0.31s on GPU.

BTW, I forgot to mention mention my NVIDIA Driver version is 378.13 on UBUNTU 16.04.2

Thanks in advance !!

FabienP · May 13, 2017, 4:43pm

Well, that’s better!

Your Nvidia drivers are still a few months old, you should try the latest 381.22. In my experience drivers can change performances quite a lot, both for good or for bad (although i didn’t saw that about neural networks yet, but it’s the case for GPU 3D rendering for instance).

Maybe you could try to reinstall theano with the new backend following theano ubuntu instructions. I would recommend to do that as a new python environment using conda env, this way you should not break your current install.

Maybe linking theano either to mklBLAS, cuBLAS or compiling your own openBLAS could help a bit.

Note that performances can vary from an OS to another too, and depending on your system, and depending on the flavour of your GPU (some models are overclocked) so do not get mad at trying to reach 111s.

rteja1113 · May 13, 2017, 11:08pm

Thanks a lot for the tips Fabien.

Mihar · October 12, 2017, 6:58pm

I am having the same problem. Previously I had a quadro p2200 and I was able to train a model with batch size set to 64 (It is a Dell tower 7910 , dual CPU). Now I have two 1080ti GPUs but it runs out of memory with batch size set to 64. If I set it to 32 it runs, but each epoch takes hours. Driver is latest version, 384, cuda version 8, cudnn version 7, and I am still using my old settings in .theanorc which is
[global]
floatX = float32
device = gpu

[cuda]
root=/usr/local/cuda/

What could be wrong ?

simoneva · November 5, 2017, 3:12pm

Really useful analysis thanks. I am just considering getting a used z800 or z600 so have been reading your comments on this with interest and am reassured that the multiprocessing will make up for the slower processor.

However your second conclusion re the odd behaviour is a bug. The batch_size is set as 100 at the top but 64 further down. This means the two batches being compared have different batch sizes but the same batches/epoch which explains the difference in timings.