Hi @FabienP, @kzuiderveld, thank you for your replies
I was able to reduce training time to 200s but still far from 111s in @kzuiderveld 's script.The performance improved after including [dnn] paths in .theanorc.It was definitely not cnmem as I trained in similar time(199s) after disabling it.
I am still using old theano backend (as I am getting some exceptions with new gpuarray backend) .Do you think that could be the reason for the difference between @kzuiderveld 's training time and mine or is it something to do with BLAS/CUBLAS because the official theano page says āMaybe a small run time speed up.ā after converting to new backend
When I ran check_blas.py , the following is the output
Some Theano flags:
blas.ldflags=
compiledir= /home/cvpr/.theano/compiledir_Linux-4.8āgeneric-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64
floatX= float32
device= gpu
Some OS information:
sys.platform= linux2
sys.version= 2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
sys.prefix= /home/cvpr/anaconda2
Some environment variables:
MKL_NUM_THREADS= None
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None
Numpy config: (used when the Theano flag āblas.ldflagsā is empty)
lapack_opt_info:
libraries = [āopenblasā, āopenblasā]
library_dirs = [ā/usr/local/libā]
define_macros = [(āHAVE_CBLASā, None)]
language = c
blas_opt_info:
libraries = [āopenblasā, āopenblasā]
library_dirs = [ā/usr/local/libā]
define_macros = [(āHAVE_CBLASā, None)]
language = c
openblas_info:
libraries = [āopenblasā, āopenblasā]
library_dirs = [ā/usr/local/libā]
define_macros = [(āHAVE_CBLASā, None)]
language = c
blis_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = [āopenblasā, āopenblasā]
library_dirs = [ā/usr/local/libā]
define_macros = [(āHAVE_CBLASā, None)]
language = c
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
Numpy dot module: numpy.core.multiarray
Numpy location: /home/cvpr/anaconda2/lib/python2.7/site-packages/numpy/init.pyc
Numpy version: 1.12.1
nvcc version:
nvcc: NVIDIA Ā® Cuda compiler driver
Copyright Ā© 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).
Total execution time: 0.31s on GPU.
BTW, I forgot to mention mention my NVIDIA Driver version is 378.13 on UBUNTU 16.04.2
Thanks in advance !!