Huge performance improvement with network training!

@kzuiderveld I can’t wait to try this. Can someone explain how to pick the number of workers? I assume it is at least partially based on the number of CPUs available.

It depends on the amount of computing done by the GPU and CPU. If you have a slow GPU, the CPU may be able to keep up and only one worker thread might suffice. If you have a box with multiple high-end GPUs, you may need quite a few worker thread to prevent the GPUs starving for data.

Think of it as yet another hyperparameter :slight_smile: Do a timing without worker threads, one with two workers, one with three workers… And pick the one where performance doesn’t improve anymore.

I’m still using Keras 1.2, where the parameter is called nb_worker and I tried it, but it doesn’t seem to help. I also got a message about nb_worker being depricated when pickle_safe=false. So I set that to True as well.

I tried it on my version of cats and dogs and got about the same times. It is odd that if I set a nb_worker for some value other than 1 I get an warning that the number of training samples is higher than the number of samples per epoch. So perhaps I haven’t implemented it correctly. Anybody else have an example using Keras 1.2?

My graphs look pretty much the same between nb_worker=1 and nb_worker=2,3, or 4. They are somewhat different from yours. The GPU is at about 100% with occasional drops down to 50% for a few seconds. The CPU seems to hover around 25% or so. Maybe it is because I have a regular ol’ 1080?

I do love the graphing technique. That is could be really helpful in so many situations.Thanks for that.

I guess I also have a question. In your graphs, the gpu is already running at 100% and the cpu is the bottleneck at about 33%, right? But if 33% of the cpu can keep the gpu saturated at 100%, then where is the extra performance coming from?

@Rothrock42, can’t help you with Keras 1.2 - I googled on nb_worker and saw some mixed messages (for some, it works - for others, it doesn’t). The image generator needs to be thread-safe; if it’s not, folks reported problems with the # of training samples as you found. Perhaps the Keras 1.2 image generator is not threadsafe and the Keras 2.0 version is?

With respect to my graphs - yes, I was surprised to see a significant speedup when the graph suggested the GPU was already running at 100%. PCI-E throughput with multiple threads is less bursty and overall higher, so something definitely improved. Perhaps nvidia-smi’s way of measuring GPU load is not entirely accurate? Too many unknowns here…

Karel, I noticed that Keras doesn’t have ‘workers’ as an argument for the fit() function, only for fit_generator(). Do you (or anyone) know if that’s because it only supports multithreading on the latter?

I have been using fit() because it is the only way I have figured out how to use StratifiedKFold() without a lot of hassle. This is what I am trying to do:

skf = StratifiedKFold(n_splits=num_folds, random_state=1, shuffle=True)

k=1

for trn_index, val_index in skf.split(X=np.zeros(len(trn)), y=trn_classes):
print("Kfold = ", k)

batches = 0

for augmented_trn, augmented_trn_labels in gen.flow(X=trn[trn_index], y=trn_labels[trn_index], batch_size = 1697, shuffle=False):
    model.fit(augmented_trn, augmented_trn_labels, batch_size=batch_size, nb_epoch=num_epoch, validation_data=(trn[val_index], trn_labels[val_index]))
    batches += 1
    if batches >= len(trn_index) / batch_size:
        # we need to break the loop by hand because
        # the generator loops indefinitely
        break
k+=1

Am I missing something?

Thanks, Christina

No, you’re not missing anything - I don’t think that model.fit supports multiple threads. But: it might be worth trying to use fit_generator instead with multithreading enabled - and not do any data augmentation (don’t specify any parameters that would augment the image).

1 Like

Hi Karel,
Thanks. You’re right - I will try fit_generator next. I will just suck it up and do the cross validation the old fashioned way for now, with a fixed split between training and validation directories. I have only been able to use the scikit-learn StratifiedKFold() with fit(), as fit_generator() doesn’t give you the ability to index specific image lists returned for the training and validation images.

What I would really like to see is K-fold added to fit_generator internally, but I think that would be a gigantic pain, too much time when you’re doing a contest. :disappointed:

At any rate, thanks for you valuable thoughts! :slight_smile:

The Keras author nowadays suggests using tensorflow queues for best performance. It’s not well documented, but worth playing with if performance is an issue for you.

2 Likes

It looks like someone’s already thought about the problem whereby you need separate directories for your training and validation data for fit_generator / flow from directory:

It would be great to be able to dynamically specify train/test splits on the fly… just one step away from stratified k-fold.

Of course others have looked into the problem :slight_smile: However, a simple split of “all input” might not always work (think: State Farm (split by drivers) or Fishing competition (split by boat)) - I suspect there’s lot of room for DIY approaches.

Karel, yes agreed that stratified k-fold would not be appropriate for all situations… :slight_smile:

I created a little convenience function to split the main train directory into training and validation directories and randomly shuffle files into them, based upon the number of folds (while keeping the main training directory intact. Yes, it is a brute force way of doing it, and isn’t stratified k-fold, but you can call this function to shuffle things up between sets of epochs.

> import os, shutil
> import numpy as np
> import random

> #This function keeps the main train dir intact, and creates 2 new dirs, one each for
> #randomly selected train/test split

> def train_val_split(train_path, split_train_path, split_val_path):

>     # First see if train_split and val_split directories already exist - if so, delete them...
>     if(os.path.exists(split_train_path)): shutil.rmtree(split_train_path)
>     if(os.path.exists(split_val_path)): shutil.rmtree(split_val_path)

>     # Create a new val directory
>     os.mkdir(split_val_path)

>     # Next copy everything in the combined training directory to a the split training directory
>     shutil.copytree(train_path, split_train_path)

>     num_folds = 5  # One of the folds to be val, the rest for train...

>     for subdir in glob(split_train_path + '*'):
>         valsubdir = split_val_path + subdir.split('/')[3]
>         os.mkdir(valsubdir)
>         g = glob(subdir + '/*.jpg')

>         shuf = np.random.permutation(g)

>         for i in range(int(round(len(shuf)/num_folds))):
>             print("Transferring ", shuf[i], " to ", split_val_path + shuf[i].split('/')[3] + '/' + shuf[i].split('/')[4])           
>             os.rename(shuf[i], split_val_path + shuf[i].split('/')[3] + '/' + shuf[i].split('/')[4])
1 Like

Hi @kzuiderveld, I’m trying on the same nvidia 1080-ti gpu but my performance seems to be significantly slow even though gpu utilization is 100%
Below is the screenshot after running your code

The time it takes is 2190s compared to your 111s.

Below are my specs

PYTHON_PACKAGES
anaconda for python 2.7
theano-0.9
keras-2.0.2

CUDA
cuda-8.0
cudnn-5.1.1

HARDWARE
gpu
nvidia geforce gtx 1080-ti
cpu
dell precision tower 7910
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
RAM:32GB

Below is my .theanorc

[global]
floatX=float32
device=gpu0
optimizer=fast_run

[lib]
cudnn=True

[cuda]
root = /usr/local/cuda-8.0/include

I’m also getting this below exception when I import theano

Using Theano backend.
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end(gpuarray)

Using gpu device 0: Graphics Device (CNMeM is disabled, cuDNN Mixed dnn version. The header is from one version, but we link with a different version (5110, 5105))

Is this some kind of CPU bottleneck issue
Has anyone faced this issue ?

Thanks in advance

Hmmm - offhand, I can’t tell why your system is so much slower. There are some differences with my setup though:

  1. I’m using an Anaconda 3 setup. My notebook assumes Python 3, so perhaps there’s an incompability causing the problem.
  2. I’m using device=cuda, not gpu0. I also don’t have the cudnn statement. device=cuda will cause the switch to the gpuarray backend - which might be faster.

Let me know if this helps.

Hi @rteja1113 ,

From what I see, two simple improvements could improve performances here.

Preallocate VRAM

Just add the following lines to your .theanorc file:

# Using old theano cuda backend
[lib] 
cnmem = 0.8

# Or new theano gpuarray backend
[gpuarray]
preallocate = 0.8

It will preallocate 80% of your VRAM for theano. This amount can be changed depending on your setup, but as contiguous block will be used you may be limited by the memory already used for display.

Properly use cuDNN

I’m not convinced that [lib] cudnn=True is enough. I added these lines to get it working:

[dnn]
enabled = auto
root = path_to_your_CUDA\v8.0_install
library_path = path_to_your_CUDA\v8.0_libs
include_path = path_to_your_CUDA\v8.0_install_include

And I did the same with cuda to be sure theano finds everything it needs.
You can also test you BLAS binding (using either cuBLAS or openBLAS)

I personally had a hard time getting theano working correctly (on windows) but found this article very helpful, as well as this one which contains performance comparisons with/without cuDNN and CNMeM.
There are others describing the procedure for linux if needed.

Hope it helps.

Hi @FabienP, @kzuiderveld, thank you for your replies

I was able to reduce training time to 200s but still far from 111s in @kzuiderveld 's script.The performance improved after including [dnn] paths in .theanorc.It was definitely not cnmem as I trained in similar time(199s) after disabling it.

I am still using old theano backend (as I am getting some exceptions with new gpuarray backend) .Do you think that could be the reason for the difference between @kzuiderveld 's training time and mine or is it something to do with BLAS/CUBLAS because the official theano page says ‘Maybe a small run time speed up.’ after converting to new backend

When I ran check_blas.py , the following is the output

Some Theano flags:
blas.ldflags=
compiledir= /home/cvpr/.theano/compiledir_Linux-4.8–generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64
floatX= float32
device= gpu
Some OS information:
sys.platform= linux2
sys.version= 2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
sys.prefix= /home/cvpr/anaconda2
Some environment variables:
MKL_NUM_THREADS= None
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None

Numpy config: (used when the Theano flag “blas.ldflags” is empty)
lapack_opt_info:
libraries = [‘openblas’, ‘openblas’]
library_dirs = [’/usr/local/lib’]
define_macros = [(‘HAVE_CBLAS’, None)]
language = c
blas_opt_info:
libraries = [‘openblas’, ‘openblas’]
library_dirs = [’/usr/local/lib’]
define_macros = [(‘HAVE_CBLAS’, None)]
language = c
openblas_info:
libraries = [‘openblas’, ‘openblas’]
library_dirs = [’/usr/local/lib’]
define_macros = [(‘HAVE_CBLAS’, None)]
language = c
blis_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = [‘openblas’, ‘openblas’]
library_dirs = [’/usr/local/lib’]
define_macros = [(‘HAVE_CBLAS’, None)]
language = c
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
Numpy dot module: numpy.core.multiarray
Numpy location: /home/cvpr/anaconda2/lib/python2.7/site-packages/numpy/init.pyc
Numpy version: 1.12.1
nvcc version:
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 0.31s on GPU.

BTW, I forgot to mention mention my NVIDIA Driver version is 378.13 on UBUNTU 16.04.2

Thanks in advance !!

Well, that’s better!

Your Nvidia drivers are still a few months old, you should try the latest 381.22. In my experience drivers can change performances quite a lot, both for good or for bad (although i didn’t saw that about neural networks yet, but it’s the case for GPU 3D rendering for instance).

Maybe you could try to reinstall theano with the new backend following theano ubuntu instructions. I would recommend to do that as a new python environment using conda env, this way you should not break your current install.

Maybe linking theano either to mklBLAS, cuBLAS or compiling your own openBLAS could help a bit.

Note that performances can vary from an OS to another too, and depending on your system, and depending on the flavour of your GPU (some models are overclocked) so do not get mad at trying to reach 111s.

1 Like

Thanks a lot for the tips Fabien.

I am having the same problem. Previously I had a quadro p2200 and I was able to train a model with batch size set to 64 (It is a Dell tower 7910 , dual CPU). Now I have two 1080ti GPUs but it runs out of memory with batch size set to 64. If I set it to 32 it runs, but each epoch takes hours. Driver is latest version, 384, cuda version 8, cudnn version 7, and I am still using my old settings in .theanorc which is
[global]
floatX = float32
device = gpu

[cuda]
root=/usr/local/cuda/

What could be wrong ?

Really useful analysis thanks. I am just considering getting a used z800 or z600 so have been reading your comments on this with interest and am reassured that the multiprocessing will make up for the slower processor.

However your second conclusion re the odd behaviour is a bug. The batch_size is set as 100 at the top but 64 further down. This means the two batches being compared have different batch sizes but the same batches/epoch which explains the difference in timings.