Huge performance improvement with network training!

I was not happy with deep learning performance, especially on an old Z800. It turns out that the CPU is not fast enough to feed the GPU; more importantly, I figured out how to achieve maximum performance. With data augmentation, an epoch now only requires 70 seconds on my system, a huge improvement.

Caveat: I am doing part 1 using a Python 3 environment and Keras 2, which has a slightly different API. This approach might work with Keras 1 (it’s really, really simple: just add workers=N to fit_generator), but I didn’t test it (don’t know whether Keras 1 provides threadsafe iterators).

My notebook contains more information on this approach:


Karel, thanks so much for this… I have been working on code parallelization for image preprocessing for my “new” 16-core Z820, but didn’t know that the data augmentation in Keras could be parallelized. This is awesome! :slight_smile:

1 Like

Can you share your updated I’ve tried to get your notebook to work, but just got stuck on

from keras.utils.layer_utils import layer_from_config

which is no longer available in keras 2.

edit: I think I managed to fix the issue by using keras.layers.deserialize() instead. Here’s my in case anyone wonders. Not sure if there are any more issues though.

1 Like

Here are the files that I’m using: And yes, I don’t know whether there any more issues either :slight_smile:

That’s great! I like your technique for graphing the nvidia-smi results too :slight_smile:

@kzuiderveld I can’t wait to try this. Can someone explain how to pick the number of workers? I assume it is at least partially based on the number of CPUs available.

It depends on the amount of computing done by the GPU and CPU. If you have a slow GPU, the CPU may be able to keep up and only one worker thread might suffice. If you have a box with multiple high-end GPUs, you may need quite a few worker thread to prevent the GPUs starving for data.

Think of it as yet another hyperparameter :slight_smile: Do a timing without worker threads, one with two workers, one with three workers… And pick the one where performance doesn’t improve anymore.

I’m still using Keras 1.2, where the parameter is called nb_worker and I tried it, but it doesn’t seem to help. I also got a message about nb_worker being depricated when pickle_safe=false. So I set that to True as well.

I tried it on my version of cats and dogs and got about the same times. It is odd that if I set a nb_worker for some value other than 1 I get an warning that the number of training samples is higher than the number of samples per epoch. So perhaps I haven’t implemented it correctly. Anybody else have an example using Keras 1.2?

My graphs look pretty much the same between nb_worker=1 and nb_worker=2,3, or 4. They are somewhat different from yours. The GPU is at about 100% with occasional drops down to 50% for a few seconds. The CPU seems to hover around 25% or so. Maybe it is because I have a regular ol’ 1080?

I do love the graphing technique. That is could be really helpful in so many situations.Thanks for that.

I guess I also have a question. In your graphs, the gpu is already running at 100% and the cpu is the bottleneck at about 33%, right? But if 33% of the cpu can keep the gpu saturated at 100%, then where is the extra performance coming from?

@Rothrock42, can’t help you with Keras 1.2 - I googled on nb_worker and saw some mixed messages (for some, it works - for others, it doesn’t). The image generator needs to be thread-safe; if it’s not, folks reported problems with the # of training samples as you found. Perhaps the Keras 1.2 image generator is not threadsafe and the Keras 2.0 version is?

With respect to my graphs - yes, I was surprised to see a significant speedup when the graph suggested the GPU was already running at 100%. PCI-E throughput with multiple threads is less bursty and overall higher, so something definitely improved. Perhaps nvidia-smi’s way of measuring GPU load is not entirely accurate? Too many unknowns here…

Karel, I noticed that Keras doesn’t have ‘workers’ as an argument for the fit() function, only for fit_generator(). Do you (or anyone) know if that’s because it only supports multithreading on the latter?

I have been using fit() because it is the only way I have figured out how to use StratifiedKFold() without a lot of hassle. This is what I am trying to do:

skf = StratifiedKFold(n_splits=num_folds, random_state=1, shuffle=True)


for trn_index, val_index in skf.split(X=np.zeros(len(trn)), y=trn_classes):
print("Kfold = ", k)

batches = 0

for augmented_trn, augmented_trn_labels in gen.flow(X=trn[trn_index], y=trn_labels[trn_index], batch_size = 1697, shuffle=False):, augmented_trn_labels, batch_size=batch_size, nb_epoch=num_epoch, validation_data=(trn[val_index], trn_labels[val_index]))
    batches += 1
    if batches >= len(trn_index) / batch_size:
        # we need to break the loop by hand because
        # the generator loops indefinitely

Am I missing something?

Thanks, Christina

No, you’re not missing anything - I don’t think that supports multiple threads. But: it might be worth trying to use fit_generator instead with multithreading enabled - and not do any data augmentation (don’t specify any parameters that would augment the image).

1 Like

Hi Karel,
Thanks. You’re right - I will try fit_generator next. I will just suck it up and do the cross validation the old fashioned way for now, with a fixed split between training and validation directories. I have only been able to use the scikit-learn StratifiedKFold() with fit(), as fit_generator() doesn’t give you the ability to index specific image lists returned for the training and validation images.

What I would really like to see is K-fold added to fit_generator internally, but I think that would be a gigantic pain, too much time when you’re doing a contest. :disappointed:

At any rate, thanks for you valuable thoughts! :slight_smile:

The Keras author nowadays suggests using tensorflow queues for best performance. It’s not well documented, but worth playing with if performance is an issue for you.


It looks like someone’s already thought about the problem whereby you need separate directories for your training and validation data for fit_generator / flow from directory:

It would be great to be able to dynamically specify train/test splits on the fly… just one step away from stratified k-fold.

Of course others have looked into the problem :slight_smile: However, a simple split of “all input” might not always work (think: State Farm (split by drivers) or Fishing competition (split by boat)) - I suspect there’s lot of room for DIY approaches.

Karel, yes agreed that stratified k-fold would not be appropriate for all situations… :slight_smile:

I created a little convenience function to split the main train directory into training and validation directories and randomly shuffle files into them, based upon the number of folds (while keeping the main training directory intact. Yes, it is a brute force way of doing it, and isn’t stratified k-fold, but you can call this function to shuffle things up between sets of epochs.

> import os, shutil
> import numpy as np
> import random

> #This function keeps the main train dir intact, and creates 2 new dirs, one each for
> #randomly selected train/test split

> def train_val_split(train_path, split_train_path, split_val_path):

>     # First see if train_split and val_split directories already exist - if so, delete them...
>     if(os.path.exists(split_train_path)): shutil.rmtree(split_train_path)
>     if(os.path.exists(split_val_path)): shutil.rmtree(split_val_path)

>     # Create a new val directory
>     os.mkdir(split_val_path)

>     # Next copy everything in the combined training directory to a the split training directory
>     shutil.copytree(train_path, split_train_path)

>     num_folds = 5  # One of the folds to be val, the rest for train...

>     for subdir in glob(split_train_path + '*'):
>         valsubdir = split_val_path + subdir.split('/')[3]
>         os.mkdir(valsubdir)
>         g = glob(subdir + '/*.jpg')

>         shuf = np.random.permutation(g)

>         for i in range(int(round(len(shuf)/num_folds))):
>             print("Transferring ", shuf[i], " to ", split_val_path + shuf[i].split('/')[3] + '/' + shuf[i].split('/')[4])           
>             os.rename(shuf[i], split_val_path + shuf[i].split('/')[3] + '/' + shuf[i].split('/')[4])
1 Like

Hi @kzuiderveld, I’m trying on the same nvidia 1080-ti gpu but my performance seems to be significantly slow even though gpu utilization is 100%
Below is the screenshot after running your code

The time it takes is 2190s compared to your 111s.

Below are my specs

anaconda for python 2.7


nvidia geforce gtx 1080-ti
dell precision tower 7910
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1

Below is my .theanorc



root = /usr/local/cuda-8.0/include

I’m also getting this below exception when I import theano

Using Theano backend.
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:

Using gpu device 0: Graphics Device (CNMeM is disabled, cuDNN Mixed dnn version. The header is from one version, but we link with a different version (5110, 5105))

Is this some kind of CPU bottleneck issue
Has anyone faced this issue ?

Thanks in advance

Hmmm - offhand, I can’t tell why your system is so much slower. There are some differences with my setup though:

  1. I’m using an Anaconda 3 setup. My notebook assumes Python 3, so perhaps there’s an incompability causing the problem.
  2. I’m using device=cuda, not gpu0. I also don’t have the cudnn statement. device=cuda will cause the switch to the gpuarray backend - which might be faster.

Let me know if this helps.

Hi @rteja1113 ,

From what I see, two simple improvements could improve performances here.

Preallocate VRAM

Just add the following lines to your .theanorc file:

# Using old theano cuda backend
cnmem = 0.8

# Or new theano gpuarray backend
preallocate = 0.8

It will preallocate 80% of your VRAM for theano. This amount can be changed depending on your setup, but as contiguous block will be used you may be limited by the memory already used for display.

Properly use cuDNN

I’m not convinced that [lib] cudnn=True is enough. I added these lines to get it working:

enabled = auto
root = path_to_your_CUDA\v8.0_install
library_path = path_to_your_CUDA\v8.0_libs
include_path = path_to_your_CUDA\v8.0_install_include

And I did the same with cuda to be sure theano finds everything it needs.
You can also test you BLAS binding (using either cuBLAS or openBLAS)

I personally had a hard time getting theano working correctly (on windows) but found this article very helpful, as well as this one which contains performance comparisons with/without cuDNN and CNMeM.
There are others describing the procedure for linux if needed.

Hope it helps.