I was not happy with deep learning performance, especially on an old Z800. It turns out that the CPU is not fast enough to feed the GPU; more importantly, I figured out how to achieve maximum performance. With data augmentation, an epoch now only requires 70 seconds on my system, a huge improvement.
Caveat: I am doing part 1 using a Python 3 environment and Keras 2, which has a slightly different API. This approach might work with Keras 1 (it’s really, really simple: just add workers=N to fit_generator), but I didn’t test it (don’t know whether Keras 1 provides threadsafe iterators).
My notebook contains more information on this approach:
Karel, thanks so much for this… I have been working on code parallelization for image preprocessing for my “new” 16-core Z820, but didn’t know that the data augmentation in Keras could be parallelized. This is awesome!
It depends on the amount of computing done by the GPU and CPU. If you have a slow GPU, the CPU may be able to keep up and only one worker thread might suffice. If you have a box with multiple high-end GPUs, you may need quite a few worker thread to prevent the GPUs starving for data.
Think of it as yet another hyperparameter Do a timing without worker threads, one with two workers, one with three workers… And pick the one where performance doesn’t improve anymore.
I’m still using Keras 1.2, where the parameter is called nb_worker and I tried it, but it doesn’t seem to help. I also got a message about nb_worker being depricated when pickle_safe=false. So I set that to True as well.
I tried it on my version of cats and dogs and got about the same times. It is odd that if I set a nb_worker for some value other than 1 I get an warning that the number of training samples is higher than the number of samples per epoch. So perhaps I haven’t implemented it correctly. Anybody else have an example using Keras 1.2?
My graphs look pretty much the same between nb_worker=1 and nb_worker=2,3, or 4. They are somewhat different from yours. The GPU is at about 100% with occasional drops down to 50% for a few seconds. The CPU seems to hover around 25% or so. Maybe it is because I have a regular ol’ 1080?
I do love the graphing technique. That is could be really helpful in so many situations.Thanks for that.
I guess I also have a question. In your graphs, the gpu is already running at 100% and the cpu is the bottleneck at about 33%, right? But if 33% of the cpu can keep the gpu saturated at 100%, then where is the extra performance coming from?
@Rothrock42, can’t help you with Keras 1.2 - I googled on nb_worker and saw some mixed messages (for some, it works - for others, it doesn’t). The image generator needs to be thread-safe; if it’s not, folks reported problems with the # of training samples as you found. Perhaps the Keras 1.2 image generator is not threadsafe and the Keras 2.0 version is?
With respect to my graphs - yes, I was surprised to see a significant speedup when the graph suggested the GPU was already running at 100%. PCI-E throughput with multiple threads is less bursty and overall higher, so something definitely improved. Perhaps nvidia-smi’s way of measuring GPU load is not entirely accurate? Too many unknowns here…
Karel, I noticed that Keras doesn’t have ‘workers’ as an argument for the fit() function, only for fit_generator(). Do you (or anyone) know if that’s because it only supports multithreading on the latter?
I have been using fit() because it is the only way I have figured out how to use StratifiedKFold() without a lot of hassle. This is what I am trying to do:
for trn_index, val_index in skf.split(X=np.zeros(len(trn)), y=trn_classes):
print("Kfold = ", k)
batches = 0
for augmented_trn, augmented_trn_labels in gen.flow(X=trn[trn_index], y=trn_labels[trn_index], batch_size = 1697, shuffle=False):
model.fit(augmented_trn, augmented_trn_labels, batch_size=batch_size, nb_epoch=num_epoch, validation_data=(trn[val_index], trn_labels[val_index]))
batches += 1
if batches >= len(trn_index) / batch_size:
# we need to break the loop by hand because
# the generator loops indefinitely
No, you’re not missing anything - I don’t think that model.fit supports multiple threads. But: it might be worth trying to use fit_generator instead with multithreading enabled - and not do any data augmentation (don’t specify any parameters that would augment the image).
Thanks. You’re right - I will try fit_generator next. I will just suck it up and do the cross validation the old fashioned way for now, with a fixed split between training and validation directories. I have only been able to use the scikit-learn StratifiedKFold() with fit(), as fit_generator() doesn’t give you the ability to index specific image lists returned for the training and validation images.
What I would really like to see is K-fold added to fit_generator internally, but I think that would be a gigantic pain, too much time when you’re doing a contest.
Of course others have looked into the problem However, a simple split of “all input” might not always work (think: State Farm (split by drivers) or Fishing competition (split by boat)) - I suspect there’s lot of room for DIY approaches.
I created a little convenience function to split the main train directory into training and validation directories and randomly shuffle files into them, based upon the number of folds (while keeping the main training directory intact. Yes, it is a brute force way of doing it, and isn’t stratified k-fold, but you can call this function to shuffle things up between sets of epochs.
> import os, shutil
> import numpy as np
> import random
> #This function keeps the main train dir intact, and creates 2 new dirs, one each for
> #randomly selected train/test split
> def train_val_split(train_path, split_train_path, split_val_path):
> # First see if train_split and val_split directories already exist - if so, delete them...
> if(os.path.exists(split_train_path)): shutil.rmtree(split_train_path)
> if(os.path.exists(split_val_path)): shutil.rmtree(split_val_path)
> # Create a new val directory
> # Next copy everything in the combined training directory to a the split training directory
> shutil.copytree(train_path, split_train_path)
> num_folds = 5 # One of the folds to be val, the rest for train...
> for subdir in glob(split_train_path + '*'):
> valsubdir = split_val_path + subdir.split('/')
> g = glob(subdir + '/*.jpg')
> shuf = np.random.permutation(g)
> for i in range(int(round(len(shuf)/num_folds))):
> print("Transferring ", shuf[i], " to ", split_val_path + shuf[i].split('/') + '/' + shuf[i].split('/'))
> os.rename(shuf[i], split_val_path + shuf[i].split('/') + '/' + shuf[i].split('/'))
And I did the same with cuda to be sure theano finds everything it needs.
You can also test you BLAS binding (using either cuBLAS or openBLAS)
I personally had a hard time getting theano working correctly (on windows) but found this article very helpful, as well as this one which contains performance comparisons with/without cuDNN and CNMeM.
There are others describing the procedure for linux if needed.