8x speedup using AWS p2.8xlarge?

I’m getting tired of waiting 5+ minutes to train each epoch on an AWS p2.xlarge instance. Has anyone tried using the p2.8xlarge instances? These have 8 GPUs and 8 times the memory. Should I expect to see an 8x speedup, if I can multiply my batch_size by 8 when running fit()?

Any help or advice on speeding up the training process would be greatly appreciated! (I’m training on the State Farm distracted driver competition right now.)

Can you use a sample? Can you precompute the conv layer outputs and just train the dense layers?

(The 8xlarge won’t help you, unless you write something to use all GPUs, which is non-trivial)

1 Like

The discussion on Keras in issues says that training the same model across several gpus is an open research problem.

" It is not a limitation of Keras. This is how deep learning works. Unless you want to get VERY researchy, you have to choose data parallelism or model parallelism (or a combination). No backend change could fix that"

If one choose to train multiple models and then take the average of their predictions as mentioned in lesson 3, wouldn’t it be at least theoretically possible to assign each model to a different GPU in order to speed up the overall process ?

About training the same model across multiple GPUs, would it be possible to use the dropout layers to help ?
What I mean is that if we consider a layer i followed by a dropout layer i + 1 which drops more than 50% of the activations, could we launch 2 trainings of the following layers in parallel since the SGD would update different sets of weights in the i-th layer ?

It’s not that hard (using tensorflow though). Checkout Transparent Multi-GPU Training on TensorFlow with Keras

from keras.applications.vgg16 import VGG16
from multi_gpu import make_parallel
model = VGG16(include_top=False, weights=‘imagenet’, input_tensor=None, input_shape=(224, 224, 3))
model = make_parallel(model, 8) #to use 8 GPU

I didn’t get x8 improvement though, only 5x-6x with this approach

Yup that’s easy! Good idea :slight_smile:

Good thinking. Although perhaps not necessary - lock-free async SGD works pretty well anyway. https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf

has anyone tried training different models across different p8 gpus using Theano’s backend yet? Would you be willing to share the recipe?

Alternatively, I could switch to TensorFlow’s backend in anticipation of the Deep Learning course pt2. However, I had tried to switch to TF as an experiment (and installed TF and configured Keras was fine), but the same model was running slower, b/c TensorFlow wasn’t using the GPU on the instance. Wondering what I was missing about implementing TensorFlow on AWS.

1 Like

Hi, I’m curious what’s your version of keras and tensorflow. I tried this script, but can not make it work with more than 2 gpus(which will make the terminal unresponding and then died), I cannot find any clue from the internet. Thanks.

Hi @shy, was this in training or just in the terminal itself? Sometimes Keras will hang in epoch training unless you set verbose=2

I’ve noticed this behavior using Keras 2, Python 3, and TF. Apparently, some kind of race condition ensues when there is a large number of steps.


I’ve experienced multiple times memory allocation errors with Tensorflow on a simple P2 instance.
Currently I am training ~7000 images, which takes 30min per Epoch.

I was wondering - how hard is it to use multiple GPUs (8x or 16x?)

Does someone has experience with it?
@chianti Do you have more experience since January?

Do I really add just 1 line?
Does this 1 line solves my memory allocation errors?