8x speedup using AWS p2.8xlarge?

oregontrail256 · January 16, 2017, 12:01am

I’m getting tired of waiting 5+ minutes to train each epoch on an AWS p2.xlarge instance. Has anyone tried using the p2.8xlarge instances? These have 8 GPUs and 8 times the memory. Should I expect to see an 8x speedup, if I can multiply my batch_size by 8 when running fit()?

Any help or advice on speeding up the training process would be greatly appreciated! (I’m training on the State Farm distracted driver competition right now.)

jeremy · January 16, 2017, 12:47am

Can you use a sample? Can you precompute the conv layer outputs and just train the dense layers?

(The 8xlarge won’t help you, unless you write something to use all GPUs, which is non-trivial)

melissa.fabros · January 17, 2017, 1:03am

The discussion on Keras in issues says that training the same model across several gpus is an open research problem.

" It is not a limitation of Keras. This is how deep learning works. Unless you want to get VERY researchy, you have to choose data parallelism or model parallelism (or a combination). No backend change could fix that"

rodgzilla · January 19, 2017, 9:25am

If one choose to train multiple models and then take the average of their predictions as mentioned in lesson 3, wouldn’t it be at least theoretically possible to assign each model to a different GPU in order to speed up the overall process ?

About training the same model across multiple GPUs, would it be possible to use the dropout layers to help ?
What I mean is that if we consider a layer i followed by a dropout layer i + 1 which drops more than 50% of the activations, could we launch 2 trainings of the following layers in parallel since the SGD would update different sets of weights in the i-th layer ?

chianti · January 19, 2017, 5:26pm

It’s not that hard (using tensorflow though). Checkout Transparent Multi-GPU Training on TensorFlow with Keras

from keras.applications.vgg16 import VGG16
from multi_gpu import make_parallel
model = VGG16(include_top=False, weights=‘imagenet’, input_tensor=None, input_shape=(224, 224, 3))
model = make_parallel(model, 8) #to use 8 GPU

I didn’t get x8 improvement though, only 5x-6x with this approach

jeremy · January 19, 2017, 5:26pm

Yup that’s easy! Good idea

Good thinking. Although perhaps not necessary - lock-free async SGD works pretty well anyway. https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf

melissa.fabros · January 30, 2017, 8:27pm

has anyone tried training different models across different p8 gpus using Theano’s backend yet? Would you be willing to share the recipe?

Alternatively, I could switch to TensorFlow’s backend in anticipation of the Deep Learning course pt2. However, I had tried to switch to TF as an experiment (and installed TF and configured Keras was fine), but the same model was running slower, b/c TensorFlow wasn’t using the GPU on the instance. Wondering what I was missing about implementing TensorFlow on AWS.

shy · May 9, 2017, 4:32am

Hi, I’m curious what’s your version of keras and tensorflow. I tried this script, but can not make it work with more than 2 gpus(which will make the terminal unresponding and then died), I cannot find any clue from the internet. Thanks.

stevelizcnao · May 12, 2017, 5:28pm

Hi @shy, was this in training or just in the terminal itself? Sometimes Keras will hang in epoch training unless you set verbose=2

wgpubs · May 12, 2017, 5:36pm

I’ve noticed this behavior using Keras 2, Python 3, and TF. Apparently, some kind of race condition ensues when there is a large number of steps.

benediktschifferer · July 28, 2017, 8:04am

Hey,

I’ve experienced multiple times memory allocation errors with Tensorflow on a simple P2 instance.
Currently I am training ~7000 images, which takes 30min per Epoch.

I was wondering - how hard is it to use multiple GPUs (8x or 16x?)

Does someone has experience with it?
@chianti Do you have more experience since January?

Do I really add just 1 line?
Does this 1 line solves my memory allocation errors?

Bests,
Benedikt