Why is my model taking ~200 seconds per epoch?

This is strange, as my partner-n-crime @prateek2686 and I have almost exactly the same architecture, yet my epochs are roughly 5 times as long as his!

I’ve tried fiddling wth the layers, with the learning rate, optimizer type, batch size. Nothing I have tried will change the length of the epochs significantly!

Where else can I look for the source of the slowness? Yes I am using the p2.xl (Tesla K80) on AWS…

The code is very straightforward:

model = Vgg16().model
conv_layers,fc_layers = split_at(model, Convolution2D)
del fc_layers
conv_model = Sequential(conv_layers)

# Using the name 'pafs' -- predictions / activations / features -- however you want to look at it
conv_pafs = load_array(path + 'conv_pafs.bc')
val_pafs = load_array(path + 'val_pafs.bc')

(val_classes, trn_classes, val_labels, trn_labels, 
    val_filenames, filenames, test_filenames) = get_classes(path)

def get_fc_model():
    model = Sequential([
        Dense(4096, activation='relu'),
        Dense(4096, activation='relu'),
        Dense(2, activation='softmax')

    model.compile(optimizer=RMSprop(lr=0.0001, rho=0.7), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

fc_model = get_fc_model()

fc_model.fit(conv_pafs, trn_labels, nb_epoch=8, 
             batch_size=batch_size, validation_data=(val_pafs, val_labels))

Figured it out – it turns out, if you throw in a maxpooling layer at the beginning of the fully connected model, such as .
MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]), .
the model will run a lot a lot quicker.

This hammers home the idea that, aside from promoting translation invariance, maxpooing really reduces computation demands (the size of the data to contend with was halved).


Your main problem is that you have a batchnorm layer at the start that’s operating on the output of a convolutional layer - but you forgot to add the ‘axis=1’ parameter! If you add that, you’ll find it runs faster, is more accurate, and your max pooling layer isn’t as necessary.


I hadn’t questioned the speed of my GPU instance until I read this thread. What ballpark time-wise should I be expecting one epoch to take with the lesson1 example “out of the box”? My TeslaK80 appears to be up and running on my p2 instance. I double checked my .theanorc is set to GPU, and one epoch for me on the training set is taking > 600 seconds. Should I be concerned about my configuration?

@icanseeformiles That time seems reasonable to me. Note that for most of the course, we’ll be using pre-computed features so the epoch times will usually be much faster (5-10 seconds)

I was facing similar issues while training and realised that cnem was disabled by default in the setup that I (think) most of the class is using. cnmem roughly talks about how much % of GPU memory can your script utilize while running.
I modified the .theanorc file and added the following bit of code
cnmem = 0.9

If you’re getting memory errors you can always reduce it to a lower number. With this technique I’m seeing some improvement in my training times. You can try this and let us know.


I am using the onboard NVIDIA GPU of my laptop. Theano and Keras were checked for switching from cpu to gpu and they work perfectly when used from bash. However with lesson 1 notebook training epoch is taking approx 6 hours. I think GPU is not being invoked. How to invoke local gpu while using notebook ?

Make sure you started your notebook from git bash @tapashettisr

I am doing that. Should I first set the THEANO_FLAGS = THEANO_FLAGS_GPU before opening the notebook?

Further will there be any improvement if we use THEANO_FLAGS = THEANO_FLAGS_GPU_DNN

I’ve figured out what each argument to theano means, and kindof run a custom command, but I started with the DNN variant from the guide you used. Watch the cnmem argument, you might have to reduce the value if you get memory errors.

Show me your notebook.

The GPU is working now with nb. I was setting the THEANO_FLAGS and starting the notebook from different bash terminals.

1 Like