Fine-tuning VGG taking very long

Hi everyone,

I’m trying to fine-tune the VGG model to a new binary classification task (like cats vs dogs) and it’s taking very long (to the point where I’m worried my setup is broken/weird. I used the p2 setup script and haven’t changed anything on the instance.

I do have quite a lot more data than cats & dogs and a rough back-of-the-envelope comparison with Jeremy’s cats v dogs suggests the ETA is correct. Jeremy’s notebook showed around 600 seconds to process around 20k images which means around 34 images per second which would imply 180k images should take 5’300 seconds which is approximately my ETA :frowning:

Epoch 1/1
  2304/188348 [..............................] - ETA: 4502s - loss: 0.8955 - acc: 0.6970

If there’s nothing wrong with my setup and this is just how long it takes, does anyone have any suggestions to speed things up? I have already pre-processed my images to 225x255 - can I edit something in the vgg class to skip a resizing step? Do I need to use multiple GPUs to speed things up from here?

Thanks in advance :blush:

Have you tried “precomputing output features” in lesson7 notebook?

1 Like

I’m also open to fine-tuning a different model - (I know VGG is a lot of layers) but any pointers of where to start would be much appreciated :sweat_smile:

When you are fine-tuning, you will be fine-tuning all the conv layers as well, which is a LOT of calculations. These conv layers recognize shapes, patterns, etc. So are very generalized. You very likely only need to train the dense layers.

I’m not sure how far you are in Part 1, but in Lesson 4, Jeremy goes through this process. You can precompute the outputs from the conv layers and use these as inputs into your dense layers. Then, you just train the dense layers before connecting the models back together.

1 Like

I remember seeing this - will give it a go, thanks!

This part didn’t really make sense to me because the finetune method in the vgg class sets layer.trainable = False for all but the newly added dense layer.

I understand I need to compute the conv layer outputs in order to use these as inputs into my dense layers but I’m not sure how that’ll speed things up? Is the idea that I can store those conv outputs so that when I do another epoch, I’m not recalculating the conv outputs, I’m just updating the dense layer - and is the implication that it’ll take as long for the first epoch to calculate the conv outputs for each image but then the actual fine-tuning will be fast?

thanks again :smiley:

The video starts at the relevant section. Only need to watch for a few minutes.

“…because the calculations for the convolutional layers takes nearly all the time…”

Basically, exactly as you said. Don’t have to spend any time calculating the conv layers, but instead just precalculate their output, which will always be the same for each image, and feed that directly to the dense layers that you can then train.

1 Like

Two things that you can do that I would do first:

  1. While the model is training, do nvidia-smi and see if the process shows up (to make sure your GPU is being used)
  2. Increase the batch size as much as you can to speed up training.

Tried increasing my batch size to 128 and it goes through the fitting process (see my OP) but now having waited 4500 seconds, right as it gets to 1s I get the following error:

Epoch 1/1
188288/188348 [============================>.] - ETA: 1s - loss: 0.4871 - acc: 0.8124

MemoryErrorTraceback (most recent call last)
<ipython-input-23-407aba095b25> in <module>()
  1 model_run+=1
----> 2, val_batches, nb_epoch=1)
MemoryError: Error allocating 3288334336 bytes of device memory (out of memory).
Apply node that caused the error: GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
Toposort index: 141
Inputs types: [TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(), (), (), ()]
Inputs strides: [(), (), (), ()]
Inputs values: [array(256), array(64), array(224), array(224)]
Outputs clients: [[GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

:sob: :sob: :sob:

Hmm, I’m pretty much certain that it didn’t have anything to do with the test set, but rather when it attempted to cross-validate, something may have happened. The model always says 1s left when it is running through the cross validation data. Maybe there was an error there?

If nothing else, you’ve learned the hard way why you should always try on a sample first


Considering you have much more images than cat-vs-dog dataset, the precomputed data will be large. Then it is likely that you will encounter errors such as OOM or kernel death during training. In this case, you might want to use model.fit_generator() instead of in lesson7 notebook. Something like

model.fit_generator(ImageDataGenerator().flow(conv_feat, trn_labels, batch_size=batch_size),
                    steps_per_epoch=len(conv_feat)//batch_size , epochs=ep, 
                    validation_data=(conv_val_feat, val_labels))

By the way, I use batch_size=32 when running code on cloud platform (floydhub). Larger batch_size tend to result memory error.

1 Like

Thanks for your reply - will look into the generator approach. So many things to try! :grin:

I also settled on a batch size of 32 but I’m still confused:

When you say ‘result memory error’, do you mean GPU memory or RAM?

If I try use a batch size of 128 say, the model fits and right at the end I get this memory error:

MemoryError: Error allocating 3288334336 bytes of device memory (out of memory).

Now please point out if my math is wrong but isn’t 3288334336 bytes = 3288 Megabytes ? And isn’t that much less than the 12 GB of GPU memory on the P2 instances? What’s going on?

We use GPU memory to store data like activations at every layers of convnet and weights (& their gradient.)

Here is a related course note of CS231n:

According to this note, VGG16 requires roughly 93MB memory per image at test time (double at training time) to store activations, and another 183*3 MB for parameters. Does this mean that using batch size 64/128 training a VGG model will deplete the 12GB GPU memory?

1 Like

Interesting - I remember seeing that in the CS231n notes…

I’m using a batch size of 32 so that’s just 3293 + 1833 = ~3500 MB <<< 12 GB so not sure what’s going on.

In related news, I’m trying to precompute the convolutional layers and running into memory issues again. Copy-pasted from the statefarm code, I’m getting a memory error on this line pretty much as I run it (runs fine on 10% sample):

conv_feat = conv_model.predict_generator(batches, batches.nb_sample)


MemoryErrorTraceback (most recent call last)
<ipython-input-21-87526e7d2796> in <module>()
----> 1 conv_feat = conv_model.predict_generator(batches, batches.nb_sample)
      2 conv_val_feat = conv_model.predict_generator(val_batches, val_batches.nb_sample)
      3 conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)

/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.pyc in predict_generator(self, generator, val_samples, max_q_size, nb_worker, pickle_safe)
    943                                             max_q_size=max_q_size,
    944                                             nb_worker=nb_worker,
--> 945                                             pickle_safe=pickle_safe)
    947     def get_config(self):

/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in predict_generator(self, generator, val_samples, max_q_size, nb_worker, pickle_safe)
   1650                 for out in outs:
   1651                     shape = (val_samples,) + out.shape[1:]
-> 1652                     all_outs.append(np.zeros(shape, dtype=K.floatx()))
   1654             for i, out in enumerate(outs):


Calling batches.nb_sample gives 188348 which I know is high but isn’t predict_generator supposed to iterate through that in batches (my batch size is 32 - same as it was during the sample where I didn’t have a memory error).

Not sure what’s going on…

Hi @markovbling,

I remember encountering the same error.

2 things to check:

  1. Is the gpu memory insufficient for extracting conv features?
  2. Is your system memory insufficient to store the conv features in RAM? The cats and dogs dataset almost took 50 GB of RAM to store the conv features.

You may want to check these.

I’m confused though, aren’t the conv features being computed 1 batch at a time?

If it works on a sample, shouldn’t it work on the entire dataset or am I missing something?

They are indeed being calculated one batch at a time. However, if the intermediate result size is greater than the system RAM, the memory error is likely to occur. Could you check how much RAM does one batch output occupy? That might help project amount of RAM the 188348 records might take.

Ah ok so you mean each batch is processed individually but the results of all batches need to be stored in RAM and that may be the issue?

Think that makes sense :blush:

I suppose I can just iterate through chunks at a time like break the dataset into 10 pieces then process each piece batch-wise and then combine the set of all convolutional activations at the end (since then only 1/10 of the results will ever be in RAM at once)…

Ah ok so you mean each batch is processed individually but the results of all batches need to be stored in RAM and that may be the issue?

Results of a batch meaning the predicted, one hot encoded values? This should be small. I believe these are used used during the batch for SGD and then thrown away (i.e. available to be reclaimed by the python interpreter).

I suppose I can just iterate through chunks at a time like break the dataset into 10 pieces then process each piece batch-wise

That doesn’t sound right to me. Using generators and SGD with batches should is essentially handling the memory management for you. Your using batches so that you don’t run out of memory while training.

My understanding of things is

  • When using generators and model.fit_generator() all your training cases DO NOT need to fit in memory (GPU or system memory). Data from an entire training batch DOES need to fit in memory.
  • All the weights of your model do need to fit in GPU memory at the time of training.
  • With generators you can keep your batch size fixed but include more training cases and not use up more memory (GPU or CPU) when training your models.

Please correct me if I’ve got any of this this wrong.

If you are using batches then I don’t think the result needs to be stored. That is the whole point of batches. We can scale to any level on the machine. Will just take more time.

If you are using linux then you can install a handy utility called htop. It let’s you see the memory/CPU usage.

Before you start the training run htop and see whether the memory is already being consumed. Maybe your memory is already being used by some object which you create earlier in your notebook? As your process runs see if there is anything unusual with memory usage - spikes maybe? May help correlate the changes in memory with your code execution.

Looking at the documentation seems the problem is with the predict_generator method. This method is for generating predictions from input (which are generated from batches). So the input is in batches but the output is not.

I believe you need to use the predict method which takes batches as input and gives batches as output.

That would explain why it worked on samples but not the whole data set.


The thing that you will have to change is probably the saving and loading the array part. Jeremy was always loading the whole array but you will have to find a way to save and load that in batches instead of the whole thing at a time.