Fine-tuning VGG taking very long

This part didn’t really make sense to me because the finetune method in the vgg class sets layer.trainable = False for all but the newly added dense layer.

I understand I need to compute the conv layer outputs in order to use these as inputs into my dense layers but I’m not sure how that’ll speed things up? Is the idea that I can store those conv outputs so that when I do another epoch, I’m not recalculating the conv outputs, I’m just updating the dense layer - and is the implication that it’ll take as long for the first epoch to calculate the conv outputs for each image but then the actual fine-tuning will be fast?

thanks again :smiley:

The video starts at the relevant section. Only need to watch for a few minutes.

“…because the calculations for the convolutional layers takes nearly all the time…”

Basically, exactly as you said. Don’t have to spend any time calculating the conv layers, but instead just precalculate their output, which will always be the same for each image, and feed that directly to the dense layers that you can then train.

1 Like

Two things that you can do that I would do first:

  1. While the model is training, do nvidia-smi and see if the process shows up (to make sure your GPU is being used)
  2. Increase the batch size as much as you can to speed up training.

Tried increasing my batch size to 128 and it goes through the fitting process (see my OP) but now having waited 4500 seconds, right as it gets to 1s I get the following error:

Epoch 1/1
188288/188348 [============================>.] - ETA: 1s - loss: 0.4871 - acc: 0.8124

MemoryErrorTraceback (most recent call last)
<ipython-input-23-407aba095b25> in <module>()
  1 model_run+=1
----> 2, val_batches, nb_epoch=1)
MemoryError: Error allocating 3288334336 bytes of device memory (out of memory).
Apply node that caused the error: GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
Toposort index: 141
Inputs types: [TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(), (), (), ()]
Inputs strides: [(), (), (), ()]
Inputs values: [array(256), array(64), array(224), array(224)]
Outputs clients: [[GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

:sob: :sob: :sob:

Hmm, I’m pretty much certain that it didn’t have anything to do with the test set, but rather when it attempted to cross-validate, something may have happened. The model always says 1s left when it is running through the cross validation data. Maybe there was an error there?

If nothing else, you’ve learned the hard way why you should always try on a sample first


Considering you have much more images than cat-vs-dog dataset, the precomputed data will be large. Then it is likely that you will encounter errors such as OOM or kernel death during training. In this case, you might want to use model.fit_generator() instead of in lesson7 notebook. Something like

model.fit_generator(ImageDataGenerator().flow(conv_feat, trn_labels, batch_size=batch_size),
                    steps_per_epoch=len(conv_feat)//batch_size , epochs=ep, 
                    validation_data=(conv_val_feat, val_labels))

By the way, I use batch_size=32 when running code on cloud platform (floydhub). Larger batch_size tend to result memory error.

1 Like

Thanks for your reply - will look into the generator approach. So many things to try! :grin:

I also settled on a batch size of 32 but I’m still confused:

When you say ‘result memory error’, do you mean GPU memory or RAM?

If I try use a batch size of 128 say, the model fits and right at the end I get this memory error:

MemoryError: Error allocating 3288334336 bytes of device memory (out of memory).

Now please point out if my math is wrong but isn’t 3288334336 bytes = 3288 Megabytes ? And isn’t that much less than the 12 GB of GPU memory on the P2 instances? What’s going on?

We use GPU memory to store data like activations at every layers of convnet and weights (& their gradient.)

Here is a related course note of CS231n:

According to this note, VGG16 requires roughly 93MB memory per image at test time (double at training time) to store activations, and another 183*3 MB for parameters. Does this mean that using batch size 64/128 training a VGG model will deplete the 12GB GPU memory?

1 Like

Interesting - I remember seeing that in the CS231n notes…

I’m using a batch size of 32 so that’s just 3293 + 1833 = ~3500 MB <<< 12 GB so not sure what’s going on.

In related news, I’m trying to precompute the convolutional layers and running into memory issues again. Copy-pasted from the statefarm code, I’m getting a memory error on this line pretty much as I run it (runs fine on 10% sample):

conv_feat = conv_model.predict_generator(batches, batches.nb_sample)


MemoryErrorTraceback (most recent call last)
<ipython-input-21-87526e7d2796> in <module>()
----> 1 conv_feat = conv_model.predict_generator(batches, batches.nb_sample)
      2 conv_val_feat = conv_model.predict_generator(val_batches, val_batches.nb_sample)
      3 conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)

/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.pyc in predict_generator(self, generator, val_samples, max_q_size, nb_worker, pickle_safe)
    943                                             max_q_size=max_q_size,
    944                                             nb_worker=nb_worker,
--> 945                                             pickle_safe=pickle_safe)
    947     def get_config(self):

/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in predict_generator(self, generator, val_samples, max_q_size, nb_worker, pickle_safe)
   1650                 for out in outs:
   1651                     shape = (val_samples,) + out.shape[1:]
-> 1652                     all_outs.append(np.zeros(shape, dtype=K.floatx()))
   1654             for i, out in enumerate(outs):


Calling batches.nb_sample gives 188348 which I know is high but isn’t predict_generator supposed to iterate through that in batches (my batch size is 32 - same as it was during the sample where I didn’t have a memory error).

Not sure what’s going on…

Hi @markovbling,

I remember encountering the same error.

2 things to check:

  1. Is the gpu memory insufficient for extracting conv features?
  2. Is your system memory insufficient to store the conv features in RAM? The cats and dogs dataset almost took 50 GB of RAM to store the conv features.

You may want to check these.

I’m confused though, aren’t the conv features being computed 1 batch at a time?

If it works on a sample, shouldn’t it work on the entire dataset or am I missing something?

They are indeed being calculated one batch at a time. However, if the intermediate result size is greater than the system RAM, the memory error is likely to occur. Could you check how much RAM does one batch output occupy? That might help project amount of RAM the 188348 records might take.

Ah ok so you mean each batch is processed individually but the results of all batches need to be stored in RAM and that may be the issue?

Think that makes sense :blush:

I suppose I can just iterate through chunks at a time like break the dataset into 10 pieces then process each piece batch-wise and then combine the set of all convolutional activations at the end (since then only 1/10 of the results will ever be in RAM at once)…

Ah ok so you mean each batch is processed individually but the results of all batches need to be stored in RAM and that may be the issue?

Results of a batch meaning the predicted, one hot encoded values? This should be small. I believe these are used used during the batch for SGD and then thrown away (i.e. available to be reclaimed by the python interpreter).

I suppose I can just iterate through chunks at a time like break the dataset into 10 pieces then process each piece batch-wise

That doesn’t sound right to me. Using generators and SGD with batches should is essentially handling the memory management for you. Your using batches so that you don’t run out of memory while training.

My understanding of things is

  • When using generators and model.fit_generator() all your training cases DO NOT need to fit in memory (GPU or system memory). Data from an entire training batch DOES need to fit in memory.
  • All the weights of your model do need to fit in GPU memory at the time of training.
  • With generators you can keep your batch size fixed but include more training cases and not use up more memory (GPU or CPU) when training your models.

Please correct me if I’ve got any of this this wrong.

If you are using batches then I don’t think the result needs to be stored. That is the whole point of batches. We can scale to any level on the machine. Will just take more time.

If you are using linux then you can install a handy utility called htop. It let’s you see the memory/CPU usage.

Before you start the training run htop and see whether the memory is already being consumed. Maybe your memory is already being used by some object which you create earlier in your notebook? As your process runs see if there is anything unusual with memory usage - spikes maybe? May help correlate the changes in memory with your code execution.

Looking at the documentation seems the problem is with the predict_generator method. This method is for generating predictions from input (which are generated from batches). So the input is in batches but the output is not.

I believe you need to use the predict method which takes batches as input and gives batches as output.

That would explain why it worked on samples but not the whole data set.


The thing that you will have to change is probably the saving and loading the array part. Jeremy was always loading the whole array but you will have to find a way to save and load that in batches instead of the whole thing at a time.

Yeah I’m already using htop and the issue seems to be that the output of all the batches is being stored in memory instead of being returned batch-by-batch.

Will work on breaking up the calc and see if that helps…

Seems like predict_generator would be written for the explicit purpose of not eating up all your memory. I haven’t looked at the implementation but it should be something like

  • load a batch of test cases into memory on the GPU.
  • make and store predictions on the entire batch (predictions are stored in a relatively small numpy array)
  • remove all references to the batch allowing the python garbage collector to reclaim the used memory.

Remember that when you create an ImageDataGenerator in Keras you determine the batch_size and therefore help control the GPU and system memory used during the call to predict_generator.

predict takes an entire “batch” in the sense that it runs predict on everything you give it all at once (think of batch processing in computer science).

Any chance there’s something wrong with how you installed cudNN?

1 Like

Watch this part of the section 10 video if you’re still having issues.

EDIT: Looks like there are some embedding issues while Part 2 is still unofficial. It can be viewed directly on Youtube, timestamp is 57:26

1 Like

Awesome, thanks! Will give it a go and report back…

I just spent a few days messing with notebooks in part1 and part2 on Google Compute Cloud with both Tensorflow and Theano as backends. I was playing with a bunch of settings trying to speed things up. Any chance you doubled the batch_size for the validation data? This is what the notebook does and I replicated it when writing my notebook.

batches = vgg.get_batches(train_path, batch_size=batch_size)
val_batches = vgg.get_batches(valid_path, batch_size=batch_size*2)

With Theano as a backend when I’d get to the very end of one epoch and Keras/Theano started processing the validation set I’d get an out of memory error. With Tensorflow it blew up immediately.

With Theano I was able to get to a batch_size of 128 on a 12G Telsa. Doubling that blows things up.

1 Like