I’m trying to fine-tune the VGG model to a new binary classification task (like cats vs dogs) and it’s taking very long (to the point where I’m worried my setup is broken/weird. I used the p2 setup script and haven’t changed anything on the instance.
I do have quite a lot more data than cats & dogs and a rough back-of-the-envelope comparison with Jeremy’s cats v dogs suggests the ETA is correct. Jeremy’s notebook showed around 600 seconds to process around 20k images which means around 34 images per second which would imply 180k images should take 5’300 seconds which is approximately my ETA
If there’s nothing wrong with my setup and this is just how long it takes, does anyone have any suggestions to speed things up? I have already pre-processed my images to 225x255 - can I edit something in the vgg class to skip a resizing step? Do I need to use multiple GPUs to speed things up from here?
When you are fine-tuning, you will be fine-tuning all the conv layers as well, which is a LOT of calculations. These conv layers recognize shapes, patterns, etc. So are very generalized. You very likely only need to train the dense layers.
I’m not sure how far you are in Part 1, but in Lesson 4, Jeremy goes through this process. You can precompute the outputs from the conv layers and use these as inputs into your dense layers. Then, you just train the dense layers before connecting the models back together.
This part didn’t really make sense to me because the finetune method in the vgg class sets layer.trainable = False for all but the newly added dense layer.
I understand I need to compute the conv layer outputs in order to use these as inputs into my dense layers but I’m not sure how that’ll speed things up? Is the idea that I can store those conv outputs so that when I do another epoch, I’m not recalculating the conv outputs, I’m just updating the dense layer - and is the implication that it’ll take as long for the first epoch to calculate the conv outputs for each image but then the actual fine-tuning will be fast?
The video starts at the relevant section. Only need to watch for a few minutes.
“…because the calculations for the convolutional layers takes nearly all the time…”
Basically, exactly as you said. Don’t have to spend any time calculating the conv layers, but instead just precalculate their output, which will always be the same for each image, and feed that directly to the dense layers that you can then train.
Tried increasing my batch size to 128 and it goes through the fitting process (see my OP) but now having waited 4500 seconds, right as it gets to 1s I get the following error:
Epoch 1/1
188288/188348 [============================>.] - ETA: 1s - loss: 0.4871 - acc: 0.8124
MemoryErrorTraceback (most recent call last)
<ipython-input-23-407aba095b25> in <module>()
1 model_run+=1
----> 2 vgg.fit(batches, val_batches, nb_epoch=1)
...
MemoryError: Error allocating 3288334336 bytes of device memory (out of memory).
Apply node that caused the error: GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
Toposort index: 141
Inputs types: [TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(), (), (), ()]
Inputs strides: [(), (), (), ()]
Inputs values: [array(256), array(64), array(224), array(224)]
Outputs clients: [[GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
Hmm, I’m pretty much certain that it didn’t have anything to do with the test set, but rather when it attempted to cross-validate, something may have happened. The model always says 1s left when it is running through the cross validation data. Maybe there was an error there?
If nothing else, you’ve learned the hard way why you should always try on a sample first
Considering you have much more images than cat-vs-dog dataset, the precomputed data will be large. Then it is likely that you will encounter errors such as OOM or kernel death during training. In this case, you might want to use model.fit_generator() instead of model.fit() in lesson7 notebook. Something like
Thanks for your reply - will look into the generator approach. So many things to try!
I also settled on a batch size of 32 but I’m still confused:
When you say ‘result memory error’, do you mean GPU memory or RAM?
If I try use a batch size of 128 say, the model fits and right at the end I get this memory error:
MemoryError: Error allocating 3288334336 bytes of device memory (out of memory).
Now please point out if my math is wrong but isn’t 3288334336 bytes = 3288 Megabytes ? And isn’t that much less than the 12 GB of GPU memory on the P2 instances? What’s going on?
According to this note, VGG16 requires roughly 93MB memory per image at test time (double at training time) to store activations, and another 183*3 MB for parameters. Does this mean that using batch size 64/128 training a VGG model will deplete the 12GB GPU memory?
Interesting - I remember seeing that in the CS231n notes…
I’m using a batch size of 32 so that’s just 3293 + 1833 = ~3500 MB <<< 12 GB so not sure what’s going on.
In related news, I’m trying to precompute the convolutional layers and running into memory issues again. Copy-pasted from the statefarm code, I’m getting a memory error on this line pretty much as I run it (runs fine on 10% sample):
MemoryErrorTraceback (most recent call last)
<ipython-input-21-87526e7d2796> in <module>()
----> 1 conv_feat = conv_model.predict_generator(batches, batches.nb_sample)
2 conv_val_feat = conv_model.predict_generator(val_batches, val_batches.nb_sample)
3 conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.pyc in predict_generator(self, generator, val_samples, max_q_size, nb_worker, pickle_safe)
943 max_q_size=max_q_size,
944 nb_worker=nb_worker,
--> 945 pickle_safe=pickle_safe)
946
947 def get_config(self):
/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in predict_generator(self, generator, val_samples, max_q_size, nb_worker, pickle_safe)
1650 for out in outs:
1651 shape = (val_samples,) + out.shape[1:]
-> 1652 all_outs.append(np.zeros(shape, dtype=K.floatx()))
1653
1654 for i, out in enumerate(outs):
MemoryError:
Calling batches.nb_sample gives 188348 which I know is high but isn’t predict_generator supposed to iterate through that in batches (my batch size is 32 - same as it was during the sample where I didn’t have a memory error).
They are indeed being calculated one batch at a time. However, if the intermediate result size is greater than the system RAM, the memory error is likely to occur. Could you check how much RAM does one batch output occupy? That might help project amount of RAM the 188348 records might take.
Ah ok so you mean each batch is processed individually but the results of all batches need to be stored in RAM and that may be the issue?
Think that makes sense
I suppose I can just iterate through chunks at a time like break the dataset into 10 pieces then process each piece batch-wise and then combine the set of all convolutional activations at the end (since then only 1/10 of the results will ever be in RAM at once)…
Ah ok so you mean each batch is processed individually but the results of all batches need to be stored in RAM and that may be the issue?
Results of a batch meaning the predicted, one hot encoded values? This should be small. I believe these are used used during the batch for SGD and then thrown away (i.e. available to be reclaimed by the python interpreter).
I suppose I can just iterate through chunks at a time like break the dataset into 10 pieces then process each piece batch-wise
That doesn’t sound right to me. Using generators and SGD with batches should is essentially handling the memory management for you. Your using batches so that you don’t run out of memory while training.
My understanding of things is
When using generators and model.fit_generator() all your training cases DO NOT need to fit in memory (GPU or system memory). Data from an entire training batch DOES need to fit in memory.
All the weights of your model do need to fit in GPU memory at the time of training.
With generators you can keep your batch size fixed but include more training cases and not use up more memory (GPU or CPU) when training your models.
Please correct me if I’ve got any of this this wrong.
If you are using batches then I don’t think the result needs to be stored. That is the whole point of batches. We can scale to any level on the machine. Will just take more time.
If you are using linux then you can install a handy utility called htop. It let’s you see the memory/CPU usage.
Before you start the training run htop and see whether the memory is already being consumed. Maybe your memory is already being used by some object which you create earlier in your notebook? As your process runs see if there is anything unusual with memory usage - spikes maybe? May help correlate the changes in memory with your code execution.
EDIT
Looking at the documentation seems the problem is with the predict_generator method. This method is for generating predictions from input (which are generated from batches). So the input is in batches but the output is not.
I believe you need to use the predict method which takes batches as input and gives batches as output.
That would explain why it worked on samples but not the whole data set.
The thing that you will have to change is probably the saving and loading the array part. Jeremy was always loading the whole array but you will have to find a way to save and load that in batches instead of the whole thing at a time.