This part didn’t really make sense to me because the finetune method in the vgg class sets layer.trainable = False for all but the newly added dense layer.
I understand I need to compute the conv layer outputs in order to use these as inputs into my dense layers but I’m not sure how that’ll speed things up? Is the idea that I can store those conv outputs so that when I do another epoch, I’m not recalculating the conv outputs, I’m just updating the dense layer - and is the implication that it’ll take as long for the first epoch to calculate the conv outputs for each image but then the actual fine-tuning will be fast?
The video starts at the relevant section. Only need to watch for a few minutes.
“…because the calculations for the convolutional layers takes nearly all the time…”
Basically, exactly as you said. Don’t have to spend any time calculating the conv layers, but instead just precalculate their output, which will always be the same for each image, and feed that directly to the dense layers that you can then train.
Tried increasing my batch size to 128 and it goes through the fitting process (see my OP) but now having waited 4500 seconds, right as it gets to 1s I get the following error:
Epoch 1/1
188288/188348 [============================>.] - ETA: 1s - loss: 0.4871 - acc: 0.8124
MemoryErrorTraceback (most recent call last)
<ipython-input-23-407aba095b25> in <module>()
1 model_run+=1
----> 2 vgg.fit(batches, val_batches, nb_epoch=1)
...
MemoryError: Error allocating 3288334336 bytes of device memory (out of memory).
Apply node that caused the error: GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
Toposort index: 141
Inputs types: [TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(), (), (), ()]
Inputs strides: [(), (), (), ()]
Inputs values: [array(256), array(64), array(224), array(224)]
Outputs clients: [[GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
Hmm, I’m pretty much certain that it didn’t have anything to do with the test set, but rather when it attempted to cross-validate, something may have happened. The model always says 1s left when it is running through the cross validation data. Maybe there was an error there?
If nothing else, you’ve learned the hard way why you should always try on a sample first
Considering you have much more images than cat-vs-dog dataset, the precomputed data will be large. Then it is likely that you will encounter errors such as OOM or kernel death during training. In this case, you might want to use model.fit_generator() instead of model.fit() in lesson7 notebook. Something like
Thanks for your reply - will look into the generator approach. So many things to try!
I also settled on a batch size of 32 but I’m still confused:
When you say ‘result memory error’, do you mean GPU memory or RAM?
If I try use a batch size of 128 say, the model fits and right at the end I get this memory error:
MemoryError: Error allocating 3288334336 bytes of device memory (out of memory).
Now please point out if my math is wrong but isn’t 3288334336 bytes = 3288 Megabytes ? And isn’t that much less than the 12 GB of GPU memory on the P2 instances? What’s going on?
According to this note, VGG16 requires roughly 93MB memory per image at test time (double at training time) to store activations, and another 183*3 MB for parameters. Does this mean that using batch size 64/128 training a VGG model will deplete the 12GB GPU memory?
Interesting - I remember seeing that in the CS231n notes…
I’m using a batch size of 32 so that’s just 3293 + 1833 = ~3500 MB <<< 12 GB so not sure what’s going on.
In related news, I’m trying to precompute the convolutional layers and running into memory issues again. Copy-pasted from the statefarm code, I’m getting a memory error on this line pretty much as I run it (runs fine on 10% sample):
MemoryErrorTraceback (most recent call last)
<ipython-input-21-87526e7d2796> in <module>()
----> 1 conv_feat = conv_model.predict_generator(batches, batches.nb_sample)
2 conv_val_feat = conv_model.predict_generator(val_batches, val_batches.nb_sample)
3 conv_test_feat = conv_model.predict_generator(test_batches, test_batches.nb_sample)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.pyc in predict_generator(self, generator, val_samples, max_q_size, nb_worker, pickle_safe)
943 max_q_size=max_q_size,
944 nb_worker=nb_worker,
--> 945 pickle_safe=pickle_safe)
946
947 def get_config(self):
/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in predict_generator(self, generator, val_samples, max_q_size, nb_worker, pickle_safe)
1650 for out in outs:
1651 shape = (val_samples,) + out.shape[1:]
-> 1652 all_outs.append(np.zeros(shape, dtype=K.floatx()))
1653
1654 for i, out in enumerate(outs):
MemoryError:
Calling batches.nb_sample gives 188348 which I know is high but isn’t predict_generator supposed to iterate through that in batches (my batch size is 32 - same as it was during the sample where I didn’t have a memory error).
They are indeed being calculated one batch at a time. However, if the intermediate result size is greater than the system RAM, the memory error is likely to occur. Could you check how much RAM does one batch output occupy? That might help project amount of RAM the 188348 records might take.
Ah ok so you mean each batch is processed individually but the results of all batches need to be stored in RAM and that may be the issue?
Think that makes sense
I suppose I can just iterate through chunks at a time like break the dataset into 10 pieces then process each piece batch-wise and then combine the set of all convolutional activations at the end (since then only 1/10 of the results will ever be in RAM at once)…
Ah ok so you mean each batch is processed individually but the results of all batches need to be stored in RAM and that may be the issue?
Results of a batch meaning the predicted, one hot encoded values? This should be small. I believe these are used used during the batch for SGD and then thrown away (i.e. available to be reclaimed by the python interpreter).
I suppose I can just iterate through chunks at a time like break the dataset into 10 pieces then process each piece batch-wise
That doesn’t sound right to me. Using generators and SGD with batches should is essentially handling the memory management for you. Your using batches so that you don’t run out of memory while training.
My understanding of things is
When using generators and model.fit_generator() all your training cases DO NOT need to fit in memory (GPU or system memory). Data from an entire training batch DOES need to fit in memory.
All the weights of your model do need to fit in GPU memory at the time of training.
With generators you can keep your batch size fixed but include more training cases and not use up more memory (GPU or CPU) when training your models.
Please correct me if I’ve got any of this this wrong.
If you are using batches then I don’t think the result needs to be stored. That is the whole point of batches. We can scale to any level on the machine. Will just take more time.
If you are using linux then you can install a handy utility called htop. It let’s you see the memory/CPU usage.
Before you start the training run htop and see whether the memory is already being consumed. Maybe your memory is already being used by some object which you create earlier in your notebook? As your process runs see if there is anything unusual with memory usage - spikes maybe? May help correlate the changes in memory with your code execution.
EDIT
Looking at the documentation seems the problem is with the predict_generator method. This method is for generating predictions from input (which are generated from batches). So the input is in batches but the output is not.
I believe you need to use the predict method which takes batches as input and gives batches as output.
That would explain why it worked on samples but not the whole data set.
The thing that you will have to change is probably the saving and loading the array part. Jeremy was always loading the whole array but you will have to find a way to save and load that in batches instead of the whole thing at a time.
Yeah I’m already using htop and the issue seems to be that the output of all the batches is being stored in memory instead of being returned batch-by-batch.
Will work on breaking up the calc and see if that helps…
Seems like predict_generator would be written for the explicit purpose of not eating up all your memory. I haven’t looked at the implementation but it should be something like
load a batch of test cases into memory on the GPU.
make and store predictions on the entire batch (predictions are stored in a relatively small numpy array)
remove all references to the batch allowing the python garbage collector to reclaim the used memory.
Remember that when you create an ImageDataGenerator in Keras you determine the batch_size and therefore help control the GPU and system memory used during the call to predict_generator.
predict takes an entire “batch” in the sense that it runs predict on everything you give it all at once (think of batch processing in computer science).
Any chance there’s something wrong with how you installed cudNN?
I just spent a few days messing with notebooks in part1 and part2 on Google Compute Cloud with both Tensorflow and Theano as backends. I was playing with a bunch of settings trying to speed things up. Any chance you doubled the batch_size for the validation data? This is what the notebook does and I replicated it when writing my notebook.
With Theano as a backend when I’d get to the very end of one epoch and Keras/Theano started processing the validation set I’d get an out of memory error. With Tensorflow it blew up immediately.
With Theano I was able to get to a batch_size of 128 on a 12G Telsa. Doubling that blows things up.