Batch size effect on validation accuracy

Hi Jeremy & Rachel and thanks for the great MOOC.

I am running VGG16 on Cats vs Dogs (Lesson 1) on my laptop, with a small NVIDIA Quadro K1100M GPU, and due to its less memory I have been experimenting with different (smaller) batch sizes. Here is what I have got:

batch size      validation accuracy
   2                   0.88
   4                   0.97
   8                   0.98

i.e. it seems that validation accuracy benefits from bigger batch size. Could you elaborate a little on this?

Many thanks in advance

Christos

2 Likes

Sure. With a small batch size, the gradients are only a very rough approximation of the true gradients. So it’ll take a lot longer to find a good solution. Generally you’ll want a batch size of around 64 if you can manage it. Smaller batch sizes are OK, but will take a bit longer.

6 Likes

Hi Jeremy. Can I ask a little bit further in this issue? My computer has limited RAM on GPU so I can run with very small batch size. Is there any way to get better validation accuracy equivalent to running with bigger batch?

One way is to train longer. Here’s what Jeremy said:

Hello,

I’ve stumbled upon a very strange situation where the batch_size is the major factor in the validation set accuracy of my model. Proof:

model.fit(conv_trn_feat, trn_labels, nb_epoch=1, batch_size=256, validation_data=(conv_val_feat, val_labels))
Train on 2001 samples, validate on 500 samples
Epoch 1/1
2001/2001 [==============================] - 6s - loss: 0.0546 - acc: 0.9845 - val_loss: 1.4992 - val_acc: 0.7700
model.fit(conv_trn_feat, trn_labels, nb_epoch=1, batch_size=128, validation_data=(conv_val_feat, val_labels))
Train on 2001 samples, validate on 500 samples
Epoch 1/1
2001/2001 [==============================] - 9s - loss: 0.0439 - acc: 0.9835 - val_loss: 4.2225 - val_acc: 0.4520
model.fit(conv_trn_feat, trn_labels, nb_epoch=1, batch_size=256, validation_data=(conv_val_feat, val_labels))
Train on 2001 samples, validate on 500 samples
Epoch 1/1
2001/2001 [==============================] - 8s - loss: 0.0291 - acc: 0.9870 - val_loss: 1.4931 - val_acc: 0.7600
model.fit(conv_trn_feat, trn_labels, nb_epoch=1, batch_size=512, validation_data=(conv_val_feat, val_labels))
Train on 2001 samples, validate on 500 samples
Epoch 1/1
2001/2001 [==============================] - 8s - loss: 0.0183 - acc: 0.9940 - val_loss: 0.0129 - val_acc: 0.9960

I can achieve 99.6% validation accuracy in less than 10 epochs of training with batch_size=512, but batch_size=128 I can’t get the validation accuracy past 48% even after hundreds of epochs of training and even if I use the same weights that I used to train the model with batch_size=512. In fact, even model.evaluate() gives me numbers in the same ballpack as the ones above, depending on what batch_size I feed it.

What can I do if I want to deploy this model on something that doesn’t have the RAM to handle batch_size=512?

3 Likes

feels like im in a similar position as you, although my results are not as extreme. have you ever investigated the problem with the batch size more?

I haven’t investigated it further. However, this only affects backpropagation, so if I want to deploy on a low end system for prediction only, I train with high batch_size, the model will behave at maximum accuracy since it only needs to do feed-forward.

When you deploy models, I interpret that as using the model to predict an answer.

I understand, the batch_size is for training and getting gradients to obtain better weights within your model.

To deploy models, the model merely apply the weights at the different layers of the model for a single prediction.

I’m just ramping up with this NN, but that’s my understanding so far. Hope it helps.

what loss function did you use?

I used Mean Squared Error (‘mse’). I was working on a regression problem.

Instead, what is the purpose of the validation batch size?

What are the pro’s and con’s of batch size equalling your data size of one class or the whole data set?

Hi Jeremy,
I’m a bit confused about the explanation. Isn’t it the faster achievement of (nearly) optimal solution the major reason behind SGD and mini-batch gradient descent? As far as I understand reducing the batch size helps you reach the optimum faster but while compromising on final accuracy. So, will training it for a longer time improve the accuracy?
Sorry if I’m missing out something.