Why is batch_size Doubled for val_batches

In the notebook of Lesson 1, why was the batch_size for val_batches twice that of batches?

vgg = Vgg16()

batches = vgg.get_batches(path+'train', batch_size=batch_size)
val_batches = vgg.get_batches(path+'valid', batch_size=batch_size*2)
1 Like

Because it doesn’t need backprop, so needs less memory.

3 Likes

Does this also mean that the valid set (e.g. sample/valid/cats) should have twice the images as sample/train/cats? In the downloaded sample dogs/cats, the valid set has half the images of the train set.

No. You want to have a bigger set for training, and a smaller set for validation. The reason I think is because the more data you have to train, the better you will be able to tune your model, and then you just need a smaller validation set to check the accuracy of your model.

I don’t know if there’s a particular ratio that is recommended thought. For example, in the video @jeremy mentions that for the training set there where originally 12,500 images for each (cats / dogs) but then he took 1,000 for each to create the validation data set.

In the material however, on the sample data set, there are I think about 8 cats and 8 dogs in the train and 4 of each in the validation. He mentioned however that he would have rather have something like 100.

In my particular case, to work with dogs vs cats I am using 100 cats and 100 dogs, and dividing those in 90 for training and 10 for validation.

Not related to your question, but I found this gist by @brookisme, which comes in handy when you are setting up your data sets.

80/20 and 90/10 occurs commonly. I did a course earlier and they used something like that. This https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio explains it in detail.

1 Like

I tend to use 80% for training and 20% for validation. I don’t know if there is a commonly accepted split ratio of data. But training set should be bigger than validation set, otherwise your model will have less data to train on, and will therefore be less accurate.

Currently, the batch size doubling appears to be buried in the DataLoader class so, after setting batch size to be say, 64, the validation generator puts it out at 128.
I have to say, this is not a stellar idea from the coding standpoint. Often, one runs these generators for debugging purposes and, the first time, when the validation spits out twice the expected result, it is a surprise and you MUST track it down. It can take a long time as one would not suspect a surprise for a bs that one specifically sets ( note that the rational would suggest that the doubling be applied to the test data set too but that does not appear to be the case). If it is really necessary to do this, there should be a triple of bs’s (similar to what one does for differential learning rate). Then the user will always get the same batch size unless she explicitly requests a variation.