Why is batch_size Doubled for val_batches


(Nyx) #1

In the notebook of Lesson 1, why was the batch_size for val_batches twice that of batches?

vgg = Vgg16()

batches = vgg.get_batches(path+'train', batch_size=batch_size)
val_batches = vgg.get_batches(path+'valid', batch_size=batch_size*2)

Lesson 1 discussion
(Jeremy Howard) #2

Because it doesn’t need backprop, so needs less memory.


(Vinay Agarwal) #3

Does this also mean that the valid set (e.g. sample/valid/cats) should have twice the images as sample/train/cats? In the downloaded sample dogs/cats, the valid set has half the images of the train set.


(Alejandro Castro) #4

No. You want to have a bigger set for training, and a smaller set for validation. The reason I think is because the more data you have to train, the better you will be able to tune your model, and then you just need a smaller validation set to check the accuracy of your model.

I don’t know if there’s a particular ratio that is recommended thought. For example, in the video @jeremy mentions that for the training set there where originally 12,500 images for each (cats / dogs) but then he took 1,000 for each to create the validation data set.

In the material however, on the sample data set, there are I think about 8 cats and 8 dogs in the train and 4 of each in the validation. He mentioned however that he would have rather have something like 100.

In my particular case, to work with dogs vs cats I am using 100 cats and 100 dogs, and dividing those in 90 for training and 10 for validation.

Not related to your question, but I found this gist by @brookisme, which comes in handy when you are setting up your data sets.


(Aseem Bansal) #5

80/20 and 90/10 occurs commonly. I did a course earlier and they used something like that. This https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio explains it in detail.


#6

I tend to use 80% for training and 20% for validation. I don’t know if there is a commonly accepted split ratio of data. But training set should be bigger than validation set, otherwise your model will have less data to train on, and will therefore be less accurate.