Pros and cons of using batch data to train your model? What is the difference if any?

Example of training an NN on ImageNet data that doesn’t fit in memory.

datagen = ImageDataGenerator(
featurewise_center=True, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=True, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=20, # randomly rotate images in the range (degrees, 0 to 180)
width_shift_range=0.2, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.2, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images
vertical_flip=False) # randomly flip images

datagen.fit(X_sample) # let’s say X_sample is a small-ish but statistically representative sample of your data

# Example 1
for e in range(nb_epoch):
print(“epoch %d” % e)
for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures
for X_batch, Y_batch in datagen.flow(X_train, Y_train, batch_size=32): # these are chunks of 32 samples
loss = model.train(X_batch, Y_batch)

# Alternatively
for e in range(nb_epoch):
print(“epoch %d” % e)
for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures
model.fit(X_batch, Y_batch, batch_size=32, nb_epoch=1)

You can just use model.train_on_batch. That would be more “pythonic” (explicit is better than implicit)

Thanks for sharing but, I don’t think this answers the question.

I’m not 100% sure what the question is. Are you asking about the difference between model.train_on_batch() and model.fit()?

If so, fit() also does validation and creates a progress bar. Other than that, they do the same thing.

The second piece of code ignores data augmentation. Augmentation is usually beneficial (if parameters are set correctly). But it has nothing to do with the batch structure and memory consumption.

Maybe I did a poor job of asking the question. Let me rephrase: How does varying the batch_size in Keras model.fit (…) affect the deep learning model. Here is the explanation in the documentation: batch_size: integer. Number of samples per gradient update.

The SGD gradient update is performed over the batch. If batch size = 1, then the gradient update happens after each example. If the batch size = 32, then the gradient update happens after each 32 examples. And so on.

The larger the batch size, the more the gradient update resembles the actual shape of the loss function. If the batch size is equal to the size of the training set, then the gradient is the actual gradient of the loss function. For smaller batch sizes, the gradient is only an approximation to the loss function. That’s why the loss bounces around so much when you use SGD: it doesn’t compute the true gradient, only an approximation.

The reason we use batches (or more correctly, mini-batches) is that the whole dataset typically doesn’t fit in GPU memory. Usually there is only room for 32 or so images at once.

Using a batch size of 32 is more efficient than using a smaller batch size, because the GPU can do work more efficiently when it’s kept busy. So a larger batch size results in less waiting. If your GPU can handle a batch size of 64 but you’re only giving it a batch with 32 images, then you’re not putting your GPU to full use.

As I pointed out, the smaller your batch size, the more SGD tends to jump around. But this “noisyness” isn’t a bad thing: it actually helps the optimizer in finding an optimum (the optimization won’t get stuck because the gradient update process is a little unpredictable).

2 Likes

I just wanted to add that image data is correlated, meaning if you have 12 500 images of cats, a picture will be similar to quite a few other pictures. This produces a nice property of SGD - you look only at a subset of a data, but are getting an update that approximates what you might get had you looked at the entire dataset.

Hence even if performance issues are not a constraint, SGD can give you a faster convergence than just using a gradient descent calculating the update on the entirety of the dataset.

I think that also the stochasticity of the updates based on batches, the inherent small differences in the images across batches, give you some nice properties for navigating the error surface but I don’t recall if that is the case exactly nor where I read that. Though I can see how it could help us escape from some shallow local minima (or at least if not a local minima than at least error surface that is across many dimensions a local minima). But take this with a grain of salt - I could just as well be dreaming this up :slight_smile: