Data Augmentation

I don’t know - I assume they just mean they’re using twice as much augmentation as some other model, but it’s not clear.

Maybe they mean that the size of the training set was doubled with augmented images.

Could be - but most people do real-time data augmentation, where such a concept doesn’t really make sense…

Thanks Jeff! Will give that a shot.

1 Like

I read the documentation and still having doubts on data augmentation.
First theory:
When a batch of 32 images are passed through Data Augmentation, randomly lets say 10 images of the 32 images are chosen and augmented and we get back 22 images normal + 10 augmented images. Is this correct ?
Second theory:
1st epoch gets randomly augmented images
2nd epoch gets another set of randomly augmented images and so on … ?
Third theory:
All of 32 images are augmented, augmentation criteria is randomly chosen for each image ?

We can find out by looking at the source code:

That one!

1 Like

@jeremy, @prateek2686
In Lesson3.ipynb, we learned that we do not augment validation data. However, below (in mnist.ipynb), we use the augmented data generator ‘gen’ both for train and test/validation data:

gen = image.ImageDataGenerator(rotation_range=8, width_shift_range=0.08, shear_range=0.3,
                           height_shift_range=0.08, zoom_range=0.08)
batches = gen.flow(X_train, y_train, batch_size=64)
test_batches = gen.flow(X_test, y_test, batch_size=64)

I assume this is a mistake, we should have instead done this:

gen_train = ImageDataGenerator(rotation_range=8, width_shift_range=0.08, 
                                       shear_range=0.3, height_shift_range=0.08, 
# !! NB: We don't want to augment or shuffle the validation / test set
# so just use a normal generator for test
gen_test = ImageDataGenerator() # image pre-processor
batches = gen_train.flow(X_train, y_train, batch_size=64)
test_batches = gen_test.flow(X_test, y_test, batch_size=64)

Am I right?


You are totally right! Many thanks for spotting this :slight_smile:

I was curious to know. You said if we want to find out best data augmentation parameters. Just run 4-5 experiments for width, height, rotation, shear, zoom, channel etc. And combine the best values from all. Are they not interrelated?
What I mean by that is: if best width comes out 0.5, best height also comes out 0.5. Is it necessary that the best validation loss will be a combination of 0.5 width and 0.5 height? Is it not possible that these extreme parameters act negatively and hurt our validation accuracy?

You’re right. If you really want to figure it out, you need to do a grid search.

From what I learned in the keras documentation and this thread, it seems that in real time augmentation the images are randomly augmented based of augmentation criteria. Also, if an image is augmented then the original image is not fed to the network but the augmented image is fed. Is my understanding correct?

Based on my understanding I am thinking is that wouldn’t it be nice to save the augmented images somewhere and use both the original and augmented images for network training so that we have multiple augmented images plus the original.

You’re right again, and Fchollet already did the hard work for you - there’s an option (in the flow_from_directory method IIRC) to save the augmented images (with parameters for image prefix, image format and save directory).

The ImageDataGenerator class as it is written doesn’t let you use augmented and original images, but you could always just create a second generator without any manipulation and run one epoch over that.

1 Like

Very informative thread, Thanks to all!

My confusion is, let us say i have very few images, say 2-5 images per category. This is all the images i have in total. Now, using these 2-5 images/category + data augmentation, we can generate sufficient image to train a model. This is all good for training a model. But,

As discussed above the validation set shouldn’t be augmented, in that case how could i validate with this small dataset? I know i am asking a corner case but often in real world we deal with very small data. Any help would be greatly appreciated. Thanks.