Lesson 3 - Why does data augmentation still result in the same number of samples?

patrick_loginow · December 5, 2017, 2:47am

Hi Everyone,

(This is my first post to this forum so hopefully I did this correctly)

I noticed in the video lecture 3 that the when Jeremy used data augmentation and reran the learning it still reported the same number of samples as in the earlier run. My understanding of data augmentation is that it would “augment” the data set so you could effectively increase the volume of your data set without actually having to collect more data from an external source. As such, I expected that the data augmentation would result in a larger training set. So for example if the augmentation was tilting the image by a certain angle I would expect the training to be done on both the tilted and untilted image. Why is it that we don’t continue to retain the original image in the for the training?

Thank you,
Patrick

Nell · December 5, 2017, 4:04am

If i remember correctly, it does do that for fast ai but i couldnt find that in the code. However to answer your question, data augmentation not only helps in getting more data but helps the model generalize better by preventing the model from learning too much from the data (also known as overfitting). This means that your model will perform better on the unseen data or future data that may not be perfectly curated like your training data.

A good example comes from image recognition. You can curate the data easily for training but the normal images that you get online or some images that people click may not be centered or have an angle depending on how they hold the camera. This helps the model in predicting these type of real world images much better.

pietz · December 5, 2017, 9:51am

I believe @Nell missed the main point.

The reason the number of samples stays the same is because the augmentation is done live during training. Imagine a black box that randomly augments a single image depending on the parameters you specified. This box is added to your pipeline. So for every single image you randomly change it every time depending on your augmentation settings, instead of adding all images in 20 different ways to your dataset.

patrick_loginow · December 6, 2017, 3:22pm

Thank you for this explanation. I thought that might be the reason - it just seemed weird to me that it would still report the number of samples as the raw number of samples and not the total including augmentation (for example you were to try to compute the training time per sample it would make a big difference if augmentation was happening behind the scenes or not).

Thank you,
Patrick

Yamano · December 6, 2017, 6:31pm

It actually generates only the augmented samples, the original image will actually never be part of the training anymore (Unless it happens to randomize very close-to-zero values for the augmentation).
So each image appears exactly once in every epoch with some augmentation as your setting so it’s the same number of samples.

nesaboz · January 19, 2023, 6:27pm

Well true, unless one uses torchvision.transforms.RandomApply.