Manually enhanced training vs data augmentation

Dear all,

I need to verify whether my understanding is correct from those who have been in this field for much longer.

What I have observed is that the training results and validation accuracy of the model is 95% when the training size is manually augmented vs it being 50-52% when data augmentation is used with same augmentation schemes used in both scenarios.

TLDR;
Lets consider the following:

(1) Training set size: X (associated labels y are available). (X ~ 1000)
(2) two data augmentation functions are defined, say A1 and A2.

Scenario 1:
Use A1 and A2 to create 5 variants of each training sample resulting in total training set size :(5+1)X = 6X

Validation accuracy: 96% after 150 epochs

Scenario 2:
Deploy A1 and A2 to operate on the fly during training with training set size X.

Validation accuracy: 52% after 150 epochs

Question: Is this always to be expected? Functionally, the two scenarios seem similar (infact Scenario 2 will present many more variants to the model as number of epochs is high) - Why does the training not take off in scenario 2?

Thanks and regards,
~anoop

Anoop Kulkarni, PhD
Innotomy Consulting

1 Like

If I had to guess, first your sizes aren’t the same. If you really wanted an accurate test, scenario 2 should also have a training set size of 6x, as the augmentation on the fly is just one item, whereas scenario 1 makes 5x of each. If my understanding of what you are doing here is correct :slight_smile:

Yes, you are right. That is my scenario. Just wanted to reconfirm if I would need 5x even while using data augmentation.

best regards
~anoop