Fastai's batch_tfms/aug_transforms


  1. Could anyone clarify how exactly aug tfms are applied to images in a batch in fastai?
    My understanding is that, for each item/image in a batch, one aug transform is picked at random and applied to that image, so that for a batch size of 64 we end up with sending 64 augmented images for the model to train on.
    Btw, if so, does that mean that the model will never (with prob zero) see the original images (only the augmented versions)?

  2. More specifically, is it possible (and how) to configure augmentation process so that for each item in a batch we apply a certain (fixed) number of each augmentation from our aug_transforms set. So that, assuming a batch size of 64 and 10 augmentations per image (lets call it an augmentation rate) and 4 transforms in our aug_transforms, during training the model will effectively see (for each batch of 64 images) a total of 64x10x4 augmented images.
    The idea is, when having a small training set, to set a rather low batch size (e.g. 2), so that the model weights are updated more often when going through one epoch, while having a significant number of images (2x10x4 images, assuming the augmentation rate of 10 and 4 transforms) in each batch (so that each batch is representative enough of the total population/distribution).

  3. Finally, what is the best way to figure out something like the above (looking at the source/code, fastai documentation, tutorials?) instead of having people on the forum read through all this? :slightly_smiling_face:

Thanks! :mask: :smile:


Got the answer for 1. Apparently, *apply_tfms* applies all transforms to each image in a batch (see below), so that the final augmented image is T_1(T_2(…T_n(X))), where X is the original image and T_1-T_n are the aug transforms.
Although this makes some sense, this approach might not be optimal since

a. Applying a number of transforms (i.e. a composition) will result in a loss of information (each transform involves modifying/interpolating the pixel values), so that the model sees significantly altered images. This is particularly important when fine/granular details are key for the classification task.
b. As a way around the above issue, it is sufficient to apply a single transformations to each image.
This can be justified mathematically as follows. Augmentation is a way to teach the model that the result/output (e.g. predicted class M(X) for an image X) is invariant w.r.t. a transformation T, i.e. M(T(X))=M(X) = label_X, for any X.
If the model learns that the output should be invariant with respect to each individual transformation T and S, then it will be invariant w.r.t. their composition. Indeed, M(T(S(X)) = M(T(Y))=M(Y)=M(S(X))=M(X).
So having a set of independent transformations T_1-T_n (independent in a sense that no transform can be represented as a composition of other transforms), a better approach would be to apply individual transforms to each image for the model to train on (e.g. as described in item 2. in my previous post).

Would be great if anyone could comment if the above makes sense :face_with_monocle: