New coordinate transforms pipeline

Jeremy and I thought it would be a good idea to document a bit more what we are currently doing, both for those who are reading the development notebooks as they come and want to understand what is going on and for the future documentation of the library, to explain why we made certain choices during its design. So tonight, I’ll talk a bit about what we decided with Jeremy about the new transforms pipeline and how we’ll do data augmentation in fastai_v1.

What does a transform do?

Typically, a data augmentation operation will randomly modify an image input. This operation can apply to pixels (when we modify the contrast or brightness for instance) or to coordinates (when we do a rotation, a zoom or a resize). The operations that apply to pixels can easily be coded in numpy/pytorch, directly on an array/tensor but the ones that modify the coordinates are a bit more tricky.

They usually come in three steps: first we create a grid of coordinates for our picture: this is an array of size h * w * 2 (h for height, w for width in the rest of this post) that contains in position i,j two floats representing the position of the pixel (i,j) in the picture. They could simply be the integers i and j, but since most transformations are centered with the center of the picture, they’re usually rescaled to go from -1 to 1, (-1,-1) being the top left corner of the picture, (1,1) the bottom right corner (and (0,0) the center), and this can be seen as a regular grid of size h * w. Here is a grid what our grid would look like for a 5px by 5px image.


Then, we apply the transformation to modify this grid of coordinates. For instance, if we want to apply an affine transformation (like a rotation) we will transform each of those vectors x of size 2 by Ax + b at every position in the grid. This will give us the new coordinates, as seen here in the case of our previous grid.

There are two problems that arise after the transformation: the first one is that the pixel values won’t fall exactly on the grid, and the other is that we can get values that get out of the grid (one of the coordinates is greater than 1 or lower than -1).

Which takes us to the last step, an interpolation. If we forget the rescale for a minute and go back to coordinates being integers, the result of our transformation gives us float coordinates, and we need to decide, for each (i,j), which pixel value in the original picture we need to take. The most basic interpolation called nearest neighbor would just round the floats and take the nearest integers. If we think in terms of the grid of coordinates (going from -1 to 1), the result of our transformation gives a point that isn’t in the grid, and we replace it by its nearest neighbor in the grid.

To be smarter, we can perform a bilinear interpolation.This takes an average of the values of the pixels corresponding to the four points in the grid surrounding the result of our transformation, with weights depending on how close we are to each of those points. This comes at a computational cost though, so this is where we have to be careful.

As for the values that go out of the picture, we treat them by padding it either:

  • by adding zeros on the side (so the pixel that fall out will be black)
  • by replacing them by the value at the border
  • by mirroring the content of the picture on the other side (reflect padding).

Be smart and efficient

Usually, data augmentation libraries have separated the different operations. So for a resize, we’ll go through the three steps above, then if we do a random rotation, we’ll go again to do those steps, then for a zoom etc… The idea we have, to design the new fastai library, is to do all the transformations on the coordinates at the same time, so that we only do those three steps once, especially the last one (the interpolation) that is the most heavy in computation.

The first thing is that we can regroup all affine transforms in just one (since an affine transform composed by an affine transform is another affine transform). We’re not the first to think of this, and there are already libraries (like torchsample) that implement this. We pushed the thing one step further though to integrate the resize, the crop and any non-affine transformation of the coordinates in the process. Let’s dig in!

In step 1, when we create the grid, we use the new size we want for our image, so new_h,new_w (and not h,w). This takes care of the resize operation (usually a resize to 1.1 or 1.25 * the size we will take at the end by cropping) with no cost.

In step 2, we do only one affine transformation, by multiplying all the affine matrices of the transforms we want to do beforehand (those are 3*3 matrices, so it’s super fast), then we apply to the coords any non-affine transformation we might want (jitter, elastic distorsion…) before…

Step 2.5: we crop (either center or randomly) the coordinates we want to keep. Crop is easy to do whenever we want, but by doing it just before the interpolation, we don’t compute pixel values that won’t be used at the end, gaining again a bit of efficiency

Then Step 3: the final interpolation. Afterward, we can apply on the picture all the tranforms that operate pixel-wise (as said before brightness, contrast…) and we’re done with data augmentation.

But does it work?

More tests are needed, and we asked the pytorch team to optimize/add some different options to the functions we need (essentially affine_grid, that combines step 1 and 2, and grid_sample that does step 3 for those who want to dig in the code) but we can already see the difference: loading all the batches of the training set of dogs and cats with standard data augmentation in torchvision takes 48s on a p3 (optimized with libjpeg-turbo and pillow_simd) when our way takes 37.6s to 43.8s (depending on the padding we use, and we hope that will get down with further optimization on the pytorch side).

Also, adding a new transformation almost doesn’t hurt performance (since the costly steps are done only once) when with classic data aug implementations, it will result in a longer training time.

Even in terms of final result, doing only one interpolation gives a better result: if we stack several transforms and do an interpolation on each one, we approximate the true value of our coordinates in some way, which tends to blur a bit the image. By regrouping all the transformations together and only doing this step at the end, we can get something nicer.

Look at how the same rotation then zoom done separately (so with two interpolations)
is blurrier than regrouping the transforms and doing just one interpolation

This where we stand for now, and hopefully, you should clearly see the three steps I mentioned above when you read the final version of the transforms code. Don’t hesitate to add any resource that might be helpful or ask questions about point that are unclear as this post will be refactored somehow to go in the documentation of fastai_v1.


Nice work! I saw your tweet about cutting cifar10 training speed in half. Tried to replicate based on the v1 github nb. I didn’t see improved results, actually I got even slightly worse results: 22…24min to 94% acc. Compared to 21min with current setup.
What kind of changes did you apply on top of that to drive down the time to 13min? I guess you did some more clever tricks that are not shown. Or am I missing something?

The main change to speed up things to 13 minutes is to use the pytorch Dataloader. It’s super fast now. Then to get down to 9-10 minutes, I’m using mixup with a few tricks that I’ll detail on the forum in another post.


I’m a bit confused still. Notebook. I am using the exact code from ‘Cifar10-comparison-pipelines.ipynb’, in cell 13 it has:
from import DataLoader as DataLoader1

isn’t that the one?

mixup is good! It has been on my table for some time, didn’t have time to test it. Tried to smush it into fastai but was not successful. good to hear you implemented that.
I saw you implemented also Cutout that seems to help also. I tested various forms of Cutouts, not sure yet wihic one is best. All seem to help by just the same. Probablilty seems to be the most important parameter in my tests.

It is the one indeed. Are you on an instance as well? I’m guessing other parameters (like the speed of the hard drive) might interfere.

ok, that could be a bummer :frowning: I use my local machine, I have Samsung SSD, should be 500Mb read/write approx.
So, you are saying that with let’s say Samsung NVME (r/w, 2GB), it should be 2x faster? seems probable, if they have really managed to leverage this bottleneck.

what version of pytorch? i am using 0.4. Should I pull from the master to get the latest booster version?

Results were with pytorch 0.4.0.

One thing I have been thinking off for quite a while is giving people an ability to train easily on imagenet, like in the dawnbench submission. Seems that with the new improvements maybe even more minutes could be shaved off from the training time.

The dawnbench repo is not that easy to follow so I was thinking of doing the legwork and figuring out how things should be run so that at the end people could put their model in say file, run the training pipeline and voila - on the other end would come out a trained model using the parameters used for the dawnbench submission.

This capability could be quite useful for experimenting with new architectures or just for pretraining models - I think right now this capability sort of exists but is out of reach for mere mortals not because of the price point (~25$) but because of how unwieldy the code is to run.

Anyhow - just wanted to say that I am really excited to see the new developments :slight_smile: Didn’t have time to work on the imagenet idea (and neither the resources - single GPU might be too little for this) but maybe if some of this new work could be applicable and it would not require too much work… Not asking anyone to put any work into it but on the offchance a slightly more cleaned up training on imagenet could be produced as a by product of the development effort… I think this would be really useful and really neat :slight_smile:

Anyhow - its so amazing that this is rewrite is happening :heart: Hoping to get involved if I manage timewise and for now will continue to root for you from the sidelines!!!

1 Like

That’s being worked on at the moment! :slight_smile: Andrew Shaw has it down from 3 hours to 2.5 hours already, and the new transforms pipeline should make it even faster, plus some other tricks we’ve got coming soon…


Thank you team for letting us witness this development at fastai_dev. Also, if you could put some instructions around the machine setup which addresses things like - use pillow_simd instead of pillow or nccl installation for multigpu - or provide a setup script, one of us could help prepare a dockerfile so that the environment where we try and replicate the results is identical to yours. Thank you for this once again.

This sounds like a great approach. The 3x3 affine transform multiplications seem like a neat trick. Any reason for choosing 3x3… is it to cover the points (-1,1),(0,0),(1,1) ?

Here’s my attempt at repeating this in my own words, let me know if there are gaps. I did have to pay attention to when the pixel values are modified, and when the positions are.
Augmentations can modify
pixels: contrast, brightness etc. These are in place and relatively easy to implement.
position (coordinates): resize, rescale, zoom etc.

We start with an original ‘image sheet’, which is comprised of coordinate pairs(I,j), each of which point to a pixel value( p)
Step 1 is modifying coordinates, i.e. new_(i,j) . This is like resizing the image sheet
Step 2 is again modifying coordinates, by applying a single affine transform (derived from multiplication of all affine transforms), and then any non affine transform.
At this stage we have a new ‘image sheet’, which may be larger, or slightly bent, or rotated depending on the transforms applied.
Step 2.5 is cropping the new coordinates which fall outside the original image window size. I.e cut the new image sheet to fit into the coordinate space of the required output frame for the image…
Step 3 The new image sheet doesn’t cover the original frame exactly, and some parts of the original sheet are uncovered. We use weighted average of the pixel values (bilinear interpolation), to fill the original image sheet with new pixel value. This process is the most computationally intensive of the lot.

We combined the work on forming the new image sheet (relatively fast due to 3x3 affine transform multiplications) and we guessed (interpolation) how the new values fit onto the original frame only once, thereby saving computation. Now this feature needs to be tested more.

That’s it for the plan, I would just say than step 1 creates the original map of coordinates rather than modifying it, and we choose here the size of the resize (instead of the size of the original picture).

For the affine matrices, they are 3 by 3 because an affine operation of the plan is something like Ax + b, and representing them with the matrix that has (A b) on the first two lines and (0 0 1) on the last line makes the composition of two affine transforms become a regular matrix product (it’s just a trick to do this fast).


This sounds very exciting! I was wondering if you had considered doing the expensive image manipulation steps on the GPU instead of the CPU. Rotations and interpolation would basically come for free if you leverage the texture mapping of the GPU. Plus you might even be able to use other types of distortions on the image (warping etc.). The downside would be the time to transfer images from CPU to GPU and back, which might take longer than just doing it in CPU - unless we could pass the transformed images right from the GPU to the next processing steps.

We have thought about it and one of the big plus of the code we are writing is that it can be done either on the CPU or the GPU (since it’s on torch tensors) and even done batch-wise (as long as you give a batch of matrices for the affine transforms for instance).
For now we don’t saturate the CPUs so there’s no real need to move this on the GPU. The option will probably there in the fastai_v1 library and this is definitely something we will experiment with.


Noted, thanks! I’ll keep track of updates here.

@sgugger I saw that you mentioned one of the effects of the new transforms pipeline was that

adding a new transformation almost doesn’t hurt performance

Would that make it easier to apply something like the ImageNet transforms from AutoAugment (, since the computational penalty of using 25+ transforms is drastically reduced?

1 Like

We plan to try that, yes.


Is pytorch faster than OpenCV for affine transformations?