New coordinate transforms pipeline

sgugger · July 25, 2018, 2:01pm

It is the one indeed. Are you on an instance as well? I’m guessing other parameters (like the speed of the hard drive) might interfere.

urmas.pitsi · July 25, 2018, 2:05pm

ok, that could be a bummer I use my local machine, I have Samsung SSD, should be 500Mb read/write approx.
So, you are saying that with let’s say Samsung NVME (r/w, 2GB), it should be 2x faster? seems probable, if they have really managed to leverage this bottleneck.

urmas.pitsi · July 25, 2018, 2:12pm

what version of pytorch? i am using 0.4. Should I pull from the master to get the latest booster version?

sgugger · July 25, 2018, 2:27pm

Results were with pytorch 0.4.0.

radek · July 25, 2018, 3:57pm

One thing I have been thinking off for quite a while is giving people an ability to train easily on imagenet, like in the fast.ai dawnbench submission. Seems that with the new improvements maybe even more minutes could be shaved off from the training time.

The dawnbench repo is not that easy to follow so I was thinking of doing the legwork and figuring out how things should be run so that at the end people could put their model in say model.py file, run the training pipeline and voila - on the other end would come out a trained model using the parameters used for the dawnbench submission.

This capability could be quite useful for experimenting with new architectures or just for pretraining models - I think right now this capability sort of exists but is out of reach for mere mortals not because of the price point (~25$) but because of how unwieldy the code is to run.

Anyhow - just wanted to say that I am really excited to see the new developments Didn’t have time to work on the imagenet idea (and neither the resources - single GPU might be too little for this) but maybe if some of this new work could be applicable and it would not require too much work… Not asking anyone to put any work into it but on the offchance a slightly more cleaned up training on imagenet could be produced as a by product of the development effort… I think this would be really useful and really neat

Anyhow - its so amazing that this is rewrite is happening Hoping to get involved if I manage timewise and for now will continue to root for you from the sidelines!!!

jeremy · July 25, 2018, 4:36pm

That’s being worked on at the moment! Andrew Shaw has it down from 3 hours to 2.5 hours already, and the new transforms pipeline should make it even faster, plus some other tricks we’ve got coming soon…

binga · July 25, 2018, 5:17pm

Thank you team for letting us witness this development at fastai_dev. Also, if you could put some instructions around the machine setup which addresses things like - use pillow_simd instead of pillow or nccl installation for multigpu - or provide a setup script, one of us could help prepare a dockerfile so that the environment where we try and replicate the results is identical to yours. Thank you for this once again.

beecoder · July 25, 2018, 5:56pm

This sounds like a great approach. The 3x3 affine transform multiplications seem like a neat trick. Any reason for choosing 3x3… is it to cover the points (-1,1),(0,0),(1,1) ?

Here’s my attempt at repeating this in my own words, let me know if there are gaps. I did have to pay attention to when the pixel values are modified, and when the positions are.
Augmentations can modify
pixels: contrast, brightness etc. These are in place and relatively easy to implement.
position (coordinates): resize, rescale, zoom etc.

We start with an original ‘image sheet’, which is comprised of coordinate pairs(I,j), each of which point to a pixel value( p)
Step 1 is modifying coordinates, i.e. new_(i,j) . This is like resizing the image sheet
Step 2 is again modifying coordinates, by applying a single affine transform (derived from multiplication of all affine transforms), and then any non affine transform.
At this stage we have a new ‘image sheet’, which may be larger, or slightly bent, or rotated depending on the transforms applied.
Step 2.5 is cropping the new coordinates which fall outside the original image window size. I.e cut the new image sheet to fit into the coordinate space of the required output frame for the image…
Step 3 The new image sheet doesn’t cover the original frame exactly, and some parts of the original sheet are uncovered. We use weighted average of the pixel values (bilinear interpolation), to fill the original image sheet with new pixel value. This process is the most computationally intensive of the lot.

We combined the work on forming the new image sheet (relatively fast due to 3x3 affine transform multiplications) and we guessed (interpolation) how the new values fit onto the original frame only once, thereby saving computation. Now this feature needs to be tested more.

sgugger · July 25, 2018, 6:26pm

That’s it for the plan, I would just say than step 1 creates the original map of coordinates rather than modifying it, and we choose here the size of the resize (instead of the size of the original picture).

For the affine matrices, they are 3 by 3 because an affine operation of the plan is something like Ax + b, and representing them with the matrix that has (A b) on the first two lines and (0 0 1) on the last line makes the composition of two affine transforms become a regular matrix product (it’s just a trick to do this fast).

JensF · July 25, 2018, 6:35pm

This sounds very exciting! I was wondering if you had considered doing the expensive image manipulation steps on the GPU instead of the CPU. Rotations and interpolation would basically come for free if you leverage the texture mapping of the GPU. Plus you might even be able to use other types of distortions on the image (warping etc.). The downside would be the time to transfer images from CPU to GPU and back, which might take longer than just doing it in CPU - unless we could pass the transformed images right from the GPU to the next processing steps.

sgugger · July 25, 2018, 6:43pm

We have thought about it and one of the big plus of the code we are writing is that it can be done either on the CPU or the GPU (since it’s on torch tensors) and even done batch-wise (as long as you give a batch of matrices for the affine transforms for instance).
For now we don’t saturate the CPUs so there’s no real need to move this on the GPU. The option will probably there in the fastai_v1 library and this is definitely something we will experiment with.

beecoder · July 25, 2018, 6:44pm

Noted, thanks! I’ll keep track of updates here.

wdhorton · July 28, 2018, 11:30pm

@sgugger I saw that you mentioned one of the effects of the new transforms pipeline was that

adding a new transformation almost doesn’t hurt performance

Would that make it easier to apply something like the ImageNet transforms from AutoAugment (https://arxiv.org/abs/1805.09501), since the computational penalty of using 25+ transforms is drastically reduced?

sgugger · July 28, 2018, 11:46pm

We plan to try that, yes.

fredguth · July 29, 2018, 1:45am

Is pytorch faster than OpenCV for affine transformations?

prabu · July 29, 2018, 7:37am

Will elastic distortions also be done directly on the torch tensors (if so how will the grid flow be generated in a device agnostic way)? Will piecewise affine transforms of an image also be considered? Thanks for all the goodies so far and the goodies currently baking …

sgugger · July 29, 2018, 3:35pm

We’re running tests to see if our implementation is faster or not on a wide range of tasks. Torchvision is slightly slower than opencv, in the few tests I did.

sgugger · July 29, 2018, 3:38pm

We’ve not implemented that yet, but we’ll be looking at it. All the functions we use work on the CPU or the GPU (mainly affine_grid and grid_sampler) so even if it ends up being on one device in fastai_v1 (for now we’re mainly looking at the CPU), it’ll be easy to adapt it to another.

jeremy · July 31, 2018, 1:32am

It’s a little early to say anything definitive about speed, since Soumith has been kind enough prioritise optimizing stuff that we need for performance in fastai - so for instance they just added a PR that optimizes grid_sample by 10x, and there’s more to come.

jeremy · July 31, 2018, 1:34am

Yes that’s the plan. The grid flow generation will be done in a similar way to the current affine matrix generation (that is, it’ll be put on the same device as the image, when it’s used).