Best practice for preprocessing and transformation graph

rmkn85 · October 1, 2019, 11:12am

Hi,
I have a dataset with image filenames and labels.
For each image, I want to:

Load the image itself
Load the image metadata (EXIF etc)
Run a preprocessing algorithm on the image to extract features (SIFT, segmentation, mask, etc)
Combining results from (3), for example applying the mask on the source image
Combine results from (2) and (3), for example colorize only B&W images
Feed outputs from various parts to a single algorithm (i.e. input to network is a batch with set of images, set of metadata, set of masks)

What is the best practice way to do this in fastai v2?

Is it to take ((Filename)->(Filename,FilenameMetadata)->(PILImage,FilenameMetadata)->(PILImage,FilenameMetadata,ImageAlgoResult)->(…))
I.e. ever increased enriched tuple
If I use this method, then transforms don’t work, since the item type becomes tuple instead of something they are familiar with like an image.

Or
(Filename)->(PILImage)
(Filename)->(FilenameMetadata)
(Filename)->(PILImage)->(ImageAlgoResult)
And then somehow to combine.
If I use this method, then I’m not sure how to reuse the image loading (once per image) and how to combined results (PILImage+ImageAlgoResults=ImageWithAlgoApplied)

Or another concept I’m missing?

Thanks!

sgugger · October 1, 2019, 11:59am

You should use a DataSource to combine your two pipelines
(Filename)->(PILImage)
(Filename)->(FilenameMetadata)
and output tuples.

Then, when you define your dataloader, you should pass in after_item all the transforms you want applied to your images and then a custom transform to regroup PILImage+ImageAlgoResults=ImageWithAlgoApplied. This transform should have force_as_item=True so it’s not applied on each part of the tuple, but receives the whole tuple. This way you can get the tuple and return what you need. It would also need to have an order greater than your image transforms to be sure to run last.

rmkn85 · October 1, 2019, 12:42pm

Thanks! That cleared up some important points.

However, if I understand correctly, this all applies for transformations that work on a single item at a time.

What should I do with transformations that need to work on a batch?
ImageAlgoResults are a result of running a NN on all images and it’s more efficient to run in batches.

My current approach is to run the preprocessing offline and save results in files, and then ImageAlgoResults just maps the original image filename to the algorithm result filename. But I’m wondering how to do it online.

jeremy · October 1, 2019, 4:58pm

You can put that in after_batch in the TfmdDataLoader.

rmkn85 · October 1, 2019, 7:59pm

This means that my transformations (on the batch) must be transformed at the end, after all single item transformations? There is no going back between batch and per-item?

jeremy · October 1, 2019, 8:07pm

Right - the PyTorch DataLoader collates things into a batch. So after that happens, you are working with batches. However, generally PyTorch ops that work on individual items tend to work on batches too.

rmkn85 · October 1, 2019, 8:14pm

So if I have this pipeline sequence:

Single item - load image from file
Batch - run segmentation algorithm to find object
Single item - crop to segmented object from (2)
Single item - resize (3) all to the same size
Batch - run another PyTorch algorithm on all images
Batch - apply augmentations
Single item - run OpenCV feature extraction
and so on
Provide results from above as batch to classification NN training

Then what would be the recommended way to do it?

jeremy · October 1, 2019, 9:42pm

Treat it as a preprocessing step.

rmkn85 · October 2, 2019, 5:23am

By preprocessing you mean go over results in offline and save results, right? (that’s what I’m doing)

Anything now or planned in fastai for managing a preprocessing pipeline, or totally external?

jeremy · October 2, 2019, 6:11am

Yes offline. I don’t know any way to do what you describe otherwise in any framework, or even what that might look like.

You can certainly use the pipeline functionality we have to help preprocessing, and can also use the parallel function to make things faster. See the text notebooks for examples.