Best practice for preprocessing and transformation graph

I have a dataset with image filenames and labels.
For each image, I want to:

  1. Load the image itself
  2. Load the image metadata (EXIF etc)
  3. Run a preprocessing algorithm on the image to extract features (SIFT, segmentation, mask, etc)
  4. Combining results from (3), for example applying the mask on the source image
  5. Combine results from (2) and (3), for example colorize only B&W images
  6. Feed outputs from various parts to a single algorithm (i.e. input to network is a batch with set of images, set of metadata, set of masks)

What is the best practice way to do this in fastai v2?

Is it to take ((Filename)->(Filename,FilenameMetadata)->(PILImage,FilenameMetadata)->(PILImage,FilenameMetadata,ImageAlgoResult)->(…))
I.e. ever increased enriched tuple
If I use this method, then transforms don’t work, since the item type becomes tuple instead of something they are familiar with like an image.

And then somehow to combine.
If I use this method, then I’m not sure how to reuse the image loading (once per image) and how to combined results (PILImage+ImageAlgoResults=ImageWithAlgoApplied)

Or another concept I’m missing?


You should use a DataSource to combine your two pipelines
and output tuples.

Then, when you define your dataloader, you should pass in after_item all the transforms you want applied to your images and then a custom transform to regroup PILImage+ImageAlgoResults=ImageWithAlgoApplied. This transform should have force_as_item=True so it’s not applied on each part of the tuple, but receives the whole tuple. This way you can get the tuple and return what you need. It would also need to have an order greater than your image transforms to be sure to run last.

1 Like

Thanks! That cleared up some important points.

However, if I understand correctly, this all applies for transformations that work on a single item at a time.

What should I do with transformations that need to work on a batch?
ImageAlgoResults are a result of running a NN on all images and it’s more efficient to run in batches.

My current approach is to run the preprocessing offline and save results in files, and then ImageAlgoResults just maps the original image filename to the algorithm result filename. But I’m wondering how to do it online.

You can put that in after_batch in the TfmdDataLoader.

This means that my transformations (on the batch) must be transformed at the end, after all single item transformations? There is no going back between batch and per-item?

Right - the PyTorch DataLoader collates things into a batch. So after that happens, you are working with batches. However, generally PyTorch ops that work on individual items tend to work on batches too.

So if I have this pipeline sequence:

  1. Single item - load image from file
  2. Batch - run segmentation algorithm to find object
  3. Single item - crop to segmented object from (2)
  4. Single item - resize (3) all to the same size
  5. Batch - run another PyTorch algorithm on all images
  6. Batch - apply augmentations
  7. Single item - run OpenCV feature extraction
    and so on
  8. Provide results from above as batch to classification NN training

Then what would be the recommended way to do it?

1 Like

Treat it as a preprocessing step.

By preprocessing you mean go over results in offline and save results, right? (that’s what I’m doing)

Anything now or planned in fastai for managing a preprocessing pipeline, or totally external?

Yes offline. I don’t know any way to do what you describe otherwise in any framework, or even what that might look like.

You can certainly use the pipeline functionality we have to help preprocessing, and can also use the parallel function to make things faster. See the text notebooks for examples.