Lesson 1: Do you have a definitive list of tips for image cleaning/preprocessing?

I’m trying to create my own image dataset to train a classifier as in Lesson 1.

The problem is that from the way I collected the images, they may be tricky to process:

  • they come in a variety of formats (mainly .tif, .tiffs, .jpg, but also .eps)
  • their height x width can be large. Running verify_images (https://docs.fast.ai/vision.data.html#verify_images on my folders), I get a bunch of messages like Image size (89992333 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  • some of them have 4 channels, some 1 (also found with verify_images).

What are the steps that you suggest to follow to deal with the issues above (for now, for simplicity, I am just disregarding those images [ - er, verify_images did it, by automatically deleting them…] but they are a very important fraction of my dataset)?

Incidentally, I also thought that it’d be great if verify_images gave a comprehensive report of the result of the analysis of the “image health”, like how many images are usable, a 2D scatter plot of height x width, a histogram of number of channels counts, etc. Also, the default behavior - deleting images that can’t be converted automatically - seems a tad too aggressive to me (shouldn’t be default).


It is not directly related to your question but how did you manage to load your images on a notebook? If you are using the cloud (like GCP).