Using transforms as preprocessor in ImageItemList API

Is there a preprocessing step to the new ImageItemList objects that I could use for applying transforms ahead of time on whole data before going for training? How do I try it out? @jeremy @sgugger

By ahead of time, I mean any execution model (all at once or few batches ahead) - whichever that doesn’t leave GPU starving when it’s ready to train.

Here is what code looks like right now…

bs, sz = 64, 160

tfms = get_transforms(do_flip=False, flip_vert=False,
                      max_rotate=0., max_zoom=1.25,
                      max_lighting=0., max_warp=0.,
                      p_affine=0., p_lighting=0.)

src = (ImageItemList.from_folder(path)
       .split_by_folder(train='train', valid='valid')

data = (src.transform(tfms, size=sz)

I have just the zoom transform as tfms kwarg to databunch, which is run when the batch is going to be fed to learner during training time, if I am not wrong in my understanding.


ImageDataBunch objects take very long time to return each bunch after applying transformations on CPU (seemingly) while my GPU is waiting after crunching the previous batch…

  • I have tried resizing my original data to smaller sizes using resize script from imagenet-fast repo (for gradual sz increase steps), but that also doesn’t seem to have much effect… CPU is still a bottleneck for each batch during runtime.
  • I have also tried removing transforms altogether, but it starts throwing dimension errors once I do that. Also I don’t think that’s a good idea since having some transforms (like zoom) would actually help my model.
  • Higher batch sizes (24, 32, 64) are giving better results for me than smaller bs values like 4 or 8, so I want to keep using that.

I think if we can apply transforms ahead of time instead of lazy execution, that would help…

A few things to take into account are:

  • what is your hard drive? If your data isn’t on an SSD or something equivalent, you will always have a CPU bottleneck
  • do you have libjpeg-turbo and pillow-simd installed? Those will boost the speed of opening the images.

The pure transform part is really fast now, and we often found that it was loading the images that caused a bottleneck. If you resize your images to a perfect square (same size for all) then try to load without transforms, you’ll see if your bottleneck comes from loading the images or the transforms. In the first case there isn’t much to be done since RAM isn’t infinite.

For at least some datasets, that’s not true, as David Page showed in his recent amazing DAWNBench results:

He pointed out that especially when you have not many epochs to run, it’s particularly important to minimize the number of workers you use. So it might well still be useful to have some PreProcessor that handles jpeg loading and normalization, and saves (e.g.) to a bcolz array.

1 Like

It’s on SSD. I have installed pillow-simd but didn’t use libpjeg-turbo though. I have my data stored as pngs right now, not sure if that’s impacting the training time. Does just installing libjpeg-turbo on ubuntu is enough to take advantage of the package? I am guessing I should be using jpegs as well to get some boost up here?

I have converted all my images to grayscale and put whole dataset on mounted ramdisk in RAM and then ran the training for few epochs and across different batch sizes to test if this is the case…
I didn’t see any noticeable difference in training time between runs on ramdisk and runs on my SSD… If 5 epochs of one-cycle run with bs 32 is taking 17 minutes on SSD, it takes roughly 17 minutes when training on ramdisk as well… So I don’t think opening images is the issue here.
Also while my GPU sits idle, I see CPU running at full load and not much disk read activity on iotop, so I do think transforms are the culprit here.

I wasn’t sure how to convert them all to squares without losing some data… I have read in other threads that we can use rectangle images in new version of fastai… But I can try that out as well to see how it improves the situation. Any pointers on how do I do that?

I would like to try this out. Is this already part of library now?

data = (src.transform(tfms, size=sz)
        .databunch(bs=bs, num_workers=0).normalize())

On workers, I have tried reducing number of workers without any preprocessing, but this is only slowing down the training time drastically. The defaults.cpu value is working best for me when there’s no preprocessing involved as part of data block API…