Making image preprocessing faster

hyaxia · October 26, 2020, 5:30pm

Hi everyone, I’m trying to run a CNN in kaggle on the cars-196 dataset.
The code I’m using for preprocessing is -

dls = ImageDataLoaders.from_folder(
    path,
    valid_pct=0.2,
    seed=42,
    bs=32,
    size=224,
    item_tfms=[Resize(460)],
    batch_tfms=[*aug_transforms(size=224, min_scale=0.75), Normalize.from_stats(*imagenet_stats)]
)

The time it takes for each epoch is 5 minutes, and its all due to the time it takes to process each image.
One way I though I can speed up the process is to perform it once and save the result of the new dataset (after augmentation) and use this one to train.

I’m not sure if its a good way or if there is a better one.
And if its a decent solution, how do I perform it? How get those augmented images and save them?

joedockrill · October 26, 2020, 5:35pm

Unlikely it’s so slow because of the augmentation tbh. Why do you think that’s the case?

Also part of what’s good about the augmentation is that you’re automatically getting a slightly different version of each image each time it’s grabbed. You can ask for the same image 100 times in a row and they’ll all be slightly different within the confines of the parameters you’ve specified. That’s normally a good thing, I wouldn’t put effort into stopping that.

joedockrill · October 26, 2020, 5:46pm

Also you’re doing an item tfms to resize them all once on the CPU (which is correct) and then resizing again in batch tfms?

[edit] looking at the aug_transforms I’m not sure any more if that’s actually less than optimal or not.

hyaxia · October 26, 2020, 5:56pm

Well I did try to run the whole program on a different dataset, which I resized beforehand, and it too much less time, about 50 seconds per epoch.
Also, I can see that my CPU is used to the max the entire time, with almost no gpu usage, indicating that the usage of the CPU is much higher than the GPU, this wasn’t the case when I used the presized dataset.

The reason that I used the resizing of image to 460 and then to 224 is that its a good practice, shown in the course https://course.fast.ai/videos/?lesson=4 at 1:20:00.

About the thing that it makes the dataset static, if its the only way, i prefer it over waiting for hours for the training to finish