Preprocessing a whole image dataset

bauke-b · February 18, 2020, 2:54pm

I was looking around but couldn’t find anything on easily processing a whole dataset prior to training a model with Fast.ai. I have a 30GB image dataset on which I wanna apply some of the Fast.ai transformations like zoom, rotate, etc. and resize them to 224. In the normal datablock api this is done on the fly, but I’m bottlenecked by CPU power so I wanna do it once and then omit from the training.

What would be the easiest way to approach this?

bauke-b · February 18, 2020, 3:55pm

I kinda figured it out using some of the examples in the documentation.

def transform_image(path):
    d = open_image(path).apply_tfms(tfms[0])
    p = Path(path)
    target_path = '/'.join(p.parts[:2]) + '/data_transformed/' + '/'.join(p.parts[3:])
    if not os.path.exists('/'.join(target_path.split('/')[:-1])):
        os.mkdir('/'.join(target_path.split('/')[:-1]))
    d.save(target_path)

from joblib import Parallel, delayed

items = ImageList.from_folder(path='../../data').items
Parallel(n_jobs=8)(delayed(transform_image)(i) for i in tqdm(items))

Which works quite nice and is efficient.