I was looking around but couldn’t find anything on easily processing a whole dataset prior to training a model with Fast.ai. I have a 30GB image dataset on which I wanna apply some of the Fast.ai transformations like zoom, rotate, etc. and resize them to 224. In the normal datablock api this is done on the fly, but I’m bottlenecked by CPU power so I wanna do it once and then omit from the training.
I kinda figured it out using some of the examples in the documentation.
def transform_image(path):
d = open_image(path).apply_tfms(tfms[0])
p = Path(path)
target_path = '/'.join(p.parts[:2]) + '/data_transformed/' + '/'.join(p.parts[3:])
if not os.path.exists('/'.join(target_path.split('/')[:-1])):
os.mkdir('/'.join(target_path.split('/')[:-1]))
d.save(target_path)
from joblib import Parallel, delayed
items = ImageList.from_folder(path='../../data').items
Parallel(n_jobs=8)(delayed(transform_image)(i) for i in tqdm(items))