Building data batches from a python generator

smarques · February 7, 2022, 6:50pm

Hi, for various reasons we are trying to use FastAi as a component of a complex pipeline that trains a vision model on synthetic images. So the idea is that the ML step receives two python generator objects, one for training set one for validation set. These generators would produce images on demand. So there is not a finite dataset, rather these two generators should be called everytime you need a batch of images.

My idea would be to build a custom datablock with custom dataloaders which I should be able to call passing a dictionary like {‘train’:SDETrainGen,‘validation’:SDEValidationGen}

But the datablock seems built around the concept of first loading the complete set and then building batches from that, if I am not mistaken.

So I am wondering if there is a way to fit FastAi into this architecture, any suggestions are very appreciated. TIA.

bwarner · February 8, 2022, 9:03pm

The most straight forward way is probably to create a custom PyTorch Dataloader to generate the images and then wrap it in the fastai Dataloaders like in this tutorial: Migrating PyTorch - Verbose.