I’m working on an image segmentation application that consumes very high resolution images (~ 500 mpx
or 23000 x 23000 px
). It can work fine by consuming the images in smaller tiles (it’s ok due to the nature of the images - microscopy slide scans, a “whole slide” view would be useless anyway).
But I was thinking of avoiding writing the image tiles to disk. Having the tiles generated in memory (system RAM ofc, not GPU’s) for the purpose of the short training session is OK though. So, my question is:
How could I create an ImageDataBunch
from a set of in-memory (system RAM) images (tiles I’d chop the original large image into), preferably without writing them to disk?
My loading code for now is very simple (and I’m kind of a total noob to fastai’s data block api - also “dataset” in my code refers to nothing remotely similar to fastai Dataset
, it’s specific to the models of my app):
def make_data_bunch(dataset_images, cls_codes):
dir_path = dataset_images[0]['image'].parent.parent
return (
SegmentationItemList(
items=[di['image'] for di in dataset_images],
path=dir_path
) # -> : SegmentationItemList
.split_by_files([
di['image'].name for di in dataset_images
if di['purpose'] == 'validation'
]) # -> : ItemLists(train: SegmentationItemList, valid: SegmentationItemList)
.label_from_lists(
train_labels=[di['label_image'] for di in dataset_images
if di['purpose'] == 'train'],
valid_labels=[di['label_image'] for di in dataset_images
if di['purpose'] == 'validation'],
classes=np.asarray(cls_codes)
) # -> : LabelLists(train: LabelList(x, y: SegmentationItemList), valid: LabelList(x, y: SegmentationItemList))
.transform(get_transforms(flip_vert=True), tfm_y=True)
.databunch(bs=1) # -> : ImageDataBunch(train: LabelList(x, y: SegmentationItemList), valid: LabelList(x, y: SegmentationItemList))
)
(Right now by handling this entirely outside fastai
lib I’d end up writing them to disk, but I was looking for a more “fastai idiomatic” way of doing it.)
Also, to note that in my application training happens in production, the user actually creates training sessions through a web ui, sets their parameters etc., it’s not a train then deploy trained model to production scenario. But the number of concurrent users would be small and the machine can have a ton of RAM, so I’m fine with creating GB-sized images in memory
Thanks in advance,
Andrei