ImageDataBunch from 500 megapixel images as tiles

I’m working on an image segmentation application that consumes very high resolution images (~ 500 mpx or 23000 x 23000 px). It can work fine by consuming the images in smaller tiles (it’s ok due to the nature of the images - microscopy slide scans, a “whole slide” view would be useless anyway).

But I was thinking of avoiding writing the image tiles to disk. Having the tiles generated in memory (system RAM ofc, not GPU’s) for the purpose of the short training session is OK though. So, my question is:

How could I create an ImageDataBunch from a set of in-memory (system RAM) images (tiles I’d chop the original large image into), preferably without writing them to disk?

My loading code for now is very simple (and I’m kind of a total noob to fastai’s data block api - also “dataset” in my code refers to nothing remotely similar to fastai Dataset, it’s specific to the models of my app):

def make_data_bunch(dataset_images, cls_codes):
    dir_path = dataset_images[0]['image'].parent.parent
    return (
            items=[di['image'] for di in dataset_images],
        )  # -> : SegmentationItemList
            di['image'].name for di in dataset_images
            if di['purpose'] == 'validation'
        ])  # -> : ItemLists(train: SegmentationItemList, valid: SegmentationItemList)
            train_labels=[di['label_image'] for di in dataset_images
                          if di['purpose'] == 'train'],
            valid_labels=[di['label_image'] for di in dataset_images
                          if di['purpose'] == 'validation'],
        )  # -> : LabelLists(train: LabelList(x, y: SegmentationItemList), valid: LabelList(x, y: SegmentationItemList))
        .transform(get_transforms(flip_vert=True), tfm_y=True)
        .databunch(bs=1)  # -> : ImageDataBunch(train: LabelList(x, y: SegmentationItemList), valid: LabelList(x, y: SegmentationItemList))

(Right now by handling this entirely outside fastai lib I’d end up writing them to disk, but I was looking for a more “fastai idiomatic” way of doing it.)

Also, to note that in my application training happens in production, the user actually creates training sessions through a web ui, sets their parameters etc., it’s not a train then deploy trained model to production scenario. But the number of concurrent users would be small and the machine can have a ton of RAM, so I’m fine with creating GB-sized images in memory :slight_smile:

Thanks in advance,


That code is VERY hard to read.

I suggest editing it and maybe other people then can understand it easier.

I am not sure I understand fully.

You have big images and want to segment them into smaller images.
Is the question how you can create an ImageDataBunch object
using as input the big pictures and labels?

That’s tricky. You can subclass SegmentationItemList to change its get function (which is what returns the item numbered i) so that it returns a piece of your image. For instance if you want to split in 4 by 4, you’d call to open the image numbered i//16 then return the i%16 chunk.


So ImageDataBunch takes in a DataLoader for training and validation.
It looks like you can already split your data into train and validation.
Can you make DataLoaders out of them?

Then I think you want a custom DataLoader that doesn’t just yield the whole images,
instead it cuts it into smaller images and yields those with the label of the whole image.

Does that sound like what you want?

1 Like

What you ultimately want is a SegmentationItemList I guess. The class hierarchy for this is
ItemList > ImageList > SegmentationItemList I believe and you will have to make changes in both ItemList and ImageList I think, because the ItemList will ultimately determine the number of things in your dataset, which you will probably want to be the number of total tiles. If this is not the number of tiles you might have to think harder about are my tiles still nicely shuffled?

What I propose is to change the items in ItemList.__init__ (which holds your image paths) somehow like this:
list(zip(items*4, [t for t in range(num_tiles) for _ in range(num_im)]))
(Make a TiledItemList extending ItemList, overriding the __init__)

so that you can access that tile information in your get function that you replace in ImageList in order to load the correct tile. (Make a TiledImageList extending TiledItemList, mostly copying the contents from the original, except for get)

I am not sure if something breaks if you make items a list of tuples. you might also just extend the path there by your tile number and cut the path in the ImageList.get.

When you have changed those two (maybe as a TiledItemList, TiledImageList), I think making it work for segmentation is as easy as

class TiledSegmentationItemList(TiledImageList):
    "`ItemList` suitable for segmentation tasks."
    _label_cls,_square_show_res = SegmentationLabelList,False

(which is the standard SegmentationItemList implementation extending your tiled version instead)

Make sure to check out and the according source code of ItemList, ImageList and SegmentationItemList

You have to create a TiledSegmentationLabelList replacing the SegmentationLabelList as well I think (which should also just use your Tiled classes instead of the original)

That sounds so much more complicated than a simple custom DataLoader.
What advantage would SegmentationItemList have over a custom DataLoader?
Did I miss something that @neuronq needs?

Agreed, its not easy but you should get all the behavior from the data blocks api I think. i’m not saying it’s worth the effort, but that is how I would build it if I wanted to represent one item of a dataset as multiple in a fastai datablock api-ish way :smiley: If you’re not going to have more cases for such a tiling I would honestly just stick with what you have or write the tiles to disk.

I don think there is an easy low-effort way to change this fundamental fact that one item in your dataset is just one item.

1 Like

Thanks! I’ll look into this and see if it’s worth de effort going this way…

my question wrt this would be though: would that play well with however the gpu computing works (sorry, I’m a total ignorant in gpu comp are…), I mean, would it’s what SegmentationItemList.get returns that gets loaded, nothing/weirder fancier going on like memory mapping from file and sutff like that? (I know, I could read through the code, but I’m very new to the field and I’d rather put effort into pushing the project further right now…)

Thanks, I’ll digg deeper into it try to see which one is easier to start with, custom DataLoader or SegmentationItemList.

(Oh, and sorry for the code, I just took it directly from the app instead of rewriting it for posting here, but it’s not relevant beyond showing that I’m already starting from the equivalent of two lists of absolute file paths, a list for training and a list for validation, that are already picked from somewhere else in the app so I want to avoid the train/validation splitting functionality of the data block api.)

Something I wrote quick:

import torch
from torch.utils import data

class SegmentedDataset(data.Dataset):
    def __init__(self, images, labels, segments_per_image):
        self.images = images
        self.labels = labels
        self.segments_per_image = segments_per_image

    def get_segment(self, image, segment_ID):
        # implement this function
        # returns one segment of one image

    def __len__(self):
        'Denotes the total number of samples'
        return len(self.labels)*self.segments_per_image

    def __getitem__(self, index):
        'Generates one sample of data'
        segment_ID = index%self.segments_per_image
        image_ID = index//self.segments_per_image

        image = self.images[image_ID]
        # data and label
        X = self.get_segment(image, segment_ID)
        y = self.labels[image_ID]

        return X, y

training_set = SegmentedDataset(train_images, train_labels)
training_generator = data.DataLoader(training_set, ...) # arguments like batch size

validation_set = SegmentedDataset(validation_images, validation_labels)
validation_generator = data.DataLoader(validation_set, ...) # arguments like batch size

ImageDataBunch(train_dl=training_generator, valid_dl=validation_generator, ...) other arguments

Hope it helps :slight_smile:


This way is a lot easier than what I was suggesting. Unless you really need such a tiled structure very frequently I would also suggest you try this rather than what I proposed.

1 Like

Now the same with the data block API :stuck_out_tongue:


:stuck_out_tongue: @sgugger
This is all I got to but I’m sure @neuronq can take it from here :slight_smile:

class SegmentationTileItemList(SegmentationItemList):
    def __init__(self, segments_per_image, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.segments_per_image = segments_per_image

    def get_image_segment(self, full_image, segment_idx):
        pass # implement this function

    def get(self, i):
        segment_idx = index % self.segments_per_image
        image_idx = i // self.segments_per_image

        fn = super().get(image_idx)
        full_image =

        res = self.get_image_segment(full_image, segment_idx)
        self.sizes[i] = res.size
        return res

class SegmentationTileLabelList(SegmentationLabelList):
    def __init__(self, segments_per_label, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.segments_per_label = segments_per_label

    def get_label_segment(self, full_label, segment_idx):
        pass # implement this function

    def get(self, i):
        segment_idx = index % self.segments_per_label
        label_idx = i // self.segments_per_label

        fn = super().get(label_idx)
        full_label =

        res = self.get_label_segment(full_label, segment_idx)
        self.sizes[i] = res.size
        return res
1 Like

Nice job! You should just add _label_cls=SegmentationTileLabelList as a class variable of your first class, so that it knows to label with this automatically.

Thanks a looot to all! I’ll refactor (current code ended up with the “write tiles to disk” solution that took 15min to code :stuck_out_tongue:…), and post sometime ~Monday-ish the solution that I picked…

(spoiler alert: it’ll likely be @Hadus 's SegmentedDataset solution…)

Chiming in, for breaking an image down into patches, the tensor.unfold function makes it really easy to turn a tensor into tiles.

You could load the large image as a tensor in RAM, break it into patches, then feed batches of patches to the GPU. If you have your segmentation ground truth as a large array, you can break it into patches the same way.

(link discusses 3D images but it works fine for 2D as well)


Are you sure about this get function? First, I don’t see where index is coming from? index = i I assume?

When using from_folder or so, it will load all filenames it can find (lets say 100). When this thing is sampled in the end, that means the i in get will always be between 0 and 100, not 100 * n_segments.

That is the problem that I was trying to refer to. Since the whole WhateverList hierarchy in fastai ultimately extends ItemList, it will be the filenames (.items) that determine how big this index is going to be.
And while you can easily fix that in the __len__ in a pytorch dataset, it won’t be that easy for this deep class hierarchy.

You can of course do data = FancyList.from_whereever(), then modify items and then continue with the datablocks pipeline. However I would not easily assume (without carefully checking the datablocks api code) that this change will have no undesidred behavior compared to having the correct items array set in the __init__. Same goes with the LabelList that is created internally. Maybe modifying items is the only thing left to do, can’t tell for sure right now.

Maybe there’s also something that I don’t see right now (should definitely go to bed :D). Anyways, hope this helps :slight_smile:

Good point!
I totally missed the fact that now we have a lot more data…

But anyhow @neuronq said he is going to most likely use the SegmentedDataset solution instead.

self.get is

how to create item i from self.items

The easiest solution would be to generate a random segment_idx.
The only change is:

    def get(self, i):
        segment_idx = np.random.randint(0, self.segments_per_image)
        image_idx = i
    def get(self, i):
        segment_idx = np.random.randint(0, self.segments_per_label)
        label_idx = i

This would work pretty well. It would act as some kind of data augmentation.
This kind of behaviour could be probably done with data augmentation easier…
One epoch would go through all the full images but only one segment per each.

If we want to do it properly then I think we do have to do more complicated stuff with self.items.

self.items[i] (which is a filename)

So in self.items there are all the file names we use; my idea is to just make segments_per_image number of copies of self.items and store it back in self.items:

That way we have the right number of data. When we go through an epoch it will go through the same filename multiple times so we also need something to index each of those filenames to a specific segment.

class SegmentationTileItemList(SegmentationItemList):
    def __init__(self, segments_per_image, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.segments_per_image = segments_per_image

        self.segment_idxs = np.repeat([range(self.segments_per_image)], len(self.items))
        # [0, 0, 0, 1, 1, 1, 2, 2, 2]
        self.items = np.repeat([self.items], self.segments_per_image, axis=0).flatten()
        # ["img0", "img1", "img2", "img0", "img1", "img2", "img0", "img1", "img2"]

    def get_image_segment(self, full_image, segment_idx):
        pass # implement this function

    def get(self, i):
        segment_idx = self.segment_idxs[i]

        fn = super().get(i)
        full_image =

        res = self.get_image_segment(full_image, segment_idx)

        self.sizes[i] = res.size
        return res

It is pretty much the same for the label one…

When picking segments at random, one should also assure that the same random segment is picked for the label then. Might just be setting a random seed tho.

One whole other aspect in this that my head can’t stop thinking about is: how tiny would you want to tile the image? More tiles means more individual samples than you can augment etc, means a lot of images.
On the other hand more tiles means you put in more ‘crappy’ information through convolution padding and data augmentation padding for which you would often have actual data… (ok for the augmentation part you can get around this if you wanted)
I wonder where the sweet spot for this is. If you find out @neuronq, I would be really interested to hear about it :slight_smile:

This is the step that prevented me completing a solution to a similar need a few months ago. I’d be interested in any update if someone gets it working well.