ImageDataBunch from 500 megapixel images as tiles

xeTaiz · March 14, 2019, 4:25pm

What you ultimately want is a SegmentationItemList I guess. The class hierarchy for this is
ItemList > ImageList > SegmentationItemList I believe and you will have to make changes in both ItemList and ImageList I think, because the ItemList will ultimately determine the number of things in your dataset, which you will probably want to be the number of total tiles. If this is not the number of tiles you might have to think harder about are my tiles still nicely shuffled?

What I propose is to change the items in ItemList.__init__ (which holds your image paths) somehow like this:
list(zip(items*4, [t for t in range(num_tiles) for _ in range(num_im)]))
(Make a TiledItemList extending ItemList, overriding the __init__)

so that you can access that tile information in your get function that you replace in ImageList in order to load the correct tile. (Make a TiledImageList extending TiledItemList, mostly copying the contents from the original, except for get)

I am not sure if something breaks if you make items a list of tuples. you might also just extend the path there by your tile number and cut the path in the ImageList.get.

When you have changed those two (maybe as a TiledItemList, TiledImageList), I think making it work for segmentation is as easy as

class TiledSegmentationItemList(TiledImageList):
    "`ItemList` suitable for segmentation tasks."
    _label_cls,_square_show_res = SegmentationLabelList,False

(which is the standard SegmentationItemList implementation extending your tiled version instead)

Make sure to check out https://docs.fast.ai/tutorial.itemlist.html and the according source code of ItemList, ImageList and SegmentationItemList

Edit:
You have to create a TiledSegmentationLabelList replacing the SegmentationLabelList as well I think (which should also just use your Tiled classes instead of the original)

Hadus · March 14, 2019, 4:36pm

That sounds so much more complicated than a simple custom DataLoader.
What advantage would SegmentationItemList have over a custom DataLoader?
Did I miss something that @neuronq needs?

xeTaiz · March 14, 2019, 4:43pm

Agreed, its not easy but you should get all the behavior from the data blocks api I think. i’m not saying it’s worth the effort, but that is how I would build it if I wanted to represent one item of a dataset as multiple in a fastai datablock api-ish way If you’re not going to have more cases for such a tiling I would honestly just stick with what you have or write the tiles to disk.

I don think there is an easy low-effort way to change this fundamental fact that one item in your dataset is just one item.

neuronq · March 14, 2019, 4:54pm

Thanks! I’ll look into this and see if it’s worth de effort going this way…

my question wrt this would be though: would that play well with however the gpu computing works (sorry, I’m a total ignorant in gpu comp are…), I mean, would it’s what SegmentationItemList.get returns that gets loaded, nothing/weirder fancier going on like memory mapping from file and sutff like that? (I know, I could read through the code, but I’m very new to the field and I’d rather put effort into pushing the project further right now…)

neuronq · March 14, 2019, 5:00pm

Thanks, I’ll digg deeper into it try to see which one is easier to start with, custom DataLoader or SegmentationItemList.

(Oh, and sorry for the code, I just took it directly from the app instead of rewriting it for posting here, but it’s not relevant beyond showing that I’m already starting from the equivalent of two lists of absolute file paths, a list for training and a list for validation, that are already picked from somewhere else in the app so I want to avoid the train/validation splitting functionality of the data block api.)

Hadus · March 14, 2019, 5:32pm

Something I wrote quick:

import torch
from torch.utils import data

class SegmentedDataset(data.Dataset):
    def __init__(self, images, labels, segments_per_image):
        self.images = images
        self.labels = labels
        self.segments_per_image = segments_per_image

    def get_segment(self, image, segment_ID):
        # implement this function
        # returns one segment of one image
        pass

    def __len__(self):
        'Denotes the total number of samples'
        return len(self.labels)*self.segments_per_image

    def __getitem__(self, index):
        'Generates one sample of data'
        segment_ID = index%self.segments_per_image
        image_ID = index//self.segments_per_image

        image = self.images[image_ID]
 
        # data and label
        X = self.get_segment(image, segment_ID)
        y = self.labels[image_ID]

        return X, y

training_set = SegmentedDataset(train_images, train_labels)
training_generator = data.DataLoader(training_set, ...) # arguments like batch size

validation_set = SegmentedDataset(validation_images, validation_labels)
validation_generator = data.DataLoader(validation_set, ...) # arguments like batch size

ImageDataBunch(train_dl=training_generator, valid_dl=validation_generator, ...) other arguments

Hope it helps

xeTaiz · March 14, 2019, 6:01pm

This way is a lot easier than what I was suggesting. Unless you really need such a tiled structure very frequently I would also suggest you try this rather than what I proposed.

sgugger · March 14, 2019, 10:38pm

Now the same with the data block API

Hadus · March 14, 2019, 11:29pm

@sgugger
This is all I got to but I’m sure @neuronq can take it from here

class SegmentationTileItemList(SegmentationItemList):
    def __init__(self, segments_per_image, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.segments_per_image = segments_per_image

    def get_image_segment(self, full_image, segment_idx):
        pass # implement this function

    def get(self, i):
        segment_idx = index % self.segments_per_image
        image_idx = i // self.segments_per_image

        fn = super().get(image_idx)
        full_image = self.open(fn)

        res = self.get_image_segment(full_image, segment_idx)
        self.sizes[i] = res.size
        return res

class SegmentationTileLabelList(SegmentationLabelList):
    def __init__(self, segments_per_label, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.segments_per_label = segments_per_label

    def get_label_segment(self, full_label, segment_idx):
        pass # implement this function

    def get(self, i):
        segment_idx = index % self.segments_per_label
        label_idx = i // self.segments_per_label

        fn = super().get(label_idx)
        full_label = self.open(fn)

        res = self.get_label_segment(full_label, segment_idx)
        self.sizes[i] = res.size
        return res

sgugger · March 15, 2019, 2:25am

Nice job! You should just add _label_cls=SegmentationTileLabelList as a class variable of your first class, so that it knows to label with this automatically.

neuronq · March 15, 2019, 12:51pm

Thanks a looot to all! I’ll refactor (current code ended up with the “write tiles to disk” solution that took 15min to code …), and post sometime ~Monday-ish the solution that I picked…

(spoiler alert: it’ll likely be @Hadus 's SegmentedDataset solution…)

KarlH · March 15, 2019, 10:05pm

Chiming in, for breaking an image down into patches, the tensor.unfold function makes it really easy to turn a tensor into tiles.

You could load the large image as a tensor in RAM, break it into patches, then feed batches of patches to the GPU. If you have your segmentation ground truth as a large array, you can break it into patches the same way.

(link discusses 3D images but it works fine for 2D as well)

xeTaiz · March 16, 2019, 12:33am

Are you sure about this get function? First, I don’t see where index is coming from? index = i I assume?

When using from_folder or so, it will load all filenames it can find (lets say 100). When this thing is sampled in the end, that means the i in get will always be between 0 and 100, not 100 * n_segments.

That is the problem that I was trying to refer to. Since the whole WhateverList hierarchy in fastai ultimately extends ItemList, it will be the filenames (.items) that determine how big this index is going to be.
And while you can easily fix that in the __len__ in a pytorch dataset, it won’t be that easy for this deep class hierarchy.

You can of course do data = FancyList.from_whereever(), then modify items and then continue with the datablocks pipeline. However I would not easily assume (without carefully checking the datablocks api code) that this change will have no undesidred behavior compared to having the correct items array set in the __init__. Same goes with the LabelList that is created internally. Maybe modifying items is the only thing left to do, can’t tell for sure right now.

Maybe there’s also something that I don’t see right now (should definitely go to bed :D). Anyways, hope this helps

Hadus · March 17, 2019, 1:16am

Good point!
I totally missed the fact that now we have a lot more data…

But anyhow @neuronq said he is going to most likely use the SegmentedDataset solution instead.

self.get is

how to create item i from self.items

The easiest solution would be to generate a random segment_idx.
The only change is:

    def get(self, i):
        segment_idx = np.random.randint(0, self.segments_per_image)
        image_idx = i
        ...

    def get(self, i):
        segment_idx = np.random.randint(0, self.segments_per_label)
        label_idx = i
        ...

This would work pretty well. It would act as some kind of data augmentation.
This kind of behaviour could be probably done with data augmentation easier…
One epoch would go through all the full images but only one segment per each.

If we want to do it properly then I think we do have to do more complicated stuff with self.items.

self.items[i] (which is a filename)

So in self.items there are all the file names we use; my idea is to just make segments_per_image number of copies of self.items and store it back in self.items:

That way we have the right number of data. When we go through an epoch it will go through the same filename multiple times so we also need something to index each of those filenames to a specific segment.

class SegmentationTileItemList(SegmentationItemList):
    def __init__(self, segments_per_image, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.segments_per_image = segments_per_image

        self.segment_idxs = np.repeat([range(self.segments_per_image)], len(self.items))
        # [0, 0, 0, 1, 1, 1, 2, 2, 2]
        self.items = np.repeat([self.items], self.segments_per_image, axis=0).flatten()
        # ["img0", "img1", "img2", "img0", "img1", "img2", "img0", "img1", "img2"]

    def get_image_segment(self, full_image, segment_idx):
        pass # implement this function

    def get(self, i):
        segment_idx = self.segment_idxs[i]

        fn = super().get(i)
        full_image = self.open(fn)

        res = self.get_image_segment(full_image, segment_idx)

        self.sizes[i] = res.size
        return res

It is pretty much the same for the label one…

xeTaiz · March 17, 2019, 8:04am

When picking segments at random, one should also assure that the same random segment is picked for the label then. Might just be setting a random seed tho.

One whole other aspect in this that my head can’t stop thinking about is: how tiny would you want to tile the image? More tiles means more individual samples than you can augment etc, means a lot of images.
On the other hand more tiles means you put in more ‘crappy’ information through convolution padding and data augmentation padding for which you would often have actual data… (ok for the augmentation part you can get around this if you wanted)
I wonder where the sweet spot for this is. If you find out @neuronq, I would be really interested to hear about it

digitalspecialists · March 17, 2019, 4:27pm

This is the step that prevented me completing a solution to a similar need a few months ago. I’d be interested in any update if someone gets it working well.

Hadus · March 17, 2019, 11:09pm

@digitalspecialists Here is one way to do it:

import contextlib
import numpy as np

@contextlib.contextmanager
def temp_seed(seed):
    state = np.random.get_state()
    np.random.seed(seed)
    try:
        yield
    finally:
        np.random.set_state(state)

    def get(self, i):
        with temp_seed(i):
            segment_idx = np.random.randint(0, self.segments_per_image)
        ...

    def get(self, i):
        with temp_seed(i):
            segment_idx = np.random.randint(0, self.segments_per_label)
        ...

This would still mean that every epoch we would have the same segments because the seed would repeat every epoch. To fix this we can use callbacks I think.

Hadus · March 17, 2019, 11:36pm

@digitalspecialists
To make item and label have the same random segment index use callbacks.

The code below doesn’t work! I am not sure how to write this callback… it should be something like:

class SegmentIdxGenCallback(Callback):
    def on_batch_begin(self):
        random_idx = np.random.randint(0, self.data.items.segments_per_image)
        self.data.labels.segment_idx = random_idx
        self.data.items.segment_idx = random_idx

fit(1, learner, cb=CallbackHandler([SegmentIdxGenCallback]))

then in SegmentationTileItemList change the beginning of get:

    def get(self, i):
        fn = super().get(self.segment_idx)
        ...

Iron4dam · June 23, 2019, 3:32pm

Hi, have you solved this yet?

mark-hoffmann · July 26, 2019, 8:15pm

@sgugger I have been messing around taking a stab at trying to fit this into the fastai framework. I’m building off of @Hadus’s solution above. Do you think the best way is to use a random seed for matching up the SegmentationTileLabelList and SegmentationTileItemList? Or if we return all segments by adding an axis and then handle downstream further? What about for the test dataset where we just want to scan across and then recombine and show the full stitched together image?

I’m fairly close on having my pipeline working by just preslicing the images and saving to disk so you load in that way initially. Then for test doing the slicing and then stitch back together outside of the fastai framework, but I think it would be really slick and help a lot of people if we try to get it integrated within.