Mapping between dataloader source objects and dataset itemts

ChristophNeuner · August 3, 2020, 3:41pm

Hello guys,

when creating a fastai2.data.core.DataLoaders object with the fastai2.data.block.DataBlock.dataloaders method, the first parameter is a list of source objects. In my case this is a list of custom wrapper objects, that basically wrap around the path of the image and some other stuff.
When the dataloaders get created, the objects in the dataset do not contain a reference to these source objects any more, as far as I have noticed.
During predictions I want to save the predictions in these custom data wrapper objects. So far the only way that I see to map between the dataset objects from the dataloaders and the original source objects is, that they are in the same order. I feel a bit uncomfortable with only relying on that.
Is there some reference I haven’t found or has someone come up with a better way to tackle this?

Here is my code calling the data block api for better understanding:

data = fastai2.data.block.DataBlock(
blocks=(ImageBlock, fastai2.data.block.MultiCategoryBlock),
get_x=lambda x: x.path,
get_y=lambda x: x.classification_labels,
splitter=fastai2.data.transforms.FuncSplitter(lambda x: x.is_valid),
item_tfms = fastai2.vision.augment.Resize(final_size),
batch_tfms=fastai2.vision.augment.aug_transforms(flip_vert=True))

dls = data.dataloaders(object_manager.objects, bs=bs)

The “object_manager.objects” simply wrap around preextracted tiles from whole-slide images with some additional information. [wsi_processing_pipeline/preprocessing/objects.py at master · FAU-DLM/wsi_processing_pipeline · GitHub]

Thanks a lot in advance!

Christoph

muellerzr · August 3, 2020, 3:48pm

That order is what you have to go with unless you make your own transforms to store this information (such as inheriting PIL and giving it whatever attribute you wish to store). It wasn’t designed to expect an object-based system like you are describing. However since you are using images, at the image level they are Pillow Image.Image’s at that point, so you could try a process like so to get the file names associated:

So, you can either work around it by the path method or build a custom Transform that inherits PILBase to extract the source information you want

However do note that is then lost at the DataLoader output level (as by then everything is a tensor)

grumbling · February 6, 2025, 10:55pm

Sorry to revive this old post but I do not have permission to create new post and my problem is similar to this.

I want to be able to retrieve other metadata during training, on top of just the labels.

You can hack a transform that retrieve metadata from PILImage but when you only have tensor, like at before_batch callback.
Is there a way to create a subclass of fastAI DataLoader that save items metadata, as class member/attribute, for each batch that created ? But how to tell the DataBlock API to use this new DataLoader ? Or do we need to do some monkey patching ??

grumbling · February 7, 2025, 6:06am

I tried to modify the Dataloader.sample function and saved the index as Dataloader attributes. But then in the callback, the self.learn.dls.train seems to be a different dataloader instance than the one used to generate the batch because my attribute is there but not set.
Modified sample function:

    def sample(self):
        self.last_batch_idx = list(b for i, b in enumerate(self.__idxs) if i // (self.bs or 1) % self.num_workers == self.offs)
        return self.last_batch_idx

Something to do with multi worker / subprocess I guess ?
Where would be the ideal place to store the items index then ?
@muellerzr