When are images loaded?

PeterY · May 9, 2019, 9:27pm

While using ImageDatabunch, the outputs are image paths and labels. I didn’t see any code in Learner read these paths, wondering when do these images being loaded exactly? Thanks!

Kornel · May 10, 2019, 10:23am

Learner is not responsible for loading data

ItemList has method get which is responsible for returning item. This get method is invoked by __getitem__ which is just pytorch Dataset method which is invoked by pytorch Dataloader during iteration in training loop (fastai/basic_train.py:99) or validation loop (fastai/basic_train.py:57). Learner only invokes training/validation loop.

ImageList overwrites this method (fastai/vision/data.py:268) so get method loads file (using open_image function fastai/vision/image:388) instead of returning item.

saltdoc · May 10, 2019, 9:52pm

if you have a tar file try the code below (be sure to omit the file ext); by setting the path_img to load into imagedatabunch and learner reads it in as “data”

path = untar_data('https://download.com/tarfilewithoutext'); path
path_img = path
data = ImageDataBunch.from_folder(path=path_img, valid_pct=0.3, ds_tfms=get_transforms(), bs=bs, size=224, num_workers=8).normalize(imagenet_stats)
learn = cnn_learner(data, models.resnet101, metrics=[accuracy, error_rate])

PeterY · May 11, 2019, 1:47am

Thanks Kornel, that makes a lot sense, but can you tell me when you saying get method is invoked by getitem, where is it exactly? Is it Python default thing?
Didn’t see get is called in getitem method from Dataset from Pytorch.

Thanks!

PeterY · May 11, 2019, 1:48am

Hi saltdoc,
I think my question is more about where is the image being loaded, since if you check out the databunch, inside the dataset or dataloader is still path. Thanks

Kornel · May 11, 2019, 2:45pm

In pytorch Dataloader is used to joining and spliting data items into batches. Also it is created as an iterator so when you call next(dl) it will return next batch, which is done in fastai training loop
Dataset is only to tell Dataloader how to get single item, which you are doing by implementing __getitem__

You can check pytorch source code, but guide is clear enough: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

slawekbiel · May 11, 2019, 3:19pm

Here is step by step what happens:

You get a dataloader from your ImageDatabunch for the set type you want, for example my_dl = data.dl(ds_type=DatasetType.Train).
You trigger __iter__ method by iterating over the dataloader or directly calling next(my_dl.__iter__())
There are several levels of wrappers here, but it basically propagates DeviceDataLoader->DataLoader->LabelList->ImageList
On the ImageList it calls __gettitem__() to load a single image which after a couple of more calls invokes open_image that uses the PIL library to load the actual image from the disk.

PeterY · May 11, 2019, 5:26pm

Hi Kornel, I know pytorch and used that very often, the question is why getitem will call get method? Thanks for your patience!

PeterY · May 11, 2019, 5:28pm

Thanks slawekbiel!
I think you mean DataLoader->DeviceDataLoader->LabelList->ImageList. (im wrong)
Can you point me where this happens? “load a single image which after a couple of more calls invokes open_image that uses the PIL library”

slawekbiel · May 11, 2019, 6:32pm

The dataloaders order was correct, the ImageDataBunch holds an instance of DeviceDataLoader which in turn holds pytorch’s DataLoader You can easily verify thing like that in a notebook.

type(data.train_dl), type(data.train_dl.dl)
(fastai.basic_data.DeviceDataLoader, torch.utils.data.dataloader.DataLoader)

The order of calls to get an item is:

LabelList::__getitem__()
ItemList::__getitem__()
ImageList::get()
ImageList::open()
open_image()

PeterY · May 11, 2019, 7:34pm

Thanks slawekbiel!
Finally find get is called in ItemList() getitem.

github.com

fastai/fastai/blob/e5b98c9171c502bfab9734b31d3d85cd0ca83e44/fastai/data_block.py#L119




def add(self, items:'ItemList'):
    self.items = np.concatenate([self.items, items.items], 0)
    if self.inner_df is not None and items.inner_df is not None:
        self.inner_df = pd.concat([self.inner_df, items.inner_df])
    else: self.inner_df = self.inner_df or items.inner_df
    return self


def __getitem__(self,idxs:int)->Any:
    idxs = try_int(idxs)
    if isinstance(idxs, Integral): return self.get(idxs)
    else: return self.new(self.items[idxs], inner_df=index_row(self.inner_df, idxs))


@classmethod
def from_folder(cls, path:PathOrStr, extensions:Collection[str]=None, recurse:bool=True,
                include:Optional[Collection[str]]=None, processor:PreProcessors=None, presort:Optional[bool]=False, **kwargs)->'ItemList':
    """Create an `ItemList` in `path` from the filenames that have a suffix in `extensions`.
    `recurse` determines if we search subfolders."""
    path = Path(path)
    return cls(get_files(path, extensions, recurse=recurse, include=include, presort=presort), path=path, processor=processor, **kwargs)

joshiharshit5077 · July 21, 2022, 6:15am

Hey I have been using the fastaiv1 object detection library(GitHub - ChristianMarzahl/ObjectDetection: Some experiments with object detection in PyTorch) and I have a question on how to plot the “data” databunch items,I need to know the distribution of images in the train and validation set (like how many hard negatives and hard positives), how can I do that?
This is my code for loading data:


import numpy as np
train_samples_per_scanner = 3000
val_samples_per_scanner = 1000

train_images = list(np.random.choice(training_set, train_samples_per_scanner))
valid_images = list(np.random.choice(valid_set, val_samples_per_scanner))
batch_size = 64

do_flip = True
flip_vert = True 
max_rotate = 90 
max_zoom = 1.1 
max_lighting = 0.2
max_warp = 0.2
p_affine = 0.75 
p_lighting = 0.75 

tfms = get_transforms(do_flip=do_flip,
                      flip_vert=flip_vert,
                      max_rotate=max_rotate,
                      max_zoom=max_zoom,
                      max_lighting=max_lighting,
                      max_warp=max_warp,
                      p_affine=p_affine,
                      p_lighting=p_lighting)

train, valid ,test = ObjectItemListSlide(train_images), ObjectItemListSlide(valid_images), ObjectItemListSlide(test_images)
item_list = ItemLists(".", train, test)
lls = item_list.label_from_func(lambda x: x.y, label_cls=SlideObjectCategoryList)
lls = lls.transform(tfms, tfm_y=True, size=patch_size)
data = lls.databunch(bs=batch_size, collate_fn=bb_pad_collate,num_workers=0).normalize()

Here training set and validation set are list of object_detection_fastai.helper.wsi_loader.SlideContainer objects
I want a plot of how many of the items have which class([0,1,2]=[‘background’, ‘hard negative’, ‘mitotic figure’])
All suggestions, patch codes, and notebooks are welcome, please share whichever resources are available to you for this problem
Thank you in advance,
Harshit

joshiharshit5077 · July 21, 2022, 6:47am

DATA AS SHOW BY LEARNER OBJECT

learn

data=ImageDataBunch;

Train: SlideLabelList (3000 items)
x: ObjectItemListSlide
Image (3, 256, 256),Image (3, 256, 256),Image (3, 256, 256),Image (3, 256, 256),Image (3, 256, 256)
y: SlideObjectCategoryList
ImageBBox (256, 256),ImageBBox (256, 256),ImageBBox (256, 256),ImageBBox (256, 256),ImageBBox (256, 256)
Path: .;

Valid: SlideLabelList (1000 items)
x: ObjectItemListSlide
Image (3, 256, 256),Image (3, 256, 256),Image (3, 256, 256),Image (3, 256, 256),Image (3, 256, 256)
y: SlideObjectCategoryList
ImageBBox (256, 256),ImageBBox (256, 256),ImageBBox (256, 256),ImageBBox (256, 256),ImageBBox (256, 256)
Path: