DataBlock API - Is it possible to have 3 datasets with same test set?

WaterKnight · March 25, 2020, 2:28pm

I am having a hard time managing to reach the solution to my next problem:

I have 3 datasets and a common test set for image segmentation:

This is my folder structure:

Datasets/
   - Dataset1/
      -Images/
      -Label/
   - Dataset2/
      -Images/
      -Label/
   - Dataset3/
      -Images/
      -Label/
   - test/
      -Images/
      -Label/

I need to train a different model for each dataset and valid it againts the same test set.
I have tried using the DataBlock API:

path=Path("Datasets/")
manual = DataBlock(blocks=(ImageBlock, MaskBlock(codes)),
                   get_items=partial(get_image_files,folders=["Dataset1","test"]),
                   splitter=GrandparentSplitter(valid_name='test'),
                   get_y=get_y_fn_manual,
                   item_tfms=Resize((size,size)),
                   batch_tfms=Normalize.from_stats(*imagenet_stats)
                  )
manual.summary(path)
dls = manual.dataloaders(path,bs=1)

I don understand pretty well the get_items operation.
However, It keeps throwing me the next error:

Setting-up type transforms pipelines
Collecting items from ../datasets
Found 0 items
2 datasets of sizes 0,0
Setting up Pipeline: PILBase.create

---------------------------------------------------------------------------
TypeError: 'NoneType' object is not iterable

I have tried the next code:

manual = DataBlock(blocks=(ImageBlock, MaskBlock(codes)),
                   get_items=partial(get_image_files,folders=["Dataset1","test"]),
                   get_y=get_y_fn_manual,
                   splitter=GrandparentSplitter(train_name="Dataset1",valid_name="test"),
                   item_tfms=Resize((size,size)),
                   batch_tfms=Normalize.from_stats(*imagenet_stats)
                  )
manual.summary(path)
dls = manual.dataloaders(path,bs=1)

Its creating the databunch now, however it don’t distinguish between Images and Labels.

sgugger · March 25, 2020, 3:01pm

You need to have your labels clearly separated from your images if you want to use get_image_files (this grabs all the images inside the subfolders), so you need to move your dataset to have a structure like:

Datasets/
   -Images/
      - Dataset1/
      - Dataset2/
      - Dataset3/
      - test/
   -Label/
      - Dataset1/
      - Dataset2/
      - Dataset3/
      - test/

or you need to write your own function to grab the image files in the Images folder only and pass that to get_items.

WaterKnight · March 25, 2020, 3:46pm

Is it possible to pass the PathList of image names with DataBlock API? As it is done with SegmentationDataLoader.

SegmentationDataLoaders.from_label_func(path=path_manual,fnames=fnames_manual,
                                              label_func=get_y_fn_manual,valid_pct=0.1,
                                              codes=codes,bs=1,shuffle_train=True,
                                              item_tfms=Resize((size,size)),
                                              batch_tfms=Normalize.from_stats(*imagenet_stats))

sgugger · March 25, 2020, 3:49pm

Sure, the don’t pass any get_items and give your list of fnames to .summary or .dataloaders.