Creating custom ItemList for segmentation masks


Hi everyone!

I’m trying to tackle the datascience bowl 2018 to get some training in medical segmentation, and I’d like to do it with the fastai library. I’ve been able to work my way around it for now by creating custom databunches from constructor using custom dataloaders and datasets. Now I’d like to use the data_block API to make it cleaner, but there are a lot of things I don’t understand and I’m getting lost in the doc and the source codes.
The competition consists in detecting nuclei in cell images. To do that, we are given a training set that contains the original images, and for each of them mask images for every nucleus to be detected (there is one mask by nucleus). The data is organized such that the image with the id id can be found in id/images/id.png and the masks are all in the folder id/masks. We also have a csv with run-length encoding of the masks combined with the corresponding image id (one line by mask). However, I’d rather not use the rle encoding file as I compute the metric directly from the combined mask (sum of all nuclei masks). The evaluation metric is a custom mean_iou that you can get more information on here, but as it is quite complicated, I will not detail it. The important thing to note is that finding clear separation between nuclei is important. To begin with, my target on training is the combined mask (btw, I wonder if there is a way to train on targets of different sizes, like for instance if I want to train on nuclei centers and radius lists).
Now, here are my remarks/questions:

  • Images are obviously of varying sizes, so I decided to train on random crops. I therefore need to apply the exact same transformation to the mask image. How is it supposed to append with de data_block API ? I have a hard time understanding how and when in the pipeline are transforms applied. As I understand it I can pass a list of transforms to the ItemList, so if I give it a random cropping function, how can I make it apply the same transformation to an image and its mask but still have it change at every image ?
  • What does the c attribute correspond to in an ImageList ?
  • I guess here my label_cls is ImageList (or a custom MaskList)?
  • To get labels, I need to call label_from_func with a custom func of mine right (as I don’t have labels but masks)?
  • If I want to add a testset, I have another problem, as I use overlapping crops for prediction. I need to create a custom learner with an override of predict to do that ?

As I was writing this I answered some of the other questions I had myself, but in general I’m still quite lost on how to tackle this specific dataset using the data_block API (which is why this text is quite messy, I’m sorry). If you have any insights or ideas on how to do it, I would appreciate it a lot!




You should start by checking this tutorial where there is an example of loading for a segmentation task. Since you have the masks that are the sums of nuclei, I think you can use SegmentationItemList (which is just an ImageList with default label_cls to SegmentMask).

To make sure the transforms are properly applied to your masks, you just need to pass tfm_y=True in the call to transforms.



Thanks! My main concern is that random transforms will be the same for the image and the corresponding mask. Is that the case ? And do you have any insights about how I change the test-time behaviour of the learner ?

EDIT: Yeah I just noticed the answer to my transform question is in the tutorial, I shall read more carefully in the future. However I’m still not sure about something: even if I use SegmentationItemList, isn’t it supposed to load only one mask per image? Do I still need a custom implementation of SegmenMask to override the open function so that it sums over all masks in the folder?



I have another question about transforms: if I want to use a random transform from pytorch, for instance torchvision.transforms.RandomCrop, do I just need to pass something like RandTransform(Transform(RandomCrop(size))) (I know I have to convert to PIL image and back to tensors, but that is not my issue here)? Or do I actually need to get the functional version and then let the RandTransform compute the parameters on its own? I know that for cropping I can directly use the fastai version, but it is in case I want to do more bizarre things afterwards.



If you use torchvision transforms, it won’t apply the same thing to the target, so you should use fastai transforms (there is a random crop too). As for your other questions, yes you will need to write a custom open method to sum over the masks.



thanks a lot!


(Terry) #7

Hi ! :vulcan_salute:

I was going to post about the datascience bowl 2018 since I didn’t find many fastai enthusiasts on this kaggle, but since this has just been posted here, I will ask here.

I am also trying to use the datablock API, but I am struggling with how to input the data. I fooled around with panda dataframes, grabbing the images path list and fusionning the mask but that’s it. The segmentationitemlist function return me a ValueError: setting an array element with a sequence.

My notebook is available here :

I would also appreciate if someone have a working notebook with fastai for this kaggle.
Disclamer : I’m a biologist so quite the beginner at coding, be kind :slight_smile:

Thanks !

1 Like



I managed to make it work, I am going to create a clean notebook and share it here as soon as it is finished (probably later today or tomorrow). Stay tuned!


(Antonio de Perio) #9

Hi!, I’m working on the exact same thing right now, and this is how I am parsing out the input training images:

# train and test directories
path_img_train = base_dir + 'stage1_train/' # need to split this folder into train and val sets
path_img_test = base_dir + 'stage1_test/' # images only, use to test

# When we grab images from_folder, we also grab all of the masks
# we want to filter out all of the mask images
def filter_only_training_images(file_path):
    if Path(file_path).match('*/masks/*'):
        return False
        return True

# Create a segmentation list
itemList = (SegmentationItemList


This seems to be working well up to this point.

Now I’m actually working out how to parse the masks as labels for these inputs. If anyone can give any tips on this that would be great for me too! :slight_smile: There are multiple masks for this dataset that don’t overlap, and the approach I’m thinking through at the moment is how to combine these masks into one single mask so that I can use one of the label functions (ie label_from_func) in the data block api. Still haven’t figured this out yet…




I’m moving away from fastai library for this particular work as I find it a bit limiting, but you can fin what I basically did to make it work in this notebook. It is a bit messy and I didn’t take the time to clean the outputs but it basically works. If you have any remarks, feel free of course !


(Antonio de Perio) #11

Hi @florobax thanks for this! This will be useful for me :slight_smile: Just trying to get to a point where I can actually do a training run.

Can I ask what part of fastai you found limiting on this particular dataset?


(Patrick Mccaffrey) #12

Thanks for sharing your notebook @florobax!

It is so interesting to see how someone else attacked the problem
What score did you get in the end? The best Private score I got as 0.00227, which seems bad.

Was a mission, my first working kaggle submission

Things I found challenging were combining the masks, making the DataBunch and scaling the test images

I’ll try clean up my notebook and post it here as well.



Hi !
To answer @adeperio, I can’t find a good way to integrate my test pipeline with fastai as a first problem. Besides, everytime I need to add a custom part to it, I find myself losing 5 hours reading the doc and the source code, as some parts fell very unnatural to me. For instance, when I tried to implement additional transforms. Finally, some features are not handled by the api, or it is well hidden, like custom batch samplers. All in all, I find it more fitting to create my own mini library that will work exactly as I want it to.

As for @musedivision, I got up to 0.37503 with fastai (and 0.41547 without using the api but still using the one-cycle policy). However, I got very low results everytime I tried to use normalize() on the dataset. Which makes me think it should not be used on this dataset (besides, by exploring the source code of the winners, I can’t find a single mention of the word “normalize”, hence I guess it is useless). Good luck with this competition, I have been working on it for a month and I am still far from a standout result^^ If you’re interested, I created a repo with my source code (it is a work in progress, so still messy and more importantly not functional yet), but it could give you inspiration on some steps. To be more precise, branch master does work (it is my last version with fastai integration), while branch full_refactor is absolutely not finished for now. The code is not commented, so don’t hesitate to ask questions if you’re curious. I also encourage you to check githubs of top ranking teams once you are familiar with the basics, it helps get inspiration for what to go on with. Good luck!

1 Like