No results or error from df to label_func

cw101 · December 12, 2022, 9:48pm

Hello,

I can’t find here a good solution, would be great when someone could help.

I can’t get from the df: get_y = label_func,

Is there a better way or a workaround? For my way here I also get an error I can’t fix:

def label_func(path):
    df_train.image_id[df_train['img_path']==str(path)].values

item_tfms = RandomResizedCrop(64, min_scale=0.75, ratio=(1.,1.))
batch_tfms = [*aug_transforms(size=64, max_warp=0), Normalize.from_stats(*imagenet_stats)]

bs=64

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(64))

dls = dblock.dataloaders(path/"train")
dls.show_batch()

The error:

Thank you very much for your help.

benkarr · December 13, 2022, 10:25am

Hey,
Right now your label_func is returning the image_id for an img_path, is that what you try to predict? That is the only obvious thing I can see, so if you need more help, the easiest thing to do would be to make your kaggle notebook public (File>Share, select Public from the dropdown) and post the link here, so we can have a go at it ourselfs

cw101 · December 13, 2022, 9:20pm

thank you for the reply. I missed something obvious to mention. Sorry. I want to label the pictures with the column “cancer” in df_train. The image_id is matched with the img_path but I haven’t found an easy solution to label the pictures with 0/1 of the cancer column.

When I #out the get_y = label_func it works but it labels the pictures with the img_path.

#out:

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   # get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(64))

The batch looks then like this:

benkarr · December 14, 2022, 10:32am

df_train['img_path']==str(path) matches the path to the entries in the img_path column; df_train.image_id[…].values returns the values of the image_id column of the rows that match the path (hope that makes sense).
If you want to return the values of the cancer column you can change your label function to:

def label_func(path):
    df_train.cancer[df_train['img_path']==str(path)].values

Does this fix your issue?

cw101 · December 14, 2022, 7:59pm

thank you for the suggestion but I get with your function this error here:

I don’t know why?

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_23/1823673056.py in <module>
      6                    item_tfms = Resize(64))
      7 
----> 8 dls = dblock.dataloaders(path/"train")
      9 dls.show_batch()

/opt/conda/lib/python3.7/site-packages/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
    153         **kwargs
    154     ) -> DataLoaders:
--> 155         dsets = self.datasets(source, verbose=verbose)
    156         kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
    157         return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)

/opt/conda/lib/python3.7/site-packages/fastai/data/block.py in datasets(self, source, verbose)
    145         splits = (self.splitter or RandomSplitter())(items)
    146         pv(f"{len(splits)} datasets of sizes {','.join([str(len(s)) for s in splits])}", verbose)
--> 147         return Datasets(items, tfms=self._combine_type_tfms(), splits=splits, dl_type=self.dl_type, n_inp=self.n_inp, verbose=verbose)
    148 
    149     def dataloaders(self, 

/opt/conda/lib/python3.7/site-packages/fastai/data/core.py in __init__(self, items, tfms, tls, n_inp, dl_type, **kwargs)
    449     ):
    450         super().__init__(dl_type=dl_type)
--> 451         self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
    452         self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))
    453 

/opt/conda/lib/python3.7/site-packages/fastai/data/core.py in <listcomp>(.0)
    449     ):
    450         super().__init__(dl_type=dl_type)
--> 451         self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
    452         self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))
    453 

/opt/conda/lib/python3.7/site-packages/fastcore/foundation.py in __call__(cls, x, *args, **kwargs)
     96     def __call__(cls, x=None, *args, **kwargs):
     97         if not args and not kwargs and x is not None and isinstance(x,cls): return x
---> 98         return super().__call__(x, *args, **kwargs)
     99 
    100 # %% ../nbs/02_foundation.ipynb 46

/opt/conda/lib/python3.7/site-packages/fastai/data/core.py in __init__(self, items, tfms, use_list, do_setup, split_idx, train_setup, splits, types, verbose, dl_type)
    363         if do_setup:
    364             pv(f"Setting up {self.tfms}", verbose)
--> 365             self.setup(train_setup=train_setup)
    366 
    367     def _new(self, items, split_idx=None, **kwargs):

/opt/conda/lib/python3.7/site-packages/fastai/data/core.py in setup(self, train_setup)
    390             for f in self.tfms.fs:
    391                 self.types.append(getattr(f, 'input_types', type(x)))
--> 392                 x = f(x)
    393             self.types.append(type(x))
    394         types = L(t if is_listy(t) else [t] for t in self.types).concat().unique()

/opt/conda/lib/python3.7/site-packages/fastcore/transform.py in __call__(self, x, **kwargs)
     79     @property
     80     def name(self): return getattr(self, '_name', _get_name(self))
---> 81     def __call__(self, x, **kwargs): return self._call('encodes', x, **kwargs)
     82     def decode  (self, x, **kwargs): return self._call('decodes', x, **kwargs)
     83     def __repr__(self): return f'{self.name}:\nencodes: {self.encodes}decodes: {self.decodes}'

/opt/conda/lib/python3.7/site-packages/fastcore/transform.py in _call(self, fn, x, split_idx, **kwargs)
     89     def _call(self, fn, x, split_idx=None, **kwargs):
     90         if split_idx!=self.split_idx and self.split_idx is not None: return x
---> 91         return self._do_call(getattr(self, fn), x, **kwargs)
     92 
     93     def _do_call(self, f, x, **kwargs):

/opt/conda/lib/python3.7/site-packages/fastcore/transform.py in _do_call(self, f, x, **kwargs)
     95             if f is None: return x
     96             ret = f.returns(x) if hasattr(f,'returns') else None
---> 97             return retain_type(f(x, **kwargs), x, ret)
     98         res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
     99         return retain_type(res, x)

/opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py in retain_type(new, old, typ, as_copy)
    186     # e.g. old is TensorImage, new is Tensor - if not subclass then do nothing
    187     if new is None: return
--> 188     assert old is not None or typ is not None
    189     if typ is None:
    190         if not isinstance(old, type(new)): return new

AssertionError:

benkarr · December 15, 2022, 5:41pm

Its hard to guess what the issue is only from that error message but if the dataloader works if you comment out the label_func thats where to look. Could you try df['cancer'].isnull().sum() and drop those rows if the value is not 0:

df = df[~df['cancer'].isnull()]

just in case that there is an instance that doesn’t have a label?

Not sure what else it could be… but if you share your notebook I could have a closer look

cw101 · December 16, 2022, 2:02pm

there is no missing values and all values are int64.

Sure I used a edited dataset of another user who was so nice to share a dataset which he transformed from DICOM files into .png files.

I downloaded it and uploaded it into my kernel but it seems an issue for kaggle to save the notebook properly^^

That is my notebook: RSNA Screening Mammography Breast Cancer Detect | Kaggle

benkarr · December 16, 2022, 3:11pm

I’m not exactly sure but think that you use this dataset. If that is the case you can add it to your kernel via the “+Add Data” button in the right console by searching for rsna-mammography-images-as-pngs and spare yourself the down/uploading .

To the issue:
Your label_func is missing the output keyword…

Adding that got me to a couple of other problems:

The file paths had a different structure than the generated img_paths in the dataframe and I had to change this to match them:

df_train['img_path'] = df_train.apply(
    lambda i: os.path.join(
        #f"{trn_path}", str(i['patient_id']) + "_" + str(i['image_id']) + '.png'
        f"{trn_path}", str(i['patient_id']) + "/" + str(i['image_id']) + '.png'
    ), axis=1
)

but it could easily be because I loaded the dataset in a different way.

Something complained about the return type being an array, so I return the plain item:

def label_func(path):
    # df_train.cancer[df_train['img_path']==str(path)].value
    return df_train.cancer[df_train['img_path']==str(path)].item()

At last: the creation of the dataloaders took a very long time. I guess this is since you have to look up each label in a pretty big dataframe, so an alternative way to get your labels could be this:

label_lookup = {str(row['image_id']):row['cancer'] for _,row in df_train.iterrows()}

def label_func(path):
    return label_lookup[path.stem]

which sped the process up a lot.

Hope that solves everything

cw101 · December 17, 2022, 4:46pm

Thank you very much for this awesome help! Can’t say how much I appreciate that!

I tried out but solutions and at first I thought it would work but it seems that the “slower” function takes just ~10.000 pictures and the other one says it has an issue with picture xyz. When I drop that picture it finds the next and so on, so I guess it would be the difference between the ~50k pictures and the ~10k which were labeled correct. Maybe it has something to do with the sample?

I did not use the one of Radek, I have taken this one: ⭐️⭐️ Breast Cancer - ROI (brest) extractor ⭐️⭐️ | Kaggle

I think he offered it before Radek so I took it and I liked it because he used yolo to cut out just the breast in the picture so much less black and the picture is more “focused” but it seems it won’t work with that.^^

So probably I will also use the pictures of Radek then your function will work.

Thanks for the advise with the +add I thought it will work just with the data kaggle is offering.

benkarr · December 18, 2022, 10:03am

Yeah, you linked to the notebook that creates the roi rather than the finished dataset and I was to lazy to read the content and assumed you ment the input dataset from Radek, sorry So my second try is this dataset, is this the correct one? (If that’s not the case I think making your private dataset public would be the easiest solution.)

I couldn’t reproduce that with either dataset. For the roi-dataset I had to adjust my “fast” label_func for the different file names, other than that everything seems to work fine:

Hope this finally solves the issue

cw101 · December 23, 2022, 8:05pm

Problem solved thanks to the great help of @benkarr!

I really appreciate your help, thank you again!