FuncSplitter taking three hours?

It’s taking three hours before my model even starts training because the below code takes so long to run. The dataset is 2.5 million 214x214 pixel images, amounting to 5 GB of data. But I think the slowness is coming from FuncSplitter.

cards_df = pd.read_csv("/kaggle/input/pump-cards/header.csv", dtype={"Api12": str})
ids_non_holdout = cards_df.loc[cards_df["Api12"].str[-3] == "0", "id"]  
ids_non_holdout = [str(i) + ".png" for i in ids_non_holdout]  # list of filenames for training on
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=FuncSplitter(lambda o: o.name in ids_non_holdout),
    get_y=parent_label,
).dataloaders(path)

Is it looking up each image one by one in this large list? Is there a more efficient way to split the test/train set?

1 Like

Yes, and yes. :slight_smile: Use pandas to create the indices directly, and use that directly for your splitter.

4 Likes

I had some trouble getting that to work:

imgs = get_image_files(path)
bool_list = [o.name for o in imgs]
bool_list = np.isin(bool_list, ids_non_holdout)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=imgs, 
    splitter=bool_list,
    get_y=parent_label,
).dataloaders(path)

That yields an error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_17/3322969700.py in <module>
      4     splitter=bool_list,
      5     get_y=parent_label,
----> 6 ).dataloaders(path)

/opt/conda/lib/python3.7/site-packages/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
    156         **kwargs
    157     ) -> DataLoaders:
--> 158         dsets = self.datasets(source, verbose=verbose)
    159         kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
    160         return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)

/opt/conda/lib/python3.7/site-packages/fastai/data/block.py in datasets(self, source, verbose)
    145     ) -> Datasets:
    146         self.source = source                     ; pv(f"Collecting items from {source}", verbose)
--> 147         items = (self.get_items or noop)(source) ; pv(f"Found {len(items)} items", verbose)
    148         splits = (self.splitter or RandomSplitter())(items)
    149         pv(f"{len(splits)} datasets of sizes {','.join([str(len(s)) for s in splits])}", verbose)

TypeError: 'L' object is not callable

Hope you already solved that, if not:
splitter should be a function that returns the splits, you could try:

splitter=lambda  o: bool_list,