It’s taking three hours before my model even starts training because the below code takes so long to run. The dataset is 2.5 million 214x214 pixel images, amounting to 5 GB of data. But I think the slowness is coming from FuncSplitter.
cards_df = pd.read_csv("/kaggle/input/pump-cards/header.csv", dtype={"Api12": str})
ids_non_holdout = cards_df.loc[cards_df["Api12"].str[-3] == "0", "id"]
ids_non_holdout = [str(i) + ".png" for i in ids_non_holdout] # list of filenames for training on
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=FuncSplitter(lambda o: o.name in ids_non_holdout),
get_y=parent_label,
).dataloaders(path)
Is it looking up each image one by one in this large list? Is there a more efficient way to split the test/train set?