GrandparentSplitter - test/valid data not selected

gegraham · March 30, 2024, 3:07pm

I have manually split my data into train and test sets. This is how my data directory looks like.

pothole_or_not/
   test/
   train/
     pothole/
     road/

And this is my datablock:

dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    get_items=get_image_files,
    splitter=GrandparentSplitter(train_name='train', valid_name='test'),
    get_y=parent_label,
    item_tfms=[Resize(192, method='squish')]
).dataloaders(path)

print(f"Train paths: {dls.train.items}")
print(f"Test paths: {dls.valid.items}")

After training the model, the “valid_loss” and “metric” sections are all “None” which suggests that the test data was not included in the training. The test directory contains just the images; the images are not in any sub-directories like the train directory.
I also get a warning which says “Your generator is empty”.
What do I do? I have been stuck on this for a while. Thank you.

vbakshi · April 1, 2024, 5:57am

I think the issue is that you have not labeled the data being used for the validation set (the test folder), so parent_label is not returning the classes of pothole and road, which are needed when calculating validation loss and metrics. I suggest trying a folder structure as follows:

pothole_or_not/
   test/
     pothole/
     road/
   train/
     pothole/
     road/

And if you are wanting to set aside a true test set (to use after training) then I suggest a folder structure like:

pothole_or_not/
   test/
   valid/
     pothole/
     road/
   train/
     pothole/
     road/

In which case your splitter would become GrandparentSplitter(train_name='train', valid_name='valid').

If you have this code in a Google Colab or Kaggle notebook, please share as that helps with troubleshooting.

gegraham · April 1, 2024, 10:40am

Thank you for your response. Indeed the GrandparentSplitter just deals with data setup in this fashion.