Empty label list for test dataset using from_folder

uulwake · October 24, 2019, 8:15am

Hi, I have problem with ImageDataBunch. I try to do the following.

data = ImageDataBunch.from_folder(train_path, 
                                      train="train",
                                      test="test", 
                                      valid_pct=0.1, 
                                      ds_tfms=get_transforms(), 
                                      size=224, num_workers=4).normalize(imagenet_stats)

It is executed perfectly, but when I look the data, the label for test dataset is empty. Here is the output.

ImageDataBunch;

Train: LabelList (180 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
Class A,Class A,Class A,Class A,Class A
Path: dataset;

Valid: LabelList (20 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
Class A,Class B,Class A,Class A,Class B
Path: dataset;

Test: LabelList (120 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: EmptyLabelList
,,,,
Path: dataset

As you can see, the label for Test is empty. Here is my folder structure.

dataset:
|- train
   |- Class A
   |- Class B
   |- Class C
   |- Class D
|- test
   |- Class A
   |- Class B
   |- Class C
   |- Class D

What should I do in order to get label for Test set? Should I change my folder structure or is there something wrong with the code?

Thank you.

TomB · October 24, 2019, 9:03am

In fastai the test data is expected to be unlabelled (in v1 anyway, think v2 has more flexible support here). So it’s more for ‘test data’ in the sense Kaggle uses of the data on which you’ll be evaluated, rather than the sense of a holdout from your validation data to use more sparingly.

There doesn’t seem to be any easy way around this in the API. You might be able to create the train/valid sets as usual and then just assign a labelled test dataset and loader to data.test_ds/data.test_dl and then they might work. Though not sure if any functions will actually then use that. If that does work, you could also use DataBunch.create which takes a test dataset parameter.
Though I suspect that won’t really help as things won’t expect labelled test data. So, you might be better just creating a separate databunch with the test data as the validation set. Then you can just evaluate against that and fastai should do what you want (expecting the validation set to be labelled). I think passing that dataloader into the various Learner methods that accept a dataloader (like Learner.validate) should work, don’t think it requires it to be one of the Learners loaders.
Not sure that any of the methods for duplicating the various item list classes (ItemList, ItemLists - split, LabelLists - split+labelled) will let you copy over the parameters from the main databunch, but maybe.

muellerzr · October 24, 2019, 1:13pm

TomB is right, v2 will allow labeled test sets easily. Until then please see my notebook here on how to work around and use a labeled test set in v1

https://github.com/muellerzr/fastai-Experiments-and-tips/blob/master/Test%20Set%20Generation/Labeled_Test_Set.ipynb

uulwake · October 24, 2019, 5:52pm

Thank you for your answer. Actually, I can do accuracy score manually by evaluating the true label and predicted label. However, I want to do TTA and I do not know how to write my own TTA thus I need to use learn.TTA() from fastai library.

I will consider about your suggestion.

uulwake · October 24, 2019, 6:08pm

Thank you. I will study your notebook.

uulwake · October 25, 2019, 11:38am

I have solved it.

Thank you very much @muellerzr for your notebook. It really helps me a lot.

First, I create Image Data Bunch for training.

data = ImageDataBunch.from_folder(train_path/"train", 
                                  train=".",
                                  valid_pct=0.1, 
                                  ds_tfms=get_transforms(), 
                                  size=224, num_workers=4).normalize(imagenet_stats)

Next, I create Imaga Data Bunch for test set.

src = (ImageList.from_folder(train_path/"test")
            .split_none()
            .label_from_folder())

data_test = (src.transform(get_transforms(), size=224))

data_test.valid = data_test.train
data_test = data_test.databunch().normalize(imagenet_stats)

After that by doing the following, I am able to do TTA prediction on my test set.

learn.data.valid_dl = data_test.valid_dl
y_preds, y = learn.TTA(ds_type=DatasetType.Valid)
if y.shape[0] == len(data_test.valid_ds):
  print(accuracy(y_preds, y))
else:
  print(f'There is error. Shape of y_preds {y_preds.shape}. Shape of y {y.shape}')