How to load learner and test on separate ImageList from folder?

I’ve trained a network with a dataset a few days ago. Now I want to test the model on a new seperate dataset however all my labels are ‘0’. The dataset is in the following directory structure:

./test
     ./ClassA
     ./ClassB

This is the code I use:

test_data_set = root_data_path / 'june_2019_testset/test'
test_learner = load_learner(path / 'models' , file='model_20epochs-v4-1.0.50.pkl', test=ImageList.from_folder(test_data_set))
preds, y = test_learner.get_preds(ds_type=DatasetType.Test)

When viewing y I get a tensor with just 0 values:
tensor([0, 0, ..., 0])

I have tried to load an ImageDataBunch from just the test folder however it gives me the following error:

data = (ImageDataBunch.from_folder(test_data_set, ignore_empty=True))

Indexerror: index 0 is out of bounds for axis 0 with size 0

which is probably caused because train and validation set are empty?

Any ideas? Using 1.0.53

Had this issue recently of testing recently.

What i did is :
Build a new databunch during testing, where you put your training data in the Train Dataset (just to fill it up), and your test data in Validation Dataset.

Then run

learn.validate(new_databunch.valid_dl)

where new_databunch is the databunch that I mentioned above.

Please see the discussion here, as that will not do quite what you are hoping for @jianshen92, it’s close though :slight_smile:

@muellerzr Interesting. No wonder i am constantly getting a higher test set accuracy.

What do you think is actually happening?

That I wish I knew the answer to, but I do not! As I was seeing the exact same thing as well… perhaps a lurker could see and give their input but I do not know!

I see, thanks for pointing it out! Curious question, did you do a cross check between the validate() accuracy with predict() accuracy?

It wasn’t for accuracy, moreso for time. Though I have noticed mabye a 1%-.5% difference between the two. Which at the end of the day is negligible. But I could go further there as well, I just made that briefly this morning. If you would like me to, I can briefly real quick :slight_smile:

Going to cross check quickly because I’m submitting and assignment soon. Knowing the real accuracy on test set would matter. I would expect both of them to be the same. Maybe we can share our findings! :slight_smile:

Sure! Give me a few minutes :slight_smile:

@jianshen92 I updated the notebook. It was a tenth of a percentage difference.

Thanks @muellerzr , mine was identical too!

Good to know! @sgugger, any thought behind why when we pass in an outside dataloader, our accuracy sky-rockets? We have already determined thanks to your help the proper way to do it, but why would this behavior exist? You can see in my example I get almost 90% doing it the wrong way on the test set. Any thoughts?

Thanks for the replies.

I feel the API should be able to support being able to load an ImageDataBunch from just a single folder. Especially when you are able to use ignore_empty

Also the error it produces is quite unclear. Yeah it is trying to load something from an empty set but it doesnt say it is exactly. Like is it trying to deduce the labels by trying to find a train folder which doesnt exist? Is it still trying to index a folder which doesnt exist (train & valid)?

I feel FastAI is quite good but the error reporting can be significantly improved to make it even easier for new people like me to understand what and why a function fails.

It can with the data block API:

ImageList.from_folder(path).split_none().label_as_you_want

and then you’ll have a databunch with just a train_dl and not valid_dl/test_dl. In general, the factory methods of ImageDataBunch are only suitable for scenarios very similar to what’s in the MOOC for beginners, the data block API is what you should use when you want more flexibility.

As for the error reporting, it can obviously be improved. Any PR for that will be welcome!

2 Likes