How to load learner and test on separate ImageList from folder?

Wesley · June 12, 2019, 1:05pm

I’ve trained a network with a dataset a few days ago. Now I want to test the model on a new seperate dataset however all my labels are ‘0’. The dataset is in the following directory structure:

./test
     ./ClassA
     ./ClassB

This is the code I use:

test_data_set = root_data_path / 'june_2019_testset/test'
test_learner = load_learner(path / 'models' , file='model_20epochs-v4-1.0.50.pkl', test=ImageList.from_folder(test_data_set))
preds, y = test_learner.get_preds(ds_type=DatasetType.Test)

When viewing y I get a tensor with just 0 values:
tensor([0, 0, ..., 0])

I have tried to load an ImageDataBunch from just the test folder however it gives me the following error:

data = (ImageDataBunch.from_folder(test_data_set, ignore_empty=True))

Indexerror: index 0 is out of bounds for axis 0 with size 0

which is probably caused because train and validation set are empty?

Any ideas? Using 1.0.53

jianshen92 · June 13, 2019, 9:39pm

Had this issue recently of testing recently.

What i did is :
Build a new databunch during testing, where you put your training data in the Train Dataset (just to fill it up), and your test data in Validation Dataset.

Then run

learn.validate(new_databunch.valid_dl)

where new_databunch is the databunch that I mentioned above.

muellerzr · June 13, 2019, 9:44pm

Please see the discussion here, as that will not do quite what you are hoping for @jianshen92, it’s close though

jianshen92 · June 13, 2019, 9:52pm

@muellerzr Interesting. No wonder i am constantly getting a higher test set accuracy.

What do you think is actually happening?

muellerzr · June 13, 2019, 9:53pm

That I wish I knew the answer to, but I do not! As I was seeing the exact same thing as well… perhaps a lurker could see and give their input but I do not know!

jianshen92 · June 13, 2019, 9:58pm

I see, thanks for pointing it out! Curious question, did you do a cross check between the validate() accuracy with predict() accuracy?

muellerzr · June 13, 2019, 9:59pm

It wasn’t for accuracy, moreso for time. Though I have noticed mabye a 1%-.5% difference between the two. Which at the end of the day is negligible. But I could go further there as well, I just made that briefly this morning. If you would like me to, I can briefly real quick

jianshen92 · June 13, 2019, 10:03pm

Going to cross check quickly because I’m submitting and assignment soon. Knowing the real accuracy on test set would matter. I would expect both of them to be the same. Maybe we can share our findings!

muellerzr · June 13, 2019, 10:03pm

Sure! Give me a few minutes

muellerzr · June 13, 2019, 10:13pm

@jianshen92 I updated the notebook. It was a tenth of a percentage difference.

jianshen92 · June 13, 2019, 10:25pm

Thanks @muellerzr , mine was identical too!

muellerzr · June 13, 2019, 10:30pm

Good to know! @sgugger, any thought behind why when we pass in an outside dataloader, our accuracy sky-rockets? We have already determined thanks to your help the proper way to do it, but why would this behavior exist? You can see in my example I get almost 90% doing it the wrong way on the test set. Any thoughts?

Wesley · June 14, 2019, 6:53am

Thanks for the replies.

I feel the API should be able to support being able to load an ImageDataBunch from just a single folder. Especially when you are able to use ignore_empty…

Also the error it produces is quite unclear. Yeah it is trying to load something from an empty set but it doesnt say it is exactly. Like is it trying to deduce the labels by trying to find a train folder which doesnt exist? Is it still trying to index a folder which doesnt exist (train & valid)?

I feel FastAI is quite good but the error reporting can be significantly improved to make it even easier for new people like me to understand what and why a function fails.

sgugger · June 14, 2019, 1:08pm

It can with the data block API:

ImageList.from_folder(path).split_none().label_as_you_want

and then you’ll have a databunch with just a train_dl and not valid_dl/test_dl. In general, the factory methods of ImageDataBunch are only suitable for scenarios very similar to what’s in the MOOC for beginners, the data block API is what you should use when you want more flexibility.

As for the error reporting, it can obviously be improved. Any PR for that will be welcome!