How do i use from_folder to accomodate a Test dataset?

wxs171530 · February 5, 2019, 8:40pm

Hi,

I have searched multiple topics in this forum, and none of which seem to answer my question. As such, I have reposted this up, hopefully, with detailed screenshots to provide a better understanding!

My file directory structure is as such:

As you can see, there are 2 distinct folders, called Train and Test… What I want to do, is to train the model using the train dataset for both training and validation (which should comprise 20% of the training dataset), before predicting out on Test set without any labels.

Therefore, based on this, my ImageDataBunch.from_folder command is seen as such, but note that under class labels, Test becomes 1 class in addition to the 2 classes that I really want.

In order to go around this problem, I edited it to this:

So far, this edited command seems to work… as it gives me 2 class labels. However, the problem comes later downstream…

One of the most common solution that occurs in the forums, is to set get.preds(is_test = True), but as you can see in FastAI V1.0, this does not seem to work anymore. I also shift-tab to see what arguments are available and none of which reflects a is_test = True parameter.

So, I went and searched… and found this proposed solution which seems to work… until I checked the shape:

Note: That in the Test folder, there are only 179 files for prediction. Whereas, the shape here is 509 images. This is from the training dataset ((444 + 193)*0.8).

I am currently stuck and have been stuck for many hours. Could someone please advise me on how I should go about solving this? @jeremy

Also, if anyone could provide me with some guidance on how I can print out Test Images with the associated predicted class labels, I will be eternally grateful!

Thank you!

PierreO · February 5, 2019, 9:22pm

Hi !

So first off try to use that instruction to get the predictions (I don’t think that will fix your last issue though) :

learn.get_preds(ds_type =DataSetType.Test)

If that does not work, try the following. I do not have much experience with the DataBunch API, I would encourage you to switch to the DataBlock one, it’s more flexible.
Using the DataBlock API, here’s what I would do to fix the DataBunch :

path = Path('../input/electricstreet/electric-street-test/Electric-Street/')
data = (ImageItemList.from_folder(path/'Train') #Where to find the data? -> in path and its subfolders
        .split_by_random_pct()              #How to split in train/valid? -> use the folders
        .label_from_folder()            #How to label? -> randomly with the default .2 split
        .add_test_folder('../Test')              #add test set
        .transform(tfms, size=224)       #Data augmentation? -> use tfms with a size of 224
        .databunch(bs=64))                   #Finally? -> use the defaults for conversion to ImageDataBunch

Could you try that and tell me how it goes ?

If that works you will be able to use learn.get_preds(ds_type =DataSetType.Test) I think.

Lastly please try to @ Jeremy or Sylvain directly unless your question is directed to them

wxs171530 · February 5, 2019, 9:54pm

@PierreO Thanks for helping me! really appreciate it!

Ah I forgot to mention that I did try using learn.get_preds(ds_type = DataSetType.Test), using my commands!

I modified your commands by a bit because there are some very specific transformations that I want to accomplish!

But even after executing your proposed solution, I still obtain the same error! I was afraid of breaking something!

ark_aung · February 6, 2019, 3:39am

Try changing your.split_by_random_pct() to .random_split_by_pct().

So it would look like:

path = Path('../input/electricstreet/electric-street-test/Electric-Street/')
data = (ImageItemList.from_folder(path/'Train')
        .random_split_by_pct()       
        .label_from_folder()           
        .add_test_folder('../Test')   
        .transform(tfms, size=224) 
        .databunch(bs=64))

And then you would just pass that databunch data to your learner like usual.
learn = create_cnn(data, model.resnet50)
After training, you would predict with
logs_preds_test = learn.get_preds(ds_type = DataSetType.Test)

wxs171530 · February 6, 2019, 9:06pm

@ark_aung Hi! Thanks for the help! I am using Fast AI V1.0.39! I have tried your solution and have obtained the following errors!

ark_aung · February 7, 2019, 2:06am

Hey @wxs171530, you are making progress in terms of creating a DataBunch. This new error has something to do with your data itself. My initial guess is that your images are of different sizes and when show_batch tried to show them, it had some problems. Your images have different sizes in second dimension (I would assume the width of your image according to your print outputs). Actually, it should not be a problem since you re-sized them at your data transform step with size=224.

First, I want to make sure that your DataBunch creation was successful. Try running your cell without
electricstreet.show_batch(rows=8, figsize=(48,48)). And then, try running another cell with,
electricstreet.train_ds.x[0]. This should display your first image as output.

One thing you could do is put up your whole notebook on colab or github and I can help you better with your debugging process.

Btw, please update the fast-ai to its latest version so we can be sure that whatever we are trying to figure out is not a bug from the older version

wxs171530 · February 7, 2019, 2:44am

Your initial guess is right! My images are all in different sizes which is why I concurred with the transformation step that rescaled the images down to 224.

I have tried to run the command without electricstreet.show_batch(rows=8, figsize=(48,48)). While it doesn’t throw up red errors, there is still an error message.

Aside from this error, as per your recommended command, I am able to execute electricstreet.train_ds.x[508] (I wanted to try and see another image besides the first).

Looks like I am able to get it! However… when I try and execute learn.fit_one_cycle()

You have been an extremely great help! And I really appreciate the time and effort you have spent to help me debug!

P.S: The reason why I am running fastAI V.1.0.39 is because I am running from a Kaggle Kernel. I don’t really have the choice to update it to the latest version? Correct me if I am wrong!

ark_aung · February 7, 2019, 3:41am

The error which arises during training the model is exactly because of that UserWarning that you received while creating the DataBunch. Since your images are of different sizes, it was impossible to stack them up to form a mini batch. Thus the error, It's not possible to collate samples of your dataset together in a batch.

The example would be something like this:
Let’s say x and o are pixels of images. Assume we have 3x3 images and when we want to make a batch, we can just stack them up forming a tensor of (2,3,3) where 2 is the size of batch.
xxx
xxx
xxx
ooo
ooo
ooo

but if we have (2x4) image and (3x2) image,
xxxx
xxxx
oo
oo
oo
You cannot stack them up and that’s why you cannot train your model. I will check out your kaggle kernel and debug the transform step.

ark_aung · February 7, 2019, 2:23pm

By the way, you can update your fast-ai with
!pip install --upgrade fastai
in one of the cells on kaggle notebook.

wxs171530 · February 8, 2019, 5:47pm

hey! @sgugger @Jeremy @Sylvain! Could you guys please try and help me out with this issue please!