[Issue] ImageDataBunch.from_folder() classifies my test folder as a label for my data

ggrelet · January 30, 2019, 4:02am

Hi,

My data directory is set like this:

.
|-- test
`-- train
    |-- nok
    |-- ok
    `-- unsure

With the following specification, test folder contains batch of files in disorder, and of course unlabeled.

When I’m using ImageDataBunch.from_folder() like the following:

data = ImageDataBunch.from_folder(path,
        ds_tfms=get_transforms(), valid_pct=0.2, size=224, num_workers=4, bs=bs, test="test").normalize()

I get the expected following result (in terms of separation):

ImageDataBunch;

Train: LabelList
y: CategoryList (923 items)
[Category ok, Category ok, Category ok, Category ok, Category ok]...
Path: data
x: ImageItemList (923 items)
[Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280)]...
Path: data;

Valid: LabelList
y: CategoryList (230 items)
[Category test, Category test, Category nok, Category ok, Category nok]...
Path: data
x: ImageItemList (230 items)
[Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280)]...
Path: data;

Test: LabelList
y: EmptyLabelList (478 items)
[EmptyLabel , EmptyLabel , EmptyLabel , EmptyLabel , EmptyLabel ]...
Path: .
x: ImageItemList (478 items)
[Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280), Image (3, 960, 1280)]...
Path: data

However when calling the classes I get with data.classes I get: ['nok', 'ok', 'test', 'unsure'] i.e. the framework assumes test is now a label of my data, and messes up my classification model.

If i remove the keyword test="test", I end up having a Test: None data set, but I still have the test label on my data.

Am I missing something?

Thanks for your help, much appreciated

ggrelet · January 30, 2019, 4:05am

PS: I know I can add the argument classes=["ok", "nok", "unsure"] and I’ll simply get a warning that test is discarded. But I wanted to point out this issue I have could be an unexpected behavior and a problem.

raimanu-ds · January 30, 2019, 1:21pm

hi @ggrelet,

I have run into the same issue and noticed it only when looking at the result of .plot_top_losses(), as if the model also fitted on the test set images.

ggrelet · January 31, 2019, 12:56am

@raimanu-ds I can confirm that my model fits the test set image.
Maybe we should open a GitHub Issue?

ggrelet · February 1, 2019, 4:43am

I opened a GitHub Issue.

teacha_max · February 2, 2019, 6:45am

Hey guys, I don’t know if it’s related or not, but I have a similar issue with ImageDataBunch.from_folder.

My data directory is set like this:

.
`-- datafolder
    |-- class1
    |-- class2

I’m using ImageDataBunch.from_folder() like the following:

data = ImageDataBunch.from_folder(path, train = ".", ds_tfms=get_transforms(), 
               valid_pct=0.2, size = 224, num_workers = 4).normalize(imagenet_stats)

When I’m calling data.classes I get: ['datafolder', 'class1', 'class2'], which means ImageDataBunch.from_folder adds root folder to classes or maybe I’m doing something wrong.

brownmamba · March 5, 2019, 5:37pm

i am getting the same error

ggrelet · March 6, 2019, 12:51am

Hello, can you expose your problem on the GitHub issue I opened? Right now it’s closed but if someone posts similar problems with new example data, maybe @sgugger will re-open it?

sgugger · March 6, 2019, 1:21am

The answer will be the same: use the data block API
There may very well be edge cases where the factory methods of ImageDataBunch fail, and you should use the data block API because it’s more flexible.

In those cases you can filter the folders you want to keep with filter_by_folder(['class1', 'class2']).

brownmamba · March 7, 2019, 4:01am

Doing this fixed everything. Thank you!!

mturbot · March 11, 2019, 3:54pm

Hi,

I think that this thread has some similarities with my problem :
Before really getting into the learner, I am trying to fully understand how to create my databunches correctly and understand what is happening.

From the FASTAI data block docs, they say that :
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=64)

is a shortcut method for :
data = (ImageList.from_folder(path)
.split_by_folder()
.label_from_folder()
.add_test_folder()
.transform(tfms, size=64)
.databunch())

These last lines, are said to be generic and can be used all the time.

I am trying through the lesson 2 to create my databunch with the data blocks.
I have a parent folder ‘sneakers’ with 2 subfolders ‘nike’ and ‘adidas’ where i have put my images. No valid or train folders.

What should i do to have the same results than :
data = ImageDataBunch.from_folder(path, train=’.’, valid_pct=0.2, ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

using the Data Block API …

I don’t get it…

Thanks
Michael

BlueLightning · May 27, 2019, 4:29pm

Hi,

I was facing the same issue as @ggrelet and this is the solution I found.

Given the folder structure:

    data_folder/
    |--  train/
         |--  class-1/
         |--  class-2/
    |--  test/

To add the test set when using ImageDataBunch.from_folder() you do as follows:

data = ImageDataBunch.from_folder( Path("data_folder/train"), train='.', test='../test'
               , valid_pct=0.2, bs=64, size=224, ds_tfms=get_transforms())

This sorts the problem of having test show up as a class:

print(data.classes)
['class-1', 'class-2']

In this example it data returns:

ImageDataBunch;

Train: LabelList (65934 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
class-1,class-1,class-2,class-1,class-1
Path: /data_folder/train;

Valid: LabelList (16483 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
class-1,class-2,class-1,class-2,class-2
Path: /data_folder/train;

Test: LabelList (17687 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: EmptyLabelList
,,,,
Path: /data_folder/train

The only small issue is that data wrongly indicates the test folder path. Other than that everything works fine and you can get predictions for the test set using the code bellow on your learner, say learn:

preds, y = learn.get_preds( ds_type=DatasetType.Test)

Hope this helps

dreambeats · May 28, 2019, 8:20am

I would instead recommend the below instead

data = ImageDataBunch.from_folder( Path("data_folder"), train='train', test='test'
               , valid_pct=0.2, bs=64, size=224, ds_tfms=get_transforms())

This way the models directory (where the learner saves model weights) gets created within ‘data_folder’ instead of ‘data_folder/train’, which makes more sense.

BlueLightning · May 28, 2019, 10:49am

@dreambeats as you say it makes more sense and that was the first thing I tried. For some reason though, as a result this modification the test folder was then included as a class. Therefore I obtained:

print(data.classes)
['class-1', 'class-2', 'test']

which was quite problematic. Have you not encountered that issue?