Bug when using ImageDataBunch.from_folder and valid_pct with test

KevinB · October 28, 2018, 12:54am

Here is the command I try using:

PATH = Path("kaggleData/competitions/imaterialist-challenge-furniture-2018/")

data = ImageDataBunch.from_folder(PATH, train="train", test="test", ds_tfms=get_transforms(), size=112, bs=64, valid_pct=0.90)

and I get this error:

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-12-7267138863ad> in <module>()
      3 #ds_tfms=([crop_pad(size=([112,112]))],[crop_pad(size=([112,112]))])
      4 
----> 5 data = ImageDataBunch.from_folder(PATH, train="train", test="test", ds_tfms=get_transforms(), size=112, bs=64, valid_pct=0.90)
      6 #data = ImageDataBunch.from_folder(PATH, train="train", valid="valid", test="test", size=112, bs=64)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/fastai/vision/data.py in from_folder(cls, path, train, valid, test, valid_pct, **kwargs)
    280 
    281         if test: datasets.append(ImageClassificationDataset.from_single_folder(
--> 282             path/test,classes=train_ds.classes))
    283         return cls.create(*datasets, path=path, **kwargs)
    284 

UnboundLocalError: local variable 'train_ds' referenced before assignment

The problem is in the ImageDataBunch.fromfolder class. It currently looks like this:

    @classmethod
    def from_folder(cls, path:PathOrStr, train:PathOrStr='train', valid:PathOrStr='valid',
                    test:Optional[PathOrStr]=None, valid_pct=None, **kwargs:Any)->'ImageDataBunch':
        "Create from imagenet style dataset in `path` with `train`,`valid`,`test` subfolders (or provide `valid_pct`)."
        path=Path(path)
        if valid_pct is None:
            train_ds = ImageClassificationDataset.from_folder(path/train)
            datasets = [train_ds, ImageClassificationDataset.from_folder(path/valid, classes=train_ds.classes)]
        else: datasets = ImageClassificationDataset.from_folder(path/train, valid_pct=valid_pct)

        if test: datasets.append(ImageClassificationDataset.from_single_folder(
            path/test,classes=train_ds.classes))
        return cls.create(*datasets, path=path, **kwargs)

, but if you use valid_pct (aka valid_pct != None), train_ds never gets set initially. So when you do that and use test, it is trying to call train_ds, but that doesn’t ever get defined, it just goes straight to datasets. My suggestion to fix this is to do the following:

@classmethod
def from_folder(cls, path:PathOrStr, train:PathOrStr='train', valid:PathOrStr='valid',
                test:Optional[PathOrStr]=None, valid_pct=None, **kwargs:Any)->'ImageDataBunch':
    "Create from imagenet style dataset in `path` with `train`,`valid`,`test` subfolders (or provide `valid_pct`)."
    path=Path(path)
    if valid_pct is None:
        train_ds = ImageClassificationDataset.from_folder(path/train)
        datasets = [train_ds, ImageClassificationDataset.from_folder(path/valid, classes=train_ds.classes)]
    else: 
        datasets = ImageClassificationDataset.from_folder(path/train, valid_pct=valid_pct)
        train_ds = datasets[0] #<--------This line is the fix I am proposing
        
    if test: datasets.append(ImageClassificationDataset.from_single_folder(
        path/test,classes=train_ds.classes))
    return cls.create(*datasets, path=path, **kwargs)

I believe this fixes the issue without adding complications, but the other alternative would be instead of this line:

if test: datasets.append(ImageClassificationDataset.from_single_folder(
        path/test,classes=train_ds.classes))

changing it to something like this:

if test: datasets.append(ImageClassificationDataset.from_single_folder(
        path/test,classes=datasets[0].classes))

I believe both solutions solve the issue so I guess whichever one is preferable.

Here is my install information:

=== Software === 
python version  : 3.6.6
fastai version  : 1.0.14
torch version   : 1.0.0.dev20181015
nvidia driver   : 396.24
torch cuda ver  : 9.2.148
torch cuda is   : available
torch cudnn ver : 7104
torch cudnn is  : enabled

=== Hardware === 
nvidia gpus     : 2
torch available : 2
  - gpu0        : 11175MB | GeForce GTX 1080 Ti
  - gpu1        : 11178MB | GeForce GTX 1080 Ti

=== Environment === 
platform        : Linux-4.15.0-36-generic-x86_64-with-debian-stretch-sid
distro          : Ubuntu 16.04 Xenial Xerus
conda env       : fastai
python          : /home/kbird/anaconda3/envs/fastai/bin/python
sys.path        : 
/home/kbird/anaconda3/envs/fastai/lib/python36.zip
/home/kbird/anaconda3/envs/fastai/lib/python3.6
/home/kbird/anaconda3/envs/fastai/lib/python3.6/lib-dynload
/home/kbird/anaconda3/envs/fastai/lib/python3.6/site-packages
/home/kbird/anaconda3/envs/fastai/lib/python3.6/site-packages/IPython/extensions
/home/kbird/.ipython

sgugger · October 28, 2018, 12:51pm

Nice catch. I personally prefer the second solution since it doesn’t add a line of code. Would you mind making a PR for this?

KevinB · October 28, 2018, 6:11pm

I’ll make a PR later today.

KevinB · October 28, 2018, 9:42pm

Let me know if you want me to add anything else to this. I am not very familiar with contributing to open source projects.

sgugger · October 29, 2018, 12:54am

Thanks!

navjots · October 29, 2018, 1:37am

This bug was crying for a test
added - https://github.com/fastai/fastai/pull/1003

sgugger · October 29, 2018, 2:37am

Eh eh, thanks!
Was wondering how to test those test sets issues, and creating fake images sure works. Was wondering if adding a test set to MNIST_TINY wouldn’t be better though. @jeremy what do you think?

jeremy · October 29, 2018, 3:02am

Good idea

Shahinfar · March 21, 2019, 3:40am

Hi @sgugger
I am a bit confused about the behavior of ImageDataBunch.from_folder

I have 9 categories, 10 image/cat in training folders and 250 image/cat in related test folders.

if I define my data as :

data = ImageDataBunch.from_folder(f’{d_path}’+“LC_SE_10”, valid =‘test’, bs=8)

it is working as expected, 90 training items and 2250 validation items.

However if I use
data = ImageDataBunch.from_folder(f’{d_path}’+“LC_SE_10”, valid_pct =0.1, bs=8)
this is what i get

my data is been increased by 26 times and then 10% is been allocated for validation.
So my confusion is why it is happening while i have not defined any tfms to expect the increase in image numbers by augmentation.

I would appreciate if you can clarify this for me.

sgugger · March 21, 2019, 1:05pm

Please don’t tag me personally when they are multiple people that can answer your question.

I don’t see where your confusion comes from: you have a total of 2,340 images, 90 in the train folder and 2,250 in the valid folder. In your first call they are split according to that.
In your second call, you ask to split all the images randomly, and select 10% (234) for your validation set, so you 2,106 left in your training set.

Shahinfar · March 21, 2019, 10:07pm

Thanks for the clarification.
but i was hoping that i can use 10% if training for validation and leave test data intact for TTA after training. I never thought that it will take randomly from training and testing set combined. it has not been clarified in the docs either.

Conceptually validation data should be kept separated from testing set (as well as testing set). and in many ML packages validation percentage are partitioned from training set only, not training and testing set combined.
if validation_pct is being taken from training and testing combined then any statistics calculated on testing set at the end are biased.

here in fastai if we indicated testing set separately still valid_pct is being taken from both:

So to be clear my question would be how we can have valid_pct partitioned only from training test and not testing set, so we can have an unbiased conclusion on results of prediction on testing set?

sgugger · March 22, 2019, 1:12pm

ImageList.from_folder looks recursively into all the subfolders for the images it can find, but you can then use filter_by_folder to include/exclude some folders you only want/don’t want

Shahinfar · March 25, 2019, 1:10am

Thank you for your suggestion.
but now after I looked deeper in the API it seems that validation set is not being used at training process (I mean as the internal fine-tuning data as it is common in many packages).
for example if I define my databunch as :
data = ImageDataBunch.from_folder(path, ds_tfms=(tfms, tfms), train=“train”, valid=‘test’, bs=8)

Validation set will be used only at the time of prediction for example in :
learn.get_preds(ds_type=DatasetType.Valid)
or :
earn.TTA()

if this correct, i guess there is no concern to have valid=‘test’, right?

sgugger · March 25, 2019, 1:07pm

Normally you shouldn’t use your test set until the very end. The validation set isn’t used for training, but it’s used to fine-tune hyper-parameters since you adjust them to get a better validation loss/metric at the end of training.

Shahinfar · March 26, 2019, 12:31am

Thanks!

So I guess if I create my data as below it should keep test set separate from training set, right?

   data = (ImageList.
              from_folder(f'{path}' +'/train/').
              split_by_rand_pct(valid_pct = 0.2, seed=None).
              add_test_folder(f'{path}' +'/test/').
              label_from_folder().
              transform(get_transforms(),size=329).
              databunch(bs=16, num_workers=0)   
           )

But I am getting this error:

AttributeError: ‘ImageList’ object has no attribute 'add_test_folder’

while I am clearly seeing that add_test_folder is a function of LableLists class which has inherited the ImageList. What am I doing wrong here?

sgugger · March 26, 2019, 12:34am

The add_test_folder call should go after label_from_folder.

Shahinfar · March 26, 2019, 12:54am

thanks!
and how do you define labels of test set to be from the folder names?
this seems to not be correct
add_test_folder(f’{path}’ +’/test/’, label_from_folder()).

sgugger · March 26, 2019, 1:01am

I am bit tired of saying it again and again and again. The test set in fastai is unlabeled, as is fully explained in the docs.

Shahinfar · March 26, 2019, 1:10am

Thanks for clarification and sorry for tiring you.
But confusion comes in when there are methods such as .tta() or .get_preds() where you need labels for evaluations. perhaps a clarification in the docs where those methods are explained would solve the issue. perhaps there is an explanation now but when i checked it the last time there wasn’t any.

Shahinfar · March 26, 2019, 5:18am

it seems that the trick that has been mentioned here at: https://docs.fast.ai/data_block.html#Add-a-test-folder
for using test data as validation set with labels only works for

learn.validate(data_test.valid_dl)

when I use

log_preds, y_true = learn.TTA(ds_type=data_test.valid_dl)
or: 
log_preds, y_true = learn.get_preds(data_test.valid_dl)

it will use the valid_pct partition of training data and not the new validation set that is being defined as
.split_by_folder(train=‘train’, valid=‘test’)

any suggestion for this?