Bug when using ImageDataBunch.from_folder and valid_pct with test

Shahinfar · March 21, 2019, 10:07pm

Thanks for the clarification.
but i was hoping that i can use 10% if training for validation and leave test data intact for TTA after training. I never thought that it will take randomly from training and testing set combined. it has not been clarified in the docs either.

Conceptually validation data should be kept separated from testing set (as well as testing set). and in many ML packages validation percentage are partitioned from training set only, not training and testing set combined.
if validation_pct is being taken from training and testing combined then any statistics calculated on testing set at the end are biased.

here in fastai if we indicated testing set separately still valid_pct is being taken from both:

So to be clear my question would be how we can have valid_pct partitioned only from training test and not testing set, so we can have an unbiased conclusion on results of prediction on testing set?

sgugger · March 22, 2019, 1:12pm

ImageList.from_folder looks recursively into all the subfolders for the images it can find, but you can then use filter_by_folder to include/exclude some folders you only want/don’t want

Shahinfar · March 25, 2019, 1:10am

Thank you for your suggestion.
but now after I looked deeper in the API it seems that validation set is not being used at training process (I mean as the internal fine-tuning data as it is common in many packages).
for example if I define my databunch as :
data = ImageDataBunch.from_folder(path, ds_tfms=(tfms, tfms), train=“train”, valid=‘test’, bs=8)

Validation set will be used only at the time of prediction for example in :
learn.get_preds(ds_type=DatasetType.Valid)
or :
earn.TTA()

if this correct, i guess there is no concern to have valid=‘test’, right?

sgugger · March 25, 2019, 1:07pm

Normally you shouldn’t use your test set until the very end. The validation set isn’t used for training, but it’s used to fine-tune hyper-parameters since you adjust them to get a better validation loss/metric at the end of training.

Shahinfar · March 26, 2019, 12:31am

Thanks!

So I guess if I create my data as below it should keep test set separate from training set, right?

   data = (ImageList.
              from_folder(f'{path}' +'/train/').
              split_by_rand_pct(valid_pct = 0.2, seed=None).
              add_test_folder(f'{path}' +'/test/').
              label_from_folder().
              transform(get_transforms(),size=329).
              databunch(bs=16, num_workers=0)   
           )

But I am getting this error:

AttributeError: ‘ImageList’ object has no attribute 'add_test_folder’

while I am clearly seeing that add_test_folder is a function of LableLists class which has inherited the ImageList. What am I doing wrong here?

sgugger · March 26, 2019, 12:34am

The add_test_folder call should go after label_from_folder.

Shahinfar · March 26, 2019, 12:54am

thanks!
and how do you define labels of test set to be from the folder names?
this seems to not be correct
add_test_folder(f’{path}’ +’/test/’, label_from_folder()).

sgugger · March 26, 2019, 1:01am

I am bit tired of saying it again and again and again. The test set in fastai is unlabeled, as is fully explained in the docs.

Shahinfar · March 26, 2019, 1:10am

Thanks for clarification and sorry for tiring you.
But confusion comes in when there are methods such as .tta() or .get_preds() where you need labels for evaluations. perhaps a clarification in the docs where those methods are explained would solve the issue. perhaps there is an explanation now but when i checked it the last time there wasn’t any.

Shahinfar · March 26, 2019, 5:18am

it seems that the trick that has been mentioned here at: https://docs.fast.ai/data_block.html#Add-a-test-folder
for using test data as validation set with labels only works for

learn.validate(data_test.valid_dl)

when I use

log_preds, y_true = learn.TTA(ds_type=data_test.valid_dl)
or: 
log_preds, y_true = learn.get_preds(data_test.valid_dl)

it will use the valid_pct partition of training data and not the new validation set that is being defined as
.split_by_folder(train=‘train’, valid=‘test’)

any suggestion for this?

sgugger · March 26, 2019, 12:58pm

You have to change the data object of your Learner for learn.TTA and learn.get_preds. Did you set learn.data = data_test?

Shahinfar · March 26, 2019, 10:32pm

Yes i did but still using the valid_pct partition for the first data set.
this is exactly what I have done:

learn.load(S)
learn.data=test_data
log_preds, y_true = learn.TTA (ds_type=test_data.valid_dl, beta= 0.5, scale = 1.3)

However i figured out what is the issue. I found that unless you define your dataset as : ds_type=DatasetType.Valid it won’t use the new validation set.

If you please let me know why is this? or point me to where there is an explanation for it i would appreciate as it helps me and others to have a better understanding of what is the difference between DatasetType.Valid and data_test.valid_dl

Thanks a lot!

sgugger · March 26, 2019, 10:50pm

That line can’t work with current fastai. ds_type must be of DatasetType. I’ll stop replying until you provide your whole code, as it’s pointless for me to try to guess what’s happening. Not trying to be mean, but I (or any other person in this forum) really can’t help without seeing everything. Failure might be linked to some line of codes before what you are showing.
Also the whole error message (if applicable) and your current setup (given by show_install) are necessary information to figure out what’s going on.

Shahinfar · March 26, 2019, 11:09pm

Thanks, and I think the answer is [quote=“sgugger, post:23, topic:28292”]
That line can’t work with current fastai. ds_type must be of DatasetType .
[/quote]
simply during the version update many thing has changed while the docs are still not fully updated and somewhat frustrating for everyone who is using it.

I understand that i have to share my code but in this case every thing was according to the aforementioned docs which i have shared the link and thought is needless to copy and paste everything again here. but here we go as you requested:

data = (ImageList.
                  from_folder(f'{path_LC}' ).
                  filter_by_folder(include=['train'], exclude=['test']).
                  split_by_rand_pct(valid_pct = 0.2, seed=None). 
                  label_from_folder().
                  transform(ds_tfms,size=329).
                  databunch(bs=8, num_workers=0)   
               )

learn = cnn_learner(data, arch,pretrained=True,
                    metrics=[accuracy, error_rate, top_k_accuracy], 
                    callback_fns=[partial(CSVLogger, filename =str('stat_' +str(tr)+'_S_'+ aName))])

learn.fit_one_cycle(epoch, max_lr=maxLR, moms =[0.95, 0.85], div_factor = 25.0)
        
        learn.freeze()
        learn.export()

test_data = (ImageList.
                    from_folder(f'{path_LC}').
                    split_by_folder(train='train', valid='test').
                    label_from_folder().
                    transform(ds_tfms,size=329).
                    databunch(bs=16, num_workers=0))

learn.load()
learn.data=test_data
log_preds, y_true = learn.TTA(ds_type=test_data.valid_ds, beta= 0.5, scale = 1.3)

=== Software === 
python        : 3.7.1
fastai        : 1.0.49
fastprogress  : 0.1.20
torch         : 1.0.1
torch cuda    : 10.0 / is available
torch cudnn   : 7401 / is enabled

=== Hardware === 
torch devices : 1
  - gpu0      : GeForce GTX 1080 with Max-Q Design

=== Environment === 
platform      : Windows-10-10.0.16299-SP0
conda env     : base
python        : C:\ProgramData\Anaconda3\python.exe
sys.path      : C:\Users\sshahinf\Desktop\Python_code
C:\ProgramData\Anaconda3\python37.zip
C:\ProgramData\Anaconda3\DLLs
C:\ProgramData\Anaconda3\lib
C:\ProgramData\Anaconda3

C:\ProgramData\Anaconda3\lib\site-packages
C:\ProgramData\Anaconda3\lib\site-packages\win32
C:\ProgramData\Anaconda3\lib\site-packages\win32\lib
C:\ProgramData\Anaconda3\lib\site-packages\Pythonwin
C:\ProgramData\Anaconda3\lib\site-packages\IPython\extensions
C:\Users\sshahinf\.ipython
no nvidia-smi is found

sgugger · March 27, 2019, 2:03pm

The docs are updated with each new version. If there are places when they’re not fully updated, any PR to fix them will always be more than welcome. TTA isn’t documented in any case, which is another thing where a contribution would be appreciated. In general you’ll find people will be more prompt to help you if you use language like “it’s not perfect, how can help make it better?” rather than just complaining.

Same for the changes. I’m not sure what are the many things that changed since the functions you use haven’t moved in the past three months.

In any case, the correct line is

learn.TTA(ds_type=DatasetType.Test, beta= 0.5, scale = 1.3)

Shahinfar · March 27, 2019, 9:45pm

thanks for your clarification. I will try to contribute to TTA documentation.
this is what i wrote two comments above:

and i just asked the question for what is the difference between two data type.

well, there have been changes that each of them took a bit of time until we figured out how to fix them from v0.7 to v1.0. it was unfortunate that we tried to use this library at the time many changes had to happen. Needless to mention all of them here. it was just bad timing i guess

thanks anyways!

sgugger · March 28, 2019, 12:08am

Ah yes sorry, ds_type=DatasetType.Valid is what you want when you have set your new data, because it’s the validation set of data_test. I confused myself

Oh I didn’t realize you were talking of v0.7. It has been stated very clearly that v1.0 was a complete rewrite, so there is absolutely no backward compatibility. Also there was no docs for v0.7, so it’s not a question of having them be updated, more like writing them

shruti_01 · May 6, 2019, 7:15am

Why do we have two different methods to create data -
ImageDataBunch and ImageLists?

utkb · May 8, 2019, 2:59am

Hi Shruti,

There are many methods, actually : ) Please have a look at the data_block docs and the DataBunch docs, for a start. You should also be able to Google many useful blogposts about “fastai data block” API. Worth noticing the strive for consistency (kudos to the devs) between the creation methods for different data types, e.g. image, tabular, text, etc., such that very similar lines of code in fastai can be used for different DL models / applications. Thanks.

Yijin

shruti_01 · May 8, 2019, 3:19am

Yep, did just that - read a blogpost on the data block API. Now things make more sense