How to add test data into ImageDataBunch.from_df and make predictions?

Hello all

I am very confuse with how data structure works in fast.ai
I am trying to predict the test data after training the model

This shows no error at all (default)

data = ImageDataBunch.from_df(path=‘data/’, df=df_trn, ds_tfms=tfms,
size=224, bs=bs).normalize(imagenet_stats)

When I try to pass a test dataset into (test=df_tst), I got an error

data = ImageDataBunch.from_df(path=‘data/’, df=df_trn, ds_tfms=tfms,
size=224, bs=bs, test=df_tst).normalize(imagenet_stats)

TypeError: expected str, bytes or os.PathLike object, not list

So now I try to pass just the text list (test=list(df_tst[‘name’])

data = ImageDataBunch.from_df(path=‘data/’, df=df_trn, ds_tfms=tfms,
size=224, bs=bs, test=list(df_tst[‘name’]).normalize(imagenet_stats)

and still got error
please help

2 Likes

Do you want a labeled test set or unlabeled?

Thanks for fast reply :slight_smile:
The train and test dataframe look the same just like in the lecture, something like this

|name|label|
| path_img_1 | category_1 |
| path_img_2 | category_2 |
| path_img_3 | category_1 |
| path_img_4 | category_2 |

So it has path, and answer

You’ll want to make a seperate databunch for just it, and pass in the ImageList to the validation when you want to run analysis via learn.validate(). The reason is the test sets in fastai are unlabeled. So here we can make a labeled test set to work with. An example using tabular is shown below:

data = (TabularList.from_df(train_set, path=Path(''), cat_names=cat_var, 
                            cont_names=cont_var, procs=procs)
       .split_by_rand_pct(0.2)
       .label_from_df(dep_var, classes=classes)
       .databunch(bs=5000))

data_test = (TabularList.from_df(test, path=Path(''), cat_names=cat_var, 
                            cont_names=cont_var, procs=procs, processor=data.processor)
       .split_none()
       .label_from_df(dep_var, classes=classes)
       .databunch(bs=5000))

Notice here that I make a separate databunch, classes is to ensure they have the same classes, and the processor is to make sure they align correctly transformation wise. If you need it I can quickly write one for an ImageList in a moment but see if you can’t work it out yourself first. Also notice the split_none() on the test set databunch.

Then when you are ready to use it and validate, you can do the following:

learn.data.valid_dl = data_test.train_dl
learn.validate()

Good luck!

3 Likes

Thanks alot, I’ll take sometime to digest that in :slight_smile:

1 Like

hi @muellerzr ,

  1. How do I set aside validation Set , from a dataframe ? Looking at the code of fast.ai, this from_df set the valid_pct to 0.2 which means 20% data is kept aside for validation set automatically ?

Below is my code , when I plot learn.recorder.plot_losses() , it doesn’t show the validation loss graph.

I have one Dataframe with Image path and Label .

image_dataset = pd.concat([df['image_path'], df['lesion']], axis=1, keys=['name', 'label'])
bs = 8

tfms = get_transforms(flip_vert=True)
data = ImageDataBunch.from_df(".", image_dataset, ds_tfms=tfms, size=450, bs=bs).normalize(imagenet_stats)

Fast.ai code

   @classmethod
    def from_df(cls, path:PathOrStr, df:pd.DataFrame, folder:PathOrStr=None, label_delim:str=None, valid_pct:float=0.2,
                seed:int=None, fn_col:IntsOrStrs=0, label_col:IntsOrStrs=1, suffix:str='', **kwargs:Any)->'ImageDataBunch':
        "Create from a `DataFrame` `df`."
        src = (ImageList.from_df(df, path=path, folder=folder, suffix=suffix, cols=fn_col)
                .split_by_rand_pct(valid_pct, seed)
                .label_from_df(label_delim=label_delim, cols=label_col))
        return cls.create_from_ll(src, **kwargs)

Thanks.

1 Like

I would do the following:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

Followed by you can go through and run through the datablock api and pass in a train and validation dataframe (see the API for ImageList). If you need help with that let me know

1 Like

Hi,
I am having a similiar issue but my test set is unlabelled!!
(Leaf Identification Kaggle Problem)!!
Can I please get some help since I am stuck?

Hi,
I have an image net style folder i.e. with ‘test/’ ‘train/’ and ‘valid/’ sub folders. When I create an imagedatabunch from this folder structure like this:

It only seems to detect the train and valid folders ignoring the test folder.

How can I test my model on the ‘test’ folder images once it has been trained?

Many thanks!

I faced a very similar issue today where I *did not * explicitly pass the test folder as part of ImageDataBunch.from_folder. This caused them to include test images as part of training dataset.

Here’s databunch creation code which also added test as one of the labels to be present in Confusion Matrix as you can see it includes all the images from the child directories as part of the training set (Test dataset is NONE).

After fixing the data bunch code creation and explicitly specifying the train, test directory names, we can confirm that the databunch is created correctly.

Note that the factory methods are only there for the beginners and to grab “easy datasets”. You should learn to use the data block API which is far more flexible (and let you control which folders to include/exclude :wink: )

4 Likes

Thanks - sorry delayed reply!