How to add test data into ImageDataBunch.from_df and make predictions?

(Khunakorn Luyaphan) #1

Hello all

I am very confuse with how data structure works in fast.ai
I am trying to predict the test data after training the model

This shows no error at all (default)

data = ImageDataBunch.from_df(path=‘data/’, df=df_trn, ds_tfms=tfms,
size=224, bs=bs).normalize(imagenet_stats)

When I try to pass a test dataset into (test=df_tst), I got an error

data = ImageDataBunch.from_df(path=‘data/’, df=df_trn, ds_tfms=tfms,
size=224, bs=bs, test=df_tst).normalize(imagenet_stats)

TypeError: expected str, bytes or os.PathLike object, not list

So now I try to pass just the text list (test=list(df_tst[‘name’])

data = ImageDataBunch.from_df(path=‘data/’, df=df_trn, ds_tfms=tfms,
size=224, bs=bs, test=list(df_tst[‘name’]).normalize(imagenet_stats)

and still got error
please help

0 Likes

Tabular: validation set percentage
(Zachary Mueller) #2

Do you want a labeled test set or unlabeled?

0 Likes

(Khunakorn Luyaphan) #3

Thanks for fast reply :slight_smile:
The train and test dataframe look the same just like in the lecture, something like this

|name|label|
| path_img_1 | category_1 |
| path_img_2 | category_2 |
| path_img_3 | category_1 |
| path_img_4 | category_2 |

So it has path, and answer

0 Likes

(Zachary Mueller) #4

You’ll want to make a seperate databunch for just it, and pass in the ImageList to the validation when you want to run analysis via learn.validate(). The reason is the test sets in fastai are unlabeled. So here we can make a labeled test set to work with. An example using tabular is shown below:

data = (TabularList.from_df(train_set, path=Path(''), cat_names=cat_var, 
                            cont_names=cont_var, procs=procs)
       .split_by_rand_pct(0.2)
       .label_from_df(dep_var, classes=classes)
       .databunch(bs=5000))

data_test = (TabularList.from_df(test, path=Path(''), cat_names=cat_var, 
                            cont_names=cont_var, procs=procs, processor=data.processor)
       .split_none()
       .label_from_df(dep_var, classes=classes)
       .databunch(bs=5000))

Notice here that I make a separate databunch, classes is to ensure they have the same classes, and the processor is to make sure they align correctly transformation wise. If you need it I can quickly write one for an ImageList in a moment but see if you can’t work it out yourself first. Also notice the split_none() on the test set databunch.

Then when you are ready to use it and validate, you can do the following:

learn.data.valid_dl = data_test.train_dl
learn.validate()

Good luck!

0 Likes

Tabular: validation set percentage
(Khunakorn Luyaphan) #5

Thanks alot, I’ll take sometime to digest that in :slight_smile:

0 Likes

(Jbo) #6

hi @muellerzr ,

  1. How do I set aside validation Set , from a dataframe ? Looking at the code of fast.ai, this from_df set the valid_pct to 0.2 which means 20% data is kept aside for validation set automatically ?

Below is my code , when I plot learn.recorder.plot_losses() , it doesn’t show the validation loss graph.

I have one Dataframe with Image path and Label .

image_dataset = pd.concat([df['image_path'], df['lesion']], axis=1, keys=['name', 'label'])
bs = 8

tfms = get_transforms(flip_vert=True)
data = ImageDataBunch.from_df(".", image_dataset, ds_tfms=tfms, size=450, bs=bs).normalize(imagenet_stats)

Fast.ai code

   @classmethod
    def from_df(cls, path:PathOrStr, df:pd.DataFrame, folder:PathOrStr=None, label_delim:str=None, valid_pct:float=0.2,
                seed:int=None, fn_col:IntsOrStrs=0, label_col:IntsOrStrs=1, suffix:str='', **kwargs:Any)->'ImageDataBunch':
        "Create from a `DataFrame` `df`."
        src = (ImageList.from_df(df, path=path, folder=folder, suffix=suffix, cols=fn_col)
                .split_by_rand_pct(valid_pct, seed)
                .label_from_df(label_delim=label_delim, cols=label_col))
        return cls.create_from_ll(src, **kwargs)

Thanks.

0 Likes

(Zachary Mueller) #7

I would do the following:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

Followed by you can go through and run through the datablock api and pass in a train and validation dataframe (see the API for ImageList). If you need help with that let me know

1 Like