How to add, create databunch and split by split_from_df into train, valid and test set?

Hello everyone: I am facing this issue for few weeks but I am unable to solve this.

I am working on medical images where I am diagnosing disease. I am creating a databunch by using some custom data loader using fastai. The code which I am using is below:
But this code is creating only train and valid split. The valid split includes the test set as combine I don’t know how to separately load the test set. My sets are already defined in a data frame into 80%, 10%, 10% train, valid, and test respectively.

I know I am doing something wrong in the code but I am a newbie so can’t make new changes if I do then I am facing many other problems.
My Dataloader code:

def get_chestxray8(path:PathOrStr, bs:int, img_sz:int, valid_only_bbx:bool=False, tfms:bool=True, convert_mode:str='RGB',
                   normalize:bool=True, norm_stats:Tuple[Floats, Floats]=imagenet_stats, processor:Optional[Callable]=None,
                   **kwargs:Any)->DataBunch:
    '''
    TODO
    '''
    path = Path(path)
    df = pd.read_pickle(path / 'full_ds_bbx.pkl')
    df['is_valid'] = df.set!='Train'
    if valid_only_bbx: df = df[(df.set=='Train') | df.bbx]

    if processor is not None: df = processor(df)

    lbl_dict = df[['file','label']].set_index('file')['label'].to_dict()
    def bbox_label_func(fn:str)->list: return lbl_dict[Path(fn).name]
    lbls = ['No finding', 'Atelectasis', 'Cardiomegaly', 'Consolidation', 'Infiltration', 
    'Lung Opacity', 'Mass', 'Pleural effusion', 'Pleural thickening', 'Pneumothorax', 'Pulmonary fibrosis']


    src = (CustomObjectItemList.from_df(df, path / 'images', cols='file', convert_mode=convert_mode)
                               .split_from_df('is_valid')
                               .label_from_func(bbox_label_func, classes=lbls))

    if tfms: src = src.transform(get_transforms(**kwargs), size=img_sz, tfm_y=True)

    data =  src.databunch(bs=bs, collate_fn=multiclass_bb_pad_collate)
    if normalize: data = data.normalize(stats=norm_stats)

    return data

Please help me thanks.
@much_learner @muellerzr

you should be able to take your 10% test and pass that into a databunch just like you’re doing above. just divide your dataframe so the 10% test is not in the 90% other stuff.

1 Like

if I am getting you right then should I remove before preparing this data frame and again prepare a databunch for my test set?
Can you provide any code snippet for guidance because every time I try to make a test databunch this makes me confused due to the above data loader.

how about this

if you look at what zach is doing on
cell 5, he splits the dataframes, note the test dataframe
cell 25 he creates his databunch, only using the train and val dl, but no the test one, even thought test dataframe goes through similar processing
cell 34, 35, he executes preditions on the test dataset, ie test_dl. then with the preds, you can determine accuracy yourself.

1 Like
df['is_valid'] = df.set!='Train'

With this part of the code you are setting the column is_valid as 1 for all the rows where ‘set’ is not ‘Train’.

And later you are using ‘is_valid’ column to split the dataset, so all your valid and test splits come as valid spilt.

What you can do is remove the test spilt into a seperate dataframe and then use it for predictions later.

Or, assuming that you are doing classification, you can try something like this by using Native Pytorch with FastAI.

1 Like

Thanks for your reply
well, I applied this method and now predicting on the test set. But one of my main aims is to load test databuch too into my whole databunch. If you have experience with Keras it loads train, test, and valid split at once, and later after training you can just pass the test data loader into the trained model and evaluate the model.

By which we can find confusion matrix and different parameters.