How to use the data block API with you already have separate training and validation DataFrames?

wgpubs · November 12, 2018, 10:19pm

Still trying to wrap my head around how to use the new data block API to accomplish things I’ve been able to previously do without it (and that look to be deprecated in the near future).

So … I have a train_df and a valid_df already to go for a multi-label text classification problem along with the column names for the labels and the column names for the text data I want to use.

How do I create a DataBunch using the data block API in this scenario?

kachio · November 12, 2018, 10:41pm

I would combine both dataframes into a single dataframe (with two columns filename and label), then use:

data = (ImageFileList.from_df(df)
.label_from_df(df)
.split_by_idx(valid_idx)
.datasets()
.databunch())

Here’s an example on google colab

wgpubs · November 12, 2018, 10:55pm

Nice suggestion!

Another option would be to add a column to the merged DataFrame like is_valid so I know which rows should be used in the training and validation datasets (rather than have to know which index to split on).

Actually, even better might be to name this column dataset with one of these values train | valid | test to indicate what dataset the rows should be included in.

I was hoping there might be a way to just pass in all 2-3 DataFrames initially and then apply other data block API calls to indicate which should be used for the training, validation, and optionally test datasets (similar to how you can specify what folders to use in like manner when reading the inputs from a parent folder).

kachio · November 12, 2018, 11:14pm

Yes. You could use np.where() on the 3rd column of is_valid to get the indices (of the rows) that are validation. this would be your 'valid_idx

It’s one line of code to merge/combine multiple dataframes (see pd.merge or pd.concat docs)

sam2 · November 13, 2018, 6:48pm

@kachio, @wgpubs

Can you check lengths of train_ds and valid_ds in your case? For me the lengths of both show same which is incorrect. Although that databunch was successfully created !!

wgpubs · November 13, 2018, 11:57pm

How would this work in the case of dealing with text data inside the df?

For example, lets say I have a DataFrame with the following columns: text, is_positive, is_negative.

How would I use the DataBunch API to create a DataBunch for an NLP multilabel classification problem using the above DataFrame?

I’ve been fighting for this for about 2 hours now and can’t make sense of how to do this with the new API. On top of that, it seems like the documentation is out of sync with the source which is making it difficult to make sense of things as well.

kachio · November 14, 2018, 7:56pm

My understanding is that the input dataframe should be two columns (e.g. name/text and label). One way to deal with multilabel, as seen in the Planet notebook, is to concatenate them into a string. In the case you’ve presented, is_positive and is_negative would be combined into a single label e.g. '0 1' or '1 1'. (The labels are separated by space, see screen shot).

to create databunch:
data = ItemList.from_df(df).random_split_by_pct().label_from_df(sep=' ').databunch()

I was able to create databunch using ItemList for multilabel data and text. Here’s an example notebook

Alikhattak · October 18, 2021, 5:40am

Hello, sir, I am stuck in this situation for two weeks.
I am doing the same made a set col and name into train, valid, test but I am using is_valid col which combines test and validation split into one e.g. valid and my test is empty.

My dataloader is below:

def get_chestxray8(path:PathOrStr, bs:int, img_sz:int, valid_only_bbx:bool=False, tfms:bool=False, convert_mode:str='RGB',
                   normalize:bool=True, norm_stats:Tuple[Floats, Floats]=imagenet_stats, processor:Optional[Callable]=None,
                   **kwargs:Any)->DataBunch:
    '''
    TODO
    '''
    path = Path(path)
    df = pd.read_pickle(path/'full_ds_bbx.pkl')
    df['is_valid'] = df.set!='Train'
    if valid_only_bbx: df = df[(df.set=='Train')]

    if processor is not None: df = processor(df)

    lbl_dict = df[['file','label']].set_index('file')['label'].to_dict()
    def bbox_label_func(fn:str)->list: return lbl_dict[Path(fn).name]
    lbls = ['No finding', 'Atelectasis', 'Cardiomegaly', 'Consolidation', 'Infiltration', 
    'Lung Opacity', 'Mass', 'Pleural effusion', 'Pleural thickening', 'Pneumothorax', 'Pulmonary fibrosis']


    src = (CustomObjectItemList.from_df(df, path / 'images', cols='file', convert_mode=convert_mode)
                               .split_from_df('is_valid')
                               .label_from_func(bbox_label_func, classes=lbls))

    if tfms: src = src.transform(get_transforms(**kwargs), size=img_sz, tfm_y=True)

    data =  src.databunch(bs=bs, collate_fn=multiclass_bb_pad_collate)
    if normalize: data = data.normalize(stats=norm_stats)

    return data

and I get this databunch, where test and valid combines and get me only valid.
I get my databunch like this:

ImageDataBunch;

Train: LabelList (12662 items)
x: CustomObjectItemList
Image (3, 512, 512),Image (3, 512, 512),Image (3, 512, 512),Image (3, 512, 512),Image (3, 512, 512)
y: CustomObjectCategoryList
ImageBBox (512, 512),ImageBBox (512, 512),ImageBBox (512, 512),ImageBBox (512, 512),ImageBBox (512, 512)
Path: /home/ali/Desktop/CX Product/RpSalWeaklyDet/images;

Valid: LabelList (2129 items)
x: CustomObjectItemList
Image (3, 512, 512),Image (3, 512, 512),Image (3, 512, 512),Image (3, 512, 512),Image (3, 512, 512)
y: CustomObjectCategoryList
ImageBBox (512, 512),ImageBBox (512, 512),ImageBBox (512, 512),ImageBBox (512, 512),ImageBBox (512, 512)
Path: /home/ali/Desktop/CX Product/RpSalWeaklyDet/images;

Test: None

Please suggest how I would add test set too without combining test + valid.
Thanks

wgpubs · October 18, 2021, 3:42pm

The easiest way to add a test set and get predictions is to create one using your fast.ai DataLoaders:

test_dl = dls.test_dl(test_df, with_labels=True)
preds = learn.get_preds(dl=test_dl)

… where test_df look just like the DataFrame you’re using to define your training/validation data. You can see how I do this in my Blurr library here.

Another option is to simply use PyTorch to get the predictions from test_dl by iterating throught the test_dl DataLoader yourself as I illustrate here.

See the docs for DataLoaders.test_dl and if you’re looking for a detailed walk-thru, check out @muellerzr 's post on test_dl here.