Data Block API: from_folder with labels in a .csv?

I can’t get my data retrieval procedure to work.
My main folder (PATH) contains a ‘train’, ‘valid’ and ‘test’ folder. The labels are in a .csv file (‘labels.csv’) in the same main folder. The .csv file has two columns: filenames without extension (’.tif’) and labels.

I made a chain of methods like below:

data = (ImageItemList.from_folder(PATH, extensions='.tif')
       .use_partial_data(sample_pct = .1, seed= 34)
       .label_from_df(pd.read_csv('labels.csv'))
       .random_split_by_pct(valid_pct=0.2, seed=34)       
       .transform(tfms, size = 96)
       .databunch(bs=64)).normalize(imagenet_stats)

In an earlier example on the forum label_from_csv was mentioned . I can’t find this method in the docs.

Essentially I’m trying to translate the following (which works):

data = ImageDataBunch.from_csv(PATH, folder='train', test='test', csv_labels='labels.csv', suffix='.tif', valid_pct = 0.2, ds_tfms=tfms, size=sz, bs=bs
                                  ).normalize(imagenet_stats)

In addition I want to include use_partial_data.
How do I retrieve the data and label the files, using folders and a .csv file?

Any ideas/examples? Thanks

Hi,
It sounds to me that you need to use datablock API, which is under doc->core->datablock Link

It seems to me you have two folders with train/valid split already, you can try the following

ImageItemList.from_csv()
.split_by_folder()
.label_from_df()
.transform()
.databunch()

Now you can specify path in each function call according to the doc. Your label_from_df() should just pass cols with your label (by default, it sets to cols=1,but you can do like cols=2 or cols = ‘label’…etc), since you have passed csv file path in .from_csv(), it should automatically figured out the csv file and opened it.

hope that help :slight_smile:

Thank you. I solved it with help from another post like this:

    data = (ImageItemList.from_csv(PATH, folder='train', csv_file='labels.csv', cols={your_fname_col_name})
           .use_partial_data(sample_pct = .1, seed= 34)
           .random_split_by_pct(valid_pct=0.2, seed=34)
           .label_from_df(cols={your_label_cols_name})
           .transform(tfms, size = 96)
           .databunch(bs=64)).normalize(imagenet_stats)
    ```
1 Like