Add a use_partial_data to the data block api

I have created a new functionality for the datablock api that works as follows:

item_list = (JSONImageItemList.from_folder(PATH/'train', extensions=[".txt"])
             .use_partial_data(0.001, seed=seed) #New Functionality.  This would select a random 0.1% of the total records.  
             .random_split_by_pct(0.1, seed=seed)
             .label_from_folder()
             .add_test(ItemList.from_folder(PATH/'test'))
             .databunch())

I used the random_split_by_pct as a guide. Here is the code I propose to add:

def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemLists':
    "Use only a sample of the full dataset.  "
    if seed is not None: np.random.seed(seed)
    rand_idx = np.random.permutation(range_of(self))
    cut = int(sample_pct * len(self))
    return self[rand_idx[:cut]]

I believe this will make using sample data much quicker for prototyping before using the full dataset.

Here is the pull request:

7 Likes

Thanks! This seems like a good and useful idea, will review later today.

1 Like

How is the use_partial_data supposed to be applied?

The following line(s) run without error both:
data.use_partial_data(sample_pct = .2, seed= 34) OR
data = data.use_partial_data(sample_pct = .2, seed= 34)

But the next line throws an error, no matter the approach:
data.show_batch(rows=3, figsize=(5,4))

AttributeError: ‘ImageItemList’ object has no attribute ‘show_batch’.

I would like use a subset of the whole dataset to speed up experiments.

You have to set it at the very beginning, before splitting and labelling:
data = ImageItemList.from_folder(path).use_partial_data(sample_pct = .2, seed= 34).random_split_by_pct()…

2 Likes

Thank you. I think the problem is that I try to apply it to a DataBunch object.
It seems that this is not possible.

For example:

data = ImageDataBunch.from_csv(PATH, folder='train', test='test', csv_labels='labels.csv', suffix='.tif', valid_pct = 0.2, ds_tfms=get_transforms(), size=sz, bs=bs).use_partial_data(sample_pct = .2, seed= 34)

It would be convenient if this would work though:) Currently my data is stored with .csv labels. Maybe I’m missing something here?

You should learn how to use the data block API as the factory methods will only get you so far :wink:
Here just copy the source code of that factory method to help you.

1 Like

The data block API seems intuitive. One thing I don’t understand.
My main folder (PATH) containts a ‘train’, ‘valid’ and ‘test’ folder. The labels are in a .csv file (‘labels.csv’) in the same main folder. The .csv file has the filenames without extensions and labels.

I made a chain of methods like below:

data = (ImageItemList.from_folder(PATH, extensions='.tif')
       .use_partial_data(sample_pct = .1, seed= 34)
       .label_from_df(pd.read_csv('labels.csv'))
       .random_split_by_pct(valid_pct=0.2, seed=34)       
       .transform(tfms, size = 96)
       .databunch(bs=64)).normalize(imagenet_stats)

In an earlier example on the forum you used label_from_csv. I can’t find this method in the docs.

Essentially I’m trying to translate the following, after addition of .use_partial_data:

data = ImageDataBunch.from_csv(PATH, folder='train', test='test', csv_labels='labels.csv', suffix='.tif', valid_pct = 0.2, ds_tfms=get_transforms(do_flip=True, flip_vert=True), size=sz, bs=bs
                                  ).normalize(imagenet_stats)

How do I label the inputs using the folder structure and a .csv file, considering the partial data approach?

Your csv must have a column with the filenames (otherwise I don’t know how you label), so you should use

data = (ImageItemList.from_csv(PATH, 'labels.csv', cols={your_fname_col_name})
       .use_partial_data(sample_pct = .1, seed= 34)
       .random_split_by_pct(valid_pct=0.2, seed=34)
       .label_from_df(cols={your_label_cols_name})
       .transform(tfms, size = 96)
       .databunch(bs=64)).normalize(imagenet_stats)

Also, always remember to split before your label or you’ll get an error.

Thanks again! Getting there…
Somehow with this approach the ‘train’ is not found. This error showed up:

FileNotFoundError: [Errno 2] No such file or directory:
/home/jupyter/tutorials/data/lymph_node/./file_1.tif

As you can see there is a ‘.’ instead of ‘train’.

Should the label.csv file be in the same folder as the training images?

What did I forget this time :disappointed: :slightly_smiling_face: ?

You should add ‘folder=train’ in your first call (so that it’s put instead of ‘.’)

1 Like