Add a use_partial_data to the data block api

KevinB · December 1, 2018, 5:10am

I have created a new functionality for the datablock api that works as follows:

item_list = (JSONImageItemList.from_folder(PATH/'train', extensions=[".txt"])
             .use_partial_data(0.001, seed=seed) #New Functionality.  This would select a random 0.1% of the total records.  
             .random_split_by_pct(0.1, seed=seed)
             .label_from_folder()
             .add_test(ItemList.from_folder(PATH/'test'))
             .databunch())

I used the random_split_by_pct as a guide. Here is the code I propose to add:

def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemLists':
    "Use only a sample of the full dataset.  "
    if seed is not None: np.random.seed(seed)
    rand_idx = np.random.permutation(range_of(self))
    cut = int(sample_pct * len(self))
    return self[rand_idx[:cut]]

I believe this will make using sample data much quicker for prototyping before using the full dataset.

Here is the pull request:

sgugger · December 1, 2018, 12:31pm

Thanks! This seems like a good and useful idea, will review later today.

sinsji · January 29, 2019, 8:01pm

How is the use_partial_data supposed to be applied?

The following line(s) run without error both:
data.use_partial_data(sample_pct = .2, seed= 34) OR
data = data.use_partial_data(sample_pct = .2, seed= 34)

But the next line throws an error, no matter the approach:
data.show_batch(rows=3, figsize=(5,4))

AttributeError: ‘ImageItemList’ object has no attribute ‘show_batch’.

I would like use a subset of the whole dataset to speed up experiments.

sgugger · January 29, 2019, 8:05pm

You have to set it at the very beginning, before splitting and labelling:
data = ImageItemList.from_folder(path).use_partial_data(sample_pct = .2, seed= 34).random_split_by_pct()…

sinsji · January 29, 2019, 8:14pm

Thank you. I think the problem is that I try to apply it to a DataBunch object.
It seems that this is not possible.

For example:

data = ImageDataBunch.from_csv(PATH, folder='train', test='test', csv_labels='labels.csv', suffix='.tif', valid_pct = 0.2, ds_tfms=get_transforms(), size=sz, bs=bs).use_partial_data(sample_pct = .2, seed= 34)

It would be convenient if this would work though:) Currently my data is stored with .csv labels. Maybe I’m missing something here?

sgugger · January 29, 2019, 8:58pm

You should learn how to use the data block API as the factory methods will only get you so far
Here just copy the source code of that factory method to help you.

sinsji · January 30, 2019, 9:02am

The data block API seems intuitive. One thing I don’t understand.
My main folder (PATH) containts a ‘train’, ‘valid’ and ‘test’ folder. The labels are in a .csv file (‘labels.csv’) in the same main folder. The .csv file has the filenames without extensions and labels.

I made a chain of methods like below:

data = (ImageItemList.from_folder(PATH, extensions='.tif')
       .use_partial_data(sample_pct = .1, seed= 34)
       .label_from_df(pd.read_csv('labels.csv'))
       .random_split_by_pct(valid_pct=0.2, seed=34)       
       .transform(tfms, size = 96)
       .databunch(bs=64)).normalize(imagenet_stats)

In an earlier example on the forum you used label_from_csv. I can’t find this method in the docs.

Essentially I’m trying to translate the following, after addition of .use_partial_data:

data = ImageDataBunch.from_csv(PATH, folder='train', test='test', csv_labels='labels.csv', suffix='.tif', valid_pct = 0.2, ds_tfms=get_transforms(do_flip=True, flip_vert=True), size=sz, bs=bs
                                  ).normalize(imagenet_stats)

How do I label the inputs using the folder structure and a .csv file, considering the partial data approach?

sgugger · January 30, 2019, 2:23pm

Your csv must have a column with the filenames (otherwise I don’t know how you label), so you should use

data = (ImageItemList.from_csv(PATH, 'labels.csv', cols={your_fname_col_name})
       .use_partial_data(sample_pct = .1, seed= 34)
       .random_split_by_pct(valid_pct=0.2, seed=34)
       .label_from_df(cols={your_label_cols_name})
       .transform(tfms, size = 96)
       .databunch(bs=64)).normalize(imagenet_stats)

Also, always remember to split before your label or you’ll get an error.

sinsji · January 30, 2019, 3:00pm

Thanks again! Getting there…
Somehow with this approach the ‘train’ is not found. This error showed up:

FileNotFoundError: [Errno 2] No such file or directory:
/home/jupyter/tutorials/data/lymph_node/./file_1.tif

As you can see there is a ‘.’ instead of ‘train’.

Should the label.csv file be in the same folder as the training images?

What did I forget this time ?

sgugger · January 30, 2019, 3:26pm

You should add ‘folder=train’ in your first call (so that it’s put instead of ‘.’)