Add a use_partial_data to the data block api

(Kevin Bird) #1

I have created a new functionality for the datablock api that works as follows:

item_list = (JSONImageItemList.from_folder(PATH/'train', extensions=[".txt"])
             .use_partial_data(0.001, seed=seed) #New Functionality.  This would select a random 0.1% of the total records.  
             .random_split_by_pct(0.1, seed=seed)
             .label_from_folder()
             .add_test(ItemList.from_folder(PATH/'test'))
             .databunch())

I used the random_split_by_pct as a guide. Here is the code I propose to add:

def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemLists':
    "Use only a sample of the full dataset.  "
    if seed is not None: np.random.seed(seed)
    rand_idx = np.random.permutation(range_of(self))
    cut = int(sample_pct * len(self))
    return self[rand_idx[:cut]]

I believe this will make using sample data much quicker for prototyping before using the full dataset.

Here is the pull request:

6 Likes

#2

Thanks! This seems like a good and useful idea, will review later today.

1 Like

How to use sample subset of dataset with ImageClassifierData?
(David) #3

How is the use_partial_data supposed to be applied?

The following line(s) run without error both:
data.use_partial_data(sample_pct = .2, seed= 34) OR
data = data.use_partial_data(sample_pct = .2, seed= 34)

But the next line throws an error, no matter the approach:
data.show_batch(rows=3, figsize=(5,4))

AttributeError: ‘ImageItemList’ object has no attribute ‘show_batch’.

I would like use a subset of the whole dataset to speed up experiments.

0 Likes

#4

You have to set it at the very beginning, before splitting and labelling:
data = ImageItemList.from_folder(path).use_partial_data(sample_pct = .2, seed= 34).random_split_by_pct()…

2 Likes

(David) #5

Thank you. I think the problem is that I try to apply it to a DataBunch object.
It seems that this is not possible.

For example:

data = ImageDataBunch.from_csv(PATH, folder='train', test='test', csv_labels='labels.csv', suffix='.tif', valid_pct = 0.2, ds_tfms=get_transforms(), size=sz, bs=bs).use_partial_data(sample_pct = .2, seed= 34)

It would be convenient if this would work though:) Currently my data is stored with .csv labels. Maybe I’m missing something here?

0 Likes

#6

You should learn how to use the data block API as the factory methods will only get you so far :wink:
Here just copy the source code of that factory method to help you.

1 Like

(David) #7

The data block API seems intuitive. One thing I don’t understand.
My main folder (PATH) containts a ‘train’, ‘valid’ and ‘test’ folder. The labels are in a .csv file (‘labels.csv’) in the same main folder. The .csv file has the filenames without extensions and labels.

I made a chain of methods like below:

data = (ImageItemList.from_folder(PATH, extensions='.tif')
       .use_partial_data(sample_pct = .1, seed= 34)
       .label_from_df(pd.read_csv('labels.csv'))
       .random_split_by_pct(valid_pct=0.2, seed=34)       
       .transform(tfms, size = 96)
       .databunch(bs=64)).normalize(imagenet_stats)

In an earlier example on the forum you used label_from_csv. I can’t find this method in the docs.

Essentially I’m trying to translate the following, after addition of .use_partial_data:

data = ImageDataBunch.from_csv(PATH, folder='train', test='test', csv_labels='labels.csv', suffix='.tif', valid_pct = 0.2, ds_tfms=get_transforms(do_flip=True, flip_vert=True), size=sz, bs=bs
                                  ).normalize(imagenet_stats)

How do I label the inputs using the folder structure and a .csv file, considering the partial data approach?

0 Likes

#8

Your csv must have a column with the filenames (otherwise I don’t know how you label), so you should use

data = (ImageItemList.from_csv(PATH, 'labels.csv', cols={your_fname_col_name})
       .use_partial_data(sample_pct = .1, seed= 34)
       .random_split_by_pct(valid_pct=0.2, seed=34)
       .label_from_df(cols={your_label_cols_name})
       .transform(tfms, size = 96)
       .databunch(bs=64)).normalize(imagenet_stats)

Also, always remember to split before your label or you’ll get an error.

0 Likes

(David) #9

Thanks again! Getting there…
Somehow with this approach the ‘train’ is not found. This error showed up:

FileNotFoundError: [Errno 2] No such file or directory:
/home/jupyter/tutorials/data/lymph_node/./file_1.tif

As you can see there is a ‘.’ instead of ‘train’.

Should the label.csv file be in the same folder as the training images?

What did I forget this time :disappointed: :slightly_smiling_face: ?

0 Likes

#10

You should add ‘folder=train’ in your first call (so that it’s put instead of ‘.’)

1 Like