I have created a new functionality for the datablock api that works as follows:
item_list = (JSONImageItemList.from_folder(PATH/'train', extensions=[".txt"])
.use_partial_data(0.001, seed=seed) #New Functionality. This would select a random 0.1% of the total records.
.random_split_by_pct(0.1, seed=seed)
.label_from_folder()
.add_test(ItemList.from_folder(PATH/'test'))
.databunch())
I used the random_split_by_pct as a guide. Here is the code I propose to add:
def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemLists':
"Use only a sample of the full dataset. "
if seed is not None: np.random.seed(seed)
rand_idx = np.random.permutation(range_of(self))
cut = int(sample_pct * len(self))
return self[rand_idx[:cut]]
I believe this will make using sample data much quicker for prototyping before using the full dataset.
How is the use_partial_data supposed to be applied?
The following line(s) run without error both: data.use_partial_data(sample_pct = .2, seed= 34) OR data = data.use_partial_data(sample_pct = .2, seed= 34)
But the next line throws an error, no matter the approach: data.show_batch(rows=3, figsize=(5,4))
AttributeError: ‘ImageItemList’ object has no attribute ‘show_batch’.
I would like use a subset of the whole dataset to speed up experiments.
You have to set it at the very beginning, before splitting and labelling:
data = ImageItemList.from_folder(path).use_partial_data(sample_pct = .2, seed= 34).random_split_by_pct()…
You should learn how to use the data block API as the factory methods will only get you so far
Here just copy the source code of that factory method to help you.
The data block API seems intuitive. One thing I don’t understand.
My main folder (PATH) containts a ‘train’, ‘valid’ and ‘test’ folder. The labels are in a .csv file (‘labels.csv’) in the same main folder. The .csv file has the filenames without extensions and labels.