I have created a new functionality for the datablock api that works as follows:
item_list = (JSONImageItemList.from_folder(PATH/'train', extensions=[".txt"]) .use_partial_data(0.001, seed=seed) #New Functionality. This would select a random 0.1% of the total records. .random_split_by_pct(0.1, seed=seed) .label_from_folder() .add_test(ItemList.from_folder(PATH/'test')) .databunch())
I used the random_split_by_pct as a guide. Here is the code I propose to add:
def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemLists': "Use only a sample of the full dataset. " if seed is not None: np.random.seed(seed) rand_idx = np.random.permutation(range_of(self)) cut = int(sample_pct * len(self)) return self[rand_idx[:cut]]
I believe this will make using sample data much quicker for prototyping before using the full dataset.
Here is the pull request: