How to shrink a large dataset to work locally without GPU

To experiment the code efficiently on a local machine without a gpu, requies to use a smaller dataset or a subset of a large dataset. As I am experimenting on Jeremy’s Paddy competition notebooks, I need to find a way to shrink the dataset which is 1 GB in size.

At first, I tried to shrink the training set and validation set after I got a dataloader dls. I managed to shrink the size by dls.train_ds.items = dls.train_ds.items[:1000] and validation set respectively. However, I can’t run learn.lr_find nor learn.fit without error. It seems dls still uses some meta data produced when creating the dls from the full dataset.

Instead of tweaking the source code related to Datasets and DataLoaders to take care of the meta data, it should be easier to shrink the dataset before creating the dls.

If using ImageDataLoaders.from_folder, from its signature we can tell that to have a shrinked dataset, we have to create a parent folder with 10 diseaese folders which are all shrinked in size. I don’t have any code template to tweak in order to make it work.

If using ImageDataLoaders.from_path_func, based on its signature, instead of a parent folder of folders (a shrinked version), I just need to provide a L list of image files which is a shrinked version of a full dataset. This approach seems more straightforward with a little tweaking on get_image_files. (see codes and output below)

I would love to see how you approach this problem if you like to share.

def _get_files_subset(p, # path
               fs, # list of filenames 
               extensions=None,
               ratio=0.1):
    "get the fullnames for the list of filenames of a path"
    p = Path(p)
    res = [p/f for f in fs if not f.startswith('.')
           and ((not extensions) or f'.{f.split(".")[-1].lower()}' in extensions)]
    return res[:int(len(res)*ratio)]
@delegates(_get_files_subset)
def get_files_subset(path, recurse=True, folders=None, followlinks=True, **kwargs):
    "Get all the files in `path` with optional `extensions`, optionally with `recurse`, only in `folders`, if specified."
    path = Path(path)
    folders=L(folders)
    extensions = kwargs['extensions']
    ratio = kwargs['ratio']
    extensions = setify(extensions)
    extensions = {e.lower() for e in extensions}
    if recurse:
        res = []
        for i,(p,d,f) in enumerate(os.walk(path, followlinks=followlinks)): # returns (dirpath, dirnames, filenames)
            if len(folders) !=0 and i==0: d[:] = [o for o in d if o in folders]
            else:                         d[:] = [o for o in d if not o.startswith('.')]
            if len(folders) !=0 and i==0 and '.' not in folders: continue
            res += _get_files(p, f, extensions, ratio=ratio)
    else:
        f = [o.name for o in os.scandir(path) if o.is_file()]
        res = _get_files_subset(path, f, extensions, ratio=ratio)
    return L(res)
# File:      ~/mambaforge/lib/python3.9/site-packages/fastai/data/transforms.py
# Type:      function
get_files
<function __main__.get_files(path, recurse=True, folders=None, followlinks=True, *, extensions=None, ratio=0.1)>
@delegates(get_files)
def get_image_files_subset(path, ratio, **kwargs):
    "Get image files in `path` recursively, only in `folders`, if specified."
    return get_files_subset(path, ratio=ratio, extensions=image_extensions, **kwargs)
# File:      ~/mambaforge/lib/python3.9/site-packages/fastai/data/transforms.py
# Type:      function
get_image_files_subset
<function __main__.get_image_files_subset(path, ratio, *, recurse=True, folders=None, followlinks=True, extensions=None)>
train_reduced = get_image_files_subset(path/"train_images", 0.1)
train_reduced
(#1036) [Path('paddy-disease-classification/train_images/dead_heart/110369.jpg'),Path('paddy-disease-classification/train_images/dead_heart/105002.jpg'),Path('paddy-disease-classification/train_images/dead_heart/106279.jpg'),Path('paddy-disease-classification/train_images/dead_heart/108254.jpg'),Path('paddy-disease-classification/train_images/dead_heart/104308.jpg'),Path('paddy-disease-classification/train_images/dead_heart/107629.jpg'),Path('paddy-disease-classification/train_images/dead_heart/110355.jpg'),Path('paddy-disease-classification/train_images/dead_heart/100146.jpg'),Path('paddy-disease-classification/train_images/dead_heart/103329.jpg'),Path('paddy-disease-classification/train_images/dead_heart/105980.jpg')...]
len(train_files)
len(train_reduced)
train_reduced[100]
train_reduced[200]
train_reduced[300]
10407






1036






Path('paddy-disease-classification/train_images/dead_heart/101271.jpg')






Path('paddy-disease-classification/train_images/bacterial_leaf_blight/101649.jpg')






Path('paddy-disease-classification/train_images/brown_spot/100780.jpg')
dls = ImageDataLoaders.from_path_func(".", train_reduced, valid_pct=0.2, seed=42,
#     label_func = lambda x: str(x).split('/')[-2],
    label_func = parent_label,
    item_tfms=Resize(480, method='squish'),
    batch_tfms=aug_transforms(size=128, min_scale=0.75))

dls.show_batch(max_n=6)
dls.train_ds
(#829) [(PILImage mode=RGB size=480x640, TensorCategory(7)),(PILImage mode=RGB size=480x640, TensorCategory(7)),(PILImage mode=RGB size=480x640, TensorCategory(7)),(PILImage mode=RGB size=480x640, TensorCategory(6)),(PILImage mode=RGB size=480x640, TensorCategory(5)),(PILImage mode=RGB size=480x640, TensorCategory(0)),(PILImage mode=RGB size=480x640, TensorCategory(5)),(PILImage mode=RGB size=480x640, TensorCategory(3)),(PILImage mode=RGB size=480x640, TensorCategory(3)),(PILImage mode=RGB size=480x640, TensorCategory(9))...]