Is there a way to trim a dataset inside a data bunch before training?

I am practicing with the places2 dataset , which has 365 labels and 1.8M images. I am loading it with ImageDataBunch.from_folder since the file structure is /train/{label}/<images>.

Because the places2 is so big, I want to trim the train dataset to a subset before sending it to the learner, this way I can try different subsets in a programmatic way. The subsets could be created by reducing the number of labels and or the number of images per label.

I have been looking at the code but I haven’t been able to find an obvious way to do this inside . I the suggested approach to build a data frame out of the file structure and then use ImageDataBunch.from_df?

(Background: I am interested in getting the feel for the smaller number of images needed to get good results (and eventually understand how the quality of the classification varies with the number of classes and number of samples).


I’d just use the from_name_re approach we saw at the end of last week’s lesson. You can pass in a random subset of file names, and extract the labels from the paths.


Maybe you know it already, the concrete code for Jeremy’s approach is (I based on the lesson1-pets nb):

fnames = random.sample(fnames, 2000)
pat = r'/([^/]+)_\d+.jpg$'
data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224)

My laptop need about 4 times to learn (with GPU 1060 6Gb) compare to Jeremy so I also need to trim the data to understand the concept first. :smiley:


Thank you, this would help me creating subsets of labels, which would be a big improvement :slight_smile:

I’d still want to reduce the number of samples per label, but I think I have a direction (as provided by this very recent post): ImageDataBunch has a classes array and I have train_ds array that is a tuple (image, class_num), so I theoretically can filter the contents of train_ds. Will report back if I get anywhere.

1 Like

I ended up taking a slightly different approach and building an ImageDataBunch from scratch, meaning building the training and validation datasets from the files in the directory, but limiting the number of classes and the number of images in each class before building the training set.

This is how it looks in code, in this case limiting to 7 classes and 600 images in each class for the training set:

my_classes =  [
data = SampledImageDataBunch(path/'train',path/'val', my_classes, num_samples=600, ds_tfms=get_transforms(), size=224, bs=128)

The code to achieve this is below. It’s probably not very pythonic, but it works:

def training_set(train_dir, classes_to_train, num_samples=50, shuffle=True):
    train_dirs = [dir for dir in train_dir.iterdir() if in classes_to_train]
    train_files = []
    train_labels = []
    for dir in train_dirs:
        fs =  [f for f in dir.iterdir()]
        print( + " has " + str(len(fs)) +" samples")
        if shuffle:
            files= random.sample(fs, num_samples)
            files = fs[:num_samples]
        for f in files:

    # verify that there are enough samples per class
    ICD = ImageClassificationDataset(train_files,train_labels)
    return ICD

def validation_set(val_dir, classes_to_train):
    train_dirs = [dir for dir in val_dir.iterdir() if in classes_to_train]
    train_files = []
    train_labels = []
    for dir in train_dirs:
        files = [f for f in dir.iterdir()]
        for f in files:

    # verify that there are enough samples per class
    ICD = ImageClassificationDataset(train_files,train_labels)
    return ICD

# classes_to_train = [ 'fire_escape','lake-natural', 'cliff']
# ts = training_set(train_dir, classes_to_train, num_samples=200, shuffle=False)
# vs = validation_set(path/'val',classes_to_train)
# print(ts)
# print(vs)

def SampledImageDataBunch(train_dir, val_dir, classes_to_train, num_samples=200, shuffle=True,  **kwargs):
    ts =training_set (train_dir, classes_to_train, num_samples, shuffle,)
    vs =validation_set(val_dir, classes_to_train)
    data = ImageDataBunch.create(ts,vs, **kwargs)
    return data

I’ve used this approach:

  1. build an array of "all available samples: all_files
    all_files = flat_list([d.glob('*') for d in path_train.glob('*')])

  2. create the real working set files decimating the domain with specific function
    files = random.sample(all_files,items_count)

  3. Create labels for files:
    labels = list(map(extractClass, map(Path.as_posix, files)))

  4. Create DataBunch using files and labels of previous points
    data = ImageDataBunch.from_lists(path=path, fnames=files, labels=labels, valid_pct=0.2, test='test', ds_tfms=tfms, size=256, bs=32)

Using this approach you can change the “decimation” function of point (2) to make experiments new working set.


I like your approach. Seems much simpler. Thank you!

1 Like

I am trying to use a sample of the data using data_block API instead of factory method and struggling since can’t pass filenames to the API.

This approach works well. I’d just like to point out to viewers (if it was not obvious from the post) that this approach works if the train folder has classes in the form of ‘c1’, ‘c2’ … ‘ci’ where it combines them and randomly forms an image dataset from all the classes.

1 Like

Hello @tbatchelli Yes, there is a way for trim a dataset inside a data bunch before training I seen the similar question with answer to it in the Applied AI Course chat group check it out there