Filter learn.data.train_dl at epoch begin

sshleifer · June 10, 2019, 12:14am

I am trying to implement Curriculum Learning as a callback.

My current implementation makes a new ImageList -> DataBunch every time, with a filter_by_func to keep only the examples of the correct difficulty level. This works but seems wasteful and inelegant because I am running

    image_list = ImageList.from_folder(path)
    image_list = ClassUtils.filter_classes(image_list, classes, woof)

    learn.data.train_dl = (image_list
            .filter_by_func(filter_func)
            .use_partial_data(sample)
            .split_by_folder(valid='val')

            .label_from_folder().transform(([flip_lr(p=flip_lr_p)], []), size=size)
            .databunch(bs=bs, num_workers=workers, shuffle_train=shuffle_train)
            .presize(size, scale=(0.35,1))
            .normalize(imagenet_stats))

every epoch.

Unfortunately, when I try to filter_by_func the resulting DataBunch, later in the process, I get an error that xb is meant to be a Tensor, not an Image.

Is there a better way to filter learn.data.train_dl at the beginning of every epoch? Can a DeviceDataLoader be filtered? (Potentially relatedly, what is CallbackHandler.set_dl used for?

Thanks!

sgugger · June 10, 2019, 1:07pm

I think what you want is called a sampler (from PyTorch, not fastai), to only loop through certain elements of your dataset. You can pass one after creating your data with:

data.train_dl = data>train_dl.new(shuffle=False, sampler=my_sampler)