How to use DatasetFormatter on the entire dataset

polakowo · January 30, 2019, 5:47pm

Hi all,

I’m trying to find duplicate/similar/dissimilar images in the dataset. For this, I looked into the DatasetFormatter class introduced by the latest fastai. The problem with this is that it always responds with the training set, even when passing DatasetType.Valid. The data object was instantiated like this: vision.ImageDataBunch.from_folder(path, valid_pct=0.2). I’d also like to remove duplicates from the training AND the validation set, any ideas on how to do just that?

balnazzar · February 1, 2019, 3:57pm

Please show us the call to DatasetFormatter()

polakowo · February 1, 2019, 4:44pm

I was using the default DatasetFormatter().from_similars(learn, ds_type=DatasetType.Valid) as shown in the lecture, but then I discovered that the ds_type argument is nowhere used in the image_cleaner.py, in fact, the dataloader is hardcoded to be dl = learn.data.fix_dl.

wyquek · February 1, 2019, 4:47pm

I used this sub-class of DatasetFormatter to clean my training set. Paste the codes in your Jupyter notebook and the training images, alongside the Delete buttons, will show up.

class DatasetFormatter_Training(DatasetFormatter):
    @classmethod
    def from_training(cls, learn, n_imgs=None, ds_type:DatasetType=DatasetType.Train, **kwargs):
        dl = learn.dl(ds_type)
        if not n_imgs: n_imgs = len(dl.dataset)
        idxs = range(n_imgs)
        return cls.padded_ds(dl.dataset, **kwargs), idxs

ds, idxs = DatasetFormatter_Training().from_training(learn, ds_type=DatasetType.Train)
fd1 = ImageDeleter(ds, idxs,batch_size = 10)

I’d used it to clean the training set here, but note that training sets are way bigger than validation set, and takes way more time to clean. Maybe you could do something likewise to get rid of similair images.

polakowo · February 1, 2019, 4:57pm

Thanks, @wyquek . The issue for me is cleaning up images from both sets simultaneously, since you never want your training data to appear in the validation set. My approach is currently the follows.

def get_similars_idxs(cls, learn, layer_ls, **kwargs):
        "Gets the indices for the most similar images in `ds_type` dataset"
        hook = hook_output(learn.model[layer_ls[0]][layer_ls[1]][layer_ls[2]])

        train_actns = cls.get_actns(learn, hook=hook, dl=learn.data.train_dl, **kwargs)
        valid_actns = cls.get_actns(learn, hook=hook, dl=learn.data.valid_dl, **kwargs)
        ds_actns = torch.cat((train_actns, valid_actns), 0)

        similarities = cls.comb_similarity(ds_actns, ds_actns, **kwargs)
        idxs = cls.sort_idxs(similarities)
        return cls.padded_ds(dl, **kwargs), idxs

wyquek · February 2, 2019, 1:51am

nice, i’ll use it to clean my dataset too