I’m trying to find duplicate/similar/dissimilar images in the dataset. For this, I looked into the
DatasetFormatter class introduced by the latest fastai. The problem with this is that it always responds with the training set, even when passing
DatasetType.Valid. The data object was instantiated like this:
vision.ImageDataBunch.from_folder(path, valid_pct=0.2). I’d also like to remove duplicates from the training AND the validation set, any ideas on how to do just that?
Please show us the call to
I was using the default
DatasetFormatter().from_similars(learn, ds_type=DatasetType.Valid) as shown in the lecture, but then I discovered that the
ds_type argument is nowhere used in the
image_cleaner.py, in fact, the dataloader is hardcoded to be
dl = learn.data.fix_dl.
I used this sub-class of
DatasetFormatter to clean my training set. Paste the codes in your Jupyter notebook and the training images, alongside the
Delete buttons, will show up.
def from_training(cls, learn, n_imgs=None, ds_type:DatasetType=DatasetType.Train, **kwargs):
dl = learn.dl(ds_type)
if not n_imgs: n_imgs = len(dl.dataset)
idxs = range(n_imgs)
return cls.padded_ds(dl.dataset, **kwargs), idxs
ds, idxs = DatasetFormatter_Training().from_training(learn, ds_type=DatasetType.Train)
fd1 = ImageDeleter(ds, idxs,batch_size = 10)
I’d used it to clean the training set here, but note that training sets are way bigger than validation set, and takes way more time to clean. Maybe you could do something likewise to get rid of similair images.
Thanks, @wyquek . The issue for me is cleaning up images from both sets simultaneously, since you never want your training data to appear in the validation set. My approach is currently the follows.
def get_similars_idxs(cls, learn, layer_ls, **kwargs):
"Gets the indices for the most similar images in `ds_type` dataset"
hook = hook_output(learn.model[layer_ls][layer_ls][layer_ls])
train_actns = cls.get_actns(learn, hook=hook, dl=learn.data.train_dl, **kwargs)
valid_actns = cls.get_actns(learn, hook=hook, dl=learn.data.valid_dl, **kwargs)
ds_actns = torch.cat((train_actns, valid_actns), 0)
similarities = cls.comb_similarity(ds_actns, ds_actns, **kwargs)
idxs = cls.sort_idxs(similarities)
return cls.padded_ds(dl, **kwargs), idxs
nice, i’ll use it to clean my dataset too