Duplicate Widget


(Francisco Ingham) #1

I have included in the ImageCleaner widget a duplicate finding functionality. Basically users can scan the most similar pairs of images in their dataset and chose to delete them if necessary. The only differences in usage are calling from_similars() method in DatasetFormatter and specifying duplicates=True when calling ImageCleaner.


(Constantin Baumgartner) #3

@lesscomfortable I’m trying to use the .from_similars and I’m running into a weird problem:

I build the databunch strictly adhering to the docs:

db =(ImageItemList.from_folder(PATH)
     .no_split()
     .label_from_folder()
     .transform(no_tfms, size = 122)
     .databunch(bs = 16))

The databunch has the correct number of images:

db.train_ds.x
-> ImageItemList (273 items)

But when I use DatasetFormatter().from_similars(learn), I end up with a list of id’s thats twice as long as the dataset and just matching each image with itself:

ds, idxs = DatasetFormatter().from_similars(learn)

len(idxs)
-> 546

idxs[:8]
-> [223, 223, 133, 133, 79, 79, 191, 191]

Am I implementing this incorrectly or misunderstanding how it’s supposed to be used?

Thanks for the help.


(Francisco Ingham) #4

Hey! This is not expected, I’m going to check it myself in a few minutes. It seems it is comaparing images with themselves (not the correct behavior). Are you using the last version of the library?


(Constantin Baumgartner) #5

Yes, I upgraded yesterday morning so I’m on the latest version.


(Francisco Ingham) #6

I found the bug. Will submit a PR now, please replace the last line in comb_similarity by these two:

t = torch.mm(t1, t2.t()) / (w1 * w2.t()).clamp(min=1e-8)
return torch.tril(t, diagonal=-1)


(Constantin Baumgartner) #7

Works great, thanks!


(Francisco Ingham) #8

Thank you for catching this!