Duplicate Widget

Hi karthika hope all is well!
Make sure to visually check your data path and directory, I once had a similar problem when I had somehow duplicated some directories containing images, probably caused by having to run the notebook more than once.

mrfabulous1 :smiley::smiley:

I am using fast ai version ‘1.0.54’.
In this ImageCleaner(ds, fns_idxs, ‘path’, duplicates=True), len(ds) is giving me the correct # as that of images in my dataset, but when i tried len(fns_idxs) it is giving me double the number of images. I think fns_idxs is index right, why it is getting doubled? can you please clarify or explain this to me

Hey, sorry i can’t find an topic which is allready answering my quistion, but it seems that you allready saved this problem. I am trying to make the Cleanup of an vision data set, but ImgeList does not excist anymore, as it is used in the documentation I don’t know what to use:

db = (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.transform(get_transforms(), size=224)
.databunch()
)

What did You use instead? ImageItemList does not contain .split_none

1 Like

is there a way to use widgets in colab?

I made ClassConfusion, to do so I had to work around Google Colab’s widget library. The native ones will not work from the fastai library. See my repo here: https://github.com/muellerzr/ClassConfusion

1 Like

It was very discouraging for me as well to ‘delete’ samples without knowing the category. As I found later, downloading images from google/baidu/etc will provide you with a big amount of garbage, and that has to be cleaned up manually.

I went away looking for other deduplicators but eventually came back in an attempt to stick to the tools provided by the course.

I found something interesting. It is not the intended use, but in my opinion it works much better.

If you do:

ds, idxs = DatasetFormatter().from_similars(learn_cln)
ImageCleaner(ds, idxs, path, duplicates=True)

Then you get something like

But!! If you do:

ds, idxs = DatasetFormatter().from_similars(learn_cln)
ImageCleaner(ds, idxs, path, duplicates=False)

Then you get

This allows to SEE the categories and perform a faster cleaning of the dataset.

Because… rather than

  1. Do top_losses
  2. re-categorize N out of M clones (multiple times)
  3. Reload the csv
  4. Then do from_similars
  5. delete N out of M clones (multiple times)

I do:

  1. Do from_similars
  2. delete N out of M clones and re-categorize the image left (multiple times)

Then you can focus on top_losses knowing that all the duplicates are out of the game.

================

This has nothing to do with the tool but is a VERY NICE trick.
If you get an image that you are unsure of (you simply don’t know the category it belongs to).

Then right click on it (only on Chrome, sorry) and click on “Search google for image”.

image

On the heading you may find your answer

If not, find a large picture on the results to open, or look at the titles of webpages containing the photo.

In other words, use Google’s power to find out the answer.

2 Likes

I cannot edit this post but have been asked to confirm that it does not work in Google Colab. If you are going to use it please do in a traditional Jupyter Notebook!

Thank you. Can we put this notice “Colab do not work” in the notebook in the general questions before you start lesson 2? BTW, so far I think Colab seems working very well in different things, and I don’t need to worry a lot on dependency, virtual environment, setup, clone, etc … However, the other are good experience for me to learn different tools.

1 Like

Yes! I will this weekend.

1 Like

Does the widget only display images that are too similar or does it go through the full dataset showing decreasingly similar images.

If it only displays images that are too similar, is it best practice to resolve every duplicate that appears?

You should use it until you see images that are not too similar. When you see 4/5 like those you know you are good.

1 Like

Does this duplicates cleaner widget exist in fastai2? If not, it would be good to port it. Using other tools (fdupes and findimagedupes) I noticed that there are a large number of duplicates in the oxford-iiit-pet data set. I think Francisco’s method would be better though.

1 Like

Wow, this seems like a pretty useful tool. Does it works with the latest version of fastai?