If anyone is in need for a small tool to simplify downloading and cleaning multi-class image datasets from Google, Bing and Baidu I recently created a python package called fastclass.
You simply define a csv file with search terms for your classes and it’ll download, resize and de-duplicate them for you. Afterwards you can use a second script to inspect and clean them…
Let me know if you have problems or suggestions for improvement
Happy crawling…
Christian
@Jeremy Please advise if stuff like this should be posted somewhere else…
Sure thing. The example gets buried in the site-packages folder I guess. But anyways, the format is super simple:
first column is your search term, second column is a list of expressions you don’t want to appear in the final class names (that are based on the search term)…
Let me give you a quick example:
searchterm,exclude
guitar gibson les paul,guitar
guitar "g&l" legacy, guitar
This will create two classes and exclude “guitar” int he final class names (they will be: gibson_les_paul, gandl_legacy). I added exclude terms since they help you find better results in the actual query on google/ bing/ etc.
Wow, looks promising, I Just looked at it briefly and like the fact that there is option to clean the download images quickly, will try it in detail, in next few days and will post feedback here. In the meanwhile, Great work @cwerner.
First, thanks for sharing this tool. It worked as it should when tried it out on Colab. I couldn’t find the sample csv file either, so I created one locally and used
from google.colab import files uploaded = files.upload() for fn in uploaded.keys(): print('blah...')
to feed the fcd function.
One question: how long did it take to return the images? On colab, four search queries took quite some time to complete, so I would like to know if it was related to colab or not.
The sample file is probably installed somewhere in site-packages
You can try: pip show fastclass
I just added the version to 0.1.3. This version now has the flag -m INT where you can specify how many images the crawler should pull. Try with 50 or so to see how fast it will get… 1000 is max.
I did not test this much. Let me know if it gives you an error. I’m back at my desk in 1h
-m is a great addition to the function. Specifying 50 images sped it up significantly. Using -m 50, -c GOOGLE and four search queries, the process took 43 seconds. Using all three services took just over 2.5 minutes.
I used pip show fastclass to find the install location y’day, to no avail. In the fastclass folder, I see deduplicate.py, fc_clean.py, fc_download.py, imageprocessing.py, misc.py, __init__.py and __pycache__/. In the …dist-info folder I see entry_points.txt, INSTALLER, LICENSE, METADATA, RECORD, WHEEL and top_level.txt.
This is awesome. I just started working on my multi-class project! Will definitely check it out.
Do you know if there’s a way to do the interpretations (e.g. top_losses / confusion matrix) for multi-class stuff with fastai? I couldn’t figure that out yet, but thought you might know! Thanks ~~
Ah, I mean multiple labels per image (satellite images). I haven’t been able to find a way to evaluate the trained model. The notebook suggests uploading to Kaggle for that specific example, but I’d like to use the other “interp” methods available in the single class classification. Do you have any suggestions?