Small tool to build image dataset: fastclass

cwerner · October 27, 2018, 9:21pm

Hi all.

If anyone is in need for a small tool to simplify downloading and cleaning multi-class image datasets from Google, Bing and Baidu I recently created a python package called fastclass.

You can get it from my GitHub:

I also wrote a brief blog post about it here: https://www.christianwerner.net/tech/Build-your-image-dataset-faster

You simply define a csv file with search terms for your classes and it’ll download, resize and de-duplicate them for you. Afterwards you can use a second script to inspect and clean them…

Let me know if you have problems or suggestions for improvement

Happy crawling…

Christian

@Jeremy Please advise if stuff like this should be posted somewhere else…

Mauro · October 28, 2018, 10:46am

Hi Christain. I installed the script, but it didn’t come with a csv example in the install location. Can you post the sample csv here?

cwerner · October 28, 2018, 11:01am

Hi Mauro.

Sure thing. The example gets buried in the site-packages folder I guess. But anyways, the format is super simple:

first column is your search term, second column is a list of expressions you don’t want to appear in the final class names (that are based on the search term)…

Let me give you a quick example:

searchterm,exclude
guitar gibson les paul,guitar
guitar "g&l" legacy, guitar

This will create two classes and exclude “guitar” int he final class names (they will be: gibson_les_paul, gandl_legacy). I added exclude terms since they help you find better results in the actual query on google/ bing/ etc.

The full example file is here:

github.com

cwerner/fastclass/blob/master/example/guitars.csv

searchterm,exclude
guitar gibson les paul,guitar
guitar gibson SG,guitar
guitar gibson ES,guitar
guitar gibson "Flying V",guitar
guitar gibson explorer,guitar
guitar gibson firebird,guitar
guitar fender stratocaster, guitar
guitar fender telecaster, guitar
guitar fender jaguar, guitar
guitar fender jazzmaster, guitar
guitar fender mustang, guitar
guitar gretsch streamliner, guitar
guitar gretsch electromatic, guitar
guitar prs se, guitar
guitar prs s2, guitar
guitar prs ce, guitar
guitar rickenbacker solid body, guitar
guitar rickenbacker hollowbody, guitar
guitar yamaha pacifica, guitar

This file has been truncated. show original

pankymathur · October 28, 2018, 9:32pm

Wow, looks promising, I Just looked at it briefly and like the fact that there is option to clean the download images quickly, will try it in detail, in next few days and will post feedback here. In the meanwhile, Great work @cwerner.

cwerner · October 28, 2018, 10:21pm

Thanks!

Let me know (issue or otherwise) if you find any problems… I’m in the middle of deleting bad guitars atm

eunosM3 · November 1, 2018, 2:42pm

First, thanks for sharing this tool. It worked as it should when tried it out on Colab. I couldn’t find the sample csv file either, so I created one locally and used

from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('blah...')

to feed the fcd function.

One question: how long did it take to return the images? On colab, four search queries took quite some time to complete, so I would like to know if it was related to colab or not.

~ML

cwerner · November 1, 2018, 3:08pm

Hi

The sample file is probably installed somewhere in site-packages

You can try:
pip show fastclass

I just added the version to 0.1.3. This version now has the flag -m INT where you can specify how many images the crawler should pull. Try with 50 or so to see how fast it will get… 1000 is max.

I did not test this much. Let me know if it gives you an error. I’m back at my desk in 1h

eunosM3 · November 1, 2018, 5:58pm

-m is a great addition to the function. Specifying 50 images sped it up significantly. Using -m 50, -c GOOGLE and four search queries, the process took 43 seconds. Using all three services took just over 2.5 minutes.

I used pip show fastclass to find the install location y’day, to no avail. In the fastclass folder, I see deduplicate.py, fc_clean.py, fc_download.py, imageprocessing.py, misc.py, __init__.py and __pycache__/. In the …dist-info folder I see entry_points.txt, INSTALLER, LICENSE, METADATA, RECORD, WHEEL and top_level.txt.

cwerner · November 1, 2018, 6:10pm

Yeah, I also noticed that icrawler can get really slow with 1000 images.

Hm, I will investigate if there is a better way to use setup.py. Maybe one can copy these files into the user directory und a .fastclass folder…

avatar · November 15, 2018, 11:10pm

Hi @cwerner

I could not get this to work yet

I did the instillation as recommended

$pip install git+https://github.com/cwerner/fastclass.git#egg=fastclass
$pip show fastclass

Made sure that its installed in
Location: /opt/anaconda3/lib/python3.6/site-packages

Then in Jupyter

  from fastclass import *

File "<ipython-input-5-a5b8965ec448>", line 1
    fcd -c GOOGLE -c BING -s 224 /home/jupyter/course-v3/nbs/dl1/data/guitars.csv
                ^
SyntaxError: invalid syntax

May be I am missing some thing very basic

Thanks,
A

whatrocks · November 15, 2018, 11:16pm

This is awesome. I just started working on my multi-class project! Will definitely check it out.

Do you know if there’s a way to do the interpretations (e.g. top_losses / confusion matrix) for multi-class stuff with fastai? I couldn’t figure that out yet, but thought you might know! Thanks ~~

cwerner · November 16, 2018, 7:04am

Hi

The two tools fcd and fcc are meant to be used from the command line.

fcd -c GOOGLE -c BING -s 224 /home/jupyter/course-v3/nbs/dl1/data/guitars.csv

Also not that the example csv file will end up in the site-packages directory of your installation.

Let me know if you have further questions!

cwerner · November 16, 2018, 7:06am

Hi,

And thanks. Do you mean multiple labels per image (satellite notebook lesson2)? Or a multi-class classification as done in lesson 1?

whatrocks · November 16, 2018, 4:52pm

Ah, I mean multiple labels per image (satellite images). I haven’t been able to find a way to evaluate the trained model. The notebook suggests uploading to Kaggle for that specific example, but I’d like to use the other “interp” methods available in the single class classification. Do you have any suggestions?

ricky_saka · February 4, 2019, 2:37pm

Thanks for this very useful

cwerner · March 6, 2019, 3:59pm

Thanks for using it! feel free to improve on it on GitHub with a PR…

Oliv · August 19, 2019, 5:25am

Awesome lil tool Christian ! Thanks

cwerner · August 19, 2019, 8:57am

Pleasure

KianaPhom · August 19, 2019, 4:34pm

Thanks for sharing.

Interneuron · February 18, 2020, 3:46pm

Very useful tool, works great, thanks so much!