Small tool to build image dataset: fastclass

Hi all.

If anyone is in need for a small tool to simplify downloading and cleaning multi-class image datasets from Google, Bing and Baidu I recently created a python package called fastclass.

You can get it from my GitHub:

I also wrote a brief blog post about it here: https://www.christianwerner.net/tech/Build-your-image-dataset-faster

You simply define a csv file with search terms for your classes and it’ll download, resize and de-duplicate them for you. Afterwards you can use a second script to inspect and clean them…

Let me know if you have problems or suggestions for improvement

Happy crawling… :wink:

Christian

@Jeremy Please advise if stuff like this should be posted somewhere else…

26 Likes

Hi Christain. I installed the script, but it didn’t come with a csv example in the install location. Can you post the sample csv here?

Hi Mauro.

Sure thing. The example gets buried in the site-packages folder I guess. But anyways, the format is super simple:

first column is your search term, second column is a list of expressions you don’t want to appear in the final class names (that are based on the search term)…

Let me give you a quick example:

searchterm,exclude
guitar gibson les paul,guitar
guitar "g&l" legacy, guitar

This will create two classes and exclude “guitar” int he final class names (they will be: gibson_les_paul, gandl_legacy). I added exclude terms since they help you find better results in the actual query on google/ bing/ etc.

The full example file is here:

1 Like

Wow, looks promising, I Just looked at it briefly and like the fact that there is option to clean the download images quickly, will try it in detail, in next few days and will post feedback here. In the meanwhile, Great work @cwerner.

Thanks!

Let me know (issue or otherwise) if you find any problems… I’m in the middle of deleting bad guitars atm :wink:

First, thanks for sharing this tool. It worked as it should when tried it out on Colab. I couldn’t find the sample csv file either, so I created one locally and used

from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('blah...')

to feed the fcd function.

One question: how long did it take to return the images? On colab, four search queries took quite some time to complete, so I would like to know if it was related to colab or not.

~ML

Hi

The sample file is probably installed somewhere in site-packages

You can try:
pip show fastclass

I just added the version to 0.1.3. This version now has the flag -m INT where you can specify how many images the crawler should pull. Try with 50 or so to see how fast it will get… 1000 is max.

I did not test this much. Let me know if it gives you an error. I’m back at my desk in 1h

-m is a great addition to the function. Specifying 50 images sped it up significantly. Using -m 50, -c GOOGLE and four search queries, the process took 43 seconds. Using all three services took just over 2.5 minutes.

I used pip show fastclass to find the install location y’day, to no avail. In the fastclass folder, I see deduplicate.py, fc_clean.py, fc_download.py, imageprocessing.py, misc.py, __init__.py and __pycache__/. In the …dist-info folder I see entry_points.txt, INSTALLER, LICENSE, METADATA, RECORD, WHEEL and top_level.txt.

Yeah, I also noticed that icrawler can get really slow with 1000 images.

Hm, I will investigate if there is a better way to use setup.py. Maybe one can copy these files into the user directory und a .fastclass folder…

Hi @cwerner

I could not get this to work yet

I did the instillation as recommended

$pip install git+https://github.com/cwerner/fastclass.git#egg=fastclass
$pip show fastclass

Made sure that its installed in
Location: /opt/anaconda3/lib/python3.6/site-packages

Then in Jupyter

  from fastclass import *

File "<ipython-input-5-a5b8965ec448>", line 1
    fcd -c GOOGLE -c BING -s 224 /home/jupyter/course-v3/nbs/dl1/data/guitars.csv
                ^
SyntaxError: invalid syntax

May be I am missing some thing very basic

Thanks,
A

This is awesome. I just started working on my multi-class project! Will definitely check it out.

Do you know if there’s a way to do the interpretations (e.g. top_losses / confusion matrix) for multi-class stuff with fastai? I couldn’t figure that out yet, but thought you might know! Thanks ~~

1 Like

Hi

The two tools fcd and fcc are meant to be used from the command line.

fcd -c GOOGLE -c BING -s 224 /home/jupyter/course-v3/nbs/dl1/data/guitars.csv

Also not that the example csv file will end up in the site-packages directory of your installation.

Let me know if you have further questions!

1 Like

Hi,

And thanks. Do you mean multiple labels per image (satellite notebook lesson2)? Or a multi-class classification as done in lesson 1?

Ah, I mean multiple labels per image (satellite images). I haven’t been able to find a way to evaluate the trained model. The notebook suggests uploading to Kaggle for that specific example, but I’d like to use the other “interp” methods available in the single class classification. Do you have any suggestions?

Thanks for this very useful

1 Like

Thanks for using it! feel free to improve on it on GitHub with a PR…

Awesome lil tool Christian ! Thanks

Pleasure :wink::+1:

1 Like

Thanks for sharing.

1 Like

Very useful tool, works great, thanks so much!

1 Like