Tips for building large image datasets

This fork of google_images_download works, it has not been merged yet but you can use it in place of the pip install google-images-download version:

However:

  • I cannot download more than 100 images per search
  • I cannot use the -wr parameter for some reason it seems, which forces me to slightly change the keyword for searches which is not great to build a consistent image dataset. I chose to use different colors of a similar objects in order to build it anyway
1 Like

Thanks… this was really useful

can you share your notebook for reference

can you share your notebook so i can understand what worked

I have shared the file here: https://github.com/debunker/HousePlantClassifier/blob/master/HPC.ipynb

good thanks

Thank you for posting this! I’m getting an syntax error message when trying to run the downloading step. Any advice?

This is a great way to host and serve data. It makes it very easy in the future to edit notebooks to reference separate groups of image data.

The first method doesnt work for me. Finally it is finished that “Unfortunately all 50 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!”. Before it, you need to install Selenium and chromedrive (I had some errors between version to solve etc)

Lets start with the positive. The following worked for me.
googliser is shell script that worked for me (the only mechanism that I worked for me in colab).
Here are the steps (can be found in the git as well)

  1. !apt install imagemagick
  2. !bash <(wget -qO- git.io/get-googliser)
  3. !googliser --phrase “apple” --title ‘Apples!’ --color ‘full’ --number 50 --upper-size 100000 -o ‘./data’ -G

Here what didn’t work for me:

  1. google-mages-download. I should have searched the forum earlier.
  2. ai_utilities it actually almost worked except the image_download() hits somekind of issue with the path.
4 Likes

This flikr scraper has worked for me https://github.com/ultralytics/flickr_scraper. The instructions in the README are clear. Thanks to ultralytics for making it.

Now to train my croc or birkenstock classifier!

2 Likes

How to access this data? Once downloaded in colab? A bit new here, started course 2 days back. Sorry for silly question

Got It :slight_smile:

Im using the paperspace virtual machine. Will it work there?

Thnks! so useful!
The only problem that it says all the images are not downloadable. :frowning:
Anyone faced that? what’s the solution?
tnx

1 Like

I’m facing the same issue. I looked for troubleshooting documentation in the original repo but there is nothing about it. Maybe it’s something related with the chromedriver installation, i’m not pretty sure of having done it right in my vm.

2 Likes

Hi Lindy, thanks for your sharing. For your information, I met one issue to use google-images-download, with out “Unfortunately all 100 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!..” I searched for the solution, which is most likely that google has changed something for scrawler.~

2 Likes

[FIX UPDATE] So, some of the issues below still stand, but the official guide DOES still work. If you are having issues, make sure “internet” is enabled in your Kaggle notebook (facepalm). The interface is a bit different from the other steps I found:

  1. Click the ‘Kaggle’ icon in top right (looks like >|)
  2. Click Preferences
  3. Toggle “Internet” on

@joedockrill has also provided a helpful tool below that he wrote and maintains.

----------------------original comment below---------------------------------

I am having the same issue: Unfortunately all 500 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

I tried using this guide as well and had apparently the same problem. Every download attempt got an error like this one:

Error https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQhPp0TLFJJJVgkpP-LThb46ySlqEL9kvTtHg&usqp=CAU HTTPSConnectionPool(host=‘encrypted-tbn0.gstatic.com’, port=443): Max retries exceeded with url: /images?q=tbn%3AANd9GcQhPp0TLFJJJVgkpP-LThb46ySlqEL9kvTtHg&usqp=CAU (Caused by NewConnectionError(’<urllib3.connection.VerifiedHTTPSConnection object at 0x7f259ea9bc90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution’))

For those of you still looking for solutions, these other resources appear to be useful. The Chrome extension is very user friendly and good for a first project.

How to scrape images

Google changed their page structure and everything written a while back to work on it is broken.

You can search around for something newer (or something which has been subsequently fixed), or you can use my scraper notebook and search on duckduckgo instead. It’s less painful.

1 Like

Currently, this handy Firefox extension works on Google and DuckDuckGo.


It also shows the number of image links you download and saves it in a CSV file.
3 Likes