Tips for building large image datasets

loicloic · March 28, 2020, 5:29pm

This fork of google_images_download works, it has not been merged yet but you can use it in place of the pip install google-images-download version:

However:

I cannot download more than 100 images per search
I cannot use the -wr parameter for some reason it seems, which forces me to slightly change the keyword for searches which is not great to build a consistent image dataset. I chose to use different colors of a similar objects in order to build it anyway

harijos · April 2, 2020, 8:33pm

Thanks… this was really useful

am.sharan · April 3, 2020, 9:49am

can you share your notebook for reference

am.sharan · April 3, 2020, 9:51am

can you share your notebook so i can understand what worked

harijos · April 3, 2020, 12:56pm

I have shared the file here: https://github.com/debunker/HousePlantClassifier/blob/master/HPC.ipynb

am.sharan · April 3, 2020, 2:19pm

good thanks

kooc02 · April 5, 2020, 2:45pm

Thank you for posting this! I’m getting an syntax error message when trying to run the downloading step. Any advice?

Kirby · April 15, 2020, 12:35pm

This is a great way to host and serve data. It makes it very easy in the future to edit notebooks to reference separate groups of image data.

vict0ra · April 16, 2020, 12:40pm

The first method doesnt work for me. Finally it is finished that “Unfortunately all 50 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!”. Before it, you need to install Selenium and chromedrive (I had some errors between version to solve etc)

smajee · April 21, 2020, 1:59am

Lets start with the positive. The following worked for me.
googliser is shell script that worked for me (the only mechanism that I worked for me in colab).
Here are the steps (can be found in the git as well)

!apt install imagemagick
!bash <(wget -qO- git.io/get-googliser)
!googliser --phrase “apple” --title ‘Apples!’ --color ‘full’ --number 50 --upper-size 100000 -o ‘./data’ -G

Here what didn’t work for me:

google-mages-download. I should have searched the forum earlier.
ai_utilities it actually almost worked except the image_download() hits somekind of issue with the path.

s.j.hatfield · May 19, 2020, 9:01am

This flikr scraper has worked for me https://github.com/ultralytics/flickr_scraper. The instructions in the README are clear. Thanks to ultralytics for making it.

Now to train my croc or birkenstock classifier!

akgarg · May 28, 2020, 5:02am

How to access this data? Once downloaded in colab? A bit new here, started course 2 days back. Sorry for silly question

akgarg · May 28, 2020, 5:39am

Got It

bhavi-san · May 29, 2020, 5:37am

Im using the paperspace virtual machine. Will it work there?

OrLi · July 3, 2020, 4:11pm

Thnks! so useful!
The only problem that it says all the images are not downloadable.
Anyone faced that? what’s the solution?
tnx

cvf · July 17, 2020, 9:04am

I’m facing the same issue. I looked for troubleshooting documentation in the original repo but there is nothing about it. Maybe it’s something related with the chromedriver installation, i’m not pretty sure of having done it right in my vm.

yitao94 · July 19, 2020, 6:47am

Hi Lindy, thanks for your sharing. For your information, I met one issue to use google-images-download, with out “Unfortunately all 100 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!..” I searched for the solution, which is most likely that google has changed something for scrawler.~

m.g.fields · July 30, 2020, 2:05am

[FIX UPDATE] So, some of the issues below still stand, but the official guide DOES still work. If you are having issues, make sure “internet” is enabled in your Kaggle notebook (facepalm). The interface is a bit different from the other steps I found:

Click the ‘Kaggle’ icon in top right (looks like >|)
Click Preferences
Toggle “Internet” on

@joedockrill has also provided a helpful tool below that he wrote and maintains.

----------------------original comment below---------------------------------

I am having the same issue: Unfortunately all 500 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

I tried using this guide as well and had apparently the same problem. Every download attempt got an error like this one:

Error https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQhPp0TLFJJJVgkpP-LThb46ySlqEL9kvTtHg&usqp=CAU HTTPSConnectionPool(host=‘encrypted-tbn0.gstatic.com’, port=443): Max retries exceeded with url: /images?q=tbn%3AANd9GcQhPp0TLFJJJVgkpP-LThb46ySlqEL9kvTtHg&usqp=CAU (Caused by NewConnectionError(’<urllib3.connection.VerifiedHTTPSConnection object at 0x7f259ea9bc90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution’))

For those of you still looking for solutions, these other resources appear to be useful. The Chrome extension is very user friendly and good for a first project.

How to scrape images

joedockrill · July 30, 2020, 4:05am

Google changed their page structure and everything written a while back to work on it is broken.

You can search around for something newer (or something which has been subsequently fixed), or you can use my scraper notebook and search on duckduckgo instead. It’s less painful.

kaustubhXD · August 8, 2020, 12:52pm

Currently, this handy Firefox extension works on Google and DuckDuckGo.

It also shows the number of image links you download and saves it in a CSV file.