How to scrape the web for images?

wgpubs · November 6, 2017, 8:55pm

Anyone have any recommended approaches for scraping images for ML tasks?

For example, imagine we didn’t have our lesson 1 dog/cats datasets. What would be a programmatic approach we could follow to use google (as an example) to find and download cat and dog pics for us?

jeremy · November 6, 2017, 8:59pm

Might be worth checking this out and adding your thoughts/questions there: Challenges while creating your own dataset

wgpubs · November 7, 2017, 5:59pm

Adding this here and will link to it from the post you mentioned.

Here is a simple project you can use to scrape images from a google search. It uses selenium and should only be used for educational purposes. Comments, recommendations, and pull requests are welcomed!

naveenmanwani · November 10, 2017, 5:52pm

@wgpubs,even if we download the dataset with the above mention resource by you.then we have to label about 70 %(only suppose) of the data , for eg:: banana1.jpg, banana2.jpg .then reserve the 10% of the data for validation set as unlabeled and 20% of the data for test set as unlabeled,so please correct me if i’m wrong about the procedure

wgpubs · November 10, 2017, 6:37pm

You are correct.

You’ll have to write code to create the same directory structure used in the notebooks (e.g., /train, /valid, /test, /tmp, /models) that you see in the notebooks. From there you’ll have to move the images into sub-folders under /train … and from there, move a portion (20% or whatever) into /valid.

I also create a /sample directory and put a subset of everything in there for development. It makes it faster and allows you to debug things before using the full dataset.

You may want to look at the original part 1 notebooks as there is much more info there on how you can do the above.

naveenmanwani · November 10, 2017, 6:48pm

@wgpubs where i can find original part1 notebook and just a clarification validation set will consist of labeled data or unlabeled data

wgpubs · November 10, 2017, 9:44pm

Its up on the fast.ai github repo: https://github.com/fastai/courses

hiromi · November 15, 2017, 12:23am

I sometimes use this chrome extension for downloading images.

helena · November 15, 2017, 12:37am

incidentally, i was reminded of this thread by this one on HN today - apparently quite a hot topic: Ask HN: What are best tools for web scraping?

radi · January 24, 2018, 7:39pm

I found this http://www.image-net.org/ website for images. Hope that helps other people.

GertjanBrouwer · January 24, 2018, 8:09pm

Nobody has mentioned Scrapy yet. Scrapy is a python webscraper/crawler. It has very extensive documentation and it is being used by multiple prominent companies. I am currently using it myself and I love it.

suvash · January 24, 2018, 8:45pm

Depends on how big your needs are. Here’s a good guide to using scrapy that you can really scale up. https://learn.scrapinghub.com/

ianianian · February 20, 2018, 9:19pm

I believe that image sizes being different matters, I don’t recall if the fastai library corrects for these variations?
If not, is there any resource or framework that I could reference in order to prep image sizes to feed into my Neural Network?

Apologies if these are basic questions I’m pretty new (but excited!) to everything here!

Ian

renato · February 20, 2018, 9:47pm

This one can also be helpful https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/

LaPatel · April 1, 2018, 3:33am

A semi-manual method to download Google images:

Go to Google.
Search for your term, say “sand”.
Filter the search for images.
Right click in the top blank part.
Click Save As.
Save it as “sand - Google Search.html” on the desktop.
The desktop will have this file and “sand - Google Search_files” folder.
That folder will have many “images(*)” files without extension.
To add jpg extension to these files:
- Open Command Prompt.
- Cd to desktop folder and then to “sand - Google Search_files” folder.
- Write: “ren images* images*.jpg”.
- Run it.

willsa14 · April 8, 2018, 4:13am

This seems to be a very nice script for scraping images from the web:

I found this thanks to a tweet by Sebastian Raschka.

yuan0420 · July 27, 2018, 8:30am

I found ScrapeStorm is useful. I think it is very simple and convenient for scraping webpage images. I recommend it to you.

maggieliuzzi · November 4, 2018, 11:37pm

This is awesome!

kernel_panic · December 8, 2018, 9:32pm

What are the best options for scraping texts rather than images from Google please? I’d like to do something like what Jeremy showed in the earlier part of lesson 4 – not texts from arXiv, but from a Google keyword search. Any pointers will be greatly appreciated!

kernel_panic · December 9, 2018, 8:39am

In case anyone has similar interests - am resorting to good old lynx, and wget. Finding that better than the browser plugins and extensions I explored.