How to scrape the web for images?

Anyone have any recommended approaches for scraping images for ML tasks?

For example, imagine we didn’t have our lesson 1 dog/cats datasets. What would be a programmatic approach we could follow to use google (as an example) to find and download cat and dog pics for us?

8 Likes

Might be worth checking this out and adding your thoughts/questions there: Challenges while creating your own dataset

1 Like

Adding this here and will link to it from the post you mentioned.

Here is a simple project you can use to scrape images from a google search. It uses selenium and should only be used for educational purposes. Comments, recommendations, and pull requests are welcomed!

7 Likes

@wgpubs,even if we download the dataset with the above mention resource by you.then we have to label about 70 %(only suppose) of the data , for eg:: banana1.jpg, banana2.jpg .then reserve the 10% of the data for validation set as unlabeled and 20% of the data for test set as unlabeled,so please correct me if i’m wrong about the procedure

You are correct.

You’ll have to write code to create the same directory structure used in the notebooks (e.g., /train, /valid, /test, /tmp, /models) that you see in the notebooks. From there you’ll have to move the images into sub-folders under /train … and from there, move a portion (20% or whatever) into /valid.

I also create a /sample directory and put a subset of everything in there for development. It makes it faster and allows you to debug things before using the full dataset.

You may want to look at the original part 1 notebooks as there is much more info there on how you can do the above.

@wgpubs where i can find original part1 notebook and just a clarification validation set will consist of labeled data or unlabeled data

Its up on the fast.ai github repo: https://github.com/fastai/courses

1 Like

I sometimes use this chrome extension for downloading images.

17 Likes

incidentally, i was reminded of this thread by this one on HN today - apparently quite a hot topic: Ask HN: What are best tools for web scraping?

6 Likes

I found this http://www.image-net.org/ website for images. Hope that helps other people.

2 Likes

Nobody has mentioned Scrapy yet. Scrapy is a python webscraper/crawler. It has very extensive documentation and it is being used by multiple prominent companies. I am currently using it myself and I love it.

4 Likes

Depends on how big your needs are. Here’s a good guide to using scrapy that you can really scale up. https://learn.scrapinghub.com/

1 Like
  1. I believe that image sizes being different matters, I don’t recall if the fastai library corrects for these variations?

  2. If not, is there any resource or framework that I could reference in order to prep image sizes to feed into my Neural Network?

Apologies if these are basic questions :grimacing: I’m pretty new (but excited!) to everything here!

Ian

This one can also be helpful https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/

3 Likes

A semi-manual method to download Google images:

  • Go to Google.
  • Search for your term, say “sand”.
  • Filter the search for images.
  • Right click in the top blank part.
  • Click Save As.
  • Save it as “sand - Google Search.html” on the desktop.
  • The desktop will have this file and “sand - Google Search_files” folder.
  • That folder will have many “images(*)” files without extension.
  • To add jpg extension to these files:
    • Open Command Prompt.
    • Cd to desktop folder and then to “sand - Google Search_files” folder.
    • Write: “ren images* images*.jpg”.
    • Run it.
1 Like

This seems to be a very nice script for scraping images from the web:

I found this thanks to a tweet by Sebastian Raschka.

7 Likes

I found ScrapeStorm is useful. I think it is very simple and convenient for scraping webpage images. I recommend it to you.

1 Like

This is awesome!

What are the best options for scraping texts rather than images from Google please? I’d like to do something like what Jeremy showed in the earlier part of lesson 4 – not texts from arXiv, but from a Google keyword search. Any pointers will be greatly appreciated!

In case anyone has similar interests - am resorting to good old lynx, and wget. Finding that better than the browser plugins and extensions I explored.