How to scrape the web for images?

incidentally, i was reminded of this thread by this one on HN today - apparently quite a hot topic: Ask HN: What are best tools for web scraping?

6 Likes

I found this http://www.image-net.org/ website for images. Hope that helps other people.

2 Likes

Nobody has mentioned Scrapy yet. Scrapy is a python webscraper/crawler. It has very extensive documentation and it is being used by multiple prominent companies. I am currently using it myself and I love it.

4 Likes

Depends on how big your needs are. Here’s a good guide to using scrapy that you can really scale up. https://learn.scrapinghub.com/

1 Like
  1. I believe that image sizes being different matters, I don’t recall if the fastai library corrects for these variations?

  2. If not, is there any resource or framework that I could reference in order to prep image sizes to feed into my Neural Network?

Apologies if these are basic questions :grimacing: I’m pretty new (but excited!) to everything here!

Ian

This one can also be helpful https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/

3 Likes

A semi-manual method to download Google images:

  • Go to Google.
  • Search for your term, say “sand”.
  • Filter the search for images.
  • Right click in the top blank part.
  • Click Save As.
  • Save it as “sand - Google Search.html” on the desktop.
  • The desktop will have this file and “sand - Google Search_files” folder.
  • That folder will have many “images(*)” files without extension.
  • To add jpg extension to these files:
    • Open Command Prompt.
    • Cd to desktop folder and then to “sand - Google Search_files” folder.
    • Write: “ren images* images*.jpg”.
    • Run it.
1 Like

This seems to be a very nice script for scraping images from the web:

I found this thanks to a tweet by Sebastian Raschka.

7 Likes

I found ScrapeStorm is useful. I think it is very simple and convenient for scraping webpage images. I recommend it to you.

1 Like

This is awesome!

What are the best options for scraping texts rather than images from Google please? I’d like to do something like what Jeremy showed in the earlier part of lesson 4 – not texts from arXiv, but from a Google keyword search. Any pointers will be greatly appreciated!

In case anyone has similar interests - am resorting to good old lynx, and wget. Finding that better than the browser plugins and extensions I explored.

I use this https://github.com/althonos/InstaLooter

though it limit how much you can download daily.

Works really well, thank you!

I have used this extension quite a bit when wanting to create a detailed sitemap (works with pagination too) for image, and text scraping. Hope it helps! :slight_smile:

i followed lesson 2 and tried to make the text file of urls of images, but it came up an exmpty text file. any thouhgts? thanks!

@yishai

did you follow the code exactly as in the notebook? Can you share here the part of the code where that happens?

Pinterest is also a good place for collecting images. I forked this repo (PinterestDL) and make it work on Colab
Find the source here and Colab sample here

This script is broken. It is not working now. The issue is already reported in github repo.

i am sharing some great applications link which help you too extract databse online
justdial data extractor
indiamart data extractor