Anyone have any recommended approaches for scraping images for ML tasks?
For example, imagine we didn’t have our lesson 1 dog/cats datasets. What would be a programmatic approach we could follow to use google (as an example) to find and download cat and dog pics for us?
Might be worth checking this out and adding your thoughts/questions there: Challenges while creating your own dataset
Adding this here and will link to it from the post you mentioned.
Here is a simple project you can use to scrape images from a google search. It uses selenium and should only be used for educational purposes. Comments, recommendations, and pull requests are welcomed!
@wgpubs,even if we download the dataset with the above mention resource by you.then we have to label about 70 %(only suppose) of the data , for eg:: banana1.jpg, banana2.jpg .then reserve the 10% of the data for validation set as unlabeled and 20% of the data for test set as unlabeled,so please correct me if i’m wrong about the procedure
You are correct.
You’ll have to write code to create the same directory structure used in the notebooks (e.g., /train, /valid, /test, /tmp, /models) that you see in the notebooks. From there you’ll have to move the images into sub-folders under /train … and from there, move a portion (20% or whatever) into /valid.
I also create a /sample directory and put a subset of everything in there for development. It makes it faster and allows you to debug things before using the full dataset.
You may want to look at the original part 1 notebooks as there is much more info there on how you can do the above.
@wgpubs where i can find original part1 notebook and just a clarification validation set will consist of labeled data or unlabeled data
I sometimes use this chrome extension for downloading images.
incidentally, i was reminded of this thread by this one on HN today - apparently quite a hot topic: Ask HN: What are best tools for web scraping?
I found this http://www.image-net.org/ website for images. Hope that helps other people.
Nobody has mentioned Scrapy yet. Scrapy is a python webscraper/crawler. It has very extensive documentation and it is being used by multiple prominent companies. I am currently using it myself and I love it.
Depends on how big your needs are. Here’s a good guide to using scrapy that you can really scale up. https://learn.scrapinghub.com/
I believe that image sizes being different matters, I don’t recall if the fastai library corrects for these variations?
If not, is there any resource or framework that I could reference in order to prep image sizes to feed into my Neural Network?
Apologies if these are basic questions I’m pretty new (but excited!) to everything here!
A semi-manual method to download Google images:
- Go to Google.
- Search for your term, say “sand”.
- Filter the search for images.
- Right click in the top blank part.
- Click Save As.
- Save it as “sand - Google Search.html” on the desktop.
- The desktop will have this file and “sand - Google Search_files” folder.
- That folder will have many “images(*)” files without extension.
- To add jpg extension to these files:
- Cd to desktop folder and then to “sand - Google Search_files” folder.
- Write: “ren images* images*.jpg”.
This seems to be a very nice script for scraping images from the web:
I found this thanks to a tweet by Sebastian Raschka.
I found ScrapeStorm is useful. I think it is very simple and convenient for scraping webpage images. I recommend it to you.
What are the best options for scraping texts rather than images from Google please? I’d like to do something like what Jeremy showed in the earlier part of lesson 4 – not texts from arXiv, but from a Google keyword search. Any pointers will be greatly appreciated!
In case anyone has similar interests - am resorting to good old lynx, and wget. Finding that better than the browser plugins and extensions I explored.