Anyone have any recommended approaches for scraping images for ML tasks?
For example, imagine we didn’t have our lesson 1 dog/cats datasets. What would be a programmatic approach we could follow to use google (as an example) to find and download cat and dog pics for us?
Adding this here and will link to it from the post you mentioned.
Here is a simple project you can use to scrape images from a google search. It uses selenium and should only be used for educational purposes. Comments, recommendations, and pull requests are welcomed!
@wgpubs,even if we download the dataset with the above mention resource by you.then we have to label about 70 %(only suppose) of the data , for eg:: banana1.jpg, banana2.jpg .then reserve the 10% of the data for validation set as unlabeled and 20% of the data for test set as unlabeled,so please correct me if i’m wrong about the procedure
You’ll have to write code to create the same directory structure used in the notebooks (e.g., /train, /valid, /test, /tmp, /models) that you see in the notebooks. From there you’ll have to move the images into sub-folders under /train … and from there, move a portion (20% or whatever) into /valid.
I also create a /sample directory and put a subset of everything in there for development. It makes it faster and allows you to debug things before using the full dataset.
You may want to look at the original part 1 notebooks as there is much more info there on how you can do the above.
Nobody has mentioned Scrapy yet. Scrapy is a python webscraper/crawler. It has very extensive documentation and it is being used by multiple prominent companies. I am currently using it myself and I love it.
What are the best options for scraping texts rather than images from Google please? I’d like to do something like what Jeremy showed in the earlier part of lesson 4 – not texts from arXiv, but from a Google keyword search. Any pointers will be greatly appreciated!
In case anyone has similar interests - am resorting to good old lynx, and wget. Finding that better than the browser plugins and extensions I explored.