Share a software(QImageScraper) to scrape images from Google,Bing and Yahoo search results

tham · June 2, 2017, 12:06pm

Do you ever need to gather images for your own computer vision projects?I do, and not all of the images I need can found on ready made data set like imagenet, mnist, caltech etc.

I looking a lots of apps by Google search, but none of the free apps can give me satisfy results(extreme image finder is great, but not free), either only able to download thumbnail nor cannot find all of the links of the images, that is why I decide to develop one by myself.

I place the project explanation at github, you can download it from here.

To make this app become better, I need a list of free, safe proxies I could connect to, this could lower the chances being spotted as “robot”, if you know how to do that, please give me some guides, thanks

ecase · June 2, 2017, 12:51pm

I don’t know much about automating image scraping but I do know how many hours I’ve spent trawling Google/Bing for photos that fit the needs of my project on ID’ing habitats. Will check this out. Thanks!

Jason · June 2, 2017, 1:21pm

Not sure if this will help but I’m using it some something “some what similar” to you. It’s a google chrome plug-in so not exactly automated but works well for my needs. It doesn’t pull the source links from google image search but if you’re using other websites it pulls the full image (not the css resized version).

You can also try taking a look at this to see if it helps. I haven’t used it yet, but looks promising.

tham · June 2, 2017, 6:35pm

Thanks for the links

I tried Fatkun, most of the original images on google cannot be downloaded, many downloaded images are thumbnails, worse things is, sometimes this plugin crash and close the whole browser suddenly. I do not know what happen, bugs of the plugin, bugs of os(os is windows 10 64bits), or google disallowed apps on chrome store to scrape google images?

Tutorial of PyimageSearch is a good start, but scrappy do not suit for this task. I need a webview to scroll the webpage down, emulate click action(Show more images, See more images etc), get html contents after scrolled and clicked, parse links from the html, handle many network errors, ability to design simple~complex gui since this app may add more features in the future(depend on requirements, either by me nor other’s ).

tham · June 2, 2017, 7:19pm

I found a list of free proxies could be used(not sure they are save or not), this may increase the rate of the images could be downloaded as full sizes(right now it vary from 80~100%, it depends, if full size image cannot download, QImageScraper will download thumbnail instead).

clu2033 · June 6, 2017, 2:13am

Be mindful of websites Robot.txt and try not to overload their servers. Using wait times between requests go a long way to not getting banned.

tham · June 7, 2017, 2:01am

Thanks, I do add some delay on every downloading request(1000~1500 msec, maybe too short, 4000~5000 msec is better?).

About the robots, now I switch four users agents randomly, 2 googles bots and 2 bing bots, most of the websites should allowed google and bing bots to scan their data.

I give proxy a try, it works, but I stop using it because the speed is too slow, I may add a new function to import proxy lists in the future, after all I cannot find a free, safe and ultra fast proxy for everyone.

bahram1 · June 7, 2017, 9:01pm

I am researching about doing an scraping project as well, haven’t started coding yet.
but my idea was to use Tor, it is free and you get a new IP every 10 min.

maybe this help:

please share your finding and if it works or not for you.

tham · June 7, 2017, 10:42pm

Thanks, Tor is an important keyword for image scraper
I will share what I found, maybe in a few days.

For anyone who interesting at Qt5, here is a nice video on youtube

tham · June 8, 2017, 6:19am

Using Tor with Qt is easier than I though

    QNetworkProxy proxy(QNetworkProxy::Socks5Proxy, "127.0.0.1", 9150);
    QNetworkAccessManager *manager = new QNetworkAccessManager(this);
    manager->setProxy(proxy);
    connect(manager, static_cast<void(QNetworkAccessManager::*)(QNetworkReply*)>(&QNetworkAccessManager::finished),
            [](QNetworkReply *reply)
    {
        qDebug()<<"push reply";
        QFile file("tor_reply.html");
        QTextStream stream(&file);
        if(file.open(QIODevice::WriteOnly)){
            stream<<reply->readAll();
        }
        manager->deleteLater();
    });
    connect(manager, static_cast<void(QNetworkAccessManager::*)(QNetworkReply*)>(&QNetworkAccessManager::finished),
            manager, &QNetworkAccessManager::deleteLater);
    manager->get(QNetworkRequest(QUrl("http://www.whatsmyip.org/")));

Everytime I connect to the website, my ip is changed , I haven’t measure the speed yet(whatever, I do not have high expectation about speed), but Tor do satisfy the requirements of “safe and reliable”.

I got some troubles on figuring out the connection, you can find my solution at stack overflow

tham · June 21, 2017, 6:21pm

@bahram1 Hi, I have something new to share with you(about tor). If you want to know how to renew ip of tor, please check this get-new-ip-of-tor7-0-by-standard-python-socket-only.

If you want to know how to do it by Qt5, check the source codes of QImageScraper. Since this is an app with ui, I do not use synchronous network api at all.

lateralplacket · June 21, 2017, 8:11pm

With respect @tham, though I’m sympathetic to the desire to advance your own learning and the field of machine learning, I hope you reconsider what effects your use of Tor will have on others and stop using it for this purpose.

I’m pretty sure using Tor to scrape websites would be considered an abuse of Tor, of the website you’re scraping, and of the freedoms of people using Tor legitimately, by all three groups of people (Tor operators and maintainers, website operators, and users) - and also by the subset of the general public who know what Tor is and know its legitimate uses.

I hope that isn’t too blunt, but it’s easy to forget about the wider context here!

The reasons include

Other people using Tor will be prevented from browsing that website using Tor if/when either Tor as a whole or particular IPs you end up using are blocked
Tor’s reputation goes downhill: “it’s just a hiding place for web scrapers and terrorists”
People who really NEED Tor because for example they’re in an oppressive regime or otherwise persecuted for their ideas (or reading about others’ ideas), will more often find they’re unable to see what you can by any means, or publish content on those websites, because they have started blocking Tor to prevent abuse
People who try to use Tor to defend freedom in currently-more-free states by circumventing surveillance also find it incrementally harder to do so

OK enough lecturing, this is only my second post here!

lateralplacket · June 21, 2017, 8:15pm

Have you considered taking your delay times from a pseudo-random distribution?

Interesting to hear about what tools you’re finding useful (aside from Tor!), thanks for posting.

tham · June 22, 2017, 3:52am

Thanks, this idea sounds reasonable, should be able to make the requests looks more like human.
Will replace naive random number by normal distribution.

Haven’t found anything useful yet.

Thanks for you lectures, will remove support of Tor from QImageScraper.

natedl98 · June 22, 2017, 6:55am

lateralplacket:

With respect @tham, though I’m sympathetic to the desire to advance your own learning and the field of machine learning, I hope you reconsider what effects your use of Tor will have on others and stop using it for this purpose.

I’m pretty sure using Tor to scrape websites would be considered an abuse of Tor, of the website you’re scraping, and of the freedoms of people using Tor legitimately, by all three groups of people (Tor operators and maintainers, website operators, and users) - and also by the subset of the general public who know what Tor is and know its legitimate uses.

I hope that isn’t too blunt, but it’s easy to forget about the wider context here!

The reasons include

Other people using Tor will be prevented from browsing that website using Tor if/when either Tor as a whole or particular IPs you end up using are blocked

Tor’s reputation goes downhill: “it’s just a hiding place for web scrapers and terrorists”

People who really NEED Tor because for example they’re in an oppressive regime or otherwise persecuted for their ideas (or reading about others’ ideas), will more often find they’re unable to see what you can by any means, or publish content on those websites, because they have started blocking Tor to prevent abuse

People who try to use Tor to defend freedom in currently-more-free states by circumventing surveillance also find it incrementally harder to do so

OK enough lecturing, this is only my second post here!

It’s taking every ounce of my willpower not to troll this post

lateralplacket · June 22, 2017, 9:11pm

Well done Nate

bahram1 · June 24, 2017, 12:27am

Great, I was playing with it a little bit, but it is not fast enough! it takes seconds to make renew connection and get a new IP !

I was thinking of dockerizing it ! let’s say 200 docker instance each with port forwarding to different port. that way we can slow each request for a second or 2 and still have reasonable output and prevent IP blocking. but I don’t know enough iptable rules and it is going to be a big distraction.

I understand @lateralplacket reasons and they all make sense, but a poor researcher needs its data and there is no free alternative

I can role out couple of droplets in Digital Ocean or something cheaper and use them as proxy. but problem with that is sites I am trying to scrape is blocking all the buy able IPs!!

tham · June 25, 2017, 2:35pm

Renew ip of is very fast, but access network data by tor is another story. In my case it is not a big deal, even without proxy, download success rate of big images searched by google is better than 90% in most of the cases. Besides, human love big images, but big image is not mandatory for many computer vision task.

What is dockerizing? If you want to delay download request frequency, why not just wait a few seconds(minimum_delay_secs + random_secs) before next download start?

bahram1 · June 25, 2017, 2:54pm

Docker :smile I need to hit almost 2 million page in 24hr, I guess I have to bite the bullet and pay for a peoxy service!

Vyachez · October 24, 2017, 12:55am

Hello folks, I was doing some work on this. I don’t have Google and other engines plugin right now, but here is what I have created so far:

This is just web-link image scrapper, but what it can offer:

Uploading verified images (verified by size to avoid junk);
Creates right directory structure like Jeremy advised in Lesson 1;
Calibrates number of images for each class at the end (equaliser.py).
Please have a look - hope it may help for those interested in the topic.
P.S. This is my first more or less useful code I did for public sharing.