Download Images from Google Image Search

aditya.swami · February 11, 2020, 12:28am

Hey Fast AI fam!

Newbie deep learner here (pun not intended and possibly inaccurate). This is my first post on the forum.

I’m really struggling with collecting my own image dataset from google images and would love your help.

Steps I took;

Searched for images in google image search
Scrolled all the way to the bottom of the page
Opened up the developer console
Added the following bit of JS

urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

A file downloaded

However… (drum roll)

All the urls in the file start with encrypted-tbn0 etc.

https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT8G9tvTSwgLkzxVwi1IHdu_YDKKZjGaBeqpY2_xvsHZN2BYVT2
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQM6xoeZVS43ODMpCtG0LSWziOKKww4Vs8L8V3DjU6LtpfI5nza
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTE8d9xxBVt3aNEjSSZS0JkQp-GPDaGoHQwhef8PfyGTnjFnjuA

I’ve tried to look up what this is. A google forum has this explanation - https://support.google.com/webmasters/thread/5941325?hl=en

I’ve tried using an incognito browser too!

Is there something I’m doing wrong? Has anyone else encountered this problem and found a work around?

Any help would be much appreciated!

Thank you!

Adi

JonathanSum · February 11, 2020, 2:18am

Hi, if downloading still does not work, you can try this one: Problems fetching urls from Google Images

aditya.swami · February 11, 2020, 2:45am

Thanks mate!

Here’s the issue I’m running into when I use the file with the https://encrypted-tbn0… urls

Sorry all of this is a bit new. Last I ran the notebook I had no issues. The only big difference was that the urls were not encrypted.

Thanks!

aditya.swami · February 11, 2020, 6:50am

** Update **

False alarm.

I waited and checked the folder and the images downloaded. I’m not clear why the error appears. An academic pursuit for when I’m a bit more experienced!

Thanks all!

machinethink · February 11, 2020, 10:00am

I want to point out that you usually can’t use images downloaded from Google Images because the copyright holder needs to give consent.

It’s probably fine if you’re just doing this for a personal study project, but you need to be careful about publishing a model that you trained on images that you do not have copyrights to, or are not given consent to use. (This is why large datasets such as ImageNet use only images from Flickr that have a Creative Commons license, and even that is not entirely uncontroversial.)

Note: I am not a lawyer. And I don’t think there is any legal consensus (yet) that a model trained on copyrighted images is a derivative work and therefore infringes on the copyright. But it’s smarter (and nicer) to only use images you have permission to use.

aditya.swami · February 11, 2020, 10:23am

Great call!

Do you think it’s an issue if it’s being taught on the course?

I found this court ruling on web scraping and its legality. http://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf

I’m obviously paraphrasing here

ozgur · February 17, 2020, 5:59pm

It 's because you have some some lines with empty strings in your CSV file. It only shows the error for these urls, it doesn’t prevent downloading images with valid url. I fixed this issue by filtering empty string urls in a given file with this PR to the fastai repo.

aditya.swami · February 27, 2020, 11:05am

Thanks for clarifying @ozgur

ozgur · May 15, 2020, 10:49am

Created this Google Colab for downloading from Google Images easily. Create your csv with links and drag drop to folder panel in the left and then run.

Ujjawal_Mittal · June 13, 2022, 9:33am

I have downloaded the file using the JS code but it is empty. How can I make it right?

mike.moloch · June 13, 2022, 3:48pm

You may also want to look into the image download function shared by Jeremy in his Kaggle notebooks (Is it a Bird?) which uses duck duck go .

from fastcore.all import *
import time

def search_images(term, max_images=200):
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term})
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    while len(urls)<max_images and 'next' in data:
        data = urljson(requestUrl,data=params)
        urls.update(L(data['results']).itemgot('image'))
        requestUrl = url + data['next']
        time.sleep(0.2)
    return L(urls)[:max_images]

Thanks,

Amr

Ujjawal_Mittal · June 17, 2022, 6:00pm

actually the file i am downloading from google which ha the image urls is empty