Download Images from Google Image Search

Hey Fast AI fam!

Newbie deep learner here (pun not intended and possibly inaccurate). This is my first post on the forum.

I’m really struggling with collecting my own image dataset from google images and would love your help. :slight_smile:

Steps I took;

  1. Searched for images in google image search
  2. Scrolled all the way to the bottom of the page
  3. Opened up the developer console
  4. Added the following bit of JS
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
  1. A file downloaded

However… (drum roll)

All the urls in the file start with encrypted-tbn0 etc.

https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT8G9tvTSwgLkzxVwi1IHdu_YDKKZjGaBeqpY2_xvsHZN2BYVT2
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQM6xoeZVS43ODMpCtG0LSWziOKKww4Vs8L8V3DjU6LtpfI5nza
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTE8d9xxBVt3aNEjSSZS0JkQp-GPDaGoHQwhef8PfyGTnjFnjuA

I’ve tried to look up what this is. A google forum has this explanation - https://support.google.com/webmasters/thread/5941325?hl=en

I’ve tried using an incognito browser too!

Is there something I’m doing wrong? Has anyone else encountered this problem and found a work around?

Any help would be much appreciated!

Thank you!

Adi

Hi, if downloading still does not work, you can try this one: Problems fetching urls from Google Images

2 Likes

Thanks mate!

Here’s the issue I’m running into when I use the file with the https://encrypted-tbn0… urls

Sorry all of this is a bit new. Last I ran the notebook I had no issues. The only big difference was that the urls were not encrypted.

Thanks!

** Update **

False alarm. :grinning:

I waited and checked the folder and the images downloaded. I’m not clear why the error appears. An academic pursuit for when I’m a bit more experienced!

Thanks all!

I want to point out that you usually can’t use images downloaded from Google Images because the copyright holder needs to give consent.

It’s probably fine if you’re just doing this for a personal study project, but you need to be careful about publishing a model that you trained on images that you do not have copyrights to, or are not given consent to use. (This is why large datasets such as ImageNet use only images from Flickr that have a Creative Commons license, and even that is not entirely uncontroversial.)

Note: I am not a lawyer. And I don’t think there is any legal consensus (yet) that a model trained on copyrighted images is a derivative work and therefore infringes on the copyright. But it’s smarter (and nicer) to only use images you have permission to use.

3 Likes

Great call!

Do you think it’s an issue if it’s being taught on the course?

I found this court ruling on web scraping and its legality. http://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf

I’m obviously paraphrasing here :smiley:

It 's because you have some some lines with empty strings in your CSV file. It only shows the error for these urls, it doesn’t prevent downloading images with valid url. I fixed this issue by filtering empty string urls in a given file with this PR to the fastai repo.

1 Like

Thanks for clarifying @ozgur :smiley:

1 Like

Created this Google Colab for downloading from Google Images easily. Create your csv with links and drag drop to folder panel in the left and then run.

2 Likes

I have downloaded the file using the JS code but it is empty. How can I make it right?

You may also want to look into the image download function shared by Jeremy in his Kaggle notebooks (Is it a Bird?) which uses duck duck go .

from fastcore.all import *
import time

def search_images(term, max_images=200):
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term})
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    while len(urls)<max_images and 'next' in data:
        data = urljson(requestUrl,data=params)
        urls.update(L(data['results']).itemgot('image'))
        requestUrl = url + data['next']
        time.sleep(0.2)
    return L(urls)[:max_images]

Thanks,

Amr

actually the file i am downloading from google which ha the image urls is empty