Problems fetching urls from Google Images

You can still download the images with the download_images function (https://docs.fast.ai/vision.data.html#download_images). The solution worked fine for me.

I use Chrome browser with Ubuntu OS.

I’ve been having problems here and unfortunately @abakke JavaScript got a Syntax Error: Invalid or unexpected token.

I used the following code:

urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\\n')));

EDIT: the code above had an error in that the new line ‘\n’ was escaped so the join was just adding “/n” instead of creating a new line. The correct code is below:

urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

I’ve left my original problems below, just in case anyone finds sed useful. But its now solved.

I then have two problems: Chrome is blocking the pop up window for saving so I can’t specify a file name. It just downloads the file as “download” into my Downloads folder.

Secondly the “download” file has literal '\n’s in it - not new lines. So it isn’t read as a CSV with many lines - just as one really long line.

I use Linux and got around this with the following:

sed -'s/\\n/\n/g' download > newfilename.txt

This turned my one line download file into a newfilename.txt file with a new line for each url. There is good documentation for sed if you type info sed into the terminal.

This is a bit clunky by by coming up with an alias I can at least rename and reformat the url file in a short line of code in my terminal.

If anyone else has a more elegant solution I’d love to hear it!

5 Likes

ohhh, the quotation marks got messed up! Thanks for telling.

I then have two problems: Chrome is blocking the pop up window for saving so I can’t specify a file name. It just downloads the file as “download” into my Downloads folder.

Hmm, It could be that you just have to rename the file with eg .txt at the end, so it knows which filetype to cast to it.

I don’t belive it’s necessary to do the ‘\n’ fix you propose. This is because the image_downloader function split the links on the ‘\n’ anyway, so it doesn’t matter that it’s not shown as a newline :slight_smile:

I still can’t get your javascript to work in my console. BUT - I’ve figured out what was wrong in my javascript code. I had ‘//n’ which was escaping the /n hence getting the entire file on one line. (I did try adding .txt and .csv extensions to the file name originally). Hey-ho - at least I’ve learnt about sed and got a bit better at noticing/using the escape character!

It worked in mine, when i changed all the quotation marks. I wanted to edit in your code in my first post, but i can’t figure out how… the edit button seems to have disappeared.

Hey-ho - at least I’ve learnt about sed and got a bit better at noticing/using the escape character!

That’s the spirit! Every day we get a little bit better!

Hi Everyone,

  Here I show you the screenshots what i have done 


The problem is what i got https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQCSqUKhuFf0jtfMzQhSwUKdHfggYEa8wl3oI3t7aaUp3Qp6I1E
from the downloaded csv file

All links should be like this https://encrypted-tbn0.gstatic.com/**

So here i got encrypted links whenever i go to google looking for images
Please try to help me if as soon as possible

If you click on the link you sent, is the image showing? Because it does for me. If the image is showing the image_downloader function should have no problem retriving the images for you.

1 Like

HI abakke,

For last two days i was looking into alternative method
method 2:
finally I got https://serpapi.com/

But it was headache because it contains JSON Object finally i made a python script to make a url_file.csv

Edit:
Thanks For getting my attention ,
Sorry buddy! There is a big mistakes on my side Downloaded csv File contains Whitespaces
So, Existing method works Fine (what they taught on fast ai)

I think you mean download_images right? Its a factory method in fastai. With the encrypted links, I’m getting this sort of error.

1 Like

Yes you are right, that is what I meant :smile: It’s hard for me to tell why it fails… Could you provide a little more information, like what the URLs you’re feeding it looks like?

1 Like

https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT4vcf9ev4ozewwikw6Qn8iJ-xFqj4j38kx1hX3Hg7SqhLW9uja They look like that. I’m running on Kaggle kernels. I’ll show my code when I get on a computer. Thank you for the reply.

Alright I realised I could just show you my kernel. Here it is in its entirety. https://www.kaggle.com/bongbonglemon/adult-vs-teenager-fastai

hmm i’m not sure what’s wrong. Could you upload your URL datafiles, so that I can test locally? :slight_smile:

https://www.kaggle.com/bongbonglemon/adult-vs-teenager Here it is on Kaggle. Thank you. @abakke

I checked the problem, it originates from this line and it seems download_images method reads CSV file, splits it with “\n” to read line by line, but does not filter empty strings in the string array.

urls = open(urls).read().strip().split("\n")[:max_pics]
Changing this to filtering as below will fix this issue.
urls = list(filter(None, open(urls).read().strip().split("\n")))[:max_pics]

I will create a PR from my fork where I already fixed.

4 Likes

The problem is that there are some empty lines in your csv files, every 100 line or so :slight_smile: The images are getting downloaded though.

I just ran into this problem and my solution uses the below javascript in the Chrome console:

urls = Array.from(document.querySelectorAll('.tx8vtf')).map(el=>el.getAttribute('src')).filter(el=>el).filter(el=>el.includes('http'))
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

After getting the urls in csv I strip out the commas and use wget from the command line like so

wget -i urls

to download all the urls. The download happens in serial so you possibly are better off using the library function which AFAIK runs in parallel.

3 Likes

thank buddy,it’s a great help.I just cann’t get the urls file,it’s always 0 byte.your answer fix my problem,thank you so much.
I’m not familiar with javascript, could you please tell me more about your thought about fixing this problem,thank you again

That javacript snippet is looking for any html element that have the css class “.tx8vtf”. In my experiments, the image elements from google images have that css class. I check this by using the google chrome plugin SelectorGadget https://selectorgadget.com/ which highlights all elements on the page which have the css selector you are looking for. Check out the video on their page for more info.

After you determine you have the right css selector (.tx8vtf in my example), the javascript code finds all of them and extracts the url that you can then use to download the images.

If the count of urls is 0, then experiment with changing the css class you use to select the images.

Hope this helps.

Hello,

You can all follow this tutorial on how to download your image data sets from Bing.
I compiled into a script with slight modification where you can use it as follow:
$ python download_images.py -q “serach query” -o <dest_dir> -m <result_size> -g <group_by> -k <bing_API_KEY>

from requests import exceptions
import argparse
import requests
import cv2
import os



ap = argparse.ArgumentParser()
ap.add_argument("-q", "--query", required=True,
	help="search query to search Bing Image API for")
ap.add_argument("-o", "--output", required=True,
	help="path to output directory of images")
ap.add_argument("-m", "--max", required=True,
  help="maximum number of results")
ap.add_argument("-g", "--group", required=True,
  help="group number of results")
ap.add_argument("-k", "--apikey", required=True,
  help="bing api key") 
  
args = vars(ap.parse_args())

API_KEY = args['apikey']

MAX_RESULTS = int(args['max'])
GROUP_SIZE = int(args['group'])

URL = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"

EXCEPTIONS = set([IOError, FileNotFoundError,
	exceptions.RequestException, exceptions.HTTPError,
	exceptions.ConnectionError, exceptions.Timeout])
  
  
term = args["query"]
headers = {"Ocp-Apim-Subscription-Key" : API_KEY}
params = {"q": term, "offset": 0, "count": GROUP_SIZE}
print("[INFO] searching Bing API for '{}'".format(term))
search = requests.get(URL, headers=headers, params=params)
search.raise_for_status()
results = search.json()
estNumResults = min(results["totalEstimatedMatches"], MAX_RESULTS)
print("[INFO] {} total results for '{}'".format(estNumResults,
	term))
total = 0


for offset in range(0, estNumResults, GROUP_SIZE):
	print("[INFO] making request for group {}-{} of {}...".format(
		offset, offset + GROUP_SIZE, estNumResults))
	params["offset"] = offset
	search = requests.get(URL, headers=headers, params=params)
	search.raise_for_status()
	results = search.json()
	print("[INFO] saving images for group {}-{} of {}...".format(
		offset, offset + GROUP_SIZE, estNumResults))
    


	for v in results["value"]:
		try:
			print("[INFO] fetching: {}".format(v["contentUrl"]))
			r = requests.get(v["contentUrl"], timeout=30)
			ext = v["contentUrl"][v["contentUrl"].rfind("."):]
			p = os.path.sep.join([args["output"], "{}{}".format(
				str(total).zfill(8), ext)])
			f = open(p, "wb")
			f.write(r.content)
			f.close()
		except Exception as e:
			if type(e) in EXCEPTIONS:
				print("[INFO] skipping: {}".format(v["contentUrl"]))
				continue
		image = cv2.imread(p)
		if image is None:
			print("[INFO] deleting: {}".format(p))
			os.remove(p)
			continue
		total += 1
2 Likes