You can still download the images with the download_images function (https://docs.fast.ai/vision.data.html#download_images). The solution worked fine for me.
I use Chrome browser with Ubuntu OS.
I’ve been having problems here and unfortunately @abakke JavaScript got a Syntax Error: Invalid or unexpected token.
I used the following code:
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\\n')));
EDIT: the code above had an error in that the new line ‘\n’ was escaped so the join was just adding “/n” instead of creating a new line. The correct code is below:
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
I’ve left my original problems below, just in case anyone finds sed useful. But its now solved.
I then have two problems: Chrome is blocking the pop up window for saving so I can’t specify a file name. It just downloads the file as “download” into my Downloads folder.
Secondly the “download” file has literal '\n’s in it - not new lines. So it isn’t read as a CSV with many lines - just as one really long line.
I use Linux and got around this with the following:
sed -'s/\\n/\n/g' download > newfilename.txt
This turned my one line download file into a newfilename.txt file with a new line for each url. There is good documentation for sed if you type info sed
into the terminal.
This is a bit clunky by by coming up with an alias I can at least rename and reformat the url file in a short line of code in my terminal.
If anyone else has a more elegant solution I’d love to hear it!
ohhh, the quotation marks got messed up! Thanks for telling.
I then have two problems: Chrome is blocking the pop up window for saving so I can’t specify a file name. It just downloads the file as “download” into my Downloads folder.
Hmm, It could be that you just have to rename the file with eg .txt at the end, so it knows which filetype to cast to it.
I don’t belive it’s necessary to do the ‘\n’ fix you propose. This is because the image_downloader function split the links on the ‘\n’ anyway, so it doesn’t matter that it’s not shown as a newline
I still can’t get your javascript to work in my console. BUT - I’ve figured out what was wrong in my javascript code. I had ‘//n’ which was escaping the /n hence getting the entire file on one line. (I did try adding .txt and .csv extensions to the file name originally). Hey-ho - at least I’ve learnt about sed and got a bit better at noticing/using the escape character!
It worked in mine, when i changed all the quotation marks. I wanted to edit in your code in my first post, but i can’t figure out how… the edit button seems to have disappeared.
Hey-ho - at least I’ve learnt about sed and got a bit better at noticing/using the escape character!
That’s the spirit! Every day we get a little bit better!
Hi Everyone,
Here I show you the screenshots what i have done
The problem is what i got https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQCSqUKhuFf0jtfMzQhSwUKdHfggYEa8wl3oI3t7aaUp3Qp6I1E
from the downloaded csv file
All links should be like this https://encrypted-tbn0.gstatic.com/**
So here i got encrypted links whenever i go to google looking for images
Please try to help me if as soon as possible
If you click on the link you sent, is the image showing? Because it does for me. If the image is showing the image_downloader function should have no problem retriving the images for you.
HI abakke,
For last two days i was looking into alternative method
method 2:
finally I got https://serpapi.com/
But it was headache because it contains JSON Object finally i made a python script to make a url_file.csv
Edit:
Thanks For getting my attention ,
Sorry buddy! There is a big mistakes on my side Downloaded csv File contains Whitespaces
So, Existing method works Fine (what they taught on fast ai)
I think you mean download_images right? Its a factory method in fastai. With the encrypted links, I’m getting this sort of error.
Yes you are right, that is what I meant It’s hard for me to tell why it fails… Could you provide a little more information, like what the URLs you’re feeding it looks like?
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT4vcf9ev4ozewwikw6Qn8iJ-xFqj4j38kx1hX3Hg7SqhLW9uja They look like that. I’m running on Kaggle kernels. I’ll show my code when I get on a computer. Thank you for the reply.
Alright I realised I could just show you my kernel. Here it is in its entirety. https://www.kaggle.com/bongbonglemon/adult-vs-teenager-fastai
hmm i’m not sure what’s wrong. Could you upload your URL datafiles, so that I can test locally?
I checked the problem, it originates from this line and it seems download_images method reads CSV file, splits it with “\n” to read line by line, but does not filter empty strings in the string array.
urls = open(urls).read().strip().split("\n")[:max_pics]
Changing this to filtering as below will fix this issue.
urls = list(filter(None, open(urls).read().strip().split("\n")))[:max_pics]
I will create a PR from my fork where I already fixed.
The problem is that there are some empty lines in your csv files, every 100 line or so The images are getting downloaded though.
I just ran into this problem and my solution uses the below javascript in the Chrome console:
urls = Array.from(document.querySelectorAll('.tx8vtf')).map(el=>el.getAttribute('src')).filter(el=>el).filter(el=>el.includes('http'))
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
After getting the urls in csv I strip out the commas and use wget
from the command line like so
wget -i urls
to download all the urls. The download happens in serial so you possibly are better off using the library function which AFAIK runs in parallel.
thank buddy,it’s a great help.I just cann’t get the urls file,it’s always 0 byte.your answer fix my problem,thank you so much.
I’m not familiar with javascript, could you please tell me more about your thought about fixing this problem,thank you again
That javacript snippet is looking for any html element that have the css
class “.tx8vtf”. In my experiments, the image elements from google images have that css
class. I check this by using the google chrome plugin SelectorGadget https://selectorgadget.com/ which highlights all elements on the page which have the css
selector you are looking for. Check out the video on their page for more info.
After you determine you have the right css
selector (.tx8vtf
in my example), the javascript code finds all of them and extracts the url that you can then use to download the images.
If the count of urls
is 0, then experiment with changing the css class you use to select the images.
Hope this helps.
Hello,
You can all follow this tutorial on how to download your image data sets from Bing.
I compiled into a script with slight modification where you can use it as follow:
$ python download_images.py -q “serach query
” -o <dest_dir> -m <result_size> -g <group_by> -k <bing_API_KEY>
from requests import exceptions
import argparse
import requests
import cv2
import os
ap = argparse.ArgumentParser()
ap.add_argument("-q", "--query", required=True,
help="search query to search Bing Image API for")
ap.add_argument("-o", "--output", required=True,
help="path to output directory of images")
ap.add_argument("-m", "--max", required=True,
help="maximum number of results")
ap.add_argument("-g", "--group", required=True,
help="group number of results")
ap.add_argument("-k", "--apikey", required=True,
help="bing api key")
args = vars(ap.parse_args())
API_KEY = args['apikey']
MAX_RESULTS = int(args['max'])
GROUP_SIZE = int(args['group'])
URL = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
EXCEPTIONS = set([IOError, FileNotFoundError,
exceptions.RequestException, exceptions.HTTPError,
exceptions.ConnectionError, exceptions.Timeout])
term = args["query"]
headers = {"Ocp-Apim-Subscription-Key" : API_KEY}
params = {"q": term, "offset": 0, "count": GROUP_SIZE}
print("[INFO] searching Bing API for '{}'".format(term))
search = requests.get(URL, headers=headers, params=params)
search.raise_for_status()
results = search.json()
estNumResults = min(results["totalEstimatedMatches"], MAX_RESULTS)
print("[INFO] {} total results for '{}'".format(estNumResults,
term))
total = 0
for offset in range(0, estNumResults, GROUP_SIZE):
print("[INFO] making request for group {}-{} of {}...".format(
offset, offset + GROUP_SIZE, estNumResults))
params["offset"] = offset
search = requests.get(URL, headers=headers, params=params)
search.raise_for_status()
results = search.json()
print("[INFO] saving images for group {}-{} of {}...".format(
offset, offset + GROUP_SIZE, estNumResults))
for v in results["value"]:
try:
print("[INFO] fetching: {}".format(v["contentUrl"]))
r = requests.get(v["contentUrl"], timeout=30)
ext = v["contentUrl"][v["contentUrl"].rfind("."):]
p = os.path.sep.join([args["output"], "{}{}".format(
str(total).zfill(8), ext)])
f = open(p, "wb")
f.write(r.content)
f.close()
except Exception as e:
if type(e) in EXCEPTIONS:
print("[INFO] skipping: {}".format(v["contentUrl"]))
continue
image = cv2.imread(p)
if image is None:
print("[INFO] deleting: {}".format(p))
os.remove(p)
continue
total += 1