Problems fetching urls from Google Images

abakke · February 6, 2020, 10:00am

Hi everyone!

It seems that google has change the html for the google images page! This results in that the code presented in lesson 2 (bear classifier) for retrieving image urls from google images is not working anymore… I got around this problem by a slight modification of the JavaScript code

function imurl(el) { let a = el.getAttribute(“data-iurl”); if (a == null) { return el.getAttribute(“src”); } else { return a; } };
urls = Array.from(document.querySelectorAll(‘.rg_i’)).map(el=>(imurl(el))); window.open(‘data:text/csv;charset=utf-8,’ + escape(urls.join(‘\n’)));

Also has anyone else noticed some problems with the ImageCleaner? It seems like it’s not writing all the file-paths it should to the csv file (deleting images that should not be deleted).

JonathanSum · February 6, 2020, 11:05pm

I just used an image downloader from the chrome store. However, I still hope there will be a person comes here to give us a new code. ImageCleaner does not work for colab.

manohar.fast.ai · February 7, 2020, 1:25am

I have same issue but got encrypted url link below

https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSu1nG35CdrxHAlayOIIvF5z4iAM9Uifc248DTEu-LA3YaejyMl
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcR6YynbqCENsdodWNDsx60HHTpoqNKIHK9P5k1WTDulix-qwG8v
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRvGf09u6efj7Eeb0nXpJ22zPw2z2lqAC_1tjiqgJ46yB0zNGmF

please someone help me to find out better solution

abakke · February 7, 2020, 7:32am

You can still download the images with the download_images function (https://docs.fast.ai/vision.data.html#download_images). The solution worked fine for me.

alexxcollins21 · February 7, 2020, 4:23pm

I use Chrome browser with Ubuntu OS.

I’ve been having problems here and unfortunately @abakke JavaScript got a Syntax Error: Invalid or unexpected token.

I used the following code:

urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\\n')));

EDIT: the code above had an error in that the new line ‘\n’ was escaped so the join was just adding “/n” instead of creating a new line. The correct code is below:

urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

I’ve left my original problems below, just in case anyone finds sed useful. But its now solved.

I then have two problems: Chrome is blocking the pop up window for saving so I can’t specify a file name. It just downloads the file as “download” into my Downloads folder.

Secondly the “download” file has literal '\n’s in it - not new lines. So it isn’t read as a CSV with many lines - just as one really long line.

I use Linux and got around this with the following:

sed -'s/\\n/\n/g' download > newfilename.txt

This turned my one line download file into a newfilename.txt file with a new line for each url. There is good documentation for sed if you type info sed into the terminal.

This is a bit clunky by by coming up with an alias I can at least rename and reformat the url file in a short line of code in my terminal.

If anyone else has a more elegant solution I’d love to hear it!

abakke · February 7, 2020, 8:56pm

ohhh, the quotation marks got messed up! Thanks for telling.

I then have two problems: Chrome is blocking the pop up window for saving so I can’t specify a file name. It just downloads the file as “download” into my Downloads folder.

Hmm, It could be that you just have to rename the file with eg .txt at the end, so it knows which filetype to cast to it.

I don’t belive it’s necessary to do the ‘\n’ fix you propose. This is because the image_downloader function split the links on the ‘\n’ anyway, so it doesn’t matter that it’s not shown as a newline

alexxcollins21 · February 7, 2020, 9:16pm

I still can’t get your javascript to work in my console. BUT - I’ve figured out what was wrong in my javascript code. I had ‘//n’ which was escaping the /n hence getting the entire file on one line. (I did try adding .txt and .csv extensions to the file name originally). Hey-ho - at least I’ve learnt about sed and got a bit better at noticing/using the escape character!

abakke · February 7, 2020, 9:29pm

It worked in mine, when i changed all the quotation marks. I wanted to edit in your code in my first post, but i can’t figure out how… the edit button seems to have disappeared.

Hey-ho - at least I’ve learnt about sed and got a bit better at noticing/using the escape character!

That’s the spirit! Every day we get a little bit better!

manohar.fast.ai · February 8, 2020, 3:37am

Hi Everyone,

  Here I show you the screenshots what i have done

The problem is what i got https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQCSqUKhuFf0jtfMzQhSwUKdHfggYEa8wl3oI3t7aaUp3Qp6I1E
from the downloaded csv file

All links should be like this https://encrypted-tbn0.gstatic.com/**

So here i got encrypted links whenever i go to google looking for images
Please try to help me if as soon as possible

abakke · February 8, 2020, 8:16am

If you click on the link you sent, is the image showing? Because it does for me. If the image is showing the image_downloader function should have no problem retriving the images for you.

manohar.fast.ai · February 8, 2020, 10:18am

HI abakke,

For last two days i was looking into alternative method
method 2:
finally I got https://serpapi.com/

But it was headache because it contains JSON Object finally i made a python script to make a url_file.csv

Edit:
Thanks For getting my attention ,
Sorry buddy! There is a big mistakes on my side Downloaded csv File contains Whitespaces
So, Existing method works Fine (what they taught on fast ai)

Gordon · February 13, 2020, 6:23am

I think you mean download_images right? Its a factory method in fastai. With the encrypted links, I’m getting this sort of error.

abakke · February 13, 2020, 10:12am

Yes you are right, that is what I meant It’s hard for me to tell why it fails… Could you provide a little more information, like what the URLs you’re feeding it looks like?

Gordon · February 13, 2020, 10:37am

https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT4vcf9ev4ozewwikw6Qn8iJ-xFqj4j38kx1hX3Hg7SqhLW9uja They look like that. I’m running on Kaggle kernels. I’ll show my code when I get on a computer. Thank you for the reply.

Gordon · February 13, 2020, 12:33pm

Alright I realised I could just show you my kernel. Here it is in its entirety. https://www.kaggle.com/bongbonglemon/adult-vs-teenager-fastai

abakke · February 13, 2020, 12:43pm

hmm i’m not sure what’s wrong. Could you upload your URL datafiles, so that I can test locally?

Gordon · February 13, 2020, 3:56pm

https://www.kaggle.com/bongbonglemon/adult-vs-teenager Here it is on Kaggle. Thank you. @abakke

ozgur · February 13, 2020, 5:57pm

I checked the problem, it originates from this line and it seems download_images method reads CSV file, splits it with “\n” to read line by line, but does not filter empty strings in the string array.

urls = open(urls).read().strip().split("\n")[:max_pics]
Changing this to filtering as below will fix this issue.
urls = list(filter(None, open(urls).read().strip().split("\n")))[:max_pics]

I will create a PR from my fork where I already fixed.

abakke · February 13, 2020, 6:08pm

The problem is that there are some empty lines in your csv files, every 100 line or so The images are getting downloaded though.

idraja · February 17, 2020, 9:57pm

I just ran into this problem and my solution uses the below javascript in the Chrome console:

urls = Array.from(document.querySelectorAll('.tx8vtf')).map(el=>el.getAttribute('src')).filter(el=>el).filter(el=>el.includes('http'))
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

After getting the urls in csv I strip out the commas and use wget from the command line like so

wget -i urls

to download all the urls. The download happens in serial so you possibly are better off using the library function which AFAIK runs in parallel.