Updated JS to scrape image urls

oddrationale · February 14, 2020, 5:59pm

I made a few updates to the JavaScript URL scraping code snippet:

// Google Images
var urls=Array.from(document.querySelectorAll(".rg_i")).map(el=>el.hasAttribute("data-src")?el.getAttribute("data-src"):el.getAttribute("data-iurl")).filter(l=>l!=null).join("\n");
var a = document.createElement("a");a.download = "filename.txt";a.href = "data:text/csv;charset=utf-8,"+urls;a.click();

// DuckDuckGo Images (uses Bing)
var urls=Array.from(document.querySelectorAll(".tile--img__img")).map(el=>el.hasAttribute("data-src")?el.getAttribute("src"):el.getAttribute("data-src")).filter(l=>l!=null).map(l=>"https:"+l).join("\n");
var a = document.createElement("a");a.download = "filename.txt";a.href = "data:text/csv;charset=utf-8,"+urls;a.click();

// Bing Images
var urls=Array.from(document.querySelectorAll(".mimg")).map(el=>el.hasAttribute("src")?el.getAttribute("src"):null).filter(l=>l!=null&&l.startsWith("http")).join("\n");
var a = document.createElement("a");a.download = "filename.txt";a.href = "data:text/csv;charset=utf-8,"+urls;a.click();

This fixes a few issues in the code snippet from lesson2-download.ipynb:

the escape() function was escaping the the new line
filters out the blank lines that were being added every 100 lines or so
you can now provide a file name in filename.txt instead of it being downloaded to download
uses anchor tag instead of window.open() to be more compatible with most browsers

Hope you find useful!

Fast.MT · July 19, 2020, 1:59pm

Thanks a lot! This was very helpful.

avkornaev · July 16, 2021, 12:59pm

Wow, it works! Thank you very much!