Hey Fast AI fam!
Newbie deep learner here (pun not intended and possibly inaccurate). This is my first post on the forum.
I’m really struggling with collecting my own image dataset from google images and would love your help.
Steps I took;
- Searched for images in google image search
- Scrolled all the way to the bottom of the page
- Opened up the developer console
- Added the following bit of JS
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
- A file downloaded
However… (drum roll)
All the urls in the file start with encrypted-tbn0 etc.
I’ve tried to look up what this is. A google forum has this explanation - https://support.google.com/webmasters/thread/5941325?hl=en
I’ve tried using an incognito browser too!
Is there something I’m doing wrong? Has anyone else encountered this problem and found a work around?
Any help would be much appreciated!
Hi, if downloading still does not work, you can try this one: Problems fetching urls from Google Images
Here’s the issue I’m running into when I use the file with the https://encrypted-tbn0… urls
Sorry all of this is a bit new. Last I ran the notebook I had no issues. The only big difference was that the urls were not encrypted.
** Update **
I waited and checked the folder and the images downloaded. I’m not clear why the error appears. An academic pursuit for when I’m a bit more experienced!
I want to point out that you usually can’t use images downloaded from Google Images because the copyright holder needs to give consent.
It’s probably fine if you’re just doing this for a personal study project, but you need to be careful about publishing a model that you trained on images that you do not have copyrights to, or are not given consent to use. (This is why large datasets such as ImageNet use only images from Flickr that have a Creative Commons license, and even that is not entirely uncontroversial.)
Note: I am not a lawyer. And I don’t think there is any legal consensus (yet) that a model trained on copyrighted images is a derivative work and therefore infringes on the copyright. But it’s smarter (and nicer) to only use images you have permission to use.
Do you think it’s an issue if it’s being taught on the course?
I found this court ruling on web scraping and its legality. http://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf
I’m obviously paraphrasing here
It 's because you have some some lines with empty strings in your CSV file. It only shows the error for these urls, it doesn’t prevent downloading images with valid url. I fixed this issue by filtering empty string urls in a given file with this PR to the fastai repo.
Thanks for clarifying @ozgur
Created this Google Colab for downloading from Google Images easily. Create your csv with links and drag drop to folder panel in the left and then run.