Creating Image Datasets for Vision Learning

prairieguy · August 28, 2020, 5:34pm

I’m starting this thread for Part 1 (2020) to discuss approaches for creating image datasets for vision learning. (There are other threads for earlier classes.)

I’ll start with https://github.com/prairie-guy/ai_utilities, a github repository I wrote containing several python functions useful for fastai.

image_download() uses your choice of search engines to download a specified number of images. The default search engine is bing. Each search engine has set its maximum limit for the number of downloads. (I’m working on increasing this by using multiple date ranges.)

Flickr requires an apikey, but one is easy to obtain for non-commercial use. With it, the download limit is larger than with the other search engines.

Searching with google is not working due to a bug in the upstream package: icrawler (I have a fix and have issued a pull request. If anyone wants the fix, let me know).

Installation

git clone git@github.com:prairie-guy/ai_utilities.git
pip install icrawler
pip install python-magic

Within your python code, you will need to include the following code to access ai_utilities (Sorry, no easy install with pip) You will need to indicate the parent-directory, something like /home/prairieguy/

import sys                                                                                                                                            
sys.path.append('your-parent-directory-of-ai_utilities')                                                                                              
from ai_utilities import *                                                                                                                            
from pathlib import Path                                                                                                                              
from fastai.vision.all import *

Usage

Here is sample python code which does the following: Downloads up to 100 images of each of the animals, check that each image-file is a valid jpeg-file, remove duplicates, save to the directory dataset and create data = ImageDataBunch.from_folder(...). Optionally, create an imagenet-type directory structure.

import sys                                                                                                                                            
sys.path.append('your-parent-directory-of-ai_utilities')                                                                                              
from ai_utilities import *                                                                                                                            
from pathlib import Path                                                                                                                              
from fastai.vision.all import *                                                                                                                       
                                                                                                                                                      
for p in ['dog', 'goat', 'sheep']:                                                                                                                    
    image_download(p, 100)                                                                                                                            
path = Path.cwd()/'dataset'                                                                                                                           
data = ImageDataLoaders.from_folder(path,valid_pct=0.2, item_tfms=Resize(224)) 

# Optionally, create an imagenet-type file directory.                                                                                                 
make_train_valid(path)                                                                                                                                
data = ImageDataLoaders.from_folder(path, train='train', valid='valid', item_tfms=Resize(224))

NoBlueWithoutYellow · August 28, 2020, 6:08pm

how do i use this in a colab notebook?

pip install icrawler

pip install python-magic or pip install python-magic-bin

I’ve tried all this code to import the library but it still gives this error:

ModuleNotFoundError: No module named ‘ai_utilities’

prairieguy · August 28, 2020, 6:58pm

Were you able to import icrawler and python-magic?

If so, you then you will need to clone ai_utilities from github:
git clone git@github.com:prairie-guy/ai_utilities.git

Then within your code, you will need to include the following code to access ai_utilities (Sorry, I haven’t figured out how to make it installable with pip) You wil need to indicate the parent-directory, something like /home/NoBlueWithoutYellow/

import sys                                                                                                                                            
sys.path.append('your-parent-directory-of-ai_utilities')                                                                                              
from ai_utilities import *                                                                                                                            
from pathlib import Path                                                                                                                              
from fastai.vision.all import *                                                                                                                        ```

NoBlueWithoutYellow · August 28, 2020, 7:24pm

I’m using google colab notebook.
Would I still be needing the parent directory then?

This is the error I’m getting when using the git clone.

Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

prairieguy · August 28, 2020, 8:26pm

I’ve never used Colab before. I just tried logging on. I was able to use pip to install the dependencies, however, I was not able to use git clone. Perhaps someone with more experience with Colab can provide an alternative.

joedockrill · August 29, 2020, 9:41am

I’ll add my $0.02 to this thread as well since this issue of how to create datasets comes up so often.

You can open the notebook directly from here.

prairieguy · September 5, 2020, 6:13pm

Hi Joe - Nice job on your notebook! It is nice to be able to use the DuckDuckGo search engine. In https://github.com/prairie-guy/ai_utilities, I had previously written a function image_download() that allows for the search and download of images from google, bing, flickr or baidu. I have yet to add DuckDuckGo.

In the meantime, based upon your work and that of https://github.com/deepanprabhu, I created the function search_images_ddg() that has the same api as search_images_bing(), found in fastbook/utils.py It can easily be added in place of search_images_bing(). Just add the following code to a notebook. It’s dependencies are either fastai.vision.all or fastbook. (I have added it to my repository as well, though has not been documented)

Disclaimer: I have not looked up the “Terms of Service for DuckDuckGo” to see if this is an acceptable use of their search engine or not.

from fastai.vision.all import *

def search_images_ddg(key,max_n=200):
     """Search for 'key' with DuckDuckGo and return a unique urls of 'max_n' images """
     url        = 'https://duckduckgo.com/'
     params     = {'q':key}
     res        = requests.post(url,data=params)
     searchObj  = re.search(r'vqd=([\d-]+)\&',res.text)
     if not searchObj: print('Token Parsing Failed !'); return
     requestUrl = url + 'i.js'
     headers    = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0'}
     params     = (('l','us-en'),('o','json'),('q',key),('vqd',searchObj.group(1)),('f',',,,'),('p','1'),('v7exp','a'))
     urls       = []
     while True:
         try:
             res  = requests.get(requestUrl,headers=headers,params=params)
             data = json.loads(res.text)
             for obj in data['results']:
                 urls.append(obj['image'])
                 max_n = max_n - 1
                 if max_n < 1: return L(set(urls))     # dedupe                                                                                       
             if 'next' not in data: return L(set(urls))
             requestUrl = url + data['next']
         except:
             pass

alex.larrimore · December 15, 2020, 9:04pm

I’ve been using your search_images_ddg function for awhile now (Thank you!).

As of today I’m getting a new error after running this code block:
urls = search_images_ddg(‘black bear’, max_images=200)

AttributeError Traceback (most recent call last)
in ()
----> 1 urls = search_images_ddg(‘black bear’, max_images=200)

/usr/local/lib/python3.6/dist-packages/fastbook/init.py in search_images_ddg(term, max_images)
55 assert max_images<1000
56 url = ‘https://duckduckgo.com/’
—> 57 res = urlread(url,data={‘q’:term}).decode()
58 searchObj = re.search(r’vqd=([\d-]+)&’, res)
59 assert searchObj

AttributeError: ‘str’ object has no attribute ‘decode’

Do you have any idea what might be going on here?
Thanks!

BusyFox · December 16, 2020, 4:25am

I am faced with the same error as alex’s when calling search_images_ddg instead of search_images_bing in Practical Deep Learning for Coders lesson
It seems to be a compatibility issue of some specific python versions.

prairieguy · December 20, 2020, 7:02pm

Alex - Thanks for pointing this out. I’m glad you have found this function helpful! The problem is not with my original code contributed to the fastbook github repository. The issue is with the pip package version of fastbook, in which a broken version of search_images_ddg overwrites my code. I’ve opened an issue here, as I have no ability to edit the pip package:

Until it is fixed, you can simply copy and paste the original code into your code. I know it’s a hack, but it will restore the functionality until the issue is resolved.

Karthik_D_K · December 27, 2020, 8:43am

Hi @prairieguy - The workaround you have mentioned is working fine for me, many Thanks

Alex_Adamov · March 18, 2021, 11:20am

Nice! Thanks for sharing