Tips for building large image datasets

lindyrock · October 17, 2018, 9:10pm

Would love to learn different methods people have used to create their own large image training datasets. I’ll share mine below:

1) Using google-images-download

$ pip install google_images_download

Install Chrome and Chromedriver to download images from command line.

I installed these onto my virtual machine by navigating to the respective download pages on my laptop’s Chrome browser, and then copying/pasting the correct wget command into the virtual terminal using Chrome’s CurlWget extension.

Now I can download. The following gets me 500 medium-sized images of baseball games:

$ googleimagesdownload -k "baseball game" -s medium -l 500 -o fastai/courses/dl1/data/baseballcricket -i train/baseball -cd ~/chromedriver

Experimentally, requesting 500 images worked fine and requesting 4000 cut me off at 450. So to get thousands of images I run the command a few times, changing the date range for each request:

$ googleimagesdownload -k "baseball game" -s medium -wr '{"time_min":"09/01/2018","time_max":"09/30/2018"}' -l 500 -o fastai/courses/dl1/data/baseballcricket -i train/baseball -cd ~/chromedriver

Notes: I run this on a FastAI Ubuntu 16.04 machine hosted by Paperspace, so this method works without a GUI browser.

2) Following the tutorial on PyImageSearch using a paid Bing Image API account.

I found this to be incredibly easy and could download thousands of images at once.

3) Using sentdex’s script for downloading from ImageNet URLs

4) There’s a great thread in Part I using the package ai-utilities developed by prairieguy for fastAI, and I’d love to hear about peoples’ experiences using that, or anything else you’ve found helpful. Mostly interested in tips for building large datasets in the ~tens of thousands of images range.

jeremy · October 17, 2018, 9:22pm

Great idea for a topic! And thanks for the useful links. Can you describe some datasets you’ve created with this?

lindyrock · October 17, 2018, 10:01pm

Sure. I used the ImageNet downloader to get about 1500 images each of coho and chinook salmon to train a classifier between them. ImageNet has really good, accurately labeled, and specific groupings like that, but also a limit to the number of images in a single synset. 1500’s not astoundingly high, so it would be useful to supplement those datasets with the Bing and Google image download tools.

I used the Bing Image API to download photos of MaryKate and Ashley, the Olsen twins, and built an excellent classifier between the two of them. Was able to get many thousands of photos for each, but then needed to become an expert myself in telling the difference so I could weed out mislabeled photos, which definitely introduced some bias.

But google_image_search has been the best so far. I’m just starting out with it and have played around with the baseball/cricket example. Searching by date range reduces duplicate photos. It was great to download 100 each and look at model accuracy, then download 400 more, then 500 more, 1000 more, and get to understand how dataset size is affecting my results.

lindyrock · October 17, 2018, 10:09pm

I’ve also been able to get a ton of data from NASA’s Aqua/Terra satellites directly from their AppEEARS API, which has very good documentation. Would recommend for anybody interested in geospatial. It’s very easy to download time series of the same location.

jeremy · October 17, 2018, 11:47pm

This may be the coolest deep learning project ever

beecoder · October 18, 2018, 12:41am

Searching by date is a great idea for google-images-download! Duplicates were a problem while scaling to more than 100’s of images. That, and looking out for mislabeled data.
I did get a bit paranoid and spent some time looking into downloading copyright-free images, but not sure how this would scale…
Amazon Mturk is probably overkill, just putting it out there.

devforfu · October 18, 2018, 3:50am

I was trying to use Bing API. Seems like a good solution. However, I didn’t try to compare it with Google approach. My snippet was something like this, and I was able to retrieve several hundreds of images (with duplicates). Actually, I am thinking to build a faces emotions dataset, which is similar to one of the datasets from Kaggle (can’t remember its name).

The main Bing’s advantage from my point of view is that it provides programmatic API, while google-images-download seems to use Selenium and browser automation. However, probably Google allows using their engine’s API also?

The only question about possible copyright issues. I wonder, can I just gather a bunch of images from Bing, and build a publicly available dataset, or do I need to make sure that I use royalty-free images only? I mean, is it possible to just say that the images were collected from Bing API?

marcmuc · October 18, 2018, 4:43am

To extend your list: I have successfully scraped duckduckgo for images. I have copied my script to a gist, but - it is not pretty. Was just quickly thrown together and never finished. But basically right now you can specify search-terms, number of images and what image format you would like (e.g. jpg).
This could easily be extended also to picture sizes (info already there). The script also creates a csv file with all titles, urls and sizes that were downloaded, in case you want to check something later etc. As I said, not finished, but maybe helpful for someone:

gist.github.com

https://gist.github.com/mprostock/69c7aaee720e5738f781e8a89072f269

ddg_images.py

import os
import requests
import re
import json
import csv
import time
import logging
from datetime import datetime

This file has been truncated. show original

devforfu · October 18, 2018, 5:41am

Probably it is worth to make an attempt to write some kind of image collecting package

Even · October 19, 2018, 4:05am

Nice projects and thanks for the details. I’ve been thinking about building out a dataset for artist/style classification in the hopes that it’ll produce more interesting embeddings for style transfer from those pretrained weights. I’ll give your method a try. You mentioned search by date range reduces duplicate photos. The google_images_download cli supports specific site searches so I’ll probably just target wikimedia. Thanks for the great suggestions here.

svenski · October 23, 2018, 3:56pm

I’ve implemented a package that wraps google-image-download with the additional functionality of sanity checking the images making sure they can be opened, have three channels, and finally organises the files into separate folders by train/validation/test: https://github.com/svenski/duckgoose

The name comes from what I tried to classify instead of dogs and cats in the previous version of the course.

rachel · October 23, 2018, 8:21pm

This is a very helpful post! Thank you for putting it together!

Taka · October 24, 2018, 7:28pm

Thanks for sharing these! I’ve been using google_images_download and intermittently found that few images were corrupted/unreadable (usually about 1/20th of total). I had to manually delete such images. Is there any way to automatically delete these?Any other suggestions?
Edit: I found the below solution (thanks @kai ), but am looking for something shorter.

    broken_images=[]
    for pic_class in os.listdir(PATH):
        for pic in os.listdir(f'{PATH}/{pic_class}'):
            try:
                img = PIL.Image.open(f'{PATH}/{pic_class}/{pic}') 
                img.verify()
            except (IOError, SyntaxError) as e:
                print('Bad file:', f'{PATH}/{pic_class}/{pic}')
                broken_images.append(f'{PATH}/{pic_class}/{pic}')
    return broken_images
img_to_del = check_images(f'{PATH}train')
[os.remove(pic) for pic in img_to_del]```

svenski · October 24, 2018, 8:43pm

I don’t know about shorter, but I found it useful to verify that there are three channels too, ie RGB, some of the images were black & white.

This is from the duckgoose package:

ii = Image.open(ff)
number_of_channels = len(ii.getbands()

brismith · October 25, 2018, 12:17am

I think there is certainly the potential for copyright issues if you are making the dataset available publicly - even with royalty free depending on the exact licence. Some creative commons licenses for example don’t allow derivatives - so taking a 299x299 part of a larger image could be considered a derivative. Most of the answers in the Bing Copyright article FAQ start with " it depends…". Potentially aiming to just use certain images based on license could even introduce bias.

jeremy · October 25, 2018, 2:21am

A good solution, which imagenet used to use, is to distribute the URLs to the images. You can also distribute a script with the list. That way people can download them themselves.

amankhandelia · October 25, 2018, 2:53am

Is anyone aware of a method/script/technique to download bulk images from a facebook page?

ritika26 · October 25, 2018, 3:17am

@devfortu.
I have used Bing API multiple times for building my custom dataset.Yes can you gather a bunch of images from Bing API and build your own dataset .

There is a very good blog written by Dr Adrian Rosebrock for building a deep learning image dataset using Bing API

devforfu · October 25, 2018, 3:34am

Yes, I am experimenting with Bing API right now, can’t say that results are too promising in terms of the match between a query and the retrieved images. However, I am going to give it a try and collect some data.

Ok, got it. Agree, URLs sounds good. Actually, I was thinking about a similar approach when scrapped text data from a platform that doesn’t allow to use their content except in personal/educational purposes. So I just can share a scrapping script without content itself.

Here is my attempt to build a simple wrapper on top of Bing API:

The goal is to make simple CLI tool/library to make image search queries like:

python -m imds download smiling human face photo | dogs b/w pictures | cats

Where each request is separated by a pipe symbol. Or to read prepared queries from the file.

avishalom · October 25, 2018, 7:48am

So, I have a bunch of images in google drive. what’s an easy way to move them to my AWS notebook instance?
thanks