Tips for building large image datasets

ste · December 27, 2018, 8:53pm

Try ImageMagick:
https://imagemagick.org/script/mogrify.php

IE:

magick mogrify -resize 256x256 *.jpg

Resizes all the images in a folder to 256x256.

harris · February 1, 2019, 2:20am

If you want to download > 100 images from google images, you’re going to have to download selenium, webdriver, and chrome. This series of steps worked for me: https://gist.github.com/ziadoz/3e8ab7e944d02fe872c3454d17af31a5

akhalsa · February 3, 2019, 9:35pm

I am having trouble with google_images_download. It mostly works but my dataset keeps including corropted images that need to be removed manually. This is the command I am using:

googleimagesdownload -k “sports car” -s medium -f png -l 500 -o ~/storage/cars -i sports --chromedriver /home/paperspace/anaconda3/envs/fastai/bin/chromedriver

In jupyter I run:

data = ImageDataBunch.from_folder(path, ds_tfms=get_transforms(),
valid_pct=0.25, size=224, bs=bs).normalize(imagenet_stats)

The error message I get looks like this:

/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/fastai/basic_data.py:226: UserWarning: There seems to be something wrong with your dataset, can’t access these elements in self.train_ds: 1010,934
warn(warn_msg)

I can go through the images 1 by 1 and a few will not open which I can then remove. Once I have gone through the whole dataset everything works fine.

Has anyone seen this / has any idea how to automatically remove these corrupted images?

artuskg · February 3, 2019, 10:45pm

for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

as per lesson2-download.ipnb

HTH

akhalsa · February 3, 2019, 11:30pm

thanks!

artuskg · February 3, 2019, 11:31pm

You are welcome - happy to help

xjdeng · February 4, 2019, 6:55am

I’m still having trouble getting the Javascript in the Jupyter Notebook for lesson 2 working consistently.

But one quick and dirty way to get even more images if the standard results from Google Images isn’t enough:

Redo your search again in different languages.

Try translating what you’re searching for into one of the more popular languages on the Internet, like Spanish, Chinese, Russian, French, German, Italian, Japanese, etc. and combining the results.

joresh · February 4, 2019, 11:42pm

I was having trouble using the script recommended by Jeremy because I am on Windows 10. If I open the downloaded URL files in Windows, my Colab notebook cannot read it anymore. I found an Chrome extension called Fatkun Batch Download using which I have downloaded thousands of images. All you need to do is to search for the images, open the extension and download them to your hard disk. It can download from multiple tabs too. After downloading, you can upload them into Google drive and run the data object creator.

bidyutchanda · February 28, 2019, 3:58pm

Hi Lindy.
First time learning anything about DL and starting out with fast.ai.
Can you tell me what to do if I want to download a specific set of images, like say balloons, from Imagenet, how would I do that?
And maybe if I do not want my whole neural network to be trained with all the images from ImageNet and just ‘balloons’, can I do that?

I know the answer to the second question will be long. Any help will be appreciated

jonesTheCoder · March 12, 2019, 4:31pm

I have used Fatkun Batch Download with positive results as well. Then, for self-labelling images, I have used LabelImg: https://github.com/tzutalin/labelImg

Th3Lourde · April 22, 2019, 1:13am

@lindyrock how did you vary your dates? Did each command correspond to a month? Thanks!

drigio · April 30, 2019, 7:26pm

Hello, I had a few troubles using chromedriver / running headless chrome using Google Colab ! So i improvised on @lindyrock 's commands. I have created a script that can download thousands of images without using chromedriver / selenium. It downloads max 100 images for max n times. It does so by changing dates. You can look at the script over here. As @jeremy said, this method does not fit in Google terms of service. Please pardon my shell scripting

drigio · April 30, 2019, 7:33pm

Hello, I had come across this problem too. You can checkout this script . What @artuskg suggested is more cleaner solution and requires you to proceed in the lecture series, however the above script may help to understand what errors we were getting and what might be a solution to avoid the problem. Basically what it does is deletes all the images that cant be opened (i.e are corrupted). Hope it helps!

Kamal_Eldin · May 4, 2019, 8:14pm

As I understand, and after spending a lot of time searching, I guess that for Google Colab, there is no way to get google-images-download to download more than 100 images natively without using an external script…for me I always get this error in Colab for limits greater than 100
"Looks like we cannot locate the path the ‘chromedriver’ (use the ‘–chromedriver’ argument to specify the path to the executable.)"
I’m following this workflow https://colab.research.google.com/drive/1Mqi5FrhV_ZcmmSAdVD71cljn5i5z2NsA#scrollTo=YqJOQScMItMK
Could anyone point out how to workaround this error to be able to raise the download limit beyond 100 in Colab?! Thanks

prairieguy · June 9, 2019, 1:01am

I wrote this python module [ai_utilities] to simplify downloading images directly to path for subsequent use by ImageList.from_folder(path)

ai_utilities

A set of scripts useful with fast.ai lectures and libraries.

image_download is the primary function. It provides easy download of images from google, bing and/or flickr (though the later requires an apikey). It is intended for direct import of images within a python script or Jupyter Notebook. (This differs from previous versions intended for use as a CLI script.)

This is a new version based upon icrawler vs. selenium. It is much cleaner to install, use and extend. (It is itself an extension of work from fastclass)

make-train-valid makes a train-valid directory and randomly copy files from labels_dir to sub-
directories. It is largely obsolete due to the new capabilities provided directly within fastai

Installation

Anaconda should be installed
With fastai installed, the dependencies are: icrawler and python-magic
conda install -c hellock icrawler
pip install python-magic

Example Usage

Download up to 500 images of each class, check each file to be a valid jpeg image, save to directory dataset, create imagenet-type directory structure and create data = ImageDataBunch.from_folder(...)

sys.path.append(your-parent-directory-of-ai_utilities)
from ai_utilities import *

pets = ['dog', 'cat', 'gold fish', 'tortise', 'snake' ]
for p in pets:
    image_download(p, 500)
 
path = Path.cwd()/'dataset' 
make_train_valid(path)
data = ImageDataBunch.from_folder(path, ds_tfms=get_transforms(), size=224, bs=64).normalize(imagenet_stats)

Functions

image_download.py

Downloads up to a number of images (typically limited to 1000) from a specified search engine, including google, bing and flickr. The search_text can be different from its label. Downloads are checked to be valid images. By default, images are saved to the directory dataset

usage: image_download(search_text:Path, num_images, label:str=None, engine:str='google', image_dir='dataset', apikey=None)
           where, 'engine'   = ['google'|'bing'|'all'|'flickr'],
                  'all'    = 'google' and 'bing',
                  'flickr' requires an apikey
           where, 'label' can be different from 'search_text'

make_train_valid.py

From a directory containing sub-directories, each with a different class of images, make an imagenet-type directory structure.
It randomly copies files from labels_dir to sub-directories: train, valid, test. Creates an imagmenet-type directory usable by ImageDataBunch.from_folder(dir,...)

usage: make_train_valid(labels_dir:Path, train:float=.8, valid:float=.2, test:float=0)                           
     positional arguments:
        labels_dir     Contains at least two directories of labels, each containing
                       files of that label
         optional arguments:
                        train=.8  files for training, default=.8
                        valid=.2  files for validation, default=.2
                        test=  0  files for training, default=.0

sxela · June 30, 2019, 7:51am

@prairieguy Awesome tool! Saved me a lot of time. I had to add exception catcher to filter_images(), as it’s been dropping the script due to some MagicError name use count (30) exceeded for some images (maybe it’s the long EXIF or something)

prairieguy · June 30, 2019, 10:40pm

sxela - Thrilled that ai_utilities was of use to you. I use it a ton for image downloads within Jupyter. It’s easier for me than the alternatives.

This is a recent re-write of my original image_download.py. Previously, it depended upon selenium and was really a shell script. This version uses icrawler and is meant to be imported and used within Jupyter (or python script). Specifying engine=‘all’ uses both google and bing. Though Flickr can also be specified as an engine, it does require an apikey. (I haven’t used it too much.) Also, search_term and label can be different. One thing I need to fix is that num_images is only a maximum. (Actual downloaded images is a function of total images available from the search-engine, how many successfully download and how many are not corrupted.)

Also, thanks for adding the exception handling for filtering downloaded images! My first pull request It made my day!

I should also acknowledge that the decision to use of icrawler and some of the code used here is an extension of the work done by fastclass

saarahasad · July 1, 2019, 1:25pm

Bing Image API apparently does not give us legal permission to use it for any ‘Machine learning’ purposes.

txuninho · July 11, 2019, 7:42pm

Hi Bryan
Thank you for your library straight
I did as in your example usage
It works fine for me until I run “data.show_batch(rows=3, figsize=(7,6))” as in course 1. As a result I get “RuntimeError: cuda runtime error (2) : out of memory …”
I changed the “bs=64” to “bs=16” as mentionned in course notebook and restart the kernel. Still I get the same error.
I set a Google Cloud Platform as recommended.
Did you ever experiment that issue ?
Cheers

prairieguy · July 12, 2019, 6:49am

Hmm. I didn’t realize that. I tried to find their EULA (or equivalent), but couldn’t. I am using ‘bing’ for strictly educational purposes and wanted to see if there was a ‘educational carve-out’. If you have a link, I would be interested in checking. Thanks.