Would love to learn different methods people have used to create their own large image training datasets. I’ll share mine below:
1) Using google-images-download
$ pip install google_images_download
Install Chrome and Chromedriver to download images from command line.
I installed these onto my virtual machine by navigating to the respective download pages on my laptop’s Chrome browser, and then copying/pasting the correct wget
command into the virtual terminal using Chrome’s CurlWget extension.
Now I can download. The following gets me 500 medium-sized images of baseball games:
$ googleimagesdownload -k "baseball game" -s medium -l 500 -o fastai/courses/dl1/data/baseballcricket -i train/baseball -cd ~/chromedriver
Experimentally, requesting 500 images worked fine and requesting 4000 cut me off at 450. So to get thousands of images I run the command a few times, changing the date range for each request:
$ googleimagesdownload -k "baseball game" -s medium -wr '{"time_min":"09/01/2018","time_max":"09/30/2018"}' -l 500 -o fastai/courses/dl1/data/baseballcricket -i train/baseball -cd ~/chromedriver
Notes: I run this on a FastAI Ubuntu 16.04 machine hosted by Paperspace, so this method works without a GUI browser.
2) Following the tutorial on PyImageSearch using a paid Bing Image API account.
I found this to be incredibly easy and could download thousands of images at once.
3) Using sentdex’s script for downloading from ImageNet URLs
4) There’s a great thread in Part I using the package ai-utilities developed by prairieguy for fastAI, and I’d love to hear about peoples’ experiences using that, or anything else you’ve found helpful. Mostly interested in tips for building large datasets in the ~tens of thousands of images range.