Generating image datasets quickly

Gabriel_Syme · July 9, 2018, 3:52am

Hello everyone!

I was going to start my second pass on the fastai course yesterday and wanted to follow through the courses with my own case studies this time. While that hasn’t been difficult for structured data (my work involves a lot of that) or language models (shockingly there’s a lot of text out there for all of us), it has been a bit trickier for image datasets.

Yesterday I found a really easy way of going about that. It was all possible with the google images downloader firefox addon: https://addons.mozilla.org/en-US/firefox/addon/google-images-downloader/?src=recommended

This amazing little addon allows you to download ALL images on a google image search with the press of a button, even providing a text file including all image links. I know there’s probably dozens of ways to scrape images off the net, but this one was the most satisfying I’ve found to date and wanted to share.

As an example, in a matter of 20 mins I had 12,500 images of different architectural practices (around 700 images of buildings from ~20 firms) for my ‘architect studio classification’ project.

Hope this is useful to some of you!

Kind regards,
Theodore.

HamsterHuey · October 25, 2018, 5:07pm

Thanks! I tried this extension and it seems to work, but it only downloaded around 80 images for the search term I had. Any suggestions on how to get it to grab more results for the same search term?

chans.best · October 27, 2018, 6:40am

this has been super helpfull, thanks a lot and kudos to dev of add on.

chans.best · October 27, 2018, 7:25am

just scroll to end of images and preview 1000-2000 images before hitting image button.it will download all of em.

RenegadeLrner · February 7, 2019, 2:57am

Add this chrome extension, it’s the easiest way I found up to now, as of Feb 2019

Hope this helps anybody who came here looking for such an approach.

ragnarkar · February 7, 2019, 12:46pm

The firefox addon “Save Images” works wonders but only works on older versions of Firefox,like Waterfox.

shawngoodin · March 11, 2019, 1:36am

This is a great tool - I got way better pictures from here then I did on ImageNet of Monkey faces. I’m trying to write a model that can decipher between 3 different types of Monkeys

Couple of questions:

How many images of each kind of monkey do you think are enough to train the model?
Once I get the images do I need to crop them all to the same size, I could have swore I watch a video by Jeremy that said they needed to all be normalized to the same size 244 I think and same number of pixels. Is there a Fastai function that does this for you?

Gabriel_Syme · March 12, 2019, 3:18pm

I think the typical number people quote for DL is 5,000 data points per class, but with transfer learning these days this might be much much lower.

Concerning resizing, I’m guessing you can do it while loading the batch using the transforms. It’s been a while since I did that can’t remember the exact line but there’s a resize image there.

Jetze · March 15, 2019, 4:12pm

Thanks! Very useful for quickly putting together a dataset of badminton vs tennis matches

Sidpawar · April 22, 2020, 6:05pm

This doesn’t seem to work on newer versions of Mozilla, any suggestions?

mdgbayly · June 14, 2020, 8:04pm

I was looking for something simple to get some images downloaded to my VM so I could play around with the Lesson 1 code on my own images prior to digging into all the Lesson 2 download tutorials.

(caveat: I’m using the Google Cloud Platform GPU server approach)

Lots of the other suggestions in these forums no longer seem to work according to recent posts (e.g. Googlizer, google-images-download).

I came across this other github project which seems to work and is super simple to get going:

My GCP VM server seems to come with npm already installed so it was less than 5 minutes to use the terminal/shell on my server that I use for running jupyter to npm install this project and then run a simple download.

Under the hood it is just using the google.com/search images endpoint and then screen scraping the thumbnail images from that page. It seems to be limited to getting around 550 images and they are not big images (generally less than 20K in size and small dimensions) but I’m assuming fine for doing some basic image classification.

HTH
Martin

enginhic · September 27, 2020, 1:17pm

I saw your question months later and version 4 of the course is out. Jeremy goes over how to handle dataset images for uniform size in chapter 2 of the book and lesson 2 of the course. You can crop at the center, resize by stretching or squishing, or pad around the image to the desired size. But, he also goes into data augmentation and selective cropping, which may negate the need to resize the images. Very informative. You can find it halfway through the chapter, under the section titled From Data to Dataloaders. I must warn, this is in fastai 2