Tips for building large image datasets

This may be the coolest deep learning project ever :smiley:

20 Likes

Searching by date is a great idea for google-images-download! Duplicates were a problem while scaling to more than 100’s of images. That, and looking out for mislabeled data.
I did get a bit paranoid and spent some time looking into downloading copyright-free images, but not sure how this would scale…
Amazon Mturk is probably overkill, just putting it out there.

2 Likes

I was trying to use Bing API. Seems like a good solution. However, I didn’t try to compare it with Google approach. My snippet was something like this, and I was able to retrieve several hundreds of images (with duplicates). Actually, I am thinking to build a faces emotions dataset, which is similar to one of the datasets from Kaggle (can’t remember its name).

The main Bing’s advantage from my point of view is that it provides programmatic API, while google-images-download seems to use Selenium and browser automation. However, probably Google allows using their engine’s API also?

The only question about possible copyright issues. I wonder, can I just gather a bunch of images from Bing, and build a publicly available dataset, or do I need to make sure that I use royalty-free images only? I mean, is it possible to just say that the images were collected from Bing API?

1 Like

To extend your list: I have successfully scraped duckduckgo for images. I have copied my script to a gist, but - it is not pretty. Was just quickly thrown together and never finished. But basically right now you can specify search-terms, number of images and what image format you would like (e.g. jpg).
This could easily be extended also to picture sizes (info already there). The script also creates a csv file with all titles, urls and sizes that were downloaded, in case you want to check something later etc. As I said, not finished, but maybe helpful for someone:

15 Likes

Probably it is worth to make an attempt to write some kind of image collecting package :smile:

8 Likes

Nice projects and thanks for the details. I’ve been thinking about building out a dataset for artist/style classification in the hopes that it’ll produce more interesting embeddings for style transfer from those pretrained weights. I’ll give your method a try. You mentioned search by date range reduces duplicate photos. The google_images_download cli supports specific site searches so I’ll probably just target wikimedia. Thanks for the great suggestions here.

I’ve implemented a package that wraps google-image-download with the additional functionality of sanity checking the images making sure they can be opened, have three channels, and finally organises the files into separate folders by train/validation/test: https://github.com/svenski/duckgoose

The name comes from what I tried to classify instead of dogs and cats in the previous version of the course.

31 Likes

This is a very helpful post! Thank you for putting it together!

2 Likes

Thanks for sharing these! I’ve been using google_images_download and intermittently found that few images were corrupted/unreadable (usually about 1/20th of total). I had to manually delete such images. Is there any way to automatically delete these?Any other suggestions?
Edit: I found the below solution (thanks @kai ), but am looking for something shorter.

    broken_images=[]
    for pic_class in os.listdir(PATH):
        for pic in os.listdir(f'{PATH}/{pic_class}'):
            try:
                img = PIL.Image.open(f'{PATH}/{pic_class}/{pic}') 
                img.verify()
            except (IOError, SyntaxError) as e:
                print('Bad file:', f'{PATH}/{pic_class}/{pic}')
                broken_images.append(f'{PATH}/{pic_class}/{pic}')
    return broken_images
img_to_del = check_images(f'{PATH}train')
[os.remove(pic) for pic in img_to_del]```
7 Likes

I don’t know about shorter, but I found it useful to verify that there are three channels too, ie RGB, some of the images were black & white.

This is from the duckgoose package:

ii = Image.open(ff)
number_of_channels = len(ii.getbands()
3 Likes

I think there is certainly the potential for copyright issues if you are making the dataset available publicly - even with royalty free depending on the exact licence. Some creative commons licenses for example don’t allow derivatives - so taking a 299x299 part of a larger image could be considered a derivative. Most of the answers in the Bing Copyright article FAQ start with " it depends…". Potentially aiming to just use certain images based on license could even introduce bias.

1 Like

A good solution, which imagenet used to use, is to distribute the URLs to the images. You can also distribute a script with the list. That way people can download them themselves.

5 Likes

Is anyone aware of a method/script/technique to download bulk images from a facebook page?

@devfortu.
I have used Bing API multiple times for building my custom dataset.Yes can you gather a bunch of images from Bing API and build your own dataset .

There is a very good blog written by Dr Adrian Rosebrock for building a deep learning image dataset using Bing API

9 Likes

Yes, I am experimenting with Bing API right now, can’t say that results are too promising in terms of the match between a query and the retrieved images. However, I am going to give it a try and collect some data.

Ok, got it. Agree, URLs sounds good. Actually, I was thinking about a similar approach when scrapped text data from a platform that doesn’t allow to use their content except in personal/educational purposes. So I just can share a scrapping script without content itself.

Here is my attempt to build a simple wrapper on top of Bing API:

The goal is to make simple CLI tool/library to make image search queries like:

python -m imds download smiling human face photo | dogs b/w pictures | cats

Where each request is separated by a pipe symbol. Or to read prepared queries from the file.

1 Like

So, I have a bunch of images in google drive. what’s an easy way to move them to my AWS notebook instance?
thanks

Also you can try rsync CLI util if you’re on macOS or Linux.

1 Like

Thanks for sharing. Duplicate Photo Finder is a great tool to find duplicate or identify similar images.

cc: @beecoder

7 Likes

I bet you could build a better one using a CNN @Moody :slight_smile: Let me know if you need help getting started.

10 Likes

Hi, I have used your script and it downloads the images but doesn’t split them into train, test, valid folder though it creates them.

link to notebook

1 Like