Tips for building large image datasets

I’ve implemented a package that wraps google-image-download with the additional functionality of sanity checking the images making sure they can be opened, have three channels, and finally organises the files into separate folders by train/validation/test: https://github.com/svenski/duckgoose

The name comes from what I tried to classify instead of dogs and cats in the previous version of the course.

31 Likes

This is a very helpful post! Thank you for putting it together!

2 Likes

Thanks for sharing these! I’ve been using google_images_download and intermittently found that few images were corrupted/unreadable (usually about 1/20th of total). I had to manually delete such images. Is there any way to automatically delete these?Any other suggestions?
Edit: I found the below solution (thanks @kai ), but am looking for something shorter.

    broken_images=[]
    for pic_class in os.listdir(PATH):
        for pic in os.listdir(f'{PATH}/{pic_class}'):
            try:
                img = PIL.Image.open(f'{PATH}/{pic_class}/{pic}') 
                img.verify()
            except (IOError, SyntaxError) as e:
                print('Bad file:', f'{PATH}/{pic_class}/{pic}')
                broken_images.append(f'{PATH}/{pic_class}/{pic}')
    return broken_images
img_to_del = check_images(f'{PATH}train')
[os.remove(pic) for pic in img_to_del]```
7 Likes

I don’t know about shorter, but I found it useful to verify that there are three channels too, ie RGB, some of the images were black & white.

This is from the duckgoose package:

ii = Image.open(ff)
number_of_channels = len(ii.getbands()
3 Likes

I think there is certainly the potential for copyright issues if you are making the dataset available publicly - even with royalty free depending on the exact licence. Some creative commons licenses for example don’t allow derivatives - so taking a 299x299 part of a larger image could be considered a derivative. Most of the answers in the Bing Copyright article FAQ start with " it depends…". Potentially aiming to just use certain images based on license could even introduce bias.

1 Like

A good solution, which imagenet used to use, is to distribute the URLs to the images. You can also distribute a script with the list. That way people can download them themselves.

5 Likes

Is anyone aware of a method/script/technique to download bulk images from a facebook page?

@devfortu.
I have used Bing API multiple times for building my custom dataset.Yes can you gather a bunch of images from Bing API and build your own dataset .

There is a very good blog written by Dr Adrian Rosebrock for building a deep learning image dataset using Bing API

9 Likes

Yes, I am experimenting with Bing API right now, can’t say that results are too promising in terms of the match between a query and the retrieved images. However, I am going to give it a try and collect some data.

Ok, got it. Agree, URLs sounds good. Actually, I was thinking about a similar approach when scrapped text data from a platform that doesn’t allow to use their content except in personal/educational purposes. So I just can share a scrapping script without content itself.

Here is my attempt to build a simple wrapper on top of Bing API:

The goal is to make simple CLI tool/library to make image search queries like:

python -m imds download smiling human face photo | dogs b/w pictures | cats

Where each request is separated by a pipe symbol. Or to read prepared queries from the file.

1 Like

So, I have a bunch of images in google drive. what’s an easy way to move them to my AWS notebook instance?
thanks

Also you can try rsync CLI util if you’re on macOS or Linux.

1 Like

Thanks for sharing. Duplicate Photo Finder is a great tool to find duplicate or identify similar images.

cc: @beecoder

7 Likes

I bet you could build a better one using a CNN @Moody :slight_smile: Let me know if you need help getting started.

10 Likes

Hi, I have used your script and it downloads the images but doesn’t split them into train, test, valid folder though it creates them.

link to notebook

1 Like

Oh, is there anything in the train valid test directories? I have a sneaking suspicion there’s a bug that means the folders are the search phrases? I’ll check later today.

No they are empty, only downloaded_from_google has the images distributed in folder on the basis of classes.

I used to use a library for crawling, icrawler. It supports Google, Bing and Bidu Search Engine Crawling. We can extend to donload from our own custom webpages. A Sample notebook on using it would be https://github.com/nareshr8/Image-Localisation/blob/master/crawler.ipynb

2 Likes

I found the reason it didn’t work, the part the sanity checks and organises the images uses a glob pattern to find the files, which assumes that the file names start with the class name. Since the search term you used didn’t use the class as the first term it didn’t match anything. I’ve changed it to match anything with the search term for now. Slightly brittle, if a file name contains a search term it might be used for several classes – I’ll change it to use some sanitized version of the search terms instead later on.

So the new version of duckgoose (0.1.7) will work, or you can rearrange the search terms to have the class name first.

Thanks for letting me know it didn’t work for you.

1 Like

Does anyone know how to open images in jupyter notebook while waiting for input? I’m writing a data checking function so you can go through your images by class after downloading and delete the ones that don’t belong.

No luck with

  • show_image(open_image(img_path))
  • or img = open_image(img_path); img.show()
  • or plt.imshow(np.rollaxis((np.array(open_image(img_path).data) * 255).astype(np.int32), 0, 3))

All 3 ways display after input is received; same behavior on terminal. So far only PIL.Image works:

import PIL.Image
...
img = PIL.Image.open(class_folder_path/f)
...
img.show()

Unfortunately this opens an image using your system’s default viewer, and running img.close() will not close the window - you have to do it manually. An issue for datasets with hundreds of images.

There is a way that does this, at least from the terminal: and that’s with OpenCV, but I’m hesitating on that since fastai isn’t using OpenCV. That’s something similar to what I did in an old project a while back (tuning bounding boxes in that case; may blog that).

edit: I put together a script with OpenCV of the datacleaner: here’s a video of how it works. Not sure if that works on a cloud instance with no GUI.


On a separate note: you can also get image data from video. Using OpenCV and MSS, you can build a dataset by running video and taking a screenshot of that part of the screen, with labels mapped to the key you press. Here’s how I did that in that same project.

You can build pretty big datasets quickly that way too; your bigger problem will be making sure the data itself is varied enough – since 20 shots of Matt Damon smiling in a 5-second cut are all going to contain basically the same information.

3 Likes

I built a dataset curator to help find and remove both duplicate images and images from outside of the data distribution. It uses the intermediate representations from a pretrained vgg network (similar to content loss when doing style transfer).

1 Like