Tips for building large image datasets

Oh, is there anything in the train valid test directories? I have a sneaking suspicion there’s a bug that means the folders are the search phrases? I’ll check later today.

No they are empty, only downloaded_from_google has the images distributed in folder on the basis of classes.

I used to use a library for crawling, icrawler. It supports Google, Bing and Bidu Search Engine Crawling. We can extend to donload from our own custom webpages. A Sample notebook on using it would be https://github.com/nareshr8/Image-Localisation/blob/master/crawler.ipynb

2 Likes

I found the reason it didn’t work, the part the sanity checks and organises the images uses a glob pattern to find the files, which assumes that the file names start with the class name. Since the search term you used didn’t use the class as the first term it didn’t match anything. I’ve changed it to match anything with the search term for now. Slightly brittle, if a file name contains a search term it might be used for several classes – I’ll change it to use some sanitized version of the search terms instead later on.

So the new version of duckgoose (0.1.7) will work, or you can rearrange the search terms to have the class name first.

Thanks for letting me know it didn’t work for you.

1 Like

Does anyone know how to open images in jupyter notebook while waiting for input? I’m writing a data checking function so you can go through your images by class after downloading and delete the ones that don’t belong.

No luck with

  • show_image(open_image(img_path))
  • or img = open_image(img_path); img.show()
  • or plt.imshow(np.rollaxis((np.array(open_image(img_path).data) * 255).astype(np.int32), 0, 3))

All 3 ways display after input is received; same behavior on terminal. So far only PIL.Image works:

import PIL.Image
...
img = PIL.Image.open(class_folder_path/f)
...
img.show()

Unfortunately this opens an image using your system’s default viewer, and running img.close() will not close the window - you have to do it manually. An issue for datasets with hundreds of images.

There is a way that does this, at least from the terminal: and that’s with OpenCV, but I’m hesitating on that since fastai isn’t using OpenCV. That’s something similar to what I did in an old project a while back (tuning bounding boxes in that case; may blog that).

edit: I put together a script with OpenCV of the datacleaner: here’s a video of how it works. Not sure if that works on a cloud instance with no GUI.


On a separate note: you can also get image data from video. Using OpenCV and MSS, you can build a dataset by running video and taking a screenshot of that part of the screen, with labels mapped to the key you press. Here’s how I did that in that same project.

You can build pretty big datasets quickly that way too; your bigger problem will be making sure the data itself is varied enough – since 20 shots of Matt Damon smiling in a 5-second cut are all going to contain basically the same information.

3 Likes

I built a dataset curator to help find and remove both duplicate images and images from outside of the data distribution. It uses the intermediate representations from a pretrained vgg network (similar to content loss when doing style transfer).

1 Like

Any recommendations around image resizing? I’ve built my own dataset, but the images I grabbed are on the larger side. What’s everyone doing in this context? I know the library can resize, but I’m guessing that’s a costly operation and would be better done once?

Updated my thing using the new stuff I learned tonight. Now the interface doesn’t look like it was made by a child :rofl:

1 Like

Kind of a double post (see also Small tool to build image dataset: fastclass).

I wrote a small python package fastclass that tackles two problems I had when building a dataset:

  1. easily download multiple image classes from the big search engines without using their (paid) APIs
  2. Quickly filter the results and mark images for deletion (or grades, more see below)

For my example I defined 25 search terms (guitars, it’s also in the GitHub repo under examples)…

The first script fcd pulls from Google, Bing or Baidu (or all 3) and resizes them, too (uses icrawler). Simply write a csv file where each row contains the search terms you want to push to the search engines

Then, the second script fcc launches a Tkinter GUI and you can quickly flick through the produced folders and mark any file for deletion or grade it for instance into various “grades” (grading is optional).

In my case I used 4 grades (and deleted a bunch):
Grade 1: good
Grade 2: only the body of a guitar (still super useful to distinguish between models)
Grade 3: headstock only (not used in first model)
Grade 4: really hard (back of guitar, not used in first model)

I ended up with roughly 9000 images for 11 classes. Quality check takes some time - but it’s worth it!

You simply push a number to mark for grade, d for delete and can always flick back and forth using arrow keys. Once you are done use x to terminate and write report file…

I wrote about it here:

Repo is here:

Notebook with classifier (97% on 11 gibson and fender models is here, I only used grade1+2 images for the classifier for the moment, will experiment with the others later):

Let me know with an issue of via these forums if you find any issues with it. Hopeit’s useful to you…

12 Likes

@jeremy This thread offers better methods than the javascript code in lesson2-download notebook. The javascript code approach is problematic because it doesn’t work in all browsers, fails with blockers, and isn’t a solution for those who don’t have a way to access a browser UI (colab et al).

AFAICT none of the methods presented here are allowed under Google’s Terms of Service. I’m fine with them being discussed here, but I don’t think we should be teaching them in the course.

1 Like

I’m experimenting with Crestle for uploading and syncing. Since they offer a Terminal session, I’ve been able to use GoodSync effectively. The benefits of GoodSync are their support of all platforms, powerful features, good UI (Windows, Mac), compatible with cloud backed storage services (Dropbox, Google Cloud, Microsoft OneDrive, etc), works well. I’m only able to get 700Kbps uploads/sync. I haven’t identified the bottleneck and haven’t benchmarked against similar services. It might be as fast as it gets.

https://www.goodsync.com/for-linux

FWIW, I’ve cobbled together a python program that copies the contents of a Google Drive. I’m experimenting with it on Crestle. It may be useful for people who prefer working with Google Drive. I’d prefer to be able to mount a Google Drive, as can be done on Colab, instead of just a file copy method. GoodSync is a better choice for most.

import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

def main():
	gauth = GoogleAuth()
	# Try to load saved client credentials
	gauth.LoadCredentialsFile("mycreds.txt")
	if gauth.credentials is None:
		# Authenticate if they're not there
		#gauth.LocalWebserverAuth()
		gauth.CommandLineAuth()
	elif gauth.access_token_expired:
		# Refresh them if expired
		gauth.Refresh()
	else:
		# Initialize the saved creds
		gauth.Authorize()

	# Save the current credentials to a file
	gauth.SaveCredentialsFile("mycreds.txt")

	drive = GoogleDrive(gauth)
	local_expanded_path = os.path.expanduser('~/data')
	copy_directory(drive, 'root', local_expanded_path)

def copy_directory(drive, source_id, local_path):

	print(f'source_id:{source_id} local_path:{local_path}')
	try:
		os.makedirs(local_path,exist_ok=True)
	except OSError as e:
		print(f'makedirs failed: {local_path} errno:{e.errno}')

	file_list = drive.ListFile({'q': "'{source_id}' in parents".format(source_id=source_id)}).GetList()
	for f in file_list:
		print(f["title"], f["id"], f["mimeType"])
		if f["title"].startswith("."):
			continue
		fname = os.path.join(local_path, f['title'])
		if f['mimeType'] == 'application/vnd.google-apps.folder':
			copy_directory(drive, f['id'], fname)
		else:
			file = drive.CreateFile({'id': f['id']})
			file.GetContentFile(fname)

if __name__ == "__main__":
	main()
2 Likes

Curious about your Olsen twin project… We tried a “Chrisifier” (Chris Pine/Evans/Pratt/Hemsworth) and were able to get the error down to around 25% using the standard pipeline from the bears notebook from class. So decent accuracy, but far from perfect. How accurate were you able to get your Olsen twins model? Any tricks you’d be willing to share? Apart from the obvious - gather more data (we only have about 200 images of each Chris) we were thinking maybe it’d be possible to pretrain on a large facial recognition dataset.

Hey thanks for putting this together.

I’m Currently having an issue getting 2 of my classes images to d/l. Output suggests the script is running correctly. The dirs are created but the output_path dir is empty on inspection after running.

My problem is with ‘laver’ and ‘badderlocks’ classes. All others have d/l’d successfully. Can you point me in right direction?

NB here

Even,

Not sure if you found one, but I use https://www.bricelam.net/ImageResizer/.
Easy to use and works well.

1 Like

Try ImageMagick:
https://imagemagick.org/script/mogrify.php

IE:

magick mogrify -resize 256x256 *.jpg

Resizes all the images in a folder to 256x256.

6 Likes

If you want to download > 100 images from google images, you’re going to have to download selenium, webdriver, and chrome. This series of steps worked for me: https://gist.github.com/ziadoz/3e8ab7e944d02fe872c3454d17af31a5

I am having trouble with google_images_download. It mostly works but my dataset keeps including corropted images that need to be removed manually. This is the command I am using:

googleimagesdownload -k “sports car” -s medium -f png -l 500 -o ~/storage/cars -i sports --chromedriver /home/paperspace/anaconda3/envs/fastai/bin/chromedriver

In jupyter I run:

data = ImageDataBunch.from_folder(path, ds_tfms=get_transforms(),
valid_pct=0.25, size=224, bs=bs).normalize(imagenet_stats)

The error message I get looks like this:

/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/fastai/basic_data.py:226: UserWarning: There seems to be something wrong with your dataset, can’t access these elements in self.train_ds: 1010,934
warn(warn_msg)

I can go through the images 1 by 1 and a few will not open which I can then remove. Once I have gone through the whole dataset everything works fine.

Has anyone seen this / has any idea how to automatically remove these corrupted images?

for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

as per lesson2-download.ipnb :smile:

HTH

2 Likes