Datasets: downloading images for multiple classes

I’ve been using this notebook (by @lesscomfortable) to get together my datasets about towers of the world.

However, with more than a few classes (using 12), I found it hard to just get all the folders created and the data download without the need of my manual intervention (1 by 1).

So… I went ahead and made this little code to loop it.

  1. Write down the classes (names) you will be gathering data for.
  2. Follow the notebook’s first instructions and get your urls (file) from google images.
  3. Upload all the files to the desired path (e.g ‘data/towers’). Make sure you match the name ‘urls_class’. I like hyphen on folders and underscores on files, but that’s on me.
  4. Run the cells. It will create the subfolders and download the data for each one of the classes.
  5. Wait for you notebook to finish. Right upper corner to be empty circle. See this if you have questions about the latter.

See all the code here

classes = [‘space-needle’,‘eiffel-tower’,‘cn-tower’,‘canton-tower’,‘kl-tower’,‘liberation-tower’,‘milad-tower’,‘oriental-pearl-tower’,‘ostankino-tower’,‘tokyo-skytree’,‘tokyo-tower’,‘washington-monument’]

path = Path(‘data/towers’)

for class_name in classes:
folder = class_name
file = f’urls_{class_name}.txt’.replace(’-’,’_’)
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)
print(file)
download_images(path/file, dest, max_pics=200)

7 Likes

Great idea! We will probably add it to the download_images function in the following days.

I could try to create a better shaped PR :blush:

Go for it! Just remember to edit download_images to get a list of filenames instead of a single one. We will probably be able to look at it after the lesson so no rush.

Probably easier just to create a csv and use that factory method.

1 Like

I meant the notebook. I think the method is good, if anything, you could build another one that would receive dictionary/list of files+destinations (a cvs like @jeremy pointed) and call this one.

Yes. Maybe we could loop over filenames and url files and feed them to download_images. Then create a csv with filenames and labels and run .from_csv. How does that sound @pherra?

I did something similar, using scrapy to scrape the names of the items I wanted to download. However, scrapy works best with bash and I don’t think my environment is set up properly, because I kept getting errors when I tried exporting the bash variable with the extracted names to python. But, the idea would be to then create a function that loops over each name in the list and download the images using google_image_downloader. Anyone has experience passing variables from bash to python Jupyter?

This project was created during fastai v1 to download class wise images. It can be run from terminal and python programs both.
https://github.com/hardikvasa/google-images-download

1 Like

Great! Although it was very good for me to do it manually this time, I went through the google search and got familiar with my data. I’ll definately look at it for a larger dataset.

Have you had to delete EXIF data and re-do it? Ran into this issue and kaggle thread about it. Applied the commands to all my images and I got clean results while verifying them.

Download_image notebook trick (csv style) downloaded some non image files and had to be deleted with verify_image function. Don’t remember if those erring files included EXIF format.

I found that any unsupervised downloads from icrawler, google image search or other downloaders contain a lot of wrong and even corrupted images and I feared they’d mess up my classification…

So I always check them visually and mark any unsuitable ones for deletion…

I wrote a small python package for this task (auto-download based on csv file of search queries/ terms) and quickly clean via GUI script afterwards…

I linked it elsewhere in the forums:

I still takes a lot of time to weed through many-class-datasets, but you get some clean data (11 classes, approx. 8500 images downloaded and cleft after cleaning with the script).

1 Like

That github page isn’t available anymore. I get an 404 error.