ImageDownloader widget

(Nate Gadzhibalaev) #1

I’ve put up together a widget that can search and download images from Google Search and store them on disk, so that folks can play around with simple CNNs on their own dataset ideas!

The good:

  • It’s built as a part of the existing widget system. As easy as ImageDownloader().
  • It works in both Notebook and Lab.
  • It’s just 100 LOC.
  • I’ve included some examples in the documentation in docs_src and updated the docs, including how to create a data bunch and a learner with the images downloaded that way.
  • It allows folks to pick the resolution they want and how many images they want. It should work for 100+ images per label.

The ugly:

I’ve made this a couple weeks back and thought I’d finally share this as is. Here’s my branch with it — if @lesscomfortable and @jeremy think it’s good enough — I’ll open a pull request.

upd: missed the link:

I can also add some examples in more novice-folks facing docs, maybe in examples folder? I think it makes sense to show this to more novice users, it’s obviously not a super serious research tool and more of a playground feature.

One thing that can be improved is to pluck the algorithm and pieces of code from google_images_download to fetch image urls, and then use parallel version of download_images from basic_data to download them and hook all that to fastprogress widget. Not sure if I have time to do that this or next week though.

Another think to work on would be to check all the downloaded images after they’ve been downloaded and auto-delete all the broken ones, otherwise users might have problems with DataLoaders, and with num_workers > 0 they’re tricky to debug.

@lesscomfortable, what do you think? Let me know if you’d like to to tweak this a bit, or if we can merge this and tweak it on the go.

(George Zhang) #2

This looks amazing to me. Thanks for the effort!

(Francisco Ingham) #3

Hey, sorry for the delay! This looks very nice indeed. The only problem with this approach (and the reason why we did it with a Javascript script) is that it infringes Google’s TOS. And some people are ok with that but we thought that the default option should be in line with the TOS. This however, is definitely an easier approach and I think it would be very nice to have a gist for those who want to use it. As for merging into the library, that’s for @sgugger to decide.

(Nate Gadzhibalaev) #4

Thanks for the update!

Yep, I didn’t think about the infringement of Google’s ToS too. :face_with_monocle:

In the meantime, I’ve dropped the dependency, implemented parallel downloads and fastprogress bar.

I’ll see if everything works together, if I can add a headless browser option for >100 images scenario, document using the function directly instead of the widget.

I’ll then pack it in a pull request and let’s see if @sgugger thinks it’s OK to have this in the repo. No hard feelings if Google’s TOS is a show stopper — we can do a gist, or I’ll pack it into a pypi package and we can link it from the notebooks if you’d like.

I need a day to finish the code, tests and docs. @sgugger, @lesscomfortable, expect an update on this tomorrow morning. :wink:


It may take some time to review it with the holidays, but it would be a nice addition to the library (as long as we don’t violate any ToS :wink: )

(Nate Gadzhibalaev) #6


It’s fine if it takes a longer than usual to review the PR, and I’ll be happy to improve the code and how it fits into the library.

(Nate Gadzhibalaev) #7

Bump: cleaned up docs notebooks, fixed a bug. Ready for review.

(Nate Gadzhibalaev) #8

@sgugger, I was unavailable for a couple of days, I’ll catch up on those tests issues and fix them this week.

(Pierre Guillou) #9

Hello @xnutsive. Thanks a lot for your ImageDownloader widget!

Just a verification. The fonction verify_images() is implemented in your widget but I’m not sure that accepts the argument max_size to resize all images (I do not see it in the code of the widget):

verify_images(label_path, max_workers=max_workers)

(Nate Gadzhibalaev) #10

Thank you for using it, you’re very welcome :wink:

verify_images is in, so ImageDownloader uses the same verification function that’s in the library already. I’m not sure it can resize images on the fly though.

I don’t pass max resolution to verify images in the widget code, but you can call verify images directly after downloading yourself with that arg and it should work.

(Pierre Guillou) #11

It can :slight_smile: From Jeremy:

Do you plan to implement its argument max_size in your ImageDownloader ?

(Nate Gadzhibalaev) #12

Thought about this for a bit more.

Seems like adding a separate input for max size will confuse novice users: they have search size (i.e. 400x300, 800x600, etc), and there’ll be one more size input — might be too much.

And if I just add an option to pass max_size to the download_google_images() then it’s so small and simple it’s not even worth adding into the download function — you can just do this:

path = Path("data")
download_google_images(path, "cats", n_images=100)
verify_images(path, max_size=224)

Alternatively, if you’re using the widget and not the download_google_images() function directly, you can just put verify_images(max_size=224) in the next cell of the notebook and get the same result.

However, I’m not 100% sure. If there’s a good training performance boost to having everything resized prior to training, I’ll add something in there so that novice users get better results out of the box and get more encouraged to pursue their projects.

One way I think the widget can be improved is if I add some docs and examples on how to clean up the new dataset by showing images that are not like other downloaded images. I.e. if I want to download 100 polar bears and get 95 bears and 5 white bear toys, I want a snippet of code to invoke ImageCleaner with one line of code that’ll show me the outliers and ask if I want to delete them or keep them.

@sgugger what do you think?