ImageDownloader widget

xnutsive · December 16, 2018, 1:46am

I’ve put up together a widget that can search and download images from Google Search and store them on disk, so that folks can play around with simple CNNs on their own dataset ideas!

The good:

It’s built as a part of the existing fast.ai widget system. As easy as ImageDownloader().
It works in both Notebook and Lab.
It’s just 100 LOC.
I’ve included some examples in the documentation in docs_src and updated the docs, including how to create a data bunch and a learner with the images downloaded that way.
It allows folks to pick the resolution they want and how many images they want. It should work for 100+ images per label.

The ugly:

It uses the google_images_download script and I’m not a huge fan of it. It’s under MIT license.

I’ve made this a couple weeks back and thought I’d finally share this as is. Here’s my branch with it — if @lesscomfortable and @jeremy think it’s good enough — I’ll open a pull request.

upd: missed the link: https://github.com/xnutsive/fastai/tree/image_downloader

I can also add some examples in more novice-folks facing docs, maybe in examples folder? I think it makes sense to show this to more novice users, it’s obviously not a super serious research tool and more of a playground feature.

One thing that can be improved is to pluck the algorithm and pieces of code from google_images_download to fetch image urls, and then use parallel version of download_images from basic_data to download them and hook all that to fastprogress widget. Not sure if I have time to do that this or next week though.

Another think to work on would be to check all the downloaded images after they’ve been downloaded and auto-delete all the broken ones, otherwise users might have problems with DataLoaders, and with num_workers > 0 they’re tricky to debug.

@lesscomfortable, what do you think? Let me know if you’d like to to tweak this a bit, or if we can merge this and tweak it on the go.

PegasusWithoutWinds · December 17, 2018, 1:43am

This looks amazing to me. Thanks for the effort!

lesscomfortable · December 21, 2018, 3:56am

Hey, sorry for the delay! This looks very nice indeed. The only problem with this approach (and the reason why we did it with a Javascript script) is that it infringes Google’s TOS. And some people are ok with that but we thought that the default option should be in line with the TOS. This however, is definitely an easier approach and I think it would be very nice to have a gist for those who want to use it. As for merging into the library, that’s for @sgugger to decide.

xnutsive · December 21, 2018, 6:18pm

Thanks for the update!

Yep, I didn’t think about the infringement of Google’s ToS too.

In the meantime, I’ve dropped the dependency, implemented parallel downloads and fastprogress bar.

I’ll see if everything works together, if I can add a headless browser option for >100 images scenario, document using the function directly instead of the widget.

I’ll then pack it in a pull request and let’s see if @sgugger thinks it’s OK to have this in the repo. No hard feelings if Google’s TOS is a show stopper — we can do a gist, or I’ll pack it into a pypi package and we can link it from the notebooks if you’d like.

I need a day to finish the code, tests and docs. @sgugger, @lesscomfortable, expect an update on this tomorrow morning.

sgugger · December 22, 2018, 7:09am

It may take some time to review it with the holidays, but it would be a nice addition to the library (as long as we don’t violate any ToS )

xnutsive · December 23, 2018, 5:11am

Boom: https://github.com/fastai/fastai/pull/1382

It’s fine if it takes a longer than usual to review the PR, and I’ll be happy to improve the code and how it fits into the library.

xnutsive · December 28, 2018, 7:44pm

Bump: cleaned up docs notebooks, fixed a bug. Ready for review.

xnutsive · January 3, 2019, 7:27am

@sgugger, I was unavailable for a couple of days, I’ll catch up on those tests issues and fix them this week.

pierreguillou · January 22, 2019, 11:16am

Hello @xnutsive. Thanks a lot for your ImageDownloader widget!

Just a verification. The fonction verify_images() is implemented in your widget but I’m not sure that accepts the argument max_size to resize all images (I do not see it in the code of the widget):

verify_images(label_path, max_workers=max_workers)

xnutsive · January 22, 2019, 2:50pm

Thank you for using it, you’re very welcome

verify_images is in vision.data, so ImageDownloader uses the same verification function that’s in the library already. I’m not sure it can resize images on the fly though.

I don’t pass max resolution to verify images in the widget code, but you can call verify images directly after downloading yourself with that arg and it should work.

pierreguillou · January 22, 2019, 4:01pm

It can From Jeremy: Best way to resize pictures for model training - #5 by jeremy

Do you plan to implement its argument max_size in your ImageDownloader ?

xnutsive · January 23, 2019, 5:51am

Thought about this for a bit more.

Seems like adding a separate input for max size will confuse novice users: they have search size (i.e. 400x300, 800x600, etc), and there’ll be one more size input — might be too much.

And if I just add an option to pass max_size to the download_google_images() then it’s so small and simple it’s not even worth adding into the download function — you can just do this:

path = Path("data")
download_google_images(path, "cats", n_images=100)
verify_images(path, max_size=224)

Alternatively, if you’re using the widget and not the download_google_images() function directly, you can just put verify_images(max_size=224) in the next cell of the notebook and get the same result.

However, I’m not 100% sure. If there’s a good training performance boost to having everything resized prior to training, I’ll add something in there so that novice users get better results out of the box and get more encouraged to pursue their projects.

One way I think the widget can be improved is if I add some docs and examples on how to clean up the new dataset by showing images that are not like other downloaded images. I.e. if I want to download 100 polar bears and get 95 bears and 5 white bear toys, I want a snippet of code to invoke ImageCleaner with one line of code that’ll show me the outliers and ask if I want to delete them or keep them.

@sgugger what do you think?

sgugger · January 23, 2019, 2:16pm

You don’t get a training boost in accuracy by resizing in advance, but in performance (the bottleneck is almost always the CPU with data augmentation). I don’t see a problem having it in a separate instruction and avoid too much complexity.

pierreguillou · January 24, 2019, 11:12am

Hi @xnutsive.

When we want to download more than 100 images, I noticed that the fastai docs on ImageDownloader does not explain where to get chromedriver fro Windows 10. I found the answer on Stackoverflow (update 2):

Download chromedriver_win32.zip from the download page of ChromeDriver.
Unzip to chromedriver.exe in C:\Windows

pierreguillou · January 24, 2019, 11:22am

That’s the way

ImageDownloader(path)

path_to_folder = path / 'your search query in ImageDownloader'
verify_images(path_to_folder, delete=True, max_size=500)

pierreguillou · January 24, 2019, 11:38am

Nate, it looks like that ImageDownloader does not accept the following search (I’m using Windows 10).
The use of “” in the search creates a folder name problem when Windows tries to create a folder with the search string.

pierreguillou · January 24, 2019, 11:51am

@xnutsive. One more question (gloups ).

Do you have any plan to implement in your ImageDownload widget the gi2ds snippet from @melonkernel that is a Tool for deleting files on the Google Image Search page before downloading ?

This is very useful when you do a search with a label shared by different products because there are plenty of images you don’t want to download in this case (for example: manga means mango in English and in Japanese drawing…).

xnutsive · January 24, 2019, 3:28pm

Whoa, thank you @pierreguillou for for the questions!

Download chromedriver_win32.zip from the download page of ChromeDriver.

Unzip to chromedriver.exe in C:\Windows

I’ll add this to the docs, along with verifying images (resizing) after download.

Nate, it looks like that ImageDownloader does not accept the following search (I’m using Windows 10).
The use of “” in the search creates a folder name problem when Windows tries to create a folder with the search string.

That’s a good catch. Just so I understand, you’re expecting to use advanced google search syntax, right? I’ll add escaping to labels to the code.

Do you have any plan to implement in your ImageDownload widget the gi2ds snippet from @melonkernel that is a Tool for deleting files on the Google Image Search page before downloading ?

I absolutely understand the need. I don’t have a plan to add this yet, but I’ll think about how this can be done. Maybe for smaller searches I can show a grid of images you can toggle with a pagination widget, and then you hit a “save” button.

pierreguillou · January 24, 2019, 4:23pm

Great

Yes. I’m building a classifier of varieties of products. The varieties names have 1,2 or even 3 words. That means I need to make searches like :
product_name variety_name1 -"variety_name21 variety_name22"
(with variety_name2 = “variety_name21 variety_name22”).

That would be a great help as I often use gi2ds snippet to avoid useless downloads.

nareshr8 · August 24, 2019, 3:42am

Is imageDownloader moved to different package? I dont see it in the imports…