ImageDownloader widget

Thanks for the update!

Yep, I didn’t think about the infringement of Google’s ToS too. :face_with_monocle:

In the meantime, I’ve dropped the dependency, implemented parallel downloads and fastprogress bar.

I’ll see if everything works together, if I can add a headless browser option for >100 images scenario, document using the function directly instead of the widget.

I’ll then pack it in a pull request and let’s see if @sgugger thinks it’s OK to have this in the repo. No hard feelings if Google’s TOS is a show stopper — we can do a gist, or I’ll pack it into a pypi package and we can link it from the notebooks if you’d like.

I need a day to finish the code, tests and docs. @sgugger, @lesscomfortable, expect an update on this tomorrow morning. :wink:

1 Like

It may take some time to review it with the holidays, but it would be a nice addition to the library (as long as we don’t violate any ToS :wink: )

1 Like

Boom: https://github.com/fastai/fastai/pull/1382

It’s fine if it takes a longer than usual to review the PR, and I’ll be happy to improve the code and how it fits into the library.

Bump: cleaned up docs notebooks, fixed a bug. Ready for review.

@sgugger, I was unavailable for a couple of days, I’ll catch up on those tests issues and fix them this week.

Hello @xnutsive. Thanks a lot for your ImageDownloader widget!

Just a verification. The fonction verify_images() is implemented in your widget but I’m not sure that accepts the argument max_size to resize all images (I do not see it in the code of the widget):

verify_images(label_path, max_workers=max_workers)

1 Like

Thank you for using it, you’re very welcome :wink:

verify_images is in vision.data, so ImageDownloader uses the same verification function that’s in the library already. I’m not sure it can resize images on the fly though.

I don’t pass max resolution to verify images in the widget code, but you can call verify images directly after downloading yourself with that arg and it should work.

It can :slight_smile: From Jeremy: https://forums.fast.ai/t/best-way-to-resize-pictures-for-model-training/28307/5

Do you plan to implement its argument max_size in your ImageDownloader ?

Thought about this for a bit more.

Seems like adding a separate input for max size will confuse novice users: they have search size (i.e. 400x300, 800x600, etc), and there’ll be one more size input — might be too much.

And if I just add an option to pass max_size to the download_google_images() then it’s so small and simple it’s not even worth adding into the download function — you can just do this:

path = Path("data")
download_google_images(path, "cats", n_images=100)
verify_images(path, max_size=224)

Alternatively, if you’re using the widget and not the download_google_images() function directly, you can just put verify_images(max_size=224) in the next cell of the notebook and get the same result.

However, I’m not 100% sure. If there’s a good training performance boost to having everything resized prior to training, I’ll add something in there so that novice users get better results out of the box and get more encouraged to pursue their projects.

One way I think the widget can be improved is if I add some docs and examples on how to clean up the new dataset by showing images that are not like other downloaded images. I.e. if I want to download 100 polar bears and get 95 bears and 5 white bear toys, I want a snippet of code to invoke ImageCleaner with one line of code that’ll show me the outliers and ask if I want to delete them or keep them.

@sgugger what do you think?

1 Like

You don’t get a training boost in accuracy by resizing in advance, but in performance (the bottleneck is almost always the CPU with data augmentation). I don’t see a problem having it in a separate instruction and avoid too much complexity.

1 Like

Hi @xnutsive.

When we want to download more than 100 images, I noticed that the fastai docs on ImageDownloader does not explain where to get chromedriver fro Windows 10. I found the answer on Stackoverflow (update 2):

  1. Download chromedriver_win32.zip from the download page of ChromeDriver.
  2. Unzip to chromedriver.exe in C:\Windows

That’s the way :slight_smile:

ImageDownloader(path)

path_to_folder = path / 'your search query in ImageDownloader'
verify_images(path_to_folder, delete=True, max_size=500)

Nate, it looks like that ImageDownloader does not accept the following search (I’m using Windows 10).
The use of “” in the search creates a folder name problem when Windows tries to create a folder with the search string.

@xnutsive. One more question (gloups :slight_smile: ).

Do you have any plan to implement in your ImageDownload widget the gi2ds snippet from @melonkernel that is a Tool for deleting files on the Google Image Search page before downloading ?

This is very useful when you do a search with a label shared by different products because there are plenty of images you don’t want to download in this case (for example: manga means mango in English and in Japanese drawing…).

Whoa, thank you @pierreguillou for for the questions!

  1. Download chromedriver_win32.zip from the download page of ChromeDriver.
  2. Unzip to chromedriver.exe in C:\Windows

I’ll add this to the docs, along with verifying images (resizing) after download.

Nate, it looks like that ImageDownloader does not accept the following search (I’m using Windows 10).
The use of “” in the search creates a folder name problem when Windows tries to create a folder with the search string.

That’s a good catch. Just so I understand, you’re expecting to use advanced google search syntax, right? I’ll add escaping to labels to the code.

Do you have any plan to implement in your ImageDownload widget the gi2ds snippet from @melonkernel that is a Tool for deleting files on the Google Image Search page before downloading ?

I absolutely understand the need. I don’t have a plan to add this yet, but I’ll think about how this can be done. Maybe for smaller searches I can show a grid of images you can toggle with a pagination widget, and then you hit a “save” button.

1 Like

Great :slight_smile:

Yes. I’m building a classifier of varieties of products. The varieties names have 1,2 or even 3 words. That means I need to make searches like :
product_name variety_name1 -"variety_name21 variety_name22"
(with variety_name2 = “variety_name21 variety_name22”).

That would be a great help as I often use gi2ds snippet to avoid useless downloads.

Is imageDownloader moved to different package? I dont see it in the imports…

It’s “ImageDownloader” with a capitol I :slight_smile:

Hi @xnutsive.
thanks for this super widget :slight_smile:
I wrote a sample for this widget in Colab with Selenium and ChromeDriver, I created a Colab file.

It gives me this error when I try to download more than 100 images.

1 Like

Interesting!

Honestly haven’t touched it for over a year and surprised it still works :wink: Great work porting this to Colab.

I can take a look into the issue, but that’ll likely take a week or so. I’ll ping you guys here with an update, and probably try to also port this to fastai v2 to see how it works internally.

1 Like