Yep, I didn’t think about the infringement of Google’s ToS too.
In the meantime, I’ve dropped the dependency, implemented parallel downloads and fastprogress bar.
I’ll see if everything works together, if I can add a headless browser option for >100 images scenario, document using the function directly instead of the widget.
I’ll then pack it in a pull request and let’s see if @sgugger thinks it’s OK to have this in the repo. No hard feelings if Google’s TOS is a show stopper — we can do a gist, or I’ll pack it into a pypi package and we can link it from the notebooks if you’d like.
I need a day to finish the code, tests and docs. @sgugger, @lesscomfortable, expect an update on this tomorrow morning.
Just a verification. The fonction verify_images() is implemented in your widget but I’m not sure that accepts the argument max_size to resize all images (I do not see it in the code of the widget):
verify_images is in vision.data, so ImageDownloader uses the same verification function that’s in the library already. I’m not sure it can resize images on the fly though.
I don’t pass max resolution to verify images in the widget code, but you can call verify images directly after downloading yourself with that arg and it should work.
Seems like adding a separate input for max size will confuse novice users: they have search size (i.e. 400x300, 800x600, etc), and there’ll be one more size input — might be too much.
And if I just add an option to pass max_size to the download_google_images() then it’s so small and simple it’s not even worth adding into the download function — you can just do this:
Alternatively, if you’re using the widget and not the download_google_images() function directly, you can just put verify_images(max_size=224) in the next cell of the notebook and get the same result.
However, I’m not 100% sure. If there’s a good training performance boost to having everything resized prior to training, I’ll add something in there so that novice users get better results out of the box and get more encouraged to pursue their projects.
One way I think the widget can be improved is if I add some docs and examples on how to clean up the new dataset by showing images that are not like other downloaded images. I.e. if I want to download 100 polar bears and get 95 bears and 5 white bear toys, I want a snippet of code to invoke ImageCleaner with one line of code that’ll show me the outliers and ask if I want to delete them or keep them.
You don’t get a training boost in accuracy by resizing in advance, but in performance (the bottleneck is almost always the CPU with data augmentation). I don’t see a problem having it in a separate instruction and avoid too much complexity.
Nate, it looks like that ImageDownloader does not accept the following search (I’m using Windows 10).
The use of “” in the search creates a folder name problem when Windows tries to create a folder with the search string.
This is very useful when you do a search with a label shared by different products because there are plenty of images you don’t want to download in this case (for example: manga means mango in English and in Japanese drawing…).
I’ll add this to the docs, along with verifying images (resizing) after download.
Nate, it looks like that ImageDownloader does not accept the following search (I’m using Windows 10).
The use of “” in the search creates a folder name problem when Windows tries to create a folder with the search string.
That’s a good catch. Just so I understand, you’re expecting to use advanced google search syntax, right? I’ll add escaping to labels to the code.
I absolutely understand the need. I don’t have a plan to add this yet, but I’ll think about how this can be done. Maybe for smaller searches I can show a grid of images you can toggle with a pagination widget, and then you hit a “save” button.
Yes. I’m building a classifier of varieties of products. The varieties names have 1,2 or even 3 words. That means I need to make searches like : product_name variety_name1 -"variety_name21 variety_name22"
(with variety_name2 = “variety_name21 variety_name22”).
That would be a great help as I often use gi2ds snippet to avoid useless downloads.
Honestly haven’t touched it for over a year and surprised it still works Great work porting this to Colab.
I can take a look into the issue, but that’ll likely take a week or so. I’ll ping you guys here with an update, and probably try to also port this to fastai v2 to see how it works internally.