I’ve put up together a widget that can search and download images from Google Search and store them on disk, so that folks can play around with simple CNNs on their own dataset ideas!
It’s built as a part of the existing fast.ai widget system. As easy as ImageDownloader().
It works in both Notebook and Lab.
It’s just 100 LOC.
I’ve included some examples in the documentation in docs_src and updated the docs, including how to create a data bunch and a learner with the images downloaded that way.
It allows folks to pick the resolution they want and how many images they want. It should work for 100+ images per label.
The ugly:
It uses the google_images_download script and I’m not a huge fan of it. It’s under MIT license.
I can also add some examples in more novice-folks facing docs, maybe in examples folder? I think it makes sense to show this to more novice users, it’s obviously not a super serious research tool and more of a playground feature.
One thing that can be improved is to pluck the algorithm and pieces of code from google_images_download to fetch image urls, and then use parallel version of download_images from basic_data to download them and hook all that to fastprogress widget. Not sure if I have time to do that this or next week though.
Another think to work on would be to check all the downloaded images after they’ve been downloaded and auto-delete all the broken ones, otherwise users might have problems with DataLoaders, and with num_workers > 0 they’re tricky to debug.
@lesscomfortable, what do you think? Let me know if you’d like to to tweak this a bit, or if we can merge this and tweak it on the go.
Hey, sorry for the delay! This looks very nice indeed. The only problem with this approach (and the reason why we did it with a Javascript script) is that it infringes Google’s TOS. And some people are ok with that but we thought that the default option should be in line with the TOS. This however, is definitely an easier approach and I think it would be very nice to have a gist for those who want to use it. As for merging into the library, that’s for @sgugger to decide.
Yep, I didn’t think about the infringement of Google’s ToS too.
In the meantime, I’ve dropped the dependency, implemented parallel downloads and fastprogress bar.
I’ll see if everything works together, if I can add a headless browser option for >100 images scenario, document using the function directly instead of the widget.
I’ll then pack it in a pull request and let’s see if @sgugger thinks it’s OK to have this in the repo. No hard feelings if Google’s TOS is a show stopper — we can do a gist, or I’ll pack it into a pypi package and we can link it from the notebooks if you’d like.
I need a day to finish the code, tests and docs. @sgugger, @lesscomfortable, expect an update on this tomorrow morning.
Just a verification. The fonction verify_images() is implemented in your widget but I’m not sure that accepts the argument max_size to resize all images (I do not see it in the code of the widget):
verify_images is in vision.data, so ImageDownloader uses the same verification function that’s in the library already. I’m not sure it can resize images on the fly though.
I don’t pass max resolution to verify images in the widget code, but you can call verify images directly after downloading yourself with that arg and it should work.
Seems like adding a separate input for max size will confuse novice users: they have search size (i.e. 400x300, 800x600, etc), and there’ll be one more size input — might be too much.
And if I just add an option to pass max_size to the download_google_images() then it’s so small and simple it’s not even worth adding into the download function — you can just do this:
Alternatively, if you’re using the widget and not the download_google_images() function directly, you can just put verify_images(max_size=224) in the next cell of the notebook and get the same result.
However, I’m not 100% sure. If there’s a good training performance boost to having everything resized prior to training, I’ll add something in there so that novice users get better results out of the box and get more encouraged to pursue their projects.
One way I think the widget can be improved is if I add some docs and examples on how to clean up the new dataset by showing images that are not like other downloaded images. I.e. if I want to download 100 polar bears and get 95 bears and 5 white bear toys, I want a snippet of code to invoke ImageCleaner with one line of code that’ll show me the outliers and ask if I want to delete them or keep them.
You don’t get a training boost in accuracy by resizing in advance, but in performance (the bottleneck is almost always the CPU with data augmentation). I don’t see a problem having it in a separate instruction and avoid too much complexity.
Nate, it looks like that ImageDownloader does not accept the following search (I’m using Windows 10).
The use of “” in the search creates a folder name problem when Windows tries to create a folder with the search string.
This is very useful when you do a search with a label shared by different products because there are plenty of images you don’t want to download in this case (for example: manga means mango in English and in Japanese drawing…).
I’ll add this to the docs, along with verifying images (resizing) after download.
Nate, it looks like that ImageDownloader does not accept the following search (I’m using Windows 10).
The use of “” in the search creates a folder name problem when Windows tries to create a folder with the search string.
That’s a good catch. Just so I understand, you’re expecting to use advanced google search syntax, right? I’ll add escaping to labels to the code.
I absolutely understand the need. I don’t have a plan to add this yet, but I’ll think about how this can be done. Maybe for smaller searches I can show a grid of images you can toggle with a pagination widget, and then you hit a “save” button.
Yes. I’m building a classifier of varieties of products. The varieties names have 1,2 or even 3 words. That means I need to make searches like : product_name variety_name1 -"variety_name21 variety_name22"
(with variety_name2 = “variety_name21 variety_name22”).
That would be a great help as I often use gi2ds snippet to avoid useless downloads.