Getting more than 150 images using search_images_bing

yanivbl · September 1, 2020, 2:29pm

Any idea how I can obtain more than 150 images using search_images_bing?

I tried redefining the function using ctn param for number of images:
def search_images_bing2(key, term, min_sz=128,cnt=150):
client = api(‘https://api.cognitive.microsoft.com’, auth(key))
return L(client.images.search(query=term, count=cnt, min_height=min_sz, min_width=min_sz).value)

But I still got 150 images using
results = search_images_bing2(key,f'{o} men',cnt=500)

oddrationale · September 3, 2020, 6:38am

I had the exact same question!

According to the Bing Image Search API reference, the maximum images you can get in one request is 150 images.

However, using the count and offset query parameters, we can page through and get 150 images at a time through multiple requests.

After some debugging, I’ve made the following changes to the search_images_bing() function in the utils.py file:

# +
# pip install azure-cognitiveservices-search-imagesearch

from itertools import chain

from azure.cognitiveservices.search.imagesearch import ImageSearchClient as api
from msrest.authentication import CognitiveServicesCredentials as auth


def search_images_bing(key, term, total_count=150, min_sz=128):
    """Search for images using the Bing API
    
    :param key: Your Bing API key
    :type key: str
    :param term: The search term to search for
    :type term: str
    :param total_count: The total number of images you want to return (default is 150)
    :type total_count: int
    :param min_sz: the minimum height and width of the images to search for (default is 128)
    :type min_sz: int
    :returns: An L-collection of ImageObject
    :rtype: L
    """
    max_count = 150
    client = api("https://api.cognitive.microsoft.com", auth(key))
    imgs = [
        client.images.search(
            query=term, min_height=min_sz, min_width=min_sz, count=count, offset=offset
        ).value
        for count, offset in (
            (
                max_count if total_count - offset > max_count else total_count - offset,
                offset,
            )
            for offset in range(0, total_count, max_count)
        )
    ]
    return L(chain(*imgs))

You can now use the search_images_bing() function in your notebooks as follows:

from utils import *


results = search_images_bing(key, "grizzly bear", 500)

I have not been able to find if there is a maximum to the number of times we can apply an offset. For example, if Bing runs out of images. So I don’t know what will happen if you request an extremely large number of images (10,000+) and Bing has less than that amount. But as long as you don’t hit that limit, you should be OK.

Please test and let me know how it works for you. If anyone has any suggestions to improve the code, I’d love to know! If it works well, I may put in a pull request to the course repo.

yanivbl · September 3, 2020, 11:09am

Good Stuff!
Thank you for posting @oddrationale , it worked well for me.

Could you give me a hand and explain how I can change in the utils.py file using gradient?
(I ran your code by copying to the notebook and overriding the function)

oddrationale · September 3, 2020, 3:48pm

I’m not using Gradient, but I think the process should be the same. In the folder that has the chapter notebooks, if you scroll down, you should see a utils.py file. You can open that file and find the search_image_bing() function and overwrite the code.

cengel · September 3, 2020, 7:19pm

Just came across the same issue and wondered if anyone already solved it. Thank you @oddrationale!

Alfiesan · September 3, 2020, 8:23pm

·Hi, can you give me a Help for this?

Im already have my Key from Azure, im statring to do this change that you comment on utils.py. But after make a search im still appear an a empty folder.

oddrationale · September 3, 2020, 8:35pm

@Alfiesan, let’s check to make sure the image_search_bing() function returned images as we expected.

key = "<YOUR_KEY>"
results = search_images_bing(key, 'grizzly bear')
results

Run that in a cell and make sure results contains ImageObject. If it does, then we’re good. Run download_images() and double-check that the path you passed into get_image_files() is correct.

If results is empty, then we have a problem and probably need more troubleshooting. So let me know.

prairieguy · September 5, 2020, 8:48pm

@oddrationale, I really like how you have extended search_images_bing() to download more than 150 images from bing.

@jeremy would it make sense to incude this extended version of search_images_bing() within fastai/fastai/vision/utils.py? I really like the functionality provided by this PR. I understand that it doesn’t make sense to include within fastai/fastbook/utils.py. I think finding images for training vision models to be an appropriate addition to fastai itself, as it can be a core process to creating vision models.

I also created a function search_images_dgd and a PR #250 which might be considered to be included there as well.

prairieguy · September 5, 2020, 9:13pm

@oddrationale, as the number of downloads increase, one needs to account for duplicate images as shown here, where len(list(set(ims))) finds unique urls:

for max_n in [100,150,200,500,1000,2000]:
    results = search_images_bing(key, 'grizzly bear',max_n)
    ims = results.attrgot('content_url')
    print(max_n, len(ims), len(list(set(ims))))

100 100 100
150 150 150
200 200 168
500 500 409
1000 954 788
2000 1280 787

To fix this, you would either need to break the api of search_images_bin() or else have the user check after calling the function. This is not an issue with for original function, as when calling less than 150 images, there duplications do not appear to be a problem.

oddrationale · September 6, 2020, 12:59am

Thanks, @prairieguy, for the constructive feedback! I appreciate you testing it so thoroughly.

To me, I see that you brought up two separate issues:

Bing Search API does not always return the count that give it, and
Duplicate content_url's can be returned.

I think these should be handled separately.

Bing Search API returns less images

I did some additional testing and I believe this occurs because we’ve reached the maximum Bing will return.

results = L(client.images.search(query="grizzly bear", min_height=128, min_width=128, count=150, offset=1500).value)
len(results)

Ouput:

Here I called the client.images.search() function directly, without using search_images_bing(). I used a count of 150 and a high offset of 1500. Only 54 items were returned.

Let’s try again with a higher offset.

results2 = L(client.images.search(query="grizzly bear", min_height=128, min_width=128, count=150, offset=1800).value)
len(results2)

Output:

After increasing the offset, I got the same same number of results as before. I did some random sampling and these two lists seem to have the same content_url's.

print(results.attrgot("content_url")[42])
print(results2.attrgot("content_url")[42])

Output:

https://www.freedomskateshop.at/media/images/product/1800x1200/1grizzly_woodland_camo_cut_out_skateboard_griptape.jpg
https://www.freedomskateshop.at/media/images/product/1800x1200/1grizzly_woodland_camo_cut_out_skateboard_griptape.jpg

I think what is going on here is that once Bing has reached the maximum number of images it has for that search term, it will continue to return the same set of images after a certain offset.

One solution for this is to perhaps just show a warning message to the user, but return the images anyways. Something like:

import logging

if len(L(chain(*imgs))) < total_count:
    logging.warning(f"Bing only found {len(L(chain(*imgs)))} images for '{term}'. Total requested was {total_count}.")

Duplicate image URLs

For the second issue of duplicate image URLs, I think it should be the user’s responsibility to remove the duplicates. The search_images_bing() function should just return the images that Bing provides. It would be analogous to browsing Bing Image Search manually and seeing duplicate images. The user can decide to keep or remove the duplicate image URLs.

Again, I appreciate you taking the time to read and test my code, which is why I wanted to in turn give my detailed response. Would love to get your thoughts on this and I’ll modify the snippet in my original posting accordingly so that others can benefit!

prairieguy · September 6, 2020, 8:44am

@oddrationale, a couple of comments:

It does appear that there is a relatively low number of images available. Adding a warning as you described makes sense, provide the fewer number of results are still returned.
Since it appears that your PR was not accepted, I think it’s less important that this function needs to maintain the original api.
I would change it to include the L.attrogot(‘content_url’) within search_images_bing. I think this makes sense as this as the user shouldn’t have to remember why and what that call is about. I think the function should just return a L of image urls. It seem like a cleaner interface to me. This function returns image urls and download_images handles the downloading.
Finally, I do think that it would make sense to remove duplicates. I can’t think of many reasons the user would want duplicate images. Moreover, I for one, wouldn’t want to remember to remove duplicates.
I went ahead and re-wrote the code to reflect these changes. I changed the style to my personal taste, but I believe it reflects your original logic.

def search_images_bing(key, term, total_count=150, min_sz=128):
    """Search for images using the Bing API
    
    :param key: Your Bing API key
    :type key: str
    :param term: The search term to search for
    :type term: str
    :param total_count: The total number of imagles you want to return (default is 150)
    :type total_count: int
    :param min_sz: the minimum height and width of the images to search for (default is 128)
    :type min_sz: int
    :returns: An L-collection of ImageObject
    :rtype: L
    """
    max_count = 150
    client = api("https://api.cognitive.microsoft.com", auth(key))
    imgs = []                                                                                                                                         
    for offset in range(0, total_count, max_count):
        count = max_count if total_count - offset > max_count else total_count - offset
        img = client.images.search(query=term,min_height=min_sz,min_width=min_sz,count=count,offset=offset).value
        imgs.append(img) 
    return L(chain(*imgs)).attrgot('content_url').unique()

joedockrill · September 6, 2020, 9:29am

Regarding duplicates, the other issue I sometimes face with my own scraper is not duplicate URLs, but duplicate images coming from different places, for example if you search for Van Gogh paintings, you’re going to get very many copies of starry night and sunflowers.

I’m planning on writing some duplicate image detection code as soon as I get time but if anyone had already done this, please speak up.

prairieguy · September 6, 2020, 5:12pm

@joedockrill - I have written the function dedupe_images(image_dir:Path)->int: (https://github.com/prairie-guy/ai_utilities/blob/master/image_download.py) which takes a directory of images as inputs and remove duplicates, returning how many were deleted. It does so at a binary level, so this would not detect different images of the same painting, only multiple identical images. (This happens relatively frequently.)

In the same file, filter_images(image_dir:Path, img_type:str='JPEG')->int will remove non-image files from a directory.

The dependencies for these functions are hashlib and magic

joedockrill · September 6, 2020, 5:34pm

I was thinking more something that would say, this is a different size, slightly different brightness/contrast, and maybe even slightly different crop, but I think it’s the same picture. Here are all the images i think are the same and the one I think you should keep, press buttons please.

I think I know how to go about it, I just haven’t had time to do anything about it yet.

prairieguy · September 6, 2020, 6:30pm

Sounds like a difficult problem. Not sure how I would even tackle it. Perhaps a neural network itself in which you manually curate training examples in which you train for “sameness”. Otherwise, sounds like lots of case checking if images.

What were you thinking as a general strategy?

joedockrill · September 6, 2020, 6:36pm

Tbh I’ve got all my money on skimage.metrics.structural_similarity at the moment. If that doesn’t quite fit then I’m not sure but I’ll figure it out.

prairieguy · September 6, 2020, 7:05pm

Looked it up. Looks promising. Good luck. I’ve got a bunch of old image classifier data sets that I’ve downloaded from web. I could use them for testing. Happy to test early code if helpful.

thetj09 · October 17, 2020, 7:56am

You are changing the fastbook module. May I ask are the changes expected to work right off the bat? What I mean is, once I make the changes to utils.py, Do I need to re-install or restart the kernel?

It doesn’t seem to work right off the bat, after making changes to utils.py.

joedockrill · October 17, 2020, 11:38am

Thej, I don’t think this change ever made it into fast.ai.

Also if anyone still cares about getting this working as intended, you should look at the totalEstimatedMatches in the original search result returned to see how many results you have in total and page through them 150 at a time. Also look at the nextOffset param in conjunction with offset to avoid duplicate results.

thetj09 · October 17, 2020, 11:42am

I think you have misunderstood my question. My question was about how to see the changes I make to utils.py. Cause, from utils import * didn’t work.