Duck Duck Go code not working

Sorry to ask such a banal question, but I cannot get the Duck Duck Go code to work in Colab. Per this page, the following code should work on Colab:

!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
from fastbook import *

urls = search_images_ddg('grizzly bear', max_images=100)

This throws the following error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-1f02cd6eb038> in <module>()
      4 from fastbook import *
      5 
----> 6 urls = search_images_ddg('grizzly bear', max_images=100)
      7 len(urls),urls[0]

/usr/local/lib/python3.6/dist-packages/fastbook/__init__.py in search_images_ddg(term, max_images)
     55     assert max_images<1000
     56     url = 'https://duckduckgo.com/'
---> 57     res = urlread(url,data={'q':term}).decode()
     58     searchObj = re.search(r'vqd=([\d-]+)\&', res)
     59     assert searchObj

AttributeError: 'str' object has no attribute 'decode'

I see in the fastai docs that

they do not have an official API, so the function we’ll show here relies on the particular structure of their web interface, which may change.

Perhaps the API changed. Can anyone get this to work?

3 Likes

Hello, you can check the underlying code of the function using:

??search_images_ddg

you’ll get the following output:

def search_images_ddg(term, max_images=200):
    "Search for `term` with DuckDuckGo and return a unique urls of about `max_images` images"
    assert max_images<1000
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term}).decode()
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    assert searchObj
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    while len(urls)<max_images and 'next' in data:
        try:
            data = urljson(requestUrl,data=params)
            urls.update(L(data['results']).itemgot('image'))
            requestUrl = url + data['next']
        except (URLError,HTTPError): pass
        time.sleep(0.2)
    return L(urls)

The issue is in this part of the code:

res = urlread(url,data={‘q’:term}).decode()

which leads to the error:

AttributeError: ‘str’ object has no attribute ‘decode’

I’d suggest you write a new function, removing the decode method. Something like:

def search_images_ddg_corrected(term, max_images=200):
    "Search for `term` with DuckDuckGo and return a unique urls of about `max_images` images"
    assert max_images<1000
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term})
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    assert searchObj
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    while len(urls)<max_images and 'next' in data:
        try:
            data = urljson(requestUrl,data=params)
            urls.update(L(data['results']).itemgot('image'))
            requestUrl = url + data['next']
        except (URLError,HTTPError): pass
        time.sleep(0.2)
    return L(urls)

Maybe we should submit a PR?

8 Likes

Thank you for this. Would you like to make the PR? If not, I can.

You can submit it no prob!

1 Like

I was hoping to make a PR on this today, but I see that this fix is already in the GitHub repo. It just hasn’t been pushed to PyPI.

4 Likes

Even with the change suggested here, I’m still getting

TypeError: cannot use a string pattern on a bytes-like object
1 Like

I suggest to use the jmd_imagescraper, check it out: https://pypi.org/project/jmd-imagescraper/

And for an example check: https://towardsdatascience.com/classifying-cats-vs-dogs-a-beginners-guide-to-deep-learning-4469ffed086c

6 Likes

The second one is nice, thanks :+1:

2 Likes

Actually the issue is that instead of removing the decode function all together, you just need to pass it as a parameter to the urlread function.Like so:

res = urlread(url,data={'q':term}, decode=true)

That should fix the problem. I just ran the code locally. I’m new to the forum so sure how to best to get this fix into the repo but wanted to let you know that this approach is still doable without using the jd image library.

Hope that helps!

2 Likes

This returns another error on my machine-
TypeError: cannot use a string pattern on a bytes-like object
Any idea what needs changing?

Can you post a snippet of code?

Hi @jonathanl and @ACassoni, I wonder if you have this issue yet! this is my suggested solution

#To install run the following line on the terminal
pip install DuckDuckGoImages

#Then in Jupyter notebook run the following
import DuckDuckGoImages as ddg
bear_types = 'grizzly','black','teddy'
path = Path('bears')
for o in bear_types:
    ims = ddg.download(f'{o} bear', folder=f'./bears/{o}')

fns = get_image_files(path)
fns

failed = verify_images(fns)
failed

image|690x222

2 Likes

I’ve recently worked on a small image classifier using the duck duck go scraper on Colab and I didn’t have any particular problem. This is the code I used:

def search_images_ddg(key, max_n=200):
  """Search for 'key' with DuckDuckGo and return a unique urls of 'max_n' images
  (Adopted from https://github.com/deepanprabhu/duckduckgo-images-api)
  """
  url = 'https://duckduckgo.com/'
  params = {'q':key}
  res = requests.post(url,data=params)
  searchObj = re.search(r'vqd=([\d-]+)\&',res.text)

  if not searchObj:
    print('Token Parsing Failed !')
    return

  requestUrl = url + 'i.js'
  headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0'}
  params = (('l','us-en'),('o','json'),('q',key),('vqd',searchObj.group(1)),('f',',,,'),('p','1'),('v7exp','a'))
  urls = []

  while True:
    try:
      res = requests.get(requestUrl,headers=headers,params=params)
      data = json.loads(res.text)
      for obj in data['results']:
        urls.append(obj['image'])
        max_n = max_n - 1
        if max_n < 1:
          return L(set(urls))
      if 'next' not in data:
        return L(set(urls))
      requestUrl = url + data['next']
    except:
      pass

And then I used these lines of code to scrape the images I wanted:

toyota_cars = ['4runner', 'land cruiser', 'rav4']
path = Path('/tmp/toyota_cars')

if not path.exists():
  path.mkdir()

for toyota_car in toyota_cars:
  dest = (path/toyota_car)
  dest.mkdir(exist_ok=True)
  urls = search_images_ddg(f'toyota {toyota_car}', max_n=200)
  download_images(dest, urls=urls)

images_path = get_image_files(path)

I hope it can help.

2 Likes

Guys getting same error in colab for this function.

Thanks a lot. I am quite a newbie to github and development coding and the second link (https://towardsdatascience.com/classifying-cats-vs-dogs-a-beginners-guide-to-deep-learning-4469ffed086c) helped a lot. It walked step by step through the whole process and thanks to it my lesson 2 code is completed and working.

2 Likes

I was not able to get Microsoft Azure’s API key and couldn’t proceed further but the second link (towards data science blog) saved me. Thanks for this. :star_struck:

1 Like

simpler fix is to replace it with decoded=true as noted, yes

2 Likes

Great thanks a lot ! Was pondering on chapter 2 for a couple of days

1 Like

Thank you soooooooooo much for sharing this with us. This really helped a lot, also I went through the entire post, it was step by step project and I really enjoyed it. Thank you soo much onece again

1 Like

Hi Jonathan,
I am new to all this, can you please share the link of where this change has been made in the github repo? I am not able to find it.