Duck Duck Go code not working

jonathanl · February 1, 2021, 4:39pm

Sorry to ask such a banal question, but I cannot get the Duck Duck Go code to work in Colab. Per this page, the following code should work on Colab:

!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
from fastbook import *

urls = search_images_ddg('grizzly bear', max_images=100)

This throws the following error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-1f02cd6eb038> in <module>()
      4 from fastbook import *
      5 
----> 6 urls = search_images_ddg('grizzly bear', max_images=100)
      7 len(urls),urls[0]

/usr/local/lib/python3.6/dist-packages/fastbook/__init__.py in search_images_ddg(term, max_images)
     55     assert max_images<1000
     56     url = 'https://duckduckgo.com/'
---> 57     res = urlread(url,data={'q':term}).decode()
     58     searchObj = re.search(r'vqd=([\d-]+)\&', res)
     59     assert searchObj

AttributeError: 'str' object has no attribute 'decode'

I see in the fastai docs that

they do not have an official API, so the function we’ll show here relies on the particular structure of their web interface, which may change.

Perhaps the API changed. Can anyone get this to work?

jimmiemunyi · February 1, 2021, 6:36pm

Hello, you can check the underlying code of the function using:

??search_images_ddg

you’ll get the following output:

def search_images_ddg(term, max_images=200):
    "Search for `term` with DuckDuckGo and return a unique urls of about `max_images` images"
    assert max_images<1000
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term}).decode()
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    assert searchObj
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    while len(urls)<max_images and 'next' in data:
        try:
            data = urljson(requestUrl,data=params)
            urls.update(L(data['results']).itemgot('image'))
            requestUrl = url + data['next']
        except (URLError,HTTPError): pass
        time.sleep(0.2)
    return L(urls)

The issue is in this part of the code:

res = urlread(url,data={‘q’:term}).decode()

which leads to the error:

AttributeError: ‘str’ object has no attribute ‘decode’

I’d suggest you write a new function, removing the decode method. Something like:

def search_images_ddg_corrected(term, max_images=200):
    "Search for `term` with DuckDuckGo and return a unique urls of about `max_images` images"
    assert max_images<1000
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term})
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    assert searchObj
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    while len(urls)<max_images and 'next' in data:
        try:
            data = urljson(requestUrl,data=params)
            urls.update(L(data['results']).itemgot('image'))
            requestUrl = url + data['next']
        except (URLError,HTTPError): pass
        time.sleep(0.2)
    return L(urls)

Maybe we should submit a PR?

jonathanl · February 1, 2021, 8:32pm

Thank you for this. Would you like to make the PR? If not, I can.

jimmiemunyi · February 2, 2021, 7:04am

You can submit it no prob!

jonathanl · February 5, 2021, 7:09pm

I was hoping to make a PR on this today, but I see that this fix is already in the GitHub repo. It just hasn’t been pushed to PyPI.

sandcurves · March 9, 2021, 12:34pm

Even with the change suggested here, I’m still getting

TypeError: cannot use a string pattern on a bytes-like object

ACassoni · March 9, 2021, 10:32pm

I suggest to use the jmd_imagescraper, check it out: https://pypi.org/project/jmd-imagescraper/

And for an example check: https://towardsdatascience.com/classifying-cats-vs-dogs-a-beginners-guide-to-deep-learning-4469ffed086c

Tommie · March 12, 2021, 2:32pm

The second one is nice, thanks

unojack · March 19, 2021, 5:35pm

Actually the issue is that instead of removing the decode function all together, you just need to pass it as a parameter to the urlread function.Like so:

res = urlread(url,data={'q':term}, decode=true)

That should fix the problem. I just ran the code locally. I’m new to the forum so sure how to best to get this fix into the repo but wanted to let you know that this approach is still doable without using the jd image library.

Hope that helps!

bigboss01 · March 19, 2021, 5:53pm

This returns another error on my machine-
TypeError: cannot use a string pattern on a bytes-like object
Any idea what needs changing?

unojack · March 27, 2021, 5:58pm

Can you post a snippet of code?

Omer_Hassan · April 16, 2021, 11:30pm

Hi @jonathanl and @ACassoni, I wonder if you have this issue yet! this is my suggested solution

#To install run the following line on the terminal
pip install DuckDuckGoImages

#Then in Jupyter notebook run the following
import DuckDuckGoImages as ddg
bear_types = 'grizzly','black','teddy'
path = Path('bears')
for o in bear_types:
    ims = ddg.download(f'{o} bear', folder=f'./bears/{o}')

fns = get_image_files(path)
fns

failed = verify_images(fns)
failed

image|690x222

mp97 · April 20, 2021, 6:07pm

I’ve recently worked on a small image classifier using the duck duck go scraper on Colab and I didn’t have any particular problem. This is the code I used:

def search_images_ddg(key, max_n=200):
  """Search for 'key' with DuckDuckGo and return a unique urls of 'max_n' images
  (Adopted from https://github.com/deepanprabhu/duckduckgo-images-api)
  """
  url = 'https://duckduckgo.com/'
  params = {'q':key}
  res = requests.post(url,data=params)
  searchObj = re.search(r'vqd=([\d-]+)\&',res.text)

  if not searchObj:
    print('Token Parsing Failed !')
    return

  requestUrl = url + 'i.js'
  headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0'}
  params = (('l','us-en'),('o','json'),('q',key),('vqd',searchObj.group(1)),('f',',,,'),('p','1'),('v7exp','a'))
  urls = []

  while True:
    try:
      res = requests.get(requestUrl,headers=headers,params=params)
      data = json.loads(res.text)
      for obj in data['results']:
        urls.append(obj['image'])
        max_n = max_n - 1
        if max_n < 1:
          return L(set(urls))
      if 'next' not in data:
        return L(set(urls))
      requestUrl = url + data['next']
    except:
      pass

And then I used these lines of code to scrape the images I wanted:

toyota_cars = ['4runner', 'land cruiser', 'rav4']
path = Path('/tmp/toyota_cars')

if not path.exists():
  path.mkdir()

for toyota_car in toyota_cars:
  dest = (path/toyota_car)
  dest.mkdir(exist_ok=True)
  urls = search_images_ddg(f'toyota {toyota_car}', max_n=200)
  download_images(dest, urls=urls)

images_path = get_image_files(path)

I hope it can help.

sagarTripathy · June 2, 2021, 1:14pm

Guys getting same error in colab for this function.

Abhinav_Chandraker · June 4, 2021, 4:27pm

Thanks a lot. I am quite a newbie to github and development coding and the second link (https://towardsdatascience.com/classifying-cats-vs-dogs-a-beginners-guide-to-deep-learning-4469ffed086c) helped a lot. It walked step by step through the whole process and thanks to it my lesson 2 code is completed and working.

Jawakar · June 8, 2021, 2:08am

I was not able to get Microsoft Azure’s API key and couldn’t proceed further but the second link (towards data science blog) saved me. Thanks for this.

nissan · June 8, 2021, 8:52am

jimmiemunyi:

def search_images_ddg_corrected(term, max_images=200):
    "Search for `term` with DuckDuckGo and return a unique urls of about `max_images` images"
    assert max_images<1000
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term})
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    assert searchObj
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    while len(urls)<max_images and 'next' in data:
        try:
            data = urljson(requestUrl,data=params)
            urls.update(L(data['results']).itemgot('image'))
            requestUrl = url + data['next']
        except (URLError,HTTPError): pass
        time.sleep(0.2)
    return L(urls)

simpler fix is to replace it with decoded=true as noted, yes

unaveenj · June 20, 2021, 3:49pm

Great thanks a lot ! Was pondering on chapter 2 for a couple of days

Saifullah3711 · August 3, 2021, 1:03pm

Thank you soooooooooo much for sharing this with us. This really helped a lot, also I went through the entire post, it was step by step project and I really enjoyed it. Thank you soo much onece again

AbhayN · August 3, 2021, 4:11pm

Hi Jonathan,
I am new to all this, can you please share the link of where this change has been made in the github repo? I am not able to find it.