Lesson 1 - Problems downloading data for classifier

DonChilitos · July 28, 2022, 4:57pm

Hi guys,

I’m trying to build a plant classifier. From Wikipedia, I got a list of plants and they latin names and was trying to build upon the example in lesson 1 to create the classifier.

However, on certain plant names, I’m getting a KeyError during the download_images execution:

import shutil

path = Path('OregonPlants')

def search_to_folder(x): return '_'.join(x.split(' ')[-2:])

for o in search_terms:
    dnld = (path/'dnld')
    dest = (path/search_to_folder(o))
    dnld.mkdir(exist_ok=True, parents=True)
    dest.mkdir(exist_ok=True, parents=True)
    print(o)
    download_images(dnld, urls=search_images(f'{o} photo'), max_pics=20)
    resize_images(dnld, max_size=400, dest=dest)
    shutil.rmtree(dnld)

and get the following output:

Vaccinium ovalifolium Alaska blueberry
Daucus pusillus American wild
Melica aristata Awned melic
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [37], in <cell line: 7>()
     11 dest.mkdir(exist_ok=True, parents=True)
     12 print(o)
---> 13 download_images(dnld, urls=search_images(f'{o} photo'), max_pics=20)
     14 resize_images(dnld, max_size=400, dest=dest)
     15 shutil.rmtree(dnld)

Input In [14], in search_images(term, max_images)
     12     data = urljson(requestUrl,data=params)
     13     urls.update(L(data['results']).itemgot('image'))
---> 14     requestUrl = url + data['next']
     15     time.sleep(0.2)
     16 return L(urls)[:max_images]

KeyError: 'next'

I haven’t changed the function download_images from the notebook, but it seems to get hung up on specific examples. Any ideas what could be happening? A quick manual search on ddg for the term where I get the error shows plenty of image results for the string.

amalia · July 28, 2022, 8:36pm

Hello!

It seems that the error is coming from search_images.

KeyError: 'next' means that python tried to find the key next in the variable called data in line 14 but it failed. However, the lines 12 to 15 are inside a while loop with a condition that avoids entering if next is not in data. I paste below the function as I see it in the Kaggle notebook

def search_images(term, max_images=200):
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term})
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    while len(urls)<max_images and 'next' in data:
        data = urljson(requestUrl,data=params)
        urls.update(L(data['results']).itemgot('image'))
        requestUrl = url + data['next']
        time.sleep(0.2)
    return L(urls)[:max_images]

Does your function look the same?

DonChilitos · July 28, 2022, 8:46pm

Hi Amalia,

Indeed, the function looks the same. I just reused the one in the lesson 1 notebook with the only change in the workflow that I use the species I found on Wikipedia. The strange thing is that the download_images functions successfully gets images from some species but gets stuck in others. I tried remove a problematic one but another one later gets found.

Thanks for your help.

hydrogenman · March 2, 2024, 8:33pm

apparently this error will occur when search fails to return enough urls, and there is no ‘next’

I also removed rare searches, and reduced the number of urls requested