Resolved: HTTP403ForbiddenError in search_images functions

thomaspurk · March 22, 2024, 2:03am

To help refresh my Python and Jupyter skills in preparation to tackle the remainder of the “Practical Deep Learning for Coder” course work, I coded along with Mr. Howard’s presentation in my own Notebook. Practical Deep Learning for Coders: Lesson 1

I encounter the error shown in the following screen capture. I see that quite a few people have reported this error but I did not see this direct solution posted elsewhere.

After some print statement debugging in Jupyter Lab, I found that the HTTP request fails in the second loop. And that the problem is an incompatibility between the Request parameters sent as POST data on Line 17 (second screen capture) and the URL parameters appended on line 20 (now commented out). I assume the urljson function cannot handle both or the URL params override the POST data params.

The purpose of line 20 seems to be to increment the “s” or “start” parameter so that each subsequent request starts at higher position to grab the next “page” of results. So, the first loop pulls down results 0-99 and the second 100-199. That way we don’t get duplicate URLs on every loop.

Since the params dictionary defined on line 11 contains the required vqd parameter and the data[‘next’] parameter string does not, it is necessary to keep line 17. Therefore it is necessary to increment the start parameter by 100 on a separate line and update the “s” member of the params dictionary (lines 14, 21, and 22)

After making these changes, the function performed as expected and 200 unique bird and 200 unique forest images were downloaded.

ren_rsa · March 24, 2024, 1:27pm

I managed to cobble together an alternative funciton to retrieve images from duck duck go:

import httpx
import json

def get_images(keywords, max_results=None):
    url = f"https://duckduckgo.com/?va=f&t=hg&q={keywords}&iax=images&ia=images"
    headers = {
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        "accept-language": "en-US,en;q=0.9,hi;q=0.8",
        "cache-control": "max-age=0",
        "sec-ch-ua": "\"Google Chrome\";v=\"113\", \"Chromium\";v=\"113\", \"Not-A.Brand\";v=\"24\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Windows\"",
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "sec-gpc": "1",
        "upgrade-insecure-requests": "1",
        "cookie": "p=-2; ah=in-en; l=in-en",
        "Referer": "https://duckduckgo.com/",
        "Referrer-Policy": "origin"
    }

    with httpx.Client() as client:
        response = client.get(url, headers=headers)

    #print("Response status code:", response.status_code)
    resp = response.text

    vqd_index_start = resp.index('vqd="') + 5
    vqd_index_end = resp.index('"', vqd_index_start)
    vqd_bytes = resp[vqd_index_start:vqd_index_end]
    #print(f"vqd_bytes: {vqd_bytes}")

    images_url = f"https://duckduckgo.com/i.js?o=json&q={keywords}&vqd={vqd_bytes}"

    with httpx.Client() as client:
        response = client.get(images_url, headers=headers)

    #print("Response status code:", response.status_code)
    # with open("F://Python//playground//temp.json", "w") as f:
    #     json.dump(response.json(), f)

    response = response.json()
    #response["results"] = 
    results = response["results"][:max_results] if max_results else response["results"]
    return [res['image'] for res in results]

anjoal · March 25, 2024, 1:33pm

Hi @thomaspurk!

Thanks you for resolving this. I’ve started the course today and the code was giving this error.

As a last note, for your code to work we must import the time library, otherwise it will give us an error because it does not recognize ‘time’ when we call the time.sleep(0.2) function.