构建自带浏览器 (BYOB) 工具用于网页浏览和摘要

2024 年 9 月 26 日

免责声明：本 cookbook 仅供教育目的。使用网络搜索和抓取技术时，请确保您遵守所有适用法律和服务条款。本 cookbook 将搜索限制在 openai.com 域，以检索公共信息来说明概念。

大型语言模型 (LLM)，如 GPT-4o，具有知识截止日期，这意味着它们缺乏在该日期之后发生的事件的信息。在最新数据至关重要的场景中，必须为 LLM 提供访问当前网络信息的权限，以确保准确和相关的响应。

在本指南中，我们将使用 Python 构建一个自带浏览器 (BYOB) 工具来克服此限制。我们的目标是创建一个在您的应用程序中提供最新答案的系统，包括 OpenAI 最新产品发布等最新进展。通过将网络搜索功能与 LLM 集成，我们将使模型能够根据在线可用的最新信息生成响应。

虽然您可以使用任何公开可用的搜索 API，但我们将利用 Google 的自定义搜索 API 来执行网络搜索。从搜索结果中检索到的信息将被处理并传递给 LLM，以通过检索增强生成 (RAG) 生成最终响应。

自带浏览器 (BYOB) 工具允许用户以编程方式执行网页浏览任务。在本 notebook 中，我们将创建一个 BYOB 工具，该工具将：

#1. 设置搜索引擎： 使用公共搜索 API，例如 Google 的自定义搜索 API，执行网络搜索并获取相关搜索结果列表。

#2. 构建搜索字典： 从搜索结果中收集每个网页的标题、URL 和摘要，以创建结构化的信息字典。

#3. 生成 RAG 响应： 通过将收集的信息传递给 LLM 来实现检索增强生成 (RAG)，然后 LLM 生成对用户查询的最终响应。

设置 BYOB 工具

为了向模型提供最近事件的信息，我们将遵循以下步骤：

步骤 3：将信息传递给模型，以生成对用户查询的 RAG 响应

在我们开始之前，请确保您的机器上安装了以下内容：Python 3.12 或更高版本。您还需要 Google Custom Search API 密钥和 Custom Search Engine ID (CSE ID)。需要安装的 Python 包：requests、beautifulsoup4、openai。并确保 OPENAI_API_KEY 设置为环境变量。

步骤 1：设置搜索引擎以提供网络搜索结果

您可以使用任何公开可用的网络搜索 API 来执行此任务。我们将使用 Google 的 Custom Search API 配置自定义搜索引擎。该引擎将根据用户的查询获取相关网页列表，重点是获取最新和最相关的结果。

a. 配置搜索 API 密钥和函数： 从 Google Developers Console 获取 Google API 密钥和 Custom Search Engine ID (CSE ID)。您可以导航到此可编程搜索引擎链接来设置 API 密钥以及 Custom Search Engine ID (CSE ID)。

下面的 search 函数根据搜索词、API 和 CSE ID 密钥以及要返回的搜索结果数量来设置搜索。我们将引入一个参数 site_filter，以将输出限制为仅 openai.com

import requests  # For making HTTP requests to APIs and websites

def search(search_item, api_key, cse_id, search_depth=10, site_filter=None):
    service_url = 'https://www.googleapis.com/customsearch/v1'

    params = {
        'q': search_item,
        'key': api_key,
        'cx': cse_id,
        'num': search_depth
    }

    try:
        response = requests.get(service_url, params=params)
        response.raise_for_status()
        results = response.json()

        # Check if 'items' exists in the results
        if 'items' in results:
            if site_filter is not None:
                
                # Filter results to include only those with site_filter in the link
                filtered_results = [result for result in results['items'] if site_filter in result['link']]

                if filtered_results:
                    return filtered_results
                else:
                    print(f"No results with {site_filter} found.")
                    return []
            else:
                if 'items' in results:
                    return results['items']
                else:
                    print("No search results found.")
                    return []

    except requests.exceptions.RequestException as e:
        print(f"An error occurred during the search: {e}")
        return []

b. 确定搜索引擎的搜索词： 在我们可以从第三方 API 检索特定结果之前，我们可能需要使用查询扩展来确定我们的浏览器搜索 API 应该检索的特定术语。查询扩展是一个通过添加相关术语、同义词或变体来扩展原始用户查询的过程。这项技术至关重要，因为像 Google 的 Custom Search API 这样的搜索引擎通常更擅长匹配一系列相关术语，而不仅仅是用户使用的自然语言提示。

例如，仅使用原始查询 "列出过去 2 年 OpenAI 最新的产品发布，按时间倒序排列" 进行搜索，可能比更具体和直接地搜索简洁的短语（如 "最新的 OpenAI 产品发布"）返回更少且相关性较低的结果。在下面的代码中，我们将使用用户原始的 search_query 来生成更具体的搜索词，以便与 Google API 一起使用来检索结果。

search_term = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Provide a google search term based on search query provided below in 3-4 words"},
        {"role": "user", "content": search_query}]
).choices[0].message.content

print(search_term)

Latest OpenAI product launches

c. 调用搜索函数： 现在我们有了搜索词，我们将调用搜索函数以从 Google 搜索 API 检索结果。结果此时仅包含网页的链接和摘要。在下一步中，我们将从网页检索更多信息并将其汇总到字典中，以传递给模型。

from dotenv import load_dotenv
import os

load_dotenv('.env')

api_key = os.getenv('API_KEY')
cse_id = os.getenv('CSE_ID')

search_items = search(search_item=search_term, api_key=api_key, cse_id=cse_id, search_depth=10, site_filter="https://openai.com")

for item in search_items:
    print(f"Link: {item['link']}")
    print(f"Snippet: {item['snippet']}\n")

Link: https://openai.com/news/
Snippet: Overview ; Product. Sep 12, 2024. Introducing OpenAI o1 ; Product. Jul 25, 2024. SearchGPT is a prototype of new AI search features ; Research. Jul 18, 2024. GPT- ...

Link: https://openai.com/index/new-models-and-developer-products-announced-at-devday/
Snippet: Nov 6, 2023 ... GPT-4 Turbo with 128K context · We released the first version of GPT-4 in March and made GPT-4 generally available to all developers in July.

Link: https://openai.com/news/product/
Snippet: Discover the latest product advancements from OpenAI and the ways they're being used by individuals and businesses.

Link: https://openai.com/
Snippet: A new series of AI models designed to spend more time thinking before they respond. Learn more · (opens in a new window) ...

Link: https://openai.com/index/sora/
Snippet: Feb 15, 2024 ... We plan to include C2PA metadata(opens in a new window) in the future if we deploy the model in an OpenAI product. In addition to us developing ...

Link: https://openai.com/o1/
Snippet: We've developed a new series of AI models designed to spend more time thinking before they respond. Here is the latest news on o1 research, product and ...

Link: https://openai.com/index/introducing-gpts/
Snippet: Nov 6, 2023 ... We plan to offer GPTs to more users soon. Learn more about our OpenAI DevDay announcements for new models and developer products.

Link: https://openai.com/api/
Snippet: The most powerful platform for building AI products ... Build and scale AI experiences powered by industry-leading models and tools. Start building (opens in a ...

步骤 2：构建包含网页标题、URL 和摘要的搜索字典

获取搜索结果后，我们将提取和组织相关信息，以便将其传递给 LLM 以进行最终输出。

a. 抓取网页内容： 对于搜索结果中的每个 URL，检索网页以提取文本内容，同时过滤掉不相关的数据，如脚本和广告，如函数 retrieve_content 中所示。

b. 总结内容： 使用 LLM 生成抓取内容的简洁摘要，重点关注与用户查询相关的信息。可以向模型提供原始搜索文本，以便它可以专注于总结搜索意图的内容，如函数 summarize_content 中所述。

c. 创建结构化字典： 将数据组织成字典或 DataFrame，其中包含每个网页的标题、链接和摘要。此结构可以传递给 LLM，以生成带有适当引用的摘要。

import requests
from bs4 import BeautifulSoup

TRUNCATE_SCRAPED_TEXT = 50000  # Adjust based on your model's context window
SEARCH_DEPTH = 5

def retrieve_content(url, max_tokens=TRUNCATE_SCRAPED_TEXT):
        try:
            headers = {'User-Agent': 'Mozilla/5.0'}
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()

            soup = BeautifulSoup(response.content, 'html.parser')
            for script_or_style in soup(['script', 'style']):
                script_or_style.decompose()

            text = soup.get_text(separator=' ', strip=True)
            characters = max_tokens * 4  # Approximate conversion
            text = text[:characters]
            return text
        except requests.exceptions.RequestException as e:
            print(f"Failed to retrieve {url}: {e}")
            return None
        
def summarize_content(content, search_term, character_limit=500):
        prompt = (
            f"You are an AI assistant tasked with summarizing content relevant to '{search_term}'. "
            f"Please provide a concise summary in {character_limit} characters or less."
        )
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": prompt},
                    {"role": "user", "content": content}]
            )
            summary = response.choices[0].message.content
            return summary
        except Exception as e:
            print(f"An error occurred during summarization: {e}")
            return None

def get_search_results(search_items, character_limit=500):
    # Generate a summary of search results for the given search term
    results_list = []
    for idx, item in enumerate(search_items, start=1):
        url = item.get('link')
        
        snippet = item.get('snippet', '')
        web_content = retrieve_content(url, TRUNCATE_SCRAPED_TEXT)
        
        if web_content is None:
            print(f"Error: skipped URL: {url}")
        else:
            summary = summarize_content(web_content, search_term, character_limit)
            result_dict = {
                'order': idx,
                'link': url,
                'title': snippet,
                'Summary': summary
            }
            results_list.append(result_dict)
    return results_list

results = get_search_results(search_items)

for result in results:
    print(f"Search order: {result['order']}")
    print(f"Link: {result['link']}")
    print(f"Snippet: {result['title']}")
    print(f"Summary: {result['Summary']}")
    print('-' * 80)

Search order: 1
Link: https://openai.com/news/
Snippet: Overview ; Product. Sep 12, 2024. Introducing OpenAI o1 ; Product. Jul 25, 2024. SearchGPT is a prototype of new AI search features ; Research. Jul 18, 2024. GPT- ...
Summary: OpenAI recently launched several notable products in 2024, including OpenAI o1 and SearchGPT, a prototype for enhanced AI search capabilities. Additionally, GPT-4o mini was introduced, enhancing cost-efficient intelligence. The organization also rolled out OpenAI for Nonprofits and ChatGPT Edu to support various sectors. Improvements in data analysis within ChatGPT and enhancements to the fine-tuning API were also announced. These updates reflect OpenAI's ongoing commitment to advancing AI technologies across different fields.
--------------------------------------------------------------------------------
Search order: 2
Link: https://openai.com/index/new-models-and-developer-products-announced-at-devday/
Snippet: Nov 6, 2023 ... GPT-4 Turbo with 128K context · We released the first version of GPT-4 in March and made GPT-4 generally available to all developers in July.
Summary: OpenAI's recent DevDay revealed several new products and model updates, including the launch of GPT-4 Turbo with a 128K context window, new pricing, and enhanced multimodal capabilities. Key features include the new Assistants API for developing specialized AI applications, improved function calling, and advanced capabilities like text-to-speech and DALL·E 3 integration. Additionally, OpenAI introduced a Copyright Shield for legal protection and Whisper v3 for improved speech recognition. Pricing reductions and rate limit increases were also announced across several models.
--------------------------------------------------------------------------------
Search order: 3
Link: https://openai.com/news/product/
Snippet: Discover the latest product advancements from OpenAI and the ways they're being used by individuals and businesses.
Summary: As of September 2024, OpenAI has launched several significant products, including OpenAI o1, a versatile AI tool, and SearchGPT, a prototype aimed at enhancing AI-driven search capabilities. Earlier, in May 2024, they introduced OpenAI for Education, emphasizing AI's integration into educational settings. Upcoming enhancements to existing products like GPT-4, DALL·E 3, and ChatGPT are also in focus, continuing OpenAI's mission to innovate across various sectors with cutting-edge AI technologies.
--------------------------------------------------------------------------------
Search order: 4
Link: https://openai.com/
Snippet: A new series of AI models designed to spend more time thinking before they respond. Learn more · (opens in a new window) ...
Summary: OpenAI has recently launched several innovative products, including the OpenAI o1 and o1-mini models which focus on enhanced reasoning capabilities. The partnership with Apple aims to integrate ChatGPT into Apple’s user experience. OpenAI also debuted "Sora," a video generation tool from text prompts, and made significant upgrades to the ChatGPT Enterprise with new compliance tools. The introduction of structured outputs in the API and enhanced data analysis features are also notable advancements, further expanding the utility of AI in various domains.
--------------------------------------------------------------------------------
Search order: 5
Link: https://openai.com/index/sora/
Snippet: Feb 15, 2024 ... We plan to include C2PA metadata(opens in a new window) in the future if we deploy the model in an OpenAI product. In addition to us developing ...
Summary: OpenAI has launched Sora, an innovative AI model capable of generating high-quality text-to-video content. Sora can create videos up to one minute long, simulating complex scenes with motion and character interactions based on user prompts. The model uses advanced diffusion techniques, akin to its predecessors in the GPT and DALL·E families, enabling it to understand and animate real-world physics and nuances. OpenAI is working with external artists and domain experts to ensure safety and accuracy, while gathering feedback for future enhancements before wider release.
--------------------------------------------------------------------------------
Search order: 6
Link: https://openai.com/o1/
Snippet: We've developed a new series of AI models designed to spend more time thinking before they respond. Here is the latest news on o1 research, product and ...
Summary: OpenAI has introduced the o1 series, a new set of AI models aimed at improving response deliberation. This innovation allows models to "think" more before generating replies. The o1 models can be accessed via ChatGPT Plus and through APIs. Other recent advancements include updates to GPT-4, GPT-4o mini, and DALL·E 3. OpenAI continues to focus on enhancing product offerings for individual, team, and enterprise use, reflecting its commitment to research and safety in AI technologies.
--------------------------------------------------------------------------------
Search order: 7
Link: https://openai.com/index/introducing-gpts/
Snippet: Nov 6, 2023 ... We plan to offer GPTs to more users soon. Learn more about our OpenAI DevDay announcements for new models and developer products.
Summary: On November 6, 2023, OpenAI launched "GPTs," allowing users to create customized versions of ChatGPT tailored to specific tasks without needing coding skills. These custom GPTs can assist in various activities, from learning games to workplace tasks. The upcoming GPT Store will feature creations from users, making them searchable and shareable. Enterprise users can develop internal-only versions, enhancing workplace productivity. Additionally, ChatGPT Plus users benefit from an improved interface that consolidates features like DALL·E and data analysis.
--------------------------------------------------------------------------------
Search order: 8
Link: https://openai.com/api/
Snippet: The most powerful platform for building AI products ... Build and scale AI experiences powered by industry-leading models and tools. Start building (opens in a ...
Summary: OpenAI has launched several notable products, including GPT-4o and GPT-4o mini, designed for complex and lightweight tasks respectively, both featuring a 128k context length. New models like OpenAI o1-preview and o1-mini enhance reasoning capabilities. The API platform offers various tools for building AI applications, including Chat Completions, Assistants, and Batch APIs. Enhanced customization options include Fine-tuning and a Custom Model Program. OpenAI's enterprise features emphasize security, compliance, and dedicated support, facilitating widespread innovative applications across sectors.
--------------------------------------------------------------------------------

我们检索了最新的结果。（请注意，这些结果会因您执行此脚本的时间而异。）

步骤 3：将信息传递给模型，以生成对用户查询的 RAG 响应

通过将搜索数据组织成 JSON 数据结构，我们将把此信息与原始用户查询一起传递给 LLM，以生成最终响应。现在，LLM 响应包含超出其原始知识截止日期的信息，从而提供最新的见解。

import json 

final_prompt = (
    f"The user will provide a dictionary of search results in JSON format for search query {search_term} Based on on the search results provided by the user, provide a detailed response to this query: **'{search_query}'**. Make sure to cite all the sources at the end of your answer."
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": final_prompt},
        {"role": "user", "content": json.dumps(results)}],
    temperature=0

)
summary = response.choices[0].message.content

print(summary)

Based on the search results provided, here is a chronological list of the latest OpenAI product launches from the past two years, ordered from the most recent to the oldest:

1. **September 12, 2024**: **OpenAI o1**
   - A versatile AI tool designed to enhance reasoning capabilities.
   - Source: [OpenAI News](https://openai.com/news/)

2. **July 25, 2024**: **SearchGPT**
   - A prototype aimed at enhancing AI-driven search capabilities.
   - Source: [OpenAI News](https://openai.com/news/)

3. **July 18, 2024**: **GPT-4o mini**
   - A cost-efficient intelligence model.
   - Source: [OpenAI News](https://openai.com/news/)

4. **May 2024**: **OpenAI for Education**
   - Focuses on integrating AI into educational settings.
   - Source: [OpenAI News](https://openai.com/news/product/)

5. **February 15, 2024**: **Sora**
   - An AI model capable of generating high-quality text-to-video content.
   - Source: [OpenAI Sora](https://openai.com/index/sora/)

6. **November 6, 2023**: **GPT-4 Turbo**
   - Features a 128K context window and enhanced multimodal capabilities.
   - Source: [OpenAI DevDay](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)

7. **November 6, 2023**: **GPTs**
   - Allows users to create customized versions of ChatGPT tailored to specific tasks.
   - Source: [OpenAI DevDay](https://openai.com/index/introducing-gpts/)

8. **March 2023**: **GPT-4**
   - The first version of GPT-4 was released.
   - Source: [OpenAI DevDay](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)

9. **July 2023**: **GPT-4 General Availability**
   - GPT-4 was made generally available to all developers.
   - Source: [OpenAI DevDay](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)

10. **2023**: **Whisper v3**
    - An improved speech recognition model.
    - Source: [OpenAI DevDay](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)

11. **2023**: **DALL·E 3 Integration**
    - Enhanced capabilities for generating images from text prompts.
    - Source: [OpenAI DevDay](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)

12. **2023**: **Assistants API**
    - For developing specialized AI applications.
    - Source: [OpenAI DevDay](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)

13. **2023**: **Copyright Shield**
    - Legal protection for AI-generated content.
    - Source: [OpenAI DevDay](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)

14. **2023**: **OpenAI for Nonprofits**
    - Support for various sectors through AI.
    - Source: [OpenAI News](https://openai.com/news/)

15. **2023**: **ChatGPT Edu**
    - Aimed at educational support.
    - Source: [OpenAI News](https://openai.com/news/)

16. **2023**: **ChatGPT Enterprise**
    - New compliance tools and enhanced data analysis features.
    - Source: [OpenAI](https://openai.com/)

17. **2023**: **OpenAI o1-mini**
    - A lightweight version of the OpenAI o1 model.
    - Source: [OpenAI](https://openai.com/)

18. **2023**: **OpenAI o1-preview**
    - An early version of the OpenAI o1 model.
    - Source: [OpenAI](https://openai.com/api/)

19. **2023**: **Custom Model Program**
    - Enhanced customization options for AI models.
    - Source: [OpenAI](https://openai.com/api/)

20. **2023**: **Fine-tuning API Enhancements**
    - Improvements to the fine-tuning API.
    - Source: [OpenAI News](https://openai.com/news/)

### Sources:
- [OpenAI News](https://openai.com/news/)
- [OpenAI DevDay](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)
- [OpenAI Sora](https://openai.com/index/sora/)
- [OpenAI API](https://openai.com/api/)
- [OpenAI](https://openai.com/)

结论

大型语言模型 (LLM) 具有知识截止日期，可能不知道最近发生的事件。为了向它们提供最新信息，您可以使用 Python 构建一个自带浏览器 (BYOB) 工具。此工具检索当前的 Web 数据并将其馈送到 LLM，从而实现最新的响应。

该过程包括三个主要步骤：

#1 设置搜索引擎： 使用公共搜索 API，如 Google 的 Custom Search API，执行网络搜索并获取相关搜索结果列表。

#2 构建搜索字典： 从搜索结果中收集每个网页的标题、URL 和摘要，以创建结构化的信息字典。

#3. 生成 RAG 响应： 通过将收集的信息传递给 LLM 来实现检索增强生成 (RAG)，然后 LLM 生成对用户查询的最终响应。

通过遵循这些步骤，您可以增强 LLM 在您的应用程序中提供最新答案的能力，其中包括 OpenAI 最新产品发布等最新进展。