使用 Agents SDK 构建语音助手

2025 年 3 月 27 日
在 Github 中打开

假设您是一家消费科技公司的 AI 负责人。您有一个愿景,即部署一个单一入口点的数字语音助手,该助手能够帮助用户处理任何查询,无论他们是想对自己的账户进行操作、查找产品信息还是接收实时指导。

然而,将这一愿景变为现实可能极其困难 - 这需要构建和测试首先通过文本处理每个单独用例的能力,集成对他们所需各种工具和系统的访问,并将它们以某种方式协调成连贯的体验。然后,一旦您达到了令人满意的质量水平(甚至评估这一点也可能很困难),您将面临为语音交互重构整个工作流程的艰巨任务。

幸运的是,对于您来说,OpenAI 最近发布的三个版本使实现这一愿景比以往任何时候都更加简单,它们提供了通过语音构建和编排模块化代理工作流程的工具,且只需最少的配置

  • Responses API - 一种代理 API,通过托管的有状态对话轻松与我们的前沿模型进行交互,跟踪响应以实现评估,以及用于文件搜索、网络搜索、计算机使用等的内置工具
  • Agents SDK - 一个轻量级、可定制的开源框架,用于构建和编排跨多个不同代理的工作流程,使您的助手能够将输入路由到适当的代理,并扩展以支持许多用例
  • 语音代理 - Agents SDK 的扩展,支持使用语音管道,使您的代理能够从基于文本转变为能够在几行代码中解释和生成音频

本 cookbook 演示了如何使用上述工具为虚构的消费者应用程序构建一个简单的应用内语音助手。我们将创建一个 分诊代理,该代理欢迎用户,确定他们的意图,并将请求路由到三个专门的代理之一

  • 搜索代理 - 通过 Responses API 的内置工具执行网络搜索,以提供有关用户查询的实时信息
  • 知识代理 - 利用 Responses API 的文件搜索工具,从 OpenAI 管理的向量数据库中检索信息
  • 账户代理 - 使用函数调用来提供通过 API 触发自定义操作的能力

最后,我们将使用 AgentsSDK 的语音功能将此工作流程转换为实时语音助手,捕获麦克风输入,执行语音转文本,通过我们的代理进行路由,并使用文本转语音进行响应。

设置

要执行本 cookbook,您需要安装以下软件包,这些软件包提供对 OpenAI API、Agents SDK 和音频处理库的访问。此外,您可以设置您的 OpenAI API 密钥,供代理通过 set_default_openai_key 函数使用。

%pip install openai
%pip install openai-agents 'openai-agents[voice]'
%pip install numpy
%pip install sounddevice
%pip install os
from agents import Agent, function_tool, WebSearchTool, FileSearchTool, set_default_openai_key
from agents.extensions.handoff_prompt import prompt_with_handoff_instructions

set_default_openai_key("YOUR_API_KEY")

定义代理和工具

今天,我们将为我们虚构的消费者应用程序 ACME 商店构建一个助手,最初专注于支持跨三个关键用例的用例

  • 使用网络搜索回答实时问题,为购买决策提供信息
  • 提供有关我们产品组合中可用选项的信息
  • 提供账户信息,使用户能够了解他们的预算和支出

为了实现这一点,我们将使用代理架构。这使我们能够将每个用例的功能拆分为单独的代理,从而降低单个代理可能被要求完成的任务的复杂性/范围,并提高准确性。我们的代理架构相对简单,专注于上述三个用例,但 Agents SDK 的优点在于,当您想要添加新功能时,可以非常轻松地扩展和向工作流程添加其他代理

Agent Architecture

搜索代理

我们的第一个代理是一个简单的网络搜索代理,它使用 Responses API 提供的 WebSearchTool 来查找有关用户查询的实时信息。我们将为每个示例保持指令提示的简单性,但稍后我们将迭代以展示如何针对您的用例优化响应格式。

# --- Agent: Search Agent ---
search_agent = Agent(
    name="SearchAgent",
    instructions=(
        "You immediately provide an input to the WebSearchTool to find up-to-date information on the user's query."
    ),
    tools=[WebSearchTool()],
)

知识代理

我们的第二个代理需要能够回答有关我们产品组合的问题。为此,我们将使用 FileSearchTool 从 OpenAI 管理的向量存储中检索信息,该向量存储包含我们公司特定的产品信息。为此,我们有两个选择

  1. 使用 OpenAI 平台网站 - 转到 platform.openai.com/storage 并创建一个向量存储,上传您选择的文档。然后,获取向量存储 ID 并将其替换到下面的 FileSearchTool 初始化中。

  2. 使用 OpenAI API - 使用 OpenAI Python 客户端中的 vector_stores.create 函数创建一个向量存储,然后使用 vector_stores.files.create 函数向其添加文件。完成后,您可以再次使用 FileSearchTool 搜索向量存储。请参阅下面的代码示例,了解如何执行此操作,可以使用提供的示例文件,也可以更改为您自己的本地文件路径

from openai import OpenAI
import os

client = OpenAI(api_key='YOUR_API_KEY')

def upload_file(file_path: str, vector_store_id: str):
    file_name = os.path.basename(file_path)
    try:
        file_response = client.files.create(file=open(file_path, 'rb'), purpose="assistants")
        attach_response = client.vector_stores.files.create(
            vector_store_id=vector_store_id,
            file_id=file_response.id
        )
        return {"file": file_name, "status": "success"}
    except Exception as e:
        print(f"Error with {file_name}: {str(e)}")
        return {"file": file_name, "status": "failed", "error": str(e)}

def create_vector_store(store_name: str) -> dict:
    try:
        vector_store = client.vector_stores.create(name=store_name)
        details = {
            "id": vector_store.id,
            "name": vector_store.name,
            "created_at": vector_store.created_at,
            "file_count": vector_store.file_counts.completed
        }
        print("Vector store created:", details)
        return details
    except Exception as e:
        print(f"Error creating vector store: {e}")
        return {}
    
vector_store_id = create_vector_store("ACME Shop Product Knowledge Base")
upload_file("voice_agents_knowledge/acme_product_catalogue.pdf", vector_store_id["id"])

实施向量存储后,我们现在可以使知识代理能够使用 FileSearchTool 搜索给定的存储 ID。

# --- Agent: Knowledge Agent ---
knowledge_agent = Agent(
    name="KnowledgeAgent",
    instructions=(
        "You answer user questions on our product portfolio with concise, helpful responses using the FileSearchTool."
    ),
    tools=[FileSearchTool(
            max_num_results=3,
            vector_store_ids=["VECTOR_STORE_ID"],
        ),],
)

账户代理

到目前为止,我们一直在使用 Agents SDK 提供的内置工具,但您可以定义自己的工具供代理使用,以通过 function_tool 装饰器与您的系统集成。在这里,我们将定义一个简单的虚拟函数,以返回我们账户代理的给定用户 ID 的账户信息。

# --- Tool 1: Fetch account information (dummy) ---
@function_tool
def get_account_info(user_id: str) -> dict:
    """Return dummy account info for a given user."""
    return {
        "user_id": user_id,
        "name": "Bugs Bunny",
        "account_balance": "£72.50",
        "membership_status": "Gold Executive"
    }

# --- Agent: Account Agent ---
account_agent = Agent(
    name="AccountAgent",
    instructions=(
        "You provide account information based on a user ID using the get_account_info tool."
    ),
    tools=[get_account_info],
)

有关使用 Agents SDK 进行函数调用的更多信息,请参阅 Agents SDK 文档

最后,我们将定义分诊代理,该代理将根据用户的意图将用户查询路由到适当的代理。在这里,我们使用 prompt_with_handoff_instructions 函数,该函数提供了有关如何处理移交的额外指导,建议为任何具有定义的移交集和定义的指令集的代理提供该函数。

# --- Agent: Triage Agent ---
triage_agent = Agent(
    name="Assistant",
    instructions=prompt_with_handoff_instructions("""
You are the virtual assistant for Acme Shop. Welcome the user and ask how you can help.
Based on the user's intent, route to:
- AccountAgent for account-related queries
- KnowledgeAgent for product FAQs
- SearchAgent for anything requiring real-time web search
"""),
    handoffs=[account_agent, knowledge_agent, search_agent],
)

运行工作流程

现在我们已经定义了我们的代理,我们可以在一些示例查询上运行工作流程,看看它的表现如何。

# %%
from agents import Runner, trace

async def test_queries():
    examples = [
        "What's my ACME account balance doc? My user ID is 1234567890", # Account Agent test
        "Ooh i've got money to spend! How big is the input and how fast is the output of the dynamite dispenser?", # Knowledge Agent test
        "Hmmm, what about duck hunting gear - what's trending right now?", # Search Agent test

    ]
    with trace("ACME App Assistant"):
        for query in examples:
            result = await Runner.run(triage_agent, query)
            print(f"User: {query}")
            print(result.final_output)
            print("---")
# Run the tests
await test_queries()
User: What's my ACME account balance doc? My user ID is 1234567890
Your ACME account balance is £72.50. You have a Gold Executive membership.
---
User: Ooh i've got money to spend! How big is the input and how fast is the output of the dynamite dispenser?
The Automated Dynamite Dispenser can hold up to 10 sticks of dynamite and dispenses them at a speed of 1 stick every 2 seconds.
---
User: Hmmm, what about duck hunting gear - what's trending right now?
Staying updated with the latest trends in duck hunting gear can significantly enhance your hunting experience. Here are some of the top trending items for the 2025 season:



**Banded Aspire Catalyst Waders**  
These all-season waders feature waterproof-breathable technology, ensuring comfort in various conditions. They boast a minimal-stitch design for enhanced mobility and include PrimaLoft Aerogel insulation for thermal protection. Additional features like an over-the-boot protective pant and an integrated LED light in the chest pocket make them a standout choice. ([blog.gritroutdoors.com](https://blog.gritroutdoors.com/must-have-duck-hunting-gear-for-a-winning-season/?utm_source=openai))




**Sitka Delta Zip Waders**  
Known for their durability, these waders have reinforced shins and knees with rugged foam pads, ideal for challenging terrains. Made with GORE-TEX material, they ensure dryness throughout the season. ([blog.gritroutdoors.com](https://blog.gritroutdoors.com/must-have-duck-hunting-gear-for-a-winning-season/?utm_source=openai))




**MOmarsh InvisiMan Blind**  
This one-person, low-profile blind is praised for its sturdiness and ease of setup. Hunters have reported that even late-season, cautious ducks approach without hesitation, making it a valuable addition to your gear. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))




**Slayer Calls Ranger Duck Call**  
This double reed call produces crisp and loud sounds, effectively attracting distant ducks in harsh weather conditions. Its performance has been noted for turning the heads of ducks even at extreme distances. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))




**Sitka Full Choke Pack**  
A favorite among hunters, this backpack-style blind bag offers comfort and efficiency. It has proven to keep gear dry during heavy downpours and is durable enough to withstand over 60 hunts in a season. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))


Incorporating these trending items into your gear can enhance your comfort, efficiency, and success during the hunting season. 
---

追踪

在上面我们可以看到输出似乎符合我们的预期,但 Agents SDK 的一个关键优势是它包含内置追踪功能,该功能可以跟踪代理运行期间跨 LLM 调用、移交和工具的事件流。

使用 追踪仪表板,我们可以在开发和生产期间调试、可视化和监控我们的工作流程。正如我们在下面看到的,每个测试查询都已正确路由到适当的代理。

Traces Dashboard

启用语音

在设计好我们的工作流程后,实际上我们会在现实中花费时间评估追踪并迭代工作流程,以确保其尽可能有效。但假设我们对工作流程感到满意,那么我们现在可以开始考虑如何将我们的应用内助手从基于文本的交互转换为基于语音的交互。

为此,我们可以简单地利用 Agents SDK 提供的类,将我们基于文本的工作流程转换为基于语音的工作流程。VoicePipeline 类为转录音频输入、执行给定的代理工作流程以及生成文本到语音响应以回放给用户提供了一个接口,而 SingleAgentVoiceWorkflow 类使我们能够利用我们之前用于基于文本的工作流程的相同代理工作流程。为了提供和接收音频,我们将使用 sounddevice 库。

端到端,新的工作流程如下所示

Agent Architecture 2

启用此功能的代码如下

# %%
import numpy as np
import sounddevice as sd
from agents.voice import AudioInput, SingleAgentVoiceWorkflow, VoicePipeline

async def voice_assistant():
    samplerate = sd.query_devices(kind='input')['default_samplerate']

    while True:
        pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(triage_agent))

        # Check for input to either provide voice or exit
        cmd = input("Press Enter to speak your query (or type 'esc' to exit): ")
        if cmd.lower() == "esc":
            print("Exiting...")
            break      
        print("Listening...")
        recorded_chunks = []

         # Start streaming from microphone until Enter is pressed
        with sd.InputStream(samplerate=samplerate, channels=1, dtype='int16', callback=lambda indata, frames, time, status: recorded_chunks.append(indata.copy())):
            input()

        # Concatenate chunks into single buffer
        recording = np.concatenate(recorded_chunks, axis=0)

        # Input the buffer and await the result
        audio_input = AudioInput(buffer=recording)

        with trace("ACME App Voice Assistant"):
            result = await pipeline.run(audio_input)

         # Transfer the streamed result into chunks of audio
        response_chunks = []
        async for event in result.stream():
            if event.type == "voice_stream_event_audio":
                response_chunks.append(event.data)

        response_audio = np.concatenate(response_chunks, axis=0)

        # Play response
        print("Assistant is responding...")
        sd.play(response_audio, samplerate=samplerate)
        sd.wait()
        print("---")

# Run the voice assistant
await voice_assistant()
Listening...
Assistant is responding...
---
Exiting...

执行上述代码,我们得到了以下响应,这些响应正确地提供了与基于文本的工作流程相同的功能。

from IPython.display import display, Audio
display(Audio("voice_agents_audio/account_balance_response_base.mp3"))
display(Audio("voice_agents_audio/product_info_response_base.mp3"))
display(Audio("voice_agents_audio/trending_items_response_base.mp3"))

提示:当将追踪与语音代理一起使用时,您可以在追踪仪表板中播放音频

Audio trace

优化语音

这是一个良好的开端,但我们可以做得更好。由于我们只是将基于文本的代理转换为基于语音的代理,因此响应在其输出中没有针对音调或格式进行优化,这意味着它们感觉机械且不自然。

为了解决这个问题,我们需要对我们的提示进行一些更改。

首先,我们可以调整我们现有的代理以包含一个通用的系统提示,提供有关如何优化其文本响应以供稍后转换为语音格式的说明

# Common system prompt for voice output best practices:
voice_system_prompt = """
[Output Structure]
Your output will be delivered in an audio voice response, please ensure that every response meets these guidelines:
1. Use a friendly, human tone that will sound natural when spoken aloud.
2. Keep responses short and segmented—ideally one to two concise sentences per step.
3. Avoid technical jargon; use plain language so that instructions are easy to understand.
4. Provide only essential details so as not to overwhelm the listener.
"""

# --- Agent: Search Agent ---
search_voice_agent = Agent(
    name="SearchVoiceAgent",
    instructions=voice_system_prompt + (
        "You immediately provide an input to the WebSearchTool to find up-to-date information on the user's query."
    ),
    tools=[WebSearchTool()],
)

# --- Agent: Knowledge Agent ---
knowledge_voice_agent = Agent(
    name="KnowledgeVoiceAgent",
    instructions=voice_system_prompt + (
        "You answer user questions on our product portfolio with concise, helpful responses using the FileSearchTool."
    ),
    tools=[FileSearchTool(
            max_num_results=3,
            vector_store_ids=["VECTOR_STORE_ID"],
        ),],
)

# --- Agent: Account Agent ---
account_voice_agent = Agent(
    name="AccountVoiceAgent",
    instructions=voice_system_prompt + (
        "You provide account information based on a user ID using the get_account_info tool."
    ),
    tools=[get_account_info],
)

# --- Agent: Triage Agent ---
triage_voice_agent = Agent(
    name="VoiceAssistant",
    instructions=prompt_with_handoff_instructions("""
You are the virtual assistant for Acme Shop. Welcome the user and ask how you can help.
Based on the user's intent, route to:
- AccountAgent for account-related queries
- KnowledgeAgent for product FAQs
- SearchAgent for anything requiring real-time web search
"""),
    handoffs=[account_voice_agent, knowledge_voice_agent, search_voice_agent],
)

接下来,我们可以指示 Agents SDK 使用的默认 OpenAI TTS 模型 gpt-4o-mini-tts,关于如何使用 instructions 字段传达代理生成的文本的音频输出。

在这里,我们对输出拥有巨大的控制权,包括指定输出的个性、发音、速度和情感的能力。

下面我包含了一些关于如何针对不同应用程序提示模型的示例。

health_assistant= "Voice Affect: Calm, composed, and reassuring; project quiet authority and confidence."
"Tone: Sincere, empathetic, and gently authoritative—express genuine apology while conveying competence."
"Pacing: Steady and moderate; unhurried enough to communicate care, yet efficient enough to demonstrate professionalism."

coach_assistant="Voice: High-energy, upbeat, and encouraging, projecting enthusiasm and motivation."
"Punctuation: Short, punchy sentences with strategic pauses to maintain excitement and clarity."
"Delivery: Fast-paced and dynamic, with rising intonation to build momentum and keep engagement high."

themed_character_assistant="Affect: Deep, commanding, and slightly dramatic, with an archaic and reverent quality that reflects the grandeur of Olde English storytelling."
"Tone: Noble, heroic, and formal, capturing the essence of medieval knights and epic quests, while reflecting the antiquated charm of Olde English."    
"Emotion: Excitement, anticipation, and a sense of mystery, combined with the seriousness of fate and duty."
"Pronunciation: Clear, deliberate, and with a slightly formal cadence."
"Pause: Pauses after important Olde English phrases such as \"Lo!\" or \"Hark!\" and between clauses like \"Choose thy path\" to add weight to the decision-making process and allow the listener to reflect on the seriousness of the quest."

我们的配置将侧重于创建友好、热情和支持性的语气,这种语气在朗读时听起来自然,并引导用户完成对话。

from agents.voice import TTSModelSettings, VoicePipeline, VoicePipelineConfig, SingleAgentVoiceWorkflow, AudioInput
import sounddevice as sd
import numpy as np

# Define custom TTS model settings with the desired instructions
custom_tts_settings = TTSModelSettings(
    instructions="Personality: upbeat, friendly, persuasive guide"
    "Tone: Friendly, clear, and reassuring, creating a calm atmosphere and making the listener feel confident and comfortable."
    "Pronunciation: Clear, articulate, and steady, ensuring each instruction is easily understood while maintaining a natural, conversational flow."
    "Tempo: Speak relatively fast, include brief pauses and after before questions"
    "Emotion: Warm and supportive, conveying empathy and care, ensuring the listener feels guided and safe throughout the journey."
)

async def voice_assistant_optimized():
    samplerate = sd.query_devices(kind='input')['default_samplerate']
    voice_pipeline_config = VoicePipelineConfig(tts_settings=custom_tts_settings)

    while True:
        pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(triage_voice_agent), config=voice_pipeline_config)

        # Check for input to either provide voice or exit
        cmd = input("Press Enter to speak your query (or type 'esc' to exit): ")
        if cmd.lower() == "esc":
            print("Exiting...")
            break       
        print("Listening...")
        recorded_chunks = []

         # Start streaming from microphone until Enter is pressed
        with sd.InputStream(samplerate=samplerate, channels=1, dtype='int16', callback=lambda indata, frames, time, status: recorded_chunks.append(indata.copy())):
            input()

        # Concatenate chunks into single buffer
        recording = np.concatenate(recorded_chunks, axis=0)

        # Input the buffer and await the result
        audio_input = AudioInput(buffer=recording)

        with trace("ACME App Optimized Voice Assistant"):
            result = await pipeline.run(audio_input)

         # Transfer the streamed result into chunks of audio
        response_chunks = []
        async for event in result.stream():
            if event.type == "voice_stream_event_audio":
                response_chunks.append(event.data)
        response_audio = np.concatenate(response_chunks, axis=0)

        # Play response
        print("Assistant is responding...")
        sd.play(response_audio, samplerate=samplerate)
        sd.wait()
        print("---")

# Run the voice assistant
await voice_assistant_optimized()
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...

运行上面的代码,我们得到了以下响应,这些响应的措辞更加自然,交付也更具吸引力。

display(Audio("voice_agents_audio/account_balance_response_opti.mp3"))
display(Audio("voice_agents_audio/product_info_response_opti.mp3"))
display(Audio("voice_agents_audio/trending_items_response_opti.mp3"))

...对于不太微妙的东西,我们可以切换到 themed_character_assistant 指令并接收以下响应

display(Audio("voice_agents_audio/product_info_character.wav"))
display(Audio("voice_agents_audio/product_info_character_2.wav"))

结论

瞧!

在本 cookbook 中,我们演示了如何

  • 定义代理,为我们的应用内语音助手提供特定的用例功能
  • 利用 Responses API 中的内置工具和自定义工具,为代理提供一系列功能,并使用追踪评估其性能
  • 使用 Agents SDK 编排这些代理
  • 使用 Agents SDK 的语音功能将代理从基于文本的交互转换为基于语音的交互

Agents SDK 实现了构建语音助手的模块化方法,允许您逐个用例地工作,单独评估和迭代每个用例,然后再实施下一个用例,并在您准备好时将工作流程从文本转换为语音。

我们希望本 cookbook 为您提供了有用的指南,帮助您开始构建自己的应用内语音助手!