控制文本到语音以实现更动态的音频生成

我们传统的 TTS API 无法控制生成音频的声音。例如，如果您想将一段文本转换为音频，您将无法对音频生成给出任何具体指示。

使用音频聊天完成，您可以在生成音频之前给出具体说明。这允许您告诉 API 以不同的速度、音调和口音说话。通过适当的指示，这些声音可以更加动态、自然且适合上下文。

传统 TTS

传统 TTS 可以指定声音，但不能指定音调、口音或任何其他上下文音频参数。

from openai import OpenAI client = OpenAI() tts_text = """ Once upon a time, Leo the lion cub woke up to the smell of pancakes and scrambled eggs. His tummy rumbled with excitement as he raced to the kitchen. Mama Lion had made a breakfast feast! Leo gobbled up his pancakes, sipped his orange juice, and munched on some juicy berries. """ speech_file_path = "./sounds/default_tts.mp3" response = client.audio.speech.create( model="tts-1-hd", voice="alloy", input=tts_text, ) response.write_to_file(speech_file_path)

聊天完成 TTS

使用聊天完成，您可以在生成音频之前给出具体说明。在以下示例中，我们在儿童学习环境中生成英式口音。这对于教育应用尤其有用，在教育应用中，助手的声音对于学习体验非常重要。

import base64 speech_file_path = "./sounds/chat_completions_tts.mp3" completion = client.chat.completions.create( model="gpt-4o-audio-preview", modalities=["text", "audio"], audio={"voice": "alloy", "format": "mp3"}, messages=[ { "role": "system", "content": "You are a helpful assistant that can generate audio from text. Speak in a British accent and enunciate like you're talking to a child.", }, { "role": "user", "content": tts_text, } ], ) mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data) with open(speech_file_path, "wb") as f: f.write(mp3_bytes) speech_file_path = "./sounds/chat_completions_tts_fast.mp3" completion = client.chat.completions.create( model="gpt-4o-audio-preview", modalities=["text", "audio"], audio={"voice": "alloy", "format": "mp3"}, messages=[ { "role": "system", "content": "You are a helpful assistant that can generate audio from text. Speak in a British accent and speak really fast.", }, { "role": "user", "content": tts_text, } ], ) mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data) with open(speech_file_path, "wb") as f: f.write(mp3_bytes)

聊天完成多语言 TTS

我们还可以生成不同语言口音的音频。在以下示例中，我们生成了特定乌拉圭西班牙语口音的音频。

completion = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": "You are an expert translator. Translate any text given into Spanish like you are from Uruguay.", }, { "role": "user", "content": tts_text, } ], ) translated_text = completion.choices[0].message.content print(translated_text) speech_file_path = "./sounds/chat_completions_tts_es_uy.mp3" completion = client.chat.completions.create( model="gpt-4o-audio-preview", modalities=["text", "audio"], audio={"voice": "alloy", "format": "mp3"}, messages=[ { "role": "system", "content": "You are a helpful assistant that can generate audio from text. Speak any text that you receive in a Uruguayan spanish accent and more slowly.", }, { "role": "user", "content": translated_text, } ], ) mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data) with open(speech_file_path, "wb") as f: f.write(mp3_bytes)

Había una vez un leoncito llamado Leo que se despertó con el aroma de panqueques y huevos revueltos. Su pancita gruñía de emoción mientras corría hacia la cocina. ¡Mamá León había preparado un festín de desayuno! Leo devoró sus panqueques, sorbió su jugo de naranja y mordisqueó algunas bayas jugosas.

结论

控制生成音频声音的能力为更丰富的音频体验开辟了许多可能性。有许多用例，例如

增强的表现力：可控 TTS 允许调整音调、音高、速度和情感，使声音能够传达不同的情绪（例如，兴奋、平静、紧迫）。
语言学习和教育：可控 TTS 可以模仿口音、语调和发音，这对于语言学习者和教育应用非常有利，在这些应用中，准确的语调和强调至关重要。
上下文语音：可控 TTS 调整声音以适应内容的上下文，例如用于专业文档的正式语调或用于社交互动的友好、对话式风格。这有助于在虚拟助手和聊天机器人中创建更自然的对话。