使用 GPT 的视觉能力和 TTS API 处理和叙述视频

2023年11月6日
在 Github 中打开

本笔记本演示了如何将 GPT 的视觉能力与视频结合使用。GPT-4o 不直接接受视频作为输入,但我们可以使用视觉和 128K 上下文窗口一次性描述整个视频的静态帧。我们将介绍两个示例

  1. 使用 GPT-4o 获取视频的描述
  2. 使用 GPT-o 和 TTS API 为视频生成配音
from IPython.display import display, Image, Audio

import cv2  # We're using OpenCV to read video, to install !pip install opencv-python
import base64
import time
from openai import OpenAI
import os
import requests

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

首先,我们使用 OpenCV 从包含野牛和狼的自然视频中提取帧

video = cv2.VideoCapture("data/bison.mp4")

base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")
618 frames read.

显示帧以确保我们已正确读取它们

display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
    time.sleep(0.025)
image generated by notebook

获得视频帧后,我们编写提示并向 GPT 发送请求(请注意,我们不需要发送每一帧即可让 GPT 理解正在发生的事情)

PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::50]),
        ],
    },
]
params = {
    "model": "gpt-4o",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 200,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)
Title: "Epic Wildlife Showdown: Wolves vs. Bison in the Snow"

Description: 
Witness the raw power and strategy of nature in this intense and breathtaking video! A pack of wolves face off against a herd of bison in a dramatic battle for survival set against a stunning snowy backdrop. See how the wolves employ their cunning tactics while the bison demonstrate their strength and solidarity. This rare and unforgettable footage captures the essence of the wild like never before. Who will prevail in this ultimate test of endurance and skill? Watch to find out and experience the thrill of the wilderness! 🌨️🦊🐂 #Wildlife #NatureDocumentary #AnimalKingdom #SurvivalOfTheFittest #NatureLovers

让我们以大卫·艾登堡的风格为这个视频创建一个配音。使用相同的视频帧,我们提示 GPT 为我们提供一个简短的脚本

PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::60]),
        ],
    },
]
params = {
    "model": "gpt-4o",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 500,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)
In the frozen expanses of the North American wilderness, a battle unfolds—a testament to the harsh realities of survival.

The pack of wolves, relentless and coordinated, closes in on the mighty bison. Exhausted and surrounded, the bison relies on its immense strength and bulk to fend off the predators.

But the wolves are cunning strategists. They work together, each member playing a role in the hunt, nipping at the bison's legs, forcing it into a corner.

The alpha female leads the charge, her pack following her cues. They encircle their prey, tightening the noose with every passing second.

The bison makes a desperate attempt to escape, but the wolves latch onto their target, wearing it down through sheer persistence and teamwork.

In these moments, nature's brutal elegance is laid bare—a primal dance where only the strongest and the most cunning can thrive.

The bison, now overpowered and exhausted, faces its inevitable fate. The wolves have triumphed, securing a meal that will sustain their pack for days to come.

And so, the cycle of life continues, as it has for millennia, in this unforgiving land where the struggle for survival is an unending battle.

现在我们可以将脚本传递给 TTS API,它将生成配音的 mp3 文件

response = requests.post(
    "https://api.openai.com/v1/audio/speech",
    headers={
        "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
    },
    json={
        "model": "tts-1-1106",
        "input": result.choices[0].message.content,
        "voice": "onyx",
    },
)

audio = b""
for chunk in response.iter_content(chunk_size=1024 * 1024):
    audio += chunk
Audio(audio)