GPT-4o 和 GPT-4o mini 简介

2024 年 7 月 18 日
在 Github 中打开

GPT-4o(“o” 代表 “omni”)和 GPT-4o mini 是原生多模态模型,旨在处理文本、音频和视频输入的组合,并能生成文本、音频和图像格式的输出。GPT-4o mini 是 GPT-4o 的轻量级版本。

背景

在 GPT-4o 之前,用户可以使用语音模式与 ChatGPT 互动,该模式使用三个独立的模型运行。GPT-4o 将这些功能集成到一个跨文本、视觉和音频训练的单一模型中。这种统一的方法确保所有输入(无论是文本、视觉还是听觉)都由同一个神经网络进行协同处理。

GPT-4o mini 是这个全能模型系列的下一个迭代版本,以更小、更经济的版本提供。该模型提供比 GPT-3.5 Turbo 更高的准确率,同时速度一样快,并支持多模态输入和输出。

当前的 API 功能

目前,gpt-4o-mini 模型支持 {text, image} 输入,{text} 输出,与 gpt-4-turbo 的模态相同。作为预览,我们还将使用 gpt-4o-audio-preview 模型来展示通过 GPT4o 模型进行转录的功能。

%pip install --upgrade openai

配置 OpenAI 客户端并提交测试请求

为了设置客户端以供我们使用,我们需要创建一个 API 密钥以用于我们的请求。如果您已经有用于使用的 API 密钥,请跳过这些步骤。

您可以按照以下步骤获取 API 密钥

  1. 创建一个新项目
  2. 在您的项目中生成一个 API 密钥
  3. (推荐,但非必需)将您的 API 密钥设置为所有项目的环境变量

一旦我们完成此设置,让我们从一个简单的 {text} 输入到模型开始我们的第一个请求。我们将为我们的第一个请求使用 systemuser 消息,我们将收到来自 assistant 角色的响应。

from openai import OpenAI 
import os

## Set the API key and model name
MODEL="gpt-4o-mini"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))
completion = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, # <-- This is the system message that provides context to the model
    {"role": "user", "content": "Hello! Could you solve 2+2?"}  # <-- This is the user message for which the model will generate a response
  ]
)

print("Assistant: " + completion.choices[0].message.content)
Assistant: Of course! \( 2 + 2 = 4 \).

图像处理

GPT-4o mini 可以直接处理图像并根据图像采取智能操作。我们可以提供两种格式的图像

  1. Base64 编码
  2. URL

让我们首先查看我们将要使用的图像,然后尝试将此图像作为 Base64 和 URL 链接发送到 API

from IPython.display import Image, display, Audio, Markdown
import base64

IMAGE_PATH = "data/triangle.png"

# Preview image for context
display(Image(IMAGE_PATH))
image generated by notebook
# Open the image file and encode it as a base64 string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)
To find the area of the triangle, you can use the formula:

\[
\text{Area} = \frac{1}{2} \times \text{base} \times \text{height}
\]

In the triangle you provided:

- The base is \(9\) (the length at the bottom).
- The height is \(5\) (the vertical line from the top vertex to the base).

Now, plug in the values:

\[
\text{Area} = \frac{1}{2} \times 9 \times 5
\]

Calculating this:

\[
\text{Area} = \frac{1}{2} \times 45 = 22.5
\]

Thus, the area of the triangle is **22.5 square units**.
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)
To find the area of the triangle, you can use the formula:

\[
\text{Area} = \frac{1}{2} \times \text{base} \times \text{height}
\]

In the triangle you provided:

- The base is \(9\) (the length at the bottom).
- The height is \(5\) (the vertical line from the top vertex to the base).

Now, plug in the values:

\[
\text{Area} = \frac{1}{2} \times 9 \times 5
\]

Calculating this gives:

\[
\text{Area} = \frac{1}{2} \times 45 = 22.5
\]

Thus, the area of the triangle is **22.5 square units**.

视频处理

虽然无法直接将视频发送到 API,但如果您对帧进行采样,然后将它们作为图像提供,GPT-4o 可以理解视频。

由于 API 中的 GPT-4o mini 尚不支持音频输入(截至 2024 年 7 月),我们将结合使用 GPT-4o mini 和 Whisper 来处理提供的视频的音频和视觉内容,并展示两个用例

  1. 摘要
  2. 问答

视频处理的设置

我们将使用两个 Python 包进行视频处理 - opencv-python 和 moviepy。

这些需要 ffmpeg,因此请确保事先安装它。根据您的操作系统,您可能需要运行 brew install ffmpegsudo apt install ffmpeg

%pip install opencv-python
%pip install moviepy
import cv2
from moviepy import *
import time
import base64

# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "data/keynote_recap.mp4"
def process_video(video_path, seconds_per_frame=2):
    base64Frames = []
    base_video_path, _ = os.path.splitext(video_path)

    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = video.get(cv2.CAP_PROP_FPS)
    frames_to_skip = int(fps * seconds_per_frame)
    curr_frame=0

    # Loop through the video and extract frames at specified sampling rate
    while curr_frame < total_frames - 1:
        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
        curr_frame += frames_to_skip
    video.release()

    # Extract audio from video
    audio_path = f"{base_video_path}.mp3"
    clip = VideoFileClip(video_path)
    clip.audio.write_audiofile(audio_path, bitrate="32k")
    clip.audio.close()
    clip.close()

    print(f"Extracted {len(base64Frames)} frames")
    print(f"Extracted audio to {audio_path}")
    return base64Frames, audio_path

# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)
MoviePy - Writing audio in data/keynote_recap.mp3
                                                                      
MoviePy - Done.
Extracted 218 frames
Extracted audio to data/keynote_recap.mp3
## Display the frames and audio for context
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
    time.sleep(0.025)

Audio(audio_path)
image generated by notebook

示例 1:摘要

现在我们有了视频帧和音频,让我们运行一些不同的测试来生成视频摘要,以比较使用不同模态的模型的结果。我们应该期望看到,通过来自视觉和音频输入的上下文生成的摘要将是最准确的,因为模型能够使用视频的完整上下文。

  1. 视觉摘要
  2. 音频摘要
  3. 视觉 + 音频摘要

视觉摘要

视觉摘要是通过仅将视频中的帧发送到模型来生成的。仅凭帧,模型很可能捕捉到视觉方面,但会错过演讲者讨论的任何细节。

response = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system", "content": "You are generating a video summary. Please provide a summary of the video. Respond in Markdown."},
    {"role": "user", "content": [
        "These are the frames from the video.",
        *map(lambda x: {"type": "image_url", 
                        "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames)
        ],
    }
    ],
    temperature=0,
)
print(response.choices[0].message.content)
# OpenAI Dev Day Summary

## Overview
The video captures highlights from OpenAI's Dev Day, showcasing new advancements and features in AI technology, particularly focusing on the latest developments in the GPT-4 model and its applications.

## Key Highlights

### Event Introduction
- The event is branded as "OpenAI Dev Day," setting the stage for discussions on AI advancements.

### Keynote Recap
- The keynote features a recap of significant updates and innovations in AI, particularly around the GPT-4 model.

### New Features
- **GPT-4 Turbo**: Introduction of a faster and more efficient version of GPT-4, emphasizing improved performance and reduced costs.
- **DALL-E 3**: Updates on the image generation model, showcasing its capabilities and integration with other tools.
- **Custom Models**: Introduction of features allowing users to create tailored AI models for specific tasks.

### Technical Innovations
- **Function Calling**: Demonstration of how the model can handle complex instructions and execute functions based on user queries.
- **JSON Mode**: A new feature that allows for structured data handling, enhancing the model's ability to process and respond to requests.

### User Experience Enhancements
- **Threading and Retrieval**: New functionalities that improve how users can interact with the model, making it easier to manage conversations and retrieve information.
- **Code Interpreter**: Introduction of a tool that allows the model to execute code, expanding its utility for developers.

### Community Engagement
- The event emphasizes community involvement, encouraging developers to explore and utilize the new features in their applications.

### Conclusion
- The event wraps up with a call to action for developers to engage with the new tools and features, fostering innovation in AI applications.

## Closing Remarks
The OpenAI Dev Day serves as a platform for showcasing the latest advancements in AI technology, encouraging developers to leverage these innovations for enhanced applications and user experiences.

结果正如预期的那样 - 模型能够捕捉到视频视觉效果的高级方面,但错过了演讲中提供的细节。

音频摘要

音频摘要是通过将音频转录发送到模型来生成的。仅凭音频,模型很可能偏向于音频内容,并且会错过演示文稿和视觉效果提供的上下文。

{audio} 输入对于 GPT-4o 目前处于预览阶段,但在不久的将来将纳入基础模型。因此,我们将使用 gpt-4o-audio-preview 模型来处理音频。

#transcribe the audio
with open(audio_path, 'rb') as audio_file:
    audio_content = base64.b64encode(audio_file.read()).decode('utf-8')

response = client.chat.completions.create(
            model='gpt-4o-audio-preview',
            modalities=["text"],
            messages=[
                    {   "role": "system", 
                        "content":"You are generating a transcript. Create a transcript of the provided audio."
                    },
                    {
                        "role": "user",
                        "content": [
                            { 
                                "type": "text",
                                "text": "this is the audio."
                            },
                            {
                                "type": "input_audio",
                                "input_audio": {
                                    "data": audio_content,
                                    "format": "mp3"
                                }
                            }
                        ]
                    },
                ],
            temperature=0,
        )

# Extract and return the transcription
transcription = response.choices[0].message.content
print (transcription)

看起来不错。现在让我们总结一下并以 markdown 格式进行格式化。

#summarize the transcript
response = client.chat.completions.create(
            model=MODEL,
            modalities=["text"],
            messages=[
                {"role": "system", "content": "You are generating a transcript summary. Create a summary of the provided transcription. Respond in Markdown."},
                {"role": "user", "content": f"Summarize this text: {transcription}"},
            ],
            temperature=0,
        )
transcription_summary = response.choices[0].message.content
print (transcription_summary)
# OpenAI Dev Day Summary

On the inaugural OpenAI Dev Day, several significant updates and features were announced:

- **Launch of GPT-4 Turbo**: This new model supports up to 128,000 tokens of context and is designed to follow instructions more effectively.
  
- **JSON Mode**: A new feature that ensures the model responds with valid JSON.

- **Function Calling**: Users can now call multiple functions simultaneously, enhancing the model's capabilities.

- **Retrieval Feature**: This allows models to access external knowledge from documents or databases, improving their contextual understanding.

- **Knowledge Base**: GPT-4 Turbo has knowledge up to April 2023, with plans for ongoing improvements.

- **Dolly 3 and New Models**: The introduction of Dolly 3, GPT-4 Turbo with Vision, and a new Text-to-Speech model, all available via the API.

- **Custom Models Program**: A new initiative where researchers collaborate with companies to create tailored models for specific use cases.

- **Increased Rate Limits**: Established GPT-4 customers will see a doubling of tokens per minute, with options to request further changes in API settings.

- **Cost Efficiency**: GPT-4 Turbo is significantly cheaper than its predecessor, with a 3x reduction for prompt tokens and 2x for completion tokens.

- **Introduction of GPTs**: Tailored versions of ChatGPT designed for specific purposes, allowing users to create and share private or public GPTs easily, even without coding skills.

- **Upcoming GPT Store**: A platform for users to share their GPT creations.

- **Assistance API**: Features persistent threads, built-in retrieval, a code interpreter, and improved function calling to streamline user interactions.

The event concluded with excitement about the future of AI technology and an invitation for attendees to return next year to see further advancements.

音频摘要偏向于演讲期间讨论的内容,但与视频摘要相比,结构性差得多。

音频 + 视觉摘要

音频 + 视觉摘要是通过同时将视频的视觉和音频发送到模型来生成的。当同时发送这两者时,模型预计会更好地进行总结,因为它能够一次感知整个视频。

## Generate a summary with visual and audio
response = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system", "content":"""You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown"""},
    {"role": "user", "content": [
        "These are the frames from the video.",
        *map(lambda x: {"type": "image_url", 
                        "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
        {"type": "text", "text": f"The audio transcription is: {transcription}"}
        ],
    }
],
    temperature=0,
)
print(response.choices[0].message.content)
# OpenAI Dev Day Summary

## Overview
The first-ever OpenAI Dev Day introduced several exciting updates and features, primarily focusing on the launch of **GPT-4 Turbo**. This new model enhances capabilities and expands the potential for developers and users alike.

## Key Announcements

### 1. **GPT-4 Turbo**
- **Token Support**: Supports up to **128,000 tokens** of context.
- **JSON Mode**: A new feature that ensures responses are in valid JSON format.
- **Function Calling**: Improved ability to call multiple functions simultaneously and better adherence to instructions.

### 2. **Knowledge Retrieval**
- **Enhanced Knowledge Access**: Users can now integrate external documents or databases, allowing models to access updated information beyond their training cut-off (April 2023).

### 3. **DALL-E 3 and Other Models**
- Launch of **DALL-E 3**, **GPT-4 Turbo with Vision**, and a new **Text-to-Speech model** in the API.

### 4. **Custom Models Program**
- Introduction of a program where OpenAI researchers collaborate with companies to create tailored models for specific use cases.

### 5. **Rate Limits and Pricing**
- **Increased Rate Limits**: Doubling tokens per minute for established GPT-4 customers.
- **Cost Efficiency**: GPT-4 Turbo is **3x cheaper** for prompt tokens and **2x cheaper** for completion tokens compared to GPT-4.

### 6. **Introduction of GPTs**
- **Tailored Versions**: GPTs are customized versions of ChatGPT designed for specific tasks, combining instructions, expanded knowledge, and actions.
- **User-Friendly Creation**: Users can create GPTs through conversation, making it accessible even for those without coding skills.
- **GPT Store**: A new platform for sharing and discovering GPTs, launching later this month.

### 7. **Assistance API Enhancements**
- Features include persistent threads, built-in retrieval, a code interpreter, and improved function calling.

## Conclusion
The event highlighted OpenAI's commitment to enhancing AI capabilities and accessibility for developers. The advancements presented are expected to empower users to create innovative applications and solutions. OpenAI looks forward to future developments and encourages ongoing engagement with the community. 

Thank you for attending!

在结合视频和音频后,我们能够获得更详细和全面的事件摘要,该摘要使用了视频中视觉和音频元素的信息。

示例 2:问答

对于问答环节,我们将使用与之前相同的概念,在运行相同的 3 个测试以演示结合输入模态的好处的同时,询问有关我们处理过的视频的问题

  1. 视觉问答
  2. 音频问答
  3. 视觉 + 音频问答
QUESTION = "Question: Why did Sam Altman have an example about raising windows and turning the radio on?"
qa_visual_response = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system", "content": "Use the video to answer the provided question. Respond in Markdown."},
    {"role": "user", "content": [
        "These are the frames from the video.",
        *map(lambda x: {"type": "image_url", "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
        QUESTION
        ],
    }
    ],
    temperature=0,
)
print("Visual QA:\n" + qa_visual_response.choices[0].message.content)
Visual QA:
Sam Altman used the example of raising windows and turning the radio on to illustrate the concept of function calling in AI. This example demonstrates how AI can interpret natural language commands and translate them into specific function calls, making interactions more intuitive and user-friendly. By showing a relatable scenario, he highlighted the advancements in AI's ability to understand and execute complex tasks based on simple instructions.
qa_audio_response = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system", "content":"""Use the transcription to answer the provided question. Respond in Markdown."""},
    {"role": "user", "content": f"The audio transcription is: {transcription}. \n\n {QUESTION}"},
    ],
    temperature=0,
)
print("Audio QA:\n" + qa_audio_response.choices[0].message.content)
Audio QA:
The transcription provided does not include any mention of Sam Altman discussing raising windows or turning the radio on. Therefore, I cannot provide an answer to that specific question based on the given text. If you have more context or another transcription that includes that example, please share it, and I would be happy to help!
qa_both_response = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system", "content":"""Use the video and transcription to answer the provided question."""},
    {"role": "user", "content": [
        "These are the frames from the video.",
        *map(lambda x: {"type": "image_url", 
                        "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
                        {"type": "text", "text": f"The audio transcription is: {transcription}"},
        QUESTION
        ],
    }
    ],
    temperature=0,
)
print("Both QA:\n" + qa_both_response.choices[0].message.content)
Both QA:
Sam Altman used the example of raising windows and turning the radio on to illustrate the new function calling feature in GPT-4 Turbo. This example demonstrates how the model can interpret natural language commands and translate them into specific function calls, making it easier for users to interact with the model in a more intuitive way. It highlights the model's ability to understand context and perform multiple actions based on user instructions.

比较这三个答案,最准确的答案是通过使用视频中的音频和视觉内容生成的。Sam Altman 在主题演讲中没有讨论升起的窗户或收音机,但提到了模型在单个请求中执行多个功能的改进能力,而示例在他身后展示。

结论

集成音频、视觉和文本等多种输入模态,显着增强了模型在各种任务上的性能。这种多模态方法允许更全面的理解和交互,更接近人类感知和处理信息的方式。

目前,API 中的 GPT-4o 和 GPT-4o mini 支持文本和图像输入,音频功能即将推出。目前,请使用 gpt-4o-audio-preview 进行音频输入。