使用 GPT4 Vision 和函数调用

,
2024 年 12 月 13 日
在 Github 中打开

GPT-4o,截至 2024 年 11 月以 gpt-4o-2024-11-20 的形式提供,现在支持具有视觉功能的函数调用,具有更好的推理能力,知识截止日期为 2023 年 10 月。将图像与函数调用结合使用将解锁多模态用例和推理能力,使您能够超越 OCR 和图像描述。

我们将通过两个示例来演示 GPT-4o 与 Vision 的函数调用用法

  1. 模拟用于交付异常支持的客户服务助手
  2. 分析组织结构图以提取员工信息
!pip install pymupdf --quiet
!pip install openai --quiet
!pip install matplotlib --quiet
# instructor makes it easy to work with function calling
!pip install instructor --quiet
import base64
import os
from enum import Enum
from io import BytesIO
from typing import Iterable
from typing import List
from typing import Literal, Optional

import fitz
# Instructor is powered by Pydantic, which is powered by type hints. Schema validation, prompting is controlled by type annotations
import instructor
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display
from PIL import Image
from openai import OpenAI
from pydantic import BaseModel, Field
Matplotlib is building the font cache; this may take a moment.

1. 模拟用于交付异常支持的客户服务助手

我们将为交付服务模拟一个客户服务助手,该助手能够分析包裹图像。助手将根据图像分析执行以下操作

  • 如果图像中包裹看起来已损坏,则根据政策自动处理退款。
  • 如果包裹看起来是湿的,则发起更换。
  • 如果包裹看起来正常且未损坏,则升级到代理。

让我们看一下客户服务助手将分析以确定适当操作的包裹示例图像。我们将图像编码为 base64 字符串,以便模型处理。

# Function to encode the image as base64
def encode_image(image_path: str):
    # check if the image exists
    if not os.path.exists(image_path):
        raise FileNotFoundError(f"Image file not found: {image_path}")
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


# Sample images for testing
image_dir = "images"

# encode all images within the directory
image_files = os.listdir(image_dir)
image_data = {}
for image_file in image_files:
    image_path = os.path.join(image_dir, image_file)
    # encode the image with key as the image file name
    image_data[image_file.split('.')[0]] = encode_image(image_path)
    print(f"Encoded image: {image_file}")


def display_images(image_data: dict):
    fig, axs = plt.subplots(1, 3, figsize=(18, 6))
    for i, (key, value) in enumerate(image_data.items()):
        img = Image.open(BytesIO(base64.b64decode(value)))
        ax = axs[i]
        ax.imshow(img)
        ax.axis("off")
        ax.set_title(key)
    plt.tight_layout()
    plt.show()


display_images(image_data)
Encoded image: wet_package.jpg
Encoded image: damaged_package.jpg
Encoded image: normal_package.jpg
image generated by notebook

我们已成功将示例图像编码为 base64 字符串并显示它们。客户服务助手将分析这些图像,以根据包裹状况确定适当的操作。

现在让我们定义用于订单处理的函数/工具,例如将订单升级到代理、退款订单和更换订单。我们将创建占位符函数来模拟基于已识别工具的这些操作的处理。我们将使用 Pydantic 模型来定义订单操作的数据结构。

MODEL = "gpt-4o-2024-11-20"

class Order(BaseModel):
    """Represents an order with details such as order ID, customer name, product name, price, status, and delivery date."""
    order_id: str = Field(..., description="The unique identifier of the order")
    product_name: str = Field(..., description="The name of the product")
    price: float = Field(..., description="The price of the product")
    status: str = Field(..., description="The status of the order")
    delivery_date: str = Field(..., description="The delivery date of the order")
# Placeholder functions for order processing

def get_order_details(order_id):
    # Placeholder function to retrieve order details based on the order ID
    return Order(
        order_id=order_id,
        product_name="Product X",
        price=100.0,
        status="Delivered",
        delivery_date="2024-04-10",
    )

def escalate_to_agent(order: Order, message: str):
    # Placeholder function to escalate the order to a human agent
    return f"Order {order.order_id} has been escalated to an agent with message: `{message}`"

def refund_order(order: Order):
    # Placeholder function to process a refund for the order
    return f"Order {order.order_id} has been refunded successfully."

def replace_order(order: Order):
    # Placeholder function to replace the order with a new one
    return f"Order {order.order_id} has been replaced with a new order."

class FunctionCallBase(BaseModel):
    rationale: Optional[str] = Field(..., description="The reason for the action.")
    image_description: Optional[str] = Field(
        ..., description="The detailed description of the package image."
    )
    action: Literal["escalate_to_agent", "replace_order", "refund_order"]
    message: Optional[str] = Field(
        ...,
        description="The message to be escalated to the agent if action is escalate_to_agent",
    )
    # Placeholder functions to process the action based on the order ID
    def __call__(self, order_id):
        order: Order = get_order_details(order_id=order_id)
        if self.action == "escalate_to_agent":
            return escalate_to_agent(order, self.message)
        if self.action == "replace_order":
            return replace_order(order)
        if self.action == "refund_order":
            return refund_order(order)

class EscalateToAgent(FunctionCallBase):
    """Escalate to an agent for further assistance."""
    pass

class OrderActionBase(FunctionCallBase):
    pass

class ReplaceOrder(OrderActionBase):
    """Tool call to replace an order."""
    pass

class RefundOrder(OrderActionBase):
    """Tool call to refund an order."""
    pass

模拟用户消息和处理包裹图像

我们将模拟包含包裹图像的用户消息,并使用 GPT-4o 与 Vision 模型处理图像。模型将根据图像分析和损坏、潮湿或正常包裹的预定义操作来识别适当的工具调用。然后,我们将根据订单 ID 处理已识别的操作并显示结果。

# extract the tool call from the response
ORDER_ID = "12345"  # Placeholder order ID for testing
INSTRUCTION_PROMPT = "You are a customer service assistant for a delivery service, equipped to analyze images of packages. If a package appears damaged in the image, automatically process a refund according to policy. If the package looks wet, initiate a replacement. If the package appears normal and not damaged, escalate to agent. For any other issues or unclear images, escalate to agent. You must always use tools!"

def delivery_exception_support_handler(test_image: str):
    payload = {
        "model": MODEL,
        "response_model": Iterable[RefundOrder | ReplaceOrder | EscalateToAgent],
        "tool_choice": "auto",  # automatically select the tool based on the context
        "temperature": 0.0,  # for less diversity in responses
        "seed": 123,  # Set a seed for reproducibility
    }
    payload["messages"] = [
        {
            "role": "user",
            "content": INSTRUCTION_PROMPT,
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data[test_image]}"
                    }
                },
            ],
        }
    ]
    function_calls = instructor.from_openai(
        OpenAI(), mode=instructor.Mode.PARALLEL_TOOLS
    ).chat.completions.create(**payload)

    for tool in function_calls:
        print(f"- Tool call: {tool.action} for provided img: {test_image}")
        print(f"- Parameters: {tool}")
        print(f">> Action result: {tool(ORDER_ID)}")
        return tool


print("Processing delivery exception support for different package images...")

print("\n===================== Simulating user message 1 =====================")
assert delivery_exception_support_handler("damaged_package").action == "refund_order"

print("\n===================== Simulating user message 2 =====================")
assert delivery_exception_support_handler("normal_package").action == "escalate_to_agent"

print("\n===================== Simulating user message 3 =====================")
assert delivery_exception_support_handler("wet_package").action == "replace_order"
Processing delivery exception support for different package images...

===================== Simulating user message 1 =====================
- Tool call: refund_order for provided img: damaged_package
- Parameters: rationale='The package appears damaged as it is visibly crushed and deformed.' image_description='A package that is visibly crushed and deformed, with torn and wrinkled packaging material.' action='refund_order' message=None
>> Action result: Order 12345 has been refunded successfully.

===================== Simulating user message 2 =====================
- Tool call: escalate_to_agent for provided img: normal_package
- Parameters: rationale='The package appears normal and undamaged in the image.' image_description='A cardboard box placed on a wooden floor, showing no visible signs of damage or wetness.' action='escalate_to_agent' message='The package appears normal and undamaged. Please review further.'
>> Action result: Order 12345 has been escalated to an agent with message: `The package appears normal and undamaged. Please review further.`

===================== Simulating user message 3 =====================
- Tool call: replace_order for provided img: wet_package
- Parameters: rationale='The package appears wet, which may compromise its contents.' image_description="A cardboard box labeled 'Fragile' with visible wet spots on its surface." action='replace_order' message=None
>> Action result: Order 12345 has been replaced with a new order.

2. 分析组织结构图以提取员工信息

对于第二个示例,我们将分析组织结构图图像以提取员工信息,例如员工姓名、角色、经理和经理角色。我们将使用 GPT-4o 与 Vision 来处理组织结构图图像,并提取有关组织中员工的结构化数据。实际上,函数调用使我们能够超越 OCR,真正推断和转换图表中的层级关系。

我们将从 PDF 格式的示例组织结构图开始,我们想要分析并将 PDF 的第一页转换为 JPEG 图像以进行分析。

# Function to convert a single page PDF page to a JPEG image
def convert_pdf_page_to_jpg(pdf_path: str, output_path: str, page_number=0):
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)  # 0 is the first page
    pix = page.get_pixmap()
    # Save the pixmap as a JPEG
    pix.save(output_path)


def display_img_local(image_path: str):
    img = Image.open(image_path)
    display(img)


pdf_path = 'data/org-chart-sample.pdf'
output_path = 'org-chart-sample.jpg'

convert_pdf_page_to_jpg(pdf_path, output_path)
display_img_local(output_path)
image generated by notebook

组织结构图图像已成功从 PDF 文件中提取并显示。现在让我们定义一个函数,使用新的 GPT4o 与 Vision 来分析组织结构图图像。该函数将从图像中提取有关员工、其角色及其经理的信息。我们将使用函数/工具调用来指定组织结构的输入参数,例如员工姓名、角色以及经理姓名和角色。我们将使用 Pydantic 模型来定义数据结构。

base64_img = encode_image(output_path)

class RoleEnum(str, Enum):
    """Defines possible roles within an organization."""
    CEO = "CEO"
    CTO = "CTO"
    CFO = "CFO"
    COO = "COO"
    EMPLOYEE = "Employee"
    MANAGER = "Manager"
    INTERN = "Intern"
    OTHER = "Other"

class Employee(BaseModel):
    """Represents an employee, including their name, role, and optional manager information."""
    employee_name: str = Field(..., description="The name of the employee")
    role: RoleEnum = Field(..., description="The role of the employee")
    manager_name: Optional[str] = Field(None, description="The manager's name, if applicable")
    manager_role: Optional[RoleEnum] = Field(None, description="The manager's role, if applicable")


class EmployeeList(BaseModel):
    """A list of employees within the organizational structure."""
    employees: List[Employee] = Field(..., description="A list of employees")

def parse_orgchart(base64_img: str) -> EmployeeList:
    response = instructor.from_openai(OpenAI()).chat.completions.create(
        model=MODEL,
        response_model=EmployeeList,
        messages=[
            {
                "role": "user",
                "content": 'Analyze the given organizational chart and very carefully extract the information.',
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_img}"
                        }
                    },
                ],
            }
        ],
    )
    return response

现在,我们将定义一个函数来解析来自 GPT-4o with vision 的响应并提取员工数据。我们将制表提取的数据以便于可视化。请注意,提取数据的准确性可能因输入图像的复杂性和清晰度而异。

# call the functions to analyze the organizational chart and parse the response
result = parse_orgchart(base64_img)

# tabulate the extracted data
df = pd.DataFrame([{
    'employee_name': employee.employee_name,
    'role': employee.role.value,
    'manager_name': employee.manager_name,
    'manager_role': employee.manager_role.value if employee.manager_role else None
} for employee in result.employees])

display(df)
员工姓名 角色 经理姓名 经理角色
0 Juliana Silva 首席执行官
1 Kim Chun Hei 首席财务官 Juliana Silva 首席执行官
2 Cahaya Dewi 经理 Kim Chun Hei 首席财务官
3 Drew Feig 员工 Cahaya Dewi 经理
4 Richard Sanchez 员工 Cahaya Dewi 经理
5 Sacha Dubois 实习生 Cahaya Dewi 经理
6 Chad Gibbons 首席技术官 Juliana Silva 首席执行官
7 Shawn Garcia 经理 Chad Gibbons 首席技术官
8 Olivia Wilson 员工 Shawn Garcia 经理
9 Matt Zhang 实习生 Shawn Garcia 经理
10 Chiaki Sato 首席运营官 Juliana Silva 首席执行官
11 Aaron Loeb 经理 Chiaki Sato 首席运营官
12 Avery Davis 员工 Aaron Loeb 经理
13 Harper Russo 员工 Aaron Loeb 经理
14 Taylor Alonso 实习生 Aaron Loeb 经理

从组织结构图中提取的数据已成功解析并在 DataFrame 中显示。 这种方法使我们能够利用 GPT-4o 的 Vision 功能从图像(例如组织结构图和图表)中提取结构化信息,并处理数据以进行进一步分析。 通过使用函数调用,我们可以扩展多模态模型的功能以执行特定任务或调用外部函数。