使用 GPT4 Vision 和函数调用

GPT-4o，截至 2024 年 11 月以 gpt-4o-2024-11-20 的形式提供，现在支持具有视觉功能的函数调用，具有更好的推理能力，知识截止日期为 2023 年 10 月。将图像与函数调用结合使用将解锁多模态用例和推理能力，使您能够超越 OCR 和图像描述。

我们将通过两个示例来演示 GPT-4o 与 Vision 的函数调用用法

模拟用于交付异常支持的客户服务助手
分析组织结构图以提取员工信息

import base64 import os from enum import Enum from io import BytesIO from typing import Iterable from typing import List from typing import Literal, Optional import fitz # Instructor is powered by Pydantic, which is powered by type hints. Schema validation, prompting is controlled by type annotations import instructor import matplotlib.pyplot as plt import pandas as pd from IPython.display import display from PIL import Image from openai import OpenAI from pydantic import BaseModel, Field

1. 模拟用于交付异常支持的客户服务助手

我们将为交付服务模拟一个客户服务助手，该助手能够分析包裹图像。助手将根据图像分析执行以下操作

如果图像中包裹看起来已损坏，则根据政策自动处理退款。
如果包裹看起来是湿的，则发起更换。
如果包裹看起来正常且未损坏，则升级到代理。

让我们看一下客户服务助手将分析以确定适当操作的包裹示例图像。我们将图像编码为 base64 字符串，以便模型处理。

# Function to encode the image as base64 def encode_image(image_path: str): # check if the image exists if not os.path.exists(image_path): raise FileNotFoundError(f"Image file not found: {image_path}") with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode('utf-8') # Sample images for testing image_dir = "images" # encode all images within the directory image_files = os.listdir(image_dir) image_data = {} for image_file in image_files: image_path = os.path.join(image_dir, image_file) # encode the image with key as the image file name image_data[image_file.split('.')[0]] = encode_image(image_path) print(f"Encoded image: {image_file}") def display_images(image_data: dict): fig, axs = plt.subplots(1, 3, figsize=(18, 6)) for i, (key, value) in enumerate(image_data.items()): img = Image.open(BytesIO(base64.b64decode(value))) ax = axs[i] ax.imshow(img) ax.axis("off") ax.set_title(key) plt.tight_layout() plt.show() display_images(image_data)

我们已成功将示例图像编码为 base64 字符串并显示它们。客户服务助手将分析这些图像，以根据包裹状况确定适当的操作。

现在让我们定义用于订单处理的函数/工具，例如将订单升级到代理、退款订单和更换订单。我们将创建占位符函数来模拟基于已识别工具的这些操作的处理。我们将使用 Pydantic 模型来定义订单操作的数据结构。

MODEL = "gpt-4o-2024-11-20" class Order(BaseModel): """Represents an order with details such as order ID, customer name, product name, price, status, and delivery date.""" order_id: str = Field(..., description="The unique identifier of the order") product_name: str = Field(..., description="The name of the product") price: float = Field(..., description="The price of the product") status: str = Field(..., description="The status of the order") delivery_date: str = Field(..., description="The delivery date of the order") # Placeholder functions for order processing def get_order_details(order_id): # Placeholder function to retrieve order details based on the order ID return Order( order_id=order_id, product_name="Product X", price=100.0, status="Delivered", delivery_date="2024-04-10", ) def escalate_to_agent(order: Order, message: str): # Placeholder function to escalate the order to a human agent return f"Order {order.order_id} has been escalated to an agent with message: `{message}`" def refund_order(order: Order): # Placeholder function to process a refund for the order return f"Order {order.order_id} has been refunded successfully." def replace_order(order: Order): # Placeholder function to replace the order with a new one return f"Order {order.order_id} has been replaced with a new order." class FunctionCallBase(BaseModel): rationale: Optional[str] = Field(..., description="The reason for the action.") image_description: Optional[str] = Field( ..., description="The detailed description of the package image." ) action: Literal["escalate_to_agent", "replace_order", "refund_order"] message: Optional[str] = Field( ..., description="The message to be escalated to the agent if action is escalate_to_agent", ) # Placeholder functions to process the action based on the order ID def __call__(self, order_id): order: Order = get_order_details(order_id=order_id) if self.action == "escalate_to_agent": return escalate_to_agent(order, self.message) if self.action == "replace_order": return replace_order(order) if self.action == "refund_order": return refund_order(order) class EscalateToAgent(FunctionCallBase): """Escalate to an agent for further assistance.""" pass class OrderActionBase(FunctionCallBase): pass class ReplaceOrder(OrderActionBase): """Tool call to replace an order.""" pass class RefundOrder(OrderActionBase): """Tool call to refund an order.""" pass

模拟用户消息和处理包裹图像

我们将模拟包含包裹图像的用户消息，并使用 GPT-4o 与 Vision 模型处理图像。模型将根据图像分析和损坏、潮湿或正常包裹的预定义操作来识别适当的工具调用。然后，我们将根据订单 ID 处理已识别的操作并显示结果。

# extract the tool call from the response ORDER_ID = "12345" # Placeholder order ID for testing INSTRUCTION_PROMPT = "You are a customer service assistant for a delivery service, equipped to analyze images of packages. If a package appears damaged in the image, automatically process a refund according to policy. If the package looks wet, initiate a replacement. If the package appears normal and not damaged, escalate to agent. For any other issues or unclear images, escalate to agent. You must always use tools!" def delivery_exception_support_handler(test_image: str): payload = { "model": MODEL, "response_model": Iterable[RefundOrder | ReplaceOrder | EscalateToAgent], "tool_choice": "auto", # automatically select the tool based on the context "temperature": 0.0, # for less diversity in responses "seed": 123, # Set a seed for reproducibility } payload["messages"] = [ { "role": "user", "content": INSTRUCTION_PROMPT, }, { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{image_data[test_image]}" } }, ], } ] function_calls = instructor.from_openai( OpenAI(), mode=instructor.Mode.PARALLEL_TOOLS ).chat.completions.create(**payload) for tool in function_calls: print(f"- Tool call: {tool.action} for provided img: {test_image}") print(f"- Parameters: {tool}") print(f">> Action result: {tool(ORDER_ID)}") return tool print("Processing delivery exception support for different package images...") print("\n===================== Simulating user message 1 =====================") assert delivery_exception_support_handler("damaged_package").action == "refund_order" print("\n===================== Simulating user message 2 =====================") assert delivery_exception_support_handler("normal_package").action == "escalate_to_agent" print("\n===================== Simulating user message 3 =====================") assert delivery_exception_support_handler("wet_package").action == "replace_order"

Processing delivery exception support for different package images... ===================== Simulating user message 1 ===================== - Tool call: refund_order for provided img: damaged_package - Parameters: rationale='The package appears damaged as it is visibly crushed and deformed.' image_description='A package that is visibly crushed and deformed, with torn and wrinkled packaging material.' action='refund_order' message=None >> Action result: Order 12345 has been refunded successfully. ===================== Simulating user message 2 ===================== - Tool call: escalate_to_agent for provided img: normal_package - Parameters: rationale='The package appears normal and undamaged in the image.' image_description='A cardboard box placed on a wooden floor, showing no visible signs of damage or wetness.' action='escalate_to_agent' message='The package appears normal and undamaged. Please review further.' >> Action result: Order 12345 has been escalated to an agent with message: `The package appears normal and undamaged. Please review further.` ===================== Simulating user message 3 ===================== - Tool call: replace_order for provided img: wet_package - Parameters: rationale='The package appears wet, which may compromise its contents.' image_description="A cardboard box labeled 'Fragile' with visible wet spots on its surface." action='replace_order' message=None >> Action result: Order 12345 has been replaced with a new order.

2. 分析组织结构图以提取员工信息

对于第二个示例，我们将分析组织结构图图像以提取员工信息，例如员工姓名、角色、经理和经理角色。我们将使用 GPT-4o 与 Vision 来处理组织结构图图像，并提取有关组织中员工的结构化数据。实际上，函数调用使我们能够超越 OCR，真正推断和转换图表中的层级关系。

我们将从 PDF 格式的示例组织结构图开始，我们想要分析并将 PDF 的第一页转换为 JPEG 图像以进行分析。

# Function to convert a single page PDF page to a JPEG image def convert_pdf_page_to_jpg(pdf_path: str, output_path: str, page_number=0): if not os.path.exists(pdf_path): raise FileNotFoundError(f"PDF file not found: {pdf_path}") doc = fitz.open(pdf_path) page = doc.load_page(page_number) # 0 is the first page pix = page.get_pixmap() # Save the pixmap as a JPEG pix.save(output_path) def display_img_local(image_path: str): img = Image.open(image_path) display(img) pdf_path = 'data/org-chart-sample.pdf' output_path = 'org-chart-sample.jpg' convert_pdf_page_to_jpg(pdf_path, output_path) display_img_local(output_path)

组织结构图图像已成功从 PDF 文件中提取并显示。现在让我们定义一个函数，使用新的 GPT4o 与 Vision 来分析组织结构图图像。该函数将从图像中提取有关员工、其角色及其经理的信息。我们将使用函数/工具调用来指定组织结构的输入参数，例如员工姓名、角色以及经理姓名和角色。我们将使用 Pydantic 模型来定义数据结构。

base64_img = encode_image(output_path) class RoleEnum(str, Enum): """Defines possible roles within an organization.""" CEO = "CEO" CTO = "CTO" CFO = "CFO" COO = "COO" EMPLOYEE = "Employee" MANAGER = "Manager" INTERN = "Intern" OTHER = "Other" class Employee(BaseModel): """Represents an employee, including their name, role, and optional manager information.""" employee_name: str = Field(..., description="The name of the employee") role: RoleEnum = Field(..., description="The role of the employee") manager_name: Optional[str] = Field(None, description="The manager's name, if applicable") manager_role: Optional[RoleEnum] = Field(None, description="The manager's role, if applicable") class EmployeeList(BaseModel): """A list of employees within the organizational structure.""" employees: List[Employee] = Field(..., description="A list of employees") def parse_orgchart(base64_img: str) -> EmployeeList: response = instructor.from_openai(OpenAI()).chat.completions.create( model=MODEL, response_model=EmployeeList, messages=[ { "role": "user", "content": 'Analyze the given organizational chart and very carefully extract the information.', }, { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_img}" } }, ], } ], ) return response

现在，我们将定义一个函数来解析来自 GPT-4o with vision 的响应并提取员工数据。我们将制表提取的数据以便于可视化。请注意，提取数据的准确性可能因输入图像的复杂性和清晰度而异。

# call the functions to analyze the organizational chart and parse the response result = parse_orgchart(base64_img) # tabulate the extracted data df = pd.DataFrame([{ 'employee_name': employee.employee_name, 'role': employee.role.value, 'manager_name': employee.manager_name, 'manager_role': employee.manager_role.value if employee.manager_role else None } for employee in result.employees]) display(df)

	员工姓名	角色	经理姓名	经理角色
0	Juliana Silva	首席执行官	无	无
1	Kim Chun Hei	首席财务官	Juliana Silva	首席执行官
2	Cahaya Dewi	经理	Kim Chun Hei	首席财务官
3	Drew Feig	员工	Cahaya Dewi	经理
4	Richard Sanchez	员工	Cahaya Dewi	经理
5	Sacha Dubois	实习生	Cahaya Dewi	经理
6	Chad Gibbons	首席技术官	Juliana Silva	首席执行官
7	Shawn Garcia	经理	Chad Gibbons	首席技术官
8	Olivia Wilson	员工	Shawn Garcia	经理
9	Matt Zhang	实习生	Shawn Garcia	经理
10	Chiaki Sato	首席运营官	Juliana Silva	首席执行官
11	Aaron Loeb	经理	Chiaki Sato	首席运营官
12	Avery Davis	员工	Aaron Loeb	经理
13	Harper Russo	员工	Aaron Loeb	经理
14	Taylor Alonso	实习生	Aaron Loeb	经理

员工姓名

角色

经理姓名

经理角色

Juliana Silva

首席执行官

无

Kim Chun Hei

首席财务官

Juliana Silva

首席执行官

Cahaya Dewi

经理

Kim Chun Hei

首席财务官

Drew Feig

员工

Cahaya Dewi

经理

Richard Sanchez

员工

Cahaya Dewi

经理

Sacha Dubois

实习生

Cahaya Dewi

经理

Chad Gibbons

首席技术官

Juliana Silva

首席执行官

Shawn Garcia

经理

Chad Gibbons

首席技术官

Olivia Wilson

员工

Shawn Garcia

经理

Matt Zhang

实习生

Shawn Garcia

经理

Chiaki Sato

首席运营官

Juliana Silva

首席执行官

Aaron Loeb

经理

Chiaki Sato

首席运营官

Avery Davis

员工

Aaron Loeb

经理

Harper Russo

员工

Aaron Loeb

经理

Taylor Alonso

实习生

Aaron Loeb

经理

从组织结构图中提取的数据已成功解析并在 DataFrame 中显示。这种方法使我们能够利用 GPT-4o 的 Vision 功能从图像（例如组织结构图和图表）中提取结构化信息，并处理数据以进行进一步分析。通过使用函数调用，我们可以扩展多模态模型的功能以执行特定任务或调用外部函数。

安装和设置

1. 模拟用于交付异常支持的客户服务助手

模拟用户消息和处理包裹图像

2. 分析组织结构图以提取员工信息