在 GPT-4o 上进行视觉微调以实现视觉问答

2024 年 11 月 1 日

我们很高兴宣布推出 GPT-4o 视觉微调，这是一项前沿的多模态微调功能，使开发者能够使用图像和文本来微调 GPT-4o。借助这项新功能，您可以自定义模型以获得更强大的图像理解能力，从而在各个行业和应用中释放新的可能性。

从高级视觉搜索到为自动驾驶汽车或智慧城市改进物体检测，视觉微调使您能够打造根据您的特定需求量身定制的解决方案。通过结合文本和图像输入，该产品在诸如视觉问答等任务中具有独特的优势，在这些任务中，需要通过分析图像来获得详细的、上下文感知的答案。一般来说，当模型接收到的问题和图像与训练数据相似时，这似乎是最有效的，因为我们能够教会模型如何搜索和识别图像的相关部分以正确回答问题。与文本输入上的微调类似，视觉微调对于教导模型新信息来说不是那么有用。

在本指南中，我们将引导您完成使用多模态输入微调 GPT-4o 的步骤。具体来说，我们将演示如何训练模型来回答与书籍图像相关的问题，但潜在的应用领域涵盖无数领域，从 网页设计 和教育到 医疗保健 和研究。

无论您是希望为制造业构建更智能的缺陷检测模型，增强复杂的文档处理和图表理解，还是开发具有更好视觉理解能力的各种其他用例的应用程序，本指南都将向您展示入门有多么快速和容易。

有关更多信息，请查看完整的文档。

from openai import OpenAI, ChatCompletion
import json
import os

client = OpenAI()

加载数据集

我们将使用来自 OCR-VQA 数据集的书籍图像问答对数据集，该数据集可通过 HuggingFace 访问。该数据集包含 207,572 张书籍图像，以及询问书籍标题、作者、版本、年份和类型的相关问答对。数据集总共包含约 100 万个 QA 对。为了本指南的目的，我们将仅使用数据集的一个小子集来训练、验证和测试我们的模型。

我们相信，该数据集非常适合在多模态输入上进行微调，因为它不仅需要模型准确识别相关的边界框以提取关键信息，还需要推理图像的内容以正确回答问题。

from datasets import load_dataset

# load dataset
ds = load_dataset("howard-hou/OCR-VQA")

我们将首先抽取 150 个训练示例、50 个验证示例和 100 个测试示例。我们还将展开 questions 和 answers 列，为每一行创建一个 QA 对。此外，由于我们的图像以字节字符串形式存储，我们将把它们转换为图像以便处理。

import pandas as pd
from io import BytesIO
from PIL import Image

# sample 150 training examples, 50 validation examples and 100 test examples
ds_train = ds['train'].shuffle(seed=42).select(range(150))
ds_val = ds['validation'].shuffle(seed=42).select(range(50))
ds_test = ds['test'].shuffle(seed=42).select(range(100))

# convert to pandas dataframe
ds_train = ds_train.to_pandas()
ds_val = ds_val.to_pandas()
ds_test = ds_test.to_pandas()

# convert byte strings to images
ds_train['image'] = ds_train['image'].apply(lambda x: Image.open(BytesIO(x['bytes'])))
ds_val['image'] = ds_val['image'].apply(lambda x: Image.open(BytesIO(x['bytes'])))
ds_test['image'] = ds_test['image'].apply(lambda x: Image.open(BytesIO(x['bytes'])))

# explode the 'questions' and 'answers' columns
ds_train = ds_train.explode(['questions', 'answers'])
ds_val = ds_val.explode(['questions', 'answers'])
ds_test = ds_test.explode(['questions', 'answers'])

# rename columns
ds_train = ds_train.rename(columns={'questions': 'question', 'answers': 'answer'})
ds_val = ds_val.rename(columns={'questions': 'question', 'answers': 'answer'})
ds_test = ds_test.rename(columns={'questions': 'question', 'answers': 'answer'})

# create unique ids for each example
ds_train = ds_train.reset_index(drop=True)
ds_val = ds_val.reset_index(drop=True)
ds_test = ds_test.reset_index(drop=True)

# select columns
ds_train = ds_train[['question', 'answer', 'image']]
ds_val = ds_val[['question', 'answer', 'image']]
ds_test = ds_test[['question', 'answer', 'image']]

让我们检查一下训练集中的一个随机样本。

在这个例子中，问题提示模型确定书名。在这种情况下，答案非常模棱两可，因为既有主标题“Patty's Patterns - Advanced Series Vol. 1 & 2”，也有副标题“100 Full-Page Patterns Value Bundle”，它们位于图像的不同部分。此外，这里的作者名称不是个人，而是一个名为“Penny Farthing Graphics”的团体，这可能会被误认为是标题的一部分。

这种类型的任务在视觉问答中很典型，模型必须解释复杂的图像并提供准确的、特定于上下文的响应。通过对这些类型的问题进行训练，我们可以提高模型在各种领域执行详细图像分析的能力。

from IPython.display import display

# display a random training example
print('QUESTION:', ds_train.iloc[198]['question'])
display(ds_train.iloc[198]['image'])
print('ANSWER:', ds_train.iloc[198]['answer'])

QUESTION: What is the title of this book?

ANSWER: Patty's Patterns - Advanced Series Vol. 1 & 2: 100 Full-Page Patterns Value Bundle

数据准备

为了确保我们的模型成功进行微调，正确构建训练数据至关重要。正确格式化数据有助于避免训练期间的验证错误，并确保模型可以有效地从文本和图像输入中学习。好消息是，这个过程非常简单。

训练数据集中的每个示例都应该与 Chat Completions API 格式相同的对话。具体来说，这意味着将数据构建为一系列消息，其中每条消息都包含一个 role（例如“user”或“assistant”）和消息的 content。

由于我们正在使用文本和图像进行视觉微调，我们将构建这些消息以包含两种内容类型。对于每个训练样本，关于图像的问题都作为用户消息呈现，相应的答案作为助手消息提供。

图像可以通过以下两种方式之一包含

作为 HTTP URL，引用图像的位置。
作为 data URL，包含以 base64 编码的图像。

以下是消息格式应如何显示的示例

{
    "messages": 
    [
        {
            "role": "system",
            "content": "Use the image to answer the question."
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is the title of this book?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<encoded_image>"}}
            ]
        }
    ]
}

让我们首先为我们的模型定义系统指令。这些指令为模型提供重要的上下文，指导模型在处理训练数据时应如何表现。清晰简洁的系统指令对于确保模型在文本和图像上都能很好地推理尤其有用。

SYSTEM_PROMPT = """
Generate an answer to the question based on the image of the book provided.
Questions will include both open-ended questions and binary "yes/no" questions.
The questions will inquire about the title, author, edition, year and genre of the book in the image.

You will read the question and examine the corresponding image to provide an accurate answer.

# Steps

1. **Read the Question:** Carefully analyze the question to understand what information is being asked.
2. **Examine the Image:**
   - **Identify Relevant Bounding Boxes (if applicable):** For questions requiring specific details like the title or author, focus on the relevant areas or bounding boxes within the image to extract the necessary text. There may be multiple relevant bounding boxes in the image, so be sure to consider all relevant areas.
   - **Analyze the Whole Image:** For questions that need general reasoning (e.g., "Is this book related to Children's Books?"), consider the entire image, including title, graphics, colors, and overall design elements.
3. **Formulate a Reasoned Answer:**
   - For binary questions (yes/no), use evidence from the image to support your answer.
   - For open-ended questions, provide the exact text from the image or a concise phrase that best describes the requested information.

# Output Format

- Provide your answer in a concise and clear manner. Always return the final conclusion only, no additional text or reasoning.
- If the question is binary, answer with "Yes" or "No."
- For open-ended questions requesting specific details (e.g., title, author), return the exact text from the image.
- For questions about general attributes like "genre," return a single word or phrase that best describes it.

# Notes

- Always prioritize accuracy and clarity in your responses.
- If multiple authors are listed, return the first author listed.
- If the information is not present in the image, try to reason about the question using the information you can gather from the image e.g. if the author is not listed, use the title and genre to find the author.
- Ensure reasoning steps logically lead to the conclusions before stating your final answer.

# Examples
You will be provided with examples of questions and corresponding images of book covers, along with the reasoning and conclusion for each example. Use these examples to guide your reasoning process."""

为了确保我们的图像格式正确以进行视觉微调，它们必须是 base64 格式 并且是 RGB 或 RGBA。这确保了模型可以在训练期间准确处理图像。以下是一个处理图像编码的函数，并在必要时将其转换为正确格式。

此函数允许我们控制图像编码的质量，如果我们想减小文件大小，这可能很有用。100 是最高质量，1 是最低质量。微调作业的最大文件大小为 1GB，但我们不太可能看到大量训练数据的改进。尽管如此，如果需要适应文件大小限制，我们可以使用 quality 参数来减小文件大小。

import base64

def encode_image(image, quality=100):
    if image.mode != 'RGB':
        image = image.convert('RGB')  # Convert to RGB
    buffered = BytesIO()
    image.save(buffered, format="JPEG", quality=quality) 
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

我们还将包含来自训练集的 Few-Shot 示例 作为用户和助手消息，以帮助指导模型的推理过程。

FEW_SHOT_EXAMPLES = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "**Example 1:**\n\n**Question:** Who wrote this book?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(ds_train.iloc[286]['image'], quality=50)}"}}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "**Reasoning:** The cover clearly displays two authors' names, 'Evelyn M. Thomson' and 'Orlen N. Johnson,' at the bottom of the cover, with Evelyn M. Thomson listed first. Typically, the first-listed author is considered the primary author or main contributor.\n\n**Conclusion:** Evelyn Thomson"}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "**Example 2:**\n\n**Question:** What is the title of this book?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(ds_train.iloc[22]['image'], quality=50)}"}}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "**Answer:**\n\n**Reasoning:** The cover prominently displays the title across the top and center of the image. The full title reads, 'Computer Systems: An Integrated Approach to Architecture and Operating Systems,' with each component of the title clearly separated and formatted to stand out.\n\n**Conclusion:** Computer Systems: An Integrated Approach to Architecture and Operating Systems"}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "**Example 3:**\n\n**Question:** Is this book related to Children's Books?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(ds_train.iloc[492]['image'], quality=50)}"}}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "**Answer:**\n\n**Reasoning:** The cover illustration features a whimsical mermaid holding a red shoe, with gentle, child-friendly artwork that suggests it is targeted toward a young audience. Additionally, the style and imagery are typical of children's literature.\n\n**Conclusion:** Yes"}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "**Example 4:**\n\n**Question:** Is this book related to History?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(ds_train.iloc[68]['image'], quality=50)}"}}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "**Answer:**\n\n**Reasoning:** The title 'Oliver Wendell Holmes, Jr.: Civil War Soldier, Supreme Court Justice' clearly indicates that this book focuses on the life of Oliver Wendell Holmes, Jr., providing a biographical account rather than a general historical analysis. Although it references historical elements (Civil War, Supreme Court), the primary focus is on the individual rather than historical events as a whole.\n\n**Conclusion:** No"}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "**Example 5:**\n\n**Question:** What is the genre of this book?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(ds_train.iloc[42]['image'], quality=50)}"}}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "**Answer:**\n\n**Reasoning:** The cover prominently features an image of a train station and the title 'Railway Depots, Stations & Terminals,' which directly suggests a focus on railway infrastructure. This points to the book being related to topics within Engineering & Transportation.\n\n**Conclusion:** Engineering & Transportation"}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "**Example 6:**\n\n**Question:** What type of book is this?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(ds_train.iloc[334]['image'], quality=50)}"}}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "**Answer:**\n\n**Reasoning:** The title 'Principles and Practice of Modern Chromatographic Methods' suggests a focus on chromatography, a scientific technique used in chemistry and biology. This aligns with the academic and technical nature typical of books in the 'Science & Math' category.\n\n**Conclusion:** Science & Math"}
        ]
    }
]

现在我们有了系统指令、Few-Shot 示例和图像编码函数，下一步是迭代训练集并构建微调所需的消息。提醒一下，每个训练示例都必须格式化为对话，并且必须包含图像（base64 格式）以及相应的问题和答案。

为了微调 GPT-4o，我们建议至少提供 10 个示例，但通常您会在 50 到 100 个 训练示例中看到明显的改进。在这种情况下，我们将全力以赴，并使用我们更大的训练样本，即 150 张图像和 721 个 QA 对 来微调模型。

from tqdm import tqdm

# constructing the training set
json_data = []

for idx, example in tqdm(ds_train.iterrows()):
    system_message = {
        "role": "system",
        "content": [{"type": "text", "text": SYSTEM_PROMPT}]
    }
    
    user_message = {
        "role": "user",
        "content": [
            {"type": "text", "text": f"Question [{idx}]: {example['question']}"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(example['image'], quality=50)}"}}
        ]
    }
    
    assistant_message = {
        "role": "assistant",
        "content": [{"type": "text", "text": example["answer"]}]
    }

    all_messages = [system_message] + FEW_SHOT_EXAMPLES + [user_message, assistant_message]
    
    json_data.append({"messages": all_messages})

721it [00:01, 518.61it/s]

我们将最终的训练集保存在 .jsonl 文件中，其中文件中的每一行代表训练数据集中的单个示例。

# save the JSON data to a file
with open("ocr-vqa-train.jsonl", "w") as f:
    for message in json_data:
        json.dump(message, f)
        f.write("\n")

与训练集一样，我们需要以相同的消息格式构建我们的验证集和测试集。但是，对于测试集，有一个关键区别：由于测试集用于评估，因此我们不包含助手的消息（即答案）。这确保了模型生成自己的答案，我们稍后可以将其与真实答案进行比较以进行性能评估。

# constructing the validation set
json_data = []

for idx, example in tqdm(ds_val.iterrows()):
    system_message = {
        "role": "system",
        "content": [{"type": "text", "text": SYSTEM_PROMPT}]
    }
    
    user_message = {
        "role": "user",
        "content": [
            {"type": "text", "text": f"Question [{idx}]: {example['question']}"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(example['image'], quality=50)}"}}
        ]
    }

    assistant_message = {
        "role": "assistant",
        "content": [{"type": "text", "text": example["answer"]}]
    }

    all_messages = [system_message] + FEW_SHOT_EXAMPLES + [user_message, assistant_message]
    
    json_data.append({"messages": all_messages})

# save the JSON data to a file
with open("ocr-vqa-validation.jsonl", "w") as f:
    for message in json_data:
        json.dump(message, f)
        f.write("\n")

239it [00:00, 474.76it/s]

# constructing the test set
json_data = []

for idx, example in tqdm(ds_test.iterrows()):
    system_message = {
        "role": "system",
        "content": [{"type": "text", "text": SYSTEM_PROMPT}]
    }
    
    user_message = {
        "role": "user",
        "content": [
            {"type": "text", "text": f"Question [{idx}]: {example['question']}"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(example['image'], quality=50)}"}}
        ]
    }

    all_messages = [system_message] + FEW_SHOT_EXAMPLES + [user_message]
    
    json_data.append({"messages": all_messages})

# save the JSON data to a file
with open("ocr-vqa-test.jsonl", "w") as f:
    for message in json_data:
        json.dump(message, f)
        f.write("\n")

485it [00:00, 490.79it/s]

微调

现在我们已经以正确的格式准备了训练数据集和验证数据集，我们可以使用 Files API 上传它们以进行微调。

# upload training file
train_file = client.files.create(
  file=open("ocr-vqa-train.jsonl", "rb"),
  purpose="fine-tune"
)

# upload validation file
val_file = client.files.create(
  file=open("ocr-vqa-validation.jsonl", "rb"),
  purpose="fine-tune"
)

文件上传后，我们就可以进行下一步：启动微调作业。

要创建微调作业，我们使用微调 API。这可能需要一些时间才能完成，但您可以在平台 UI 中跟踪微调作业的进度。

# create fine tuning job
file_train = train_file.id
file_val = val_file.id

client.fine_tuning.jobs.create(
  training_file=file_train,
  # note: validation file is optional
  validation_file=file_val,
  model="gpt-4o-2024-08-06"
)

FineTuningJob(id='ftjob-I1GKWTvusx0900L4ggohrGCP', created_at=1730479789, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-4o-2024-08-06', object='fine_tuning.job', organization_id='org-l89177bnhkme4a44292n5r3j', result_files=[], seed=662273734, status='validating_files', trained_tokens=None, training_file='file-UzGnMr4kYPgcFeuq121UBifQ', validation_file='file-LoWiW0fCIa3eirRZExRU3pKB', estimated_finish=None, integrations=[], user_provided_suffix=None, method=None)

评估

微调作业完成后，就该通过在测试集上运行推理来评估我们模型的性能了。此步骤包括使用微调模型生成对测试集中问题的响应，并将模型的预测与真实答案进行比较以进行评估。我们还将使用非微调的 GPT-4o 模型在测试集上运行推理以进行比较。

from concurrent.futures import ThreadPoolExecutor, as_completed
import re

# load the test data from JSONL file
test_data = []
with open("ocr-vqa-test.jsonl", "r") as f:
    for line in f:
        test_data.append(json.loads(line))

def process_example(example, model):
    response = client.chat.completions.create(
        model=model,
        messages=example["messages"],
        store=True,
        metadata={'dataset': 'ocr-vqa-test'}
    )
    predicted_answer = response.choices[0].message.content.strip()
    
    # regex to get the question ID
    match = re.search(r'\[(\d+)\]', example["messages"][-1]["content"][0]["text"])
    if match:
        example_id = int(match.group(1))
    else:
        example_id = -1
    
    actual_answer = ds_test.iloc[example_id]['answer']

    return {
        "example_id": example_id,
        "predicted_answer": predicted_answer,
        "actual_answer": actual_answer
    }

# run the prompts through the finetuned model and store the results
model = "ft:gpt-4o-2024-08-06:openai::AOY1M8VG"
results = []
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(process_example, example, model): example for example in test_data}
    for future in tqdm(as_completed(futures), total=len(futures)):
        results.append(future.result())

# save the results to a file
with open("ocr-vqa-ft-results.jsonl", "w") as f:
    for result in results:
        json.dump(result, f)
        f.write("\n")

# run the prompts through the non-fine-tuned model and store the results
model = "gpt-4o"
results = []
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(process_example, example, model): example for example in test_data}
    for future in tqdm(as_completed(futures), total=len(futures)):
        results.append(future.result())

# save the results to a file
with open("ocr-vqa-4o-results.jsonl", "w") as f:
    for result in results:
        json.dump(result, f)
        f.write("\n")

100%|██████████| 485/485 [02:03<00:00,  3.93it/s]
100%|██████████| 485/485 [01:35<00:00,  5.09it/s]

现在我们已经使用我们的微调模型运行了推理，让我们检查一些具体的例子，以了解模型与实际答案相比表现如何。

# Q: What is the title of this book?
{"example_id": 6, "predicted_answer": "A Wrinkle in Time", "actual_answer": "A Wrinkle in Time (Time Quintet)"}
# Q: Who wrote this book?
{"example_id": 10, "predicted_answer": "DK Travel", "actual_answer": "DK Publishing"}
# Q: What is the title of this book?
{"example_id": 11, "predicted_answer": "DK Eyewitness Travel Guide: Peru", "actual_answer": "DK Eyewitness Travel Guide: Peru"}
# Q: What type of book is this?
{"example_id": 12, "predicted_answer": "Travel", "actual_answer": "Travel"}
# Q: Who wrote this book?
{"example_id": 437, "predicted_answer": "Cookshack, Inc.", "actual_answer": "Cookshack"}
# Q: What type of book is this?
{"example_id": 482, "predicted_answer": "Christian Books & Bibles", "actual_answer": "Religion & Spirituality"}

正如我们所见，微调模型在回答问题方面做得非常出色，许多回答都是完全正确的。

但是，在某些情况下，模型的预测答案与真实答案接近，但不完全匹配，尤其是在措辞或细节可能不同的开放式问题中。为了评估这些预测的质量，我们将使用 GPT-4o 来评估预测响应与数据集中真实标签之间的相似性。

为了评估我们的模型响应，我们将使用 GPT-4o 来确定真实答案和我们的预测响应之间的相似性。我们将根据以下标准对我们的预测答案进行排名

非常相似：预测答案与真实答案完全匹配，并且没有遗漏重要的信息，尽管在标点符号方面可能存在一些细微的遗漏或差异。
基本相似：预测答案与真实答案非常接近，可能缺少一些单词或短语。
有些相似：虽然预测答案与真实答案有明显的差异，但核心内容是准确的并且在语义上相似，可能缺少一些信息。
不正确：预测答案完全不正确、不相关，或者包含来自真实答案的关键错误或遗漏。

from pydantic import BaseModel, Field

# define output schema
class Result(BaseModel):
    example_id: int = Field(description="The unique ID of the question")
    rating: str = Field(description="The assigned similarity rating. One of [Very Similar | Mostly Similar | Somewhat Similar | Incorrect]")
    type: str = Field(description="The type of question. Open if the question is binary yes/no, otherwise Closed. One of [Open | Closed]")

EVAL_PROMPT = """
Evaluate the closeness between the predicted answer and the ground truth for each provided result.
Rank the predicted answer based on the following criteria:

1. **Very Similar**: The predicted answer exactly matches the ground truth and there is no important information omitted, although there may be some minor ommissions or discrepancies in punctuation.
2. **Mostly Similar**: The predicted answer closely aligns with the ground truth, perhaps with some missing words or phrases.
3. **Somewhat Similar**: Although the predicted answer has noticeable differences to the ground truth, the core content is accurate and semantically similar, perhaps with some missing information.
4. **Incorrect**: The predicted answer is completely incorrect, irrelevant, or contains critical errors or omissions from the ground truth.

Ensure to consider both open-ended and yes/no questions.

# Steps
1. **Analyze the Answers**: Read the predicted answer, and ground truth carefully.
2. **Evaluate Similarity**:
    - Check if the predicted answer contains the same core information and correctness as the ground truth.
    - Determine if there are any important omissions or errors.
3. **Assign a Rating**: Based on your evaluation, assign the appropriate rating: Very Similar, Mostly Similar, Somewhat Similar, or Incorrect.

# Output Format
```json
[
    {
        "example_id": [example_id],
        "rating": "[Very Similar | Mostly Similar | Somewhat Similar | Incorrect]",
        "type": "[Open | Closed]
    }
]
```

# Examples

**Input:**
```json
{"example_id": 6, "predicted_answer": "A Wrinkle in Time", "actual_answer": "A Wrinkle in Time (Time Quintet)"}
```
**Reasoning:**
The predicted answer "A Wrinkle in Time" is a very close match to the ground truth "A Wrinkle in Time (Time Quintet)" with a missing tagline or subtitle.
**Output:**
```json
{ "example_id": 6, "rating": "Mostly Similar", "type": "Open" }
```

**Input:**
```json
{"example_id": 437, "predicted_answer": "Cookshack, Inc.", "actual_answer": "Cookshack"}
```
**Reasoning:**
The predicted answer "Cookshack, Inc." is exactly the same as the ground truth "Cookshack", with only a difference in punctuation.
**Output:**
```json
{ "example_id": 437, "rating": "Very Similar", "type": "Open" }
```

**Input:**
```json
{"example_id": 482, "predicted_answer": "Christian Books & Bibles", "actual_answer": "Religion & Spirituality"}
```
**Reasoning:**
The predicted answer "Christian Books & Bibles" is semantically similar to the ground truth "Religion & Spirituality", however there is a key difference in the predicted answer.
**Output:**
```json
{ "example_id": 482, "rating": "Somewhat Similar", "type": "Open" }
```

**Input:**
```json
{ "example_id": 417, "predicted_answer": "yes", "actual_answer": "no" }
```
**Reasoning:**
The predicted answer "yes" is completely incorrect compared to the actual answer "no."
**Output:**
```json
{ "example_id": 417, "rating": "Incorrect", "type": "Closed" }
```
"""

def process_result(result):
    messages = [
        {
            "role": "system",
            "content": EVAL_PROMPT
        },
        {
            "role": "user",
            "content": str(result)
        }
    ]

    response = client.beta.chat.completions.parse(
        model='gpt-4o',
        messages=messages,
        temperature=0,
        response_format=Result
    )

    return json.loads(response.choices[0].message.content)

# fine-tuned model results with scores
results = []
with open("ocr-vqa-ft-results.jsonl", "r") as f:
    for line in f:
        results.append(json.loads(line))

results_w_scores = []
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(process_result, result): result for result in results}
    for future in tqdm(as_completed(futures), total=len(futures)):
        results_w_scores.append(future.result())

# Save the results to a file
with open("ocr-vqa-ft-similarity.jsonl", "w") as f:
    for score in results_w_scores:
        json.dump(score, f)
        f.write("\n")

# non-fine-tuned model results with scores
results = []
with open("ocr-vqa-4o-results.jsonl", "r") as f:
    for line in f:
        results.append(json.loads(line))

results_w_scores_4o = []
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(process_result, result): result for result in results}
    for future in tqdm(as_completed(futures), total=len(futures)):
        results_w_scores_4o.append(future.result())

# Save the results to a file
with open("ocr-vqa-4o-similarity.jsonl", "w") as f:
    for score in results_w_scores_4o:
        json.dump(score, f)
        f.write("\n")

100%|██████████| 485/485 [00:18<00:00, 25.58it/s]
100%|██████████| 485/485 [00:17<00:00, 27.09it/s]

为了充分理解微调的影响，我们还使用非微调的 GPT-4o 模型评估了同一组测试问题。

让我们首先比较微调模型与非微调模型在封闭式（是/否）问题上的性能。

请注意，使用微调模型，我们可以检查预测答案和实际答案之间的完全匹配，因为模型已经学会生成遵循系统提示中指定的响应格式的一致答案。但是，对于非微调模型，我们需要考虑预测答案中措辞和用语的变化。以下是非微调模型输出的示例。正如我们所见，最终答案是正确的，但响应格式不一致，并且在响应中输出了推理。

# example of non-fine-tuned model output
{"example_id": 14, "predicted_answer": "**Answer:**\n\nNo. \n\n**Reasoning:** The cover shows \"Eyewitness Travel\" and \"Peru,\" indicating it is a travel guide focused on the country, rather than a pharmaceutical book.", "actual_answer": "No"}

# read in results
results_ft = []
with open("ocr-vqa-ft-results.jsonl", "r") as f:
    for line in f:
        results_ft.append(json.loads(line))

results_4o = []
with open("ocr-vqa-4o-results.jsonl", "r") as f:
    for line in f:
        results_4o.append(json.loads(line))

# filter results for yes/no questions
results_ft_closed = [result for result in results_ft if result['actual_answer'] in ['Yes', 'No']]
results_4o_closed = [result for result in results_4o if result['actual_answer'] in ['Yes', 'No']]

# check for correct predictions
correct_ft_closed = [result for result in results_ft_closed if result['predicted_answer'] == result['actual_answer']]
correct_4o_closed = [
    result for result in results_4o_closed 
    if result['predicted_answer'].lower() == result['actual_answer'].lower() 
    or result['actual_answer'].lower() in result['predicted_answer'].lower()
]
print(f"Fine-tuned model accuracy: {round(100*len(correct_ft_closed) / len(results_ft_closed), 2)}%")
print(f"Non-fine-tuned model accuracy: {round(100*len(correct_4o_closed) / len(results_4o_closed), 2)}%")

Fine-tuned model accuracy: 90.53%
Non-fine-tuned model accuracy: 87.89%

考虑到对非微调模型的措辞和用语变化给予了慷慨的允许，包括忽略大小写并允许部分匹配，微调模型在此问题集上仍然以 2.64% 的优势优于非微调模型。

现在，让我们比较微调模型与非微调模型在所有开放式问题上的性能。首先，我们将检查预测答案和实际答案之间的完全匹配，再次允许非微调模型的措辞和用语的总体变化，但对微调模型保持严格的标准。

# filter results for open-ended questions
results_ft_open = [result for result in results_ft if result['actual_answer'] not in ['Yes', 'No']]
results_4o_open = [result for result in results_4o if result['actual_answer'] not in ['Yes', 'No']]

# check for correct predictions
correct_ft_open = [result for result in results_ft_open if result['predicted_answer'] == result['actual_answer']]
correct_4o_open = [
    result for result in results_4o_open 
    if result['predicted_answer'].lower() == result['actual_answer'].lower() 
    or result['actual_answer'].lower() in result['predicted_answer'].lower()
]
print(f"Fine-tuned model accuracy: {round(100*len(correct_ft_open) / len(results_ft_open), 2)}%")
print(f"Non-fine-tuned model accuracy: {round(100*len(correct_4o_open) / len(results_4o_open), 2)}%")

Fine-tuned model accuracy: 64.07%
Non-fine-tuned model accuracy: 46.1%

这里的准确率提高更加明显，微调模型以 17.97% 的巨大优势优于非微调模型，即使对非微调模型的措辞和用语变化给予了非常慷慨的允许！

如果我们对微调模型给予相同的宽松度，我们将看到准确率额外提高 4.1%，使总改进幅度达到 22.07%。

为了更深入地挖掘，我们还可以按问题类型查看准确率。

import matplotlib.pyplot as plt

# seperate by question type
def get_question_type(question):
    if question in ["What is the title of this book?"]:
        return "Title"
    elif question in ["What is the genre of this book?", "What type of book is this?"]:
        return "Genre"
    elif question in ["Who wrote this book?", "Who is the author of this book?"]:
        return "Author"
    else:
        return "Other"

# get index numbers for each question type
question_type_indexes = {
    "Title": [],
    "Genre": [],
    "Author": [],
    "Other": []
}

for idx, row in ds_test.iterrows():
    question_type = get_question_type(row['question'])
    question_type_indexes[question_type].append(idx)

# plot accuracy by question type]
accuracy_by_type_ft = {}
accuracy_by_type_4o = {}

for question_type, indexes in question_type_indexes.items():
    correct_predictions_ft = [
        result for result in results_ft if result['example_id'] in indexes and (
            result['predicted_answer'].lower() == result['actual_answer'].lower() or
            result['actual_answer'].lower() in result['predicted_answer'].lower()
        )
    ]
    correct_predictions_4o = [
        result for result in results_4o if result['example_id'] in indexes and (
            result['predicted_answer'].lower() == result['actual_answer'].lower() or
            result['actual_answer'].lower() in result['predicted_answer'].lower()
        )
    ]
    accuracy_ft = len(correct_predictions_ft) / len(indexes) if indexes else 0
    accuracy_4o = len(correct_predictions_4o) / len(indexes) if indexes else 0
    accuracy_by_type_ft[question_type] = accuracy_ft * 100 
    accuracy_by_type_4o[question_type] = accuracy_4o * 100

# prepare data for plotting
question_types = list(accuracy_by_type_ft.keys())
accuracies_ft = list(accuracy_by_type_ft.values())
accuracies_4o = list(accuracy_by_type_4o.values())

# plot grouped bar chart
bar_width = 0.35
index = range(len(question_types))

plt.figure(figsize=(10, 6))
bar1 = plt.bar(index, accuracies_ft, bar_width, label='Fine-tuned GPT-4o', color='skyblue')
bar2 = plt.bar([i + bar_width for i in index], accuracies_4o, bar_width, label='Non-fine-tuned GPT-4o', color='lightcoral')

plt.xlabel('Question Type')
plt.ylabel('Accuracy (%)')
plt.title('Accuracy by Question Type')
plt.ylim(0, 100)
plt.xticks([i + bar_width / 2 for i in index], question_types, rotation=45)
plt.legend()

plt.show()

看起来微调模型最大的性能提升来自于类型类别的问题，例如“这是什么类型的书？”或“这本书的类型是什么？”。这可能表明了微调的普遍好处，即我们教会模型根据训练数据中存在的类别对类型进行分类。然而，它也突出了模型强大的视觉理解能力，因为我们能够仅根据书籍封面的视觉内容来识别类型。

此外，我们在标题类别中看到了显着的提升，这表明微调增强了模型的 OCR 能力及其理解书籍封面布局和结构以提取相关信息的能力。

最后，让我们比较微调模型和非微调模型之间相似性评级的分布，以允许措辞和用语的变化。

from collections import Counter

# extract ratings
ratings_ft = [result['rating'] for result in results_w_scores if result['type'] == 'Open']
ratings_4o = [result['rating'] for result in results_w_scores_4o if result['type'] == 'Open']

# count occurrences of each rating
rating_counts_ft = Counter(ratings_ft)
rating_counts_4o = Counter(ratings_4o)

# define the order of ratings
rating_order = ["Very Similar", "Mostly Similar", "Somewhat Similar", "Incorrect"]

# create bar chart
bar_width = 0.35
index = range(len(rating_order))

fig, ax = plt.subplots()
bar1 = ax.bar(index, [rating_counts_ft.get(rating, 0) for rating in rating_order], bar_width, label='FT GPT-4o')
bar2 = ax.bar([i + bar_width for i in index], [rating_counts_4o.get(rating, 0) for rating in rating_order], bar_width, label='GPT-4o')

ax.set_xlabel('Ratings')
ax.set_ylabel('Count')
ax.set_title('Ratings Distribution')
ax.set_xticks([i + bar_width / 2 for i in index])
ax.set_xticklabels(rating_order)
ax.legend()

plt.show()

结果清晰地描绘了通过微调获得的收益，没有任何其他修改。比较 微调的 GPT-4o 模型和 未微调的 GPT-4o 之间的评级分布，我们看到微调模型获得了更多完全正确的响应，而错误响应的数量相当。

主要结论

提高精度：微调帮助模型生成更精确的答案，这些答案与真实答案相匹配，尤其是在高度特定领域的任务中，例如书籍封面上的 OCR。
更好的泛化能力：虽然非微调的 GPT-4o 能够至少在某种程度上接近许多问题的真实答案，但它不太一致。由于在训练期间接触了多模态数据，微调模型在各种测试问题中表现出更好的泛化能力。

虽然视觉微调的结果令人鼓舞，但仍有改进的机会。与文本上的微调非常相似，视觉微调的有效性在很大程度上取决于训练数据的质量、多样性和代表性。特别是，模型受益于关注最常发生错误的情况，从而实现有针对性的改进。

在回顾不正确的结果后，微调模型的许多“不正确”响应实际上是由于数据集中标签的不一致造成的。例如，一些真实答案仅提供作者的名和姓，而图像实际上也显示了中间的首字母。同样，一些标题的真实标签包括副标题和标语，而另一些则不包括。

另一个常见的主题是类型误分类。尽管模型几乎总是能够生成与真实答案在语义上相似的类型，但答案有时会偏离。这可能是由于训练数据中缺少这些类型。为模型提供更多样化的训练示例以涵盖这些类型，或更清晰地说明处理边缘情况的说明，可以帮助指导模型的理解。

下一步

扩展训练数据集：添加更多涵盖模型较弱领域的各种示例，例如识别类型，可以显着提高性能。
专家指导的提示：将特定领域的指令融入训练提示中，可以进一步改进模型在复杂情况下准确解释和响应的能力。

尽管在这个特定任务上仍有一些进展要做，但初步结果非常令人鼓舞。通过最少的设置和努力，我们已经观察到视觉微调在整体准确率方面取得了显着提升，这表明这种方法具有巨大的潜力。视觉微调为改进各种视觉问答任务以及其他依赖于强大视觉理解的任务开辟了可能性。