使用 Responses API 中的文件搜索功能对 PDF 执行 RAG

2025年3月11日
在 Github 中打开

虽然 RAG 可能令人感到复杂,但在 PDF 文件中进行搜索不应该很复杂。目前最常用的方法之一是解析您的 PDF,定义您的分块策略,将这些块上传到存储提供商,对这些文本块运行嵌入,并将这些嵌入存储在向量数据库中。而这仅仅是设置阶段 —— 在我们的 LLM 工作流程中检索内容也需要多个步骤。

这就是文件搜索 —— 一个您可以在 Responses API 中使用的托管工具 —— 的用武之地。它允许您搜索您的知识库,并根据检索到的内容生成答案。在本食谱中,我们将把这些 PDF 上传到 OpenAI 上的向量存储,并使用文件搜索从该向量存储中获取额外的上下文,以回答我们在第一步中生成的问题。然后,我们将首先基于从 OpenAI 博客 (openai.com/news) 提取的 PDF 创建一小部分问题。

文件搜索之前在 Assistants API 上可用。现在它在新 Responses API 上可用,这是一个可以是有状态或无状态的 API,并且具有诸如元数据过滤等新功能。

设置

!pip install PyPDF2 pandas tqdm openai -q
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import concurrent
import PyPDF2
import os
import pandas as pd
import base64

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
dir_pdfs = 'openai_blog_pdfs' # have those PDFs stored locally here
pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]

使用我们的 PDF 创建向量存储

我们将在 OpenAI API 上创建一个向量存储,并将我们的 PDF 上传到该向量存储。OpenAI 将读取这些 PDF,将内容分成多个文本块,对这些块运行嵌入,并将这些嵌入和文本存储在向量存储中。这将使我们能够查询此向量存储,以根据查询返回相关内容。

def upload_single_pdf(file_path: str, vector_store_id: str):
    file_name = os.path.basename(file_path)
    try:
        file_response = client.files.create(file=open(file_path, 'rb'), purpose="assistants")
        attach_response = client.vector_stores.files.create(
            vector_store_id=vector_store_id,
            file_id=file_response.id
        )
        return {"file": file_name, "status": "success"}
    except Exception as e:
        print(f"Error with {file_name}: {str(e)}")
        return {"file": file_name, "status": "failed", "error": str(e)}

def upload_pdf_files_to_vector_store(vector_store_id: str):
    pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]
    stats = {"total_files": len(pdf_files), "successful_uploads": 0, "failed_uploads": 0, "errors": []}
    
    print(f"{len(pdf_files)} PDF files to process. Uploading in parallel...")

    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(upload_single_pdf, file_path, vector_store_id): file_path for file_path in pdf_files}
        for future in tqdm(concurrent.futures.as_completed(futures), total=len(pdf_files)):
            result = future.result()
            if result["status"] == "success":
                stats["successful_uploads"] += 1
            else:
                stats["failed_uploads"] += 1
                stats["errors"].append(result)

    return stats

def create_vector_store(store_name: str) -> dict:
    try:
        vector_store = client.vector_stores.create(name=store_name)
        details = {
            "id": vector_store.id,
            "name": vector_store.name,
            "created_at": vector_store.created_at,
            "file_count": vector_store.file_counts.completed
        }
        print("Vector store created:", details)
        return details
    except Exception as e:
        print(f"Error creating vector store: {e}")
        return {}
store_name = "openai_blog_store"
vector_store_details = create_vector_store(store_name)
upload_pdf_files_to_vector_store(vector_store_details["id"])
Vector store created: {'id': 'vs_67d06b9b9a9c8191bafd456cf2364ce3', 'name': 'openai_blog_store', 'created_at': 1741712283, 'file_count': 0}
21 PDF files to process. Uploading in parallel...
100%|███████████████████████████████| 21/21 [00:09<00:00,  2.32it/s]
{'total_files': 21,
 'successful_uploads': 21,
 'failed_uploads': 0,
 'errors': []}

现在我们的向量存储已准备就绪,我们可以直接查询向量存储,并检索特定查询的相关内容。使用新的向量搜索 API,我们能够从我们的知识库中找到相关项,而无需将其集成到 LLM 查询中。

query = "What's Deep Research?"
search_results = client.vector_stores.search(
    vector_store_id=vector_store_details['id'],
    query=query
)
for result in search_results.data:
    print(str(len(result.content[0].text)) + ' of character of content from ' + result.filename + ' with a relevant score of ' + str(result.score))
3502 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9813588865322393
3493 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9522476825143714
3634 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9397930296526796
2774 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9101975747303771
3474 of character of content from Deep research System Card _ OpenAI.pdf with a relevant score of 0.9036647613464299
3123 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.887120981288272
3343 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.8448454849432881
3262 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.791345286655509
3271 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.7485530025091963
2721 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.734033360849088

我们可以看到,从搜索查询返回了不同大小(以及底层不同的文本)的结果。它们都具有不同的相关性得分,这些得分由我们的排序器计算,该排序器使用混合搜索。

在单个 API 调用中将搜索结果与 LLM 集成

然而,与其查询向量存储,然后将数据传递到 Responses 或 Chat Completion API 调用中,更方便的方式是在 LLM 查询中使用这些搜索结果,是将 file_search 工具作为 OpenAI Responses API 的一部分插入使用。

query = "What's Deep Research?"
response = client.responses.create(
    input= query,
    model="gpt-4o-mini",
    tools=[{
        "type": "file_search",
        "vector_store_ids": [vector_store_details['id']],
    }]
)

# Extract annotations from the response
annotations = response.output[1].content[0].annotations
    
# Get top-k retrieved filenames
retrieved_files = set([result.filename for result in annotations])

print(f'Files used: {retrieved_files}')
print('Response:')
print(response.output[1].content[0].text) # 0 being the filesearch call
Files used: {'Introducing deep research _ OpenAI.pdf'}
Response:
Deep Research is a new capability introduced by OpenAI that allows users to conduct complex, multi-step research tasks on the internet efficiently. Key features include:

1. **Autonomous Research**: Deep Research acts as an independent agent that synthesizes vast amounts of information across the web, enabling users to receive comprehensive reports similar to those produced by a research analyst.

2. **Multi-Step Reasoning**: It performs deep analysis by finding, interpreting, and synthesizing data from various sources, including text, images, and PDFs.

3. **Application Areas**: Especially useful for professionals in fields such as finance, science, policy, and engineering, as well as for consumers seeking detailed information for purchases.

4. **Efficiency**: The output is fully documented with citations, making it easy to verify information, and it significantly speeds up research processes that would otherwise take hours for a human to complete.

5. **Limitations**: While Deep Research enhances research capabilities, it is still subject to limitations, such as potential inaccuracies in information retrieval and challenges in distinguishing authoritative data from unreliable sources.

Overall, Deep Research marks a significant advancement toward automated general intelligence (AGI) by improving access to thorough and precise research outputs.

我们可以看到,gpt-4o-mini 能够回答需要有关 OpenAI 深度研究的更近期、更专业的知识的查询。它使用了来自文件 Introducing deep research _ OpenAI.pdf 的内容,该文件具有最相关的文本块。如果我们想更深入地分析检索到的文本块,我们还可以通过将 include=["output[*].file_search_call.search_results"] 添加到我们的查询中来分析搜索引擎返回的不同文本。

评估性能

对于这些信息检索系统,关键还在于衡量为这些答案检索到的文件的相关性和质量。本食谱的后续步骤将包括生成评估数据集,并计算此生成数据集上的不同指标。这是一种不完善的方法,我们始终建议为您的用例准备人工验证的评估数据集,但这将向您展示评估这些数据集的方法。这将是不完善的,因为生成的一些问题可能是通用的(例如:本文档中的主要利益相关者说了什么),并且我们的检索测试将很难弄清楚该问题是为哪个文档生成的。

生成问题

我们将创建一些函数,这些函数将读取我们本地的 PDF,并生成一个只能通过此文档回答的问题。因此,它将创建我们的评估数据集,我们可以在之后使用。

def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        with open(pdf_path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
    return text

def generate_questions(pdf_path):
    text = extract_text_from_pdf(pdf_path)

    prompt = (
        "Can you generate a question that can only be answered from this document?:\n"
        f"{text}\n\n"
    )

    response = client.responses.create(
        input=prompt,
        model="gpt-4o",
    )

    question = response.output[0].content[0].text

    return question

如果我们为第一个 PDF 文件运行函数 generate_question,我们将能够看到它生成的问题类型。

generate_questions(pdf_files[0])
'What new capabilities will ChatGPT have as a result of the partnership between OpenAI and Schibsted Media Group?'

现在我们可以为我们本地存储的所有 PDF 生成所有问题。

# Generate questions for each PDF and store in a dictionary
questions_dict = {}
for pdf_path in pdf_files:
    questions = generate_questions(pdf_path)
    questions_dict[os.path.basename(pdf_path)] = questions
questions_dict
{'OpenAI partners with Schibsted Media Group _ OpenAI.pdf': 'What is the purpose of the partnership between Schibsted Media Group and OpenAI announced on February 10, 2025?',
 'OpenAI and the CSU system bring AI to 500,000 students & faculty _ OpenAI.pdf': 'What significant milestone did the California State University system achieve by partnering with OpenAI, making it the first of its kind in the United States?',
 '1,000 Scientist AI Jam Session _ OpenAI.pdf': 'What was the specific AI model used during the "1,000 Scientist AI Jam Session" event across the nine national labs?',
 'Announcing The Stargate Project _ OpenAI.pdf': 'What are the initial equity funders and lead partners in The Stargate Project announced by OpenAI, and who holds the financial and operational responsibilities?',
 'Introducing Operator _ OpenAI.pdf': 'What is the name of the new model that powers the Operator agent introduced by OpenAI?',
 'Introducing NextGenAI _ OpenAI.pdf': 'What major initiative did OpenAI launch on March 4, 2025, and which research institution from Europe is involved as a founding partner?',
 'Introducing the Intelligence Age _ OpenAI.pdf': "What is the name of the video generation tool used by OpenAI's creative team to help produce their Super Bowl ad?",
 'Operator System Card _ OpenAI.pdf': 'What is the preparedness score for the "Cybersecurity" category according to the Operator System Card?',
 'Strengthening America’s AI leadership with the U.S. National Laboratories _ OpenAI.pdf': "What is the purpose of OpenAI's agreement with the U.S. National Laboratories as described in the document?",
 'OpenAI GPT-4.5 System Card _ OpenAI.pdf': 'What is the Preparedness Framework rating for "Cybersecurity" for GPT-4.5 according to the system card?',
 'Partnering with Axios expands OpenAI’s work with the news industry _ OpenAI.pdf': "What is the goal of OpenAI's new content partnership with Axios as announced in the document?",
 'OpenAI and Guardian Media Group launch content partnership _ OpenAI.pdf': 'What is the main purpose of the partnership between OpenAI and Guardian Media Group announced on February 14, 2025?',
 'Introducing GPT-4.5 _ OpenAI.pdf': 'What is the release date of the GPT-4.5 research preview?',
 'Introducing data residency in Europe _ OpenAI.pdf': 'What are the benefits of data residency in Europe for new ChatGPT Enterprise and Edu customers according to the document?',
 'The power of personalized AI _ OpenAI.pdf': 'What is the purpose of the "Model Spec" document published by OpenAI for ChatGPT?',
 'Disrupting malicious uses of AI _ OpenAI.pdf': "What is OpenAI's mission as stated in the document?",
 'Sharing the latest Model Spec _ OpenAI.pdf': 'What is the release date of the latest Model Spec mentioned in the document?',
 'Deep research System Card _ OpenAI.pdf': "What specific publication date is mentioned in the Deep Research System Card for when the report on deep research's preparedness was released?",
 'Bertelsmann powers creativity and productivity with OpenAI _ OpenAI.pdf': 'What specific AI-powered solutions is Bertelsmann planning to implement for its divisions RTL Deutschland and Penguin Random House according to the document?',
 'OpenAI’s Economic Blueprint _ OpenAI.pdf': 'What date and location is scheduled for the kickoff event of OpenAI\'s "Innovating for America" initiative as mentioned in the Economic Blueprint document?',
 'Introducing deep research _ OpenAI.pdf': 'What specific model powers the "deep research" capability in ChatGPT that is discussed in this document, and what are its main features designed for?'}

现在我们有一个 filename:question 的字典,我们可以循环遍历并询问 gpt-4o(-mini) 关于这些问题,而无需提供文档,并且 gpt-4o 应该能够在向量存储中找到相关文档。

我们将把我们的字典转换为数据框,并使用 gpt-4o-mini 处理它。我们将留意预期文件

rows = []
for filename, query in questions_dict.items():
    rows.append({"query": query, "_id": filename.replace(".pdf", "")})

# Metrics evaluation parameters
k = 5
total_queries = len(rows)
correct_retrievals_at_k = 0
reciprocal_ranks = []
average_precisions = []

def process_query(row):
    query = row['query']
    expected_filename = row['_id'] + '.pdf'
    # Call file_search via Responses API
    response = client.responses.create(
        input=query,
        model="gpt-4o-mini",
        tools=[{
            "type": "file_search",
            "vector_store_ids": [vector_store_details['id']],
            "max_num_results": k,
        }],
        tool_choice="required" # it will force the file_search, while not necessary, it's better to enforce it as this is what we're testing
    )
    # Extract annotations from the response
    annotations = None
    if hasattr(response.output[1], 'content') and response.output[1].content:
        annotations = response.output[1].content[0].annotations
    elif hasattr(response.output[1], 'annotations'):
        annotations = response.output[1].annotations

    if annotations is None:
        print(f"No annotations for query: {query}")
        return False, 0, 0

    # Get top-k retrieved filenames
    retrieved_files = [result.filename for result in annotations[:k]]
    if expected_filename in retrieved_files:
        rank = retrieved_files.index(expected_filename) + 1
        rr = 1 / rank
        correct = True
    else:
        rr = 0
        correct = False

    # Calculate Average Precision
    precisions = []
    num_relevant = 0
    for i, fname in enumerate(retrieved_files):
        if fname == expected_filename:
            num_relevant += 1
            precisions.append(num_relevant / (i + 1))
    avg_precision = sum(precisions) / len(precisions) if precisions else 0
    
    if expected_filename not in retrieved_files:
        print("Expected file NOT found in the retrieved files!")
        
    if retrieved_files and retrieved_files[0] != expected_filename:
        print(f"Query: {query}")
        print(f"Expected file: {expected_filename}")
        print(f"First retrieved file: {retrieved_files[0]}")
        print(f"Retrieved files: {retrieved_files}")
        print("-" * 50)
    
    
    return correct, rr, avg_precision
process_query(rows[0])
(True, 1.0, 1.0)

在此示例中,召回率和精确率均为 1,并且我们的文件排名第一,因此我们在此示例中的 MRR 和 MAP = 1。

现在我们可以在我们的问题集上执行此处理。

with ThreadPoolExecutor() as executor:
    results = list(tqdm(executor.map(process_query, rows), total=total_queries))

correct_retrievals_at_k = 0
reciprocal_ranks = []
average_precisions = []

for correct, rr, avg_precision in results:
    if correct:
        correct_retrievals_at_k += 1
    reciprocal_ranks.append(rr)
    average_precisions.append(avg_precision)

recall_at_k = correct_retrievals_at_k / total_queries
precision_at_k = recall_at_k  # In this context, same as recall
mrr = sum(reciprocal_ranks) / total_queries
map_score = sum(average_precisions) / total_queries
 62%|███████████████████▏           | 13/21 [00:07<00:03,  2.57it/s]
Expected file NOT found in the retrieved files!
Query: What is OpenAI's mission as stated in the document?
Expected file: Disrupting malicious uses of AI _ OpenAI.pdf
First retrieved file: Introducing the Intelligence Age _ OpenAI.pdf
Retrieved files: ['Introducing the Intelligence Age _ OpenAI.pdf']
--------------------------------------------------
 71%|██████████████████████▏        | 15/21 [00:14<00:06,  1.04s/it]
Expected file NOT found in the retrieved files!
Query: What is the purpose of the "Model Spec" document published by OpenAI for ChatGPT?
Expected file: The power of personalized AI _ OpenAI.pdf
First retrieved file: Sharing the latest Model Spec _ OpenAI.pdf
Retrieved files: ['Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf']
--------------------------------------------------
100%|███████████████████████████████| 21/21 [00:15<00:00,  1.38it/s]

上面记录的输出将显示文件未按我们的评估数据集预期的那样排名第一,或者根本没有找到。正如我们从不完善的评估数据集看到的那样,一些问题是通用的,并期望另一个文档,而我们的检索系统没有专门为此问题检索到该文档。

# Print the metrics with k
print(f"Metrics at k={k}:")
print(f"Recall@{k}: {recall_at_k:.4f}")
print(f"Precision@{k}: {precision_at_k:.4f}")
print(f"Mean Reciprocal Rank (MRR): {mrr:.4f}")
print(f"Mean Average Precision (MAP): {map_score:.4f}")
Metrics at k=5:
Recall@5: 0.9048
Precision@5: 0.9048
Mean Reciprocal Rank (MRR): 0.9048
Mean Average Precision (MAP): 0.8954

通过本食谱,我们能够了解如何

  • 使用 PDF 上下文填充(利用 4o 的视觉模态)和传统 PDF 阅读器生成评估数据集
  • 创建向量存储并使用 PDF 填充它
  • 获取 LLM 对查询的答案,利用开箱即用的 RAG 系统,该系统在 OpenAI 的 Response API 中使用 file_search 工具调用
  • 了解文本块是如何被检索、排名并用作 Response API 的一部分的
  • 衡量先前生成的评估数据集的准确率、精确率、召回率、MRR 和 MAP

通过将文件搜索与 Responses 结合使用,您可以简化 RAG 架构,并在单个 API 调用中使用新的 Responses API 来利用它。文件存储、嵌入、检索都集成在一个工具中!