长文档内容提取

2023年2月20日
在 Github 中打开

GPT-3 可以帮助我们从超出上下文窗口大小的文档中提取关键数据、日期或其他重要的内容片段。一种解决方案是将文档分块并分别处理每个块,然后将结果合并成一个答案列表。

在本笔记本中,我们将详细介绍这种方法

  • 加载一个长 PDF 并提取文本
  • 创建一个提示,用于提取关键信息
  • 将我们的文档分块并处理每个块,以提取任何答案
  • 最后将它们合并
  • 然后,这种简单的方法将被扩展到三个更困难的问题

方法

  • 设置: 获取一个 PDF,一份关于动力单元的 F1 财务条例文档,并从中提取文本以进行实体提取。我们将使用它来尝试提取隐藏在内容中的答案。
  • 简单实体提取: 通过以下步骤从文档块中提取关键信息:
    • 创建一个模板提示,其中包含我们的问题以及期望的格式示例
    • 创建一个函数,将文本块作为输入,与提示结合并获得响应
    • 运行脚本以分块文本,提取答案并输出以进行解析
  • 复杂实体提取: 提出一些更困难的问题,这些问题需要更强的推理能力才能解决
!pip install textract
!pip install tiktoken
import textract
import os
import openai
import tiktoken

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

# Extract the raw text from each PDF using textract
text = textract.process('data/fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf', method='pdfminer').decode('utf-8')
clean_text = text.replace("  ", " ").replace("\n", "; ").replace(';',' ')
# Example prompt - 
document = '<document>'
template_prompt=f'''Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output \"Not specified\".
When you extract a key piece of information, include the closest page number.
Use the following format:\n0. Who is the author\n1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR\n2. What is the value of External Manufacturing Costs in USD\n3. What is the Capital Expenditure Limit in USD\n\nDocument: \"\"\"<document>\"\"\"\n\n0. Who is the author: Tom Anderson (Page 1)\n1.'''
print(template_prompt)
Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output "Not specified".
When you extract a key piece of information, include the closest page number.
Use the following format:
0. Who is the author
1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR
2. What is the value of External Manufacturing Costs in USD
3. What is the Capital Expenditure Limit in USD

Document: """<document>"""

0. Who is the author: Tom Anderson (Page 1)
1.
# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def create_chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text."""
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j

def extract_chunk(document,template_prompt):
    prompt = template_prompt.replace('<document>',document)

    messages = [
            {"role": "system", "content": "You help extract information from documents."},
            {"role": "user", "content": prompt}
            ]

    response = client.chat.completions.create(
            model='gpt-4', 
            messages=messages,
            temperature=0,
            max_tokens=1500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
    return "1." + response.choices[0].message.content
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

results = []
    
chunks = create_chunks(clean_text,1000,tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

for chunk in text_chunks:
    results.append(extract_chunk(chunk,template_prompt))
    #print(chunk)
    print(results[-1])
groups = [r.split('\n') for r in results]

# zip the groups together
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped
['1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR: USD 95,000,000 (Page 2); GBP 76,459,000 (Page 2); EUR 90,210,000 (Page 2)',
 '2. What is the value of External Manufacturing Costs in USD: US Dollars 20,000,000 in respect of each of the Full Year Reporting Periods ending on 31 December 2023, 31 December 2024 and 31 December 2025, adjusted for Indexation (Page 10)',
 '3. What is the Capital Expenditure Limit in USD: US Dollars 30,000,000 (Page 32)']
# Example prompt - 
template_prompt=f'''Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output \"Not specified\".
When you extract a key piece of information, include the closest page number.
Use the following format:\n0. Who is the author\n1. How is a Minor Overspend Breach calculated\n2. How is a Major Overspend Breach calculated\n3. Which years do these financial regulations apply to\n\nDocument: \"\"\"<document>\"\"\"\n\n0. Who is the author: Tom Anderson (Page 1)\n1.'''
print(template_prompt)
Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output "Not specified".
When you extract a key piece of information, include the closest page number.
Use the following format:
0. Who is the author
1. How is a Minor Overspend Breach calculated
2. How is a Major Overspend Breach calculated
3. Which years do these financial regulations apply to

Document: """<document>"""

0. Who is the author: Tom Anderson (Page 1)
1.
results = []

for chunk in text_chunks:
    results.append(extract_chunk(chunk,template_prompt))
    
groups = [r.split('\n') for r in results]

# zip the groups together
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped
['1. How is a Minor Overspend Breach calculated: A Minor Overspend Breach arises when a Power Unit Manufacturer submits its Full Year Reporting Documentation and Relevant Costs reported therein exceed the Power Unit Cost Cap by less than 5% (Page 24)',
 '2. How is a Major Overspend Breach calculated: A Material Overspend Breach arises when a Power Unit Manufacturer submits its Full Year Reporting Documentation and Relevant Costs reported therein exceed the Power Unit Cost Cap by 5% or more (Page 25)',
 '3. Which years do these financial regulations apply to: 2026 onwards (Page 1)',
 '3. Which years do these financial regulations apply to: 2023, 2024, 2025, 2026 and subsequent Full Year Reporting Periods (Page 2)',
 '3. Which years do these financial regulations apply to: 2022-2025 (Page 6)',
 '3. Which years do these financial regulations apply to: 2023, 2024, 2025, 2026 and subsequent Full Year Reporting Periods (Page 10)',
 '3. Which years do these financial regulations apply to: 2022 (Page 14)',
 '3. Which years do these financial regulations apply to: 2022 (Page 16)',
 '3. Which years do these financial regulations apply to: 2022 (Page 19)',
 '3. Which years do these financial regulations apply to: 2022 (Page 21)',
 '3. Which years do these financial regulations apply to: 2026 onwards (Page 26)',
 '3. Which years do these financial regulations apply to: 2026 (Page 2)',
 '3. Which years do these financial regulations apply to: 2022 (Page 30)',
 '3. Which years do these financial regulations apply to: 2022 (Page 32)',
 '3. Which years do these financial regulations apply to: 2023, 2024 and 2025 (Page 1)',
 '3. Which years do these financial regulations apply to: 2022 (Page 37)',
 '3. Which years do these financial regulations apply to: 2026 onwards (Page 40)',
 '3. Which years do these financial regulations apply to: 2022 (Page 1)',
 '3. Which years do these financial regulations apply to: 2026 to 2030 seasons (Page 46)',
 '3. Which years do these financial regulations apply to: 2022 (Page 47)',
 '3. Which years do these financial regulations apply to: 2022 (Page 1)',
 '3. Which years do these financial regulations apply to: 2022 (Page 1)',
 '3. Which years do these financial regulations apply to: 2022 (Page 56)',
 '3. Which years do these financial regulations apply to: 2022 (Page 1)',
 '3. Which years do these financial regulations apply to: 2022 (Page 16)',
 '3. Which years do these financial regulations apply to: 2022 (Page 16)']

整合

我们已经能够安全地提取前两个答案,而第三个答案则被每页上出现的日期所混淆,尽管正确的答案也在其中。

为了进一步调整,您可以考虑尝试:

  • 更具描述性或更具体的提示
  • 如果您有足够的训练数据,可以对模型进行微调,以非常出色地找到一组输出
  • 您分块数据的方式 - 我们选择了 1000 个 token 且没有重叠,但更智能的分块(将信息分成几部分,按 token 或类似方式切割)可能会获得更好的结果

然而,通过最少的调整,我们现在已经使用长文档的内容回答了 6 个不同难度的问题,并且拥有了一个可重用的方法,我们可以将其应用于任何需要实体提取的长文档。期待看到您能用它做些什么!