如何评估摘要任务

,
2023年8月16日
在 Github 中打开

在这篇 notebook 中,我们深入研究了使用简单示例对抽取式摘要任务的评估技术。除了展示使用 LLM 作为评估器的更新颖方法外,我们还探讨了传统的评估方法,如 ROUGEBERTScore

评估摘要的质量是一个耗时的过程,因为它涉及不同的质量指标,如连贯性、简洁性、可读性和内容。传统的自动评估指标,如 ROUGEBERTScore 等,具体且可靠,但它们可能与摘要的实际质量关联性不佳。它们与人类判断的相关性相对较低,特别是对于开放式生成任务 (Liu et al., 2023)。越来越需要依赖人类评估、用户反馈或基于模型的指标,同时警惕潜在的偏差。虽然人类判断提供了宝贵的见解,但它通常不可扩展,并且成本可能过高。

除了这些传统指标之外,我们还展示了一种方法 (G-Eval),该方法利用大型语言模型 (LLM) 作为一种新颖的、无参考的指标来评估抽取式摘要。在本例中,我们使用 gpt-4 对候选输出进行评分。gpt-4 已经有效地学习了一种语言质量的内部模型,使其能够区分流畅、连贯的文本和低质量的文本。利用这种内部评分机制可以自动评估 LLM 生成的新的候选输出。

# Installing necessary packages for the evaluation
# rouge: For evaluating with ROUGE metric
# bert_score: For evaluating with BERTScore
# openai: To interact with OpenAI's API
!pip install rouge --quiet
!pip install bert_score --quiet
!pip install openai --quiet
from openai import OpenAI
import os
import re
import pandas as pd

# Python Implementation of the ROUGE Metric
from rouge import Rouge

# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.
from bert_score import BERTScorer

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
<IPython.core.display.Javascript object>

示例任务

为了本 notebook 的目的,我们将使用下面的摘要示例。请注意,我们提供了两个生成的摘要进行比较,以及一个参考的人工撰写摘要,ROUGEBERTScore 等评估指标需要这个参考摘要。

摘录 (excerpt)

OpenAI 的使命是确保通用人工智能 (AGI) 造福全人类。OpenAI 将直接构建安全且有益的 AGI,但如果其工作有助于他人实现这一结果,也将认为其使命已完成。OpenAI 为此目的遵循几个关键原则。首先,广泛分布的利益 - 任何对 AGI 部署的影响都将用于造福所有人,并避免有害用途或权力过度集中。其次,长期安全 - OpenAI 致力于进行研究以确保 AGI 的安全,并在 AI 社区中推广此类研究的采用。第三,技术领先地位 - OpenAI 旨在处于 AI 能力的最前沿。第四,合作导向 - OpenAI 积极与其他研究和政策机构合作,并寻求创建一个全球社区,共同应对 AGI 的全球挑战。

摘要

参考摘要 /ref_summary (人工生成)评估摘要 1 / eval_summary_1 (系统生成)评估摘要 2 / eval_summary_2 (系统生成)
OpenAI 旨在确保通用人工智能 (AGI) 用于每个人的利益,避免有害用途或权力过度集中。它致力于研究 AGI 安全性,并在 AI 社区中推广此类研究。OpenAI 寻求在 AI 能力方面处于领先地位,并与全球研究和政策机构合作,以应对 AGI 的挑战。OpenAI 旨在使 AGI 造福全人类,避免有害用途和权力集中。它率先研究安全且有益的 AGI,并在全球范围内推广应用。OpenAI 在保持 AI 技术领先地位的同时,与全球机构合作以应对 AGI 挑战。它寻求领导全球合作努力,开发 AGI 以造福集体。OpenAI 旨在确保 AGI 供所有人使用,完全避免有害的东西或大的权力集中。致力于研究 AGI 的安全方面,在 AI 人员中推广这些研究。OpenAI 希望在 AI 方面处于领先地位,并与全球研究、政策团体合作,以解决 AGI 的问题。

花点时间想一想您个人更喜欢哪个摘要,以及哪个摘要最能体现 OpenAI 的使命。

excerpt = "OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges."
ref_summary = "OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges."
eval_summary_1 = "OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good."
eval_summary_2 = "OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff."
<IPython.core.display.Javascript object>

使用 ROUGE 评估

ROUGE,即 Recall-Oriented Understudy for Gisting Evaluation 的缩写,主要衡量生成输出和参考文本之间单词的重叠程度。它是评估自动摘要任务的常用指标。在其变体中,ROUGE-L 提供了对系统生成摘要和参考摘要之间最长连续匹配的深入了解,从而衡量系统在多大程度上保留了原始摘要的本质。

# function to calculate the Rouge score
def get_rouge_scores(text1, text2):
    rouge = Rouge()
    return rouge.get_scores(text1, text2)


rouge_scores_out = []

# Calculate the ROUGE scores for both summaries using reference
eval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)
eval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)

for metric in ["rouge-1", "rouge-2", "rouge-l"]:
    for label in ["F-Score"]:
        eval_1_score = eval_1_rouge[0][metric][label[0].lower()]
        eval_2_score = eval_2_rouge[0][metric][label[0].lower()]

        row = {
            "Metric": f"{metric} ({label})",
            "Summary 1": eval_1_score,
            "Summary 2": eval_2_score,
        }
        rouge_scores_out.append(row)


def highlight_max(s):
    is_max = s == s.max()
    return [
        "background-color: lightgreen" if v else "background-color: white"
        for v in is_max
    ]


rouge_scores_out = (
    pd.DataFrame(rouge_scores_out)
    .set_index("Metric")
    .style.apply(highlight_max, axis=1)
)

rouge_scores_out
  摘要 1 摘要 2
指标    
rouge-1 (F-分数) 0.488889 0.511628
rouge-2 (F-分数) 0.230769 0.163265
rouge-l (F-分数) 0.488889 0.511628
<IPython.core.display.Javascript object>

该表显示了用于评估两个不同摘要与参考文本的 ROUGE 分数。在 rouge-1 的情况下,摘要 2 优于摘要 1,表明单个单词的重叠更好;对于 rouge-l,摘要 2 的分数更高,这意味着在最长公共子序列中更接近匹配,因此在捕获原始文本的主要内容和顺序方面,总体摘要可能更好。由于摘要 2 有许多直接从摘录中提取的单词和短语,因此它与参考摘要的重叠可能会更高,从而导致更高的 ROUGE 分数。

虽然 ROUGE 和类似的指标(如 BLEUMETEOR)提供了定量衡量标准,但它们通常无法捕捉到高质量摘要的真正本质。它们与人类评分的相关性也较差。鉴于 LLM 的进步,LLM 擅长生成流畅且连贯的摘要,像 ROUGE 这样的传统指标可能会无意中惩罚这些模型。如果摘要的表达方式不同,但仍然准确地概括了核心信息,则尤其如此。

使用 BERTScore 评估

ROUGE 依赖于预测文本和参考文本中单词的精确存在,而未能解释其潜在的语义。这就是 BERTScore 的用武之地,它利用 BERT 模型的上下文嵌入,旨在评估机器生成文本的上下文中预测句子和参考句子之间的相似性。通过比较来自两个句子的嵌入,BERTScore 捕获了传统的基于 n-gram 的指标可能遗漏的语义相似性。

# Instantiate the BERTScorer object for English language
scorer = BERTScorer(lang="en")

# Calculate BERTScore for the summary 1 against the excerpt
# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively
P1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])

# Calculate BERTScore for summary 2 against the excerpt
# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively
P2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])

print("Summary 1 F1 Score:", F1_1.tolist()[0])
print("Summary 2 F1 Score:", F2_2.tolist()[0])
Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Summary 1 F1 Score: 0.9227314591407776
Summary 2 F1 Score: 0.9189572930335999
<IPython.core.display.Javascript object>

摘要之间接近的 F1 分数表明,它们在捕获关键信息方面可能表现相似。但是,应该谨慎解释这种微小的差异。由于 BERTScore 可能无法完全掌握人类评估者可能理解的细微之处和高级概念,因此仅依赖此指标可能会导致误解摘要的实际质量和细微差别。将 BERTScore 与人类判断和其他指标相结合的综合方法可以提供更可靠的评估。

使用 GPT-4 评估

在这里,我们使用 gpt-4 实现了一个示例性的无参考文本评估器,其灵感来自 G-Eval 框架,该框架使用大型语言模型评估生成文本的质量。与依赖于与参考摘要进行比较的 ROUGEBERTScore 等指标不同,基于 gpt-4 的评估器仅根据输入提示和文本评估生成内容的质量,而无需任何地面实况参考。这使其适用于人类参考稀疏或不可用的新数据集和任务。

这是此方法的概述

  1. 我们定义了四个不同的标准
    1. 相关性:评估摘要是否仅包含重要信息并排除冗余。
    2. 连贯性:评估摘要的逻辑流程和组织。
    3. 一致性:检查摘要是否与源文档中的事实一致。
    4. 流畅性:评估摘要的语法和可读性。
  2. 我们为每个标准制作提示,将原始文档和摘要作为输入,并利用思维链生成,引导模型为每个标准输出 1-5 的数字分数。
  3. 我们使用定义的提示从 gpt-4 生成分数,并在摘要之间进行比较。

在此演示中,我们使用直接评分函数,其中 gpt-4 为每个指标生成一个离散分数 (1-5)。对分数进行归一化并进行加权求和可能会产生更稳健、更连续的分数,从而更好地反映摘要的质量和多样性。

# Evaluation prompt template based on G-Eval
EVALUATION_PROMPT_TEMPLATE = """
You will be given one summary written for an article. Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions very carefully. 
Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

{criteria}

Evaluation Steps:

{steps}

Example:

Source Text:

{document}

Summary:

{summary}

Evaluation Form (scores ONLY):

- {metric_name}
"""

# Metric 1: Relevance

RELEVANCY_SCORE_CRITERIA = """
Relevance(1-5) - selection of important content from the source. \
The summary should include only important information from the source document. \
Annotators were instructed to penalize summaries which contained redundancies and excess information.
"""

RELEVANCY_SCORE_STEPS = """
1. Read the summary and the source document carefully.
2. Compare the summary to the source document and identify the main points of the article.
3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.
4. Assign a relevance score from 1 to 5.
"""

# Metric 2: Coherence

COHERENCE_SCORE_CRITERIA = """
Coherence(1-5) - the collective quality of all sentences. \
We align this dimension with the DUC quality question of structure and coherence \
whereby "the summary should be well-structured and well-organized. \
The summary should not just be a heap of related information, but should build from sentence to a\
coherent body of information about a topic."
"""

COHERENCE_SCORE_STEPS = """
1. Read the article carefully and identify the main topic and key points.
2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,
and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.
"""

# Metric 3: Consistency

CONSISTENCY_SCORE_CRITERIA = """
Consistency(1-5) - the factual alignment between the summary and the summarized source. \
A factually consistent summary contains only statements that are entailed by the source document. \
Annotators were also asked to penalize summaries that contained hallucinated facts.
"""

CONSISTENCY_SCORE_STEPS = """
1. Read the article carefully and identify the main facts and details it presents.
2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.
3. Assign a score for consistency based on the Evaluation Criteria.
"""

# Metric 4: Fluency

FLUENCY_SCORE_CRITERIA = """
Fluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.
1: Poor. The summary has many errors that make it hard to understand or sound unnatural.
2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.
3: Good. The summary has few or no errors and is easy to read and follow.
"""

FLUENCY_SCORE_STEPS = """
Read the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.
"""


def get_geval_score(
    criteria: str, steps: str, document: str, summary: str, metric_name: str
):
    prompt = EVALUATION_PROMPT_TEMPLATE.format(
        criteria=criteria,
        steps=steps,
        metric_name=metric_name,
        document=document,
        summary=summary,
    )
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=5,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response.choices[0].message.content


evaluation_metrics = {
    "Relevance": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),
    "Coherence": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),
    "Consistency": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),
    "Fluency": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),
}

summaries = {"Summary 1": eval_summary_1, "Summary 2": eval_summary_2}

data = {"Evaluation Type": [], "Summary Type": [], "Score": []}

for eval_type, (criteria, steps) in evaluation_metrics.items():
    for summ_type, summary in summaries.items():
        data["Evaluation Type"].append(eval_type)
        data["Summary Type"].append(summ_type)
        result = get_geval_score(criteria, steps, excerpt, summary, eval_type)
        score_num = int(result.strip())
        data["Score"].append(score_num)

pivot_df = pd.DataFrame(data, index=None).pivot(
    index="Evaluation Type", columns="Summary Type", values="Score"
)
styled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)
display(styled_pivot_df)
摘要类型 摘要 1 摘要 2
评估类型    
连贯性 5 3
一致性 5 5
流畅性 3 2
相关性 5 4
<IPython.core.display.Javascript object>

总体而言,摘要 1 在四个类别中的三个类别(连贯性、相关性和流畅性)中似乎优于摘要 2。两个摘要都被发现彼此一致。结果可能表明,根据给定的评估标准,摘要 1 通常更可取。

局限性

请注意,基于 LLM 的指标可能存在偏向于偏爱 LLM 生成的文本而不是人类撰写的文本的偏差。此外,基于 LLM 的指标对系统消息/提示很敏感。我们建议尝试其他技术,这些技术可以帮助提高性能和/或获得一致的分数,从而在高成本高质量评估和自动化评估之间取得适当的平衡。还值得注意的是,这种评分方法目前受到 gpt-4 上下文窗口的限制。

结论

评估抽取式摘要仍然是一个有待进一步改进的开放领域。ROUGEBLEUBERTScore 等传统指标提供了有用的自动评估,但在捕捉语义相似性和摘要质量的细微方面存在局限性。此外,它们需要参考输出,而收集/标记参考输出可能很昂贵。基于 LLM 的指标作为评估连贯性、流畅性和相关性的无参考方法,展现出前景。然而,它们也可能存在偏向于偏爱 LLM 生成的文本的偏差。最终,自动指标和人工评估的结合是可靠评估抽取式摘要系统的理想选择。虽然人工评估对于全面了解摘要质量是不可或缺的,但应辅以自动化评估,以实现高效、大规模的测试。该领域将继续发展更强大的评估技术,平衡质量、可扩展性和公平性。改进评估方法对于推动生产应用方面的进展至关重要。

参考资料