使用自定义 LLM 作为裁判，通过 Braintrust 检测幻觉

2024 年 10 月 14 日

假设您正在开发一个客户服务机器人，并尝试评估其响应的质量。考虑这样一个问题：“你们的退货政策是什么？” 如果正确答案是“您可以在购买后 30 天内退货”，但您的机器人生成了“您可以在 30 天内退货”，您将如何评估这是否是一个好的响应？

像 Levenshtein 字符串距离这样的启发式方法会表明响应是不正确的。然而，更好的方法是使用 LLM 作为裁判来评估响应的准确性。LLM 作为裁判是一种利用 LLM 对答案质量进行评分的技术。LLM 可以超越表面字符串比较来推理语言，使其能够更准确地评估答案。

在本指南中，我们将逐步介绍如何构建一个 LLM 作为裁判的评分器，该评分器可以使用 Braintrust（一个与 OpenAI 模型兼容的第三方评估平台）来检测幻觉。

安装依赖项

让我们安装一些基本依赖项。我们将使用 CoQA 数据集（通过 DuckDB）、Braintrust 用于评估，以及 OpenAI 的模型。请注意，Braintrust 是一个第三方评估平台，在继续之前，您应该查看他们的服务条款和隐私政策。

%pip install autoevals duckdb braintrust openai --quiet

Note: you may need to restart the kernel to use updated packages.

接下来，让我们初始化 OpenAI 客户端。我们将使用 AsyncOpenAI 客户端，以便我们可以并行化我们的请求。braintrust.wrap_openai 函数包装了 OpenAI 客户端，以启用将 LLM 调用记录到 Braintrust。我们将使用 Braintrust 来促进以下评估。在继续之前，您应该注册一个 Braintrust 帐户，并在您的环境中将 BRAINTRUST_API_KEY 设置为有效的 API 密钥。

import os

import braintrust
from openai import AsyncOpenAI

braintrust.login(api_key=os.environ["BRAINTRUST_API_KEY"])
client = braintrust.wrap_openai(AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"]))

探索数据集

我们将使用 CoQA 数据集，该数据集包含各种各样的段落、问题和答案。由于 CoQA 非常大，我们只查看前几个段落。与任何公共数据集一样，底层 LLM 有可能已经记住了数据集的某些方面，因此在开发您自己的评分器时，最好使用您自己的私有数据对其进行测试。

import duckdb

# DuckDB has an easy wrapper for loading datasets from Hugging Face.
con = duckdb.connect(":memory:")
full_result = con.query("""
    SELECT * FROM 'hf://datasets/stanfordnlp/coqa/data/validation-00000-of-00001.parquet'
        LIMIT 40
""").fetchall()

single_result = full_result[10]

print("Passage:")
print(single_result[1])

print("\nQuestion:")
print(single_result[2][0])

print("\nAnswer:")
print(single_result[3]["input_text"][0])

Passage:
(CNN)A chiseled boxer's Instagram feed shows him making constant references to the Bible and enjoying gospel singing with his wife.

Another features his formidable opponent counting stacks of money, hanging out in strip clubs, and flashing diamond watches and Ferraris.

Welcome to the world of boxing promotion, circa 2015.

American Floyd Mayweather and Filipino Manny Pacquiao are set to officially announce their heavily anticipated boxing match at a press conference in Los Angeles Wednesday.

With the combined purse for the May 2 bout in Las Vegas reported to touch $300 million pending viewership numbers, the incentives to self-promote could not be higher.

"Nowadays you have to be on social media to launch the fight and to build hype," says boxing promoter Nisse Sauerland, CEO of Team Sauerland. "It couldn't be done without it."

Thirty-eight year old Mayweather (47-0, 26 knockouts), who favors the moniker "The Money Man" or "TBE" (The Best Ever), boasts nearly five million Instagram followers, 5.65 million followers on Twitter and 9.2 million Facebook likes.

He famously confirmed the fight via Shots, a photo sharing social media application that he's invested in, and displays links to his clothing brand, The Money Team, on all his accounts.

Along with professing to the be the best fighter of all time, he could also stake a claim to be one of the greatest social media users in sports.

"I think they're both playing their roles," says Sauerland, who promotes over 45 boxers. "You've got the bad guy and the good guy, really. You've got the guy who throws the money around (Mayweather), that's his image, and Pacquiao, he's the hope of a nation."

Question:
Who are the two boxer featured in this article?

Answer:
Floyd Mayweather and Manny Pacquiao

数据包含一系列段落，每个段落都有许多问题和答案。让我们将其展平为 (passage, question, answer) 元组的列表。

from dataclasses import dataclass


@dataclass
class QuestionAnswer:
    passage: str
    question: str
    expected_answer: str
    generated_answer: str


qa_pairs = [
    QuestionAnswer(
        passage=r[1],
        question=question,
        generated_answer=r[3]["input_text"][i],
        expected_answer=r[3]["input_text"][i],
    )
    for r in full_result
    for (i, question) in enumerate(r[2])
]

print(len(qa_pairs))

添加幻觉

由于 Braintrust 的评分器旨在测试幻觉，我们可以使用 QA 对来生成已知的幻觉。我们将通过要求 LLM 自信地生成每个问题的答案，而不使用段落来创建幻觉答案。

import asyncio
import random

random.seed(42)


async def hallucinate_answer(qa):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """\
You are a helpful hallucinating assistant, who makes up fake answers to questions.

Answer the following question in 1 sentence. If you know the answer, then make up some fake
superfluous details that are not in the passage you have memorized.

Make sure to always answer it confidently, even if you don't know the answer. Do not use words
like "perhaps", "likely", "maybe", etc. or punctuation like "...".Do not admit that you cannot
or do not know the answer.""",
            },
            {"role": "user", "content": qa.question},
        ],
        temperature=1,
        max_tokens=100,
    )
    return response.choices[0].message.content


hallucinated_answers = await asyncio.gather(
    *[hallucinate_answer(qa) for qa in qa_pairs]
)


hallucinations = [
    QuestionAnswer(
        passage=qa.passage,
        question=qa.question,
        expected_answer=qa.expected_answer,
        generated_answer=hallucination,
    )
    for (qa, hallucination) in zip(qa_pairs, hallucinated_answers)
    # Exclude simple yes/no answers.
    if "yes" not in hallucination.lower() and "no" not in hallucination.lower()
]

print("Passage:")
print(hallucinations[0].passage)
print("\nQuestion:")
print(hallucinations[0].question)
print("\nExpected Answer:")
print(hallucinations[0].expected_answer)
print("\nGenerated Answer:")
print(hallucinations[0].generated_answer)

print("\n\nNumber of hallucinations:", len(hallucinations))

Passage:
Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing.

"What are you doing, Cotton?!"

"I only wanted to be more like you".

Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry.

"Don't ever do that again, Cotton!" they all cried. "Next time you might mess up that pretty white fur of yours and we wouldn't want that!"

Then Cotton thought, "I change my mind. I like being special".

Question:
Where did she live?

Expected Answer:
in a barn

Generated Answer:
She lived in a quaint cottage on the edge of the Misty Hollow Forest, where elves and talking owls often hosted moonlit storytelling festivals.

Number of hallucinations: 270

创建评估器

我们将考虑几种创建 LLM 作为裁判的常用方法。对于每种方法，我们将创建一个评分器，然后“元评估”它，以查看其表现如何。由于我们知道幻觉答案是不正确的，我们将通过测试评估器将幻觉答案评为 0 的频率来评估评估器的质量。

LLM 作为裁判 #1：数字评分器

创建 LLM 作为裁判时，一个常见的初始直觉是要求 LLM 在 1 到 5 的范围内对答案进行评分。这种方法的好处是很容易将 LLM 的输出转换为数字分数。

我们将使用 Factuality 模板的修改版本，但要求 LLM 在 1 到 10 的范围内对答案进行评分。

import json

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
Rate the submission on a scale of 1 to 10.
"""


@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await numeric_rater(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await numeric_rater(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0.0

这看起来很有希望！既然我们已经在单个示例上进行了健全性检查，那么让我们运行一个适当的评估，看看它在更广泛的数据集上的表现如何。评估包含三个组成部分

数据：在本例中，input 是问题、幻觉答案和真实答案。评分器将此转换为 0 到 1 之间的分数。预期分数为 0，因为它是幻觉。
任务：任务只是为每个输入调用数字评分器。
分数：我们将通过将生成的分数与真实分数进行比较来评估生成分数的质量。由于我们知道这两个数字都在 0 到 1 之间，我们可以使用归一化差异作为分数。

from dataclasses import asdict

from braintrust import Eval


def data():
    for pair in hallucinations:
        yield dict(
            input=dict(asdict(pair)), expected=0, metadata=dict(hallucination=True)
        )


async def task(input):
    return await numeric_rater(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


def normalized_diff(output, expected):
    return 1 - abs(output - expected)


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater",
    max_concurrency=10,
)

Experiment Numeric rater is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater
LLM-as-a-judge [experiment_name=Numeric rater] (data): 270it [00:00, 54634.41it/s]

LLM-as-a-judge [experiment_name=Numeric rater] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]

=========================SUMMARY=========================
95.35% 'normalized_diff' score

201.60tok prompt_tokens
5tok completion_tokens
206.60tok total_tokens

See results for Numeric rater at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater

EvalResultWithSummary(summary="...", results=[...])

看起来数字评分器的总分接近 94%。这还不错，但是如果您的评估中有 6% 被错误判断，那可能会使您很难信任它们。让我们深入研究 Braintrust UI，以了解发生了什么。

Partial credit

看起来许多不正确的答案都被评为 1 到 10 之间的数字。但是，我们目前尚不清楚模型为何给出这些分数。让我们看看接下来是否可以解决这个问题。

LLM 作为裁判 #2：添加推理

让我们调整提示，使 LLM 也对其评分进行推理。这种方法称为思维链推理。除了可能提高分数外，它还将使我们深入了解模型为何给出这些分数。

@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "title": "Reasoning",
                                "type": "string",
                            },
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await numeric_rater(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await numeric_rater(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0.0

await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater with reasoning",
    max_concurrency=10,
)

Experiment Numeric rater with reasoning is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning
LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (data): 270it [00:00, 111715.70it/s]

LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]

=========================SUMMARY=========================
Numeric rater with reasoning compared to Numeric rater:
92.10% (-03.25%) 'normalized_diff' score	(5 improvements, 63 regressions)

3.68s duration
3.68s llm_duration
239.60tok (+3800.00%) 'prompt_tokens'    	(0 improvements, 270 regressions)
136.82tok (+13182.22%) 'completion_tokens'	(0 improvements, 270 regressions)
376.43tok (+16982.22%) 'total_tokens'     	(0 improvements, 270 regressions)
0.00$ estimated_cost

See results for Numeric rater with reasoning at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning

EvalResultWithSummary(summary="...", results=[...])

添加推理似乎没有帮助提高分数（实际上，它还降低了 3%）。但是，如果我们查看其中一个失败案例，我们将深入了解模型的想法。这是一个幻觉答案的示例

Output

以及分数及其推理

Reasoning

看起来模型正在应用自己的判断来计算部分信用。这是数字评分中常见的问题，对于模型和人类都是如此，通常可以通过使用更好的提示来解决。

LLM 作为裁判 #3：分类而不是评分

接下来，我们将明确具体的标准，并要求模型根据这些标准对答案进行分类。这种方法使我们能够更精确地引导模型朝向我们正在测试的幻觉。直观地，给模型具体的评分标准将产生更准确的分数。

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""

# Since we're testing for hallucinations, penalize (B) as much as (D).
CHOICE_SCORES = {
    "A": 0.5,
    "B": 0,
    "C": 1,
    "D": 0,
    "E": 1,
}


@braintrust.traced
async def classifier(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Call this function to select a choice.",
                    "parameters": {
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "type": "string",
                            },
                            "choice": {
                                "description": "The choice",
                                "type": "string",
                                "enum": ["A", "B", "C", "D", "E"],
                            },
                        },
                        "required": ["reasons", "choice"],
                        "type": "object",
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    choice = arguments["choice"]
    return CHOICE_SCORES[choice] if choice in CHOICE_SCORES else None


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await classifier(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await classifier(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0

async def task(input):
    return await classifier(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Classifier",
    max_concurrency=10,
)

Experiment Classifier is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier
LLM-as-a-judge [experiment_name=Classifier] (data): 270it [00:00, 84930.41it/s]

LLM-as-a-judge [experiment_name=Classifier] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]

=========================SUMMARY=========================
Classifier compared to Numeric rater with reasoning:
98.15% (+06.05%) 'normalized_diff' score	(86 improvements, 5 regressions)

4.41s (+72.60%) 'duration'         	(104 improvements, 165 regressions)
4.40s (+72.59%) 'llm_duration'     	(104 improvements, 165 regressions)
418.60tok (+17900.00%) 'prompt_tokens'    	(0 improvements, 270 regressions)
164.91tok (+2809.26%) 'completion_tokens'	(64 improvements, 204 regressions)
583.52tok (+20709.26%) 'total_tokens'     	(0 improvements, 270 regressions)
0.00$ (+00.07%) 'estimated_cost'   	(8 improvements, 255 regressions)

See results for Classifier at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier

EvalResultWithSummary(summary="...", results=[...])

分类器的得分率为 98%，这是一个显著的改进！

编纂这种模式

上面的分类器可以简单地重写为

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{input}}
************
[Expert]: {{expected}}
************
[Submission]: {{output}}
************
[END DATA]
 
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
 
Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""
 
Classifier = autoevals.LLMClassifier(
    name="Hallucination detector",
    prompt_template=PROMPT,
    choice_scores={"A": 0.5, "B": 0, "C": 1, "D": 0, "E": 1},
    use_cot=True,
)

下一步

作为下一步，您可以深入研究个别改进和退化，以评估它们并考虑未来对提示的改进。您还可以在您自己的数据上对其进行测试，并仔细检查结果是否适用于您的用例。您还可以评估像 o1 这样的模型，尝试微调较小的模型，看看结果是否可重现，或者使用少样本提示使模型与更主观的标准对齐。在所有情况下，您都应该努力评估您的结果，以便您可以严格评估每次更改的影响。