使用 logprobs

,
2023 年 12 月 20 日
在 Github 中打开

此 notebook 演示了 Chat Completions API 中 logprobs 参数的用法。启用 logprobs 后,API 会返回每个输出 token 的对数概率,以及每个 token 位置上有限数量的最有可能的 token 及其对数概率。相关的请求参数是

  • logprobs: 是否返回输出 token 的对数概率。如果为 true,则返回消息内容中返回的每个输出 token 的对数概率。
  • top_logprobs: 一个介于 0 和 5 之间的整数,指定在每个 token 位置返回的最有可能的 token 的数量,每个 token 都带有相关的对数概率。如果使用此参数,则必须将 logprobs 设置为 true。

输出 token 的对数概率表示在给定上下文的情况下,每个 token 在序列中出现的可能性。 简单来说,对数概率是 log(p),其中 p = token 基于上下文中之前的 token 在特定位置出现的概率。关于 logprobs 的一些要点:

  • 更高的对数概率表明 token 在该上下文中的可能性更高。这使用户能够衡量模型对其输出的置信度,或探索模型考虑的替代响应。
  • 对数概率可以是任何负数或 0.00.0 对应于 100% 的概率。
  • 对数概率允许我们将序列的联合概率计算为各个 token 的对数概率之和。 这对于评分和对模型输出进行排名很有用。另一种常见的方法是取句子中每个 token 的平均对数概率来选择最佳生成结果。
  • 我们可以检查分配给不同候选 token 的 logprobs,以了解模型认为哪些选项是合理或不合理的。

虽然 logprobs 有广泛的用例,但本 notebook 将重点介绍其在以下方面的应用:

  1. 分类任务
  • 大型语言模型擅长许多分类任务,但准确衡量模型对其输出的置信度可能具有挑战性。 logprobs 为每个类别预测提供相关的概率,使用户能够设置自己的分类或置信度阈值。
  1. 检索 (问答) 评估
  • logprobs 可以辅助检索应用中的自我评估。在问答示例中,模型输出一个虚构的布尔值 has_sufficient_context_for_answer,它可以作为答案是否包含在检索内容中的置信度评分。这种类型的评估可以减少基于检索的幻觉并提高准确性。
  1. 自动完成
  • logprobs 可以帮助我们决定如何在用户键入时建议单词。
  1. Token 高亮显示和输出字节
  • 用户可以使用启用 logprobs 时内置的 token 化功能轻松创建 token 高亮显示器。此外,bytes 参数包含每个输出字符的 ASCII 编码,这对于再现表情符号和特殊字符特别有用。
  1. 计算困惑度
  • logprobs 可用于帮助我们评估模型对结果的总体置信度,并帮助我们比较来自不同提示的结果的置信度。
from openai import OpenAI
from math import exp
import numpy as np
from IPython.display import display, HTML
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4",
    max_tokens=500,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

假设我们想要创建一个系统,将新闻文章分类到一组预定义的类别中。在没有 logprobs 的情况下,我们可以使用 Chat Completions 来做到这一点,但要评估模型进行分类的确定性要困难得多。

现在,通过启用 logprobs,我们可以准确地看到模型对其预测的置信度,这对于创建准确且值得信赖的分类器至关重要。例如,如果所选类别的对数概率很高,则表明模型对其分类非常有信心。如果对数概率较低,则表明模型的信心不足。这在模型的分类与您的预期不符,或者模型的输出需要由人工审核或验证的情况下尤其有用。

我们将从一个提示开始,该提示向模型呈现四个类别:技术、政治、体育和艺术。然后,模型的任务是仅根据标题将文章分类到这些类别中。

CLASSIFICATION_PROMPT = """You will be given a headline of a news article.
Classify the article into one of the following categories: Technology, Politics, Sports, and Art.
Return only the name of the category, and nothing else.
MAKE SURE your output is one of the four categories stated.
Article headline: {headline}"""

让我们看看三个示例标题,首先从标准的 Chat Completions 输出开始,不使用 logprobs

headlines = [
    "Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.",
    "Local Mayor Launches Initiative to Enhance Urban Public Transport.",
    "Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut",
]
for headline in headlines:
    print(f"\nHeadline: {headline}")
    API_RESPONSE = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
        model="gpt-4o",
    )
    print(f"Category: {API_RESPONSE.choices[0].message.content}\n")
Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
Category: Technology


Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.
Category: Politics


Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut
Category: Art

在这里,我们可以看到每个标题的选定类别。但是,我们无法了解模型对其预测的置信度。让我们重新运行相同的提示,但启用 logprobs,并将 top_logprobs 设置为 2(这将向我们显示每个 token 的 2 个最可能的输出 token)。此外,我们还可以输出每个输出 token 的线性概率,以便将对数概率转换为更易于理解的 0-100% 范围。

for headline in headlines:
    print(f"\nHeadline: {headline}")
    API_RESPONSE = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
        model="gpt-4o-mini",
        logprobs=True,
        top_logprobs=2,
    )
    top_two_logprobs = API_RESPONSE.choices[0].logprobs.content[0].top_logprobs
    html_content = ""
    for i, logprob in enumerate(top_two_logprobs, start=1):
        html_content += (
            f"<span style='color: cyan'>Output token {i}:</span> {logprob.token}, "
            f"<span style='color: darkorange'>logprobs:</span> {logprob.logprob}, "
            f"<span style='color: magenta'>linear probability:</span> {np.round(np.exp(logprob.logprob)*100,2)}%<br>"
        )
    display(HTML(html_content))
    print("\n")
Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
输出 token 1: 技术, logprobs: 0.0, 线性概率: 100.0%
输出 token 2: 技术, logprobs: -18.75, 线性概率: 0.0%


Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.
输出 token 1: 政治, logprobs: -3.1281633e-07, 线性概率: 100.0%
输出 token 2: Polit, logprobs: -16.0, 线性概率: 0.0%


Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut
输出 token 1: 艺术, logprobs: -0.028133942, 线性概率: 97.23%
输出 token 2: 体育, logprobs: -4.278134, 线性概率: 1.39%

正如前两个标题所预期的那样,gpt-4o-mini 对其分类有 100% 的信心,因为内容分别明确侧重于技术和政治。但是,第三个标题结合了体育和艺术相关的主题,导致置信度略低,为 97%,但仍然表现出对其分类的强烈确定性。

logprobs 对于分类任务非常有用。它们允许我们设置置信度阈值,或者在所选输出的对数概率不够高时输出多个潜在的 token。例如,在创建用于标记文章的推荐引擎时,我们可以自动对超过特定阈值的标题进行分类,并将不太确定的标题发送以进行人工审核。

为了减少幻觉,并提高我们基于 RAG 的问答系统的性能,我们可以使用 logprobs 来评估模型对其检索的置信度。

假设我们已经使用 RAG 构建了一个用于问答的检索系统,但正在与对我们问题的幻觉答案作斗争。 注意: 在此示例中,我们将使用硬编码的文章,但有关使用 RAG 进行问答的教程,请参阅食谱中的其他条目。

# Article retrieved
ada_lovelace_article = """Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.
Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as "poetical science" and herself as an "Analyst (& Metaphysician)".
When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as "the father of computers". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.
Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called "Notes".
Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer program—that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of "poetical science" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.
"""

# Questions that can be easily answered given the article
easy_questions = [
    "What nationality was Ada Lovelace?",
    "What was an important finding from Lovelace's seventh note?",
]

# Questions that are not fully covered in the article
medium_questions = [
    "Did Lovelace collaborate with Charles Dickens",
    "What concepts did Lovelace build with Charles Babbage",
]

现在,我们可以做的是要求模型回答问题,然后再评估其响应。具体来说,我们将要求模型输出一个布尔值 has_sufficient_context_for_answer。然后,我们可以评估 logprobs,以查看模型对答案是否包含在提供的上下文中有多大的信心

PROMPT = """You retrieved this article: {article}. The question is: {question}.
Before even answering the question, consider whether you have sufficient information in the article to answer the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.
Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False', nothing else.
"""
html_output = ""
html_output += "Questions clearly answered in article"

for question in easy_questions:
    API_RESPONSE = get_completion(
        [
            {
                "role": "user",
                "content": PROMPT.format(
                    article=ada_lovelace_article, question=question
                ),
            }
        ],
        model="gpt-4o-mini",
        logprobs=True,
    )
    html_output += f'<p style="color:green">Question: {question}</p>'
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'

html_output += "Questions only partially covered in the article"

for question in medium_questions:
    API_RESPONSE = get_completion(
        [
            {
                "role": "user",
                "content": PROMPT.format(
                    article=ada_lovelace_article, question=question
                ),
            }
        ],
        model="gpt-4o",
        logprobs=True,
        top_logprobs=3,
    )
    html_output += f'<p style="color:green">Question: {question}</p>'
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'

display(HTML(html_output))
文章中明确回答的问题

问题:Ada Lovelace 是什么国籍?

has_sufficient_context_for_answer: True, logprobs: -3.1281633e-07, 线性概率: 100.0%

问题:Lovelace 的第七个笔记中的重要发现是什么?

has_sufficient_context_for_answer: True, logprobs: -7.89631e-07, 线性概率: 100.0%

文章中仅部分涵盖的问题

问题:Lovelace 是否与 Charles Dickens 合作过

has_sufficient_context_for_answer: False, logprobs: -0.008654992, 线性概率: 99.14%

问题:Lovelace 与 Charles Babbage 一起构建了哪些概念

has_sufficient_context_for_answer: True, logprobs: -0.004082317, 线性概率: 99.59%

对于前两个问题,我们的模型以(接近)100% 的置信度断言文章具有回答所提问题所需的充分上下文。

另一方面,对于文章中不太清楚回答的更棘手的问题,模型对其具有足够的上下文不太自信。这是一个很好的护栏,可以帮助确保我们检索到的内容是充分的。

这种自我评估可以帮助减少幻觉,因为当您的 sufficient_context_for_answer 对数概率低于某个阈值时,您可以限制答案或重新提示用户。 像这样的方法已被证明可以显着减少 RAG for Q&A 的幻觉和错误 (示例)

logprobs 的另一个用例是自动完成系统。在不创建完整的端到端自动完成系统的情况下,让我们演示一下 logprobs 如何帮助我们决定如何在用户键入时建议单词。

首先,让我们想出一个示例句子:"My least favorite TV show is Breaking Bad." 假设我们希望在键入句子时动态推荐下一个单词或 token,但仅当模型非常确定下一个单词是什么时才推荐。为了演示这一点,让我们将句子分解为顺序组件。

sentence_list = [
    "My",
    "My least",
    "My least favorite",
    "My least favorite TV",
    "My least favorite TV show",
    "My least favorite TV show is",
    "My least favorite TV show is Breaking Bad",
]

现在,我们可以要求 gpt-4o-mini 充当自动完成引擎,并提供模型获得的任何上下文。 我们可以启用 logprobs,并查看模型对其预测的置信度。

high_prob_completions = {}
low_prob_completions = {}
html_output = ""

for sentence in sentence_list:
    PROMPT = """Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}"""
    API_RESPONSE = get_completion(
        [{"role": "user", "content": PROMPT.format(sentence=sentence)}],
        model="gpt-4o-mini",
        logprobs=True,
        top_logprobs=3,
    )
    html_output += f'<p>Sentence: {sentence}</p>'
    first_token = True
    for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
        html_output += f'<p style="color:cyan">Predicted next token: {token.token}, <span style="color:darkorange">logprobs: {token.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(token.logprob)*100,2)}%</span></p>'
        if first_token:
            if np.exp(token.logprob) > 0.95:
                high_prob_completions[sentence] = token.token
            if np.exp(token.logprob) < 0.60:
                low_prob_completions[sentence] = token.token
        first_token = False
    html_output += "<br>"

display(HTML(html_output))

句子: My

预测的下一个 token: My, logprobs: -0.08344023, 线性概率: 91.99%

预测的下一个 token: dog, logprobs: -3.3334403, 线性概率: 3.57%

预测的下一个 token: ap, logprobs: -3.5834403, 线性概率: 2.78%


句子: My least

预测的下一个 token: My, logprobs: -0.1271426, 线性概率: 88.06%

预测的下一个 token: favorite, logprobs: -2.1271427, 线性概率: 11.92%

预测的下一个 token: My, logprobs: -9.127143, 线性概率: 0.01%


句子: My least favorite

预测的下一个 token: My, logprobs: -0.052905332, 线性概率: 94.85%

预测的下一个 token: food, logprobs: -4.0529056, 线性概率: 1.74%

预测的下一个 token: color, logprobs: -5.0529056, 线性概率: 0.64%


句子: My least favorite TV

预测的下一个 token: show, logprobs: -0.57662326, 线性概率: 56.18%

预测的下一个 token: My, logprobs: -0.82662326, 线性概率: 43.75%

预测的下一个 token: show, logprobs: -8.201623, 线性概率: 0.03%


句子: My least favorite TV show

预测的下一个 token: is, logprobs: -0.70817715, 线性概率: 49.25%

预测的下一个 token: My, logprobs: -0.70817715, 线性概率: 49.25%

预测的下一个 token: was, logprobs: -4.833177, 线性概率: 0.8%


句子: My least favorite TV show is

预测的下一个 token: My, logprobs: -0.47896808, 线性概率: 61.94%

预测的下一个 token: one, logprobs: -1.7289681, 线性概率: 17.75%

预测的下一个 token: the, logprobs: -2.9789681, 线性概率: 5.08%


句子: My least favorite TV show is Breaking Bad

预测的下一个 token: because, logprobs: -0.034502674, 线性概率: 96.61%

预测的下一个 token: ,, logprobs: -3.7845027, 线性概率: 2.27%

预测的下一个 token: because, logprobs: -5.0345025, 线性概率: 0.65%


让我们看一下高置信度的自动完成建议

high_prob_completions
{'My least favorite TV show is Breaking Bad': 'because'}

这些看起来很合理!我们可以对这些建议感到自信。在写完“My least favorite TV”之后,您很可能想写“show”!现在让我们看看模型不太确定的自动完成建议

low_prob_completions
{'My least favorite TV': 'show', 'My least favorite TV show': 'is'}

这些也很合乎逻辑。仅仅使用前缀“my least favorite”并不清楚用户要说什么,而且作者最喜欢的电视节目是什么真的只能靠猜测。

因此,使用 gpt-4o-mini,我们可以使用 logprobs 创建动态自动完成引擎的根基!

让我们快速介绍一下如何使用 logprobs 创建简单的 token 高亮显示器,以及如何使用 bytes 参数。首先,我们可以创建一个函数来计数和高亮显示每个 token。虽然这不使用对数概率,但它使用了启用 logprobs 时内置的 token 化功能。

PROMPT = """What's the longest word in the English language?"""

API_RESPONSE = get_completion(
    [{"role": "user", "content": PROMPT}], model="gpt-4o", logprobs=True, top_logprobs=5
)


def highlight_text(api_response):
    colors = [
        "#FF00FF",  # Magenta
        "#008000",  # Green
        "#FF8C00",  # Dark Orange
        "#FF0000",  # Red
        "#0000FF",  # Blue
    ]
    tokens = api_response.choices[0].logprobs.content

    color_idx = 0  # Initialize color index
    html_output = ""  # Initialize HTML output
    for t in tokens:
        token_str = bytes(t.bytes).decode("utf-8")  # Decode bytes to string

        # Add colored token to HTML output
        html_output += f"<span style='color: {colors[color_idx]}'>{token_str}</span>"

        # Move to the next color
        color_idx = (color_idx + 1) % len(colors)
    display(HTML(html_output))  # Display HTML output
    print(f"Total number of tokens: {len(tokens)}")
highlight_text(API_RESPONSE)
The longest word in the English language is often considered to be "pneumonoultramicroscopicsilicovolcanoconiosis," a term referring to a type of lung disease caused by inhaling very fine silicate or quartz dust. However, it's worth noting that this word was coined more for its length than for practical use. There are also chemical names for proteins and other compounds that can be much longer, but they are typically not used in everyday language.
Total number of tokens: 95

接下来,让我们使用 bytes 参数重建一个句子。启用 logprobs 后,我们会获得每个 token 及其 token 字符串的 ASCII(十进制 utf-8)值。当处理表情符号或特殊字符的 token 或包含表情符号或特殊字符的 token 时,这些 ASCII 值会很有帮助。

PROMPT = """Output the blue heart emoji and its name."""
API_RESPONSE = get_completion(
    [{"role": "user", "content": PROMPT}], model="gpt-4o", logprobs=True
)

aggregated_bytes = []
joint_logprob = 0.0

# Iterate over tokens, aggregate bytes and calculate joint logprob
for token in API_RESPONSE.choices[0].logprobs.content:
    print("Token:", token.token)
    print("Log prob:", token.logprob)
    print("Linear prob:", np.round(exp(token.logprob) * 100, 2), "%")
    print("Bytes:", token.bytes, "\n")
    aggregated_bytes += token.bytes
    joint_logprob += token.logprob

# Decode the aggregated bytes to text
aggregated_text = bytes(aggregated_bytes).decode("utf-8")

# Assert that the decoded text is the same as the message content
assert API_RESPONSE.choices[0].message.content == aggregated_text

# Print the results
print("Bytes array:", aggregated_bytes)
print(f"Decoded bytes: {aggregated_text}")
print("Joint prob:", np.round(exp(joint_logprob) * 100, 2), "%")
Token: Here
Log prob: -0.054242473
Linear prob: 94.72 %
Bytes: [72, 101, 114, 101] 

Token:  is
Log prob: -0.0044352207
Linear prob: 99.56 %
Bytes: [32, 105, 115] 

Token:  the
Log prob: -2.1008714e-06
Linear prob: 100.0 %
Bytes: [32, 116, 104, 101] 

Token:  blue
Log prob: -0.0013290489
Linear prob: 99.87 %
Bytes: [32, 98, 108, 117, 101] 

Token:  heart
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [32, 104, 101, 97, 114, 116] 

Token:  emoji
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [32, 101, 109, 111, 106, 105] 

Token:  and
Log prob: -0.038287632
Linear prob: 96.24 %
Bytes: [32, 97, 110, 100] 

Token:  its
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [32, 105, 116, 115] 

Token:  name
Log prob: -1.569009e-05
Linear prob: 100.0 %
Bytes: [32, 110, 97, 109, 101] 

Token: :


Log prob: -0.11313002
Linear prob: 89.3 %
Bytes: [58, 10, 10] 

Token: \xf0\x9f\x92
Log prob: -0.09048584
Linear prob: 91.35 %
Bytes: [240, 159, 146] 

Token: \x99
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [153] 

Token:  Blue
Log prob: -0.023958502
Linear prob: 97.63 %
Bytes: [32, 66, 108, 117, 101] 

Token:  Heart
Log prob: -6.2729996e-06
Linear prob: 100.0 %
Bytes: [32, 72, 101, 97, 114, 116] 

Bytes array: [72, 101, 114, 101, 32, 105, 115, 32, 116, 104, 101, 32, 98, 108, 117, 101, 32, 104, 101, 97, 114, 116, 32, 101, 109, 111, 106, 105, 32, 97, 110, 100, 32, 105, 116, 115, 32, 110, 97, 109, 101, 58, 10, 10, 240, 159, 146, 153, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]
Decoded bytes: Here is the blue heart emoji and its name:

💙 Blue Heart
Joint prob: 72.19 %

在这里,我们看到虽然第一个 token 是 \xf0\x9f\x92',但我们可以获取其 ASCII 值并将其附加到 bytes 数组。然后,我们可以轻松地将此数组解码为完整句子,并使用我们的断言语句验证解码后的字节是否与我们的完成消息相同!

此外,我们可以获得整个完成的联合概率,它是每个 token 的对数概率的指数乘积。这使我们了解给定提示的这个给定完成的 可能性 有多大。由于我们的提示非常明确(要求特定的表情符号及其名称),因此此输出的联合概率很高!但是,如果我们要求随机输出,我们会看到联合概率要低得多。这也可以成为开发人员在提示工程期间的一个好策略。

5. 计算困惑度

当希望评估模型对结果的置信度时,计算困惑度会很有用,困惑度是对不确定性的度量。困惑度可以通过对数概率平均值的负数进行指数运算来计算。通常,较高的困惑度表示结果更不确定,而较低的困惑度表示结果更自信。因此,困惑度既可以用于评估单个模型运行的结果,也可以用于比较模型运行之间结果的相对置信度。虽然高置信度不能保证结果的准确性,但它可以作为一个有用的信号,可以与其他评估指标结合使用,以更好地了解提示的行为。

例如,假设我想使用 gpt-4o-mini 来了解有关人工智能的更多信息。我可以询问有关最近历史的问题和有关未来的问题

prompts = [
    "In a short sentence, has artifical intelligence grown in the last decade?",
    "In a short sentence, what are your thoughts on the future of artificial intelligence?",
]

for prompt in prompts:
    API_RESPONSE = get_completion(
        [{"role": "user", "content": prompt}],
        model="gpt-4o-mini",
        logprobs=True,
    )

    logprobs = [token.logprob for token in API_RESPONSE.choices[0].logprobs.content]
    response_text = API_RESPONSE.choices[0].message.content
    response_text_tokens = [token.token for token in API_RESPONSE.choices[0].logprobs.content]
    max_starter_length = max(len(s) for s in ["Prompt:", "Response:", "Tokens:", "Logprobs:", "Perplexity:"])
    max_token_length = max(len(s) for s in response_text_tokens)
    

    formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]
    formatted_lps = [f"{lp:.2f}".rjust(max_token_length) for lp in logprobs]

    perplexity_score = np.exp(-np.mean(logprobs))
    print("Prompt:".ljust(max_starter_length), prompt)
    print("Response:".ljust(max_starter_length), response_text, "\n")
    print("Tokens:".ljust(max_starter_length), " ".join(formatted_response_tokens))
    print("Logprobs:".ljust(max_starter_length), " ".join(formatted_lps))
    print("Perplexity:".ljust(max_starter_length), perplexity_score, "\n")
Prompt:     In a short sentence, has artifical intelligence grown in the last decade?
Response:   Yes, artificial intelligence has grown significantly in the last decade, advancing in capabilities and applications across various fields. 

Tokens:                Yes              ,     artificial   intelligence            has          grown  significantly             in            the           last         decade              ,      advancing             in   capabilities            and   applications         across        various         fields              .
Logprobs:            -0.00           0.00          -0.00           0.00          -0.00          -0.73          -0.00          -0.01          -0.02          -0.00           0.00          -0.02          -0.66          -0.03          -0.62          -0.47          -0.02          -0.39          -0.01          -0.20          -0.00
Perplexity: 1.1644170003987546 

Prompt:     In a short sentence, what are your thoughts on the future of artificial intelligence?
Response:   The future of artificial intelligence holds immense potential for transformative advancements across various sectors, but it also requires careful consideration of ethical and societal impacts. 

Tokens:                 The          future              of      artificial    intelligence           holds         immense       potential             for  transformative    advancements          across         various         sectors               ,             but              it            also        requires         careful   consideration              of         ethical             and        societal         impacts               .
Logprobs:             -0.02           -0.00            0.00           -0.00            0.00           -0.05           -0.35           -0.01           -0.02           -0.64           -0.43           -0.25           -0.16           -0.51           -0.02           -0.43           -0.08           -0.07           -0.97           -0.02           -0.48           -0.00           -0.00           -0.48           -0.01           -0.58           -0.00
Perplexity: 1.2292170270768858 

在此示例中,gpt-4o-mini 对于关于最近历史的更确定的问题返回了较低的困惑度分数,而对于关于不久的将来的更具推测性的评估返回了较高的困惑度分数。同样,虽然这些差异不能保证准确性,但它们有助于指导我们对模型结果的解释以及我们将来如何使用它们。

太棒了!我们能够使用 logprobs 参数构建更强大的分类器,评估我们用于问答系统的检索,并对 token 的每个“字节”进行编码和解码! logprobs 为我们的 completions 输出添加了有用的信息和信号,我们很高兴看到开发人员如何将其整合以改进应用程序。

logprobs 还有许多其他用例,本食谱中未涵盖。我们可以将 logprobs 用于

  • 审核
  • 关键词选择
  • 改进提示和输出的可解释性
  • Token 修复
  • 等等!