提高可靠性的技术 | OpenAI 食谱

当 GPT-3 在任务上失败时，应该怎么办？

搜索更好的提示，以获得更可靠的答案？
投入数千个示例来微调自定义模型？
假设模型无法完成任务，然后继续进行？

没有简单的答案 - 这取决于情况。但是，如果您的任务涉及逻辑推理或复杂性，请考虑尝试本文中的技术来构建更可靠、高性能的提示。

为什么 GPT-3 在复杂任务上失败

如果有人问你 13 乘以 17 是多少，答案会立即出现在你的脑海中吗？对于我们大多数人来说，可能不会。然而，这并不意味着人类无法进行两位数乘法。只需几秒钟，以及一些纸和笔，算出 13 x 17 = 130 + 70 + 21 = 221 并不太费力。

同样，如果你给 GPT-3 一个任务，这个任务过于复杂，无法在计算下一个 token 所需的时间内完成，它可能会编造一个不正确的猜测。然而，与人类类似，这并不一定意味着模型无法完成这项任务。如果有时间和空间进行推理，模型仍然可能能够可靠地回答。

例如，如果您向 gpt-3.5-turbo-instruct 询问以下关于杂耍球的数学问题，它会给出不正确的答案

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
A:

There are 8 blue golf balls.

这是否意味着 GPT-3 无法解决简单的数学问题？不；事实上，结果证明，通过提示模型 Let's think step by step，模型可以可靠地解决问题

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
A: Let's think step by step.

There are 16 balls in total.
Half of the balls are golf balls.
That means that there are 8 golf balls.
Half of the golf balls are blue.
That means that there are 4 blue golf balls.

当然，仅从一个例子很难判断 Let's think step by step 这个技巧是否真的普遍有效，还是只是在这个特定问题上侥幸成功。但它确实有效。在一项词语数学问题的基准测试中，Let's think step by step 这个技巧大大提高了 GPT-3 的解决率，从毫无价值的 18% 提高到不错的 79%！

模型能力取决于上下文

在学习使用 GPT-3 时，一个常见的概念性错误是认为其能力在所有上下文中都是固定的。例如，如果 GPT-3 回答错了一个简单的逻辑问题，那么它一定不具备简单的逻辑能力。

但是，正如 Let's think step by step 的例子所示，GPT-3 表面上的失败有时可以通过更好的提示来弥补，这些提示可以帮助模型引导自己走向正确的输出。

如何提高复杂任务的可靠性

本文的其余部分分享了提高大型语言模型在复杂任务上可靠性的技术。尽管其中一些技术特定于某些类型的问题，但许多技术都建立在可以应用于广泛任务的一般原则之上，例如：

给出更清晰的指令
将复杂任务分解为更简单的子任务
构建指令以使模型保持任务状态
提示模型在回答前进行解释
要求对许多可能的答案进行理由说明，然后进行综合
生成多个输出，然后使用模型选择最佳输出
微调自定义模型以最大化性能

将复杂任务分解为更简单的任务

给模型更多时间和空间思考的一种方法是将任务分解为更简单的部分。

例如，考虑一个任务，我们向模型询问关于某些文本的选择题 - 在本例中，是一个 Clue 游戏。当直接询问时，gpt-3.5-turbo-instruct 无法将线索 3 和 5 放在一起，并给出不正确的答案

Use the following clues to answer the following multiple-choice question.

Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.

Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

Solution:

(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

尽管线索 3 和 5 确定 Mustard 上校是天文台里唯一的人，并且天文台里的人拿着烛台，但模型未能将它们组合成正确的答案 (a) 是。

然而，与其直接询问答案，我们可以将任务分为三个部分

首先，逐个查看线索，并考虑线索是否可能相关
其次，结合相关线索，推理出问题的答案
第三，写出最终答案：(a)、(b) 或 (c)

Use the following clues to answer the following multiple-choice question, using the following procedure:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant
(2) Second, combine the relevant clues to reason out the answer to the question
(3) Third, map the answer to one of the multiple choice answers: either (a), (b), or (c)

Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.

Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

Solution:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant:

Clue 1: Miss Scarlett was the only person in the lounge. This clue is not relevant.
Clue 2: The person with the pipe was in the kitchen. This clue is not relevant.
Clue 3: Colonel Mustard was the only person in the observatory. This clue is relevant.
Clue 4: Professor Plum was not in the library nor the billiard room. This clue is not relevant.
Clue 5: The person with the candlestick was in the observatory. This clue is relevant.

(2) Second, combine the relevant clues to reason out the answer to the question:
Clue 3 and Clue 5 both say that the person with the candlestick was in the observatory. Colonel Mustard was the only person in the observatory, so Colonel Mustard was the person with the candlestick.

(3) Third, write the final answer:
(a) Yes; Colonel Mustard was in the observatory with the candlestick

通过给模型更多的时间和空间思考，并引导它沿着推理计划进行，它能够找出正确的答案 (a) 是。

将复杂指令分解为更小的子任务的另一个好处是，它可以帮助模型专注于每个子任务。

例如，如果我们要求 gpt-3.5-turbo-instruct 用其原始语言总结文本，模型可能会退回到英语

Summarize the text using the original language of the text. The summary should be one sentence long.

Text:
"""
La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, "hombre de Estado") es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad. La estadística es una ciencia formal deductiva, con un conocimiento propio, dinámico y en continuo desarrollo obtenido a través del método científico formal. En ocasiones, las ciencias fácticas necesitan utilizar técnicas estadísticas durante su proceso de investigación factual, con el fin de obtener nuevos conocimientos basados en la experimentación y en la observación. En estos casos, la aplicación de la estadística permite el análisis de datos provenientes de una muestra representativa, que busca explicar las correlaciones y dependencias de un fenómeno físico o natural, de ocurrencia en forma aleatoria o condicional.
"""

Summary:

The text explains that statistics is a science that studies the variability, collection, organization, analysis, interpretation, and presentation of data, as well as the random process that generates them following the laws of probability.

然而，如果我们首先要求模型识别文本的语言，然后再总结文本，它会变得更可靠

First, identify the language of the text. Second, summarize the text using the original language of the text. The summary should be one sentence long.

Text:
"""
La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, "hombre de Estado") es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad. La estadística es una ciencia formal deductiva, con un conocimiento propio, dinámico y en continuo desarrollo obtenido a través del método científico formal. En ocasiones, las ciencias fácticas necesitan utilizar técnicas estadísticas durante su proceso de investigación factual, con el fin de obtener nuevos conocimientos basados en la experimentación y en la observación. En estos casos, la aplicación de la estadística permite el análisis de datos provenientes de una muestra representativa, que busca explicar las correlaciones y dependencias de un fenómeno físico o natural, de ocurrencia en forma aleatoria o condicional.
"""

Language:

Spanish

La estadística es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad.

提示模型在回答前进行解释

另一种提高答案可靠性的强大技术是提示模型逐步推理出答案，而不是立即跳到最终答案。通过“大声思考”，模型更有可能得出正确的答案。

零样本

方法

由 Takeshi Kojima 等人在 2022 年发表，提示模型推理出答案的最简单方法是在答案前面加上 Let's think step by step. 图 2 说明了一个例子

zero-shot reasoning example
来源：Takeshi Kojima 等人 (2022) 的 Large Language Models are Zero-Shot Reasoners。

结果

作者将这个简单的技巧应用于 MultiArith 数学数据集，发现 Let's think step by step 使准确率提高了四倍，从 18% 提高到 79%！

zero-shot reasoning example
来源：Takeshi Kojima 等人 (2022) 的 Large Language Models are Zero-Shot Reasoners。

意义

尽管 Let's think step by step 技巧在数学问题上效果很好，但它并非对所有任务都有效。作者发现，它对多步骤算术问题、符号推理问题、策略问题和其他推理问题最有帮助。它对简单的数学问题或常识性问题没有帮助，并且可能对许多其他非推理任务也没有帮助。

zero-shot reasoning example
来源：Takeshi Kojima 等人 (2022) 的 Large Language Models are Zero-Shot Reasoners。

要了解更多信息，请阅读完整论文。

如果您将此技术应用于您自己的任务，请不要害怕尝试自定义指令。Let's think step by step 相当通用，因此您可能会发现，使用更严格的、为您的用例定制的格式的指令可以获得更好的性能。例如，您可以尝试更结构化的变体，例如 First, think step by step about why X might be true. Second, think step by step about why Y might be true. Third, think step by step about whether X or Y makes more sense.。您甚至可以给模型一个示例格式，以帮助其保持在轨道上，例如：

Using the IRS guidance below, answer the following questions using this format:
(1) For each criterion, determine whether it is met by the vehicle purchase
- {Criterion} Let's think step by step. {explanation} {yes or no, or if the question does not apply then N/A}.
(2) After considering each criterion in turn, phrase the final answer as "Because of {reasons}, the answer is likely {yes or no}."

IRS guidance:
"""
You may be eligible for a federal tax credit under Section 30D if you purchased a car or truck that meets the following criteria:
- Does the vehicle have at least four wheels?
- Does the vehicle weigh less than 14,000 pounds?
- Does the vehicle draw energy from a battery with at least 4 kilowatt hours that may be recharged from an external source?
- Was the vehicle purchased in a year before 2022?
  - If so, has the manufacturer sold less than 200,000 qualifying vehicles? (Tesla and GM have sold more than 200,000 qualifying vehicles.)
- Was the vehicle purchased in a year after 2022?
  - If so, is the vehicle present in the following list of North American-assembled vehicles? (The only electric vehicles assembled in North America are the Audi Q5, BMW 330e, BMW X5, Chevrolet Bolt EUV, Chevrolet Bolt EV, Chrysler Pacifica PHEV, Ford Escape PHEV, Ford F Series, Ford Mustang MACH E, Ford Transit Van, GMC Hummer Pickup, GMC Hummer SUV, Jeep Grand Cherokee PHEV, Jeep Wrangler PHEV, Lincoln Aviator PHEV, Lincoln Corsair Plug-in, Lucid Air, Nissan Leaf, Rivian EDV, Rivian R1S, Rivian R1T, Tesla Model 3, Tesla Model S, Tesla Model X, Tesla Model Y, Volvo S60, BMW 330e, Bolt EV, Cadillac Lyriq, Mercedes EQS SUV, and Nissan Leaf.)
"""

Question: Can I claim a federal tax credit for my Toyota Prius Prime bought in 2021?

Solution:

(1) For each criterion, determine whether it is met by the vehicle purchase
- Does the vehicle have at least four wheels? Let's think step by step.

 The Toyota Prius Prime has four wheels, so the answer is yes.
- Does the vehicle weigh less than 14,000 pounds? Let's think step by step. The Toyota Prius Prime weighs less than 14,000 pounds, so the answer is yes.
- Does the vehicle draw energy from a battery with at least 4 kilowatt hours that may be recharged from an external source? Let's think step by step. The Toyota Prius Prime has a battery with at least 4 kilowatt hours that may be recharged from an external source, so the answer is yes.
- Was the vehicle purchased in a year before 2022? Let's think step by step. The Toyota Prius Prime was purchased in 2021, which is before 2022, so the answer is yes.
- Was the vehicle purchased in a year after 2022? N/A
- If so, is the vehicle present in the following list of North American-assembled vehicles? N/A

(2) After considering each criterion in turn, phrase the final answer as "Because of {reasons}, the answer is likely {yes or no}."

Because the Toyota Prius Prime meets all of the criteria for a federal tax credit, the answer is likely yes.

少样本示例

方法

提示模型推理出答案可以通过多种方式完成。一种方式是用几个例子（“少样本”）来演示，正如来自 Google 的 Jason Wei 和 Denny Zhou 等人所研究的那样。这是一个少样本的思维链提示示例

chain of thought example
来源：Jason Wei 和 Denny Zhou 等人 (2022) 的 Chain of Thought Prompting Elicits Reasoning in Large Language Models

更多由人工标注员编写的推理链演示

chain of thought example
来源：Jason Wei 和 Denny Zhou 等人 (2022) 的 Chain of Thought Prompting Elicits Reasoning in Large Language Models

（注意：梨是否真的漂浮已被质疑）

结果

作者在小学数学问题上进行测试，发现思维链提示使解决率提高了两倍，从 18% 提高到 57%。

chain of thought example
来源：Jason Wei 和 Denny Zhou 等人 (2022) 的 Chain of Thought Prompting Elicits Reasoning in Large Language Models

除了数学问题外，思维链提示还提高了在体育理解、硬币翻转跟踪和最后一个字母连接相关问题上的表现。在大多数情况下，不需要太多示例即可饱和性能增益（大约少于 8 个）。

chain of thought example
来源：Jason Wei 和 Denny Zhou 等人 (2022) 的 Chain of Thought Prompting Elicits Reasoning in Large Language Models

要了解更多信息，请阅读完整论文。

意义

相对于 Let's think step by step 技术，基于少样本示例的方法的一个优势是，您可以更轻松地指定您希望模型在得出最终答案之前执行的推理格式、长度和风格。这在模型最初没有以正确的方式或深度进行推理的情况下尤其有用。

微调

方法

一般来说，为了最大限度地发挥任务的性能，您需要微调自定义模型。然而，使用解释来微调模型可能需要数千个示例解释，而编写这些解释成本很高。

2022 年，Eric Zelikman 和 Yuhuai Wu 等人发表了一个巧妙的程序，使用少样本提示来生成一个解释数据集，该数据集可用于微调模型。其思想是使用少样本提示来生成候选解释，并且仅保留产生正确答案的解释。然后，为了获得一些不正确答案的更多解释，重新尝试少样本提示，但将正确答案作为问题的一部分给出。作者称他们的程序为 STaR（自学推理器）

STaR procedure
来源：Eric Zelikman 和 Yujuai Wu 等人 (2022) 的 STaR: Bootstrapping Reasoning With Reasoning

使用此技术，您可以将微调的好处与思维链提示的好处结合起来，而无需编写数千个示例解释。

结果

当作者将此技术应用于常识问答数据集时，他们发现 STaR 的性能优于单独的思维链提示（73% > 37%）和单独的微调（73% > 60%）

STaR results
来源：Eric Zelikman 和 Yujuai Wu 等人 (2022) 的 STaR: Bootstrapping Reasoning With Reasoning

要了解更多信息，请阅读完整论文。

意义

使用少样本提示来扩展或修改微调数据集是一个可以推广到解释写作之外的想法。例如，如果您有大量想要训练的非结构化文本，您可能会找到机会使用提示从您的非结构化文本中提取结构化数据集，然后在该结构化数据集上微调自定义模型。

思维链提示的扩展

也已经发表了许多思维链提示的扩展。

选择-推理提示

方法

由 Antonia Creswell 等人发表，思维链技术的一个扩展是将用于生成解释和答案的单个提示拆分为更小的部分。首先，提示从文本中选择相关的事实子集（“选择提示”）。然后，第二个提示从选定的事实中推断出一个结论（“推理提示”）。然后，这些提示在循环中交替进行，以生成多个推理步骤，并最终得出最终答案。作者在下图中说明了这个想法

Selection-inference prompting
来源：Antonia Creswell 等人 (2022) 的 Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

结果

当应用于 7B 参数模型时，作者发现，在 bAbi 和 Proof Writer 基准测试任务（两者都需要更长的推理步骤序列）上，相对于思维链提示，选择-推理提示大大提高了性能。他们获得的最佳性能结合了选择-推理提示和微调。

Selection-inference prompting
来源：Antonia Creswell 等人 (2022) 的 Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

意义

尽管在这些基准测试上的收益很大，但选择这些基准测试是因为它们需要更长的推理序列。对于不需要多步推理的问题，收益可能较小。

结果突出了使用大型语言模型的一些一般性经验教训。一，将复杂任务分解为更小的任务是提高可靠性和性能的好方法；任务越原子化，模型出错的空间就越小。二，获得最大性能通常意味着将微调与您选择的任何方法相结合。

要了解更多信息，请阅读完整论文。

忠实推理架构

在发表选择-推理提示技术几个月后，作者在一篇后续论文中扩展了该技术，其中提出了以下想法：

弄清楚何时应停止或继续选择-推理循环
添加一个价值函数，以帮助搜索多个推理路径
通过微调模型以推理句子标签（例如，sen1）而不是写出句子本身，来减少虚假事实的幻觉

方法

在原始的选择-推理技术中，专门的“选择”和“推理”提示交替使用，以选择事实并从这些事实中进行推断，结合起来生成一系列推理步骤。

作者用两个额外的组件扩展了这项技术。

首先，作者添加了一个“halter”模型，在每个推理步骤之后，都会询问该模型到目前为止的推理是否足以回答问题。如果是，则模型生成最终答案。

halter 模型带来了一些优势

它可以根据需要告诉选择-推理过程停止或继续进行。
如果过程永远不会停止，您将不会得到答案，这通常比幻觉猜测更好

来源：Antonia Creswell 等人 (2022) 的 Faithful Reasoning Using Large Language Models

其次，作者添加了一个价值函数，用于评估推理步骤的质量并搜索多个推理轨迹。这呼应了提高可靠性的一个共同主题；与其从模型中生成单个答案，不如生成一组答案，然后使用某种类型的价值函数/鉴别器/验证器模型来选择最佳答案。

来源：Antonia Creswell 等人 (2022) 的 Faithful Reasoning Using Large Language Models

除了这两个扩展之外，作者还使用了一个技巧来减少虚假事实的幻觉。他们没有要求模型写出事实陈述，而是微调模型以使用句子标签（例如，sen1）。这有助于防止模型幻觉出提示上下文中未提及的虚假事实。

来源：Antonia Creswell 等人 (2022) 的 Faithful Reasoning Using Large Language Models

结果

作者在两个基准测试上评估了他们的技术：ProofWriter 任务（未显示）和 EntailmentBankQA（显示）。该技术大大提高了准确性，尤其是在更难的推理问题上。

来源：Antonia Creswell 等人 (2022) 的 Faithful Reasoning Using Large Language Models](https://arxiv.org/abs/2208.14271)

此外，他们的句子标签操作技巧基本上消除了幻觉！

来源：Antonia Creswell 等人 (2022) 的 Faithful Reasoning Using Large Language Models](https://arxiv.org/abs/2208.14271)

意义

本文说明了许多有助于提高大型语言模型可靠性的经验教训

将复杂任务分解为更小、更可靠的子任务
以逐步方式生成您的答案，并在过程中进行评估
生成许多可能的答案，并使用另一个模型或函数来选择看起来最好的答案
通过约束模型可以说什么（例如，通过使用句子标签而不是句子）来减少幻觉
通过在专门任务上微调模型来最大化模型性能

要了解更多信息，请阅读完整论文。

由少到多提示

除了在长推理链上表现不佳（选择-推理技术在此方面表现出色）之外，当示例很短但任务很长时，思维链提示尤其会遇到困难。

方法

由少到多提示是另一种将推理任务分解为更小、更可靠的子任务的技术。其思想是通过类似 To solve {question}, we need to first solve: " 的提示从模型中引出一个子任务。然后，有了这个子任务，模型就可以生成一个解决方案。解决方案附加到原始问题中，并重复该过程，直到产生最终答案。

来源：Denny Zhou 等人 (2022) 的 Least-to-most Prompting Enables Complex Reasoning in Large Language Models

结果

当应用于使用 code-davinci-002（针对代码进行了优化，但仍然可以理解文本）的长推理链基准测试时，作者测得的收益高达 16% -> 99.7%！

Least-to-most prompting results on last-letter-concatenation task
来源：Denny Zhou 等人 (2022) 的 Least-to-most Prompting Enables Complex Reasoning in Large Language Models

意义

尽管上述由少到多提示带来的收益令人印象深刻，但它们是在非常狭窄的任务集上测量的，这些任务需要长推理链。

尽管如此，它们仍然说明了一个共同的主题：通过 (a) 将复杂任务分解为更小的子任务和 (b) 给模型更多的时间和空间来计算答案，从而提高可靠性。

要了解更多信息，请阅读完整论文。

助产术式提示

方法

与之前的技术试图最大化正确答案的可能性相反，另一种方法是使用 GPT-3 生成一个可能的解释树（包括正确的和不正确的），然后分析它们的关系，以猜测哪个集合是正确的。这项技术在 2022 年 5 月由 Jaehun Jung 等人创造，称为助产术式提示（助产术式意味着与苏格拉底式提问方法有关，旨在引出想法）。

该方法很复杂，其工作原理如下

首先，构建一个助产术式树，其中每个节点都是一个可能为真或为假的陈述
- 从一个多项选择题或真/假陈述开始（例如 War cannot have a tie）
- 对于问题的每个可能答案，使用模型生成相应的解释（使用类似 War cannot have a tie? True, because 的提示）
- 然后，用问题和生成的解释提示模型，并要求它产生答案。如果反转解释（使用类似 It is wrong to say that {explanation} 的前缀）反转了答案，则该解释被认为是“逻辑上完整的”。
- 如果一个解释在逻辑上不完整，则递归地重复上述过程，将每个解释变成一个真或假问题，并为每个新问题生成更多解释。
- 在完成所有递归解释后，您最终会得到一个解释树，其中树上的每个叶子都具有反转解释会反转模型答案的属性。
其次，将树转换为关系图
- 对于树中的每个节点，计算模型对每个节点的相对信念（从给定解释获得 True 答案的概率推断得出）
- 对于树中的每对节点，使用模型识别它们是蕴含（隐含）还是矛盾
第三，找到最一致的信念集，并将这些信念视为真实的
- 具体来说，使用每个节点中的信念强度以及它们之间的逻辑关系，将问题表述为加权最大可满足性问题 (MAX-SAT)
- 使用求解器找到最自我一致的信念集，并将这些信念视为真实的

来源：Jaehun Jung 等人 (2022) 的 Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations

结果

Maieutic prompting results
来源：Jaehun Jung 等人 (2022) 的 Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations

意义

除了复杂性之外，这种方法的一个局限性在于，它似乎仅适用于可以作为多项选择题提出的问题。

要了解更多信息，请阅读完整论文。

扩展

自我一致性

方法

对于具有离散答案集的任务，提高可靠性的一种简单方法是从模型中采样多个解释和答案（使用正温度），然后选择出现频率最高的最终答案。

Self-consistency method
来源：Xuezhi Wang 等人 (2022) 的 Self-Consistency Improves Chain of Thought Reasoning in Language Models

结果

这项技术在一套数学和推理基准测试中将准确率提高了 1 到 24 个百分点。（下图绘制的是 Google LaMDA 模型的结果；使用 Google 更大的 PaLM 模型，基线更高，但收益略小。）

Self-consistency results
来源：Xuezhi Wang 等人 (2022) 的 Self-Consistency Improves Chain of Thought Reasoning in Language Models

意义

尽管这项技术实现起来很简单，但可能会很昂贵。生成一组 10 个答案将使您的成本增加 10 倍。

此外，与许多这些技术一样，它仅适用于具有有限答案集的任务。对于每个答案都是唯一的开放式任务（例如写诗），选择最常见的答案意味着什么并不明显。

最后，当有多个路径或措辞可以达到答案时，此技术应该是最有益的；如果只有一条路径，那么该技术可能根本无济于事。一个极端的例子：如果任务是生成一个单 token 答案，那么从 100 次生成中获取最常见的 token 与获取具有最高 logprobs 的 token 没有什么不同（您可以通过在 temperature=0 的情况下进行单次生成来获得）。

验证器

另一种提高任务性能的关键技术是训练验证器或鉴别器模型来评估主要生成模型的输出。如果鉴别器拒绝输出，那么您可以重新采样生成模型，直到获得可接受的输出。在许多情况下，判断一个答案比创建一个答案更容易，这有助于解释这种方法的威力。

方法

2021 年，OpenAI 研究人员将此技术应用于小学数学问题，使用了以下程序

首先，他们在问题和解决方案上微调了一个模型
对于训练集中的每个问题，他们生成了 100 个解决方案
这 100 个解决方案中的每一个都根据最终答案是否正确而自动标记为正确或不正确
使用这些解决方案，其中一些标记为正确，一些标记为不正确，他们微调了一个验证器模型，以分类问题和候选解决方案是否正确或不正确
最后，在测试时，生成模型为每个问题创建 100 个解决方案，并根据验证器模型选择得分最高的解决方案作为最终答案

Verifier method
来源：Karl Cobbe 等人 (2021) 的 Training Verifiers to Solve Math Word Problems

结果

使用 175B GPT-3 模型和 8,000 个训练示例，这项技术将小学数学准确率从约 33% 大幅提高到约 55%。

Verifier results
来源：Karl Cobbe 等人 (2021) 的 Training Verifiers to Solve Math Word Problems

意义

与自我一致性技术类似，这种方法可能会变得昂贵，因为每个任务生成 100 个解决方案将使您的成本大约增加 ~100 倍。

可靠性理论

尽管上述技术的方法各不相同，但它们都具有提高复杂任务可靠性的共同目标。它们主要通过以下方式实现这一目标：

将不可靠的操作分解为更小的、更可靠的操作（例如，选择-推理提示）
使用多个步骤或多个关系来使系统的可靠性大于任何单个组件（例如，助产术式提示）

教训	论文	日期
将复杂任务分解为更简单的子任务（并考虑向用户公开中间输出）	AI Chains：通过链接大型语言模型提示实现透明且可控的人工智能交互	2021年10月
您可以通过生成许多候选项来改进输出，然后选择看起来最好的一个	训练验证器解决数学文字题	2021年10月
在推理任务中，模型在回答之前逐步推理时表现更好	思维链提示引发大型语言模型的推理	2022年1月
您可以通过生成许多解释-答案输出并选择最流行的答案来改进逐步推理	自我一致性提高思维链在语言模型中的推理	2022年3月
如果您想微调逐步推理器，您可以仅使用多项选择题和答案数据来完成	STaR：通过推理引导推理	2022年3月
即使没有示例，逐步推理方法也能很好地工作	大型语言模型是零样本推理器	2022年5月
通过交替使用“选择”提示和“推断”提示，您可以做得比逐步推理更好	选择-推断：利用大型语言模型进行可解释的逻辑推理	2022年5月
在长推理问题上，您可以通过将问题分解成小块来逐步解决，从而改进逐步推理	由少到多提示使大型语言模型能够进行复杂推理	2022年5月
您可以让模型分析好的和伪造的解释，以找出哪组解释最一致	助产式提示：通过递归解释进行逻辑一致的推理	2022年5月
您可以从概率编程的角度来思考这些技术，其中系统由不可靠的组件组成	语言模型级联	2022年7月
您可以通过句子标签操作消除幻觉，并且可以使用“缰绳”提示减少错误答案	使用大型语言模型的忠实推理	2022年8月