如何实施 LLM 防护栏

2023 年 12 月 19 日
在 Github 中打开

在本笔记本中,我们分享如何为您的 LLM 应用程序实施防护栏的示例。防护栏是旨在引导您的应用程序的检测控制的通用术语。鉴于 LLM 固有的随机性,更强的可操控性是一个常见的需求,因此创建有效的防护栏已成为将 LLM 从原型推向生产时最常见的性能优化领域之一。

防护栏非常多样化,几乎可以部署到您可以想象的 LLM 出错的任何上下文中。本笔记本旨在提供简单的示例,这些示例可以扩展以满足您的独特用例,并概述在决定是否实施防护栏以及如何实施时需要考虑的权衡。

本笔记本将重点关注

  1. 输入防护栏,在不适当的内容到达您的 LLM 之前标记它们
  2. 输出防护栏,在 LLM 生成的内容到达客户之前验证它们

注意: 本笔记本将防护栏作为 LLM 周围检测控制的通用术语来处理 - 有关提供预构建防护栏框架的官方库,请查看以下内容

import openai

GPT_MODEL = 'gpt-4o-mini'

1. 输入防护栏

输入防护栏旨在防止不适当的内容首先到达 LLM - 一些常见的用例是

  • 主题防护栏: 识别用户何时提出跑题问题,并就 LLM 可以帮助他们解决哪些主题提供建议。
  • 越狱: 检测用户何时尝试劫持 LLM 并覆盖其提示。
  • 提示注入: 识别提示注入的实例,用户尝试隐藏恶意代码,这些代码将在 LLM 执行的任何下游函数中执行。

在所有这些情况下,它们都充当预防性控制,在 LLM 之前或与 LLM 并行运行,并在满足这些标准之一时触发您的应用程序以不同方式运行。

设计防护栏

在设计防护栏时,重要的是要考虑准确性延迟成本之间的权衡,您需要尝试以对您的底线和用户体验的影响最小的方式实现最大准确性。

我们将从一个简单的主题防护栏开始,该防护栏旨在检测跑题问题,并在触发时阻止 LLM 回答。此防护栏由一个简单的提示组成,并使用 gpt-4o-mini,最大限度地提高延迟/成本,保持足够好的准确性,但如果我们想进一步优化,我们可以考虑

  • 准确性: 您可以考虑微调 gpt-4o-mini 或少量示例来提高准确性。如果您拥有可以帮助确定内容是否允许的语料库,则 RAG 也可能有效。
  • 延迟/成本: 您可以尝试微调较小的模型,例如 babbage-002 或 Llama 等开源产品,这些模型在提供足够的训练示例时可以表现良好。当使用开源产品时,您还可以调整用于推理的机器,以最大限度地降低成本或延迟。

这个简单的防护栏旨在确保 LLM 仅回答预定义的一组主题,并使用预先设定的消息响应超出范围的查询。

拥抱异步

最小化延迟的常用设计是将您的防护栏与您的主 LLM 调用异步发送。如果您的防护栏被触发,您将返回它们的响应,否则返回 LLM 响应。

我们将使用这种方法,创建一个 execute_chat_with_guardrails 函数,该函数将并行运行我们 LLM 的 get_chat_responsetopical_guardrail 防护栏,并且仅当防护栏返回 allowed 时才返回 LLM 响应。

局限性

在开发您的设计时,您应始终考虑防护栏的局限性。需要注意的一些关键局限性是

  • 当使用 LLM 作为防护栏时,请注意它们与您的基本 LLM 调用本身具有相同的漏洞。例如,提示注入尝试可能成功地规避您的防护栏和您的实际 LLM 调用。
  • 随着对话变得更长,LLM 更容易受到越狱的影响,因为您的指令会被额外的文本稀释。
  • 如果您使防护栏过于严格以弥补上述问题,则防护栏可能会损害用户体验。这表现为过度拒绝,即您的防护栏拒绝无害的用户请求,因为它们与提示注入或越狱尝试具有相似之处。

缓解措施

如果您可以将防护栏与基于规则的或更传统的机器学习模型相结合进行检测,则可以减轻其中一些风险。我们还看到客户的防护栏仅考虑最新消息,以减轻模型被冗长对话混淆的风险。

我们还建议逐步推出并积极监控对话,以便您可以发现提示注入或越狱的实例,并添加更多防护栏以覆盖这些新型行为,或将它们作为训练示例包含到您现有的防护栏中。

system_prompt = "You are a helpful assistant."

bad_request = "I want to talk about horses"
good_request = "What are the best breeds of dog for people that like cats?"
import asyncio


async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")

    return response.choices[0].message.content


async def topical_guardrail(user_request):
    print("Checking topical guardrail")
    messages = [
        {
            "role": "system",
            "content": "Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'",
        },
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0
    )

    print("Got guardrail response")
    return response.choices[0].message.content


async def execute_chat_with_guardrail(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
# Call the main function with the good request - this should go through
response = await execute_chat_with_guardrail(good_request)
print(response)
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
If you like cats and are considering getting a dog, there are several breeds known for their compatibility with feline friends. Here are some of the best dog breeds that tend to get along well with cats:

1. **Golden Retriever**: Friendly and tolerant, Golden Retrievers often get along well with other animals, including cats.

2. **Labrador Retriever**: Similar to Golden Retrievers, Labs are social and friendly, making them good companions for cats.

3. **Cavalier King Charles Spaniel**: This breed is gentle and affectionate, often forming strong bonds with other pets.

4. **Basset Hound**: Basset Hounds are laid-back and generally have a calm demeanor, which can help them coexist peacefully with cats.

5. **Beagle**: Beagles are friendly and sociable, and they often enjoy the company of other animals, including cats.

6. **Pug**: Pugs are known for their playful and friendly nature, which can make them good companions for cats.

7. **Shih Tzu**: Shih Tzus are typically friendly and adaptable, often getting along well with other pets.

8. **Collie**: Collies are known for their gentle and protective nature, which can extend to their relationships with cats.

9. **Newfoundland**: These gentle giants are known for their calm demeanor and often get along well with other animals.

10. **Cocker Spaniel**: Cocker Spaniels are friendly and affectionate dogs that can get along well with cats if introduced properly.

When introducing a dog to a cat, it's important to do so gradually and supervise their interactions to ensure a positive relationship. Each dog's personality can vary, so individual temperament is key in determining compatibility.
# Call the main function with the bad request - this should get blocked
response = await execute_chat_with_guardrail(bad_request)
print(response)
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.

看起来我们的防护栏起作用了 - 第一个问题被允许通过,但第二个问题因跑题而被阻止。现在我们将把这个概念扩展到也审核我们从 LLM 收到的响应。

2. 输出防护栏

输出防护栏控制 LLM 返回的内容。这些可以采取多种形式,其中一些最常见的形式是

  • 幻觉/事实核查防护栏: 使用地面实况信息语料库或幻觉响应的训练集来阻止幻觉响应。
  • 审核防护栏: 应用品牌和公司准则来审核 LLM 的结果,并在其违反准则时阻止或重写其响应。
  • 语法检查: 来自 LLM 的结构化输出可能会返回损坏或无法解析 - 这些防护栏会检测到这些情况,并重试或优雅地失败,从而防止下游应用程序出现故障。
    • 这是应用于函数调用的常见控制,确保在 LLM 返回 function_call 时,预期的架构在 arguments 中返回。

审核防护栏

在这里,我们实施一个审核防护栏,它使用 G-Eval 评估方法的版本来评分 LLM 响应中是否存在不需要的内容。我们的其他 笔记本中更详细地演示了这种方法。

为了实现这一点,我们将创建一个可扩展的框架来审核内容,该框架接受一个 domain,并使用一组 stepscriteria 应用于一段 content

  1. 我们设置一个域名,描述我们要审核的内容类型。
  2. 我们提供标准,清楚地概述内容应该包含和不应该包含的内容。
  3. 为 LLM 提供逐步说明以对内容进行评分。
  4. LLM 返回 1-5 的离散分数。

设置防护栏阈值

我们的输出防护栏将评估 LLM 的响应并阻止任何得分为 3 或更高的内容。设置此阈值是常见的优化领域 - 我们建议构建一个评估集并使用混淆矩阵对结果进行评分,以设置防护栏的正确容差。这里的权衡通常是

  • 更多的误报会导致用户体验碎片化,客户会感到恼火,并且助手似乎不太有帮助。
  • 更多的漏报可能会对您的业务造成持久的损害,因为人们会让助手回答不适当的问题,或提示注入/越狱它。

例如,对于越狱,您可能希望设置非常低的阈值,因为如果您的 LLM 被劫持并用于生成最终出现在社交媒体上的危险内容,则对您的业务的风险非常高。但是,对于我们的用例,我们愿意接受一些漏报,因为可能发生的最坏情况是某人最终得到了一只比熊犬,它可能更适合拉布拉多犬,虽然这很遗憾,但可能不会对我们的业务造成持久损害(我们希望如此)。

domain = "animal breed recommendation"

animal_advice_criteria = """
Assess the presence of explicit recommendation of cat or dog breeds in the content.
The content should contain only general advice about cats and dogs, not specific breeds to purchase."""

animal_advice_steps = """
1. Read the content and the criteria carefully.
2. Assess how much explicit recommendation of cat or dog breeds is contained in the content.
3. Assign an animal advice score from 1 to 5, with 1 being no explicit cat or dog breed advice, and 5 being multiple named cat or dog breeds.
"""

moderation_system_prompt = """
You are a moderation assistant. Your role is to detect content about {domain} in the text provided, and mark the severity of that content.

## {domain}

### Criteria

{scoring_criteria}

### Instructions

{scoring_steps}

### Content

{content}

### Evaluation (score only!)
"""
async def moderation_guardrail(chat_response):
    print("Checking moderation guardrail")
    mod_messages = [
        {"role": "user", "content": moderation_system_prompt.format(
            domain=domain,
            scoring_criteria=animal_advice_criteria,
            scoring_steps=animal_advice_steps,
            content=chat_response
        )},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=mod_messages, temperature=0
    )
    print("Got moderation response")
    return response.choices[0].message.content
    
    
async def execute_all_guardrails(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                moderation_response = await moderation_guardrail(chat_response)

                if int(moderation_response) >= 3:
                    print(f"Moderation guardrail flagged with a score of {int(moderation_response)}")
                    return "Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have."

                else:
                    print('Passed moderation')
                    return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
# Adding a request that should pass both our topical guardrail and our moderation guardrail
great_request = 'What is some advice you can give to a new dog owner?'
tests = [good_request,bad_request,great_request]

for test in tests:
    result = await execute_all_guardrails(test)
    print(result)
    print('\n\n')
    
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Moderation guardrail flagged with a score of 5
Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Moderation guardrail flagged with a score of 3
Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have.



结论

防护栏是 LLM 中一个充满活力且不断发展的主题,我们希望本笔记本能为您有效介绍围绕防护栏的核心概念。回顾一下

  • 防护栏是旨在防止有害内容到达您的应用程序和用户的检测控制,并在生产中为您的 LLM 增加可操控性。
  • 它们可以采用输入防护栏的形式,目标是在内容到达 LLM 之前对其进行控制,以及输出防护栏,控制 LLM 的响应。
  • 设计防护栏和设置其阈值是准确性、延迟和成本之间的权衡。您的决定应基于对防护栏性能的明确评估,以及对误报和漏报对您的业务造成的成本的理解。
  • 通过拥抱异步设计原则,您可以水平扩展防护栏,以最大限度地减少对用户的影响,因为您的防护栏在数量和范围上都在增加。

我们期待看到您如何推进这项工作,以及随着生态系统的成熟,关于防护栏的思考如何演变。