使用元提示增强您的提示

2024 年 10 月 23 日
在 Github 中打开

欢迎来到我们的元提示食谱!在本指南中,我们将探索如何使用基本提示并对其进行改进,以提高来自语言模型的输出质量。我们将使用新闻报道摘要的示例来说明这个过程。

元提示是一种使用 LLM 生成或改进提示的技术。通常,这是通过使用更智能的模型来优化智能程度较低的模型的提示来完成的。这是一个使用提示来引导、构建和优化其他提示的过程,有助于确保它们更有效地引导 LLM 产生高质量、相关的输出。我们将利用 o1-preview 的功能,这是一个更智能的模型,具有先进的推理技能,用于改进 gpt-4o 的提示。

我们致力于通过这项技术,让您使用 LLM 的开发之旅更加顺畅和易于访问。不要忘记查看我们在 Playground 中的“生成任何内容”功能——它是深入研究元提示的绝佳起点。

在本示例中,我们将从一个用于总结新闻文章的简单提示开始,然后对其进行增强,以查看输出如何改进。我们将使用 o1-preview 来分析和改进我们的提示,在此过程中添加更多细节和清晰度。最后,我们将系统地评估输出,以了解我们改进的影响。

import pandas as pd
import openai 
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from pydantic import BaseModel
from datasets import load_dataset

client = openai.Client()
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

导入数据

让我们首先从 HuggingFace 导入 bbc_news_alltime 数据集。此数据集包含所有 BBC 新闻文章,捕捉了从 2017 年每月发布的所有内容,直至最近的完整月份。对于我们的实验,我们将专门关注最近一个月(2024 年 8 月)的样本,以保持内容的时效性和可管理性。

ds = load_dataset("RealTimeData/bbc_news_alltime", "2024-08")
df = pd.DataFrame(ds['train']).sample(n=100, random_state=1)
df.head()
标题 发布日期 作者 描述 版块 内容 链接 顶部图片
2662 Laura Whitmore:在提出...后我被精神操控 2024-08-04 https://127.0.0.1/bbcnews 这位前“爱岛”主持人说,事情是... 文化 电视节目主持人 Laura Whitmore 说... http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o https://ichef.bbci.co.uk/ace/standard/2560/cps...
1865 Errollyn Wallen 被任命为国王音乐大师... 2024-08-25 https://127.0.0.1/bbcnews 她最出名的是她在 2012 年巴黎... 文化 著名作曲家和创作型歌手 Erro... http://www.bbc.co.uk/news/articles/c4gl758g7zgo https://ichef.bbci.co.uk/ace/standard/2560/cps...
2554 SDLP:Matthew O'Toole 支持 Claire Hanna 竞选... 2024-08-30 https://127.0.0.1/bbcnews Matthew O'Toole 曾被一些人提名为一个潜在的... 北爱尔兰政治 Matthew O'Toole 领导他的政党的官方反对派... http://www.bbc.co.uk/news/articles/cvg41j7xrzdo https://ichef.bbci.co.uk/ace/standard/3840/cps...
1338 Rotherham 骚乱者被判刑 - BBC 新闻 2024-08-20 https://127.0.0.1/bbcnews 两名参与袭击 Hol 的暴徒... 南约克郡 Rotherham 一对男女因英国骚乱被判刑... http://www.bbc.co.uk/news/articles/cwywggd7qw6o https://ichef.bbci.co.uk/ace/standard/2560/cps...
1232 BBC 新闻 - BBC iPlayer 2024-08-02 JavaScript 似乎已被禁用。请启用... http://www.bbc.co.uk/news/10318089

让我们从一个简单的提示开始,然后使用 o1-preview 来增强它,以获得更好的结果。我们想要总结新闻文章,所以这就是我要模型做的事情。

simple_prompt = "Summarize this news article: {article}"

为了改进提示,我们需要向 o1-preview 提供我们想要实现的情境和目标。然后我们可以要求它生成更详细的提示,从而产生更丰富、更全面的新闻摘要。

meta_prompt = """
Improve the following prompt to generate a more detailed summary. 
Adhere to prompt engineering best practices. 
Make sure the structure is clear and intuitive and contains the type of news, tags and sentiment analysis.

{simple_prompt}

Only return the prompt.
"""
def get_model_response(messages, model="o1-preview"):
    response = client.chat.completions.create(
        messages=messages,
        model=model,
    )
    return response.choices[0].message.content


complex_prompt = get_model_response([{"role": "user", "content": meta_prompt.format(simple_prompt=simple_prompt)}])
complex_prompt
'Please read the following news article and provide a comprehensive summary that includes:\n\n1. **Type of News**: Specify the category of the news article (e.g., Politics, Technology, Health, Sports, etc.).\n2. **Summary**: Write a concise and clear summary of the main points, ensuring the structure is logical and intuitive.\n3. **Tags**: List relevant keywords or tags associated with the article.\n4. **Sentiment Analysis**: Analyze the overall sentiment of the article (positive, negative, or neutral) and briefly explain your reasoning.\n\n**Article:**\n\n{article}'

生成摘要

现在我们有了简单提示和增强提示,让我们生成摘要!对于我们数据集中的每个条目,我们将使用简单提示和增强提示,看看它们的比较结果如何。通过这样做,我们将亲眼目睹我们使用 o1-preview 进行的改进如何带来更丰富、更详细的摘要。让我们深入了解一下,亲自看看差异!

def generate_response(prompt): 
    messages = [{"role": "user", "content": prompt}]
    response = get_model_response(messages, model="gpt-4o-mini")
    return response

def generate_summaries(row):
    simple_itinerary = generate_response(simple_prompt.format(article=row["content"]))
    complex_itinerary = generate_response(complex_prompt + row["content"])
    return simple_itinerary, complex_itinerary

让我们检查一下一切是否看起来良好,以及我们是否可以为第一篇新闻报道生成摘要。

generate_summaries(df.iloc[0])
('Television presenter Laura Whitmore has shared that the issues she attempted to address during her time on *Strictly Come Dancing* eight years ago are now surfacing, stating that she experienced "gaslighting" that made her concerns seem normalized. In a recent interview, she expressed the difficulties she faced, including being portrayed negatively and feeling "broken" during the competition. Whitmore indicated that she raised concerns about inappropriate behavior and is currently providing evidence for a BBC investigation, although she has not made an official complaint herself. The BBC is facing allegations of mistreatment towards contestants, prompting them to announce new welfare measures, including the presence of a chaperone during rehearsals. Other celebrities participating in the show have also made allegations against professional dancers, leading to growing scrutiny around conditions on the show. The BBC emphasized that it takes complaints very seriously and is committed to updating its support processes.',
 '1. **Type of News**: Entertainment\n\n2. **Summary**: Laura Whitmore, a television presenter, has spoken out about her experiences on Strictly Come Dancing, revealing that issues she attempted to address during her tenure on the show are now coming to light. In an interview with The Irish Times, she described feeling "gaslit" and suggested that her concerns, which she raised eight years ago, were not taken seriously at the time. Whitmore recalled that her participation left her feeling "broken" and criticized how she was portrayed during the show. She mentioned contributing evidence to an ongoing review involving incidents of alleged inappropriate behavior during her time on the show, although she did not make an official complaint. The BBC, which has been navigating its own controversy related to the treatment of contestants, stated it is taking these claims seriously and plans to enhance welfare measures on the show, including the introduction of a chaperone at rehearsals. Recent allegations from other contestants have further intensified the scrutiny of Strictly Come Dancing.\n\n3. **Tags**: Laura Whitmore, Strictly Come Dancing, BBC, allegations, inappropriate behavior, gaslighting, welfare measures, entertainment controversy\n\n4. **Sentiment Analysis**: The overall sentiment of the article is negative. It highlights serious allegations of mistreatment and inappropriate behavior associated with a popular television show, along with personal accounts from Whitmore that reflect emotional distress and professional struggles. The tone conveys a sense of urgency and seriousness regarding the issues raised, indicating a critical atmosphere within the entertainment industry related to contestant treatment.')

通过比较从简单提示和增强提示生成的摘要,我们已经可以看到显著的改进。最初的摘要为我们提供了文章的总体概述,而增强的摘要则更深入——它不仅提供了详细的摘要,还对新闻类型进行了分类,列出了相关标签,甚至包括了情感分析。

现在让我们在整个数据集上进行测试!

# Add new columns to the dataframe for storing itineraries
df['simple_summary'] = None
df['complex_summary'] = None

# Use ThreadPoolExecutor to generate itineraries concurrently
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(generate_summaries, row): index for index, row in df.iterrows()}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Generating Itineraries"):
        index = futures[future]
        simple_itinerary, complex_itinerary = future.result()
        df.at[index, 'simple_summary'] = simple_itinerary
        df.at[index, 'complex_summary'] = complex_itinerary

df.head()
Generating Itineraries: 100%|██████████| 100/100 [00:50<00:00,  1.98it/s]
标题 发布日期 作者 描述 版块 内容 链接 顶部图片 简单摘要 复杂摘要
2662 Laura Whitmore:在提出...后我被精神操控 2024-08-04 https://127.0.0.1/bbcnews 这位前“爱岛”主持人说,事情是... 文化 电视节目主持人 Laura Whitmore 说... http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o https://ichef.bbci.co.uk/ace/standard/2560/cps... 电视节目主持人 Laura Whitmore 谈到... 1. **新闻类型**:娱乐/电视...\
1865 Errollyn Wallen 被任命为国王音乐大师... 2024-08-25 https://127.0.0.1/bbcnews 她最出名的是她在 2012 年巴黎... 文化 著名作曲家和创作型歌手 Erro... http://www.bbc.co.uk/news/articles/c4gl758g7zgo https://ichef.bbci.co.uk/ace/standard/2560/cps... Errollyn Wallen 已被任命为国王音乐大师... 1. **新闻类型**:艺术/音乐\n\n2. **摘要**...\
2554 SDLP:Matthew O'Toole 支持 Claire Hanna 竞选... 2024-08-30 https://127.0.0.1/bbcnews Matthew O'Toole 曾被一些人提名为一个潜在的... 北爱尔兰政治 Matthew O'Toole 领导他的政党的官方反对派... http://www.bbc.co.uk/news/articles/cvg41j7xrzdo https://ichef.bbci.co.uk/ace/standard/3840/cps... 官方反对派领导人 Matthew O'Toole... 1. **新闻类型**:政治\n\n2. **摘要**...\
1338 Rotherham 骚乱者被判刑 - BBC 新闻 2024-08-20 https://127.0.0.1/bbcnews 两名参与袭击 Hol 的暴徒... 南约克郡 Rotherham 一对男女因英国骚乱被判刑... http://www.bbc.co.uk/news/articles/cwywggd7qw6o https://ichef.bbci.co.uk/ace/standard/2560/cps... Nathan Palmer (29 岁) 和 Niven Matthewm 两名男子... 1. **新闻类型**:政治/犯罪与司法...\
1232 BBC 新闻 - BBC iPlayer 2024-08-02 JavaScript 似乎已被禁用。请启用... http://www.bbc.co.uk/news/10318089 文章讨论了启用 JavaScript 的必要性... 我无法提供文章摘要,因为...

评估结果

为了评估两个提示之间性能的差异,我们将使用结构化的评估方法,让 LLM 充当评判员。这意味着我们将利用语言模型本身来根据特定标准评估和比较输出。

“LLM 作为评判员”意味着什么?

使用 LLM 作为评判员涉及让语言模型评估其自身的输出或另一个模型的输出。它应用预定义的标准来评估准确性、清晰度和相关性等方面。这种方法帮助我们获得客观且一致的评估,而没有人类偏见,从而更容易识别不同提示之间的改进。我们的 OpenAI Evals 入门食谱简要介绍了如何开始使用这种方法。

这是我们将用于评估的提示

evaluation_prompt = """
You are an expert editor tasked with evaluating the quality of a news article summary. Below is the original article and the summary to be evaluated:

**Original Article**:  
{original_article}

**Summary**:  
{summary}

Please evaluate the summary based on the following criteria, using a scale of 1 to 5 (1 being the lowest and 5 being the highest). Be critical in your evaluation and only give high scores for exceptional summaries:

1. **Categorization and Context**: Does the summary clearly identify the type or category of news (e.g., Politics, Technology, Sports) and provide appropriate context?  
2. **Keyword and Tag Extraction**: Does the summary include relevant keywords or tags that accurately capture the main topics and themes of the article?  
3. **Sentiment Analysis**: Does the summary accurately identify the overall sentiment of the article and provide a clear, well-supported explanation for this sentiment?  
4. **Clarity and Structure**: Is the summary clear, well-organized, and structured in a way that makes it easy to understand the main points?  
5. **Detail and Completeness**: Does the summary provide a detailed account that includes all necessary components (type of news, tags, sentiment) comprehensively?  


Provide your scores and justifications for each criterion, ensuring a rigorous and detailed evaluation.
"""

class ScoreCard(BaseModel):
    justification: str
    categorization: int
    keyword_extraction: int
    sentiment_analysis: int
    clarity_structure: int
    detail_completeness: int

这是一个专业提示——您实际上也可以使用元提示来改进您的评估提示!通过对指示 LLM 充当评判员的提示应用相同的迭代增强,您可以使您的评估更加精确和深刻。

让我们使用这个提示来评估我们的摘要!

def evaluate_summaries(row):
    simple_messages = [{"role": "user", "content": evaluation_prompt.format(original_article=row["content"], summary=row['simple_summary'])}]
    complex_messages = [{"role": "user", "content": evaluation_prompt.format(original_article=row["content"], summary=row['complex_summary'])}]
    
    simple_summary = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=simple_messages,
        response_format=ScoreCard)
    simple_summary = simple_summary.choices[0].message.parsed
    
    complex_summary = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=complex_messages,
        response_format=ScoreCard)
    complex_summary = complex_summary.choices[0].message.parsed
    
    return simple_summary, complex_summary

# Add new columns to the dataframe for storing evaluations
df['simple_evaluation'] = None
df['complex_evaluation'] = None

# Use ThreadPoolExecutor to evaluate itineraries concurrently
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(evaluate_summaries, row): index for index, row in df.iterrows()}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Evaluating Summaries"):
        index = futures[future]
        simple_evaluation, complex_evaluation = future.result()
        df.at[index, 'simple_evaluation'] = simple_evaluation
        df.at[index, 'complex_evaluation'] = complex_evaluation

df.head()
Evaluating Summaries: 100%|██████████| 100/100 [01:42<00:00,  1.02s/it]
标题 发布日期 作者 描述 版块 内容 链接 顶部图片 简单摘要 复杂摘要 简单评估 复杂评估
2662 Laura Whitmore:在提出...后我被精神操控 2024-08-04 https://127.0.0.1/bbcnews 这位前“爱岛”主持人说,事情是... 文化 电视节目主持人 Laura Whitmore 说... http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o https://ichef.bbci.co.uk/ace/standard/2560/cps... 电视节目主持人 Laura Whitmore 谈到... 1. **新闻类型**:娱乐/电视...\ 分类=4 关键词提取=3 情感... 分类=5 关键词提取=5 情感...
1865 Errollyn Wallen 被任命为国王音乐大师... 2024-08-25 https://127.0.0.1/bbcnews 她最出名的是她在 2012 年巴黎... 文化 著名作曲家和创作型歌手 Erro... http://www.bbc.co.uk/news/articles/c4gl758g7zgo https://ichef.bbci.co.uk/ace/standard/2560/cps... Errollyn Wallen 已被任命为国王音乐大师... 1. **新闻类型**:艺术/音乐\n\n2. **摘要**...\ 分类=4 关键词提取=4 情感... 分类=5 关键词提取=5 情感...
2554 SDLP:Matthew O'Toole 支持 Claire Hanna 竞选... 2024-08-30 https://127.0.0.1/bbcnews Matthew O'Toole 曾被一些人提名为一个潜在的... 北爱尔兰政治 Matthew O'Toole 领导他的政党的官方反对派... http://www.bbc.co.uk/news/articles/cvg41j7xrzdo https://ichef.bbci.co.uk/ace/standard/3840/cps... 官方反对派领导人 Matthew O'Toole... 1. **新闻类型**:政治\n\n2. **摘要**...\ 分类=5 关键词提取=4 情感... 分类=5 关键词提取=5 情感...
1338 Rotherham 骚乱者被判刑 - BBC 新闻 2024-08-20 https://127.0.0.1/bbcnews 两名参与袭击 Hol 的暴徒... 南约克郡 Rotherham 一对男女因英国骚乱被判刑... http://www.bbc.co.uk/news/articles/cwywggd7qw6o https://ichef.bbci.co.uk/ace/standard/2560/cps... Nathan Palmer (29 岁) 和 Niven Matthewm 两名男子... 1. **新闻类型**:政治/犯罪与司法...\ 分类=3 关键词提取=3 情感... 分类=5 关键词提取=4 情感...
1232 BBC 新闻 - BBC iPlayer 2024-08-02 JavaScript 似乎已被禁用。请启用... http://www.bbc.co.uk/news/10318089 文章讨论了启用 JavaScript 的必要性... 我无法提供文章摘要,因为... 分类=2 关键词提取=3 情感... 分类=1 关键词提取=1 情感...
import matplotlib.pyplot as plt

df["simple_scores"] = df["simple_evaluation"].apply(lambda x: [score for key, score in x.model_dump().items() if key != 'justification'])
df["complex_scores"] = df["complex_evaluation"].apply(lambda x: [score for key, score in x.model_dump().items() if key != 'justification'])


# Calculate average scores for each criterion
criteria = [
    'Categorisation',
    'Keywords and Tags',
    'Sentiment Analysis',
    'Clarity and Structure',
    'Detail and Completeness'
]

# Calculate average scores for each criterion by model
simple_avg_scores = df['simple_scores'].apply(pd.Series).mean()
complex_avg_scores = df['complex_scores'].apply(pd.Series).mean()


# Prepare data for plotting
avg_scores_df = pd.DataFrame({
    'Criteria': criteria,
    'Original Prompt': simple_avg_scores,
    'Improved Prompt': complex_avg_scores
})

# Plotting
ax = avg_scores_df.plot(x='Criteria', kind='bar', figsize=(6, 4))
plt.ylabel('Average Score')
plt.title('Comparison of Simple vs Complex Prompt Performance by Model')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.show()
image generated by notebook

在评估结果后,我们发现,虽然基本提示在清晰度和结构方面表现良好,但增强提示在其他几个关键标准上显著提高了输出:分类、关键词和标签、情感分析以及细节和完整性。复杂提示生成的摘要信息量更大、组织性更好、内容更丰富。

这证明了改进提示如何极大地提高生成摘要的质量。虽然这是一个简化的示例,但提示优化的好处预计在真实的生产级应用中会更加明显,从而使输出更符合特定目标和用户需求。

结论

元提示是一种强大的技术,可以显著提高语言模型输出的质量。我们的探索表明,从一个简单的提示开始,并使用 o1-preview 对其进行改进,可以生成信息量更大、组织性更好、内容更丰富的摘要——在分类、关键词和标签、情感分析和完整性等关键标准方面都有所改进。这项练习强调了提示优化的价值,即使在这个简化的示例中,好处也很明显。在实际应用中,利用元提示和 o1-preview 等工具可以提升语言模型的性能,更好地满足您的特定目标和用户需求。