微调问答 - 创建问答

2022 年 3 月 10 日

在 Github 中打开

注意：要根据文本文档回答问题，我们建议采用使用嵌入的问答中的步骤。以下某些代码可能依赖于已弃用的 API 端点。

2. 创建合成问答数据集

我们使用davinci-instruct-beta-v3，这是一个专门用于遵循指令的模型，以根据给定的上下文创建问题。然后，我们还使用davinci-instruct-beta-v3来回答这些问题，给定相同的上下文。

这很昂贵，并且会花费很长时间，因为我们为每个部分调用了 davinci 引擎。您可以直接下载最终数据集。

我们正在使用使用之前的笔记本创建的数据集

2.1 读取数据，并创建上下文

通过连接标题、标题和该部分的内容来创建上下文

import pandas as pd
df = pd.read_csv('olympics-data/olympics_sections.csv')
df['context'] = df.title + "\n" + df.heading + "\n\n" + df.content
df.head()

	标题	标题	内容	tokens	上下文
0	2020 年夏季奥运会	摘要	2020 年夏季奥运会（日语：2020年夏季オリン...	713	2020 年夏季奥运会\n摘要\n\n2020 年夏季奥运会...
1	2020 年夏季奥运会	主办城市选择	国际奥林匹克委员会 (IOC) 投票...	126	2020 年夏季奥运会\n主办城市选择\n\nT...
2	2020 年夏季奥运会	COVID-19 疫情的影响	2020 年 1 月，人们开始关注...	369	2020 年夏季奥运会\nCOVID-19 疫情的影响...
3	2020 年夏季奥运会	资格赛取消和延期	对疫情的担忧开始影响资格赛...	298	2020 年夏季奥运会\n资格赛取消...
4	2020 年夏季奥运会	对兴奋剂测试的影响	强制性兴奋剂测试受到严重限制...	163	2020 年夏季奥运会\n对兴奋剂测试的影响\n...

2.2 根据上下文创建问题

使用 davinci-instruct 生成一些与维基百科章节内容相关的合理问题。

注意：我们使用了 temperature=0，但试验更高的 temperature 以获得更多样化的问题可能是有益的。

警告：此步骤将持续很长时间，并消耗大量 tokens，因为它为每个部分调用 davinci-instruct 以生成多个问题。

from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

def get_questions(context):
    try:
        response = client.chat.completions.create(model="davinci-instruct-beta-v3",
        prompt=f"Write questions based on the text below\n\nText: {context}\n\nQuestions:\n1.",
        temperature=0,
        max_tokens=257,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=["\n\n"])
        return response.choices[0].text
    except:
        return ""


df['questions']= df.context.apply(get_questions)
df['questions'] = "1." + df.questions
print(df[['questions']].values[0][0])

1. What is the 2020 Summer Olympics?
2. When did the 2020 Summer Olympics take place?
3. Who won the most medals at the 2020 Summer Olympics?
4. Who won the most gold medals at the 2020 Summer Olympics?
5. Who won the most medals at the 2020 Summer Olympics?

提示旨在生成多个问题。上面的示例问题是根据 2020 年夏季奥运会页面的摘要部分生成的。

我们可以观察到上面的问题 3 和 5 重复了。有时，生成的问题在没有上下文的情况下可能会有歧义。我们将展示，即使存在这些限制，我们也可以创建一个成功的模型。

print(df.content.values[0])

The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and branded as Tokyo 2020 (東京2020, Tōkyō Nii Zero Nii Zero), was an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July.
Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013. Originally scheduled to take place from 24 July to 9 August 2020, the event was postponed to 2021 in March 2020 as a result of the COVID-19 pandemic, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled). However, the event retained the Tokyo 2020 name for marketing and branding purposes. It was largely held behind closed doors with no public spectators permitted due to the declaration of a state of emergency in the Greater Tokyo Area in response to the pandemic. The Summer Paralympics were held between 24 August and 5 September 2021, 16 days after the completion of the Olympics.The 2020 Games were the fourth Olympic Games to be held in Japan, following the Tokyo 1964 (Summer), Sapporo 1972 (Winter) and Nagano 1998 (Winter) games. Tokyo is the first city in Asia to hold the Summer Games twice. The 2020 Games were the second of three consecutive Olympics to be held in East Asia, following the 2018 Winter Olympics in Pyeongchang, South Korea and preceding the 2022 Winter Olympics in Beijing, China.
New events were introduced in existing sports for 2020, including 3x3 basketball, freestyle BMX and mixed gender team events in a number of existing sports, as well as the return of madison cycling for men and an introduction of the same event for women. New IOC policies also allowed the host organizing committee to add new sports to the Olympic program for just one Games. The disciplines added by the Japanese Olympic Committee were baseball and softball, karate, sport climbing, surfing and skateboarding, the last four of which made their Olympic debuts, and the last three of which will remain on the Olympic program.The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88). Host nation Japan finished third, setting a record for the most gold medals and total medals ever won by their delegation at an Olympic Games with 27 and 58. Great Britain finished fourth, with a total of 22 gold and 65 medals, becoming the first nation at the Summer Olympics to increase or equal their total medals won in the two Games subsequent to hosting them. The Russian delegation competing as the ROC (not to be confused with the Republic of China (Taiwan) which competed as Chinese Taipei, not ROC) finished fifth with 20 gold medals and third in the overall medal count, with 71 medals. Bermuda, the Philippines and Qatar won their first-ever Olympic gold medals. Burkina Faso, San Marino and Turkmenistan won their first-ever Olympic medals.

2.3 根据上下文创建答案

使用 davinci-instruct 回答问题，给定相关的维基百科章节内容

注意：我们使用了 temperature=0，但试验更高的 temperature 以获得更多样化的问题可能是有益的。

警告：此步骤将持续很长时间，并消耗大量 tokens，因为它为每个部分调用 davinci-instruct 以回答所有问题。

def get_answers(row):
    try:
        response = client.chat.completions.create(
            engine="davinci-instruct-beta-v3",
            prompt=f"Write answer based on the text below\n\nText: {row.context}\n\nQuestions:\n{row.questions}\n\nAnswers:\n1.",
            temperature=0,
            max_tokens=257,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
        return response.choices[0].text
    except Exception as e:
        print (e)
        return ""


df['answers']= df.apply(get_answers, axis=1)
df['answers'] = "1." + df.answers
df = df.dropna().reset_index().drop('index',axis=1)
print(df[['answers']].values[0][0])

1. The 2020 Summer Olympics is an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan.
2. The 2020 Summer Olympics took place from 23 July to 8 August 2021.
3. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).
4. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).
5. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).

这些是根据主办城市选择的上下文对上述问题的答案。

我们可以看到，答案 3-5 包含正确答案，但答案不是直接回答问题，而是逐字提取。尽管存在这些偶尔的低质量答案，但我们将展示，给定大量示例，该模型可以相当好地学习该任务。

2.4 保存基于维基百科章节的奥运会问答数据集

我们保存该文件以在下一个笔记本中使用

df.to_csv('olympics-data/olympics_qa.csv', index=False)

2.5 搜索文件（已弃用）

我们创建一个搜索文件（API 参考），当提出问题时，可以使用该文件检索相关上下文。

已弃用：/search 端点已被弃用，转而使用嵌入。嵌入更便宜、更快，并且可以支持更好的搜索体验。有关使用嵌入的搜索实现，请参阅问答指南

df = df[df.tokens<2000]
df[['context', 'tokens']].rename(columns={'context':'text','tokens':'metadata'}).to_json('olympics-data/olympics_search.jsonl', orient='records', lines=True)

search_file = client.files.create(
  file=open("olympics-data/olympics_search.jsonl"),
  purpose='search'
)
olympics_search_fileid = search_file['id']

2.6 根据提供的上下文回答问题

我们将使用答案端点的简单实现。这通过简单地使用/search 端点来实现，该端点在索引文件中搜索以获取可以包含在上下文中的相关部分，然后是给定指定模型的问题和回答提示。

from answers_with_ft import create_context, answer_question
print(create_context("Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?", olympics_search_fileid, max_len=400))

Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay
Summary

The women's 4 × 100 metres relay event at the 2020 Summer Olympics took place on 5 and 6 August 2021 at the Japan National Stadium. There were 16 competing relay teams, with each team having 5 members from which 4 were selected in each round.

###

Athletics at the 2020 Summer Olympics – Men's 4 × 100 metres relay
Qualification

National Olympic Committees (NOCs) could qualify one relay team in one of three following ways:
The top 8 NOCs at the 2019 World Athletics Championships qualified a relay team.
The top 8 NOCs at the 2021 World Athletics Relays qualified a relay team.
Where an NOC placed in the top 8 at both the 2019 World Championships and the 2021 World Relays, the quota place was allocated to the world top list as of 29 June 2021. In this case, 4 teams did so, so there are 4 places available through the world rankings.A total of five athletes may be entered for a relay team. Should a NOC have also entered individual athletes in the corresponding individual event (100 m), the entered individual athletes must be included in the total of five (5) athletes entered for the relay event. In addition of five, NOCs can nominate a maximum of one alternate athlete for each team.
The qualifying period was originally from 1 May 2019 to 29 June 2020. Due to the COVID-19 pandemic, the period was suspended from 6 April 2020 to 30 November 2020, with the end date extended to 29 June 2021. The qualifying time standards could be obtained in various meets during the given period that have the approval of the IAAF. Both indoor and outdoor meets are eligible. The most recent Area Championships may be counted in the ranking, even if not during the qualifying period.

answer_question(olympics_search_fileid, "davinci-instruct-beta-v3", 
            "Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?")

' Japan National Stadium'

在我们为问答微调模型后，我们将能够使用它来代替davinci-instruct-beta-v3，以便在无法根据上下文回答问题时获得更好的答案。我们看到了davinci-instruct-beta-v3的一个缺点，即它总是尝试回答问题，而不管相关上下文是否存在。（请注意，第二个问题是关于 2024 年的未来事件。）

answer_question(olympics_search_fileid, "davinci-instruct-beta-v3", 
            "Where did women's 4 x 100 metres relay event take place during the 2048 Summer Olympics?", max_len=1000)

' Japan National Stadium'

我们可以看到，即使在无法根据提供的上下文回答问题的情况下，davinci 也倾向于回答问题。请注意，提出的关于 2048 年夏季奥运会的问题，该奥运会尚未发生，并且检索到的内容仅返回了 2020 年的结果。

2.7（可选）调查搜索端点返回相关上下文的可能性

def check_context(title, heading, question, max_len=1800, search_model='ada', max_rerank=10):
    """
    Evaluate the performance of the search model in retrieving the correct context

    Parameters
    ----------
    title: str
        The title of the Wikipedia page
    heading: str
        The heading of the Wikipedia section
    qusetion: str
        The question
    max_len: int
        The maximum length of the context
    search_model: str
        The search model to use - `ada` is most cost effective
    max_rerank: int
        The maximum number of reranking documents to use the search model on

    Returns
    -------
    rank: int
        The rank of the correct context
    token_length: int
        The number of tokens needed to obtain the correct context
    """
    
    try:
        # TODO: openai.Engine(search_model) is deprecated
        results = openai.Engine(search_model).search(
            search_model=search_model, 
            query=question, 
            max_rerank=max_rerank,
            file=olympics_search_fileid,
            return_metadata=True
        )
        index=-1
        returns = []
        cur_len = 0
        for result in results['data']:
            cur_len += int(result['metadata']) + 4 # we add 4 tokens for the separator `\n\n###\n\n`
            if cur_len > max_len:
                break
            returns.append(result['text'])
            res = result['text'].split('\n')
            if res[0] == title and res[1] == heading:
                index = len(returns) - 1
                break
        return index, cur_len
    except Exception as e:
        #print (e)
        return []
print(check_context("Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay", "Summary", "Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?", max_len=10000))

(0, 58)

我们利用基于上下文生成的问答来估计我们可以检索到原始上下文的频率。这些问题是嘈杂的，因此这不是一个完美的估计。

我们的问题和答案都以编号的项目符号开头，但是由于它们的生成方式，它们缺少第一个数字，因此我们在问题（和答案）列表中添加了“1.”。

我们计算使用 ada 搜索检索到的部分排名，以及完全检索相关部分所需的上下文中 tokens 的数量。

ada_results = df.apply(lambda x: [
                    check_context( x.title, 
                                   x.heading, 
                                   q[3:],     # remove the number prefix
                                   max_len=1000000, # set a large number to get the full context 
                                   search_model='ada', 
                                   max_rerank=200,
                                 ) 
                    for q in (x.questions).split('\n') # split the questions
                    if len(q) >10 # remove the empty questions
                ], axis=1)
ada_results.head()

0    [(132, 27104), (-1, 22939), (8, 2151), (2, 121...
1    [(4, 1737), (0, 130), (8, 744), (96, 17208), (...
2          [(0, 373), (0, 373), (-1, 40610), (1, 570)]
3            [(0, 302), (0, 302), (5, 968), (8, 1425)]
4                      [(0, 167), (0, 167), (2, 1442)]
Name: ada, dtype: object

out = pd.concat([ada_results], axis=1)
out.columns = ['ada']
out.to_csv('olympics-data/search_engine_results.csv')

def expand_lists(out):
    """
    Expand a pandas series containing lists into a series, where each list element becomes a value on its own

    Input is a row per paragraph, which has multiple questions
    Output is a row per question
    """
    cols = [pd.DataFrame(out[name].tolist()).stack().reset_index(level=1, drop=True).rename(name) for name in out.columns] 
    return pd.concat(cols, axis=1)

out_expanded = expand_lists(out)
out_expanded['rank'] = out_expanded.ada.apply(lambda x: x[0] if x != [] else -2)
out_expanded['tokens'] = out_expanded.ada.apply(lambda x: x[1] if x != [] else -2)

within_2k = (out_expanded.tokens < 2000).mean()
print(f"{within_2k*100:.1f}% of relevant paragraphs are retrieved within the first 2k tokens")

74.3% of relevant paragraphs are retrieved within the first 2k tokens

在这个数据集上，相关上下文可以在 74% 的时间内获得

outside_200 = (out_expanded['rank'] == -1).mean()
print(f"{outside_200*100:.1f}% of relevant paragraphs are not retrieved within the first 200 results")

7.4% of relevant paragraphs are not retrieved within the first 200 results

在 7.4% 的时间内，这是由于搜索算法的关键字搜索部分未在前 200 个结果中检索到相关上下文。在 18.3% 的时间内，这是由于语义搜索未将相关上下文放置在前 2000 个 tokens 中。

import matplotlib.pyplot as plt

# plot a histogram, and add axis descriptions and title
out_expanded[(out_expanded['rank'] >=0)&(out_expanded['rank'] <30)]['rank'].hist(bins=29)
plt.xlabel('rank')
plt.ylabel('count')
plt.title('Histogram of ranks of retrieved paragraphs')
plt.show()

out_expanded[(out_expanded.tokens>=0)&(out_expanded.tokens < 2000)]['tokens'].hist(bins=29)
plt.xlabel('tokens')
plt.ylabel('count')
plt.title('Histogram of the number of minimum tokens needed')
plt.show()

我们可以观察到，上下文最有可能作为首批结果之一返回，并且最有可能在前 200-500 个 tokens 内返回。

# normalized value_counts
out_expanded['rank'].value_counts(normalize=True).sort_index()[:13]

-2     0.000063
-1     0.074428
 0     0.453420
 1     0.089515
 2     0.047146
 3     0.032437
 4     0.024139
 5     0.019676
 6     0.015967
 7     0.013452
 8     0.011189
 9     0.009869
 10    0.009178
Name: rank, dtype: float64

在每个排名中返回相关上下文的概率。（-2 表示处理错误，-1 表示排名 >200）