微调的问答 - 训练

2022年3月10日

注意：为了回答基于文本文档的问题，我们建议使用使用嵌入的问答中的步骤。以下部分代码可能依赖于已弃用的 API 端点。

3. 训练一个专门用于问答的微调模型

本笔记本将利用上下文、问题和答案对的数据集，额外创建对抗性问题和上下文对，其中问题不是根据该上下文生成的。在这些情况下，模型将被提示回答“没有足够的上下文来回答问题”。我们还将训练一个判别器模型，该模型预测问题是否可以基于上下文回答。

我们还将添加更难的对抗性示例，这些示例将基于语义相似的章节或来自同一文章的相邻章节。

import openai
import pandas as pd
df = pd.read_csv('olympics-data/olympics_qa.csv')
olympics_search_fileid = "file-c3shd8wqF3vSCKaukW4Jr1TT"
df.head()

	标题	标题	内容	tokens	上下文	问题	答案
0	2020年夏季奥林匹克运动会	摘要	2020 年夏季奥林匹克运动会（日语：2020年夏季オリン...	713	2020年夏季奥林匹克运动会\n摘要\n\n2020 年夏季奥...	1. 什么是 2020 年夏季奥林匹克运动会？\n2. 何时 ...	1. 2020 年夏季奥林匹克运动会是一项国际性...
1	2020年夏季奥林匹克运动会	主办城市选择	国际奥林匹克委员会 (IOC) 投票...	126	2020年夏季奥林匹克运动会\n主办城市选择\n\n国...	1. \n2. \n3. \n4.	1. 什么是国际奥林匹克委员会...
2	2020年夏季奥林匹克运动会	COVID-19 疫情的影响	2020 年 1 月，人们开始对...	369	2020年夏季奥林匹克运动会\nCOVID-19 疫情的影响\n...	1. 什么是 COVID-19 疫情？\n2. 疫情如何...	1. COVID-19 疫情是一场始于...
3	2020年夏季奥林匹克运动会	资格赛取消和延期	对疫情的担忧开始影响到资格赛...	298	2020年夏季奥林匹克运动会\n资格赛取消和延期\n...	1. 亚洲原定资格赛的地点在哪里...	1. 亚洲和大洋洲资格赛的原定地点是...
4	2020年夏季奥林匹克运动会	对兴奋剂检测的影响	强制性兴奋剂检测受到严重限制...	163	2020年夏季奥林匹克运动会\n对兴奋剂检测的影响\n...	1. 什么是 COVID-19 疫情？\n2. 什么导致...	1. COVID-19 疫情是一场始于...

将章节拆分为训练集和测试集

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
len(train_df), len(test_df)

(3014, 754)

我们检查我们打算使用的分隔符是否不存在于上下文中

df.context.str.contains('->').sum()

3.1 为问答和判别器模型创建微调数据集

微调数据集按以下方式创建。对于每对对应的问题、答案和上下文对，我们创建

正例：正确的问题、答案、上下文对
负例
- 随机负例，其中随机上下文与问题配对
- 两个困难的负例
  - 一个源自同一维基百科文章
  - 另一个，与正确上下文最相似

这个过程是有噪声的，因为有时问题在不同的上下文中也可能可以回答，但平均而言，我们希望这不会对性能产生太大影响。

我们对判别器模型和问答模型应用相同的数据集创建过程。我们分别对训练集和测试集应用该过程，以确保训练集中的示例不会出现在测试集中。

import random

def get_random_similar_contexts(question, context, file_id=olympics_search_fileid, search_model='ada', max_rerank=10):
    """
    Find similar contexts to the given context using the search file
    """
    try:
        # TODO: openai.Engine(search_model) is deprecated
        results = openai.Engine(search_model).search(
            search_model=search_model, 
            query=question, 
            max_rerank=max_rerank,
            file=file_id
        )
        candidates = []
        for result in results['data'][:3]:
            if result['text'] == context:
                continue
            candidates.append(result['text'])
        random_candidate = random.choice(candidates)
        return random_candidate
    except Exception as e:
        print(e)
        return ""

def create_fine_tuning_dataset(df, discriminator=False, n_negative=1, add_related=False):
    """
    Create a dataset for fine tuning the OpenAI model; either for a discriminator model, 
    or a model specializing in Q&A, where it says if no relevant context is found.

    Parameters
    ----------
    df: pd.DataFrame
        The dataframe containing the question, answer and context pairs
    discriminator: bool
        Whether to create a dataset for the discriminator
    n_negative: int
        The number of random negative samples to add (using a random context)
    add_related: bool
        Whether to add the related contexts to the correct context. These are hard negative examples

    Returns
    -------
    pd.DataFrame
        The dataframe containing the prompts and completions, ready for fine-tuning
    """
    rows = []
    for i, row in df.iterrows():
        for q, a in zip(("1." + row.questions).split('\n'), ("1." + row.answers).split('\n')):
            if len(q) >10 and len(a) >10:
                if discriminator:
                    rows.append({"prompt":f"{row.context}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" yes"})
                else:
                    rows.append({"prompt":f"{row.context}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" {a[2:].strip()}"})

    for i, row in df.iterrows():
        for q in ("1." + row.questions).split('\n'):
            if len(q) >10:
                for j in range(n_negative + (2 if add_related else 0)):
                    random_context = ""
                    if j == 0 and add_related:
                        # add the related contexts based on originating from the same wikipedia page
                        subset = df[(df.title == row.title) & (df.context != row.context)]
                        
                        if len(subset) < 1:
                            continue
                        random_context = subset.sample(1).iloc[0].context
                    if j == 1 and add_related:
                        # add the related contexts based on the most similar contexts according to the search
                        random_context = get_random_similar_contexts(q[2:].strip(), row.context, search_model='ada', max_rerank=10)
                    else:
                        while True:
                            # add random context, which isn't the correct context
                            random_context = df.sample(1).iloc[0].context
                            if random_context != row.context:
                                break
                    if discriminator:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" no"})
                    else:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" No appropriate context found to answer the question."})

    return pd.DataFrame(rows)

我们对判别器模型和问答模型应用相同的数据集创建过程。我们分别对训练集和测试集应用该过程，以确保训练集中的示例不会出现在测试集中。

for name, is_disc in [('discriminator', True), ('qa', False)]:
    for train_test, dt in [('train', train_df), ('test', test_df)]:
        ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)
        ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)

我们根据微调工具的建议格式化了数据，该工具可以使用

openai tools fine_tunes.prepare_data -f qa_train.jsonl

我们强烈建议您使用此工具，它可以建议改进数据格式以进行微调。

3.2 提交数据集进行微调

!openai api fine_tunes.create -t "olympics-data/discriminator_train.jsonl" -v "olympics-data/discriminator_test.jsonl" --batch_size 16  --compute_classification_metrics --classification_positive_class " yes" --model ada

!openai api fine_tunes.create -t "olympics-data/qa_train.jsonl" -v "olympics-data/qa_test.jsonl" --batch_size 16

3.3 使用微调模型

我们现在将使用微调的判别器和微调的问答模型。通过请求 logprobs，我们可以看到判别器在 yes 与 no 答案中有多确定。

ft_discriminator = "curie:ft-openai-internal-2021-08-23-23-58-57"
ft_qa = "curie:ft-openai-internal-2021-08-23-17-54-10"

def apply_ft_discriminator(context, question, discriminator_model):
    """
    Apply the fine tuned discriminator to a question, to assess whether it can be answered from the context.
    """
    prompt = f"{context}\nQuestion: {question}\n Related:"
    result = openai.chat.completions.create(model=discriminator_model, prompt=prompt, max_tokens=1, temperature=0, top_p=1, n=1, logprobs=2)
    return result['choices'][0]['logprobs']['top_logprobs']

apply_ft_discriminator('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', 
                        'What was the first human-made object in space?', ft_discriminator)

[<OpenAIObject at 0x7fe812e602b0> JSON: {
   " no": -10.819577,
   " yes": -2.045765e-05
 }]

我们可以看到，该模型可以很好地泛化到不同的上下文和问题。

def apply_ft_qa_answer(context, question, answering_model):
    """
    Apply the fine tuned discriminator to a question
    """
    prompt = f"{context}\nQuestion: {question}\nAnswer:"
    result = openai.chat.completions.create(model=answering_model, prompt=prompt, max_tokens=30, temperature=0, top_p=1, n=1, stop=['.','\n'])
    return result['choices'][0]['text']

apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', 
                    'What was the first human-made object in space?', ft_qa)

' The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957'

我们可以看到，当上下文合适时，模型可以回答问题。

apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',
                    'What is impressive about the Soviet Union?', ft_qa)

' The Soviet Union was the first country to successfully launch a satellite into space'

apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',
                    'How many cars were produced in the Soviet Union in 1970?', ft_qa)

' No appropriate context found to answer the question'

我们可以看到，模型知道何时回答问题，以及何时说明没有足够的上下文来回答问题。

我们还可以组合判别器和基础模型，或微调的问答模型。判别器本质上可以充当一个决策器，判断在给定上下文的情况下问题是否可以回答。

def answer_question_conditionally(answering_model, discriminator_model, context, question, discriminator_logprob_yes_modifier=0):
    logprobs = apply_ft_discriminator(context, question, discriminator_model)
    yes_logprob = logprobs[' yes'] if ' yes' in logprobs else -100
    no_logprob = logprobs[' no'] if ' no' in logprobs else -100
    if yes_logprob + discriminator_logprob_yes_modifier < no_logprob:
        return " No appropriate context found to answer the question based on the discriminator."
    return apply_ft_qa_answer(context, question, answering_model)
answer_question_conditionally(ft_qa, ft_discriminator, 
                                "Crowdless games are a rare although not unheard-of occurrence in sports. \
                                 When they do occur, it is usually the result of events beyond the control \
                                 of the teams or fans, such as weather-related concerns, public health concerns, \
                                 or wider civil disturbances unrelated to the game. For instance, \
                                 the COVID-19 pandemic caused many sports leagues around the world \
                                 to be played behind closed doors.",
                                "Could weather cause a sport event to have no crowd?")

' Weather could cause a sport event to have no crowd'

上面的函数说明了如何潜在地组合判别器和微调的问答模型。这可以更精细地控制我们希望模型在回答问题之前有多确定。

我们现在来看看 answers 端点是如何工作的 - 结合搜索从知识库中检索相关上下文，然后使用微调的问答模型来回答问题。

3.4 基于知识库回答问题

最后，我们可以使用类似于 /answers 端点的逻辑，我们首先搜索相关上下文，然后要求问答模型根据该上下文回答问题。如果您想查看实现细节，请查看 answers_with_ft.py 文件。

from answers_with_ft import answer_question
answer_question(olympics_search_fileid, ft_qa, "Which country won the Women's football tournament at the 2020 Olympic games?")

" Canada won the Women's football tournament at the 2020 Olympic games"