微调的问答 - 训练

,
2022年3月10日
在 Github 中打开

3. 训练一个专门用于问答的微调模型

本笔记本将利用上下文、问题和答案对的数据集,额外创建对抗性问题和上下文对,其中问题不是根据该上下文生成的。 在这些情况下,模型将被提示回答“没有足够的上下文来回答问题”。 我们还将训练一个判别器模型,该模型预测问题是否可以基于上下文回答。

我们还将添加更难的对抗性示例,这些示例将基于语义相似的章节或来自同一文章的相邻章节。

import openai
import pandas as pd
df = pd.read_csv('olympics-data/olympics_qa.csv')
olympics_search_fileid = "file-c3shd8wqF3vSCKaukW4Jr1TT"
df.head()
标题 标题 内容 tokens 上下文 问题 答案
0 2020年夏季奥林匹克运动会 摘要 2020 年夏季奥林匹克运动会(日语:2020年夏季オリン... 713 2020年夏季奥林匹克运动会\n摘要\n\n2020 年夏季奥... 1. 什么是 2020 年夏季奥林匹克运动会?\n2. 何时 ... 1. 2020 年夏季奥林匹克运动会是一项国际性...
1 2020年夏季奥林匹克运动会 主办城市选择 国际奥林匹克委员会 (IOC) 投票... 126 2020年夏季奥林匹克运动会\n主办城市选择\n\n国... 1. \n2. \n3. \n4. 1. 什么是国际奥林匹克委员会...
2 2020年夏季奥林匹克运动会 COVID-19 疫情的影响 2020 年 1 月,人们开始对... 369 2020年夏季奥林匹克运动会\nCOVID-19 疫情的影响\n... 1. 什么是 COVID-19 疫情?\n2. 疫情如何... 1. COVID-19 疫情是一场始于...
3 2020年夏季奥林匹克运动会 资格赛取消和延期 对疫情的担忧开始影响到资格赛... 298 2020年夏季奥林匹克运动会\n资格赛取消和延期\n... 1. 亚洲原定资格赛的地点在哪里... 1. 亚洲和大洋洲资格赛的原定地点是...
4 2020年夏季奥林匹克运动会 对兴奋剂检测的影响 强制性兴奋剂检测受到严重限制... 163 2020年夏季奥林匹克运动会\n对兴奋剂检测的影响\n... 1. 什么是 COVID-19 疫情?\n2. 什么导致... 1. COVID-19 疫情是一场始于...

将章节拆分为训练集和测试集

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
len(train_df), len(test_df)
(3014, 754)

我们检查我们打算使用的分隔符是否不存在于上下文中

df.context.str.contains('->').sum()
0

3.1 为问答和判别器模型创建微调数据集

微调数据集按以下方式创建。 对于每对对应的问题、答案和上下文对,我们创建

  • 正例:正确的问题、答案、上下文对
  • 负例
    • 随机负例,其中随机上下文与问题配对
    • 两个困难的负例
      • 一个源自同一维基百科文章
      • 另一个,与正确上下文最相似

这个过程是有噪声的,因为有时问题在不同的上下文中也可能可以回答,但平均而言,我们希望这不会对性能产生太大影响。

我们对判别器模型和问答模型应用相同的数据集创建过程。 我们分别对训练集和测试集应用该过程,以确保训练集中的示例不会出现在测试集中。

import random

def get_random_similar_contexts(question, context, file_id=olympics_search_fileid, search_model='ada', max_rerank=10):
    """
    Find similar contexts to the given context using the search file
    """
    try:
        # TODO: openai.Engine(search_model) is deprecated
        results = openai.Engine(search_model).search(
            search_model=search_model, 
            query=question, 
            max_rerank=max_rerank,
            file=file_id
        )
        candidates = []
        for result in results['data'][:3]:
            if result['text'] == context:
                continue
            candidates.append(result['text'])
        random_candidate = random.choice(candidates)
        return random_candidate
    except Exception as e:
        print(e)
        return ""

def create_fine_tuning_dataset(df, discriminator=False, n_negative=1, add_related=False):
    """
    Create a dataset for fine tuning the OpenAI model; either for a discriminator model, 
    or a model specializing in Q&A, where it says if no relevant context is found.

    Parameters
    ----------
    df: pd.DataFrame
        The dataframe containing the question, answer and context pairs
    discriminator: bool
        Whether to create a dataset for the discriminator
    n_negative: int
        The number of random negative samples to add (using a random context)
    add_related: bool
        Whether to add the related contexts to the correct context. These are hard negative examples

    Returns
    -------
    pd.DataFrame
        The dataframe containing the prompts and completions, ready for fine-tuning
    """
    rows = []
    for i, row in df.iterrows():
        for q, a in zip(("1." + row.questions).split('\n'), ("1." + row.answers).split('\n')):
            if len(q) >10 and len(a) >10:
                if discriminator:
                    rows.append({"prompt":f"{row.context}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" yes"})
                else:
                    rows.append({"prompt":f"{row.context}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" {a[2:].strip()}"})

    for i, row in df.iterrows():
        for q in ("1." + row.questions).split('\n'):
            if len(q) >10:
                for j in range(n_negative + (2 if add_related else 0)):
                    random_context = ""
                    if j == 0 and add_related:
                        # add the related contexts based on originating from the same wikipedia page
                        subset = df[(df.title == row.title) & (df.context != row.context)]
                        
                        if len(subset) < 1:
                            continue
                        random_context = subset.sample(1).iloc[0].context
                    if j == 1 and add_related:
                        # add the related contexts based on the most similar contexts according to the search
                        random_context = get_random_similar_contexts(q[2:].strip(), row.context, search_model='ada', max_rerank=10)
                    else:
                        while True:
                            # add random context, which isn't the correct context
                            random_context = df.sample(1).iloc[0].context
                            if random_context != row.context:
                                break
                    if discriminator:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" no"})
                    else:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" No appropriate context found to answer the question."})

    return pd.DataFrame(rows) 

我们对判别器模型和问答模型应用相同的数据集创建过程。 我们分别对训练集和测试集应用该过程,以确保训练集中的示例不会出现在测试集中。

for name, is_disc in [('discriminator', True), ('qa', False)]:
    for train_test, dt in [('train', train_df), ('test', test_df)]:
        ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)
        ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)

我们根据微调工具的建议格式化了数据,该工具可以使用

openai tools fine_tunes.prepare_data -f qa_train.jsonl

我们强烈建议您使用此工具,它可以建议改进数据格式以进行微调。

!openai api fine_tunes.create -t "olympics-data/discriminator_train.jsonl" -v "olympics-data/discriminator_test.jsonl" --batch_size 16  --compute_classification_metrics --classification_positive_class " yes" --model ada
!openai api fine_tunes.create -t "olympics-data/qa_train.jsonl" -v "olympics-data/qa_test.jsonl" --batch_size 16

3.3 使用微调模型

我们现在将使用微调的判别器和微调的问答模型。 通过请求 logprobs,我们可以看到判别器在 yesno 答案中有多确定。

ft_discriminator = "curie:ft-openai-internal-2021-08-23-23-58-57"
ft_qa = "curie:ft-openai-internal-2021-08-23-17-54-10"

def apply_ft_discriminator(context, question, discriminator_model):
    """
    Apply the fine tuned discriminator to a question, to assess whether it can be answered from the context.
    """
    prompt = f"{context}\nQuestion: {question}\n Related:"
    result = openai.chat.completions.create(model=discriminator_model, prompt=prompt, max_tokens=1, temperature=0, top_p=1, n=1, logprobs=2)
    return result['choices'][0]['logprobs']['top_logprobs']

apply_ft_discriminator('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', 
                        'What was the first human-made object in space?', ft_discriminator)
[<OpenAIObject at 0x7fe812e602b0> JSON: {
   " no": -10.819577,
   " yes": -2.045765e-05
 }]

我们可以看到,该模型可以很好地泛化到不同的上下文和问题。

def apply_ft_qa_answer(context, question, answering_model):
    """
    Apply the fine tuned discriminator to a question
    """
    prompt = f"{context}\nQuestion: {question}\nAnswer:"
    result = openai.chat.completions.create(model=answering_model, prompt=prompt, max_tokens=30, temperature=0, top_p=1, n=1, stop=['.','\n'])
    return result['choices'][0]['text']

apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', 
                    'What was the first human-made object in space?', ft_qa)
' The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957'

我们可以看到,当上下文合适时,模型可以回答问题。

apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',
                    'What is impressive about the Soviet Union?', ft_qa)
' The Soviet Union was the first country to successfully launch a satellite into space'
apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',
                    'How many cars were produced in the Soviet Union in 1970?', ft_qa)
' No appropriate context found to answer the question'

我们可以看到,模型知道何时回答问题,以及何时说明没有足够的上下文来回答问题。

我们还可以组合判别器和基础模型,或微调的问答模型。 判别器本质上可以充当一个决策器,判断在给定上下文的情况下问题是否可以回答。

def answer_question_conditionally(answering_model, discriminator_model, context, question, discriminator_logprob_yes_modifier=0):
    logprobs = apply_ft_discriminator(context, question, discriminator_model)
    yes_logprob = logprobs[' yes'] if ' yes' in logprobs else -100
    no_logprob = logprobs[' no'] if ' no' in logprobs else -100
    if yes_logprob + discriminator_logprob_yes_modifier < no_logprob:
        return " No appropriate context found to answer the question based on the discriminator."
    return apply_ft_qa_answer(context, question, answering_model)
answer_question_conditionally(ft_qa, ft_discriminator, 
                                "Crowdless games are a rare although not unheard-of occurrence in sports. \
                                 When they do occur, it is usually the result of events beyond the control \
                                 of the teams or fans, such as weather-related concerns, public health concerns, \
                                 or wider civil disturbances unrelated to the game. For instance, \
                                 the COVID-19 pandemic caused many sports leagues around the world \
                                 to be played behind closed doors.",
                                "Could weather cause a sport event to have no crowd?")
' Weather could cause a sport event to have no crowd'

上面的函数说明了如何潜在地组合判别器和微调的问答模型。 这可以更精细地控制我们希望模型在回答问题之前有多确定。

我们现在来看看 answers 端点是如何工作的 - 结合搜索从知识库中检索相关上下文,然后使用微调的问答模型来回答问题。

from answers_with_ft import answer_question
answer_question(olympics_search_fileid, ft_qa, "Which country won the Women's football tournament at the 2020 Olympic games?")
" Canada won the Women's football tournament at the 2020 Olympic games"