利用模型蒸馏来微调模型

2024 年 10 月 16 日

OpenAI 最近发布了蒸馏功能，它允许利用（大型）模型的输出来微调另一个（较小）模型。当您转向较小的模型时，这可以显着降低特定任务的价格和延迟。在本 cookbook 中，我们将查看一个数据集，将 gpt-4o 的输出蒸馏到 gpt-4o-mini，并展示我们如何获得比通用的、非蒸馏的 4o-mini 明显更好的结果。

我们还将利用结构化输出来解决使用枚举列表的分类问题。我们将看到微调模型如何从结构化输出中受益，以及它将如何影响性能。我们将展示结构化输出适用于所有这些模型，包括蒸馏模型。

我们将首先分析数据集，获取 4o 和 4o mini 的输出，突出显示两个模型在性能上的差异，然后进行蒸馏并分析蒸馏模型的性能。

先决条件

让我们安装并加载依赖项。确保您的 OpenAI API 密钥在您的环境中定义为“OPENAI_API_KEY”，它将由客户端直接加载。

! pip install openai tiktoken numpy pandas tqdm --quiet

import openai
import json
import tiktoken
from tqdm import tqdm
from openai import OpenAI
import numpy as np
import concurrent.futures
import pandas as pd

client = OpenAI()

加载和理解数据集

对于本 cookbook，我们将从以下 Kaggle 挑战赛加载数据：https://www.kaggle.com/datasets/zynicide/wine-reviews。

此数据集包含大量行，您可以随意在整个数据上运行此 cookbook，但作为有偏见的法国葡萄酒爱好者，我将数据集缩小到仅限法国葡萄酒，以关注较少的行和葡萄品种。

我们正在研究一个分类问题，我们希望根据所有其他可用标准（包括描述、子区域和省份，我们将在提示中包含这些标准）来猜测葡萄品种。这为模型提供了大量信息，您可以随意删除一些可以显着帮助模型的信息，例如葡萄酒的产区，以查看它是否在查找葡萄方面做得很好。

让我们过滤掉评论中出现次数少于 5 次的葡萄品种。

让我们继续处理此数据集中 500 个随机行的子集。

df = pd.read_csv('data/winemag/winemag-data-130k-v2.csv')
df_france = df[df['country'] == 'France']

# Let's also filter out wines that have less than 5 references with their grape variety – even though we'd like to find those
# they're outliers that we don't want to optimize for that would make our enum list be too long
# and they could also add noise for the rest of the dataset on which we'd like to guess, eventually reducing our accuracy.

varieties_less_than_five_list = df_france['variety'].value_counts()[df_france['variety'].value_counts() < 5].index.tolist()
df_france = df_france[~df_france['variety'].isin(varieties_less_than_five_list)]

df_france_subset = df_france.sample(n=500)
df_france_subset.head()

	未命名：0	国家	描述	名称	评分	价格	省份	地区_1	地区_2	品酒师姓名	品酒师 Twitter 账号	标题	品种	酒庄
95206	95206	法国	饱满、肥腻、成熟、香气浓郁的葡萄酒，充满了...	梅尔塞一级酒庄	91	35.0	勃艮第	梅尔塞	NaN	罗杰·沃斯	@vossroger	安东尼·罗德 2010 梅尔塞一级酒庄 C...	黑皮诺	安东尼·罗德
66403	66403	法国	对于简单的夏布利酒，这款酒令人印象深刻，浓郁，...	酒庄	89	26.0	勃艮第	夏布利	NaN	罗杰·沃斯	@vossroger	威廉·费弗尔 2005 酒庄（夏布利）	霞多丽	威廉·费弗尔
71277	71277	法国	这款 50-50 的马瑟兰和梅洛混酿酒开启了...	拉雷米斯	84	13.0	法国其他	法国葡萄酒	NaN	劳伦·布泽奥	@laurbuzz	莫尔多雷酒庄 2014 拉雷米斯红葡萄酒（法国...	红葡萄酒混酿	莫尔多雷酒庄
27484	27484	法国	这款醇厚易饮的葡萄酒的中等浓郁香气...	正宗与时尚	86	10.0	法国其他	法国葡萄酒	NaN	劳伦·布泽奥	@laurbuzz	浪漫 2014 正宗与时尚赤霞珠...	赤霞珠	浪漫
124917	124917	法国	新鲜、纯净的 Conference 梨皮的香气...	NaN	89	30.0	阿尔萨斯	阿尔萨斯	NaN	安妮·克雷比尔 MW	@AnneInVino	文森特·斯托夫勒酒庄 2015 灰皮诺（阿尔...	灰皮诺	文森特·斯托夫勒酒庄

让我们检索所有葡萄品种，以便将它们包含在提示和我们的结构化输出枚举列表中。

varieties = np.array(df_france['variety'].unique()).astype('str')
varieties

array(['Gewürztraminer', 'Pinot Gris', 'Gamay',
       'Bordeaux-style White Blend', 'Champagne Blend', 'Chardonnay',
       'Petit Manseng', 'Riesling', 'White Blend', 'Pinot Blanc',
       'Alsace white blend', 'Bordeaux-style Red Blend', 'Malbec',
       'Tannat-Cabernet', 'Rhône-style Red Blend', 'Ugni Blanc-Colombard',
       'Savagnin', 'Pinot Noir', 'Rosé', 'Melon',
       'Rhône-style White Blend', 'Pinot Noir-Gamay', 'Colombard',
       'Chenin Blanc', 'Sylvaner', 'Sauvignon Blanc', 'Red Blend',
       'Chenin Blanc-Chardonnay', 'Cabernet Sauvignon', 'Cabernet Franc',
       'Syrah', 'Sparkling Blend', 'Duras', 'Provence red blend',
       'Tannat', 'Merlot', 'Malbec-Merlot', 'Chardonnay-Viognier',
       'Cabernet Franc-Cabernet Sauvignon', 'Muscat', 'Viognier',
       'Picpoul', 'Altesse', 'Provence white blend', 'Mondeuse',
       'Grenache-Syrah', 'G-S-M', 'Pinot Meunier', 'Cabernet-Syrah',
       'Vermentino', 'Marsanne', 'Colombard-Sauvignon Blanc',
       'Gros and Petit Manseng', 'Jacquère', 'Negrette', 'Mauzac',
       'Pinot Auxerrois', 'Grenache', 'Roussanne', 'Gros Manseng',
       'Tannat-Merlot', 'Aligoté', 'Chasselas', "Loin de l'Oeil",
       'Malbec-Tannat', 'Carignan', 'Colombard-Ugni Blanc', 'Sémillon',
       'Syrah-Grenache', 'Sciaccerellu', 'Auxerrois', 'Mourvèdre',
       'Tannat-Cabernet Franc', 'Braucol', 'Trousseau',
       'Merlot-Cabernet Sauvignon'], dtype='<U33')

生成提示

让我们构建一个函数来生成我们的提示，并为我们列表中的第一款葡萄酒尝试一下。

def generate_prompt(row, varieties):
    # Format the varieties list as a comma-separated string
    variety_list = ', '.join(varieties)
    
    prompt = f"""
    Based on this wine review, guess the grape variety:
    This wine is produced by {row['winery']} in the {row['province']} region of {row['country']}.
    It was grown in {row['region_1']}. It is described as: "{row['description']}".
    The wine has been reviewed by {row['taster_name']} and received {row['points']} points.
    The price is {row['price']}.

    Here is a list of possible grape varieties to choose from: {variety_list}.
    
    What is the likely grape variety? Answer only with the grape variety name or blend from the list.
    """
    return prompt

# Example usage with a specific row
prompt = generate_prompt(df_france.iloc[0], varieties)
prompt

'\n    Based on this wine review, guess the grape variety:\n    This wine is produced by Trimbach in the Alsace region of France.\n    It was grown in Alsace. It is described as: "This dry and restrained wine offers spice in profusion. Balanced with acidity and a firm texture, it\'s very much for food.".\n    The wine has been reviewed by Roger Voss and received 87 points.\n    The price is 24.0.\n\n    Here is a list of possible grape varieties to choose from: Gewürztraminer, Pinot Gris, Gamay, Bordeaux-style White Blend, Champagne Blend, Chardonnay, Petit Manseng, Riesling, White Blend, Pinot Blanc, Alsace white blend, Bordeaux-style Red Blend, Malbec, Tannat-Cabernet, Rhône-style Red Blend, Ugni Blanc-Colombard, Savagnin, Pinot Noir, Rosé, Melon, Rhône-style White Blend, Pinot Noir-Gamay, Colombard, Chenin Blanc, Sylvaner, Sauvignon Blanc, Red Blend, Chenin Blanc-Chardonnay, Cabernet Sauvignon, Cabernet Franc, Syrah, Sparkling Blend, Duras, Provence red blend, Tannat, Merlot, Malbec-Merlot, Chardonnay-Viognier, Cabernet Franc-Cabernet Sauvignon, Muscat, Viognier, Picpoul, Altesse, Provence white blend, Mondeuse, Grenache-Syrah, G-S-M, Pinot Meunier, Cabernet-Syrah, Vermentino, Marsanne, Colombard-Sauvignon Blanc, Gros and Petit Manseng, Jacquère, Negrette, Mauzac, Pinot Auxerrois, Grenache, Roussanne, Gros Manseng, Tannat-Merlot, Aligoté, Chasselas, Loin de l\'Oeil, Malbec-Tannat, Carignan, Colombard-Ugni Blanc, Sémillon, Syrah-Grenache, Sciaccerellu, Auxerrois, Mourvèdre, Tannat-Cabernet Franc, Braucol, Trousseau, Merlot-Cabernet Sauvignon.\n    \n    What is the likely grape variety? Answer only with the grape variety name or blend from the list.\n    '

为了在运行查询之前了解成本，您可以利用 tiktoken 来了解我们将发送的 token 数量以及运行此操作的相关成本。这只会为您提供运行补全的估计值，而不是微调过程（在本 cookbook 稍后运行蒸馏时使用），这取决于其他因素，例如 epoch 数量、训练集等。

# Load encoding for the GPT-4 model
enc = tiktoken.encoding_for_model("gpt-4o")

# Initialize a variable to store the total number of tokens
total_tokens = 0

for index, row in df_france_subset.iterrows():
    prompt = generate_prompt(row, varieties)
    
    # Tokenize the input text and count tokens
    tokens = enc.encode(prompt)
    token_count = len(tokens)
    
    # Add the token count to the total
    total_tokens += token_count

print(f"Total number of tokens in the dataset: {total_tokens}")
print(f"Total number of prompts: {len(df_france_subset)}")

Total number of tokens in the dataset: 245439
Total number of prompts: 500

# outputing cost in $ as of 2024/10/16

gpt4o_token_price = 2.50 / 1_000_000  # $2.50 per 1M tokens
gpt4o_mini_token_price = 0.150 / 1_000_000  # $0.15 per 1M tokens

total_gpt4o_cost = gpt4o_token_price*total_tokens
total_gpt4o_mini_cost = gpt4o_mini_token_price*total_tokens

print(total_gpt4o_cost)
print(total_gpt4o_mini_cost)

0.6135975
0.03681585

准备存储补全的函数

由于我们正在查看有限的响应列表（枚举葡萄品种列表），让我们利用结构化输出，以便我们确保模型将从此列表中回答。这也使我们能够直接将模型的答案与葡萄品种进行比较，并获得确定性答案（与模型可能回答“我认为葡萄是黑皮诺”而不是仅仅“黑皮诺”的模型相比），此外还可以提高性能以避免数据集中没有的葡萄品种。

如果您想了解有关结构化输出的更多信息，您可以阅读此 cookbook 和此文档指南。

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "grape-variety",
        "schema": {
            "type": "object",
            "properties": {
                "variety": {
                    "type": "string",
                    "enum": varieties.tolist()
                }
            },
            "additionalProperties": False,
            "required": ["variety"],
        },
        "strict": True
    }
}

为了蒸馏模型，您需要存储来自模型的所有补全，以便您可以将其作为参考提供给较小的模型进行微调。因此，我们在 client.chat.completions.create 方法中添加了 store=True 参数，以便我们可以存储来自 gpt-4o 的这些补全。

我们将存储所有补全（甚至包括 4o-mini 和我们未来微调的模型），以便我们能够直接从 OpenAI 平台运行 Evals。

在存储这些补全时，使用元数据标签存储它们很有用，这将允许从 OpenAI 平台进行过滤，以便在您想要运行蒸馏和评估的特定补全集上运行这些操作。

# Initialize the progress index
metadata_value = "wine-distillation" # that's a funny metadata tag :-)

# Function to call the API and process the result for a single model (blocking call in this case)
def call_model(model, prompt):
    response = client.chat.completions.create(
        model=model,
        store=True,
        metadata={
            "distillation": metadata_value,
        },
        messages=[
            {
                "role": "system",
                "content": "You're a sommelier expert and you know everything about wine. You answer precisely with the name of the variety/blend."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
         response_format=response_format
    )
    return json.loads(response.choices[0].message.content.strip())['variety']

并行处理

由于我们将在大量行上运行此操作，让我们确保我们并行运行这些补全，并为此使用并发 futures。我们将迭代我们的数据帧，并每 20 行输出进度。我们将从运行补全的模型中存储补全，并在同一数据帧中使用列名 {model}-variety。

def process_example(index, row, model, df, progress_bar):
    global progress_index

    try:
        # Generate the prompt using the row
        prompt = generate_prompt(row, varieties)

        df.at[index, model + "-variety"] = call_model(model, prompt)
        
        # Update the progress bar
        progress_bar.update(1)
        
        progress_index += 1
    except Exception as e:
        print(f"Error processing model {model}: {str(e)}")

def process_dataframe(df, model):
    global progress_index
    progress_index = 1  # Reset progress index

    # Create a tqdm progress bar
    with tqdm(total=len(df), desc="Processing rows") as progress_bar:
        # Process each example concurrently using ThreadPoolExecutor
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = {executor.submit(process_example, index, row, model, df, progress_bar): index for index, row in df.iterrows()}
            
            for future in concurrent.futures.as_completed(futures):
                try:
                    future.result()  # Wait for each example to be processed
                except Exception as e:
                    print(f"Error processing example: {str(e)}")

    return df

让我们在处理整个数据帧之前尝试我们的调用模型函数，并检查输出。

answer = call_model('gpt-4o', generate_prompt(df_france_subset.iloc[0], varieties))
answer

'Pinot Noir'

太棒了！我们确认我们可以获得葡萄品种作为输出，现在让我们处理包含 gpt-4o 和 gpt-4o-mini 的数据集，并比较结果。

df_france_subset = process_dataframe(df_france_subset, "gpt-4o")

Processing rows: 100%|███████████████████████████████████████████████| 500/500 [00:41<00:00, 12.09it/s]

df_france_subset = process_dataframe(df_france_subset, "gpt-4o-mini")

Processing rows: 100%|███████████████████████████████████████████████| 500/500 [01:31<00:00,  5.45it/s]

比较 gpt-4o 和 gpt-4o-mini

现在我们已经获得了这两个模型的所有聊天补全；让我们将它们与预期的葡萄品种进行比较，并评估它们在查找葡萄品种方面的准确性。我们将直接在此处在 python 中执行此操作，因为我们有一个简单的字符串检查要运行，但是如果您的任务涉及更复杂的评估，您可以利用 OpenAI Evals 或我们的开源评估框架。

models = ['gpt-4o', 'gpt-4o-mini']

def get_accuracy(model, df):
    return np.mean(df['variety'] == df[model + '-variety'])

for model in models:
    print(f"{model} accuracy: {get_accuracy(model, df_france_subset) * 100:.2f}%")

gpt-4o accuracy: 81.80%
gpt-4o-mini accuracy: 69.00%

我们可以看到 gpt-4o 在查找葡萄品种方面比 4o-mini 更好（高出 12.80%，或相对于 4o-mini 几乎高出 20%！）。我现在想知道我们是否在训练期间让 gpt-4o 喝了酒！

将 gpt-4o 输出蒸馏到 gpt-4o-mini

假设我们希望经常运行此预测，我们希望补全更快更便宜，但保持该准确度水平。如果能够将 4o 准确度蒸馏到 4o-mini，那就太棒了，不是吗？让我们开始吧！

我们现在转到 OpenAI 存储的补全页面：https://platform.openai.com/chat-completions。

让我们选择模型 gpt-4o（确保这样做，您不希望蒸馏我们运行的 4o-mini 的输出）。让我们也选择元数据 distillation: wine-distillation 以仅获取从此 cookbook 运行的存储的补全。

Filtering out completions

选择补全后，您可以单击右上角的“蒸馏”以根据这些补全微调模型。完成此操作后，将自动创建一个用于运行微调过程的文件。然后，让我们选择 gpt-4o-mini 作为基础模型，保留默认参数（但您可以随意更改它们或迭代它以提高性能）。

Distilling modal

微调作业开始后，您可以从微调页面检索微调作业 ID，我们将使用它来监控微调作业的状态以及在完成后检索微调模型 ID。

Fine tuning job

# copy paste your fine-tune job ID below
finetune_job = client.fine_tuning.jobs.retrieve("ftjob-pRyNWzUItmHpxmJ1TX7FOaWe")

if finetune_job.status == 'succeeded':
    fine_tuned_model = finetune_job.fine_tuned_model
    print('finetuned model: ' + fine_tuned_model)
else:
    print('finetuned job status: ' + finetune_job.status)

finetuned model: ft:gpt-4o-mini-2024-07-18:distillation-test:wine-distillation:AIZntSyE

运行蒸馏模型的补全

现在我们已经微调了模型，我们可以使用此模型来运行补全，并将准确性与 gpt4o 和 gpt4o-mini 进行比较。让我们获取不同的法国葡萄酒子集（因为我们将输出限制为法国葡萄品种，没有异常值，我们也需要将我们的验证数据集集中于此）。让我们在每个模型的 300 个条目上运行此操作。

validation_dataset = df_france.sample(n=300)

models.append(fine_tuned_model)

for model in models:
    another_subset = process_dataframe(validation_dataset, model)

Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:20<00:00, 14.69it/s]
Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:27<00:00, 10.99it/s]
Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:37<00:00,  8.08it/s]

让我们比较模型的准确性

for model in models:
    print(f"{model} accuracy: {get_accuracy(model, another_subset) * 100:.2f}%")

gpt-4o accuracy: 79.67%
gpt-4o-mini accuracy: 64.67%
ft:gpt-4o-mini-2024-07-18:distillation-test:wine-distillation:AIZntSyE accuracy: 79.33%

相对于非蒸馏的 gpt-4o-mini，这几乎是 22% 的相对改进！🎉

我们的微调模型比 gpt-4o-mini 表现更好，同时具有相同的基础模型。我们将能够使用此模型以更低的成本和更低的延迟运行推理，以进行未来的葡萄品种预测。