合成数据生成（第 1 部分）

2024 年 4 月 10 日

使用大型语言模型 (LLM) 生成合成数据为常见问题提供了强大的解决方案：高质量、多样化和符合隐私保护的数据的可用性。这可以用于多种场景，例如训练数据科学机器学习模型（SVM、决策树、KNN）、在数据上微调不同的 GPT 模型、作为冷启动问题的解决方案、帮助构建具有真实数据的引人入胜的演示/应用程序、情景测试等。

有许多关键驱动因素可能会让您希望利用合成数据。

人类数据可能存在隐私限制和/或其中包含我们不希望使用的可识别数据。
合成数据可能比真实数据更结构化，因此更容易操作。
在数据稀疏或某些类别的数据稀疏的领域，我们可能希望扩充数据。
在处理不平衡数据集或缺乏多样性的数据集时，我们可能希望创建数据以提高数据集的丰富性。

与传统的数据增强或手动数据创建方法不同，使用 LLM 可以生成丰富、细致入微且上下文相关的数据集，从而显着增强其对企业和开发人员的用处。

我们将本教程分为两部分。在本 cookbook 中，我们将有以下议程

带有结构化提示的 CSV
带有 Python 程序的 CSV
带有 python 程序的多表 CSV
简单地创建文本数据
处理不平衡或非多样化的文本数据，而在第 2 部分中，我们将研究用于获得更好文本数据的提示策略。

最后两个尤其适用于创建合成数据以微调另一个 GPT 模型。例如，使用 gpt-4o 生成的更高质量数据来微调更便宜、更快速的 gpt-3.5-turbo，以在降低成本的同时提高性能。

设置

%pip install openai
%pip install pandas
%pip install scikit-learn
%pip install matplotlib

from openai import OpenAI
import os
import re
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import json
import matplotlib

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

1. 带有结构提示的 CSV

在这里，我们以最简单的方式创建数据。您可以通过解决 3 个关键点来快速生成数据：告诉它数据的格式 (CSV)、模式以及有关列如何关联的有用信息（LLM 将能够从列名中推断出这一点，但帮助会提高性能）。

datagen_model = "gpt-4o-mini"
question = """
Create a CSV file with 10 rows of housing data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

```csv
id,house_size_m2,house_price,location,number_of_bedrooms
1,50,150000,Suburban,2
2,75,250000,City Center,3
3,100,350000,Suburban,4
4,120,450000,Suburban,4
5,80,300000,City Center,3
6,90,400000,City Center,3
7,150,600000,Premium Area,5
8,200,750000,Premium Area,5
9,55,180000,Suburban,2
10,300,950000,Premium Area,6
```

2. 带有 Python 程序的 CSV

直接生成数据的问题在于，由于上下文的限制，我们可以生成的数据量有限。相反，我们可以做的是要求 LLM 生成一个 python 程序来生成合成数据。这使我们能够扩展到更多数据，同时还可以通过检查 python 程序来了解数据的生成方式。

这将使我们能够根据需要编辑 python 程序，同时为我们提供一个良好的起点。

question = """
Create a Python program to generate 100 rows of housing data.
I want you to at the end of it output a pandas dataframe with 100 rows of data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

Certainly! Below is a Python program that generates synthetic housing data according to your specifications. We will create a pandas DataFrame with the defined fields and characteristics.

```python
import pandas as pd
import random

def generate_housing_data(num_rows):
    data = []
    
    locations = [
        ('City Center', 10000, 150),  # (location name, base price per m², base size)
        ('Suburban Area', 8000, 100),
        ('Country Side', 5000, 80),
        ('Coastal Region', 12000, 110),
        ('Urban Neighborhood', 9000, 130)
    ]
    
    for i in range(1, num_rows + 1):
        # Randomly pick a location
        location, base_price_per_m2, base_size = random.choice(locations)
        
        # Generate number of bedrooms (1 to 5)
        number_of_bedrooms = random.randint(1, 5)
        
        # Calculate house size based on the number of bedrooms
        house_size = base_size + (10 * number_of_bedrooms) + random.randint(-5, 15)  # Adding some noise
        
        # Calculate house price based on house size and location
        house_price = base_price_per_m2 * house_size + random.randint(-5000, 10000)  # Adding some noise

        # Append the generated data to the list
        data.append({
            'id': i,
            'house_size_m2': house_size,
            'house_price': house_price,
            'location': location,
            'number_of_bedrooms': number_of_bedrooms
        })

    # Create a pandas DataFrame
    df = pd.DataFrame(data)
    return df

# Generate 100 rows of housing data
housing_data_df = generate_housing_data(100)

# Show the result
print(housing_data_df)
```

### Explanation:
- The `generate_housing_data` function creates synthetic housing data for a specified number of rows (`num_rows`).
- We define different locations with corresponding base prices per square meter and average house sizes.
- For each house, we randomly select a location, number of bedrooms, and calculate house size and price to ensure a sensible correlation between the values.
- Finally, we create a pandas DataFrame from the generated data and return it.

You can run this program in your Python environment, and it will output a DataFrame containing 100 rows of synthetic housing data.

我们需要确保适当地解析此输出，因为通常 python 代码周围可能有文本。我们还可以明确要求它说明它对生成的数据所做的所有假设，但是在这种情况下，它自动告诉了我们。

3. 带有 python 程序的多表 CSV

对于更复杂的关系，我们需要确保指定更多特征。

要创建彼此相关的多个不同数据集（例如，房屋、位置、房屋类型），与之前一样，我们需要指定格式、模式和有用的信息。但是，现在获得良好性能所需的有用信息更高了。这取决于具体情况，但要描述的许多事项是数据集如何相互关联、处理数据集相对于彼此的大小、确保适当地创建外键和主键，以及理想情况下使用先前生成的数据集来填充新数据集，以便在必要时实际数据值匹配。

question = """
Create a Python program to generate 3 different pandas dataframes.

1. Housing data
I want 100 rows. Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms
 - house type
 + any relevant foreign keys

2. Location
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - country
 - city
 - population
 - area (m^2)
 + any relevant foreign keys

 3. House types
 - id (incrementing integer starting at 1)
 - house type
 - average house type price
 - number of houses
 + any relevant foreign keys

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.
Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.
You can use the previously generated dataframe to generate the next dataframe.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

Certainly! Below is a Python program that generates the three specified pandas DataFrames for housing data, location data, and house types. Each DataFrame will include the necessary fields, and the foreign keys will ensure proper relationships among them.

```python
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(0)

# Function to generate location DataFrame
def generate_location_data(num_locations):
    locations = {
        "id": range(1, num_locations + 1),
        "country": np.random.choice(['USA', 'Canada', 'UK'], num_locations),
        "city": np.random.choice(['New York', 'Toronto', 'London', 'Vancouver', 'Manchester'], num_locations),
        "population": np.random.randint(50000, 1000000, num_locations),
        "area": np.random.randint(10000, 500000, num_locations)
    }
    return pd.DataFrame(locations)

# Function to generate house types DataFrame
def generate_house_type_data(num_house_types):
    house_types = {
        "id": range(1, num_house_types + 1),
        "house_type": np.random.choice(['Detached', 'Semi-Detached', 'Terraced', 'Flat'], num_house_types),
        "average_house_type_price": np.random.randint(100000, 1000000, num_house_types),
        "number_of_houses": np.random.randint(10, 1000, num_house_types)
    }
    return pd.DataFrame(house_types)

# Function to generate housing data DataFrame
def generate_housing_data(num_houses, location_df, house_type_df):
    house_sizes = np.random.randint(50, 300, num_houses)  # size in m^2
    location_ids = np.random.choice(location_df['id'], num_houses)
    house_type_ids = np.random.choice(house_type_df['id'], num_houses)
    
    # Generate prices based on size, location, and house type
    house_prices = (house_sizes * np.random.randint(2000, 5000, num_houses) // 10) + \
                   (location_ids * 1000) + \
                   (house_type_df.loc[house_type_ids - 1, 'average_house_type_price'].values // 4)
    
    housing_data = {
        "id": range(1, num_houses + 1),
        "house_size": house_sizes,
        "house_price": house_prices,
        "location_id": location_ids,
        "bedrooms": np.random.randint(1, 6, num_houses),
        "house_type_id": house_type_ids
    }
    
    return pd.DataFrame(housing_data)

# Generate DataFrames
num_locations = 10
num_house_types = 4
num_houses = 100

location_df = generate_location_data(num_locations)
house_type_df = generate_house_type_data(num_house_types)
housing_df = generate_housing_data(num_houses, location_df, house_type_df)

# Display the generated DataFrames
print("Location DataFrame:")
print(location_df.head(), "\n")

print("House Types DataFrame:")
print(house_type_df.head(), "\n")

print("Housing DataFrame:")
print(housing_df.head(), "\n")

# Printing the DataFrame shapes
print(f"Shapes: \nLocation: {location_df.shape}, House Types: {house_type_df.shape}, Housing: {housing_df.shape}")
```

### Explanation of the Code:
1. **Location DataFrame:** 
   - Generates random locations with attributes such as country, city, population, and area.
  
2. **House Types DataFrame:** 
   - Generates different types of houses along with average prices and quantity available.
  
3. **Housing DataFrame:** 
   - Generates housing data with increments on price based on house size, location, and house type, while also ensuring foreign keys (IDs) for location and house type.

### Output:
The three DataFrames generated will logically relate to one another with consistent data types and primary–foreign key relationships, resulting in a coherent representation of the housing dataset. The output displays heads of each DataFrame and their shapes for verification.

4. 简单地创建文本数据

在这里，我们首先了解如何创建文本数据。例如，这可以用于微调另一个 GPT 模型。在这种情况下，我们把自己想象成一个零售商，试图简化为其销售的商品创建描述的过程。我们再次需要指定数据的格式，特别是在这种情况下，我们希望格式易于解析为输出。

我们下面考虑的示例是我们希望为 GPT 模型创建输入输出训练对以进行微调的示例。我们将产品的名称及其所属的类别作为输入，输出将是描述。

明确指定输出的结构并给出不偏离此结构的命令有助于强制执行输出结构。您可以在循环中运行此操作并附加数据以生成更多合成数据。同样，与之前一样，我们需要很好地解析数据，以便我们的下游代码不会中断。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. The usecase is a retailer generating a description for a product from a product catalogue. I want the input to be product name and category (to which the product belongs to) and output to be description.
  The format should be of the form:
  1.
  Input: product_name, category
  Output: description
  2.
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.
  Create as many training pairs as possible.
  """

  response = client.chat.completions.create(
    model=datagen_model,
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response

1.
Input: Wireless Bluetooth Headphones, Electronics
Output: Immerse yourself in high-quality sound with these Wireless Bluetooth Headphones, featuring active noise cancellation and a comfortable over-ear design for extended listening sessions.

2.
Input: Organic Green Tea, Beverages
Output: Enjoy a refreshing cup of Organic Green Tea, sourced from the finest leaves, packed with antioxidants, and perfect for a healthy, invigorating boost anytime.

3.
Input: Stainless Steel Kitchen Knife, Kitchenware
Output: Cut with precision and ease using this Stainless Steel Kitchen Knife, designed with an ergonomic handle and a sharp blade for all your culinary tasks.

4.
Input: Hiking Backpack, Outdoor Gear
Output: Explore the great outdoors with this durable Hiking Backpack, featuring multiple compartments for optimal organization and a breathable design for ultimate comfort on long treks.

5.
Input: Air Fryer, Kitchen Appliances
Output: Cook your favorite meals with less oil using this Air Fryer

注意：以上输出已截断。现在我们可以像下面这样解析它，以获得产品、类别及其描述的列表。例如，让我们看一下它生成的产品。

#regex to parse data
pattern = re.compile(r'Input:\s*(.+?),\s*(.+?)\nOutput:\s*(.+?)(?=\n\n|\Z)', re.DOTALL)
matches = pattern.findall(output_string)
products = []
categories = []
descriptions = []

for match in matches:
    product, category, description = match
    products.append(product.strip())
    categories.append(category.strip())
    descriptions.append(description.strip())
products

['Wireless Bluetooth Headphones',
 'Organic Green Tea',
 'Stainless Steel Kitchen Knife',
 'Hiking Backpack',
 'Air Fryer',
 "Kids' Educational Tablet",
 'Bluetooth Speaker',
 'Yoga Mat',
 'Memory Foam Mattress',
 'Smartwatch',
 'Leather Wallet',
 'Portable Phone Charger',
 'Non-Stick Cookware Set',
 'Pet Dog Bed',
 'Fitness Tracker',
 'Wireless Earbuds',
 'Organic Green Tea',
 'Reusable Water Bottle',
 'Yoga Mat',
 'Leather Wallet',
 'Air Fryer',
 'Gaming Mouse',
 'Crochet Kit',
 'Hiking Boots',
 'Scented Candles',
 'Bluetooth Speaker',
 'Stainless Steel Cookware Set',
 'Fitness Tracker',
 'Decorative Throw Pillows',
 'Eco-Friendly Cleaning Supplies',
 'Wireless Noise Cancelling Headphones',
 'Organic Green Tea',
 'Adjustable Yoga Mat',
 'Bluetooth Smart Scale',
 'Stainless Steel Water Bottle',
 'Soft Cotton Bedding Set',
 'Multi-Functional Kitchen Blender',
 'Eco-Friendly Reusable Bags',
 'Portable Phone Charger',
 'Classic Leather Wallet',
 'Suede Chelsea Boots',
 'Non-Stick Cookware Set',
 'Pet-Friendly Indoor Plants',
 'High-Protein Snack Bars',
 'LED Desk Lamp with USB Port']

5. 处理不平衡或非多样化的文本数据

生成高质量合成数据的一些最重要的方面是准确性（数据是否有意义）、一致性（同一输入的两个单独数据点是否大致相同）和多样性（确保我们的数据分布尽可能匹配生产中存在的分布）。

为了增加数据的多样性，我们首先对数据进行聚类。这将为我们提供有关哪些集群代表性不足（不平衡数据集）或哪些数据根本未被解决（扩大数据分布）的信息。然后，我们将建议新的集群（使用来自 GPT 的自我反思类型调用）或要求我们的合成生成调用的下一次迭代明确针对代表性不足的集群。

然后，我们可以递归运行此生成和集群循环分析以自动生成多样化的合成数据。

出于演示目的，我们显式提示 LLM 生成有关 4 个不同主题领域的信息：车辆、服装、洗漱用品、食物。然后，我们将对数据进行聚类，看看它是否设法找到这 4 个主题领域。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under 4 main topics: vehicle, clothing, toiletries, food)
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response

1. vehicle  
Input: "Tesla Model 3, Electric Car"  
Output: "The Tesla Model 3 is a revolutionary electric car with impressive range and cutting-edge technology, designed to provide an exhilarating driving experience while minimizing environmental impact."

2. clothing  
Input: "Nike Air Max, Shoes"  
Output: "Elevate your sneaker game with Nike Air Max. Combining iconic style with superior comfort and support, these shoes are perfect for both workouts and casual outings."

3. toiletries  
Input: "Oral-B Pro 1000, Electronic Toothbrush"  
Output: "Achieve a superior clean with the Oral-B Pro 1000. This electronic toothbrush features 3D cleaning action that pulsates and oscillates to remove more plaque than a regular manual toothbrush."

4. food  
Input: "Chobani Greek Yogurt, Yogurt"  
Output: "Indulge in a nutritious snack with Chobani Greek Yogurt. Packed with protein and delicious flavors, it’s the perfect choice for a healthy breakfast or a satisfying treat anytime."

5. vehicle

注意：以上输出已截断。在上面的示例中，我们将显式地将主题领域作为每个示例的响应的一部分包括在内，因为它有助于调节后续输出，并且往往会提供更好的性能。我们还可以给出一个实际的输出示例，以便它了解输出风格的正确想法，并且还有助于强制执行结构。

pattern = re.compile(r'(\d+)\.\s*(\w+)\s*Input:\s*"(.+?),\s*(.+?)"\s*Output:\s*"(.*?)"', re.DOTALL)
matches = pattern.findall(output_string)

topics = []
products = []
categories = []
descriptions = []

for match in matches:
    number, topic, product, category, description = match
    topics.append(topic)
    products.append(product)
    categories.append(category)
    descriptions.append(description)

products

['Tesla Model 3',
 'Nike Air Max',
 'Oral-B Pro 1000',
 'Chobani Greek Yogurt',
 'Ford F-150',
 "Levi's 511",
 'Philips Sonicare',
 'Quaker Oatmeal',
 'Toyota Camry',
 'Adidas Ultraboost',
 'Toyota Camry',
 'Nike Air Max',
 'Colgate Electric Toothbrush',
 'Blue Diamond Almonds',
 'Harley Davidson Fat Boy',
 'Adidas UltraBoost',
 "Dove Men's Body Wash",
 'Quaker Oats',
 'Ford F-150',
 "Levi's 501 Jeans",
 'Tesla Model 3',
 'Nike Air Max',
 'Oral-B Pro 1000',
 'Organic Almond Butter',
 'Yamaha YZF-R3',
 'Adidas Ultraboost',
 'Philips Sonicare',
 'Organic Quinoa']

我们现在将对数据进行聚类以进行分析。我们将使用 K-means 聚类来分离数据。K-means 的一个重要参数是 K，即集群的数量。

我们知道应该有 4 个集群（4 个主题），因为我们在提示中指定了这一点：车辆、电子产品、服装、食物。但是，通常对于我们的数据，我们不知道存在的集群数量。因此，我们将使用肘部法则来找到最佳集群数量。

在肘部法则中，我们迭代一系列不同的 K 值，每次都存储惯性。惯性测量集群中每个点与该集群质心之间平方距离的总和，从而告诉我们每个集群的分离程度和密度。如果我们绘制 K 与惯性的关系图，我们就可以看到惯性如何下降，以及惯性下降最不迅速的位置（通常形成肘部形状），我们可以将最佳集群数量设置为该位置。您可以在此处更深入地了解肘部法则。

首先，让我们将数据存储到 pandas 数据帧中，以便于分析

data = {
    'Product': products,
    'Category': categories,
    'Description': descriptions
}

df = pd.DataFrame(data)

接下来，让我们嵌入我们的数据，因为嵌入是我们将聚类的内容，因为如果它们相似，它们应该在向量空间中彼此靠近。

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model=model)

    return response.data[0].embedding

embedding_model = "text-embedding-3-small"
df["embedding"] = df.Category.apply(lambda x: get_embedding(x, model=embedding_model))

# Ensure there are embeddings to concatenate
if len(df.embedding.values) > 0:
    matrix = np.vstack(df.embedding.values)
else:
    matrix = np.array([])  # Handle the case where there are no embeddings

df

	产品	类别	描述	嵌入
0	特斯拉 Model 3	电动汽车	特斯拉 Model 3 是一款革命性的电动汽车...	[0.003255360759794712, -0.039260633289813995, ...
1	耐克 Air Max	鞋子	用耐克 Air Max 提升您的运动鞋品味。C...	[0.03943369910120964, 0.022045187652111053, -0...
2	Oral-B Pro 1000	电动牙刷	使用 Oral-B Pro 1 实现卓越清洁...	[-0.003470012918114662, -0.01911414973437786, ...
3	Chobani 希腊酸奶	酸奶	尽情享用 Chobani 希腊酸奶的营养零食...	[0.0208318829536438, -0.02645781636238098, -0....
4	福特 F-150	皮卡	福特 F-150 是终极皮卡，d...	[0.007467855699360371, -0.05288049206137657, -...
5	Levi's 511	牛仔裤	穿着 Levi's 511 牛仔裤时尚出行。Featu...	[0.0037206460256129503, 0.022772302851080894, ...
6	飞利浦 Sonicare	电动牙刷	使用 Phi 探索口腔护理的新水平...	[-0.00724813062697649, -0.011600878089666367, ...
7	桂格燕麦片	早餐谷物	从桂格燕麦片开始美好的一天。这...	[-0.006529285106807947, 0.007865572348237038, ...
8	丰田凯美瑞	轿车	丰田凯美瑞在轿车类别中脱颖而出...	[-0.02088991366326809, -0.006191295105963945, ...
9	阿迪达斯 Ultraboost	跑鞋	穿着阿迪达斯 Ultraboost 跑出前所未有的水平...	[0.02679188922047615, 0.014639599248766899, 8....
10	丰田凯美瑞	汽车	丰田凯美瑞是一款可靠的中型轿车...	[0.008056452497839928, -0.007912316359579563, ...
11	耐克 Air Max	鞋子	用耐克 Air Max 提升您的运动鞋品味...	[0.03943241760134697, 0.02208484522998333, -0....
12	高露洁电动牙刷	电动牙刷	使用 C 改变您的口腔卫生习惯...	[-0.003470012918114662, -0.01911414973437786, ...
13	蓝钻杏仁	坚果	用蓝钻杏仁健康零食。这些...	[-0.013289917260408401, 0.036334190517663956, ...
14	哈雷戴维森肥仔	摩托车	体验哈雷戴维森开阔道路的刺激...	[0.012365399859845638, 0.03552943095564842, -0...
15	阿迪达斯 UltraBoost	运动鞋	享受舒适性和性能的完美融合...	[0.013107392005622387, 0.02963760495185852, -0...
16	多芬男士沐浴露	沐浴露	使用多芬男士沐浴露焕新肌肤并补充水分...	[0.03760576993227005, -0.008475445210933685, -...
17	桂格燕麦	燕麦	从桂格燕麦开始美好的一天。包装...	[-0.00903365109115839, 0.00896345917135477, 0....
18	福特 F-150	卡车	福特 F-150 是一款耐用且可靠的卡车...	[0.023461222648620605, -0.026651185005903244, ...
19	Levi's 501 牛仔裤	牛仔裤	探索 Levi's 501 牛仔裤的永恒风格...	[0.003762696636840701, 0.02275814116001129, -0...
20	特斯拉 Model 3	手机	与特斯拉 M 一起探索驾驶的未来...	[0.03703858703374863, 0.03407958149909973, 0.0...
21	耐克 Air Max	鞋子	用耐克 Air Max 提升您的游戏水平。设计...	[0.03943369910120964, 0.022045187652111053, -0...
22	Oral-B Pro 1000	电动牙刷	使用 Oral-B Pro 1 实现卓越清洁...	[-0.003470012918114662, -0.01911414973437786, ...
23	有机杏仁酱	食物	尽情享用有机杏仁的奶油美味...	[-0.014613640494644642, -0.002179765608161688,...
24	雅马哈 YZF-R3	手机	隆重推出雅马哈 YZF-R3，终极 sp...	[0.03703858703374863, 0.03407958149909973, 0.0...
25	阿迪达斯 Ultraboost	鞋子	探索阿迪达斯 Ultraboost，这是一款提供...	[0.03944042697548866, 0.022062409669160843, -0...
26	飞利浦 Sonicare	电动牙刷	体验飞利浦牙科护理革命...	[-0.003470012918114662, -0.01911414973437786, ...
27	有机藜麦	食物	用有机藜麦滋养您的身体，一种营养...	[-0.014613640494644642, -0.002179765608161688,...

现在我们执行肘部法则。

# Determine the optimal number of clusters using the elbow method
inertias = []
range_of_clusters = range(1, 13)  # Adjust the range as necessary

for n_clusters in range_of_clusters:
    kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
    kmeans.fit(matrix)
    inertias.append(kmeans.inertia_)

这将为我们输出一个图表，我们必须在其中直观地告诉最佳聚类点在哪里。我们可以看到，我们看到惯性逐渐减小，而不是急剧的肘部，但最陡峭的减小点似乎发生在 3、4 或 5 个集群左右，这与我们在提示中的预期相符。

# Plotting the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(range_of_clusters, inertias, '-o')
plt.title('Elbow Method to Determine Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.xticks(range_of_clusters)
plt.show()

elbow_chart

出于演示目的，我们将选择 5 作为最佳集群数量，以表明我们选择的确切位置并不重要，只要我们大致正确即可。有许多正确的方法来分类数据。我们还存储每个数据点所属的集群。

n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_
df["Cluster"] = labels

我们现在将分析集群数据。我们将关注两个独立的事项。 1. 不平衡数据，2. 扩大数据分布。

首先，对于不平衡数据，我们计算每个集群中的示例数量。然后，我们从每个集群中随机选择几个示例，并询问 LLM 这些示例映射到哪些主题。

cluster_counts = df["Cluster"].value_counts().sort_index()
print(cluster_counts)

Cluster
0    5
1    7
2    8
3    6
4    2
Name: count, dtype: int64

我们可以看到此处找到的主题：环保交通、奢侈品和休闲用品、个人护理产品、电动牙刷以及服装和服饰与我们最初的提示：车辆、服装、洗漱用品、食物非常吻合，但不完全相同。

由于我们选择了 5 个集群，因此它将洗漱用品分为护肤品和个人护理，这对我们下游的影响不大。

df

	产品	类别	描述	嵌入	集群
0	特斯拉 Model 3	电动汽车	特斯拉 Model 3 是一款革命性的电动汽车...	[0.003255360759794712, -0.039260633289813995, ...	1
1	耐克 Air Max	鞋子	用耐克 Air Max 提升您的运动鞋品味。C...	[0.03943369910120964, 0.022045187652111053, -0...	2
2	Oral-B Pro 1000	电动牙刷	使用 Oral-B Pro 1 实现卓越清洁...	[-0.003470012918114662, -0.01911414973437786, ...	1
3	Chobani 希腊酸奶	酸奶	尽情享用 Chobani 希腊酸奶的营养零食...	[0.0208318829536438, -0.02645781636238098, -0....	3
4	福特 F-150	皮卡	福特 F-150 是终极皮卡，d...	[0.007467855699360371, -0.05288049206137657, -...	0
5	Levi's 511	牛仔裤	穿着 Levi's 511 牛仔裤时尚出行。Featu...	[0.0037206460256129503, 0.022772302851080894, ...	2
6	飞利浦 Sonicare	电动牙刷	使用 Phi 探索口腔护理的新水平...	[-0.00724813062697649, -0.011600878089666367, ...	1
7	桂格燕麦片	早餐谷物	从桂格燕麦片开始美好的一天。这...	[-0.006529285106807947, 0.007865572348237038, ...	3
8	丰田凯美瑞	轿车	丰田凯美瑞在轿车类别中脱颖而出...	[-0.02088991366326809, -0.006191295105963945, ...	0
9	阿迪达斯 Ultraboost	跑鞋	穿着阿迪达斯 Ultraboost 跑出前所未有的水平...	[0.02679188922047615, 0.014639599248766899, 8....	2
10	丰田凯美瑞	汽车	丰田凯美瑞是一款可靠的中型轿车...	[0.008056452497839928, -0.007912316359579563, ...	0
11	耐克 Air Max	鞋子	用耐克 Air Max 提升您的运动鞋品味...	[0.03943241760134697, 0.02208484522998333, -0....	2
12	高露洁电动牙刷	电动牙刷	使用 C 改变您的口腔卫生习惯...	[-0.003470012918114662, -0.01911414973437786, ...	1
13	蓝钻杏仁	坚果	用蓝钻杏仁健康零食。这些...	[-0.013289917260408401, 0.036334190517663956, ...	3
14	哈雷戴维森肥仔	摩托车	体验哈雷戴维森开阔道路的刺激...	[0.012365399859845638, 0.03552943095564842, -0...	0
15	阿迪达斯 UltraBoost	运动鞋	享受舒适性和性能的完美融合...	[0.013107392005622387, 0.02963760495185852, -0...	2
16	多芬男士沐浴露	沐浴露	使用多芬男士沐浴露焕新肌肤并补充水分...	[0.03760576993227005, -0.008475445210933685, -...	1
17	桂格燕麦	燕麦	从桂格燕麦开始美好的一天。包装...	[-0.00903365109115839, 0.00896345917135477, 0....	3
18	福特 F-150	卡车	福特 F-150 是一款耐用且可靠的卡车...	[0.023461222648620605, -0.026651185005903244, ...	0
19	Levi's 501 牛仔裤	牛仔裤	探索 Levi's 501 牛仔裤的永恒风格...	[0.003762696636840701, 0.02275814116001129, -0...	2
20	特斯拉 Model 3	手机	与特斯拉 M 一起探索驾驶的未来...	[0.03703858703374863, 0.03407958149909973, 0.0...	4
21	耐克 Air Max	鞋子	用耐克 Air Max 提升您的游戏水平。设计...	[0.03943369910120964, 0.022045187652111053, -0...	2
22	Oral-B Pro 1000	电动牙刷	使用 Oral-B Pro 1 实现卓越清洁...	[-0.003470012918114662, -0.01911414973437786, ...	1
23	有机杏仁酱	食物	尽情享用有机杏仁的奶油美味...	[-0.014613640494644642, -0.002179765608161688,...	3
24	雅马哈 YZF-R3	手机	隆重推出雅马哈 YZF-R3，终极 sp...	[0.03703858703374863, 0.03407958149909973, 0.0...	4
25	阿迪达斯 Ultraboost	鞋子	探索阿迪达斯 Ultraboost，这是一款提供...	[0.03944042697548866, 0.022062409669160843, -0...	2
26	飞利浦 Sonicare	电动牙刷	体验飞利浦牙科护理革命...	[-0.003470012918114662, -0.01911414973437786, ...	1
27	有机藜麦	食物	用有机藜麦滋养您的身体，一种营养...	[-0.014613640494644642, -0.002179765608161688,...	3

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want you identify the broad topic areas these clusters belong to.
    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    Do not add any extra characters around that formatting as it will make the output parsing break.
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content

pattern = r"Cluster: (\d+), topic: ([^\n]+)"
matches = re.findall(pattern, res)
clusters = [{"cluster": int(cluster), "topic": topic} for cluster, topic in matches]
json_output = json.dumps(clusters, indent=2)
print(json_output)

[
  {
    "cluster": 0,
    "topic": "Automotive  "
  },
  {
    "cluster": 1,
    "topic": "Personal Care  "
  },
  {
    "cluster": 2,
    "topic": "Footwear  "
  },
  {
    "cluster": 3,
    "topic": "Food  "
  },
  {
    "cluster": 4,
    "topic": "Automotive  "
  }
]

我们现在有了集群及其计数，因此我们可以提示 LLM 在我们想要的主题中生成更多示例。但是，对于此示例，我们将不再进一步深入，因为它们已很好地拆分，您只需按照上述步骤提示模型生成数据，同时传入代表性不足的主题即可。

接下来，我们将尝试处理增加数据分布的多样性。

首先，我们以类似的方式开始，从每个集群中随机找到几个示例，并询问 LLM 这些示例映射到哪些主题。除了在同一 LLM 调用中，我们还将要求它生成更多主题以增加数据的多样性。我们通过一次调用来完成此操作，以节省时间/成本。

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want to promote diversity in my examples across categories so follow the procedure below:
    1. You must identify the broad topic areas these clusters belong to.
    2. You should generate further topic areas which don't exist so I can generate data within these topics to improve diversity.


    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:

    1. Cluster topic mapping
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    2. New topics
    1. topic
    2. topic
    3. topic
    4. topic

    Do not add any extra characters around that formatting as it will make the output parsing break. It is very important you stick to that output format
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content
print(res)

1. Cluster topic mapping
Cluster: 0, topic: Automotive
Cluster: 1, topic: Personal Care
Cluster: 2, topic: Footwear
Cluster: 3, topic: Food
Cluster: 4, topic: Electric Vehicles

2. New topics
1. topic: Home Appliances
2. topic: Outdoor Equipment
3. topic: Smart Home Technology
4. topic: Fitness Equipment

我们再次可以看到，我们在此处显式提示它应遵循的输出结构。我还告诉它生成主题的目的（为了提高多样性），以便模型具有完整的上下文。

然后，我们将数据解析为集群映射 json 列表和主题列表

parts = res.split("\n\n")
cluster_mapping_part = parts[0]
new_topics_part = parts[1]

# Parse cluster topic mapping
cluster_topic_mapping_lines = cluster_mapping_part.split("\n")[1:]  # Skip the first two lines
cluster_topic_mapping = [{"cluster": int(line.split(",")[0].split(":")[1].strip()), "topic": line.split(":")[2].strip()} for line in cluster_topic_mapping_lines]

# Parse new topics
new_topics_lines = new_topics_part.split("\n")[1:]  # Skip the first line
new_topics = [line.split(". ")[1] for line in new_topics_lines]

cluster_topic_mapping, new_topics

([{'cluster': 0, 'topic': 'Automotive'},
  {'cluster': 1, 'topic': 'Personal Care'},
  {'cluster': 2, 'topic': 'Footwear'},
  {'cluster': 3, 'topic': 'Food'},
  {'cluster': 4, 'topic': 'Electric Vehicles'}],
 ['topic: Home Appliances',
  'topic: Outdoor Equipment',
  'topic: Smart Home Technology',
  'topic: Fitness Equipment'])

最后，我们可以使用此信息进一步提示模型继续生成合成数据。我们通过将 json 列表中的所有主题传递到下面的提示中来做到这一点。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under some main topics: {[entry['topic'] for entry in cluster_topic_mapping]})
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string)

1. Automotive
Input: "Tesla Model S, Electric Vehicles"
Output: "The Tesla Model S delivers exhilarating performance with advanced electric technology, offering a sleek design, impressive range, and an industry-leading infotainment system."

2. Personal Care
Input: "Oral-B Pro 1000, Electronic Toothbrush"
Output: "The Oral-B Pro 1000 features a 3D cleaning action that oscillates, rotates, and pulsates to remove plaque, ensuring a deeper clean for healthier gums."

3. Footwear
Input: "Nike Air Max 270, Shoes"
Output: "Step into comfort and style with Nike Air Max 270, designed with a large Max Air unit for superior cushioning and a breathable upper for a snug fit."

4. Electronics
Input: "Apple iPhone 12, Mobile Phones"
Output: "The Apple iPhone 12 combines powerful performance with stunning design, equipped with A14 Bionic chip and advanced camera systems for capturing every moment in stunning detail."

5. Food
Input: "Nature Valley Granola Bars, Snacks"
Output: "Nature Valley Granola Bars offer a wholesome crunch made from simple, delicious ingredients, providing a perfect snack that fuels your adventure."

6. Automotive
Input: "Ford F-150, Electric Vehicles"
Output: "The Ford F-150 stands at the forefront of durability and innovation, with its powerful electric version setting new standards for strength and sustainability in the truck category."

7. Personal Care
Input: "Philips Sonicare, Electronic Toothbrush"
Output: "Philips Sonicare delivers superior cleaning with dynamic technology that provides up to 31,000 strokes per minute for a healthier mouth and brighter smile."

8. Footwear
Input: "Adidas Ultraboost, Shoes"
Output: "The Adidas Ultraboost is a game-changer in running footwear, featuring responsive cushioning and a knit upper for a snug, supportive fit that adapts to any run."

9. Electronics
Input: "Dell XPS 13, Laptop"
Output: "The Dell XPS 13 is a remarkable laptop with an ultra-thin design, featuring a stunning InfinityEdge display and powerful performance to accommodate your multitasking needs."

10. Food
Input: "Kraft Macaroni & Cheese, Instant Food"
Output: "Kraft Macaroni & Cheese offers quick and convenient comfort food, combining creamy cheese sauce with perfectly cooked pasta for a simple meal that satisfies."

1. Automotive
Input: "Toyota Camry, Mobile Phones"
Output: "The Toyota Camry is a midsize sedan that combines efficiency with modern technology. It offers a spacious interior and the latest features for an enjoyable driving experience."

2. Personal Care
Input: "Oral-B Pro 1000, Electronic Toothbrush"
Output: "The Oral-B Pro 1000 not only provides powerful cleaning action but also enhances your oral hygiene routine with its smart pressure sensor and various cleaning modes."

3. Footwear
Input: "Nike Air Max, Shoes"
Output: "Step into comfort with the Nike Air Max. With cutting-edge technology and a sleek design, these shoes are perfect for athletes and casual wearers alike."

4. Food
Input: "Nature's Valley Granola Bar, Food"
Output: "Savor the wholesome goodness of Nature's Valley Granola Bar, crafted with real ingredients to fuel your day with delicious flavor and crunchy satisfaction."

5. Electric Vehicles
Input: "Tesla Model 3, Mobile Phones"
Output: "The Tesla Model 3 is a revolutionary electric vehicle that combines performance with sustainability, featuring an intuitive interface and cutting-edge technology for an exceptional driving experience."

1. Automotive
Input: "Tesla Model 3, Electric Vehicles"
Output: "The Tesla Model 3 combines cutting-edge technology with eco-friendly driving. Enjoy a sleek design, impressive range, and top-notch safety features, making it the perfect electric car for the modern driver."

2. Personal Care
Input: "Oral-B Pro 1000, Electronic Toothbrush"
Output: "Achieve a superior clean with the Oral-B Pro 1000. Featuring advanced 3D cleaning action, this electronic toothbrush ensures effective plaque removal while being gentle on gums, allowing you to maintain optimum oral health."

3. Footwear
Input: "Nike Air Max, Shoes"
Output: "Step up your game with Nike Air Max shoes. Combining iconic cushioning technology and bold style, these shoes provide ultimate comfort and support, perfect for both casual wear and athletic performance."

4. Food
Input: "Oreo Cookies, Snacks"
Output: "Indulge in the classic taste of Oreo Cookies. With their irresistible cream filling sandwiched between two crunchy chocolate wafers, these treats are perfect for satisfying your sweet tooth any time of the day."

5. Personal Care
Input: "Garnier Micellar Water, Skincare"
Output: "Garnier Micellar Water gently removes makeup and impurities while hydrating the skin. This soothing formula is suitable for all skin types, making it a must-have in your daily skincare routine."

6. Automotive
Input: "Ford F-150, Trucks"
Output: "The Ford F-150 is the quintessential pickup truck, combining power, reliability, and innovative technology. Equipped with advanced towing capabilities and a spacious interior, it's designed for both work and play."

7. Electronics
Input: "Samsung Galaxy S21, Mobile Phones"
Output: "Experience the future of mobile technology with the Samsung Galaxy S21. This smartphone features a stunning display, powerful processor, and multiple camera options, perfect for capturing life's moments in high definition."

8. Footwear
Input: "Adidas Ultraboost, Shoes"
Output: "Run in style with Adidas Ultraboost shoes. Known for their comfort and performance, these shoes utilize responsive cushioning to provide unmatched energy return with every step you take."

9. Electronics
Input: "Dell XPS 13, Laptops"
Output: "The Dell XPS 13 redefines the laptop experience with its stunning InfinityEdge display, powerful performance, and sleek design. Ideal for both professionals and students looking for portability and functionality."

10. Personal Care
Input: "Philips Sonicare, Electronic Toothbrush"
Output: "Philips Sonicare's electronic toothbrush guarantees a superior cleaning experience with its advanced sonic technology. This toothbrush not only helps remove plaque but also promotes healthier gums for a brighter smile."

您可以在循环中运行此操作以附加到您之前的数据，这样您就可以不断生成更多文本合成数据来训练另一个 GPT 模型，同时确保我们迎合不平衡的数据集并生成多样化的数据。

您现在已完成合成数据生成教程的第 1 部分，其中我们已完成

带有结构化提示的 CSV
带有 Python 程序的 CSV
带有 python 程序的多表 CSV
简单地创建文本数据
处理不平衡或非多样化的文本数据

在第 2 部分中，您将找到有关更好提示 LLM 以增强文本合成数据生成的技术。