使用推理进行数据验证 | OpenAI Cookbook

在本指南中，我们将探讨如何使用 o1 模型，特别是 o1-preview，通过推理执行数据验证。我们将通过一个涉及合成医疗数据集的实际示例，并演示如何评估模型在识别数据中问题方面的准确性。

概述

数据验证是确保数据集质量和可靠性的关键步骤，尤其是在医疗保健等敏感领域。传统的验证方法通常依赖于预定义的规则和模式。然而，像 o1 这样的高级模型可以理解上下文并对数据进行推理，从而为验证提供更灵活和智能的方法。

在本教程中，我们将

生成包含不一致性的合成医疗数据集。
定义一个函数，该函数接受一行数据并验证其准确性
运行验证过程并计算准确性指标。
分析和解释结果。

from openai import OpenAI import json from IPython.display import display, HTML from sklearn.metrics import precision_score, recall_score, f1_score from concurrent.futures import ThreadPoolExecutor, as_completed import csv import pandas as pd client = OpenAI() MODEL = 'o1-preview'

我们将大量使用合成数据生成 cookbook 中描述的原则来创建数据集的基础。

我们将提示模型为我们的用例生成医疗数据集。我们向模型提供了关于如何创建数据集、遵循什么格式以及如何填充不准确之处的详细说明。我们还提供了一些示例数据行以帮助模型入门。

数据集中的每一行都将具有以下字段

患者 ID：随机生成的患者 ID
出生日期：患者的出生日期
性别：男/女
病史：过去的诊断
当前用药：患者正在服用的药物
过敏史：已识别的过敏症
实验室结果（葡萄糖 mg/dL）
诊断：当前诊断
治疗计划：当前治疗计划
是否有效：当前数据行是否有效（真/假）
问题：如果数据行无效，问题是什么

数据中可能存在的一些不准确示例包括

给对药物过敏的患者开药
当前用药与病史不符
治疗计划与诊断不符

def generate_data(): messages = [ { "role": "user", "content": """ You are a helpful assistant designed to generate data. You will be given a format for the data to generate and some examples of the data. When generating Patient IDs, use the format 'P' followed by a three-digit number (e.g., P006, P941, P319). Intentionally make some mistakes in the data generation and document them in the appropriate columns ('Is Valid' and 'Issue') if the row of data is invalid. The types of mistakes to include are: - **Allergy Contradictions**: Prescribing a medication that the patient is allergic to (e.g., prescribing Penicillin to a patient allergic to Penicillin). - **Medical History and Medication Mismatch**: A patient with a medical condition not receiving appropriate medication (e.g., a diabetic patient not prescribed any diabetes medication). - **Lab Results and Diagnosis Mismatch**: Lab results that do not support the diagnosis (e.g., normal glucose levels but diagnosed with Diabetes Type 2). - **Other Plausible Mistakes**: Any other realistic errors that could occur in medical records, such as incorrect gender entries, impossible dates of birth, or inconsistent treatment plans. Ensure that when 'Is Valid' is 'False', the 'Issue' column clearly explains the problem. Return 100 rows of data for the user. Your response should strictly be in the format of a valid CSV. Generate Synthetic Medical Records Dataset with the following columns: - Patient ID: A randomly generated patient id - Date of Birth: Date of birth of the patient - Gender: M/F - Medical History: Past diagnoses - Current Medications: Medication the patient is taking - Allergies: Identified allergies - Lab Results (Glucose mg/dL) - Diagnoses: Current diagnosis - Treatment Plan: Current treatment plan - Is Valid: Whether or not the current row of data is valid (True/False) - Issue: If the row of data is not valid, what the issue is Patient ID,Date of Birth,Gender,Medical History,Current Medications,Allergies,Lab Results (Glucose mg/dL),Diagnoses,Treatment Plan,Is Valid,Issue P001,1980-05-14,M,Hypertension,Lisinopril,None,110,Hypertension,Continue Lisinopril,True, P002,1975-11-30,F,Diabetes Type 2,Metformin,Penicillin,90,Diabetes Type 2,Continue Metformin,True, P003,1990-07-22,F,Asthma,Albuterol,Aspirin,85,Asthma,Prescribe Albuterol,True, P004,2000-03-10,M,None,Amoxicillin,Penicillin,95,Infection,Prescribe Amoxicillin,False,Prescribed Amoxicillin despite Penicillin allergy P005,1985-09-18,F,Hyperlipidemia,Atorvastatin,None,200,Hyperlipidemia,Continue Atorvastatin,True, P006,1978-12-05,M,Hypertension; Diabetes Type 2,Lisinopril; Insulin,None,55,Diabetes Type 2,Adjust insulin dosage,False,Low glucose level not properly addressed """ } ] response = client.chat.completions.create( model=MODEL, messages=messages ) return response.choices[0].message.content.replace('```csv', '').replace('```', '')

# Generate data three times using the existing dataGeneration function generated_data = [] data = generate_data() generated_data.extend(data.strip().split('\n')) # Append the generated data to the medicalData.csv file with open('../data/medicalData.csv', 'a', newline='') as csvfile: csvwriter = csv.writer(csvfile) for row in generated_data: csvwriter.writerow(row.split(',')) print("Synthetic data generation and appending completed.")

现在我们已经准备好数据集，我们将提示推理模型审查每行数据，并确定它是否包含问题。我们将要求模型输出数据中是否存在问题，然后提供对该问题的解释。

一旦我们让模型确定了无效数据列表，我们将把这些结果传递给模型评分器，以评估两个指标

模型正确识别数据问题的能力准确性
对于已正确识别问题的部分数据，模型在识别手头问题方面的准确性如何

鉴于此任务范围更窄，我们可以使用更快的 gpt-4o 模型来计算准确性。

提醒：鉴于这些模型仍处于 beta 阶段，速率限制将大大降低。请相应地调整并发工作线程的数量。

def validate_data(input_data): messages = [ { "role": "user", "content": f""" You are a helpful assistant designed to validate the quality of medical datasets. You will be given a single row of medical data, and your task is to determine whether the data is valid. - Carefully analyze the data for any inconsistencies, contradictions, missing values, or implausible information. - Consider the logical relationships between different fields (e.g., treatments should be appropriate for the diagnoses, medications should not conflict with allergies, lab results should be consistent with diagnoses, etc.). - Use your general medical knowledge to assess the validity of the data. - Focus solely on the information provided without making assumptions beyond the given data. **Return only a JSON object** with the following two properties: - `"is_valid"`: a boolean (`true` or `false`) indicating whether the data is valid. - `"issue"`: if `"is_valid"` is `false`, provide a brief explanation of the issue; if `"is_valid"` is `true`, set `"issue"` to `null`. Both JSON properties must always be present. Do not include any additional text or explanations outside the JSON object. MEDICAL DATA: {input_data} """ } ] response = client.chat.completions.create( model=MODEL, messages=messages ) response_content = response.choices[0].message.content.replace('```json', '').replace('```', '').strip() try: if isinstance(response_content, dict): response_dict = response_content else: response_dict = json.loads(response_content) return response_dict except json.JSONDecodeError as e: print(f"Failed to decode JSON response: {response_content}") raise e

# Read the CSV file and exclude the last two columns input_data = [] with open('../data/medicalData.csv', 'r') as file: reader = csv.reader(file) headers = next(reader) for row in reader: input_data.append(row[:-2]) # Exclude "Is Valid" and "Issue" columns # Initialize lists to store true labels true_is_valid = [] true_issues = [] # Extract true labels from the CSV file with open('../data/medicalData.csv', 'r') as file: reader = csv.reader(file) headers = next(reader) for row in reader: true_is_valid.append(row[-2] == 'True') true_issues.append(row[-1]) # Function to validate a single row of data def validate_row(row): input_str = ','.join(row) result_json = validate_data(input_str) return result_json # Validate data rows and collect results pred_is_valid = [False] * len(input_data) pred_issues = [''] * len(input_data) with ThreadPoolExecutor() as executor: futures = {executor.submit(validate_row, row): i for i, row in enumerate(input_data)} for future in as_completed(futures): i = futures[future] # Get the index of the current row result_json = future.result() pred_is_valid[i] = result_json['is_valid'] pred_issues[i] = result_json['issue']

现在我们有了模型的结果，我们可以将其与真实来源进行比较，并确定系统的准确性

# Convert predicted and true 'is_valid' labels to boolean if they aren't already pred_is_valid_bool = [bool(val) if isinstance(val, bool) else val == 'True' for val in pred_is_valid] true_is_valid_bool = [bool(val) if isinstance(val, bool) else val == 'True' for val in true_is_valid] # Calculate precision, recall, and f1 score for the 'is_valid' prediction precision = precision_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True) recall = recall_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True) f1 = f1_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True) # Initialize issue_matches_full with False issue_matches_full = [False] * len(true_is_valid)

我们现在将确定模型准确分类数据中问题的能力

def validate_issue(model_generated_answer, correct_answer): messages = [ { "role": "user", "content": f""" You are a medical expert assistant designed to validate the quality of an LLM-generated answer. The model was asked to review a medical dataset row to determine if the data is valid. If the data is not valid, it should provide a justification explaining why. Your task: • Compare the model-generated justification with the correct reason provided. • Determine if they address the same underlying medical issue or concern, even if phrased differently. • Focus on the intent, medical concepts, and implications rather than exact wording. Instructions: • If the justifications have the same intent or address the same medical issue, return True. • If they address different issues or concerns, return False. • Only respond with a single word: True or False. Examples: 1. Example 1: • Model Generated Response: “The patient is allergic to penicillin” • Correct Response: “The patient was prescribed penicillin despite being allergic” • Answer: True 2. Example 2: • Model Generated Response: “The date of birth of the patient is incorrect” • Correct Response: “The patient was prescribed penicillin despite being allergic” • Answer: False Model Generated Response: {model_generated_answer} Correct Response: {correct_answer} """ } ] response = client.chat.completions.create( model="o1-preview", messages=messages ) result = response.choices[0].message.content return result

# Validate issues for rows where both true and predicted 'is_valid' are False validation_results = [] with ThreadPoolExecutor() as executor: futures = { executor.submit(validate_issue, pred_issues[i], true_issues[i]): i for i in range(len(pred_is_valid_bool)) if not pred_is_valid_bool[i] and not true_is_valid_bool[i] } for future in as_completed(futures): i = futures[future] # Get the original index issue_match = future.result() issue_matches_full[i] = (issue_match == 'True') validation_results.append({ "index": i, "predicted_issue": pred_issues[i], "true_issue": true_issues[i], "issue_match": issue_matches_full[i] }) # Calculate issue accuracy issue_accuracy = sum([i['issue_match'] for i in validation_results]) / len(validation_results) # Store the results in the dictionary model_results = { "precision": precision, "recall": recall, "f1": f1, "issue_accuracy": issue_accuracy } # Create a DataFrame to store the results df_results = pd.DataFrame([model_results]) # Create a DataFrame to store the validation results for each row df_validation_results = pd.DataFrame(validation_results)

下面我们将显示我们正确识别出包含问题的行子集。对于每一行，我们将显示预测的问题与真实问题，以及是否存在匹配

def display_formatted_dataframe(df): def format_text(text): return text.replace('\n', '<br>') df_formatted = df.copy() df_formatted['predicted_issue'] = df_formatted['predicted_issue'].apply(format_text) df_formatted['true_issue'] = df_formatted['true_issue'].apply(format_text) display(HTML(df_formatted.to_html(escape=False, justify='left'))) display_formatted_dataframe(pd.DataFrame(validation_results))

	索引	predicted_issue	true_issue	issue_match
0	39	阿莫西林被开给对青霉素过敏的患者。	尽管对青霉素过敏，仍开了阿莫西林	真
1	50	诊断为 1 型糖尿病的患者未服用任何药物，并且治疗字段列出了诊断而不是适当的治疗方法。	1 型糖尿病患者未接受胰岛素治疗	真
2	51	实验室结果 300 表明高血糖，但未记录诊断或治疗。	极高的血糖水平未被诊断或治疗	真
3	26	尽管患者对青霉素过敏，但仍被开了青霉素。	尽管对青霉素过敏，仍开了青霉素	真
4	31	患者的年龄 (88) 与出生日期 (1996-11-05) 不一致。	骨质疏松症患者未接受治疗	假
5	24	“治疗计划”字段不应为“抑郁症”；它应该指定针对抑郁症开出的治疗方法。	抑郁症患者未接受治疗	真
6	3	患者对青霉素过敏，但开了阿莫西林。	尽管对青霉素过敏，仍开了阿莫西林	真
7	28	治疗字段包含“哮喘”，这是一个诊断，而不是治疗。	哮喘患者未开任何药物	假
8	7	患有哮喘且实验室结果偏低 (100) 的患者仅通过生活方式调整进行治疗，而没有药物治疗，这是不适当的。	哮喘患者未开任何药物	真
9	16	患者的年龄 (86) 与出生日期 (1955-10-10) 不符。	慢性阻塞性肺病患者未接受治疗	假
10	53	提供的年龄 (92) 与出生日期 (1983-08-19) 不一致。	抑郁症患者未接受治疗	假
11	23	治疗字段错误地列出了“高脂血症”，而不是针对该诊断的适当治疗方法。	高脂血症患者未开任何药物	真
12	13	患者对磺胺类药物过敏，但开了磺胺甲恶唑，这是一种磺胺类药物。	尽管对磺胺类药物过敏，仍开了磺胺类药物	真
13	98	尽管患者对青霉素过敏，但仍被开了青霉素。	尽管对青霉素过敏，仍开了青霉素	真
14	9	患者对青霉素药物过敏，但开了青霉素。	尽管对青霉素过敏，仍开了青霉素	真
15	85	治疗字段包含“高脂血症”，这是一个诊断，而不是治疗。	高脂血症患者未开任何药物	假
16	18	开出的治疗方法（阿司匹林）不适用于感染的诊断。	尽管对阿司匹林过敏，仍开了阿司匹林；高血糖水平未得到解决	假
17	70	治疗字段包含诊断“骨质疏松症”而不是治疗方法。	骨质疏松症患者未接受治疗	真
18	57	患者对青霉素过敏，但被开了阿莫西林，这是禁忌症。	尽管对青霉素过敏，仍开了阿莫西林	真
19	80	治疗字段错误地列出了“2 型糖尿病”，而不是有效的治疗计划。	2 型糖尿病患者未接受药物治疗	真
20	87	治疗计划包括开阿莫西林，但患者对此过敏。	尽管对青霉素过敏，仍开了阿莫西林	真
21	37	治疗字段包含“高脂血症”，这是一个诊断，而不是治疗。	高脂血症患者未开任何药物	假
22	95	治疗方法被列为“哮喘”，这不是针对该诊断的适当治疗方法。	哮喘患者未开任何药物	真
23	96	治疗字段列出了“高脂血症”，这不是适当的治疗方法。	高脂血症患者未开任何药物	假
24	59	治疗字段包含“贫血”，这不是有效的治疗方法。	贫血患者未接受治疗	假
25	5	年龄与出生日期不符	低血糖水平未得到适当解决	假

索引

predicted_issue

true_issue

issue_match

阿莫西林被开给对青霉素过敏的患者。

尽管对青霉素过敏，仍开了阿莫西林

真

诊断为 1 型糖尿病的患者未服用任何药物，并且治疗字段列出了诊断而不是适当的治疗方法。

1 型糖尿病患者未接受胰岛素治疗

真

实验室结果 300 表明高血糖，但未记录诊断或治疗。

极高的血糖水平未被诊断或治疗

真

尽管患者对青霉素过敏，但仍被开了青霉素。

尽管对青霉素过敏，仍开了青霉素

真

患者的年龄 (88) 与出生日期 (1996-11-05) 不一致。

骨质疏松症患者未接受治疗

假

“治疗计划”字段不应为“抑郁症”；它应该指定针对抑郁症开出的治疗方法。

抑郁症患者未接受治疗

真

患者对青霉素过敏，但开了阿莫西林。

尽管对青霉素过敏，仍开了阿莫西林

真

治疗字段包含“哮喘”，这是一个诊断，而不是治疗。

哮喘患者未开任何药物

假

患有哮喘且实验室结果偏低 (100) 的患者仅通过生活方式调整进行治疗，而没有药物治疗，这是不适当的。

哮喘患者未开任何药物

真

患者的年龄 (86) 与出生日期 (1955-10-10) 不符。

慢性阻塞性肺病患者未接受治疗

假

提供的年龄 (92) 与出生日期 (1983-08-19) 不一致。

抑郁症患者未接受治疗

假

治疗字段错误地列出了“高脂血症”，而不是针对该诊断的适当治疗方法。

高脂血症患者未开任何药物

真

患者对磺胺类药物过敏，但开了磺胺甲恶唑，这是一种磺胺类药物。

尽管对磺胺类药物过敏，仍开了磺胺类药物

真

尽管患者对青霉素过敏，但仍被开了青霉素。

尽管对青霉素过敏，仍开了青霉素

真

患者对青霉素药物过敏，但开了青霉素。

尽管对青霉素过敏，仍开了青霉素

真

治疗字段包含“高脂血症”，这是一个诊断，而不是治疗。

高脂血症患者未开任何药物

假

开出的治疗方法（阿司匹林）不适用于感染的诊断。

尽管对阿司匹林过敏，仍开了阿司匹林；高血糖水平未得到解决

假

治疗字段包含诊断“骨质疏松症”而不是治疗方法。

骨质疏松症患者未接受治疗

真

患者对青霉素过敏，但被开了阿莫西林，这是禁忌症。

尽管对青霉素过敏，仍开了阿莫西林

真

治疗字段错误地列出了“2 型糖尿病”，而不是有效的治疗计划。

2 型糖尿病患者未接受药物治疗

真

治疗计划包括开阿莫西林，但患者对此过敏。

尽管对青霉素过敏，仍开了阿莫西林

真

治疗字段包含“高脂血症”，这是一个诊断，而不是治疗。

高脂血症患者未开任何药物

假

治疗方法被列为“哮喘”，这不是针对该诊断的适当治疗方法。

哮喘患者未开任何药物

真

治疗字段列出了“高脂血症”，这不是适当的治疗方法。

高脂血症患者未开任何药物

假

治疗字段包含“贫血”，这不是有效的治疗方法。

贫血患者未接受治疗

假

年龄与出生日期不符

低血糖水平未得到适当解决

假

结论

从这里的结果我们可以看到，我们能够为问题识别生成高精度/召回率，并在精确定位数据中的确切问题方面获得不错的准确性。

这应该有助于简化各种领域评估集的数据验证。