使用 Chroma 和 OpenAI 进行强大的问答

2023 年 4 月 6 日

在 Github 中打开

本笔记本逐步指导您使用 Chroma（一个开源嵌入数据库）以及 OpenAI 的文本嵌入和聊天完成 API，回答有关数据集合的问题。

此外，本笔记本还演示了使问答系统更强大的权衡。正如我们将看到的，简单的查询并不总是能产生最佳结果！

使用 LLM 进行问答

像 OpenAI 的 ChatGPT 这样的大型语言模型 (LLM) 可用于回答模型可能未在其上训练过或无法访问的数据的问题。例如：

个人数据，如电子邮件和笔记
高度专业化的数据，如档案或法律文件
新创建的数据，如最近的新闻报道

为了克服这种限制，我们可以使用一种数据存储，它可以像 LLM 本身一样以自然语言进行查询。像 Chroma 这样的嵌入存储将文档表示为嵌入，以及文档本身。

通过嵌入文本查询，Chroma 可以找到相关文档，然后我们可以将这些文档传递给 LLM 以回答我们的问题。我们将展示这种方法的详细示例和变体。

设置和准备工作

首先，我们确保安装了我们需要的 python 依赖项。

%pip install -qU openai chromadb pandas

Note: you may need to restart the kernel to use updated packages.

我们在整个笔记本中使用 OpenAI 的 API。您可以从 https://beta.openai.com/account/api-keys 获取 API 密钥

您可以通过在终端中执行命令 export OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 将您的 API 密钥添加为环境变量。请注意，如果尚未设置环境变量，您需要重新加载笔记本。或者，您可以在笔记本中设置它，请参见下文。

import os

# Uncomment the following line to set the environment variable in the notebook
# os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
    import openai
    openai.api_key = os.getenv("OPENAI_API_KEY")
else:
    print("OPENAI_API_KEY environment variable not found")

OPENAI_API_KEY is ready

数据集

在本笔记本中，我们使用 SciFact 数据集。这是一个经过专家注释的科学主张的精选数据集，附带一个论文标题和摘要的文本语料库。根据语料库中的文档，每个主张可能得到支持、反驳，或者没有足够的证据来证明任何一种情况。

将语料库作为基本事实提供，使我们能够调查以下 LLM 问答方法的效果。

# Load the claim dataset
import pandas as pd

data_path = '../../data'

claim_df = pd.read_json(f'{data_path}/scifact_claims.jsonl', lines=True)
claim_df.head()

	id	主张	证据	cited_doc_ids
0	1	0 维生物材料显示出诱导特性...	{}	[31715818]
1	3	1,000 基因组项目实现了遗传图谱的绘制...	{'14717500': [{'sentences': [2, 5], 'label': '...	[14717500]
2	5	英国 1/2000 的人具有异常的 PrP 阳性。	{'13734012': [{'sentences': [4], 'label': 'SUP...	[13734012]
3	13	5% 的围产期死亡率是由低出生体重引起的...	{}	[1606628]
4	36	维生素 B12 缺乏会增加血液中的水平...	{}	[5152028, 11705328]

仅询问模型

GPT-3.5 是在大量科学信息上训练的。作为基线，我们希望了解模型在没有任何进一步上下文的情况下已经知道什么。这将使我们能够校准整体性能。

我们构建了一个适当的提示，其中包含一些示例事实，然后使用数据集中的每个主张查询模型。我们要求模型将主张评估为“真”、“假”或“NEE”（如果没有足够的证据来证明任何一种情况）。

def build_prompt(claim):
    return [
        {"role": "system", "content": "I will ask you to assess a scientific claim. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence."},
        {"role": "user", "content": f"""        
Example:

Claim:
0-dimensional biomaterials show inductive properties.

Assessment:
False

Claim:
1/2000 in UK have abnormal PrP positivity.

Assessment:
True

Claim:
Aspirin inhibits the production of PGE2.

Assessment:
False

End of examples. Assess the following claim:

Claim:
{claim}

Assessment:
"""}
    ]


def assess_claims(claims):
    responses = []
    # Query the OpenAI API
    for claim in claims:
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            messages=build_prompt(claim),
            max_tokens=3,
        )
        # Strip any punctuation or whitespace from the response
        responses.append(response.choices[0].message.content.strip('., '))

    return responses

我们从数据集中抽取 100 个主张样本

# Let's take a look at 100 claims
samples = claim_df.sample(50)

claims = samples['claim'].tolist()

我们根据数据集评估基本事实。从数据集描述中，每个主张要么得到证据的支持或反驳，要么没有足够的证据来证明任何一种情况。

def get_groundtruth(evidence):
    groundtruth = []
    for e in evidence:
        # Evidence is empty 
        if len(e) == 0:
            groundtruth.append('NEE')
        else:
            # In this dataset, all evidence for a given claim is consistent, either SUPPORT or CONTRADICT
            if list(e.values())[0][0]['label'] == 'SUPPORT':
                groundtruth.append('True')
            else:
                groundtruth.append('False')
    return groundtruth

evidence = samples['evidence'].tolist()
groundtruth = get_groundtruth(evidence)

我们还输出了混淆矩阵，将模型的评估与基本事实进行比较，在一个易于阅读的表格中。

def confusion_matrix(inferred, groundtruth):
    assert len(inferred) == len(groundtruth)
    confusion = {
        'True': {'True': 0, 'False': 0, 'NEE': 0},
        'False': {'True': 0, 'False': 0, 'NEE': 0},
        'NEE': {'True': 0, 'False': 0, 'NEE': 0},
    }
    for i, g in zip(inferred, groundtruth):
        confusion[i][g] += 1

    # Pretty print the confusion matrix
    print('\tGroundtruth')
    print('\tTrue\tFalse\tNEE')
    for i in confusion:
        print(i, end='\t')
        for g in confusion[i]:
            print(confusion[i][g], end='\t')
        print()

    return confusion

我们要求模型直接评估主张，而无需额外的上下文。

gpt_inferred = assess_claims(claims)
confusion_matrix(gpt_inferred, groundtruth)

	Groundtruth
	True	False	NEE
True	15	5	14	
False	0	2	1	
NEE	3	3	7

{'True': {'True': 15, 'False': 5, 'NEE': 14},
 'False': {'True': 0, 'False': 2, 'NEE': 1},
 'NEE': {'True': 3, 'False': 3, 'NEE': 7}}

结果

从这些结果中我们看到，LLM 强烈偏向于将主张评估为真，即使它们是假的，并且也倾向于将错误的主张评估为没有足够的证据。请注意，“没有足够的证据”是相对于模型在真空中（没有额外的上下文）对主张的评估而言的。

添加上下文

我们现在添加语料库的论文标题和摘要中提供的额外上下文。本节展示了如何使用 OpenAI 文本嵌入将文本语料库加载到 Chroma 中。

首先，我们加载文本语料库。

# Load the corpus into a dataframe
corpus_df = pd.read_json(f'{data_path}/scifact_corpus.jsonl', lines=True)
corpus_df.head()

	doc_id	标题	摘要	结构化
0	4983	人类新生儿的微观结构发育 c...	[大脑白质结构的改变...]	假
1	5836	髓系来源诱导的骨髓增生异常综合征...	[骨髓增生异常综合征 (MDS) 是年龄相关的...]	假
2	7912	BC1 RNA，来自主基因的转录本...	[ID 元素是短散布元素 (...	假
3	18670	人类外周血单核细胞的 DNA 甲基化组...	[DNA 甲基化在生物学中起着重要作用...]	假
4	19238	人类髓鞘碱性蛋白基因包含在...	[两种人类 Golli（用于在 ol 中表达的基因...	假

将语料库加载到 Chroma 中

下一步是将语料库加载到 Chroma 中。给定一个嵌入函数，Chroma 将自动处理每个文档的嵌入，并将其与其文本和元数据一起存储，从而使其易于查询。

我们实例化一个（临时）Chroma 客户端，并为 SciFact 标题和摘要语料库创建一个集合。Chroma 也可以在持久化配置中实例化；在 Chroma 文档中了解更多信息。

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# We initialize an embedding function, and provide it to the collection.
embedding_function = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))

chroma_client = chromadb.Client() # Ephemeral by default
scifact_corpus_collection = chroma_client.create_collection(name='scifact_corpus', embedding_function=embedding_function)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.

接下来，我们将语料库加载到 Chroma 中。由于此数据加载是内存密集型的，我们建议使用批量加载方案，批量大小为 50-1000。对于此示例，整个语料库应花费一分钟多一点的时间。它正在后台自动嵌入，使用我们之前指定的 embedding_function。

batch_size = 100

for i in range(0, len(corpus_df), batch_size):
    batch_df = corpus_df[i:i+batch_size]
    scifact_corpus_collection.add(
        ids=batch_df['doc_id'].apply(lambda x: str(x)).tolist(), # Chroma takes string IDs.
        documents=(batch_df['title'] + '. ' + batch_df['abstract'].apply(lambda x: ' '.join(x))).to_list(), # We concatenate the title and abstract.
        metadatas=[{"structured": structured} for structured in batch_df['structured'].to_list()] # We also store the metadata, though we don't use it in this example.
    )

检索上下文

接下来，我们从语料库中检索可能与我们样本中的每个主张相关的文档。我们希望将这些作为上下文提供给 LLM，以评估主张。我们根据嵌入距离检索每个主张的 3 个最相关文档。

claim_query_result = scifact_corpus_collection.query(query_texts=claims, include=['documents', 'distances'], n_results=3)

我们创建一个新的提示，这次考虑到我们从语料库中检索到的额外上下文。

def build_prompt_with_context(claim, context):
    return [{'role': 'system', 'content': "I will ask you to assess whether a particular scientific claim, based on evidence provided. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence."}, 
            {'role': 'user', 'content': f""""
The evidence is the following:

{' '.join(context)}

Assess the following claim on the basis of the evidence. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence. Do not output any other text. 

Claim:
{claim}

Assessment:
"""}]


def assess_claims_with_context(claims, contexts):
    responses = []
    # Query the OpenAI API
    for claim, context in zip(claims, contexts):
        # If no evidence is provided, return NEE
        if len(context) == 0:
            responses.append('NEE')
            continue
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            messages=build_prompt_with_context(claim=claim, context=context),
            max_tokens=3,
        )
        # Strip any punctuation or whitespace from the response
        responses.append(response.choices[0].message.content.strip('., '))

    return responses

然后要求模型使用检索到的上下文评估主张。

gpt_with_context_evaluation = assess_claims_with_context(claims, claim_query_result['documents'])
confusion_matrix(gpt_with_context_evaluation, groundtruth)

	Groundtruth
	True	False	NEE
True	16	2	8	
False	1	6	5	
NEE	1	2	9

{'True': {'True': 16, 'False': 2, 'NEE': 8},
 'False': {'True': 1, 'False': 6, 'NEE': 5},
 'NEE': {'True': 1, 'False': 2, 'NEE': 9}}

结果

我们看到，模型不太可能将错误的主张评估为真（2 个实例 VS 之前的 5 个），但没有足够证据的主张仍然经常被评估为真或假。

查看检索到的文档，我们看到它们有时与主张无关 - 这会导致模型被额外的信息混淆，并且它可能会认为存在足够的证据，即使信息无关紧要。发生这种情况是因为我们总是要求提供 3 个“最”相关的文档，但这些文档可能在超出某个点后完全不相关。

基于相关性过滤上下文

除了文档本身之外，Chroma 还返回距离分数。我们可以尝试对距离进行阈值处理，以便更少的无关文档进入我们提供给模型的上下文中。

如果在阈值过滤后，没有上下文文档保留，我们绕过模型并简单地返回没有足够的证据。

def filter_query_result(query_result, distance_threshold=0.25):
# For each query result, retain only the documents whose distance is below the threshold
    for ids, docs, distances in zip(query_result['ids'], query_result['documents'], query_result['distances']):
        for i in range(len(ids)-1, -1, -1):
            if distances[i] > distance_threshold:
                ids.pop(i)
                docs.pop(i)
                distances.pop(i)
    return query_result

filtered_claim_query_result = filter_query_result(claim_query_result)

现在我们使用这个更清晰的上下文来评估主张。

gpt_with_filtered_context_evaluation = assess_claims_with_context(claims, filtered_claim_query_result['documents'])
confusion_matrix(gpt_with_filtered_context_evaluation, groundtruth)

	Groundtruth
	True	False	NEE
True	10	2	1	
False	0	2	1	
NEE	8	6	20

{'True': {'True': 10, 'False': 2, 'NEE': 1},
 'False': {'True': 0, 'False': 2, 'NEE': 1},
 'NEE': {'True': 8, 'False': 6, 'NEE': 20}}

结果

现在，当没有足够的证据时，模型评估为真或假的主张要少得多。但是，它现在偏离了确定性。大多数主张现在被评估为没有足够的证据，因为其中很大一部分被距离阈值过滤掉。可以调整距离阈值以找到最佳工作点，但这可能很困难，并且取决于数据集和嵌入模型。

假设文档嵌入：富有成效地使用幻觉

我们希望能够检索相关文档，而不会检索可能混淆模型的较少相关文档。实现此目的的一种方法是改进检索查询。

到目前为止，我们一直使用主张（即单句陈述）查询数据集，而语料库包含描述科学论文的摘要。直观地看，虽然这些可能是相关的，但它们的结构和含义存在显着差异。这些差异由嵌入模型编码，因此影响查询与最相关结果之间的距离。

我们可以通过利用 LLM 的强大功能来生成相关文本来克服这个问题。虽然事实可能是幻觉，但模型生成的文档的内容和结构比查询更类似于我们语料库中的文档。这可能会带来更好的查询，从而获得更好的结果。

这种方法称为假设文档嵌入 (HyDE)，并且已被证明在检索任务中非常出色。它应该帮助我们将更多相关信息带入上下文中，而不会污染它。

TL;DR

当您嵌入整个摘要而不是单个句子时，您会获得更好的匹配
但主张通常是单个句子
因此，HyDE 表明，使用 GPT3 将主张扩展为幻觉摘要，然后基于这些摘要进行搜索（主张 -> 摘要 -> 结果）比直接搜索（主张 -> 结果）效果更好

首先，我们使用上下文示例来提示模型为我们要评估的每个主张生成类似于语料库中的文档。

def build_hallucination_prompt(claim):
    return [{'role': 'system', 'content': """I will ask you to write an abstract for a scientific paper which supports or refutes a given claim. It should be written in scientific language, include a title. Output only one abstract, then stop.
    
    An Example:

    Claim:
    A high microerythrocyte count raises vulnerability to severe anemia in homozygous alpha (+)- thalassemia trait subjects.

    Abstract:
    BACKGROUND The heritable haemoglobinopathy alpha(+)-thalassaemia is caused by the reduced synthesis of alpha-globin chains that form part of normal adult haemoglobin (Hb). Individuals homozygous for alpha(+)-thalassaemia have microcytosis and an increased erythrocyte count. Alpha(+)-thalassaemia homozygosity confers considerable protection against severe malaria, including severe malarial anaemia (SMA) (Hb concentration < 50 g/l), but does not influence parasite count. We tested the hypothesis that the erythrocyte indices associated with alpha(+)-thalassaemia homozygosity provide a haematological benefit during acute malaria.   
    METHODS AND FINDINGS Data from children living on the north coast of Papua New Guinea who had participated in a case-control study of the protection afforded by alpha(+)-thalassaemia against severe malaria were reanalysed to assess the genotype-specific reduction in erythrocyte count and Hb levels associated with acute malarial disease. We observed a reduction in median erythrocyte count of approximately 1.5 x 10(12)/l in all children with acute falciparum malaria relative to values in community children (p < 0.001). We developed a simple mathematical model of the linear relationship between Hb concentration and erythrocyte count. This model predicted that children homozygous for alpha(+)-thalassaemia lose less Hb than children of normal genotype for a reduction in erythrocyte count of >1.1 x 10(12)/l as a result of the reduced mean cell Hb in homozygous alpha(+)-thalassaemia. In addition, children homozygous for alpha(+)-thalassaemia require a 10% greater reduction in erythrocyte count than children of normal genotype (p = 0.02) for Hb concentration to fall to 50 g/l, the cutoff for SMA. We estimated that the haematological profile in children homozygous for alpha(+)-thalassaemia reduces the risk of SMA during acute malaria compared to children of normal genotype (relative risk 0.52; 95% confidence interval [CI] 0.24-1.12, p = 0.09).   
    CONCLUSIONS The increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassaemia may contribute substantially to their protection against SMA. A lower concentration of Hb per erythrocyte and a larger population of erythrocytes may be a biologically advantageous strategy against the significant reduction in erythrocyte count that occurs during acute infection with the malaria parasite Plasmodium falciparum. This haematological profile may reduce the risk of anaemia by other Plasmodium species, as well as other causes of anaemia. Other host polymorphisms that induce an increased erythrocyte count and microcytosis may confer a similar advantage.

    End of example. 
    
    """}, {'role': 'user', 'content': f""""
    Perform the task for the following claim.

    Claim:
    {claim}

    Abstract:
    """}]


def hallucinate_evidence(claims):
    # Query the OpenAI API
    responses = []
    # Query the OpenAI API
    for claim in claims:
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            messages=build_hallucination_prompt(claim),
        )
        responses.append(response.choices[0].message.content)
    return responses

我们为每个主张幻觉化一个文档。

注意：这可能需要一段时间，大约 30 分钟才能处理 100 个主张。您可以减少我们要评估的主张数量以更快地获得结果。

hallucinated_evidence = hallucinate_evidence(claims)

我们使用幻觉文档作为语料库的查询，并使用相同的距离阈值过滤结果。

hallucinated_query_result = scifact_corpus_collection.query(query_texts=hallucinated_evidence, include=['documents', 'distances'], n_results=3)
filtered_hallucinated_query_result = filter_query_result(hallucinated_query_result)

然后，我们要求模型使用新上下文评估主张。

gpt_with_hallucinated_context_evaluation = assess_claims_with_context(claims, filtered_hallucinated_query_result['documents'])
confusion_matrix(gpt_with_hallucinated_context_evaluation, groundtruth)

	Groundtruth
	True	False	NEE
True	15	2	5	
False	1	5	4	
NEE	2	3	13

{'True': {'True': 15, 'False': 2, 'NEE': 5},
 'False': {'True': 1, 'False': 5, 'NEE': 4},
 'NEE': {'True': 2, 'False': 3, 'NEE': 13}}

结果

将 HyDE 与简单的距离阈值相结合可带来显着改进。模型不再偏向于将主张评估为真，也不偏向于没有足够的证据。它也更经常地正确评估何时没有足够的证据。

结论

为 LLM 配备基于文档语料库的上下文是一种强大的技术，可以将 LLM 的通用推理和自然语言交互带入您自己的数据中。但是，重要的是要知道，幼稚的查询和检索可能无法产生最佳结果！最终，理解数据将有助于最大限度地发挥基于检索的问答方法的作用。