使用 Pinecone 进行检索增强生成式问答

2023 年 2 月 7 日

修复 LLM 的幻觉问题

在本笔记本中，我们将学习如何从 Pinecone 查询与我们的查询相关的上下文，并将这些上下文传递给生成式 OpenAI 模型，以生成由真实数据源支持的答案。

使用 GPT-3 准确回答问题的常见问题是 GPT-3 有时会编造内容。GPT 模型具有广泛的通用知识，但这不一定适用于更具体的信息。为此，我们使用 Pinecone 向量数据库作为我们的“外部知识库”——就像 GPT-3 的长期记忆一样。

此笔记本所需的安装是

!pip install -qU openai pinecone-client datasets

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.3 KB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 KB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.6/170.6 KB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai (pyproject.toml) ... [?25l[?25hdone

import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = "OPENAI_API_KEY"

对于许多问题，最先进（SOTA）的 LLM 完全有能力正确回答。

query = "who was the 12th person on the moon and when did they land?"

# now query `gpt-3.5-turbo-instruct` WITHOUT context
res = openai.Completion.create(
    engine='gpt-3.5-turbo-instruct',
    prompt=query,
    temperature=0,
    max_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

res['choices'][0]['text'].strip()

'The 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'

然而，情况并非总是如此。首先，让我们将上面的内容重写为一个简单的函数，这样我们就无需每次都重写它。

def complete(prompt):
    res = openai.Completion.create(
        engine='gpt-3.5-turbo-instruct',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

现在，让我们问一个关于训练一种称为句子转换器的 Transformer 模型的更具体的问题。我们期望的理想答案是“多负例排序（MNR）损失”。

如果您对这个术语感到陌生，请不要担心，这对于理解我们正在做什么或演示的内容不是必需的。

query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

complete(query)

'If you only have pairs of related sentences, then the best training method to use for sentence transformers is the supervised learning approach. This approach involves providing the model with labeled data, such as pairs of related sentences, and then training the model to learn the relationships between the sentences. This approach is often used for tasks such as natural language inference, semantic similarity, and paraphrase identification.'

我们得到的常见答案之一是

The best training method to use for fine-tuning a pre-trained model with sentence transformers is the Masked Language Model (MLM) training. MLM training involves randomly masking some of the words in a sentence and then training the model to predict the masked words. This helps the model to learn the context of the sentence and better understand the relationships between words.

这个答案看起来很有说服力，对吧？然而，它是错误的。MLM 通常用于 Transformer 模型的预训练步骤，但“不能”用于微调句子转换器，并且与拥有“相关句子对”无关。

我们收到的另一个答案（以及我们在上面返回的答案）是关于监督学习方法是最合适的。这完全正确，但它不够具体，也没有回答问题。

我们有两种选择来使我们的 LLM 能够理解并正确回答这个问题

我们在涵盖所提及主题的文本数据上微调 LLM，可能是在讨论句子转换器、语义搜索训练方法等的文章和论文上。
我们使用Retrieval Augmented Generation (RAG)，这是一种实现信息检索组件到生成过程的技术。允许我们检索相关信息并将此信息作为辅助信息源馈送到生成模型中。

我们将演示选项 2。

构建知识库

对于选项 2，相关信息的检索需要一个外部“知识库”，一个我们可以存储信息并有效检索信息的地方。我们可以将其视为我们 LLM 的外部长期记忆。

我们将需要检索与我们的查询语义相关的信息，为此我们需要使用“密集向量嵌入”。这些可以被认为是句子背后含义的数值表示。

为了创建这些密集向量，我们使用 text-embedding-3-small 模型。

我们已经验证了我们的 OpenAI 连接，要创建嵌入，我们只需执行

embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

在响应 res 中，我们将找到一个类似 JSON 的对象，其中包含 'data' 字段中的新嵌入。

res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

在 'data' 内部，我们将找到两条记录，每条记录对应于我们刚刚嵌入的两个句子之一。每个向量嵌入包含 1536 维度（text-embedding-3-small 模型的输出维度）。

len(res['data'])

len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

我们将把相同的嵌入逻辑应用于包含与我们的查询相关的信息（以及关于 ML 和 AI 主题的许多其他查询）的数据集。

数据准备

我们将使用的数据集是 Hugging Face Datasets 中的 jamescalam/youtube-transcriptions。它包含来自多个 ML 和技术 YouTube 频道的转录音频。我们使用以下命令下载它

from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

Using custom data configuration jamescalam--youtube-transcriptions-6a482f3df0aedcdb
Reusing dataset json (/Users/jamesbriggs/.cache/huggingface/datasets/jamescalam___json/jamescalam--youtube-transcriptions-6a482f3df0aedcdb/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)

Dataset({
    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619
})

data[0]

{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'published': '2021-07-06 13:00:03 UTC',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'video_id': '35Pdoyi6ZoQ',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
 'id': '35Pdoyi6ZoQ-t0.0',
 'text': 'Hi, welcome to the video.',
 'start': 0.0,
 'end': 9.36}

该数据集包含许多小的文本数据片段。我们将需要合并来自每个视频的许多片段，以创建包含更多信息的更实质性的文本块。

from tqdm.auto import tqdm

new_data = []

window = 20  # number of sentences to combine
stride = 4  # number of sentences to 'stride' over, used to create overlap

for i in tqdm(range(0, len(data), stride)):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    text = ' '.join(data[i:i_end]['text'])
    # create the new merged dataset
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published'],
        'channel_id': data[i]['channel_id']
    })

  0%|          | 0/52155 [00:00<?, ?it/s]

new_data[0]

{'start': 0.0,
 'end': 74.12,
 'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'text': "Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.",
 'id': '35Pdoyi6ZoQ-t0.0',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'published': '2021-07-06 13:00:03 UTC',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw'}

现在我们需要一个地方来存储这些嵌入，并启用高效的向量搜索。为此，我们使用 Pinecone，我们可以获取一个免费 API 密钥，并在下面输入它，我们将在其中初始化与 Pinecone 的连接并创建一个新索引。

import pinecone

index_name = 'openai-youtube-transcriptions'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="PINECONE_API_KEY",
    environment="us-east1-gcp"  # may be different, check at app.pinecone.io
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='cosine',
        metadata_config={'indexed': ['channel_id', 'published']}
    )
# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

我们可以看到索引当前为空，total_vector_count 为 0。我们可以开始使用 OpenAI text-embedding-3-small 构建的嵌入来填充它，如下所示

from tqdm.auto import tqdm
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(new_data), batch_size)):
    # find end of batch
    i_end = min(len(new_data), i+batch_size)
    meta_batch = new_data[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    done = False
    while not done:
        try:
            res = openai.Embedding.create(input=texts, engine=embed_model)
            done = True
        except:
            sleep(5)
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        'start': x['start'],
        'end': x['end'],
        'title': x['title'],
        'text': x['text'],
        'url': x['url'],
        'published': x['published'],
        'channel_id': x['channel_id']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/487 [00:00<?, ?it/s]

现在我们搜索，为此我们需要创建一个查询向量 xq

res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(xq, top_k=2, include_metadata=True)

res

{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 568.4,
                           'published': datetime.date(2021, 11, 24),
                           'start': 418.88,
                           'text': 'pairs of related sentences you can go '
                                   'ahead and actually try training or '
                                   'fine-tuning using NLI with multiple '
                                   "negative ranking loss. If you don't have "
                                   'that fine. Another option is that you have '
                                   'a semantic textual similarity data set or '
                                   'STS and what this is is you have so you '
                                   'have sentence A here, sentence B here and '
                                   'then you have a score from from 0 to 1 '
                                   'that tells you the similarity between '
                                   'those two scores and you would train this '
                                   'using something like cosine similarity '
                                   "loss. Now if that's not an option and your "
                                   'focus or use case is on building a '
                                   'sentence transformer for another language '
                                   'where there is no current sentence '
                                   'transformer you can use multilingual '
                                   'parallel data. So what I mean by that is '
                                   'so parallel data just means translation '
                                   'pairs so if you have for example a English '
                                   'sentence and then you have another '
                                   'language here so it can it can be anything '
                                   "I'm just going to put XX and that XX is "
                                   'your target language you can fine-tune a '
                                   'model using something called multilingual '
                                   'knowledge distillation and what that does '
                                   'is takes a monolingual model for example '
                                   'in English and using those translation '
                                   'pairs it distills the knowledge the '
                                   'semantic similarity knowledge from that '
                                   'monolingual English model into a '
                                   'multilingual model which can handle both '
                                   'English and your target language. So '
                                   "they're three options quite popular very "
                                   'common that you can go for and as a '
                                   'supervised methods the chances are that '
                                   'probably going to outperform anything you '
                                   'do with unsupervised training at least for '
                                   'now. So if none of those sound like '
                                   'something',
                           'title': 'Today Unsupervised Sentence Transformers, '
                                    'Tomorrow Skynet (how TSDAE works)',
                           'url': 'https://youtu.be/pNvujJ1XyeQ'},
              'score': 0.865277052,
              'sparseValues': {},
              'values': []},
             {'id': 'WS1uVMGhlWQ-t737.28',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 900.72,
                           'published': datetime.date(2021, 10, 20),
                           'start': 737.28,
                           'text': "were actually more accurate. So we can't "
                                   "really do that. We can't use this what is "
                                   'called a mean pooling approach. Or we '
                                   "can't use it in its current form. Now the "
                                   'solution to this problem was introduced by '
                                   'two people in 2019 Nils Reimers and Irenia '
                                   'Gurevich. They introduced what is the '
                                   'first sentence transformer or sentence '
                                   'BERT. And it was found that sentence BERT '
                                   'or S BERT outformed all of the previous '
                                   'Save the Art models on pretty much all '
                                   'benchmarks. Not all of them but most of '
                                   'them. And it did it in a very quick time. '
                                   'So if we compare it to BERT, if we wanted '
                                   'to find the most similar sentence pair '
                                   'from 10,000 sentences in that 2019 paper '
                                   'they found that with BERT that took 65 '
                                   'hours. With S BERT embeddings they could '
                                   'create all the embeddings in just around '
                                   'five seconds. And then they could compare '
                                   'all those with cosine similarity in 0.01 '
                                   "seconds. So it's a lot faster. We go from "
                                   '65 hours to just over five seconds which '
                                   'is I think pretty incredible. Now I think '
                                   "that's pretty much all the context we need "
                                   'behind sentence transformers. And what we '
                                   'do now is dive into a little bit of how '
                                   'they actually work. Now we said before we '
                                   'have the core transform models and what S '
                                   'BERT does is fine tunes on sentence pairs '
                                   'using what is called a Siamese '
                                   'architecture or Siamese network. What we '
                                   'mean by a Siamese network is that we have '
                                   'what we can see, what can view as two BERT '
                                   'models that are identical and the weights '
                                   'between those two models are tied. Now in '
                                   'reality when implementing this we just use '
                                   'a single BERT model. And what we do is we '
                                   'process one sentence, a sentence A through '
                                   'the model and then we process another '
                                   'sentence, sentence B through the model. '
                                   "And that's the sentence pair. So with our "
                                   'cross-linked we were processing the '
                                   'sentence pair together. We were putting '
                                   'them both together, processing them all at '
                                   'once. This time we process them '
                                   'separately. And during training what '
                                   'happens is the weights',
                           'title': 'Intro to Sentence Embeddings with '
                                    'Transformers',
                           'url': 'https://youtu.be/WS1uVMGhlWQ'},
              'score': 0.85855335,
              'sparseValues': {},
              'values': []}],
 'namespace': ''}

limit = 3750

def retrieve(query):
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

    # retrieve from Pinecone
    xq = res['data'][0]['embedding']

    # get relevant contexts
    res = index.query(xq, top_k=3, include_metadata=True)
    contexts = [
        x['metadata']['text'] for x in res['matches']
    ]

    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt

# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)
query_with_contexts

"Answer the question based on the context below.\n\nContext:\npairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can fine-tune a model using something called multilingual knowledge distillation and what that does is takes a monolingual model for example in English and using those translation pairs it distills the knowledge the semantic similarity knowledge from that monolingual English model into a multilingual model which can handle both English and your target language. So they're three options quite popular very common that you can go for and as a supervised methods the chances are that probably going to outperform anything you do with unsupervised training at least for now. So if none of those sound like something\n\n---\n\nwere actually more accurate. So we can't really do that. We can't use this what is called a mean pooling approach. Or we can't use it in its current form. Now the solution to this problem was introduced by two people in 2019 Nils Reimers and Irenia Gurevich. They introduced what is the first sentence transformer or sentence BERT. And it was found that sentence BERT or S BERT outformed all of the previous Save the Art models on pretty much all benchmarks. Not all of them but most of them. And it did it in a very quick time. So if we compare it to BERT, if we wanted to find the most similar sentence pair from 10,000 sentences in that 2019 paper they found that with BERT that took 65 hours. With S BERT embeddings they could create all the embeddings in just around five seconds. And then they could compare all those with cosine similarity in 0.01 seconds. So it's a lot faster. We go from 65 hours to just over five seconds which is I think pretty incredible. Now I think that's pretty much all the context we need behind sentence transformers. And what we do now is dive into a little bit of how they actually work. Now we said before we have the core transform models and what S BERT does is fine tunes on sentence pairs using what is called a Siamese architecture or Siamese network. What we mean by a Siamese network is that we have what we can see, what can view as two BERT models that are identical and the weights between those two models are tied. Now in reality when implementing this we just use a single BERT model. And what we do is we process one sentence, a sentence A through the model and then we process another sentence, sentence B through the model. And that's the sentence pair. So with our cross-linked we were processing the sentence pair together. We were putting them both together, processing them all at once. This time we process them separately. And during training what happens is the weights\n\n---\n\nTransformer-based Sequential Denoising Autoencoder. So what we'll do is jump straight into it and take a look at where we might want to use this training approach and and how we can actually implement it. So the first question we need to ask is do we really need to resort to unsupervised training? Now what we're going to do here is just have a look at a few of the most popular training approaches and what sort of data we need for that. So the first one we're looking at here is Natural Language Inference or NLI and NLI requires that we have pairs of sentences that are labeled as either contradictory, neutral which means they're not necessarily related or as entailing or as inferring each other. So you have pairs that entail each other so they are both very similar pairs that are neutral and also pairs that are contradictory. And this is the traditional NLI data. Now using another version of fine-tuning with NLI called a multiple negatives ranking loss you can get by with only entailment pairs so pairs that are related to each other or positive pairs and it can also use contradictory pairs to improve the performance of training as well but you don't need it. So if you have positive pairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B\n\nQuestion: Which training method should I use for sentence transformers when I only have pairs of related sentences?\nAnswer:"

# then we complete the context-infused query
complete(query_with_contexts)

'You should use Natural Language Inference (NLI) with multiple negative ranking loss.'

我们立即得到了一个非常好的答案，明确指出要使用多重排序损失（也称为多负例排序损失）。