哲学与向量嵌入，OpenAI 和 Cassandra / Astra DB

2023 年 8 月 29 日

在 Github 中打开

CassIO 版本

在这个快速入门中，您将学习如何使用 OpenAI 的向量嵌入和 Apache Cassandra®，或者等效地使用 DataStax Astra DB 通过 CQL，作为数据持久化的向量存储来构建一个“哲学名言查找器和生成器”。

这个笔记本的基本工作流程如下所示。您将评估和存储一些著名哲学家的名言的向量嵌入，使用它们构建一个强大的搜索引擎，之后甚至可以生成新的名言！

该笔记本演示了向量搜索的一些标准使用模式——同时展示了使用 Cassandra / Astra DB 通过 CQL 的向量功能入门是多么容易。

有关使用向量搜索和文本嵌入构建问答系统的背景知识，请查看这个出色的动手操作笔记本：使用嵌入的问答。

选择你的框架

请注意，此笔记本使用了 CassIO 库，但我们也涵盖了实现相同任务的其他技术选择。请查看此文件夹的 README 以了解其他选项。此笔记本可以作为 Colab 笔记本或常规 Jupyter 笔记本运行。

设置
获取数据库连接
连接到 OpenAI
将名言加载到向量存储中
用例 1：名言搜索引擎
用例 2：名言生成器
（可选）利用向量存储中的分区

工作原理

索引

每句名言都使用 OpenAI 的 Embedding 转换为嵌入向量。这些向量保存在向量存储中，以供日后搜索使用。一些元数据，包括作者姓名和其他一些预先计算的标签，也与向量一起存储，以便进行搜索自定义。

1_vector_indexing

搜索

要查找与提供的搜索名言相似的名言，首先将搜索名言转换为嵌入向量，然后使用此向量查询存储以查找相似的向量......即先前索引的相似名言。搜索可以选择性地受其他元数据约束（“查找斯宾诺莎说的与此相似的名言……”）。

2_vector_search

这里的关键点是，“内容相似的名言”在向量空间中转化为在度量上彼此接近的向量：因此，向量相似性搜索有效地实现了语义相似性。这就是向量嵌入如此强大的关键原因。

下面的草图试图传达这个想法。每句名言，一旦被转换为向量，就是空间中的一个点。嗯，在这种情况下，它在一个球体上，因为 OpenAI 的嵌入向量，与大多数其他向量一样，被归一化为单位长度。哦，而且这个球体实际上不是三维的，而是 1536 维的！

因此，本质上，向量空间中的相似性搜索返回最接近查询向量的向量

3_vector_space

生成

给定一个建议（一个主题或一句试探性的名言），执行搜索步骤，并将第一个返回的结果（名言）输入到 LLM 提示中，该提示要求生成模型根据传递的示例和初始建议发明一段新的文本。

4_quote_generation

设置

首先安装一些必需的软件包

!pip install --quiet "cassio>=0.1.3" "openai>=1.0.0" datasets

from getpass import getpass
from collections import Counter

import cassio
from cassio.table import MetadataVectorCassandraTable

import openai
from datasets import load_dataset

获取数据库连接

为了通过 CQL 连接到您的 Astra DB，您需要两件事

一个令牌，角色为“数据库管理员”（看起来像 AstraCS:...）
数据库 ID（看起来像 3df2a5b6-...）

确保您拥有这两个字符串——在您登录后，在 Astra UI 中获得。有关更多信息，请参阅此处：数据库 ID 和令牌。

如果您想连接到 Cassandra 集群（但是该集群必须支持向量搜索），请将 cassio.init(session=..., keyspace=...) 替换为您集群的合适 Session 和 keyspace 名称。

astra_token = getpass("Please enter your Astra token ('AstraCS:...')")
database_id = input("Please enter your database id ('3df2a5b6-...')")

Please enter your Astra token ('AstraCS:...') ········
Please enter your database id ('3df2a5b6-...') 01234567-89ab-dcef-0123-456789abcdef

cassio.init(token=astra_token, database_id=database_id)

创建数据库连接

这是您如何通过 CQL 创建到 Astra DB 的连接

（顺便说一句，您也可以使用任何 Cassandra 集群（只要它提供向量功能），只需更改以下参数到 Cluster 实例化。）

通过 CassIO 创建向量存储

您需要一个支持向量并配备元数据的表。称之为“philosophers_cassio”

v_table = MetadataVectorCassandraTable(table="philosophers_cassio", vector_dimension=1536)

连接到 OpenAI

设置您的密钥

OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

Please enter your OpenAI API Key:  ········

用于嵌入的测试调用

快速检查如何获取输入文本列表的嵌入向量

client = openai.OpenAI(api_key=OPENAI_API_KEY)
embedding_model_name = "text-embedding-3-small"

result = client.embeddings.create(
    input=[
        "This is a sentence",
        "A second sentence"
    ],
    model=embedding_model_name,
)

注意：以上是 OpenAI v1.0+ 的语法。如果使用以前的版本，获取嵌入的代码将有所不同。

print(f"len(result.data)              = {len(result.data)}")
print(f"result.data[1].embedding      = {str(result.data[1].embedding)[:55]}...")
print(f"len(result.data[1].embedding) = {len(result.data[1].embedding)}")

len(result.data)              = 2
result.data[1].embedding      = [-0.010821706615388393, 0.001387271680869162, 0.0035479...
len(result.data[1].embedding) = 1536

将名言加载到向量存储中

注意：以上是 OpenAI v1.0+ 的语法。如果使用以前的版本，获取嵌入的代码将有所不同。

philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]

快速检查

print("An example entry:")
print(philo_dataset[16])

An example entry:
{'author': 'aristotle', 'quote': 'Love well, be loved and do something of value.', 'tags': 'love;ethics'}

检查数据集大小

author_count = Counter(entry["author"] for entry in philo_dataset)
print(f"Total: {len(philo_dataset)} quotes. By author:")
for author, count in author_count.most_common():
    print(f"    {author:<20}: {count} quotes")

Total: 450 quotes. By author:
    aristotle           : 50 quotes
    schopenhauer        : 50 quotes
    spinoza             : 50 quotes
    hegel               : 50 quotes
    freud               : 50 quotes
    nietzsche           : 50 quotes
    sartre              : 50 quotes
    plato               : 50 quotes
    kant                : 50 quotes

将名言插入向量存储

您将计算名言的嵌入并将它们保存到向量存储中，以及文本本身和计划供以后使用的元数据。请注意，作者作为元数据字段与已在名言中找到的“标签”一起添加。

为了优化速度并减少调用次数，您将对嵌入 OpenAI 服务执行批量调用。

（注意：为了更快的执行速度，Cassandra 和 CassIO 允许您执行并发插入，但为了更直接的演示代码，我们在此处不这样做。）

BATCH_SIZE = 50

num_batches = ((len(philo_dataset) + BATCH_SIZE - 1) // BATCH_SIZE)

quotes_list = philo_dataset["quote"]
authors_list = philo_dataset["author"]
tags_list = philo_dataset["tags"]

print("Starting to store entries:")
for batch_i in range(num_batches):
    b_start = batch_i * BATCH_SIZE
    b_end = (batch_i + 1) * BATCH_SIZE
    # compute the embedding vectors for this batch
    b_emb_results = client.embeddings.create(
        input=quotes_list[b_start : b_end],
        model=embedding_model_name,
    )
    # prepare the rows for insertion
    print("B ", end="")
    for entry_idx, emb_result in zip(range(b_start, b_end), b_emb_results.data):
        if tags_list[entry_idx]:
            tags = {
                tag
                for tag in tags_list[entry_idx].split(";")
            }
        else:
            tags = set()
        author = authors_list[entry_idx]
        quote = quotes_list[entry_idx]
        v_table.put(
            row_id=f"q_{author}_{entry_idx}",
            body_blob=quote,
            vector=emb_result.embedding,
            metadata={**{tag: True for tag in tags}, **{"author": author}},
        )
        print("*", end="")
    print(f" done ({len(b_emb_results.data)})")

print("\nFinished storing entries.")

Starting to store entries:
B ************************************************** done (50)
B ************************************************** done (50)
B ************************************************** done (50)
B ************************************************** done (50)
B ************************************************** done (50)
B ************************************************** done (50)
B ************************************************** done (50)
B ************************************************** done (50)
B ************************************************** done (50)

Finished storing entries.

用例 1：名言搜索引擎

对于名言搜索功能，您首先需要将输入的名言转换为向量，然后使用它来查询存储（除了在搜索调用中处理可选元数据之外，即）。

将搜索引擎功能封装到一个函数中，以便于重用

def find_quote_and_author(query_quote, n, author=None, tags=None):
    query_vector = client.embeddings.create(
        input=[query_quote],
        model=embedding_model_name,
    ).data[0].embedding
    metadata = {}
    if author:
        metadata["author"] = author
    if tags:
        for tag in tags:
            metadata[tag] = True
    #
    results = v_table.ann_search(
        query_vector,
        n=n,
        metadata=metadata,
    )
    return [
        (result["body_blob"], result["metadata"]["author"])
        for result in results
    ]

将搜索投入测试

仅传递一句名言

find_quote_and_author("We struggle all our life for nothing", 3)

[('Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.',
  'schopenhauer'),
 ('We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.',
  'aristotle'),
 ('Perhaps the gods are kind to us, by making life more disagreeable as we grow older. In the end death seems less intolerable than the manifold burdens we carry',
  'freud')]

搜索限制为作者

find_quote_and_author("We struggle all our life for nothing", 2, author="nietzsche")

[('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche'),
 ('What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.',
  'nietzsche')]

搜索约束为标签（从之前与名言一起保存的标签中选择）

find_quote_and_author("We struggle all our life for nothing", 2, tags=["politics"])

[('Mankind will never see an end of trouble until lovers of wisdom come to hold political power, or the holders of power become lovers of wisdom',
  'plato'),
 ('Everything the State says is a lie, and everything it has it has stolen.',
  'nietzsche')]

剔除不相关的结果

向量相似性搜索通常返回最接近查询的向量，即使这意味着如果没有任何更好的结果，结果可能有些不相关。

为了控制这个问题，您可以获得查询和每个结果之间的实际“距离”，然后对其设置一个截止值，有效地丢弃超出该阈值的结果。正确调整此阈值不是一个容易的问题：在这里，我们只向您展示方法。

为了感受它是如何工作的，请尝试以下查询，并尝试选择名言和阈值来比较结果

注意（对于数学爱好者）：这个“距离”正是向量之间的余弦相似度，即标量积除以两个向量的范数的乘积。因此，它是一个范围从 -1 到 +1 的数字，其中 -1 表示完全反向的向量，+1 表示方向完全相同的向量。在其他地方（例如，在本演示的“CQL”对应部分中），您将获得此数量的重新缩放以适合 [0, 1] 区间，这意味着生成的数值和那里的适当阈值会相应转换。

quote = "Animals are our equals."
# quote = "Be good."
# quote = "This teapot is strange."

metric_threshold = 0.84

quote_vector = client.embeddings.create(
    input=[quote],
    model=embedding_model_name,
).data[0].embedding

results = list(v_table.metric_ann_search(
    quote_vector,
    n=8,
    metric="cos",
    metric_threshold=metric_threshold,
))

print(f"{len(results)} quotes within the threshold:")
for idx, result in enumerate(results):
    print(f"    {idx}. [distance={result['distance']:.3f}] \"{result['body_blob'][:70]}...\"")

3 quotes within the threshold:
    0. [distance=0.855] "The assumption that animals are without rights, and the illusion that ..."
    1. [distance=0.843] "Animals are in possession of themselves; their soul is in possession o..."
    2. [distance=0.841] "At his best, man is the noblest of all animals; separated from law and..."

用例 2：名言生成器

对于此任务，您需要 OpenAI 的另一个组件，即 LLM 为我们生成名言（基于通过查询向量存储获得Input）。

您还需要一个提示模板，该模板将为生成名言 LLM 完成任务而填充。

completion_model_name = "gpt-3.5-turbo"

generation_prompt_template = """"Generate a single short philosophical quote on the given topic,
similar in spirit and form to the provided actual example quotes.
Do not exceed 20-30 words in your quote.

REFERENCE TOPIC: "{topic}"

ACTUAL EXAMPLES:
{examples}
"""

与搜索类似，此功能最好包装到一个方便的函数中（该函数在内部使用搜索）

def generate_quote(topic, n=2, author=None, tags=None):
    quotes = find_quote_and_author(query_quote=topic, n=n, author=author, tags=tags)
    if quotes:
        prompt = generation_prompt_template.format(
            topic=topic,
            examples="\n".join(f"  - {quote[0]}" for quote in quotes),
        )
        # a little logging:
        print("** quotes found:")
        for q, a in quotes:
            print(f"**    - {q} ({a})")
        print("** end of logging")
        #
        response = client.chat.completions.create(
            model=completion_model_name,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=320,
        )
        return response.choices[0].message.content.replace('"', '').strip()
    else:
        print("** no quotes found.")
        return None

注意：与嵌入计算的情况类似，对于 v1.0 之前的 OpenAI，聊天完成 API 的代码会略有不同。

将名言生成投入测试

仅传递文本（“名言”，但实际上可以只建议一个主题，因为它的向量嵌入仍然会最终落在向量空间中的正确位置）

q_topic = generate_quote("politics and virtue")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
**    - Happiness is the reward of virtue. (aristotle)
**    - Our moral virtues benefit mainly other people; intellectual virtues, on the other hand, benefit primarily ourselves; therefore the former make us universally popular, the latter unpopular. (schopenhauer)
** end of logging

A new generated quote:
Virtuous politics purifies society, while corrupt politics breeds chaos and decay.

从一位哲学家那里获得灵感

q_topic = generate_quote("animals", author="schopenhauer")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
**    - Because Christian morality leaves animals out of account, they are at once outlawed in philosophical morals; they are mere 'things,' mere means to any ends whatsoever. They can therefore be used for vivisection, hunting, coursing, bullfights, and horse racing, and can be whipped to death as they struggle along with heavy carts of stone. Shame on such a morality that is worthy of pariahs, and that fails to recognize the eternal essence that exists in every living thing, and shines forth with inscrutable significance from all eyes that see the sun! (schopenhauer)
**    - The assumption that animals are without rights, and the illusion that our treatment of them has no moral significance, is a positively outrageous example of Western crudity and barbarity. Universal compassion is the only guarantee of morality. (schopenhauer)
** end of logging

A new generated quote:
The true measure of humanity lies not in our dominion over animals, but in our ability to show compassion and respect for all living beings.

（可选）分区

在完成此快速入门之前，有一个有趣的主题需要研究。虽然通常标签和名言可以存在任何关系（例如，一句名言可以有多个标签），但作者实际上是一个精确的分组（它们在名言集上定义了一个“不相交的分区”）：每句名言都恰好有一位作者（至少对我们而言）。

现在，假设您预先知道您的应用程序通常（或总是）在单个作者上运行查询。那么您可以充分利用底层数据库结构：如果您将名言分组到分区中（每个作者一个分区），则仅针对一个作者的向量查询将使用更少的资源并返回更快的结果。

我们不会在此处深入探讨细节，这些细节与 Cassandra 存储内部结构有关：重要的信息是，如果您的查询在组内运行，请考虑相应地进行分区以提高性能。

您现在将看到此选择的实际效果。

首先，您需要来自 CassIO 的不同表抽象

from cassio.table import ClusteredMetadataVectorCassandraTable

v_table_partitioned = ClusteredMetadataVectorCassandraTable(table="philosophers_cassio_partitioned", vector_dimension=1536)

现在在新表上重复计算嵌入和插入步骤。

与您之前看到的相比，一个关键的区别在于，现在名言的作者被存储为插入行的分区 ID，而不是添加到包罗万象的“元数据”字典中。

顺便说一句，为了演示，您将并发插入给定作者的所有名言：使用 CassIO，这是通过对每句名言使用异步 put_async 方法，收集生成的 Future 对象列表，然后在它们之上调用 result() 方法来实现的，以确保它们都已执行。Cassandra / Astra DB 很好地支持 I/O 操作中的高度并发。

（注意：可以缓存先前计算的嵌入以节省一些 API 令牌——但是，在这里，我们希望保持代码更易于检查。）

BATCH_SIZE = 50

num_batches = ((len(philo_dataset) + BATCH_SIZE - 1) // BATCH_SIZE)

quotes_list = philo_dataset["quote"]
authors_list = philo_dataset["author"]
tags_list = philo_dataset["tags"]

print("Starting to store entries:")
for batch_i in range(num_batches):
    b_start = batch_i * BATCH_SIZE
    b_end = (batch_i + 1) * BATCH_SIZE
    # compute the embedding vectors for this batch
    b_emb_results = client.embeddings.create(
        input=quotes_list[b_start : b_end],
        model=embedding_model_name,
    )
    # prepare the rows for insertion
    futures = []
    print("B ", end="")
    for entry_idx, emb_result in zip(range(b_start, b_end), b_emb_results.data):
        if tags_list[entry_idx]:
            tags = {
                tag
                for tag in tags_list[entry_idx].split(";")
            }
        else:
            tags = set()
        author = authors_list[entry_idx]
        quote = quotes_list[entry_idx]
        futures.append(v_table_partitioned.put_async(
            partition_id=author,
            row_id=f"q_{author}_{entry_idx}",
            body_blob=quote,
            vector=emb_result.embedding,
            metadata={tag: True for tag in tags},
        ))
    #
    for future in futures:
        future.result()
    #
    print(f" done ({len(b_emb_results.data)})")

print("\nFinished storing entries.")

Starting to store entries:
B  done (50)
B  done (50)
B  done (50)
B  done (50)
B  done (50)
B  done (50)
B  done (50)
B  done (50)
B  done (50)

Finished storing entries.

使用这个新表，相似性搜索会相应地更改（请注意 ann_search 的参数）

def find_quote_and_author_p(query_quote, n, author=None, tags=None):
    query_vector = client.embeddings.create(
        input=[query_quote],
        model=embedding_model_name,
    ).data[0].embedding
    metadata = {}
    partition_id = None
    if author:
        partition_id = author
    if tags:
        for tag in tags:
            metadata[tag] = True
    #
    results = v_table_partitioned.ann_search(
        query_vector,
        n=n,
        partition_id=partition_id,
        metadata=metadata,
    )
    return [
        (result["body_blob"], result["partition_id"])
        for result in results
    ]

就是这样：新表仍然完全支持“通用”相似性搜索……

find_quote_and_author_p("We struggle all our life for nothing", 3)

[('Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.',
  'schopenhauer'),
 ('We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.',
  'aristotle'),
 ('Perhaps the gods are kind to us, by making life more disagreeable as we grow older. In the end death seems less intolerable than the manifold burdens we carry',
  'freud')]

……但是当指定作者时，您会注意到巨大的性能优势

find_quote_and_author_p("We struggle all our life for nothing", 2, author="nietzsche")

[('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche'),
 ('What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.',
  'nietzsche')]

好吧，如果您有实际大小的数据集，您会注意到性能提升。在这个演示中，只有几十个条目，没有明显的差异——但您明白了。

结论

恭喜！您已经学习了如何使用 OpenAI 进行向量嵌入，以及如何通过 CQL 使用 Cassandra / Astra DB 进行存储，从而构建一个复杂的哲学搜索引擎和名言生成器。

此示例使用了 CassIO 来与向量存储接口——但这并非唯一的选择。查看 README 以了解其他选项以及与流行框架的集成。

要了解有关 Astra DB 的向量搜索功能如何成为您的 ML/GenAI 应用程序中的关键要素的更多信息，请访问 Astra DB 关于该主题的网页。

清理

如果您想删除用于此演示的所有资源，请运行此单元格（警告：这将删除表以及插入其中的数据！）

# we peek at CassIO's config to get a direct handle to the DB connection
session = cassio.config.resolve_session()
keyspace = cassio.config.resolve_keyspace()

session.execute(f"DROP TABLE IF EXISTS {keyspace}.philosophers_cassio;")
session.execute(f"DROP TABLE IF EXISTS {keyspace}.philosophers_cassio_partitioned;")

<cassandra.cluster.ResultSet at 0x7fdcc42e8f10>