使用向量嵌入、OpenAI 和 Cassandra / Astra DB 的哲学

2023 年 8 月 29 日
在 Github 中打开

在这个快速入门中,您将学习如何使用 OpenAI 的向量嵌入和 Apache Cassandra®,或等效的 DataStax Astra DB 通过 CQL,构建一个“哲学名言查找器和生成器”,作为数据持久化的向量存储。

此笔记本的基本工作流程概述如下。您将评估和存储许多著名哲学家的名言的向量嵌入,使用它们构建强大的搜索引擎,之后甚至可以生成新的名言!

该笔记本演示了向量搜索的一些标准用法模式——同时展示了使用 Cassandra / Astra DB 通过 CQL 的向量功能入门是多么容易。

有关使用向量搜索和文本嵌入构建问答系统的背景知识,请查看这份出色的实践笔记本:使用嵌入进行问题解答

选择你的框架

请注意,此笔记本使用 Cassandra 驱动程序 并直接运行 CQL(Cassandra 查询语言)语句,但我们涵盖了实现相同任务的其他技术选择。查看此文件夹的 README 以获取其他选项。此笔记本可以作为 Colab 笔记本或常规 Jupyter 笔记本运行。

目录

  • 设置
  • 获取数据库连接
  • 连接到 OpenAI
  • 将名言加载到向量存储
  • 用例 1:名言搜索引擎
  • 用例 2:名言生成器
  • (可选)利用向量存储中的分区

工作原理

索引

每条名言都使用 OpenAI 的 Embedding 转化为嵌入向量。这些向量保存在向量存储中,以便日后用于搜索。一些元数据,包括作者姓名和其他一些预先计算的标签,也与向量一起存储,以便进行搜索自定义。

1_vector_indexing_cql

搜索

要查找与提供的搜索名言相似的名言,后者会被动态转化为嵌入向量,然后使用该向量查询存储以查找相似的向量……即先前索引的相似名言。搜索可以选择性地受到附加元数据(“查找斯宾诺莎的与此类似的引言……”)的约束。

2_vector_search_cql

这里的关键点是“内容相似的名言”在向量空间中转化为彼此在度量上接近的向量:因此,向量相似性搜索有效地实现了语义相似性。这是向量嵌入如此强大的关键原因。

下面的草图试图传达这个想法。每条名言一旦转化为向量,就是空间中的一个点。嗯,在这种情况下,它在一个球体上,因为 OpenAI 的嵌入向量,与大多数其他向量一样,被归一化为单位长度。哦,这个球体实际上不是三维的,而是 1536 维的!

因此,本质上,向量空间中的相似性搜索返回与查询向量最接近的向量

3_vector_space

生成

给定一个建议(主题或试探性名言),执行搜索步骤,并将第一个返回的结果(名言)馈送到 LLM 提示中,该提示要求生成模型根据传递的示例初始建议发明新的文本。

4_quote_generation

安装并导入必要的依赖项

!pip install --quiet "cassandra-driver>=0.28.0" "openai>=1.0.0" datasets
import os
from uuid import uuid4
from getpass import getpass
from collections import Counter

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

import openai
from datasets import load_dataset

不要太在意下一个单元格,我们需要它来检测 Colab 并让您上传 SCB 文件(见下文)

try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

创建 Session 对象(与您的 Astra DB 实例的连接)需要一些密钥。

(注意:在 Google Colab 和本地 Jupyter 上,某些步骤会略有不同,这就是为什么笔记本会检测运行时类型。)

# Your database's Secure Connect Bundle zip file is needed:
if IS_COLAB:
    print('Please upload your Secure Connect Bundle zipfile: ')
    uploaded = files.upload()
    if uploaded:
        astraBundleFileTitle = list(uploaded.keys())[0]
        ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
    else:
        raise ValueError(
            'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
        )
else:
    # you are running a local-jupyter notebook:
    ASTRA_DB_SECURE_BUNDLE_PATH = input("Please provide the full path to your Secure Connect Bundle zipfile: ")

ASTRA_DB_APPLICATION_TOKEN = getpass("Please provide your Database Token ('AstraCS:...' string): ")
ASTRA_DB_KEYSPACE = input("Please provide the Keyspace name for your Database: ")
Please provide the full path to your Secure Connect Bundle zipfile:  /path/to/secure-connect-DatabaseName.zip
Please provide your Database Token ('AstraCS:...' string):  ········
Please provide the Keyspace name for your Database:  my_keyspace

创建数据库连接

这是您如何创建与 Astra DB 的连接

(顺便说一句,您也可以使用任何 Cassandra 集群(只要它提供向量功能),只需通过更改参数到以下 Cluster 实例化。)

# Don't mind the "Closing connection" error after "downgrading protocol..." messages you may see,
# it is really just a warning: the connection will work smoothly.
cluster = Cluster(
    cloud={
        "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
    },
    auth_provider=PlainTextAuthProvider(
        "token",
        ASTRA_DB_APPLICATION_TOKEN,
    ),
)

session = cluster.connect()
keyspace = ASTRA_DB_KEYSPACE

在 CQL 中创建向量表

您需要一个支持向量并配备元数据的表。称之为“philosophers_cql”。

每行将存储:名言、其向量嵌入、名言作者和一组“标签”。您还需要一个主键来确保行的唯一性。

以下是创建表的完整 CQL 命令(查看 此页面 以获取有关此语句和以下语句的 CQL 语法的更多信息)

create_table_statement = f"""CREATE TABLE IF NOT EXISTS {keyspace}.philosophers_cql (
    quote_id UUID PRIMARY KEY,
    body TEXT,
    embedding_vector VECTOR<FLOAT, 1536>,
    author TEXT,
    tags SET<TEXT>
);"""

将此语句传递给您的数据库 Session 以执行它

session.execute(create_table_statement)
<cassandra.cluster.ResultSet at 0x7feee37b3460>

为了在表中的向量上运行 ANN(近似最近邻)搜索,您需要在 embedding_vector 列上创建一个特定的索引。

创建索引时,您可以选择性地选择用于计算向量距离的“相似度函数”:由于对于单位长度向量(例如来自 OpenAI 的向量),“余弦差”与“点积”相同,因此您将使用后者,后者的计算成本较低。

运行此 CQL 语句

create_vector_index_statement = f"""CREATE CUSTOM INDEX IF NOT EXISTS idx_embedding_vector
    ON {keyspace}.philosophers_cql (embedding_vector)
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
    WITH OPTIONS = {{'similarity_function' : 'dot_product'}};
"""
# Note: the double '{{' and '}}' are just the F-string escape sequence for '{' and '}'

session.execute(create_vector_index_statement)
<cassandra.cluster.ResultSet at 0x7feeefd3da00>

这足以在表上运行向量搜索……但您希望能够选择性地指定作者和/或一些标签来限制名言搜索。创建另外两个索引来支持这一点

create_author_index_statement = f"""CREATE CUSTOM INDEX IF NOT EXISTS idx_author
    ON {keyspace}.philosophers_cql (author)
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';
"""
session.execute(create_author_index_statement)

create_tags_index_statement = f"""CREATE CUSTOM INDEX IF NOT EXISTS idx_tags
    ON {keyspace}.philosophers_cql (VALUES(tags))
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';
"""
session.execute(create_tags_index_statement)
<cassandra.cluster.ResultSet at 0x7fef2c64af70>
OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")
Please enter your OpenAI API Key:  ········
client = openai.OpenAI(api_key=OPENAI_API_KEY)
embedding_model_name = "text-embedding-3-small"

result = client.embeddings.create(
    input=[
        "This is a sentence",
        "A second sentence"
    ],
    model=embedding_model_name,
)

注意:以上是 OpenAI v1.0+ 的语法。如果使用以前的版本,获取嵌入的代码将有所不同。

print(f"len(result.data)              = {len(result.data)}")
print(f"result.data[1].embedding      = {str(result.data[1].embedding)[:55]}...")
print(f"len(result.data[1].embedding) = {len(result.data[1].embedding)}")
len(result.data)              = 2
result.data[1].embedding      = [-0.0108176339417696, 0.0013546717818826437, 0.00362232...
len(result.data[1].embedding) = 1536

获取包含名言的数据集。(我们改编和扩充了来自 这个 Kaggle 数据集 的数据,已准备好在此演示中使用。)

philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]

快速检查

print("An example entry:")
print(philo_dataset[16])
An example entry:
{'author': 'aristotle', 'quote': 'Love well, be loved and do something of value.', 'tags': 'love;ethics'}

检查数据集大小

author_count = Counter(entry["author"] for entry in philo_dataset)
print(f"Total: {len(philo_dataset)} quotes. By author:")
for author, count in author_count.most_common():
    print(f"    {author:<20}: {count} quotes")
Total: 450 quotes. By author:
    aristotle           : 50 quotes
    schopenhauer        : 50 quotes
    spinoza             : 50 quotes
    hegel               : 50 quotes
    freud               : 50 quotes
    nietzsche           : 50 quotes
    sartre              : 50 quotes
    plato               : 50 quotes
    kant                : 50 quotes

将名言插入向量存储

您将计算名言的嵌入,并将它们与文本本身和计划稍后使用的元数据一起保存到向量存储中。

为了优化速度并减少调用次数,您将对嵌入 OpenAI 服务执行批量调用。

数据库写入是通过 CQL 语句完成的。但是由于您将多次运行此特定插入(尽管使用不同的值),因此最好准备语句,然后一遍又一遍地运行它。

(注意:为了更快的插入,Cassandra 驱动程序可以让您执行并发插入,为了更直接的演示代码,我们在此处不这样做。)

prepared_insertion = session.prepare(
    f"INSERT INTO {keyspace}.philosophers_cql (quote_id, author, body, embedding_vector, tags) VALUES (?, ?, ?, ?, ?);"
)

BATCH_SIZE = 20

num_batches = ((len(philo_dataset) + BATCH_SIZE - 1) // BATCH_SIZE)

quotes_list = philo_dataset["quote"]
authors_list = philo_dataset["author"]
tags_list = philo_dataset["tags"]

print("Starting to store entries:")
for batch_i in range(num_batches):
    b_start = batch_i * BATCH_SIZE
    b_end = (batch_i + 1) * BATCH_SIZE
    # compute the embedding vectors for this batch
    b_emb_results = client.embeddings.create(
        input=quotes_list[b_start : b_end],
        model=embedding_model_name,
    )
    # prepare the rows for insertion
    print("B ", end="")
    for entry_idx, emb_result in zip(range(b_start, b_end), b_emb_results.data):
        if tags_list[entry_idx]:
            tags = {
                tag
                for tag in tags_list[entry_idx].split(";")
            }
        else:
            tags = set()
        author = authors_list[entry_idx]
        quote = quotes_list[entry_idx]
        quote_id = uuid4()  # a new random ID for each quote. In a production app you'll want to have better control...
        session.execute(
            prepared_insertion,
            (quote_id, author, quote, emb_result.embedding, tags),
        )
        print("*", end="")
    print(f" done ({len(b_emb_results.data)})")

print("\nFinished storing entries.")
Starting to store entries:
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ******************** done (20)
B ********** done (10)

Finished storing entries.

对于名言搜索功能,您首先需要将输入名言转化为向量,然后使用它来查询存储(此外还需要处理搜索调用中的可选元数据)。

将搜索引擎功能封装到一个函数中,以便于重用

def find_quote_and_author(query_quote, n, author=None, tags=None):
    query_vector = client.embeddings.create(
        input=[query_quote],
        model=embedding_model_name,
    ).data[0].embedding
    # depending on what conditions are passed, the WHERE clause in the statement may vary.
    where_clauses = []
    where_values = []
    if author:
        where_clauses += ["author = %s"]
        where_values += [author]
    if tags:
        for tag in tags:
            where_clauses += ["tags CONTAINS %s"]
            where_values += [tag]
    # The reason for these two lists above is that when running the CQL search statement the values passed
    # must match the sequence of "?" marks in the statement.
    if where_clauses:
        search_statement = f"""SELECT body, author FROM {keyspace}.philosophers_cql
            WHERE {' AND '.join(where_clauses)}
            ORDER BY embedding_vector ANN OF %s
            LIMIT %s;
        """
    else:
        search_statement = f"""SELECT body, author FROM {keyspace}.philosophers_cql
            ORDER BY embedding_vector ANN OF %s
            LIMIT %s;
        """
    # For best performance, one should keep a cache of prepared statements (see the insertion code above)
    # for the various possible statements used here.
    # (We'll leave it as an exercise to the reader to avoid making this code too long.
    # Remember: to prepare a statement you use '?' instead of '%s'.)
    query_values = tuple(where_values + [query_vector] + [n])
    result_rows = session.execute(search_statement, query_values)
    return [
        (result_row.body, result_row.author)
        for result_row in result_rows
    ]

仅传递名言

find_quote_and_author("We struggle all our life for nothing", 3)
[('Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.',
  'schopenhauer'),
 ('We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.',
  'aristotle'),
 ('Perhaps the gods are kind to us, by making life more disagreeable as we grow older. In the end death seems less intolerable than the manifold burdens we carry',
  'freud')]

搜索限制为作者

find_quote_and_author("We struggle all our life for nothing", 2, author="nietzsche")
[('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche'),
 ('What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.',
  'nietzsche')]

搜索约束为标签(从之前与名言一起保存的标签中选择)

find_quote_and_author("We struggle all our life for nothing", 2, tags=["politics"])
[('Mankind will never see an end of trouble until lovers of wisdom come to hold political power, or the holders of power become lovers of wisdom',
  'plato'),
 ('Everything the State says is a lie, and everything it has it has stolen.',
  'nietzsche')]

剔除不相关的结果

向量相似性搜索通常返回与查询最接近的向量,即使这意味着如果没有任何更好的结果,结果可能有点不相关。

为了控制这个问题,您可以获得查询和每个结果之间的实际“相似度”,然后设置一个截止值,有效地丢弃超出该阈值的结果。正确调整此阈值并非易事:在这里,我们将向您展示方法。

为了感受其工作原理,请尝试以下查询并尝试选择名言和阈值来比较结果

注意(对于数学爱好者):此值是向量之间余弦差的零到一之间的重新缩放,即标量积除以两个向量的范数的乘积。换句话说,对于反向向量,此值为 0,对于平行向量,此值为 +1。有关其他相似性度量,请查看 文档 —— 并记住 SELECT 查询中的度量应与之前创建索引时使用的度量匹配,以便获得有意义的、有序的结果。

quote = "Animals are our equals."
# quote = "Be good."
# quote = "This teapot is strange."

similarity_threshold = 0.92

quote_vector = client.embeddings.create(
    input=[quote],
    model=embedding_model_name,
).data[0].embedding

# Once more: remember to prepare your statements in production for greater performance...

search_statement = f"""SELECT body, similarity_dot_product(embedding_vector, %s) as similarity
    FROM {keyspace}.philosophers_cql
    ORDER BY embedding_vector ANN OF %s
    LIMIT %s;
"""
query_values = (quote_vector, quote_vector, 8)

result_rows = session.execute(search_statement, query_values)
results = [
    (result_row.body, result_row.similarity)
    for result_row in result_rows
    if result_row.similarity >= similarity_threshold
]

print(f"{len(results)} quotes within the threshold:")
for idx, (r_body, r_similarity) in enumerate(results):
    print(f"    {idx}. [similarity={r_similarity:.3f}] \"{r_body[:70]}...\"")
3 quotes within the threshold:
    0. [similarity=0.927] "The assumption that animals are without rights, and the illusion that ..."
    1. [similarity=0.922] "Animals are in possession of themselves; their soul is in possession o..."
    2. [similarity=0.920] "At his best, man is the noblest of all animals; separated from law and..."

对于此任务,您需要 OpenAI 的另一个组件,即 LLM 来为我们生成名言(基于通过查询向量存储获得的输入)。

您还需要一个提示模板,该模板将为生成名言 LLM 完成任务而填充。

completion_model_name = "gpt-3.5-turbo"

generation_prompt_template = """"Generate a single short philosophical quote on the given topic,
similar in spirit and form to the provided actual example quotes.
Do not exceed 20-30 words in your quote.

REFERENCE TOPIC: "{topic}"

ACTUAL EXAMPLES:
{examples}
"""

与搜索类似,此功能最好包装到一个方便的函数中(该函数在内部使用搜索)

def generate_quote(topic, n=2, author=None, tags=None):
    quotes = find_quote_and_author(query_quote=topic, n=n, author=author, tags=tags)
    if quotes:
        prompt = generation_prompt_template.format(
            topic=topic,
            examples="\n".join(f"  - {quote[0]}" for quote in quotes),
        )
        # a little logging:
        print("** quotes found:")
        for q, a in quotes:
            print(f"**    - {q} ({a})")
        print("** end of logging")
        #
        response = client.chat.completions.create(
            model=completion_model_name,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=320,
        )
        return response.choices[0].message.content.replace('"', '').strip()
    else:
        print("** no quotes found.")
        return None

注意:与嵌入计算的情况类似,对于 v1.0 之前的 OpenAI,Chat Completion API 的代码会略有不同。

仅传递文本(“名言”,但实际上可以只建议一个主题,因为它的向量嵌入仍然会最终位于向量空间中的正确位置)

q_topic = generate_quote("politics and virtue")
print("\nA new generated quote:")
print(q_topic)
** quotes found:
**    - Happiness is the reward of virtue. (aristotle)
**    - Our moral virtues benefit mainly other people; intellectual virtues, on the other hand, benefit primarily ourselves; therefore the former make us universally popular, the latter unpopular. (schopenhauer)
** end of logging

A new generated quote:
True politics is not the pursuit of power, but the cultivation of virtue for the betterment of all.

仅从一位哲学家的灵感中使用

q_topic = generate_quote("animals", author="schopenhauer")
print("\nA new generated quote:")
print(q_topic)
** quotes found:
**    - Because Christian morality leaves animals out of account, they are at once outlawed in philosophical morals; they are mere 'things,' mere means to any ends whatsoever. They can therefore be used for vivisection, hunting, coursing, bullfights, and horse racing, and can be whipped to death as they struggle along with heavy carts of stone. Shame on such a morality that is worthy of pariahs, and that fails to recognize the eternal essence that exists in every living thing, and shines forth with inscrutable significance from all eyes that see the sun! (schopenhauer)
**    - The assumption that animals are without rights, and the illusion that our treatment of them has no moral significance, is a positively outrageous example of Western crudity and barbarity. Universal compassion is the only guarantee of morality. (schopenhauer)
** end of logging

A new generated quote:
Do not judge the worth of a soul by its outward form, for within every animal lies an eternal essence that deserves our compassion and respect.

在完成此快速入门之前,有一个有趣的主题需要研究。虽然通常标签和名言可以处于任何关系(例如,一条名言可以有多个标签),但作者实际上是一个精确的分组(它们在名言集上定义了“不相交分区”):每条名言都只有一个作者(至少对我们而言)。

现在,假设您预先知道您的应用程序通常(或始终)在单个作者上运行查询。那么您可以充分利用底层数据库结构:如果您将名言分组在分区中(每个作者一个分区),则仅对作者的向量查询将使用更少的资源并返回得更快。

我们在此处不深入探讨细节,这些细节与 Cassandra 存储内部结构有关:重要的信息是如果您的查询在组内运行,请考虑相应地进行分区以提高性能

您现在将看到此选择的实际效果。

按作者分区需要一个新的表模式:创建一个名为“philosophers_cql_partitioned”的新表,以及必要的索引

create_table_p_statement = f"""CREATE TABLE IF NOT EXISTS {keyspace}.philosophers_cql_partitioned (
    author TEXT,
    quote_id UUID,
    body TEXT,
    embedding_vector VECTOR<FLOAT, 1536>,
    tags SET<TEXT>,
    PRIMARY KEY ( (author), quote_id )
) WITH CLUSTERING ORDER BY (quote_id ASC);"""

session.execute(create_table_p_statement)

create_vector_index_p_statement = f"""CREATE CUSTOM INDEX IF NOT EXISTS idx_embedding_vector_p
    ON {keyspace}.philosophers_cql_partitioned (embedding_vector)
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
    WITH OPTIONS = {{'similarity_function' : 'dot_product'}};
"""

session.execute(create_vector_index_p_statement)

create_tags_index_p_statement = f"""CREATE CUSTOM INDEX IF NOT EXISTS idx_tags_p
    ON {keyspace}.philosophers_cql_partitioned (VALUES(tags))
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';
"""
session.execute(create_tags_index_p_statement)
<cassandra.cluster.ResultSet at 0x7fef149d7940>

现在在新表上重复计算嵌入和插入步骤。

您可以像之前一样使用相同的插入代码,因为差异隐藏在“幕后”:数据库将根据此新表的分区方案以不同的方式存储插入的行。

但是,为了演示,您将利用 Cassandra 驱动程序提供的便捷功能来轻松地并发运行多个查询(在本例中为 INSERT)。这是 Cassandra / Astra DB 通过 CQL 非常支持的功能,并且可以通过客户端代码中的少量更改来显着加速。

(注意:可以额外缓存先前计算的嵌入以节省一些 API 令牌——但是,在这里,我们希望保持代码更易于检查。)

from cassandra.concurrent import execute_concurrent_with_args
prepared_insertion = session.prepare(
    f"INSERT INTO {keyspace}.philosophers_cql_partitioned (quote_id, author, body, embedding_vector, tags) VALUES (?, ?, ?, ?, ?);"
)

BATCH_SIZE = 50

num_batches = ((len(philo_dataset) + BATCH_SIZE - 1) // BATCH_SIZE)

quotes_list = philo_dataset["quote"]
authors_list = philo_dataset["author"]
tags_list = philo_dataset["tags"]

print("Starting to store entries:")
for batch_i in range(num_batches):
    print("[...", end="")
    b_start = batch_i * BATCH_SIZE
    b_end = (batch_i + 1) * BATCH_SIZE
    # compute the embedding vectors for this batch
    b_emb_results = client.embeddings.create(
        input=quotes_list[b_start : b_end],
        model=embedding_model_name,
    )
    # prepare this batch's entries for insertion
    tuples_to_insert = []
    for entry_idx, emb_result in zip(range(b_start, b_end), b_emb_results.data):
        if tags_list[entry_idx]:
            tags = {
                tag
                for tag in tags_list[entry_idx].split(";")
            }
        else:
            tags = set()
        author = authors_list[entry_idx]
        quote = quotes_list[entry_idx]
        quote_id = uuid4()  # a new random ID for each quote. In a production app you'll want to have better control...
        # append a *tuple* to the list, and in the tuple the values are ordered to match "?" in the prepared statement:
        tuples_to_insert.append((quote_id, author, quote, emb_result.embedding, tags))
    # insert the batch at once through the driver's concurrent primitive
    conc_results = execute_concurrent_with_args(
        session,
        prepared_insertion,
        tuples_to_insert,
    )
    # check that all insertions succeed (better to always do this):
    if any([not success for success, _ in conc_results]):
        print("Something failed during the insertions!")
    else:
        print(f"{len(b_emb_results.data)}] ", end="")

print("\nFinished storing entries.")
Starting to store entries:
[...50] [...50] [...50] [...50] [...50] [...50] [...50] [...50] [...50] 
Finished storing entries.

尽管表模式不同,但相似性搜索背后的数据库查询本质上是相同的

def find_quote_and_author_p(query_quote, n, author=None, tags=None):
    query_vector = client.embeddings.create(
        input=[query_quote],
        model=embedding_model_name,
    ).data[0].embedding
    # Depending on what conditions are passed, the WHERE clause in the statement may vary.
    # Construct it accordingly:
    where_clauses = []
    where_values = []
    if author:
        where_clauses += ["author = %s"]
        where_values += [author]
    if tags:
        for tag in tags:
            where_clauses += ["tags CONTAINS %s"]
            where_values += [tag]
    if where_clauses:
        search_statement = f"""SELECT body, author FROM {keyspace}.philosophers_cql_partitioned
            WHERE {' AND '.join(where_clauses)}
            ORDER BY embedding_vector ANN OF %s
            LIMIT %s;
        """
    else:
        search_statement = f"""SELECT body, author FROM {keyspace}.philosophers_cql_partitioned
            ORDER BY embedding_vector ANN OF %s
            LIMIT %s;
        """
    query_values = tuple(where_values + [query_vector] + [n])
    result_rows = session.execute(search_statement, query_values)
    return [
        (result_row.body, result_row.author)
        for result_row in result_rows
    ]

就是这样:新表仍然完全支持“通用”相似性搜索……

find_quote_and_author_p("We struggle all our life for nothing", 3)
[('Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.',
  'schopenhauer'),
 ('We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.',
  'aristotle'),
 ('Perhaps the gods are kind to us, by making life more disagreeable as we grow older. In the end death seems less intolerable than the manifold burdens we carry',
  'freud')]

……但是当指定作者时,您会注意到巨大的性能优势

find_quote_and_author_p("We struggle all our life for nothing", 2, author="nietzsche")
[('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche'),
 ('What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.',
  'nietzsche')]

嗯,如果您有一个真实大小的数据集,您注意到性能提升。在此演示中,只有几十个条目,没有明显的差异——但您明白了。

结论

恭喜!您已经学习了如何使用 OpenAI 进行向量嵌入,以及使用 Astra DB / Cassandra 进行存储,从而构建一个复杂的哲学搜索引擎和名言生成器。

此示例使用了 Cassandra 驱动程序 并直接运行 CQL(Cassandra 查询语言)语句来与向量存储交互 - 但这不是唯一的选择。查看 README 以获取其他选项以及与流行框架的集成。

要了解有关 Astra DB 的向量搜索功能如何成为您的 ML/GenAI 应用程序的关键组成部分的更多信息,请访问 Astra DB 关于该主题的网页。

清理

如果您想删除用于此演示的所有资源,请运行此单元格(警告:这将删除表和其中插入的数据!

session.execute(f"DROP TABLE IF EXISTS {keyspace}.philosophers_cql;")
session.execute(f"DROP TABLE IF EXISTS {keyspace}.philosophers_cql_partitioned;")
<cassandra.cluster.ResultSet at 0x7fef149096a0>