将 MyScale 用作 OpenAI 嵌入的向量数据库 | OpenAI Cookbook

本笔记本提供了一个关于如何将 MyScale 用作 OpenAI 嵌入的向量数据库的逐步指南。该过程包括

利用由 OpenAI API 预先计算的嵌入。
将这些嵌入存储在 MyScale 的云实例中。
使用 OpenAI API 将原始文本查询转换为嵌入。
利用 MyScale 在创建的集合中执行最近邻搜索。

什么是 MyScale

MyScale 是一个基于 Clickhouse 构建的数据库，它结合了向量搜索和 SQL 分析，以提供高性能、精简和完全托管的体验。它旨在促进对结构化数据和向量数据的联合查询和分析，并为所有数据处理提供全面的 SQL 支持。

部署选项

通过使用 MyScale 控制台，在两分钟内在您的集群上部署和执行带有 SQL 的向量搜索。

先决条件

要遵循本指南，您需要具备以下条件

按照快速入门指南部署的 MyScale 集群。
用于与 MyScale 交互的 'clickhouse-connect' 库。
用于查询向量化的 OpenAI API 密钥。

安装要求

本笔记本需要 openai、clickhouse-connect 以及其他一些依赖项。使用以下命令安装它们

准备您的 OpenAI API 密钥

要使用 OpenAI API，您需要设置一个 API 密钥。如果您还没有密钥，可以从 OpenAI 获取。

连接到 MyScale

按照连接详情部分从 MyScale 控制台检索集群主机、用户名和密码信息，并使用它来创建与集群的连接，如下所示

我们需要加载 OpenAI 提供的维基百科文章的预计算向量嵌入数据集。使用 wget 包下载数据集。

import wget embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip" # The file is ~700 MB so this will take some time wget.download(embeddings_url)

下载完成后，使用 zipfile 包解压文件

现在，我们可以将 vector_database_wikipedia_articles_embedded.csv 中的数据加载到 Pandas DataFrame 中

import pandas as pd from ast import literal_eval # read data from csv article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv') article_df = article_df[['id', 'url', 'title', 'text', 'content_vector']] # read vectors from strings back into a list article_df["content_vector"] = article_df.content_vector.apply(literal_eval) article_df.head()

索引数据

我们将在 MyScale 中创建一个名为 articles 的 SQL 表来存储嵌入数据。该表将包含一个向量索引，其中包含余弦距离度量和嵌入长度的约束。使用以下代码创建并将数据插入到 articles 表中

# create articles table with vector index embedding_len=len(article_df['content_vector'][0]) # 1536 client.command(f""" CREATE TABLE IF NOT EXISTS default.articles ( id UInt64, url String, title String, text String, content_vector Array(Float32), CONSTRAINT cons_vector_len CHECK length(content_vector) = {embedding_len}, VECTOR INDEX article_content_index content_vector TYPE HNSWFLAT('metric_type=Cosine') ) ENGINE = MergeTree ORDER BY id """) # insert data into the table in batches from tqdm.auto import tqdm batch_size = 100 total_records = len(article_df) # upload data in batches data = article_df.to_records(index=False).tolist() column_names = article_df.columns.tolist() for i in tqdm(range(0, total_records, batch_size)): i_end = min(i + batch_size, total_records) client.insert("default.articles", data[i:i_end], column_names=column_names)

我们需要在继续搜索之前检查向量索引的构建状态，因为它是在后台自动构建的。

# check count of inserted data print(f"articles count: {client.command('SELECT count(*) FROM default.articles')}") # check the status of the vector index, make sure vector index is ready with 'Built' status get_index_status="SELECT status FROM system.vector_indices WHERE name='article_content_index'" print(f"index build status: {client.command(get_index_status)}")

搜索数据

在 MyScale 中索引后，我们可以执行向量搜索以查找相似的内容。首先，我们将使用 OpenAI API 为我们的查询生成嵌入。然后，我们将使用 MyScale 执行向量搜索。

import openai query = "Famous battles in Scottish history" # creates embedding vector from user query embed = openai.Embedding.create( input=query, model="text-embedding-3-small", )["data"][0]["embedding"] # query the database to find the top K similar content to the given query top_k = 10 results = client.query(f""" SELECT id, url, title, distance(content_vector, {embed}) as dist FROM default.articles ORDER BY dist LIMIT {top_k} """) # display results for i, r in enumerate(results.named_results()): print(i+1, r['title'])