使用 Hologres 作为 OpenAI 嵌入的向量数据库

2023 年 5 月 19 日

本笔记本逐步指导您如何使用 Hologres 作为 OpenAI 嵌入的向量数据库。

本笔记本展示了一个端到端的流程，包括：

使用由 OpenAI API 创建的预计算嵌入。
将嵌入存储在 Hologres 的云实例中。
使用 OpenAI API 将原始文本查询转换为嵌入。
使用 Hologres 在创建的集合中执行最近邻搜索。
在提示工程中，为大型语言模型提供搜索结果作为上下文

什么是 Hologres

Hologres 是由阿里云开发的统一实时数据仓库服务。您可以使用 Hologres 实时写入、更新、处理和分析大量数据。Hologres 支持标准 SQL 语法，兼容 PostgreSQL，并支持大多数 PostgreSQL 函数。Hologres 支持对高达 PB 级的数据进行在线分析处理 (OLAP) 和即席分析，并提供高并发和低延迟的在线数据服务。Hologres 支持多工作负载的细粒度隔离和企业级安全能力。Hologres 与 MaxCompute、Realtime Compute for Apache Flink 和 DataWorks 深度集成，为企业提供全栈在线和离线数据仓库解决方案。

Hologres 通过采用 Proxima 提供向量数据库功能。

Proxima 是由阿里巴巴达摩院开发的高性能软件库。它允许您搜索向量的最近邻。与 Facebook AI Similarity Search (Faiss) 等类似的开源软件相比，Proxima 提供了更高的稳定性和性能。Proxima 提供了在行业中具有领先性能和效果的基础模块，并允许您搜索相似的图像、视频或人脸。Hologres 与 Proxima 深度集成，以提供高性能的向量搜索服务。

部署选项

点击这里快速部署 Hologres 数据仓库。

前提条件

为了进行本练习，我们需要准备以下几项内容：

Hologres 云服务器实例。
用于与向量数据库交互的 'psycopg2-binary' 库。任何其他 postgresql 客户端库也可以。
一个 OpenAI API 密钥。

我们可以通过运行一个简单的 curl 命令来验证服务器是否成功启动

安装要求

本笔记本显然需要 openai 和 psycopg2-binary 包，但我们还将使用其他一些额外的库。以下命令安装所有这些库：

! pip install openai psycopg2-binary pandas wget

准备您的 OpenAI API 密钥

OpenAI API 密钥用于文档和查询的向量化。

如果您没有 OpenAI API 密钥，您可以从 https://beta.openai.com/account/api-keys 获取一个。

获取密钥后，请将其添加到您的环境变量中，变量名为 OPENAI_API_KEY。

# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.
import os

# Note. alternatively you can set a temporary env variable like this:
# os.environ["OPENAI_API_KEY"] = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
else:
    print("OPENAI_API_KEY environment variable not found")

OPENAI_API_KEY is ready

连接到 Hologres

首先将其添加到您的环境变量中。或者您可以直接更改下面的 "psycopg2.connect" 参数

使用官方 Python 库可以轻松连接到正在运行的 Hologres 服务器实例

import os
import psycopg2

# Note. alternatively you can set a temporary env variable like this:
# os.environ["PGHOST"] = "your_host"
# os.environ["PGPORT"] "5432"),
# os.environ["PGDATABASE"] "postgres"),
# os.environ["PGUSER"] "user"),
# os.environ["PGPASSWORD"] "password"),

connection = psycopg2.connect(
    host=os.environ.get("PGHOST", "localhost"),
    port=os.environ.get("PGPORT", "5432"),
    database=os.environ.get("PGDATABASE", "postgres"),
    user=os.environ.get("PGUSER", "user"),
    password=os.environ.get("PGPASSWORD", "password")
)
connection.set_session(autocommit=True)

# Create a new cursor object
cursor = connection.cursor()

我们可以通过运行任何可用的方法来测试连接


# Execute a simple query to test the connection
cursor.execute("SELECT 1;")
result = cursor.fetchone()

# Check the query result
if result == (1,):
    print("Connection successful!")
else:
    print("Connection failed.")

Connection successful!

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

下载的文件必须解压

import zipfile
import os
import re
import tempfile

current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")

with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(output_directory)


# check the csv file exist
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)


if os.path.exists(file_path):
    print(f"The file {file_name} exists in the data directory.")
else:
    print(f"The file {file_name} does not exist in the data directory.")

The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.

加载数据

在本节中，我们将加载会话之前准备好的数据，这样您就不必使用自己的额度重新计算维基百科文章的嵌入。

!unzip -n vector_database_wikipedia_articles_embedded.zip
!ls -lh vector_database_wikipedia_articles_embedded.csv

Archive:  vector_database_wikipedia_articles_embedded.zip
-rw-r--r--@ 1 geng  staff   1.7G Jan 31 01:19 vector_database_wikipedia_articles_embedded.csv

查看数据。

import pandas, json
data = pandas.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv')
data

	id	url	title	text	title_vector	content_vector	vector_id
0	1	https://simple.wikipedia.org/wiki/April	April	April 是 J... 年的第四个月	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	August	August (Aug.) 是一年中的第八个月 ...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	Art	Art 是一种表达想象力的创造性活动 ...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A 或 a 是英文字母表的第一个字母 ...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	Air	Air 指的是地球的大气层。空气是一种 ...	[0.02224554680287838, -0.02044147066771984, -0...	[0.021524671465158463, 0.018522677943110466, -...	4
...	...	...	...	...	...	...	...
24995	98295	https://simple.wikipedia.org/wiki/Geneva	Geneva	Geneva (, , , , ) 是第二大城市 ...	[-0.015773078426718712, 0.01737344264984131, 0...	[0.008000412955880165, 0.02008531428873539, 0....	24995
24996	98316	https://simple.wikipedia.org/wiki/Concubinage	Concubinage	Concubinage 是女性处于关系中的状态 ...	[-0.00519518880173564, 0.005898841191083193, 0...	[-0.01736736111342907, -0.002740012714639306, ...	24996
24997	98318	https://simple.wikipedia.org/wiki/Mistress%20%...	Mistress (情人)	Mistress 是男性的长期女性性伴侣 ...	[-0.023164259269833565, -0.02052430994808674, ...	[-0.017878392711281776, -0.0004517830966506153...	24997
24998	98326	https://simple.wikipedia.org/wiki/Eastern%20Front	Eastern Front	Eastern Front 可以指以下内容：\n\n...	[-0.00681863259524107, 0.002171179046854377, 8...	[-0.0019235472427681088, -0.004023272544145584...	24998
24999	98327	https://simple.wikipedia.org/wiki/Italian%20Ca...	Italian Campaign	Italian Campaign 可以指以下内容：\n\n...	[-0.014151256531476974, -0.008553029969334602,...	[-0.011758845299482346, -0.01346028596162796, ...	24999

25000 行 × 7 列

title_vector_length = len(json.loads(data['title_vector'].iloc[0]))
content_vector_length = len(json.loads(data['content_vector'].iloc[0]))

print(title_vector_length, content_vector_length)

1536 1536

创建表和 Proxima 向量索引

Hologres 将数据存储在表中，其中每个对象至少由一个向量描述。我们的表将被称为 articles，每个对象将由 title 和 content 向量描述。

我们将从创建表开始，并在 title 和 content 上创建 Proxima 索引，然后我们将用预计算的嵌入填充它。

cursor.execute('CREATE EXTENSION IF NOT EXISTS proxima;')
create_proxima_table_sql = '''
BEGIN;
DROP TABLE IF EXISTS articles;
CREATE TABLE articles (
    id INT PRIMARY KEY NOT NULL,
    url TEXT,
    title TEXT,
    content TEXT,
    title_vector float4[] check(
        array_ndims(title_vector) = 1 and 
        array_length(title_vector, 1) = 1536
    ), -- define the vectors
    content_vector float4[] check(
        array_ndims(content_vector) = 1 and 
        array_length(content_vector, 1) = 1536
    ),
    vector_id INT
);

-- Create indexes for the vector fields.
call set_table_property(
    'articles',
    'proxima_vectors', 
    '{
        "title_vector":{"algorithm":"Graph","distance_method":"Euclidean","builder_params":{"min_flush_proxima_row_count" : 10}},
        "content_vector":{"algorithm":"Graph","distance_method":"Euclidean","builder_params":{"min_flush_proxima_row_count" : 10}}
    }'
);  

COMMIT;
'''

# Execute the SQL statements (will autocommit)
cursor.execute(create_proxima_table_sql)

上传数据

现在，让我们使用 COPY 语句将数据上传到 Hologres 云实例。根据网络带宽，这可能需要 5-10 分钟。

import io

# Path to the unzipped CSV file
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'

# In SQL, arrays are surrounded by {}, rather than []
def process_file(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            # Replace '[' with '{' and ']' with '}'
            modified_line = line.replace('[', '{').replace(']', '}')
            yield modified_line

# Create a StringIO object to store the modified lines
modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))

# Create the COPY command for the copy_expert method
copy_command = '''
COPY public.articles (id, url, title, content, title_vector, content_vector, vector_id)
FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ',');
'''

# Execute the COPY command using the copy_expert method
cursor.copy_expert(copy_command, modified_lines)

Proxima 索引将在后台构建。我们可以在此期间进行搜索，但如果没有向量索引，查询会很慢。使用此命令等待索引构建完成。

cursor.execute('vacuum articles;')

# Check the collection size to make sure all the points have been stored
count_sql = "select count(*) from articles;"
cursor.execute(count_sql)
result = cursor.fetchone()
print(f"Count:{result[0]}")

Count:25000

搜索数据

数据上传完成后，我们将开始查询集合以查找最接近的向量。我们可以提供额外的参数 vector_name 以从标题搜索切换到内容搜索。由于预计算的嵌入是使用 text-embedding-3-small OpenAI 模型创建的，因此我们在搜索期间也必须使用它。

import openai
def query_knn(query, table_name, vector_name="title_vector", top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-3-small",
    )["data"][0]["embedding"]

    # Convert the embedded_query to PostgreSQL compatible format
    embedded_query_pg = "{" + ",".join(map(str, embedded_query)) + "}"

    # Create SQL query
    query_sql = f"""
    SELECT id, url, title, pm_approx_euclidean_distance({vector_name},'{embedded_query_pg}'::float4[]) AS distance
    FROM {table_name}
    ORDER BY distance
    LIMIT {top_k};
    """
    # Execute the query
    cursor.execute(query_sql)
    results = cursor.fetchall()

    return results

query_results = query_knn("modern art in Europe", "Articles")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

1. Museum of Modern Art (Score: 0.501)
2. Western Europe (Score: 0.485)
3. Renaissance art (Score: 0.479)
4. Pop art (Score: 0.472)
5. Northern Europe (Score: 0.461)
6. Hellenistic art (Score: 0.458)
7. Modernist literature (Score: 0.447)
8. Art film (Score: 0.44)
9. Central Europe (Score: 0.439)
10. Art (Score: 0.437)
11. European (Score: 0.437)
12. Byzantine art (Score: 0.436)
13. Postmodernism (Score: 0.435)
14. Eastern Europe (Score: 0.433)
15. Cubism (Score: 0.433)
16. Europe (Score: 0.432)
17. Impressionism (Score: 0.432)
18. Bauhaus (Score: 0.431)
19. Surrealism (Score: 0.429)
20. Expressionism (Score: 0.429)

# This time we'll query using content vector
query_results = query_knn("Famous battles in Scottish history", "Articles", "content_vector")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

1. Battle of Bannockburn (Score: 0.489)
2. Wars of Scottish Independence (Score: 0.474)
3. 1651 (Score: 0.457)
4. First War of Scottish Independence (Score: 0.452)
5. Robert I of Scotland (Score: 0.445)
6. 841 (Score: 0.441)
7. 1716 (Score: 0.441)
8. 1314 (Score: 0.429)
9. 1263 (Score: 0.428)
10. William Wallace (Score: 0.426)
11. Stirling (Score: 0.419)
12. 1306 (Score: 0.419)
13. 1746 (Score: 0.418)
14. 1040s (Score: 0.414)
15. 1106 (Score: 0.412)
16. 1304 (Score: 0.411)
17. David II of Scotland (Score: 0.408)
18. Braveheart (Score: 0.407)
19. 1124 (Score: 0.406)
20. July 27 (Score: 0.405)