使用 Neon Postgres 的向量相似性搜索

2023年9月28日

在 Github 中打开

本笔记本指导您如何使用 Neon Serverless Postgres 作为 OpenAI 嵌入的向量数据库。它演示了如何：

使用 OpenAI API 创建的嵌入。
将嵌入存储在 Neon Serverless Postgres 数据库中。
使用 OpenAI API 将原始文本查询转换为嵌入。
使用带有 pgvector 扩展的 Neon 执行向量相似性搜索。

先决条件

在开始之前，请确保您拥有以下内容：

Neon Postgres 数据库。您可以创建一个帐户，并通过几个简单的步骤设置一个包含即用型 neondb 数据库的项目。有关说明，请参阅注册和创建您的第一个项目。
您的 Neon 数据库的连接字符串。您可以从 Neon 仪表板上的 连接详细信息 小部件复制它。请参阅从任何应用程序连接。
pgvector 扩展。通过运行 CREATE EXTENSION vector; 在 Neon 中安装扩展。有关说明，请参阅启用 pgvector 扩展。
您的 OpenAI API 密钥。
Python 和 pip。

安装所需的模块

本笔记本需要 openai、psycopg2、pandas、wget 和 python-dotenv 包。您可以使用 pip 安装它们

! pip install openai psycopg2 pandas wget python-dotenv

准备您的 OpenAI API 密钥

生成文档和查询的向量需要 OpenAI API 密钥。

如果您没有 OpenAI API 密钥，请从 https://platform.openai.com/account/api-keys 获取一个。

将 OpenAI API 密钥添加为操作系统环境变量，或在提示时为会话提供它。如果您定义环境变量，请将变量命名为 OPENAI_API_KEY。

有关将 OpenAI API 密钥配置为环境变量的信息，请参阅 API 密钥安全最佳实践。

测试您的 OpenAPI 密钥

# Test to ensure that your OpenAI API key is defined as an environment variable or provide it when prompted
# If you run this notebook locally, you may have to reload the terminal and the notebook to make the environment available

import os
from getpass import getpass

# Check if OPENAI_API_KEY is set as an environment variable
if os.getenv("OPENAI_API_KEY") is not None:
    print("Your OPENAI_API_KEY is ready")
else:
    # If not, prompt for it
    api_key = getpass("Enter your OPENAI_API_KEY: ")
    if api_key:
        print("Your OPENAI_API_KEY is now available for this session")
        # Optionally, you can set it as an environment variable for the current session
        os.environ["OPENAI_API_KEY"] = api_key
    else:
        print("You did not enter your OPENAI_API_KEY")

Your OPENAI_API_KEY is ready

连接到您的 Neon 数据库

在下方提供您的 Neon 数据库连接字符串，或使用 DATABASE_URL 变量在 .env 文件中定义它。有关获取 Neon 连接字符串的信息，请参阅从任何应用程序连接。

import os
import psycopg2
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# The connection string can be provided directly here.
# Replace the next line with Your Neon connection string.
connection_string = "postgres://<user>:<password>@<hostname>/<dbname>"

# If connection_string is not directly provided above, 
# then check if DATABASE_URL is set in the environment or .env.
if not connection_string:
    connection_string = os.environ.get("DATABASE_URL")

    # If neither method provides a connection string, raise an error.
    if not connection_string:
        raise ValueError("Please provide a valid connection string either in the code or in the .env file as DATABASE_URL.")

# Connect using the connection string
connection = psycopg2.connect(connection_string)

# Create a new cursor object
cursor = connection.cursor()

测试与数据库的连接

# Execute this query to test the database connection
cursor.execute("SELECT 1;")
result = cursor.fetchone()

# Check the query result
if result == (1,):
    print("Your database connection was successful!")
else:
    print("Your connection failed.")

Your database connection was successful!

本指南使用 OpenAI Cookbook examples 目录中提供的预计算的 Wikipedia 文章嵌入，因此您不必使用自己的 OpenAI 额度来计算嵌入。

导入预计算的嵌入 zip 文件

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB. Importing it will take several minutes.
wget.download(embeddings_url)

'vector_database_wikipedia_articles_embedded.zip'

解压下载的 zip 文件

import zipfile
import os
import re
import tempfile

current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")

with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(output_directory)


# Check to see if the csv file was extracted
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)


if os.path.exists(file_path):
    print(f"The csv file {file_name} exists in the data directory.")
else:
    print(f"The csv file {file_name} does not exist in the data directory.")

The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.

创建表并为您的向量嵌入添加索引

在您的数据库中创建的向量表名为 articles。每个对象都有 title 和 content 向量。

索引在 title 和 content 向量列上定义。

create_table_sql = '''
CREATE TABLE IF NOT EXISTS public.articles (
    id INTEGER NOT NULL,
    url TEXT,
    title TEXT,
    content TEXT,
    title_vector vector(1536),
    content_vector vector(1536),
    vector_id INTEGER
);

ALTER TABLE public.articles ADD PRIMARY KEY (id);
'''

# SQL statement for creating indexes
create_indexes_sql = '''
CREATE INDEX ON public.articles USING ivfflat (content_vector) WITH (lists = 1000);

CREATE INDEX ON public.articles USING ivfflat (title_vector) WITH (lists = 1000);
'''

# Execute the SQL statements
cursor.execute(create_table_sql)
cursor.execute(create_indexes_sql)

# Commit the changes
connection.commit()

加载数据

将预计算的向量数据从 .csv 文件加载到您的 articles 表中。有 25000 条记录，因此预计此操作需要几分钟时间。

import io

# Path to your local CSV file
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'

# Define a generator function to process the csv file
def process_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            yield line

# Create a StringIO object to store the modified lines
modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))

# Create the COPY command for copy_expert
copy_command = '''
COPY public.articles (id, url, title, content, title_vector, content_vector, vector_id)
FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ',');
'''

# Execute the COPY command using copy_expert
cursor.copy_expert(copy_command, modified_lines)

# Commit the changes
connection.commit()

检查记录数以确保数据已加载。应该有 25000 条记录。

# Check the size of the data
count_sql = """select count(*) from public.articles;"""
cursor.execute(count_sql)
result = cursor.fetchone()
print(f"Count:{result[0]}")

Count:25000

搜索您的数据

将数据存储在您的 Neon 数据库中后，您可以查询数据以查找最近邻。

首先定义 query_neon 函数，该函数在您运行向量相似性搜索时执行。该函数基于用户的查询创建嵌入，准备 SQL 查询，并使用嵌入运行 SQL 查询。您加载到数据库中的预计算嵌入是使用 text-embedding-3-small OpenAI 模型创建的，因此您必须使用相同的模型为相似性搜索创建嵌入。

提供了一个 vector_name 参数，允许您基于“title”或“content”进行搜索。

def query_neon(query, collection_name, vector_name="title_vector", top_k=20):

    # Create an embedding vector from the user query
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-3-small",
    )["data"][0]["embedding"]

    # Convert the embedded_query to PostgreSQL compatible format
    embedded_query_pg = "[" + ",".join(map(str, embedded_query)) + "]"

    # Create the SQL query
    query_sql = f"""
    SELECT id, url, title, l2_distance({vector_name},'{embedded_query_pg}'::VECTOR(1536)) AS similarity
    FROM {collection_name}
    ORDER BY {vector_name} <-> '{embedded_query_pg}'::VECTOR(1536)
    LIMIT {top_k};
    """
    # Execute the query
    cursor.execute(query_sql)
    results = cursor.fetchall()

    return results

基于 title_vector 嵌入运行相似性搜索

# Query based on `title_vector` embeddings
import openai

query_results = query_neon("Greek mythology", "Articles")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

1. Greek mythology (Score: 0.998)
2. Roman mythology (Score: 0.7)
3. Greek underworld (Score: 0.637)
4. Mythology (Score: 0.635)
5. Classical mythology (Score: 0.629)
6. Japanese mythology (Score: 0.615)
7. Norse mythology (Score: 0.569)
8. Greek language (Score: 0.566)
9. Zeus (Score: 0.534)
10. List of mythologies (Score: 0.531)
11. Jupiter (mythology) (Score: 0.53)
12. Greek (Score: 0.53)
13. Gaia (mythology) (Score: 0.526)
14. Titan (mythology) (Score: 0.522)
15. Mercury (mythology) (Score: 0.521)
16. Ancient Greece (Score: 0.52)
17. Greek alphabet (Score: 0.52)
18. Venus (mythology) (Score: 0.515)
19. Pluto (mythology) (Score: 0.515)
20. Athena (Score: 0.514)

基于 content_vector 嵌入运行相似性搜索

# Query based on `content_vector` embeddings
query_results = query_neon("Famous battles in Greek history", "Articles", "content_vector")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

1. 222 BC (Score: 0.489)
2. Trojan War (Score: 0.458)
3. Peloponnesian War (Score: 0.456)
4. History of the Peloponnesian War (Score: 0.449)
5. 430 BC (Score: 0.441)
6. 168 BC (Score: 0.436)
7. Ancient Greece (Score: 0.429)
8. Classical Athens (Score: 0.428)
9. 499 BC (Score: 0.427)
10. Leonidas I (Score: 0.426)
11. Battle (Score: 0.421)
12. Greek War of Independence (Score: 0.421)
13. Menelaus (Score: 0.419)
14. Thebes, Greece (Score: 0.417)
15. Patroclus (Score: 0.417)
16. 427 BC (Score: 0.416)
17. 429 BC (Score: 0.413)
18. August 2 (Score: 0.412)
19. Ionia (Score: 0.411)
20. 323 (Score: 0.409)