使用 Redis 作为向量数据库与 OpenAI

2023 年 2 月 13 日
在 Github 中打开

本笔记本介绍了如何将 Redis 作为向量数据库与 OpenAI 嵌入一起使用。Redis 是一个可扩展的实时数据库,当使用 RediSearch 模块时,可以用作向量数据库。RediSearch 模块允许您在 Redis 中索引和搜索向量。本笔记本将向您展示如何使用 RediSearch 模块来索引和搜索通过使用 OpenAI API 创建并存储在 Redis 中的向量。

什么是 Redis?

大多数来自 Web 服务背景的开发人员可能都熟悉 Redis。Redis 的核心是一个开源的键值存储,可以用作缓存、消息代理和数据库。开发人员选择 Redis 是因为它速度快,拥有庞大的客户端库生态系统,并且多年来已被主要企业部署。

除了 Redis 的传统用途之外。Redis 还提供了 Redis 模块,这是一种使用新的数据类型和命令扩展 Redis 的方法。示例模块包括 RedisJSONRedisTimeSeriesRedisBloomRediSearch

什么是 RediSearch?

RediSearch 是一个 Redis 模块,为 Redis 提供查询、二级索引、全文搜索和向量搜索。要使用 RediSearch,您首先需要在您的 Redis 数据上声明索引。然后,您可以使用 RediSearch 客户端来查询该数据。有关 RediSearch 功能集的更多信息,请参阅 READMERediSearch 文档

部署选项

有很多种部署 Redis 的方法。对于本地开发,最快的方法是使用 Redis Stack Docker 容器,我们将在本文中使用它。Redis Stack 包含许多 Redis 模块,这些模块可以一起使用以创建快速、多模型的数据存储和查询引擎。

对于生产用例,最简单的入门方法是使用 Redis Cloud 服务。Redis Cloud 是一项完全托管的 Redis 服务。您还可以使用 Redis Enterprise 在您自己的基础设施上部署 Redis。Redis Enterprise 是一项完全托管的 Redis 服务,可以部署在 Kubernetes、本地或云端。

此外,每个主要的云提供商(AWS MarketplaceGoogle MarketplaceAzure Marketplace)都在市场产品中提供 Redis Enterprise。

先决条件

在我们开始这个项目之前,我们需要设置以下内容

===========================================================

启动 Redis

为了使本示例简单,我们将使用 Redis Stack Docker 容器,我们可以按如下方式启动它

$ docker-compose up -d

这也包括用于管理您的 Redis 数据库的 RedisInsight GUI,您可以在启动 Docker 容器后在 https://127.0.0.1:8001 查看它。

您已全部设置完毕,可以开始使用了!接下来,我们导入并创建我们的客户端,用于与我们刚刚创建的 Redis 数据库进行通信。

安装要求

Redis-Py 是用于与 Redis 通信的 Python 客户端。我们将使用它来与我们的 Redis-stack 数据库通信。

! pip install redis wget pandas openai

===========================================================

准备您的 OpenAI API 密钥

OpenAI API 密钥用于查询数据的向量化。

如果您没有 OpenAI API 密钥,您可以从 https://beta.openai.com/account/api-keys 获取一个。

获取密钥后,请使用以下命令将其添加到您的环境变量中,命名为 OPENAI_API_KEY

! export OPENAI_API_KEY="your API key"
# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.
import os
import openai

# Note. alternatively you can set a temporary env variable like this:
# os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

if os.getenv("OPENAI_API_KEY") is not None:
    openai.api_key = os.getenv("OPENAI_API_KEY")
    print ("OPENAI_API_KEY is ready")
else:
    print ("OPENAI_API_KEY environment variable not found")
OPENAI_API_KEY is ready

加载数据

在本节中,我们将加载已转换为向量的嵌入数据。我们将使用这些数据在 Redis 中创建索引,然后搜索相似的向量。

import sys
import numpy as np
import pandas as pd
from typing import List

# use helper function in nbutils.py to download and read the data
# this should take from 5-10 min to run
if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())
import nbutils

nbutils.download_wikipedia_data()
data = nbutils.read_wikipedia_data()

data.head()
File Downloaded
id url title text title_vector content_vector vector_id
0 1 https://simple.wikipedia.org/wiki/April April April 是 J... 年的第四个月 [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August August August (Aug.) 是... 年的第八个月 [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art Art Art 是一种表达想象力的创造性活动... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A 或 a 是英文字母表的第一个字母... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air Air Air 指的是地球的大气层。空气是... [0.02224554680287838, -0.02044147066771984, -0... [0.021524671465158463, 0.018522677943110466, -... 4

连接到 Redis

现在我们的 Redis 数据库正在运行,我们可以使用 Redis-py 客户端连接到它。我们将使用 Redis 数据库的默认主机和端口,即 localhost:6379

import redis
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField
)

REDIS_HOST =  "localhost"
REDIS_PORT = 6379
REDIS_PASSWORD = "" # default for passwordless Redis

# Connect to Redis
redis_client = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    password=REDIS_PASSWORD
)
redis_client.ping()
True

在 Redis 中创建搜索索引

以下单元格将展示如何在 Redis 中指定和创建搜索索引。我们将

  1. 设置一些常量来定义我们的索引,例如距离度量和索引名称
  2. 使用 RediSearch 字段定义索引模式
  3. 创建索引
# Constants
VECTOR_DIM = len(data['title_vector'][0]) # length of the vectors
VECTOR_NUMBER = len(data)                 # initial number of vectors
INDEX_NAME = "embeddings-index"           # name of the search index
PREFIX = "doc"                            # prefix for the document keys
DISTANCE_METRIC = "COSINE"                # distance metric for the vectors (ex. COSINE, IP, L2)
# Define RediSearch fields for each of the columns in the dataset
title = TextField(name="title")
url = TextField(name="url")
text = TextField(name="text")
title_embedding = VectorField("title_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
text_embedding = VectorField("content_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
fields = [title, url, text, title_embedding, text_embedding]
# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except:
    # Create RediSearch Index
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
)

将文档加载到索引中

现在我们有了搜索索引,我们可以将文档加载到其中。我们将使用与之前示例中使用的相同文档。在 Redis 中,HASH 或 JSON(如果除了 RediSearch 之外还使用 RedisJSON)数据类型都可以用于存储文档。在本示例中,我们将使用 HASH 数据类型。以下单元格将展示如何将文档加载到索引中。

def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
    records = documents.to_dict("records")
    for doc in records:
        key = f"{prefix}:{str(doc['id'])}"

        # create byte vectors for title and content
        title_embedding = np.array(doc["title_vector"], dtype=np.float32).tobytes()
        content_embedding = np.array(doc["content_vector"], dtype=np.float32).tobytes()

        # replace list of floats with byte vectors
        doc["title_vector"] = title_embedding
        doc["content_vector"] = content_embedding

        client.hset(key, mapping = doc)
index_documents(redis_client, PREFIX, data)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")
Loaded 25000 documents in Redis search index with name: embeddings-index

使用 OpenAI 查询嵌入的简单向量搜索查询

现在我们有了搜索索引并将文档加载到其中,我们可以运行搜索查询。下面我们将提供一个函数,该函数将运行搜索查询并返回结果。使用此函数,我们运行一些查询,这些查询将展示如何将 Redis 用作向量数据库。

def search_redis(
    redis_client: redis.Redis,
    user_query: str,
    index_name: str = "embeddings-index",
    vector_field: str = "title_vector",
    return_fields: list = ["title", "url", "text", "vector_score"],
    hybrid_fields = "*",
    k: int = 20,
    print_results: bool = True,
) -> List[dict]:

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(input=user_query,
                                            model="text-embedding-3-small",
                                            )["data"][0]['embedding']

    # Prepare the Query
    base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
    query = (
        Query(base_query)
         .return_fields(*return_fields)
         .sort_by("vector_score")
         .paging(0, k)
         .dialect(2)
    )
    params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

    # perform vector search
    results = redis_client.ft(index_name).search(query, params_dict)
    if print_results:
        for i, article in enumerate(results.docs):
            score = 1 - float(article.vector_score)
            print(f"{i}. {article.title} (Score: {round(score ,3) })")
    return results.docs
# For using OpenAI to generate query embedding
results = search_redis(redis_client, 'modern art in Europe', k=10)
0. Museum of Modern Art (Score: 0.875)
1. Western Europe (Score: 0.868)
2. Renaissance art (Score: 0.864)
3. Pop art (Score: 0.86)
4. Northern Europe (Score: 0.855)
5. Hellenistic art (Score: 0.853)
6. Modernist literature (Score: 0.847)
7. Art film (Score: 0.843)
8. Central Europe (Score: 0.843)
9. European (Score: 0.841)
results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)
0. Battle of Bannockburn (Score: 0.869)
1. Wars of Scottish Independence (Score: 0.861)
2. 1651 (Score: 0.853)
3. First War of Scottish Independence (Score: 0.85)
4. Robert I of Scotland (Score: 0.846)
5. 841 (Score: 0.844)
6. 1716 (Score: 0.844)
7. 1314 (Score: 0.837)
8. 1263 (Score: 0.836)
9. William Wallace (Score: 0.835)

使用 Redis 的混合查询

之前的示例展示了如何使用 RediSearch 运行向量搜索查询。在本节中,我们将展示如何将向量搜索与其他 RediSearch 字段结合使用以进行混合搜索。在下面的示例中,我们将向量搜索与全文搜索结合使用。

def create_hybrid_field(field_name: str, value: str) -> str:
    return f'@{field_name}:"{value}"'

# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title
results = search_redis(redis_client,
                       "Famous battles in Scottish history",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("title", "Scottish")
                       )
0. First War of Scottish Independence (Score: 0.892)
1. Wars of Scottish Independence (Score: 0.889)
2. Second War of Scottish Independence (Score: 0.879)
3. List of Scottish monarchs (Score: 0.873)
4. Scottish Borders (Score: 0.863)
# run a hybrid query for articles about Art in the title vector and only include results with the phrase "Leonardo da Vinci" in the text
results = search_redis(redis_client,
                       "Art",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("text", "Leonardo da Vinci")
                       )

# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned
mention = [sentence for sentence in results[0].text.split("\n") if "Leonardo da Vinci" in sentence][0]
mention
0. Art (Score: 1.0)
1. Paint (Score: 0.896)
2. Renaissance art (Score: 0.88)
3. Painting (Score: 0.874)
4. Renaissance (Score: 0.846)
'In Europe, after the Middle Ages, there was a "Renaissance" which means "rebirth". People rediscovered science and artists were allowed to paint subjects other than religious subjects. People like Michelangelo and Leonardo da Vinci still painted religious pictures, but they also now could paint mythological pictures too. These artists also invented perspective where things in the distance look smaller in the picture. This was new because in the Middle Ages people would paint all the figures close up and just overlapping each other. These artists used nudity regularly in their art.'

HNSW 索引

到目前为止,我们一直在使用 FLAT 或“暴力”索引来运行我们的查询。Redis 还支持 HNSW 索引,这是一种快速的近似索引。HNSW 索引是一种基于图的索引,它使用分层可导航小世界图来存储向量。HNSW 索引是大型数据集的理想选择,在这些数据集中,您希望运行近似查询。

对于大多数情况,HNSW 将比 FLAT 花费更长的时间来构建并消耗更多的内存,但运行查询的速度会更快,特别是对于大型数据集。

以下单元格将展示如何创建 HNSW 索引并使用与之前相同的数据使用它运行查询。

# re-define RediSearch vector fields to use HNSW index
title_embedding = VectorField("title_vector",
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER
    }
)
text_embedding = VectorField("content_vector",
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER
    }
)
fields = [title, url, text, title_embedding, text_embedding]
import time
# Check if index exists
HNSW_INDEX_NAME = INDEX_NAME+ "_HNSW"

try:
    redis_client.ft(HNSW_INDEX_NAME).info()
    print("Index already exists")
except:
    # Create RediSearch Index
    redis_client.ft(HNSW_INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

# since RediSearch creates the index in the background for existing documents, we will wait until
# indexing is complete before running our queries. Although this is not necessary for the first query,
# some queries may take longer to run if the index is not fully built. In general, Redis will perform
# best when adding new documents to existing indices rather than new indices on existing documents.
while redis_client.ft(HNSW_INDEX_NAME).info()["indexing"] == "1":
    time.sleep(5)
results = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10)
0. Western Europe (Score: 0.868)
1. Northern Europe (Score: 0.855)
2. Central Europe (Score: 0.843)
3. European (Score: 0.841)
4. Eastern Europe (Score: 0.839)
5. Europe (Score: 0.839)
6. Western European Union (Score: 0.837)
7. Southern Europe (Score: 0.831)
8. Western civilization (Score: 0.83)
9. Council of Europe (Score: 0.827)
# compare the results of the HNSW index to the FLAT index and time both queries
def time_queries(iterations: int = 10):
    print(" ----- Flat Index ----- ")
    t0 = time.time()
    for i in range(iterations):
        results_flat = search_redis(redis_client, 'modern art in Europe', k=10, print_results=False)
    t0 = (time.time() - t0) / iterations
    results_flat = search_redis(redis_client, 'modern art in Europe', k=10, print_results=True)
    print(f"Flat index query time: {round(t0, 3)} seconds\n")
    time.sleep(1)
    print(" ----- HNSW Index ------ ")
    t1 = time.time()
    for i in range(iterations):
        results_hnsw = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10, print_results=False)
    t1 = (time.time() - t1) / iterations
    results_hnsw = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10, print_results=True)
    print(f"HNSW index query time: {round(t1, 3)} seconds")
    print(" ------------------------ ")
time_queries()
 ----- Flat Index ----- 
0. Museum of Modern Art (Score: 0.875)
1. Western Europe (Score: 0.867)
2. Renaissance art (Score: 0.864)
3. Pop art (Score: 0.861)
4. Northern Europe (Score: 0.855)
5. Hellenistic art (Score: 0.853)
6. Modernist literature (Score: 0.847)
7. Art film (Score: 0.843)
8. Central Europe (Score: 0.843)
9. Art (Score: 0.842)
Flat index query time: 0.263 seconds

 ----- HNSW Index ------ 
0. Western Europe (Score: 0.867)
1. Northern Europe (Score: 0.855)
2. Central Europe (Score: 0.843)
3. European (Score: 0.841)
4. Eastern Europe (Score: 0.839)
5. Europe (Score: 0.839)
6. Western European Union (Score: 0.837)
7. Southern Europe (Score: 0.831)
8. Western civilization (Score: 0.83)
9. Council of Europe (Score: 0.827)
HNSW index query time: 0.129 seconds
 ------------------------