使用 Redis 进行嵌入搜索

2023年6月28日

本笔记本将带您了解一个简单的流程，该流程下载一些数据，嵌入数据，然后使用精选的向量数据库对其进行索引和搜索。对于希望安全地存储和搜索我们的嵌入及其自身数据以支持生产用例（如聊天机器人、主题建模等）的客户而言，这是一个常见的需求。

什么是向量数据库

向量数据库是一种用于存储、管理和搜索嵌入向量的数据库。近年来，由于人工智能在解决涉及自然语言、图像识别和其他非结构化数据形式的用例方面的效率不断提高，使用嵌入将非结构化数据（文本、音频、视频等）编码为向量以供机器学习模型使用的情况呈爆炸式增长。向量数据库已成为企业交付和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够利用我们在本仓库中分享的许多嵌入用例（例如，问答、聊天机器人和推荐服务），并在安全、可扩展的环境中使用它们。我们的许多客户都在小规模上使用嵌入来解决他们的问题，但性能和安全性阻碍了他们投入生产——我们认为向量数据库是解决这个问题的关键组件，在本指南中，我们将介绍嵌入文本数据、将其存储在向量数据库中以及将其用于语义搜索的基础知识。

演示流程

演示流程如下

设置：导入软件包并设置任何必需的变量
加载数据：加载数据集并使用 OpenAI 嵌入对其进行嵌入
Redis
- 设置：设置 Redis-Py 客户端。有关更多详细信息，请访问此处
- 索引数据：在所有可用字段上创建用于向量搜索和混合搜索（向量 + 全文搜索）的搜索索引。
- 搜索数据：运行一些示例查询，并考虑各种目标。

运行完本笔记本后，您应该对如何设置和使用向量数据库有一个基本的了解，并且可以继续进行更复杂的用例，从而利用我们的嵌入。

设置

导入所需的库并设置我们要使用的嵌入模型。

# We'll need to install the Redis client
!pip install redis

#Install wget to pull zip file
!pip install wget

import openai

from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval

# Redis client library for Python
import redis

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-3-small"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

加载数据

在本节中，我们将加载我们之前为此会话准备的嵌入数据。

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")

article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')

article_df.head()

	id	url	标题	文本	标题向量	内容向量	向量 ID
0	1	https://simple.wikipedia.org/wiki/April	四月	四月是公历一年中的第四个月...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	八月	八月（Aug.）是公历一年中的第八个月...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	艺术	艺术是一种表达想象力的创造性活动...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A 或 a 是英文字母表的第一个字母...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	空气	空气是指地球的大气层。空气是一种...	[0.02224554680287838, -0.02044147066771984, -0...	[0.021524671465158463, 0.018522677943110466, -...	4

# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)

article_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB

Redis

本教程中介绍的下一个向量数据库是 Redis。您很可能已经了解 Redis。您可能没有意识到的是 RediSearch 模块。多年来，企业一直在所有主要的云提供商、Redis Cloud 和本地部署中使用带有 RediSearch 模块的 Redis。最近，Redis 团队除了 RediSearch 已经具备的功能外，还向该模块添加了向量存储和搜索功能。

鉴于 Redis 周围庞大的生态系统，很可能存在您需要的语言的客户端库。您可以使用任何标准的 Redis 客户端库来运行 RediSearch 命令，但最简单的方法是使用包装 RediSearch API 的库。以下是一些示例，但您可以在此处找到更多客户端库。

项目	语言	许可证	作者
jedis	Java	MIT	Redis
redis-py	Python	MIT	Redis
node-redis	Node.js	MIT	Redis
nredisstack	.NET	MIT	Redis
redisearch-go	Go	BSD	Redis
redisearch-api-rs	Rust	BSD	Redis

在下面的单元格中，我们将引导您完成使用 Redis 作为向量数据库的过程。由于你们中的许多人可能已经习惯了 Redis API，因此大多数人应该对此感到熟悉。

设置

部署带有 RediSearch 的 Redis 有很多方法。最简单的入门方法是使用 Docker，但还有许多潜在的部署选项。有关其他部署选项，请参阅此仓库中的 redis 目录。

在本教程中，我们将使用 Docker 上的 Redis Stack。

通过运行以下 docker 命令启动带有 RediSearch (Redis Stack) 的 Redis 版本

$ cd redis
$ docker compose up -d

这还包括用于管理 Redis 数据库的 RedisInsight GUI，您可以在启动 docker 容器后在 http://localhost:8001 上查看它。

您已全部设置完毕，可以开始使用了！接下来，我们导入并创建我们的客户端，用于与我们刚刚创建的 Redis 数据库进行通信。

import redis
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField
)

REDIS_HOST =  "localhost"
REDIS_PORT = 6379
REDIS_PASSWORD = "" # default for passwordless Redis

# Connect to Redis
redis_client = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    password=REDIS_PASSWORD
)
redis_client.ping()

True

创建搜索索引

以下单元格将显示如何在 Redis 中指定和创建搜索索引。我们将

设置一些常量来定义我们的索引，例如距离度量和索引名称
使用 RediSearch 字段定义索引架构
创建索引

# Constants
VECTOR_DIM = len(article_df['title_vector'][0]) # length of the vectors
VECTOR_NUMBER = len(article_df)                 # initial number of vectors
INDEX_NAME = "embeddings-index"                 # name of the search index
PREFIX = "doc"                                  # prefix for the document keys
DISTANCE_METRIC = "COSINE"                      # distance metric for the vectors (ex. COSINE, IP, L2)

# Define RediSearch fields for each of the columns in the dataset
title = TextField(name="title")
url = TextField(name="url")
text = TextField(name="text")
title_embedding = VectorField("title_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
text_embedding = VectorField("content_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
fields = [title, url, text, title_embedding, text_embedding]

# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except:
    # Create RediSearch Index
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

将文档加载到索引中

现在我们有了搜索索引，我们可以将文档加载到其中。我们将使用与之前示例中使用的相同文档。在 Redis 中，Hash 或 JSON 数据类型（如果除了 RediSearch 之外还使用 RedisJSON）都可用于存储文档。在此示例中，我们将使用 HASH 数据类型。以下单元格将显示如何将文档加载到索引中。

def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
    records = documents.to_dict("records")
    for doc in records:
        key = f"{prefix}:{str(doc['id'])}"

        # create byte vectors for title and content
        title_embedding = np.array(doc["title_vector"], dtype=np.float32).tobytes()
        content_embedding = np.array(doc["content_vector"], dtype=np.float32).tobytes()

        # replace list of floats with byte vectors
        doc["title_vector"] = title_embedding
        doc["content_vector"] = content_embedding

        client.hset(key, mapping = doc)

index_documents(redis_client, PREFIX, article_df)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")

Loaded 25000 documents in Redis search index with name: embeddings-index

运行搜索查询

现在我们有了搜索索引并且文档已加载到其中，我们可以运行搜索查询。下面我们将提供一个函数，该函数将运行搜索查询并返回结果。使用此函数，我们将运行一些查询，这些查询将展示如何将 Redis 用作向量数据库。每个示例都将演示在开发使用 Redis 的搜索应用程序时需要记住的特定功能。

返回字段：您可以指定要在搜索结果中返回哪些字段。如果您只想返回文档中字段的子集，并且不需要单独调用来检索文档，这将非常有用。在下面的示例中，我们将仅在搜索结果中返回 title 字段。
混合搜索：您可以将向量搜索与任何其他 RediSearch 字段结合使用，以进行混合搜索，例如全文搜索、标签、地理和数字。在下面的示例中，我们将向量搜索与全文搜索结合起来。

def search_redis(
    redis_client: redis.Redis,
    user_query: str,
    index_name: str = "embeddings-index",
    vector_field: str = "title_vector",
    return_fields: list = ["title", "url", "text", "vector_score"],
    hybrid_fields = "*",
    k: int = 20,
) -> List[dict]:

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(input=user_query,
                                            model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']

    # Prepare the Query
    base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
    query = (
        Query(base_query)
         .return_fields(*return_fields)
         .sort_by("vector_score")
         .paging(0, k)
         .dialect(2)
    )
    params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

    # perform vector search
    results = redis_client.ft(index_name).search(query, params_dict)
    for i, article in enumerate(results.docs):
        score = 1 - float(article.vector_score)
        print(f"{i}. {article.title} (Score: {round(score ,3) })")
    return results.docs

# For using OpenAI to generate query embedding
openai.api_key = os.getenv("OPENAI_API_KEY", "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
results = search_redis(redis_client, 'modern art in Europe', k=10)

0. Museum of Modern Art (Score: 0.875)
1. Western Europe (Score: 0.867)
2. Renaissance art (Score: 0.864)
3. Pop art (Score: 0.86)
4. Northern Europe (Score: 0.855)
5. Hellenistic art (Score: 0.853)
6. Modernist literature (Score: 0.847)
7. Art film (Score: 0.843)
8. Central Europe (Score: 0.843)
9. European (Score: 0.841)

results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)

0. Battle of Bannockburn (Score: 0.869)
1. Wars of Scottish Independence (Score: 0.861)
2. 1651 (Score: 0.853)
3. First War of Scottish Independence (Score: 0.85)
4. Robert I of Scotland (Score: 0.846)
5. 841 (Score: 0.844)
6. 1716 (Score: 0.844)
7. 1314 (Score: 0.837)
8. 1263 (Score: 0.836)
9. William Wallace (Score: 0.835)

使用 Redis 进行混合查询

前面的示例展示了如何使用 RediSearch 运行向量搜索查询。在本节中，我们将展示如何将向量搜索与其他 RediSearch 字段结合使用以进行混合搜索。在下面的示例中，我们将向量搜索与全文搜索结合起来。

def create_hybrid_field(field_name: str, value: str) -> str:
    return f'@{field_name}:"{value}"'

# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title
results = search_redis(redis_client,
                       "Famous battles in Scottish history",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("title", "Scottish")
                       )

0. First War of Scottish Independence (Score: 0.892)
1. Wars of Scottish Independence (Score: 0.889)
2. Second War of Scottish Independence (Score: 0.879)
3. List of Scottish monarchs (Score: 0.873)
4. Scottish Borders (Score: 0.863)

# run a hybrid query for articles about Art in the title vector and only include results with the phrase "Leonardo da Vinci" in the text
results = search_redis(redis_client,
                       "Art",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("text", "Leonardo da Vinci")
                       )

# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned
mention = [sentence for sentence in results[0].text.split("\n") if "Leonardo da Vinci" in sentence][0]
mention

0. Art (Score: 1.0)
1. Paint (Score: 0.896)
2. Renaissance art (Score: 0.88)
3. Painting (Score: 0.874)
4. Renaissance (Score: 0.846)

'In Europe, after the Middle Ages, there was a "Renaissance" which means "rebirth". People rediscovered science and artists were allowed to paint subjects other than religious subjects. People like Michelangelo and Leonardo da Vinci still painted religious pictures, but they also now could paint mythological pictures too. These artists also invented perspective where things in the distance look smaller in the picture. This was new because in the Middle Ages people would paint all the figures close up and just overlapping each other. These artists used nudity regularly in their art.'

有关将 Redis 用作向量数据库的更多示例，请参阅此存储库的 vector_databases/redis 目录中的 README 和示例