使用 Qdrant 进行嵌入搜索

,
2023年6月28日
在 Github 中打开

本 notebook 将引导您完成一个简单的流程,以下载一些数据,嵌入数据,然后使用一系列向量数据库对其进行索引和搜索。对于希望安全地存储和搜索我们的嵌入及其自身数据,以支持生产用例(如聊天机器人、主题建模等)的客户来说,这是一个常见的需求。

什么是向量数据库

向量数据库是一种用于存储、管理和搜索嵌入向量的数据库。近年来,由于人工智能在解决涉及自然语言、图像识别和其他非结构化数据形式的用例方面的有效性日益提高,使用嵌入将非结构化数据(文本、音频、视频等)编码为向量以供机器学习模型使用的情况呈爆炸式增长。向量数据库已成为企业交付和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够采用我们在本仓库中分享的许多嵌入用例(例如,问答、聊天机器人和推荐服务),并在安全、可扩展的环境中使用它们。我们的许多客户都在小规模上使用嵌入来解决他们的问题,但性能和安全性阻碍了他们投入生产——我们认为向量数据库是解决这个问题的关键组成部分,在本指南中,我们将介绍嵌入文本数据、将其存储在向量数据库中以及将其用于语义搜索的基础知识。

演示流程

演示流程如下:

  • 设置:导入包并设置任何必需的变量
  • 加载数据:加载数据集并使用 OpenAI 嵌入对其进行嵌入
  • Qdrant
    • 设置:在这里,我们将设置 Qdrant 的 Python 客户端。有关更多详细信息,请访问此处
    • 索引数据:我们将创建一个集合,其中包含标题内容的向量
    • 搜索数据:我们将运行一些搜索以确认它是否有效

一旦您运行完本 notebook,您应该对如何设置和使用向量数据库有一个基本的了解,并且可以继续进行更复杂的用例,使用我们的嵌入。

设置

导入所需的库并设置我们要使用的嵌入模型。

# We'll need to install Qdrant client
!pip install qdrant-client
import openai
import pandas as pd
from ast import literal_eval
import qdrant_client # Qdrant's client library for Python

# This can be changed to the embedding model of your choice. Make sure its the same model that is used for generating embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

加载数据

在本节中,我们将加载我们在此会话之前准备的嵌入数据。

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
100% [......................................................................] 698933052 / 698933052
'vector_database_wikipedia_articles_embedded (10).zip'
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")
article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
article_df.head()
id url title text title_vector content_vector vector_id
0 1 https://simple.wikipedia.org/wiki/April April April 是 J... 年的第四个月 [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August August August(Aug.)是一年中的第八个月 ... [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art Art Art 是一种表达想象力的创造性活动 ... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A 或 a 是英语字母表的第一个字母 ... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air Air Air 指的是地球的大气层。空气是一种 ... [0.02224554680287838, -0.02044147066771984, -0... [0.021524671465158463, 0.018522677943110466, -... 4
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)
article_df.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB

Qdrant

Qdrant。是一个用 Rust 编写的高性能向量搜索数据库。它提供本地部署和云版本,但出于本示例的目的,我们将使用本地部署模式。

设置所有内容将需要

  • 启动 Qdrant 的本地实例
  • 配置集合并将数据存储在其中
  • 尝试一些查询

设置

对于本地部署,我们将根据 Qdrant 文档使用 Docker:https://qdrant.tech/documentation/quick_start/。Qdrant 只需要一个容器,但在本仓库的 ./qdrant/docker-compose.yaml 文件中提供了一个 docker-compose.yaml 文件的示例。

您可以通过导航到此目录并运行 docker-compose up -d 在本地启动 Qdrant 实例

您可能需要将 Docker 的内存限制增加到 8GB 或更多。否则 Qdrant 可能无法执行并显示类似 7 Killed 的错误消息。

! docker compose up -d
[?25l[+] Running 1/0
 ✔ Container qdrant-qdrant-1  Running                                      0.0s 
[?25h
qdrant = qdrant_client.QdrantClient(host="localhost", port=6333)
qdrant.get_collections()
CollectionsResponse(collections=[CollectionDescription(name='Articles')])

索引数据

Qdrant 将数据存储在集合中,其中每个对象都由至少一个向量描述,并且可能包含称为有效负载的附加元数据。我们的集合将被称为 Articles,每个对象将由标题内容向量描述。

我们将使用官方的 qdrant-client 包,该包已内置所有实用程序方法。

from qdrant_client.http import models as rest
# Get the vector size from the first row to set up the collection
vector_size = len(article_df['content_vector'][0])

# Set up the collection with the vector configuration. You need to declare the vector size and distance metric for the collection. Distance metric enables vector database to index and search vectors efficiently.
qdrant.recreate_collection(
    collection_name='Articles',
    vectors_config={
        'title': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        'content': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)
True
vector_size = len(article_df['content_vector'][0])

qdrant.recreate_collection(
    collection_name='Articles',
    vectors_config={
        'title': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        'content': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)
True

除了在 vector 下定义的向量配置之外,我们还可以定义 payload 配置。有效负载是一个可选字段,允许您在向量旁边存储附加元数据。在我们的例子中,我们将存储文章的 idtitleurl。当我们在搜索结果中从有效负载返回最接近的文章标题时,我们还可以向用户提供文章的 URL(这是元数据的一部分)。

from qdrant_client.models import PointStruct # Import the PointStruct to store the vector and payload
from tqdm import tqdm # Library to show the progress bar 

# Populate collection with vectors using tqdm to show progress
for k, v in tqdm(article_df.iterrows(), desc="Upserting articles", total=len(article_df)):
    try:
        qdrant.upsert(
            collection_name='Articles',
            points=[
                PointStruct(
                    id=k,
                    vector={'title': v['title_vector'], 
                            'content': v['content_vector']},
                    payload={
                        'id': v['id'],
                        'title': v['title'],
                        'url': v['url']
                    }
                )
            ]
        )
    except Exception as e:
        print(f"Failed to upsert row {k}: {v}")
        print(f"Exception: {e}")
Upserting articles: 100%|█████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [02:52<00:00, 144.82it/s]
# Check the collection size to make sure all the points have been stored
qdrant.count(collection_name='Articles')
CountResult(count=25000)

搜索数据

一旦数据被放入 Qdrant,我们将开始查询集合以查找最接近的向量。我们可以提供一个额外的参数 vector_name 以从标题搜索切换到内容搜索。请确保您使用 text-embedding-ada-002 模型,因为文件中的原始嵌入是使用此模型创建的。

def query_qdrant(query, collection_name, vector_name='title', top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.embeddings.create(
        input=query,
        model=EMBEDDING_MODEL,
    ).data[0].embedding # We take the first embedding from the list
    
    query_results = qdrant.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k, 
        query_filter=None
    )
    
    return query_results
query_results = query_qdrant('modern art in Europe', 'Articles', 'title')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]}, URL: {article.payload["url"]} (Score: {round(article.score, 3)})')
1. Museum of Modern Art, URL: https://simple.wikipedia.org/wiki/Museum%20of%20Modern%20Art (Score: 0.875)
2. Western Europe, URL: https://simple.wikipedia.org/wiki/Western%20Europe (Score: 0.867)
3. Renaissance art, URL: https://simple.wikipedia.org/wiki/Renaissance%20art (Score: 0.864)
4. Pop art, URL: https://simple.wikipedia.org/wiki/Pop%20art (Score: 0.86)
5. Northern Europe, URL: https://simple.wikipedia.org/wiki/Northern%20Europe (Score: 0.855)
6. Hellenistic art, URL: https://simple.wikipedia.org/wiki/Hellenistic%20art (Score: 0.853)
7. Modernist literature, URL: https://simple.wikipedia.org/wiki/Modernist%20literature (Score: 0.847)
8. Art film, URL: https://simple.wikipedia.org/wiki/Art%20film (Score: 0.843)
9. Central Europe, URL: https://simple.wikipedia.org/wiki/Central%20Europe (Score: 0.843)
10. European, URL: https://simple.wikipedia.org/wiki/European (Score: 0.841)
11. Art, URL: https://simple.wikipedia.org/wiki/Art (Score: 0.841)
12. Byzantine art, URL: https://simple.wikipedia.org/wiki/Byzantine%20art (Score: 0.841)
13. Postmodernism, URL: https://simple.wikipedia.org/wiki/Postmodernism (Score: 0.84)
14. Eastern Europe, URL: https://simple.wikipedia.org/wiki/Eastern%20Europe (Score: 0.839)
15. Cubism, URL: https://simple.wikipedia.org/wiki/Cubism (Score: 0.839)
16. Europe, URL: https://simple.wikipedia.org/wiki/Europe (Score: 0.839)
17. Impressionism, URL: https://simple.wikipedia.org/wiki/Impressionism (Score: 0.838)
18. Bauhaus, URL: https://simple.wikipedia.org/wiki/Bauhaus (Score: 0.838)
19. Surrealism, URL: https://simple.wikipedia.org/wiki/Surrealism (Score: 0.837)
20. Expressionism, URL: https://simple.wikipedia.org/wiki/Expressionism (Score: 0.837)
# This time we'll query using content vector
query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]}, URL: {article.payload["url"]} (Score: {round(article.score, 3)})')
1. Battle of Bannockburn, URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn (Score: 0.869)
2. Wars of Scottish Independence, URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence (Score: 0.861)
3. 1651, URL: https://simple.wikipedia.org/wiki/1651 (Score: 0.852)
4. First War of Scottish Independence, URL: https://simple.wikipedia.org/wiki/First%20War%20of%20Scottish%20Independence (Score: 0.85)
5. Robert I of Scotland, URL: https://simple.wikipedia.org/wiki/Robert%20I%20of%20Scotland (Score: 0.846)
6. 841, URL: https://simple.wikipedia.org/wiki/841 (Score: 0.844)
7. 1716, URL: https://simple.wikipedia.org/wiki/1716 (Score: 0.844)
8. 1314, URL: https://simple.wikipedia.org/wiki/1314 (Score: 0.837)
9. 1263, URL: https://simple.wikipedia.org/wiki/1263 (Score: 0.836)
10. William Wallace, URL: https://simple.wikipedia.org/wiki/William%20Wallace (Score: 0.835)
11. Stirling, URL: https://simple.wikipedia.org/wiki/Stirling (Score: 0.831)
12. 1306, URL: https://simple.wikipedia.org/wiki/1306 (Score: 0.831)
13. 1746, URL: https://simple.wikipedia.org/wiki/1746 (Score: 0.83)
14. 1040s, URL: https://simple.wikipedia.org/wiki/1040s (Score: 0.828)
15. 1106, URL: https://simple.wikipedia.org/wiki/1106 (Score: 0.827)
16. 1304, URL: https://simple.wikipedia.org/wiki/1304 (Score: 0.826)
17. David II of Scotland, URL: https://simple.wikipedia.org/wiki/David%20II%20of%20Scotland (Score: 0.825)
18. Braveheart, URL: https://simple.wikipedia.org/wiki/Braveheart (Score: 0.824)
19. 1124, URL: https://simple.wikipedia.org/wiki/1124 (Score: 0.824)
20. July 27, URL: https://simple.wikipedia.org/wiki/July%2027 (Score: 0.823)