使用 Typesense 进行嵌入搜索

2023年6月28日
在 Github 中打开

本笔记本将引导您完成一个简单的流程,以下载一些数据、嵌入数据,然后使用精选的向量数据库对其进行索引和搜索。对于希望安全地存储和搜索我们的嵌入及其自身数据,以支持聊天机器人、主题建模等生产用例的客户来说,这是一个常见的需求。

什么是向量数据库

向量数据库是一种用于存储、管理和搜索嵌入向量的数据库。近年来,由于人工智能在解决涉及自然语言、图像识别和其他非结构化数据形式的用例方面的有效性不断提高,使用嵌入将非结构化数据(文本、音频、视频等)编码为向量以供机器学习模型使用的情况呈爆炸式增长。向量数据库已成为企业交付和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够采用我们在本仓库中分享的许多嵌入用例(例如,问答、聊天机器人和推荐服务),并在安全、可扩展的环境中使用它们。我们的许多客户都在小规模范围内使用嵌入来解决他们的问题,但性能和安全性阻碍了他们投入生产 - 我们认为向量数据库是解决这个问题的关键组件,在本指南中,我们将介绍嵌入文本数据、将其存储在向量数据库中以及将其用于语义搜索的基础知识。

演示流程

演示流程如下

  • 设置:导入包并设置任何必需的变量
  • 加载数据:加载数据集并使用 OpenAI 嵌入对其进行嵌入
  • Typesense
    • 设置:设置 Typesense Python 客户端。有关更多详细信息,请访问此处
    • 索引数据:我们将创建一个集合并为标题内容编制索引。
    • 搜索数据:运行一些示例查询,其中考虑了各种目标。

一旦您运行完本笔记本,您应该对如何设置和使用向量数据库有一个基本的了解,并且可以继续进行更复杂的用例,利用我们的嵌入。

设置

导入所需的库并设置我们想要使用的嵌入模型。

# We'll need to install the Typesense client
!pip install typesense

#Install wget to pull zip file
!pip install wget
import openai

from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval

# Typesense's client library for Python
import typesense

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-3-small"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

加载数据

在本节中,我们将加载我们之前准备的嵌入数据。

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")
article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
article_df.head()
id url 标题 文本 title_vector content_vector vector_id
0 1 https://simple.wikipedia.org/wiki/April 四月 四月是公历年中第四个月... [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August 八月 八月 (Aug.) 是公历年中的第八个月... [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art 艺术 艺术是一种表达想象力的创造性活动... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A 或 a 是英文字母表的第一个字母... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air 空气 空气是指地球的大气层。空气是一种... [0.02224554680287838, -0.02044147066771984, -0... [0.021524671465158463, 0.018522677943110466, -... 4
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)
article_df.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB

Typesense

我们将要介绍的下一个向量存储是 Typesense,它是一个开源的、内存搜索引擎,您可以自托管或在 Typesense Cloud 上运行。

Typesense 专注于性能,通过将整个索引存储在 RAM 中(并在磁盘上备份),并且还通过简化可用选项和设置良好的默认值来专注于提供开箱即用的开发者体验。它还允许您将基于属性的筛选与向量查询结合使用。

对于此示例,我们将设置一个基于本地 Docker 的 Typesense 服务器,在 Typesense 中索引我们的向量,然后执行一些最近邻搜索查询。如果您使用 Typesense Cloud,则可以跳过 Docker 设置部分,只需从集群仪表板获取主机名和 API 密钥即可。

设置

要在本地运行 Typesense,您需要 Docker。按照 Typesense 文档 此处 包含的说明,我们在本仓库中创建了一个示例 docker-compose.yml 文件,保存在 ./typesense/docker-compose.yml

启动 Docker 后,您可以通过导航到 examples/vector_databases/typesense/ 目录并运行 docker-compose up -d 在本地启动 Typesense。

默认 API 密钥在 Docker compose 文件中设置为 xyz,默认 Typesense 端口设置为 8108

import typesense

typesense_client = \
    typesense.Client({
        "nodes": [{
            "host": "localhost",  # For Typesense Cloud use xxx.a1.typesense.net
            "port": "8108",       # For Typesense Cloud use 443
            "protocol": "http"    # For Typesense Cloud use https
          }],
          "api_key": "xyz",
          "connection_timeout_seconds": 60
        })

索引数据

要在 Typesense 中索引向量,我们首先创建一个集合(它是文档的集合),并为特定字段启用向量索引。您甚至可以在单个文档中存储多个向量字段。

# Delete existing collections if they already exist
try:
    typesense_client.collections['wikipedia_articles'].delete()
except Exception as e:
    pass

# Create a new collection

schema = {
    "name": "wikipedia_articles",
    "fields": [
        {
            "name": "content_vector",
            "type": "float[]",
            "num_dim": len(article_df['content_vector'][0])
        },
        {
            "name": "title_vector",
            "type": "float[]",
            "num_dim": len(article_df['title_vector'][0])
        }
    ]
}

create_response = typesense_client.collections.create(schema)
print(create_response)

print("Created new collection wikipedia-articles")
{'created_at': 1687165065, 'default_sorting_field': '', 'enable_nested_fields': False, 'fields': [{'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'content_vector', 'num_dim': 1536, 'optional': False, 'sort': False, 'type': 'float[]'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'title_vector', 'num_dim': 1536, 'optional': False, 'sort': False, 'type': 'float[]'}], 'name': 'wikipedia_articles', 'num_documents': 0, 'symbols_to_index': [], 'token_separators': []}
Created new collection wikipedia-articles
# Upsert the vector data into the collection we just created
#
# Note: This can take a few minutes, especially if your on an M1 and running docker in an emulated mode

print("Indexing vectors in Typesense...")

document_counter = 0
documents_batch = []

for k,v in article_df.iterrows():
    # Create a document with the vector data

    # Notice how you can add any fields that you haven't added to the schema to the document.
    # These will be stored on disk and returned when the document is a hit.
    # This is useful to store attributes required for display purposes.

    document = {
        "title_vector": v["title_vector"],
        "content_vector": v["content_vector"],
        "title": v["title"],
        "content": v["text"],
    }
    documents_batch.append(document)
    document_counter = document_counter + 1

    # Upsert a batch of 100 documents
    if document_counter % 100 == 0 or document_counter == len(article_df):
        response = typesense_client.collections['wikipedia_articles'].documents.import_(documents_batch)
        # print(response)

        documents_batch = []
        print(f"Processed {document_counter} / {len(article_df)} ")

print(f"Imported ({len(article_df)}) articles.")
Indexing vectors in Typesense...
Processed 100 / 25000 
Processed 200 / 25000 
Processed 300 / 25000 
Processed 400 / 25000 
Processed 500 / 25000 
Processed 600 / 25000 
Processed 700 / 25000 
Processed 800 / 25000 
Processed 900 / 25000 
Processed 1000 / 25000 
Processed 1100 / 25000 
Processed 1200 / 25000 
Processed 1300 / 25000 
Processed 1400 / 25000 
Processed 1500 / 25000 
Processed 1600 / 25000 
Processed 1700 / 25000 
Processed 1800 / 25000 
Processed 1900 / 25000 
Processed 2000 / 25000 
Processed 2100 / 25000 
Processed 2200 / 25000 
Processed 2300 / 25000 
Processed 2400 / 25000 
Processed 2500 / 25000 
Processed 2600 / 25000 
Processed 2700 / 25000 
Processed 2800 / 25000 
Processed 2900 / 25000 
Processed 3000 / 25000 
Processed 3100 / 25000 
Processed 3200 / 25000 
Processed 3300 / 25000 
Processed 3400 / 25000 
Processed 3500 / 25000 
Processed 3600 / 25000 
Processed 3700 / 25000 
Processed 3800 / 25000 
Processed 3900 / 25000 
Processed 4000 / 25000 
Processed 4100 / 25000 
Processed 4200 / 25000 
Processed 4300 / 25000 
Processed 4400 / 25000 
Processed 4500 / 25000 
Processed 4600 / 25000 
Processed 4700 / 25000 
Processed 4800 / 25000 
Processed 4900 / 25000 
Processed 5000 / 25000 
Processed 5100 / 25000 
Processed 5200 / 25000 
Processed 5300 / 25000 
Processed 5400 / 25000 
Processed 5500 / 25000 
Processed 5600 / 25000 
Processed 5700 / 25000 
Processed 5800 / 25000 
Processed 5900 / 25000 
Processed 6000 / 25000 
Processed 6100 / 25000 
Processed 6200 / 25000 
Processed 6300 / 25000 
Processed 6400 / 25000 
Processed 6500 / 25000 
Processed 6600 / 25000 
Processed 6700 / 25000 
Processed 6800 / 25000 
Processed 6900 / 25000 
Processed 7000 / 25000 
Processed 7100 / 25000 
Processed 7200 / 25000 
Processed 7300 / 25000 
Processed 7400 / 25000 
Processed 7500 / 25000 
Processed 7600 / 25000 
Processed 7700 / 25000 
Processed 7800 / 25000 
Processed 7900 / 25000 
Processed 8000 / 25000 
Processed 8100 / 25000 
Processed 8200 / 25000 
Processed 8300 / 25000 
Processed 8400 / 25000 
Processed 8500 / 25000 
Processed 8600 / 25000 
Processed 8700 / 25000 
Processed 8800 / 25000 
Processed 8900 / 25000 
Processed 9000 / 25000 
Processed 9100 / 25000 
Processed 9200 / 25000 
Processed 9300 / 25000 
Processed 9400 / 25000 
Processed 9500 / 25000 
Processed 9600 / 25000 
Processed 9700 / 25000 
Processed 9800 / 25000 
Processed 9900 / 25000 
Processed 10000 / 25000 
Processed 10100 / 25000 
Processed 10200 / 25000 
Processed 10300 / 25000 
Processed 10400 / 25000 
Processed 10500 / 25000 
Processed 10600 / 25000 
Processed 10700 / 25000 
Processed 10800 / 25000 
Processed 10900 / 25000 
Processed 11000 / 25000 
Processed 11100 / 25000 
Processed 11200 / 25000 
Processed 11300 / 25000 
Processed 11400 / 25000 
Processed 11500 / 25000 
Processed 11600 / 25000 
Processed 11700 / 25000 
Processed 11800 / 25000 
Processed 11900 / 25000 
Processed 12000 / 25000 
Processed 12100 / 25000 
Processed 12200 / 25000 
Processed 12300 / 25000 
Processed 12400 / 25000 
Processed 12500 / 25000 
Processed 12600 / 25000 
Processed 12700 / 25000 
Processed 12800 / 25000 
Processed 12900 / 25000 
Processed 13000 / 25000 
Processed 13100 / 25000 
Processed 13200 / 25000 
Processed 13300 / 25000 
Processed 13400 / 25000 
Processed 13500 / 25000 
Processed 13600 / 25000 
Processed 13700 / 25000 
Processed 13800 / 25000 
Processed 13900 / 25000 
Processed 14000 / 25000 
Processed 14100 / 25000 
Processed 14200 / 25000 
Processed 14300 / 25000 
Processed 14400 / 25000 
Processed 14500 / 25000 
Processed 14600 / 25000 
Processed 14700 / 25000 
Processed 14800 / 25000 
Processed 14900 / 25000 
Processed 15000 / 25000 
Processed 15100 / 25000 
Processed 15200 / 25000 
Processed 15300 / 25000 
Processed 15400 / 25000 
Processed 15500 / 25000 
Processed 15600 / 25000 
Processed 15700 / 25000 
Processed 15800 / 25000 
Processed 15900 / 25000 
Processed 16000 / 25000 
Processed 16100 / 25000 
Processed 16200 / 25000 
Processed 16300 / 25000 
Processed 16400 / 25000 
Processed 16500 / 25000 
Processed 16600 / 25000 
Processed 16700 / 25000 
Processed 16800 / 25000 
Processed 16900 / 25000 
Processed 17000 / 25000 
Processed 17100 / 25000 
Processed 17200 / 25000 
Processed 17300 / 25000 
Processed 17400 / 25000 
Processed 17500 / 25000 
Processed 17600 / 25000 
Processed 17700 / 25000 
Processed 17800 / 25000 
Processed 17900 / 25000 
Processed 18000 / 25000 
Processed 18100 / 25000 
Processed 18200 / 25000 
Processed 18300 / 25000 
Processed 18400 / 25000 
Processed 18500 / 25000 
Processed 18600 / 25000 
Processed 18700 / 25000 
Processed 18800 / 25000 
Processed 18900 / 25000 
Processed 19000 / 25000 
Processed 19100 / 25000 
Processed 19200 / 25000 
Processed 19300 / 25000 
Processed 19400 / 25000 
Processed 19500 / 25000 
Processed 19600 / 25000 
Processed 19700 / 25000 
Processed 19800 / 25000 
Processed 19900 / 25000 
Processed 20000 / 25000 
Processed 20100 / 25000 
Processed 20200 / 25000 
Processed 20300 / 25000 
Processed 20400 / 25000 
Processed 20500 / 25000 
Processed 20600 / 25000 
Processed 20700 / 25000 
Processed 20800 / 25000 
Processed 20900 / 25000 
Processed 21000 / 25000 
Processed 21100 / 25000 
Processed 21200 / 25000 
Processed 21300 / 25000 
Processed 21400 / 25000 
Processed 21500 / 25000 
Processed 21600 / 25000 
Processed 21700 / 25000 
Processed 21800 / 25000 
Processed 21900 / 25000 
Processed 22000 / 25000 
Processed 22100 / 25000 
Processed 22200 / 25000 
Processed 22300 / 25000 
Processed 22400 / 25000 
Processed 22500 / 25000 
Processed 22600 / 25000 
Processed 22700 / 25000 
Processed 22800 / 25000 
Processed 22900 / 25000 
Processed 23000 / 25000 
Processed 23100 / 25000 
Processed 23200 / 25000 
Processed 23300 / 25000 
Processed 23400 / 25000 
Processed 23500 / 25000 
Processed 23600 / 25000 
Processed 23700 / 25000 
Processed 23800 / 25000 
Processed 23900 / 25000 
Processed 24000 / 25000 
Processed 24100 / 25000 
Processed 24200 / 25000 
Processed 24300 / 25000 
Processed 24400 / 25000 
Processed 24500 / 25000 
Processed 24600 / 25000 
Processed 24700 / 25000 
Processed 24800 / 25000 
Processed 24900 / 25000 
Processed 25000 / 25000 
Imported (25000) articles.
# Check the number of documents imported

collection = typesense_client.collections['wikipedia_articles'].retrieve()
print(f'Collection has {collection["num_documents"]} documents')
Collection has 25000 documents

搜索数据

现在我们已将向量导入到 Typesense 中,我们可以在 title_vectorcontent_vector 字段上执行最近邻搜索。

def query_typesense(query, field='title', top_k=20):

    # Creates embedding vector from user query
    openai.api_key = os.getenv("OPENAI_API_KEY", "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
    embedded_query = openai.Embedding.create(
        input=query,
        model=EMBEDDING_MODEL,
    )['data'][0]['embedding']

    typesense_results = typesense_client.multi_search.perform({
        "searches": [{
            "q": "*",
            "collection": "wikipedia_articles",
            "vector_query": f"{field}_vector:([{','.join(str(v) for v in embedded_query)}], k:{top_k})"
        }]
    }, {})

    return typesense_results
query_results = query_typesense('modern art in Europe', 'title')

for i, hit in enumerate(query_results['results'][0]['hits']):
    document = hit["document"]
    vector_distance = hit["vector_distance"]
    print(f'{i + 1}. {document["title"]} (Distance: {vector_distance})')
1. Museum of Modern Art (Distance: 0.12482291460037231)
2. Western Europe (Distance: 0.13255876302719116)
3. Renaissance art (Distance: 0.13584274053573608)
4. Pop art (Distance: 0.1396539807319641)
5. Northern Europe (Distance: 0.14534103870391846)
6. Hellenistic art (Distance: 0.1472070813179016)
7. Modernist literature (Distance: 0.15296930074691772)
8. Art film (Distance: 0.1567266583442688)
9. Central Europe (Distance: 0.15741699934005737)
10. European (Distance: 0.1585891842842102)
query_results = query_typesense('Famous battles in Scottish history', 'content')

for i, hit in enumerate(query_results['results'][0]['hits']):
    document = hit["document"]
    vector_distance = hit["vector_distance"]
    print(f'{i + 1}. {document["title"]} (Distance: {vector_distance})')
1. Battle of Bannockburn (Distance: 0.1306111216545105)
2. Wars of Scottish Independence (Distance: 0.1384994387626648)
3. 1651 (Distance: 0.14744246006011963)
4. First War of Scottish Independence (Distance: 0.15033596754074097)
5. Robert I of Scotland (Distance: 0.15376019477844238)
6. 841 (Distance: 0.15609073638916016)
7. 1716 (Distance: 0.15615153312683105)
8. 1314 (Distance: 0.16280347108840942)
9. 1263 (Distance: 0.16361045837402344)
10. William Wallace (Distance: 0.16464537382125854)

感谢您的关注,您现在已经准备好设置自己的向量数据库,并使用嵌入来完成各种很酷的事情 - 祝您玩得开心!对于更复杂的用例,请继续学习本仓库中的其他食谱示例。