使用 Milvus 和 OpenAI 进行过滤搜索

2023年3月28日
在 Github 中打开

查找您的下一部电影

在本 notebook 中,我们将介绍如何使用 OpenAI 生成电影描述的嵌入向量,并在 Milvus 中使用这些嵌入向量来查找相关电影。为了缩小我们的搜索结果并尝试一些新的东西,我们将使用过滤来进行元数据搜索。本示例中的数据集来源于 HuggingFace datasets,包含超过 8000 部电影条目。

首先,让我们开始下载此 notebook 所需的库

  • openai 用于与 OpenAI 嵌入服务通信
  • pymilvus 用于与 Milvus 服务器通信
  • datasets 用于下载数据集
  • tqdm 用于进度条
! pip install openai pymilvus datasets tqdm

安装完所需的软件包后,我们就可以开始了。首先启动 Milvus 服务。运行的文件是此文件所在文件夹中的 docker-compose.yaml。此命令启动一个 Milvus 单机实例,我们将用于此测试。

! docker compose up -d
E0317 14:06:38.344884000 140704629352640 fork_posix.cc:76]             Other threads are currently calling into gRPC, skipping fork() handlers
[?25l[+] Running 1/0
 ⠿ Network milvus          Created                                         0.1s
 ⠋ Container milvus-etcd   Creating                                        0.0s
 ⠋ Container milvus-minio  Creating                                        0.0s
[?25h[?25l[+] Running 1/3
 ⠿ Network milvus          Created                                         0.1s
 ⠙ Container milvus-etcd   Creating                                        0.1s
 ⠙ Container milvus-minio  Creating                                        0.1s
[?25h[?25l[+] Running 2/3
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Starting                                   0.2s
 ⠿ Container milvus-minio       Starting                                   0.2s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Starting                                   0.3s
 ⠿ Container milvus-minio       Starting                                   0.3s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Starting                                   0.4s
 ⠿ Container milvus-minio       Starting                                   0.4s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Starting                                   0.5s
 ⠿ Container milvus-minio       Starting                                   0.5s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Starting                                   0.6s
 ⠿ Container milvus-minio       Starting                                   0.6s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Starting                                   0.7s
 ⠿ Container milvus-minio       Starting                                   0.7s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Starting                                   0.8s
 ⠿ Container milvus-minio       Starting                                   0.8s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Starting                                   0.9s
 ⠿ Container milvus-minio       Starting                                   0.9s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Started                                    0.9s
 ⠿ Container milvus-minio       Starting                                   1.0s
 ⠿ Container milvus-standalone  Created                                    0.1s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Started                                    0.9s
 ⠿ Container milvus-minio       Started                                    1.0s
 ⠿ Container milvus-standalone  Starting                                   1.0s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Started                                    0.9s
 ⠿ Container milvus-minio       Started                                    1.0s
 ⠿ Container milvus-standalone  Starting                                   1.1s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Started                                    0.9s
 ⠿ Container milvus-minio       Started                                    1.0s
 ⠿ Container milvus-standalone  Starting                                   1.2s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Started                                    0.9s
 ⠿ Container milvus-minio       Started                                    1.0s
 ⠿ Container milvus-standalone  Starting                                   1.3s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Started                                    0.9s
 ⠿ Container milvus-minio       Started                                    1.0s
 ⠿ Container milvus-standalone  Starting                                   1.4s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Started                                    0.9s
 ⠿ Container milvus-minio       Started                                    1.0s
 ⠿ Container milvus-standalone  Starting                                   1.5s
[?25h[?25l[+] Running 4/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-etcd        Started                                    0.9s
 ⠿ Container milvus-minio       Started                                    1.0s
 ⠿ Container milvus-standalone  Started                                    1.6s
[?25h

Milvus 运行后,我们可以设置全局变量

  • HOST: Milvus 主机地址
  • PORT: Milvus 端口号
  • COLLECTION_NAME: 在 Milvus 中集合的名称
  • DIMENSION: 嵌入向量的维度
  • OPENAI_ENGINE: 要使用的嵌入模型
  • openai.api_key: 您的 OpenAI 帐户密钥
  • INDEX_PARAM: 集合要使用的索引设置
  • QUERY_PARAM: 要使用的搜索参数
  • BATCH_SIZE: 一次嵌入和插入多少部电影
import openai

HOST = 'localhost'
PORT = 19530
COLLECTION_NAME = 'movie_search'
DIMENSION = 1536
OPENAI_ENGINE = 'text-embedding-3-small'
openai.api_key = 'sk-your_key'

INDEX_PARAM = {
    'metric_type':'L2',
    'index_type':"HNSW",
    'params':{'M': 8, 'efConstruction': 64}
}

QUERY_PARAM = {
    "metric_type": "L2",
    "params": {"ef": 64},
}

BATCH_SIZE = 1000
from pymilvus import connections, utility, FieldSchema, Collection, CollectionSchema, DataType

# Connect to Milvus Database
connections.connect(host=HOST, port=PORT)
# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)
# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=64000),
    FieldSchema(name='type', dtype=DataType.VARCHAR, max_length=64000),
    FieldSchema(name='release_year', dtype=DataType.INT64),
    FieldSchema(name='rating', dtype=DataType.VARCHAR, max_length=64000),
    FieldSchema(name='description', dtype=DataType.VARCHAR, max_length=64000),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)
# Create the index on the collection and load it.
collection.create_index(field_name="embedding", index_params=INDEX_PARAM)
collection.load()

数据集

在 Milvus 启动并运行后,我们可以开始获取数据。Hugging Face Datasets 是一个托管许多不同用户数据集的中心,在本示例中,我们使用 HuggingLearners 的 netflix-shows 数据集。此数据集包含超过 8000 部电影的电影及其元数据对。我们将嵌入每个描述并将其连同标题、类型、发行年份和评分一起存储在 Milvus 中。

import datasets

# Download the dataset 
dataset = datasets.load_dataset('hugginglearners/netflix-shows', split='train')
Found cached dataset csv (/Users/filiphaltmayer/.cache/huggingface/datasets/hugginglearners___csv/hugginglearners--netflix-shows-03475319fc65a05a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)

插入数据

现在我们已经将数据存储在机器上,我们可以开始嵌入数据并将其插入 Milvus。嵌入函数接收文本并以列表格式返回嵌入向量。

# Simple function that converts the texts to embeddings
def embed(texts):
    embeddings = openai.Embedding.create(
        input=texts,
        engine=OPENAI_ENGINE
    )
    return [x['embedding'] for x in embeddings['data']]

下一步执行实际的插入操作。我们遍历所有条目并创建批次,一旦达到设置的批次大小,我们就插入一次。循环结束后,如果存在,我们将插入最后剩余的批次。

from tqdm import tqdm

data = [
    [], # title
    [], # type
    [], # release_year
    [], # rating
    [], # description
]

# Embed and insert in batches
for i in tqdm(range(0, len(dataset))):
    data[0].append(dataset[i]['title'] or '')
    data[1].append(dataset[i]['type'] or '')
    data[2].append(dataset[i]['release_year'] or -1)
    data[3].append(dataset[i]['rating'] or '')
    data[4].append(dataset[i]['description'] or '')
    if len(data[0]) % BATCH_SIZE == 0:
        data.append(embed(data[4]))
        collection.insert(data)
        data = [[],[],[],[],[]]

# Embed and insert the remainder 
if len(data[0]) != 0:
    data.append(embed(data[4]))
    collection.insert(data)
    data = [[],[],[],[],[]]
100%|██████████| 8807/8807 [00:31<00:00, 276.82it/s]

查询数据库

数据安全插入 Milvus 后,我们现在可以执行查询。查询接收您要搜索的电影描述和要使用的过滤器元组。有关过滤器的更多信息,请访问此处。搜索首先打印出您的描述和过滤器表达式。之后,对于每个结果,我们打印结果电影的分数、标题、类型、发行年份、评分和描述。

import textwrap

def query(query, top_k = 5):
    text, expr = query
    res = collection.search(embed(text), anns_field='embedding', expr = expr, param=QUERY_PARAM, limit = top_k, output_fields=['title', 'type', 'release_year', 'rating', 'description'])
    for i, hit in enumerate(res):
        print('Description:', text, 'Expression:', expr)
        print('Results:')
        for ii, hits in enumerate(hit):
            print('\t' + 'Rank:', ii + 1, 'Score:', hits.score, 'Title:', hits.entity.get('title'))
            print('\t\t' + 'Type:', hits.entity.get('type'), 'Release Year:', hits.entity.get('release_year'), 'Rating:', hits.entity.get('rating'))
            print(textwrap.fill(hits.entity.get('description'), 88))
            print()

my_query = ('movie about a fluffly animal', 'release_year < 2019 and rating like \"PG%\"')

query(my_query)
Description: movie about a fluffly animal Expression: release_year < 2019 and rating like "PG%"
Results:
	Rank: 1 Score: 0.30083978176116943 Title: The Lamb
		Type: Movie Release Year: 2017 Rating: PG
A big-dreaming donkey escapes his menial existence and befriends some free-spirited
animal pals in this imaginative retelling of the Nativity Story.

	Rank: 2 Score: 0.33528298139572144 Title: Puss in Boots
		Type: Movie Release Year: 2011 Rating: PG
The fabled feline heads to the Land of Giants with friends Humpty Dumpty and Kitty
Softpaws on a quest to nab its greatest treasure: the Golden Goose.

	Rank: 3 Score: 0.33528298139572144 Title: Puss in Boots
		Type: Movie Release Year: 2011 Rating: PG
The fabled feline heads to the Land of Giants with friends Humpty Dumpty and Kitty
Softpaws on a quest to nab its greatest treasure: the Golden Goose.

	Rank: 4 Score: 0.3414868116378784 Title: Show Dogs
		Type: Movie Release Year: 2018 Rating: PG
A rough and tough police dog must go undercover with an FBI agent as a prim and proper
pet at a dog show to save a baby panda from an illegal sale.

	Rank: 5 Score: 0.3414868116378784 Title: Show Dogs
		Type: Movie Release Year: 2018 Rating: PG
A rough and tough police dog must go undercover with an FBI agent as a prim and proper
pet at a dog show to save a baby panda from an illegal sale.