Milvus 和 OpenAI 入门

2023年3月28日
在 Github 中打开

查找你的下一本书

在本 notebook 中,我们将介绍如何使用 OpenAI 生成书籍描述的嵌入向量,并在 Milvus 中使用这些嵌入向量来查找相关的书籍。此示例中的数据集来源于 HuggingFace datasets,包含略超过 100 万个标题-描述对。

首先,让我们下载此 notebook 所需的库

  • openai 用于与 OpenAI 嵌入服务通信
  • pymilvus 用于与 Milvus 服务器通信
  • datasets 用于下载数据集
  • tqdm 用于进度条
! pip install openai pymilvus datasets tqdm
Looking in indexes: https://pypi.ac.cn/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: openai in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (0.27.2)
Requirement already satisfied: pymilvus in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (2.2.2)
Requirement already satisfied: datasets in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (2.10.1)
Requirement already satisfied: tqdm in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (4.64.1)
Requirement already satisfied: aiohttp in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from openai) (3.8.4)
Requirement already satisfied: requests>=2.20 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from openai) (2.28.2)
Requirement already satisfied: pandas>=1.2.4 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.5.3)
Requirement already satisfied: ujson<=5.4.0,>=2.0.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (5.1.0)
Requirement already satisfied: mmh3<=3.0.0,>=2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (3.0.0)
Requirement already satisfied: grpcio<=1.48.0,>=1.47.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.47.2)
Requirement already satisfied: grpcio-tools<=1.48.0,>=1.47.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.47.2)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.12.1)
Requirement already satisfied: dill<0.3.7,>=0.3.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.3.6)
Requirement already satisfied: xxhash in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (3.2.0)
Requirement already satisfied: pyyaml>=5.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (5.4.1)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (2023.1.0)
Requirement already satisfied: packaging in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (23.0)
Requirement already satisfied: numpy>=1.17 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (1.23.5)
Requirement already satisfied: multiprocess in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.70.14)
Requirement already satisfied: pyarrow>=6.0.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (10.0.1)
Requirement already satisfied: responses<0.19 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.18.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: frozenlist>=1.1.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.3.3)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.8.2)
Requirement already satisfied: aiosignal>=1.1.2 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.3.1)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (3.0.1)
Requirement already satisfied: attrs>=17.3.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (22.2.0)
Requirement already satisfied: six>=1.5.2 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio<=1.48.0,>=1.47.0->pymilvus) (1.16.0)
Requirement already satisfied: protobuf<4.0dev,>=3.12.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio-tools<=1.48.0,>=1.47.0->pymilvus) (3.20.1)
Requirement already satisfied: setuptools in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio-tools<=1.48.0,>=1.47.0->pymilvus) (65.6.3)
Requirement already satisfied: filelock in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets) (3.9.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets) (4.5.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pandas>=1.2.4->pymilvus) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pandas>=1.2.4->pymilvus) (2022.7.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (2022.12.7)

安装完所需的软件包后,我们就可以开始了。首先启动 Milvus 服务。正在运行的文件是此文件所在文件夹中的 docker-compose.yaml。此命令启动一个 Milvus 独立实例,我们将在此测试中使用它。

! docker compose up -d
[?25l[+] Running 0/0
 ⠋ Network milvus  Creating                                                0.1s
[?25h[?25l[+] Running 1/1
 ⠿ Network milvus          Created                                         0.1s
 ⠋ Container milvus-minio  Creating                                        0.1s
 ⠋ Container milvus-etcd   Creating                                        0.1s
[?25h[?25l[+] Running 1/3
 ⠿ Network milvus          Created                                         0.1s
 ⠙ Container milvus-minio  Creating                                        0.2s
 ⠙ Container milvus-etcd   Creating                                        0.2s
[?25h[?25l[+] Running 1/3
 ⠿ Network milvus          Created                                         0.1s
 ⠹ Container milvus-minio  Creating                                        0.3s
 ⠹ Container milvus-etcd   Creating                                        0.3s
[?25h[?25l[+] Running 3/3
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Created                                    0.3s
 ⠿ Container milvus-etcd        Created                                    0.3s
 ⠋ Container milvus-standalone  Creating                                   0.1s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Created                                    0.3s
 ⠿ Container milvus-etcd        Created                                    0.3s
 ⠙ Container milvus-standalone  Creating                                   0.2s
[?25h[?25l[+] Running 4/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Created                                    0.3s
 ⠿ Container milvus-etcd        Created                                    0.3s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   0.7s
 ⠿ Container milvus-etcd        Starting                                   0.7s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   0.8s
 ⠿ Container milvus-etcd        Starting                                   0.8s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   0.9s
 ⠿ Container milvus-etcd        Starting                                   0.9s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.0s
 ⠿ Container milvus-etcd        Starting                                   1.0s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.1s
 ⠿ Container milvus-etcd        Starting                                   1.1s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.2s
 ⠿ Container milvus-etcd        Starting                                   1.2s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.3s
 ⠿ Container milvus-etcd        Starting                                   1.3s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.4s
 ⠿ Container milvus-etcd        Starting                                   1.4s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.5s
 ⠿ Container milvus-etcd        Starting                                   1.5s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.6s
 ⠿ Container milvus-etcd        Starting                                   1.6s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 2/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.7s
 ⠿ Container milvus-etcd        Starting                                   1.7s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Starting                                   1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Created                                    0.3s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   1.6s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   1.7s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   1.8s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   1.9s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   2.0s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   2.1s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   2.2s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   2.3s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   2.4s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   2.5s
[?25h[?25l[+] Running 3/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Starting                                   2.6s
[?25h[?25l[+] Running 4/4
 ⠿ Network milvus               Created                                    0.1s
 ⠿ Container milvus-minio       Started                                    1.8s
 ⠿ Container milvus-etcd        Started                                    1.7s
 ⠿ Container milvus-standalone  Started                                    2.6s
[?25h

Milvus 运行后,我们可以设置全局变量

  • HOST: Milvus 主机地址
  • PORT: Milvus 端口号
  • COLLECTION_NAME: 在 Milvus 中集合的名称
  • DIMENSION: 嵌入向量的维度
  • OPENAI_ENGINE: 要使用的嵌入模型
  • openai.api_key: 你的 OpenAI 账户密钥
  • INDEX_PARAM: 用于集合的索引设置
  • QUERY_PARAM: 要使用的搜索参数
  • BATCH_SIZE: 一次嵌入和插入的文本数量
import openai

HOST = 'localhost'
PORT = 19530
COLLECTION_NAME = 'book_search'
DIMENSION = 1536
OPENAI_ENGINE = 'text-embedding-3-small'
openai.api_key = 'sk-your_key'

INDEX_PARAM = {
    'metric_type':'L2',
    'index_type':"HNSW",
    'params':{'M': 8, 'efConstruction': 64}
}

QUERY_PARAM = {
    "metric_type": "L2",
    "params": {"ef": 64},
}

BATCH_SIZE = 1000

Milvus

此部分处理 Milvus 以及为此用例设置数据库。在 Milvus 中,我们需要设置一个集合并为该集合建立索引。

from pymilvus import connections, utility, FieldSchema, Collection, CollectionSchema, DataType

# Connect to Milvus Database
connections.connect(host=HOST, port=PORT)
# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)
# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=64000),
    FieldSchema(name='description', dtype=DataType.VARCHAR, max_length=64000),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)
# Create the index on the collection and load it.
collection.create_index(field_name="embedding", index_params=INDEX_PARAM)
collection.load()

数据集

在 Milvus 启动并运行后,我们可以开始获取数据。Hugging Face Datasets 是一个托管许多不同用户数据集的中心,在此示例中,我们使用 Skelebor 的书籍数据集。此数据集包含超过 100 万本书的标题-描述对。我们将嵌入每个描述,并将其与其标题一起存储在 Milvus 中。

import datasets

# Download the dataset and only use the `train` portion (file is around 800Mb)
dataset = datasets.load_dataset('Skelebor/book_titles_and_descriptions_en_clean', split='train')
/Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset parquet (/Users/filiphaltmayer/.cache/huggingface/datasets/Skelebor___parquet/Skelebor--book_titles_and_descriptions_en_clean-3596935b1d8a7747/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)

插入数据

现在我们已经将数据放在我们的机器上,我们可以开始嵌入它并将其插入 Milvus。嵌入函数接收文本并以列表格式返回嵌入向量。

# Simple function that converts the texts to embeddings
def embed(texts):
    embeddings = openai.Embedding.create(
        input=texts,
        engine=OPENAI_ENGINE
    )
    return [x['embedding'] for x in embeddings['data']]

下一步执行实际的插入操作。由于有如此多的数据点,如果你想立即进行测试,可以提前停止插入单元格块并继续进行。这样做可能会由于数据点较少而降低结果的准确性,但应该仍然足够好。

from tqdm import tqdm

data = [
    [], # title
    [], # description
]

# Embed and insert in batches
for i in tqdm(range(0, len(dataset))):
    data[0].append(dataset[i]['title'])
    data[1].append(dataset[i]['description'])
    if len(data[0]) % BATCH_SIZE == 0:
        data.append(embed(data[1]))
        collection.insert(data)
        data = [[],[]]

# Embed and insert the remainder 
if len(data[0]) != 0:
    data.append(embed(data[1]))
    collection.insert(data)
    data = [[],[]]
  0%|          | 1999/1032335 [00:06<57:22, 299.31it/s]  
KeyboardInterrupt

查询数据库

数据安全插入 Milvus 后,我们现在可以执行查询。查询接收一个字符串或字符串列表并搜索它们。结果会打印出你提供的描述以及包含结果分数、结果标题和结果书籍描述的结果。

import textwrap

def query(queries, top_k = 5):
    if type(queries) != list:
        queries = [queries]
    res = collection.search(embed(queries), anns_field='embedding', param=QUERY_PARAM, limit = top_k, output_fields=['title', 'description'])
    for i, hit in enumerate(res):
        print('Description:', queries[i])
        print('Results:')
        for ii, hits in enumerate(hit):
            print('\t' + 'Rank:', ii + 1, 'Score:', hits.score, 'Title:', hits.entity.get('title'))
            print(textwrap.fill(hits.entity.get('description'), 88))
            print()
query('Book about a k-9 from europe')
RPC error: [search], <MilvusException: (code=1, message=code: UnexpectedError, reason: code: CollectionNotExists, reason: can't find collection: book_search)>, <Time:{'RPC start': '2023-03-17 14:22:18.368461', 'RPC error': '2023-03-17 14:22:18.382086'}>
MilvusException<MilvusException: (code=1, message=code: UnexpectedError, reason: code: CollectionNotExists, reason: 找不到集合: book_search)>