使用 MyScale 进行 embeddings 搜索 | OpenAI Cookbook

本 notebook 将引导您完成一个简单的流程，即下载一些数据，对其进行嵌入 (embedding)，然后使用一系列向量数据库对其进行索引和搜索。对于希望在安全环境中存储和搜索我们的 embeddings 以及他们自己的数据，以支持生产用例（如聊天机器人、主题建模等）的客户来说，这是一个常见的需求。

什么是向量数据库

向量数据库是一种用于存储、管理和搜索 embedding 向量的数据库。近年来，由于人工智能在解决涉及自然语言、图像识别和其他非结构化数据形式的用例方面的有效性不断提高，使用 embeddings 将非结构化数据（文本、音频、视频等）编码为向量以供机器学习模型使用的情况呈爆炸式增长。向量数据库已成为企业交付和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够采用我们在本仓库中分享的许多 embeddings 用例（例如，问答、聊天机器人和推荐服务），并在安全、可扩展的环境中使用它们。我们的许多客户都在小规模上使用 embeddings 解决他们的问题，但性能和安全性阻碍了他们投入生产——我们认为向量数据库是解决这个问题的关键组成部分，在本指南中，我们将介绍嵌入文本数据、将其存储在向量数据库中以及将其用于语义搜索的基础知识。

演示流程

演示流程如下

设置：导入包并设置任何必需的变量
加载数据：加载数据集并使用 OpenAI embeddings 对其进行嵌入
MyScale
- 设置：设置 MyScale Python 客户端。有关更多详细信息，请访问此处
- 索引数据：我们将创建一个表并为 content 建立索引。
- 搜索数据：运行一些示例查询，并考虑各种目标。

一旦您运行完本 notebook，您应该对如何设置和使用向量数据库有一个基本的了解，并且可以继续进行更复杂的用例，从而利用我们的 embeddings。

import openai from typing import List, Iterator import pandas as pd import numpy as np import os import wget from ast import literal_eval # MyScale's client library for Python import clickhouse_connect # I've set this to our new embeddings model, this can be changed to the embedding model of your choice EMBEDDING_MODEL = "text-embedding-3-small" # Ignore unclosed SSL socket warnings - optional in case you get these errors import warnings warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning) warnings.filterwarnings("ignore", category=DeprecationWarning)

加载数据

在本节中，我们将加载我们在此会话之前准备的嵌入数据。

	id	url	标题	文本	title_vector	content_vector	vector_id
0	1	https://simple.wikipedia.org/wiki/April	四月	四月是公历一年中的第四个月...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	八月	八月（Aug.）是公历一年中的第八个月...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	艺术	艺术是一种表达想象力的创造性活动...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A 或 a 是英文字母表的第一个字母...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	空气	空气指的是地球的大气层。空气是一种...	[0.02224554680287838, -0.02044147066771984, -0...	[0.021524671465158463, 0.018522677943110466, -...	4

url

标题

文本

title_vector

content_vector

vector_id

https://simple.wikipedia.org/wiki/April

四月

四月是公历一年中的第四个月...

[0.001009464613161981, -0.020700545981526375, ...

[-0.011253940872848034, -0.013491976074874401,...

https://simple.wikipedia.org/wiki/August

八月

八月（Aug.）是公历一年中的第八个月...

[0.0009286514250561595, 0.000820168002974242, ...

[0.0003609954728744924, 0.007262262050062418, ...

https://simple.wikipedia.org/wiki/Art

艺术

艺术是一种表达想象力的创造性活动...

[0.003393713850528002, 0.0061537534929811954, ...

[-0.004959689453244209, 0.015772193670272827, ...

https://simple.wikipedia.org/wiki/A

A 或 a 是英文字母表的第一个字母...

[0.0153952119871974, -0.013759135268628597, 0....

[0.024894846603274345, -0.022186409682035446, ...

https://simple.wikipedia.org/wiki/Air

空气

空气指的是地球的大气层。空气是一种...

[0.02224554680287838, -0.02044147066771984, -0...

[0.021524671465158463, 0.018522677943110466, -...

# Read vectors from strings back into a list article_df['title_vector'] = article_df.title_vector.apply(literal_eval) article_df['content_vector'] = article_df.content_vector.apply(literal_eval) # Set vector_id to be a string article_df['vector_id'] = article_df['vector_id'].apply(str)

<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000 entries, 0 to 24999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 25000 non-null int64 1 url 25000 non-null object 2 title 25000 non-null object 3 text 25000 non-null object 4 title_vector 25000 non-null object 5 content_vector 25000 non-null object 6 vector_id 25000 non-null object dtypes: int64(1), object(6) memory usage: 1.3+ MB

MyScale

我们将要考虑的下一个向量数据库是 MyScale。

MyScale 是一个构建在 Clickhouse 上的数据库，它结合了向量搜索和 SQL 分析，以提供高性能、精简和完全托管的体验。它旨在促进对结构化数据和向量数据的联合查询和分析，并为所有数据处理提供全面的 SQL 支持。

使用 MyScale Console，在两分钟内在您的集群上部署和执行带有 SQL 的向量搜索。

连接到 MyScale

按照连接详细信息部分从 MyScale 控制台检索集群主机、用户名和密码信息，并使用它来创建与集群的连接，如下所示

索引数据

我们将在 MyScale 中创建一个名为 articles 的 SQL 表来存储 embeddings 数据。该表将包含一个向量索引，该索引具有余弦距离度量和 embeddings 长度的约束。使用以下代码创建数据并将其插入到 articles 表中

# create articles table with vector index embedding_len=len(article_df['content_vector'][0]) # 1536 client.command(f""" CREATE TABLE IF NOT EXISTS default.articles ( id UInt64, url String, title String, text String, content_vector Array(Float32), CONSTRAINT cons_vector_len CHECK length(content_vector) = {embedding_len}, VECTOR INDEX article_content_index content_vector TYPE HNSWFLAT('metric_type=Cosine') ) ENGINE = MergeTree ORDER BY id """) # insert data into the table in batches from tqdm.auto import tqdm batch_size = 100 total_records = len(article_df) # we only need subset of columns article_df = article_df[['id', 'url', 'title', 'text', 'content_vector']] # upload data in batches data = article_df.to_records(index=False).tolist() column_names = article_df.columns.tolist() for i in tqdm(range(0, total_records, batch_size)): i_end = min(i + batch_size, total_records) client.insert("default.articles", data[i:i_end], column_names=column_names)

我们需要在继续搜索之前检查向量索引的构建状态，因为它是在后台自动构建的。

# check count of inserted data print(f"articles count: {client.command('SELECT count(*) FROM default.articles')}") # check the status of the vector index, make sure vector index is ready with 'Built' status get_index_status="SELECT status FROM system.vector_indices WHERE name='article_content_index'" print(f"index build status: {client.command(get_index_status)}")

搜索数据

在 MyScale 中完成索引后，我们可以执行向量搜索以查找相似的内容。首先，我们将使用 OpenAI API 为我们的查询生成 embeddings。然后，我们将使用 MyScale 执行向量搜索。

query = "Famous battles in Scottish history" # creates embedding vector from user query embed = openai.Embedding.create( input=query, model="text-embedding-3-small", )["data"][0]["embedding"] # query the database to find the top K similar content to the given query top_k = 10 results = client.query(f""" SELECT id, url, title, distance(content_vector, {embed}) as dist FROM default.articles ORDER BY dist LIMIT {top_k} """) # display results for i, r in enumerate(results.named_results()): print(i+1, r['title'])