使用 Pinecone 进行嵌入搜索 | OpenAI 食谱

本笔记本将带您了解一个简单的流程，以下载一些数据、嵌入数据，然后使用精选的向量数据库对其进行索引和搜索。对于希望存储和搜索我们的嵌入以及他们自己的数据（在安全环境中）以支持生产用例（例如聊天机器人、主题建模等）的客户来说，这是一个常见的需求。

什么是向量数据库

向量数据库是一种用于存储、管理和搜索嵌入向量的数据库。近年来，由于人工智能在解决涉及自然语言、图像识别和其他非结构化数据形式的用例方面的有效性不断提高，使用嵌入将非结构化数据（文本、音频、视频等）编码为向量以供机器学习模型使用的情况呈爆炸式增长。向量数据库已成为企业交付和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够采用我们在本仓库中分享的许多嵌入用例（例如，问答、聊天机器人和推荐服务），并在安全、可扩展的环境中使用它们。我们的许多客户都使用嵌入来解决他们小规模的问题，但性能和安全性阻碍了他们投入生产——我们认为向量数据库是解决这个问题的关键组件，在本指南中，我们将介绍嵌入文本数据、将其存储在向量数据库中以及将其用于语义搜索的基础知识。

演示流程

演示流程如下：

设置：导入包并设置任何必需的变量
加载数据：加载数据集并使用 OpenAI 嵌入对其进行嵌入
Pinecone
- 设置：在这里，我们将为 Pinecone 设置 Python 客户端。有关更多详细信息，请访问此处
- 索引数据：我们将为标题和内容创建具有命名空间的索引
- 搜索数据：我们将使用搜索查询测试两个命名空间，以确认它是否有效

运行完本笔记本后，您应该对如何设置和使用向量数据库有一个基本的了解，并且可以继续进行更复杂的用例，从而利用我们的嵌入。

Requirement already satisfied: pinecone-client in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (2.2.2) Requirement already satisfied: requests>=2.19.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.31.0) Requirement already satisfied: pyyaml>=5.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (6.0) Requirement already satisfied: loguru>=0.5.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (0.7.0) Requirement already satisfied: typing-extensions>=3.7.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.5.0) Requirement already satisfied: dnspython>=2.0.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.3.0) Requirement already satisfied: python-dateutil>=2.5.3 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.8.2) Requirement already satisfied: urllib3>=1.21.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.26.16) Requirement already satisfied: tqdm>=4.64.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.65.0) Requirement already satisfied: numpy>=1.22.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.25.0) Requirement already satisfied: six>=1.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from python-dateutil>=2.5.3->pinecone-client) (1.16.0) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.1.0) Requirement already satisfied: idna<4,>=2.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.4) Requirement already satisfied: certifi>=2017.4.17 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (2023.5.7) Requirement already satisfied: wget in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (3.2)

import openai from typing import List, Iterator import pandas as pd import numpy as np import os import wget from ast import literal_eval # Pinecone's client library for Python import pinecone # I've set this to our new embeddings model, this can be changed to the embedding model of your choice EMBEDDING_MODEL = "text-embedding-3-small" # Ignore unclosed SSL socket warnings - optional in case you get these errors import warnings warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning) warnings.filterwarnings("ignore", category=DeprecationWarning)

/Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console) from tqdm.autonotebook import tqdm

加载数据

在本节中，我们将加载我们已准备好的嵌入数据在本文中。

	id	url	标题	文本	title_vector	content_vector	vector_id
0	1	https://simple.wikipedia.org/wiki/April	四月	四月是公历一年中的第四个月...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	八月	八月（Aug.）是公历一年中的第八个月...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	艺术	艺术是一种表达想象力的创造性活动...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A 或 a 是英语字母表的第一个字母...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	空气	空气是指地球的大气层。空气是一种...	[0.02224554680287838, -0.02044147066771984, -0...	[0.021524671465158463, 0.018522677943110466, -...	4

url

标题

文本

title_vector

content_vector

vector_id

https://simple.wikipedia.org/wiki/April

四月

四月是公历一年中的第四个月...

[0.001009464613161981, -0.020700545981526375, ...

[-0.011253940872848034, -0.013491976074874401,...

https://simple.wikipedia.org/wiki/August

八月

八月（Aug.）是公历一年中的第八个月...

[0.0009286514250561595, 0.000820168002974242, ...

[0.0003609954728744924, 0.007262262050062418, ...

https://simple.wikipedia.org/wiki/Art

艺术

艺术是一种表达想象力的创造性活动...

[0.003393713850528002, 0.0061537534929811954, ...

[-0.004959689453244209, 0.015772193670272827, ...

https://simple.wikipedia.org/wiki/A

A 或 a 是英语字母表的第一个字母...

[0.0153952119871974, -0.013759135268628597, 0....

[0.024894846603274345, -0.022186409682035446, ...

https://simple.wikipedia.org/wiki/Air

空气

空气是指地球的大气层。空气是一种...

[0.02224554680287838, -0.02044147066771984, -0...

[0.021524671465158463, 0.018522677943110466, -...

# Read vectors from strings back into a list article_df['title_vector'] = article_df.title_vector.apply(literal_eval) article_df['content_vector'] = article_df.content_vector.apply(literal_eval) # Set vector_id to be a string article_df['vector_id'] = article_df['vector_id'].apply(str)

<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000 entries, 0 to 24999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 25000 non-null int64 1 url 25000 non-null object 2 title 25000 non-null object 3 text 25000 non-null object 4 title_vector 25000 non-null object 5 content_vector 25000 non-null object 6 vector_id 25000 non-null object dtypes: int64(1), object(6) memory usage: 1.3+ MB

Pinecone

我们将要查看的下一个选项是 Pinecone，这是一个托管向量数据库，它提供云原生选项。

在继续执行此步骤之前，您需要导航到 Pinecone，注册，然后将您的 API 密钥另存为名为 PINECONE_API_KEY 的环境变量。

对于本节，我们将

创建一个索引，其中包含用于文章标题和内容的多个命名空间
将我们的数据存储在索引中，其中包含用于文章标题和内容的单独可搜索的“命名空间”
触发一些相似性搜索查询以验证我们的设置是否正常工作

创建索引

首先，我们需要创建一个索引，我们将其称为 wikipedia-articles。一旦我们有了索引，我们就可以创建多个命名空间，这可以使单个索引可用于各种用例进行搜索。有关更多详细信息，请查阅Pinecone 文档。

如果您想并行批量插入到索引以提高插入速度，那么 Pinecone 文档中有一个关于并行批量插入的优秀指南。

# Models a simple batch generator that make chunks out of an input DataFrame class BatchGenerator: def __init__(self, batch_size: int = 10) -> None: self.batch_size = batch_size # Makes chunks out of an input DataFrame def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]: splits = self.splits_num(df.shape[0]) if splits <= 1: yield df else: for chunk in np.array_split(df, splits): yield chunk # Determines how many chunks DataFrame contains def splits_num(self, elements: int) -> int: return round(elements / self.batch_size) __call__ = to_batches df_batcher = BatchGenerator(300)

# Pick a name for the new index index_name = 'wikipedia-articles' # Check whether the index with the same name already exists - if so, delete it if index_name in pinecone.list_indexes(): pinecone.delete_index(index_name) # Creates new index pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0])) index = pinecone.Index(index_name=index_name) # Confirm our index was created pinecone.list_indexes()

# Upsert content vectors in content namespace - this can take a few minutes print("Uploading vectors to content namespace..") for batch_df in df_batcher(article_df): index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')

# Upsert title vectors in title namespace - this can also take a few minutes print("Uploading vectors to title namespace..") for batch_df in df_batcher(article_df): index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')

搜索数据

现在我们将输入一些虚拟搜索并检查我们是否获得了不错的结果

# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results titles_mapped = dict(zip(article_df.vector_id,article_df.title)) content_mapped = dict(zip(article_df.vector_id,article_df.text))

def query_article(query, namespace, top_k=5): '''Queries an article using its title in the specified namespace and prints results.''' # Create vector embeddings based on the title column embedded_query = openai.Embedding.create( input=query, model=EMBEDDING_MODEL, )["data"][0]['embedding'] # Query namespace passed as parameter using title vector query_result = index.query(embedded_query, namespace=namespace, top_k=top_k) # Print query results print(f'\nMost similar results to {query} in "{namespace}" namespace:\n') if not query_result.matches: print('no query result') matches = query_result.matches ids = [res.id for res in matches] scores = [res.score for res in matches] df = pd.DataFrame({'id':ids, 'score':scores, 'title': [titles_mapped[_id] for _id in ids], 'content': [content_mapped[_id] for _id in ids], }) counter = 0 for k,v in df.iterrows(): counter += 1 print(f'{v.title} (score = {v.score})') print('\n') return df

Most similar results to modern art in Europe in "title" namespace: Museum of Modern Art (score = 0.875177085) Western Europe (score = 0.867441177) Renaissance art (score = 0.864156306) Pop art (score = 0.860346854) Northern Europe (score = 0.854658186)

Most similar results to Famous battles in Scottish history in "content" namespace: Battle of Bannockburn (score = 0.869336188) Wars of Scottish Independence (score = 0.861470938) 1651 (score = 0.852588475) First War of Scottish Independence (score = 0.84962213) Robert I of Scotland (score = 0.846214116)