本笔记本演示了如何使用 OpenAI 和 MongoDB Atlas 向量搜索构建语义搜索应用程序

Collecting pymongo Downloading pymongo-4.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m677.1/677.1 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m [?25hCollecting openai Downloading openai-1.3.3-py3-none-any.whl (220 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.3/220.3 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m [?25hCollecting dnspython<3.0.0,>=1.16.0 (from pymongo) Downloading dnspython-2.4.2-py3-none-any.whl (300 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: anyio<4,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai) (3.7.1) Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai) (1.7.0) Collecting httpx<1,>=0.23.0 (from openai) Downloading httpx-0.25.1-py3-none-any.whl (75 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from openai) (1.10.13) Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.10/dist-packages (from openai) (4.66.1) Requirement already satisfied: typing-extensions<5,>=4.5 in /usr/local/lib/python3.10/dist-packages (from openai) (4.5.0) Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.5.0->openai) (3.4) Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.5.0->openai) (1.3.0) Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.5.0->openai) (1.1.3) Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (2023.7.22) Collecting httpcore (from httpx<1,>=0.23.0->openai) Downloading httpcore-1.0.2-py3-none-any.whl (76 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m [?25hCollecting h11<0.15,>=0.13 (from httpcore->httpx<1,>=0.23.0->openai) Downloading h11-0.14.0-py3-none-any.whl (58 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m [?25hInstalling collected packages: h11, dnspython, pymongo, httpcore, httpx, openai [31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. llmx 0.0.15a0 requires cohere, which is not installed. llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m [0mSuccessfully installed dnspython-2.4.2 h11-0.14.0 httpcore-1.0.2 httpx-0.25.1 openai-1.3.3 pymongo-4.6.0

步骤 1：设置环境

为此，需要 2 个先决条件

MongoDB Atlas 集群：要创建永久免费的 MongoDB Atlas 集群，首先，您需要创建一个 MongoDB Atlas 帐户（如果您还没有帐户）。访问 MongoDB Atlas 网站，然后单击“注册”。访问 MongoDB Atlas 仪表板并设置您的集群。为了利用聚合管道中的 $vectorSearch 运算符，您需要运行 MongoDB Atlas 6.0.11 或更高版本。本教程可以使用免费集群构建。在设置部署时，系统将提示您设置数据库用户和网络连接规则。请务必将您的用户名和密码保存在安全的地方，并设置正确的 IP 地址规则，以便您的集群可以正确连接。如果您需要更多入门帮助，请查看我们的 MongoDB Atlas 教程。
OpenAI API 密钥 要创建您的 OpenAI 密钥，您需要创建一个帐户。创建帐户后，访问 OpenAI 平台。单击屏幕右上角的个人资料图标以获取下拉菜单，然后选择“查看 API 密钥”。

注意：执行上述步骤后，系统将提示您输入凭据。

在本教程中，我们将使用 MongoDB 示例数据集。使用 Atlas UI 加载示例数据集。我们将使用“sample_mflix”数据库，其中包含一个“movies”集合，其中每个文档都包含标题、情节、类型、演员、导演等字段。

步骤 2：设置嵌入生成函数

步骤 3：创建并存储嵌入

sample_mflix.movies 示例数据集中的每个文档都对应一部电影；我们将执行操作，为“plot”字段中的数据创建向量嵌入，并将其存储在数据库中。使用 OpenAI 嵌入端点创建向量嵌入对于执行基于意图的相似性搜索是必要的。

from pymongo import ReplaceOne # Update the collection with the embeddings requests = [] for doc in collection.find({'plot':{"$exists": True}}).limit(500): doc[EMBEDDING_FIELD_NAME] = generate_embedding(doc['plot']) requests.append(ReplaceOne({'_id': doc['_id']}, doc)) collection.bulk_write(requests)

执行上述操作后，“movies”集合中的文档将包含一个额外的“embedding”字段（由 EMBEDDDING_FIELD_NAME 变量定义），以及已经存在的字段，如标题、情节、类型、演员、导演等。

注意：为了节省时间，我们仅限于 500 个文档。如果您想对 sample_mflix 数据库中超过 23,000 个文档的整个数据集执行此操作，则需要一段时间。或者，您可以使用 sample_mflix.embedded_movies 集合，其中包含预先填充的 plot_embedding 字段，该字段包含使用 OpenAI 的 text-embedding-3-small 嵌入模型创建的嵌入，您可以将其与 Atlas Search 向量搜索功能一起使用。

步骤 4：创建向量搜索索引

我们将在该集合上创建 Atlas Vector Search 索引，这将允许我们执行近似 KNN 搜索，从而为语义搜索提供支持。我们将介绍创建此索引的 2 种方法 - Atlas UI 和使用 MongoDB Python 驱动程序。

（可选）文档：创建向量搜索索引

现在前往 Atlas UI，并使用此处描述的步骤创建 Atlas Vector Search 索引。“dimensions”字段的值 1536 对应于 openAI text-embedding-ada002。

在 Atlas UI 上的 JSON 编辑器中使用下面给出的定义。

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "dimensions": 1536,
        "similarity": "dotProduct",
        "type": "knnVector"
      }
    }
  }
}

（可选）或者，我们可以使用 pymongo 驱动程序以编程方式创建这些向量搜索索引下面单元格中给出的 Python 命令将创建索引（这仅适用于最新版本的 Python Driver for MongoDB 和 MongoDB 服务器版本 7.0+ Atlas 集群）。

collection.create_search_index( {"definition": {"mappings": {"dynamic": True, "fields": { EMBEDDING_FIELD_NAME : { "dimensions": 1536, "similarity": "dotProduct", "type": "knnVector" }}}}, "name": ATLAS_VECTOR_SEARCH_INDEX_NAME } )

步骤 5：查询您的数据

查询结果在此处查找与查询字符串中捕获的文本具有语义相似情节的电影，而不是基于关键字搜索。

（可选）文档：运行向量搜索查询

def query_results(query, k): results = collection.aggregate([ { '$vectorSearch': { "index": ATLAS_VECTOR_SEARCH_INDEX_NAME, "path": EMBEDDING_FIELD_NAME, "queryVector": generate_embedding(query), "numCandidates": 50, "limit": 5, } } ]) return results

使用 MongoDB Atlas Vector Search 和 OpenAI 进行语义搜索

步骤 1：设置环境

步骤 2：设置嵌入生成函数

步骤 3：创建并存储嵌入

步骤 4：创建向量搜索索引

步骤 5：查询您的数据