本笔记本提供了关于如何使用 Azure AI 搜索(原 Azure 认知搜索)作为 OpenAI 嵌入的向量数据库的分步说明。Azure AI 搜索是一种云搜索服务,为开发人员提供基础设施、API 和工具,以便在 Web、移动和企业应用程序中构建针对私有、异构内容的丰富搜索体验。
先决条件
为了完成本练习,您必须具备以下条件
本笔记本提供了关于如何使用 Azure AI 搜索(原 Azure 认知搜索)作为 OpenAI 嵌入的向量数据库的分步说明。Azure AI 搜索是一种云搜索服务,为开发人员提供基础设施、API 和工具,以便在 Web、移动和企业应用程序中构建针对私有、异构内容的丰富搜索体验。
为了完成本练习,您必须具备以下条件
! pip install wget
! pip install azure-search-documents
! pip install azure-identity
! pip install openai
import json
import wget
import pandas as pd
import zipfile
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient, SearchIndexingBufferedSender
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import (
QueryAnswerType,
QueryCaptionType,
QueryType,
VectorizedQuery,
)
from azure.search.documents.indexes.models import (
HnswAlgorithmConfiguration,
HnswParameters,
SearchField,
SearchableField,
SearchFieldDataType,
SearchIndex,
SemanticConfiguration,
SemanticField,
SemanticPrioritizedFields,
SemanticSearch,
SimpleField,
VectorSearch,
VectorSearchAlgorithmKind,
VectorSearchAlgorithmMetric,
VectorSearchProfile,
)
本节指导您完成 Azure OpenAI 的身份验证设置,使您能够使用 Azure Active Directory (AAD) 或 API 密钥安全地与服务进行交互。在继续之前,请确保您已准备好 Azure OpenAI 终结点和凭据。有关使用 Azure OpenAI 设置 AAD 的详细说明,请参阅官方文档。
endpoint: str = "YOUR_AZURE_OPENAI_ENDPOINT"
api_key: str = "YOUR_AZURE_OPENAI_KEY"
api_version: str = "2023-05-15"
deployment = "YOUR_AZURE_OPENAI_DEPLOYMENT_NAME"
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
credential, "https://cognitiveservices.azure.com/.default"
)
# Set this flag to True if you are using Azure Active Directory
use_aad_for_aoai = True
if use_aad_for_aoai:
# Use Azure Active Directory (AAD) authentication
client = AzureOpenAI(
azure_endpoint=endpoint,
api_version=api_version,
azure_ad_token_provider=token_provider,
)
else:
# Use API key authentication
client = AzureOpenAI(
api_key=api_key,
api_version=api_version,
azure_endpoint=endpoint,
)
本节介绍如何设置 Azure AI 搜索客户端以与向量存储功能集成。您可以在 Azure 门户中或通过 搜索管理 SDK 以编程方式找到您的 Azure AI 搜索服务详细信息。
# Configuration
search_service_endpoint: str = "YOUR_AZURE_SEARCH_ENDPOINT"
search_service_api_key: str = "YOUR_AZURE_SEARCH_ADMIN_KEY"
index_name: str = "azure-ai-search-openai-cookbook-demo"
# Set this flag to True if you are using Azure Active Directory
use_aad_for_search = True
if use_aad_for_search:
# Use Azure Active Directory (AAD) authentication
credential = DefaultAzureCredential()
else:
# Use API key authentication
credential = AzureKeyCredential(search_service_api_key)
# Initialize the SearchClient with the selected authentication method
search_client = SearchClient(
endpoint=search_service_endpoint, index_name=index_name, credential=credential
)
embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"
# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
'vector_database_wikipedia_articles_embedded.zip'
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip", "r") as zip_ref:
zip_ref.extractall("../../data")
article_df = pd.read_csv("../../data/vector_database_wikipedia_articles_embedded.csv")
# Read vectors from strings back into a list using json.loads
article_df["title_vector"] = article_df.title_vector.apply(json.loads)
article_df["content_vector"] = article_df.content_vector.apply(json.loads)
article_df["vector_id"] = article_df["vector_id"].apply(str)
article_df.head()
ID | URL | 标题 | 文本 | 标题向量 | 内容向量 | 向量 ID | |
---|---|---|---|---|---|---|---|
0 | 1 | https://simple.wikipedia.org/wiki/April | April | April is the fourth month of the year in the J... | [0.001009464613161981, -0.020700545981526375, ... | [-0.011253940872848034, -0.013491976074874401,... | 0 |
1 | 2 | https://simple.wikipedia.org/wiki/August | August | August (Aug.) is the eighth month of the year ... | [0.0009286514250561595, 0.000820168002974242, ... | [0.0003609954728744924, 0.007262262050062418, ... | 1 |
2 | 6 | https://simple.wikipedia.org/wiki/Art | Art | Art is a creative activity that expresses imag... | [0.003393713850528002, 0.0061537534929811954, ... | [-0.004959689453244209, 0.015772193670272827, ... | 2 |
3 | 8 | https://simple.wikipedia.org/wiki/A | A | A or a is the first letter of the English alph... | [0.0153952119871974, -0.013759135268628597, 0.... | [0.024894846603274345, -0.022186409682035446, ... | 3 |
4 | 9 | https://simple.wikipedia.org/wiki/Air | Air | Air refers to the Earth's atmosphere. Air is a... | [0.02224554680287838, -0.02044147066771984, -0... | [0.021524671465158463, 0.018522677943110466, -... | 4 |
此代码片段演示了如何使用 Azure AI 搜索 Python SDK 中的 SearchIndexClient
定义和创建搜索索引。该索引结合了向量搜索和语义排序器功能。有关更多详细信息,请访问我们的文档,了解如何创建向量索引
# Initialize the SearchIndexClient
index_client = SearchIndexClient(
endpoint=search_service_endpoint, credential=credential
)
# Define the fields for the index
fields = [
SimpleField(name="id", type=SearchFieldDataType.String),
SimpleField(name="vector_id", type=SearchFieldDataType.String, key=True),
SimpleField(name="url", type=SearchFieldDataType.String),
SearchableField(name="title", type=SearchFieldDataType.String),
SearchableField(name="text", type=SearchFieldDataType.String),
SearchField(
name="title_vector",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
vector_search_dimensions=1536,
vector_search_profile_name="my-vector-config",
),
SearchField(
name="content_vector",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
vector_search_dimensions=1536,
vector_search_profile_name="my-vector-config",
),
]
# Configure the vector search configuration
vector_search = VectorSearch(
algorithms=[
HnswAlgorithmConfiguration(
name="my-hnsw",
kind=VectorSearchAlgorithmKind.HNSW,
parameters=HnswParameters(
m=4,
ef_construction=400,
ef_search=500,
metric=VectorSearchAlgorithmMetric.COSINE,
),
)
],
profiles=[
VectorSearchProfile(
name="my-vector-config",
algorithm_configuration_name="my-hnsw",
)
],
)
# Configure the semantic search configuration
semantic_search = SemanticSearch(
configurations=[
SemanticConfiguration(
name="my-semantic-config",
prioritized_fields=SemanticPrioritizedFields(
title_field=SemanticField(field_name="title"),
keywords_fields=[SemanticField(field_name="url")],
content_fields=[SemanticField(field_name="text")],
),
)
]
)
# Create the search index with the vector search and semantic search configurations
index = SearchIndex(
name=index_name,
fields=fields,
vector_search=vector_search,
semantic_search=semantic_search,
)
# Create or update the index
result = index_client.create_or_update_index(index)
print(f"{result.name} created")
azure-ai-search-openai-cookbook-demo created
以下代码片段概述了将一批文档(特别是包含预计算嵌入的维基百科文章)从 pandas DataFrame 上传到 Azure AI 搜索索引的过程。有关数据导入策略和最佳实践的详细指南,请参阅 Azure AI 搜索中的数据导入。
from azure.core.exceptions import HttpResponseError
# Convert the 'id' and 'vector_id' columns to string so one of them can serve as our key field
article_df["id"] = article_df["id"].astype(str)
article_df["vector_id"] = article_df["vector_id"].astype(str)
# Convert the DataFrame to a list of dictionaries
documents = article_df.to_dict(orient="records")
# Create a SearchIndexingBufferedSender
batch_client = SearchIndexingBufferedSender(
search_service_endpoint, index_name, credential
)
try:
# Add upload actions for all documents in a single call
batch_client.upload_documents(documents=documents)
# Manually flush to send any remaining documents in the buffer
batch_client.flush()
except HttpResponseError as e:
print(f"An error occurred: {e}")
finally:
# Clean up resources
batch_client.close()
print(f"Uploaded {len(documents)} documents in total")
Uploaded 25000 documents in total
如果您的数据集不包含预计算嵌入,您可以使用以下函数和 openai
Python 库创建嵌入。您还会注意到,相同的函数和模型也被用于生成查询嵌入以执行向量搜索。
# Example function to generate document embedding
def generate_embeddings(text, model):
# Generate embeddings for the provided text using the specified model
embeddings_response = client.embeddings.create(model=model, input=text)
# Extract the embedding data from the response
embedding = embeddings_response.data[0].embedding
return embedding
first_document_content = documents[0]["text"]
print(f"Content: {first_document_content[:100]}")
content_vector = generate_embeddings(first_document_content, deployment)
print("Content vector generated")
Content: April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March Content vector generated
# Pure Vector Search
query = "modern art in Europe"
search_client = SearchClient(search_service_endpoint, index_name, credential)
vector_query = VectorizedQuery(vector=generate_embeddings(query, deployment), k_nearest_neighbors=3, fields="content_vector")
results = search_client.search(
search_text=None,
vector_queries= [vector_query],
select=["title", "text", "url"]
)
for result in results:
print(f"Title: {result['title']}")
print(f"Score: {result['@search.score']}")
print(f"URL: {result['url']}\n")
Title: Documenta Score: 0.8599451 URL: https://simple.wikipedia.org/wiki/Documenta Title: Museum of Modern Art Score: 0.85260946 URL: https://simple.wikipedia.org/wiki/Museum%20of%20Modern%20Art Title: Expressionism Score: 0.852354 URL: https://simple.wikipedia.org/wiki/Expressionism
混合搜索结合了传统关键词搜索和基于向量的相似性搜索的功能,以提供更相关和上下文相关的结果。这种方法在处理复杂的查询时尤其有用,这些查询受益于理解文本背后的语义含义。
提供的代码片段演示了如何执行混合搜索查询
# Hybrid Search
query = "Famous battles in Scottish history"
search_client = SearchClient(search_service_endpoint, index_name, credential)
vector_query = VectorizedQuery(vector=generate_embeddings(query, deployment), k_nearest_neighbors=3, fields="content_vector")
results = search_client.search(
search_text=query,
vector_queries= [vector_query],
select=["title", "text", "url"],
top=3
)
for result in results:
print(f"Title: {result['title']}")
print(f"Score: {result['@search.score']}")
print(f"URL: {result['url']}\n")
Title: Wars of Scottish Independence Score: 0.03306011110544205 URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence Title: Battle of Bannockburn Score: 0.022253260016441345 URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn Title: Scottish Score: 0.016393441706895828 URL: https://simple.wikipedia.org/wiki/Scottish
语义排序器通过使用语言理解来重新排序搜索结果,从而显着提高搜索相关性。此外,您可以获得抽取式标题、答案和突出显示。
# Semantic Hybrid Search
query = "What were the key technological advancements during the Industrial Revolution?"
search_client = SearchClient(search_service_endpoint, index_name, credential)
vector_query = VectorizedQuery(
vector=generate_embeddings(query, deployment),
k_nearest_neighbors=3,
fields="content_vector",
)
results = search_client.search(
search_text=query,
vector_queries=[vector_query],
select=["title", "text", "url"],
query_type=QueryType.SEMANTIC,
semantic_configuration_name="my-semantic-config",
query_caption=QueryCaptionType.EXTRACTIVE,
query_answer=QueryAnswerType.EXTRACTIVE,
top=3,
)
semantic_answers = results.get_answers()
for answer in semantic_answers:
if answer.highlights:
print(f"Semantic Answer: {answer.highlights}")
else:
print(f"Semantic Answer: {answer.text}")
print(f"Semantic Answer Score: {answer.score}\n")
for result in results:
print(f"Title: {result['title']}")
print(f"Reranker Score: {result['@search.reranker_score']}")
print(f"URL: {result['url']}")
captions = result["@search.captions"]
if captions:
caption = captions[0]
if caption.highlights:
print(f"Caption: {caption.highlights}\n")
else:
print(f"Caption: {caption.text}\n")
Semantic Answer: Advancements During the industrial revolution, new technology brought many changes. For example:<em> Canals</em> were built to allow heavy goods to be moved easily where they were needed. The steam engine became the main source of power. It replaced horses and human labor. Cheap iron and steel became mass-produced. Semantic Answer Score: 0.90478515625 Title: Industrial Revolution Reranker Score: 3.408700942993164 URL: https://simple.wikipedia.org/wiki/Industrial%20Revolution Caption: Advancements During the industrial revolution, new technology brought many changes. For example: Canals were built to allow heavy goods to be moved easily where they were needed. The steam engine became the main source of power. It replaced horses and human labor. Cheap iron and steel became mass-produced. Title: Printing Reranker Score: 1.603400707244873 URL: https://simple.wikipedia.org/wiki/Printing Caption: Machines to speed printing, cheaper paper, automatic stitching and binding all arrived in the 19th century during the industrial revolution. What had once been done by a few men by hand was now done by limited companies on huge machines. The result was much lower prices, and a much wider readership. Title: Industrialisation Reranker Score: 1.3238357305526733 URL: https://simple.wikipedia.org/wiki/Industrialisation Caption: <em>Industrialisation</em> (or<em> industrialization)</em> is a process that happens in countries when they start to use machines to do work that was once done by people.<em> Industrialisation changes</em> the things people do.<em> Industrialisation</em> caused towns to grow larger. Many people left farming to take higher paid jobs in factories in towns.