Azure AI 搜索结合 Azure Functions 和 ChatGPT 中的 GPT Actions

2024 年 7 月 8 日
在 Github 中打开

本 notebook 提供了关于使用 Azure AI 搜索(前身为 Azure Cognitive Search)作为向量数据库与 OpenAI 嵌入,然后在之上创建 Azure Function 以插入 ChatGPT 中的自定义 GPT 的分步说明。

对于希望在 Azure 内设置 RAG 基础设施并将其作为端点公开以与其他平台(如 ChatGPT)集成的客户来说,这可能是一个解决方案。

Azure AI 搜索是一种云搜索服务,它为开发人员提供基础设施、API 和工具,以便在 Web、移动和企业应用程序中构建跨私有异构内容的丰富搜索体验。

Azure Functions 是一种无服务器计算服务,可运行事件驱动的代码,自动管理基础设施、扩展并与其他 Azure 服务集成。

架构

以下是本解决方案的架构图,我们将逐步讲解。

azure-rag-architecture.png

注意:向量数据存储 + 无服务器函数的这种架构模式可以推广到其他向量数据存储。例如,如果您想在 Azure 中使用类似 Postgres 的服务,您需要更改配置 Azure AI 搜索设置步骤来设置 Postgres 的要求,您需要修改创建 Azure AI 向量搜索以在 Postgres 中创建数据库和表,并且您需要更新此存储库中的 function_app.py 代码以查询 Postgres 而不是 Azure AI 搜索。数据准备和 Azure Function 的创建将保持一致。

  1. 环境设置 设置环境,通过安装和导入所需的库并配置我们的 Azure 设置。包括

  2. 准备数据 准备要上传的数据,方法是嵌入文档,以及捕获其他元数据。我们将使用 OpenAI 文档的子集作为示例数据。

  3. 创建 Azure AI 向量搜索 创建 Azure AI 向量搜索并上传我们准备的数据。包括

  4. 创建 Azure Function 创建 Azure Function 以与 Azure AI 向量搜索交互。包括

  5. 在 ChatGPT 的自定义 GPT 中输入 将 Azure Function 与 ChatGPT 中的自定义 GPT 集成。包括

设置环境

我们将通过导入所需的库并配置我们的 Azure 设置来设置我们的环境。

! pip install -q wget
! pip install -q azure-search-documents 
! pip install -q azure-identity
! pip install -q openai
! pip install -q azure-mgmt-search
! pip install -q pandas
! pip install -q azure-mgmt-resource 
! pip install -q azure-mgmt-storage
! pip install -q pyperclip
! pip install -q PyPDF2
! pip install -q tiktoken
# Standard Libraries
import json  
import os
import platform
import subprocess
import csv
from itertools import islice
import uuid
import shutil
import concurrent.futures

# Third-Party Libraries
import pandas as pd
from PyPDF2 import PdfReader
import tiktoken
from dotenv import load_dotenv
import pyperclip

# OpenAI Libraries (note we use OpenAI directly here, but you can replace with Azure OpenAI as needed)
from openai import OpenAI

# Azure Identity and Credentials
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.core.credentials import AzureKeyCredential  
from azure.core.exceptions import HttpResponseError

# Azure Search Documents
from azure.search.documents import SearchClient, SearchIndexingBufferedSender  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.models import (
    VectorizedQuery
)
from azure.search.documents.indexes.models import (
    HnswAlgorithmConfiguration,
    HnswParameters,
    SearchField,
    SearchableField,
    SearchFieldDataType,
    SearchIndex,
    SimpleField,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchAlgorithmMetric,
    VectorSearchProfile,
)

# Azure Management Clients
from azure.mgmt.search import SearchManagementClient
from azure.mgmt.resource import ResourceManagementClient, SubscriptionClient
from azure.mgmt.storage import StorageManagementClient
openai_api_key = os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>") # Saving this as a variable to reference in function app in later step
openai_client = OpenAI(api_key=openai_api_key)
embeddings_model = "text-embedding-3-small" # We'll use this by default, but you can change to your text-embedding-3-large if desired

先决条件

  • 来自 Azure 的订阅 ID
  • 来自 Azure 的资源组名称
  • Azure 中的区域
# Update the below with your values
subscription_id="<enter_your_subscription_id>"
resource_group="<enter_your_resource_group>"

## Make sure to choose a region that supports the proper products. We've defaulted to "eastus" below. https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/#products-by-region_tab5
region = "eastus"
credential = InteractiveBrowserCredential()
subscription_client = SubscriptionClient(credential)
subscription = next(subscription_client.subscriptions.list())
# Initialize the SearchManagementClient with the provided credentials and subscription ID
search_management_client = SearchManagementClient(
    credential=credential,
    subscription_id=subscription_id,
)

# Generate a unique name for the search service using UUID, but you can change this if you'd like.
generated_uuid = str(uuid.uuid4())
search_service_name = "search-service-gpt-demo" + generated_uuid
## The below is the default endpoint structure that is created when you create a search service. This may differ based on your Azure settings.
search_service_endpoint = 'https://'+search_service_name+'.search.windows.net'

# Create or update the search service with the specified parameters
response = search_management_client.services.begin_create_or_update(
    resource_group_name=resource_group,
    search_service_name=search_service_name,
    service={
        "location": region,
        "properties": {"hostingMode": "default", "partitionCount": 1, "replicaCount": 1},
        # We are using the free pricing tier for this demo. You are only allowed one free search service per subscription.
        "sku": {"name": "free"},
        "tags": {"app-name": "Search service demo"},
    },
).result()

# Convert the response to a dictionary and then to a pretty-printed JSON string
response_dict = response.as_dict()
response_json = json.dumps(response_dict, indent=4)

print(response_json)
print("Search Service Name:" + search_service_name)
print("Search Service Endpoint:" + search_service_endpoint)
# Retrieve the admin keys for the search service
try:
    response = search_management_client.admin_keys.get(
        resource_group_name=resource_group,
        search_service_name=search_service_name,
    )
    # Extract the primary API key from the response and save as a variable to be used later
    search_service_api_key = response.primary_key
    print("Successfully retrieved the API key.")
except Exception as e:
    print(f"Failed to retrieve the API key: {e}")

准备数据

我们将嵌入 OpenAI 文档的几个页面并将其存储在 oai_docs 文件夹中。我们将首先嵌入每个页面,将其添加到 CSV 文件中,然后使用该 CSV 文件上传到索引。

为了处理超出 8191 个 token 上下文的更长文本文件,我们可以单独使用 chunk 嵌入,或者以某种方式组合它们,例如平均(按每个 chunk 的大小加权)。

我们将从 Python 自己的 cookbook 中获取一个将序列分解为 chunk 的函数。

def batched(iterable, n):
    """Batch data into tuples of length n. The last batch may be shorter."""
    # batched('ABCDEFG', 3) --> ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while (batch := tuple(islice(it, n))):
        yield batch

现在我们定义一个函数,该函数将字符串编码为 token,然后将其分解为 chunk。我们将使用 tiktoken,这是 OpenAI 的快速开源 tokenizer。

要阅读有关使用 Tiktoken 计算 token 的更多信息,请查看此 cookbook

def chunked_tokens(text, chunk_length, encoding_name='cl100k_base'):
    # Get the encoding object for the specified encoding name. OpenAI's tiktoken library, which is used in this notebook, currently supports two encodings: 'bpe' and 'cl100k_base'. The 'bpe' encoding is used for GPT-3 and earlier models, while 'cl100k_base' is used for newer models like GPT-4.
    encoding = tiktoken.get_encoding(encoding_name)
    # Encode the input text into tokens
    tokens = encoding.encode(text)
    # Create an iterator that yields chunks of tokens of the specified length
    chunks_iterator = batched(tokens, chunk_length)
    # Yield each chunk from the iterator
    yield from chunks_iterator

最后,我们可以编写一个函数,该函数可以安全地处理嵌入请求,即使输入文本长度超过最大上下文长度,方法是将输入 token 分块并单独嵌入每个 chunk。可以将 average 标志设置为 True 以返回 chunk 嵌入的加权平均值,或者设置为 False 以仅返回未修改的 chunk 嵌入列表。

注意:您可以在此处采用其他更复杂的技术,包括

  • 使用 GPT-4o 捕获图像/图表描述以进行嵌入。
  • 在 chunk 之间保持文本重叠,以最大限度地减少重要上下文的丢失。
  • 基于段落或章节进行 chunk。
  • 添加有关每篇文章的更具描述性的元数据。
## Change the below based on model. The below is for the latest embeddings models from OpenAI, so you can leave as is unless you are using a different embedding model..
EMBEDDING_CTX_LENGTH = 8191
EMBEDDING_ENCODING='cl100k_base'
def generate_embeddings(text, model):
    # Generate embeddings for the provided text using the specified model
    embeddings_response = openai_client.embeddings.create(model=model, input=text)
    # Extract the embedding data from the response
    embedding = embeddings_response.data[0].embedding
    return embedding

def len_safe_get_embedding(text, model=embeddings_model, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING):
    # Initialize lists to store embeddings and corresponding text chunks
    chunk_embeddings = []
    chunk_texts = []
    # Iterate over chunks of tokens from the input text
    for chunk in chunked_tokens(text, chunk_length=max_tokens, encoding_name=encoding_name):
        # Generate embeddings for each chunk and append to the list
        chunk_embeddings.append(generate_embeddings(chunk, model=model))
        # Decode the chunk back to text and append to the list
        chunk_texts.append(tiktoken.get_encoding(encoding_name).decode(chunk))
    # Return the list of chunk embeddings and the corresponding text chunks
    return chunk_embeddings, chunk_texts

接下来,我们可以定义一个辅助函数,该函数将捕获有关文档的其他元数据。这对于用作搜索查询的元数据过滤器以及捕获更丰富的数据以进行搜索非常有用。

在本示例中,我将从类别列表中选择,以便稍后在元数据过滤器中使用。

## These are the categories I will be using for the categorization task. You can change these as needed based on your use case.
categories = ['authentication','models','techniques','tools','setup','billing_limits','other']

def categorize_text(text, categories):
    # Create a prompt for categorization
    messages = [
        {"role": "system", "content": f"""You are an expert in LLMs, and you will be given text that corresponds to an article in OpenAI's documentation.
         Categorize the document into one of these categories: {', '.join(categories)}. Only respond with the category name and nothing else."""},
        {"role": "user", "content": text}
    ]
    try:
        # Call the OpenAI API to categorize the text
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=messages
        )
        # Extract the category from the response
        category = response.choices[0].message.content
        return category
    except Exception as e:
        print(f"Error categorizing text: {str(e)}")
        return None

现在,我们可以定义一些辅助函数来处理数据文件夹中 oai_docs 文件夹中的 .txt 文件。您也可以将此函数用于您自己的数据,并支持 .txt 和 .pdf 文件。

def extract_text_from_pdf(pdf_path):
    # Initialize the PDF reader
    reader = PdfReader(pdf_path)
    text = ""
    # Iterate through each page in the PDF and extract text
    for page in reader.pages:
        text += page.extract_text()
    return text

def process_file(file_path, idx, categories, embeddings_model):
    file_name = os.path.basename(file_path)
    print(f"Processing file {idx + 1}: {file_name}")
    
    # Read text content from .txt files
    if file_name.endswith('.txt'):
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
    # Extract text content from .pdf files
    elif file_name.endswith('.pdf'):
        text = extract_text_from_pdf(file_path)
    
    title = file_name
    # Generate embeddings for the title
    title_vectors, title_text = len_safe_get_embedding(title, embeddings_model)
    print(f"Generated title embeddings for {file_name}")
    
    # Generate embeddings for the content
    content_vectors, content_text = len_safe_get_embedding(text, embeddings_model)
    print(f"Generated content embeddings for {file_name}")
    
    category = categorize_text(' '.join(content_text), categories)
    print(f"Categorized {file_name} as {category}")
    
    # Prepare the data to be appended
    data = []
    for i, content_vector in enumerate(content_vectors):
        data.append({
            "id": f"{idx}_{i}",
            "vector_id": f"{idx}_{i}",
            "title": title_text[0],
            "text": content_text[i],
            "title_vector": json.dumps(title_vectors[0]),  # Assuming title is short and has only one chunk
            "content_vector": json.dumps(content_vector),
            "category": category
        })
        print(f"Appended data for chunk {i + 1}/{len(content_vectors)} of {file_name}")
    
    return data

我们现在将使用此辅助函数来处理我们的 OpenAI 文档。随意更新此函数以使用您自己的数据,方法是更改下方 process_files 中的文件夹。

请注意,这将并发处理所选文件夹中的文档,因此如果使用 txt 文件,则应花费 <30 秒,如果使用 PDF 文件,则会稍长一些。

## Customize the location below if you are using different data besides the OpenAI documentation. Note that if you are using a different dataset, you will need to update the categories list as well.
folder_name = "../../../data/oai_docs"

files = [os.path.join(folder_name, f) for f in os.listdir(folder_name) if f.endswith('.txt') or f.endswith('.pdf')]
data = []

# Process each file concurrently
with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = {executor.submit(process_file, file_path, idx, categories, embeddings_model): idx for idx, file_path in enumerate(files)}
    for future in concurrent.futures.as_completed(futures):
        try:
            result = future.result()
            data.extend(result)
        except Exception as e:
            print(f"Error processing file: {str(e)}")

# Write the data to a CSV file
csv_file = os.path.join("..", "embedded_data.csv")
with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ["id", "vector_id", "title", "text", "title_vector", "content_vector","category"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for row in data:
        writer.writerow(row)
        print(f"Wrote row with id {row['id']} to CSV")

# Convert the CSV file to a Dataframe
article_df = pd.read_csv("../embedded_data.csv")
# Read vectors from strings back into a list using json.loads
article_df["title_vector"] = article_df.title_vector.apply(json.loads)
article_df["content_vector"] = article_df.content_vector.apply(json.loads)
article_df["vector_id"] = article_df["vector_id"].apply(str)
article_df["category"] = article_df["category"].apply(str)
article_df.head()

我们现在有一个 embedded_data.csv 文件,其中包含六列,我们可以将其上传到我们的向量数据库!

创建索引

我们将使用 Azure AI 搜索 Python SDK 中的 SearchIndexClient 定义和创建搜索索引。该索引结合了向量搜索和混合搜索功能。有关更多详细信息,请访问 Microsoft 的文档,了解如何创建向量索引

index_name = "azure-ai-search-openai-cookbook-demo"
# index_name = "<insert_name_for_index>"

index_client = SearchIndexClient(
    endpoint=search_service_endpoint, credential=AzureKeyCredential(search_service_api_key)
)
# Define the fields for the index. Update these based on your data.
# Each field represents a column in the search index
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String),  # Simple string field for document ID
    SimpleField(name="vector_id", type=SearchFieldDataType.String, key=True),  # Key field for the index
    # SimpleField(name="url", type=SearchFieldDataType.String),  # URL field (commented out)
    SearchableField(name="title", type=SearchFieldDataType.String),  # Searchable field for document title
    SearchableField(name="text", type=SearchFieldDataType.String),  # Searchable field for document text
    SearchField(
        name="title_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  # Collection of single values for title vector
        vector_search_dimensions=1536,  # Number of dimensions in the vector
        vector_search_profile_name="my-vector-config",  # Profile name for vector search configuration
    ),
    SearchField(
        name="content_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  # Collection of single values for content vector
        vector_search_dimensions=1536,  # Number of dimensions in the vector
        vector_search_profile_name="my-vector-config",  # Profile name for vector search configuration
    ),
    SearchableField(name="category", type=SearchFieldDataType.String, filterable=True),  # Searchable field for document category
]

# This configuration defines the algorithm and parameters for vector search
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="my-hnsw",  # Name of the HNSW algorithm configuration
            kind=VectorSearchAlgorithmKind.HNSW,  # Type of algorithm
            parameters=HnswParameters(
                m=4,  # Number of bi-directional links created for every new element
                ef_construction=400,  # Size of the dynamic list for the nearest neighbors during construction
                ef_search=500,  # Size of the dynamic list for the nearest neighbors during search
                metric=VectorSearchAlgorithmMetric.COSINE,  # Distance metric used for the search
            ),
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="my-vector-config",  # Name of the vector search profile
            algorithm_configuration_name="my-hnsw",  # Reference to the algorithm configuration
        )
    ],
)

# Create the search index with the vector search configuration
# This combines all the configurations into a single search index
index = SearchIndex(
    name=index_name,  # Name of the index
    fields=fields,  # Fields defined for the index
    vector_search=vector_search  # Vector search configuration

)

# Create or update the index
# This sends the index definition to the Azure Search service
result = index_client.create_index(index)
print(f"{result.name} created")  # Output the name of the created index

上传数据

现在,我们将从 pandas DataFrame 将上面存储在 embedded_data.csv 中的文章上传到 Azure AI 搜索索引。有关数据导入策略和最佳实践的详细指南,请参阅Azure AI 搜索中的数据导入

# Convert the 'id' and 'vector_id' columns to string so one of them can serve as our key field
article_df["id"] = article_df["id"].astype(str)
article_df["vector_id"] = article_df["vector_id"].astype(str)

# Convert the DataFrame to a list of dictionaries
documents = article_df.to_dict(orient="records")

# Log the number of documents to be uploaded
print(f"Number of documents to upload: {len(documents)}")

# Create a SearchIndexingBufferedSender
batch_client = SearchIndexingBufferedSender(
    search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key)
)
# Get the first document to check its schema
first_document = documents[0]

# Get the index schema
index_schema = index_client.get_index(index_name)

# Get the field names from the index schema
index_fields = {field.name: field.type for field in index_schema.fields}

# Check each field in the first document
for field, value in first_document.items():
    if field not in index_fields:
        print(f"Field '{field}' is not in the index schema.")

# Check for any fields in the index schema that are not in the documents
for field in index_fields:
    if field not in first_document:
        print(f"Field '{field}' is in the index schema but not in the documents.")

try:
    if documents:
        # Add upload actions for all documents in a single call
        upload_result = batch_client.upload_documents(documents=documents)

        # Check if the upload was successful
        # Manually flush to send any remaining documents in the buffer
        batch_client.flush()
        
        print(f"Uploaded {len(documents)} documents in total")
    else:
        print("No documents to upload.")
except HttpResponseError as e:
    print(f"An error occurred: {e}")
    raise  # Re-raise the exception to ensure it errors out
finally:
    # Clean up resources
    batch_client.close()

现在数据已上传,我们将在本地测试向量相似度搜索和混合搜索,以确保其按预期工作。

您可以测试纯向量搜索和混合搜索。纯向量搜索将 None 传递给下面的 search_text,并且仅搜索向量相似度。混合搜索将传统关键字搜索的功能与通过将查询文本 query 传递给 search_text 的基于向量的相似度搜索相结合,以提供更相关和上下文相关的结果。

query = "What model should I use to embed?"
# Note: we'll have the GPT choose the category automatically once we put it in ChatGPT
category ="models"

search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key))
vector_query = VectorizedQuery(vector=generate_embeddings(query, embeddings_model), k_nearest_neighbors=3, fields="content_vector")
  
results = search_client.search(  
    search_text=None, # Pass in None if you want to use pure vector search, and `query` if you want to use hybrid search
    vector_queries= [vector_query], 
    select=["title", "text"],
    filter=f"category eq '{category}'" 
)

for result in results:  
    print(result)

创建 Azure Function

Azure Functions 是在我们的新 AI 搜索之上构建 API 的一种简单方法。我们的代码(请参阅此文件夹中的 function_app.py 文件,或链接的此处)执行以下操作

  1. 接收用户的查询、搜索索引端点、索引名称、k_nearest_neighbors*、要使用的搜索列(content_vector 或 title_vector)以及是否应使用混合查询的输入
  2. 获取用户的查询并嵌入它。
  3. 进行向量搜索并检索相关文本 chunk。
  4. 将这些相关文本 chunk 作为响应正文返回。

*在向量搜索的上下文中,k_nearest_neighbors 指定搜索应返回的“最接近”向量(在余弦相似度方面)的数量。例如,如果 k_nearest_neighbors 设置为 3,则搜索将返回索引中与查询向量最相似的 3 个向量。

请注意,此 Azure Function 没有任何身份验证。但是,您可以按照此处的文档在其上设置身份验证

我们可以使用下面的代码创建一个新的存储帐户,但您可以随意跳过该代码块并修改后续步骤以使用现有存储帐户。这可能需要长达 30 秒。

## Update below with a different name
storage_account_name = "<enter-storage-account-name>"

## Use below SKU or any other SKU as per your requirement
sku = "Standard_LRS"
resource_client = ResourceManagementClient(credential, subscription_id)
storage_client = StorageManagementClient(credential, subscription_id)

# Create resource group if it doesn't exist
rg_result = resource_client.resource_groups.create_or_update(resource_group, {"location": region})

# Create storage account
storage_async_operation = storage_client.storage_accounts.begin_create(
    resource_group,
    storage_account_name,
    {
        "sku": {"name": sku},
        "kind": "StorageV2",
        "location": region,
    },
)
storage_account = storage_async_operation.result()

print(f"Storage account {storage_account.name} created")

创建 Function App

Function App 是 Python 代码在通过 GPT Action 触发后执行的位置。要了解有关 Function App 的更多信息,请参阅此处的文档。

要部署 Function App,我们需要使用 Azure CLI 和 Azure Functions Core Tools。

下面将尝试根据您虚拟环境中的平台类型安装并运行它,但如果这不起作用,请阅读 Azure 文档以了解如何安装Azure Function Core ToolsAzure CLI。完成此操作后,在导航到此文件夹后,在您的终端中运行下面的 subprocess.run 命令。

首先,我们将确保我们在环境中拥有相关工具,以便运行必要的 Azure 命令。这可能需要几分钟才能安装。

os_type = platform.system()

if os_type == "Windows":
    # Install Azure Functions Core Tools on Windows
    subprocess.run(["npm", "install", "-g", "azure-functions-core-tools@3", "--unsafe-perm", "true"], check=True)
    # Install Azure CLI on Windows
    subprocess.run(["powershell", "-Command", "Invoke-WebRequest -Uri https://aka.ms/installazurecliwindows -OutFile .\\AzureCLI.msi; Start-Process msiexec.exe -ArgumentList '/I AzureCLI.msi /quiet' -Wait"], check=True)
elif os_type == "Darwin":  # MacOS
    # Install Azure Functions Core Tools on MacOS
    if platform.machine() == 'arm64':
        # For M1 Macs
        subprocess.run(["arch", "-arm64", "brew", "install", "azure-functions-core-tools@3"], check=True)
    else:
        # For Intel Macs
        subprocess.run(["brew", "install", "azure-functions-core-tools@3"], check=True)
    # Install Azure CLI on MacOS
    subprocess.run(["brew", "update"], check=True)
    subprocess.run(["brew", "install", "azure-cli"], check=True)
elif os_type == "Linux":
    # Install Azure Functions Core Tools on Linux
    subprocess.run(["curl", "https://packages.microsoft.com/keys/microsoft.asc", "|", "gpg", "--dearmor", ">", "microsoft.gpg"], check=True, shell=True)
    subprocess.run(["sudo", "mv", "microsoft.gpg", "/etc/apt/trusted.gpg.d/microsoft.gpg"], check=True)
    subprocess.run(["sudo", "sh", "-c", "'echo \"deb [arch=amd64] https://packages.microsoft.com/repos/microsoft-ubuntu-$(lsb_release -cs)-prod $(lsb_release -cs) main\" > /etc/apt/sources.list.d/dotnetdev.list'"], check=True, shell=True)
    subprocess.run(["sudo", "apt-get", "update"], check=True)
    subprocess.run(["sudo", "apt-get", "install", "azure-functions-core-tools-3"], check=True)
    # Install Azure CLI on Linux
    subprocess.run(["curl", "-sL", "https://aka.ms/InstallAzureCLIDeb", "|", "sudo", "bash"], check=True, shell=True)
else:
    # Raise an error if the operating system is not supported
    raise OSError("Unsupported operating system")

# Verify the installation of Azure Functions Core Tools
subprocess.run(["func", "--version"], check=True)
# Verify the installation of Azure CLI
subprocess.run(["az", "--version"], check=True)

subprocess.run([
    "az", "login"
], check=True)

现在,我们需要创建一个 local.settings.json 文件,其中包含 Azure 的关键环境变量

local_settings_content = f"""
{{
  "IsEncrypted": false,
  "Values": {{
    "AzureWebJobsStorage": "UseDevelopmentStorage=true",
    "FUNCTIONS_WORKER_RUNTIME": "python",
    "OPENAI_API_KEY": "{openai_api_key}",
    "EMBEDDINGS_MODEL": "{embeddings_model}",
    "SEARCH_SERVICE_API_KEY": "{search_service_api_key}",
  }}
}}
"""

with open("local.settings.json", "w") as file:
    file.write(local_settings_content)

检查 local.settings.json 文件,并确保环境变量与您的预期相符。

现在,在下面为您的应用命名,您就可以创建 Function App,然后发布您的函数了。

# Replace this with your own values. This name will appear in the URL of the API call https://<app_name>.azurewebsites.net
app_name = "<app-name>"

subprocess.run([
    "az", "functionapp", "create",
    "--resource-group", resource_group,
    "--consumption-plan-location", region,
    "--runtime", "python",
    "--name", app_name,
    "--storage-account", storage_account_name,
    "--os-type", "Linux",
], check=True)

创建 Function App 后,我们现在想要将配置变量添加到 Function App,以便在函数中使用。具体来说,我们需要 OPENAI_API_KEYSEARCH_SERVICE_API_KEYEMBEDDINGS_MODEL,因为这些都在 function_app.py 代码中使用。

# Collect the relevant environment variables 
env_vars = {
    "OPENAI_API_KEY": openai_api_key,
    "SEARCH_SERVICE_API_KEY": search_service_api_key,
    "EMBEDDINGS_MODEL": embeddings_model
}

# Create the settings argument for the az functionapp create command
settings_args = []
for key, value in env_vars.items():
    settings_args.append(f"{key}={value}")

subprocess.run([
    "az", "functionapp", "config", "appsettings", "set",
    "--name", app_name,
    "--resource-group", resource_group,
    "--settings", *settings_args
], check=True)

我们现在可以发布您的函数代码 function_app.py 到 Azure Function 了。这可能需要长达 10 分钟才能部署。完成后,我们现在拥有一个使用 Azure Function 在 Azure AI 搜索之上的 API 端点。

subprocess.run([
    "func", "azure", "functionapp", "publish", app_name
], check=True)

在 ChatGPT 的自定义 GPT 中输入

现在我们有了一个查询此向量搜索索引的 Azure Function,让我们将其作为 GPT Action 放入!

请参阅关于 GPT 的此处文档和关于 GPT Actions 的此处文档。使用以下内容作为 GPT 的说明和 GPT Action 的 OpenAPI 规范。

创建 OpenAPI Spec

以下是示例 OpenAPI 规范。当我们运行下面的代码块时,应将功能规范复制到剪贴板以粘贴到 GPT Action 中。

请注意,默认情况下这没有任何身份验证,但您可以按照此 cookbook的身份验证部分中的模式或查看此处的文档来设置带有 OAuth 的 Azure Functions。


spec = f"""
openapi: 3.1.0
info:
  title: Vector Similarity Search API
  description: API for performing vector similarity search.
  version: 1.0.0
servers:
  - url: https://{app_name}.azurewebsites.net/api
    description: Main (production) server
paths:
  /vector_similarity_search:
    post:
      operationId: vectorSimilaritySearch
      summary: Perform a vector similarity search.
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                search_service_endpoint:
                  type: string
                  description: The endpoint of the search service.
                index_name:
                  type: string
                  description: The name of the search index.
                query:
                  type: string
                  description: The search query.
                k_nearest_neighbors:
                  type: integer
                  description: The number of nearest neighbors to return.
                search_column:
                  type: string
                  description: The name of the search column.
                use_hybrid_query:
                  type: boolean
                  description: Whether to use a hybrid query.
                category:
                  type: string
                  description: category to filter.
              required:
                - search_service_endpoint
                - index_name
                - query
                - k_nearest_neighbors
                - search_column
                - use_hybrid_query
      responses:
        '200':
          description: A successful response with the search results.
          content:
            application/json:
              schema:
                type: object
                properties:
                  results:
                    type: array
                    items:
                      type: object
                      properties:
                        id:
                          type: string
                          description: The identifier of the result item.
                        score:
                          type: number
                          description: The similarity score of the result item.
                        content:
                          type: object
                          description: The content of the result item.
        '400':
          description: Bad request due to missing or invalid parameters.
        '500':
          description: Internal server error.
"""
pyperclip.copy(spec)
print("OpenAPI spec copied to clipboard")
print(spec)

创建 GPT 指令

随意修改指令以适合您的需求。查看我们的此处文档,了解有关提示工程的一些技巧。

instructions = f'''
You are an OAI docs assistant. You have an action in your knowledge base where you can make a POST request to search for information. The POST request should always include: {{
    "search_service_endpoint": "{search_service_endpoint}",
    "index_name": {index_name},
    "query": "<user_query>",
    "k_nearest_neighbors": 1,
    "search_column": "content_vector",
    "use_hybrid_query": true,
    "category": "<category>"
}}. Only the query and category change based on the user's request. Your goal is to assist users by performing searches using this POST request and providing them with relevant information based on the query.

You must only include knowledge you get from your action in your response.
The category must be from the following list: {categories}, which you should determine based on the user's query. If you cannot determine, then do not include the category in the POST request.
'''
pyperclip.copy(instructions)
print("GPT Instructions copied to clipboard")
print(instructions)

我们现在有了一个查询向量数据库的 GPT!

回顾

我们现在已通过执行以下操作成功地将 Azure AI 搜索与 ChatGPT 中的 GPT Actions 集成

  1. 使用 OpenAI 的嵌入对其进行嵌入,同时使用 gpt-4o 添加一些额外的元数据。
  2. 将该数据上传到 Azure AI 搜索。
  3. 创建了一个端点以使用 Azure Functions 查询它。
  4. 将其整合到自定义 GPT 中。

我们的 GPT 现在可以检索信息以帮助回答用户查询,使其更加准确并根据我们的数据进行定制。以下是正在运行的 GPT

azure-rag-quickstart-gpt.png