Kusto 作为嵌入的向量数据库

May 10, 2023
在 Github 中打开

此 Notebook 提供了关于使用 Azure 数据资源管理器 (Kusto) 作为带有 OpenAI 嵌入的向量数据库的逐步说明。

此 notebook 介绍了一个端到端的流程,包括:

  1. 使用由 OpenAI API 创建的预计算嵌入。
  2. 将嵌入存储在 Kusto 中。
  3. 使用 OpenAI API 将原始文本查询转换为嵌入。
  4. 使用 Kusto 在存储的嵌入中执行余弦相似度搜索
%pip install wget
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, -1, Finished, Available)
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l- done
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=10fd8aa1d20fd49c36389dc888acc721d0578c5a0635fc9fc5dc642c0f49522e
  Stored in directory: /home/trusted-service-user/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2

[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/bin/python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Warning: PySpark kernel has been restarted to use updated packages.

%pip install openai
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, -1, Finished, Available)
Collecting openai
  Downloading openai-0.27.6-py3-none-any.whl (71 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.9/71.9 kB 1.7 MB/s eta 0:00:0000:01
[?25hRequirement already satisfied: tqdm in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from openai) (4.65.0)
Requirement already satisfied: requests>=2.20 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from openai) (2.28.2)
Requirement already satisfied: aiohttp in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from openai) (3.8.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.20->openai) (1.26.14)
Requirement already satisfied: certifi>=2017.4.17 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.20->openai) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.20->openai) (2.1.1)
Requirement already satisfied: attrs>=17.3.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (22.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (1.3.3)
Requirement already satisfied: multidict<7.0,>=4.5 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (1.8.2)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (4.0.2)
Requirement already satisfied: aiosignal>=1.1.2 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (1.3.1)
Installing collected packages: openai
Successfully installed openai-0.27.6

[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/bin/python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Warning: PySpark kernel has been restarted to use updated packages.

%pip install azure-kusto-data
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, -1, Finished, Available)
Requirement already satisfied: azure-kusto-data in /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/lib/python3.10/site-packages (4.1.4)
Requirement already satisfied: msal<2,>=1.9.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (1.21.0)
Requirement already satisfied: python-dateutil>=2.8.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (2.8.2)
Requirement already satisfied: azure-core<2,>=1.11.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (1.26.4)
Requirement already satisfied: requests>=2.13.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (2.28.2)
Requirement already satisfied: ijson~=3.1 in /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/lib/python3.10/site-packages (from azure-kusto-data) (3.2.0.post0)
Requirement already satisfied: azure-identity<2,>=1.5.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (1.12.0)
Requirement already satisfied: six>=1.11.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-core<2,>=1.11.0->azure-kusto-data) (1.16.0)
Requirement already satisfied: typing-extensions>=4.3.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-core<2,>=1.11.0->azure-kusto-data) (4.5.0)
Requirement already satisfied: cryptography>=2.5 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-identity<2,>=1.5.0->azure-kusto-data) (40.0.1)
Requirement already satisfied: msal-extensions<2.0.0,>=0.3.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-identity<2,>=1.5.0->azure-kusto-data) (1.0.0)
Requirement already satisfied: PyJWT[crypto]<3,>=1.0.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from msal<2,>=1.9.0->azure-kusto-data) (2.6.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.13.0->azure-kusto-data) (1.26.14)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.13.0->azure-kusto-data) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.13.0->azure-kusto-data) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.13.0->azure-kusto-data) (2022.12.7)
Requirement already satisfied: cffi>=1.12 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from cryptography>=2.5->azure-identity<2,>=1.5.0->azure-kusto-data) (1.15.1)
Requirement already satisfied: portalocker<3,>=1.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from msal-extensions<2.0.0,>=0.3.0->azure-identity<2,>=1.5.0->azure-kusto-data) (2.7.0)
Requirement already satisfied: pycparser in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=2.5->azure-identity<2,>=1.5.0->azure-kusto-data) (2.21)

[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/bin/python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Warning: PySpark kernel has been restarted to use updated packages.

在本节中,我们将加载准备好的嵌入数据,因此您不必使用自己的额度重新计算维基百科文章的嵌入。

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 17, Finished, Available)
'vector_database_wikipedia_articles_embedded.zip'

import zipfile

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("/lakehouse/default/Files/data")
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 18, Finished, Available)
import pandas as pd

from ast import literal_eval

article_df = pd.read_csv('/lakehouse/default/Files/data/vector_database_wikipedia_articles_embedded.csv')
# Read vectors from strings back into a list
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 19, Finished, Available)
id url 标题 文本 title_vector content_vector vector_id
0 1 https://simple.wikipedia.org/wiki/April 四月 四月是一年中的第四个月,在公历中为 30 天。 [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August 八月 八月(Aug.)是一年中的第八个月,在公历中为 31 天。 [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art 艺术 艺术是一种表达想象力或技术创造力的创造性活动。 [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A 或 a 是英语字母表的第一个字母。 [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air 空气 空气是指地球的大气层。空气是由多种气体组成的混合物。 [0.02224554680287838, -0.02044147066771984, -0... [0.021524671465158463, 0.018522677943110466, -... 4

创建一个表,并根据数据帧中的内容将向量加载到 Kusto 中。“spark”选项“CreakeIfNotExists”将在表不存在时自动创建表

# replace with your AAD Tenant ID, Kusto Cluster URI, Kusto DB name and Kusto Table
AAD_TENANT_ID = ""
KUSTO_CLUSTER =  ""
KUSTO_DATABASE = "Vector"
KUSTO_TABLE = "Wiki"
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 37, Finished, Available)

kustoOptions = {"kustoCluster": KUSTO_CLUSTER, "kustoDatabase" :KUSTO_DATABASE, "kustoTable" : KUSTO_TABLE }

# Replace the auth method based on your desired authentication mechanism  - https://github.com/Azure/azure-kusto-spark/blob/master/docs/Authentication.md
access_token=mssparkutils.credentials.getToken(kustoOptions["kustoCluster"])
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 21, Finished, Available)
#Pandas data frame to spark dataframe
sparkDF=spark.createDataFrame(article_df)
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 22, Finished, Available)
/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:604: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
# Write data to a Kusto table
sparkDF.write. \
format("com.microsoft.kusto.spark.synapse.datasource"). \
option("kustoCluster",kustoOptions["kustoCluster"]). \
option("kustoDatabase",kustoOptions["kustoDatabase"]). \
option("kustoTable", kustoOptions["kustoTable"]). \
option("accessToken", access_token). \
option("tableCreateOptions", "CreateIfNotExist").\
mode("Append"). \
save()
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 23, Finished, Available)
import openai
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 43, Finished, Available)
openai.api_version = '2022-12-01'
openai.api_base = '' # Please add your endpoint here
openai.api_type = 'azure'
openai.api_key = ''  # Please add your api key here

def embed(query):
    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
            input=query,
            deployment_id="embed", #replace with your deployment id
            chunk_size=1
    )["data"][0]["embedding"]
    return embedded_query
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 44, Finished, Available)

如果您计划使用 OpenAI 进行嵌入,则仅运行此单元格

openai.api_key = ""


def embed(query):
    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-3-small",
    )["data"][0]["embedding"]
    return embedded_query

searchedEmbedding = embed("places where you worship")
#print(searchedEmbedding)
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 45, Finished, Available)
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 35, Finished, Available)
KCSB = KustoConnectionStringBuilder.with_aad_device_authentication(
    KUSTO_CLUSTER)
KCSB.authority_id = AAD_TENANT_ID
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 38, Finished, Available)
KUSTO_CLIENT = KustoClient(KCSB)
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 39, Finished, Available)
KUSTO_QUERY = "Wiki | extend similarity = series_cosine_similarity_fl(dynamic("+str(searchedEmbedding)+"), content_vector,1,1) | top 10 by similarity desc "

RESPONSE = KUSTO_CLIENT.execute(KUSTO_DATABASE, KUSTO_QUERY)
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 48, Finished, Available)
df = dataframe_from_result_table(RESPONSE.primary_results[0])
df
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 49, Finished, Available)
id url 标题 文本 title_vector content_vector vector_id 相似度
0 852 https://simple.wikipedia.org/wiki/Temple 寺庙 寺庙是人们去进行祈祷和崇拜的地方。 [-0.021837441250681877, -0.007722342386841774,... [-0.0019541378132998943, 0.007151313126087189,... 413 0.834495
1 78094 https://simple.wikipedia.org/wiki/Christian%20... 基督教崇拜 在基督教中,崇拜被认为是基督徒对上帝的第一次责任。 [0.0017675267299637198, -0.008890199474990368,... [0.020530683919787407, 0.0024345638230443, -0.... 20320 0.832132
2 59154 https://simple.wikipedia.org/wiki/Service%20of... 崇拜仪式 崇拜仪式是一种宗教聚会,人们聚集在一起进行崇拜。 [-0.007969820871949196, 0.0004240311391185969,... [0.003784010885283351, -0.0030924836173653603,... 15519 0.831633
3 51910 https://simple.wikipedia.org/wiki/Worship 崇拜 崇拜是一个经常在宗教中使用的词。它指的是对上帝或神表示敬畏和尊敬的行为。 [0.0036036288365721703, -0.01276545226573944, ... [0.007925753481686115, -0.0110504487529397, 0.... 14010 0.828185
4 29576 https://simple.wikipedia.org/wiki/Altar 祭坛 祭坛是一个地方,通常是一张桌子,宗教仪式在那里举行。 [0.007887467741966248, -0.02706138789653778, -... [0.023901859298348427, -0.031175222247838977, ... 8708 0.824124
5 92507 https://simple.wikipedia.org/wiki/Shrine 神龛 神龛是一个神圣或神圣的地方,里面有一些重要的宗教物品。 [-0.011601685546338558, 0.006366696208715439, ... [0.016423320397734642, -0.0015560361789539456,... 23945 0.823863
6 815 https://simple.wikipedia.org/wiki/Synagogue 犹太教堂 犹太教堂是犹太人聚在一起祈祷和崇拜的地方。 [-0.017317570745944977, 0.0022673190105706453,... [-0.004515442531555891, 0.003739549545571208, ... 398 0.819942
7 68080 https://simple.wikipedia.org/wiki/Shinto%20shrine 神道教神社 神道教神社是神圣的地方或地点,神道教的神灵(kami)居住在那里。 [0.0035740730818361044, 0.0028098472394049168,... [0.011014971882104874, 0.00042272370774298906,... 18106 0.818475
8 57790 https://simple.wikipedia.org/wiki/Chapel 小教堂 小教堂是基督徒崇拜的场所。这个词“chapel”在不同的基督教传统中有不同的含义。 [-0.01371884811669588, 0.0031672674231231213, ... [0.002526090247556567, 0.02482965588569641, 0.... 15260 0.817608
9 142 https://simple.wikipedia.org/wiki/Church%20%28... 教堂(建筑物) 教堂是一座为基督教宗教信仰而建造的建筑物。 [0.0021336888894438744, 0.0029748091474175453,... [0.016109377145767212, 0.022908871993422508, 0... 74 0.812636
searchedEmbedding = embed("unfortunate events in history")
KUSTO_QUERY = "Wiki | extend similarity = series_cosine_similarity_fl(dynamic("+str(searchedEmbedding)+"), title_vector,1,1) | top 10 by similarity desc "
RESPONSE = KUSTO_CLIENT.execute(KUSTO_DATABASE, KUSTO_QUERY)

df = dataframe_from_result_table(RESPONSE.primary_results[0])
df
StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 52, Finished, Available)
id url 标题 文本 title_vector content_vector vector_id 相似度
0 848 https://simple.wikipedia.org/wiki/Tragedy 悲剧 在戏剧中,亚里士多德定义的悲剧是一种模仿崇高和完整行为的戏剧类型。 [-0.019502468407154083, -0.010160734876990318,... [-0.012951433658599854, -0.018836138769984245,... 410 0.851848
1 4469 https://simple.wikipedia.org/wiki/The%20Holocaust 大屠杀 大屠杀,有时也被称为 Shoah (),是 20 世纪 30 年代和 40 年代在欧洲发生的一场种族灭绝事件,纳粹德国及其合作者杀害了大约六百万犹太人。 [-0.030233195051550865, -0.024401605129241943,... [-0.016398731619119644, -0.013267949223518372,... 1203 0.847222
2 64216 https://simple.wikipedia.org/wiki/List%20of%20... 历史瘟疫列表 此列表包含著名或有据可查的瘟疫和流行病。 [-0.010667890310287476, -0.0003575817099772393... [-0.010863155126571655, -0.0012196656316518784... 16859 0.844411
3 4397 https://simple.wikipedia.org/wiki/List%20of%20... 列表 of disasters 这是一个灾难列表,包括自然灾害和人为灾害。 [-0.02713736332952976, -0.005278210621327162, ... [-0.023679986596107483, -0.006126823835074902,... 1158 0.843063
4 23073 https://simple.wikipedia.org/wiki/Disaster 灾难 灾难是非常糟糕的事情,会发生在短时间内,造成很多伤害。 [-0.018235962837934497, -0.020034968852996823,... [-0.02504003793001175, 0.007415903266519308, 0... 7251 0.840334
5 4382 https://simple.wikipedia.org/wiki/List%20of%20... 恐怖事件列表 以下是按日期排列的恐怖主义行为和失败行为的列表。 [-0.03989032283425331, -0.012808636762201786, ... [-0.045838188380002975, -0.01682935282588005, ... 1149 0.836162
6 13528 https://simple.wikipedia.org/wiki/A%20Series%2... 一系列不幸事件 《一系列不幸事件》是丹尼尔·汉德勒以笔名雷蒙·斯尼凯特创作的一系列儿童小说。 [0.0010618815431371331, -0.0267023965716362, -... [0.002801976166665554, -0.02904471382498741, -... 4347 0.835172
7 42874 https://simple.wikipedia.org/wiki/History%20of... 世界历史 世界历史(也称为人类历史)是对人类过去的记忆、发现、收集、组织、呈现和解释。 [0.0026915925554931164, -0.022206028923392296,... [0.013645033352077007, -0.005165994167327881, ... 11672 0.830243
8 4452 https://simple.wikipedia.org/wiki/Accident 事故 事故是指当事情在没有计划的情况下出错时发生的事情。 [-0.004075294826179743, -0.0059883203357458115... [0.00926120299845934, 0.013705797493457794, 0.... 1190 0.826898
9 324 https://simple.wikipedia.org/wiki/History 历史 历史是对过去事件的研究。人们通过查看书面文件和文物来了解历史。 [0.006603690329939127, -0.011856242083013058, ... [0.0048830462619662285, 0.0032003086525946856,... 170 0.824645