找到你的下一本书
在本笔记本中,我们将介绍如何使用 OpenAI 生成书籍描述的嵌入向量,并在 Zilliz 中使用这些嵌入向量来查找相关的书籍。本示例中的数据集来源于 HuggingFace datasets,包含超过 100 万个标题-描述对。
让我们首先下载此笔记本所需的库
- openai用于与 OpenAI 嵌入服务通信
- pymilvus用于与 Zilliz 实例通信
- datasets用于下载数据集
- tqdm用于进度条
在本笔记本中,我们将介绍如何使用 OpenAI 生成书籍描述的嵌入向量,并在 Zilliz 中使用这些嵌入向量来查找相关的书籍。本示例中的数据集来源于 HuggingFace datasets,包含超过 100 万个标题-描述对。
让我们首先下载此笔记本所需的库
openai 用于与 OpenAI 嵌入服务通信pymilvus 用于与 Zilliz 实例通信datasets 用于下载数据集tqdm 用于进度条! pip install openai pymilvus datasets tqdmLooking in indexes: https://pypi.ac.cn/simple, https://pypi.ngc.nvidia.com Requirement already satisfied: openai in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (0.27.2) Requirement already satisfied: pymilvus in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (2.2.2) Requirement already satisfied: datasets in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (2.10.1) Requirement already satisfied: tqdm in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (4.64.1) Requirement already satisfied: requests>=2.20 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from openai) (2.28.2) Requirement already satisfied: aiohttp in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from openai) (3.8.4) Requirement already satisfied: ujson<=5.4.0,>=2.0.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (5.1.0) Requirement already satisfied: grpcio-tools<=1.48.0,>=1.47.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.47.2) Requirement already satisfied: grpcio<=1.48.0,>=1.47.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.47.2) Requirement already satisfied: mmh3<=3.0.0,>=2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (3.0.0) Requirement already satisfied: pandas>=1.2.4 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.5.3) Requirement already satisfied: numpy>=1.17 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (1.23.5) Requirement already satisfied: xxhash in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (3.2.0) Requirement already satisfied: responses<0.19 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.18.0) Requirement already satisfied: dill<0.3.7,>=0.3.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.3.6) Requirement already satisfied: huggingface-hub<1.0.0,>=0.2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.12.1) Requirement already satisfied: pyarrow>=6.0.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (10.0.1) Requirement already satisfied: multiprocess in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.70.14) Requirement already satisfied: pyyaml>=5.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (5.4.1) Requirement already satisfied: fsspec[http]>=2021.11.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (2023.1.0) Requirement already satisfied: packaging in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (23.0) Requirement already satisfied: frozenlist>=1.1.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.3.3) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (4.0.2) Requirement already satisfied: aiosignal>=1.1.2 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.3.1) Requirement already satisfied: attrs>=17.3.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (22.2.0) Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (3.0.1) Requirement already satisfied: yarl<2.0,>=1.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.8.2) Requirement already satisfied: multidict<7.0,>=4.5 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (6.0.4) Requirement already satisfied: six>=1.5.2 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio<=1.48.0,>=1.47.0->pymilvus) (1.16.0) Requirement already satisfied: protobuf<4.0dev,>=3.12.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio-tools<=1.48.0,>=1.47.0->pymilvus) (3.20.1) Requirement already satisfied: setuptools in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio-tools<=1.48.0,>=1.47.0->pymilvus) (65.6.3) Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets) (4.5.0) Requirement already satisfied: filelock in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets) (3.9.0) Requirement already satisfied: python-dateutil>=2.8.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pandas>=1.2.4->pymilvus) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pandas>=1.2.4->pymilvus) (2022.7.1) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (1.26.14) Requirement already satisfied: idna<4,>=2.5 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (3.4) Requirement already satisfied: certifi>=2017.4.17 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (2022.12.7)
要启动并运行 Zilliz,请查看此处。设置好您的帐户和数据库后,请继续设置以下值
import openai
URI = 'your_uri'
TOKEN = 'your_token' # TOKEN == user:password or api_key
COLLECTION_NAME = 'book_search'
DIMENSION = 1536
OPENAI_ENGINE = 'text-embedding-3-small'
openai.api_key = 'sk-your-key'
INDEX_PARAM = {
    'metric_type':'L2',
    'index_type':"AUTOINDEX",
    'params':{}
}
QUERY_PARAM = {
    "metric_type": "L2",
    "params": {},
}
BATCH_SIZE = 1000此部分介绍 Zilliz 以及为此用例设置数据库。在 Zilliz 中,我们需要设置一个集合并对其进行索引。
from pymilvus import connections, utility, FieldSchema, Collection, CollectionSchema, DataType
# Connect to Zilliz Database
connections.connect(uri=URI, token=TOKEN)# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=64000),
    FieldSchema(name='description', dtype=DataType.VARCHAR, max_length=64000),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)# Create the index on the collection and load it.
collection.create_index(field_name="embedding", index_params=INDEX_PARAM)
collection.load()在 Zilliz 启动并运行后,我们可以开始获取数据。`Hugging Face Datasets` 是一个包含许多不同用户数据集的中心,在此示例中,我们使用 Skelebor 的书籍数据集。此数据集包含超过 100 万本书籍的标题-描述对。我们将嵌入每个描述,并将其与其标题一起存储在 Zilliz 中。
import datasets
# Download the dataset and only use the `train` portion (file is around 800Mb)
dataset = datasets.load_dataset('Skelebor/book_titles_and_descriptions_en_clean', split='train')/Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Found cached dataset parquet (/Users/filiphaltmayer/.cache/huggingface/datasets/Skelebor___parquet/Skelebor--book_titles_and_descriptions_en_clean-3596935b1d8a7747/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
现在我们的数据已经在我们的机器上,我们可以开始嵌入它并将其插入到 Zilliz 中。嵌入函数接收文本并以列表格式返回嵌入向量。
# Simple function that converts the texts to embeddings
def embed(texts):
    embeddings = openai.Embedding.create(
        input=texts,
        engine=OPENAI_ENGINE
    )
    return [x['embedding'] for x in embeddings['data']]
下一步是实际的插入操作。由于有如此多的数据点,如果您想立即进行测试,可以提前停止插入单元格块并继续进行。这样做可能会由于数据点减少而降低结果的准确性,但仍然应该足够好。
from tqdm import tqdm
data = [
    [], # title
    [], # description
]
# Embed and insert in batches
for i in tqdm(range(0, len(dataset))):
    data[0].append(dataset[i]['title'])
    data[1].append(dataset[i]['description'])
    if len(data[0]) % BATCH_SIZE == 0:
        data.append(embed(data[1]))
        collection.insert(data)
        data = [[],[]]
# Embed and insert the remainder 
if len(data[0]) != 0:
    data.append(embed(data[1]))
    collection.insert(data)
    data = [[],[]]
0%| | 2999/1032335 [00:19<1:49:30, 156.66it/s]
在我们的数据安全地插入到 Zilliz 后,我们现在可以执行查询。查询接收一个字符串或字符串列表并搜索它们。结果会打印出您提供的描述以及结果,其中包括结果分数、结果标题和结果书籍描述。
import textwrap
def query(queries, top_k = 5):
    if type(queries) != list:
        queries = [queries]
    res = collection.search(embed(queries), anns_field='embedding', param=QUERY_PARAM, limit = top_k, output_fields=['title', 'description'])
    for i, hit in enumerate(res):
        print('Description:', queries[i])
        print('Results:')
        for ii, hits in enumerate(hit):
            print('\t' + 'Rank:', ii + 1, 'Score:', hits.score, 'Title:', hits.entity.get('title'))
            print(textwrap.fill(hits.entity.get('description'), 88))
            print()query('Book about a k-9 from europe')Description: Book about a k-9 from europe Results: Rank: 1 Score: 0.3047754764556885 Title: Bark M For Murder Who let the dogs out? Evildoers beware! Four of mystery fiction's top storytellers are setting the hounds on your trail -- in an incomparable quartet of crime stories with a canine edge. Man's (and woman's) best friends take the lead in this phenomenal collection of tales tense and surprising, humorous and thrilling: New York Timesbestselling author J.A. Jance's spellbinding saga of a scam-busting septuagenarian and her two golden retrievers; Anthony Award winner Virginia Lanier's pureblood thriller featuring bloodhounds and bloody murder; Chassie West's suspenseful stunner about a life-saving German shepherd and a ghastly forgotten crime; rising star Lee Charles Kelley's edge-of-your-seat yarn that pits an ex-cop/kennel owner and a yappy toy poodle against a craven killer. Rank: 2 Score: 0.3283390402793884 Title: Texas K-9 Unit Christmas: Holiday Hero\Rescuing Christmas CHRISTMAS COMES WRAPPED IN DANGER Holiday Hero by Shirlee McCoy Emma Fairchild never expected to find trouble in sleepy Sagebrush, Texas. But when she's attacked and left for dead in her own diner, her childhood friend turned K-9 cop Lucas Harwood offers a chance at justice--and love. Rescuing Christmas by Terri Reed She escaped a kidnapper, but now a killer has set his sights on K-9 dog trainer Lily Anderson. When fellow officer Jarrod Evans appoints himself her bodyguard, Lily knows more than her life is at risk--so is her heart. Texas K-9 Unit: These lawmen solve the toughest cases with the help of their brave canine partners Rank: 3 Score: 0.33899369835853577 Title: Dogs on Duty: Soldiers' Best Friends on the Battlefield and Beyond When the news of the raid on Osama Bin Laden's compound broke, the SEAL team member that stole the show was a highly trained canine companion. Throughout history, dogs have been key contributors to military units. Dorothy Hinshaw Patent follows man's best friend onto the battlefield, showing readers why dogs are uniquely qualified for the job at hand, how they are trained, how they contribute to missions, and what happens when they retire. With full-color photographs throughout and sidebars featuring heroic canines throughout history, Dogs on Duty provides a fascinating look at these exceptional soldiers and companions. Rank: 4 Score: 0.34207457304000854 Title: Toute Allure: Falling in Love in Rural France After saying goodbye to life as a successful fashion editor in London, Karen Wheeler is now happy in her small village house in rural France. Her idyll is complete when she meets the love of her life - he has shaggy hair, four paws and a wet nose! Rank: 5 Score: 0.343595951795578 Title: Otherwise Alone (Evan Arden, #1) Librarian's note: This is an alternate cover edition for ASIN: B00AP5NNWC. Lieutenant Evan Arden sits in a shack in the middle of nowhere, waiting for orders that will send him back home - if he ever gets them. Other than his loyal Great Pyrenees, there's no one around to break up the monotony. The tedium is excruciating, but it is suddenly interrupted when a young woman stumbles up his path. "It's only 50-something pages, but in that short amount of time, the author's awesome writing packs in a whole lotta character detail. And sets the stage for the series, perfectly." -Maryse.net, 4.5 Stars He has two choices - pick her off from a distance with his trusty sniper-rifle, or dare let her approach his cabin and enter his life. Why not? It's been ages, and he is otherwise alone...