使用 LangChain、Deep Lake 和 OpenAI 进行问答

2023 年 9 月 30 日
在 Github 中打开

此 notebook 展示了如何使用 LangChain、Deep Lake 作为向量存储和 OpenAI 嵌入来实现问答系统。我们将采取以下步骤来实现此目的

  1. 加载 Deep Lake 文本数据集
  2. 初始化一个 带有 LangChain 的 Deep Lake 向量存储
  3. 将文本添加到向量存储
  4. 在数据库上运行查询
  5. 完成!

您还可以关注其他教程,例如关于任何类型数据(PDF、json、csv、文本)的问答:与存储在 Deep Lake 中的任何数据聊天代码理解,或者 关于 PDF 的问答,或者 推荐歌曲

!pip install deeplake langchain openai tiktoken
import getpass
import os

os.environ['OPENAI_API_KEY'] = getpass.getpass()
··········
import deeplake

ds = deeplake.load("hub://activeloop/cohere-wikipedia-22-sample")
ds.summary()
\
Opening dataset in read-only mode as you don't have write permissions.
-
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/cohere-wikipedia-22-sample

|
hub://activeloop/cohere-wikipedia-22-sample loaded successfully.

Dataset(path='hub://activeloop/cohere-wikipedia-22-sample', read_only=True, tensors=['ids', 'metadata', 'text'])

  tensor    htype     shape      dtype  compression
 -------   -------   -------    -------  ------- 
   ids      text    (20000, 1)    str     None   
 metadata   json    (20000, 1)    str     None   
   text     text    (20000, 1)    str     None   
 


让我们看一些示例

ds[:3].text.data()["value"]
['The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.',
 'A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say "Tuesday at 24:00" and "Wednesday at 00:00" to mean exactly the same time.',
 'However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say "23:59", which is one minute before midnight.']
dataset_path = 'wikipedia-embeddings-deeplake'

我们将设置 OpenAI 的 text-embedding-3-small 作为我们的嵌入函数,并在 dataset_path 初始化 Deep Lake 向量存储...

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embedding = OpenAIEmbeddings(model="text-embedding-3-small")
db = DeepLake(dataset_path, embedding=embedding, overwrite=True)


... 并使用 add_texts 方法一次填充一个批次的样本。

from tqdm.auto import tqdm

batch_size = 100

nsamples = 10  # for testing. Replace with len(ds) to append everything
for i in tqdm(range(0, nsamples, batch_size)):
    # find end of batch
    i_end = min(nsamples, i + batch_size)

    batch = ds[i:i_end]
    id_batch = batch.ids.data()["value"]
    text_batch = batch.text.data()["value"]
    meta_batch = batch.metadata.data()["value"]

    db.add_texts(text_batch, metadatas=meta_batch, ids=id_batch)
  0%|          | 0/1 [00:00<?, ?it/s]
creating embeddings:   0%|          | 0/1 [00:00<?, ?it/s]
creating embeddings: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]

100%|██████████| 10/10 [00:00<00:00, 462.42it/s]
Dataset(path='wikipedia-embeddings-deeplake', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (10, 1)      str     None   
 metadata     json      (10, 1)      str     None   
 embedding  embedding  (10, 1536)  float32   None   
    id        text      (10, 1)      str     None   


在数据库上运行用户查询

可以通过 db.vectorstore.dataset 访问底层 Deep Lake 数据集对象,并且可以使用 db.vectorstore.summary() 总结数据结构,这将显示 4 个张量和 10 个样本

db.vectorstore.summary()
Dataset(path='wikipedia-embeddings-deeplake', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (10, 1)      str     None   
 metadata     json      (10, 1)      str     None   
 embedding  embedding  (10, 1536)  float32   None   
    id        text      (10, 1)      str     None   

我们现在将使用 GPT-3.5-Turbo 作为我们的 LLM 在我们的向量存储上设置 QA。

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

# Re-load the vector store in case it's no longer initialized
# db = DeepLake(dataset_path = dataset_path, embedding_function=embedding)

qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model='gpt-3.5-turbo'), chain_type="stuff", retriever=db.as_retriever())

让我们尝试运行一个提示并检查输出。在内部,此 API 执行嵌入搜索以查找最相关的数据以馈送到 LLM 上下文。

query = 'Why does the military not say 24:00?'
qa.run(query)
'The military prefers not to say 24:00 because they do not like to have two names for the same thing. Instead, they always say "23:59", which is one minute before midnight.'

瞧!