使用 LlamaIndex 进行财务文档分析

2023 年 6 月 22 日

在本示例笔记本中，我们展示了如何使用 10-K 文档和 LlamaIndex 框架，仅用几行代码即可执行财务分析。

笔记本大纲

简介
设置
数据加载和索引
简单问答
高级问答 - 比较和对比

简介

LLamaIndex

LlamaIndex 是 LLM 应用程序的数据框架。您只需几行代码即可开始使用，并在几分钟内构建一个检索增强生成 (RAG) 系统。对于更高级的用户，LlamaIndex 提供了一个丰富的工具包，用于摄取和索引您的数据，用于检索和重新排序的模块，以及用于构建自定义查询引擎的可组合组件。

有关更多详细信息，请参阅完整文档。

基于 10-K 文档的财务分析

财务分析师工作的一个关键部分是从长篇财务文件中提取信息并综合洞察。一个很好的例子是 10-K 表格 - 美国证券交易委员会 (SEC) 要求的年度报告，它全面总结了公司的财务业绩。这些文档通常长达数百页，并且包含特定领域的术语，这使得外行人难以快速理解。

我们展示了 LlamaIndex 如何支持财务分析师快速提取信息并跨多个文档综合洞察，而只需极少的编码。

设置

首先，我们需要安装 llama-index 库

!pip install llama-index pypdf

现在，我们导入本教程中使用的所有模块

from langchain import OpenAI

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

在开始之前，我们可以配置 LLM 提供商和模型，这将为我们的 RAG 系统提供支持。
在这里，我们从 OpenAI 中选择 gpt-3.5-turbo-instruct。

llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo-instruct", max_tokens=-1)

我们构建一个 ServiceContext 并将其设置为全局默认值，以便所有后续依赖 LLM 调用的操作都将使用我们在此处配置的模型。

service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)

数据加载和索引

现在，我们加载并解析 2 个 PDF（一个是 2021 年的 Uber 10-K，另一个是 2021 年的 Lyft 10-k）。
在底层，PDF 被转换为纯文本 Document 对象，按页面分隔。

注意：此操作可能需要一段时间才能运行，因为每个文档都超过 100 页。

lyft_docs = SimpleDirectoryReader(input_files=["../data/10k/lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["../data/10k/uber_2021.pdf"]).load_data()

print(f'Loaded lyft 10-K with {len(lyft_docs)} pages')
print(f'Loaded Uber 10-K with {len(uber_docs)} pages')

Loaded lyft 10-K with 238 pages
Loaded Uber 10-K with 307 pages

现在，我们可以在已加载的文档上构建（内存中）VectorStoreIndex。

注意：此操作可能需要一段时间才能运行，因为它调用 OpenAI API 来计算文档块的向量嵌入。

lyft_index = VectorStoreIndex.from_documents(lyft_docs)
uber_index = VectorStoreIndex.from_documents(uber_docs)

简单问答

现在我们准备好针对我们的索引运行一些查询了！
为此，我们首先配置一个 QueryEngine，它只是捕获一组配置，用于说明我们希望如何查询底层索引。

对于 VectorStoreIndex，最常见的调整配置是 similarity_top_k，它控制检索多少个文档块（我们称之为 Node 对象）以用作回答我们问题的上下文。

lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)

uber_engine = uber_index.as_query_engine(similarity_top_k=3)

让我们看看一些实际查询！

response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in millions with page reference')

print(response)

$3,208.3 million (page 63)

response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')

print(response)

$17,455 (page 53)

高级问答 - 比较和对比

对于更复杂的财务分析，通常需要参考多个文档。

例如，让我们看看如何在 Lyft 和 Uber 的财务数据上执行比较和对比查询。
为此，我们构建一个 SubQuestionQueryEngine，它将复杂的比较和对比查询分解为更简单的子问题，以便在由各个索引支持的各个子查询引擎上执行。

query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine, 
        metadata=ToolMetadata(name='lyft_10k', description='Provides information about Lyft financials for year 2021')
    ),
    QueryEngineTool(
        query_engine=uber_engine, 
        metadata=ToolMetadata(name='uber_10k', description='Provides information about Uber financials for year 2021')
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

让我们看看这些实际查询！

response = await s_engine.aquery('Compare and contrast the customer segments and geographies that grew the fastest')

Generated 4 sub questions.
[36;1m[1;3m[uber_10k] Q: What customer segments grew the fastest for Uber
[0m[36;1m[1;3m[uber_10k] A: in 2021?

The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth.
[0m[33;1m[1;3m[uber_10k] Q: What geographies grew the fastest for Uber
[0m[33;1m[1;3m[uber_10k] A: 
Based on the context information, it appears that Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.
[0m[38;5;200m[1;3m[lyft_10k] Q: What customer segments grew the fastest for Lyft
[0m[38;5;200m[1;3m[lyft_10k] A: 
The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them.
[0m[32;1m[1;3m[lyft_10k] Q: What geographies grew the fastest for Lyft
[0m[32;1m[1;3m[lyft_10k] A: 
It is not possible to answer this question with the given context information.
[0m

print(response)

The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth. Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.

The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

In summary, Uber and Lyft both experienced growth in customer segments related to mobility, couriers, riders, and eaters. Uber experienced the most growth in large metropolitan areas, as well as in suburban and rural areas, and in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain. Lyft experienced the most growth in ridesharing, light vehicles, and public transit. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

response = await s_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

Generated 2 sub questions.
[36;1m[1;3m[uber_10k] Q: What is the revenue growth of Uber from 2020 to 2021
[0m[36;1m[1;3m[uber_10k] A: 
The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis.
[0m[33;1m[1;3m[lyft_10k] Q: What is the revenue growth of Lyft from 2020 to 2021
[0m[33;1m[1;3m[lyft_10k] A: 
The revenue growth of Lyft from 2020 to 2021 is 36%, increasing from $2,364,681 thousand to $3,208,323 thousand.
[0m

print(response)

The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis, while the revenue growth of Lyft from 2020 to 2021 was 36%. This means that Uber had a higher revenue growth than Lyft from 2020 to 2021.