推荐在网络上广泛应用。
- “买了那件商品?试试这些类似的商品。”
- “喜欢那本书?试试这些类似的书名。”
- “没有找到您要找的帮助页面?试试这些类似的页面。”
本笔记本演示了如何使用嵌入来查找要推荐的类似项目。 特别是,我们使用AG 新闻文章语料库作为我们的数据集。
我们的模型将回答以下问题:给定一篇文章,哪些其他文章与之最相似?
推荐在网络上广泛应用。
本笔记本演示了如何使用嵌入来查找要推荐的类似项目。 特别是,我们使用AG 新闻文章语料库作为我们的数据集。
我们的模型将回答以下问题:给定一篇文章,哪些其他文章与之最相似?
import pandas as pd
import pickle
from utils.embeddings_utils import (
get_embedding,
distances_from_embeddings,
tsne_components_from_embeddings,
chart_from_components,
indices_of_nearest_neighbors_from_distances,
)
EMBEDDING_MODEL = "text-embedding-3-small"
接下来,让我们加载 AG 新闻数据,看看它是什么样的。
# load data (full dataset available at http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)
dataset_path = "data/AG_news_samples.csv"
df = pd.read_csv(dataset_path)
n_examples = 5
df.head(n_examples)
标题 | 描述 | label_int | 标签 | |
---|---|---|---|---|
0 | 世界简报 | 英国:布莱尔警告气候威胁 首相... | 1 | 世界 |
1 | 英伟达在主板上安装防火墙 (PC Wo... | PC World - 即将推出的芯片组将包括内置... | 4 | 科技 |
2 | 希腊和中国媒体对奥运会的喜悦 | 希腊报纸反映出一种混合的兴奋... | 2 | 体育 |
3 | U2 可以用 iPod 看图片 | 加利福尼亚州圣何塞 - 苹果电脑(报价,Cha... | 4 | 科技 |
4 | 梦想工厂 | 任何产品,任何形状,任何尺寸 - 制造... | 4 | 科技 |
让我们看看相同的示例,但没有被省略号截断。
# print the title, description, and label of each example
for idx, row in df.head(n_examples).iterrows():
print("")
print(f"Title: {row['title']}")
print(f"Description: {row['description']}")
print(f"Label: {row['label']}")
Title: World Briefings Description: BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the quot;alarming quot; growth of greenhouse gases. Label: World Title: Nvidia Puts a Firewall on a Motherboard (PC World) Description: PC World - Upcoming chip set will include built-in security features for your PC. Label: Sci/Tech Title: Olympic joy in Greek, Chinese press Description: Newspapers in Greece reflect a mixture of exhilaration that the Athens Olympics proved successful, and relief that they passed off without any major setback. Label: Sports Title: U2 Can iPod with Pictures Description: SAN JOSE, Calif. -- Apple Computer (Quote, Chart) unveiled a batch of new iPods, iTunes software and promos designed to keep it atop the heap of digital music players. Label: Sci/Tech Title: The Dream Factory Description: Any product, any shape, any size -- manufactured on your desktop! The future is the fabricator. By Bruce Sterling from Wired magazine. Label: Sci/Tech
在为这些文章获取嵌入之前,让我们设置一个缓存来保存我们生成的嵌入。 通常,保存您的嵌入以便以后可以重复使用是一个好主意。 如果您不保存它们,则每次重新计算它们时都需要再次付费。
缓存是一个字典,它将(text, model)
元组映射到一个嵌入,该嵌入是浮点数列表。 缓存保存为 Python pickle 文件。
# establish a cache of embeddings to avoid recomputing
# cache is a dict of tuples (text, model) -> embedding, saved as a pickle file
# set path to embedding cache
embedding_cache_path = "data/recommendations_embeddings_cache.pkl"
# load the cache if it exists, and save a copy to disk
try:
embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
embedding_cache = {}
with open(embedding_cache_path, "wb") as embedding_cache_file:
pickle.dump(embedding_cache, embedding_cache_file)
# define a function to retrieve embeddings from the cache if present, and otherwise request via the API
def embedding_from_string(
string: str,
model: str = EMBEDDING_MODEL,
embedding_cache=embedding_cache
) -> list:
"""Return embedding of given string, using a cache to avoid recomputing."""
if (string, model) not in embedding_cache.keys():
embedding_cache[(string, model)] = get_embedding(string, model)
with open(embedding_cache_path, "wb") as embedding_cache_file:
pickle.dump(embedding_cache, embedding_cache_file)
return embedding_cache[(string, model)]
让我们通过获取嵌入来检查它是否有效。
# as an example, take the first description from the dataset
example_string = df["description"].values[0]
print(f"\nExample string: {example_string}")
# print the first 10 dimensions of the embedding
example_embedding = embedding_from_string(example_string)
print(f"\nExample embedding: {example_embedding[:10]}...")
Example string: BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the quot;alarming quot; growth of greenhouse gases. Example embedding: [0.0545826330780983, -0.00428084097802639, 0.04785159230232239, 0.01587914116680622, -0.03640881925821304, 0.0143799539655447, -0.014267769642174244, -0.015175441280007362, -0.002344391541555524, 0.011075624264776707]...
为了找到相似的文章,让我们遵循一个三步计划
def print_recommendations_from_strings(
strings: list[str],
index_of_source_string: int,
k_nearest_neighbors: int = 1,
model=EMBEDDING_MODEL,
) -> list[int]:
"""Print out the k nearest neighbors of a given string."""
# get embeddings for all strings
embeddings = [embedding_from_string(string, model=model) for string in strings]
# get the embedding of the source string
query_embedding = embeddings[index_of_source_string]
# get distances between the source embedding and other embeddings (function from utils.embeddings_utils.py)
distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")
# get indices of nearest neighbors (function from utils.utils.embeddings_utils.py)
indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances)
# print out source string
query_string = strings[index_of_source_string]
print(f"Source string: {query_string}")
# print out its k nearest neighbors
k_counter = 0
for i in indices_of_nearest_neighbors:
# skip any strings that are identical matches to the starting string
if query_string == strings[i]:
continue
# stop after printing out k articles
if k_counter >= k_nearest_neighbors:
break
k_counter += 1
# print out the similar strings and their distances
print(
f"""
--- Recommendation #{k_counter} (nearest neighbor {k_counter} of {k_nearest_neighbors}) ---
String: {strings[i]}
Distance: {distances[i]:0.3f}"""
)
return indices_of_nearest_neighbors
让我们查找与第一篇文章相似的文章,第一篇文章是关于托尼·布莱尔的。
article_descriptions = df["description"].tolist()
tony_blair_articles = print_recommendations_from_strings(
strings=article_descriptions, # let's base similarity off of the article description
index_of_source_string=0, # articles similar to the first one about Tony Blair
k_nearest_neighbors=5, # 5 most similar articles
)
Source string: BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the quot;alarming quot; growth of greenhouse gases. --- Recommendation #1 (nearest neighbor 1 of 5) --- String: The anguish of hostage Kenneth Bigley in Iraq hangs over Prime Minister Tony Blair today as he faces the twin test of a local election and a debate by his Labour Party about the divisive war. Distance: 0.514 --- Recommendation #2 (nearest neighbor 2 of 5) --- String: THE re-election of British Prime Minister Tony Blair would be seen as an endorsement of the military action in Iraq, Prime Minister John Howard said today. Distance: 0.516 --- Recommendation #3 (nearest neighbor 3 of 5) --- String: Israel is prepared to back a Middle East conference convened by Tony Blair early next year despite having expressed fears that the British plans were over-ambitious and designed Distance: 0.546 --- Recommendation #4 (nearest neighbor 4 of 5) --- String: Allowing dozens of casinos to be built in the UK would bring investment and thousands of jobs, Tony Blair says. Distance: 0.568 --- Recommendation #5 (nearest neighbor 5 of 5) --- String: AFP - A battle group of British troops rolled out of southern Iraq on a US-requested mission to deadlier areas near Baghdad, in a major political gamble for British Prime Minister Tony Blair. Distance: 0.579
非常好! 5 个推荐中有 4 个明确提到了托尼·布莱尔,第五个是伦敦关于气候变化的文章,这些主题可能经常与托尼·布莱尔相关联。
让我们看看我们的推荐器在关于 NVIDIA 具有更高安全性的新芯片组的第二篇示例文章中的表现如何。
chipset_security_articles = print_recommendations_from_strings(
strings=article_descriptions, # let's base similarity off of the article description
index_of_source_string=1, # let's look at articles similar to the second one about a more secure chipset
k_nearest_neighbors=5, # let's look at the 5 most similar articles
)
Source string: PC World - Upcoming chip set will include built-in security features for your PC. --- Recommendation #1 (nearest neighbor 1 of 5) --- String: PC World - Updated antivirus software for businesses adds intrusion prevention features. Distance: 0.422 --- Recommendation #2 (nearest neighbor 2 of 5) --- String: PC World - Symantec, McAfee hope raising virus-definition fees will move users to\ suites. Distance: 0.518 --- Recommendation #3 (nearest neighbor 3 of 5) --- String: originally offered on notebook PCs -- to its Opteron 32- and 64-bit x86 processors for server applications. The technology will help servers to run Distance: 0.522 --- Recommendation #4 (nearest neighbor 4 of 5) --- String: PC World - Send your video throughout your house--wirelessly--with new gateways and media adapters. Distance: 0.532 --- Recommendation #5 (nearest neighbor 5 of 5) --- String: Chips that help a computer's main microprocessors perform specific types of math problems are becoming a big business once again.\ Distance: 0.532
从打印出的距离可以看出,排名第一的推荐比所有其他推荐都更接近(0.11 对 0.14+)。 并且排名第一的推荐看起来与起始文章非常相似 - 这是另一篇来自 PC World 关于提高计算机安全性的文章。 非常好!
构建推荐系统的一种更复杂的方法是训练一个机器学习模型,该模型接收数十个或数百个信号,例如商品受欢迎程度或用户点击数据。 即使在这个系统中,嵌入也可以成为推荐器的一个非常有用的信号,特别是对于那些“冷启动”且尚无用户数据的项目(例如,添加到目录中的全新产品,但尚未收到任何点击)。
为了了解我们的最近邻推荐器正在做什么,让我们可视化文章嵌入。 虽然我们无法绘制每个嵌入向量的 2048 个维度,但我们可以使用诸如t-SNE或PCA等技术将嵌入压缩到 2 或 3 个维度,我们可以将其制成图表。
在可视化最近邻居之前,让我们使用 t-SNE 可视化所有文章描述。 请注意,t-SNE 不是确定性的,这意味着结果可能因运行而异。
# get embeddings for all article descriptions
embeddings = [embedding_from_string(string) for string in article_descriptions]
# compress the 2048-dimensional embeddings into 2 dimensions using t-SNE
tsne_components = tsne_components_from_embeddings(embeddings)
# get the article labels for coloring the chart
labels = df["label"].tolist()
chart_from_components(
components=tsne_components,
labels=labels,
strings=article_descriptions,
width=600,
height=500,
title="t-SNE components of article descriptions",
)
正如您在上面的图表中看到的那样,即使是高度压缩的嵌入也能很好地按类别对文章描述进行聚类。 值得强调的是:此聚类是在不知道标签本身的情况下完成的!
此外,如果您仔细观察最明显的异常值,它们通常是由于错误标记而不是嵌入不良造成的。 例如,绿色体育集群中大多数蓝色的“世界”点似乎都是体育报道。
接下来,让我们根据这些点是源文章、其最近邻居还是其他来重新着色这些点。
# create labels for the recommended articles
def nearest_neighbor_labels(
list_of_indices: list[int],
k_nearest_neighbors: int = 5
) -> list[str]:
"""Return a list of labels to color the k nearest neighbors."""
labels = ["Other" for _ in list_of_indices]
source_index = list_of_indices[0]
labels[source_index] = "Source"
for i in range(k_nearest_neighbors):
nearest_neighbor_index = list_of_indices[i + 1]
labels[nearest_neighbor_index] = f"Nearest neighbor (top {k_nearest_neighbors})"
return labels
tony_blair_labels = nearest_neighbor_labels(tony_blair_articles, k_nearest_neighbors=5)
chipset_security_labels = nearest_neighbor_labels(chipset_security_articles, k_nearest_neighbors=5
)
# a 2D chart of nearest neighbors of the Tony Blair article
chart_from_components(
components=tsne_components,
labels=tony_blair_labels,
strings=article_descriptions,
width=600,
height=500,
title="Nearest neighbors of the Tony Blair article",
category_orders={"label": ["Other", "Nearest neighbor (top 5)", "Source"]},
)
查看上面的 2D 图表,我们可以看到关于托尼·布莱尔的文章在“世界新闻”集群中彼此之间相对接近。 有趣的是,尽管 5 个最近邻居(红色)在高维空间中是最接近的,但它们并不是这个压缩的 2D 空间中最接近的点。 将嵌入压缩到 2 个维度会丢弃其大部分信息,并且 2D 空间中的最近邻居似乎不如完整嵌入空间中的最近邻居那么相关。
# a 2D chart of nearest neighbors of the chipset security article
chart_from_components(
components=tsne_components,
labels=chipset_security_labels,
strings=article_descriptions,
width=600,
height=500,
title="Nearest neighbors of the chipset security article",
category_orders={"label": ["Other", "Nearest neighbor (top 5)", "Source"]},
)
对于芯片组安全示例,完整嵌入空间中 4 个最接近的最近邻居仍然是此压缩 2D 可视化中的最近邻居。 第五个显示得更远,尽管在完整嵌入空间中更接近。
如果您愿意,您还可以使用函数chart_from_components_3D
制作嵌入的交互式 3D 图。 (这样做将需要使用n_components=3
重新计算 t-SNE 组件。)