本笔记本涵盖了数据未标记但具有可用于将其聚类成有意义类别的特征的用例。聚类的挑战在于使那些使聚类脱颖而出的特征具有人类可读性,而这正是我们希望使用 GPT-3 为我们生成有意义的聚类描述的地方。然后,我们可以使用这些描述将标签应用于先前未标记的数据集。
为了给模型提供数据,我们使用了在笔记本 交易多类别分类笔记本 中展示的方法创建的嵌入,并将其应用于数据集中的全部 359 个交易,以便为学习提供更大的池。
本笔记本涵盖了数据未标记但具有可用于将其聚类成有意义类别的特征的用例。聚类的挑战在于使那些使聚类脱颖而出的特征具有人类可读性,而这正是我们希望使用 GPT-3 为我们生成有意义的聚类描述的地方。然后,我们可以使用这些描述将标签应用于先前未标记的数据集。
为了给模型提供数据,我们使用了在笔记本 交易多类别分类笔记本 中展示的方法创建的嵌入,并将其应用于数据集中的全部 359 个交易,以便为学习提供更大的池。
# optional env import
from dotenv import load_dotenv
load_dotenv()
True
# imports
from openai import OpenAI
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
import os
from ast import literal_eval
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
COMPLETIONS_MODEL = "gpt-3.5-turbo"
# This path leads to a file with data and precomputed embeddings
embedding_path = "data/library_transactions_with_embeddings_359.csv"
我们将重用 聚类笔记本 中的方法,使用 K-Means 聚类我们的数据集,使用我们之前创建的特征嵌入。然后,我们将使用 Completions 终端节点为我们生成聚类描述并判断其有效性。
df = pd.read_csv(embedding_path)
df.head()
日期 | 供应商 | 描述 | 交易金额 (£) | 组合 | n_tokens | 嵌入 | |
---|---|---|---|---|---|---|---|
0 | 21/04/2016 | M & J Ballantyne Ltd | 乔治四世桥工程 | 35098.0 | 供应商: M & J Ballantyne Ltd; 描述: 乔... | 118 | [-0.013169967569410801, -0.004833734128624201,... |
1 | 26/04/2016 | 私人销售 | 文学与档案项目 | 30000.0 | 供应商: 私人销售; 描述: 文学 ... | 114 | [-0.019571533426642418, -0.010801066644489765,... |
2 | 30/04/2016 | 爱丁堡市议会 | 非住宅税率 | 40800.0 | 供应商: 爱丁堡市议会; 描述... | 114 | [-0.0054041435942053795, -6.548957026097924e-0... |
3 | 09/05/2016 | Computacenter Uk | 开尔文大厅 | 72835.0 | 供应商: Computacenter Uk; 描述: 开尔... | 113 | [-0.004776035435497761, -0.005533686839044094,... |
4 | 09/05/2016 | John Graham Construction Ltd | 堤道边翻新 | 64361.0 | 供应商: John Graham Construction Ltd; 描述... | 117 | [0.003290407592430711, -0.0073441751301288605,... |
embedding_df = pd.read_csv(embedding_path)
embedding_df["embedding"] = embedding_df.embedding.apply(literal_eval).apply(np.array)
matrix = np.vstack(embedding_df.embedding.values)
matrix.shape
(359, 1536)
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
kmeans.fit(matrix)
labels = kmeans.labels_
embedding_df["Cluster"] = labels
tsne = TSNE(
n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200
)
vis_dims2 = tsne.fit_transform(matrix)
x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]
for category, color in enumerate(["purple", "green", "red", "blue","yellow"]):
xs = np.array(x)[embedding_df.Cluster == category]
ys = np.array(y)[embedding_df.Cluster == category]
plt.scatter(xs, ys, color=color, alpha=0.3)
avg_x = xs.mean()
avg_y = ys.mean()
plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
plt.title("Clusters identified visualized in language 2d using t-SNE")
Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')
# We'll read 10 transactions per cluster as we're expecting some variation
transactions_per_cluster = 10
for i in range(n_clusters):
print(f"Cluster {i} Theme:\n")
transactions = "\n".join(
embedding_df[embedding_df.Cluster == i]
.combined.str.replace("Supplier: ", "")
.str.replace("Description: ", ": ")
.str.replace("Value: ", ": ")
.sample(transactions_per_cluster, random_state=42)
.values
)
response = client.chat.completions.create(
model=COMPLETIONS_MODEL,
# We'll include a prompt to instruct the model what sort of description we're looking for
messages=[
{"role": "user",
"content": f'''We want to group these transactions into meaningful clusters so we can target the areas we are spending the most money.
What do the following transactions have in common?\n\nTransactions:\n"""\n{transactions}\n"""\n\nTheme:'''}
],
temperature=0,
max_tokens=100,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
print(response.choices[0].message.content.replace("\n", ""))
print("\n")
sample_cluster_rows = embedding_df[embedding_df.Cluster == i].sample(transactions_per_cluster, random_state=42)
for j in range(transactions_per_cluster):
print(sample_cluster_rows.Supplier.values[j], end=", ")
print(sample_cluster_rows.Description.values[j], end="\n")
print("-" * 100)
print("\n")
Cluster 0 Theme: The common theme among these transactions is that they all involve spending money on various expenses such as electricity, non-domestic rates, IT equipment, computer equipment, and the purchase of an electric van. EDF ENERGY, Electricity Oct 2019 3 buildings City Of Edinburgh Council, Non Domestic Rates EDF, Electricity EX LIBRIS, IT equipment City Of Edinburgh Council, Non Domestic Rates CITY OF EDINBURGH COUNCIL, Rates for 33 Salisbury Place EDF Energy, Electricity XMA Scotland Ltd, IT equipment Computer Centre UK Ltd, Computer equipment ARNOLD CLARK, Purchase of an electric van ---------------------------------------------------------------------------------------------------- Cluster 1 Theme: The common theme among these transactions is that they all involve payments for various goods and services. Some specific examples include student bursary costs, collection of papers, architectural works, legal deposit services, papers related to Alisdair Gray, resources on slavery abolition and social justice, collection items, online/print subscriptions, ALDL charges, and literary/archival items. Institute of Conservation, This payment covers 2 invoices for student bursary costs PRIVATE SALE, Collection of papers of an individual LEE BOYD LIMITED, Architectural Works ALDL, Legal Deposit Services RICK GEKOSKI, Papers 1970's to 2019 Alisdair Gray ADAM MATTHEW DIGITAL LTD, Resource - slavery abolution and social justice PROQUEST INFORMATION AND LEARN, This payment covers multiple invoices for collection items LM Information Delivery UK LTD, Payment of 18 separate invoice for Online/Print subscriptions Jan 20-Dec 20 ALDL, ALDL Charges Private Sale, Literary & Archival Items ---------------------------------------------------------------------------------------------------- Cluster 2 Theme: The common theme among these transactions is that they all involve spending money at Kelvin Hall. CBRE, Kelvin Hall GLASGOW CITY COUNCIL, Kelvin Hall University Of Glasgow, Kelvin Hall GLASGOW LIFE, Oct 20 to Dec 20 service charge - Kelvin Hall Computacenter Uk, Kelvin Hall XMA Scotland Ltd, Kelvin Hall GLASGOW LIFE, Service Charges Kelvin Hall 01/07/19-30/09/19 Glasgow Life, Kelvin Hall Service Charges Glasgow City Council, Kelvin Hall GLASGOW LIFE, Quarterly service charge KH ---------------------------------------------------------------------------------------------------- Cluster 3 Theme: The common theme among these transactions is that they all involve payments for facility management fees and services provided by ECG Facilities Service. ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees ECG FACILITIES SERVICE, Facilities Management Charge ECG FACILITIES SERVICE, Inspection and Maintenance of all Library properties ECG Facilities Service, Facilities Management Charge ECG FACILITIES SERVICE, Maintenance contract - October ECG FACILITIES SERVICE, Electrical and mechanical works ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees ECG FACILITIES SERVICE, CB Bolier Replacement (1),USP Batteries,Gutter Works & Cleaning of pigeon fouling ECG Facilities Service, Facilities Management Charge ECG Facilities Service, Facilities Management Charge ---------------------------------------------------------------------------------------------------- Cluster 4 Theme: The common theme among these transactions is that they all involve construction or refurbishment work. M & J Ballantyne Ltd, George IV Bridge Work John Graham Construction Ltd, Causewayside Refurbishment John Graham Construction Ltd, Causewayside Refurbishment John Graham Construction Ltd, Causewayside Refurbishment John Graham Construction Ltd, Causewayside Refurbishment ARTHUR MCKAY BUILDING SERVICES, Causewayside Work John Graham Construction Ltd, Causewayside Refurbishment Morris & Spottiswood Ltd, George IV Bridge Work ECG FACILITIES SERVICE, Causewayside IT Work John Graham Construction Ltd, Causewayside Refurbishment ----------------------------------------------------------------------------------------------------
我们现在有五个新的聚类,我们可以用它们来描述我们的数据。查看可视化,我们的一些聚类有一些重叠,我们需要进行一些调整才能达到正确的位置,但我们已经可以看到 GPT-3 做出了一些有效的推断。特别是,它发现包括法定 deposit 的项目与文学档案相关,这是事实,但模型没有得到任何线索。非常酷,通过一些调整,我们可以创建一组基本的聚类,然后我们可以将它们与多类别分类器一起使用,以推广到我们可能使用的其他交易数据集。