交易的多类别分类

2022 年 10 月 20 日
在 Github 中打开

对于此 notebook,我们将研究如何将公共交易数据集分类到我们预定义的多个类别中。这些方法应可复制到任何多类别分类用例,在这些用例中,我们尝试将交易数据拟合到预定义的类别中,并且在完成此过程后,您应该掌握一些处理标记和未标记数据集的方法。

我们在此 notebook 中将采用的不同方法是

  • 零样本分类: 首先,我们将进行零样本分类,仅使用提示进行指导,将交易放入五个命名的类别中
  • 使用嵌入进行分类: 接下来,我们将在标记数据集上创建嵌入,然后使用传统的分类模型来测试其在识别我们的类别方面的有效性
  • 微调分类: 最后,我们将生成一个在我们标记的数据集上训练的微调模型,以查看它与零样本和少样本分类方法的比较结果
%load_ext autoreload
%autoreload
%pip install openai 'openai[datalib]' 'openai[embeddings]' transformers
import openai
import pandas as pd
import numpy as np
import json
import os

COMPLETIONS_MODEL = "gpt-4"

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if you didn't set as an env var>"))
transactions = pd.read_csv('./data/25000_spend_dataset_current.csv', encoding= 'unicode_escape')
len(transactions)
359
transactions.head()
日期 供应商 描述 交易价值 (£)
0 21/04/2016 M & J Ballantyne Ltd George IV Bridge Work 35098.0
1 26/04/2016 Private Sale Literary & Archival Items 30000.0
2 30/04/2016 City Of Edinburgh Council Non Domestic Rates 40800.0
3 09/05/2016 Computacenter Uk Kelvin Hall 72835.0
4 09/05/2016 John Graham Construction Ltd Causewayside Refurbishment 64361.0
def request_completion(prompt):

    completion_response = openai.chat.completions.create(
                            prompt=prompt,
                            temperature=0,
                            max_tokens=5,
                            top_p=1,
                            frequency_penalty=0,
                            presence_penalty=0,
                            model=COMPLETIONS_MODEL)

    return completion_response

def classify_transaction(transaction,prompt):

    prompt = prompt.replace('SUPPLIER_NAME',transaction['Supplier'])
    prompt = prompt.replace('DESCRIPTION_TEXT',transaction['Description'])
    prompt = prompt.replace('TRANSACTION_VALUE',str(transaction['Transaction value (£)']))

    classification = request_completion(prompt).choices[0].message.content.replace('\n','')

    return classification

# This function takes your training and validation outputs from the prepare_data function of the Finetuning API, and
# confirms that each have the same number of classes.
# If they do not have the same number of classes the fine-tune will fail and return an error

def check_finetune_classes(train_file,valid_file):

    train_classes = set()
    valid_classes = set()
    with open(train_file, 'r') as json_file:
        json_list = list(json_file)
        print(len(json_list))

    for json_str in json_list:
        result = json.loads(json_str)
        train_classes.add(result['completion'])
        #print(f"result: {result['completion']}")
        #print(isinstance(result, dict))

    with open(valid_file, 'r') as json_file:
        json_list = list(json_file)
        print(len(json_list))

    for json_str in json_list:
        result = json.loads(json_str)
        valid_classes.add(result['completion'])
        #print(f"result: {result['completion']}")
        #print(isinstance(result, dict))

    if len(train_classes) == len(valid_classes):
        print('All good')

    else:
        print('Classes do not match, please prepare data again')

零样本分类

我们将首先评估基础模型在使用简单提示对这些交易进行分类时的性能。我们将为模型提供 5 个类别和一个“无法分类”的兜底类别,用于模型无法归类的交易。

zero_shot_prompt = '''You are a data expert working for the National Library of Scotland.
You are analysing all transactions over £25,000 in value and classifying them into one of five categories.
The five categories are Building Improvement, Literature & Archive, Utility Bills, Professional Services and Software/IT.
If you can't tell what it is, say Could not classify

Transaction:

Supplier: SUPPLIER_NAME
Description: DESCRIPTION_TEXT
Value: TRANSACTION_VALUE

The classification is:'''
# Get a test transaction
transaction = transactions.iloc[0]

# Interpolate the values into the prompt
prompt = zero_shot_prompt.replace('SUPPLIER_NAME',transaction['Supplier'])
prompt = prompt.replace('DESCRIPTION_TEXT',transaction['Description'])
prompt = prompt.replace('TRANSACTION_VALUE',str(transaction['Transaction value (£)']))

# Use our completion function to return a prediction
completion_response = request_completion(prompt)
print(completion_response.choices[0].text)
 Building Improvement

我们的第一次尝试是正确的,M & J Ballantyne Ltd 是一家房屋建筑商,他们执行的工作确实是建筑改进。

让我们将样本量扩大到 25 个,看看它的表现如何,同样只需一个简单的提示来指导它

test_transactions = transactions.iloc[:25]
test_transactions['Classification'] = test_transactions.apply(lambda x: classify_transaction(x,zero_shot_prompt),axis=1)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.ac.cn/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
test_transactions['Classification'].value_counts()
 Building Improvement    14
 Could not classify       5
 Literature & Archive     3
 Software/IT              2
 Utility Bills            1
Name: Classification, dtype: int64
test_transactions.head(25)
日期 供应商 描述 交易价值 (£) 分类
0 21/04/2016 M & J Ballantyne Ltd George IV Bridge Work 35098.0 建筑改进
1 26/04/2016 Private Sale Literary & Archival Items 30000.0 文学与档案
2 30/04/2016 City Of Edinburgh Council Non Domestic Rates 40800.0 水电费账单
3 09/05/2016 Computacenter Uk Kelvin Hall 72835.0 软件/IT
4 09/05/2016 John Graham Construction Ltd Causewayside Refurbishment 64361.0 建筑改进
5 09/05/2016 A McGillivray Causewayside Refurbishment 53690.0 建筑改进
6 16/05/2016 John Graham Construction Ltd Causewayside Refurbishment 365344.0 建筑改进
7 23/05/2016 Computacenter Uk Kelvin Hall 26506.0 软件/IT
8 23/05/2016 ECG Facilities Service 设施管理费 32777.0 建筑改进
9 23/05/2016 ECG Facilities Service 设施管理费 32777.0 建筑改进
10 30/05/2016 ALDL ALDL 收费 32317.0 无法分类
11 10/06/2016 Wavetek Ltd Kelvin Hall 87589.0 无法分类
12 10/06/2016 John Graham Construction Ltd Causewayside Refurbishment 381803.0 建筑改进
13 28/06/2016 ECG Facilities Service 设施管理费 32832.0 建筑改进
14 30/06/2016 Glasgow City Council Kelvin Hall 1700000.0 建筑改进
15 11/07/2016 Wavetek Ltd Kelvin Hall 65692.0 无法分类
16 11/07/2016 John Graham Construction Ltd Causewayside Refurbishment 139845.0 建筑改进
17 15/07/2016 Sotheby'S Literary & Archival Items 28500.0 文学与档案
18 18/07/2016 Christies Literary & Archival Items 33800.0 文学与档案
19 25/07/2016 A McGillivray Causewayside Refurbishment 30113.0 建筑改进
20 31/07/2016 ALDL ALDL 收费 32317.0 无法分类
21 08/08/2016 ECG Facilities Service 设施管理费 32795.0 建筑改进
22 15/08/2016 Creative Video Productions Ltd Kelvin Hall 26866.0 无法分类
23 15/08/2016 John Graham Construction Ltd Causewayside Refurbishment 196807.0 建筑改进
24 24/08/2016 ECG Facilities Service 设施管理费 32795.0 建筑改进

即使没有标记示例,初始结果也相当不错!无法分类的那些案例更加棘手,几乎没有关于其主题的线索,但是如果我们清理标记数据集以提供更多示例,也许我们可以获得更好的性能。

使用嵌入进行分类

让我们从我们目前已分类的小型集合中创建嵌入 - 我们通过在数据集中的 101 个交易上运行零样本分类器并手动更正我们获得的 15 个无法分类结果,制作了一组标记示例

创建嵌入

初始部分重用了来自 Get_embeddings_from_dataset Notebook 的方法,以从组合字段(连接了我们所有特征的字段)创建嵌入

df = pd.read_csv('./data/labelled_transactions.csv')
df.head()
日期 供应商 描述 交易价值 (£) 分类
0 15/08/2016 Creative Video Productions Ltd Kelvin Hall 26866 其他
1 29/05/2017 John Graham Construction Ltd Causewayside Refurbishment 74806 建筑改进
2 29/05/2017 Morris & Spottiswood Ltd George IV Bridge Work 56448 建筑改进
3 31/05/2017 John Graham Construction Ltd Causewayside Refurbishment 164691 建筑改进
4 24/07/2017 John Graham Construction Ltd Causewayside Refurbishment 27926 建筑改进
df['combined'] = "Supplier: " + df['Supplier'].str.strip() + "; Description: " + df['Description'].str.strip() + "; Value: " + str(df['Transaction value (£)']).strip()
df.head(2)
日期 供应商 描述 交易价值 (£) 分类 组合
0 15/08/2016 Creative Video Productions Ltd Kelvin Hall 26866 其他 供应商:Creative Video Productions Ltd;描述...
1 29/05/2017 John Graham Construction Ltd Causewayside Refurbishment 74806 建筑改进 供应商:John Graham Construction Ltd;描述...
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))
len(df)
101
embedding_path = './data/transactions_with_embeddings_100.csv'
from utils.embeddings_utils import get_embedding

df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, model='gpt-4'))
df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, model='gpt-4'))
df.to_csv(embedding_path)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from ast import literal_eval

fs_df = pd.read_csv(embedding_path)
fs_df["babbage_similarity"] = fs_df.babbage_similarity.apply(literal_eval).apply(np.array)
fs_df.head()
未命名: 0 日期 供应商 描述 交易价值 (£) 分类 组合 n_tokens babbage_similarity babbage_search
0 0 15/08/2016 Creative Video Productions Ltd Kelvin Hall 26866 其他 供应商:Creative Video Productions Ltd;描述... 136 [-0.009802100248634815, 0.022551486268639565, ... [-0.00232666521333158, 0.019198870286345482, 0...
1 1 29/05/2017 John Graham Construction Ltd Causewayside Refurbishment 74806 建筑改进 供应商:John Graham Construction Ltd;描述... 140 [-0.009065819904208183, 0.012094118632376194, ... [0.005169447045773268, 0.00473341578617692, -0...
2 2 29/05/2017 Morris & Spottiswood Ltd George IV Bridge Work 56448 建筑改进 供应商:Morris & Spottiswood Ltd;描述... 141 [-0.009000026620924473, 0.02405017428100109, -... [0.0028343256562948227, 0.021166473627090454, ...
3 3 31/05/2017 John Graham Construction Ltd Causewayside Refurbishment 164691 建筑改进 供应商:John Graham Construction Ltd;描述... 140 [-0.009065819904208183, 0.012094118632376194, ... [0.005169447045773268, 0.00473341578617692, -0...
4 4 24/07/2017 John Graham Construction Ltd Causewayside Refurbishment 27926 建筑改进 供应商:John Graham Construction Ltd;描述... 140 [-0.009065819904208183, 0.012094118632376194, ... [0.005169447045773268, 0.00473341578617692, -0...
X_train, X_test, y_train, y_test = train_test_split(
    list(fs_df.babbage_similarity.values), fs_df.Classification, test_size=0.2, random_state=42
)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

report = classification_report(y_test, preds)
print(report)
                      precision    recall  f1-score   support

Building Improvement       0.92      1.00      0.96        11
Literature & Archive       1.00      1.00      1.00         3
               Other       0.00      0.00      0.00         1
         Software/IT       1.00      1.00      1.00         1
       Utility Bills       1.00      1.00      1.00         5

            accuracy                           0.95        21
           macro avg       0.78      0.80      0.79        21
        weighted avg       0.91      0.95      0.93        21

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

此模型的性能非常强大,因此创建嵌入并使用更简单的分类器看起来也是一种有效的方法,零样本分类器帮助我们完成了未标记数据集的初始分类。

让我们更进一步,看看在同一标记数据集上训练的微调模型是否能为我们提供可比较的结果

微调交易分类

对于此用例,我们将尝试通过在相同的 101 个交易的标记集上训练微调模型,并将此微调模型应用于一组未见交易,从而改进上述少样本分类

构建微调分类器

我们首先需要进行一些数据准备工作,以使我们的数据准备就绪。这将采取以下步骤

  • 首先,我们将列出我们的类别,并用数字标识符替换它们。使模型预测单个 token 而不是像“建筑改进”这样的多个连续 token 应该会给我们更好的结果
  • 我们还需要为每个示例添加一个常见的前缀和后缀,以帮助模型进行预测 - 在我们的例子中,我们的文本已经以“供应商”开头,我们将添加后缀“\n\n###\n\n”
  • 最后,我们将在我们的每个目标分类类别上添加一个前导空格,同样是为了帮助模型
ft_prep_df = fs_df.copy()
len(ft_prep_df)
101
ft_prep_df.head()
未命名: 0 日期 供应商 描述 交易价值 (£) 分类 组合 n_tokens babbage_similarity babbage_search
0 0 15/08/2016 Creative Video Productions Ltd Kelvin Hall 26866 其他 供应商:Creative Video Productions Ltd;描述... 12 [-0.009630300104618073, 0.009887108579277992, ... [-0.008217384107410908, 0.025170527398586273, ...
1 1 29/05/2017 John Graham Construction Ltd Causewayside Refurbishment 74806 建筑改进 供应商:John Graham Construction Ltd;描述... 16 [-0.006144719664007425, -0.0018709596479311585... [-0.007424891460686922, 0.008475713431835175, ...
2 2 29/05/2017 Morris & Spottiswood Ltd George IV Bridge Work 56448 建筑改进 供应商:Morris & Spottiswood Ltd;描述... 17 [-0.005225738976150751, 0.015156379900872707, ... [-0.007611643522977829, 0.030322374776005745, ...
3 3 31/05/2017 John Graham Construction Ltd Causewayside Refurbishment 164691 建筑改进 供应商:John Graham Construction Ltd;描述... 16 [-0.006144719664007425, -0.0018709596479311585... [-0.007424891460686922, 0.008475713431835175, ...
4 4 24/07/2017 John Graham Construction Ltd Causewayside Refurbishment 27926 建筑改进 供应商:John Graham Construction Ltd;描述... 16 [-0.006144719664007425, -0.0018709596479311585... [-0.007424891460686922, 0.008475713431835175, ...
classes = list(set(ft_prep_df['Classification']))
class_df = pd.DataFrame(classes).reset_index()
class_df.columns = ['class_id','class']
class_df  , len(class_df)
(   class_id                 class
 0         0  Literature & Archive
 1         1         Utility Bills
 2         2  Building Improvement
 3         3           Software/IT
 4         4                 Other,
 5)
ft_df_with_class = ft_prep_df.merge(class_df,left_on='Classification',right_on='class',how='inner')

# Adding a leading whitespace onto each completion to help the model
ft_df_with_class['class_id'] = ft_df_with_class.apply(lambda x: ' ' + str(x['class_id']),axis=1)
ft_df_with_class = ft_df_with_class.drop('class', axis=1)

# Adding a common separator onto the end of each prompt so the model knows when a prompt is terminating
ft_df_with_class['prompt'] = ft_df_with_class.apply(lambda x: x['combined'] + '\n\n###\n\n',axis=1)
ft_df_with_class.head()
未命名: 0 日期 供应商 描述 交易价值 (£) 分类 组合 n_tokens babbage_similarity babbage_search 类别 ID 提示
0 0 15/08/2016 Creative Video Productions Ltd Kelvin Hall 26866 其他 供应商:Creative Video Productions Ltd;描述... 12 [-0.009630300104618073, 0.009887108579277992, ... [-0.008217384107410908, 0.025170527398586273, ... 4 供应商:Creative Video Productions Ltd;描述...
1 51 31/03/2017 NLS Foundation 补助金支付 177500 其他 供应商:NLS Foundation;描述:补助金支付... 11 [-0.022305507212877274, 0.008543581701815128, ... [-0.020519884303212166, 0.01993306167423725, -... 4 供应商:NLS Foundation;描述:补助金支付...
2 70 26/06/2017 British Library 法定送存服务 50056 其他 供应商:British Library;描述:法定... 11 [-0.01019938476383686, 0.015277703292667866, -... [-0.01843327097594738, 0.03343546763062477, -0... 4 供应商:British Library;描述:法定...
3 71 24/07/2017 ALDL 法定送存服务 27067 其他 供应商:ALDL;描述:法定送存服务... 11 [-0.008471488021314144, 0.004098685923963785, ... [-0.012966590002179146, 0.01299362163990736, 0... 4 供应商:ALDL;描述:法定送存服务...
4 100 24/07/2017 AM Phillip 车辆采购 26604 其他 供应商:AM Phillip;描述:车辆采购... 10 [-0.003459023078903556, 0.004626389592885971, ... [-0.0010945454705506563, 0.008626140654087067,... 4 供应商:AM Phillip;描述:车辆采购...
# This step is unnecessary if you have a number of observations in each class
# In our case we don't, so we shuffle the data to give us a better chance of getting equal classes in our train and validation sets
# Our fine-tuned model will error if we have less classes in the validation set, so this is a necessary step

import random

labels = [x for x in ft_df_with_class['class_id']]
text = [x for x in ft_df_with_class['prompt']]
ft_df = pd.DataFrame(zip(text, labels), columns = ['prompt','class_id']) #[:300]
ft_df.columns = ['prompt','completion']
ft_df['ordering'] = ft_df.apply(lambda x: random.randint(0,len(ft_df)), axis = 1)
ft_df.set_index('ordering',inplace=True)
ft_df_sorted = ft_df.sort_index(ascending=True)
ft_df_sorted.head()
提示 完成
订购
0 供应商:Sothebys;描述:文学与档案... 0
1 供应商:Sotheby'S;描述:文学与 A... 0
2 供应商:City Of Edinburgh Council;描述... 1
2 供应商:John Graham Construction Ltd;描述... 2
3 供应商:John Graham Construction Ltd;描述... 2
# This step is to remove any existing files if we've already produced training/validation sets for this classifier
#!rm transactions_grouped*

# We output our shuffled dataframe to a .jsonl file and run the prepare_data function to get us our input files
ft_df_sorted.to_json("transactions_grouped.jsonl", orient='records', lines=True)
!openai tools fine_tunes.prepare_data -f transactions_grouped.jsonl -q
# This functions checks that your classes all appear in both prepared files
# If they don't, the fine-tuned model creation will fail
check_finetune_classes('transactions_grouped_prepared_train.jsonl','transactions_grouped_prepared_valid.jsonl')
31
8
All good
# This step creates your model
!openai api fine_tunes.create -t "transactions_grouped_prepared_train.jsonl" -v "transactions_grouped_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 5 -m curie

# You can use following command to get fine tuning job status and model name, replace the job name with your job
#!openai api fine_tunes.get -i ft-YBIc01t4hxYBC7I5qhRF3Qdx
# Congrats, you've got a fine-tuned model!
# Copy/paste the name provided into the variable below and we'll take it for a spin
fine_tuned_model = 'curie:ft-personal-2022-10-20-10-42-56'

应用微调分类器

现在我们将应用我们的分类器,看看它的表现如何。我们的训练集中只有 31 个唯一的观察结果,验证集中有 8 个,所以让我们看看性能如何

test_set = pd.read_json('transactions_grouped_prepared_valid.jsonl', lines=True)
test_set.head()
提示 完成
0 供应商:Wavetek Ltd;描述:Kelvin Hal... 2
1 供应商:ECG Facilities Service;描述:... 1
2 供应商:M & J Ballantyne Ltd;描述:G... 2
3 供应商:Private Sale;描述:文学... 0
4 供应商:Ex Libris;描述:IT 设备... 3
test_set['predicted_class'] = test_set.apply(lambda x: openai.chat.completions.create(model=fine_tuned_model, prompt=x['prompt'], max_tokens=1, temperature=0, logprobs=5),axis=1)
test_set['pred'] = test_set.apply(lambda x : x['predicted_class']['choices'][0]['text'],axis=1)
test_set['result'] = test_set.apply(lambda x: str(x['pred']).strip() == str(x['completion']).strip(), axis = 1)
test_set['result'].value_counts()
True     4
False    4
Name: result, dtype: int64

性能不是很好 - 不幸的是,这是预期的。由于每个类别只有几个示例,因此上述使用嵌入和传统分类器的方法效果更好。

微调模型在有大量标记观察结果时效果最佳。如果我们有几百或几千个,我们可能会获得更好的结果,但让我们在保留集上进行最后一次测试,以确认它不能很好地推广到一组新的观察结果

holdout_df = transactions.copy().iloc[101:]
holdout_df.head()
日期 供应商 描述 交易价值 (£)
101 23/10/2017 City Building LLP Causewayside Refurbishment 53147.0
102 30/10/2017 ECG Facilities Service 设施管理费 35758.0
103 30/10/2017 ECG Facilities Service 设施管理费 35758.0
104 06/11/2017 John Graham Construction Ltd Causewayside Refurbishment 134208.0
105 06/11/2017 ALDL 法定送存服务 27067.0
holdout_df['combined'] = "Supplier: " + holdout_df['Supplier'].str.strip() + "; Description: " + holdout_df['Description'].str.strip() + '\n\n###\n\n' # + "; Value: " + str(df['Transaction value (£)']).strip()
holdout_df['prediction_result'] = holdout_df.apply(lambda x: openai.chat.completions.create(model=fine_tuned_model, prompt=x['combined'], max_tokens=1, temperature=0, logprobs=5),axis=1)
holdout_df['pred'] = holdout_df.apply(lambda x : x['prediction_result']['choices'][0]['text'],axis=1)
holdout_df.head(10)
日期 供应商 描述 交易价值 (£) 组合 预测结果 预测
101 23/10/2017 City Building LLP Causewayside Refurbishment 53147.0 供应商:City Building LLP;描述:Caus... {'id': 'cmpl-63YDadbYLo8xKsGY2vReOFCMgTOvG', '... 2
102 30/10/2017 ECG Facilities Service 设施管理费 35758.0 供应商:ECG Facilities Service;描述:... {'id': 'cmpl-63YDbNK1D7UikDc3xi5ATihg5kQEt', '... 2
103 30/10/2017 ECG Facilities Service 设施管理费 35758.0 供应商:ECG Facilities Service;描述:... {'id': 'cmpl-63YDbwfiHjkjMWsfTKNt6naeqPzOe', '... 2
104 06/11/2017 John Graham Construction Ltd Causewayside Refurbishment 134208.0 供应商:John Graham Construction Ltd;描述... {'id': 'cmpl-63YDbWAndtsRqPTi2ZHZtPodZvOwr', '... 2
105 06/11/2017 ALDL 法定送存服务 27067.0 供应商:ALDL;描述:法定送存服务... {'id': 'cmpl-63YDbDu7WM3svYWsRAMdDUKtSFDBu', '... 2
106 27/11/2017 Maggs Bros Ltd Literary & Archival Items 26500.0 供应商:Maggs Bros Ltd;描述:Literar... {'id': 'cmpl-63YDbxNNI8ZH5CJJNxQ0IF9Zf925C', '... 0
107 30/11/2017 Glasgow City Council Kelvin Hall 42345.0 供应商:Glasgow City Council;描述:K... {'id': 'cmpl-63YDb8R1FWu4bjwM2xE775rouwneV', '... 2
108 11/12/2017 ECG Facilities Service 设施管理费 35758.0 供应商:ECG Facilities Service;描述:... {'id': 'cmpl-63YDcAPsp37WhbPs9kwfUX0kBk7Hv', '... 2
109 11/12/2017 John Graham Construction Ltd Causewayside Refurbishment 159275.0 供应商:John Graham Construction Ltd;描述... {'id': 'cmpl-63YDcML2welrC3wF0nuKgcNmVu1oQ', '... 2
110 08/01/2018 ECG Facilities Service 设施管理费 35758.0 供应商:ECG Facilities Service;描述:... {'id': 'cmpl-63YDc95SSdOHnIliFB2cjMEEm7Z2u', '... 2
holdout_df['pred'].value_counts()
 2    231
 0     27
Name: pred, dtype: int64

嗯,这些结果同样不尽如人意 - 因此我们了解到,对于具有少量标记观察结果的数据集,零样本分类或使用嵌入的传统分类比微调模型返回更好的结果。

微调模型仍然是一个很棒的工具,但是当您要分类的每个类别都有大量标记示例时,它会更有效