使用交叉编码器进行搜索重排序

2023年6月28日
在 Github 中打开

本笔记本将带您了解使用交叉编码器对搜索结果进行重新排序的示例。

这是我们客户常见的用例,您已使用嵌入(使用 双编码器 生成)实现了语义搜索,但结果不如您的用例要求的那样准确。可能的原因是您可以使用一些业务规则来重新排序文档,例如文档的最新程度或受欢迎程度。

然而,通常存在一些细微的特定领域规则可以帮助确定相关性,而这正是交叉编码器可以发挥作用的地方。交叉编码器比双编码器更准确,但它们的可扩展性不佳,因此使用它们对语义搜索返回的缩短列表进行重新排序是理想的用例。

示例

考虑一个包含 D 个文档和 Q 个查询的搜索任务。

计算每对相关性的暴力方法很昂贵;其成本与 D * Q 成比例。这被称为交叉编码

一种更快的方法是基于嵌入的搜索,其中每个文档和查询的嵌入只计算一次,然后多次重复使用以廉价地计算成对相关性。由于嵌入只计算一次,其成本与 D + Q 成比例。这被称为双编码

虽然基于嵌入的搜索速度更快,但质量可能会更差。为了兼顾两者,一种常见的方法是使用嵌入(或其他双编码器)来廉价地识别顶部候选者,然后使用 GPT(或其他交叉编码器)来昂贵地对这些顶部候选者进行重新排序。这种混合方法的成本与 (D + Q) * 嵌入的成本 + (N * Q) * 重新排序的成本 成比例,其中 N 是重新排序的候选者数量。

演练

为了说明这种方法,我们将使用启用 logprobstext-davinci-003 来构建一个由 GPT 驱动的交叉编码器。我们的 GPT 模型具有强大的通用语言理解能力,当通过一些少样本示例进行调整时,可以提供简单有效的交叉编码选项。

本笔记本借鉴了 Weaviate 的这篇精彩文章,以及 Sentence Transformers 对双编码器与交叉编码器的出色解释

!pip install openai
!pip install arxiv
!pip install tenacity
!pip install pandas
!pip install tiktoken
import arxiv
from math import exp
import openai
import os
import pandas as pd
from tenacity import retry, wait_random_exponential, stop_after_attempt
import tiktoken

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

OPENAI_MODEL = "gpt-4"

在本示例中,我们将使用 arXiv 搜索服务,但此步骤可以由您拥有的任何搜索服务执行。需要考虑的关键项是略微过度提取以捕获所有潜在相关的文档,然后再对它们进行重新排序。

query = "how do bi-encoders work for sentence embeddings"
search = arxiv.Search(
    query=query, max_results=20, sort_by=arxiv.SortCriterion.Relevance
)
result_list = []

for result in search.results():
    result_dict = {}

    result_dict.update({"title": result.title})
    result_dict.update({"summary": result.summary})

    # Taking the first url provided
    result_dict.update({"article_url": [x.href for x in result.links][0]})
    result_dict.update({"pdf_url": [x.href for x in result.links][1]})
    result_list.append(result_dict)
result_list[0]
{'title': 'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features',
 'summary': 'Models based on large-pretrained language models, such as S(entence)BERT,\nprovide effective and efficient sentence embeddings that show high correlation\nto human similarity ratings, but lack interpretability. On the other hand,\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\nRepresentation, AMR) can make explicit the semantic aspects in which two\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\nand do not reach state-of-the-art performance when rating sentence similarity.\n  In this work, we aim at the best of both worlds, by learning to induce\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\nvarious semantic sentence features (e.g., semantic roles, negation, or\nquantification). We show how to i) learn a decomposition of the sentence\nembeddings into semantic features, through approximation of a suite of\ninterpretable AMR graph metrics, and how to ii) preserve the overall power of\nthe neural embeddings by controlling the decomposition learning process with a\nsecond objective that enforces consistency with the similarity ratings of an\nSBERT teacher model. In our experimental studies, we show that our approach\noffers interpretability -- while fully preserving the effectiveness and\nefficiency of the neural sentence embeddings.',
 'article_url': 'http://arxiv.org/abs/2206.07023v2',
 'pdf_url': 'http://arxiv.org/pdf/2206.07023v2'}
for i, result in enumerate(result_list):
    print(f"{i + 1}: {result['title']}")
1: SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
2: Are Classes Clusters?
3: Semantic Composition in Visually Grounded Language Models
4: Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions
5: Learning Probabilistic Sentence Representations from Paraphrases
6: Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings
7: How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation
8: Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences
9: Vec2Sent: Probing Sentence Embeddings with Natural Language Generation
10: Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
11: SentPWNet: A Unified Sentence Pair Weighting Network for Task-specific Sentence Embedding
12: Learning Joint Representations of Videos and Sentences with Web Image Search
13: Character-based Neural Networks for Sentence Pair Modeling
14: Train Once, Test Anywhere: Zero-Shot Learning for Text Classification
15: Hierarchical GPT with Congruent Transformers for Multi-Sentence Language Models
16: Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
17: In Search for Linear Relations in Sentence Embedding Spaces
18: Learning to Borrow -- Relation Representation for Without-Mention Entity-Pairs for Knowledge Graph Completion
19: Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences
20: Relational Sentence Embedding for Flexible Semantic Matching

交叉编码器

我们将使用 Completions 端点创建一个交叉编码器 - 这里需要考虑的关键因素是

  • 使您的示例特定于领域 - 交叉编码器的优势在于当您根据您的领域定制它们时。
  • 在要重新排序的潜在示例数量与处理速度之间需要权衡。考虑批处理和并行处理交叉编码器请求以更快地处理它们。

这里的步骤是

  • 构建一个提示来评估相关性,并提供少样本示例以根据您的领域进行调整。
  • Yes No 的标记添加 logit bias,以降低任何其他标记出现的可能性。
  • 返回 yes/no 的分类以及 logprobs
  • Yes 键上的 logprobs 重新排序结果。
tokens = [" Yes", " No"]
tokenizer = tiktoken.encoding_for_model(OPENAI_MODEL)
ids = [tokenizer.encode(token) for token in tokens]
ids[0], ids[1]
([3363], [1400])
prompt = '''
You are an Assistant responsible for helping detect whether the retrieved document is relevant to the query. For a given input, you need to output a single token: "Yes" or "No" indicating the retrieved document is relevant to the query.

Query: How to plant a tree?
Document: """Cars were invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.[3][4][5] Cars became widely available during the 20th century. One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced horse-drawn carriages.[6] In Europe and other parts of the world, demand for automobiles did not increase until after World War II.[7] The car is considered an essential part of the developed economy."""
Relevant: No

Query: Has the coronavirus vaccine been approved?
Document: """The Pfizer-BioNTech COVID-19 vaccine was approved for emergency use in the United States on December 11, 2020."""
Relevant: Yes

Query: What is the capital of France?
Document: """Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture. Its 19th-century cityscape is crisscrossed by wide boulevards and the River Seine. Beyond such landmarks as the Eiffel Tower and the 12th-century, Gothic Notre-Dame cathedral, the city is known for its cafe culture and designer boutiques along the Rue du Faubourg Saint-Honoré."""
Relevant: Yes

Query: What are some papers to learn about PPO reinforcement learning?
Document: """Proximal Policy Optimization and its Dynamic Version for Sequence Generation: In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning. In this paper, we replace policy gradient with proximal policy optimization (PPO), which is a proved more efficient reinforcement learning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We demonstrate the efficacy of PPO and PPO-dynamic on conditional sequence generation tasks including synthetic experiment and chit-chat chatbot. The results show that PPO and PPO-dynamic can beat policy gradient by stability and performance."""
Relevant: Yes

Query: Explain sentence embeddings
Document: """Inside the bubble: exploring the environments of reionisation-era Lyman-α emitting galaxies with JADES and FRESCO: We present a study of the environments of 16 Lyman-α emitting galaxies (LAEs) in the reionisation era (5.8<z<8) identified by JWST/NIRSpec as part of the JWST Advanced Deep Extragalactic Survey (JADES). Unless situated in sufficiently (re)ionised regions, Lyman-α emission from these galaxies would be strongly absorbed by neutral gas in the intergalactic medium (IGM). We conservatively estimate sizes of the ionised regions required to reconcile the relatively low Lyman-α velocity offsets (ΔvLyα<300kms−1) with moderately high Lyman-α escape fractions (fesc,Lyα>5%) observed in our sample of LAEs, indicating the presence of ionised ``bubbles'' with physical sizes of the order of 0.1pMpc≲Rion≲1pMpc in a patchy reionisation scenario where the bubbles are embedded in a fully neutral IGM. Around half of the LAEs in our sample are found to coincide with large-scale galaxy overdensities seen in FRESCO at z∼5.8-5.9 and z∼7.3, suggesting Lyman-α transmission is strongly enhanced in such overdense regions, and underlining the importance of LAEs as tracers of the first large-scale ionised bubbles. Considering only spectroscopically confirmed galaxies, we find our sample of UV-faint LAEs (MUV≳−20mag) and their direct neighbours are generally not able to produce the required ionised regions based on the Lyman-α transmission properties, suggesting lower-luminosity sources likely play an important role in carving out these bubbles. These observations demonstrate the combined power of JWST multi-object and slitless spectroscopy in acquiring a unique view of the early stages of Cosmic Reionisation via the most distant LAEs."""
Relevant: No

Query: {query}
Document: """{document}"""
Relevant:
'''


@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def document_relevance(query, document):
    response = openai.chat.completions.create(
        model="text-davinci-003",
        message=prompt.format(query=query, document=document),
        temperature=0,
        logprobs=True,
        logit_bias={3363: 1, 1400: 1},
    )

    return (
        query,
        document,
        response.choices[0].message.content,
        response.choices[0].logprobs.token_logprobs[0],
    )
content = result_list[0]["title"] + ": " + result_list[0]["summary"]

# Set logprobs to 1 so our response will include the most probable token the model identified
response = openai.chat.completions.create(
    model=OPENAI_MODEL,
    prompt=prompt.format(query=query, document=content),
    temperature=0,
    logprobs=1,
    logit_bias={3363: 1, 1400: 1},
    max_tokens=1,
)
result = response.choices[0]
print(f"Result was {result.message.content}")
print(f"Logprobs was {result.logprobs.token_logprobs[0]}")
print("\nBelow is the full logprobs object\n\n")
print(result["logprobs"])
Result was Yes
Logprobs was -0.05869877

Below is the full logprobs object


{
  "tokens": [
    "Yes"
  ],
  "token_logprobs": [
    -0.05869877
  ],
  "top_logprobs": [
    {
      "Yes": -0.05869877
    }
  ],
  "text_offset": [
    5764
  ]
}
output_list = []
for x in result_list:
    content = x["title"] + ": " + x["summary"]

    try:
        output_list.append(document_relevance(query, document=content))

    except Exception as e:
        print(e)
output_list[:10]
[('how do bi-encoders work for sentence embeddings',
  'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features: Models based on large-pretrained language models, such as S(entence)BERT,\nprovide effective and efficient sentence embeddings that show high correlation\nto human similarity ratings, but lack interpretability. On the other hand,\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\nRepresentation, AMR) can make explicit the semantic aspects in which two\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\nand do not reach state-of-the-art performance when rating sentence similarity.\n  In this work, we aim at the best of both worlds, by learning to induce\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\nvarious semantic sentence features (e.g., semantic roles, negation, or\nquantification). We show how to i) learn a decomposition of the sentence\nembeddings into semantic features, through approximation of a suite of\ninterpretable AMR graph metrics, and how to ii) preserve the overall power of\nthe neural embeddings by controlling the decomposition learning process with a\nsecond objective that enforces consistency with the similarity ratings of an\nSBERT teacher model. In our experimental studies, we show that our approach\noffers interpretability -- while fully preserving the effectiveness and\nefficiency of the neural sentence embeddings.',
  'Yes',
  -0.05326408),
 ('how do bi-encoders work for sentence embeddings',
  'Are Classes Clusters?: Sentence embedding models aim to provide general purpose embeddings for\nsentences. Most of the models studied in this paper claim to perform well on\nSTS tasks - but they do not report on their suitability for clustering. This\npaper looks at four recent sentence embedding models (Universal Sentence\nEncoder (Cer et al., 2018), Sentence-BERT (Reimers and Gurevych, 2019), LASER\n(Artetxe and Schwenk, 2019), and DeCLUTR (Giorgi et al., 2020)). It gives a\nbrief overview of the ideas behind their implementations. It then investigates\nhow well topic classes in two text classification datasets (Amazon Reviews (Ni\net al., 2019) and News Category Dataset (Misra, 2018)) map to clusters in their\ncorresponding sentence embedding space. While the performance of the resulting\nclassification model is far from perfect, it is better than random. This is\ninteresting because the classification model has been constructed in an\nunsupervised way. The topic classes in these real life topic classification\ndatasets can be partly reconstructed by clustering the corresponding sentence\nembeddings.',
  'No',
  -0.009535169),
 ('how do bi-encoders work for sentence embeddings',
  "Semantic Composition in Visually Grounded Language Models: What is sentence meaning and its ideal representation? Much of the expressive\npower of human language derives from semantic composition, the mind's ability\nto represent meaning hierarchically & relationally over constituents. At the\nsame time, much sentential meaning is outside the text and requires grounding\nin sensory, motor, and experiential modalities to be adequately learned.\nAlthough large language models display considerable compositional ability,\nrecent work shows that visually-grounded language models drastically fail to\nrepresent compositional structure. In this thesis, we explore whether & how\nmodels compose visually grounded semantics, and how we might improve their\nability to do so.\n  Specifically, we introduce 1) WinogroundVQA, a new compositional visual\nquestion answering benchmark, 2) Syntactic Neural Module Distillation, a\nmeasure of compositional ability in sentence embedding models, 3) Causal\nTracing for Image Captioning Models to locate neural representations vital for\nvision-language composition, 4) Syntactic MeanPool to inject a compositional\ninductive bias into sentence embeddings, and 5) Cross-modal Attention\nCongruence Regularization, a self-supervised objective function for\nvision-language relation alignment. We close by discussing connections of our\nwork to neuroscience, psycholinguistics, formal semantics, and philosophy.",
  'No',
  -0.008887106),
 ('how do bi-encoders work for sentence embeddings',
  "Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions: Text embedding models from Natural Language Processing can map text data\n(e.g. words, sentences, documents) to supposedly meaningful numerical\nrepresentations (a.k.a. text embeddings). While such models are increasingly\napplied in social science research, one important issue is often not addressed:\nthe extent to which these embeddings are valid representations of constructs\nrelevant for social science research. We therefore propose the use of the\nclassic construct validity framework to evaluate the validity of text\nembeddings. We show how this framework can be adapted to the opaque and\nhigh-dimensional nature of text embeddings, with application to survey\nquestions. We include several popular text embedding methods (e.g. fastText,\nGloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct\nvalidity analyses. We find evidence of convergent and discriminant validity in\nsome cases. We also show that embeddings can be used to predict respondent's\nanswers to completely new survey questions. Furthermore, BERT-based embedding\ntechniques and the Universal Sentence Encoder provide more valid\nrepresentations of survey questions than do others. Our results thus highlight\nthe necessity to examine the construct validity of text embeddings before\ndeploying them in social science research.",
  'No',
  -0.008583762),
 ('how do bi-encoders work for sentence embeddings',
  'Learning Probabilistic Sentence Representations from Paraphrases: Probabilistic word embeddings have shown effectiveness in capturing notions\nof generality and entailment, but there is very little work on doing the\nanalogous type of investigation for sentences. In this paper we define\nprobabilistic models that produce distributions for sentences. Our\nbest-performing model treats each word as a linear transformation operator\napplied to a multivariate Gaussian distribution. We train our models on\nparaphrases and demonstrate that they naturally capture sentence specificity.\nWhile our proposed model achieves the best performance overall, we also show\nthat specificity is represented by simpler architectures via the norm of the\nsentence vectors. Qualitative analysis shows that our probabilistic model\ncaptures sentential entailment and provides ways to analyze the specificity and\npreciseness of individual words.',
  'No',
  -0.011975748),
 ('how do bi-encoders work for sentence embeddings',
  "Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings: Semantic sentence embeddings are usually supervisedly built minimizing\ndistances between pairs of embeddings of sentences labelled as semantically\nsimilar by annotators. Since big labelled datasets are rare, in particular for\nnon-English languages, and expensive, recent studies focus on unsupervised\napproaches that require not-paired input sentences. We instead propose a\nlanguage-independent approach to build large datasets of pairs of informal\ntexts weakly similar, without manual human effort, exploiting Twitter's\nintrinsic powerful signals of relatedness: replies and quotes of tweets. We use\nthe collected pairs to train a Transformer model with triplet-like structures,\nand we test the generated embeddings on Twitter NLP similarity tasks (PIT and\nTURL) and STSb. We also introduce four new sentence ranking evaluation\nbenchmarks of informal texts, carefully extracted from the initial collections\nof tweets, proving not only that our best model learns classical Semantic\nTextual Similarity, but also excels on tasks where pairs of sentences are not\nexact paraphrases. Ablation studies reveal how increasing the corpus size\ninfluences positively the results, even at 2M samples, suggesting that bigger\ncollections of Tweets still do not contain redundant information about semantic\nsimilarities.",
  'No',
  -0.01219046),
 ('how do bi-encoders work for sentence embeddings',
  "How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation: Sentence encoders map sentences to real valued vectors for use in downstream\napplications. To peek into these representations - e.g., to increase\ninterpretability of their results - probing tasks have been designed which\nquery them for linguistic knowledge. However, designing probing tasks for\nlesser-resourced languages is tricky, because these often lack large-scale\nannotated data or (high-quality) dependency parsers as a prerequisite of\nprobing task design in English. To investigate how to probe sentence embeddings\nin such cases, we investigate sensitivity of probing task results to structural\ndesign choices, conducting the first such large scale study. We show that\ndesign choices like size of the annotated probing dataset and type of\nclassifier used for evaluation do (sometimes substantially) influence probing\noutcomes. We then probe embeddings in a multilingual setup with design choices\nthat lie in a 'stable region', as we identify for English, and find that\nresults on English do not transfer to other languages. Fairer and more\ncomprehensive sentence-level probing evaluation should thus be carried out on\nmultiple languages in the future.",
  'No',
  -0.015550519),
 ('how do bi-encoders work for sentence embeddings',
  'Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences: Sentence embedding methods offer a powerful approach for working with short\ntextual constructs or sequences of words. By representing sentences as dense\nnumerical vectors, many natural language processing (NLP) applications have\nimproved their performance. However, relatively little is understood about the\nlatent structure of sentence embeddings. Specifically, research has not\naddressed whether the length and structure of sentences impact the sentence\nembedding space and topology. This paper reports research on a set of\ncomprehensive clustering and network analyses targeting sentence and\nsub-sentence embedding spaces. Results show that one method generates the most\nclusterable embeddings. In general, the embeddings of span sub-sentences have\nbetter clustering properties than the original sentences. The results have\nimplications for future sentence embedding models and applications.',
  'No',
  -0.012663184),
 ('how do bi-encoders work for sentence embeddings',
  'Vec2Sent: Probing Sentence Embeddings with Natural Language Generation: We introspect black-box sentence embeddings by conditionally generating from\nthem with the objective to retrieve the underlying discrete sentence. We\nperceive of this as a new unsupervised probing task and show that it correlates\nwell with downstream task performance. We also illustrate how the language\ngenerated from different encoders differs. We apply our approach to generate\nsentence analogies from sentence embeddings.',
  'Yes',
  -0.004863006),
 ('how do bi-encoders work for sentence embeddings',
  'Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings: Semantic representation learning for sentences is an important and\nwell-studied problem in NLP. The current trend for this task involves training\na Transformer-based sentence encoder through a contrastive objective with text,\ni.e., clustering sentences with semantically similar meanings and scattering\nothers. In this work, we find the performance of Transformer models as sentence\nencoders can be improved by training with multi-modal multi-task losses, using\nunpaired examples from another modality (e.g., sentences and unrelated\nimage/audio data). In particular, besides learning by the contrastive loss on\ntext, our model clusters examples from a non-linguistic domain (e.g.,\nvisual/audio) with a similar contrastive loss at the same time. The reliance of\nour framework on unpaired non-linguistic data makes it language-agnostic,\nenabling it to be widely applicable beyond English NLP. Experiments on 7\nsemantic textual similarity benchmarks reveal that models trained with the\nadditional non-linguistic (/images/audio) contrastive objective lead to higher\nquality sentence embeddings. This indicates that Transformer models are able to\ngeneralize better by doing a similar task (i.e., clustering) with unpaired\nexamples from different modalities in a multi-task fashion.',
  'No',
  -0.013869206)]
output_df = pd.DataFrame(
    output_list, columns=["query", "document", "prediction", "logprobs"]
).reset_index()
# Use exp() to convert logprobs into probability
output_df["probability"] = output_df["logprobs"].apply(exp)
# Reorder based on likelihood of being Yes
output_df["yes_probability"] = output_df.apply(
    lambda x: x["probability"] * -1 + 1
    if x["prediction"] == "No"
    else x["probability"],
    axis=1,
)
output_df.head()
索引 查询 文档 预测 logprobs 概率 yes_probability
0 0 双编码器如何用于句子嵌入 SBERT 研究意义表示:分解... -0.053264 0.948130 0.948130
1 1 双编码器如何用于句子嵌入 类是簇吗?:句子嵌入模式... -0.009535 0.990510 0.009490
2 2 双编码器如何用于句子嵌入 视觉基础语言中的语义组合... -0.008887 0.991152 0.008848
3 3 双编码器如何用于句子嵌入 评估文本嵌入的构造效度... -0.008584 0.991453 0.008547
4 4 双编码器如何用于句子嵌入 学习概率句子表示... -0.011976 0.988096 0.011904
# Return reranked results
reranked_df = output_df.sort_values(
    by=["yes_probability"], ascending=False
).reset_index()
reranked_df.head(10)
level_0 索引 查询 文档 预测 logprobs 概率 yes_probability
0 16 16 双编码器如何用于句子嵌入 在句子嵌入中寻找线性关系... -0.004824 0.995187 0.995187
1 8 8 双编码器如何用于句子嵌入 Vec2Sent:使用 Nat 探测句子嵌入... -0.004863 0.995149 0.995149
2 19 19 双编码器如何用于句子嵌入 用于灵活语义的关系句子嵌入... -0.038814 0.961930 0.961930
3 0 0 双编码器如何用于句子嵌入 SBERT 研究意义表示:分解... -0.053264 0.948130 0.948130
4 15 15 双编码器如何用于句子嵌入 Sentence-T5:来自 P 的可扩展句子编码器... -0.291893 0.746849 0.253151
5 6 6 双编码器如何用于句子嵌入 如何在低资源中探测句子嵌入... -0.015551 0.984570 0.015430
6 18 18 双编码器如何用于句子嵌入 使用 Pr 的高效灵活主题建模... -0.015296 0.984820 0.015180
7 9 9 双编码器如何用于句子嵌入 用于对比学习的非语言监督... -0.013869 0.986227 0.013773
8 12 12 双编码器如何用于句子嵌入 用于句子 P 的基于字符的神经网络... -0.012866 0.987216 0.012784
9 7 7 双编码器如何用于句子嵌入 用于嵌入的聚类和网络分析... -0.012663 0.987417 0.012583
# Inspect our new top document following reranking
reranked_df["document"][0]
'In Search for Linear Relations in Sentence Embedding Spaces: We present an introductory investigation into continuous-space vector\nrepresentations of sentences. We acquire pairs of very similar sentences\ndiffering only by a small alterations (such as change of a noun, adding an\nadjective, noun or punctuation) from datasets for natural language inference\nusing a simple pattern method. We look into how such a small change within the\nsentence text affects its representation in the continuous space and how such\nalterations are reflected by some of the popular sentence embedding models. We\nfound that vector differences of some embeddings actually reflect small changes\nwithin a sentence.'

结论

我们已经展示了如何创建一个定制的交叉编码器来重新排序学术论文。这种方法最适合存在特定领域的细微差别,可以用来为您用户选择最相关的语料库,并且在进行一些预过滤以限制交叉编码器需要处理的数据量的情况下。

我们见过的一些典型用例是

  • 返回 100 个最相关的股票报告列表,然后根据特定客户投资组合的详细上下文重新排序为前 5 或 10 个
  • 在经典的基于规则的搜索之后运行,该搜索获得前 100 或 1000 个最相关的结果,并根据特定用户的上下文对其进行修剪

下一步

像我们在这里一样,采用少样本方法,当领域足够通用以至于少量示例可以涵盖大多数重新排序情况时,效果会很好。但是,随着文档之间的差异变得更加具体,您可能需要考虑使用 Fine-tuning 端点,以使用更广泛的示例制作更精细的交叉编码器。

使用 text-davinci-003 也会对延迟产生影响,您需要考虑这一点,即使是上面的几个示例,每个示例也需要几秒钟 - 同样,如果您能够从 adababbage 微调模型获得不错的结果,Fine-tuning 端点可能会在这里为您提供帮助。

我们使用了 OpenAI 的 Completions 端点来构建我们的交叉编码器,但开源社区为此领域提供了良好的服务。这里 是 HuggingFace 的一个示例。

我们希望您发现这对调整您的搜索用例很有用,并期待看到您构建的内容。