Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent
神经代码搜索再探:通过自然语言意图增强代码片段检索
Abstract
摘要
In this work, we propose and study annotated code search: the retrieval of code snippets paired with brief descriptions of their intent using natural language queries. On three benchmark datasets, we investigate how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets. Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description. We find that our model yields significantly more relevant search results (with absolute gains up to $20.6%$ in mean reciprocal rank) compared to state-of-the-art code retrieval methods that do not use descriptions but attempt to compute the intent of snippets solely from un annotated code.
在本工作中,我们提出并研究了带注释的代码搜索:通过自然语言查询检索配有意向简要描述的代码片段。基于三个基准数据集,我们探究了如何利用描述来更好地捕捉代码片段的意图,从而改进代码检索系统。结合迁移学习和自然语言处理领域的最新进展,我们创建了一个针对带自然语言描述注释代码的领域专用检索模型。实验表明,与最先进的、不使用描述而仅从未注释代码推断片段意图的代码检索方法相比,我们的模型能显著提升搜索结果相关性(平均倒数排名绝对增益最高达 $20.6%$)。
I. INTRODUCTION
I. 引言
Many modern-day applications and software systems are built from a variety of open source libraries and components. As a result, developers find themselves working with a growing number of libraries and tools in their day-to-day coding activities. Quickly and efficiently retrieving relevant documentation for these libraries is becoming increasingly important. Concrete code examples that illustrate the usage of a library or specific library feature are particularly helpful to software developers, as witnessed by the rapid growth of online QA platforms such as Stack Overflow.
许多现代应用程序和软件系统由各种开源库和组件构建而成。因此,开发人员发现自己在日常编码活动中需要使用的库和工具越来越多。快速高效地检索这些库的相关文档变得愈发重要。展示库或特定库功能使用的具体代码示例对软件开发人员特别有帮助,Stack Overflow 等在线问答平台的快速增长就是明证。
To address this need, the research community has been investigating systems that retrieve code snippets from natural language queries [1, 2, 3, 4, 5, 6]. All of these works use neural models to compute the similarity between code snippets and natural language queries. While this setup enables code search on large code corpora, understanding the intent of an un annotated code fragment remains challenging even for experienced developers, let alone for automated systems. For instance, programmers that were asked to assess the relevance of code snippets to a query reported that it would have been helpful to get additional context about the code snippet [4]. Moreover, the low percentage of keyword matches between the search terms and the code tokens [2] makes the retrieval of un annotated code snippets particularly challenging.
为满足这一需求,研究界一直在探索从自然语言查询中检索代码片段的系统[1, 2, 3, 4, 5, 6]。这些工作均采用神经网络模型计算代码片段与自然语言查询之间的相似度。虽然这种架构支持大规模代码库的搜索,但理解未注释代码段的意图对经验丰富的开发者而言仍具挑战性,更不用说自动化系统。例如,被要求评估代码片段与查询相关性的程序员反馈称,获取代码片段的额外上下文会很有帮助[4]。此外,搜索词与代码token之间的关键词匹配率较低[2],这使得未注释代码片段的检索尤为困难。
In this paper, we take a step back and investigate code search in a more restricted (but, we argue, equally useful) setting where code snippets are accompanied with brief natural language descriptions that capture their intent. Code snippets that are meant to be reused by others are frequently accompanied by descriptions. Such annotated snippets are found for instance in coding cheat sheets, interactive notebook environments, official library documentation, technical blog posts, programming textbooks, or on QA websites such as Stack Overflow. While none of these resources use a common, structured format to label code snippets with their intent, simple heuristics can often be used to pair code snippets with meaningful descriptions. We demonstrate that the quality of the search results improves significantly when these descriptions are taken into account by the snippet ranking algorithm.
本文回归基础,研究了一种更受限制(但我们认为同样实用)的代码搜索场景:代码片段附带有描述其意图的简短自然语言说明。那些旨在被他人复用的代码片段通常都会配有描述性文字,这类带标注的片段可见于编程速查表、交互式笔记本环境、官方库文档、技术博客文章、编程教科书,或是Stack Overflow等问答网站。虽然这些资源都未采用统一的结构化格式来标记代码片段的意图,但通过简单启发式方法往往能将代码片段与有意义的描述配对。我们证明,当片段排序算法将这些描述纳入考量时,搜索结果质量会显著提升。
The key contributions of this paper are as follows:
本文的主要贡献如下:
• We propose the annotated code search task: the problem of retrieving code snippets annotated with short descriptions, from natural language queries. To evaluate this task, we created three benchmark datasets, each of which is comprised of an annotated code snippet collection and a set of queries linked to one or more relevant snippets in the collection.
• 我们提出了带注释的代码搜索任务:从自然语言查询中检索带有简短描述的代码片段。为了评估该任务,我们创建了三个基准数据集,每个数据集均由带注释的代码片段集合和一组与集合中一个或多个相关片段链接的查询组成。
II. BACKGROUND RELATED WORK
II. 背景与相关工作
A. Code search
A. 代码搜索
There has been a growing interest in retrieving code using natural language queries [1, 2, 3, 4, 5, 6]. These works use neural networks to project code and queries in the same vector space. During training, the model parameters are optimized to minimize the distance between pairs of code snippets and corresponding natural language descriptions. These descriptions are typically extracted from the method documentation. The rationale is that, at prediction time, a query will be projected to a vector that is similar to the vector representations of related code snippets. As such, code search can be reduced to a mathrm{kOmega} -nearest neighbors problem in the joint query-snippet vector space. Gu et al. [1], Sachdev et al. [2], Cambronero et al. [3], Husain et al. [4], Yao et al. [5], Feng et al. [6] have all built retrieval models for snippet collections that consist of pure source code. By contrast, in this work we study a different code search setting where every snippet is annotated with a brief natural language description describing the code’s intent.
近年来,利用自然语言查询检索代码的研究日益受到关注 [1, 2, 3, 4, 5, 6]。这些工作通过神经网络将代码和查询映射到同一向量空间。训练过程中,模型参数通过最小化代码片段与其对应自然语言描述之间的距离进行优化,这些描述通常提取自方法文档。其核心思想是:在预测阶段,查询语句会被映射到与相关代码片段向量表示相近的向量空间,从而使代码搜索转化为联合查询-片段向量空间中的 mathrm{kOmega} 最近邻问题。Gu等 [1]、Sachdev等 [2]、Cambronero等 [3]、Husain等 [4]、Yao等 [5]、Feng等 [6] 均已构建了针对纯源代码片段集合的检索模型。而本研究则探讨了一种不同的代码搜索场景:每个代码片段都标注了描述代码意图的简短自然语言说明。
Yao et al. [5] proposed a model to automatically generate descriptions from the code and utilize reinforcement learning to directly optimize the descriptions for improving code retrieval. They showed that an ensemble of 1) a model that used the generated descriptions and 2) an existing code search model yielded better performance for retrieving SQL code snippets. Ye et al. [7] improved on this work with a dual learning model that simultaneously optimizes for code generation and code sum mari z ation. Both of these works describe an alternative framework to obtain code descriptions but are otherwise orthogonal to our work.
Yao等人[5]提出了一种从代码自动生成描述并利用强化学习直接优化描述以改进代码检索的模型。他们证明,结合1)使用生成描述的模型与2)现有代码搜索模型的集成方法,在检索SQL代码片段时能获得更好性能。Ye等人[7]通过双学习模型改进了这项工作,该模型同时优化代码生成和代码摘要任务。这两项工作都提出了获取代码描述的替代框架,但与我们的工作正交互补。
Recent work explored the application of NLP transfer learning techniques to code [8, 6]. CodeBERT [6], a bimodal transformer model for code and natural language obtained state-of-the-art results on highly specific tasks (for example, identifying methods/functions given their documentation among a list of “distractor” snippets), but performance on such proxy tasks correlates poorly with performance on actual code search [4]. It therefore remains to be seen how well CodeBERT performs on downstream tasks such as code search. In this work, we study retrieval of code snippets that leverage both the code tokens and a short description of the code. For computing the similarity between the query and the code, we compare with existing models such as NCS [2] and UNIF [3].
近期研究探索了将自然语言处理 (NLP) 迁移学习技术应用于代码领域 [8, 6]。CodeBERT [6] 作为一款面向代码与自然语言的双模态 Transformer 模型,在高度特定任务中 (例如从干扰代码片段列表中识别出与文档描述匹配的方法/函数) 取得了最先进成果,但此类代理任务的表现与实际代码搜索性能相关性较低 [4]。因此 CodeBERT 在代码搜索等下游任务中的表现仍有待验证。本研究致力于同时利用代码 Token 和简短描述来实现代码片段检索,在计算查询与代码相似度时,我们与 NCS [2] 和 UNIF [3] 等现有模型进行了对比。
B. Transfer learning in natural language processing
B. 自然语言处理中的迁移学习
In the past two years, natural language processing has seen substantial improvements across a variety of tasks including question answering, sum mari z ation, and information retrieval [9, 10]. Researchers have shown that fine-tuning large models, pre-trained on vast amounts of unlabeled text data, is often far superior to learning models from scratch on labeled, task-specific data only [11, 12, inter alia]. During pre-training, a neural net is optimized to predict words and/or sentences from their context (i.e., the previous or surrounding words/sentences), and as a result, learns to compute word representations that capture semantic and syntactic properties of the words and the context in which they appear. During fine-tuning, the last layer of the pre-trained model is replaced with a task-specific layer and all model parameters are fine-tuned on labeled data of the downstream task.
过去两年,自然语言处理在问答、摘要和信息检索等多项任务中取得了显著进步[9, 10]。研究表明,基于海量无标注文本数据预训练的大模型经过微调后,其表现通常远超仅用标注任务数据从头训练的模型[11, 12等]。预训练阶段,神经网络通过预测上下文(即前后或周边词句)中的单词/句子进行优化,从而学习到能捕捉词语及其所处上下文语义与句法特征的词向量表示。微调阶段则替换预训练模型的最后一层为任务特定层,并基于下游任务的标注数据对所有模型参数进行微调。
All recent state-of-the-art systems use the Transformer architecture [13], a neural network that processes sequences using multi-head self-attention. This facilitates capturing long-range dependencies and makes the computations more parallel iz able than those of recurrent neural networks (RNNs). Due to the increased scale of the models and text corpora, pre-training has become impractical for the average researcher. Fortunately, several institutions have made pre-trained models publicly available: the universal sentence encoder (USE) [14], GPT and GPT-2 [15, 16], BERT [11], RoBERTa [17], T5 [9] are all publicly available. The models differ in a) the pre-training objectives (e.g., BERT predicts words given the surrounding words, while GPT models only use the previous words); b) the text corpora on which they were trained; c) the model size; and d) variations on how the transformer architecture is used (e.g., BERT computes representations of (sub)words, sentences, and pairs of sentences, whereas the universal sentence encoder is aimed at computing sentence representations only).
所有最新的先进系统都采用了Transformer架构[13],这是一种利用多头自注意力机制处理序列的神经网络。该架构有助于捕捉长距离依赖关系,并使计算比循环神经网络(RNN)更具并行性。由于模型规模和文本语料库的扩大,预训练对普通研究者来说已变得不切实际。幸运的是,多家机构公开提供了预训练模型:通用句子编码器(USE)[14]、GPT和GPT-2[15,16]、BERT[11]、RoBERTa[17]、T5[9]均可公开获取。这些模型的主要区别在于:a)预训练目标(例如BERT预测给定上下文单词,而GPT模型仅使用上文单词);b)训练所用的文本语料库;c)模型规模;d)Transformer架构的应用方式差异(例如BERT计算(子)词、句子和句子对的表示,而通用句子编码器仅专注于计算句子表示)。
Our interest in these transfer learning models stems from the need to compute the semantic similarity between search queries and short code descriptions. Because software development is terminology-heavy, the domains of the training corpora play a crucial role in performance on downstream tasks such as code search. Although the T5 model yields state-of-the-art results on many benchmarks , it is less suited for tasks in the software domain because its pre-training data was filtered with a heuristic to exclude documents that contain code. By contrast, the text corpora on which the universal sentence encoder was trained included data from QA websites such as Stack Overflow.
我们对这些迁移学习模型的兴趣源于需要计算搜索查询与简短代码描述之间的语义相似度。由于软件开发涉及大量专业术语,训练语料的领域对代码搜索等下游任务的性能起着关键作用。尽管T5模型在许多基准测试中取得了最先进的结果,但由于其预训练数据通过启发式方法过滤排除了包含代码的文档,该模型不太适合软件领域的任务。相比之下,通用句子编码器的训练文本语料包含了Stack Overflow等问答网站的数据。
III. ANNOTATED CODE SEARCH
III. 带注释的代码搜索
We cast annotated code search as a retrieval task: given a collection of annotated code snippets mathcal{C}= s_{1},s_{2},...,s_{N} where each annotated snippet s_{i} consists of a code fragment c_{i} paired with a brief natural language description d_{i} and a natural language query q , the goal is to generate a ranked list of code snippets s_{q1} , ..., s_{q k} . In the context of this paper, a code snippet will always be paired with a description, so for convenience, we will use the terms annotated code snippet and code snippet interchangeably. Prior work has studied the retrieval of code snippets without description, we will refer to this setting as code-only code search.
我们将带注释的代码搜索视为一项检索任务:给定一组带注释的代码片段 mathcal{C}= s_{1},s_{2},...,s_{N} ,其中每个带注释的片段 s_{i} 由代码片段 c_{i} 和简短的自然语言描述 d_{i} 组成,以及一个自然语言查询 q ,目标是生成一个排序的代码片段列表 s_{q1} , ..., s_{q k} 。在本文的上下文中,代码片段总是与描述配对,因此为了方便起见,我们将交替使用带注释的代码片段和代码片段这两个术语。先前的工作研究了不带描述的代码片段检索,我们将此设置称为仅代码(code-only)的代码搜索。
To benchmark the retrieval performance of a model on a snippet collection mathcal{C} , we require a ground truth dataset that links queries to code snippets from mathcal{C} that match the query. Performance can then be measured objectively with standard information retrieval metrics: mean reciprocal rank (MRR) and recall @mathbf{k} (mathbf{r}@mathbf{k}) . The reciprocal rank is the inverse of the rank of the first matching snippet for a given query, MRR computes the average of the reciprocal ranks for the queries in the evaluation set. For MRR the top-ranked snippets are most influential, e.g., finding a relevant snippet at rank 1 instead of rank 2 will have a bigger impact on the score than finding a snippet at rank 4 instead of rank 5. Recall @mathbf{k} is defined as the percentage of queries for which at least one matching snippet was found in the top mathbf{nablacdot}mathbf{k} results.
为了衡量模型在代码片段集合 mathcal{C} 上的检索性能,我们需要一个基准真值数据集,该数据集将查询与 mathcal{C} 中匹配查询的代码片段关联起来。随后可通过标准信息检索指标客观评估性能:平均倒数排名 (MRR) 和召回率 @mathbf{k} (mathbf{r}@mathbf{k})。倒数排名是指给定查询的第一个匹配片段排名的倒数,MRR 计算评估集中所有查询倒数排名的平均值。对于 MRR 而言,排名靠前的片段影响最大,例如将相关片段从第 2 名提升至第 1 名对分数的影响,比从第 5 名提升至第 4 名更大。召回率 @mathbf{k} 定义为在排名前 mathbf{nablacdot}mathbf{k} 的结果中至少找到一个匹配片段的查询百分比。
To get insight into the difficulty of a benchmark, we will also measure the word overlap between the queries and descriptions of matching snippets. We define word overlap between a query q and a description d as the number of unique words that occur in both q and d. . Relative overlap is computed as the ratio between the overlap and the number of unique words in q . Before computing the overlap, q and d are tokenized and lowercased, and stop words are filtered out.
为了衡量基准测试的难度,我们还将计算查询与匹配代码片段描述之间的词汇重叠度。定义查询 q 和描述 d 之间的词汇重叠为两者共有的唯一词汇数量。相对重叠率通过重叠词汇数与查询 q 中唯一词汇总数的比值计算得出。在计算前,q 和 d 需进行分词、小写转换并过滤停用词。
Records from the CoNaLa corpus
来自CoNaLa语料库的记录


Fig. 1: Illustration of the CoNaLa corpus [18] is converted in a benchmark for annotated code search.
图 1: CoNaLa语料库[18]被转换为带标注代码搜索的基准测试。
IV. PACS: BENCHMARKS FOR ANNOTATED CODE SEARCH
IV. PACS: 带标注代码搜索基准
Existing code search benchmarks [1, 2, 3, 4, 5] link queries to relevant code snippets but these snippets are not consistently paired with descriptions and hence not directly applicable to this study.3 To evaluate and train annotated code retrieval models, we therefore created three benchmarking datasets, which we collectively refer to as the Python Annotated Code Search benchmark (PACS).4 Table I gives an overview of the PACS datasets. In the remainder of this section, we provide details on the construction of the benchmarks and explain how we obtain relevant training data.
现有代码搜索基准[1, 2, 3, 4, 5]将查询与相关代码片段关联,但这些片段并未统一配备描述,因此不适用于本研究。为评估和训练带注释的代码检索模型,我们创建了三个基准数据集,统称为Python注释代码搜索基准(PACS)。表1概述了PACS数据集。本节后续内容将详述基准构建过程,并说明相关训练数据的获取方法。
A. Snippet collections and ground truth
A. 代码片段集合与真实数据
We compile snippet collections from two existing datasets (CoNaLa and StaQC) and mine a third snippet collection from Stack Overflow posts in the data science domain.
我们从两个现有数据集(CoNaLa和StaQC)中编译代码片段集合,并从数据科学领域的Stack Overflow帖子中挖掘第三个代码片段集合。
a) CoNaLa: The CoNaLa corpus [18]^{5} is a curated collection of short Python code snippets annotated with their natural language intent. Yin et al. [18] used crowd-sourcing to label a sample of Stack Overflow posts. An example record of the corpus is shown in Figure 1. It consists of a code snippet, intent, re-written intent, and the id of the Stack Overflow post. The code snippets are selected from post answers. The intent is a high-level, natural language description of the code snippet and typically corresponds to the Stack Overflow post title. Because more than one code snippet can be extracted from the same post, different code snippets are sometimes annotated with the same intent. The re-written intent is a reformulation of the intent that should better reflect the full meaning of the code.
a) CoNaLa: CoNaLa语料库[18]^5是一个精选的Python短代码片段集合,附带自然语言意图标注。Yin等人[18]通过众包方式对Stack Overflow帖子样本进行了标注。该语料库的示例记录如图1所示,包含代码片段、意图、改写意图及Stack Overflow帖子ID。代码片段选自帖子回答,意图是对代码片段的高层次自然语言描述,通常对应Stack Overflow帖子标题。由于同一帖子可能提取多个代码片段,不同代码片段有时会标注相同意图。改写意图是对原意图的重新表述,应更完整地反映代码含义。
Figure 1 illustrates how we create a benchmark for annotated code search from the corpus. The snippet collection is constructed by selecting the re-written intents and their corresponding code snippets. This results in 2,777 snippets. To create the ground truth, we group the CoNaLa records by their Stack Overflow post id. For each group, we use the intent of one of the snippets as a query. The matching snippets are the code snippets that were constructed from the records in this group. Many of the re-written intents are similar to the original intent. Queries that have a relative word overlap of more than 0.5 with the description of a matching snippet are removed. This results in a ground truth dataset with 762 queries.
图 1: 展示了如何从语料库中创建带标注的代码搜索基准。片段集合通过筛选重写后的意图描述及其对应代码片段构建而成,最终获得2,777个片段。为建立标准答案,我们按Stack Overflow帖子ID对CoNaLa记录进行分组,每组选用其中一个片段的意图描述作为查询语句,匹配片段则来自该组记录对应的所有代码片段。由于多数重写意图与原始意图高度相似,我们会剔除与匹配片段描述文本重复率超过0.5的查询,最终生成包含762条查询的标准答案数据集。
Original StaQC corpus
原始 StaQC 语料库

Fig. 2: Illustration of the additional cleaning for the StaQC corpus [19]: we strip prompts, filter out code snippets that do not parse, and automatically rewrite questions as descriptions with simple regular expressions.

图 2: StaQC语料库 [19] 的额外清洗流程示意: 我们移除提示词、过滤无法解析的代码片段, 并通过简单正则表达式自动将问题重写为描述语句。

Fig. 3: To create the StaQC-py and SO-DS ground truth we make use of Stack Overflow posts that were tagged as duplicates by users.
图 3: 为构建 StaQC-py 和 SO-DS 的真实标注数据,我们利用了被用户标记为重复内容的 Stack Overflow 帖子。
^b ) StaQC-py: StaQC [19]6 is a large collection of Python and SQL question-code snippet pairs that are mined from Stack Overflow. These pairs were extracted with machine learning models that identify how to questions and their corresponding code snippet(s) in the answers. The dataset has been used in prior work on code sum mari z ation [20] and (code-only) code search [5]. To create a snippet collection we started from the 272K Python question-code pairs in StaQC and performed additional cleaning as illustrated in Figure 2: we strip prompt prefixes ( .>>> , · · ·, In [1]: , etc.), filter out code snippets that do not parse in Python 2 and 3, and use simple heuristics to reformulate questions to descriptions. This results in a snippet collection of 204K (description, code snippet) pairs.
^b) StaQC-py: StaQC [19]6 是一个从 Stack Overflow 挖掘的大型 Python 和 SQL 问答代码片段对集合。这些数据对通过机器学习模型提取,用于识别问答中的技术问题及其对应答案中的代码片段。该数据集已用于先前关于代码摘要 [20] 和 (纯代码) 代码搜索 [5] 的研究工作。为创建代码片段集合,我们从 StaQC 的 272K 个 Python 问答对出发,并进行了如图 2 所示的额外清理:移除提示前缀 ( .>>> , · · ·, In [1]: 等),过滤无法在 Python 2 和 3 中解析的代码片段,并使用简单启发式方法将问题重述为描述。最终得到包含 204K 个 (描述,代码片段) 对的片段集合。
To create the ground truth, we leverage the fact that some Stack Overflow posts are tagged as a duplicate of another post. The titles of duplicate posts are descriptions that are created independently by different users but reflect the same underlying intent. Therefore, as illustrated in Figure 3, we collect the duplicates of the posts used for the StaQC snippet collection and take the duplicates’ titles as queries and the corresponding StaQC snippets as the matching results.
为创建基准数据,我们利用了一些Stack Overflow帖子被标记为其他帖子重复版本这一事实。重复帖子的标题是由不同用户独立创建但反映相同底层意图的描述。因此,如图3所示,我们收集了用于StaQC代码片段集的帖子的重复版本,并将重复帖子的标题作为查询,将对应的StaQC代码片段作为匹配结果。
The duplicate posts are collected from the Stack Overflow dump of February 2020.7 Some posts are tagged as a duplicate of multiple questions, and it also happens that a post A is tagged as a duplicate of post B, while post C is tagged as a duplicate of A. We transitively group all such duplicate relations (i.e.,
重复帖子收集自2020年2月的Stack Overflow数据转储。部分帖子被标记为多个问题的重复项,也存在帖子A被标记为帖子B的重复项,而帖子C又被标记为A的重复项的情况。我们通过传递性将所有此类重复关系进行分组(即
A, B, and C will be part of the same group) and end up with 195K duplicate groups. Most groups consist of two posts, but popular posts can have more than 100 duplicates. To create the StaQC ground truth, we select all groups containing 1) a post that was the source for a StaQC snippet (to have snippets that match the query), as well as 2) a post that was not used for the StaQC data (to select the query). This resulted in 5,497 duplicate groups. From each such group we extracted a query with matching results. The ground truth set is split evenly in a validation and a test set. For the validation set we additionally filter out any queries that overlap with the CoNaLa and SO-DS test sets to avoid over fitting when tuning model hyper-parameters.
A、B 和 C 将属于同一组) ,最终得到 195K 个重复组。大多数组包含两个帖子,但热门帖子可能有超过 100 个重复项。为了创建 StaQC 基准数据集,我们筛选出所有满足以下条件的组:1) 包含作为 StaQC 片段来源的帖子 (以确保片段与查询匹配) ,以及 2) 包含未用于 StaQC 数据的帖子 (以选择查询) 。最终得到 5,497 个重复组。从每个组中提取一个查询及其匹配结果。基准数据集被均匀划分为验证集和测试集。对于验证集,我们还过滤掉与 CoNaLa 和 SO-DS 测试集重叠的任何查询,以避免在调整模型超参数时出现过拟合。
c) SO-DS: The CoNaLa and StaQC benchmarks have different merits: the CoNaLa snippets and descriptions are curated, but the collection is small and the queries in the ground truth were not created independently from the descriptions. The StaQC benchmark, on the other hand, has a large snippet collection with realistic ground truth queries that were written independently from the snippet descriptions, but despite StaQC’s advanced mining methodology the snippet collection contains more noise. We therefore collect a third snippet collection, SO-DS, where we control the quality by only mining from Stack Overflow posts with a large number of upvotes. More specifically, we mine snippets from the most upvoted Stack Overflow posts that are labeled with “python” and one or more tags related to data science libraries such as “tensorflow”, “matplotlib” and “beautiful soup”.8 9
c) SO-DS: CoNaLa和StaQC基准各有优势:CoNaLa的代码片段和描述经过人工筛选,但规模较小且基准查询并非独立于描述创建。而StaQC基准虽拥有大量代码片段及独立编写的真实基准查询,但其先进挖掘方法仍导致片段集包含较多噪声。为此,我们收集了第三个代码片段集SO-DS,通过仅从高赞Stack Overflow帖子中挖掘来控制质量。具体而言,我们从标记为"python"并附带"tensorflow"、"matplotlib"、"beautiful soup"等数据科学库标签的高赞帖子中提取代码片段。8 9
We iterate over each tag’s posts in decreasing order of their upvote scores. For each post, we extract a maximum of two snippets from the post answers, prioritizing answers with more votes. A snippet is extracted by concatenating all the parseable code blocks from a post answer. We opted to not automatically select the most relevant lines of code with a machine learning model, as was done for StaQC, because incorrect predictions would introduce noise. Note that, unlike CoNaLa and StaQC this dataset is specifically intended for code search, not for code generation or code sum mari z ation, where a one-to-one alignment between description and source code is important. Similar to the StaQC benchmark, we use the post title as the snippet description and apply the same cleaning pipeline: we remove prompts, filter out snippets that can not be parsed, and use simple heuristics to reformulate questions to descriptions. When we extracted 250 snippets or ran out of posts with the correct tag, we move on to the next tag. As a final step, we filter out duplicate snippets, which occur because some Stack Overflow posts with multiple tags were mined more than once. The resulting snippet collection consists of 12,137 snippets and 7,674 unique descriptions. Table II displays a few examples from the collection.
我们按每个标签下帖子的投票分数降序进行遍历。对于每个帖子,我们最多从其回答中提取两个代码片段,优先选择投票数更多的回答。代码片段通过拼接回答中所有可解析的代码块生成。我们没有像StaQC那样采用机器学习模型自动选择最相关的代码行,因为错误的预测会引入噪声。需要注意的是,与CoNaLa和StaQC不同,该数据集专门用于代码搜索而非代码生成或代码摘要,因此不需要严格保持描述与源代码的一一对应关系。与StaQC基准类似,我们将帖子标题作为片段描述,并采用相同的清洗流程:移除提示语、过滤无法解析的片段,使用简单启发式方法将问题改写为描述。当提取到250个片段或该标签下无更多合格帖子时,转至下一个标签。最后过滤重复片段(因部分含多标签的Stack Overflow帖子被重复采集)。最终获得的片段集合包含12,137个代码片段和7,674条唯一描述。表II展示了该集合的部分示例。
The ground truth is collected analogously to StaQC-py by creating queries from Stack Overflow duplicate posts. This results in 2,225 annotated queries, which are evenly split in a validation and test set. The validation set is again filtered to ensure that no queries overlap with the CoNaLa and StaQC-py test sets.
真实标注数据采用与StaQC-py类似的方式收集,通过Stack Overflow重复帖生成查询语句。最终获得2,225条标注查询,平均划分为验证集和测试集。验证集经过二次过滤以确保与CoNaLa及StaQC-py测试集无查询重叠。
d) Benchmark statistics: Table I reports summary statistics on the three benchmarks. The CoNaLa snippets are mostly one-liners and its descriptions are on average slightly longer than those in the other collections. The SO-DS snippets are the longest, which is to be expected since the SO-DS mining procedure does not attempt to automatically extract the most relevant lines of code in Stack Overflow posts, as was done for StaQC. Concerning the different ground truth sets, we observe that most of the CoNaLa queries only have one matching snippet, whereas SO-DS and StaQC on average have 1.7 and 3.4 matching snippets. To assess the difficulty of each benchmark, Figure 4 shows histograms of the relative word overlap between the queries and the descriptions of matching snippets. The three benchmarks have an
d) 基准测试统计:表1报告了三个基准测试的汇总统计数据。CoNaLa代码片段多为单行代码,其描述平均略长于其他数据集。SO-DS代码片段最长,这符合预期,因为SO-DS挖掘过程不像StaQC那样尝试自动提取Stack Overflow帖子中最相关的代码行。关于不同的真实数据集,我们观察到大多数CoNaLa查询只有一个匹配代码片段,而SO-DS和StaQC平均分别有1.7和3.4个匹配片段。为评估各基准测试难度,图4展示了查询与匹配片段描述之间相对词汇重叠的直方图。这三个基准测试具有...
| 代码片段集合 | 测试集 | ||||
|---|---|---|---|---|---|
| 基准测试 | #代码片段 | #描述token数 | 代码长度 (#代码行) | #查询数 | #代码片段/查询 |
| CoNaLa | 2,777 | 10.32 | 1.07 | 762 | 1.17 |
| StaQC-py | 203,700 | 8.36 | 9.84 | 2,749 | 3.40 |
| SO-DS | 12,137 | 7.35 | 14.98 | 1,113 | 1.70 |
TABLE I: Overview of the Python annotated code search (PACS) benchmarks.
表 1: Python语言标注代码搜索(PACS)基准概述

Fig. 4: Histograms of the relative word overlap between the queries and the matching snippet descriptions in the PACS test sets.
图 4: PACS测试集中查询与匹配片段描述之间相对单词重叠的直方图。
average relative word overlap between 0.28 and 0.29 and there are queries with a relative overlap of more than 0.6. 10
平均相对词重叠率在0.28到0.29之间,部分查询的相对重叠率甚至超过0.6。
B. Training data
B. 训练数据
Because it is infeasible to construct a code search ground truth dataset that is large enough for model training, we resort to different proxy datasets. We train query-code similarity models (i.e., the code-only models that will be presented in Section V-B ) on the description-code snippet pairs in the respective collections. To train the query-description similarity models (i.e., the description-only models that will be
由于构建足够大规模用于模型训练的代码搜索基准数据集不可行,我们转而采用不同的代理数据集。我们在各数据集的描述-代码片段对上训练查询-代码相似度模型(即第V-B节将介绍的纯代码模型)。为训练查询-描述相似度模型(即后文将介绍的纯描述模型)
Extracting and saving video frames
提取并保存视频帧
Log keras output to a file
将keras输出记录到文件
from keras.callbacks import CSVLogger csv_logger = CSVLogger(’log.csv’, append l= True, separator =prime ;’)model.fit(X_train, Y_train, callbacks = [ csv_logger])
from keras.callbacks import CSVLogger csv_logger = CSVLogger('log.csv', append=True, separator=';')model.fit(X_train, Y_train, callbacks=[ csv_logger])
https://stack overflow.com/questions/33311153 https://stack overflow.com/questions/38445982
https://stackoverflow.com/questions/33311153 https://stackoverflow.com/questions/38445982
Plot correlation matrix using pandas
使用pandas绘制相关性矩阵
Regex find all overlapping matches
正则表达式查找所有重叠匹配
import matplotlib.pyplot as plt plt.matshow(dataframe.corr()) plt.show()
import matplotlib.pyplot as plt plt.matshow(dataframe.corr()) plt.show()
https://stack overflow.com/questions/29432629 https://stack overflow.com/questions/5616822
https://stackoverflow.com/questions/29432629
https://stackoverflow.com/questions/5616822
TABLE II: Examples from the SO-DS snippet collection.
TABLE III: Examples of Stack Overflow post titles that were tagged as duplicates.
表 2: SO-DS 片段集合示例
| 原始内容 | 重复内容 |
|---|---|
| Python语言函数通过引用调用 | 是否可以创建一个修改输入的函数 |
| 使用Javascript抓取/监听AJAX数据? | 在Chrome扩展中拦截AJAX响应 |
| 如何在QML中编写条件导入语句? | QML仅在可用时导入更高版本模块 |
| Boost.Asio SSL线程安全性 | Boost Asio和多线程中OpenSSL的使用 |
| 是否可以停止Javascript执行? | Javascript中的System.exit |
表 3: 被标记为重复的Stack Overflow帖子标题示例
presented in Section V-A), we create two additional datasets. Firstly, we collect all Python post titles on Stack Overflow before February 2020, resulting in a total of 1.30M questions. Secondly, we create pairs of related code descriptions from the Stack Overflow duplicate post records that were not used to compile ground truth datasets and snippet collections. To not bias questions with many duplicates (e.g., How to clone or copy a list has more than 289 duplicates), we only sample one pair per duplicate group. This resulted in 187K pairs of related code descriptions. Table III shows a few examples from the dataset.
在第五-A节所述方法的基础上,我们创建了两个额外数据集。首先,我们收集了2020年2月之前Stack Overflow上所有Python语言相关帖子标题,共得到130万个问题。其次,我们从Stack Overflow未用于构建基准数据集和代码片段集合的重复帖子记录中,创建了相关联的代码描述对。为避免具有大量重复的问题(例如"如何克隆或复制列表"有超过289个重复项)造成偏差,我们仅从每个重复组中抽取一对样本。最终得到18.7万组相关联的代码描述对。表III展示了该数据集的部分示例。
V. MODELS FOR ANNOTATED CODE SEARCH
V. 带标注代码搜索模型
In line with recent work in code search [1, 2, 3, 4, 5], we formulate snippet retrieval as a mathbf{k} -nearest neighbor problem: queries and snippets are projected in a shared vector space such that any given query and its matching snippets are represented by similar vectors. The snippet collection s_{1},s_{2},...,s_{N} can then be ranked w.r.t. a query q by sorting snippets according to the cosine similarity between their respective vectors mathbf{s_{1}},mathbf{s_{2}},...,mathbf{s_{N}} and the query vector q.
与近期代码搜索领域的研究[1, 2, 3, 4, 5]一致,我们将代码片段检索建模为一个 mathbf{k} 近邻问题:查询语句和代码片段被映射到共享向量空间,使得任意查询与其匹配片段通过相似向量表示。对于查询 q,代码片段集合 s_{1},s_{2},...,s_{N} 的排序可通过计算各片段向量 mathbf{s_{1}},mathbf{s_{2}},...,mathbf{s_{N}} 与查询向量 q 的余弦相似度来实现。

Fig. 5: Architectures of query/snippet similarity models presented in Section V: A) depicts the neural bag-of-words model of queries and snippet descriptions; B) depicts the query and snippet description embedding using USE; C) depicts the NCS architecture for embedding queries and actual code; and D) depicts an ensemble of embedding retrieval models.
图 5: 第V节中提出的查询/代码片段相似度模型架构: A) 展示了查询和片段描述的神经词袋模型; B) 展示了使用通用句子编码器(USE)的查询和片段描述嵌入; C) 展示了用于嵌入查询和实际代码的NCS架构; D) 展示了嵌入检索模型的集成方案。
An annotated code snippet s consists of a description d and the actual code snippet c. Ideally, we would train a model that jointly embeds d and c . However, as mentioned in Section IV-B, we lack sufficient ground truth data to train such a model. Instead, we independently train two embedding models that separately capture the similarity between 1) descriptions and queries (Section V-A); and 2) code fragments and queries (Section V-B). In Section V-C, we explain how these models can be efficiently combined in a single ensemble model.
带注释的代码片段 s 包含描述 d 和实际代码片段 c。理想情况下,我们会训练一个联合嵌入 d 和 c 的模型。但如第 IV-B 节所述,我们缺乏足够的真实数据来训练此类模型。因此,我们独立训练了两个嵌入模型,分别捕获:1) 描述与查询之间的相似性(第 V-A 节);2) 代码片段与查询之间的相似性(第 V-B 节)。第 V-C 节将阐述如何将这些模型高效组合成单一集成模型。
A. Query-description similarity
A. 查询-描述相似性
In this section, we present two query-description similarity models: Neural bag-of-words, which will be used as a baseline, and the universal sentence encoder.
本节介绍两种查询-描述相似度模型:作为基线的神经词袋模型 (Neural bag-of-words) 和通用语句编码器 (universal sentence encoder)。
Neural bag-of-words (NBOW): Bag-of-words models treat input texts as unordered word sets. While throwing away the sequence order may seem a crude approximation, for information retrieval it results in hard-to-beat baselines. Figure 5 A) illustrates how bag-of-words is applied in a neural setting. We obtain embeddings mathbf{q}^{mathbf{bow}} and mathbf{d}^{mathbf{bow}} for the query q and description d by summing their respective word embeddings:
神经词袋模型 (NBOW): 词袋模型将输入文本视为无序的单词集合。虽然丢弃序列顺序看似是一种粗糙的近似方法,但在信息检索领域却能产生难以超越的基线性能。图 5 A) 展示了词袋方法在神经网络中的应用方式。我们通过对查询 q 和描述 d 的各自词向量求和,获得其嵌入表示 mathbf{q}^{mathbf{bow}} 和 mathbf{d}^{mathbf{bow}}:
Where q_{j} and d_{j} denote the mathrm{j}^{t h} words in the query/description after pre-processing, and e_{1}(cdot) is the word embedding model that maps a description/query word to its embedding.
其中 q_{j} 和 d_{j} 表示预处理后查询/描述中的第 mathrm{j}^{t h} 个词,e_{1}(cdot) 是将描述/查询词映射到其嵌入向量的词嵌入模型。
We pre-process queries and descriptions by applying token iz ation, lemma ti z ation, and stopword removal. To learn the word embeddings, we used fastText’s [21] continuous skip-gram model with negative sampling. This model is similar to the skip-gram model from the word2vec toolkit, but in contrast to the latter, fastText also incorporates subword information. The fastText embeddings are trained on the Python Stack Overflow questions dataset (see Section IV-B).
我们通过应用分词 (tokenization)、词形还原 (lemmatization) 和停用词去除对查询和描述进行预处理。为了学习词嵌入,我们使用了 fastText [21] 的连续跳字模型 (continuous skip-gram model) 和负采样 (negative sampling)。该模型类似于 word2vec 工具包中的跳字模型,但与后者不同的是,fastText 还融合了子词信息。fastText 词嵌入是在 Python语言 Stack Overflow 问题数据集上训练的(参见第 IV-B 节)。
Universal sentence encoder (U S E) : The universal sentence encoder (USE) is a transformer-based transfer learning model that computes a vector for a sequence of words (e.g., a sentence or a paragraph).11 We leverage USE to compute representations for the queries and snippet descriptions as shown in Figure 5 B): USE utilizes transformer to obtain contextual i zed representations mathbf{h_{1}^{q}},mathbf{h_{2}^{q}},...,mathbf{h_{m}^{q}} for each query token. An embedding for the query is then calculated by summing the transformer’s outputs mathbf{h_{1}^{q}},mathbf{h_{2}^{q}},...,mathbf{h_{m}^{q}} . It is worth noting that each representation mathbf{h_{i}^{q}} that is produced by transformer is contextual i zed on all the query tokens, not solely from the current and preceding tokens q_{1},...,q_{i} as would be the case with RNNs. The embeddings of the descriptions are computed analogously, with the same architecture and model parameters.
通用句子编码器 (USE):通用句子编码器 (USE) 是一种基于Transformer的迁移学习模型,可为单词序列(如句子或段落)计算向量。我们利用USE为查询和片段描述计算表示,如图5 B)所示:USE采用Transformer获取每个查询token的上下文表示 mathbf{h_{1}^{q}},mathbf{h_{2}^{q}},...,mathbf{h_{m}^{q}}。随后通过对Transformer的输出 mathbf{h_{1}^{q}},mathbf{h_{2}^{q}},...,mathbf{h_{m}^{q}} 求和来计算查询的嵌入向量。值得注意的是,Transformer生成的每个表示 mathbf{h_{i}^{q}} 都基于所有查询token进行上下文建模,而非像RNN那样仅依赖当前及先前token q_{1},...,q_{i}。描述文本的嵌入向量采用相同架构和模型参数进行类似计算。
For our experiments, we used a pre-trained model available on TensorFlow mathrm{Hub}^{12} that outputs 512- dimensional embeddings and has a total of 147M parameters. The model was pre-trained in a multi-task learning setting with three tasks: 1) skip-thought’s next sentence prediction [22]; 2) a conversational inputresponse task [23], where to goal is to assess the relevance of a response on a given input (e.g., a reply to an email or a Stack Overflow post); and 3) natural language inference using the SNLI corpus [24]. The training data for tasks 1 and 2 comes from a variety of sources crawled from the web, including Stack Overflow. This makes USE a good choice for downstream tasks in the software development domain, such as code search.
在我们的实验中,我们使用了TensorFlow mathrm{Hub}^{12} 上提供的预训练模型,该模型输出512维嵌入向量,总计1.47亿参数。该模型通过多任务学习框架在以下三个任务上进行了预训练:1) 跳读思维 (skip-thought) 的下一句预测 [22];2) 对话输入-响应任务 [23],目标是评估给定输入(如电子邮件回复或Stack Overflow帖子)的响应相关性;3) 使用SNLI语料库 [24] 的自然语言推理。任务1和2的训练数据来自网络爬取的各种来源,包括Stack Overflow。这使得USE成为软件开发领域下游任务(如代码搜索)的理想选择。
Although USE’s pre-training ingests textual data from the software domain, without fine-tuning the model does not significantly outperform our baselines (see Section VII). Training USE directly on code search is infeasible due to the lack of annotated data. Therefore, we propose the detection of duplicate Stack Overflow posts as a proxy task. That is, we tune the USE model to distinguish similar Stack Overflow post titles from unrelated title pairs. This task stimulates that titles of duplicate posts are mapped to similar vectors, while vectors of random title pairs are adjusted to be more orthogonal. As such, the USE embeddings are tailored to semantic similarity computation in the software domain.
尽管USE的预训练吸收了软件领域的文本数据,但在未经微调的情况下,该模型并未显著超越我们的基线(参见第VII节)。由于缺乏标注数据,直接在代码搜索上训练USE不可行。因此,我们提出将重复Stack Overflow帖子检测作为代理任务。即通过调整USE模型来区分相似的Stack Overflow帖子标题与无关标题对。该任务促使重复帖子标题被映射为相似向量,同时随机标题对的向量被调整为更正交的状态。由此,USE嵌入向量被专门适配于软件领域的语义相似度计算。
A dataset of duplicate title pairs was created as described in Section IV. For every “positive” instance, we construct five negative examples by sampling random title pairs (for which we verify that they were not tagged as each other’s duplicate). Given a title pair (t_{1},t_{2}) , we model the probability that t_{1} and t_{2} are duplicates by first computing the cosine similarity of their USE embeddings and clipping negative similarities to zero with the rectified linear unit (ReLU) activation function, see Equation 1. Next, the result is fed to a layer with weight parameter w , bias b and the logistic activation function sigma , see Equation 2. We fine-tune w,b , and the parameters of USE pmb{theta} to minimize binary cross-entropy. The inclusion of the ReLU activation avoids that the training objective pushes the cosine similarity between embeddings of unrelated titles to -1 , as this corresponds to having a strong negative correlation between the embeddings.
按照第IV节所述方法创建了一个重复标题对数据集。对于每个"正例"样本,我们通过随机采样标题对构建五个负例样本(已验证这些负例未被标记为彼此重复)。给定标题对(t_{1},t_{2}),我们首先计算其USE嵌入向量的余弦相似度,并使用修正线性单元(ReLU)激活函数将负相似度截断为零(见公式1),以此建模t_{1}与t_{2}为重复标题的概率。随后将结果输入具有权重参数w、偏置b和逻辑激活函数sigma的层级(见公式2)。我们通过微调w,b和USE参数pmb{theta}来最小化二元交叉熵。引入ReLU激活可避免训练目标将无关标题嵌入向量间的余弦相似度推向-1,因为这会导致嵌入向量间出现强负相关性。
When w,b are initialized randomly, we found that the model sometimes degenerates to labeling every pair as unrelated. This issue is avoided by initializing w,b such that, initially, all pairs with a high cosine similarity are considered a duplicate with high probability. For our experiments, we initialized w and b with 15 and -5 respectively.
当 w,b 随机初始化时,我们发现模型有时会退化为将所有样本对标记为不相关。通过初始化 w,b 使得初始状态下所有高余弦相似度样本对都有较高概率被判定为重复,可以避免该问题。实验中我们分别将 w 和 b 初始化为15和 -5。

Fig. 6: Illustration of how sentences are created from description-code snippet pairs to serve as input for training fastText embeddings. If we only append the code tokens to the description words (top right) as is done in the original NCS model, hist and histogram appear far apart. By creating the other contexts (middle and bottom right) they appear in closer proximity, causing fastText to push their embeddings closer in the vector space.
图 6: 展示如何从描述-代码片段对生成句子作为fastText嵌入训练的输入。若仅按原始NCS模型做法将代码token直接附加到描述词汇后(右上),hist和histogram在向量空间中相距甚远。通过构建其他上下文(右中和右下),二者距离被拉近,促使fastText使其嵌入向量更接近。
B. Query-code similarity
B. 查询-代码相似度
To compute the similarity between query and code, we experimented with models whose designs are based on two recently published (un annotated) code search models: NCS [2] and UNIF [3].
为了计算查询与代码之间的相似度,我们基于两种近期发布的(未标注)代码搜索模型NCS [2]和UNIF [3]进行了模型实验。
NCS: The architecture for the neural code search (NCS) model [2] is shown in Figure $5mathrm{C}, . Like the neural bag-of-words model introduced in the previous section, the query representation is computed as a sum of the query word embeddings: begin{array}{r}{pmb q^{n c s}=sum_{j=1}^{m}e_{2}(q_{j})}end{array} jm=1 e2(qj). Similarly, an embedding for the code is computed by summing the code token embeddin gs weighted with their respective IDF scores: mathbf{boldsymbol{c}}^{n c s}=$ textstylesum_{j=1}^{n}i d f(c_{j})e_{2}(c_{j}) . Where e_{2}(cdot) is the embedding model that maps code tokens and query words to their respective embeddings.
NCS: 神经代码搜索 (NCS) 模型 [2] 的架构如图 $5mathrm{C} 所示。与上一节介绍的神经词袋模型类似,查询表示通过查询词嵌入的求和计算得到:begin{array}{r}{pmb q^{n c s}=sum_{j=1}^{m}e_{2}(q_{j})}end{array} jm=1 e2(qj)。类似地,代码的嵌入通过代码 token 嵌入的加权求和计算得到,权重为各自的 IDF 分数:mathbf{boldsymbol{c}}^{n c s}=$ textstylesum_{j=1}^{n}i d f(c_{j})e_{2}(c_{j})。其中 e_{2}(cdot) 是将代码 token 和查询词映射到各自嵌入的嵌入模型。
The embedding model e_{2}(cdot) is trained with fastText on a multi-modal text corpus containing both source code and natural language code descriptions. For each of the PACS benchmarks, we train a fastText skip-gram model using the respective snippet collections to create the text corpora. In the original NCS model, each sentence in the training corpus is constructed from a different description-code snippet pair by concatenating the description words with the code tokens. We found it to be beneficial to further augment the corpora by also: 1) inserting the description words in the middle of the code tokens; and 2) appending the description words to the code tokens (see Figure 6). We also increase fastText’s window size parameter to 20 whereas the original model kept fastText’s default value (i.e., 5). These modifications are aimed at better capturing the relations between code tokens and description words and are inspired by work on multilingual embedding models [25]. In particular, by enforcing that code tokens and description words that belong to the same snippet appear close to each other in the fastText training corpus, their representations will become more similar.
嵌入模型 e_{2}(cdot) 使用 fastText 在多模态文本语料库上进行训练,该语料库包含源代码和自然语言代码描述。针对每个 PACS 基准,我们使用相应的代码片段集合训练 fastText skip-gram 模型以创建文本语料库。在原始 NCS 模型中,训练语料库中的每个句子通过将描述词与代码 token 拼接而成,来自不同的描述-代码片段对。我们发现进一步扩充语料库有以下益处:1) 在代码 token 中间插入描述词;2) 将描述词附加到代码 token 之后 (见图 6)。此外,我们将 fastText 的窗口大小参数增加到 20,而原始模型保留 fastText 的默认值 (即 5)。这些修改旨在更好地捕捉代码 token 与描述词之间的关系,并受到多语言嵌入模型研究 [25] 的启发。具体而言,通过强制属于同一代码片段的代码 token 和描述词在 fastText 训练语料库中彼此靠近,它们的表示将变得更加相似。
We pre-process queries by applying token iz ation and stopword removal. Code snippets are tokenized with Tree Sitter 13 and we retain only code identifiers (method names, imports, variable names, keyword argument names) other code identifiers 14, and inline comments. The code identifiers are split on camel casing and underscores.
我们通过应用分词(Tokenization)和停用词去除对查询进行预处理。代码片段使用Tree Sitter[13]进行分词,仅保留代码标识符(方法名、导入项、变量名、关键字参数名)等代码标识符[14]以及内联注释。代码标识符会按驼峰命名法和下划线进行分割。
UNIF: The UNIF [3] and NCS architectures are similar, but instead of using TF-IDF weighting, UNIF computes the code token weights with a neural network. More specifically, the representation for the code c^{u n i f} is calculated as a weighted sum of code token embeddings e_{3}(c_{1}),e_{3}(c_{2}),ldots,e_{3}(c_{m}) :
UNIF: UNIF [3] 和 NCS 架构类似,但它不使用 TF-IDF 加权,而是通过神经网络计算代码 token 的权重。具体来说,代码的表示 c^{u n i f} 是代码 token 嵌入 e_{3}(c_{1}),e_{3}(c_{2}),ldots,e_{3}(c_{m}) 的加权和:
Where textbf{em a} is a vector with the same dimensionality as the embeddings. UNIF will train textbf{em a} and fine-tune the description and code token embeddings by maximizing the margin between 1) the cosine distance of descriptions and corresponding code snippets; and 2) the cosine distance of descriptions and random code snippets.
其中textbf{em a}是一个与嵌入向量维度相同的向量。UNIF将通过最大化以下两者之间的边距来训练textbf{em a}并微调描述和代码token的嵌入:1) 描述与对应代码片段之间的余弦距离;2) 描述与随机代码片段之间的余弦距离。
Although making the bag-of-words assumption yields a rudimentary view on source code, state-of-theart results on code search are obtained under this assumption [2, 3, 4]. We hypothesize that one reason may be that for many snippets in the collections, re-ordering function calls does not lead to a meaningful snippet with a different intent (i.e., solving a different task).
尽管词袋假设为源代码提供了基础视角,但在这一假设下仍能获得当前最先进的代码搜索效果 [2, 3, 4]。我们推测其中一个原因可能是:对于代码库中的多数片段而言,调整函数调用顺序并不会产生具有不同意图(即解决不同任务)的有效代码片段。
C. Ensemble
C. 集成
We expect that the different similarity models we presented will make different types of errors. Therefore, an ensemble model that computes query-snippet similarity as a linear combination of the fine-tuned USE and NCS model outputs should obtain better results. We note that an ensemble of cosine similarity models can be easily reformulated as a single cosine similarity model:
我们预期所提出的不同相似度模型会犯不同类型的错误。因此,采用集成模型将查询-片段相似度计算为微调后的USE和NCS模型输出的线性组合,应能获得更好的结果。需要指出的是,余弦相似度模型的集成可以轻松重构为单一余弦相似度模型:
is equivalent to
等同于
This implementation trick is relevant when scaling up to very large snippets collections, as it allows using approximate nearest neighbor libraries such as faiss [26].15 These libraries can do nearest neighbor search with billions of instances but expect that each respective instance/query is represented by a single vector.
这一实现技巧在扩展到超大规模代码片段集合时尤为重要,因为它允许使用诸如faiss [26] 这类近似最近邻库。这些库能处理数十亿实例的最近邻搜索,但要求每个实例/查询都表示为单一向量。
VI. EXPERIMENTAL SETUP
VI. 实验设置
A. Baselines
A. 基线方法
To compare with classic information retrieval approaches, we set up two Okapi BM25 retrieval models using the rank-bm25 library.16 Okapi BM25 is a popular bag-of-words ranking function based on exact keyword matching. mathrm{BM}25_{mathrm{code}} indexes the code, and serves as a baseline for the neural code models. It uses the same pre-processing as NCS and UNIF, except that we replace each query and code token by its lemma. mathrm{BM}25_{mathrm{descr}} indexes the code descriptions, and serves as a baseline for the neural bag-of-words and universal sentence encoder models. We also report the results of the original NCS model (i.e., without our proposed adaptations) which we reproduced from [2] (to our knowledge, the model is not publicly available).
为与传统信息检索方法进行对比,我们使用rank-bm25库建立了两个Okapi BM25检索模型。Okapi BM25是基于精确关键词匹配的经典词袋排序函数。mathrm{BM}25_{mathrm{code}}索引代码,作为神经代码模型的基线,其预处理方式与NCS和UNIF相同(仅将查询和代码token替换为其词元)。mathrm{BM}25_{mathrm{descr}}索引代码描述,作为神经词袋模型和通用句子编码器模型的基线。我们还复现了原始NCS模型(即未采用我们改进方案的版本)的结果,该模型根据[2]复现(据我们所知,该模型未公开)。
TABLE IV: Results on the Python annotated code search (PACS) benchmark test sets. Mean reciprocal rank (MRR), recall @3 (mathrm{r}@3) and recall @10 ({bf r}@10) are expressed as percentages. mathrm{NCS}_{mathrm{original}^{*}} and mathrm{UNIF_{*}} are the results obtained with our implementations of the NCS/UNIF models as described in [2] and [3], respectively (the original models were not released).
表 IV: Python语言注释代码搜索(PACS)基准测试集结果。平均倒数排名(MRR)、召回率@3(r@3)和召回率@10(r@10)以百分比表示。NCSoriginal和UNIF分别是我们在[2]和[3]中描述的NCS/UNIF模型的实现结果(原始模型未发布)。
| 模型 | 输入 | CoNaLa MRR | CoNaLa r@3 | CoNaLa r@10 | StaQC-py MRR | StaQC-py r@3 | StaQC-py r@10 | SO-DS MRR | SO-DS r@3 | SO-DS r@10 |
|---|---|---|---|---|---|---|---|---|---|---|
| BM25code | code | 6.9 | 7.0 | 14.6 | 2.0 | 2.1 | 4.2 | 6.7 | 7.7 | 13.5 |
| NCSorig | code | 1.5 | 1.3 | 3.7 | 1.7 | 1.7 | 3.5 | 4.4 | 4.9 | 8.9 |
| NCSours | code | 16.7 | 19.9 | 31.2 | 3.0 | 3.2 | 6.1 | 11.3 | 12.7 | 23.5 |
| UNIF | code | 15.6 | 18.3 | 29.8 | 2.4 | 2.8 | 5.5 | 11.6 | 13.1 | 23.7 |
| BM25descr | description | 23.8 | 26.4 | 39.1 | 7.5 | 8.0 | 13.3 | 21.6 | 23.8 | 32.3 |
| NBOW | description | 17.7 | 18.8 | 30.6 | 9.5 | 10.9 | 16.6 | 24.7 | 27.7 | 38.0 |
| USE | description | 18.1 | 20.7 | 32.3 | 10.4 | 11.2 | 17.6 | 24.4 | 27.3 | 37.9 |
| USEtuned | description | 30.0 | 34.0 | 48.7 | 11.1 | 11.5 | 19.0 | 29.0 | 32.3 | 45.3 |
| USEtuned+ReLU | description | 34.0 | 37.9 | 54.9 | 11.7 | 12.8 | 20.3 | 30.4 | 33.3 | 48.5 |
| Ensemble | code + descr. | 35.1 | 39.8 | 57.2 | 12.6 | 14.3 | 22.7 | 32.3 | 37.0 | 50.1 |
B. Hyper-parameters
B. 超参数
For training the embeddings of the neural bag-of-words model, we kept fastText’s default parameters except for the number of epochs, which we increased to 300. The USE fine-tuning was implemented in TensorFlow [27] using stochastic gradient descent (SGD) with mini-batches of size 512, a learning rate of mathtt{1}inmathrm{-4} , and Adam optimization [28]. We fine-tuned USE for 15 epochs, while tracking the retrieval performance on the SO-DS validation set every 51, 200 examples, after which we retained the best performing snapshot to evaluate on the test set. The NCS embeddings (our version) were trained for 30 epochs, with window size 20, and without filtering out infrequent words. Other parameters were kept at fastText’s default setting. The original NCS model did not change fastText’s defaults. The UNIF model was trained for 20 epochs, using SGD with mini-batches of size 32, a learning rate of mathtt{1}inmathrm{-4} , and Adam optimization [28]. For StaQC-py and SO-DS, we track retrieval performance on the respective validation sets every epoch and keep the best performing checkpoint for evaluation on the test set. For CoNaLa, we do not have a validation set so we keep the model after 20 epochs. The lambda weights for the ensemble model were tuned on the SO-DS validation set and were set to 0.5 for NCS and 1 for USE.
在训练神经词袋模型的嵌入向量时,我们保留了fastText的默认参数,仅将训练轮数增加到300轮。USE微调采用TensorFlow [27]实现,使用小批量随机梯度下降(SGD),批量大小为512,学习率为mathtt{1}inmathrm{-4},并采用Adam优化器[28]。我们对USE进行了15轮微调,每处理51,200个样本后就在SO-DS验证集上评估检索性能,最终保留最佳快照用于测试集评估。NCS嵌入(我们的版本)训练30轮,窗口大小为20,不过滤低频词,其余参数保持fastText默认设置。原始NCS模型未修改fastText默认参数。UNIF模型训练20轮,使用批量大小为32的SGD,学习率为mathtt{1}inmathrm{-4},同样采用Adam优化器[28]。对于StaQC-py和SO-DS数据集,我们每轮都在对应验证集上跟踪检索性能,保留最佳检查点用于测试集评估。CoNaLa数据集无验证集,故直接保留训练20轮后的模型。集成模型的lambda权重在SO-DS验证集上调试得出,NCS权重设为0.5,USE权重设为1。
C. Evaluation metrics
C. 评估指标
As discussed in Section III, we use mean reciprocal rank (MRR) and recall mathbf{Delta}mathbb{Q}mathbf{k}left(mathbf{r}frac{mathbf{partial}mathbf{partial}mathbf{k}}{mathbf{partial}mathbf{partial}mathbf{k}}right) to measure snippet retrieval performance. We cut-off the search result list at 10 snippets when computing MRR as users will rarely look beyond the first 10 results. For recall @mathbf{k} , we set k=3 and k=10 , measuring the percentage of queries for which at least one matching snippet was found in the top-3 and top-10 results respectively.
如第 III 节所述,我们使用平均倒数排名 (MRR) 和召回率 mathbf{Delta}mathbb{Q}mathbf{k}left(mathbf{r}frac{mathbf{partial}mathbf{partial}mathbf{k}}{mathbf{partial}mathbf{partial}mathbf{k}right)} 来衡量代码片段检索性能。在计算 MRR 时,我们将搜索结果列表截断为前 10 个片段,因为用户很少会查看超过前 10 条结果。对于召回率 @mathbf{k},我们设定 k=3 和 k=10,分别衡量在 top-3 和 top-10 结果中至少找到一个匹配片段的查询百分比。
VII. RESULTS DISCUSSION
VII. 结果与讨论
The performance of the various models on our annotated code search benchmarks is listed in Table IV. The results are grouped by the information that the models index: the source code, the description, or both. As expected, we find that leveraging code descriptions significantly boosts retrieval performance. The big gap between the best code-only and best description-only models confirms that retrieving snippets on the source code alone remains a very challenging task. This is likely due to the fact that the vocabulary of code tokens and description tokens is quite disjoint [2]. We also observe that our NCS variant and UNIF outperform classic keyword matching. The modifications to NCS that we proposed in Section V-B improve performance over the original model across all three benchmarks, and achieve results on par with or better than UNIF. Table V reports an ablation study of the proposed adjustments. It demonstrates that fastText hyper-parameter tuning, the more inclusive code pre-processing (i.e., retaining all code identifiers), and the fastText training corpus augmentation illustrated in Figure 6, all contribute to this performance increase.
表 IV 列出了各模型在我们标注的代码搜索基准测试中的表现。结果按模型索引的信息类型分组:源代码、描述或两者兼具。与预期一致,利用代码描述能显著提升检索性能。纯代码模型与纯描述模型之间的巨大差距证实,仅基于源代码检索代码片段仍是一项极具挑战性的任务,这很可能是因为代码token与描述token的词汇集高度不重合[2]。我们还观察到,我们的NCS变体和UNIF模型均优于经典关键词匹配方法。第V-B节提出的NCS改进方案在所有三个基准测试中均超越原始模型,并达到与UNIF相当或更优的结果。表 V 呈现了所提调整方案的消融研究,结果表明:fastText超参数调优、更包容的代码预处理(即保留所有代码标识符)以及图 6 所示的fastText训练语料扩充,均为性能提升的关键因素。
TABLE V: Ablation study of the changes to the original NCS model: mathrm{NCS}_{mathrm{original^{*}+x}} corresponds to the original model with one of our modifications, and mathrm{NCS}_{mathrm{ours-X}} corresponds to our NCS variant without one of our modifications. context Augment refers to the method we propose for generating the fastText input data (see Figure 6).
表 V: 原始NCS模型修改的消融研究: mathrm{NCS}_{mathrm{original^{*}+x}} 表示原始模型加上我们的一项修改, mathrm{NCS}_{mathrm{ours-X}} 表示我们的NCS变体去掉一项修改。context Augment指我们提出的生成fastText输入数据的方法(见图6)。
| 方法 | CoNaLa MRR | CoNaLa r@3 | CoNaLa r@10 | StaQC-py MRR | StaQC-py r@3 | StaQC-py r@10 | SO-DS MRR | SO-DS r@3 | SO-DS r@10 |
|---|---|---|---|---|---|---|---|---|---|
| NCSoriginal* | 1.5 | 1.3 | 3.7 | 1.7 | 1.7 | 3.5 | 4.4 | 4.9 | 8.9 |
| NCSoriginal*+epochs=30 | 4.7 | 5.0 | 11.7 | 2.0 | 2.0 | 4.1 | 6.3 | 7.3 | 14.5 |
| NCSoriginal*+alIdentifiers | 2.6 | 2.8 | 6.0 | 2.0 | 2.0 | 4.3 | 6.5 | 7.2 | 14.4 |
| NCSoriginal*+window=20 | 1.5 | 1.7 | 3.7 | 2.0 | 2.1 | 4.1 | 5.9 | 6.0 | 12.6 |
| NCSoriginal*+contextAugment | 2.3 | 2.6 | 5.5 | 1.6 | 1.5 | 3.5 | 7.0 | 7.0 | 14.6 |
| NCSours-epochs=30 | 7.4 | 8.1 | 18.1 | 2.4 | 2.5 | 5.3 | 9.8 | 11.3 | 19.2 |
| NCSours-allIdentifiers | 10.1 | 10.6 | 19.9 | 2.5 | 2.7 | 5.3 | 8.9 | 8.7 | 19.3 |
| NCSours-window=20 | 15.8 | 18.0 | 29.0 | 2.5 | 2.7 | 5.4 | 9.8 | 11.1 | 20.7 |
| NCSours-contextAugment | 14.3 | 15.6 | 28.5 | 2.6 | 2.8 | 5.3 | 10.0 | 11.3 | 22.0 |
| NCSours | 16.7 | 19.9 | 31.2 | 3.0 | 3.2 | 6.1 | 11.3 | 12.7 | 23.5 |
In an additional experiment, we verified whether using more training data for NCS/UNIF could help close the gap with the description-only models. To this end, we constructed a superset of the SO-DS snippet collection by mining annotated snippets with the procedure described in Section IV, except that we retain 1,000 instead of 250 snippets per tag. Somewhat surprisingly, we found that having four times more training data did not improve the performance for NCS and UNIF.
在另一项实验中,我们验证了为NCS/UNIF使用更多训练数据是否能缩小与纯描述模型的差距。为此,我们通过第四章描述的标注代码片段挖掘流程构建了SO-DS代码片段集的超集,但将每个标签的保留量从250个提升至1,000个。出乎意料的是,我们发现四倍的训练数据并未提升NCS和UNIF的性能。
When comparing the description-only baselines, we find that on CoNaLA classic keyword matching (i.e., BM25) outperforms the neural baselines (i.e., neural bag-of-words model and USE without fine-tuning), whereas on StaQC-py and SO-DS the inverse is true. This indicates that for StaQC-py and SO-DS there tends to be more (accidental) keyword overlap between queries and descriptions of irrelevant snippets. This is to be expected given that the StaQC-py and SO-DS snippet collections are considerably larger than the CoNaLa collection.
在仅比较描述基线时,我们发现CoNaLA数据集上经典关键词匹配(即BM25)优于神经基线(即未经微调的神经词袋模型和USE),而在StaQC-py和SO-DS数据集上情况则相反。这表明StaQC-py和SO-DS中查询与无关代码片段描述之间往往存在更多(偶然的)关键词重叠。考虑到StaQC-py和SO-DS的代码片段集合规模远大于CoNaLa,这一结果符合预期。
Another important finding is that, across all three benchmarks, the proposed fine-tuning recipe boosts USE’s performance significantly. The fine-tuned USE models outperform all other models with large margins. The experiments also validate the benefit of introducing the ReLU activation before the output layer as it consistently leads to the best results. Given these large improvements, we also considered finetuning other pre-trained transformer models, We conducted preliminary experiments with sentence-BERT models [29], which outperformed USE on semantic similarity benchmarks, and GPT-2 [16]. Even after fine-tuning these models were unable to beat the baselines. For the sentence-BERT models, we attribute this to the mismatch between the domain of the pre-training corpora and the software development domain. For the GPT-2 model, on the other hand, we found that the cosine similarities between hidden states of unrelated sequences are particularly high, a phenomenon that is studied in Ethayarajh [30]. This makes
另一个重要发现是,在所有三个基准测试中,所提出的微调方法显著提升了USE模型的性能。经过微调的USE模型以较大优势超越所有其他模型。实验也验证了在输出层前引入ReLU激活函数的益处,因为它始终能带来最佳结果。鉴于这些显著改进,我们还尝试对其他预训练Transformer模型进行微调。我们针对在语义相似性基准上表现优于USE的sentence-BERT模型[29]和GPT-2[16]进行了初步实验。即使经过微调,这些模型仍无法超越基线水平。对于sentence-BERT模型,我们认为这是由于预训练语料库领域与软件开发领域不匹配所致。而对于GPT-2模型,我们发现无关序列隐藏状态间的余弦相似度特别高,这一现象在Ethayarajh[30]中已有研究。这使得...
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import re
import re
umathtt{r}mathtt{l}=mathbf{Omega}^{prime}<mathtt{a} href = "http://www.ptop.se" target = " _blank" > http://www.ptop.se’ _{textrm{r}}= re.compile(’( ?<= href =" ).*?( ?=" )’) r.findall(url)
umathtt{r}mathtt{l}=mathbf{Omega}^{prime}<mathtt{a} href = "http://www.ptop.se" target = " _blank" > http://www.ptop.se’ _{textrm{r}}= re.compile(’( ?<= href =" ).*?( ?=" )’) r.findall(url)
S: https://stack overflow.com/questions/9622163
S: https://stackoverflow.com/questions/9622163
S: https://stack overflow.com/questions/499345
S: https://stackoverflow.com/questions/499345
sess = tf.Session(config = tf.Config Proto( log device placement = True))
sess = tf.Session(config = tf.Config Proto( log device placement = True))
S: https://stack overflow.com/questions/30088006
S: https://stackoverflow.com/questions/30088006
S: https://stack overflow.com/questions/38009682
S: https://stackoverflow.com/questions/38009682
TABLE VI: Example queries (Q) with one of the top-3 results (a description D, a code snippet C, and a url to the source S) from the SO-DS collection) returned by the ensemble model.
表 VI: 集成模型返回的SO-DS集合中示例查询(Q)及其前三结果之一(描述D、代码片段C和来源链接S)。
the model a poor fit for our fine-tuning procedure.
该模型不适合我们的微调流程。
A final observation from Table IV is that the ensemble model, which combines USEtuned+ReLU with mathrm{_{NCS_{ours}}} , yields the best results. This indicates that the incorporation of the source code is still beneficial. To demonstrate that the ensemble model can answer non-trivial queries, Table VI reports a top-3 result from the SO-DS collection for four non-trivial example queries. Although all queries have at most one word in common with the description or code, the results capture the query intent very well. Interestingly, the result for the query check that tf uses gpu also uncovers a limitation of current code search systems. The code snippet is relevant but outdated as it will not work in the latest versions of the TensorFlow library. Dealing with the evolution of software libraries and tools in a code search engine is an interesting challenge for future research.
从表 IV 得出的最终观察是,结合 USEtuned+ReLU 与 mathrm{_{NCS_{ours}}} 的集成模型取得了最佳结果。这表明引入源代码仍然有益。为证明集成模型能处理非简单查询,表 VI 展示了 SO-DS 集合中四个非简单示例查询的 top-3 结果。尽管所有查询与描述或代码最多只有一个共同词汇,但结果很好地捕捉了查询意图。有趣的是,查询 check that tf uses gpu 的结果也揭示了当前代码搜索系统的局限性:相关代码片段已过时,无法在 TensorFlow 库最新版本中运行。如何在代码搜索引擎中应对软件库和工具的演化,是未来研究的一个有趣挑战。
To further improve code search, we see the most potential in learning better code representations. An open question is whether the transfer learning techniques that have pushed the state of the art in natural language processing, can yield similar improvements when leveraged for code representation learning. One could question to what extent the distribution al hypothesis (i.e., the idea that words that occur in the same contexts tend to have similar meanings [31]) is as powerful for source code. Because software developers are trained to avoid code duplication (by refactoring repeated code fragments into functions, methods, modules, etc.), what can be learned from token co-occurrence is inherently more limited, due to the sparsity of patterns. This raises the question of whether we should not incorporate other modalities, such as runtime information, when learning code representations.
为了进一步提升代码搜索效果,我们认为最具潜力的方向是学习更好的代码表示。一个悬而未决的问题是:那些推动自然语言处理领域技术发展的迁移学习方法,在应用于代码表示学习时能否带来类似的改进。有人可能会质疑分布假说(即认为出现在相同上下文中的词汇往往具有相似含义[31])对源代码的适用程度。由于软件开发人员被训练要避免代码重复(通过将重复代码片段重构为函数、方法、模块等),受模式稀疏性的影响,从token共现中可学习的内容本质上更为有限。这引发了新的思考:在学习代码表示时,我们是否应该引入运行时信息等其他模态?
Threats to validity
有效性威胁
Stack Overflow titles as queries: Similar to previous code search studies [2, 3, 5], ground truth queries are compiled from Stack Overflow post titles. While the resulting evaluation and test sets are a good reflection of the types of questions that programmers have, Stack Overflow titles can be more verbose than actual code search queries.
将Stack Overflow标题作为查询:与之前的代码搜索研究[2, 3, 5]类似,基准查询是从Stack Overflow帖子标题中编译的。虽然最终的评估和测试集很好地反映了程序员提出的问题类型,但Stack Overflow标题可能比实际的代码搜索查询更冗长。
Training data: Due to the inherent differences between their architectures, NCS/UNIF and USE can not be trained on the same data. Therefore the results are sensitive to the training data we chose for the respective models. As in this work we were mainly concerned with investigating the benefit of code descriptions, we aimed to optimize the training data for the code-only models. We verified if the performance of these models improved when training them with more in-domain code snippet-description pairs but this was not the case.
训练数据:由于架构上的本质差异,NCS/UNIF和USE无法使用相同数据进行训练。因此实验结果对我们为各模型选择的训练数据较为敏感。由于本研究主要关注代码描述带来的收益,我们着力优化了纯代码模型的训练数据。通过实验验证了增加同领域代码片段-描述对是否提升模型性能,但结果并未显现改进。
Interpreting performance: Our approach to construct the respective ground truth datasets does not yield an exhaustive set of all the relevant snippets for a given query. As such, the performance metrics in this paper are only meaningful to compare the performance of models relative to each other, not to measure absolute performance. For instance, a recall @10 score of $50%$ does not reflect that the model retrieves relevant snippets in the top 10 for (only) half of the queries.
解读性能:我们构建相应真实数据集的方法并未穷尽给定查询的所有相关片段。因此,本文中的性能指标仅适用于模型间的相对性能比较,而非衡量绝对性能。例如,召回率 @10 得分为 $50%$ 并不表示模型仅对半数查询能在前10个结果中检索到相关片段。
Snippet collection: The choice of the snippet collection unavoidably impacts the results. For instance, one could argue that the lack of static types makes learning code representations for Python more challenging compared to languages like Java. Given the considerable performance gaps between the codeonly and description-only models, across three different snippet collections, we expect our conclusions to remain valid on other collections.
片段收集:片段收集的选择不可避免地会影响结果。例如,有人认为与Java等语言相比,Python语言缺乏静态类型会使其代码表示学习更具挑战性。考虑到仅代码模型和仅描述模型在三个不同片段收集上的显著性能差距,我们预期这些结论在其他收集上依然成立。
VIII. CONCLUSION
VIII. 结论
As the number of software libraries and tools increases, so does the need for effective code search tools to identify relevant example code. Most work on code search considers the problem of retrieving un annotated code snippets based on natural language queries. In this work, we considered the problem of retrieving code snippets annotated with a brief description of their intent, which is usually available for code that is part of tutorials, cheat sheets, textbooks, notebooks, and other forms of code documentation.
随着软件库和工具数量的增加,对有效代码搜索工具的需求也随之增长,以识别相关的示例代码。大多数关于代码搜索的工作都集中在基于自然语言查询检索未注释代码片段的问题上。在本研究中,我们考虑了检索带有简要意图描述的代码片段的问题,这些描述通常出现在教程、速查表、教科书、笔记本和其他形式的代码文档中。
We showed that by leveraging code descriptions and by applying recent advances in natural language processing, we can build retrieval models that produce far more relevant search results compared to models that solely use source code. In particular, we described how we can fine-tune the universal sentence encoder for code search, showed how it can be effectively combined with a new variant of the neural code search model, and evaluated the models on three new code search benchmarks. On these benchmarks, we significantly outperformed state-of-the-art code search models with absolute gains up to 20.6% , 23.9% , and 26.4% in mean reciprocal rank (MRR), recall @3 , and recall @10 .
我们证明,通过利用代码描述并应用自然语言处理领域的最新进展,可以构建比仅使用源代码的模型产生更相关搜索结果的检索模型。具体而言,我们阐述了如何针对代码搜索任务微调通用句子编码器(universal sentence encoder),展示了如何将其与神经代码搜索模型的新变体有效结合,并在三个新的代码搜索基准上对这些模型进行了评估。在这些基准测试中,我们的模型显著优于最先进的代码搜索模型,在平均倒数排名(MRR)、召回率@3和召回率@10指标上分别实现了高达20.6%、23.9%和26.4%的绝对提升。
REFERENCES
参考文献
thought Vectors,” in Advances in Neural Information Processing Systems, 2015, pp. 3294–3302. [23] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukacs, R. Guo, S. Kumar, B. Miklos, and R. Kurzweil, “Efficient natural language response suggestion for smart reply,” arXiv preprint arXiv:1705.00652, 2017. [24] S. R. Bowman, G. Angeli, Potts Christopher, and C. D. Manning, “A large annotated corpus for learning natural language inference,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015. [25] I. Vulic and M.-F. Moens, “Bilingual Word Embeddings from Comparable Data with Application to Bilingual Lexicon Extraction and Word Translation Disambiguation,” Journal of Artifical Intelligence Research, 2016. [26] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,’ arXiv preprint arXiv:1702.08734, 2017. [27] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems,” 2015. [Online]. Available: http://tensorflow.org/ [28] D. Kingma and M. Welling, “Auto-Encoding Variation al Bayes,” in In Proceedings of the International Conference on Learning Representations (ICLR), 2014, pp. 1–14. [29] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERTNetworks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3982–3992. [Online]. Available: https://www.aclweb.org/anthology/D19-1410 [30] K. Ethayarajh, “How Contextual are Contextual i zed Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 55–65. [31] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 23, pp. 146–162, 1954.
[23] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukacs, R. Guo, S. Kumar, B. Miklos, 和 R. Kurzweil, 《智能回复的高效自然语言响应建议》, arXiv 预印本 arXiv:1705.00652, 2017.
[24] S. R. Bowman, G. Angeli, Potts Christopher, 和 C. D. Manning, 《用于学习自然语言推理的大规模标注语料库》, 收录于《2015年自然语言处理经验方法会议论文集》(EMNLP). 计算语言学协会, 2015.
[25] I. Vulic 和 M.-F. Moens, 《基于可比数据的双语词嵌入及其在双语词典提取和词义消歧中的应用》, 《人工智能研究杂志》, 2016.
[26] J. Johnson, M. Douze, 和 H. Jégou, 《基于GPU的十亿级相似性搜索》, arXiv 预印本 arXiv:1702.08734, 2017.
[27] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, 和 X. Zheng, 《TensorFlow: 异构系统上的大规模机器学习》, 2015. [在线]. 可访问: http://tensorflow.org/
[28] D. Kingma 和 M. Welling, 《变分自编码贝叶斯》, 收录于《国际学习表示会议论文集》(ICLR), 2014, 第1–14页.
[29] N. Reimers 和 I. Gurevych, 《Sentence-BERT: 基于孪生BERT网络的句子嵌入》, 收录于《2019年自然语言处理经验方法会议暨第九届自然语言处理国际联合会议论文集》(EMNLP-IJCNLP). 中国香港: 计算语言学协会, 2019, 第3982–3992页. [在线]. 可访问: https://www.aclweb.org/anthology/D19-1410
[30] K. Ethayarajh, 《上下文化词表示有多上下文化? 比较BERT、ELMo和GPT-2嵌入的几何结构》, 收录于《2019年自然语言处理经验方法会议暨第九届自然语言处理国际联合会议论文集》(EMNLP-IJCNLP), 2019, 第55–65页.
[31] Z. S. Harris, 《分布结构》, 《词》, 第10卷, 第23期, 第146–162页, 1954.
