[论文翻译]神经代码搜索再探:通过自然语言意图增强代码片段检索


原文地址:https://arxiv.org/pdf/2008.12193v1


Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent

神经代码搜索再探:通过自然语言意图增强代码片段检索

Abstract

摘要

In this work, we propose and study annotated code search: the retrieval of code snippets paired with brief descriptions of their intent using natural language queries. On three benchmark datasets, we investigate how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets. Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description. We find that our model yields significantly more relevant search results (with absolute gains up to $20.6%$ in mean reciprocal rank) compared to state-of-the-art code retrieval methods that do not use descriptions but attempt to compute the intent of snippets solely from un annotated code.

在本工作中,我们提出并研究了带注释的代码搜索:通过自然语言查询检索配有意向简要描述的代码片段。基于三个基准数据集,我们探究了如何利用描述来更好地捕捉代码片段的意图,从而改进代码检索系统。结合迁移学习和自然语言处理领域的最新进展,我们创建了一个针对带自然语言描述注释代码的领域专用检索模型。实验表明,与最先进的、不使用描述而仅从未注释代码推断片段意图的代码检索方法相比,我们的模型能显著提升搜索结果相关性(平均倒数排名绝对增益最高达 $20.6%$)。

I. INTRODUCTION

I. 引言

Many modern-day applications and software systems are built from a variety of open source libraries and components. As a result, developers find themselves working with a growing number of libraries and tools in their day-to-day coding activities. Quickly and efficiently retrieving relevant documentation for these libraries is becoming increasingly important. Concrete code examples that illustrate the usage of a library or specific library feature are particularly helpful to software developers, as witnessed by the rapid growth of online QA platforms such as Stack Overflow.

许多现代应用程序和软件系统由各种开源库和组件构建而成。因此,开发人员发现自己在日常编码活动中需要使用的库和工具越来越多。快速高效地检索这些库的相关文档变得愈发重要。展示库或特定库功能使用的具体代码示例对软件开发人员特别有帮助,Stack Overflow 等在线问答平台的快速增长就是明证。

To address this need, the research community has been investigating systems that retrieve code snippets from natural language queries [1, 2, 3, 4, 5, 6]. All of these works use neural models to compute the similarity between code snippets and natural language queries. While this setup enables code search on large code corpora, understanding the intent of an un annotated code fragment remains challenging even for experienced developers, let alone for automated systems. For instance, programmers that were asked to assess the relevance of code snippets to a query reported that it would have been helpful to get additional context about the code snippet [4]. Moreover, the low percentage of keyword matches between the search terms and the code tokens [2] makes the retrieval of un annotated code snippets particularly challenging.

为满足这一需求,研究界一直在探索从自然语言查询中检索代码片段的系统[1, 2, 3, 4, 5, 6]。这些工作均采用神经网络模型计算代码片段与自然语言查询之间的相似度。虽然这种架构支持大规模代码库的搜索,但理解未注释代码段的意图对经验丰富的开发者而言仍具挑战性,更不用说自动化系统。例如,被要求评估代码片段与查询相关性的程序员反馈称,获取代码片段的额外上下文会很有帮助[4]。此外,搜索词与代码token之间的关键词匹配率较低[2],这使得未注释代码片段的检索尤为困难。

In this paper, we take a step back and investigate code search in a more restricted (but, we argue, equally useful) setting where code snippets are accompanied with brief natural language descriptions that capture their intent. Code snippets that are meant to be reused by others are frequently accompanied by descriptions. Such annotated snippets are found for instance in coding cheat sheets, interactive notebook environments, official library documentation, technical blog posts, programming textbooks, or on QA websites such as Stack Overflow. While none of these resources use a common, structured format to label code snippets with their intent, simple heuristics can often be used to pair code snippets with meaningful descriptions. We demonstrate that the quality of the search results improves significantly when these descriptions are taken into account by the snippet ranking algorithm.

本文回归基础,研究了一种更受限制(但我们认为同样实用)的代码搜索场景:代码片段附带有描述其意图的简短自然语言说明。那些旨在被他人复用的代码片段通常都会配有描述性文字,这类带标注的片段可见于编程速查表、交互式笔记本环境、官方库文档、技术博客文章、编程教科书,或是Stack Overflow等问答网站。虽然这些资源都未采用统一的结构化格式来标记代码片段的意图,但通过简单启发式方法往往能将代码片段与有意义的描述配对。我们证明,当片段排序算法将这些描述纳入考量时,搜索结果质量会显著提升。

The key contributions of this paper are as follows:

本文的主要贡献如下:

• We propose the annotated code search task: the problem of retrieving code snippets annotated with short descriptions, from natural language queries. To evaluate this task, we created three benchmark datasets, each of which is comprised of an annotated code snippet collection and a set of queries linked to one or more relevant snippets in the collection.

• 我们提出了带注释的代码搜索任务:从自然语言查询中检索带有简短描述的代码片段。为了评估该任务,我们创建了三个基准数据集,每个数据集均由带注释的代码片段集合和一组与集合中一个或多个相关片段链接的查询组成。

II. BACKGROUND RELATED WORK

II. 背景与相关工作

A. Code search

A. 代码搜索

There has been a growing interest in retrieving code using natural language queries [1, 2, 3, 4, 5, 6]. These works use neural networks to project code and queries in the same vector space. During training, the model parameters are optimized to minimize the distance between pairs of code snippets and corresponding natural language descriptions. These descriptions are typically extracted from the method documentation. The rationale is that, at prediction time, a query will be projected to a vector that is similar to the vector representations of related code snippets. As such, code search can be reduced to a mathrm{kOmega} -nearest neighbors problem in the joint query-snippet vector space. Gu et al. [1], Sachdev et al. [2], Cambronero et al. [3], Husain et al. [4], Yao et al. [5], Feng et al. [6] have all built retrieval models for snippet collections that consist of pure source code. By contrast, in this work we study a different code search setting where every snippet is annotated with a brief natural language description describing the code’s intent.

近年来,利用自然语言查询检索代码的研究日益受到关注 [1, 2, 3, 4, 5, 6]。这些工作通过神经网络将代码和查询映射到同一向量空间。训练过程中,模型参数通过最小化代码片段与其对应自然语言描述之间的距离进行优化,这些描述通常提取自方法文档。其核心思想是:在预测阶段,查询语句会被映射到与相关代码片段向量表示相近的向量空间,从而使代码搜索转化为联合查询-片段向量空间中的 mathrm{kOmega} 最近邻问题。Gu等 [1]、Sachdev等 [2]、Cambronero等 [3]、Husain等 [4]、Yao等 [5]、Feng等 [6] 均已构建了针对纯源代码片段集合的检索模型。而本研究则探讨了一种不同的代码搜索场景:每个代码片段都标注了描述代码意图的简短自然语言说明。

Yao et al. [5] proposed a model to automatically generate descriptions from the code and utilize reinforcement learning to directly optimize the descriptions for improving code retrieval. They showed that an ensemble of 1) a model that used the generated descriptions and 2) an existing code search model yielded better performance for retrieving SQL code snippets. Ye et al. [7] improved on this work with a dual learning model that simultaneously optimizes for code generation and code sum mari z ation. Both of these works describe an alternative framework to obtain code descriptions but are otherwise orthogonal to our work.

Yao等人[5]提出了一种从代码自动生成描述并利用强化学习直接优化描述以改进代码检索的模型。他们证明,结合1)使用生成描述的模型与2)现有代码搜索模型的集成方法,在检索SQL代码片段时能获得更好性能。Ye等人[7]通过双学习模型改进了这项工作,该模型同时优化代码生成和代码摘要任务。这两项工作都提出了获取代码描述的替代框架,但与我们的工作正交互补。

Recent work explored the application of NLP transfer learning techniques to code [8, 6]. CodeBERT [6], a bimodal transformer model for code and natural language obtained state-of-the-art results on highly specific tasks (for example, identifying methods/functions given their documentation among a list of “distractor” snippets), but performance on such proxy tasks correlates poorly with performance on actual code search [4]. It therefore remains to be seen how well CodeBERT performs on downstream tasks such as code search. In this work, we study retrieval of code snippets that leverage both the code tokens and a short description of the code. For computing the similarity between the query and the code, we compare with existing models such as NCS [2] and UNIF [3].

近期研究探索了将自然语言处理 (NLP) 迁移学习技术应用于代码领域 [8, 6]。CodeBERT [6] 作为一款面向代码与自然语言的双模态 Transformer 模型,在高度特定任务中 (例如从干扰代码片段列表中识别出与文档描述匹配的方法/函数) 取得了最先进成果,但此类代理任务的表现与实际代码搜索性能相关性较低 [4]。因此 CodeBERT 在代码搜索等下游任务中的表现仍有待验证。本研究致力于同时利用代码 Token 和简短描述来实现代码片段检索,在计算查询与代码相似度时,我们与 NCS [2] 和 UNIF [3] 等现有模型进行了对比。

B. Transfer learning in natural language processing

B. 自然语言处理中的迁移学习

In the past two years, natural language processing has seen substantial improvements across a variety of tasks including question answering, sum mari z ation, and information retrieval [9, 10]. Researchers have shown that fine-tuning large models, pre-trained on vast amounts of unlabeled text data, is often far superior to learning models from scratch on labeled, task-specific data only [11, 12, inter alia]. During pre-training, a neural net is optimized to predict words and/or sentences from their context (i.e., the previous or surrounding words/sentences), and as a result, learns to compute word representations that capture semantic and syntactic properties of the words and the context in which they appear. During fine-tuning, the last layer of the pre-trained model is replaced with a task-specific layer and all model parameters are fine-tuned on labeled data of the downstream task.

过去两年,自然语言处理在问答、摘要和信息检索等多项任务中取得了显著进步[9, 10]。研究表明,基于海量无标注文本数据预训练的大模型经过微调后,其表现通常远超仅用标注任务数据从头训练的模型[11, 12等]。预训练阶段,神经网络通过预测上下文(即前后或周边词句)中的单词/句子进行优化,从而学习到能捕捉词语及其所处上下文语义与句法特征的词向量表示。微调阶段则替换预训练模型的最后一层为任务特定层,并基于下游任务的标注数据对所有模型参数进行微调。

All recent state-of-the-art systems use the Transformer architecture [13], a neural network that processes sequences using multi-head self-attention. This facilitates capturing long-range dependencies and makes the computations more parallel iz able than those of recurrent neural networks (RNNs). Due to the increased scale of the models and text corpora, pre-training has become impractical for the average researcher. Fortunately, several institutions have made pre-trained models publicly available: the universal sentence encoder (USE) [14], GPT and GPT-2 [15, 16], BERT [11], RoBERTa [17], T5 [9] are all publicly available. The models differ in a) the pre-training objectives (e.g., BERT predicts words given the surrounding words, while GPT models only use the previous words); b) the text corpora on which they were trained; c) the model size; and d) variations on how the transformer architecture is used (e.g., BERT computes representations of (sub)words, sentences, and pairs of sentences, whereas the universal sentence encoder is aimed at computing sentence representations only).

所有最新的先进系统都采用了Transformer架构[13],这是一种利用多头自注意力机制处理序列的神经网络。该架构有助于捕捉长距离依赖关系,并使计算比循环神经网络(RNN)更具并行性。由于模型规模和文本语料库的扩大,预训练对普通研究者来说已变得不切实际。幸运的是,多家机构公开提供了预训练模型:通用句子编码器(USE)[14]、GPT和GPT-2[15,16]、BERT[11]、RoBERTa[17]、T5[9]均可公开获取。这些模型的主要区别在于:a)预训练目标(例如BERT预测给定上下文单词,而GPT模型仅使用上文单词);b)训练所用的文本语料库;c)模型规模;d)Transformer架构的应用方式差异(例如BERT计算(子)词、句子和句子对的表示,而通用句子编码器仅专注于计算句子表示)。

Our interest in these transfer learning models stems from the need to compute the semantic similarity between search queries and short code descriptions. Because software development is terminology-heavy, the domains of the training corpora play a crucial role in performance on downstream tasks such as code search. Although the T5 model yields state-of-the-art results on many benchmarks , it is less suited for tasks in the software domain because its pre-training data was filtered with a heuristic to exclude documents that contain code. By contrast, the text corpora on which the universal sentence encoder was trained included data from QA websites such as Stack Overflow.

我们对这些迁移学习模型的兴趣源于需要计算搜索查询与简短代码描述之间的语义相似度。由于软件开发涉及大量专业术语,训练语料的领域对代码搜索等下游任务的性能起着关键作用。尽管T5模型在许多基准测试中取得了最先进的结果,但由于其预训练数据通过启发式方法过滤排除了包含代码的文档,该模型不太适合软件领域的任务。相比之下,通用句子编码器的训练文本语料包含了Stack Overflow等问答网站的数据。

III. ANNOTATED CODE SEARCH

III. 带注释的代码搜索

We cast annotated code search as a retrieval task: given a collection of annotated code snippets mathcal{C}= s_{1},s_{2},...,s_{N} where each annotated snippet s_{i} consists of a code fragment c_{i} paired with a brief natural language description d_{i} and a natural language query q , the goal is to generate a ranked list of code snippets s_{q1} , ..., s_{q k} . In the context of this paper, a code snippet will always be paired with a description, so for convenience, we will use the terms annotated code snippet and code snippet interchangeably. Prior work has studied the retrieval of code snippets without description, we will refer to this setting as code-only code search.

我们将带注释的代码搜索视为一项检索任务:给定一组带注释的代码片段 mathcal{C}= s_{1},s_{2},...,s_{N} ,其中每个带注释的片段 s_{i} 由代码片段 c_{i} 和简短的自然语言描述 d_{i} 组成,以及一个自然语言查询 q ,目标是生成一个排序的代码片段列表 s_{q1} , ..., s_{q k} 。在本文的上下文中,代码片段总是与描述配对,因此为了方便起见,我们将交替使用带注释的代码片段和代码片段这两个术语。先前的工作研究了不带描述的代码片段检索,我们将此设置称为仅代码(code-only)的代码搜索。

To benchmark the retrieval performance of a model on a snippet collection mathcal{C} , we require a ground truth dataset that links queries to code snippets from mathcal{C} that match the query. Performance can then be measured objectively with standard information retrieval metrics: mean reciprocal rank (MRR) and recall @mathbf{k} (mathbf{r}@mathbf{k}) . The reciprocal rank is the inverse of the rank of the first matching snippet for a given query, MRR computes the average of the reciprocal ranks for the queries in the evaluation set. For MRR the top-ranked snippets are most influential, e.g., finding a relevant snippet at rank 1 instead of rank 2 will have a bigger impact on the score than finding a snippet at rank 4 instead of rank 5. Recall @mathbf{k} is defined as the percentage of queries for which at least one matching snippet was found in the top mathbf{nablacdot}mathbf{k} results.

为了衡量模型在代码片段集合 mathcal{C} 上的检索性能,我们需要一个基准真值数据集,该数据集将查询与 mathcal{C} 中匹配查询的代码片段关联起来。随后可通过标准信息检索指标客观评估性能:平均倒数排名 (MRR) 和召回率 @mathbf{k} (mathbf{r}@mathbf{k})。倒数排名是指给定查询的第一个匹配片段排名的倒数,MRR 计算评估集中所有查询倒数排名的平均值。对于 MRR 而言,排名靠前的片段影响最大,例如将相关片段从第 2 名提升至第 1 名对分数的影响,比从第 5 名提升至第 4 名更大。召回率 @mathbf{k} 定义为在排名前 mathbf{nablacdot}mathbf{k} 的结果中至少找到一个匹配片段的查询百分比。

To get insight into the difficulty of a benchmark, we will also measure the word overlap between the queries and descriptions of matching snippets. We define word overlap between a query q and a description d as the number of unique words that occur in both q and d. . Relative overlap is computed as the ratio between the overlap and the number of unique words in q . Before computing the overlap, q and d are tokenized and lowercased, and stop words are filtered out.

为了衡量基准测试的难度,我们还将计算查询与匹配代码片段描述之间的词汇重叠度。定义查询 q 和描述 d 之间的词汇重叠为两者共有的唯一词汇数量。相对重叠率通过重叠词汇数与查询 q 中唯一词汇总数的比值计算得出。在计算前,qd 需进行分词、小写转换并过滤停用词。

Records from the CoNaLa corpus

来自CoNaLa语料库的记录

Fig. 1: Illustration of the CoNaLa corpus [18] is converted in a benchmark for annotated code search.

图 1: CoNaLa语料库[18]被转换为带标注代码搜索的基准测试。

IV. PACS: BENCHMARKS FOR ANNOTATED CODE SEARCH

IV. PACS: 带标注代码搜索基准

Existing code search benchmarks [1, 2, 3, 4, 5] link queries to relevant code snippets but these snippets are not consistently paired with descriptions and hence not directly applicable to this study.3 To evaluate and train annotated code retrieval models, we therefore created three benchmarking datasets, which we collectively refer to as the Python Annotated Code Search benchmark (PACS).4 Table I gives an overview of the PACS datasets. In the remainder of this section, we provide details on the construction of the benchmarks and explain how we obtain relevant training data.

现有代码搜索基准[1, 2, 3, 4, 5]将查询与相关代码片段关联,但这些片段并未统一配备描述,因此不适用于本研究。为评估和训练带注释的代码检索模型,我们创建了三个基准数据集,统称为Python注释代码搜索基准(PACS)。表1概述了PACS数据集。本节后续内容将详述基准构建过程,并说明相关训练数据的获取方法。

A. Snippet collections and ground truth

A. 代码片段集合与真实数据

We compile snippet collections from two existing datasets (CoNaLa and StaQC) and mine a third snippet collection from Stack Overflow posts in the data science domain.

我们从两个现有数据集(CoNaLa和StaQC)中编译代码片段集合,并从数据科学领域的Stack Overflow帖子中挖掘第三个代码片段集合。

a) CoNaLa: The CoNaLa corpus [18]^{5} is a curated collection of short Python code snippets annotated with their natural language intent. Yin et al. [18] used crowd-sourcing to label a sample of Stack Overflow posts. An example record of the corpus is shown in Figure 1. It consists of a code snippet, intent, re-written intent, and the id of the Stack Overflow post. The code snippets are selected from post answers. The intent is a high-level, natural language description of the code snippet and typically corresponds to the Stack Overflow post title. Because more than one code snippet can be extracted from the same post, different code snippets are sometimes annotated with the same intent. The re-written intent is a reformulation of the intent that should better reflect the full meaning of the code.

a) CoNaLa: CoNaLa语料库[18]^5是一个精选的Python短代码片段集合,附带自然语言意图标注。Yin等人[18]通过众包方式对Stack Overflow帖子样本进行了标注。该语料库的示例记录如图1所示,包含代码片段、意图、改写意图及Stack Overflow帖子ID。代码片段选自帖子回答,意图是对代码片段的高层次自然语言描述,通常对应Stack Overflow帖子标题。由于同一帖子可能提取多个代码片段,不同代码片段有时会标注相同意图。改写意图是对原意图的重新表述,应更完整地反映代码含义。

Figure 1 illustrates how we create a benchmark for annotated code search from the corpus. The snippet collection is constructed by selecting the re-written intents and their corresponding code snippets. This results in 2,777 snippets. To create the ground truth, we group the CoNaLa records by their Stack Overflow post id. For each group, we use the intent of one of the snippets as a query. The matching snippets are the code snippets that were constructed from the records in this group. Many of the re-written intents are similar to the original intent. Queries that have a relative word overlap of more than 0.5 with the description of a matching snippet are removed. This results in a ground truth dataset with 762 queries.

图 1: 展示了如何从语料库中创建带标注的代码搜索基准。片段集合通过筛选重写后的意图描述及其对应代码片段构建而成,最终获得2,777个片段。为建立标准答案,我们按Stack Overflow帖子ID对CoNaLa记录进行分组,每组选用其中一个片段的意图描述作为查询语句,匹配片段则来自该组记录对应的所有代码片段。由于多数重写意图与原始意图高度相似,我们会剔除与匹配片段描述文本重复率超过0.5的查询,最终生成包含762条查询的标准答案数据集。

Original StaQC corpus

原始 StaQC 语料库


Fig. 2: Illustration of the additional cleaning for the StaQC corpus [19]: we strip prompts, filter out code snippets that do not parse, and automatically rewrite questions as descriptions with simple regular expressions.


图 2: StaQC语料库 [19] 的额外清洗流程示意: 我们移除提示词、过滤无法解析的代码片段, 并通过简单正则表达式自动将问题重写为描述语句。


Fig. 3: To create the StaQC-py and SO-DS ground truth we make use of Stack Overflow posts that were tagged as duplicates by users.

图 3: 为构建 StaQC-py 和 SO-DS 的真实标注数据,我们利用了被用户标记为重复内容的 Stack Overflow 帖子。

^b ) StaQC-py: StaQC [19]6 is a large collection of Python and SQL question-code snippet pairs that are mined from Stack Overflow. These pairs were extracted with machine learning models that identify how to questions and their corresponding code snippet(s) in the answers. The dataset has been used in prior work on code sum mari z ation [20] and (code-only) code search [5]. To create a snippet collection we started from the 272K Python question-code pairs in StaQC and performed additional cleaning as illustrated in Figure 2: we strip prompt prefixes ( .>>> , · · ·, In [1]: , etc.), filter out code snippets that do not parse in Python 2 and 3, and use simple heuristics to reformulate questions to descriptions. This results in a snippet collection of 204K (description, code snippet) pairs.

^b) StaQC-py: StaQC [19]6 是一个从 Stack Overflow 挖掘的大型 Python 和 SQL 问答代码片段对集合。这些数据对通过机器学习模型提取,用于识别问答中的技术问题及其对应答案中的代码片段。该数据集已用于先前关于代码摘要 [20] 和 (纯代码) 代码搜索 [5] 的研究工作。为创建代码片段集合,我们从 StaQC 的 272K 个 Python 问答对出发,并进行了如图 2 所示的额外清理:移除提示前缀 ( .>>> , · · ·, In [1]: 等),过滤无法在 Python 2 和 3 中解析的代码片段,并使用简单启发式方法将问题重述为描述。最终得到包含 204K 个 (描述,代码片段) 对的片段集合。

To create the ground truth, we leverage the fact that some Stack Overflow posts are tagged as a duplicate of another post. The titles of duplicate posts are descriptions that are created independently by different users but reflect the same underlying intent. Therefore, as illustrated in Figure 3, we collect the duplicates of the posts used for the StaQC snippet collection and take the duplicates’ titles as queries and the corresponding StaQC snippets as the matching results.

为创建基准数据,我们利用了一些Stack Overflow帖子被标记为其他帖子重复版本这一事实。重复帖子的标题是由不同用户独立创建但反映相同底层意图的描述。因此,如图3所示,我们收集了用于StaQC代码片段集的帖子的重复版本,并将重复帖子的标题作为查询,将对应的StaQC代码片段作为匹配结果。

The duplicate posts are collected from the Stack Overflow dump of February 2020.7 Some posts are tagged as a duplicate of multiple questions, and it also happens that a post A is tagged as a duplicate of post B, while post C is tagged as a duplicate of A. We transitively group all such duplicate relations (i.e.,

重复帖子收集自2020年2月的Stack Overflow数据转储。部分帖子被标记为多个问题的重复项,也存在帖子A被标记为帖子B的重复项,而帖子C又被标记为A的重复项的情况。我们通过传递性将所有此类重复关系进行分组(即

A, B, and C will be part of the same group) and end up with 195K duplicate groups. Most groups consist of two posts, but popular posts can have more than 100 duplicates. To create the StaQC ground truth, we select all groups containing 1) a post that was the source for a StaQC snippet (to have snippets that match the query), as well as 2) a post that was not used for the StaQC data (to select the query). This resulted in 5,497 duplicate groups. From each such group we extracted a query with matching results. The ground truth set is split evenly in a validation and a test set. For the validation set we additionally filter out any queries that overlap with the CoNaLa and SO-DS test sets to avoid over fitting when tuning model hyper-parameters.

A、B 和 C 将属于同一组) ,最终得到 195K 个重复组。大多数组包含两个帖子,但热门帖子可能有超过 100 个重复项。为了创建 StaQC 基准数据集,我们筛选出所有满足以下条件的组:1) 包含作为 StaQC 片段来源的帖子 (以确保片段与查询匹配) ,以及 2) 包含未用于 StaQC 数据的帖子 (以选择查询) 。最终得到 5,497 个重复组。从每个组中提取一个查询及其匹配结果。基准数据集被均匀划分为验证集和测试集。对于验证集,我们还过滤掉与 CoNaLa 和 SO-DS 测试集重叠的任何查询,以避免在调整模型超参数时出现过拟合。

c) SO-DS: The CoNaLa and StaQC benchmarks have different merits: the CoNaLa snippets and descriptions are curated, but the collection is small and the queries in the ground truth were not created independently from the descriptions. The StaQC benchmark, on the other hand, has a large snippet collection with realistic ground truth queries that were written independently from the snippet descriptions, but despite StaQC’s advanced mining methodology the snippet collection contains more noise. We therefore collect a third snippet collection, SO-DS, where we control the quality by only mining from Stack Overflow posts with a large number of upvotes. More specifically, we mine snippets from the most upvoted Stack Overflow posts that are labeled with “python” and one or more tags related to data science libraries such as “tensorflow”, “matplotlib” and “beautiful soup”.8 9

c) SO-DS: CoNaLa和StaQC基准各有优势:CoNaLa的代码片段和描述经过人工筛选,但规模较小且基准查询并非独立于描述创建。而StaQC基准虽拥有大量代码片段及独立编写的真实基准查询,但其先进挖掘方法仍导致片段集包含较多噪声。为此,我们收集了第三个代码片段集SO-DS,通过仅从高赞Stack Overflow帖子中挖掘来控制质量。具体而言,我们从标记为"python"并附带"tensorflow"、"matplotlib"、"beautiful soup"等数据科学库标签的高赞帖子中提取代码片段。8 9

We iterate over each tag’s posts in decreasing order of their upvote scores. For each post, we extract a maximum of two snippets from the post answers, prioritizing answers with more votes. A snippet is extracted by concatenating all the parseable code blocks from a post answer. We opted to not automatically select the most relevant lines of code with a machine learning model, as was done for StaQC, because incorrect predictions would introduce noise. Note that, unlike CoNaLa and StaQC this dataset is specifically intended for code search, not for code generation or code sum mari z ation, where a one-to-one alignment between description and source code is important. Similar to the StaQC benchmark, we use the post title as the snippet description and apply the same cleaning pipeline: we remove prompts, filter out snippets that can not be parsed, and use simple heuristics to reformulate questions to descriptions. When we extracted 250 snippets or ran out of posts with the correct tag, we move on to the next tag. As a final step, we filter out duplicate snippets, which occur because some Stack Overflow posts with multiple tags were mined more than once. The resulting snippet collection consists of 12,137 snippets and 7,674 unique descriptions. Table II displays a few examples from the collection.

我们按每个标签下帖子的投票分数降序进行遍历。对于每个帖子,我们最多从其回答中提取两个代码片段,优先选择投票数更多的回答。代码片段通过拼接回答中所有可解析的代码块生成。我们没有像StaQC那样采用机器学习模型自动选择最相关的代码行,因为错误的预测会引入噪声。需要注意的是,与CoNaLa和StaQC不同,该数据集专门用于代码搜索而非代码生成或代码摘要,因此不需要严格保持描述与源代码的一一对应关系。与StaQC基准类似,我们将帖子标题作为片段描述,并采用相同的清洗流程:移除提示语、过滤无法解析的片段,使用简单启发式方法将问题改写为描述。当提取到250个片段或该标签下无更多合格帖子时,转至下一个标签。最后过滤重复片段(因部分含多标签的Stack Overflow帖子被重复采集)。最终获得的片段集合包含12,137个代码片段和7,674条唯一描述。表II展示了该集合的部分示例。

The ground truth is collected analogously to StaQC-py by creating queries from Stack Overflow duplicate posts. This results in 2,225 annotated queries, which are evenly split in a validation and test set. The validation set is again filtered to ensure that no queries overlap with the CoNaLa and StaQC-py test sets.

真实标注数据采用与StaQC-py类似的方式收集,通过Stack Overflow重复帖生成查询语句。最终获得2,225条标注查询,平均划分为验证集和测试集。验证集经过二次过滤以确保与CoNaLa及StaQC-py测试集无查询重叠。

d) Benchmark statistics: Table I reports summary statistics on the three benchmarks. The CoNaLa snippets are mostly one-liners and its descriptions are on average slightly longer than those in the other collections. The SO-DS snippets are the longest, which is to be expected since the SO-DS mining procedure does not attempt to automatically extract the most relevant lines of code in Stack Overflow posts, as was done for StaQC. Concerning the different ground truth sets, we observe that most of the CoNaLa queries only have one matching snippet, whereas SO-DS and StaQC on average have 1.7 and 3.4 matching snippets. To assess the difficulty of each benchmark, Figure 4 shows histograms of the relative word overlap between the queries and the descriptions of matching snippets. The three benchmarks have an

d) 基准测试统计:表1报告了三个基准测试的汇总统计数据。CoNaLa代码片段多为单行代码,其描述平均略长于其他数据集。SO-DS代码片段最长,这符合预期,因为SO-DS挖掘过程不像StaQC那样尝试自动提取Stack Overflow帖子中最相关的代码行。关于不同的真实数据集,我们观察到大多数CoNaLa查询只有一个匹配代码片段,而SO-DS和StaQC平均分别有1.7和3.4个匹配片段。为评估各基准测试难度,图4展示了查询与匹配片段描述之间相对词汇重叠的直方图。这三个基准测试具有...

代码片段集合 测试集
基准测试 #代码片段 #描述token数 代码长度 (#代码行) #查询数 #代码片段/查询
CoNaLa 2,777 10.32 1.07 762 1.17
StaQC-py 203,700 8.36 9.84 2,749 3.40
SO-DS 12,137 7.35 14.98 1,113 1.70

TABLE I: Overview of the Python annotated code search (PACS) benchmarks.

表 1: Python语言标注代码搜索(PACS)基准概述


Fig. 4: Histograms of the relative word overlap between the queries and the matching snippet descriptions in the PACS test sets.

图 4: PACS测试集中查询与匹配片段描述之间相对单词重叠的直方图。

average relative word overlap between 0.28 and 0.29 and there are queries with a relative overlap of more than 0.6. 10

平均相对词重叠率在0.28到0.29之间,部分查询的相对重叠率甚至超过0.6。

B. Training data

B. 训练数据

Because it is infeasible to construct a code search ground truth dataset that is large enough for model training, we resort to different proxy datasets. We train query-code similarity models (i.e., the code-only models that will be presented in Section V-B ) on the description-code snippet pairs in the respective collections. To train the query-description similarity models (i.e., the description-only models that will be

由于构建足够大规模用于模型训练的代码搜索基准数据集不可行,我们转而采用不同的代理数据集。我们在各数据集的描述-代码片段对上训练查询-代码相似度模型(即第V-B节将介绍的纯代码模型)。为训练查询-描述相似度模型(即后文将介绍的纯描述模型)

Extracting and saving video frames

提取并保存视频帧

Log keras output to a file

将keras输出记录到文件

from keras.callbacks import CSVLogger csv_logger = CSVLogger(’log.csv’, append l= True, separator =prime ;’)model.fit(X_train, Y_train, callbacks = [ csv_logger])

from keras.callbacks import CSVLogger csv_logger = CSVLogger('log.csv', append=True, separator=';')model.fit(X_train, Y_train, callbacks=[ csv_logger])

https://stack overflow.com/questions/33311153 https://stack overflow.com/questions/38445982

https://stackoverflow.com/questions/33311153 https://stackoverflow.com/questions/38445982

Plot correlation matrix using pandas

使用pandas绘制相关性矩阵

Regex find all overlapping matches

正则表达式查找所有重叠匹配

import matplotlib.pyplot as plt plt.matshow(dataframe.corr()) plt.show()

import matplotlib.pyplot as plt plt.matshow(dataframe.corr()) plt.show()

https://stack overflow.com/questions/29432629 https://stack overflow.com/questions/5616822

https://stackoverflow.com/questions/29432629
https://stackoverflow.com/questions/5616822

TABLE II: Examples from the SO-DS snippet collection.

TABLE III: Examples of Stack Overflow post titles that were tagged as duplicates.

表 2: SO-DS 片段集合示例

原始内容 重复内容
Python语言函数通过引用调用 是否可以创建一个修改输入的函数
使用Javascript抓取/监听AJAX数据? 在Chrome扩展中拦截AJAX响应
如何在QML中编写条件导入语句? QML仅在可用时导入更高版本模块
Boost.Asio SSL线程安全性 Boost Asio和多线程中OpenSSL的使用
是否可以停止Javascript执行? Javascript中的System.exit

表 3: 被标记为重复的Stack Overflow帖子标题示例

presented in Section V-A), we create two additional datasets. Firstly, we collect all Python post titles on Stack Overflow before February 2020, resulting in a total of 1.30M questions. Secondly, we create pairs of related code descriptions from the Stack Overflow duplicate post records that were not used to compile ground truth datasets and snippet collections. To not bias questions with many duplicates (e.g., How to clone or copy a list has more than 289 duplicates), we only sample one pair per duplicate group. This resulted in 187K pairs of related code descriptions. Table III shows a few examples from the dataset.

在第五-A节所述方法的基础上,我们创建了两个额外数据集。首先,我们收集了2020年2月之前Stack Overflow上所有Python语言相关帖子标题,共得到130万个问题。其次,我们从Stack Overflow未用于构建基准数据集和代码片段集合的重复帖子记录中,创建了相关联的代码描述对。为避免具有大量重复的问题(例如"如何克隆或复制列表"有超过289个重复项)造成偏差,我们仅从每个重复组中抽取一对样本。最终得到18.7万组相关联的代码描述对。表III展示了该数据集的部分示例。

V. MODELS FOR ANNOTATED CODE SEARCH

V. 带标注代码搜索模型

In line with recent work in code search [1, 2, 3, 4, 5], we formulate snippet retrieval as a mathbf{k} -nearest neighbor problem: queries and snippets are projected in a shared vector space such that any given query and its matching snippets are represented by similar vectors. The snippet collection s_{1},s_{2},...,s_{N} can then be ranked w.r.t. a query q by sorting snippets according to the cosine similarity between their respective vectors mathbf{s_{1}},mathbf{s_{2}},...,mathbf{s_{N}} and the query vector q.

与近期代码搜索领域的研究[1, 2, 3, 4, 5]一致,我们将代码片段检索建模为一个 mathbf{k} 近邻问题:查询语句和代码片段被映射到共享向量空间,使得任意查询与其匹配片段通过相似向量表示。对于查询 q,代码片段集合 s_{1},s_{2},...,s_{N} 的排序可通过计算各片段向量 mathbf{s_{1}},mathbf{s_{2}},...,mathbf{s_{N}} 与查询向量 q 的余弦相似度来实现。

c o s(mathbf{s_{i}},mathbf{q})=frac{mathbf{s_{i}}cdotmathbf{q}}{||mathbf{s_{i}}|||mathbf{||q||}}
c o s(mathbf{s_{i}},mathbf{q})=frac{mathbf{s_{i}}cdotmathbf{q}}{||mathbf{s_{i}}|||mathbf{||q||}}


Fig. 5: Architectures of query/snippet similarity models presented in Section V: A) depicts the neural bag-of-words model of queries and snippet descriptions; B) depicts the query and snippet description embedding using USE; C) depicts the NCS architecture for embedding queries and actual code; and D) depicts an ensemble of embedding retrieval models.

图 5: 第V节中提出的查询/代码片段相似度模型架构: A) 展示了查询和片段描述的神经词袋模型; B) 展示了使用通用句子编码器(USE)的查询和片段描述嵌入; C) 展示了用于嵌入查询和实际代码的NCS架构; D) 展示了嵌入检索模型的集成方案。

An annotated code snippet s consists of a description d and the actual code snippet c. Ideally, we would train a model that jointly embeds d and c . However, as mentioned in Section IV-B, we lack sufficient ground truth data to train such a model. Instead, we independently train two embedding models that separately capture the similarity between 1) descriptions and queries (Section V-A); and 2) code fragments and queries (Section V-B). In Section V-C, we explain how these models can be efficiently combined in a single ensemble model.

带注释的代码片段 s 包含描述 d 和实际代码片段 c。理想情况下,我们会训练一个联合嵌入 dc 的模型。但如第 IV-B 节所述,我们缺乏足够的真实数据来训练此类模型。因此,我们独立训练了两个嵌入模型,分别捕获:1) 描述与查询之间的相似性(第 V-A 节);2) 代码片段与查询之间的相似性(第 V-B 节)。第 V-C 节将阐述如何将这些模型高效组合成单一集成模型。

A. Query-description similarity

A. 查询-描述相似性

In this section, we present two query-description similarity models: Neural bag-of-words, which will be used as a baseline, and the universal sentence encoder.

本节介绍两种查询-描述相似度模型:作为基线的神经词袋模型 (Neural bag-of-words) 和通用语句编码器 (universal sentence encoder)。

Neural bag-of-words (NBOW): Bag-of-words models treat input texts as unordered word sets. While throwing away the sequence order may seem a crude approximation, for information retrieval it results in hard-to-beat baselines. Figure 5 A) illustrates how bag-of-words is applied in a neural setting. We obtain embeddings mathbf{q}^{mathbf{bow}} and mathbf{d}^{mathbf{bow}} for the query q and description d by summing their respective word embeddings:

神经词袋模型 (NBOW): 词袋模型将输入文本视为无序的单词集合。虽然丢弃序列顺序看似是一种粗糙的近似方法,但在信息检索领域却能产生难以超越的基线性能。图 5 A) 展示了词袋方法在神经网络中的应用方式。我们通过对查询 q 和描述 d 的各自词向量求和,获得其嵌入表示 mathbf{q}^{mathbf{bow}}mathbf{d}^{mathbf{bow}}:

mathbf{q}^{mathrm{bow}}=sum_{j=1}^{m}e_{1}(q_{j}),mathbf{d}^{mathrm{bow}}=sum_{j=1}^{n}e_{1}(d_{j})
mathbf{q}^{mathrm{bow}}=sum_{j=1}^{m}e_{1}(q_{j}),mathbf{d}^{mathrm{bow}}=sum_{j=1}^{n}e_{1}(d_{j})

Where q_{j} and d_{j} denote the mathrm{j}^{t h} words in the query/description after pre-processing, and e_{1}(cdot) is the word embedding model that maps a description/query word to its embedding.

其中 q_{j}d_{j} 表示预处理后查询/描述中的第 mathrm{j}^{t h} 个词,e_{1}(cdot) 是将描述/查询词映射到其嵌入向量的词嵌入模型。

We pre-process queries and descriptions by applying token iz ation, lemma ti z ation, and stopword removal. To learn the word embeddings, we used fastText’s [21] continuous skip-gram model with negative sampling. This model is similar to the skip-gram model from the word2vec toolkit, but in contrast to the latter, fastText also incorporates subword information. The fastText embeddings are trained on the Python Stack Overflow questions dataset (see Section IV-B).

我们通过应用分词 (tokenization)、词形还原 (lemmatization) 和停用词去除对查询和描述进行预处理。为了学习词嵌入,我们使用了 fastText [21] 的连续跳字模型 (continuous skip-gram model) 和负采样 (negative sampling)。该模型类似于 word2vec 工具包中的跳字模型,但与后者不同的是,fastText 还融合了子词信息。fastText 词嵌入是在 Python语言 Stack Overflow 问题数据集上训练的(参见第 IV-B 节)。

Universal sentence encoder (U S E) : The universal sentence encoder (USE) is a transformer-based transfer learning model that computes a vector for a sequence of words (e.g., a sentence or a paragraph).11 We leverage USE to compute representations for the queries and snippet descriptions as shown in Figure 5 B): USE utilizes transformer to obtain contextual i zed representations mathbf{h_{1}^{q}},mathbf{h_{2}^{q}},...,mathbf{h_{m}^{q}} for each query token. An embedding for the query is then calculated by summing the transformer’s outputs mathbf{h_{1}^{q}},mathbf{h_{2}^{q}},...,mathbf{h_{m}^{q}} . It is worth noting that each representation mathbf{h_{i}^{q}} that is produced by transformer is contextual i zed on all the query tokens, not solely from the current and preceding tokens q_{1},...,q_{i} as would be the case with RNNs. The embeddings of the descriptions are computed analogously, with the same architecture and model parameters.

通用句子编码器 (USE):通用句子编码器 (USE) 是一种基于Transformer的迁移学习模型,可为单词序列(如句子或段落)计算向量。我们利用USE为查询和片段描述计算表示,如图5 B)所示:USE采用Transformer获取每个查询token的上下文表示 mathbf{h_{1}^{q}},mathbf{h_{2}^{q}},...,mathbf{h_{m}^{q}}。随后通过对Transformer的输出 mathbf{h_{1}^{q}},mathbf{h_{2}^{q}},...,mathbf{h_{m}^{q}} 求和来计算查询的嵌入向量。值得注意的是,Transformer生成的每个表示 mathbf{h_{i}^{q}} 都基于所有查询token进行上下文建模,而非像RNN那样仅依赖当前及先前token q_{1},...,q_{i}。描述文本的嵌入向量采用相同架构和模型参数进行类似计算。

For our experiments, we used a pre-trained model available on TensorFlow mathrm{Hub}^{12} that outputs 512- dimensional embeddings and has a total of 147M parameters. The model was pre-trained in a multi-task learning setting with three tasks: 1) skip-thought’s next sentence prediction [22]; 2) a conversational inputresponse task [23], where to goal is to assess the relevance of a response on a given input (e.g., a reply to an email or a Stack Overflow post); and 3) natural language inference using the SNLI corpus [24]. The training data for tasks 1 and 2 comes from a variety of sources crawled from the web, including Stack Overflow. This makes USE a good choice for downstream tasks in the software development domain, such as code search.

在我们的实验中,我们使用了TensorFlow mathrm{Hub}^{12} 上提供的预训练模型,该模型输出512维嵌入向量,总计1.47亿参数。该模型通过多任务学习框架在以下三个任务上进行了预训练:1) 跳读思维 (skip-thought) 的下一句预测 [22];2) 对话输入-响应任务 [23],目标是评估给定输入(如电子邮件回复或Stack Overflow帖子)的响应相关性;3) 使用SNLI语料库 [24] 的自然语言推理。任务1和2的训练数据来自网络爬取的各种来源,包括Stack Overflow。这使得USE成为软件开发领域下游任务(如代码搜索)的理想选择。

Although USE’s pre-training ingests textual data from the software domain, without fine-tuning the model does not significantly outperform our baselines (see Section VII). Training USE directly on code search is infeasible due to the lack of annotated data. Therefore, we propose the detection of duplicate Stack Overflow posts as a proxy task. That is, we tune the USE model to distinguish similar Stack Overflow post titles from unrelated title pairs. This task stimulates that titles of duplicate posts are mapped to similar vectors, while vectors of random title pairs are adjusted to be more orthogonal. As such, the USE embeddings are tailored to semantic similarity computation in the software domain.

尽管USE的预训练吸收了软件领域的文本数据,但在未经微调的情况下,该模型并未显著超越我们的基线(参见第VII节)。由于缺乏标注数据,直接在代码搜索上训练USE不可行。因此,我们提出将重复Stack Overflow帖子检测作为代理任务。即通过调整USE模型来区分相似的Stack Overflow帖子标题与无关标题对。该任务促使重复帖子标题被映射为相似向量,同时随机标题对的向量被调整为更正交的状态。由此,USE嵌入向量被专门适配于软件领域的语义相似度计算。

A dataset of duplicate title pairs was created as described in Section IV. For every “positive” instance, we construct five negative examples by sampling random title pairs (for which we verify that they were not tagged as each other’s duplicate). Given a title pair (t_{1},t_{2}) , we model the probability that t_{1} and t_{2} are duplicates by first computing the cosine similarity of their USE embeddings and clipping negative similarities to zero with the rectified linear unit (ReLU) activation function, see Equation 1. Next, the result is fed to a layer with weight parameter w , bias b and the logistic activation function sigma , see Equation 2. We fine-tune w,b , and the parameters of USE pmb{theta} to minimize binary cross-entropy. The inclusion of the ReLU activation avoids that the training objective pushes the cosine similarity between embeddings of unrelated titles to -1 , as this corresponds to having a strong negative correlation between the embeddings.

按照第IV节所述方法创建了一个重复标题对数据集。对于每个"正例"样本,我们通过随机采样标题对构建五个负例样本(已验证这些负例未被标记为彼此重复)。给定标题对(t_{1},t_{2}),我们首先计算其USE嵌入向量的余弦相似度,并使用修正线性单元(ReLU)激活函数将负相似度截断为零(见公式1),以此建模t_{1}t_{2}为重复标题的概率。随后将结果输入具有权重参数w、偏置b和逻辑激活函数sigma的层级(见公式2)。我们通过微调w,b和USE参数pmb{theta}来最小化二元交叉熵。引入ReLU激活可避免训练目标将无关标题嵌入向量间的余弦相似度推向-1,因为这会导致嵌入向量间出现强负相关性。

begin{array}{r}{s i m(t_{1},t_{2})=R e L U(c o s(U S E(t_{1};pmbtheta),U S E(t_{2};pmbtheta)))} {P(d u p l i c a t e=1|t_{1},t_{2})=sigma(s i m(t_{1},t_{2})w+b))}end{array}
begin{array}{r}{s i m(t_{1},t_{2})=R e L U(c o s(U S E(t_{1};pmbtheta),U S E(t_{2};pmbtheta)))} {P(d u p l i c a t e=1|t_{1},t_{2})=sigma(s i m(t_{1},t_{2})w+b))}end{array}

When w,b are initialized randomly, we found that the model sometimes degenerates to labeling every pair as unrelated. This issue is avoided by initializing w,b such that, initially, all pairs with a high cosine similarity are considered a duplicate with high probability. For our experiments, we initialized w and b with 15 and -5 respectively.

w,b 随机初始化时,我们发现模型有时会退化为将所有样本对标记为不相关。通过初始化 w,b 使得初始状态下所有高余弦相似度样本对都有较高概率被判定为重复,可以避免该问题。实验中我们分别将 wb 初始化为15和 -5


Fig. 6: Illustration of how sentences are created from description-code snippet pairs to serve as input for training fastText embeddings. If we only append the code tokens to the description words (top right) as is done in the original NCS model, hist and histogram appear far apart. By creating the other contexts (middle and bottom right) they appear in closer proximity, causing fastText to push their embeddings closer in the vector space.

图 6: 展示如何从描述-代码片段对生成句子作为fastText嵌入训练的输入。若仅按原始NCS模型做法将代码token直接附加到描述词汇后(右上),hist和histogram在向量空间中相距甚远。通过构建其他上下文(右中和右下),二者距离被拉近,促使fastText使其嵌入向量更接近。

B. Query-code similarity

B. 查询-代码相似度

To compute the similarity between query and code, we experimented with models whose designs are based on two recently published (un annotated) code search models: NCS [2] and UNIF [3].

为了计算查询与代码之间的相似度,我们基于两种近期发布的(未标注)代码搜索模型NCS [2]和UNIF [3]进行了模型实验。

NCS: The architecture for the neural code search (NCS) model [2] is shown in Figure $5mathrm{C}, . Like the neural bag-of-words model introduced in the previous section, the query representation is computed as a sum of the query word embeddings: begin{array}{r}{pmb q^{n c s}=sum_{j=1}^{m}e_{2}(q_{j})}end{array} jm=1 e2(qj). Similarly, an embedding for the code is computed by summing the code token embeddin gs weighted with their respective IDF scores: mathbf{boldsymbol{c}}^{n c s}=$ textstylesum_{j=1}^{n}i d f(c_{j})e_{2}(c_{j}) . Where e_{2}(cdot) is the embedding model that maps code tokens and query words to their respective embeddings.

NCS: 神经代码搜索 (NCS) 模型 [2] 的架构如图 $5mathrm{C} 所示。与上一节介绍的神经词袋模型类似,查询表示通过查询词嵌入的求和计算得到:begin{array}{r}{pmb q^{n c s}=sum_{j=1}^{m}e_{2}(q_{j})}end{array} jm=1 e2(qj)。类似地,代码的嵌入通过代码 token 嵌入的加权求和计算得到,权重为各自的 IDF 分数:mathbf{boldsymbol{c}}^{n c s}=$ textstylesum_{j=1}^{n}i d f(c_{j})e_{2}(c_{j})。其中 e_{2}(cdot) 是将代码 token 和查询词映射到各自嵌入的嵌入模型。

The embedding model e_{2}(cdot) is trained with fas