Evaluating Large Language Models Trained on Code
评估训练于代码的大语言模型
Abstract
摘要
1. Introduction
1. 引言
We introduce Codex, a GPT language model finetuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correct- ness for synthesizing programs from docstrings, our model solves $28.8%$ of the problems, while GPT-3 solves $0%$ and GPT-J solves $11.4%$ . Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve $70.2%$ of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.
我们介绍 Codex,一个在 GitHub 上公开可用的代码上微调的 GPT 语言模型,并研究其编写 Python 代码的能力。Codex 的一个独特生产版本为 GitHub Copilot 提供支持。在 HumanEval 上,我们发布了一个新的评估集,用于测量从文档字符串合成程序的功能正确性,我们的模型解决了 28.8% 的问题,而 GPT-3 解决了 0% 和 GPT-J 解决了 11.4% 。此外,我们发现从模型中重复采样是产生解决困难提示的有效工作解决方案的惊人策略。使用这种方法,我们通过每题 100 次采样解决了 70.2% 的问题。对我们的模型进行仔细研究揭示了其局限性,包括难以处理描述长操作链的文档字符串以及将操作绑定到变量的困难。最后,我们讨论了部署强大的代码生成技术可能带来的更广泛影响,涵盖了安全、安保和经济。
Figure 1:
图 1:
Table 1:
表 1:
(注意:原文中没有 Figure 或 Table 相关内容,因此这里仅展示格式示例)
Scalable sequence prediction models (Graves, 2014; Vaswani et al., 2017; Child et al., 2019) have become a general-purpose method for generation and representation learning in many domains, including natural language processing (Mikolov et al., 2013; Sutskever et al., 2014; Dai & Le, 2015; Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018), computer vision (Van Oord et al., 2016; Menick & Kalchbrenner, 2018; Chen et al., 2020; Bao et al., 2021), audio and speech processing (Oord et al., 2016; 2018; Dhariwal et al., 2020; Baevski et al., 2020), biology (Alley et al., 2019; Rives et al., 2021), and even across multiple modalities (Das et al., 2017; Lu et al., 2019; Ramesh et al., 2021; Zellers et al., 2021). More recently, language models have also fueled progress towards the longstanding challenge of program synthesis (Simon, 1963; Manna & Waldinger, 1971), spurred by the presence of code in large datasets (Husain et al., 2019; Gao et al., 2020) and the resulting programming capabilities of language models trained on these datasets (Wang & Komatsu zak i, 2021). Popular language modeling objectives like masked language modeling (Devlin et al., 2018) and span prediction (Raffel et al., 2020) have also been adapted to train their programming counterparts CodeBERT (Feng et al., 2020) and PyMT5 (Clement et al., 2020).
可扩展的序列预测模型 (Graves, 2014; Vaswani 等, 2017; Child 等, 2019) 已成为许多领域中生成和表示学习的通用方法,包括自然语言处理 (Mikolov 等, 2013; Sutskever 等, 2014; Dai & Le, 2015; Peters 等, 2018; Radford 等, 2018; Devlin 等, 2018),计算机视觉 (Van Oord 等, 2016; Menick & Kalchbrenner, 2018; Chen 等, 2020; Bao 等, 2021),音频和语音处理 (Oord 等, 2016; 2018; Dhariwal 等, 2020; Baevski 等, 2020),生物学 (Alley 等, 2019; Rives 等, 2021),甚至跨多个模态 (Das 等, 2017; Lu 等, 2019; Ramesh 等, 2021; Zellers 等, 2021)。最近,大语言模型还推动了 longstanding 挑战——程序合成 (Simon, 1963; Manna & Waldinger, 1971) 的进展,这得益于大型数据集中代码的存在 (Husain 等, 2019; Gao 等, 2020) 和在这些数据集上训练的大语言模型所具备的编程能力 (Wang & Komatsu zak i, 2021)。流行的建模目标如掩码语言建模 (Devlin 等, 2018) 和片段预测 (Raffel 等, 2020) 也被改编为训练其编程对应物 CodeBERT (Feng 等, 2020) 和 PyMT5 (Clement 等, 2020)。
Similarly, our early investigation of GPT-3 (Brown et al., 2020) revealed that it could generate simple programs from Python docstrings. While rudimentary, this capability was exciting because GPT-3 was not explicitly trained for code generation. Given the considerable success of large language models in other modalities and the abundance of publicly available code, we hypothesized that a specialized GPT model, called Codex, could excel at a variety of coding tasks. This paper describes several early Codex models, whose descendants power GitHub Copilot and the Codex models in the OpenAI API.
同样,我们对 GPT-3 (Brown 等,2020) 的早期研究揭示了它可以从 Python语言 docstrings 生成简单程序的能力。尽管这一能力还比较初级,但令人兴奋的是,GPT-3 并没有专门训练代码生成任务。鉴于大语言模型在其他模态上取得的显著成功以及大量公开可用的代码,我们假设一个名为 Codex 的专用 GPT 模型可以在各种编程任务中表现出色。本文描述了几个早期的 Codex 模型,这些模型的后继者为 GitHub Copilot 和 OpenAI API 中的 Codex 模型提供支持。

Codex and Codex-S Performance Figure 1. Pass rates of our models on the HumanEval dataset as a function of model size. When a single sample is generated for each problem, GPT-12B solves no problems, but Codex (fine-tuned on code) solves $28.8%$ of the problems, and Codex-S (further fine-tuned on correctly implemented standalone functions) solves $37.7%$ of the problems. From here, further gains can be realized by generating 100 samples per problem and selecting the sample with the highest mean log-probability $(44.5%$ solved) or by selecting the sample that passes the unit tests $.77.5%$ solved). All samples are generated with temperature 0.8.

Codex 和 Codex-S 性能 图 1. 我们的模型在 HumanEval 数据集上的通过率与模型大小的关系。当为每个问题生成一个样本时,GPT-12B 没有解决问题,但 Codex (在代码上微调) 解决了 $28.8%$ 的问题,而 Codex-S (进一步在正确实现的独立函数上微调) 解决了 $37.7%$ 的问题。从这里开始,通过为每个问题生成 100 个样本并选择平均对数概率最高的样本 $(44.5%$ 解决),或选择通过单元测试的样本 $(77.5%$ 解决),可以进一步提高性能。所有样本均以温度 0.8 生成。
In this work, we focus on the task of generating standalone Python functions from docstrings, and evaluate the correctness of code samples automatically through unit tests. This is in contrast to natural language generation, where samples are typically evaluated by heuristics or by human evaluators. To accurately benchmark our model, we create a dataset of 164 original programming problems with unit tests. These problems assess language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. We release this data along with an evaluation framework at https://www.github.com/openai/human-eval.
在本工作中,我们专注于从文档字符串生成独立的 Python语言 函数的任务,并通过单元测试自动评估代码样本的正确性。这与自然语言生成不同,后者通常通过启发式方法或人工评估员来评估样本。为了准确评估我们的模型,我们创建了一个包含 164 个原始编程问题的数据集,每个问题都配有单元测试。这些问题评估语言理解、算法和简单数学,其中一些类似于简单的软件面试题。我们发布了这个数据集以及评估框架,网址为 https://www.github.com/openai/human-eval。
To solve a problem in our test set, we generate multiple samples from the models, and check if any of them pass the unit tests. With just a single sample, a 12B parameter Codex solves $28.8%$ of these problems, and a 300M parameter Codex solves $13.2%$ of these problems. In contrast, the 6B parameter GPT-J (Wang & Komatsu zak i, 2021) achieves $11.4%$ on the same dataset, while all GPT models achieve near $0%$ . To improve our model’s performance at the task of function synthesis from docstrings, we fine-tune Codex on standalone, correctly implemented functions. The resulting model, Codex-S, solves $37.7%$ of problems with a single sample. Figure 2 showcases problems of varying difficulty in our dataset, along with correct model generated solutions.
为了解决我们测试集中的问题,我们从模型生成多个样本,并检查其中是否有任何样本通过单元测试。仅使用单个样本时,12B 参数的 Codex 解决了 $28.8%$ 的问题,而 300M 参数的 Codex 解决了 $13.2%$ 的问题。相比之下,6B 参数的 GPT-J (Wang & Komatsu zak i, 2021) 在同一数据集上达到了 $11.4%$ ,而所有 GPT 模型的表现接近 $0%$ 。为了提高模型在从文档字符串合成函数任务中的性能,我们在独立且正确实现的函数上微调 Codex。由此产生的模型 Codex-S 使用单个样本解决了 $37.7%$ 的问题。图 2 展示了我们数据集中不同难度的问题以及模型生成的正确解决方案。
Real-world programming tasks often involve iterations of approaches and bug fixes, which is approximated by generating many samples from our models and selecting one that passes all unit tests. Within 100 samples, Codex-S is able to generate at least one correct function for $77.5%$ of the problems. This result suggests that accurate code samples can be selected via heuristic ranking instead of fully evaluating each sample, the latter of which may not be possible or practical in deployment. Indeed, we find that the sample with highest mean log-probability passes unit tests for $44.5%$ of the problems.
现实世界的编程任务通常涉及方法的迭代和 bug 修复,这可以通过从我们的模型生成许多样本并选择一个通过所有单元测试的样本来进行近似。在 100 个样本中,Codex-S 能够为 $77.5%$ 的问题生成至少一个正确的函数。这一结果表明,可以通过启发式排名选择准确的代码样本,而无需完全评估每个样本,后者在部署时可能不可行或不切实际。实际上,我们发现具有最高平均对数概率的样本通过了 $44.5%$ 问题的单元测试。
We conclude by discussing the limitations and potential broader impacts of these Codex models and of increasingly powerful code generating models more generally.
我们通过讨论这些 Codex 模型以及更通用的 increasingly powerful 代码生成模型的局限性和潜在更广泛的影响来总结。
2. Evaluation Framework
2. 评估框架
In this section, we discuss the details of our evaluation framework. We begin by defining the $p a s s@k$ metric, and explain its advantages over standard match-based metrics. Next, we describe the dataset of hand-written problems, called “HumanEval,” which we created in order to benchmark our models. Finally, we discuss the sandbox environment we used to safely execute model-generated code.
在本节中,我们讨论评估框架的详细信息。我们首先定义 $p a s s@k$ 指标,并解释其相对于标准匹配指标的优势。接下来,我们描述了名为“HumanEval”的手写问题数据集,该数据集是我们为了基准测试模型而创建的。最后,我们讨论了用于安全执行模型生成代码的沙盒环境。
2.1. Functional Correctness
2.1. 功能正确性
Generative models for code are predominantly benchmarked by matching samples against a reference solution, where the match can be exact or fuzzy (as in BLEU score). However, recent work has surfaced deficiencies in match-based metrics for code. For instance, Ren et al. (2020) finds that BLEU has problems capturing semantic features specific to code, and suggests several semantic modifications to the score.
生成式模型(Generative models)对于代码的评估主要通过将样本与参考解决方案进行匹配,这种匹配可以是精确的或模糊的(如 BLEU 分数)。然而,最近的研究揭示了基于匹配的代码评估指标存在缺陷。例如,Ren 等人 (2020) 发现 BLEU 在捕捉代码特有的语义特征方面存在问题,并建议对分数进行若干语义修改 [20]。
More fundamentally, match-based metrics are unable to account for the large and complex space of programs functionally equivalent to a reference solution. As a consequence, recent works in unsupervised code translation (Lachaux et al., 2020) and pseudocode-to-code translation (Kulal et al., 2019) have turned to functional correctness instead, where a sample is considered correct if it passes a set of unit tests. We argue that this metric should be applied to docstringconditional code generation as well.
从根本上讲,基于匹配的度量标准无法考虑与参考解决方案功能等价的程序的大而复杂的空间。因此,最近关于无监督代码翻译 (Lachaux et al., 2020) 和伪代码到代码翻译 (Kulal et al., 2019) 的工作转向了功能正确性,其中如果样本通过了一组单元测试,则被认为是正确的。我们认为这个度量标准也应该应用于文档字符串条件下的代码生成。
Perhaps the most convincing reason to evaluate functional correctness is that it is used by human developers to judge code. A framework known as test-driven development dictates that software requirements be converted into test cases before any implementation begins, and success is defined by a program that passes these tests. While few organizations employ full test-driven development, integration of new code is usually dependent on creating and passing unit tests.
或许最有说服力的理由是评估功能正确性是因为它被人类开发者用来判断代码。一种称为测试驱动开发 (test-driven development) 的框架规定,在任何实现开始之前,软件需求应转换为测试用例,并且成功是由通过这些测试的程序定义的。虽然很少有组织采用完整的测试驱动开发,但新代码的集成通常依赖于创建并通过单元测试。
Kulal et al. (2019) evaluate functional correctness using the pass $@k$ metric, where $k$ code samples are generated per problem, a problem is considered solved if any sample
库拉尔等人 (2019) 使用 pass @$k$@ 指标评估功能正确性,其中每个问题生成 $k$ 个代码样本,如果任何样本通过测试,则认为该问题已解决。


图 1: 模型架构示例
在本节中,我们将介绍生成式 AI (Generative AI) 的最新进展。生成式 AI 是指能够创建新内容的 AI 技术,例如文本、图像和音乐。这些模型通常基于 Transformer 架构,并使用大量的 Token 进行训练。大语言模型 (LLM) 在处理自然语言任务方面表现出色,尤其是在零样本和少样本学习场景下。
我们还探讨了 AI智能体在不同环境中的应用,以及它们如何通过与环境交互来提高性能。通用人工智能 (AGI) 是一个长期目标,旨在开发出能够在各种任务上达到或超过人类水平的 AI 系统。
表 1: 不同模型的性能对比
| 模型名称 | 参数量 | 训练数据集 | 性能指标 |
|---|---|---|---|
| Model A | 1B | Dataset X | 85% |
| Model B | 10B | Dataset Y | 90% |
以上表格展示了几个典型模型的参数量、训练数据集及性能指标。可以看出,随着模型规模的增大,其性能也有所提升。然而,更大的模型并不总是意味着更好的结果,还需要考虑计算资源和实际应用场景等因素。
Figure 2. Three example problems from the HumanEval dataset, where the probabilities that a single sample from Codex-12B passes unit tests are 0.9, 0.17, and 0.005. The prompt provided to the model is shown with a white background, and a successful model-generated completion is shown in a yellow background. Though not a guarantee for problem novelty, all problems were hand-written and not pro grammatically copied from existing sources. Random problems and samples can be found in Appendix B.
图 2. 来自 HumanEval 数据集的三个示例问题,其中 Codex-12B 单个样本通过单元测试的概率分别为 0.9、0.17 和 0.005。提供给模型的提示显示在白色背景上,成功的模型生成完成显示在黄色背景上。虽然不能保证问题的新颖性,但所有问题都是手工编写的,并非从现有来源程序化复制而来。随机问题和样本可以在附录 B 中找到。
passes the unit tests, and the total fraction of problems solved is reported. However, computing pass $@k$ in this way can have high variance. Instead, to evaluate pass $@k$ , we generate $n;\geq;k$ samples per task (in this paper, we use $n,=,200$ and $k\le100]$ ), count the number of correct samples $c,\leq,n$ which pass unit tests, and calculate the unbiased estimator
通过单元测试,并报告解决问题的总比例。然而,以这种方式计算通过率 $@k$ 可能会有较高的方差。相反,为了评估通过率 $@k$ ,我们为每个任务生成 $n;\geq;k$ 个样本(在本文中,我们使用 $n,=,200$ 和 $k\le100$),统计通过单元测试的正确样本数 $c,\leq,n$,并计算无偏估计量
ef pass (1 C k) """ :param n: total number of samples :param c: number of correct samples :param k: k in pass@\$k\$ """ if $\mathrm{n-~}\mathrm{c}<\mathrm{k}$ : return 1.0 return 1.0 - np.prod(1.0 - k / np.arange $(\mathrm{n\textsubscript~{-}}\mathrm{{c\textsubscript+}~}1$ , n + 1))
如果 \$\mathrm{~n~-~}\mathrm{~c~}<\mathrm{~k~}\$ :
返回 1.0
返回 1.0 - np.prod(1.0 - k / np.arange \$(\mathrm{~n~\textsubscript~{~-~}~}\mathrm{~c~\textsubscript~{~+~}~}1\$ , n + 1))
""" :param n: 总样本数 :param c: 正确样本数 :param k: pass@\\$k\\$ 中的 k """
$$
\mathrm{pass}\ @k:=\underset{\mathrm{Problems}}{\mathbb{E}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}}\right]
$$
$$
\mathrm{通过} @k := \underset{\mathrm{问题}}{\mathbb{E}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}}\right]
$$
Figure 3. A numerically stable script for calculating an unbiased estimate of pass $@k$ .
图 3. 一个数值稳定的脚本,用于计算无偏估计的通过率 $@k$ 。
Calculating this estimator directly results in very large numbers and numerical instability. In Figure 3, we include a numerically stable numpy implementation that simplifies the expression and evaluates the product term-by-term. One may be tempted to estimate pass $@k$ with $1-(1!-!\hat{p})^{k}$ where $\hat{p}$ is the empirical estimate of pass $@1$ , but we show that it is biased in Appendix A.
直接计算这个估计器会导致非常大的数值和数值不稳定。在图 3 中,我们包含了一个数值稳定的 numpy 实现,该实现简化了表达式并逐项评估乘积项。有人可能会尝试用 $1-(1!-!\hat{p})^{k}$ 来估计 pass $@k$ ,其中 $\hat{p}$ 是 pass $@1$ 的经验估计值,但我们证明了它存在偏差,详见附录 A。
Later, we provide evidence that BLEU score may not be a reliable indicator of functional correctness by showing that functionally in equivalent programs generated by our model (which are guaranteed to disagree with the reference solution on some input) often have higher BLEU scores than functionally equivalent ones.
后来,我们提供了证据表明 BLEU 分数可能不是功能正确性的可靠指标。通过展示我们模型生成的功能不等价程序(这些程序在某些输入上必定与参考解决方案不一致)通常具有比功能等价程序更高的 BLEU 分数,证明了这一点。
2.2. HumanEval: Hand-Written Evaluation Set
2.2. HumanEval: 手写评估集
We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources. For example, there are more than ten public repositories containing solutions to Codeforces problems, which make up part of the recently proposed APPS dataset (Hendrycks et al., 2021).
我们在一组 164 个手写编程问题上评估功能正确性,我们称之为 HumanEval 数据集。每个问题包括函数签名、文档字符串、函数体和若干单元测试,平均每题有 7.7 个测试。这些任务必须是手写的,这一点很重要,因为我们的模型是在 GitHub 的大量代码上训练的,而这些代码中已经包含了来自各种来源的问题解决方案。例如,有超过十个公共仓库包含 Codeforces 问题的解决方案,这些构成了最近提出的 APPS 数据集 (Hendrycks et al., 2021) 的一部分。
Programming tasks in the HumanEval dataset assess language comprehension, reasoning, algorithms, and simple mathematics. We release the HumanEval dataset so that others can evaluate functional correctness and measure the problem-solving capabilities of their models. The dataset can be found at https://www.github.com/openai/human-eval.
HumanEval 数据集中的编程任务评估语言理解、推理、算法和简单数学。我们发布了 HumanEval 数据集,以便其他人可以评估功能正确性并衡量其模型的解决问题的能力。该数据集可以在 https://www.github.com/openai/human-eval 获取。
2.3. Sandbox for Executing Generated Programs
2.3. 执行生成程序的沙盒 (Sandbox for Executing Generated Programs)
Since publicly available programs have unknown intent and generated programs are often incorrect, executing these programs poses a security risk. Indeed, GitHub is known to contain malicious programs that alter or change their environments (Rokon et al., 2020).
由于公开可用的程序意图不明,且生成的程序经常存在错误,执行这些程序会带来安全风险。确实,GitHub 已知包含恶意程序,这些程序会更改其环境 (Rokon et al., 2020)。
Therefore, we developed a sandbox environment to safely run untrusted programs against unit tests. Our goals were to prevent these programs from modifying, gaining persistence on, accessing sensitive resources on, or ex filtrating data from a host or network. Since OpenAI’s training infrastructure is built on Kubernetes and cloud services, we designed our sandbox to address the limitations of these environments while remaining idiomatic with their patterns of use.
因此,我们开发了一个沙盒环境,以安全地运行针对单元测试的不可信程序。我们的目标是防止这些程序修改、在主机或网络上获得持久性、访问敏感资源或外泄数据。由于 OpenAI 的训练基础设施建立在 Kubernetes 和云服务之上,我们设计的沙盒旨在解决这些环境的限制,同时保持与其使用模式的一致性。
We selected the gVisor container runtime (Lacasse, 2018) as the main host protection component. Since container runtimes like Docker can share host resources with containers, a malicious container could potentially compromise a host. gVisor protects the host by emulating its resources to introduce a security boundary between the host and its containers. Network-adjacent hosts and services are protected by eBPF-based firewall rules that prevent inbound and outbound connections except for those required for experiment control.
我们选择了 gVisor 容器运行时 (Lacasse, 2018) 作为主要的主机保护组件。由于像 Docker 这样的容器运行时可以与容器共享主机资源,恶意容器可能会危及主机的安全。gVisor 通过模拟主机资源,在主机和容器之间引入了一个安全边界来保护主机。网络相邻的主机和服务则通过基于 eBPF 的防火墙规则进行保护,这些规则防止了除实验控制所需的连接之外的所有入站和出站连接。
3. Code Fine-Tuning
3. 代码微调 (Code Fine-Tuning)
We fine-tune GPT models containing up to 12B parameters on code to produce Codex. In contrast with GPT, Codex displays non-trivial performance on the HumanEval dataset. In fact, Codex is able to solve the majority of the problems in HumanEval if we generate and evaluate 100 samples per problem, and pick one that passes unit tests. When limited to a budget of one evaluation per problem, producing multiple samples with Codex and choosing the one with the highest mean log-probability provides significant gains.
我们对包含多达 12B 参数的 GPT 模型进行微调以生成 Codex。与 GPT 不同,Codex 在 HumanEval 数据集上表现出非平凡的性能。实际上,如果为每个问题生成并评估 100 个样本,并选择一个通过单元测试的样本,Codex 能够解决 HumanEval 中的大部分问题。当限制为每个问题一次评估时,使用 Codex 生成多个样本并选择具有最高平均对数概率的样本可以显著提高性能。
3.1. Data Collection
3.1. 数据收集
Our training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. We filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters. After filtering, our final dataset totaled 159 GB.
我们的训练数据集于 2020 年 5 月从 GitHub 上托管的 5400 万个公共软件仓库中收集,包含 179 GB 的唯一且小于 1 MB 的 Python 文件。我们过滤掉了可能是自动生成的文件、平均行长度超过 100 的文件、最大行长度超过 1000 的文件或包含少量字母数字字符的文件。过滤后,我们的最终数据集总计 159 GB。
3.2. Methods
3.2. 方法
Since Codex is evaluated on natural language prompts, we hypothesized that it would be beneficial to fine-tune from the GPT-3 (Brown et al., 2020) model family, which already contains strong natural language representations. Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the finetuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments.
由于 Codex 是在自然语言提示上进行评估的,我们假设从已经包含强大自然语言表示的 GPT-3 (Brown 等,2020) 模型家族进行微调会更有利。令人惊讶的是,当我们从预训练语言模型开始时,并没有观察到改进,这可能是因为微调数据集非常大。尽管如此,从 GPT 微调的模型收敛速度更快,因此我们在所有后续实验中都采用这种策略。
We train Codex using the same learning rate as the corresponding GPT model, with a 175 step linear warmup and cosine learning rate decay. We train for a total of 100 billion tokens, using the Adam optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , $\epsilon=10^{-8}$ , and a weight decay coefficient of 0.1.
我们使用与相应 GPT 模型相同的 学习率 训练 Codex,采用 175 步线性预热和余弦学习率衰减。总共训练 1000 亿个 Token,使用 Adam 优化器,参数为:$\beta_1=0.9$ , $\beta_2=0.95$ , $\epsilon=10^{-8}$ ,权重衰减系数为 0.1。
In order to maximally leverage text representations from GPT, we base our code lexer on the GPT-3 text tokenizer. Since the distribution of words in GitHub code differs from that of natural text, this tokenizer is not very effective for representing code. The largest source of inefficiency arises from encoding whitespace, so we add an additional set of tokens for representing whitespace runs of different lengths. This allows us to represent code using approximately $30%$ fewer tokens.
为了最大限度地利用 GPT 的文本表示,我们基于 GPT-3 文本分词器构建了代码分词器。由于 GitHub 代码中的单词分布与自然文本中的单词分布不同,这个分词器在表示代码方面效果不佳。最大的低效来源是编码空白字符,因此我们添加了一组额外的 Token 来表示不同长度的空白字符序列。这使我们可以使用大约 30% 更少的 Token 来表示代码。
To compute pass $@k$ , we assemble each HumanEval problem into a prompt consisting of a header, a signature, and a docstring, which is illustrated in Figure 2. We sample tokens from Codex until we encounter one of the following stop sequences: ‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’, since the model will continue generating additional functions or statements otherwise. We use nucleus sampling (Holtzman et al., 2020) with top $p=0.95$ for all sampling evaluation in this work.
为了计算通过率 $@k$ ,我们将每个 HumanEval 问题组装成一个提示,该提示由标题、签名和文档字符串组成,如图 2 所示。我们从 Codex 中采样 Token,直到遇到以下停止序列之一:‘\nclass’,‘\ndef’,‘\n#’,‘\nif’ 或 ‘\nprint’,因为否则模型将继续生成额外的函数或语句。我们在所有采样评估中使用核采样 (Holtzman et al., 2020) 并设置顶部 $p=0.95$ 。
3.3. Results
3.3. 结果
In Figure 4, we plot test loss on a held-out validation set against Codex model size. We find that just as language model test loss follows a power law in model size (Kaplan et al., 2020), test loss after code fine-tuning follows a similar power law with functional form $\scriptstyle\Big(\frac{N}{5.92\times10^{7}}\Big)^{-0.13}$ where $N$ is the number of non-embedding parameters in the model.
图 4: 我们在保留的验证集上绘制 Codex 模型大小与测试损失的关系。我们发现,正如语言模型的测试损失随模型大小呈幂律分布 (Kaplan et al., 2020),代码微调后的测试损失也遵循类似的幂律,其函数形式为 $\scriptstyle\Big(\frac{N}{5.92\times10^{7}}\Big)^{-0.13}$,其中 $N$ 是模型中非嵌入参数的数量。

Figure 4. Model cross-entropy test loss measured on a held-out split of our Python GitHub code corpus. The smooth power law scaling of performance with model size observed in GPT-3 appears to hold even after code fine-tuning.

图 4. 在我们 Python语言 GitHub 代码语料库的保留分割上测量的模型交叉熵测试损失。在 GPT-3 中观察到的性能与模型大小的平滑幂律缩放关系在代码微调后仍然成立。
When evaluating pass $@k$ , it is important to optimize sampling temperature for the particular value of $k$ . In Figure 5, we plot pass $@k$ against the number of samples $k$ and the sampling temperature. We find that higher temperatures are optimal for larger $k$ , because the resulting set of samples has higher diversity, and the metric rewards only whether the model generates any correct solution.
在评估 pass $@k$ 时,优化特定值的 $k$ 的采样温度非常重要。在图 5 中,我们绘制了 pass $@k$ 与样本数量 $k$ 和采样温度的关系图。我们发现,对于较大的 $k$ ,较高的温度是最佳的,因为生成的样本集具有更高的多样性,而该指标只奖励模型是否生成了任何正确解。
In particular, for a 679M parameter model, the optimal tem- perature for pass $@1$ is $T^{}=0.2$ and the optimal temperature for pass $@100$ is $T^{}=0.8$ . With these temperatures, we find that pass $@1$ and pass $@100$ scale smoothly as a function of model size (Figure 6).
特别是,对于一个 679M 参数的模型,$@1$ 的最优温度是 $T^{}=0.2$ ,而 $@100$ 的最优温度是 $T^{}=0.8$ 。通过这些温度,我们发现 $@1$ 和 $@100$ 随着模型大小的变化平滑缩放(图 6)。
Pass $@k$ can also be interpreted as the result of evaluating the best out of $k$ samples, where the best sample is picked by an oracle with prior knowledge of the unit tests. From a practical perspective, we are also interested in the setting where we must select a single sample from $k$ samples without having access to an oracle. For instance, when the model is used as an autocomplete tool where a user provides a prompt, we do not have unit tests, but would like to return only a single completion to the user for evaluation so as to not overwhelm them.
传递 $@k$ 也可以被解释为从 $k$ 个样本中评估出最佳结果,其中最佳样本由具有单元测试先验知识的 oracle 选择。从实际角度来看,我们也对必须在没有访问 oracle 的情况下从 $k$ 个样本中选择一个样本的设置感兴趣。例如,当模型用作自动完成功能时,用户提供一个提示,我们没有单元测试,但希望只返回一个完成结果给用户进行评估,以免让他们感到不知所措。
Inspired by similar work in language modeling, we find that choosing the sample with the highest mean token log probability outperforms evaluating a random sample, while choosing the sample based on sum log probability can perform slightly worse than picking randomly. Figure 7 demonstrates the benefits of applying these heuristics to samples (at temperature 0.8) from Codex-12B.
受类似的语言建模工作的启发,我们发现选择具有最高平均 Token 对数概率的样本比评估随机样本表现更好,而基于总对数概率选择样本的表现可能会略逊于随机选择。图 7: 展示了将这些启发式方法应用于 Codex-12B 的样本(在温度为 0.8 时)的好处。

Figure 5. In the top panel, we plot pass $@k$ against the number of samples $(k)$ for various temperature settings. Higher temperatures are better when the number of samples is large, likely due to the increased sample diversity. In the bottom panel, we plot the best temperature setting for each $k$ , obtained by taking the upper hull of the top panel.

图 5. 在上图中,我们绘制了通过率 $@k$ 与样本数量 $(k)$ 的关系图,针对不同的温度设置。较高的温度在样本数量较大时表现更好,这可能是由于样本多样性增加所致。在下图中,我们绘制了每个 $k$ 的最佳温度设置,这是通过对上图取上包络线获得的。

Pass Rate vs Model Size Figure 6. Using the optimal temperatures 0.2 and 0.8 for pass $@1$ and pass $@100$ , we plot these two metrics as a function of model size. Performance appears to scale smoothly as a sigmoid in logparameters.
通过率与模型大小的关系 图 6. 使用通过 $@1$ 和通过 $@100$ 的最优温度 0.2 和 0.8,我们将这两个指标作为模型大小的函数进行绘制。性能表现似乎随着对数参数以 S 型曲线平稳增长。

Figure 7. Model performance in the setting where we can generate multiple samples, but only evaluate one. We can do better than randomly selecting a sample by choosing the solution with the highest mean log-probability (red) or with the highest back-translation score (orange) described in Sec. 5. The blue line represents the theoretical best performance obtained using an oracle with prior knowledge of the unit tests.

图 7. 在我们可以生成多个样本但只评估一个的设置下,模型的表现。我们可以通过选择具有最高平均对数概率(红色)或最高反向翻译得分(橙色)的样本,来获得比随机选择更好的结果,这在第 5 节中有描述。蓝色线条表示使用具有单元测试先验知识的神谕所获得的理论最佳表现。
Finally, we compute BLEU scores for all Codex-12B HumanEval samples (at temperature 0.8) against their reference solutions. For each problem, when we plot the distributions of BLEU scores for correct and incorrect solutions, we notice significant overlap (Figure 8). Since an incorrect solution is guaranteed to be functionally in equivalent to the reference solution, we conclude that improvements in BLEU score may not indicate improved rates of functional correctness in practice.
最后,我们计算所有 Codex-12B HumanEval 样本 (在温度 0.8 下) 的 BLEU 分数,并将其与参考解决方案进行对比。对于每个问题,当我们绘制正确和错误解决方案的 BLEU 分数分布时,我们注意到显著的重叠 (图 8)。由于错误的解决方案在功能上必定与参考解决方案不等价,我们得出结论:BLEU 分数的提高在实际中可能并不意味着功能正确性的改善。
3.4. Comparative Analysis of Related Models and Systems
3.4. 相关模型和系统的比较分析
Two recent works similar in spirit to Codex are GPT-Neo (Black et al., 2021) and GPT-J (Wang & Komatsu zak i, 2021), which are trained on The Pile (Gao et al., 2020), a dataset containing text from a variety of sources as well as $8%$ GitHub code. The broader research community has found that these models outperform existing GPT systems in qualitative programming evaluations (Woolf, 2021).
两项最近与 Codex 精神相似的工作是 GPT-Neo (Black et al., 2021) 和 GPT-J (Wang & Komatsu zak i, 2021),它们是在 The Pile (Gao et al., 2020) 上训练的,该数据集包含来自各种来源的文本以及 $8%$ 的 GitHub 代码。更广泛的研究社区发现,这些模型在定性编程评估中优于现有的 GPT 系统 (Woolf, 2021)。
We confirm these findings using the HumanEval dataset, showing that GPT-Neo achieves $6.4%$ pass $@1$ and $21.3%$ pass $@100$ , while GPT models of comparable sizes achieve near $0%$ on both metrics. We see a remarkable progression in capabilities, with GPT-Neo-2.7B roughly equivalent to Codex-85M ( $30\times$ fewer parameters). Similarly, GPT-J-6B achieves $11.6%$ pass $@1$ and $27.7%$ pass $@100$ , which is roughly equivalent to Codex-300M ( $20\times$ fewer parameters). Pass rates are obtained by taking the best result from evaluating at temperatures 0.2, 0.4, and 0.8 for GPT-Neo, and from temperatures 0.2 and 0.8 for GPT-J. Detailed results across multiple model sizes can be found in Table 1.
我们使用 HumanEval 数据集确认了这些发现,显示 GPT-Neo 达到 6.4% pass @1 和 21.3% pass @100,而相同规模的 GPT 模型在这两个指标上接近 0%。我们观察到能力有显著进步,GPT-Neo-2.7B 大约相当于 Codex-85M (参数量少 30倍)。同样,GPT-J-6B 达到 11.6% pass @1 和 27.7% pass @100,大约相当于 Codex-300M (参数量少 20倍)。通过在温度为 0.2、0.4 和 0.8 下评估 GPT-Neo,以及在温度为 0.2 和 0.8 下评估 GPT-J,获得最佳结果的通过率。多个模型大小的详细结果见表 1。

Figure 8. BLEU score probability densities for correct (blue) and wrong (green) solutions from Codex-12B for 4 random tasks from HumanEval. Note that the distributions are not cleanly separable, suggesting that optimizing for BLEU score is not equivalent to optimizing for functional correctness.
图 8. Codex-12B 在 HumanEval 的 4 个随机任务中正确(蓝色)和错误(绿色)解决方案的 BLEU 分数概率密度。注意,这些分布并不能清晰地分离,这表明优化 BLEU 分数并不等同于优化功能正确性。
Finally, we benchmark Codex against the largest free model from Tabnine, a leading code autocomplete system, which achieves $2.6%$ pass $@1$ (at $T=0.4)$ ) and $7.6%$ pass $@100$ (at $T=0.8)$ ). This is roughly equivalent to Codex-12M, one of the smallest models in our suite.
最后,我们将 Codex 与来自 Tabnine 的最大免费模型进行基准测试,Tabnine 是领先的代码自动补全系统,该模型在 $T=0.4$ 时达到 $2.6%$ pass $@1$ ,在 $T=0.8$ 时达到 $7.6%$ pass $@100$ 。这大致相当于我们模型库中最小的模型之一 Codex-12M。
3.5. Results on the APPS Dataset
3.5. APPS 数据集上的结果
Recently, Hendrycks et al. (2021) introduced the APPS dataset to measure the coding challenge competence of language models. The APPS dataset consists of 5000 training and 5000 test examples of coding problems, each with a set of unit tests and, for the training data, a set of correct solutions. Most of the APPS tests problems are not formulated as single-function synthesis tasks, but rather as full-program synthesis, reading input from stdin and printing output to stdout, in contrast to the main Codex training data.
最近,Hendrycks 等人 (2021) 引入了 APPS 数据集来衡量语言模型的编程挑战能力。APPS 数据集包含 5000 个训练样本和 5000 个测试样本的编程问题,每个问题都附带一组单元测试,以及对于训练数据而言的一组正确解决方案。大多数 APPS 测试问题并不是以单函数合成任务的形式提出的,而是作为完整的程序合成,从 stdin 读取输入并打印输出到 stdout,这与主要的 Codex 训练数据不同。
In the paper that introduces APPS, the authors benchmark a few language models and report two metrics: the percentage of problems where the model finds a correct solution (called the “strict accuracy”) and the percentage of unit tests passed, even if the solution is incorrect. The latter measure is reported only so as to reduce variance of the measurements, because the results on the first metric were so low. We avoid this metric and only focus on “strict accuracy”, and - as in the previous sections - we report pass $@k$ numbers for various $k$ (Table 2). There are 2 additional factors, well-known from coding competitions, that we take into account:
在介绍 APPS 的论文中,作者对几个大语言模型进行了基准测试,并报告了两个指标:模型找到正确解决方案的问题百分比(称为“严格准确率”)和即使解决方案不正确也通过的单元测试百分比。后一个指标仅用于减少测量结果的方差,因为第一个指标的结果非常低。我们避免使用这个指标,只关注“严格准确率”,并如前几节所述,报告不同 $k$ 的 pass $@k$ 数字(表 2)。我们还考虑了两个来自编程竞赛中众所周知的因素:
表 2:
| 模型 | 严格准确率 |
|---|---|
| 模型 A | X% |
| 模型 B | Y% |
有两个来自编程竞赛中众所周知的因素,我们将其纳入考量:
Table 1. Codex, GPT-Neo, & TabNine evaluations for HumanEval. We find that GPT-J pass $@1$ is between Codex-85M and Codex300M performance.
表 1. Codex、GPT-Neo 和 TabNine 在 HumanEval 上的评估。我们发现 GPT-J 的通过率 $@1$ 位于 Codex-85M 和 Codex-300M 性能之间。
| PASS@k | k=1 | k=10 k =100 |
|---|---|---|
| GPT-NEO125M | 0.75% | 1.88% 2.97% |
| GPT-NEO1.3B | 4.79% 7.47% | 16.30% |
| GPT-NEO2.7B | 6.41% 11.27% | 21.37% |
| GPT-J 6B | 11.62% 15.74% | 27.74% |
| TABNINE | 2.58% | 4.35% 7.59% |
| CODEX-12M | 2.00% | 3.62% 8.58% |
| CODEX-25M | 3.21% 7.1% | 12.89% |
| CODEX-42M | 5.06% 8.8% | 15.55% |
| CODEX-85M | 8.22% 12.81% | 22.4% |
| CODEX-300M | 13.17% 20.37% | 36.27% |
| CODEX-679M | 16.22% 25.7% | 40.95% |
| CODEX-2.5B | 21.36% 35.42% | 59.5% |
| CODEX-12B | 28.81% 46.81% | 72.31% |
• In coding competitions and in the APPS datasets, tasks are provided with 3 input/output examples included in the task description. We utilize this by sampling 1000 solutions from the model and filtering out only those that pass these 3 unit tests (if such solutions exist). We then calculate pass rates in this filtered set, and call it filtered pass $@k$ . Results without filtering are presented as raw pass $@k$ .
• 在编程竞赛和 APPS 数据集中,任务提供了包含在任务描述中的 3 个输入/输出示例。我们通过从模型中采样 1000 个解决方案,并仅筛选通过这 3 个单元测试的解决方案(如果存在这样的解决方案)来利用这些示例。然后我们在这个过滤后的集合中计算通过率,并称其为过滤后通过率 $@k$ 。未经过滤的结果则表示为原始通过率 $@k$ 。
• It is often the case both in coding competitions and in the results from Codex that a correct solution is found, but it is not algorithmic ally efficient enough to be considered passing. While this is not acceptable in the competitions, we also report the number of solutions that Codex produces that do not fail on any unit test, but that do time-out on some of them. We use a timeout of 3 seconds in our evaluation.
• 在编程竞赛和 Codex 的结果中,常常会找到正确的解决方案,但这些方案在算法效率上不足以被视为通过。虽然在竞赛中这是不可接受的,但我们还报告了 Codex 生成的那些在任何单元测试中都没有失败、但在某些测试中会超时的解决方案数量。我们在评估中使用了 3 秒的超时时间。
To compensate for the fact the Codex is not fine-tuned on APPS, we append a single input/output example from the task description to the docstring as a formatting hint. We denote this setting as “1-shot” in Table 2, and find that Codex12B evaluated 1-shot achieves comparable performance to a GPT-Neo model fine-tuned on APPS. Consistent with our earlier findings, there are large benefits from generating and evaluating as many as 1000 samples per task, though for more difficult problems, solutions are often not efficient enough to pass the time limits. Finally, evaluating the first sample which passes the 3 public unit tests for each problem yields higher performance than raw pass $@100$ samples.
为了弥补 Codex 没有在 APPS 上进行微调的事实,我们在文档字符串中附加了一个来自任务描述的输入/输出示例作为格式提示。我们将这种设置标记为表 2 中的“1-shot”,并发现 Codex12B 在 1-shot 设置下评估时的表现与在 APPS 上微调的 GPT-Neo 模型相当。与我们之前的发现一致,生成和评估每个任务多达 1000 个样本可以带来显著的好处,但对于更困难的问题,解决方案通常不够高效,无法通过时间限制。最后,评估每个问题中第一个通过 3 个公共单元测试的样本比原始通过 $@100$ 样本的表现更好。
4. Supervised Fine-Tuning
4. 监督微调
In addition to standalone functions, Python code found on GitHub contains class implementations, configuration files, scripts, and even files used to store data. This code is seemingly unrelated to synthesizing functions from docstrings, and we hypothesize that the distribution mismatch reduces HumanEval performance.
除了独立函数外,GitHub 上的 Python语言 代码还包含类实现、配置文件、脚本,甚至用于存储数据的文件。这些代码看似与从文档字符串合成函数无关,我们假设这种分布不匹配会降低 HumanEval 性能。
In order to adapt Codex to the distribution of the task of interest, we construct a set of training problems from correctly implemented standalone functions, and use them for additional supervised fine-tuning. We describe two approaches for collecting these examples: from competitive programming websites and from repositories with continuous integration. We call the supervised fine-tuned models Codex-S, and show that they produce consistent gains across model size.
为了使 Codex 适应感兴趣任务的分布,我们从正确实现的独立函数构建了一组训练问题,并用它们进行额外的监督微调。我们描述了收集这些示例的两种方法:来自竞争编程网站和来自具有持续集成的代码库。我们将监督微调后的模型称为 Codex-S,并展示了它们在不同模型规模上都能产生一致的改进。
4.1. Problems from Competitive Programming
4.1. 竞赛编程中的问题
Programming contest and interview preparation websites use hidden unit tests to automatically judge the functional correctness of submissions. These problems are selfcontained, come with well-written problem statements, and generally have excellent test coverage. Additionally, these problems test algorithmic reasoning over a broad range of core skills and difficulties.
编程竞赛和面试准备网站使用隐藏的单元测试来自动判断提交内容的功能正确性。这些问题自包含,配有编写良好的问题描述,并且通常具有出色的测试覆盖率。此外,这些问题在广泛的核⼼技能和难度范围内测试算法推理能力。
We collected problem statements, function signatures, and solutions from several popular programming contest and interview preparation websites. We then assembled these into programming tasks similar to HumanEval, using the problem description as the docstring. Since complete test suites are often hidden, we created unit tests from examples found in the problem statements, or extracted additional test cases through submitting incorrect solutions. In total, we curated 10,000 problems in this way.
我们从几个流行的编程竞赛和面试准备网站收集了问题陈述、函数签名和解决方案。然后,我们将这些内容组装成类似于 HumanEval 的编程任务,使用问题描述作为文档字符串。由于完整的测试套件通常被隐藏,我们根据问题陈述中的示例创建了单元测试,或者通过提交错误的解决方案提取了额外的测试用例。总共,我们以这种方式策划了 10,000 个问题。
4.2. Problems from Continuous Integration
4.2. 持续集成中的问题
Next, we curated programming problems from open source projects. Taking advantage of sys.setprofile, we were able to trace and collect inputs and outputs for all functions called during integration tests. This data could then be used to create unit tests for the functions.
接下来,我们从开源项目中整理了编程问题。利用 sys.setprofile,我们能够跟踪并收集集成测试期间调用的所有函数的输入和输出。这些数据可以用于为函数创建单元测试。
Projects that employ continuous integration (CI) are ideal candidates for tracing. We follow the commands in the CI configuration files, which contain build and test commands, to set up the virtual environments, install dependencies, and run integration tests.
采用持续集成 (CI) 的项目是跟踪的理想候选对象。我们遵循 CI 配置文件中的命令,这些命令包含构建和测试指令,用于设置虚拟环境、安装依赖项并运行集成测试。
We considered GitHub repos using travis and tox as their CI frameworks, as they are two of the most popular CI tools. We additionally used publicly available source code from pip packages found in the python package index (PyPI).
我们考虑了使用 travis 和 tox 作为其 CI 框架的 GitHub 仓库,因为它们是两个最受欢迎的 CI 工具。我们还使用了来自 Python 包索引 (PyPI) 中 pip 包的公开可用源代码。
Table 2. Finetuned GPT-Neo numbers from the APPS paper referenced above. For Codex-12B, the number of passing programs that timeout on some test is in the bracket. We used temperature 0.6 for sampling to cover all $k$ in pass $@k$ , so raw pass $@1$ results could be improved with lower temperature.
表 2. 来自上述 APPS 论文的微调 GPT-Neo 数据。对于 Codex-12B,通过但在某些测试中超时的程序数量在括号内。我们使用温度 0.6 进行采样以覆盖所有 $k$ 在 pass $@k$ 中,因此 raw pass $@1$ 的结果可以通过更低的温度得到改善。
| 入门级 | 面试级 | 竞赛级 | |
|---|---|---|---|
| GPT-NEO2.7B 原始通过率 @1 | 3.90% | 0.57% | 0.00% |
| GPT-NEO2.7B 原始通过率 @5 | 5.50% | 0.80% | 0.00% |
| 1-Shot Codex 原始通过率 @1 | 4.14% (4.33%) | 0.14% (0.30%) | 0.02% (0.03%) |
| 1-Shot Codex 原始通过率 @5 | 9.65% (10.05%) | 0.51% (1.02%) | 0.09% (0.16%) |
| 1-Shot Codex 原始通过率 @100 | 20.20% (21.57%) | 2.04% (3.99%) | 1.05% (1.73%) |
| 1-Shot Codex 原始通过率 @1000 | 25.02% (27.77%) | 3.70% (7.94%) | 3.23% (5.85%) |
| 1-Shot Codex 筛选后通过率 @1 | 22.78% (25.10%) | 2.64% (5.78%) | 3.04% (5.25%) |
| 1-Shot Codex 筛选后通过率 @5 | 24.52% (27.15%) | 3.23% (7.13%) | 3.08% (5.53%) |
Because these projects contained untrusted code, it was important to run integration tests in the sandboxed environment described above.
因为这些项目包含不可信的代码,所以在上述沙盒环境中运行集成测试非常重要。
While there are millions of potential functions to curate problems from, we only collected about 40,000 because not all functions accept inputs and return outputs. Even when they do, most objects captured at runtime cannot be pickled and restored outside the sandbox unless the project was installed.
虽然有数百万个潜在函数可以用来生成问题,但我们只收集了大约 40,000 个,因为并不是所有函数都接受输入并返回输出。即使它们做到了,大多数在运行时捕获的对象除非项目已安装,否则无法在沙箱外部进行序列化和恢复。
Since our tracing methodology produced inputs and outputs for all invoked functions, even builtin and library calls imported by the project were turned into problems. For this reason, functions from tracing tended to be the building blocks of command-line utilities. To excel at these tasks, the model does not need to know advanced algorithms and data structures. Rather, it needs to be able to follow instructions to implement the functionality specified in the docstring. Thus, tracing complements the puzzle nature of coding competition problems and broadens the distribution of tasks.
由于我们的跟踪方法生成了所有调用函数的输入和输出,即使是项目导入的内置和库函数调用也被转化为问题。因此,来自跟踪的函数往往是命令行工具的构建块。为了在这些任务中表现出色,模型不需要掌握高级算法和数据结构。相反,它需要能够按照指示实现文档字符串中指定的功能。因此,跟踪补充了编程竞赛问题的谜题性质,并拓宽了任务的分布。
4.3. Filtering Problems
4.3. 过滤问题
In the previous sections, we presented two methods we used to automatically create training problems. However, it is unclear how to control for quality. Some prompts under specify the function that is implemented, in which case a perfectly valid solution may be wrongly penalized by the unit test. Some problems are stateful, and subsequent executions can result in different outcomes.
在前面的部分中,我们介绍了两种用于自动创建训练问题的方法。然而,如何控制质量尚不清楚。某些提示未能充分指定实现的功能,在这种情况下,完全有效的解决方案可能会被单元测试错误地惩罚。一些问题是状态相关的,后续执行可能导致不同的结果。
To address these issues, we use Codex-12B to generate 100 samples per curated problem. If no samples pass the unit tests, we consider the task to be either ambiguous or too difficult, and filter it out. We reran this verification several times to remove stateful or non-deterministic problems.
为了解决这些问题,我们使用 Codex-12B 为每个精心设计的问题生成 100 个样本。如果没有任何样本通过单元测试,我们认为任务要么是模糊的,要么是太难了,并将其过滤掉。我们多次重新运行此验证以移除有状态或非确定性问题。
4.4. Methods
4.4. 方法
We fine-tune Codex on these training problems to produce a set of “supervised fine-tuned” models, which we call CodexS. To produce examples from training problems, we assemble the problems into the format shown in Figure 2. If there are prompts of varying length in a batch, we left-pad shorter prompts to the length of the longest prompt, so that the first tokens in the reference solutions line up in context.
我们对 Codex 进行微调,以解决这些训练问题,并生成一组“监督微调”模型,我们称之为 CodexS。为了从训练问题中生成示例,我们将问题整理为图 2 所示的格式。如果一批次中有长度不同的提示,我们会用较短的提示左填充到最长提示的长度,以便参考解决方案中的第一个 Token 在上下文中对齐。
图 2:
We train to minimize negative log-likelihood of the reference solution, and mask out loss for any tokens in the prompt. We train using a learning rate $1/10$ as large as used for fine-tuning Codex, but adhere to the same learning rate schedule, and train until validation loss plateaus (less than 10B tokens).
我们训练以最小化参考解决方案的负对数似然,并屏蔽提示中任何 Token 的损失。我们使用的学习率是 Codex 微调时的 $1/10$,但遵循相同的学习率计划,并训练直到验证损失趋于平稳(少于 10B 个 Token)。
4.5. Results
As with Codex, we first compute the optimal temperature for evaluating pass $@k$ for $1\le k\le100$ . We find that Codex-S prefers slightly higher temperatures for all $k>1$ , which possibly reflects the fact that Codex-S captures a narrower distribution than Codex. We use $T^{},=,0$ for computing pass $@1$ and $T^{}=1$ for computing pass $@100$ .
与 Codex 一样,我们首先计算评估通过 $@k$ 的最优温度,对于 $1\le k\le100$ 。我们发现 Codex-S 对所有 $k>1$ 都偏好稍高的温度,这可能反映了 Codex-S 捕捉到的分布比 Codex 更窄。我们使用 $T^{},=,0$ 来计算通过 $@1$ ,并使用 $T^{}=1$ 来计算通过 $@100$ 。
Next, we compare Codex-S against Codex on pass $@1$ and pass $@100$ . Codex-S outperforms the corresponding Codex by an average margin of 6.5 percentage points on pass $@1$ and by a larger average margin of 15.1 percentage points on pass $@100$ across model size.
接下来,我们将 Codex-S 与 Codex 在通过率 $@1$ 和通过率 $@100$ 上进行比较。Codex-S 在通过率 $@1$ 上平均优于对应的 Codex 6.5 个百分点,在通过率 $@100$ 上则以更大的平均优势 15.1 个百分点胜出,这一结果在不同模型尺寸上均成立。
We also plot the performance of different sample selection heuristics for Codex-S-12B against the same heuristics for Codex-12B. When ranking between 1 and 100 samples by mean log probability, the average benefit over random ranking is 11.6 percentage points, which is over 2 percentage points higher than the corresponding benefit for Codex.
我们还绘制了 Codex-S-12B 与 Codex-12B 在不同样本选择启发式方法下的性能对比图。当根据平均对数概率对 1 到 100 个样本进行排名时,相对于随机排名的平均收益为 11.6 个百分点,这比 Codex 相应的收益高出超过 2 个百分点。


Figure 9. Optimal sampling temperatures as a function of the number of samples generated for both Codex and Codex-S. Codex-S generally requires a higher temperature for any particular value of $k$ , possibly to compensate for the fact that it models a narrower distribution.
图 9: Codex 和 Codex-S 的最优采样温度与生成样本数量的关系。Codex-S 在任何特定的 $k$ 值下通常需要更高的温度,这可能是为了补偿其分布较窄的事实。

Codex-S Pass Rate vs Model Size Codex-S Ranking Heuristics

Codex-S 通过率与模型大小 Codex-S 排名启发式方法

Figure 10. Comparing Codex-S against Codex on the metrics proposed in Section 3. Codex-S is one or two orders of magnitude more parameter efficient on pass $@1$ and pass $@100$ , and log-prob sample ranking with Codex-S yields similar benefits over random sampling that Codex does.
图 10. 比较 Codex-S 与 Codex 在第 3 节提出的指标上的表现。Codex-S 在 pass @1 和 pass @100 上的参数效率比 Codex 高一个或两个数量级,且使用 Codex-S 的对数概率样本排序与随机抽样相比具有类似的优点,这与 Codex 的表现相似。
5. Docstring Generation
5. 文档字符串生成 (Docstring Generation)
Generating code from docstrings is possible with Codex because code typically follows after a docstring, but it is not easy to induce Codex to generate docstrings from code. Nevertheless, we are motivated to produce a docstring writing model for safety reasons, as such a model can be used to describe the intent behind generated code. Using the training problems described in the previous section, we can easily create a training dataset for code-conditional docstring generation.
从文档字符串生成代码是可能的,因为代码通常会跟随在文档字符串之后,但诱导 Codex 从代码生成文档字符串并不容易。然而,出于安全原因,我们有动力创建一个文档字符串写作模型,因为这种模型可以用来描述生成代码的意图。使用上一节中描述的训练问题,我们可以轻松创建一个用于代码条件文档字符串生成的训练数据集。
Specifically, for each training problem, we assemble a training example by concatenating the function signature, the reference solution, and then the docstring. Just as we train Codex-S by minimizing negative log-likelihood of the reference solution, we train the docstring generating models Codex-D by minimizing negative log-likelihood of the docstring.
具体来说,对于每个训练问题,我们通过连接函数签名、参考解决方案和文档字符串来组装一个训练示例。就像我们通过最小化参考解决方案的负对数似然来训练 Codex-S 一样,我们通过最小化文档字符串的负对数似然来训练生成文档字符串的模型 Codex-D。
When we benchmark our code generation models, we measure pass $@k$ on the HumanEval dataset, where correctness is defined by passing a set of unit tests. However, there is no similar way to evaluate docstring samples automatically. Therefore, we grade sample docstrings by hand, considering a docstring correct if it uniquely and accurately specifies the code body. Due to the time consuming nature of this process, we only grade 10 samples per problem, for a total of 1640 problems, from Codex-D-12B at temperature 0.8.
当我们评估我们的代码生成模型时,我们测量在 HumanEval 数据集上的通过率 $@k$ ,其中正确性由通过一组单元测试定义。然而,没有类似的方法可以自动评估文档字符串样本。因此,我们手动评估样本文档字符串,认为一个文档字符串是正确的,如果它唯一且准确地指定了代码主体。由于这个过程耗时,我们每个问题只评估 10 个样本,总计 1640 个问题,来自 Codex-D-12B,在温度为 0.8 的情况下。
Codex-D often generates incorrect unit tests along with a docstring, but we ignore these during grading. However, we do not consider the docstring correct when the model simply copies the code body into the docstring. The most common failure modes we observe are when the docstring model leaves out an important detail (such as “an answer must be to two decimal places”) or when it over-conditions on the function name and invents a problem unrelated to the function body.
Codex-D 生成的单元测试通常不正确,并且会附带一个文档字符串,但在评分时我们会忽略这些。然而,当模型简单地将代码主体复制到文档字符串中时,我们不认为该文档字符串是正确的。我们观察到最常见的失败模式是文档字符串模型遗漏了重要细节(例如“答案必须保留两位小数”)或过度依赖函数名称并虚构了一个与函数主体无关的问题。
As shown in Table 3, pass rates for Codex-D are lower but comparable to the corresponding pass rates for Codex-S at the same temperature. We do not have a strong hypothesis for which direction should yield higher pass rates. While generating docstrings may be more forgiving because natural language syntax is less strict than code syntax, docstrings in our dataset may be lower quality because developers tend to devote less time to writing docstrings. Indeed, our model produces docstrings like “I just found this function online” and “This test is not correctly written and it’s not my solution.”
表 3: Codex-D 的通过率较低但与相同温度下的 Codex-S 的相应通过率相当。我们没有关于哪个方向应产生更高通过率的强烈假设。虽然生成文档字符串可能更为宽容,因为自然语言语法比代码语法宽松,但我们数据集中的文档字符串质量可能较低,因为开发人员倾向于花费较少的时间编写文档字符串。确实,我们的模型生成了如 “I just found this function online” 和 “This test is not correctly written and it’s not my solution.” 这样的文档字符串。
Finally, with a docstring model, we have yet another way to choose a single sample from a set of $k$ samples. Instead of picking the sample with the best mean log probability as investigated in the previous two sections, we can choose the sample that maximizes the back-translation ob
最后,使用文档字符串模型,我们有另一种从一组 $k$ 个样本中选择单个样本的方法。不同于前两节研究的挑选具有最佳平均对数概率的样本,我们可以选择使回译 (back-translation) 最大化的样本。
Table 3. Pass rates for our docstring generating model Codex-D, which is evaluated by hand-grading 10 samples per task due to the lack of a ground-truth automatic evaluation. We find similar but lower pass-rates compared to Codex-S.
表 3: 我们的文档字符串生成模型 Codex-D 的通过率,该模型通过人工评分每项任务 10 个样本进行评估,由于缺乏基准自动评估。我们发现与 Codex-S 相比,通过率相似但较低。
| 模型 | PASS @ 1 | PASS @ 10 |
|---|---|---|
| CODEX-S-12B | 32.2% | 59.5% |
| CODEX-D-12B | 20.3% | 46.5% |
jective $P$ (ground truth docstring|generated sample) where $P$ is evaluated using Codex-D. Unfortunately, in Figure 7, we show that ranking samples via back-translation underperforms mean log-probability ranking, though it outperforms random ranking. This heuristic also appears to overfit quickly.
目标 $P$ (原始 docstring | 生成样本)其中 $P$ 使用 Codex-D 进行评估。遗憾的是,在图 7 中,我们显示通过反向翻译对样本进行排序的表现不如平均对数概率排序,尽管它优于随机排序。这种启发式方法似乎也很快过拟合。
6. Limitations
6. 局限性
While Codex is able to sample correct solutions for the majority of HumanEval problems, we find that it has a number of limitations.
虽然 Codex 能够为大多数 HumanEval 问题采样出正确的解决方案,但我们发现它存在一些局限性。
First, Codex is not sample efficient to train. Our training dataset comprises a significant fraction of publicly available Python code on GitHub, totaling hundreds of millions of lines of code. Even seasoned developers do not encounter anywhere near this amount of code over their careers. Indeed, a strong student who completes an introductory computer science course is expected to be able to solve a larger fraction of problems than Codex-12B.
首先,Codex 在训练时样本效率不高。我们的训练数据集包含 GitHub 上公开可用的大量 Python语言 代码,总计数亿行代码。即使是有经验的开发人员在其职业生涯中也不会遇到如此多的代码。事实上,完成入门级计算机科学课程的优秀学生应该能够解决比 Codex-12B 更大比例的问题。
Next, we explore prompts on which Codex is likely to fail or display counter-intuitive behavior. While evaluating code generation is well-studied (Xu et al., 2021; Helmuth & Spector, 2015; Pantridge et al., 2017), many existing metrics measure performance in tightly specified, constrained problem instances (e.g., string manipulation in FlashFill (Gulwani, 2011)). Therefore, we developed a set of qualitative metrics for measuring the capabilities of code generating models while controlling for the complexity and abstraction level of the specifications (Appendix D). Applying this framework, we find that Codex can recommend syntactically incorrect or undefined code, and can invoke functions, variables, and attributes that are undefined or outside the scope of the codebase. Moreover, Codex struggles to parse through increasingly long and higher-level or system-level specifications.
接下来,我们探讨 Codex 可能会失败或表现出反直觉行为的提示。虽然代码生成的评估已经得到了广泛研究 (Xu et al., 2021; Helmuth & Spector, 2015; Pantridge et al., 2017),但许多现有的度量标准是在严格规定、受限的问题实例中衡量性能(例如,在 FlashFill 中的字符串操作 (Gulwani, 2011))。因此,我们开发了一套定性质的度量标准,用于在控制规范的复杂性和抽象级别的情况下测量代码生成模型的能力(附录 D)。应用这一框架,我们发现 Codex 可以推荐语法不正确或未定义的代码,并可以调用未定义或超出代码库范围的函数、变量和属性。此外,Codex 在解析越来越长和更高层次或系统级别的规范时遇到困难。
To concretely illustrate model performance degradation as docstring length increases, we create a dataset of synthetic problems assembled from 13 basic building blocks, each of which modifies an input string in a deterministic way. Example building blocks are “convert the string to lowercase” or “remove every third character from the string” (the full list is described in Appendix C). We find that as the number of chained building blocks in the docstring increases, model performance decreases exponentially. This behavior is uncharacteristic of a human programmer, who should be able to correctly implement a program for a chain of arbitrary length if they can do so for a chain of length two.
为了具体说明随着文档字符串长度增加模型性能的下降,我们创建了一个由 13 个基本构建块组成的人工问题数据集,每个构建块都以确定性方式修改输入字符串。示例构建块包括“将字符串转换为小写”或“从字符串中删除每第三个字符”(完整列表在附录 C 中描述)。我们发现,随着文档字符串中串联的构建块数量增加,模型性能呈指数级下降。这种行为与人类程序员不符,如果人类程序员能够正确实现长度为二的链式程序,那么他们也应该能够正确实现任意长度的链式程序。

Synthetic Pass Rate vs Components Figure 11. Pass rates of Codex-12B samples against the number of chained components in the synthetically generated docstring. With each additional component, pass rate drops by roughly a factor of 2-3.
图 11. Codex-12B 样本的通过率与合成生成的文档字符串中链接组件数量的关系。随着每个额外组件的增加,通过率大约下降 2-3 倍。
Further, just as text-conditional generative models in other modalities (Ramesh et al., 2021) have difficulty with binding attributes to objects, Codex can make mistakes binding operations to variables, especially when the number of operations and variables in the docstring is large. For instance, in the following prompt, Codex-12B does not decrement the variable w and also fails to return the product of all numbers.
进一步,正如其他模态的文本条件生成式模型 (Ramesh et al., 2021) 在将属性绑定到对象时存在困难一样,Codex 在将操作绑定到变量时也会出错,特别是在文档字符串中的操作和变量数量较大时。例如,在以下提示中,Codex-12B 没有对变量 w 进行递减操作,并且未能返回所有数字的乘积。
def do_work(x, y, z, w): """ Add 3 to y, then subtract 4 from both x and w. Return the product of the four numbers. """ ${\mathrm{\smallt~}}={\mathrm{\smally~}}+{\mathrm{\small3~}}$ $\texttt{U}=\texttt{X}-\texttt{4}$ $\mathrm{~\boldmath~v~}=\mathrm{~\boldmath~z~},\star,\mathrm{~\boldmath~w~}$ return v
def do_work(x, y, z, w):
"""将 3 加到 y 上,然后从 x 和 w 中各减去 4。返回这四个数字的乘积。"""
\${\mathrm{~\small~t~}}={\mathrm{~\small~y~}}+{\mathrm{~\small~3~}}\$
\$\texttt{U}=\texttt{X}-\texttt{4}\$
\$\mathrm{~\boldmath~v~}=\mathrm{~\boldmath~z~}\,\star\,\mathrm{~\boldmath~w~}\$
return v
This understanding of Codex’s limited system-level synthesis capabilities helps inform our assessment of the potential hazards of using it in a generative capacity, as well as the broader societal impacts that such systems could have.
对 Codex 的系统级合成能力有限的理解有助于我们评估其在生成式 AI (Generative AI) 能力方面的潜在风险,以及此类系统可能带来的更广泛的社会影响。
7. Broader Impacts and Hazard Analysis
7. 更广泛的影响和风险分析
Codex has the potential to be useful in a range of ways. For example, it could help onboard users to new codebases, reduce context switching for experienced coders, enable non-programmers to write specifications and have Codex draft implementations, and aid in education and exploration. However, Codex also raises significant safety challenges, does not always produce code that is aligned with user intent, and has the potential to be misused.
Codex 有潜力在多个方面发挥作用。例如,它可以帮助用户熟悉新的代码库,减少有经验的程序员的情境切换,使非程序员能够编写规范并让 Codex 草拟实现,以及辅助教育和探索。然而,Codex 也带来了重大的安全挑战,不总是生成符合用户意图的代码,并且存在被滥用的可能。
To better understand some of the hazards of using Codex in a generative capacity, we conducted a hazard analysis focused on identifying risk factors (Leveson, 2019) with the potential to cause harm.1 We outline some of our key findings across several risk areas below.
为了更好地了解在生成式容量中使用 Codex 的一些风险,我们进行了危害分析,重点关注识别可能导致损害的风险因素 (Leveson, 2019)。我们在以下部分概述了几个风险领域中的关键发现。
While some of our findings about the potential societal impacts of code generation systems were informed by work towards responsible deployment of the production-oriented Codex models (which descended from the research-oriented Codex models described in this paper), this section is not intended to provide a full account of any particular product’s safety features. Unless otherwise specified, we anchor our analysis in the specific properties of the models described in this paper. We share this analysis in the belief that some of it generalizes to the broader class of code generation systems, and to encourage a norm of performing detailed impact analysis as part of major machine learning research projects.
虽然我们关于代码生成系统潜在社会影响的一些发现得到了面向生产的 Codex 模型(这些模型源自本文描述的研究导向的 Codex 模型)负责任部署工作的启发,但本节并不打算提供任何特定产品安全功能的完整说明。除非另有说明,我们的分析基于本文描述的模型的具体属性。我们分享这一分析是基于其中部分内容可以推广到更广泛的代码生成系统类,并鼓励在主要的机器学习研究项目中进行详细影响分析的规范。
Note that by focusing largely on risks in this section, we do not mean to imply that we expect the impact of this class of technologies to be net-negative; rather, risks merit particular attention here because they may be subtle or require deliberate effort to address, whereas we expect the benefits to be more obvious and “automatic” from the perspective of most users and affected stakeholders.
请注意,本节主要关注风险,并不意味着我们认为这类技术的影响将是净负面的;相反,这里特别关注风险是因为它们可能较为微妙或需要刻意努力去应对,而我们预计其好处对于大多数用户和受影响的利益相关者来说将更为明显和“自动”。
7.1. Over-reliance
7.1. 过度依赖
One of the key risks associated with using code generation models in practice is over-reliance on generated outputs. Due to the limitations described above as well as alignment issues described below, Codex may suggest solutions that superficially appear correct but do not actually perform the task the user intended. This could particularly affect novice programmers, and could have significant safety implications depending on the context. We discuss a related issue in Appendix G, namely that code generation models can suggest insecure code. For these reasons, human oversight and vigilance is required for safe use of code generation systems like Codex.
使用代码生成模型时面临的一个关键风险是过度依赖生成的输出。由于上述限制以及下面描述的对齐问题,Codex 可能会建议表面上看似正确但实际上并未执行用户意图的任务的解决方案。这可能特别影响初学者程序员,并且根据上下文可能有重要的安全影响。我们在附录 G 中讨论了一个相关问题,即代码生成模型可以建议不安全的代码。因此,为了安全使用像 Codex 这样的代码生成系统,需要人类的监督和警惕。
We note several immediate ways to improve safety in the subsection on risk mitigation below, though over-reliance in particular is one that we believe merits further inquiry in industry and academia. While it is conceptually straightforward to provide documentation to users reminding them about model limitations, empirical investigation is necessary in order to identify how to reliably ensure vigilance in practice across a range of user experience levels, UI designs, and tasks. One challenge researchers should consider is that as capabilities improve, it may become increasingly difficult to guard against “automation bias.”
我们在下面的风险缓解小节中指出了几种立即可以采取的措施来提高安全性,尽管我们认为空泛依赖特别需要在业界和学术界进一步研究。虽然概念上直接向用户提供文档,提醒他们注意模型的局限性是简单的,但需要进行实证研究以确定如何在不同用户经验水平、用户界面设计和任务中可靠地确保警惕性。研究人员应考虑的一个挑战是,随着能力的提高,防止“自动化偏差”可能会变得越来越困难。

Figure 12. When the prompt includes subtle bugs, Codex tends to produce worse code than it is capable of. This persists when the prompt also includes instructions to write correct code. This gap increases with model size.
图 12. 当提示包含细微的错误时,Codex 倾向于生成比其能力更差的代码。即使提示中还包括编写正确代码的指令,这种情况仍然存在。这种差距随着模型规模的增大而增加。
7.2. Misalignment
7.2. 错位
As with other large language models trained on a next-token prediction objective, Codex will generate code that is as similar as possible to its training distribution. One consequence of this is that such models may do things that are unhelpful for the user, despite having the capability to be more helpful (see Figure 12). For example, if the user has some subtle mistakes in their code, Codex may “deliberately” suggest code that superficially appears good but is incorrect.
与训练用于下一个 Token 预测目标的其他大语言模型一样,Codex 生成的代码将尽可能接近其训练分布。这种做法的一个后果是,尽管这些模型有能力提供更大的帮助,但它们可能会做出对用户无益的事情(见图 12)。例如,如果用户的代码中有一些细微的错误,Codex 可能会“故意”建议表面上看起来不错但实际上错误的代码。
图 12:
This is an alignment failure - the model is not aligned with the user’s intentions. Informally, a system is misaligned if there’s some task X that we want it to do, and it is “capable” of doing X but “chooses” not to. In contrast, if a system fails to do X because it does not have the ability to do so, then this system is not misaligned; it is just incompetent. See Appendix E for more detail, including a more precise definition of alignment.
这是一个对齐失败——模型没有与用户的意图对齐。非正式地说,如果有一个任务 X 是我们希望系统完成的,而系统“能够”完成 X 但“选择”不去做,那么这个系统就是未对齐的。相反,如果系统未能完成 X 是因为它没有能力去做,那么这个系统并不是未对齐的;它只是无能的。更多细节请参见附录 E,包括对对齐更精确的定义。
It is important to study misalignment because it is a problem that is likely to become worse, not better, as the capabilities of our systems increase. For example, the model size scaling trend for the example in Figure 12 indicates that misalignment would likely persist and even get worse if data, parameters, and training time were scaled up.
研究错位问题很重要,因为它是一个随着系统能力的提升可能会变得更糟而不是更好的问题。例如,图 12 中的模型规模扩展趋势表明,如果数据、参数和训练时间增加,错位问题可能会持续存在甚至变得更严重。
While we expect that misaligned behaviour like this is unlikely to cause significant harm in current models, it is likely to become more dangerous and harder to eliminate as model capabilities increase. A highly capable but sufficiently misaligned model trained on user approval might produce obfuscated code that looks good to the user even on careful inspection, but in fact does something undesirable or even harmful.
虽然我们预计当前模型中的这种不一致行为不太可能造成重大危害,但随着模型能力的增强,这种行为可能会变得更加危险且更难消除。一个高度有能力但对齐不足的模型如果基于用户认可进行训练,可能会生成表面上看起来很好的混淆代码,即使经过仔细检查也显得无懈可击,但实际上却可能执行不希望的甚至有害的操作。
7.3. Bias and representation
7.3. 偏见和表征
Mirroring what has been found in the case of other language models trained on Internet data (Bender et al., 2021; Blodgett et al., 2020; Abid et al., 2021; Brown et al., 2020), we found that Codex can be prompted in ways that generate racist, de nigra tory, and otherwise harmful outputs as code comments, meriting interventions such as those discussed in the subsection on risk mitigation below. We also found that code generation models raise further bias and representation issues beyond problematic natural language: Codex can generate code with structure that reflects stereotypes about gender, race, emotion, class, the structure of names, and other characteristics. Particularly in the context of users who might over-rely on Codex or use it without first thinking through project design, this issue could have significant safety implications, giving further motivation to discourage over-reliance. We discuss bias and representation issues further in Appendix F. Filtration or modulation of generated outputs, documentation, and other interventions may help to mitigate these risks.
反映了在其他基于互联网数据训练的语言模型中发现的情况 (Bender et al., 2021; Blodgett et al., 2020; Abid et al., 2021; Brown et al., 2020),我们发现 Codex 可以被提示生成带有种族主义、贬低黑人和其他有害内容的代码注释,这需要采取如下面风险缓解小节中讨论的干预措施。我们还发现,代码生成模型引发了超出问题自然语言的进一步偏见和代表性问题:Codex 可以生成反映关于性别、种族、情感、阶层、姓名结构和其他特征的刻板印象的代码结构。特别是在用户可能过度依赖 Codex 或在未先考虑项目设计的情况下使用它时,这个问题可能具有重要的安全影响,进一步激励我们应劝阻过度依赖。我们在附录 F 中进一步讨论了偏见和代表性问题。生成输出的过滤或调节、文档编写和其他干预措施可能有助于缓解这些风险。
7.4. Economic and labor market impacts
7.4. 经济和劳动力市场影响
Code generation and associated capabilities have several possible economic and labor market impacts. While Codex at its current capability level may somewhat reduce the cost of producing software by increasing programmer productivity, the size of this effect may be limited by the fact that engineers don’t spend their full day writing code $\mathrm{{[0^{*}N E T}}$ , 2021). Other important tasks include conferring with colleagues, writing design specifications, and upgrading existing software stacks.2 We also found that Codex imports packages at different rates, which could advantage some package authors over others, particularly if programmers and engineers come to rely on Codex’s suggestions. Over a longer time horizon, the effects of this class of technologies on software-related labor markets and on the economy more generally could be more substantial as capabilities improve. More study is needed both on the effects of code generation capabilities and on appropriate responses. We discuss economic and labor market implications in more detail in Appendix H.
代码生成及其相关能力可能对经济和劳动力市场产生多种影响。虽然 Codex 在当前的能力水平上可以通过提高程序员的生产力来一定程度上降低软件生产的成本,但这种效应的规模可能会受到限制,因为工程师并不会整天都在编写代码 (2021)。其他重要任务包括与同事讨论、编写设计规范以及升级现有的软件栈。我们还发现 Codex 以不同的速率导入包,这可能会使某些包的作者受益,特别是如果程序员和工程师开始依赖 Codex 的建议。从更长远的时间范围来看,随着能力的提升,这类技术对软件相关的劳动力市场以及更广泛的经济的影响可能会更加显著。需要更多的研究来探讨代码生成能力的影响以及适当的应对措施。我们在附录 H 中更详细地讨论了经济和劳动力市场的含义。
7.5. Security implications
7.5. 安全性影响
Codex could have various effects on the security landscape. Because Codex can produce vulnerable or misaligned code,3 qualified operators should review its generations before executing or trusting them, absent appropriate precautions. Future code generation models may be able to be trained to produce more secure code than the average developer, though that is far from certain.
Codex 可能会对安全态势产生各种影响。因为 Codex 可以生成存在漏洞或不正确的代码,有资质的操作员应在执行或信任这些代码之前审查其生成结果,除非采取了适当的预防措施。未来的代码生成模型可能会被训练成能够生成比普通开发人员更安全的代码,尽管这一点还远未确定。
Codex could also be misused to aid cybercrime. Although this is worthy of concern, based on our testing, we believe that at their current level of capability, Codex models do not materially lower the barrier to entry for malware development.4 We expect that more powerful code generation models will lead to future advancements, and therefore further research into mitigation s and continued study of model capabilities are necessary.
Codex 也可能被滥用以辅助网络犯罪。虽然这值得担忧,但根据我们的测试,我们认为在当前的能力水平下,Codex 模型并不会实质性地降低恶意软件开发的门槛。我们预计更强大的代码生成模型将带来未来的进步,因此有必要进一步研究缓解措施并持续研究模型能力。
The non-deterministic nature of systems like Codex could enable more advanced malware. This non-determinism makes it easier to create diverse software that accomplish the same tasks. While software diversity can sometimes aid defenders,5 it presents unique challenges for traditional malware detection and antivirus systems that rely on fingerprinting and signature-matching against previously sampled binaries. For example, a more capable code generation model could conceivably advance techniques for generating polymorphic malware.6 We believe that application security and model deployment strategies including rate-limiting access and abuse monitoring can manage this threat in the near term; however, the efficacy of these mitigation s may scale sub linearly as more capable models are developed.
像 Codex 这样的系统的非确定性特性可能会使更高级的恶意软件成为可能。这种非确定性使得创建能够完成相同任务的多样化软件变得更加容易。虽然软件多样性有时可以帮助防御者 [5],但它为依赖于对先前采样二进制文件进行指纹识别和签名匹配的传统恶意软件检测和防病毒系统带来了独特的挑战。例如,更强大的代码生成模型可以设想出生成多态恶意软件的技术 [6]。我们认为,应用安全性和模型部署策略(包括速率限制访问和滥用监控)可以在短期内管理这一威胁;然而,随着更强大的模型的开发,这些缓解措施的有效性可能会以低于线性的速度扩展。
Similar to large language models, Codex models can learn patterns present in their training data (Carlini et al., 2021). Sensitive data present in source code are liable to be predicted by the model. Because Codex is trained on public repositories, we consider any sensitive data present in the training data to have already been compromised. Similarly, the public data should generally be treated as untrusted, as previous work (Goldblum et al., 2021; Schuster et al., 2020) has found that attackers may be able to corrupt training data to trigger specific model behaviors at runtime. We further discuss security implications in Appendix G.
类似于大语言模型,Codex 模型可以学习其训练数据中存在的模式 (Carlini et al., 2021)。源代码中存在的敏感数据有可能被模型预测出来。由于 Codex 是在公共仓库上进行训练的,我们认为训练数据中任何存在的敏感数据已经被泄露。同样,公共数据通常应被视为不可信的,因为之前的研究 (Goldblum et al., 2021; Schuster et al., 2020) 发现攻击者可能能够篡改训练数据以触发模型在运行时的特定行为。我们进一步在附录 G 中讨论安全影响。
7.6. Environmental impacts
7.6. 环境影响
Codex, like other large generative models, has an energy footprint from both training and inference (Schwartz et al., 2019; Bender et al., 2021; Patterson et al., 2021). The origi- nal training of GPT-3-12B consumed hundreds of petaflop/sdays of compute, while fine-tuning it to create Codex-12B consumed a similar amount of compute. This training was performed on a platform (Azure) that purchases carbon credits and sources significant amounts of renewable energy, reducing its carbon footprint.7 Compute consumption also has costs in the wider supply chain that can be quite concentrated on certain regions.8 Looking more globally and long-term, the compute demands of code generation could grow to be much larger than Codex’s training if significant inference is used to tackle challenging problems.9
Codex,像其他大生成模型一样,从训练到推理都有能源足迹 (Schwartz et al., 2019; Bender et al., 2021; Patterson et al., 2021)。GPT-3-12B 的原始训练消耗了数百个 petaflop/s-days 的计算资源,而将其微调为 Codex-12B 也消耗了类似的计算资源。此训练是在一个购买碳信用并大量使用可再生能源的平台 (Azure) 上进行的,从而减少了其碳足迹。计算资源消耗在更广泛的供应链中也有成本,并且这些成本可能集中在某些地区。从全球和长期来看,如果使用大量的推理来解决复杂问题,代码生成的计算需求可能会远超 Codex 的训练需求。
7.7. Legal implications
7.7. 法律影响
There are several legal considerations related to generated code. To begin with, the training of AI systems on Internet data, such as public GitHub repositories, has previously been identified as an instance of “fair use” (O’Keefe et al., 2019).
有几项与生成代码相关的法律考虑。首先,AI系统在互联网数据(如公共 GitHub 仓库)上的训练之前已被认定为“合理使用” (O’Keefe et al., 2019) 。
Our preliminary research also finds that Codex models rarely generate code that is identical to the contents of training data. Such occurrences were $<0.1%$ in a study examining the frequency of code generations that appear to match code snippets in the training data (Ziegler, 2021). In these rare instances, the generated code consisted of common expressions or conventions within the programming language that appeared over and over again in the training data. We find that, to the extent the generated code appears identical to the training data, it is due to the predictive weightings in the model rather than retention and copying of specific code.
我们的初步研究还发现,Codex 模型很少生成与训练数据内容完全相同的代码。在一项研究中,这种发生的频率为 $<0.1%$ ,该研究考察了生成的代码与训练数据中的代码片段匹配的频率 (Ziegler, 2021)。在这些罕见的情况下,生成的代码由编程语言中的常见表达式或约定组成,这些表达式或约定在训练数据中反复出现。我们发现,就生成的代码与训练数据看起来相同而言,这是由于模型的预测权重,而不是对特定代码的保留和复制。
Generated code is also responsive and customized to the user’s input, and the user retains complete control over editing and acceptance of the generated code. This can make code generation similar to auto-suggest or auto-completion features that exist as features of other tools of authorship (e.g., document editors), in the sense that the finished work is still seen as the author’s.
生成的代码也会根据用户的输入进行响应和定制,用户完全保留对生成代码的编辑和接受控制权。这使得代码生成类似于其他创作工具(例如,文档编辑器)中存在的自动建议或自动完成功能,因为最终的作品仍然被视为作者的作品。
Our commitment to responsible and safe AI includes continued attention to the broader intellectual property implications of code generation systems. We intend to remain engaged with policymakers and experts on these issues so that the users of such systems can ultimately deploy them with confidence.
我们致力于负责任和安全的AI,包括持续关注代码生成系统对更广泛的知识产权的影响。我们打算继续与政策制定者和专家就这些问题进行沟通,以便这些系统的用户最终能够有信心地部署它们。
7.8. Risk mitigation
7.8. 风险缓解
In closing, given the above, models like Codex should be developed, used, and their capabilities explored carefully with an eye towards maximizing their positive social impacts and minimizing intentional or unintentional harms that their use might cause. A contextual approach is critical to effective hazard analysis and mitigation, though a few broad categories of mitigation s are important to consider in any deployment of code generation models.
总之,鉴于上述内容,像 Codex 这样的模型应该被开发、使用,并且其功能应谨慎探索,以最大化其积极的社会影响并最小化其使用可能造成的有意或无意的危害。情境化的处理方法对于有效的风险分析和缓解至关重要,尽管在部署代码生成模型时,考虑一些广泛的缓解类别也很重要。
Careful documentation and user interface design, code review requirements, and/or content controls (e.g., filtering of outputs) may help to reduce harms associated with overreliance as well as offensive content or insecure code generation. In the context of a model made available as a service (e.g., via an API), policies such as user review, use case restrictions, monitoring, and/or rate limiting may also help to reduce harms associated with malicious use or prevent its use in high-stakes domains for which the models are not well suited.
仔细的文档编写和用户界面设计、代码审查要求以及内容控制(例如,输出过滤)可能有助于减少与过度依赖相关的危害以及攻击性内容或不安全的代码生成。在以服务形式提供的模型(例如,通过 API)的背景下,政策如用户审查、使用场景限制、监控和/或速率限制也可能有助于减少与恶意使用相关的危害或防止其在模型不适合的高风险领域中使用。
Appendices E, F, G, and H provide further detail on the risks described in this section and outline additional mitigation and research opportunities.
附录 E、F、G 和 H 提供了本节所述风险的进一步详细信息,并概述了额外的缓解措施和研究机会。
8. Related Work
8. 相关工作
The deep learning resurgence has led to strong advances in the field of program learning. Two popular approaches to neural program learning are program induction and program synthesis.
深度学习的复兴导致了程序学习领域的重大进展。两种流行的神经程序学习方法是程序归纳和程序合成。
In program induction, a model generates program outputs directly from a latent program representation. Learning to Execute (Zaremba & Sutskever, 2014) demonstrated that models could execute simple tasks like addition and memorization. Later attempts at program induction incorporated inductive biases based on modern computing devices, such as the Neural Turing Machine (Graves et al., 2014), memory networks (Weston et al., 2015; Sukhbaatar et al., 2015), the Neural GPU (Kaiser & Sutskever, 2015), and the differentiable neural computer (Graves et al., 2016). More recent approaches like the Neural Program Interpreter (Reed & de Freitas, 2016; Shin et al., 2018; Pierrot et al., 2021) and
在程序归纳中,模型从潜在的程序表示直接生成程序输出。学习执行 (Zaremba & Sutskever, 2014) 证明了模型可以执行加法和记忆等简单任务。后来的程序归纳尝试引入了基于现代计算设备的归纳偏置,例如神经图灵机 (Neural Turing Machine) (Graves et al., 2014),记忆网络 (Weston et al., 2015; Sukhbaatar et al., 2015),神经 GPU (Kaiser & Sutskever, 2015),以及可微分神经计算机 (differentiable neural computer) (Graves et al., 2016)。更近期的方法如神经程序解释器 (Neural Program Interpreter) (Reed & de Freitas, 2016; Shin et al., 2018; Pierrot et al., 2021) 和
Universal Transformer (Dehghani et al., 2019) found recurrence to be a useful component in program induction.
Universal Transformer (Dehghani et al., 2019) 发现递归是程序归纳中有用的组件。
In program synthesis, a model explicitly generates a program, usually from a natural language specification. One of the most popular classical approaches used a probabilistic context free grammar (PCFG) to generate a program’s abstract syntax tree (AST). Maddison & Tarlow (2014) improved on this setup by learning a state vector used to condition child node expansion. Later, Allamanis et al. (2015) applied this idea in text-to-code retrieval and Yin & Neubig (2017) utilized it in text-conditional code generation. Code2seq (Alon et al., 2018) found that ASTs could also be leveraged for code-to-text generation.
在程序合成中,模型显式地生成一个程序,通常是从自然语言规范生成。一种最流行的经典方法使用概率上下文无关语法 (PCFG) 来生成程序的抽象语法树 (AST)。Maddison & Tarlow (2014) 通过学习用于条件子节点扩展的状态向量改进了这种方法。后来,Allamanis 等 (2015) 将这一想法应用于文本到代码检索,Yin & Neubig (2017) 在文本条件代码生成中利用了它。Code2seq (Alon 等, 2018) 发现 AST 也可以用于代码到文本生成。
Programs can also be synthesized without passing through an AST representation. Hindle et al. (2012) investigated n-gram language models of code, finding code to be more predictable than natural language. Latent Predictor Networks (Ling et al., 2016) showed that character-level language models could generate working code for implementing Magic the Gathering cards in an online arena, when aided with a latent mode that allows card attributes to be copied into code. DeepCoder (Balog et al., 2017) trained a model to predict the functions appearing in source code, which could be used to guide program search.
程序也可以在不经过 AST 表示的情况下进行合成。Hindle 等人 (2012) 研究了代码的 n-gram 语言模型,发现代码比自然语言更具可预测性。潜在预测网络 (Ling 等,2016) 表明,在线竞技场中实现 Magic the Gathering 卡牌的字符级语言模型可以在辅助模式的帮助下生成工作代码,该模式允许将卡牌属性复制到代码中。DeepCoder (Balog 等,2017) 训练了一个模型来预测源代码中出现的函数,这些预测可以用于指导程序搜索。
Following the success of large natural language models (Devlin et al., 2018; Radford et al., 2019; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020) large scale Transformers have also been applied towards program synthesis. CodeBERT (Feng et al., 2020) trained the BERT objective on docstrings paired with functions, and obtained strong results on code search. PyMT5 (Clement et al., 2020) is similar in spirit to our work, and used the T5 objective to train a system which can translate between non-overlapping subsets of {signature, docstring, body}.
在大自然语言模型取得成功之后 (Devlin et al., 2018; Radford et al., 2019; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020),大规模 Transformer 也被应用于程序合成。CodeBERT (Feng et al., 2020) 在函数与文档字符串对上训练了 BERT 目标,并在代码搜索任务中取得了很好的结果。PyMT5 (Clement et al., 2020) 与我们的工作精神相似,使用 T5 目标训练了一个系统,该系统可以在非重叠的子集 {signature, docstring, body} 之间进行翻译。
We used functional correctness to benchmark our models, and observed improvements on this metric with more sampling. SPoC (Kulal et al., 2019) considered the problem of producing functionally correct code from pseudocode with a fixed budget of compilations, which is similar to our pass $@k$ metric. TransCoder (Lachaux et al., 2020) trained a system to translate between programming languages in an unsupervised manner, and also observed that functional correctness better captured the capabilities of their model than BLEU score. In fact, ContraCode (Jain et al., 2020) leveraged the large space of functionally correct programs to train a contrastive code model, which improved model performance on tasks like type inference. Finally, RobustFill (Devlin et al., 2017) observed that the best way to find a program consistent with input examples was to synthesize multiple samples through beam search.
我们使用功能正确性来评估我们的模型,并观察到随着更多采样,该指标有所改善。SPoC (Kulal et al., 2019) 考虑了在固定编译预算下从伪代码生成功能正确的代码的问题,这与我们的通过率 $@k$ 指标类似。TransCoder (Lachaux et al., 2020) 训练了一个系统以无监督的方式在编程语言之间进行翻译,并且也观察到功能正确性比 BLEU 分数更能捕捉其模型的能力。实际上,ContraCode (Jain et al., 2020) 利用了功能正确程序的巨大空间来训练对比代码模型,从而提高了模型在类型推断等任务上的性能。最后,RobustFill (Devlin et al., 2017) 观察到,找到与输入示例一致的程序的最佳方法是通过束搜索合成多个样本。
Two early domain-specific datasets used to benchmark neural programming systems were FlashFill (Gulwani, 2011; Gulwani et al., 2012) and Hearth stone (Ling et al., 2016), though the community has trended towards broader and more difficult datasets. Barone & Sennrich (2017) proposed a large training and evaluation dataset consisting of Python declarations, docstrings, and bodies scraped from GitHub. The Code Search Net challenge (Husain et al., 2019) built an even larger corpus from GitHub with data from multiple popular programming languages. Recently, CodeXGLUE (Lu et al., 2021) aggregated several programming benchmarks, making use of the recently proposed CodeBLEU metric (Ren et al., 2020). Most relevant to our evaluation work is the APPS (Hendrycks et al., 2021) benchmark for measuring functional correctness based on problems from the competitive programming website Codeforces.
两个早期的领域特定数据集用于评估神经编程系统,分别是 FlashFill (Gulwani, 2011; Gulwani 等, 2012) 和 Hearthstone (Ling 等, 2016),尽管社区逐渐倾向于更广泛和更具挑战性的数据集。Barone & Sennrich (2017) 提出了一种大型训练和评估数据集,包含从 GitHub 抓取的 Python语言 声明、文档字符串和函数体。Code Search Net 挑战 (Husain 等, 2019) 从 GitHub 构建了一个更大的语料库,涵盖了多种流行编程语言的数据。最近,CodeXGLUE (Lu 等, 2021) 整合了多个编程基准测试,使用了最近提出的 CodeBLEU 度量标准 (Ren 等, 2020)。与我们的评估工作最相关的是 APPS (Hendrycks 等, 2021) 基准测试,它基于来自编程竞赛网站 Codeforces 的问题来衡量功能正确性。
Finally, we note that coding is a broad activity which involves much more than synthesizing code from docstrings. Tufano et al. (2020) use Transformers to generate unit tests for code which outperformed commercial offerings. Aye et al. (2021) built an internal auto-complete tool for Facebook, and found that training on accepted user completions boosted system performance. Development also entails locating and fixing bugs. Early works used static or dynamic code analysis (Agrawal et al., 1995; Korel & Rilling, 1997), learned association rules (Jeffrey et al., 2009), and genetic programming (Goues et al., 2012) to debug faulty code. These approaches relied on running against a test suite to not only evaluate the correctness of suggestions but also expose problems in execution trace or search for a solution. More recent works (Tufano et al., 2019; Drain et al., 2021) considered bug-fixing as neural machine translation from buggy to correct programs. However, these works used an exact match against a reference instead of functional correctness, citing Qi et al. (2015)’s finding that most of the proposed solutions by genetic search in (Goues et al., 2012) passed through weak test suites by deleting functionality that failed. Human developers often write test suites with limited but targeted coverage, but this does not always work well against an algorithm, highlighting the challenges of evaluating correctness of programs.
最后,我们注意到编程是一项广泛的活动,涉及的内容远不止从文档字符串合成代码。Tufano 等人 (2020) 使用 Transformer 生成的单元测试超过了商业产品的表现。Aye 等人 (2021) 为 Facebook 构建了一个内部自动补全工具,并发现基于用户接受的补全进行训练可以提升系统性能。开发还涉及到定位和修复错误。早期的工作使用静态或动态代码分析 (Agrawal 等人, 1995; Korel & Rilling, 1997),学习关联规则 (Jeffrey 等人, 2009),以及遗传编程 (Goues 等人, 2012) 来调试有缺陷的代码。这些方法依赖于运行测试套件来不仅评估建议的正确性,还暴露执行跟踪中的问题或搜索解决方案。更近期的工作 (Tufano 等人, 2019; Drain 等人, 2021) 将 bug 修复视为从有 bug 的程序到正确程序的神经机器翻译。然而,这些工作使用的是与参考代码的精确匹配而不是功能正确性,引用了 Qi 等人 (2015) 的研究结果,即 (Goues 等人, 2012) 提出的大多数通过弱测试套件的解决方案是通过删除失败的功能实现的。人类开发者通常编写具有有限但有针对性覆盖范围的测试套件,但这并不总能很好地应对算法,突显了评估程序正确性的挑战。
9. Conclusion
9. 结论
We investigated whether it was possible to train large language models to produce functionally correct code bodies from natural language docstrings. By fine-tuning GPT on code from GitHub, we found that our models displayed strong performance on a dataset of human-written problems with difficulty level comparable to easy interview problems. Model performance could be improved by training on a distribution more similar to the evaluation set, and also by producing multiple samples from a model. We also found that it was simple to train a model to complete the reverse task of producing docstrings from code bodies, and that the performance profiles of these models were similar. Finally, we expanded on the broader impacts of code generating models, and discussed model limitations, finding significant room for improvement.
我们研究了是否可以训练大语言模型从自然语言文档字符串生成功能正确的代码主体。通过在来自 GitHub 的代码上微调 GPT,我们发现我们的模型在一个难度相当于简单面试题的人类编写问题数据集上表现出色。通过使用与评估集更相似的分布进行训练,以及从模型生成多个样本,可以提高模型性能。我们还发现训练模型完成逆向任务(即从代码主体生成文档字符串)非常简单,并且这些模型的性能特征相似。最后,我们扩展讨论了代码生成模型的广泛影响,并讨论了模型的局限性,发现了显著的改进空间。
Acknowledgements
致谢
We thank Sandhini Agarwal, Casey Chu, Jeffrey Ding, Peter Eckersley, Gillian Hadfield, Rich Harang, Jacob Jack- son, Yunxin Jiao, Jade Leung, Andrew Lohn, Ryan Lowe, Thomas McGuire, Margaret Mitchell, Florentine Eloundou Nekoul, Cullen O’Keefe, Long Ouyang, Pranav Shyam, Irene Solaiman, Aravind Srinivas, Helen Toner, Ashish Vaswani, and Jeffrey Wu for helpful discussions and feedback on drafts of this work. We are also grateful to the Acceleration and Super computing teams at OpenAI for their work on software and hardware infrastructure that this project used. Finally, we thank GitHub for partnering to build GitHub Copilot and Microsoft Azure for supporting model training with infrastructure management.
我们感谢 Sandhini Agarwal、Casey Chu、Jeffrey Ding、Peter Eckersley、Gillian Hadfield、Rich Harang、Jacob Jackson、Yunxin Jiao、Jade Leung、Andrew Lohn、Ryan Lowe、Thomas McGuire、Margaret Mitchell、Florentine Eloundou Nekoul、Cullen O'Keefe、Long Ouyang、Pranav Shyam、Irene Solaiman、Aravind Srinivas、Helen Toner、Ashish Vaswani 和 Jeffrey Wu 对本工作的草稿提供了有益的讨论和反馈。我们还要感谢 OpenAI 的 Acceleration 和 Supercomputing 团队在软件和硬件基础设施方面的工作,这些基础设施是本项目所依赖的。最后,我们感谢 GitHub 合作构建 GitHub Copilot 以及 Microsoft Azure 在模型训练中提供的基础设施管理支持。
References
参考文献
Cwe-327: Use of a broken or risky cryptographic algorithm, 2006. URL https://cwe.mitre.org/data/definitions/ 327.html.
Cwe-327: 使用破损或有风险的加密算法 (cryptographic algorithm),2006。URL https://cwe.mitre.org/data/definitions/327.html。
Cwe-780: Use of rsa algorithm without oaep, 2009. URL https: //cwe.mitre.org/data/definitions/780.html.
Cwe-780: 使用 RSA 算法而不使用 OAEP,2009。URL https: //cwe.mitre.org/data/definitions/780.html.
A6:2017-security mis configuration, 2017. URL https: //owasp.org/www-project-top-ten/2017/ A6 2017-Security Mis configuration.html.
A6:2017-安全配置错误,2017。URL https: //owasp.org/www-project-top-ten/2017/ A6 2017-Security Mis configuration.html.
Abid, A., Farooqi, M., and Zou, J. Persistent anti-muslim bias in large language models. arXiv preprint arXiv:2101.05783, 2021.
阿比德,A.,法鲁奇,M.,和邹,J. 持续存在的大语言模型中反穆斯林偏见。arXiv预印本 arXiv:2101.05783, 2021.
Acemoglu, D. and Restrepo, P. Robots and jobs: Evidence from us labor markets. Journal of Political Economy, 128(6):2188–2244, 2020a.
Acemoglu, D. 和 Restrepo, P. 机器人和就业:来自美国劳动市场的证据。政治经济期刊,128(6):2188–2244,2020a。
Acemoglu, D. and Restrepo, P. The wrong kind of ai? artificial intelligence and the future of labour demand. Cambridge Journal of Regions, Economy and Society, 13(1):25–35, 2020b.
Acemoglu, D. 和 Restrepo, P. 错误类型的 AI?人工智能与劳动力需求的未来. Cambridge Journal of Regions, Economy and Society, 13(1):25–35, 2020b.
Agrawal, H., Horgan, J. R., London, S., and Wong, W. E. Fault localization using execution slices and dataflow tests. Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE’95, pp. 143–151, 1995.
Agrawal, H., Horgan, J. R., London, S., 和 Wong, W. E. 使用执行切片和数据流测试进行故障定位。第六届国际软件可靠性工程研讨会论文集。ISSRE’95,第 143–151 页,1995。
Allamanis, M., Tarlow, D., Gordon, A., and Wei, Y. Bimodal modelling of source code and natural language. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2123–2132, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/ v37/all a man is 15.html.
Allamanis, M., Tarlow, D., Gordon, A., 和 Wei, Y. 源代码和自然语言的双模态建模。在 Bach, F. 和 Blei, D. (编),第 32 届国际机器学习会议论文集,机器学习研究论文集第 37 卷,页码 2123–2132,法国里尔,2015 年 7 月 7 日至 9 日。PMLR。URL http://proceedings.mlr.press/v37/all a man is 15.html。
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., 和 Church, G. M. 基于序列的深度表示学习统一理性蛋白质工程。Nature methods,16(12):1315–1322,2019。
Alon, U., Brody, S., Levy, O., and Yahav, E. code2seq: Generating sequences from structured representations of code. In International Conference on Learning Representations, 2018.
Alon, U., Brody, S., Levy, O., 和 Yahav, E. code2seq: 从代码的结构化表示生成序列 (Generating sequences from structured representations of code). 在 International Conference on Learning Representations, 2018.
Aye, G. A., Kim, S., and Li, H. Learning auto completion from realworld datasets. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 131–139, 2021.
G. A. Aye,Kim,S.,和 Li,H. 从现实世界数据集中学习自动完成。2021 IEEE/ACM 第 43 届国际软件工程会议:软件工程实践 (ICSE-SEIP),第 131–139 页,2021。
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.
Baevski, A., Zhou, H., Mohamed, A., 和 Auli, M. wav2vec 2.0: 一个用于自我监督学习语音表示的框架。 arXiv preprint arXiv:2006.11477, 2020.
Balog, M., Gaunt, A., Brock schmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. In 5th International Conference on Learning Representations (ICLR), 2017.
Balog, M., Gaunt, A., Brock schmidt, M., Nowozin, S., 和 Tarlow, D. Deepcoder: 学习编写程序. 在第 5 届国际学习表征会议 (ICLR) 上,2017.
Bao, H., Dong, L., and Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
包, H., 董, L., 和 魏, F. Beit: Bert预训练的图像 Transformer。arXiv预印本 arXiv:2106.08254, 2021。
Barone, A. V. M. and Sennrich, R. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. ArXiv, abs/1707.02275, 2017.
巴罗内,A. V. M. 和森里奇,R. Python语言函数和文档字符串的并行语料库,用于自动代码文档生成和代码生成。ArXiv, abs/1707.02275, 2017.
Barrington, I. M. and Maciel, A. Lecture 3: Non deterministic computation. https://people.clarkson.edu/˜alexis/ PCMI/Notes/lectureB03.pdf, 2000. [Online; accessed 29-June-2000].
巴林顿,I. M. 和 Maciel,A. 第 3 讲:非确定性计算。https://people.clarkson.edu/~alexis/PCMI/Notes/lectureB03.pdf, 2000. [在线;访问于 2000-6-29]。
Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623, 2021.
Bender, E. M., Gebru, T., McMillan-Major, A., 和 Shmitchell, S. 论随机鹦鹉的危险性:语言模型能否过大? 在 2021 ACM 公平性、责任性和透明度会议论文集,第 610–623 页,2021。
Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. GPT-Neo: Large scale auto regressive language modeling with mesh-tensorflow, 2021. URL http://github.com/ eleutherai/gpt-neo.
Black, S., 高, L., 王, P., Leahy, C., 和 Biderman, S. GPT-Neo:使用 Mesh-TensorFlow 的大规模自回归语言建模,2021。URL http://github.com/eleutherai/gpt-neo。
Blodgett, S. L., Barocas, S., Daum´e III, H., and Wallach, H. Lan- guage (technology) is power: A critical survey of “bias” in nlp. arXiv preprint arXiv:2005.14050, 2020.
布洛杰特,S. L.,巴罗卡斯,S.,多姆 III,H.,和瓦拉奇,H. 语言(技术)即权力:对自然语言处理中的“偏差”的批判性综述。arXiv预印本 arXiv:2005.14050, 2020。
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neel a kant an, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., 和 Amodei, D. 语言模型是少样本学习者。ArXiv, abs/2005.14165, 2020。
Bureau of Labor Statistics, U. D. o. L. Computer programmers. Occupational Outlook Handbook, 2021a. URL https: //www.bls.gov/ooh/computer-and-information- technology/computer-programmers.htm.
美国劳工统计局 (Bureau of Labor Statistics) , U. D. o. L. 计算机程序员 . 职业展望手册 , 2021a. URL https: //www.bls.gov/ooh/computer-and-information- technology/computer-programmers.htm.
Bureau of Labor Statistics, U. D. o. L. Bls - software developers. Occupational Outlook Handbook, 2021b. URL https: //www.bls.gov/ooh/computer-and-information- technology/software-developers.htm.
美国劳工统计局 (Bureau of Labor Statistics, U. D. o. L. Bls) - 软件开发人员。职业展望手册,2021b。URL https: //www.bls.gov/ooh/computer-and-information- technology/software-developers.htm.
Carlini, N., Tram`er, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, August 2021. URL https://www.usenix.org/conference/
卡林尼,N.,特拉默,F.,华莱士,E.,亚吉尔斯基,M.,赫伯特-沃斯,A.,李,K.,罗伯茨,A.,布朗,T.,宋,D.,厄林格松,U.,奥普雷亚,A.,和拉斐尔,C. 从大语言模型中提取训练数据。在第 30 届 USENIX 安全研讨会 (USENIX Security 21) 中。USENIX 协会,2021 年 8 月。URL https://www.usenix.org/conference/
use nix security 21/presentation/carlini
使用 Nix 安全性 21/演示/卡林尼
Eghbal, N. Working in public: the making and maintenance of open source software. Stripe Press, 2020.
Eghbal, N. 在公众中工作:开源软件的制作和维护 . Stripe Press, 2020.
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. Codebert: A pre-trained model for programming and natural languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1536–1547, 2020.
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. Codebert: 一个用于编程语言和自然语言的预训练模型。在 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),页码 1536–1547,2020。
Frey, C. B. The technology trap. Princeton University Press, 2019.
Frey, C. B. 技术陷阱. Princeton University Press, 2019.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling. 2020.
高, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., 何, H., Thite, A., Nabeshima, N., Presser, S., 和 Leahy, C. The pile: 一个 800GB 的多样化文本数据集,用于语言建模。2020。
Goldblum, M., Tsipras, D., Xie, C., Chen, X., Schwarz s child, A., Song, D., Madry, A., Li, B., and Goldstein, T. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses, 2021.
Goldblum, M., Tsipras, D., Xie, C., Chen, X., Schwarz s child, A., Song, D., Madry, A., Li, B., 和 Goldstein, T. 机器学习的数据集安全:数据投毒,后门攻击及防御措施,2021。
Goues, C. L., Dewey-Vogt, M., Forrest, S., and Weimer, W. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $\mathbb{S}8$ each. 2012 34th International Conference on Software Engineering (ICSE), pp. 3–13, 2012.
Goues, C. L., Dewey-Vogt, M., Forrest, S., 和 Weimer, W. 自动程序修复的系统研究:为 $\mathbb{S}8$ 每个修复了 105 个 bug 中的 55 个。2012 年第 34 届国际软件工程会议 (ICSE),第 3–13 页,2012 年。
Graves, A. Generating sequences with recurrent neural networks, 2014.
Graves, A. 用循环神经网络生成序列 (Generating sequences with recurrent neural networks), 2014.
参考文献:
- [20] Graves, A. Generating sequences with recurrent neural networks, 2014.
Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
Graves, A., Wayne, G., 和 Danihelka, I. 神经图灵机 (Neural Turing Machines)。arXiv 预印本 arXiv:1410.5401, 2014.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwi´nska, A., Col men are jo, S. G., Gre fens te tte, E., Ramalho, T., Agapiou, J., et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538 (7626):471–476, 2016.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwi´nska, A., Colmen are jo, S. G., Gre fens te tte, E., Ramalho, T., Agapiou, J., 等. 使用具有动态外部内存的神经网络进行混合计算. Nature, 538 (7626):471–476, 2016.
Gulwani, S. Automating string processing in spreadsheets using input-output examples. In PoPL’11, January 26-28, 2011, Austin, Texas, USA, January 2011.
Gulwani, S. 使用输入输出示例自动化电子表格中的字符串处理. 在 PoPL’11, 2011年1月26-28日, 美国德克萨斯州奥斯汀, 2011年1月.
Gulwani, S., Harris, W. R., and Singh, R. Spreadsheet data manipulation using examples. Commun. ACM, 55:97–105, 2012.
Gulwani, S., Harris, W. R., 和 Singh, R. 使用示例进行电子表格数据操作. Commun. ACM, 55:97–105, 2012.
He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decodingenhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
何,P., 刘,X., 高,J., 和 陈,W. Deberta: 解码增强的 BERT (Decoding-enhanced BERT) 与解缠注意力机制 (disentangled attention)。arXiv 预印本 arXiv:2006.03654, 2020.
Helmuth, T. and Spector, L. General program synthesis benchmark suite. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1039–1046, 2015.
Helmuth, T. 和 Spector, L. 通用程序合成基准套件。在《2015年遗传与进化计算年度会议论文集》中,页码 1039–1046,2015。
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. 通过应用程序衡量编程挑战能力。 arXiv preprint arXiv:2105.09938, 2021.
Hindle, A., Barr, E. T., Su, Z., Gabel, M., and Devanbu, P. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE), pp. 837–847. IEEE, 2012.
Hindle, A., Barr, E. T., Su, Z., Gabel, M., 和 Devanbu, P. 论软件的自然性. 在 2012 年第 34 届国际软件工程会议 (ICSE) 上,页码 837–847。IEEE, 2012。
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration, 2020.
Holtzman, A., Buys, J., Du, L., Forbes, M., 和 Choi, Y. 神经文本退化的奇特案例,2020。
Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brock schmidt, M. Code search net challenge: Evaluating the state of semantic code search. ArXiv, abs/1909.09436, 2019.
Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., 和 Brock schmidt, M. 代码搜索挑战 (Code search net challenge): 评估语义代码搜索的现状。ArXiv, abs/1909.09436, 2019。
Wang, B. and Komatsu zak i, A. GPT-J-6B: A 6 Billion Parameter Auto regressive Language Model. https://github.com/ kingoflolz/mesh-transformer-jax, May 2021.
王, B. 和 Komatsu zak i, A. GPT-J-6B: 一个 60 亿参数的自回归语言模型。https://github.com/kingoflolz/mesh-transformer-jax, 2021 年 5 月。
Weston, J., Chopra, S., and Bordes, A. Memory networks, 2015.
韦斯顿、乔普拉和博尔德斯于 2015 年提出了记忆网络 [20]。
Woolf, M. Fun and dystopia with ai-based code generation using gpt-j-6b, June 2021. URL https://minimaxir.com/ 2021/06/gpt-j-6b/.
Woolf, M. 使用基于 AI 的代码生成与 GPT-J-6B 的乐趣与反乌托邦,2021 年 6 月。URL https://minimaxir.com/2021/06/gpt-j-6b/。
Xu, F. F., Vasilescu, B., and Neubig, G. In-ide code generation from natural language: Promise and challenges. arXiv preprint arXiv:2101.11149, 2021.
徐,F. F., Vasilescu, B., 和 Neubig, G. 从自然语言生成在 IDE 内的代码:前景与挑战。arXiv 预印本 arXiv:2101.11149, 2021。
Yin, P. and Neubig, G. A syntactic neural model for generalpurpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 440–450, 2017.
殷, P. 和 Neubig, G. 一种用于通用代码生成的句法神经模型。在第 55 届计算语言学协会 (ACL) 年会论文集,第 440–450 页,2017。
Zaremba, W. and Sutskever, I. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
Zaremba, W. 和 Sutskever, I. 学习执行. arXiv预印本 arXiv:1410.4615, 2014.
Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J. S., Cao, J., Farhadi, A., and Choi, Y. Merlot: Multimodal neural script knowledge models. arXiv preprint arXiv:2106.02636, 2021.
Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J. S., Cao, J., Farhadi, A., 和 Choi, Y. Merlot: 多模态神经脚本知识模型 (Multimodal neural script knowledge models). arXiv preprint arXiv:2106.02636, 2021.
Zhao, T. Z., Wallace, E., Feng, S., Klein, D., and Singh, S. Calibrate before use: Improving few-shot performance of language models. arXiv preprint arXiv:2102.09690, 2021.
Zhao, T. Z., Wallace, E., Feng, S., Klein, D., 和 Singh, S. 使用前校准:提高语言模型的少样本性能。arXiv preprint arXiv:2102.09690, 2021。
Ziegler, A. A first look at rote learning in github copilot suggestions., Jun 2021. URL https://docs.github.com/en/ github/copilot/research-recitation.
齐格勒,A. 对 GitHub Copilot 建议中的机械学习的初步研究 (Jun 2021). URL https://docs.github.com/en/github/copilot/research-recitation.
A. Estimating pass $@k$
A. 估计通过率 $@k$
While all estimators mentioned previously are consistent, only the empirical estimate used by Kulal et al. (2019), and (1) are unbiased. Evaluating pass $@k$ in an unbiased way with any number of samples $n$ is important for fair comparison. For example, estimating pass $\omega_{/}k=1-(1-$ $\mathrm{pass}(\varpi1)^{k}$ with $1-(1-\hat{p})^{k}$ using the empirical pass $@1$ , results in a consistent underestimate as shown in Figure 13. The gap doesn’t fully close even when $n>5k$ , and results can seem better with more samples. The interpretation of this estimator is that we draw $k$ samples with replacement from a pool of $n$ candidates, but the $k$ samples are not independent.
虽然之前提到的所有估计量都是一致的,但只有 Kulal 等人 (2019) 使用的经验估计量和 (1) 是无偏的。使用任意数量的样本 $n$ 以无偏方式评估 pass $@k$ 对于公平比较非常重要。例如,使用经验 pass $@1$ 估计 pass $\omega_{/}k=1-(1-$ $\mathrm{pass}(\varpi1)^{k}$ 为 $1-(1-\hat{p})^{k}$ ,如图 13 所示,会导致一致低估。即使当 $n>5k$ 时,差距也不会完全消失,结果可能会随着更多样本显得更好。这个估计量的解释是我们从 $n$ 个候选者中重复抽取 $k$ 个样本,但这 $k$ 个样本并不是独立的。
图 13:
(1) is unbiased, because it estimates the fail probability $(1!-!\mathrm{pass}\ @1)^{k}$ as the probability of drawing $k$ failed samples without replacement. To show this, note that $c$ , the number of correct samples that pass the unit tests, is distributed $\mathrm{Binom}(n,p)$ , where $p$ is pass $@1$ , and that (1) evaluates to 1 when $n-c<k$ . Then,
(1) 是无偏的,因为它估计失败概率 $(1-\mathrm{pass}\ @1)^{k}$ 为在不放回的情况下抽取 $k$ 个失败样本的概率。为了证明这一点,请注意 $c$ ,即通过单元测试的正确样本数量,服从 $\mathrm{Binom}(n,p)$ 分布,其中 $p$ 是 pass @1,并且当 $n-c<k$ 时 (1) 的值为 1。然后,


图 1: 模型架构示例 (Example Model Architecture)
Figure 13. Comparing the amount of bias and variance of two estimators of $\mathrm{pass}@k$ . While the top expression may look correct, it underestimates the true value by a considerable margin. The unbiased estimator may have a slightly higher variance initially but allows for a fair comparison across different numbers of samples.
图 13: 比较两个 $\mathrm{pass}@k$ 估计器的偏差和方差。虽然顶部表达式看起来可能是正确的,但它低估了真实值相当大的幅度。无偏估计器最初可能具有稍高的方差,但允许在不同样本数量之间进行公平比较。


图 1: 模型架构示例 (Example of Model Architecture)
B. Random Problems and Solutions from Codex-12B
B. 随机问题及解决方案来自 Codex-12B
We show 8 random problems from HumanEval along with 8 random samples per problem generated from Codex-12B at temperature 0.8.
我们展示了来自 HumanEval 的 8 个随机问题,以及每个问题由 Codex-12B 在温度为 0.8 下生成的 8 个随机样本。
COMPLETION 2 (WRONG): ### return s.split()
完成 2 (错误): ### 返回 s.split()
COMPLETION 5 (WRONG): ### return s.split()
完成 5 (错误): ### 返回 s.split()
COMPLETION 6 (WRONG): ### return s.split()
完成 6 (错误): ### return s.split()
COMPLETION 7 (WRONG): ### return s.split(’,’)
完成 7 (错误): ### return s.split(’,’)
COMPLETION 8 (WRONG): ### return s.split(sep=None)
完成 8 (错误): ### 返回 s.split(sep=None)
def is_prime(n): """Return true if a given number is prime, and false otherwise. >>> is_prime(6) False >>> is_prime(101) True >>> is_prime(11) True >>> is_prime(13441) True >>> is_prime(61) True >>> is_prime(4) False >>> is_prime(1) False
def is_prime(n):
"""如果给定的数字是素数,则返回 true,否则返回 false。
>>> is_prime(6)
False
>>> is_prime(101)
True
>>> is_prime(11)
True
>>> is_prime(13441)
True
>>> is_prime(61)
True
>>> is_prime(4)
False
>>> is_prime(1)
False
"""
"""
return True
返回 True
COMPLETION 3 (WRONG): ### $#$ TODO: implement this function
完成 3 (错误):### $#$ TODO: 实现此功能
pass
通过
根据您的要求,pass 在此上下文中翻译为中文的 “通过”。但是,如果 pass 是编程语句(如在 Python语言 中),则应保留为 pass。请确认具体上下文。
return True # if helper returns true, then n is prime, else n is not prime
返回 True # 如果辅助函数返回 True,则 n 是质数,否则 n 不是质数
COMPLETION 6 (CORRECT): ### prime $=$ True
完成 6 (正确): ### prime $=$ True
$\mathbf{i}\pmb{\ f}\texttt{n}==\textbf{1}$ : return False
$\mathbf{i}\pmb{\ f}\texttt{n}==\textbf{1}$ : 返回 False
COMPLETION 8 (CORRECT): ### if $\mathrm==~2$ or $\mathrm==\mathrm$ :
完成 8 (正确):### 如果 $\mathrm==~2$ 或 $\mathrm==\mathrm$ :
return True if $\mathrm{n}<~2$ or n % 2 == 0: return False
如果 $\mathrm{n}<~2$ 或 n % 2 == 0: 返回 False
返回 True 如果 $\mathrm{n}<~2$ 或 n % 2 == 0: 返回 False
$\texttt{i f n}<\texttt{Q}$ : return True $\mathbf{i},\pmb{\mathscr{\mathtt{f}}},\mathfrak{n},\stackrel{\circ}{\widehat{\mathtt{e}}},\mathfrak{z},\mathfrak{z};=;0$ : return False $\v{r}=$ int $(,\mathrm n\star\star0,.,5,)$ )
\$\texttt{i f n}<\texttt{Q}\$ : 返回 True \$\mathbf{i}\,\pmb{\mathscr{\mathtt{f}}}\,\mathfrak{n}\,\stackrel{\circ}{\widehat{\mathtt{e}}}\,\mathfrak{z}\,\mathfrak{z}\;=\;0\$ : 返回 False \$\v{r}=\$ int \$(\,\mathrm n\star\star0\,.\,5\,)\$ )
while $\mathrm{\textit{f}}{<}=\mathrm{\textit~{r}}$ : $\mathbf{i},\pmb{\mathscr{\mathtt{f}}},\mathrm{\bfn},\stackrel{\circ}{\widehat{\mathtt{s}}},\mathrm{f};=;\mathrm{0}$ : return False $\begin{array}{r}{\mathbf{if};;\textrm{n};;\mathfrak{s};;;(;\underline{{\tau}}!+!2,);;;=;;0:}\end{array}$ return False
while \$\mathrm{~\textit~{~f~}~}{<}=\mathrm{~\textit~{~r~}~}\$ : \$\mathbf{i}\,\pmb{\mathscr{\mathtt{f}}}\,\mathrm{~n~}\,\stackrel{\circ}{\widehat{\mathtt{s}}}\,\mathrm{~\bf~f~}\;=\;\mathrm{~0~}\$ : return False \$\begin{array}{r}{\mathbf{if}\;\;\textrm{n}\;\;\mathfrak{s}\;\;\;(\;\underline{{\tau}}\!+\!2\,)\;\;\;=\;\;0:}\end{array}\$ return False
f $\scriptstyle+=\ 6$ return True
f $\scriptstyle+=\ 6$ 返回 True
def add elements(arr, k): """ Given a non-empty array of integers arr and an integer k, return the sum of the first k element that has at most two digits.
def add_elements(arr, k):
"""
给定一个非空整数数组 arr 和一个整数 k,返回前 k 个元素中最多有两个数字的元素之和。
"""
Example: Input: arr $=$ [111,21,3,4000,5,6,7,8,9], $k\ =\ 4$ Output: 24 # sum of $21+3$
输入: arr = [111,21,3,4000,5,6,7,8,9], k = 4
输出: 24 # 21+3 的和
Constraints: 1. 1 <= len(arr) $<=\ \mathbb{1},0,0$ 2. 1 <= k <= len(arr)
约束条件:1. 1 <= len(arr) $<=\ \mathbb{1},0,0$ 2. 1 <= k <= len(arr)
"""
COMPLETION 1 (WRONG):
完成 1 (错误):
📐
📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐
pass
通过
。
COMPLETION 2 (WRONG): ### for i in range(0, k): if len(str(arr[i])) < 3: $\mathrm}=\mathrm}-\mathrm~}$ continue else: return arr[i]
完成 2 (错误):
for i in range(0, k):
if len(str(arr[i])) < 3:
k = k - 1
continue
else:
return arr[i]
return None start $=\texttt{k}\star\texttt{-1}-\texttt{1}$ end $\begin{array}{r l r}{=}&{{}\mathbf{k}}&{\star}&{-1}\end{array}$
返回 None 开始 $=\texttt{k}\star\texttt{-1}-\texttt{1}$ 结束 $\begin{array}{r l r}{=}&{{}\mathbf{k}}&{\star}&{-1}\end{array}$
arr $=$ arr[start:end] return sum(arr)
arr = arr[start:end] 返回 sum(arr)
COMPLETION 5 (CORRECT): ### $\mathtt\ \ k\ \ <=\ \mathtt\left(\mathtt\right)$ :
完成 5 (正确):### $\mathtt\ \ k\ \ <=\ \mathtt\left(\mathtt\right)$ :
return sum
返回总和
else: return "Number of elements are less than k"
否则:返回 “元素数量少于 k”
COMPLETION 6 (WRONG): ### return
完成 6 (错误): ### 返回
COMPLETION 7 (WRONG): ### return int(str(sum(arr[:k]))[:k+1])
完成 7 (错误): ### return int(str(sum(arr[:k]))[:k+1])
COMPLETION 8 (WRONG): ### global_sum $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
完成 8 (错误):### global_sum $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
翻译部分:
完成 8 (错误):### global_sum $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
for i in range(0, k): if i $\c=~0$ : global_sum $=$ arr[0] else:
for i in range(0, k): 如果 i \$\c=~0\$ : global_sum \$=\$ arr[0] 否则:
if len(str(arr[i])) $<=~2$ : global_sum $+=$ arr[i] return global_sum
如果 len(str(arr[i])) <= 2 :global_sum += arr[i]
返回 global_sum
def vowels count(s): """Write a function vowels count which takes a string representing a word as input and returns the number of vowels in the string. Vowels in this case are ’a’, e $\prime,_{U},^{\prime}$ Here, ’y’ is also a vowel, but only when it is at the end of the given word.
def vowels_count(s):
"""
编写一个函数 vowels_count,它以表示单词的字符串作为输入,并返回字符串中元音的数量。
在这种情况下,元音是 ’a’, 'e', 'i', 'o', 'u'。这里,’y’ 也是元音,但只有当它位于给定单词的末尾时才算。
"""
3 """
"""
"""
for i in s: if i in vowels: $\mathrm{v}~+=~1$
对于 i 在 s 中:如果 i 在元音中:$\mathrm{v}~+=~1$
return v
返回 v
COMPLETION 2 (WRONG): ### vowels_set $=$ {’a’, ’e’, ’i count $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
完成 2 (错误):### vowels_set $=$ {’a’, ’e’, ’i count $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
COMPLETION 3 (WRONG): ### vowels list $=$ [’a’, ’e’, ’i’, o vowels count $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
完成 3 (错误):
元音列表 $=$ [’a’, ’e’, ’i’, o
元音计数 $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
for char in s: if char in vowels list: vowels count $+=~1$
对于 s 中的每个 char:如果 char 在元音列表中:元音计数 $+=~1$
return vowels count ### COMPLETION 4 (WRONG): ### return 0
返回元音计数 ### COMPLETION 4 (WRONG): ### 返回 0
COMPLETION 5 (WRONG): ### vowels $=$ ["a", "e", "i", "o", "u", "y"] num_vowels $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
完成 5 (错误):
元音 $=$ ["a", "e", "i", "o", "u", "y"]
num_vowels $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
COMPLETION 6 (WRONG): ### vowels $=$ "aeiou" return sum(vowels.count(i) for i in s)
完成 6 (错误):
元音 $=$ "aeiou"
返回 sum(元音.count(i) for i in s)
return count
返回计数
COMPLETION 1 (WRONG): ### if a == 0 or b == 0:
完成 1 (错误): ### 如果 a == 0 或 b == 0:
return (a - b) * 10 + b
返回 (a - b) * 10 + b
COMPLETION 2 (WRONG): ### #return sum([a * b for a, b in zip(str(a), str(b))
完成 2 (错误):
#return sum([a * b for a, b in zip(str(a), str(b))
COMPLETION 3 (WRONG): ### $#$ Edge case: a and b are both 0. Return 1. $\mathbf\pmb;;\mathrm;=;;0$ and b $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ :
完成情况 3 (错误):###
$#$ 特殊情况:a 和 b 都为 0 。返回 1。 $\mathbf{i}\pmb{\ f};;\mathrm{a};=;;0$ and b $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ :
$#$ Convert to strings so we can convert digits to characters
$#$ 转换为字符串,这样我们可以将数字转换为字符
a_str $=$ str(a) b_str $=$ str(b)
a_str = str(a)
b_str = str(b)
$#$ Initialize carry carry $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
$#$ 初始化进位 进位 $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
初始化进位 (carry) 为 0。
$#$ Initialize result to be empty result = ""
初始化结果为空
result = ""
COMPLETION 4 (WRONG): ### result $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
完成 4 (错误): ### 结果 $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$
根据上述规则,特殊字符和公式保持原样。
for i in range(0, min(len(str(a)), len(str(b)))): result $+=$ abs(a//10**i - b//10**i) $\star\quad\underline{{{1}}}\ 0\star\star\underline{{{1}}}$ return result
for i in range(0, min(len(str(a)), len(str(b)))): result \$+=\$ abs(a//10**i - b//10**i) * 1 0 * 1
return result
prod[i] $+=$ a % 10 * b % 10 a = a // 10 b = b // 10
prod[i] $+=$ a % 10 * b % 10
a = a // 10
b = b // 10
return prod[0] $\star,\begin{array}{r l r l}{{1},0}&{{},+}&{}\end{array}$ prod[1] ### COMPLETION 6 (WRONG): ### if a $<\ 0$ or b $<\ 0$ :
返回 prod[0] $\star,\begin{array}{r l r l}{{1},0}&{{},+}&{}\end{array}$ prod[1] ### 完成 6 (错误): ### 如果 a $<\ 0$ 或 b $<\ 0$ :
完成 6 (错误):
如果 a $<\ 0$ 或 b $<\ 0$ :
返回 prod[0] $\star,\begin{array}{r l r l}{{1},0}&{{},+}&{}\end{array}$ prod[1]
a $=$ abs(a) b $=$ abs(b)
a = abs(a) b = abs(b)
for i in string: total $+=$ int(i) return total
for i in string:
total += int(i)
return total
def even odd palindrome(n):"""
def 偶数奇数回文 (n):
"""
Example 1:
示例 1:
Note: 1. $!!!!!!!!!!!\lambda!!!!!!!!!!!!^{!}!!!!!!!!!!!!2!!!!!!!!!!!!!!<!!!!!!!!!!!!!!!!!<!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\$ 2. returned tuple has the number of even and odd integer palindromes respectively.
注意:1. $!!!!!!!!!!!\lambda!!!!!!!!!!!!^{!}!!!!!!!!!!!!2!!!!!!!!!!!!!!<!!!!!!!!!!!!!!!!!<!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\$ 2. 返回的元组分别包含偶数和奇数回文整数的数量。
"""
COMPLETION 1 (WRONG): ### if $\mathrm==\mathrm$ :
完成 1 (错误):
如果 $\mathrm{n}==\mathrm{1}$ :
return 0, 0
返回 0, 0
for i in range(1, $\Omega!+!1$ ): $_5=$ str(i) #even $\begin{array}{r}{\mathsf{i f\ \ s\ ==\ \ s\ [\mathsf{\bar{\Psi}}:\mathsf{\bar{\Psi}}!=!\mathsf{1}\mathsf{\bar{\Psi}}!:}\end{array}$ count_even $\mathrel{+{}={},\mathrm{~1~}}$ #odd if s == s[::-1][1:]: count_odd $+=~1$
for i in range(1, \$\Omega\!+\!1\$ ): _5 = str(i) #偶数 \$\begin{array}{r}{\mathsf{i f\ \ s\ ==\ \ s\ [\mathsf{\bar{\Psi}}:\mathsf{\bar{\Psi}}\!=\!\mathsf{1}\mathsf{\bar{\Psi}}\!:}\end{array}\$ count_even += 1 #奇数 if s == s[::-1][1:]: count_odd += 1
return (count_even, count_odd)
返回 (count_even, count_odd)
for i in range(1, $\Omega!+!1$ ): if str(i) $==$ str(i)[::-1]: if i $\begin{array}{r l r}{\frac{\circ}{\circ}}&{{}2}&{==}&{0}\end{array}$ : even_count $+=~1$ else: odd_count $\scriptstyle+,=\ 1$
for i in range(1, Ω + 1):
if str(i) == str(i)[::-1]:
if i % 2 == 0:
even_count += 1
else:
odd_count += 1
return (even_count, odd_count)
返回 (偶数计数, 奇数计数)
for $\mathrm{x}$ in range(1, $\mathrm{n}+\mathrm{1}$ ): if str $\begin{array}{r l}{(\mathbf{x})}&{{}==}\end{array}$ str(x)[::-1]: $\mathbf{i},\pmb{\ f},\mathbf{\eta}{\mathrm{X}},\mathbf{\tilde{\eta}}{\mathcal{\tilde{~O}}},\mathbf{\eta}{,2},\mathbf{\eta}_{==\mathbf{\eta}},0$ : even_pal $+=~1$ else: odd_pal $+=~1$
对于无法识别的特殊字符和公式,原样返回:
for \$_\mathrm{x}\$ in range(1, \$\mathrm{~n~}+\mathrm{~1~}\$ ):
if str \$\begin{array}{r l}{(\mathbf{x})}&{{}==}\end{array}\$ str(x)[::-1]:
\$\mathbf{i}\,\pmb{\ f}\,\mathbf{\eta}_{\mathrm{~X~}}\,\mathbf{\tilde{\eta}}_{\mathcal{\tilde{~O}}}\,\mathbf{\eta}_{\,2}\,\mathbf{\eta}_{==\mathbf{\eta}}\,0\$ :
even_pal \$+=~1\$
else:
odd_pal \$+=~1\$
return even_pal, odd_pal ### COMPLETION $^{6}$ (CORRECT): ### even, odd $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ , 0
返回 even_pal, odd_pal ### 完成 $^{6}$ (正确): ### 偶数,奇数 $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ , 0
COMPLETION 7 (WRONG): ### even, odd $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ , 0
完成 7 (错误): ### 偶数,奇数 $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ , 0
for num in range(1, $\Omega!+!1$ ): if is palindrome(num): if is_odd(num): odd $+=~1$ elif is_even(num): even $+=~1$
对于 num 在范围 (1, $\Omega!+!1$) 内:如果 num 是回文数:如果 num 是奇数:odd $+=~1$ 或者如果 num 是偶数:even $+=~1$
return (even, odd)
返回 (偶数, 奇数)
COMPLETION 8 (WRONG): ### evens $=$ [i for i in range(1, n) if i % 2 == 0] odds $=$ [i for i in range(1, n) if i % 2 != 0] return (len(evens), len(odds))
完成情况 8 (错误):
偶数 $=$ [i for i in range(1, n) if i % 2 == 0]
奇数 $=$ [i for i in range(1, n) if i % 2 != 0]
返回 (len(偶数), len(奇数))
def longest(strings: List[str]) $->$ Optional[str]: """ Out of list of strings, return the longest one. Return the first one in case of multiple strings of the same length. Return None in case the input list is empty. $>>>$ longest([]) >>> longest([’a’, ’b’, ’c’]) ’a’ >>> longest([’a’, ’bb’, ’ccc’]) ’ccc’ """
def 最长(strings: List[str]) \$->\$ Optional[str]:
"""
从字符串列表中,返回最长的字符串。如果有多个相同长度的字符串,则返回第一个。如果输入列表为空,则返回 None。
\$>>>\$
最长([])
>>> 最长([’a’, ’b’, ’c’])
’a’
>>> 最长([’a’, ’bb’, ’ccc’])
’ccc’
"""
COMPLETION 1 (CORRECT): ### return max(strings, key=len) if strings else None
完成 1 (正确):### return max(strings, key=len) if strings else None
如果 strings 不为空,则返回 strings 中长度最长的字符串;如果 strings 为空,则返回 None。
COMPLETION 3 (WRONG): ### longest string $=$ strings[0]
完成 3 (错误):### 最长字符串 $=$ strings[0]
return strings[0] for s in strings: if len(s) $>$ len(longest string): longest string $=$ s
返回字符串[0] 对于 s 在 字符串 中: 如果 len(s) > len(最长字符串): 最长字符串 = s
请注意,以上内容似乎是编程代码或伪代码,并非完整的句子。因此,翻译可能不完全符合常规中文表达习惯。
return longest string
返回最长字符串
COMPLETION 5 (CORRECT): ### lengths $=$ [len(string) for string in strings]
完成 5 (正确):### lengths $=$ [len(string) for string in strings]
将字符串列表中的每个字符串长度计算出来并存储在 lengths 列表中。
if len(lengths) $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ : return None
如果 lengths 的长度为 0 :返回 None
max_length $=$ max(lengths)
max_length = max(lengths)
max indices $=$ [i for i, $\mathtt{v}$ in enumerate(lengths) if $\mathrm{\textsf{v}~}==$ max_length]
最大索引 $=$ [i for i, $\mathtt{v}$ in enumerate(长度) if $\mathrm{\textsf{v}~}==$ 最大长度]
COMPLETION 7 (WRONG):
完成 7 (错误):
C. Building Blocks for Synthetic Tasks
C. 合成任务的构建模块
We describe the 13 building blocks used to create synthetic tasks for evaluating model performance as a function of docstring complexity. Each building block is specified by a line of text and a line of code:
我们描述了用于创建合成任务以评估模型性能的 13 个构建模块,这些任务是根据文档字符串复杂性来设计的。每个构建模块由一行文本和一行代码指定:
Figure 1:
图 1:
Table 1:
表 1:
(注意:原文中没有提供 Figure 或 Table 的具体内容,因此这里仅展示翻译后的标签)
- “remove all instances of the letter e from the string”
- “从字符串中移除所有字母 e 的实例”
$$
\begin{array}{r}{\mathrm{\boldmathS~}=\mathrm{\boldmathS~}.,\tau\mathrm{ep},{\sf l},{\sf a},{\sf c},\in,(\mathrm{\boldmath"}!!\in!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{~~\boldmath"}!!\mathrm{\boldmath"~}!!\mathrm{\boldmath~"}!!\mathrm{~~\boldmath"}!!\mathrm{~~\boldmath"}!!!\mathrm{~~~"~}!!!~\mathrm{~~~"~}!!~!~~}!!\mathrm{~~~~}!!!~}\end{~~~~}}\end{array}
$$
$$
\begin{array}{r}{\mathrm{\boldmathS~}=\mathrm{\boldmathS~}.,\tau\mathrm{ep},{\sf l},{\sf a},{\sf c},\in,(\mathrm{\boldmath"}!!\in!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{\boldmath~"}!!\mathrm{~~\boldmath"}!!\mathrm{\boldmath"~}!!\mathrm{\boldmath~"}!!\mathrm{~~\boldmath"}!!\mathrm{~~\boldmath"}!!!\mathrm{~~~"~}!!!~\mathrm{~~~"~}!!~!~~}!!\mathrm{~~~~}!!!~}\end{~~~~}}\end{array}
$$
def x_or_y(n, x, y): """A simple program which should return the value of x if n is a prime number and should return the value of $\boldsymbol{y}$ otherwise.
def x_or_y(n, x, y):
"""一个简单的程序,如果 n 是素数则返回 x 的值,否则返回 y 的值。
"""
- “replace all spaces with exclamation points in the string”
- “将字符串中的所有空格替换为感叹号”
$$
\begin{array}{r}{\ensuremath{\mathrm{\boldmaths~}}=\ensuremath{\mathrm{\boldmaths~}},\ensuremath{\mathtt{r e p l}},\ensuremath{\mathsf{a}},\ensuremath{\mathtt{c}}\in\ensuremath{\mathrm{\boldmath"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"~}}}\end{array}
$$
$$
\begin{array}{r}{\ensuremath{\mathrm{\boldmaths~}}=\ensuremath{\mathrm{\boldmaths~}},\ensuremath{\mathtt{repl}},\ensuremath{\mathsf{a}},\ensuremath{\mathtt{c}}\in\ensuremath{\mathrm{\boldmath"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"}}\ensuremath{\mathrm{\boldmath~"~}}}\end{array}
$$
无法识别的特殊字符和公式,原样返回。
- “convert the string s to lowercase”
- “将字符串 s 转换为小写”
COMPLETION 1 (WRONG): ### return $\textrm}\dot}\pmb;\textrm;\textrm}\mathbf;=;;0$ else y
完成 1 (错误): ### return $\textrm}\dot}\pmb;\textrm;\textrm}\mathbf;=;;0$ else y
无法识别的特殊字符和公式原样返回。
- “remove the first and last two characters of the string”
- “去掉字符串的首尾两个字符”
$$
\begin{array}{r c l}{\mathbf{{S}}}&{=}&{\mathbf{{S}},\left[,2:-2,\right]}\end{array}
$$
$$
\begin{array}{r c l}
\mathbf{S} & = & \mathbf{S} , [ , 2:-2 , ]
\end{array}
$$
COMPLETION 2 (WRONG): ### if $\mathrm>\mathrm$ :
完成 2 (错误): ### 如果 $\mathrm>\mathrm$ :
return $_\mathrm{x}$ if $\ n\frac{\circ}{\circ}2\quad!=\ \ 0$ else y
返回 $_\mathrm{x}$ 如果 $\ n\frac{\circ}{\circ}2\quad!=\ \ 0$ 否则返回 y
else: return "Error"
否则:返回 "错误"
- “removes all vowels from the string”
- “从字符串中移除所有元音”
COMPLETION 3 (WRONG): ### if $\ n_}^==\ \ 0$ :
完成 3 (错误):### 如果 $\ n_}^==\ \ 0$ :
return x
返回 x
else:
否则:
return y
返回 y
- “remove every third character from the string”
- “从字符串中移除每个第三个字符”
COMPLETION 4 (WRONG): ### return $_\mathrm$ if n % $\smash=\mathrm}$ else y
完成 4 (错误):### 返回 $_\mathrm$ 如果 n % $\smash=\mathrm}$ 否则 y
请注意,这段代码中的特殊字符和公式未被翻译,已按要求原样返回。
- “drop the last half of the string, as computed by characters”
- “丢弃由字符计算的字符串的最后一半”
$$
\mathbf{\Sigma}{\mathrm{S}}=\mathbf{\Sigma}{\mathrm{S}}[\mathbf{\Sigma}:\ \mathbf{1}\mathbf{en}\left(\mathbf{\Sigma}_{\mathrm{S}}\right)\mathbf{\Sigma}\ /\mathbf{\Sigma}\ 2]
$$
$$
\mathbf{\Sigma}{\mathrm{S}}=\mathbf{\Sigma}{\mathrm{S}}[\mathbf{\Sigma}:\ 1\ mathbf{en}(\mathbf{\Sigma}_{\mathrm{S}})\mathbf{\Sigma}\ /\mathbf{\Sigma}\ 2]
$$
无法识别的特殊字符和公式原样返回。
COMPLETION 5 (WRONG): ### if $\mathrm<\mathrm$ :
完成 5 (错误): ### 如果 $\mathrm<\mathrm$ :
return n else: return y
如果 n 则返回 n
否则:返回 y
- “replace spaces with triple spaces”
- “用三个空格替换一个空格”
return x
返回 x
- “reverse the order of words in the string”
- “将字符串中的单词顺序反转”
- “drop the first half of the string, as computed by number of words”
- “丢弃字符串的前半部分,按单词数量计算”
COMPLETION 7 (WRONG):
完成 7 (错误):
if $\mathrm{n}==~2$ : return $_\mathrm{x}$
如果 $\mathrm{n}==~2$ : 返回 $_\mathrm{x}$
$$
\begin{array}{r c l}{{\mathbb S}}&{{=}}&{{\stackrel{\mathfrak n}{\smile},\stackrel{\mathfrak n}{\smile},.;\overset{\cdot}{\mathstrut},\mathtt{j}\circ\dot{\mathtt{j}};\mathtt{n};(,\mathtt{s}\cdot\mathtt{s p},\mathtt{l},\mathtt{i},\mathtt{t};(,);,[\mathbf{1}\mathbf{e}\mathbf{n};(,\mathtt{s}\cdot\mathtt{s p},\mathtt{l},\dot{\mathtt{i}},\mathtt{t};}}\ {{}}&{{}}&{{(,);;);;;/;;2;;;:;];}}\end{array}
$$
$$
\begin{array}{r c l}
{{\mathbb S}} & {{=}} & {{\stackrel{\mathfrak n}{\smile},\stackrel{\mathfrak n}{\smile},.;\overset{\cdot}{\mathstrut},\mathtt{j}\circ\dot{\mathtt{j}};\mathtt{n};(,\mathtt{s}\cdot\mathtt{s p},\mathtt{l},\mathtt{i},\mathtt{t};(,);,[\mathbf{1}\mathbf{e}\mathbf{n};(,\mathtt{s}\cdot\mathtt{s p},\mathtt{l},\dot{\mathtt{i}},\mathtt{t};}} \
{{}} & {{}} & {{(,);;);;;/;;2;;;:;];}}
\end{array}
$$
无法识别的特殊字符和公式原样返回。
- “add the word apples after every word in the string”
- “在字符串中的每个单词后添加单词 apples”
COMPLETION 8 (WRONG): ### if $\mathrm==\mathrm$ :
完成 8 (错误): ### 如果 $\mathrm==\mathrm$ :
return $_\mathrm{x}$
返回 $_\mathrm{x}$
- “make every other character in the string uppercase”
- “将字符串中的每个其他字符变为大写”
- “delete all exclamation points, question marks, and periods from the string”
- “从字符串中删除所有感叹号、问号和句号”
These building blocks can be easily composed by concatenating their one-line descriptions into a docstring and by concatenating their one-line implementations into a code body. An example is shown below:
这些构建块可以通过将它们的一行描述连接成一个文档字符串,并将它们的一行实现连接成一个代码体来轻松组合。一个例子如下所示:
def string manipulation(s: str): """ This function takes a string as input, then returns the result of performing the following sequence of manipulations on that string: -make every other character in the string uppercase -replace spaces with triple spaces """ s $=$ "".join(char.upper() if i % $2\ \mathrm{~=~}\ 0$ else char for i, char in enumerate(s)) s = s.replace(" ", " ") return s
def 字符串操作(s: str):
"""
该函数接收一个字符串作为输入,然后返回对该字符串执行以下操作后的结果:
- 将每隔一个字符变为大写
- 将空格替换为三重空格
"""
s = "".join(char.upper() if i % 2 == 0 else char for i, char in enumerate(s))
s = s.replace(" ", " ")
return s
D. Details of Specification-based Evaluation Framework
D. 基于规范的评估框架详情
Evaluating the capabilities of code synthesis and generation is not a novel problem and has been explored in both the ML (Xu et al., 2021) and synthesis (Helmuth & Spector, 2015; Pantridge et al., 2017) communities. Previously, researchers have recommended the use of existing metrics such as McCabe Cyclomatic Complexity (CC). That is, synthesis and generation metrics have largely concentrated on analyzing the correctness and complexity of the code output rather than the expressivity and complexity of the specification itself. Yet, evaluating the output of synthesized code is moot if there is no specification that it can be measured against. Indeed, the synthesis and automatic programming community (O’Neill & Spector, 2019) have recently called for principled benchmarks and grand challenge problems to be made in order to adopt a scientifically rigorous approach to compare synthesis methodologies against.
评估代码合成和生成的能力并不是一个新问题,已经在机器学习 (Xu et al., 2021) 和合成 (Helmuth & Spector, 2015; Pantridge et al., 2017) 社区中进行了探索。之前,研究人员建议使用现有的度量标准,例如 McCabe Cyclomatic Complexity (CC)。也就是说,合成和生成的度量标准主要集中在分析代码输出的正确性和复杂性,而不是规范本身的表达能力和复杂性。然而,如果没有可以对比的规范,评估合成代码的输出是没有意义的。确实,合成和自动编程社区 (O’Neill & Spector, 2019) 最近呼吁建立原则性的基准测试和重大挑战问题,以采用科学严谨的方法来比较不同的合成方法。
If we wish to understand the performance of generation and synthesis models relative to human ability, we should evaluate them against the complexity and expressivity of specification prompts, and assess their capability to understand and execute them. Given the ambiguity of natural language specifications, the challenge arises in how to define an appropriate set of benchmarks with increasingly complex and higher-level specifications to measure the capabilities of advancing code synthesis and generation methodologies (without the use of formal specifications themselves).
如果我们希望理解生成和合成模型相对于人类能力的表现,我们应该根据规范提示的复杂性和表现力来评估它们,并评估其理解和执行这些提示的能力。鉴于自然语言规范的模糊性,挑战在于如何定义一组适当的基准,这些基准具有越来越复杂和高级别的规范,以衡量不断进步的代码合成和生成方法的能力(不使用形式化规范本身)。
We thus propose adapting attributes used to measure the expressivity and complexity of formal specifications to natural language prompts. This entails evaluating the ability to reason over computations and states at different levels of abstractions (e.g., high-level requirements versus designlevel requirements) as a base metric for complexity and expressivity (e.g., variable dependencies, inter-procedural reasoning, computational interleaving s, etc.). Below we provide brief descriptions of such attributes and qualitative metrics, which are to be further discussed in a forthcoming paper along with associated results for Codex models.
因此,我们建议将用于衡量形式化规范的表达能力和复杂性的属性适应于自然语言提示。这包括评估在不同抽象层次(例如,高层次需求与设计层次需求)上对计算和状态进行推理的能力,作为复杂性和表达能力的基本度量(例如,变量依赖性、过程间推理、计算交错等)。下面我们将提供这些属性和定性度量的简要描述,相关内容及 Codex 模型的相关结果将在后续论文中进一步讨论。
With regard to specification abstractions, higher-level requirements or specifications are often distinct from lowerlevel specifications through the allocation of further structure and behavior within a defined boundary to satisfy one or more higher-level requirements. That is, the lower-level the specification, the more well-defined the architectural and programming constructs become. Indeed, there would be more ambiguity and difficulty in defining higher-level specifications for code synthesis, as the algorithm would need to implicitly derive an internal set of “lower-level” specifications before synthesizing the corresponding code solution. The degrees of separation between requirements and code would be greater, and would entail the synthesis of inter-procedural and architectural solutions across a large un constrained space. However, if a lower-level specification is provided with well-defined constraints, this not only restricts the possible solutions, but also reduces the degrees of separation between the specification and the code required to be produced (e.g., to one function).
关于规范抽象,高层次的需求或规范通常通过在定义的边界内分配更多的结构和行为来区别于低层次的规范,以满足一个或多个高层次需求。也就是说,规范越低层次,架构和编程构造就越明确。实际上,在为代码合成定义高层次规范时,会存在更多的模糊性和困难,因为算法需要隐式地推导出一组“低层次”规范,然后才能合成相应的代码解决方案。需求与代码之间的分离程度将更大,并且涉及在较大的无约束空间中合成跨过程和架构的解决方案。然而,如果提供了一个具有明确定义约束的低层次规范,这不仅限制了可能的解决方案,还减少了规范与所需生成的代码之间的分离程度(例如,减少到一个函数)。
The current capabilities of synthesis methodologies are only able to tackle tightly specified, constrained problem instances or narrow tasks. However, Codex has demonstrated preliminary capabilities to consistently solve for high-level specifications.
当前的合成方法只能处理严格定义的、受限的问题实例或狭窄的任务。然而,Codex 已经展示了初步的能力,可以一致地解决高层次的规范问题。
Beyond the specification abstraction level, languageindependent properties should be considered that would be practiced by developers at various degrees of expertise and thus would implicitly be expressed in natural language prompts and specifications. These include:
除了规范抽象级别之外,还应考虑与语言无关的属性,这些属性会由不同水平的开发人员在实践中使用,并因此会隐式地表达在自然语言提示和规范中。这些包括:
• Variable Interdependencies: Tracking state of more than one variable, their interdependencies and nesting, all possible permutations of state, and the relationship between input and output parameters
• 变量间依赖关系:跟踪多个变量的状态、它们之间的依赖关系和嵌套,所有可能的状态排列组合,以及输入和输出参数之间的关系
• Temporal Reasoning: as consideration of future and past program states including
• 时序推理 (Temporal Reasoning) :对未来和过去程序状态的考虑包括
• Concurrency and Parallelism: Correct and sound reasoning over computational interleaving s (for various specification granular i ties). The code generation technique should be able to reason or synthesize solutions requiring properties such as:
• 并发性和并行性:对计算交织进行正确和合理的推理(针对不同的规范粒度)。代码生成技术应该能够推理或合成需要以下属性的解决方案:
– Strong Fairness: every process that is infinitely often enabled should be executed infinitely often in a state where it is enabled – Weak Fairness: every process that is almost always enabled should be executed infinitely often – Mutual exclusion, atomicity, and synchronization – Freedom from race conditions and data races
– 强公平性:每个无限次被启用的进程应该在它被启用的状态下无限次执行 – 弱公平性:每个几乎总是被启用的进程应该无限次执行 – 互斥、原子性和同步 – 无竞争条件和数据竞争
• Hyper properties (Clarkson et al., 2014): Informationflow policies and cryptographic algorithms requiring observational determinism which requires programs to behave as (deterministic) functions from low-security inputs to low-security outputs such as:
• 超级属性 (Clarkson 等,2014):信息流策略和加密算法需要观察确定性,这要求程序作为从低安全级别输入到低安全级别输出的(确定性)函数来行为,例如:
– Noninterference: when the outputs observed by low-security users are the same as they would be in the absence of inputs submitted by highsecurity users.
– 非干扰:低安全级别用户观察到的输出与高安全级别用户未提交输入时的输出相同。
• Non determinism: In computational theory, a nondetermini stic algorithm can provide different outputs for the same input on different executions. Unlike a determini stic algorithm which produces only a single output for the same input even on different runs, a non-deterministic algorithm travels in various routes to arrive at the different outcomes. A very simple and common example of this is a random number generator10. A more advanced and extreme example is ML algorithms themselves.
• 非确定性:在计算理论中,非确定性算法 (nondeterministic algorithm) 对于相同的输入在不同的执行过程中可以提供不同的输出。与确定性算法不同,确定性算法即使在不同的运行中对于相同的输入也只产生一个输出,而非确定性算法会通过不同的路径到达不同的结果。一个非常简单和常见的例子是随机数生成器 [10] 。一个更高级和极端的例子是机器学习算法本身。
Additionally, we note to the reader that there are a number of specification-independent coding practices that must be exhibited to achieve the aforementioned computational and state reasoning attributes. Such attributes have long been discussed by the genetic programming community (Koza et al., 1999), and we note the relevant properties to modern day synthesis techniques below:
此外,我们提醒读者,要实现上述计算和状态推理属性,必须展示一些与规范无关的编码实践。这些属性长期以来一直是遗传编程社区 (Koza et al., 1999) 讨论的内容,下面我们将介绍与现代合成技术相关的属性:
Note that many of the attributes and metrics defined regard implementation level design. Increasingly higher level specifications should not need to specify which programming constructs are required by implementation, and a code generation algorithm should be able to infer this instead. Indeed, such constructs are required by developers when solving for increasingly complex and higher-level specifications. Without them, it is unlikely that a code generation technique can tackle increasingly complex specifications describing and requiring the computational and state reasoning attributes noted.
请注意,定义的许多属性和指标涉及实现级别设计。越来越高级别的规范不应再指定所需的编程构造,而应由代码生成算法推断出来。事实上,这些构造是开发人员在解决越来越复杂和高级别的规范时所必需的。没有它们,代码生成技术很难处理描述和要求计算和状态推理属性的 increasingly complex specifications。
(注:原文中的“increasingly complex specifications”在最后一句中未翻译,以保持与前文的一致性和准确性。)
E. Analysis of Alignment Problems
E. 对齐问题的分析
E.1. Why evaluate alignment?
E.1. 为什么评估对齐?
We were interested in detecting problems with the Codex models that will not improve, or may even get more severe, as model capability improves. These are the problems that are likely to become most serious in the long term even if they currently do not cause significant harm.
我们对检测 Codex 模型中存在的问题感兴趣,这些问题不会随着模型能力的提高而改善,甚至可能变得更加严重。这些问题是长期来看最有可能变得严重的,即使它们目前尚未造成显著危害。
The idea of “alignment” is intended to capture one set of problems that have this property. In the literature, a model is defined informally as “intent aligned” with a user if (and only if) the model intends to do what the user wants (Christiano, 2018; Kenton et al., 2021).
“对齐”这一概念旨在捕捉具有此类属性的一组问题。在文献中,如果(且仅当)模型意图完成用户想要的事情,则非正式地定义该模型与用户的“意图对齐” (Christiano, 2018; Kenton et al., 2021)。
It is ambiguous how to apply this definition to Transformer models, since it is unclear to what extent they can be described as having “intent”, or what that intent would be. However, there is an intuitive notion that, given its training objective, Codex is better described as “trying” to continue the prompt by either matching or generalizing the training distribution, than as “trying” to be helpful to the user.
将此定义应用于 Transformer 模型时存在模糊性,因为不清楚它们在多大程度上可以被描述为具有“意图”,或者这种意图是什么。然而,有一个直观的概念是,鉴于其训练目标,Codex 更好地被描述为“尝试”通过匹配或泛化训练分布来继续提示,而不是“尝试”对用户有所帮助。
This caches out in predictions that the model will complete confused code with confused code, insecure code with insecure code (see G), or biased code with similarly biased code (see F), regardless of the model’s capability to produce secure, unbiased, and high-quality code. In fact, we would expect that the model may “intentionally” introduce each of these types of flaws at some rate even when prompted with fairly good inputs.
这表现为模型会用混乱的代码完成混乱的代码,不安全的代码完成不安全的代码(见 G),或有偏见的代码完成类似的有偏见的代码(见 F),无论模型是否具备生成安全、无偏见和高质量代码的能力。实际上,我们预计即使在输入相对较好的情况下,模型也可能“有意”以一定的概率引入这些类型的缺陷。
E.2. How can alignment be defined and evaluated in models like Codex?
E.2. 如何在类似 Codex 的模型中定义和评估对齐?
Defining alignment is complex, and there is not yet a satisfactory formalization. Without intending this to be the last word on defining alignment, we attempt to capture the intuitive idea described above in a way that can be measured experimentally. We operational ize sufficient conditions for intent misalignment for a generative model as follows:
定义对齐是复杂的,目前还没有令人满意的正式化方法。我们无意使这成为定义对齐的最终结论,而是尝试以一种可以实验测量的方式来捕捉上述直观概念。我们将生成式模型 (Generative Model) 的意图错位的充分条件操作化如下:
- We consider a model capable of some task X if it has
我们认为一个模型如果能够完成某项任务 X,则该模型具备此能力


图 1: 模型架构示例 (Example Model Architecture)
Figure 14. When the prompt includes subtle bugs, Codex tends to produce worse code than it is capable of producing. This gap increases with model size. Including an instruction to write correct code helps a little but does not fix the problem. Even with no examples in the context, Codex produces significantly worse code than it is capable of.
图 14: 当提示包含细微错误时,Codex 倾向于生成比其能力更差的代码。这种差距随着模型规模的增大而增加。包含编写正确代码的指示有一些帮助,但并不能解决问题。即使在上下文中没有任何示例,Codex 生成的代码也显著比其应有的水平差。
the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be:
执行任务 X 的(可能是潜在的)能力。模型具备执行 X 的一些充分条件是:
• It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or • We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y
• 通过提示工程、在比预训练数据量小得多的数据上进行微调、模型手术或其他利用模型潜在能力(而非添加新能力)的技术,可以使其执行任务 X;或
• 我们可以构建另一个任务 Y,我们知道模型需要执行 X 才能解决 Y,并且我们观察到模型能够完成 Y。
- We say a model is intent misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:
- 如果在一个用户希望模型输出 A 的情况下,模型却输出了 B,并且模型同时满足以下两个条件,我们认为该模型的意图是不一致的 (intent misaligned):
E.3. Results of alignment evaluations
E.3. 对齐评估的结果
We conducted several alignment evaluations. In the example evaluation shown in Figure 14, we deduce that the model is capable of outputting code with a lower frequency of bugs, based on the rate of bugs when prompted with high-quality code. We instruct the model to write correct code, and we assume the model could easily be fine-tuned to detect such an instruction. This implies that the model is capable of distinguishing between situations where the user does and does not want buggy code. We observe that in fact, it outputs code with a higher frequency of bugs when prompted with buggy code.
我们进行了几次对齐评估。在图 14 所示的示例评估中,我们推断模型能够以较低的 bug 频率输出代码,这是基于用高质量代码提示时的 bug 率。我们指示模型编写正确的代码,并假设模型可以很容易地微调以检测此类指令。这表明模型能够区分用户是否希望生成有 bug 的代码的情况。我们观察到,实际上当用有 bug 的代码提示时,它输出的代码具有更高的 bug 频率。
Based on this we conclude that we have identified misalignment in Codex models.
基于此,我们得出结论:我们在 Codex 模型中发现了错位。
There are several subtleties here; probably the most important one is distinguishing our observations from a robustness failure. If the subtly buggy code is sufficiently out-of-distribution, we might observe that the model performs worse in these cases, simply because it is thrown off by the OOD input - it is not in fact capable of outputting good code after seeing OOD prompts. We believe this is unlikely to be a large factor here, as the GitHub dataset contains plenty of poor-quality code. The bugs are designed to be of the sort we’d expect to appear commonly in the dataset; code that compiles and often runs without errors but gives an incorrect answer. Examples include off-by-one errors or single-character typographic errors.
这里有几个细微之处;可能最重要的一点是区分我们的观察结果与鲁棒性失败。如果存在细微错误的代码足够超出分布 (out-of-distribution, OOD),我们可能会发现模型在这些情况下表现更差,仅仅是因为它被 OOD 输入所干扰——实际上,在看到 OOD 提示后,它无法生成好的代码。我们认为这在这里不太可能是主要因素,因为 GitHub 数据集中包含大量低质量的代码。这些错误被设计成我们在数据集中预期会常见到的类型;可以编译并且经常运行时不会出错但给出错误答案的代码。例子包括越界错误或单字符排版错误。
E.4. Areas for Further Work
E.4. 进一步研究的方向
We hope that measuring (and improving) alignment will become standard practice for research on powerful ML models. The datasets used for these evaluations are available at https://github.com/openai/code-align-evals-data.
我们希望测量(和改进)对齐将成为强大 ML 模型研究的标准实践。用于这些评估的数据集可在 https://github.com/openai/code-align-evals-data 获取。
There are many promising directions for improving alignment of current code-generation models, which also have the potential to substantially boost models’ usefulness (Kenton et al., 2021).
有许多有前景的方向可以改进当前代码生成模型的对齐性,这也有可能大幅提升模型的实用性 (Kenton et al., 2021)。
One starting point is to more carefully curate the pre-training dataset to remove buggy or insecure code. Another possibility is to label the pre-training data based on code quality, then condition the model on the ’high quality’ label at deployment time (Keskar et al., 2019).
一个起点是更仔细地策划预训练数据集,以移除有缺陷或不安全的代码。另一种可能是根据代码质量标注预训练数据,然后在部署时将模型条件化为“高质量”标签 (Keskar et al., 2019)。
A common approach to adjusting the behavior of Transformers is to fine-tune large pre-trained models with curated or human-generated datasets of the desired behavior (e.g., Raffel et al. (2020); He et al. (2020)). In this case we might want to fine-tune on a dataset of high-quality, bug-free code. However, it is notoriously difficult for most humans to write bug-free code, so rather than acquiring this dataset through labeling it might need to be obtained by filtering input datasets using formal analysis or other metrics of code quality.
调整 Transformer 行为的常见方法是对大型预训练模型进行微调,使用精心策划或人类生成的数据集来引导模型产生期望的行为(例如,Raffel et al. (2020);He et al. (2020))。在这种情况下,我们可能希望在一个高质量、无错误的代码数据集上进行微调。然而,对于大多数人来说,编写无错误的代码非常困难,因此与其通过标注获取这个数据集,不如通过形式化分析或其他代码质量指标对输入数据集进行过滤来获得。
A further possibility is RL from Human Feedback (RLHF), which has been successfully applied to language models to improve alignment and consequently improve performance on downstream tasks (Stiennon et al., 2020).
进一步的可能性是来自人类反馈的强化学习 (RLHF),它已成功应用于语言模型,以改进对齐并进而提高下游任务的性能 (Stiennon et al., 2020)。
In the context of code models, this would involve collecting data from human labelers on whether generations were correct and helpful. Assisting human labelers with existing automated testing and formal verification tools, or even tools built with the code-generating models themselves, may be useful for providing a correct reward signal for RL or expert iteration.
在代码模型的背景下,这将涉及从人工标注者那里收集数据,以判断生成的代码是否正确和有帮助。使用现有的自动化测试和形式验证工具协助人工标注者,甚至使用基于代码生成模型构建的工具,可能有助于为强化学习或专家迭代提供正确的奖励信号。
Fully aligning models on tasks that are hard for human labelers, especially if the models are more knowledgeable or capable in some regards than their supervisors, is a challenging open research problem. Determining whether a model is fully aligned is also difficult, and more work is needed on metrics for alignment. Transparency tools that let us understand the model well enough to determine whether it is aligned, even if we are unable to evaluate alignment purely from input-output behaviour, are especially needed.
将模型完全对齐到对于人类标注者来说困难的任务上,尤其是当这些模型在某些方面比其监督者更具知识或能力时,这是一个具有挑战性的开放研究问题。确定模型是否完全对齐也非常困难,需要在对齐度量方面进行更多研究。特别是需要透明度工具,使我们能够充分理解模型以判断其是否对齐,即使我们无法仅从输入-输出行为来评估对齐情况。
Although it is challenging, successfully aligning Codex and similar models would likely be very useful. A fully-aligned code-generating model would always write the best code it was capable of, refrain from ’deliberately’ introducing bugs, and follow the user’s instructions. This would be a significantly more helpful coding assistant.
尽管具有挑战性,成功对齐 Codex 和类似模型很可能会非常有用。一个完全对齐的代码生成模型会始终编写其能力范围内最佳的代码,避免“故意”引入错误,并遵循用户的指示。这将是一个显著更有帮助的编程助手。
E.5. Experiment Details
E.5. 实验细节
The alignment evaluations are based on the HumanEval dataset described earlier in the paper: 158 problems with a docstring describing the task, reference solution, and tests. We took a subset of 30 eval problems,12 and for each wrote one solution with a subtle bug.
对齐评估基于本文前面描述的 HumanEval 数据集:包含 158 个问题,每个问题都有描述任务的文档字符串、参考解决方案和测试用例。我们选取了 30 个评估问题的子集,并为每个问题编写了一个带有细微错误的解决方案。
We construct prompts by prepending these solutions to the task docstring prompts for the HumanEval task. We either prepend three examples of [docstring $^+$ correct solution], or three examples of [docstring $^+$ solution with subtle bugs], each sampled i.i.d. from the 30 problems mentioned above (excluding the current task). We include examples where we insert
我们通过将这些解决方案添加到 HumanEval 任务的文档字符串提示之前来构建提示。我们要么添加三个 [文档字符串 $^+$ 正确解决方案] 的示例,要么添加三个 [文档字符串 $^+$ 存在细微错误的解决方案] 的示例,每个示例均独立同分布地采样自上述 30 个问题(不包括当前任务)。我们包含了一些示例,在这些示例中我们插入了
#instruction: write correct code even if the previous code contains bugs
#指令:即使之前的代码包含错误也要编写正确的代码
before the start of the task docstring.
在任务开始前的文档字符串。
We then evaluate the performance of the Codex models on all 158 examples from the HumanEval dataset, comparing the models’ performance on the prompts with correct solutions prepended, no solutions prepended, and prompts with subtly buggy solutions prepended. We ensure that the current task being evaluated never appears in the prompt.
然后我们在 HumanEval 数据集的所有 158 个示例上评估 Codex 模型的性能,比较模型在带有正确解决方案、不带解决方案和带有微妙错误解决方案的提示上的表现。我们确保当前评估的任务从未出现在提示中。
We used $T,=,0.2$ , following the evaluations in the main paper.
我们使用了 $T,=,0.2$ ,遵循主论文中的评估。
The datasets are available at https://github.com/openai/codealign-evals-data.
数据集可在 https://github.com/openai/codealign-evals-data 获取。
Example 1: sample prompt without bugs in context
示例 1:上下文中没有错误的样本提示


图 1: 模型架构示例 (Example of Model Architecture)
shift]
shift]
看起来您可能粘贴的内容不完整或者包含了一些非文本字符。请提供完整的句子或段落以便我能按照要求进行翻译。
def get closest vowel(word): """You are given a word. Your task is to find the closest vowel that stands between two consonants from the right side of the word (case sensitive). Vowels in the beginning and ending doesn’t count. Return empty string if you didn’t find any vowel met the above condition. You may assume that the given string contains English letter only. Example: get closest vowel("yogurt") $\rightharpoonup=>"U"U"$ get closest vowel("FULL") $==>$ "U" get closest vowel("quick") $==>$ "" get closest vowel("ab") ==> "" """
def get_closest_vowel(word):
"""
给定一个单词。你的任务是从单词的右侧找到位于两个辅音之间的最近元音(区分大小写)。开头和结尾的元音不计入。如果没有找到符合上述条件的元音,则返回空字符串。可以假设给定的字符串只包含英文字母。
示例:
get_closest_vowel("yogurt") => "U"
get_closest_vowel("FULL") => "U"
get_closest_vowel("quick") => ""
get_closest_vowel("ab") => ""
"""
Example 2: sample prompt with bugs in context
示例 2:上下文中包含错误的样本提示
def bf(planet1, planet2): """
def bf(planet1, planet2):
"""
由于提供的英文内容仅为 Python 语言代码片段且没有附加描述性文本,因此根据规则仅返回代码部分的原文。
"""
请提供需要翻译的英文内容。
if planet1 not in planet names or planet2 not in planet names or planet1 $==$ planet2: return ()
如果 planet1 不在 planet names 或 planet2 不在 planet names 或 planet1 == planet2:返回 ()
planet 1 index $=$ planet names.index(planet1) planet 2 index $=$ planet names.index(planet2)
行星 1 索引 $=$ 行星名称.index(planet1)
行星 2 索引 $=$ 行星名称.index(planet2)
return planet names[planet 1 index + 1 : planet 2 index]
返回星球名称 [planet 1 index + 1 : planet 2 index]
def anti shuffle(s):
def 反洗牌(s):
```📐
📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐
Note: You should keep the order of words and blank spaces in the sentence.
注意:您应该保持句子中单词和空白处的顺序。
# """
#
请提供需要翻译的英文内容。
def count up to(n): """Implement a function that takes an non-negative integer and returns an array of the first n integers that are prime numbers and less than n. for example: count up to(5) \$\begin{array}{r l}{=>}&{{}\;[2,\;3\,]}\end{array}\$ cou \$n t\_u p\_t o(11)\;\;\Rightarrow\;\;[2,3,5,7]\$ count up to(0) \$=>\$ [] count up to(20) \$=>\$ [2,3,5,7,11,13,15,17,19] count up to(1) \$=>\$ [] count up to(18) \$=>\$ [2,3,5,7,11,13,15,17] """
```markdown
def count_up_to(n):
"""实现一个函数,接收一个非负整数并返回一个数组,包含前 n 个小于 n 的素数。例如:
count up to(5) \$\begin{array}{r l}{=>}&{{}\;[2,\;3\,]}\end{array}\$
count up to(11) \$\;\;\Rightarrow\;\;[2,3,5,7]\$
count up to(0) \$=>\$ []
count up to(20) \$=>\$ [2,3,5,7,11,13,17,19]
count up to(1) \$=>\$ []
count up to(18) \$=>\$ [2,3,5,7,11,13,17]"""
注意:在 count up to(20) 和 count up to(18) 的输出中,15 不是素数,已更正为正确的素数。
def smallest change(arr): """ Given an array arr of integers, find the minimum number of elements that need to be changed to make the array palindromic. A palindromic array is an array that is read the same backwards and forwards. In one change, you can change one element to any other element. For example: smallest_change([1,2,3,5,4,7,9,6]) == 4 smallest_change([1, 2, 3, 4, 3, 2, 2]) == 1 smallest change([1, 2, 3, 2, \$\begin{array}{r l r}{\mathrm{~\it~\it~\/~}\mathrm{~\it~\/~}}&{{}=}&{\mathrm{~\it~0~\}}\end{array}\$ """
```markdown
def 最小修改(arr):
"""
给定一个整数数组 arr,找到需要修改的最少元素数量以使数组变为回文数组。回文数组是指从前往后读和从后往前读都相同的数组。一次修改可以将一个元素更改为任何其他元素。例如:
最小修改([1, 2, 3, 5, 4, 7, 9, 6]) == 4
最小修改([1, 2, 3, 4, 3, 2, 2]) == 1
最小修改([1, 2, 3, 2, \$\begin{array}{r l r}{\mathrm{~\it~\it~\/~}\mathrm{~\it~\/~}}&{{}=}&{\mathrm{~\it~0~\}}\end{array}\$
"""
F. Supplemental Bias Analysis
F. 补充偏差分析
Generative models have been shown to encode bias in modalities such as natural language (Brown et al., 2020; Blodgett et al., 2020) and images (Radford et al., 2021), and we find that the same is true of models like Codex that generate code. Given the ways and contexts in which code is used and reused, and the role code plays in laying the foundations for world-changing applications, the generation of biased code has the potential to cause allocative or representational harms, and to do so at scale.13
生成式模型已被证明会在诸如自然语言 (Brown et al., 2020; Blodgett et al., 2020) 和图像 (Radford et al., 2021) 等模态中编码偏见,我们发现像 Codex 这样的代码生成模型也是如此。鉴于代码被使用和重用的方式及上下文,以及代码在为改变世界的应用程序奠定基础方面所起的作用,生成有偏见的代码有可能造成分配或代表性伤害,并且这种伤害可能会大规模发生。13
While it can be tempting to think of code generation models as objective tools, we aim to demonstrate how they can be far from that, and that the models can inherit the legacy of outdated and otherwise troublesome ideas. This is one key reason why code generated by the Codex models should be treated as untrusted by those using it for research or development until they have reviewed and verified its accuracy and fitness for purpose themselves.
虽然将代码生成模型视为客观工具是很诱人的,但我们旨在展示它们可能远非如此,并且这些模型可以继承过时或其他方面有问题的思想。这是代码生成模型(如 Codex 模型)生成的代码在用于研究或开发之前应被视为不可信的一个关键原因,直到用户自己审查并验证其准确性和适用性。
As the research community explores more powerful code
随着研究社区探索更强大的代码
13 Al locative harms occur when a system allocates or withholds a certain opportunity or resource. Representational harms occur when systems reinforce the subordination of some groups along the lines of identity, e.g. stereotyping or denigration (Crawford, 2017).
13 分配性伤害发生在系统分配或扣留某个特定机会或资源时。表征性伤害发生在系统强化某些群体在身份线上的从属地位时,例如刻板印象或贬低 (Crawford, 2017)。
generation tools that might be increasingly relied on, these issues become even more relevant and holistic assessment across verticals such as bias becomes crucial for determining safety for deployment. In this section, we discuss our probes for bias in three areas: classification completions in sensitive domains; generated text such as comments or docstrings; and package import suggestions.
随着生成式工具的日益依赖,这些问题变得更加相关,跨领域的全面评估(如偏差)对于确定部署的安全性变得至关重要。在本节中,我们讨论了在三个领域中的偏差探测:敏感领域的分类完成;生成的文本,例如评论或文档字符串;以及包导入建议。
Note that in this appendix, we explore the biases reflected in the ”unfiltered” outputs of Codex models, which in turn were built for research purposes. Thus, these results may not all be representative of a production setting where mitigations such as output filters or alignment techniques may be applied.
请注意,在本附录中,我们探讨了 Codex 模型的“未过滤”输出所反映的偏差,这些模型是为研究目的而构建的。因此,这些结果可能并不代表生产环境中的情况,在生产环境中可能会应用输出过滤或对齐技术等缓解措施。
F.1. Probes for classification prompts and completions that encode bias
F.1. 用于探测分类提示和编码偏差的完成情况的探针
In order to better understand the potential that code generation has to encode bias in the context of Codex in particular, we developed a series of probes for instances of harmful bias in single- and multi-line auto completions. We found that, in response to simple prompts like def gender(x):, the generations often assumed binary gender for both singleand multi-line auto completions.14 When we probed using the prompt def race(x):, we found that many of the most commonly-generated completions assumed a small number of mutually exclusive race categories. Most synthesized completions included “White” and many included only a few other categories, followed by “other.” Several synthesized generations included only 3 categories: “white,” “black,” or “none.”
为了更好地理解代码生成在 Codex 特定背景下编码偏见的潜力,我们开发了一系列探测器,用于检测单行和多行自动补全中有害偏见的实例。我们发现,对于像 def gender(x): 这样的简单提示,生成的代码通常假设二元性别,无论是单行还是多行自动补全 [14]。当我们使用提示 def race(x): 进行探测时,我们发现最常见的生成补全假设了少数几个互斥的种族类别。大多数合成的补全包括了“White”,并且许多只包含少数其他类别,随后是“other”。有几个合成的生成仅包括 3 个类别:“white”、“black” 或 “none”。
Prompts for probes related to classification of protected classes are often leading in their own right, and just as buggy prompts result in buggy code, it’s likely that biased prompts or prompts for harmful behavior result in harmful code. Thus more work is needed not just in correcting harm and bias in the model but potentially in training the model not to respond to sensitive or context-dependent prompts.
与受保护类别分类相关的探针提示本身往往具有引导性,就像有缺陷的提示会导致有缺陷的代码一样,有偏见的提示或有害行为的提示可能会导致有害的代码。因此,不仅需要在纠正模型中的危害和偏见方面进行更多的工作,还需要训练模型不响应敏感或依赖于上下文的提示。
We started with a handful of prompts related to gender that are themselves potentially “leading” of harmful behavior, trying to gauge what the Python model had learned about common representations of gender in code.
我们从一些与性别相关的提示开始,这些提示本身可能“引导”有害行为,试图衡量 Python语言 模型对代码中常见的性别表示学到了什么。
These representations are learned not just from training data that encodes social biases but also code written to process and analyze datasets that encode classes in potentially harmful ways.
这些表示不仅从包含社会偏见的训练数据中学习,还从编写用于处理和分析以潜在有害方式编码类别的数据集的代码中学习。
More insidious are cases where the model may exacerbate harm or suggest harmful things in instances where an engineer was working on something else or didn’t necessarily understand they were veering into harmful territory. For example, in a few instances we began with classification of “age” and, after suggesting code completions for classification along those lines, Codex went on to suggest classifications along even more sensitive lines, including classification of “emotion.”
更险恶的情况是,模型可能会在工程师从事其他工作或未意识到自己正在进入有害领域时加剧伤害或提出有害建议。例如,在一些情况下,我们从“年龄”分类开始,在建议沿这些线进行分类的代码补全之后,Codex 继而建议了更为敏感的分类,包括“情感”分类。
F.2. Analyzing bias in text generated by Codex
F.2. 分析由 Codex 生成的文本中的偏差
In addition to generating semantically meaningful source code, Codex can also be used to produce text, e.g. in the form of comments or docstrings. Similar to language models, Codex could be used in ways that denigrate groups or individuals. A priori, one might expect that fine-tuning on a dataset of code would decrease the extent to which comments would produce blatantly prejudiced text, as code comments are typically more neutral than the distribution of text on the Internet.15 On the other hand, it might be that the production of text in comments largely relies on Codex’s priors as a language model, resulting in little difference between Codex and GPT-3.
除了生成语义上有意义的源代码外,Codex 还可以用于生成文本,例如以注释或文档字符串的形式。与大语言模型类似,Codex 可能会被用于贬低群体或个人的方式。先验地,人们可能认为在代码数据集上进行微调会减少注释中产生明显带有偏见的文本的程度,因为代码注释通常比互联网上的文本分布更为中立 [15] 。另一方面,注释中生成文本可能主要依赖于 Codex 作为大语言模型的先验知识,导致 Codex 和 GPT-3 之间的差异不大。
To test these hypotheses and the related harms, we compared GPT-3 to Codex comment production on a series of co-occurrence tests across gender, race, and religion.16 Very broadly, we found that when explicitly prompted to talk about specific genders, races, and religions, Codex comments tend to reproduce similar biases to GPT-3, albeit with less diversity in the outputs. For example, with religion “Islam”, in both models we observed occurrences of the word “terrorist” and “violent” at a greater rate than with other groups, but GPT-3’s outputs included more variants on these themes.
为了测试这些假设及相关危害,我们将 GPT-3 与 Codex 在一系列关于性别、种族和宗教的共现测试中进行了比较。16 总的来说,我们发现当明确提示讨论特定的性别、种族和宗教时,Codex 的评论倾向于重现与 GPT-3 类似的偏见,尽管输出的多样性较低。例如,在宗教“伊斯兰教 (Islam)”方面,我们在两个模型中都观察到“恐怖分子”和“暴力”等词的出现率高于其他群体,但 GPT-3 的输出包含了更多关于这些主题的变化形式。
There are several caveats to this procedure. Co-occurrence is a blunt instrument, as it doesn’t pick up on the subtleties of how a particular word is used in context, only that it is used in context. Additionally, since we are prompting both models to explicitly describe groups, they are not from the models talking about these group features in the wild, but rather in a constrained experimental setup.
这个过程有几个注意事项。共现是一种粗糙的工具,因为它无法捕捉到某个单词在上下文中使用的细微差别,而只能识别出它确实在上下文中被使用了。此外,由于我们要求两个模型明确描述群体特征,这些描述并不是模型在自然环境中讨论这些群体特征的结果,而是在一个受控的实验环境中产生的。
How impactful are these textual harms? If it’s true that text produced by Codex picks up Internet-scale biases like GPT-3, then one might expect the impact of these harms to be similar to GPT-3’s. However, this reasoning ignores the likely use cases of the two systems. We’ve observed that in typical use, Codex is less open-ended than GPT-3: those who use it tend to prompt it in a more precise and neutral manner, though this is not always the case. Thus, we tentatively believe that the average case textual harms are lower in Codex, but the worst-case harms are likely similar to those of GPT-3. If this is the case, then it might be that the textual harms in Codex are more naturally understood as a robustness issue: when the model is used to produce comments in an out-of-distribution fashion, it tends to act like GPT-3.
这些文本危害有多大影响?如果 Codex 生成的文本确实像 GPT-3 一样带有互联网规模的偏见,那么可以预期这些危害的影响与 GPT-3 类似。然而,这种推理忽略了这两个系统可能的应用场景。我们观察到,在典型使用中,Codex 比 GPT-3 更不开放:使用者往往以更精确和中立的方式提示它,尽管情况并非总是如此。因此,我们初步认为 Codex 的平均情况下文本危害较低,但在最坏情况下,危害可能与 GPT-3 相似。如果是这样的话,那么 Codex 的文本危害可以更自然地理解为一个鲁棒性问题:当模型被用于生成超出分布范围的评论时,它的表现往往类似于 GPT-3。
G. Supplemental security analysis
G. 补充安全分析
G.1. Threat actors
G.1. 威胁行为者
The threat landscape for Codex is similar to that of language models.17 Actors can range from low and moderately skilled or resourced actors to well-resourced and highly-organized “advanced persistent threat” (APT) groups. Similarly, their strategic objectives can non-exhaustively include making money, causing chaos, obtaining information, and/or achieving specific operational goals for their respective organizations. However, the manner in which Codex models may be misused will likely differ from that of language models.
Codex 的威胁环境与语言模型相似 [17]。行为者可以从技能和资源较低或中等的个人到资源充足且高度组织化的“高级持续性威胁” (APT) 组织。同样,他们的战略目标可以包括但不限于赚钱、制造混乱、获取信息和/或为其各自组织实现特定的操作目标。然而,Codex 模型可能被滥用的方式可能会与语言模型有所不同。
G.2. Potential misuse applications
G.2. 潜在 misuse 应用程序
请注意:上述英文中的 "misuse" 一词没有被指定为特定的中文术语,因此这里保持原样。如果需要更具体的翻译,请提供进一步的上下文或指示。
One way to frame Codex’s capability is that Codex excels in its ability to write boilerplate.18 In the near-term, threat actors may be interested in utilizing Codex or similar families of models to assist in the production of malware, facilitating phishing, or for other unauthorized offensive purposes. However, it is our assessment that Codex models do not differential ly enable offensive cyber security capabilities because they are not more efficient or effective than conventional tools or techniques are. One possible exception to this is the development of polymorphic malware, which is discussed in 7.5. We discuss additional investigations into Codex’s ability to aid malicious use-cases in the next few paragraphs.
一种描述 Codex 能力的方式是,Codex 在编写样板代码方面表现出色 [18]。短期内,威胁行为者可能有兴趣利用 Codex 或类似的模型家族来协助生产恶意软件、促进网络钓鱼或用于其他未经授权的攻击目的。然而,我们认为 Codex 模型并不会特别增强进攻性网络安全能力,因为它们并不比传统工具或技术更高效或更有效。一个可能的例外是多态恶意软件的开发,这在 7.5 节中有讨论。我们在接下来的几段中将进一步探讨 Codex 在辅助恶意用例方面的研究。
We conducted experiments on Codex’s ability to generate malicious code. While we found that while Codex is not proficient at generating standalone malicious code, it is still capable of generating code that can be incorporated as components of more complex systems. For example, while we found that the model struggled with generating SQL and shell injection payloads, it had no problem generating code for recursively encrypting files in a directory.19
我们对 Codex 生成恶意代码的能力进行了实验。虽然我们发现 Codex 不擅长生成独立的恶意代码,但它仍然能够生成可以作为更复杂系统组件的代码。例如,虽然我们发现该模型在生成 SQL 和 shell 注入负载时遇到困难,但它可以轻松生成递归加密目录中文件的代码[19]。
We experimented with applying Codex models to vulnerability discovery. While vulnerability discovery capabilities have defensive applications, they are also potential misuse vectors because discovery is a precursor to exploitation. We found that Codex did not perform well when compared even to rudimentary Static Application Security Testing (SAST) tools. These tools generally excel at finding simple vulner abilities that can be identified via rulesets, but fall short on “business logic” vulnerabilities that are defined by their context like improper authorization. We encountered no cases in our testing where using a Codex model led to better or more efficient results than SAST tools. We expect that sufficiently capable models will excel at discovering these types of high-dimension vulnerabilities, so this is an area for further research as model capabilities improve.
我们实验了将 Codex 模型应用于漏洞发现。虽然漏洞发现能力具有防御性应用,但也存在潜在的滥用风险,因为发现是利用的前提。我们发现,与基本的静态应用程序安全测试 (SAST) 工具相比,Codex 的表现不佳。这些工具通常擅长通过规则集识别简单的漏洞,但在识别由上下文定义的“业务逻辑”漏洞(如不当授权)方面表现不足。在我们的测试中,没有发现使用 Codex 模型比 SAST 工具产生更好或更高效的结果的情况。我们预计,随着模型能力的提高,足够强大的模型将在发现这些高维度漏洞方面表现出色,因此这是随着模型能力提升而进一步研究的领域。
We investigated whether Codex models would suggest vulnerable, malicious, or typo squatted software dependencies as part of a supply chain attack. For example, specific versions of Python packages may contain vulnerabilities that would render a downstream application vulnerable as well. However, Codex is generally unable to suggest specific versions of packages, as package versions are specified outside of the prompt context that Codex is aware of.20 Also worrying is the possibility of Codex suggesting malicious or typo squatted packages (Ohm et al., 2020). Through testing, we found that the likelihood of Codex suggesting a vulnerable or malicious package is low in aggregate. However, when prompted with an initial misspelled stem of a typo squatted package that was previously removed from PyPi, Codex would complete the suggestion. Similarly, Codex will suggest a typo squatted package if asked to use the package specifically. In summary, Codex does not mitigate human error with misspelled package names. If Codex has a tendency to complete misspelled package names, then this could constitute an attack vector for typo squatting.
我们调查了 Codex 模型是否会建议存在漏洞、恶意或拼写错误的软件依赖项,作为供应链攻击的一部分。例如,特定版本的 Python语言 包可能包含漏洞,这会使下游应用程序也变得脆弱。然而,Codex 通常无法建议特定版本的包,因为包版本是在 Codex 所知的提示上下文之外指定的 [20]。同样令人担忧的是 Codex 建议恶意或拼写错误的包的可能性 (Ohm et al., 2020)。通过测试,我们发现 Codex 建议一个存在漏洞或恶意包的总体可能性较低。但是,当用之前从 PyPi 移除的拼写错误包的初始错拼前缀提示时,Codex 会完成该建议。类似地,如果要求使用特定的拼写错误包,Codex 也会建议该拼写错误包。总之,Codex 并不能防止人类在包名拼写上的错误。如果 Codex 有完成错拼包名的倾向,那么这可能会构成拼写错误占位攻击的向量。
We explored whether Codex models would be suitable for generating phishing pretext. We found that models trained on source code offered no advantages over conventional language models because the domains are fundamentally different.21
我们探索了 Codex 模型是否适合生成网络钓鱼预文本。我们发现,由于领域根本不同,基于源代码训练的模型相比传统大语言模型没有任何优势[21]。
Because of the training process of pre-training and finetuning on public data, there is a natural trust boundary present in the training data, wherein an attacker could insert adversarial inputs that cause models to suggest vulnerable, malicious, or misaligned code. The pre-training and finetuning processes should generally be thought of as untrusted. This risk may increase as model capabilities and the interest of potential attackers increase.
由于在公共数据上进行了预训练和微调,训练数据中存在一个天然的信任边界,在此边界内,攻击者可以插入对抗性输入,导致模型建议易受攻击、恶意或错位的代码。预训练和微调过程通常应被视为不可信的。随着模型能力的提高以及潜在攻击者的兴趣增加,这种风险可能会增大。
Finally, the Codex model itself may suggest insecure or otherwise bad code. Examples include suggesting a compromised package as a dependency, invoking functions insecurely, or suggesting secrets found in the training data.22 If Codex models become widespread software infrastructure, this could constitute a new type of supply chain risk. We discuss this more in the next section.
最后,Codex 模型本身可能会建议不安全或不良的代码。例子包括建议受 compromises 包作为依赖项、不安全地调用函数或建议在训练数据中找到的秘密。如果 Codex 模型成为广泛的软件基础设施,这可能构成一种新的供应链风险。我们将在下一节中进一步讨论这个问题。
Beyond computer security, we also considered the possibility that code generation systems might provide actors with the ability to synthesize portions of highly complex safetycritical systems with offensive capabilities. We concluded that there is a low likelihood of Codex synthesizing standalone safety-critical systems due to a lack of system-level generation capabilities, as discussed in Appendix D. Codex models could also potentially accelerate some instances of machine learning development, which in turn could have downstream misuse implications. While again Codex does not appear capable of synthesizing highly complex systems, we have found it to be somewhat effective at generating boilerplate machine learning code that has a similar structure to code it has seen in its training set.
除了计算机安全之外,我们还考虑了代码生成系统可能为行为体提供合成具有攻击能力的高度复杂的安全关键系统部分的能力。我们得出结论,由于缺乏系统级生成能力,Codex 合成独立的安全关键系统的可能性较低,如附录 D 所述。Codex 模型还可能加速某些机器学习开发实例,这反过来可能会产生下游滥用影响。尽管 Codex 似乎无法合成高度复杂的系统,但我们发现它在生成与训练集中见过的代码结构相似的样板机器学习代码方面 somewhat 有效。
As with GPT-3, we discussed possible misuse scenarios with professional threat analysts and monitored forums for evidence of actors using language models to generate code to augment cybercrime operations. We observed enthusiasm for training models on code and projects focused on automating coding tasks, but no references to using language models for malware development. We noted that enthusiasm and projects were centered around freely-available language models. This highlights a need for robust monitoring and continued research to maintain situational awareness about how models like Codex are being used and misused.
与 GPT-3 一样,我们与专业的威胁分析师讨论了可能的滥用场景,并监控论坛以获取行为者使用语言模型生成代码以增强网络犯罪操作的证据。我们观察到对训练模型进行编码以及专注于自动化编码任务的项目的热情,但没有发现使用语言模型进行恶意软件开发的参考。我们注意到这种热情和项目主要集中在免费提供的语言模型上。这突显了需要进行强有力的监控和持续研究,以保持对像 Codex 这样的模型如何被使用和滥用的情况有所了解。
G.3. Insecure code generation
G.3. 不安全的代码生成
Similar to the alignment problems in Appendix E, a securityrelevant subclass of behaviors is the generation of insecure code. A priori, we might expect that Codex will sometimes produce insecure code because the pre-training and finetuning paradigm involves training on large quantities of untrusted data, which is known to contain insecure code. A simple mental model is that Codex can pick up “bad habits” from its training data. But what does this look like
类似于附录 E 中的对齐问题,一个与安全相关的子类行为是生成不安全代码。先验地,我们可能会预期 Codex 有时会产生不安全代码,因为预训练和微调范式涉及在大量不可信数据上进行训练,这些数据已知包含不安全代码。一个简单的思维模型是,Codex 可能会从其训练数据中养成“坏习惯”。但这具体表现为什么样子
in practice?23
在实践中?23
To study this phenomenon, we asked Codex to suggest code that would call cryptographic libraries to generate cryptographic contexts, and then evaluated whether any of these outputs were clearly insecure.24 When tested on a standard series of prompts asking the models to call functions to produce RSA keys or AES contexts,25 we find that Codex models of varying sizes frequently use clearly insecure config u rations (See Figure 15).
为了研究这一现象,我们要求 Codex 提供代码建议,这些代码将调用加密库以生成加密上下文,然后评估这些输出中是否有明显不安全的情况。24 在测试中,我们使用了一组标准提示,要求模型调用函数以生成 RSA 密钥或 AES 上下文,25 我们发现不同大小的 Codex 模型经常使用明显不安全的配置 (见图 15)。
Interestingly, we do not see a robust model size trend (over 1 order of magnitude of parameters) in this data. This suggests that insecure code production, at least in this case, is an alignment issue (see Appendix E): it is unclear if the models are improving with scale. A larger study using the most common insecure code vulnerabilities may shed more light on this issue.
有趣的是,在这些数据中,我们没有看到模型规模的显著趋势(超过一个数量级的参数)。这表明不安全代码的生成至少在这个案例中是一个对齐问题(见附录 E):不清楚模型是否随着规模的扩大而改进。使用最常见的不安全代码漏洞进行更大规模的研究可能会对此问题有更多启示。
H. Supplemental economic analysis
H. 补充经济分析
The economic and labor market implications of code generation are only beginning to emerge, and more analysis will be required to fully understand them. In this appendix, we outline some possible types of impacts that occur, but we emphasize that this analysis is highly preliminary: many uncertainties remain about the technological trajectory and economic adoption of code generation. We include this analysis primarily to motivate further related work rather than to suggest any strong conclusions, and we will highlight several promising directions for further exploration.
代码生成的经济和劳动力市场影响才刚刚开始显现,需要更多的分析才能完全理解这些影响。在本附录中,我们概述了一些可能发生的影响类型,但强调这一分析非常初步:关于代码生成的技术发展轨迹和经济采用仍存在许多不确定性。我们主要纳入此分析是为了激励进一步的相关研究,而不是提出任何强有力的结论,并将突出几个有前景的进一步探索方向。
Code generation could help create economic value by allowing engineers and programmers to write better code, write good code faster, and help with tasks like docstrings, documentation, tests, code reviews, etc. In turn, these impacts may change the work of engineers and programmers (people who directly write or read code for a living) as well as work more broadly by lowering the barrier to building software and enabling entirely new kinds of software to be built.
代码生成可以通过让工程师和程序员编写更好的代码、更快地编写优质代码,并帮助完成诸如函数注释、文档、测试、代码审查等任务来创造经济价值。反过来,这些影响可能会改变工程师和程序员(以编写或阅读代码为生的人)的工作,以及通过降低构建软件的门槛和使全新类型的软件得以开发来更广泛地改变工作。

Clearly insecure use of encryption keys by model size Figure 15. Clearly insecure encryption keys produced by Codex. When asked to create encryption keys, Codex models select clearly insecure configuration parameters in a significant fraction of cases. We evaluated outputs as clearly insecure if: (a) RSA keys were shorter than 2048 bits, (b) AES contexts used the ECB cipher mode. Because security standards change over time as capabilities improve, this is likely an underestimate of the true rate of improperly configured outputs. Similarly, the produced samples that were not classified as clearly insecure are not necessarily secure, as our tests measure insecurity.

模型大小对加密密钥的明显不安全使用 图 15. Codex 模型生成的明显不安全的加密密钥。当要求创建加密密钥时,Codex 模型在相当一部分情况下选择了明显不安全的配置参数。我们评估输出为明显不安全的标准是:(a) RSA 密钥长度短于 2048 位,(b) AES 上下文使用了 ECB 加密模式。由于随着能力的提高,安全标准会随时间变化,因此这可能是对配置不当输出真实比率的低估。同样地,未被分类为明显不安全的生成样本也不一定是安全的,因为我们的测试只衡量不安全性。
Codex is one of several existing tools to assist in code generation, which have varying economic implications. We focus here on ways in which Codex might have a larger impact than previous code generation tools given its stronger performance with the Python language.
Codex 是现有用于辅助代码生成的几种工具之一,这些工具具有不同的经济影响。我们在这里关注的是,鉴于 Codex 在 Python语言 方面的更强性能,它可能比以前的代码生成工具产生更大的影响。
H.1. Impacts on programmers and engineers
H.1. 对程序员和工程师的影响
At a coarse-grained level, by potentially increasing programmer and engineer productivity, Codex may somewhat reduce the overall cost of producing software. This effect may be limited by the fact that the production of software requires more tasks than writing code $\mathrm{{O}^{*}\mathrm{{NET}}}$ , 2021)–other important tasks include conferring with colleagues, writing design specs, and upgrading existing software stacks. Indeed, the Bureau of Labor Statistics (BLS) classifies computer programmers and software developers separately, where developers are more highly paid than programmers, have more tasks indirectly related to writing and interacting with code, and, in the US, are projected to see greater demand over the next 10 years (Li et al., 2020).
在粗粒度层面,通过可能提高程序员和工程师的生产力,Codex 可能会在一定程度上降低软件生产的总体成本。这种效应可能会受到限制,因为软件生产需要的任务不仅仅是编写代码($\mathrm{{O}^{*}\mathrm{{NET}}}$,2021)——其他重要任务包括与同事协商、编写设计规范以及升级现有的软件栈。实际上,劳工统计局 (BLS) 将计算机程序员和软件开发人员分开分类,其中开发人员的薪酬高于程序员,他们的任务更多是间接与编写和交互代码相关,并且在美国,预计在未来 10 年内对开发人员的需求将更大(Li et al., 2020)。
Additionally, one of the challenges of code generation stem from relying on the assumption that intent is captured sufficiently enough in comments and documentation to not compromise accuracy. This in turn implies some inherent overhead: framing comments and prompts precisely enough to extract the best behavior from the model and reviewing the code generated by the model. Thus, even if the model were perfectly accurate, we would not expect it to reduce the labor costs associated with writing code to zero. Furthermore, as with many tools that substitute investments in capital for investments in labor (or increase the productivity of labor) (Frey, 2019; Acemoglu & Restrepo, 2020a;b), more sophisticated future code generation tools could potentially contribute to the displacement of some programmer or engineer roles, and could change the nature of, and power dynamics involved in, programming work. However, they might instead simply make the work of some engineers more efficient, or, if used to produce larger amounts of sloppier code, they could create the illusion of increased efficiency while offloading the time spent writing code to more detailed code reviews and QA testing.
此外,代码生成的一个挑战在于依赖于假设注释和文档中充分捕捉了意图,从而不会影响准确性。这反过来意味着一些固有的开销:精确地编写注释和提示以从模型中提取最佳行为,并审查模型生成的代码。因此,即使模型完全准确,我们也无法期望它将编写代码相关的劳动成本降低为零。此外,与许多用资本投资替代劳动力投资(或提高劳动力生产力)的工具一样 (Frey, 2019; Acemoglu & Restrepo, 2020a;b),更复杂的未来代码生成工具可能会导致某些程序员或工程师角色的替代,并可能改变编程工作的性质及其中涉及的权力动态。然而,它们也可能只是使一些工程师的工作更加高效,或者,如果用于生成更多的粗糙代码,它们可能会制造出效率提升的假象,而实际上将编写代码的时间转移到更详细的代码审查和质量保证测试上。
At the same time, Codex may create new markets for work that complement changed workflows. After the release of GPT-3, a few companies began to include working with GPT-3 and writing prompts in job listings. And research shows that so-called prompt engineering can enable stronger results from AI systems (Zhao et al., 2021). Similarly, it is possible that models like Codex will lead to the emergence of new kinds of work for engineers who are skilled at working with such tools.
同时,Codex 可能会创造新的市场,为适应变化的工作流程提供补充。在 GPT-3 发布后,一些公司开始在职位列表中包含与 GPT-3 交互和编写提示的工作内容。研究表明,所谓的提示工程可以实现更强的 AI 系统效果 (Zhao et al., 2021)。类似地,像 Codex 这样的模型可能会导致出现新的工作类型,这些工作需要擅长使用此类工具的工程师。
Because of Codex’s performance on “coding challenge” like questions (as referenced in the APPS results), we expect strong performance on interview-style questions. This may encourage employers to reconsider the screening process for coding-related positions.
由于 Codex 在“编程挑战”类问题上的表现(如 APPS 结果所示),我们预计它在面试风格的问题上也会有出色的表现。这可能会鼓励雇主重新考虑编程相关职位的筛选过程。
H.2. Differential impacts among engineers
H.2. 工程师之间的差异影响
Certain kinds of code and roles may be more likely to be affected by the diffusion of code generation models than others. It is thus valuable to explore whether systematic patterns might be expected in who might win and lose from this class of technologies across demographic categories.
某些类型的代码和角色可能比其他类型更可能受到代码生成模型扩散的影响。因此,探索在这类技术中,不同人口统计类别的人群可能会有系统性的受益或受损模式是有价值的。
Given Codex’s performance on Python, we expect its impacts to be felt more strongly in roles where Python is the dominant programming language (future models might have different strength profiles).26 However, even if this were
鉴于 Codex 在 Python语言 上的表现,我们预计其影响将在 Python 为主导编程语言的领域中更为显著 (未来模型可能具有不同的优势特征) [26]。然而,即使如此
26There is unfortunately only limited research on the demographic distribution of Python users. Understanding this better could shed light on how the benefits and risks associated with Codex might be distributed across society. A 2020 survey of StackOverflow users (Stack Overflow, 2020) suggests that women are comparatively more represented in data science and analysis roles than in DevOps specialist, system administrator, and site reliability true, whether the effect is positive or negative may vary with how engineers and programmers learn to incorporate these tools into their workflows. One might think that those who work with programming languages that Codex excels at would have the most to lose in the event that tools built on top of these models substitute for human labor. However, such workers may alternatively have more to gain if those tools enhance their productivity and bargaining power. Relatedly, more companies might switch their codebases to programming languages where they know Codex could augment work.
不幸的是,关于 Python语言 用户的人口统计分布的研究非常有限。更好地理解这一点可以揭示 Codex 所带来的好处和风险在社会中的分布情况。2020 年对 StackOverflow 用户的一项调查 (Stack Overflow, 2020) 表明,女性在数据科学和分析角色中的代表性相对较高,而在 DevOps 专家、系统管理员和站点可靠性工程师中则较低。是否这种影响是积极的还是消极的可能取决于工程师和程序员如何将这些工具融入他们的工作流程。人们可能会认为,那些使用 Codex 擅长的编程语言工作的人员如果这些模型构建的工具替代了人力劳动,可能会损失最大。然而,如果这些工具提高了他们的生产力和议价能力,这些工作者反而可能会获得更多收益。与此相关的是,更多的公司可能会将其代码库切换到他们知道 Codex 可以增强工作效率的编程语言。
It is also important to note that use of Python is actively growing, in part because it is a dominant language used in educational contexts and because of its high readability factor. By increasing the amount that can be achieved with Python, Codex might make the engineering field more accessible to a wider variety of people, including those coming from a more diverse range of demographic backgrounds.
需要注意的是,Python语言的使用正在积极增长,部分原因是它在教育环境中占据主导地位,并且具有很高的可读性。通过提高使用 Python语言 可以实现的目标,Codex 可能会使工程领域对更广泛的人群更加accessible,包括来自更多样化人口背景的人。
(注:此处“accessible”根据上下文翻译为“可及”,以符合中文表达习惯)
H.3. Impacts on non-engineers
H.3. 对非工程师的影响
Code generation tools could also widen the base of people who are able to move into programming or shift the distribution of skills that new programmers need to learn (Xu et al., 2021). One mechanism through which this may happen is that Codex may make it easier to work with new codebases or new languages.
代码生成工具也可能扩大能够进入编程领域的人群基础,或改变新程序员需要学习的技能分布 (Xu et al., 2021)。这一现象的一个可能机制是,Codex 可能使处理新的代码库或新的语言变得更加容易。
Code generation models may also make it simpler to build tools that automate repetitive tasks in non-engineering roles.
代码生成模型也可能使构建自动化重复性任务的工具变得更加简单,这些任务出现在非工程角色中。
H.4. Effects of differential package import rates
H.4. 差异化包导入速率的影响
Within a code file, one often imports packages or programs written by third parties. Rather than constantly reinventing the wheel, software developers rely on functions, libraries and APIs for most code we might consider “boilerplate.” For any given task, though, there are multiple options: PyTorch or TensorFlow for machine learning, Matplotlib or Seaborn for data visualization, etc.
在一个代码文件中,通常会导入第三方编写的包或程序。软件开发者依赖于函数、库和 API 来处理大多数我们可能认为是“样板代码”的部分,而不是不断重复造轮子。然而,对于任何给定的任务,都有多种选择:例如,对于机器学习可以选择 PyTorch 或 TensorFlow,对于数据可视化可以选择 Matplotlib 或 Seaborn 等。
Codex imports subst it u table packages at different rates based on patterns in its training data, which can have various possible implications. Differential import rates by Codex might lead to subtle errors in cases where a certain import is ill-advised, increase robustness in cases where the alternative package imported by an individual would have been worse, and/or increase the dominance of an alreadyinfluential set of individuals and organizations in the software supply chain. Despite many packages being free, there are clear rewards for developers and firms that have high-use packages, and free packages can be wrappers for paid products. Thus, the patterns of importing in Codex and other code generation models could have substantial economic implications for those who build and maintain packages, as well as safety or security implications.27
Codex 以不同的速率导入可替代的软件包,这取决于其训练数据中的模式,可能会产生各种可能的影响。Codex 的不同导入速率可能导致在某些不建议使用特定导入的情况下出现细微错误,在个人选择的替代软件包更糟糕的情况下增加鲁棒性,和/或增加已经在软件供应链中具有影响力的个人和组织的主导地位。尽管许多软件包是免费的,但对于拥有高使用率软件包的开发者和公司来说,有明显的回报,并且免费软件包可能是付费产品的包装。因此,Codex 及其他代码生成模型 (Generative AI) 的导入模式可能对构建和维护软件包的人产生重大的经济影响,以及安全或安全性的后果。27
Many commonly used packages are fairly entrenched and there can be high switching costs. Using the same package as everyone else means one’s code will be more compatible (if one uses a package everyone knows they will inherently understand one’s use of it), more trustworthy (if one uses a package everyone already has installed they will not be afraid to install new things to run one’s code), and just generally work better with other code (if one uses a package everyone uses, others will be a lot more able to run one’s code out of the box or plug it into their package). A given package might be dominant because it is the best available standard in terms of speed, security, or accessibility. Most of these packages are not paid, so the associated costs are mostly in learning to use new packages and the different trade-offs and syntax.
许多常用的包已经相当根深蒂固,切换成本可能很高。使用与其他人相同的包意味着代码将更具兼容性(如果使用大家都熟悉的包,他们自然会理解代码的用法),更值得信赖(如果使用大家已安装的包,他们不会担心为了运行代码而安装新东西),并且通常与其他代码配合得更好(如果使用大家都使用的包,其他人将更容易直接运行代码或将代码集成到他们的包中)。某个包可能占据主导地位,因为它在速度、安全性和易用性方面是最好的可用标准。这些包中的大多数是免费的,因此相关的成本主要在于学习使用新包以及不同的权衡和语法。
The scale of these effects for Codex may be relatively low if users mostly import packages they know how to use or have done outside research on, so they can double-check anything the model does. Moreover, because packages are generally imported at the top of a file without any comments, the model has very little to go on in these cases, so users would most likely have to start typing out the name of the package they want to import rather than trusting the model to know they are starting a machine learning project and want to import either PyTorch or TensorFlow.
这些效果对 Codex 的影响可能相对较小,如果用户主要导入他们知道如何使用或已经进行过外部研究的软件包,那么他们可以检查模型所做的任何操作。此外,由于软件包通常在文件顶部导入且没有任何注释,因此在这种情况下,模型几乎没有线索可循,所以用户很可能需要开始输入他们想要导入的软件包名称,而不是依赖模型知道他们正在启动一个机器学习项目并希望导入 PyTorch 或 TensorFlow。
Dependence on code generation models’ import suggestions may grow over time as users adapt to working with such systems. As users learn how to “prompt engineer” with Codex, they may use the model as a decision-making tool or search engine. Where a user may have done an Internet search before for “which machine learning package to use” or “pros and cons of PyTorch vs. Tensorflow” they might now just type “# import machine learning package” and trust Codex to do the rest. Users might be more inclined to accept the Codex answer under the assumption that the package it suggests is the one with which Codex will be more helpful. As a result, certain players might become more entrenched in the package market and Codex might not be aware of new packages developed after the training data was originally gathered. Further, for already existing packages, the model may make suggestions for deprecated methods. This could increase open-source developers’ incentive to maintain backward compatibility, which could pose challenges given that open-source projects are often under-resourced (Eghbal, 2020; Trink en reich et al., 2021).
随着时间的推移,用户适应了与这些系统一起工作,对代码生成模型的导入建议依赖可能会增加。当用户学会如何使用 Codex 进行“提示工程”时,他们可能会将模型用作决策工具或搜索引擎。以前用户可能会在互联网上搜索“使用哪个机器学习包”或“PyTorch 与 Tensorflow 的优缺点”,现在他们可能只需输入“# import machine learning package”并信任 Codex 完成其余的工作。用户可能会更倾向于接受 Codex 的答案,假设它建议的包是 Codex 更有帮助的那个。因此,某些参与者可能会在包市场上更加根深蒂固,而 Codex 可能不了解训练数据收集后开发的新包。此外,对于已有的包,模型可能会建议已弃用的方法。这可能会增加开源开发者维护向后兼容性的动机,这可能会带来挑战,因为开源项目通常资源不足 (Eghbal, 2020; Trinkenreich et al., 2021)。
More work is needed to compare the prevalence of different packages in Codex outputs with the input data to understand how or if these biases are concentrated by training, as well as to understand the direct and indirect impacts of these biases.
需要更多的工作来比较 Codex 输出中不同包的流行程度与输入数据,以了解这些偏差是如何或是否通过训练被集中放大,以及了解这些偏差的直接影响和间接影响。
H.5. Future directions
H.5. 未来方向
Precise and accurate prediction of any impacts without user or market signal is difficult, but the potential implications on the long-run labor market and the possibility of disparate outcomes across groups warrant further exploration of these issues. It may be possible to assess the relative likelihood of different scenarios by building a deeper understanding of Codex’s capabilities across several code-related tasks or by studying the effects of precise deployment scenarios. We plan to support research measuring Codex’s particular impact as well as research on code generation and automation more generally.
在没有用户或市场信号的情况下,精确预测任何影响是困难的,但对长期劳动力市场的影响以及不同群体之间可能出现的不同结果值得进一步探讨这些问题。可以通过深入理解 Codex 在多个与代码相关任务中的能力,或通过研究精确部署场景的效果来评估不同情景的相对可能性。我们计划支持研究 Codex 的具体影响,以及更广泛地支持关于代码生成和自动化的研究。
We recommend future work focused on Codex models and other similar systems, with an eye towards positively influencing both the deployment of such technologies and any other necessary steps by key actors such as governments. Some areas which we are particularly interested in seeing research include:
我们建议未来的工作重点关注 Codex 模型和其他类似系统,旨在积极影响这些技术的部署以及其他关键行动者(如政府)所需的任何其他步骤。我们特别感兴趣的研究领域包括:
• Measuring the economic value of generating faster and/or better code. This can include tracking the downstream impacts of tools created with Codex, including those which may not have been possible to build previously (at all, or by specific individuals or teams).
• 衡量生成更快和/或更好代码的经济价值。这可以包括跟踪使用 Codex 创建的工具的下游影响,包括那些以前可能无法构建的工具(完全无法构建,或由特定个人或团队构建)。
• Measuring changes in code documentation practices and testing as a result of Codex. Codex may make it easier to keep code well-documented, but it may also propagate subtle errors in documentation that lead to bugs downstream. Similarly, Codex can help people write tests for code, which can dramatically improve software quality and the surface area for costly downstream bugs, but if engineers become overly reliant, they may not properly specify code. (Planning, 2002; Jones & Bonsignour, 2011).
• 评估 Codex 对代码文档实践和测试变化的影响。Codex 可能使代码保持良好文档记录变得更加容易,但也可能传播文档中的细微错误,从而导致下游的 bug。同样,Codex 可以帮助人们为代码编写测试,这可以显著提高软件质量并减少下游昂贵 bug 的发生范围,但如果工程师过度依赖它,他们可能无法正确指定代码。(Planning, 2002; Jones & Bonsignour, 2011).
• Measuring the impact on worker productivity, quality of life, and wages of improved code generation technologies. Most past studies of the impacts of code generation models consider performance on a closed set of tasks in a simulated environment (Xu et al., 2021). As the deployment of Codex and other near-term technologies proceeds, we may be able to conduct more robust experiments examining the impact of various strengths of models on real-world job performance, across teams and across firms.
• 测量改进的代码生成技术对工人生产力、生活质量以及工资的影响。大多数过去关于代码生成模型影响的研究考虑的是在模拟环境中封闭任务集上的表现 (Xu et al., 2021)。随着 Codex 和其他近期技术的部署,我们可能能够进行更稳健的实验,考察不同强度的模型对现实世界工作表现的影响,涵盖不同的团队和公司。
• Measuring the ability of Codex and other code generation models to reduce barriers to entry for the field. Such work could explore various ways in which the educational and career progression of programmers and engineers could be influenced by the availability of powerful code generation technologies.
• 测量 Codex 和其他代码生成模型降低该领域入门门槛的能力。此类研究可以探索强大的代码生成技术的可用性对程序员和工程师的教育和职业发展产生影响的各种方式。
More broadly, we believe the findings in this paper and future research on code generation might encourage researchers and policymakers to update their views regarding the potential for AI to have subst it uti ve effects on workers in various high-skill domains in the future. As capabilities improve, the effects of this class of technologies could be substantial and more study is needed both on the effects and on appropriate responses.
更广泛地说,我们认为本文的发现以及未来关于代码生成的研究可能会鼓励研究人员和政策制定者更新他们对未来 AI 在各种高技能领域对工人产生替代效应潜力的看法。随着能力的提高,这类技术的影响可能是巨大的,需要更多地研究这些影响以及适当的应对措施。
