Evaluating Large Language Models Trained on Code

评估训练于代码的大语言模型

Abstract

摘要

1. Introduction

1. 引言

We introduce Codex, a GPT language model finetuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correct- ness for synthesizing programs from docstrings, our model solves $28.8%$ of the problems, while GPT-3 solves $0%$ and GPT-J solves $11.4%$ . Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve $70.2%$ of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

我们介绍 Codex，一个在 GitHub 上公开可用的代码上微调的 GPT 语言模型，并研究其编写 Python 代码的能力。Codex 的一个独特生产版本为 GitHub Copilot 提供支持。在 HumanEval 上，我们发布了一个新的评估集，用于测量从文档字符串合成程序的功能正确性，我们的模型解决了 28.8% 的问题，而 GPT-3 解决了 0% 和 GPT-J 解决了 11.4% 。此外，我们发现从模型中重复采样是产生解决困难提示的有效工作解决方案的惊人策略。使用这种方法，我们通过每题 100 次采样解决了 70.2% 的问题。对我们的模型进行仔细研究揭示了其局限性，包括难以处理描述长操作链的文档字符串以及将操作绑定到变量的困难。最后，我们讨论了部署强大的代码生成技术可能带来的更广泛影响，涵盖了安全、安保和经济。

Figure 1:
图 1:

Table 1:
表 1:

（注意：原文中没有 Figure 或 Table 相关内容，因此这里仅展示格式示例）

Scalable sequence prediction models (Graves, 2014; Vaswani et al., 2017; Child et al., 2019) have become a general-purpose method for generation and representation learning in many domains, including natural language processing (Mikolov et al., 2013; Sutskever et al., 2014; Dai & Le, 2015; Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018), computer vision (Van Oord et al., 2016; Menick & Kalchbrenner, 2018; Chen et al., 2020; Bao et al., 2021), audio and speech processing (Oord et al., 2016; 2018; Dhariwal et al., 2020; Baevski et al., 2020), biology (Alley et al., 2019; Rives et al., 2021), and even across multiple modalities (Das et al., 2017; Lu et al., 2019; Ramesh et al., 2021; Zellers et al., 2021). More recently, language models have also fueled progress towards the longstanding challenge of program synthesis (Simon, 1963; Manna & Waldinger, 1971), spurred by the presence of code in large datasets (Husain et al., 2019; Gao et al., 2020) and the resulting programming capabilities of language models trained on these datasets (Wang & Komatsu zak i, 2021). Popular language modeling objectives like masked language modeling (Devlin et al., 2018) and span prediction (Raffel et al., 2020) have also been adapted to train their programming counterparts CodeBERT (Feng et al., 2020) and PyMT5 (Clement et al., 2020).

可扩展的序列预测模型 (Graves, 2014; Vaswani 等, 2017; Child 等, 2019) 已成为许多领域中生成和表示学习的通用方法，包括自然语言处理 (Mikolov 等, 2013; Sutskever 等, 2014; Dai & Le, 2015; Peters 等, 2018; Radford 等, 2018; Devlin 等, 2018)，计算机视觉 (Van Oord 等, 2016; Menick & Kalchbrenner, 2018; Chen 等, 2020; Bao 等, 2021)，音频和语音处理 (Oord 等, 2016; 2018; Dhariwal 等, 2020; Baevski 等, 2020)，生物学 (Alley 等, 2019; Rives 等, 2021)，甚至跨多个模态 (Das 等, 2017; Lu 等, 2019; Ramesh 等, 2021; Zellers 等, 2021)。最近，大语言模型还推动了 longstanding 挑战——程序合成 (Simon, 1963; Manna & Waldinger, 1971) 的进展，这得益于大型数据集中代码的存在 (Husain 等, 2019; Gao 等, 2020) 和在这些数据集上训练的大语言模型所具备的编程能力 (Wang & Komatsu zak i, 2021)。流行的建模目标如掩码语言建模 (Devlin 等, 2018) 和片段预测 (Raffel 等, 2020) 也被改编为训练其编程对应物 CodeBERT (Feng 等, 2020) 和 PyMT5 (Clement 等, 2020)。

Similarly, our early investigation of GPT-3 (Brown et al., 2020) revealed that it could generate simple programs from Python docstrings. While rudimentary, this capability was exciting because GPT-3 was not explicitly trained for code generation. Given the considerable success of large language models in other modalities and the abundance of publicly available code, we hypothesized that a specialized GPT model, called Codex, could excel at a variety of coding tasks. This paper describes several early Codex models, whose descendants power GitHub Copilot and the Codex models in the OpenAI API.

同样，我们对 GPT-3 (Brown 等，2020) 的早期研究揭示了它可以从 Python语言 docstrings 生成简单程序的能力。尽管这一能力还比较初级，但令人兴奋的是，GPT-3 并没有专门训练代码生成任务。鉴于大语言模型在其他模态上取得的显著成功以及大量公开可用的代码，我们假设一个名为 Codex 的专用 GPT 模型可以在各种编程任务中表现出色。本文描述了几个早期的 Codex 模型，这些模型的后继者为 GitHub Copilot 和 OpenAI API 中的 Codex 模型提供支持。

Codex and Codex-S Performance Figure 1. Pass rates of our models on the HumanEval dataset as a function of model size. When a single sample is generated for each problem, GPT-12B solves no problems, but Codex (fine-tuned on code) solves $28.8%$ of the problems, and Codex-S (further fine-tuned on correctly implemented standalone functions) solves $37.7%$ of the problems. From here, further gains can be realized by generating 100 samples per problem and selecting the sample with the highest mean log-probability $(44.5%$ solved) or by selecting the sample that passes the unit tests $.77.5%$ solved). All samples are generated with temperature 0.8.

Codex 和 Codex-S 性能图 1. 我们的模型在 HumanEval 数据集上的通过率与模型大小的关系。当为每个问题生成一个样本时，GPT-12B 没有解决问题，但 Codex (在代码上微调) 解决了 $28.8%$ 的问题，而 Codex-S (进一步在正确实现的独立函数上微调) 解决了 $37.7%$ 的问题。从这里开始，通过为每个问题生成 100 个样本并选择平均对数概率最高的样本 $(44.5%$ 解决)，或选择通过单元测试的样本 $(77.5%$ 解决)，可以进一步提高性能。所有样本均以温度 0.8 生成。

In this work, we focus on the task of generating standalone Python functions from docstrings, and evaluate the correctness of code samples automatically through unit tests. This is in contrast to natural language generation, where samples are typically evaluated by heuristics or by human evaluators. To accurately benchmark our model, we create a dataset of 164 original programming problems with unit tests. These problems assess language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. We release this data along with an evaluation framework at https://www.github.com/openai/human-eval.

在本工作中，我们专注于从文档字符串生成独立的 Python语言函数的任务，并通过单元测试自动评估代码样本的正确性。这与自然语言生成不同，后者通常通过启发式方法或人工评估员来评估样本。为了准确评估我们的模型，我们创建了一个包含 164 个原始编程问题的数据集，每个问题都配有单元测试。这些问题评估语言理解、算法和简单数学，其中一些类似于简单的软件面试题。我们发布了这个数据集以及评估框架，网址为 https://www.github.com/openai/human-eval。

To solve a problem in our test set, we generate multiple samples from the models, and check if any of them pass the unit tests. With just a single sample, a 12B parameter Codex solves $28.8%$ of these problems, and a 300M parameter Codex solves $13.2%$ of these problems. In contrast, the 6B parameter GPT-J (Wang & Komatsu zak i, 2021) achieves $11.4%$ on the same dataset, while all GPT models achieve near $0%$ . To improve our model’s performance at the task of function synthesis from docstrings, we fine-tune Codex on standalone, correctly implemented functions. The resulting model, Codex-S, solves $37.7%$ of problems with a single sample. Figure 2 showcases problems of varying difficulty in our dataset, along with correct model generated solutions.

为了解决我们测试集中的问题，我们从模型生成多个样本，并检查其中是否有任何样本通过单元测试。仅使用单个样本时，12B 参数的 Codex 解决了 $28.8%$ 的问题，而 300M 参数的 Codex 解决了 $13.2%$ 的问题。相比之下，6B 参数的 GPT-J (Wang & Komatsu zak i, 2021) 在同一数据集上达到了 $11.4%$ ，而所有 GPT 模型的表现接近 $0%$ 。为了提高模型在从文档字符串合成函数任务中的性能，我们在独立且正确实现的函数上微调 Codex。由此产生的模型 Codex-S 使用单个样本解决了 $37.7%$ 的问题。图 2 展示了我们数据集中不同难度的问题以及模型生成的正确解决方案。

Real-world programming tasks often involve iterations of approaches and bug fixes, which is approximated by generating many samples from our models and selecting one that passes all unit tests. Within 100 samples, Codex-S is able to generate at least one correct function for $77.5%$ of the problems. This result suggests that accurate code samples can be selected via heuristic ranking instead of fully evaluating each sample, the latter of which may not be possible or practical in deployment. Indeed, we find that the sample with highest mean log-probability passes unit tests for $44.5%$ of the problems.

现实世界的编程任务通常涉及方法的迭代和 bug 修复，这可以通过从我们的模型生成许多样本并选择一个通过所有单元测试的样本来进行近似。在 100 个样本中，Codex-S 能够为 $77.5%$ 的问题生成至少一个正确的函数。这一结果表明，可以通过启发式排名选择准确的代码样本，而无需完全评估每个样本，后者在部署时可能不可行或不切实际。实际上，我们发现具有最高平均对数概率的样本通过了 $44.5%$ 问题的单元测试。

We conclude by discussing the limitations and potential broader impacts of these Codex models and of increasingly powerful code generating models more generally.

我们通过讨论这些 Codex 模型以及更通用的 increasingly powerful 代码生成模型的局限性和潜在更广泛的影响来总结。

2. Evaluation Framework

2. 评估框架

In this section, we discuss the details of our evaluation framework. We begin by defining the $p a s s@k$ metric, and explain its advantages over standard match-based metrics. Next, we describe the dataset of hand-written problems, called “HumanEval,” which we created in order to benchmark our models. Finally, we discuss the sandbox environment we used to safely execute model-generated code.

在本节中，我们讨论评估框架的详细信息。我们首先定义 $p a s s@k$ 指标，并解释其相对于标准匹配指标的优势。接下来，我们描述了名为“HumanEval”的手写问题数据集，该数据集是我们为了基准测试模型而创建的。最后，我们讨论了用于安全执行模型生成代码的沙盒环境。

2.1. Functional Correctness

2.1. 功能正确性

Generative models for code are predominantly benchmarked by matching samples against a reference solution, where the match can be exact or fuzzy (as in BLEU score). However, recent work has surfaced deficiencies in match-based metrics for code. For instance, Ren et al. (2020) finds that BLEU has problems capturing semantic features specific to code, and suggests several semantic modifications to the score.

生成式模型（Generative models）对于代码的评估主要通过将样本与参考解决方案进行匹配，这种匹配可以是精确的或模糊的（如 BLEU 分数）。然而，最近的研究揭示了基于匹配的代码评估指标存在缺陷。例如，Ren 等人 (2020) 发现 BLEU 在捕捉代码特有的语义特征方面存在问题，并建议对分数进行若干语义修改 [20]。

More fundamentally, match-based metrics are unable to account for the large and complex space of programs functionally equivalent to a reference solution. As a consequence, recent works in unsupervised code translation (Lachaux et al., 2020) and pseudocode-to-code translation (Kulal et al., 2019) have turned to functional correctness instead, where a sample is considered correct if it passes a set of unit tests. We argue that this metric should be applied to docstringconditional code generation as well.

从根本上讲，基于匹配的度量标准无法考虑与参考解决方案功能等价的程序的大而复杂的空间。因此，最近关于无监督代码翻译 (Lachaux et al., 2020) 和伪代码到代码翻译 (Kulal et al., 2019) 的工作转向了功能正确性，其中如果样本通过了一组单元测试，则被认为是正确的。我们认为这个度量标准也应该应用于文档字符串条件下的代码生成。

Perhaps the most convincing reason to evaluate functional correctness is that it is used by human developers to judge code. A framework known as test-driven development dictates that software requirements be converted into test cases before any implementation begins, and success is defined by a program that passes these tests. While few organizations employ full test-driven development, integration of new code is usually dependent on creating and passing unit tests.

或许最有说服力的理由是评估功能正确性是因为它被人类开发者用来判断代码。一种称为测试驱动开发 (test-driven development) 的框架规定，在任何实现开始之前，软件需求应转换为测试用例，并且成功是由通过这些测试的程序定义的。虽然很少有组织采用完整的测试驱动开发，但新代码的集成通常依赖于创建并通过单元测试。

Kulal et al. (2019) evaluate functional correctness using the pass $@k$ metric, where $k$ code samples are generated per problem, a problem is considered solved if any sample

库拉尔等人 (2019) 使用 pass @$k$@ 指标评估功能正确性，其中每个问题生成 $k$ 个代码样本，如果任何样本通过测试，则认为该问题已解决。

图 1: 模型架构示例

在本节中，我们将介绍生成式 AI (Generative AI) 的最新进展。生成式 AI 是指能够创建新内容的 AI 技术，例如文本、图像和音乐。这些模型通常基于 Transformer 架构，并使用大量的 Token 进行训练。大语言模型 (LLM) 在处理自然语言任务方面表现出色，尤其是在零样本和少样本学习场景下。

我们还探讨了 AI智能体在不同环境中的应用，以及它们如何通过与环境交互来提高性能。通用人工智能 (AGI) 是一个长期目标，旨在开发出能够在各种任务上达到或超过人类水平的 AI 系统。

表 1: 不同模型的性能对比

模型名称	参数量	训练数据集	性能指标
Model A	1B	Dataset X	85%
Model B	10B	Dataset Y	90%

以上表格展示了几个典型模型的参数量、训练数据集及性能指标。可以看出，随着模型规模的增大，其性能也有所提升。然而，更大的模型并不总是意味着更好的结果，还需要考虑计算资源和实际应用场景等因素。

Figure 2. Three example problems from the HumanEval dataset, where the probabilities that a single sample from Codex-12B passes unit tests are 0.9, 0.17, and 0.005. The prompt provided to the model is shown with a white background, and a successful model-generated completion is shown in a yellow background. Though not a guarantee for problem novelty, all problems were hand-written and not pro grammatically copied from existing sources. Random problems and samples can be found in Appendix B.

图 2. 来自 HumanEval 数据集的三个示例问题，其中 Codex-12B 单个样本通过单元测试的概率分别为 0.9、0.17 和 0.005。提供给模型的提示显示在白色背景上，成功的模型生成完成显示在黄色背景上。虽然不能保证问题的新颖性，但所有问题都是手工编写的，并非从现有来源程序化复制而来。随机问题和样本可以在附录 B 中找到。

passes the unit tests, and the total fraction of problems solved is reported. However, computing pass $@k$ in this way can have high variance. Instead, to evaluate pass $@k$ , we generate $n;\geq;k$ samples per task (in this paper, we use $n,=,200$ and $k\le100]$ ), count the number of correct samples $c,\leq,n$ which pass unit tests, and calculate the unbiased estimator

通过单元测试，并报告解决问题的总比例。然而，以这种方式计算通过率 $@k$ 可能会有较高的方差。相反，为了评估通过率 $@k$ ，我们为每个任务生成 $n;\geq;k$ 个样本（在本文中，我们使用 $n,=,200$ 和 $k\le100$），统计通过单元测试的正确样本数 $c,\leq,n$，并计算无偏估计量

ef pass （1 C k) """ :param n: total number of samples :param c: number of correct samples :param k: k in pass@\$k\$ """ if $\mathrm{n-~}\mathrm{c}<\mathrm{k}$ : return 1.0 return 1.0 - np.prod(1.0 - k / np.arange $(\mathrm{n\textsubscript~{-}~~}\mathrm{c\textsubscript~~{+}~}1$ , n + 1))

如果  \$\mathrm{~n~-~}\mathrm{~c~}<\mathrm{~k~}\$  ：  
返回 1.0  

返回 1.0 - np.prod(1.0 - k / np.arange  \$(\mathrm{~n~\textsubscript~{~-~}~}\mathrm{~c~\textsubscript~{~+~}~}1\$  , n + 1))  

""" :param n: 总样本数 :param c: 正确样本数 :param k: pass@\\$k\\$ 中的 k """

$$
\mathrm{pass}\ @k:=\underset{\mathrm{Problems}}{\mathbb{E}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}}\right]
$$

$$
\mathrm{通过} @k := \underset{\mathrm{问题}}{\mathbb{E}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}}\right]
$$

Figure 3. A numerically stable script for calculating an unbiased estimate of pass $@k$ .

图 3. 一个数值稳定的脚本，用于计算无偏估计的通过率 $@k$ 。

Calculating this estimator directly results in very large numbers and numerical instability. In Figure 3, we include a numerically stable numpy implementation that simplifies the expression and evaluates the product term-by-term. One may be tempted to estimate pass $@k$ with $1-(1!-!\hat{p})^{k}$ where $\hat{p}$ is the empirical estimate of pass $@1$ , but we show that it is biased in Appendix A.

直接计算这个估计器会导致非常大的数值和数值不稳定。在图 3 中，我们包含了一个数值稳定的 numpy 实现，该实现简化了表达式并逐项评估乘积项。有人可能会尝试用 $1-(1!-!\hat{p})^{k}$ 来估计 pass $@k$ ，其中 $\hat{p}$ 是 pass $@1$ 的经验估计值，但我们证明了它存在偏差，详见附录 A。

Later, we provide evidence that BLEU score may not be a reliable indicator of functional correctness by showing that functionally in equivalent programs generated by our model (which are guaranteed to disagree with the reference solution on some input) often have higher BLEU scores than functionally equivalent ones.

后来，我们提供了证据表明 BLEU 分数可能不是功能正确性的可靠指标。通过展示我们模型生成的功能不等价程序（这些程序在某些输入上必定与参考解决方案不一致）通常具有比功能等价程序更高的 BLEU 分数，证明了这一点。

2.2. HumanEval: Hand-Written Evaluation Set

2.2. HumanEval: 手写评估集

We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources. For example, there are more than ten public repositories containing solutions to Codeforces problems, which make up part of the recently proposed APPS dataset (Hendrycks et al., 2021).

我们在一组 164 个手写编程问题上评估功能正确性，我们称之为 HumanEval 数据集。每个问题包括函数签名、文档字符串、函数体和若干单元测试，平均每题有 7.7 个测试。这些任务必须是手写的，这一点很重要，因为我们的模型是在 GitHub 的大量代码上训练的，而这些代码中已经包含了来自各种来源的问题解决方案。例如，有超过十个公共仓库包含 Codeforces 问题的解决方案，这些构成了最近提出的 APPS 数据集 (Hendrycks et al., 2021) 的一部分。

Programming tasks in the HumanEval dataset assess language comprehension, reasoning, algorithms, and simple mathematics. We release the HumanEval dataset so that others can evaluate functional correctness and measure the problem-solving capabilities of their models. The dataset can be found at https://www.github.com/openai/human-eval.

HumanEval 数据集中的编程任务评估语言理解、推理、算法和简单数学。我们发布了 HumanEval 数据集，以便其他人可以评估功能正确性并衡量其模型的解决问题的能力。该数据集可以在 https://www.github.com/openai/human-eval 获取。

2.3. Sandbox for Executing Generated Programs

2.3. 执行生成程序的沙盒 (Sandbox for Executing Generated Programs)

Since publicly available programs have unknown intent and generated programs are often incorrect, executing these programs poses a security risk. Indeed, GitHub is known to contain malicious programs that alter or change their environments (Rokon et al., 2020).

由于公开可用的程序意图不明，且生成的程序经常存在错误，执行这些程序会带来安全风险。确实，GitHub 已知包含恶意程序，这些程序会更改其环境 (Rokon et al., 2020)。

Therefore, we developed a sandbox environment to safely run untrusted programs against unit tests. Our goals were to prevent these programs from modifying, gaining persistence on, accessing sensitive resources on, or ex filtrating data from a host or network. Since OpenAI’s training infrastructure is built on Kubernetes and cloud services, we designed our sandbox to address the limitations of these environments while remaining idiomatic with their patterns of use.

因此，我们开发了一个沙盒环境，以安全地运行针对单元测试的不可信程序。我们的目标是防止这些程序修改、在主机或网络上获得持久性、访问敏感资源或外泄数据。由于 OpenAI 的训练基础设施建立在 Kubernetes 和云服务之上，我们设计的沙盒旨在解决这些环境的限制，同时保持与其使用模式的一致性。

We selected the gVisor container runtime (Lacasse, 2018) as the main host protection component. Since container runtimes like Docker can share host resources with containers, a malicious container could potentially compromise a host. gVisor protects the host by emulating its resources to introduce a security boundary between the host and its containers. Network-adjacent hosts and services are protected by eBPF-based firewall rules that prevent inbound and outbound connections except for those required for experiment control.

我们选择了 gVisor 容器运行时 (Lacasse, 2018) 作为主要的主机保护组件。由于像 Docker 这样的容器运行时可以与容器共享主机资源，恶意容器可能会危及主机的安全。gVisor 通过模拟主机资源，在主机和容器之间引入了一个安全边界来保护主机。网络相邻的主机和服务则通过基于 eBPF 的防火墙规则进行保护，这些规则防止了除实验控制所需的连接之外的所有入站和出站连接。

3. Code Fine-Tuning

3. 代码微调 (Code Fine-Tuning)

We fine-tune GPT models containing up to 12B parameters on code to produce Codex. In contrast with GPT, Codex displays non-trivial performance on the HumanEval dataset. In fact, Codex is able to solve the majority of the problems in HumanEval if we generate and evaluate 100 samples per problem, and pick one that passes unit tests. When limited to a budget of one evaluation per problem, producing multiple samples with Codex and choosing the one with the highest mean log-probability provides significant gains.

我们对包含多达 12B 参数的 GPT 模型进行微调以生成 Codex。与 GPT 不同，Codex 在 HumanEval 数据集上表现出非平凡的性能。实际上，如果为每个问题生成并评估 100 个样本，并选择一个通过单元测试的样本，Codex 能够解决 HumanEval 中的大部分问题。当限制为每个问题一次评估时，使用 Codex 生成多个样本并选择具有最高平均对数概率的样本可以显著提高性能。

3.1. Data Collection

3.1. 数据收集

Our training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. We filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters. After filtering, our final dataset totaled 159 GB.

我们的训练数据集于 2020 年 5 月从 GitHub 上托管的 5400 万个公共软件仓库中收集，包含 179 GB 的唯一且小于 1 MB 的 Python 文件。我们过滤掉了可能是自动生成的文件、平均行长度超过 100 的文件、最大行长度超过 1000 的文件或包含少量字母数字字符的文件。过滤后，我们的最终数据集总计 159 GB。

3.2. Methods

3.2. 方法

Since Codex is evaluated on natural language prompts, we hypothesized that it would be beneficial to fine-tune from the GPT-3 (Brown et al., 2020) model family, which already contains strong natural language representations. Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the finetuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments.

由于 Codex 是在自然语言提示上进行评估的，我们假设从已经包含强大自然语言表示的 GPT-3 (Brown 等，2020) 模型家族进行微调会更有利。令人惊讶的是，当我们从预训练语言模型开始时，并没有观察到改进，这可能是因为微调数据集非常大。尽管如此，从 GPT 微调的模型收敛速度更快，因此我们在所有后续实验中都采用这种策略。

We train Codex using the same learning rate as the corresponding GPT model, with a 175 step linear warmup and cosine learning rate decay. We train for a total of 100 billion tokens, using the Adam optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , $\epsilon=10^{-8}$ , and a weight decay coefficient of 0.1.

我们使用与相应 GPT 模型相同的学习率训练 Codex，采用 175 步线性预热和余弦学习率衰减。总共训练 1000 亿个 Token，使用 Adam 优化器，参数为：$\beta_1=0.9$ ， $\beta_2=0.95$ ， $\epsilon=10^{-8}$ ，权重衰减系数为 0.1。

In order to maximally leverage text representations from GPT, we base our code lexer on the GPT-3 text tokenizer. Since the distribution of words in GitHub code differs from that of natural text, this tokenizer is not very effective for representing code. The largest source of inefficiency arises from encoding whitespace, so we add an additional set of tokens for representing whitespace runs of different lengths. This allows us to represent code using approximately $30%$ fewer tokens.

为了最大限度地利用 GPT 的文本表示，我们基于 GPT-3 文本分词器构建了代码分词器。由于 GitHub 代码中的单词分布与自然文本中的单词分布不同，这个分词器在表示代码方面效果不佳。最大的低效来源是编码空白字符，因此我们添加了一组额外的 Token 来表示不同长度的空白字符序列。这使我们可以使用大约 30% 更少的 Token 来表示代码。

To compute pass $@k$ , we assemble each HumanEval problem into a prompt consisting of a header, a signature, and a docstring, which is illustrated in Figure 2. We sample tokens from Codex until we encounter one of the following stop sequences: ‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’, since the model will continue generating additional functions or statements otherwise. We use nucleus sampling (Holtzman et al., 2020) with top $p=0.95$ for all sampling evaluation in this work.

为了计算通过率 $@k$ ，我们将每个 HumanEval 问题组装成一个提示，该提示由标题、签名和文档字符串组成，如图 2 所示。我们从 Codex 中采样 Token，直到遇到以下停止序列之一：‘\nclass’，‘\ndef’，‘\n#’，‘\nif’ 或 ‘\nprint’，因为否则模型将继续生成额外的函数或语句。我们在所有采样评估中使用核采样 (Holtzman et al., 2020) 并设置顶部 $p=0.95$ 。

3.3. Results

3.3. 结果

In Figure 4, we plot test loss on a held-out validation set against Codex model size. We find that just as language model test loss follows a power law in model size (Kaplan et al., 2020), test loss after code fine-tuning follows a similar power law with functional form $\scriptstyle\Big(\frac{N}{5.92\times10^{7}}\Big)^{-0.13}$ where $N$ is the number of non-embedding parameters in the model.

图 4: 我们在保留的验证集上绘制 Codex 模型大小与测试损失的关系。我们发现，正如语言模型的测试损失随模型大小呈幂律分布 (Kaplan et al., 2020)，代码微调后的测试损失也遵循类似的幂律，其函数形式为 $\scriptstyle\Big(\frac{N}{5.92\times10^{7}}\Big)^{-0.13}$，其中 $N$ 是模型中非嵌入参数的数量。

Figure 4. Model cross-entropy test loss measured on a held-out split of our Python GitHub code corpus. The smooth power law scaling of performance with model size observed in GPT-3 appears to hold even after code fine-tuning.

图 4. 在我们 Python语言 GitHub 代码语料库的保留分割上测量的模型交叉熵测试损失。在 GPT-3 中观察到的性能与模型大小的平滑幂律缩放关系在代码微调后仍然成立。

When evaluating pass $@k$ , it is important to optimize sampling temperature for the particular value of $k$ . In Figure 5, we plot pass $@k$ against the number of samples $k$ and the sampling temperature. We find that higher temperatures are optimal for larger $k$ , because the resulting set of samples has higher diversity, and the metric rewards only whether the model generates any correct solution.

在评估 pass $@k$ 时，优化特定值的 $k$ 的采样温度非常重要。在图 5 中，我们绘制了 pass $@k$ 与样本数量 $k$ 和采样温度的关系图。我们发现，对于较大的 $k$ ，较高的温度是最佳的，因为生成的样本集具有更高的多样性，而该指标只奖励模型是否生成了任何正确解。

In particular, for a 679M parameter model, the optimal tem- perature for pass $@1$ is $T^{}=0.2$ and the optimal temperature for pass $@100$ is $T^{}=0.8$ . With these temperatures, we find that pass $@1$ and pass $@100$ scale smoothly as a function of model size (Figure 6).

特别是，对于一个 679M 参数的模型，$@1$ 的最优温度是 $T^{}=0.2$ ，而 $@100$ 的最优温度是 $T^{}=0.8$ 。通过这些温度，我们发现 $@1$ 和 $@100$ 随着模型大小的变化平滑缩放（图 6）。

Pass $@k$ can also be interpreted as the result of evaluating the best out of $k$ samples, where the best sample is picked by an oracle with prior knowledge of the unit tests. From a practical perspective, we are also interested in the setting where we must select a single sample from $k$ samples without having access to an oracle. For instance, when the model is used as an autocomplete tool where a user provides a prompt, we do not have unit tests, but would like to return only a single completion to the user for evaluation so as to not overwhelm them.

传递 $@k$ 也可以被解释为从 $k$ 个样本中评估出最佳结果，其中最佳样本由具有单元测试先验知识的 oracle 选择。从实际角度来看，我们也对必须在没有访问 oracle 的情况下从 $k$ 个样本中选择一个样本的设置感兴趣。例如，当模型用作自动完成功能时，用户提供一个提示，我们没有单元测试，但希望只返回一个完成结果给用户进行评估，以免让他们感到不知所措。

Inspired by similar work in language modeling, we find that choosing the sample with the highest mean token log probability outperforms evaluating a random sample, while choosing the sample based on sum log probability can perform slightly worse than picking randomly. Figure 7 demonstrates the benefits of applying these heuristics to samples (at temperature 0.8) from Codex-12B.

受类似的语言建模工作的启发，我们发现选择具有最高平均 Token 对数概率的样本比评估随机样本表现更好，而基于总对数概率选择样本的表现可能会略逊于随机选择。图 7: 展示了将这些启发式方法应用于 Codex-12B 的样本（在温度为 0.8 时）的好处。

Figure 5. In the top panel, we plot pass $@k$ against the number of samples $(k)$ for various temperature settings. Higher temperatures are better when the number of samples is large, likely due to the increased sample diversity. In the bottom panel, we plot the best temperature setting for each $k$ , obtained by taking the upper hull of the top panel.

图 5. 在上图中，我们绘制了通过率 $@k$ 与样本数量 $(k)$ 的关系图，针对不同的温度设置。较高的温度在样本数量较大时表现更好，这可能是由于样本多样性增加所致。在下图中，我们绘制了每个 $k$ 的最佳温度设置，这是通过对上图取上包络线获得的。

Pass Rate vs Model Size Figure 6. Using the optimal temperatures 0.2 and 0.8 for pass $@1$ and pass $@100$ , we plot these two metrics as a function of model size. Performance appears to scale smoothly as a sigmoid in logparameters.

通过率与模型大小的关系图 6. 使用通过 $@1$ 和通过 $@100$ 的最优温度 0.2 和 0.8，我们将这两个指标作为模型大小的函数进行绘制。性能表现似乎随着对数参数以 S 型曲线平稳增长。

Figure 7. Model performance in the setting where we can generate multiple samples, but only evaluate one. We can do better than randomly selecting a sample by choosing the solution with the highest mean log-probability (red) or with the highest back-translation score (orange) described in Sec. 5. The blue line represents the theoretical best performance obtained using an oracle with prior knowledge of the unit tests.

图 7. 在我们可以生成多个样本但只评估一个的设置下，模型的表现。我们可以通过选择具有最高平均对数概率（红色）或最高反向翻译得分（橙色）的样本，来获得比随机选择更好的结果，这在第 5 节中有描述。蓝色线条表示使用具有单元测试先验知识的神谕所获得的理论最佳表现。

Finally, we compute BLEU scores for all Codex-12B HumanEval samples (at temperature 0.8) against their reference solutions. For each problem, when we plot the distributions of BLEU scores for correct and incorrect solutions, we notice significant overlap (Figure 8). Since an incorrect solution is guaranteed to be functionally in equivalent to the reference solution, we conclude that improvements in BLEU score may not indicate improved rates of functional correctness in practice.

最后，我们计算所有 Codex-12B HumanEval 样本 (在温度 0.8 下) 的 BLEU 分数，并将其与参考解决方案进行对比。对于每个问题，当我们绘制正确和错误解决方案的 BLEU 分数分布时，我们注意到显著的重叠 (图 8)。由于错误的解决方案在功能上必定与参考解决方案不等价，我们得出结论：BLEU 分数的提高在实际中可能并不意味着功能正确性的改善。

3.4. 相关模型和系统的比较分析

Two recent works similar in spirit to Codex are GPT-Neo (Black et al., 2021) and GPT-J (Wang & Komatsu zak i, 2021), which are trained on The Pile (Gao et al., 2020), a dataset containing text from a variety of sources as well as $8%$ GitHub code. The broader research community has found that these models outperform existing GPT systems in qualitative programming evaluations (Woolf, 2021).

两项最近与 Codex 精神相似的工作是 GPT-Neo (Black et al., 2021) 和 GPT-J (Wang & Komatsu zak i, 2021)，它们是在 The Pile (Gao et al., 2020) 上训练的，该数据集包含来自各种来源的文本以及 $8%$ 的 GitHub 代码。更广泛的研究社区发现，这些模型在定性编程评估中优于现有的 GPT 系统 (Woolf, 2021)。

We confirm these findings using the HumanEval dataset, showing that GPT-Neo achieves $6.4%$ pass $@1$ and $21.3%$ pass $@100$ , while GPT models of comparable sizes achieve near $0%$ on both metrics. We see a remarkable progression in capabilities, with GPT-Neo-2.7B roughly equivalent to Codex-85M ( $30\times$ fewer parameters). Similarly, GPT-J-6B achieves $11.6%$ pass $@1$ and $27.7%$ pass $@100$ , which is roughly equivalent to Codex-300M ( $20\times$ fewer parameters). Pass rates are obtained by taking the best result from evaluating at temperatures 0.2, 0.4, and 0.8 for GPT-Neo, and from temperatures 0.2 and 0.8 for GPT-J. Detailed results across multiple model sizes can be found in Table 1.

我们使用 HumanEval 数据集确认了这些发现，显示 GPT-Neo 达到 6.4% pass @1 和 21.3% pass @100，而相同规模的 GPT 模型在这两个指标上接近 0%。我们观察到能力有显著进步，GPT-Neo-2.7B 大约相当于 Codex-85M (参数量少 30倍)。同样，GPT-J-6B 达到 11.6% pass @1 和 27.7% pass @100，大约相当于 Codex-300M (参数量少 20倍)。通过在温度为 0.2、0.4 和 0.8 下评估 GPT-Neo，以及在温度为 0.2 和 0.8 下评估 GPT-J，获得最佳结果的通过率。多个模型大小的详细结果见表 1。

Figure 8. BLEU score probability densities for correct (blue) and wrong (green) solutions from Codex-12B for 4 random tasks from HumanEval. Note that the distributions are not cleanly separable, suggesting that optimizing for BLEU score is not equivalent to optimizing for functional correctness.

图 8. Codex-12B 在 HumanEval 的 4 个随机任务中正确（蓝色）和错误（绿色）解决方案的 BLEU 分数概率密度。注意，这些分布并不能清晰地分离，这表明优化 BLEU 分数并不等同于优化功能正确性。

Finally, we benchmark Codex against the largest free model from Tabnine, a leading code autocomplete system, which achieves $2.6%$ pass $@1$ (at $T=0.4)$ ) and $7.6%$ pass $@100$ (at $T=0.8)$ ). This is roughly equivalent to Codex-12M, one of the smallest models in our suite.

最后，我们将 Codex 与来自 Tabnine 的最大免费模型进行基准测试，Tabnine 是领先的代码自动补全系统，该模型在 $T=0.4$ 时达到 $2.6%$ pass $@1$ ，在 $T=0.8$ 时达到 $7.6%$ pass $@100$ 。这大致相当于我们模型库中最小的模型之一 Codex-12M。

3.5. Results on the APPS Dataset

3.5. APPS 数据集上的结果

Recently, Hendrycks et al. (2021) introduced the APPS dataset to measure the coding challenge competence of language models. The APPS dataset consists of 5000 training and 5000 test examples of coding problems, each with a set of unit tests and, for the training data, a set of correct solutions. Most of the APPS tests problems are not formulated as single-function synthesis tasks, but rather as full-program synthesis, reading input from stdin and printing output to stdout, in contrast to the main Codex training data.

最近，Hendrycks 等人 (2021) 引入了 APPS 数据集来衡量语言模型的编程挑战能力。APPS 数据集包含 5000 个训练样本和 5000 个测试样本的编程问题，每个问题都附带一组单元测试，以及对于训练数据而言的一组正确解决方案。大多数 APPS 测试问题并不是以单函数合成任务的形式提出的，而是作为完整的程序合成，从 stdin 读取输入并打印输出到 stdout，这与主要的 Codex 训练数据不同。

In the paper that introduces APPS, the authors benchmark a few language models and report two metrics: the percentage of problems where the model finds a correct solution (called the “strict accuracy”) and the percentage of unit tests passed, even if the solution is incorrect. The latter measure is reported only so as to reduce variance of the measurements, because the results on the first metric were so low. We avoid this metric and only focus on “strict accuracy”, and - as in the previous sections - we report pass $@k$ numbers for various $k$ (Table 2). There are 2 additional factors, well-known from coding competitions, that we take into account:

在介绍 APPS 的论文中，作者对几个大语言模型进行了基准测试，并报告了两个指标：模型找到正确解决方案的问题百分比（称为“严格准确率”）和即使解决方案不正确也通过的单元测试百分比。后一个指标仅用于减少测量结果的方差，因为第一个指标的结果非常低。我们避免使用这个指标，只关注“严格准确率”，并如前几节所述，报告不同 $k$ 的 pass $@k$ 数字（表 2）。我们还考虑了两个来自编程竞赛中众所周知的因素：

表 2:

模型	严格准确率
模型 A	X%
模型 B	Y%

有两个来自编程竞赛中众所周知的因素，我们将其纳入考量：

Table 1. Codex, GPT-Neo, & TabNine evaluations for HumanEval. We find that GPT-J pass $@1$ is between Codex-85M and Codex300M performance.

表 1. Codex、GPT-Neo 和 TabNine 在 HumanEval 上的评估。我们发现 GPT-J 的通过率 $@1$ 位于 Codex-85M 和 Codex-300M 性能之间。

PASS@k	k=1	k=10 k =100
GPT-NEO125M	0.75%	1.88% 2.97%
GPT-NEO1.3B	4.79% 7.47%	16.30%
GPT-NEO2.7B	6.41% 11.27%	21.37%
GPT-J 6B	11.62% 15.74%	27.74%
TABNINE	2.58%	4.35% 7.59%
CODEX-12M	2.00%	3.62% 8.58%
CODEX-25M	3.21% 7.1%	12.89%
CODEX-42M	5.06% 8.8%	15.55%
CODEX-85M	8.22% 12.81%	22.4%
CODEX-300M	13.17% 20.37%	36.27%
CODEX-679M	16.22% 25.7%	40.95%
CODEX-2.5B	21.36% 35.42%	59.5%
CODEX-12B	28.81% 46.81%	72.31%

• In coding competitions and in the APPS datasets, tasks are provided with 3 input/output examples included in the task description. We utilize this by sampling 1000 solutions from the model and filtering out only those that pass these 3 unit tests (if such solutions exist). We then calculate pass rates in this filtered set, and call it filtered pass $@k$ . Results without filtering are presented as raw pass $@k$ .

• 在编程竞赛和 APPS 数据集中，任务提供了包含在任务描述中的 3 个输入/输出示例。我们通过从模型中采样 1000 个解决方案，并仅筛选通过这 3 个单元测试的解决方案（如果存在这样的解决方案）来利用这些示例。然后我们在这个过滤后的集合中计算通过率，并称其为过滤后通过率 $@k$ 。未经过滤的结果则表示为原始通过率 $@k$ 。

• It is often the case both in coding competitions and in the results from Codex that a correct solution is found, but it is not algorithmic ally efficient enough to be considered passing. While this is not acceptable in the competitions, we also report the number of solutions that Codex produces that do not fail on any unit test, but that do time-out on some of them. We use a timeout of 3 seconds in our evaluation.

• 在编程竞赛和 Codex 的结果中，常常会找到正确的解决方案，但这些方案在算法效率上不足以被视为通过。虽然在竞赛中这是不可接受的，但我们还报告了 Codex 生成的那些在任何单元测试中都没有失败、但在某些测试中会超时的解决方案数量。我们在评估中使用了 3 秒的超时时间。

To compensate for the fact the Codex is not fine-tuned on APPS, we append a single input/output example from the task description to the docstring as a formatting hint. We denote this setting as “1-shot” in Table 2, and find that Codex12B evaluated 1-shot achieves comparable performance to a GPT-Neo model fine-tuned on APPS. Consistent with our earlier findings, there are large benefits from generating and evaluating as many as 1000 samples per task, though for more difficult problems, solutions are often not efficient enough to pass the time limits. Finally, evaluating the first sample which passes the 3 public unit tests for each problem yields higher performance than raw pass $@100$ samples.

为了弥补 Codex 没有在 APPS 上进行微调的事实，我们在文档字符串中附加了一个来自任务描述的输入/输出示例作为格式提示。我们将这种设置标记为表 2 中的“1-shot”，并发现 Codex12B 在 1-shot 设置下评估时的表现与在 APPS 上微调的 GPT-Neo 模型相当。与我们之前的发现一致，生成和评估每个任务多达 1000 个样本可以带来显著的好处，但对于更困难的问题，解决方案通常不够高效，无法通过时间限制。最后，评估每个问题中第一个通过 3 个公共单元测试的样本比原始通过 $@100$ 样本的表现更好。

4. Supervised Fine-Tuning

4. 监督微调

In addition to standalone functions, Python code found on GitHub contains class implementations, configuration files, scripts, and even files used to store data. This code is seemingly unrelated to synthesizing functions from docstrings, and we hypothesize that the distribution mismatch reduces HumanEval performance.

除了独立函数外，GitHub 上的 Python语言代码还包含类实现、配置文件、脚本，甚至用于存储数据的文件。这些代码看似与从文档字符串合成函数无关，我们假设这种分布不匹配会降低 HumanEval 性能。

In order to adapt Codex to the distribution of the task of interest, we construct a set of training problems from correctly implemented standalone functions, and use them for additional supervised fine-tuning. We describe two approaches for collecting these examples: from competitive programming websites and from repositories with continuous integration. We call the supervised fine-tuned models Codex-S, and show that they produce consistent gains across model size.

为了使 Codex 适应感兴趣任务的分布，我们从正确实现的独立函数构建了一组训练问题，并用它们进行额外的监督微调。我们描述了收集这些示例的两种方法：来自竞争编程网站和来自具有持续集成的代码库。我们将监督微调后的模型称为 Codex-S，并展示了它们在不同模型规模上都能产生一致的改进。

4.1. Problems from Competitive Programming

4.1. 竞赛编程中的问题

Programming contest and interview preparation websites use hidden unit tests to automatically judge the functional correctness of submissions. These problems are selfcontained, come with well-written problem statements, and generally have excellent test coverage. Additionally, these problems test algorithmic reasoning over a broad range of core skills and difficulties.

编程竞赛和面试准备网站使用隐藏的单元测试来自动判断提交内容的功能正确性。这些问题自包含，配有编写良好的问题描述，并且通常具有出色的测试覆盖率。此外，这些问题在广泛的核⼼技能和难度范围内测试算法推理能力。

We collected problem statements, function signatures, and solutions from several popular programming contest and interview preparation websites. We then assembled these into programming tasks similar to HumanEval, using the problem description as the docstring. Since complete test suites are often hidden, we created unit tests from examples found in the problem statements, or extracted additional test cases through submitting incorrect solutions. In total, we curated 10,000 problems in this way.

我们从几个流行的编程竞赛和面试准备网站收集了问题陈述、函数签名和解决方案。然后，我们将这些内容组装成类似于 HumanEval 的编程任务，使用问题描述作为文档字符串。由于完整的测试套件通常被隐藏，我们根据问题陈述中的示例创建了单元测试，或者通过提交错误的解决方案提取了额外的测试用例。总共，我们以这种方式策划了 10,000 个问题。

4.2. Problems from Continuous Integration

4.2. 持续集成中的问题

Next, we curated programming problems from open source projects. Taking advantage of sys.setprofile, we were able to trace and collect inputs and outputs for all functions called during integration tests. This data could then be used to create unit tests for the functions.

接下来，我们从开源项目中整理了编程问题。利用 sys.setprofile，我们能够跟踪并收集集成测试期间调用的所有函数的输入和输出。这些数据可以用于为函数创建单元测试。

Projects that employ continuous integration (CI) are ideal candidates for tracing. We follow the commands in the CI configuration files, which contain build and test commands, to set up the virtual environments, install dependencies, and run integration tests.

采用持续集成 (CI) 的项目是跟踪的理想候选对象。我们遵循 CI 配置文件中的命令，这些命令包含构建和测试指令，用于设置虚拟环境、安装依赖项并运行集成测试。

We considered GitHub repos using travis and tox as their CI frameworks, as they are two of the most popular CI tools. We additionally used publicly available source code from pip packages found in the python package index (PyPI).

我们考虑了使用 travis 和 tox 作为其 CI 框架的 GitHub 仓库，因为它们是两个最受欢迎的 CI 工具。我们还使用了来自 Python 包索引 (PyPI) 中 pip 包的公开可用源代码。

Table 2. Finetuned GPT-Neo numbers from the APPS paper referenced above. For Codex-12B, the number of passing programs that timeout on some test is in the bracket. We used temperature 0.6 for sampling to cover all $k$ in pass $@k$ , so raw pass $@1$ results could be improved with lower temperature.

表 2. 来自上述 APPS 论文的微调 GPT-Neo 数据。对于 Codex-12B，通过但在某些测试中超时的程序数量在括号内。我们使用温度 0.6 进行采样以覆盖所有 $k$ 在 pass $@k$ 中，因此 raw pass $@1$ 的结果可以通过更低的温度得到改善。

	入门级	面试级	竞赛级
GPT-NEO2.7B 原始通过率 @1	3.90%	0.57%	0.00%
GPT-NEO2.7B 原始通过率 @5	5.50%	0.80%	0.00%
1-Shot Codex 原始通过率 @1	4.14% (4.33%)	0.14% (0.30%)	0.02% (0.03%)
1-Shot Codex 原始通过率 @5	9.65% (10.05%)	0.51% (1.02%)	0.09% (0.16%)
1-Shot Codex 原始通过率 @100	20.20% (21.57%)	2.04% (3.99%)	1.05% (1.73%)
1-Shot Codex 原始通过率 @1000	25.02% (27.77%)	3.70% (7.94%)	3.23% (5.85%)
1-Shot Codex 筛选后通过率 @1	22.78% (25.10%)	2.64% (5.78%)	3.04% (5.25%)
1-Shot Codex 筛选后通过率 @5	24.52% (27.15%)	3.23% (7.13%)	3.08% (5.53%)

Because these projects contained untrusted code, it was important to run integration tests in the sandboxed environment described above.

因为这些项目包含不可信的代码，所以在上述沙盒环境中运行集成测试非常重要。

While there are millions of potential functions to curate problems from, we only collected about 40,000 because not all functions accept inputs and return outputs. Even when they do, most objects captured at runtime cannot be pickled and restored outside the sandbox unless the project was installed.

虽然有数百万个潜在函数可以用来生成问题，但我们只收集了大约 40,000 个，因为并不是所有函数都接受输入并返回输出。即使它们做到了，大多数在运行时捕获的对象除非项目已安装，否则无法在沙箱外部进行序列化和恢复。

Since our tracing methodology produced inputs and outputs for all invoked functions, even builtin and library calls imported by the project were turned into problems. For this reason, functions from tracing tended to be the building blocks of command-line utilities. To excel at these tasks, the model does not need to know advanced algorithms and data structures. Rather, it needs to be able to follow instructions to implement the functionality specified in the docstring. Thus, tracing complements the puzzle nature of coding competition problems and broadens the distribution of tasks.

由于我们的跟踪方法生成了所有调用函数的输入和输出，即使是项目导入的内置和库函数调用也被转化为问题。因此，来自跟踪的函数往往是命令行工具的构建块。为了在这些任务中表现出色，模型不需要掌握高级算法和数据结构。相反，它需要能够按照指示实现文档字符串中指定的功能。因此，跟踪补充了编程竞赛问题的谜题性质，并拓宽了任务的分布。

4.3. Filtering Problems

4.3. 过滤问题

In the previous sections, we presented two methods we used to automatically create training problems. However, it is unclear how to control for quality. Some prompts under specify the function that is implemented, in which case a perfectly valid solution may be wrongly penalized by the unit test. Some problems are stateful, and subsequent executions can result in different outcomes.

在前面的部分中，我们介绍了两种用于自动创建训练问题的方法。然而，如何控制质量尚不清楚。某些提示未能充分指定实现的功能，在这种情况下，完全有效的解决方案可能会被单元测试错误地惩罚。一些问题是状态相关的，后续执行可能导致不同的结果。

To address these issues, we use Codex-12B to generate 100 samples per curated problem. If no samples pass the unit tests, we consider the task to be either ambiguous or too difficult, and filter it out. We reran this verification several times to remove stateful or non-deterministic problems.

为了解决这些问题，我们使用 Codex-12B 为每个精心设计的问题生成 100 个样本。如果没有任何样本通过单元测试，我们认为任务要么是模糊的，要么是太难了，并将其过滤掉。我们多次重新运行此验证以移除有状态或非确定性问题。

4.4. Methods

4.4. 方法

We fine-tune Codex on these training problems to produce a set of “supervised fine-tuned” models, which we call CodexS. To produce examples from training problems, we assemble the problems into the format shown in Figure 2. If there are prompts of varying length in a batch, we left-pad shorter prompts to the length of the longest prompt, so that the first tokens in the reference solutions line up in context.

我们对 Codex 进行微调，以解决这些训练问题，并生成一组“监督微调”模型，我们称之为 CodexS。为了从训练问题中生成示例，我们将问题整理为图 2 所示的格式。如果一批次中有长度不同的提示，我们会用较短的提示左填充到最长提示的长度，以便参考解决方案中的第一个 Token 在上下文中对齐。

图 2:

We train to minimize negative log-likelihood of the reference solution, and mask out loss for any tokens in the prompt. We train using a learning rate $1/10$ as large as used for fine-tuning Codex, but adhere to the same learning rate schedule, and train until validation loss plateaus (less than 10B tokens).

我们训练以最小化参考解决方案的负对数似然，并屏蔽提示中任何 Token 的损失。我们使用的学习率是 Codex 微调时的 $1/10$，但遵循相同的学习率计划，并训练直到验证损失趋于平稳（少于 10B 个 Token）。

4.5. Results

As with Codex, we first compute the optimal temperature for evaluating pass $@k$ for $1\le k\le100$ . We find that Codex-S prefers slightly higher temperatures for all $k>1$ , which possibly reflects the fact that Codex-S captures a narrower distribution than Codex. We use $T^{},=,0$ for computing pass $@1$ and $T^{}=1$ for computing pass $@100$ .

与 Codex 一样，我们首先计算评估通过 $@k$ 的最优温度，对于 $1\le k\le100$ 。我们发现 Codex-S 对所有 $k>1$ 都偏好稍高的温度，这可能反映了 Codex-S 捕捉到的分布比 Codex 更窄。我们使用 $T^{},=,0$ 来计算通过 $@1$ ，并使用 $T^{}=1$ 来计算通过 $@100$ 。

Next, we compare Codex-S against Codex on pass $@1$ and pass $@100$ . Codex-S outperforms the corresponding Codex by an average margin of 6.5 percentage points on pass $@1$ and by a larger average margin of 15.1 percentage points on pass $@100$ across model size.

接下来，我们将 Codex-S 与 Codex 在通过率 $@1$ 和通过率 $@100$ 上进行比较。Codex-S 在通过率 $@1$ 上平均优于对应的 Codex 6.5 个百分点，在通过率 $@100$ 上则以更大的平均优势 15.1 个百分点胜出，这一结果在不同模型尺寸上均成立。

We also plot the performance of different sample selection heuristics for Codex-S-12B against the same heuristics for Codex-12B. When ranking between 1 and 100 samples by mean log probability, the average benefit over random ranking is 11.6 percentage points, which is over 2 percentage points higher than the corresponding benefit for Codex.

我们还绘制了 Codex-S-12B 与 Codex-12B 在不同样本选择启发式方法下的性能对比图。当根据平均对数概率对 1 到 100 个样本进行排名时，相对于随机排名的平均收益为 11.6 个百分点，这比 Codex 相应的收益高出超过 2 个百分点。

Figure 9. Optimal sampling temperatures as a function of the number of samples generated for both Codex and Codex-S. Codex-S generally requires a higher temperature for any particular value of $k$ , possibly to compensate for the fact that it models a narrower distribution.

图 9: Codex 和 Codex-S 的最优采样温度与生成样本数量的关系。Codex-S 在任何特定的 $k$ 值下通常需要更高的温度，这可能是为了补偿其分布较窄的事实。

Codex-S Pass Rate vs Model Size Codex-S Ranking Heuristics

Codex-S 通过率与模型大小 Codex-S 排名启发式方法

Figure 10. Comparing Codex-S against Codex on the metrics proposed in Section 3. Codex-S is one or two orders of magnitude more parameter efficient on pass $@1$ and pass $@100$ , and log-prob sample ranking with Codex-S yields similar benefits over random sampling that Codex does.

图 10. 比较 Codex-S 与 Codex 在第 3 节提出的指标上的表现。Codex-S 在 pass @1 和 pass @100 上的参数效率比 Codex 高一个或两个数量级，且使用 Codex-S 的对数概率样本排序与随机抽样相比具有类似的优点，这与 Codex 的表现相似。

5. Docstring Generation

5. 文档字符串生成 (Docstring Generation)

Generating code from docstrings is possible with Codex because code typically follows after a docstring, but it is not easy to induce Codex to generate docstrings from code. Nevertheless, we are motivated to produce a docstring writing model for safety reasons, as such a model can be used to describe the intent behind generated code. Using the training problems described in the previous section, we can easily create a training dataset for code-conditional docstring generation.

从文档字符串生成代码是可能的，因为代码通常会跟随在文档字符串之后，但诱导 Codex 从代码生成文档字符串并不容易。然而，出于安全原因，我们有动力创建一个文档字符串写作模型，因为这种模型可以用来描述生成代码的意图。使用上一节中描述的训练问题，我们可以轻松创建一个用于代码条件文档字符串生成的训练数据集。

Specifically, for each training problem, we assemble a training example by concatenating the function signature, the reference solution, and then the docstring. Just as we train Codex-S by minimizing negative log-likelihood of the reference solution, we train the docstring generating models Codex-D by minimizing negative log-likelihood of the docstring.

具体来说，对于每个训练问题，我们通过连接函数签名、参考解决方案和文档字符串来组装一个训练示例。就像我们通过最小化参考解决方案的负对数似然来训练 Codex-S 一样，我们通过最小化文档字符串的负对数似然来训练生成文档字符串的模型 Codex-D。

When we benchmark our code generation models, we measure pass $@k$ on the HumanEval dataset, where correctness is defined by passing a set of unit tests. However, there is no similar way to evaluate docstring samples automatically. Therefore, we grade sample docstrings by hand, considering a docstring correct if it uniquely and accurately specifies the code body. Due to the time consuming nature of this process, we only grade 10 samples per problem, for a total of 1640 problems, from Codex-D-12B at temperature 0.8.

当我们评估我们的代码生成模型时，我们测量在 HumanEval 数据集上的通过率 $@k$ ，其中正确性由通过一组单元测试定义。然而，没有类似的方法可以自动评估文档字符串样本。因此，我们手动评估样本文档字符串，认为一个文档字符串是正确的，如果它唯一且准确地指定了代码主体。由于这个过程耗时，我们每个问题只评估 10 个样本，总计 1640 个问题，来自 Codex-D-12B，在温度为 0.8 的情况下。

Codex-D often generates incorrect unit tests along with a docstring, but we ignore these during grading. However, we do not consider the docstring correct when the model simply copies the code body into the docstring. The most common failure modes we observe are when the docstring model leaves out an important detail (such as “an answer must be to two decimal places”) or when it over-conditions on the function name and invents a problem unrelated to the function body.

Codex-D 生成的单元测试通常不正确，并且会附带一个文档字符串，但在评分时我们会忽略这些。然而，当模型简单地将代码主体复制到文档字符串中时，我们不认为该文档字符串是正确的。我们观察到最常见的失败模式是文档字符串模型遗漏了重要细节（例如“答案必须保留两位小数”）或过度依赖函数名称并虚构了一个与函数主体无关的问题。

As shown in Table 3, pass rates for Codex-D are lower but comparable to the corresponding pass rates for Codex-S at the same temperature. We do not have a strong hypothesis for which direction should yield higher pass rates. While generating docstrings may be more forgiving because natural language syntax is less strict than code syntax, docstrings in our dataset may be lower quality because developers tend to devote less time to writing docstrings. Indeed, our model produces docstrings like “I just found this function online” and “This test is not correctly written and it’s not my solution.”

表 3: Codex-D 的通过率较低但与相同温度下的 Codex-S 的相应通过率相当。我们没有关于哪个方向应产生更高通过率的强烈假设。虽然生成文档字符串可能更为宽容，因为自然语言语法比代码语法宽松，但我们数据集中的文档字符串质量可能较低，因为开发人员倾向于花费较少的时间编写文档字符串。确实，我们的模型生成了如 “I just found this function online” 和 “This test is not correctly written and it’s not my solution.” 这样的文档字符串。

Finally, with a docstring model, we have yet another way to choose a single sample from a set of $k$ samples. Instead of picking the sample with the best mean log probability as investigated in the previous two sections, we can choose the sample that maximizes the back-translation ob

最后，使用文档字符串模型，我们有另一种从一组 $k$ 个样本中选择单个样本的方法。不同于前两节研究的挑选具有最佳平均对数概率的样本，我们可以选择使回译 (back-translation) 最大化的样本。

Table 3. Pass rates for our docstring generating model Codex-D, which is evaluated by hand-grading 10 samples per task due to the lack of a ground-truth automatic evaluation. We find similar but lower pass-rates compared to Codex-S.

表 3: 我们的文档字符串生成模型 Codex-D 的通过率，该模型通过人工评分每项任务 10 个样本进行评估，由于缺乏基准自动评估。我们发现与 Codex-S 相比，通过率相似但较低。

模型	PASS @ 1	PASS @ 10
CODEX-S-12B	32.2%	59.5%
CODEX-D-12B	20.3%	46.5%

jective $P$ (ground truth docstring|generated sample) where $P$ is evaluated using Codex-D. Unfortunately, in Figure 7, we show that ranking samples via back-translation underperforms mean log-probability ranking, though it outperforms random ranking. This heuristic also appears to overfit quickly.

目标 $P$ （原始 docstring | 生成样本）其中 $P$ 使用 Codex-D 进行评估。遗憾的是，在图 7 中，我们显示通过反向翻译对样本进行排序的表现不如平均对数概率排序，尽管它优于随机排序。这种启发式方法似乎也很快过拟合。

6. Limitations

6. 局限性

While Codex is able to sample correct solutions for the majority of HumanEval problems, we find that it has a number of limitations.

虽然 Codex 能够为大多数 HumanEval 问题采样出正确的解决方案，但我们发现它存在一些局限性。

First, Codex is not sample efficient to train. Our training dataset comprises a significant fraction of publicly available Python code on GitHub, totaling hundreds of millions of lines of code. Even seasoned developers do not encounter anywhere near this amount of code over their careers. Indeed, a strong student who completes an introductory computer science course is expected to be able to solve a larger fraction of problems than Codex-12B.

首先，Codex 在训练时样本效率不高。我们的训练数据集包含 GitHub 上公开可用的大量 Python语言代码，总计数亿行代码。即使是有经验的开发人员在其职业生涯中也不会遇到如此多的代码。事实上，完成入门级计算机科学课程的优秀学生应该能够解决比 Codex-12B 更大比例的问题。

Next, we explore prompts on which Codex is likely to fail or display counter-intuitive behavior. While evaluating code generation is well-studied (Xu et al., 2021; Helmuth & Spector, 2015; Pantridge et al., 2017), many existing metrics measure performance in tightly specified, constrained problem instances (e.g., string manipulation in FlashFill (Gulwani, 2011)). Therefore, we developed a set of qualitative metrics for measuring the capabilities of code generating models while controlling for the complexity and abstraction level of the specifications (Appendix D). Applying this framework, we find that Codex can recommend syntactically incorrect or undefined code, and can invoke functions, variables, and attributes that are undefined or outside the scope of the codebase. Moreover, Codex struggles to parse through increasingly long and higher-level or system-level specifications.

接下来，我们探讨 Codex 可能会失败或表现出反直觉行为的提示。虽然代码生成的评估已经得到了广泛研究 (Xu et al., 2021; Helmuth & Spector, 2015; Pantridge et al., 2017)，但许多现有的度量标准是在严格规定、受限的问题实例中衡量性能（例如，在 FlashFill 中的字符串操作 (Gulwani, 2011)）。因此，我们开发了一套定性质的度量标准，用于在控制规范的复杂性和抽象级别的情况下测量代码生成模型的能力（附录 D）。应用这一框架，我们发现 Codex 可以推荐语法不正确或未定义的代码，并可以调用未定义或超出代码库范围的函数、变量和属性。此外，Codex 在解析越来越长和更高层次或系统级别的规范时遇到困难。

To concretely illustrate model performance degradation as docstring length increases, we create a dataset of synthetic problems assembled from 13 basic building blocks, each of which modifies an input string in a deterministic way. Example building blocks are “convert the string to lowercase” or “remove every third character from the string” (the full list is described in Appendix C). We find that as the number of chained building blocks in the docstring increases, model performance decreases exponentially. This behavior is uncharacteristic of a human programmer, who should be able to correctly implement a program for a chain of arbitrary length if they can do so for a chain of length two.

为了具体说明随着文档字符串长度增加模型性能的下降，我们创建了一个由 13 个基本构建块组成的人工问题数据集，每个构建块都以确定性方式修改输入字符串。示例构建块包括“将字符串转换为小写”或“从字符串中删除每第三个字符”（完整列表在附录 C 中描述）。我们发现，随着文档字符串中串联的构建块数量增加，模型性能呈指数级下降。这种行为与人类程序员不符，如果人类程序员能够正确实现长度为二的链式程序，那么他们也应该能够正确实现任意长度的链式程序。

Synthetic Pass Rate vs Components Figure 11. Pass rates of Codex-12B samples against the number of chained components in the synthetically generated docstring. With each additional component, pass rate drops by roughly a factor of 2-3.

图 11. Codex-12B 样本的通过率与合成生成的文档字符串中链接组件数量的关系。随着每个额外组件的增加，通过率大约下降 2-3 倍。

Further, just as text-conditional generative models in other modalities (Ramesh et al., 2021) have difficulty with binding attributes to objects, Codex can make mistakes binding operations to variables, especially when the number of operations and variables in the docstring is large. For instance, in the following prompt, Codex-12B does not decrement the variable w and also fails to return the product of all numbers.

进一步，正如其他模态的文本条件生成式模型 (Ramesh et al., 2021) 在将属性绑定到对象时存在困难一样，Codex 在将操作绑定到变量时也会出错，特别是在文档字符串中的操作和变量数量较大时。例如，在以下提示中，Codex-12B 没有对变量 w 进行递减操作，并且未能返回所有数字的乘积。

def do_work(x, y, z, w): """ Add 3 to y, then subtract 4 from both x and w. Return the product of the four numbers. """ ${\mathrm{~~\small~~t~}}={\mathrm{~~\small~~y~}}+{\mathrm{~~\small~~3~}}$ $\texttt{U}=\texttt{X}-\texttt{4}$ $\mathrm{~\boldmath~v~}=\mathrm{~\boldmath~z~},\star,\mathrm{~\boldmath~w~}$ return v

def do_work(x, y, z, w):
    """将 3 加到 y 上，然后从 x 和 w 中各减去 4。返回这四个数字的乘积。"""
    \${\mathrm{~\small~t~}}={\mathrm{~\small~y~}}+{\mathrm{~\small~3~}}\$  
    \$\texttt{U}=\texttt{X}-\texttt{4}\$  
    \$\mathrm{~\boldmath~v~}=\mathrm{~\boldmath~z~}\,\star\,\mathrm{~\boldmath~w~}\$  
    return v

This understanding of Codex’s limited system-level synthesis capabilities helps inform our assessment of the potential hazards of using it in a generative capacity, as well as the broader societal impacts that such systems could have.

对 Codex 的系统级合成能力有限的理解有助于我们评估其在生成式 AI (Generative AI) 能力方面的潜在风险，以及此类系统可能带来的更广泛的社会影响。

7. Broader Impacts and Hazard Analysis

7. 更广泛的影响和风险分析

Codex has the potential to be useful in a range of ways. For example, it could help onboard users to new codebases, reduce context switching for experienced coders, enable non-programmers to write specifications and have Codex draft implementations, and aid in education and exploration. However, Codex also raises significant safety challenges, does not always produce code that is aligned with user intent, and has the potential to be misused.

Codex 有潜力在多个方面发挥作用。例如，它可以帮助用户熟悉新的代码库，减少有经验的程序员的情境切换，使非程序员能够编写规范并让 Codex 草拟实现，以及辅助教育和探索。然而，Codex 也带来了重大的安全挑战，不总是生成符合用户意图的代码，并且存在被滥用的可能。

To better understand some of the hazards of using Codex in a generative capacity, we conducted a hazard analysis focused on identifying risk factors (Leveson, 2019) with the potential to cause harm.1 We outline some of our key findings across several risk areas below.

为了更好地了解在生成式容量中使用 Codex 的一些风险，我们进行了危害分析，重点关注识别可能导致损害的风险因素 (Leveson, 2019)。我们在以下部分概述了几个风险领域中的关键发现。

While some of our findings about the potential societal impacts of code generation systems were informed by work towards responsible deployment of the production-oriented Codex models (which descended from the research-oriented Codex models described in this paper), this section is not intended to provide a full account of any particular product’s safety features. Unless otherwise specified, we anchor our analysis in the specific properties of the models described in this paper. We share this analysis in the belief that some of it generalizes to the broader class of code generation systems, and to encourage a norm of performing detailed impact analysis as part of major machine learning research projects.

虽然我们关于代码生成系统潜在社会影响的一些发现得到了面向生产的 Codex 模型（这些模型源自本文描述的研究导向的 Codex 模型）负责任部署工作的启发，但本节并不打算提供任何特定产品安全功能的完整说明。除非另有说明，我们的分析基于本文描述的模型的具体属性。我们分享这一分析是基于其中部分内容可以推广到更广泛的代码生成系统类，并鼓励在主要的机器学习研究项目中进行详细影响分析的规范。

Note that by focusing largely on risks in this section, we do not mean to imply that we expect the impact of this class of technologies to be net-negative; rather, risks merit particular attention here because they may be subtle or require deliberate effort to address, whereas we expect the benefits to be more obvious and “automatic” from the perspective of most users and affected stakeholders.

请注意，本节主要关注风险，并不意味着我们认为这类技术的影响将是净负面的；相反，这里特别关注风险是因为它们可能较为微妙或需要刻意努力去应对，而我们预计其好处对于大多数用户和受影响的利益相关者来说将更为明显和“自动”。

7.1. Over-reliance

7.1. 过度依赖

One of the key risks associated with using code generation models in practice is over-reliance on generated outputs. Due to the limitations described above as well as alignment issues described below, Codex may suggest solutions that superficially appear correct but do not actually perform the task the user intended. This could particularly affect novice programmers, and could have significant safety implications depending on the context. We discuss a related issue in Appendix G, namely that code generation models can suggest insecure code. For these reasons, human oversight and vigilance is required for safe use of code generation systems like Codex.

使用代码生成模型时面临的一个关键风险是过度依赖生成的输出。由于上述限制以及下面描述的对齐问题，Codex 可能会建议表面上看似正确但实际上并未执行用户意图的任务的解决方案。这可能特别影响初学者程序员，并且根据上下文可能有重要的安全影响。我们在附录 G 中讨论了一个相关问题，即代码生成模型可以建议不安全的代码。因此，为了安全使用像 Codex 这样的代码生成系统，需要人类的监督和警惕。

We note several immediate ways to improve safety in the subsection on risk mitigation below, though over-reliance in particular is one that we believe merits further inquiry in industry and academia. While it is conceptually straightforward to provide documentation to users reminding them about model limitations, empirical investigation is necessary in order to identify how to reliably ensure vigilance in practice across a range of user experience levels, UI designs, and tasks. One challenge researchers should consider is that as capabilities improve, it may become increasingly difficult to guard against “automation bias.”

我们在下面的风险缓解小节中指出了几种立即可以采取的措施来提高安全性，尽管我们认为空泛依赖特别需要在业界和学术界进一步研究。虽然概念上直接向用户提供文档，提醒他们注意模型的局限性是简单的，但需要进行实证研究以确定如何在不同用户经验水平、用户界面设计和任务中可靠地确保警惕性。研究人员应考虑的一个挑战是，随着能力的提高，防止“自动化偏差”可能会变得越来越困难。

Figure 12. When the prompt includes subtle bugs, Codex tends to produce worse code than it is capable of. This persists when the prompt also includes instructions to write correct code. This gap increases with model size.

图 12. 当提示包含细微的错误时，Codex 倾向于生成比其能力更差的代码。即使提示中还包括编写正确代码的指令，这种情况仍然存在。这种差距随着模型规模的增大而增加。

7.2. Misalignment

7.2. 错位

As with other large language models trained on a next-token prediction objective, Codex will generate code that is as similar as possible to its training distribution. One consequence of this is that such models may do things that are unhelpful for the user, despite having the capability to be more helpful (see Figure 12). For example, if the user has some subtle mistakes in their code, Codex may “deliberately” suggest code that superficially appears good but is incorrect.

与训练用于下一个 Token 预测目标的其他大语言模型一样，Codex 生成的代码将尽可能接近其训练分布。这种做法的一个后果是，尽管这些模型有能力提供更大的帮助，但它们可能会做出对用户无益的事情（见图 12）。例如，如果用户的代码中有一些细微的错误，Codex 可能会“故意”建议表面上看起来不错但实际上错误的代码。

图 12:

This is an alignment failure - the model is not aligned with the user’s intentions. Informally, a system is misaligned if there’s some task X that we want it to do, and it is “capable” of doing X but “chooses” not to. In contrast, if a system fails to do X because it does not have the ability to do so, then this system is not misaligned; it is just incompetent. See Appendix E for more detail, including a more precise definition of alignment.

这是一个对齐失败——模型没有与用户的意图对齐。非正式地说，如果有一个任务 X 是我们希望系统完成的，而系统“能够”完成 X 但“选择”不去做，那么这个系统就是未对齐的。相反，如果系统未能完成 X 是因为它没有能力去做，那么这个系统并不是未对齐的；它只是无能的。更多细节请参见附录 E，包括对对齐更精确的定义。

It is important to study misalignment because it is a problem that is likely to become worse, not better, as the capabilities of our systems increase. For example, the model size scaling trend for the example in Figure 12 indicates that misalignment would likely persist and even get worse if data, parameters, and training time were scaled up.

研究错位问题很重要，因为它是一个随着系统能力的提升可能会变得更糟而不是更好的问题。例如，图 12 中的模型规模扩展趋势表明，如果数据、参数和训练时间增加，错位问题可能会持续存在甚至变得更严重。

While we expect that misaligned behaviour like this is unlikely to cause significant harm in current models, it is likely to become more dangerous and harder to eliminate as model capabilities increase. A highly capable but sufficiently misaligned model trained on user approval might produce obfuscated code that looks good to the user even on careful inspection, but in fact does something undesirable or even harmful.

虽然我们预计当前模型中的这种不一致行为不太可能造成重大危害，但随着模型能力的增强，这种行为可能会变得更加危险且更难消除。一个高度有能力但对齐不足的模型如果基于用户认可进行训练，可能会生成表面上看起来很好的混淆代码，即使经过仔细检查也显得无懈可击，但实际上却可能执行不希望的甚至有害的操作。

7.3. Bias and representation

7.3. 偏见和表征

Mirroring what has been found in the case of other language models trained on Internet data (Bender et al., 2021; Blodgett et al., 2020; Abid et al., 2021; Brown et al., 2020), we found that Codex can be prompted in ways that generate racist, de nigra tory, and otherwise harmful outputs as code comments, meriting interventions such as those discussed in the subsection on risk mitigation below. We also found that code generation models raise further bias and representation issues beyond problematic natural language: Codex can generate code with structure that reflects stereotypes about gender, race, emotion, class, the structure of names, and other characteristics. Particularly in the context of users who might over-rely on Codex or use it without first thinking through project design, this issue could have significant safety implications, giving further motivation to discourage over-reliance. We discuss bias and representation issues further in Appendix F. Filtration or modulation of generated outputs, documentation, and other interventions may help to mitigate these risks.

反映了在其他基于互联网数据训练的语言模型中发现的情况 (Bender et al., 2021; Blodgett et al., 2020; Abid et al., 2021; Brown et al., 2020)，我们发现 Codex 可以被提示生成带有种族主义、贬低黑人和其他有害内容的代码注释，这需要采取如下面风险缓解小节中讨论的干预措施。我们还发现，代码生成模型引发了超出问题自然语言的进一步偏见和代表性问题：Codex 可以生成反映关于性别、种族、情感、阶层、姓名结构和其他特征的刻板印象的代码结构。特别是在用户可能过度依赖 Codex 或在未先考虑项目设计的情况下使用它时，这个问题可能具有重要的安全影响，进一步激励我们应劝阻过度依赖。我们在附录 F 中进一步讨论了偏见和代表性问题。生成输出的过滤或调节、文档编写和其他干预措施可能有助于缓解这些风险。

7.4. Economic and labor market impacts

7.4. 经济和劳动力市场影响

Code generation and associated capabilities have several possible economic and labor market impacts. While Codex at its current capability level may somewhat reduce the cost of producing software by increasing programmer productivity, the size of this effect may be limited by the fact that engineers don’t spend their full day writing code $\mathrm{{[0^{*}N E T}}$ , 2021). Other important tasks include conferring with colleagues, writing design specifications, and upgrading existing software stacks.2 We also found that Codex imports packages at different rates, which could advantage some package authors over others, particularly if programmers and engineers come to rely on Codex’s suggestions. Over a longer time horizon, the effects of this class of technologies on software-related labor markets and on the economy more generally could be more substantial as capabilities improve. More study is needed both on the effects of code generation capabilities and on appropriate responses. We discuss economic and labor market implications in more detail in Appendix H.

代码生成及其相关能力可能对经济和劳动力市场产生多种影响。虽然 Codex 在当前的能力水平上可以通过提高程序员的生产力来一定程度上降低软件生产的成本，但这种效应的规模可能会受到限制，因为工程师并不会整天都在编写代码 (2021)。其他重要任务包括与同事讨论、编写设计规范以及升级现有的软件栈。我们还发现 Codex 以不同的速率导入包，这可能会使某些包的作者受益，特别是如果程序员和工程师开始依赖 Codex 的建议。从更长远的时间范围来看，随着能力的提升，这类技术对软件相关的劳动力市场以及更广泛的经济的影响可能会更加显著。需要更多的研究来探讨代码生成能力的影响以及适当的应对措施。我们在附录 H 中更详细地讨论了经济和劳动力市场的含义。

7.5. Security implications

7.5. 安全性影响

Codex could have various effects on the security landscape. Because Codex can produce vulnerable or misaligned code,3 qualified operators should review its generations before executing or trusting them, absent appropriate precautions. Future code generation models may be able to be trained to produce more secure code than the average developer, though that is far from certain.

Codex 可能会对安全态势产生各种影响。因为 Codex 可以生成存在漏洞或不正确的代码，有资质的操作员应在执行或信任这些代码之前审查其生成结果，除非采取了适当的预防措施。未来的代码生成模型可能会被训练成能够生成比普通开发人员更安全的代码，尽管这一点还远未确定。

Codex could also be misused to aid cybercrime. Although this is worthy of concern, based on our testing, we believe that at their current level of capability, Codex models do not materially lower the barrier to entry for malware development.4 We expect that more powerful code generation models will lead to future advancements, and therefore further research into mitigation s and continued study of model capabilities are necessary.

Codex 也可能被滥用以辅助网络犯罪。虽然这值得担忧，但根据我们的测试，我们认为在当前的能力水平下，Codex 模型并不会实质性地降低恶意软件开发的门槛。我们预计更强大的代码生成模型将带来未来的进步，因此有必要进一步研究缓解措施并持续研究模型能力。

The non-deterministic nature of systems like Codex could enable more advanced malware. This non-determinism makes it easier to create diverse software that accomplish the same tasks. While software diversity can sometimes aid defenders,5 it presents unique challenges for traditional malware detection and antivirus systems that rely on fingerprinting and signature-matching against previously sampled binaries. For example, a more capable code generation model could conceivably advance techniques for generating polymorphic malware.6 We believe that application security and model deployment strategies including rate-limiting access and abuse monitoring can manage this threat in the near term; however, the efficacy of these mitigation s may scale sub linearly as more capable models are developed.

像 Codex 这样的系统的非确定性特性可能会使更高级的恶意软件成为可能。这种非确定性使得创建能够完成相同任务的多样化软件变得更加容易。虽然软件多样性有时可以帮助防御者 [5]，但它为依赖于对先前采样二进制文件进行指纹识别和签名匹配的传统恶意软件检测和防病毒系统带来了独特的挑战。例如，更强大的代码生成模型可以设想出生成多态恶意软件的技术 [6]。我们认为，应用安全性和模型部署策略（包括速率限制访问和滥用监控）可以在短期内管理这一威胁；然而，随着更强大的模型的开发，这些缓解措施的有效性可能会以低于线性的速度扩展。

Similar to large language models, Codex models can learn patterns present in their training data (Carlini et al., 2021). Sensitive data present in source code are liable to be predicted by the model. Because Codex is trained on public repositories, we consider any sensitive data present in the training data to have already been compromised. Similarly, the public data should generally be treated as untrusted, as previous work (Goldblum et al., 2021; Schuster et al., 2020) has found that attackers may be able to corrupt training data to trigger specific model behaviors at runtime. We further discuss security implications in Appendix G.

类似于大语言模型，Codex 模型可以学习其训练数据中存在的模式 (Carlini et al., 2021)。源代码中存在的敏感数据有可能被模型预测出来。由于 Codex 是在公共仓库上进行训练的，我们认为训练数据中任何存在的敏感数据已经被泄露。同样，公共数据通常应被视为不可信的，因为之前的研究 (Goldblum et al., 2021; Schuster et al., 2020) 发现攻击者可能能够篡改训练数据以触发模型在运行时的特定行为。我们进一步在附录 G 中讨论安全影响。

7.6. Environmental impacts

7.6. 环境影响

Codex, like other large generative models, has an energy footprint from both training and inference (Schwartz et al., 2019; Bender et al., 2021; Patterson et al., 2021). The origi- nal training of GPT-3-12B consumed hundreds of petaflop/sdays of compute, while fine-tuning it to create Codex-12B consumed a similar amount of compute. This training was performed on a platform (Azure) that purchases carbon credits and sources significant amounts of renewable energy, reducing its carbon footprint.7 Compute consumption also has costs in the wider supply chain that can be quite concentrated on certain regions.8 Looking more globally and long-term, the compute demands of code generation could grow to be much larger than Codex’s training if significant inference is used to tackle challenging problems.9

Codex，像其他大生成模型一样，从训练到推理都有能源足迹 (Schwartz et al., 2019; Bender et al., 2021; Patterson et al., 2021)。GPT-3-12B 的原始训练消耗了数百个 petaflop/s-days 的计算资源，而将其微调为 Codex-12B 也消耗了类似的计算资源。此训练是在一个购买碳信用并大量使用可再生能源的平台 (Azure) 上进行的，从而减少了其碳足迹。计算资源消耗在更广泛的供应链中也有成本，并且这些成本可能集中在某些地区。从全球和长期来看，如果使用大量的推理来解决复杂问题，代码生成的计算需求可能会远超 Codex 的训练需求。

7.7. Legal implications

7.7. 法律影响

There are several legal considerations related to generated code. To begin with, the training of AI systems on Internet data, such as public GitHub repositories, has previously been identified as an instance of “fair use” (O’Keefe et al., 2019).

有几项与生成代码相关的法律考虑。首先，AI系统在互联网数据（如公共 GitHub 仓库）上的训练之前已被认定为“合理使用” (O’Keefe et al., 2019) 。

Our preliminary research also finds that Codex models rarely generate code that is identical to the contents of training data. Such occurrences were $<0.1%$ in a study examining the frequency of code generations that appear to match code snippets in the training data (Ziegler, 2021). In these rare instances, the generated code consisted of common expressions or conventions within the programming language that appeared over and over again in the training data. We find that, to the extent the generated code appears identical to the training data, it is due to the predictive weightings in the model rather than retention and copying of specific code.

我们的初步研究还发现，Codex 模型很少生成与训练数据内容完全相同的代码。在一项研究中，这种发生的频率为 $<0.1%$ ，该研究考察了生成的代码与训练数据中的代码片段匹配的频率 (Ziegler, 2021)。在这些罕见的情况下，生成的代码由编程语言中的常见表达式或约定组成，这些表达式或约定在训练数据中反复出现。我们发现，就生成的代码与训练数据看起来相同而言，这是由于模型的预测权重，而不是对特定代码的保留和复制。

Generated code is also responsive and customized to the user’s input, and the user retains complete control over editing and acceptance of the generated code. This can make code generation similar to auto-suggest or auto-completion features that exist as features of other tools of authorship (e.g., document editors), in the sense that the finished work is still seen as the author’s.

生成的代码也会根据用户的输入进行响应和定制，用户完全保留对生成代码的编辑和接受控制权。这使得代码生成类似于其他创作工具（例如，文档编辑器）中存在的自动建议或自动完成功能，因为最终的作品仍然被视为作者的作品。

Our commitment to responsible and safe AI includes continued attention to the broader intellectual property implications of code generation systems. We intend to remain engaged with policymakers and experts on these issues so that the users of such systems can ultimately deploy them with confidence.

我们致力于负责任和安全的AI，包括持续关注代码生成系统对更广泛的知识产权的影响。我们打算继续与政策制定者和专家就这些问题进行沟通，以便这些系统的用户最终能够有信心地部署它们。

7.8. Risk mitigation

7.8. 风险缓解

In closing, given the above, models like Codex should be developed, used, and their capabilities explored carefully with an eye towards maximizing their positive social impacts and minimizing intentional or unintentional harms that their use might cause. A contextual approach is critical to effective hazard analysis and mitigation, though a few broad categories of mitigation s are important to consider in any deployment of code generation models.

总之，鉴于上述内容，像 Codex 这样的模型应该被开发、使用，并且其功能应谨慎探索，以最大化其积极的社会影响并最小化其使用可能造成的有意或无意的危害。情境化的处理方法对于有效的风险分析和缓解至关重要，尽管在部署代码生成模型时，考虑一些广泛的缓解类别也很重要。

Careful documentation and user interface design, code review requirements, and/or content controls (e.g., filtering of outputs) may help to reduce harms associated with overreliance as well as offensive content or insecure code generation. In the context of a model made available as a service (e.g., via an API), policies such as user review, use case restrictions, monitoring, and/or rate limiting may also help to reduce harms associated with malicious use or prevent its use in high-stakes domains for which the models are not well suited.

仔细的文档编写和用户界面设计、代码审查要求以及内容控制（例如，输出过滤）可能有助于减少与过度依赖相关的危害以及攻击性内容或不安全的代码生成。在以服务形式提供的模型（例如，通过 API）的背景下，政策如用户审查、使用场景限制、监控和/或速率限制也可能有助于减少与恶意使用相关的危害或防止其在模型不适合的高风险领域中使用。

Appendices E, F, G, and H provide further detail on the risks described in this section and outline additional mitigation and research opportunities.

附录 E、F、G 和 H 提供了本节所述风险的进一步详细信息，并概述了额外的缓解措施和研究机会。

8. 相关工作

The deep learning resurgence has led to strong advances in the field of program learning. Two popular approaches to neural program learning are program induction and program synthesis.

深度学习的复兴导致了程序学习领域的重大进展。两种流行的神经程序学习方法是程序归纳和程序合成。

In program induction, a model generates program outputs directly from a latent program representation. Learning to Execute (Zaremba & Sutskever, 2014) demonstrated that models could execute simple tasks like addition and memorization. Later attempts at program induction incorporated inductive biases based on modern computing devices, such as the Neural Turing Machine (Graves et al., 2014), memory networks (Weston et al., 2015; Sukhbaatar et al., 2015), the Neural GPU (Kaiser & Sutskever, 2015), and the differentiable neural computer (Graves et al., 2016). More recent approaches like the Neural Program Interpreter (Reed & de Freitas, 2016; Shin et al., 2018; Pierrot et al., 2021) and

在程序归纳中，模型从潜在的程序表示直接生成程序输出。学习执行 (Zaremba & Sutskever, 2014) 证明了模型可以执行加法和记忆等简单任务。后来的程序归纳尝试引入了基于现代计算设备的归纳偏置，例如神经图灵机 (Neural Turing Machine) (Graves et al., 2014)，记忆网络 (Weston et al., 2015; Sukhbaatar et al., 2015)，神经 GPU (Kaiser & Sutskever, 2015)，以及可微分神经计算机 (differentiable neural computer) (Graves et al., 2016)。更近期的方法如神经程序解释器 (Neural Program Interpreter) (Reed & de Freitas, 2016; Shin et al., 2018; Pierrot et al., 2021) 和

Universal Transformer (Dehghani et al., 2019) found recurrence to be a useful component in program induction.

Universal Transformer (Dehghani et al., 2019) 发现递归是程序归纳中有用的组件。

In program synthesis, a model explicitly generates a program, usually from a natural language specification. One of the most popular classical approaches used a probabilistic context free grammar (PCFG) to generate a program’s abstract syntax tree (AST). Maddison & Tarlow (2014) improved on this setup by learning a state vector used to condition child node expansion. Later, Allamanis et al. (2015) applied this idea in text-to-code retrieval and Yin & Neubig (2017) utilized it in text-conditional code generation. Code2seq (Alon et al., 2018) found that ASTs could also be leveraged for code-to-text generation.

在程序合成中，模型显式地生成一个程序，通常是从自然语言规范生成。一种最流行的经典方法使用概率上下文无关语法 (PCFG) 来生成程序的抽象语法树 (AST)。Maddison & Tarlow (2014) 通过学习用于条件子节点扩展的状态向量改进了这种方法。后来，Allamanis 等 (2015) 将这一想法应用于文本到代码检索，Yin & Neubig (2017) 在文本条件代码生成中利用了它。Code2seq (Alon 等, 2018) 发现 AST 也可以用于代码到文本生成。

Programs can also be synthesized without passing through an AST representation. Hindle et al. (2012) investigated n-gram language models of code, finding code to be more predictable than natural language. Latent Predictor Networks (Ling et al., 2016) showed that character-level language models could generate working code for implementing Magic the Gathering cards in an online arena, when aided with a latent mode that allows card attributes to be copied into code. DeepCoder (Balog et al., 2017) trained a model to predict the functions appearing in source code, which could be used to guide program search.

程序也可以在不经过 AST 表示的情况下进行合成。Hindle 等人 (2012) 研究了代码的 n-gram 语言模型，发现代码比自然语言更具可预测性。潜在预测网络 (Ling 等，2016) 表明，在线竞技场中实现 Magic the Gathering 卡牌的字符级语言模型可以在辅助模式的帮助下生成工作代码，该模式允许将卡牌属性复制到代码中。DeepCoder (Balog 等，2017) 训练了一个模型来预测源代码中出现的函数，这些预测可以用于指导程序搜索。

Following the success of large natural language models (Devlin et al., 2018; Radford et al., 2019; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020) large scale Transformers have also been applied towards program synthesis. CodeBERT (Feng et al., 2020) trained the BERT objective on docstrings paired with functions, and obtained strong results on code search. PyMT5 (Clement et al., 2020) is similar in spirit to our work, and used the T5 objective to train a system which can translate between non-overlapping subsets of {signature, docstring, body}.

在大自然语言模型取得成功之后 (Devlin et al., 2018; Radford et al., 2019; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020)，大规模 Transformer 也被应用于程序合成。CodeBERT (Feng et al., 2020) 在函数与文档字符串对上训练了 BERT 目标，并在代码搜索任务中取得了很好的结果。PyMT5 (Clement et al., 2020) 与我们的工作精神相似，使用 T5 目标训练了一个系统，该系统可以在非重叠的子集 {signature, docstring, body} 之间进行翻译。

We used functional correctness to benchmark our models, and observed improvements on this metric with more sampling. SPoC (Kulal et al., 2019) considered the problem of producing functionally correct code from pseudocode with a fixed budget of compilations, which is similar to our pass $@k$ metric. TransCoder (Lachaux et al., 2020) trained a system to translate between programming languages in an unsupervised manner, and also observed that functional correctness better captured the capabilities of their model than BLEU score. In fact, ContraCode (Jain et al., 2020) leveraged the large space of functionally correct programs to train a contrastive code model, which improved model performance on tasks like type inference. Finally, RobustFill (Devlin et al., 2017) observed that the best way to find a program consistent with input examples was to synthesize multiple samples through beam search.

我们使用功能正确性来评估我们的模型，并观察到随着更多采样，该指标有所改善。SPoC (Kulal et al., 2019) 考虑了在固定编译预算下从伪代码生成功能正确的代码的问题，这与我们的通过率 $@k$ 指标类似。TransCoder (Lachaux et al., 2020) 训练了一个系统以无监督的方式在编程语言之间进行翻译，并且也观察到功能正确性比 BLEU 分数更能捕捉其模型的能力。实际上，ContraCode (Jain et al., 2020) 利用了功能正确程序的巨大空间来训练对比代码模型，从而提高了模型在类型推断等任务上的性能。最后，RobustFill (Devlin et al., 2017) 观察到，找到与输入示例一致的程序的最佳方法是通过束搜索合成多个样本。

Two early domain-specific datasets used to benchmark neural programming systems were FlashFill (Gulwani, 2011; Gulwani et al., 2012) and Hearth stone (Ling et al., 2016), though the community has trended towards broader and more difficult datasets. Barone & Sennrich (2017) proposed a large training and evaluation dataset consisting of Python declarations, docstrings, and bodies scraped from GitHub. The Code Search Net challenge (Husain et al., 2019) built an even larger corpus from GitHub with data from multiple popular programming languages. Recently, CodeXGLUE (Lu et al., 2021) aggregated several programming benchmarks, making use of the recently proposed CodeBLEU metric (Ren et al., 2020). Most relevant to our evaluation work is the APPS (Hendrycks et al., 2021) benchmark for measuring functional correctness based on problems from the competitive programming website Codeforces.

两个早期的领域特定数据集用于评估神经编程系统，分别是 FlashFill (Gulwani, 2011; Gulwani 等, 2012) 和 Hearthstone (Ling 等, 2016)，尽管社区逐渐倾向于更广泛和更具挑战性的数据集。Barone & Sennrich (2017) 提出了一种大型训练和评估数据集，包含从 GitHub 抓取的 Python语言声明、文档字符串和函数体。Code Search Net 挑战 (Husain 等, 2019) 从 GitHub 构建了一个更大的语料库，涵盖了多种流行编程语言的数据。最近，CodeXGLUE (Lu 等, 2021) 整合了多个编程基准测试，使用了最近提出的 CodeBLEU 度量标准 (Ren 等, 2020)。与我们的评估工作最相关的是 APPS (Hendrycks 等, 2021) 基准测试，它基于来自编程竞赛网站 Codeforces 的问题来衡量功能正确性。

Finally, we note that coding is a broad activity which involves much more than synthesizing code from docstrings. Tufano et al. (2020) use Transformers to generate unit tests for code which outperformed commercial offerings. Aye et al. (2021) built an internal auto-complete tool for Facebook, and found that training on accepted user completions boosted system performance. Development also entails locating and fixing bugs. Early works used static or dynamic code analysis (Agrawal et al., 1995; Korel & Rilling, 1997), learned association rules (Jeffrey et al., 2009), and genetic programming (Goues et al., 2012) to debug faulty code. These approaches relied on running against a test suite to not only evaluate the correctness of suggestions but also expose problems in execution trace or search for a solution. More recent works (Tufano et al., 2019; Drain et al., 2021) considered bug-fixing as neural machine translation from buggy to correct programs. However, these works used an exact match against a reference instead of functional correctness, citing Qi et al. (2015)’s finding that most of the proposed solutions by genetic search in (Goues et al., 2012) passed through weak test suites by deleting functionality that failed. Human developers often write test suites with limited but targeted coverage, but this does not always work well against an algorithm, highlighting the challenges of evaluating correctness of programs.

最后，我们注意到编程是一项广泛的活动，涉及的内容远不止从文档字符串合成代码。Tufano 等人 (2020) 使用 Transformer 生成的单元测试超过了商业产品的表现。Aye 等人 (2021) 为 Facebook 构建了一个内部自动补全工具，并发现基于用户接受的补全进行训练可以提升系统性能。开发还涉及到定位和修复错误。早期的工作使用静态或动态代码分析 (Agrawal 等人, 1995; Korel & Rilling, 1997)，学习关联规则 (Jeffrey 等人, 2009)，以及遗传编程 (Goues 等人, 2012) 来调试有缺陷的代码。这些方法依赖于运行测试套件来不仅评估建议的正确性，还暴露执行跟踪中的问题或搜索解决方案。更近期的工作 (Tufano 等人, 2019; Drain 等人, 2021) 将 bug 修复视为从有 bug 的程序到正确程序的神经机器翻译。然而，这些工作使用的是与参考代码的精确匹配而不是功能正确性，引用了 Qi 等人 (2015) 的研究结果，即 (Goues 等人, 2012) 提出的大多数通过弱测试套件的解决方案是通过删除失败的功能实现的。人类开发者通常编写具有有限但有针对性覆盖范围的测试套件，但这并不总能很好地应对算法，突显了评估程序正确性的挑战。

9. Conclusion

9. 结论

We investigated whether it was possible to train large language models to produce functionally correct code bodies from natural language docstrings. By fine-tuning GPT on code from GitHub, we found that our models displayed strong performance on a dataset of human-written problems with difficulty level comparable to easy interview problems. Model performance could be improved by training on a distribution more similar to the evaluation set, and also by producing multiple samples from a model. We also found that it was simple to train a model to complete the reverse task of producing docstrings from code bodies, and that the performance profiles of these models were similar. Finally, we expanded on the broader impacts of code generating models, and discussed model limitations, finding significant room for improvement.

我们研究了是否可以训练大语言模型从自然语言文档字符串生成功能正确的代码主体。通过在来自 GitHub 的代码上微调 GPT，我们发现我们的模型在一个难度相当于简单面试题的人类编写问题数据集上表现出色。通过使用与评估集更相似的分布进行训练，以及从模型生成多个样本，可以提高模型性能。我们还发现训练模型完成逆向任务（即从代码主体生成文档字符串）非常简单，并且这些模型的性能特征相似。最后，我们扩展讨论了代码生成模型的广泛影响，并讨论了模型的局限性，发现了显著的改进空间。

Acknowledgements

致谢

We thank Sandhini Agarwal, Casey Chu, Jeffrey Ding, Peter Eckersley, Gillian Hadfield, Rich Harang, Jacob Jack- son, Yunxin Jiao, Jade Leung, Andrew Lohn, Ryan Lowe, Thomas McGuire, Margaret Mitchell, Florentine Eloundou Nekoul, Cullen O’Keefe, Long Ouyang, Pranav Shyam, Irene Solaiman, Aravind Srinivas, Helen Toner, Ashish Vaswani, and Jeffrey Wu for helpful discussions and feedback on drafts of this work. We are also grateful to the Acceleration and Super computing teams at OpenAI for their work on software and hardware infrastructure that this project used. Finally, we thank GitHub for partnering to build GitHub Copilot and Microsoft Azure for supporting model training with infrastructure management.

我们感谢 Sandhini Agarwal、Casey Chu、Jeffrey Ding、Peter Eckersley、Gillian Hadfield、Rich Harang、Jacob Jackson、Yunxin Jiao、Jade Leung、Andrew Lohn、Ryan Lowe、Thomas McGuire、Margaret Mitchell、Florentine Eloundou Nekoul、Cullen O'Keefe、Long Ouyang、Pranav Shyam、Irene Solaiman、Aravind Srinivas、Helen Toner、Ashish Vaswani 和 Jeffrey Wu 对本工作的草稿提供了有益的讨论和反馈。我们还要感谢 OpenAI 的 Acceleration 和 Supercomputing 团队在软件和硬件基础设施方面的工作，这些基础设施是本项目所依赖的。最后，我们感谢 GitHub 合作构建 GitHub Copilot 以及 Microsoft Azure 在模型训练中提供的基础设施管理支持。

References

参考文献

Cwe-327: Use of a broken or risky cryptographic algorithm, 2006. URL https://cwe.mitre.org/data/definitions/ 327.html.

Cwe-327: 使用破损或有风险的加密算法 (cryptographic algorithm)，2006。URL https://cwe.mitre.org/data/definitions/327.html。

Cwe-780: Use of rsa algorithm without oaep, 2009. URL https: //cwe.mitre.org/data/definitions/780.html.

Cwe-780: 使用 RSA 算法而不使用 OAEP，2009。URL https: //cwe.mitre.org/data/definitions/780.html.

A6:2017-security mis configuration, 2017. URL https: //owasp.org/www-project-top-ten/2017/ A6 2017-Security Mis configuration.html.

A6:2017-安全配置错误，2017。URL https: //owasp.org/www-project-top-ten/2017/ A6 2017-Security Mis configuration.html.

Abid, A., Farooqi, M., and Zou, J. Persistent anti-muslim bias in large language models. arXiv preprint arXiv:2101.05783, 2021.

阿比德，A.，法鲁奇，M.，和邹，J. 持续存在的大语言模型中反穆斯林偏见。arXiv预印本 arXiv:2101.05783, 2021.

Acemoglu, D. and Restrepo, P. Robots and jobs: Evidence from us labor markets. Journal of Political Economy, 128(6):2188–2244, 2020a.

Acemoglu, D. 和 Restrepo, P. 机器人和就业：来自美国劳动市场的证据。政治经济期刊，128(6):2188–2244，2020a。

Acemoglu, D. and Restrepo, P. The wrong kind of ai? artificial intelligence and the future of labour demand. Cambridge Journal of Regions, Economy and Society, 13(1):25–35, 2020b.

Acemoglu, D. 和 Restrepo, P. 错误类型的 AI？人工智能与劳动力需求的未来. Cambridge Journal of Regions, Economy and Society, 13(1):25–35, 2020b.

Agrawal, H., Horgan, J. R., London, S., and Wong, W. E. Fault localization using execution slices and dataflow tests. Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE’95, pp. 143–151, 1995.

Agrawal, H., Horgan, J. R., London, S., 和 Wong, W. E. 使用执行切片和数据流测试进行故障定位。第六届国际软件可靠性工程研讨会论文集。ISSRE’95，第 143–151 页，1995。

Allamanis, M., Tarlow, D., Gordon, A., and Wei, Y. Bimodal modelling of source code and natural language. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2123–2132, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/ v37/all a man is 15.html.

Allamanis, M., Tarlow, D., Gordon, A., 和 Wei, Y. 源代码和自然语言的双模态建模。在 Bach, F. 和 Blei, D. (编)，第 32 届国际机器学习会议论文集，机器学习研究论文集第 37 卷，页码 2123–2132，法国里尔，2015 年 7 月 7 日至 9 日。PMLR。URL http://proceedings.mlr.press/v37/all a man is 15.html。

Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.

Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., 和 Church, G. M. 基于序列的深度表示学习统一理性蛋白质工程。Nature methods，16(12):1315–1322，2019。

Alon, U., Brody, S., Levy, O., and Yahav, E. code2seq: Generating sequences from structured representations of code. In International Conference on Learning Representations, 2018.

Alon, U., Brody, S., Levy, O., 和 Yahav, E. code2seq: 从代码的结构化表示生成序列 (Generating sequences from structured representations of code). 在 International Conference on Learning Representations, 2018.

Aye, G. A., Kim, S., and Li, H. Learning auto completion from realworld datasets. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 131–139, 2021.

G. A. Aye，Kim，S.，和 Li，H. 从现实世界数据集中学习自动完成。2021 IEEE/ACM 第 43 届国际软件工程会议：软件工程实践 (ICSE-SEIP)，第 131–139 页，2021。

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.

Baevski, A., Zhou, H., Mohamed, A., 和 Auli, M. wav2vec 2.0: 一个用于自我监督学习语音表示的框架。 arXiv preprint arXiv:2006.11477, 2020.

Balog, M., Gaunt, A., Brock schmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. In 5th International Conference on Learning Representations (ICLR), 2017.

Balog, M., Gaunt, A., Brock schmidt, M., Nowozin, S., 和 Tarlow, D. Deepcoder: 学习编写程序. 在第 5 届国际学习表征会议 (ICLR) 上，2017.

Bao, H., Dong, L., and Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.

包, H., 董, L., 和魏, F. Beit: Bert预训练的图像 Transformer。arXiv预印本 arXiv:2106.08254, 2021。

Barone, A. V. M. and Sennrich, R. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. ArXiv, abs/1707.02275, 2017.

巴罗内，A. V. M. 和森里奇，R. Python语言函数和文档字符串的并行语料库，用于自动代码文档生成和代码生成。ArXiv, abs/1707.02275, 2017.

Barrington, I. M. and Maciel, A. Lecture 3: Non deterministic computation. https://people.clarkson.edu/˜alexis/ PCMI/Notes/lectureB03.pdf, 2000. [Online; accessed 29-June-2000].

巴林顿，I. M. 和 Maciel，A. 第 3 讲：非确定性计算。https://people.clarkson.edu/~alexis/PCMI/Notes/lectureB03.pdf, 2000. [在线；访问于 2000-6-29]。

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623, 2021.

Bender, E. M., Gebru, T., McMillan-Major, A., 和 Shmitchell, S. 论随机鹦鹉的危险性：语言模型能否过大？在 2021 ACM 公平性、责任性和透明度会议论文集，第 610–623 页，2021。

Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. GPT-Neo: Large scale auto regressive language modeling with mesh-tensorflow, 2021. URL http://github.com/ eleutherai/gpt-neo.

Black, S., 高, L., 王, P., Leahy, C., 和 Biderman, S. GPT-Neo：使用 Mesh-TensorFlow 的大规模自回归语言建模，2021。URL http://github.com/eleutherai/gpt-neo。

Blodgett, S. L., Barocas, S., Daum´e III, H., and Wallach, H. Lan- guage (technology) is power: A critical survey of “bias” in nlp. arXiv preprint arXiv:2005.14050, 2020.

布洛杰特，S. L.，巴罗卡斯，S.，多姆 III，H.，和瓦拉奇，H. 语言（技术）即权力：对自然语言处理中的“偏差”的批判性综述。arXiv预印本 arXiv:2005.14050, 2020。

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neel a kant an, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., 和 Amodei, D. 语言模型是少样本学习者。ArXiv, abs/2005.14165, 2020。

Bureau of Labor Statistics, U. D. o. L. Computer programmers. Occupational Outlook Handbook, 2021a. URL https: //www.bls.gov/ooh/computer-and-information- technology/computer-programmers.htm.

美国劳工统计局 (Bureau of Labor Statistics) , U. D. o. L. 计算机程序员 . 职业展望手册 , 2021a. URL https: //www.bls.gov/ooh/computer-and-information- technology/computer-programmers.htm.

Bureau of Labor Statistics, U. D. o. L. Bls - software developers. Occupational Outlook Handbook, 2021b. URL https: //www.bls.gov/ooh/computer-and-information- technology/software-developers.htm.

美国劳工统计局 (Bureau of Labor Statistics, U. D. o. L. Bls) - 软件开发人员。职业展望手册，2021b。URL https: //www.bls.gov/ooh/computer-and-information- technology/software-developers.htm.

Carlini, N., Tram`er, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, August 2021. URL https://www.usenix.org/conference/

卡林尼，N.，特拉默，F.，华莱士，E.，亚吉尔斯基，M.，赫伯特-沃斯，A.，李，K.，罗伯茨，A.，布朗，T.，宋，D.，厄林格松，U.，奥普雷亚，A.，和拉斐尔，C. 从大语言模型中提取训练数据。在第 30 届 USENIX 安全研讨会 (USENIX Security 21) 中。USENIX 协会，2021 年 8 月。URL https://www.usenix.org/conference/

use nix security 21/presentation/carlini

使用 Nix 安全性 21/演示/卡林尼

Eghbal, N. Working in public: the making and maintenance of open source software. Stripe Press, 2020.

Eghbal, N. 在公众中工作：开源软件的制作和维护 . Stripe Press, 2020.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. Codebert: A pre-trained model for programming and natural languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1536–1547, 2020.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. Codebert: 一个用于编程语言和自然语言的预训练模型。在 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)，页码 1536–1547，2020。

Frey, C. B. The technology trap. Princeton University Press, 2019.

Frey, C. B. 技术陷阱. Princeton University Press, 2019.

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling. 2020.

高, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., 何, H., Thite, A., Nabeshima, N., Presser, S., 和 Leahy, C. The pile: 一个 800GB 的多样化文本数据集，用于语言建模。2020。

Goldblum, M., Tsipras, D., Xie, C., Chen, X., Schwarz s child, A., Song, D., Madry, A., Li, B., and Goldstein, T. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses, 2021.

Goldblum, M., Tsipras, D., Xie, C., Chen, X., Schwarz s child, A., Song, D., Madry, A., Li, B., 和 Goldstein, T. 机器学习的数据集安全：数据投毒，后门攻击及防御措施，2021。

Goues, C. L., Dewey-Vogt, M., Forrest, S., and Weimer, W. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $\mathbb{S}8$ each. 2012 34th International Conference on Software Engineering (ICSE), pp. 3–13, 2012.

Goues, C. L., Dewey-Vogt, M., Forrest, S., 和 Weimer, W. 自动程序修复的系统研究：为 $\mathbb{S}8$ 每个修复了 105 个 bug 中的 55 个。2012 年第 34 届国际软件工程会议 (ICSE)，第 3–13 页，2012 年。

Graves, A. Generating sequences with recurrent neural networks, 2014.

Graves, A. 用循环神经网络生成序列 (Generating sequences with recurrent neural networks), 2014.

参考文献：

[20] Graves, A. Generating sequences with recurrent neural networks, 2014.

Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.

Graves, A., Wayne, G., 和 Danihelka, I. 神经图灵机 (Neural Turing Machines)。arXiv 预印本 arXiv:1410.5401, 2014.

Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwi´nska, A., Col men are jo, S. G., Gre fens te tte, E., Ramalho, T., Agapiou, J., et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538 (7626):471–476, 2016.

Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwi´nska, A., Colmen are jo, S. G., Gre fens te tte, E., Ramalho, T., Agapiou, J., 等. 使用具有动态外部内存的神经网络进行混合计算. Nature, 538 (7626):471–476, 2016.

Gulwani, S. Automating string processing in spreadsheets using input-output examples. In PoPL’11, January 26-28, 2011, Austin, Texas, USA, January 2011.

Gulwani, S. 使用输入输出示例自动化电子表格中的字符串处理. 在 PoPL’11, 2011年1月26-28日, 美国德克萨斯州奥斯汀, 2011年1月.

Gulwani, S., Harris, W. R., and Singh, R. Spreadsheet data manipulation using examples. Commun. ACM, 55:97–105, 2012.

Gulwani, S., Harris, W. R., 和 Singh, R. 使用示例进行电子表格数据操作. Commun. ACM, 55:97–105, 2012.

He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decodingenhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.

何，P., 刘，X., 高，J., 和陈，W. Deberta: 解码增强的 BERT (Decoding-enhanced BERT) 与解缠注意力机制 (disentangled attention)。arXiv 预印本 arXiv:2006.03654, 2020.

Helmuth, T. and Spector, L. General program synthesis benchmark suite. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1039–1046, 2015.

Helmuth, T. 和 Spector, L. 通用程序合成基准套件。在《2015年遗传与进化计算年度会议论文集》中，页码 1039–1046，2015。

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. 通过应用程序衡量编程挑战能力。 arXiv preprint arXiv:2105.09938, 2021.

Hindle, A., Barr, E. T., Su, Z., Gabel, M., and Devanbu, P. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE), pp. 837–847. IEEE, 2012.

Hindle, A., Barr, E. T., Su, Z., Gabel, M., 和 Devanbu, P. 论软件的自然性. 在 2012 年第 34 届国际软件工程会议 (ICSE) 上，页码 837–847。IEEE, 2012。

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration, 2020.

Holtzman, A., Buys, J., Du, L., Forbes, M., 和 Choi, Y. 神经文本退化的奇特案例，2020。

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brock schmidt, M. Code search net challenge: Evaluating the state of semantic code search. ArXiv, abs/1909.09436, 2019.

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., 和 Brock schmidt, M. 代码搜索挑战 (Code search net challenge): 评估语义代码搜索的现状。ArXiv, abs/1909.09436, 2019。

Wang, B. and Komatsu zak i, A. GPT-J-6B: A 6 Billion Parameter Auto regressive Language Model. https://github.com/ kingoflolz/mesh-transformer-jax, May 2021.

王, B. 和 Komatsu zak i, A. GPT-J-6B: 一个 60 亿参数的自回归语言模型。https://github.com/kingoflolz/mesh-transformer-jax, 2021 年 5 月。

Weston, J., Chopra, S., and Bordes, A. Memory networks, 2015.

韦斯顿、乔普拉和博尔德斯于 2015 年提出了记忆网络 [20]。

Woolf, M. Fun and dystopia with ai-based code generation using gpt-j-6b, June 2021. URL https://minimaxir.com/ 2021/06/gpt-j-6b/.

Woolf, M. 使用基于 AI 的代码生成与 GPT-J-6B 的乐趣与反乌托邦，2021 年 6 月。URL https://minimaxir.com/2021/06/gpt-j-6b/。

Xu, F. F., Vasilescu, B., and Neubig, G. In-ide code generation from natural language: Promise and challenges. arXiv preprint arXiv:2101.11149, 2021.

徐，F. F., Vasilescu, B., 和 Neubig, G. 从自然语言生成在 IDE 内的代码：前景与挑战。arXiv 预印本 arXiv:2101.11149, 2021。

Yin, P. and Neubig, G. A syntactic neural model for generalpurpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 440–450, 2017.

殷, P. 和 Neubig, G. 一种用于通用代码生成的句法神经模型。在第 55 届计算语言学协会 (ACL) 年会论文集，第 440–450 页，2017。

Zaremba, W. and Sutskever, I. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

Zaremba, W. 和 Sutskever, I. 学习执行. arXiv预印本 arXiv:1410.4615, 2014.

Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J. S., Cao, J., Farhadi, A., and Choi, Y. Merlot: Multimodal neural script knowledge models. arXiv preprint arXiv:2106.02636, 2021.

Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J. S., Cao, J., Farhadi, A., 和 Choi, Y. Merlot: 多模态神经脚本知识模型 (Multimodal neural script knowledge models). arXiv preprint arXiv:2106.02636, 2021.

Zhao, T. Z., Wallace, E., Feng, S., Klein, D., and Singh, S. Calibrate before use: Improving few-shot performance of language models. arXiv preprint arXiv:2102.09690, 2021.

Zhao, T. Z., Wallace, E., Feng, S., Klein, D., 和 Singh, S. 使用前校准：提高语言模型的少样本性能。arXiv preprint arXiv:2102.09690, 2021。

Ziegler, A. A first look at rote learning in github copilot suggestions., Jun 2021. URL https://docs.github.com/en/ github/copilot/research-recitation.

齐格勒，A. 对 GitHub Copilot 建议中的机械学习的初步研究 (Jun 2021). URL https://docs.github.com/en/github/copilot/research-recitation.

A. Estimating pass $@k$

A. 估计通过率 $@k$

While all estimators mentioned previously are consistent, only the empirical estimate used by Kulal et al. (2019), and (1) are unbiased. Evaluating pass $@k$ in an unbiased way with any number of samples $n$ is important for fair comparison. For example, estimating pass $\omega_{/}k=1-(1-$ $\mathrm{pass}(\varpi1)^{k}$ with $1-(1-\hat{p})^{k}$ using the empirical pass $@1$ , results in a consistent underestimate as shown in Figure 13. The gap doesn’t fully close even when $n>5k$ , and results can seem better with more samples. The interpretation of this estimator is that we draw $k$ samples with replacement from a pool of $n$ candidates, but the $k$ samples are not independent.

虽然之前提到的所有估计量都是一致的，但只有 Kulal 等人 (2019) 使用的经验估计量和 (1) 是无偏的。使用任意数量的样本 $n$ 以无偏方式评估 pass $@k$ 对于公平比较非常重要。例如，使用经验 pass $@1$ 估计 pass $\omega_{/}k=1-(1-$ $\mathrm{pass}(\varpi1)^{k}$ 为 $1-(1-\hat{p})^{k}$ ，如图 13 所示，会导致一致低估。即使当 $n>5k$ 时，差距也不会完全消失，结果可能会随着更多样本显得更好。这个估计量的解释是我们从 $n$ 个候选者中重复抽取 $k$ 个样本，但这 $k$ 个样本并不是独立的。

图 13:

(1) is unbiased, because it estimates the fail probability $(1!-!\mathrm{pass}\ @1)^{k}$ as the probability of drawing $k$ failed samples without replacement. To show this, note that $c$ , the number of correct samples that pass the unit tests, is distributed $\mathrm{Binom}(n,p)$ , where $p$ is pass $@1$ , and that (1) evaluates to 1 when $n-c<k$ . Then,

(1) 是无偏的，因为它估计失败概率 $(1-\mathrm{pass}\ @1)^{k}$ 为在不放回的情况下抽取 $k$ 个失败样本的概率。为了证明这一点，请注意 $c$ ，即通过单元测试的正确样本数量，服从 $\mathrm{Binom}(n,p)$ 分布，其中 $p$ 是 pass @1，并且当 $n-c<k$ 时 (1) 的值为 1。然后，

图 1: 模型架构示例 (Example Model Architecture)

Figure 13. Comparing the amount of bias and variance of two estimators of $\mathrm{pass}@k$ . While the top expression may look correct, it underestimates the true value by a considerable margin. The unbiased estimator may have a slightly higher variance initially but allows for a fair comparison across different numbers of samples.

图 13: 比较两个 $\mathrm{pass}@k$ 估计器的偏差和方差。虽然顶部表达式看起来可能是正确的，但它低估了真实值相当大的幅度。无偏估计器最初可能具有稍高的方差，但允许在不同样本数量之间进行公平比较。

图 1: 模型架构示例 (Example of Model Architecture)

B. Random Problems and Solutions from Codex-12B

B. 随机问题及解决方案来自 Codex-12B

We show 8 random problems from HumanEval along with 8 random samples per problem generated from Codex-12B at temperature 0.8.

我们展示了来自 HumanEval 的 8 个随机问题，以及每个问题由 Codex-12B 在温度为 0.8 下生成的 8 个随机样本。

COMPLETION 2 (WRONG): ### return s.split()

完成 2 (错误)： ### 返回 s.split()

COMPLETION 5 (WRONG): ### return s.split()

完成 5 (错误)： ### 返回 s.split()

COMPLETION 6 (WRONG): ### return s.split()

完成 6 (错误)： ### return s.split()

COMPLETION 7 (WRONG): ### return s.split(’,’)

完成 7 (错误)： ### return s.split(’,’)

COMPLETION 8 (WRONG): ### return s.split(sep=None)

完成 8 (错误)： ### 返回 s.split(sep=None)

def is_prime(n): """Return true if a given number is prime, and false otherwise. >>> is_prime(6) False >>> is_prime(101) True >>> is_prime(11) True >>> is_prime(13441) True >>> is_prime(61) True >>> is_prime(4) False >>> is_prime(1) False

def is_prime(n):
    """如果给定的数字是素数，则返回 true，否则返回 false。
    >>> is_prime(6)
    False
    >>> is_prime(101)
    True
    >>> is_prime(11)
    True
    >>> is_prime(13441)
    True
    >>> is_prime(61)
    True
    >>> is_prime(4)
    False
    >>> is_prime(1)
    False
    """

"""

return True

返回 True

COMPLETION 3 (WRONG): ### $#$ TODO: implement this function

完成 3 (错误)：### $#$ TODO: 实现此功能

pass

通过

根据您的要求，pass 在此上下文中翻译为中文的 “通过”。但是，如果 pass 是编程语句（如在 Python语言中），则应保留为 pass。请确认具体上下文。

return True # if helper returns true, then n is prime, else n is not prime

返回 True # 如果辅助函数返回 True，则 n 是质数，否则 n 不是质数

COMPLETION 6 (CORRECT): ### prime $=$ True

完成 6 (正确)： ### prime $=$ True

$\mathbf{i}\pmb{\ f}\texttt{n}==\textbf{1}$ : return False

$\mathbf{i}\pmb{\ f}\texttt{n}==\textbf{1}$ : 返回 False

COMPLETION 8 (CORRECT): ### if $\mathrm==~2$ or $\mathrm==\mathrm$ :

完成 8 (正确)：### 如果 $\mathrm==~2$ 或 $\mathrm==\mathrm$ ：

return True if $\mathrm{n}<~2$ or n % 2 == 0: return False

如果 $\mathrm{n}<~2$ 或 n % 2 == 0: 返回 False

返回 True 如果 $\mathrm{n}<~2$ 或 n % 2 == 0: 返回 False

$\texttt{i f n}<\texttt{Q}$ : return True $\mathbf{i},\pmb{\mathscr{\mathtt{f}}},\mathfrak{n},\stackrel{\circ}{\widehat{\mathtt{e}}},\mathfrak{z},\mathfrak{z};=;0$ : return False $\v{r}=$ int $(,\mathrm n\star\star0,.,5,)$ )

\$\texttt{i f n}<\texttt{Q}\$  : 返回 True  \$\mathbf{i}\,\pmb{\mathscr{\mathtt{f}}}\,\mathfrak{n}\,\stackrel{\circ}{\widehat{\mathtt{e}}}\,\mathfrak{z}\,\mathfrak{z}\;=\;0\$  : 返回 False  \$\v{r}=\$   int  \$(\,\mathrm n\star\star0\,.\,5\,)\$  )

while $\mathrm{~~\textit~~{f}~~}{<}=\mathrm{~~\textit~{r}~~}$ : $\mathbf{i},\pmb{\mathscr{\mathtt{f}}},\mathrm{n},\stackrel{\circ}{\widehat{\mathtt{s}}},\mathrm{~~\bff};=;\mathrm{0}$ : return False $\begin{array}{r}{\mathbf{if};;\textrm{n};;\mathfrak{s};;;(;\underline{{\tau}}!+!2,);;;=;;0:}\end{array}$ return False

while   \$\mathrm{~\textit~{~f~}~}{<}=\mathrm{~\textit~{~r~}~}\$  :  \$\mathbf{i}\,\pmb{\mathscr{\mathtt{f}}}\,\mathrm{~n~}\,\stackrel{\circ}{\widehat{\mathtt{s}}}\,\mathrm{~\bf~f~}\;=\;\mathrm{~0~}\$  : return False  \$\begin{array}{r}{\mathbf{if}\;\;\textrm{n}\;\;\mathfrak{s}\;\;\;(\;\underline{{\tau}}\!+\!2\,)\;\;\;=\;\;0:}\end{array}\$  return False

f $\scriptstyle+=\ 6$ return True

f $\scriptstyle+=\ 6$ 返回 True

def add elements(arr, k): """ Given a non-empty array of integers arr and an integer k, return the sum of the first k element that has at most two digits.

def add_elements(arr, k):
    """
    给定一个非空整数数组 arr 和一个整数 k，返回前 k 个元素中最多有两个数字的元素之和。
    """

Example: Input: arr $=$ [111,21,3,4000,5,6,7,8,9], $k\ =\ 4$ Output: 24 # sum of $21+3$

输入: arr = [111,21,3,4000,5,6,7,8,9], k = 4
输出: 24 # 21+3 的和

Constraints: 1. 1 <= len(arr) $<=\ \mathbb{1},0,0$ 2. 1 <= k <= len(arr)

约束条件：1. 1 <= len(arr) $<=\ \mathbb{1},0,0$ 2. 1 <= k <= len(arr)

"""

COMPLETION 1 (WRONG):

完成 1 (错误):

📐

📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐📐

pass

通过

。

COMPLETION 2 (WRONG): ### for i in range(0, k): if len(str(arr[i])) < 3: $\mathrm}=\mathrm}-\mathrm~}$ continue else: return arr[i]

完成 2 (错误)：

for i in range(0, k):
    if len(str(arr[i])) < 3:
        k = k - 1
        continue
    else:
        return arr[i]

return None start $=\texttt{k}\star\texttt{-1}-\texttt{1}$ end $\begin{array}{r l r}{=}&{{}\mathbf{k}}&{\star}&{-1}\end{array}$

返回 None 开始 $=\texttt{k}\star\texttt{-1}-\texttt{1}$ 结束 $\begin{array}{r l r}{=}&{{}\mathbf{k}}&{\star}&{-1}\end{array}$

arr $=$ arr[start:end] return sum(arr)

arr = arr[start:end] 返回 sum(arr)

COMPLETION 5 (CORRECT): ### $\mathtt\ \ k\ \ <=\ \mathtt\left(\mathtt\right)$ :

完成 5 (正确)：### $\mathtt\ \ k\ \ <=\ \mathtt\left(\mathtt\right)$ :

return sum

返回总和

else: return "Number of elements are less than k"

否则：返回 “元素数量少于 k”

COMPLETION 6 (WRONG): ### return

完成 6 (错误)： ### 返回

COMPLETION 7 (WRONG): ### return int(str(sum(arr[:k]))[:k+1])

完成 7 (错误)： ### return int(str(sum(arr[:k]))[:k+1])

COMPLETION 8 (WRONG): ### global_sum $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

完成 8 (错误)：### global_sum $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

翻译部分：

完成 8 (错误)：### global_sum $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

for i in range(0, k): if i $\c=~0$ : global_sum $=$ arr[0] else:

for i in range(0, k): 如果 i  \$\c=~0\$  : global_sum  \$=\$  arr[0] 否则:

if len(str(arr[i])) $<=~2$ : global_sum $+=$ arr[i] return global_sum

如果 len(str(arr[i])) <= 2 ：global_sum += arr[i]
返回 global_sum

def vowels count(s): """Write a function vowels count which takes a string representing a word as input and returns the number of vowels in the string. Vowels in this case are ’a’, e $\prime,_{U},^{\prime}$ Here, ’y’ is also a vowel, but only when it is at the end of the given word.

def vowels_count(s):
    """
    编写一个函数 vowels_count，它以表示单词的字符串作为输入，并返回字符串中元音的数量。
    在这种情况下，元音是 ’a’, 'e', 'i', 'o', 'u'。这里，’y’ 也是元音，但只有当它位于给定单词的末尾时才算。
    """

3 """

"""

for i in s: if i in vowels: $\mathrm{v}~+=~1$

对于 i 在 s 中：如果 i 在元音中：$\mathrm{v}~+=~1$

return v

返回 v

COMPLETION 2 (WRONG): ### vowels_set $=$ {’a’, ’e’, ’i count $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

完成 2 (错误)：### vowels_set $=$ {’a’, ’e’, ’i count $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

COMPLETION 3 (WRONG): ### vowels list $=$ [’a’, ’e’, ’i’, o vowels count $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

完成 3 (错误)：

元音列表 $=$ [’a’, ’e’, ’i’, o
元音计数 $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

for char in s: if char in vowels list: vowels count $+=~1$

对于 s 中的每个 char：如果 char 在元音列表中：元音计数 $+=~1$

return vowels count ### COMPLETION 4 (WRONG): ### return 0

返回元音计数 ### COMPLETION 4 (WRONG): ### 返回 0

COMPLETION 5 (WRONG): ### vowels $=$ ["a", "e", "i", "o", "u", "y"] num_vowels $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

完成 5 (错误)：

元音 $=$ ["a", "e", "i", "o", "u", "y"]
num_vowels $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

COMPLETION 6 (WRONG): ### vowels $=$ "aeiou" return sum(vowels.count(i) for i in s)

完成 6 (错误)：

元音 $=$ "aeiou"
返回 sum(元音.count(i) for i in s)

return count

返回计数

COMPLETION 1 (WRONG): ### if a == 0 or b == 0:

完成 1 (错误)： ### 如果 a == 0 或 b == 0：

return (a - b) * 10 + b

返回 (a - b) * 10 + b

COMPLETION 2 (WRONG): ### #return sum([a * b for a, b in zip(str(a), str(b))

完成 2 (错误)：

#return sum([a * b for a, b in zip(str(a), str(b))

COMPLETION 3 (WRONG): ### $#$ Edge case: a and b are both 0. Return 1. $\mathbf\pmb;;\mathrm;=;;0$ and b $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ :

完成情况 3 (错误)：###

$#$ 特殊情况：a 和 b 都为 0 。返回 1。 $\mathbf{i}\pmb{\ f};;\mathrm{a};=;;0$ and b $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$ :

$#$ Convert to strings so we can convert digits to characters

$#$ 转换为字符串，这样我们可以将数字转换为字符

a_str $=$ str(a) b_str $=$ str(b)

a_str = str(a)
b_str = str(b)

$#$ Initialize carry carry $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

$#$ 初始化进位进位 $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

初始化进位 (carry) 为 0。

$#$ Initialize result to be empty result = ""

初始化结果为空
result = ""

COMPLETION 4 (WRONG): ### result $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

完成 4 (错误)： ### 结果 $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad=\quad0$

根据上述规则，特殊字符和公式保持原样。

for i in range(0, min(len(str(a)), len(str(b)))): result $+=$ abs(a//10**i - b//10**i) $\star\quad\underline{{{1}}}\ 0\star\star\underline{{{1}}}$ return result

for i in range(0, min(len(str(a)), len(str(b)))): result  \$+=\$   abs(a//10**i - b//10**i)   * 1 0 * 1  
return result

prod[i] $+=$ a % 10 * b % 10 a = a // 10 b = b // 10

prod[i] $+=$ a % 10 * b % 10
a = a // 10
b = b // 10

return prod[0] $\star,\begin{array}{r l r l}{{1},0}&{{},+}&{}\end{array}$ prod[1] ### COMPLETION 6 (WRONG): ### if a $<\ 0$ or b $<\ 0$ :

返回 prod[0] $\star,\begin{array}{r l r l}{{1},0}&{{},+}&{}\end{array}$ prod[1] ### 完成 6 （错误）： ### 如果 a $<\ 0$ 或 b $<\ 0$ ：

完成 6 （错误）：

如果 a $<\ 0$ 或 b $<\ 0$ ：
返回 prod[0] $\star,\begin{array}{r l r l}{{1},0}&{{},+}&{}\end{array}$ prod[1]

a $=$ abs(a) b $=$ abs(b)

a = abs(a) b = abs(b)

for i in string: total $+=$ int(i) return total

for i in string:
    total += int(i)
return total

def even odd palindrome(n):"""

def 偶数奇数回文 (n):
"""

Example 1:

示例 1:

Note: 1. $!!!!!!!!!!!\lambda!!!!!!!!!!!!^{!}!!!!!!!!!!!!2!!!!!!!!!!!!!!<!!!!!!!!!!!!!!!!!<!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\$ 2. returned tuple has the number of even and odd integer palindr

[论文翻译]评估训练于代码的大语言模型

原文地址：https://arxiv.org/pdf/2107.03374