How is Google using AI for internal code migrations?
Google 如何利用 AI 进行内部代码迁移?
Stoyan Nikolov∗, Daniele Codecasa∗, Anna Sj¨ovall∗, Maxim Tabachnyk∗, Satish Chandra∗, Siddharth Taneja†, Celal Ziftci† ∗Google Core, †Google Ads Email:{stoyannk, cdcs, annaps, tabachnyk, chandra satish, tanejas, ziftci}@google.com
Stoyan Nikolov∗, Daniele Codecasa∗, Anna Sj¨ovall∗, Maxim Tabachnyk∗, Satish Chandra∗, Siddharth Taneja†, Celal Ziftci† ∗Google Core, †Google Ads 邮箱:{stoyannk, cdcs, annaps, tabachnyk, chandra satish, tanejas, ziftci}@google.com
Abstract—In recent years, there has been a tremendous interest in using generative AI, and particularly large language models (LLMs) in software engineering; indeed there are now several commercially available tools, and many large companies also have created proprietary ML-based tools for their own software engineers. While the use of ML for common tasks such as code completion is available in commodity tools, there is a growing interest in application of LLMs for more bespoke purposes. One such purpose is code migration.
摘要—近年来,生成式 AI(Generative AI),尤其是大语言模型(LLMs)在软件工程中的应用引起了极大的兴趣;事实上,目前已有多种商业工具可用,许多大公司也为自己的软件工程师开发了基于机器学习(ML)的专有工具。虽然 ML 在代码补全等常见任务中的应用已在通用工具中实现,但人们对将 LLMs 应用于更定制化目的的日益增长的兴趣。其中一个目的就是代码迁移。
This article is an experience report on using LLMs for code migrations at Google. It is not a research study, in the sense that we do not carry out comparisons against other approaches or evaluate research questions/hypotheses. Rather, we share our experiences in applying LLM-based code migration in an enterprise context across a range of migration cases, in the hope that other industry practitioners will find our insights useful. Many of these learnings apply to any application of ML in software engineering. We see evidence that the use of LLMs can reduce the time needed for migrations significantly, and can reduce barriers to get started and complete migration programs.
本文是关于在Google使用大语言模型(LLM)进行代码迁移的经验报告。这不是一项研究,因为我们没有与其他方法进行比较或评估研究问题/假设。相反,我们分享了在企业环境中应用基于LLM的代码迁移的经验,涵盖了一系列迁移案例,希望其他行业从业者能从中受益。许多这些经验教训适用于机器学习在软件工程中的任何应用。我们看到了证据,表明使用LLM可以显著减少迁移所需的时间,并降低启动和完成迁移计划的障碍。
I. INTRODUCTION
I. 引言
Google Product Areas (PAs) such as Ads, Search, Workspace and YouTube are mature software development organizations that perhaps resemble many Fortune 500 companies in terms of software development challenges:
Google 的产品领域(如 Ads、Search、Workspace 和 YouTube)是成熟的软件开发组织,在软件开发挑战方面可能与许多财富 500 强公司相似:
Over the years, Google as a whole has adopted software engineering principles – monorepo, analysis tools, a rigorous code review, CI/CD etc. – that have served all PAs well. See [26].
多年来,Google 整体上采用了软件工程原则——单一代码库 (monorepo)、分析工具、严格的代码审查、持续集成/持续交付 (CI/CD) 等——这些原则为所有产品领域 (PA) 提供了良好的服务。参见 [26]。
In the era of LLMs, the practice of software engineering is going through an industry-wide transformation. This profound change is taking place at Google as well. Google uses AI technologies in software engineering internally at two levels:
在大语言模型 (LLM) 时代,软件工程实践正在经历一场全行业的变革。这场深刻的变革也在 Google 内部发生。Google 在软件工程中从两个层面应用 AI 技术:
First is the generic AI-based tooling for software development that is designed for all Googlers across all PAs. These technologies – built by Google for Google – include code completion, code review, question answering, and so on. The generic AI-based development tools have been very successful at Google (see Section II) as well as in the external community, thanks to the work of Github and other companies [19, 7]. However, by necessity the IDE-based tooling is designed for “mass market” use, where the UX is paramount and typically, the value added per interaction is small but it happens many times (e.g. code completion). This is the space the mass market tools / generic tools optimize for. We have previously talked about this in blog posts and papers [23, 3, 11].
首先是面向所有 Google 员工和所有产品领域 (PA) 的通用 AI 软件开发工具。这些由 Google 为 Google 构建的技术包括代码补全、代码审查、问答等。得益于 Github 和其他公司的工作 [19, 7],这些通用的 AI 开发工具在 Google 内部(见第 II 节)以及外部社区中都非常成功。然而,基于 IDE 的工具必然是为“大众市场”设计的,其中用户体验至关重要,通常每次交互的附加值很小,但交互次数很多(例如代码补全)。这是大众市场工具/通用工具优化的领域。我们之前在博客文章和论文中讨论过这一点 [23, 3, 11]。
Second is bespoke solutions for each PA. Examples include specific code migrations, code efficiency optimization, and test generation tasks, where we have effectively used LLMs in custom (or “bespoke”) ways. As opposed to mass market, or generic tools, the number of interactions may be smaller for bespoke tools, but the complexity of each interaction is often higher. Furthermore, the emphasis here is often to accomplish an end-to-end task rather than just being an IDE convenience. For instance, in code migration at scale, the need is to be able to perform repo-level change correctly and consistently. These need custom solutions. They might use the same primitives as mass market tools, but the goals are different.
其次是为每个 PA 定制的解决方案。例如,特定的代码迁移、代码效率优化和测试生成任务,我们在这些任务中以定制(或“定制”)方式有效地使用了大语言模型。与大众市场或通用工具相比,定制工具的交互次数可能较少,但每次交互的复杂性通常更高。此外,这里的重点通常是完成端到端的任务,而不仅仅是 IDE 的便利性。例如,在大规模代码迁移中,需要能够正确且一致地执行仓库级别的更改。这些都需要定制解决方案。它们可能使用与大众市场工具相同的原语,但目标不同。
This article focuses on such bespoke solutions, specifically on the migration workloads that we see from Google product units. We describe the setting of migration projects in Google PAs, and the challenges we address in making LLM-based code migrations deliver the intended business value. The measure of success we adopt is whether there is at least a $50%$ acceleration in task completion rate in an ongoing project (see Section III.)
本文重点关注此类定制解决方案,特别是来自 Google 产品部门的迁移工作负载。我们描述了 Google PA 中迁移项目的设置,以及我们在实现基于大语言模型的代码迁移以交付预期业务价值时所面临的挑战。我们采用的成功衡量标准是,在正在进行的项目中,任务完成率是否至少提高了 50%(见第 III 节)。
Section IV in the paper covers a case study from the Google Ads PA. The Google Ads business is one of the largest in the world and is built on a code base of $500{+}\mathrm{M}$ lines of code. Ads code base uses several IDs that were 32 bits, but need to be converted to 64 bits to avoid negative rollover scenarios that could cause outages. With the use of the LLM-based approach (see details in Section IV), we are on track to achieve our migration targets, surpassing the success metric of $50%$ or better acceleration.
论文的第四部分涵盖了来自 Google Ads PA 的一个案例研究。Google Ads 业务是全球最大的业务之一,构建在一个超过 5 亿行代码的代码库上。Ads 代码库使用了多个 32 位的 ID,但需要转换为 64 位以避免可能导致服务中断的负值回滚情况。通过使用基于大语言模型的方法(详见第四部分),我们正在按计划实现迁移目标,超过了 50% 或更好的加速成功指标。
Building on the success of the Ads experience, the number of different migration initiatives around Google has been increasing. In the paper, we discuss three other distinct migrations problems arising from different Google PAs:
基于广告体验的成功,Google 内部围绕不同迁移计划的数量不断增加。在本文中,我们讨论了来自不同 Google PA 的三个不同的迁移问题:
• JUnit3 to JUnit4 migration (Section V) • Joda time to Java time migration (Section VI)
- JUnit3 到 JUnit4 的迁移(第 V 节)
- Joda time 到 Java time 的迁移(第 VI 节)
While the above are a good representative set, we have worked on several additional migrations: in fact, through 2024 the number of change lists (similar to pull requests in Git) from migration efforts across Ads and other product areas has steadily increased (see Fig 1) and the types of changes supported has expanded significantly. This has created an ecosystem that allows the teams to employ economies of scale, slotting new migrations into already proven workflows.
虽然上述内容是一个很好的代表性集合,但我们还进行了几项额外的迁移工作:事实上,到2024年,来自广告和其他产品领域的迁移工作的变更列表(类似于Git中的拉取请求)数量稳步增加(见图1),并且支持的变更类型显著扩展。这创建了一个生态系统,使团队能够利用规模经济,将新的迁移工作纳入已经验证的工作流程中。
Fig. 1. Landed change lists of AI-powered migrations for the first 3 quarters for 2024.
图 1: 2024 年前三季度 AI 驱动的迁移的落地变更列表。
Not only did the use of LLMs accelerate these migrations, from an organizational viewpoint, we have been able to complete complex migrations that were stalled for several years and required continued attention from the business. We have completed efforts that spanned several teams using a handful of engineers and saved the business hundreds of engineers worth of work.
不仅大语言模型的使用加速了这些迁移,从组织角度来看,我们还能够完成那些停滞多年且需要业务持续关注的复杂迁移。我们仅用少量工程师就完成了涉及多个团队的工作,为业务节省了数百名工程师的工作量。
Achieving success in LLM-based code migration is not straightforward. The use of LLMs alone through simple prompting is not sufficient for anything but the simplest of migrations. Instead, as we found through our journeys, and as described in the case studies in this paper, a combination of AST-based techniques, heuristics, and LLMs are needed to achieve success. Moreover, rolling out the changes in a safe way to avoid costly regressions is also important.
在基于大语言模型 (LLM) 的代码迁移中取得成功并非易事。仅通过简单的提示使用大语言模型,除了最简单的迁移外,其他情况都不足以应对。相反,正如我们在探索过程中所发现的,以及本文案例研究中所描述的,需要结合基于抽象语法树 (AST) 的技术、启发式方法和大语言模型才能取得成功。此外,以安全的方式推出更改以避免代价高昂的回归也同样重要。
Although each migration is different, and requires bespoke work, they often follow similar patterns. The cases described in this paper show a set of such patterns that, we believe, will continue to appear in future migration projects. We believe that the techniques described are not Google-specific and we expect that they can be applied to any LLM-powered code migration at large enterprises.
尽管每次迁移都不同,且需要定制化的工作,但它们通常遵循相似的模式。本文描述的案例展示了一系列这样的模式,我们相信这些模式将在未来的迁移项目中继续出现。我们认为,所描述的技术并非 Google 独有,我们预计它们可以应用于任何大型企业的大语言模型驱动的代码迁移。
To help with the challenges of migrations and to leverage the common ali ty that we see across the many cases, we have developed a common toolkit (Sec IV-2) that we used for the code changes and the techniques to find the relevant files to change. The project-specific customization comes from the LLM prompts used, as well as in the validation steps for the code changes and the review and rollout phases, which are still largely human-driven.
为了应对迁移中的挑战并利用我们在许多案例中看到的共性,我们开发了一个通用工具包(见第 IV-2 节),用于代码更改和查找相关文件的技术。项目特定的定制来自于使用的大语言模型提示,以及代码更改的验证步骤和审查与发布阶段,这些仍然主要由人工驱动。
The rest of the paper is organized as follows: Section II briefly recaps the generic code AI technologies that we use at Google. Section III talks about the bespoke technologies, focusing on code migration. Following this section, there are four sections, each describing a different code migration case study. Finally, Section VIII discusses our learning and takeaways from our experience.
本文的其余部分组织如下:第 II 节简要回顾了我们在 Google 使用的通用代码 AI 技术。第 III 节讨论了定制技术,重点关注代码迁移。接下来的四个部分分别描述了不同的代码迁移案例研究。最后,第 VIII 节讨论了我们从经验中学到的知识和收获。
II. GENERIC AI TOOLS IN GOOGLE INTERNAL SOFTWAREDEVELOPMENT
II. 谷歌内部软件开发中的通用 AI 工具
Ever since the advent of powerful transformer-based models, we started exploring how to apply LLMs to software development. LLM-based inline code completion is the most popular application of AI applied to software development: it is a natural application of LLM technology to use the code itself as training data. The UX feels natural to developers since word-level autocomplete has been a core feature of IDEs for many years. Also, it’s possible to use a rough measure of impact, e.g., the percentage of new characters written by AI. For these reasons and more, it made sense for this application of LLMs to be the first to deploy.
自从强大的基于 Transformer 的模型出现以来,我们开始探索如何将大语言模型 (LLM) 应用于软件开发。基于 LLM 的内联代码补全是 AI 应用于软件开发中最受欢迎的应用:使用代码本身作为训练数据是 LLM 技术的自然应用。对于开发者来说,用户体验感觉自然,因为单词级别的自动补全多年来一直是集成开发环境 (IDE) 的核心功能。此外,可以使用粗略的影响度量,例如 AI 生成的新字符的百分比。由于这些原因以及其他更多原因,LLM 的这一应用成为首个部署的领域是合理的。
We have seen continued fast growth similar to other enterprise contexts [7], with an acceptance rate by software engineers of $38%$ assisting in the completion of $67%$ of code characters [3], defined as the number of accepted characters from AI-based suggestions divided by the sum of manually typed characters and accepted characters from AI-based suggestions. In other words, the amount of characters in the code that are completed with AI-based assistance is now higher than manually typed by developers. While developers still need to spend time reviewing suggestions, they have more time to focus on code design.
我们看到了与其他企业环境类似的持续快速增长 [7],软件工程师的接受率为 $38%$,协助完成了 $67%$ 的代码字符 [3],定义为从基于 AI 的建议中接受的字符数除以手动输入的字符数与从基于 AI 的建议中接受的字符数之和。换句话说,代码中通过基于 AI 的辅助完成的字符数量现在超过了开发者手动输入的字符数量。尽管开发者仍需花费时间审查建议,但他们有更多时间专注于代码设计。
Key improvements came from both the models — larger models with improved coding capabilities, heuristics for constructing the context provided to the model, as well as tuning models on usage logs containing acceptances, rejections and corrections — and the UX. This cycle is essential for learning from practical behavior, rather than synthetic formulations.
关键改进来自模型和用户体验两方面。模型的改进包括:具有更强编码能力的更大模型、构建提供给模型的上下文的启发式方法,以及对包含接受、拒绝和纠正的使用日志进行调优。这种循环对于从实际行为中学习至关重要,而不是依赖于合成的公式。
Fig. 2. Improving AI-based features in coding tools (e.g., in the IDE) with historical high quality data across tools and with usage data capturing user preferences and needs.
图 2: 通过跨工具的历史高质量数据以及捕捉用户偏好和需求的使用数据,改进编码工具(如 IDE)中基于 AI 的功能。
We use our extensive and high quality logs of internal software engineering activities across multiple tools, which we have curated over many years. This data, for example, enables us to represent fine-grained code edits, build outcomes, edits to resolve build issues, code copy-paste actions, fixes of pasted code [9], code reviews, edits to fix reviewer issues, and change submissions to a repository. The training data is an aligned corpus of code with task-specific annotations in input as well as in output. The design of the data collection process, the shape of the training data, and the model that is trained on this data was described in our DIDACT [17] blog. We continue to explore these powerful datasets with newer generations of foundation models available to us (discussed more below).
我们使用了多年来精心整理的跨多个工具的内部软件工程活动的高质量日志。这些数据使我们能够表示细粒度的代码编辑、构建结果、解决构建问题的编辑、代码复制粘贴操作、粘贴代码的修复 [9]、代码审查、修复审查者问题的编辑以及向仓库提交的更改。训练数据是一个对齐的代码语料库,输入和输出中都包含任务特定的注释。数据收集过程的设计、训练数据的形态以及基于这些数据训练的模型在我们的 DIDACT [17] 博客中有所描述。我们继续探索这些强大的数据集,并结合新一代的基础模型(下文将详细讨论)。
Our next significant deployments were resolving code review comments [10] $(>!8%$ of which are now addressed with AI-based assistance) and automatically adapting pasted code [9] to the surrounding context (now responsible for ${\sim}2%$ of code in the IDE). Further deployments include instructing the IDE to perform code edits with natural language and predicting fixes to build failures [15]. Other applications, e.g., predicting tips for code readability [25] following a similar pattern are also possible.
我们接下来的重要部署是解决代码审查意见 [10] (其中超过 8% 的问题现在通过基于 AI 的辅助得到解决)以及自动调整粘贴代码 [9] 以适应周围上下文(现在在 IDE 中负责约 2% 的代码)。进一步的部署包括指导 IDE 使用自然语言执行代码编辑,并预测构建失败的修复 [15]。其他应用,例如预测代码可读性提示 [25],也可以遵循类似的模式。
Together, these deployed applications have been successful, highly-used applications at Google, with measurable impact on productivity in a real, industrial context.
这些部署的应用程序在 Google 中都是成功且使用率高的应用,在真实的工业环境中对生产力产生了可衡量的影响。
Fig. 3. A demonstration of how a variety of AI-based features can work together to assist with coding in the IDE. top: code completion, middle: adjusting copy-pasted code to the context, bottom: code edits based on natural language instructions. See our blog for more details [3].
图 3: 展示了多种基于 AI 的功能如何在 IDE 中协同工作以辅助编码。顶部:代码补全,中部:调整复制粘贴的代码以适应上下文,底部:基于自然语言指令的代码编辑。更多详情请参阅我们的博客 [3]。
III. BESPOKE USE OF LLMS FOR CODE MIGRATION
III. 大语言模型在代码迁移中的定制化使用
In this section we describe some of the LLM-powered efforts that Google implemented to address lingering technical debt and code migrations within several PAs.
在本节中,我们将介绍 Google 为应对多个产品领域 (PA) 中长期存在的技术债务和代码迁移问题而实施的一些基于大语言模型 (LLM) 的举措。
The need for code migration is not new at Google (and neither elsewhere.) At Google’s monorepo scale, special infra structure for large-scale changes [26] was, and still is used. This has allowed huge migrations like programming language version changes, API deprecation s etc. at a fraction of the cost of doing them manually. That infrastructure uses static analysis and tools like Kythe [16], Code Search [26] and ClangMR [5].
代码迁移的需求在 Google(以及其他地方)并不新鲜。在 Google 的单体仓库规模下,用于大规模变更的特殊基础设施 [26] 曾经并且仍在被使用。这使得诸如编程语言版本变更、API 弃用等大规模迁移能够以手动操作成本的一小部分完成。该基础设施使用了静态分析工具,如 Kythe [16]、Code Search [26] 和 ClangMR [5]。
However, for the kinds of code migrations that we wish to accomplish, these “deterministic” code change solutions have not proven to be quite as effective. This is because the contextual clues and the actual changes to be made have quite a bit of variance, and these are difficult to write out in a deterministic code transformation pass. This is exactly where modern LLMs are very effective, and in practice, offer a lower barrier to entry compared to devising other customized program transformation systems, such as based on program synthesis. Our goal is to find opportunities where LLMs will provide additional value by not requiring hard to maintain AST (abstract syntax trees)-based transformations, and at the same time, have scale of application that justifies the work. While AST-based techniques offer deterministic change generation, the cases we want to tackle can span complex code constructs that would be hard to implement as ASTs to cover all cases. At a high level, the use of an LLM prompt to make a taskspecific code change might look like the following:
然而,对于我们希望完成的代码迁移任务,这些“确定性”的代码变更解决方案并未被证明十分有效。这是因为上下文线索和实际需要进行的变更存在较大的差异,而这些差异很难通过确定性的代码转换步骤来明确表达。这正是现代大语言模型非常有效的地方,并且在实际应用中,相较于设计其他定制的程序转换系统(例如基于程序合成的系统),大语言模型提供了更低的入门门槛。我们的目标是找到那些大语言模型能够提供额外价值的机会,这些机会不需要基于难以维护的抽象语法树(AST)的转换,同时具有足够大的应用规模以证明其工作的合理性。虽然基于AST的技术提供了确定性的变更生成,但我们希望解决的问题可能涉及复杂的代码结构,这些结构很难通过AST来覆盖所有情况。从高层次来看,使用大语言模型提示来进行特定任务的代码变更可能如下所示:
Where rule refers to an informal description of what needs to be accomplished; as opposed to a precise AST-level pattern matching and transformation.
规则指的是对需要完成任务的非正式描述;而不是精确的抽象语法树(AST)级别的模式匹配和转换。
We use LLM prompting to build common workflows that contain bespoke, customised parts—per-task instructions to the model itself—as well as some of the sub-steps in the code migration like the file discovery and validations. This approach allows us to quickly onboard new use cases, because we built a palette of reusable sub-steps which we can combine and adapt.
我们使用大语言模型 (LLM) 提示来构建包含定制化部分(针对模型本身的每项任务指令)以及代码迁移中的一些子步骤(如文件发现和验证)的常见工作流。这种方法使我们能够快速适应新的用例,因为我们构建了一系列可重用的子步骤,可以组合和调整。
We emphasize that LLMs are only one part of the complete solution, which includes some AST and symbol-based techniques, and some purely process issues such as change rollout. The LLM role is focused on the edit generation. The parts where we need to identify locations at which to make changes, and where we need to validate that the right thing took place, are handled mostly using deterministic AST techniques with a few cases supported by the LLM as well. In Fig 4 we provide a conceptual diagram of the step-by-step process. After the opportunities for the change are discovered, they are generated using the LLM and a loop begins that iterates on tests and other validations until the changes are deemed good. Humans review the LLM-generated code the same way as any other code and they add any missing tests to cover the changed lines. The final step of landing (executing them in a production environment) the changes depends on the product they are part of.
我们强调,大语言模型 (LLM) 只是完整解决方案的一部分,该解决方案还包括一些基于抽象语法树 (AST) 和符号的技术,以及一些纯粹的过程问题,例如变更推出。大语言模型的作用主要集中在编辑生成上。我们需要确定更改位置的部分,以及需要验证是否正确执行的部分,主要通过确定性的 AST 技术处理,少数情况下也由大语言模型支持。在图 4 中,我们提供了一个逐步过程的概念图。在发现变更机会后,使用大语言模型生成变更,并开始一个循环,迭代测试和其他验证,直到变更被认为合格。人类会像审查其他代码一样审查大语言模型生成的代码,并添加任何缺失的测试以覆盖更改的行。变更的最终落地(在生产环境中执行)取决于它们所属的产品。
As mentioned in Section I, for each migration described we have defined success as AI saving at least $50%$ of the time for the end-to-end work. This time is not only the change generation itself (the code rewrite), but also finding the places to migrate, doing reviews and rolling out the changes.
如第 I 节所述,对于每次迁移,我们将成功定义为 AI 至少节省了端到端工作的 $50%$ 时间。这个时间不仅包括变更生成本身(代码重写),还包括找到需要迁移的地方、进行审查以及推出变更。
This is notably different from the success metric typically used in application of generic technologies (such as code completion, where we typically report percent of code written by AI, or the acceptance rate). We found that while the ratio of code generated by AI that gets committed is a good proxy for the time savings in most cases, it does not always capture all the value. Anecdotal remarks from developers suggest that even if the changes are not perfect, there is a lot of value in having an initial version of the changelist already created, through which developers can quickly find the places where the changes are needed. Thus, we believe that success should be assessed based on time saving on the end-to-end work. The impact on code quality is also important. Currently the quality is assured through the manual review process of the code - same as for purely human-generated code. Whether there is a long-term impact on quality remains to be seen.
这与通常用于通用技术应用的成功指标(如代码补全,我们通常报告由AI编写的代码百分比或接受率)显著不同。我们发现,虽然AI生成的代码被提交的比例在大多数情况下是时间节省的良好代理,但它并不总能捕捉到所有价值。开发者的轶事评论表明,即使更改不完美,拥有一个已经创建的更改列表的初始版本也有很大价值,开发者可以快速找到需要更改的地方。因此,我们认为成功应基于端到端工作的时间节省来评估。代码质量的影响也很重要。目前,质量通过代码的手动审查过程来保证——与纯人工生成的代码相同。长期对质量的影响仍有待观察。
Fig. 4. The high-level process to land an AI-authored change in the monorepo. We use LLMs extensively in code change creation, and partly in discovery and validation phases.
图 4: 将 AI 编写的变更落地到单一代码库的高级流程。我们在代码变更创建阶段广泛使用大语言模型,在发现和验证阶段部分使用。
In all of our case studies, the speed-up provided by the AI on the code changes alone was higher, but we believe that anchoring the success metric on the whole development journey offers a clearer impact goal and is aligned with the business outcome, rapid completion of the migrations. For fully completed migrations, we were able to estimate accurately the time savings (mostly based on historical data), while for some of the ongoing ones (such as Joda time API migration, Section VI) we relied on the expert engineers estimating a time-saving for a set of change lists created by the tools and we extrapolated from that.
在我们的所有案例研究中,AI 在代码变更方面提供的加速效果更高,但我们认为将成功指标锚定在整个开发过程中,能够提供更清晰的影响目标,并与业务成果(即快速完成迁移)保持一致。对于完全完成的迁移,我们能够准确估计节省的时间(主要基于历史数据),而对于一些正在进行的迁移(如 Joda time API 迁移,第 VI 节),我们依赖专家工程师对工具生成的一组变更列表进行时间节省的估计,并从中进行推断。
IV. INT32 TO INT64 ID MIGRATION
IV. INT32 到 INT64 ID 迁移
We previously shared information about this work in this blog post [18].
我们之前在这篇博客文章 [18] 中分享了有关这项工作的信息。
Google Ads has dozens of numerical unique “ID” types used as handles — for users, merchants, campaigns, etc. — and these IDs were originally defined as 32-bit integers in $C++$ and Java. But with the current growth in the number of IDs, we expect them to overflow the 32-bit capacity much sooner than originally expected.
Google Ads 有数十种用于标识用户、商家、广告系列等的数字唯一“ID”类型,这些 ID 最初在 $C++$ 和 Java 中被定义为 32 位整数。但随着当前 ID 数量的增长,我们预计它们会比最初预期更早地溢出 32 位容量。
Although some migration was done initially manually, we decided to employ an LLM-powered code migration flow, described below, to accelerate the work. There are several challenges related to the code changes that needed addressing:
虽然最初的一些迁移工作是手动完成的,但我们决定采用大语言模型 (LLM) 驱动的代码迁移流程来加速工作。以下是需要解决的与代码更改相关的几个挑战:
• The IDs are often defined as generic numbers $(\mathrm{int},32_\mathrm{t}$ in $C++$ or Integer in Java) and are not of a unique, easily searchable type, which makes the process of finding them through static tooling non-trivial. • There are tens of thousands of code locations across thousands of files where these IDs are used.
• ID 通常被定义为通用数字(在 C++ 中为 int32_t
,在 Java 中为 Integer),并且不属于唯一且易于搜索的类型,这使得通过静态工具查找它们的过程变得复杂。
• 这些 ID 在数千个文件中的数万个代码位置中被使用。
The full effort, if done manually was expected to require hundreds of software engineering years and complex crossteam coordination. The approach Google has had for such cases is to have a central team that drives the migration. It is what we did in this case as well and devised the following workflow:
如果完全手动完成,预计需要数百个软件工程年以及复杂的跨团队协调。Google 针对此类情况的做法是组建一个核心团队来推动迁移。这也是我们在本次案例中所采取的策略,并设计了以下工作流程:
We found that $80%$ of the code modifications in the landed CLs were fully AI-authored, and the rest were human-authored or edited from the AI suggestions. We calculate the percentage of AI-authored code by tracking the state of the change lists (similar to pull requests in git). The first version (known internally as snapshot) of the changelist is the one generated by the LLM-powered tooling. The version actually committed to the repo is compared to this and a per-character difference is calculated.
我们发现,已提交的变更列表 (CL) 中有 80% 的代码修改完全由 AI 生成,其余部分则由人工编写或基于 AI 建议进行编辑。我们通过跟踪变更列表的状态(类似于 git 中的拉取请求)来计算 AI 生成代码的百分比。变更列表的第一个版本(内部称为快照)是由基于大语言模型的工具生成的。实际提交到代码库的版本会与此版本进行比较,并计算每个字符的差异。
We discovered that in most cases, the human needed to revert at least some changes the model made that were either incorrect or not necessary. Given the complexity and sensitive nature of the modified code, effort has to be spent in carefully rolling out each change to users. This observation led to further investment in LLM-driven verification (described later) to reduce this burden.
我们发现,在大多数情况下,人类需要回滚模型所做的至少一些更改,这些更改要么不正确,要么不必要。鉴于修改代码的复杂性和敏感性,必须花费精力仔细向用户推出每个更改。这一观察促使我们进一步投资于大语言模型驱动的验证(稍后描述),以减轻这一负担。
The total time spent on the migration was reduced by an estimated $50%$ as reported by the engineers doing the migration, when compared to a similar exercise carried out without LLM assistance. In this calculation, we take into account the time needed to review and land the changes. There was also a significant reduction in communication overhead, as a single engineer could generate all necessary changes.
根据进行迁移的工程师报告,与没有大语言模型 (LLM) 协助的类似任务相比,迁移所需的总时间估计减少了 50%。在此计算中,我们考虑了审查和落地更改所需的时间。由于单个工程师可以生成所有必要的更改,通信开销也显著减少。
We will now discuss some detail about each of the steps in the workflow for this migration:
我们现在将讨论迁移工作流程中每个步骤的一些细节:
- Finding code locations to modify: In this migration, the main target is usually one or more fields defined in a protocol buffer[20] - a standard mechanism for serializing structured data. We use a variety of technique to identify the final files and code lines to modify.
- 查找需要修改的代码位置:在此次迁移中,主要目标通常是协议缓冲区 (protocol buffer) [20] 中定义的一个或多个字段,协议缓冲区是一种用于序列化结构化数据的标准机制。我们使用多种技术来识别最终需要修改的文件和代码行。
First, we start with the manual identification of the protocol buffer fields for an ID, called a “seed”. Then, we use Kythe [16] to find references to the seed in the entire Google codebase, called “direct references”. This process is repeated 3 times where references-to-references are also automatically found in a bread-first-search fashion, where each level is farther away from the initial seed.
首先,我们手动识别协议缓冲区字段中的 ID,称为“种子”。然后,我们使用 Kythe [16] 在整个 Google 代码库中查找对该种子的引用,称为“直接引用”。此过程重复 3 次,其中引用到引用的关系也会以广度优先搜索的方式自动找到,每一层都离初始种子更远。
The result of this Kythe search is a superset of files and lines that may potentially need to be modified. We filter this superset to identify the locations to be modified accurately, before the files are passed to our LLM for the actual modification. To improve accuracy, we use various pluggable and extensible strategies that help decide whether a specific location needs to be migrated. First, we pass the locations through regular expressions to identify if production code contains any casting in different languages, e.g. (int) seed.getLong(). Such locations are tagged as to-be-migrated. Then, for test code, we check whether values passed as IDs are larger than the 32-bit space, e.g. seed.setLong(5L). When they fit the 32-bit space, they are marked as to-be-modified too, since we want to execute tests with values larger than the maximum 32-bit value.
此次 Kythe 搜索的结果是一个可能需要进行修改的文件和行的超集。在将这些文件传递给我们的 大语言模型 (LLM) 进行实际修改之前,我们会过滤这个超集,以准确识别需要修改的位置。为了提高准确性,我们使用了多种可插拔和可扩展的策略,这些策略有助于判断某个特定位置是否需要迁移。首先,我们通过正则表达式传递这些位置,以识别生产代码中是否包含不同语言的类型转换,例如 (int) seed.getLong()
。这些位置会被标记为需要迁移。然后,对于测试代码,我们会检查作为 ID 传递的值是否大于 32 位空间,例如 seed.setLong(5L)
。当它们适合 32 位空间时,它们也会被标记为需要修改,因为我们希望使用大于最大 32 位值的值来执行测试。
The remaining locations are passed through more filters that eliminate false positives with additional techniques. One approach is to parse the source code to check whether a given location contains an actual call relevant to the seed or not. As an example, a method/function definition may have been found with the Kythe references search, but usually function definitions are not relevant places to be changed for test files. Such locations are marked as irrelevant.
剩余的位置会通过更多过滤器,这些过滤器使用额外的技术来消除误报。一种方法是解析源代码,检查给定位置是否包含与种子相关的实际调用。例如,可能通过 Kythe 引用搜索找到了方法/函数的定义,但通常函数定义并不是测试文件中需要更改的相关位置。这些位置会被标记为不相关。
At the end of these filters, the remaining locations are kept to be passed to our LLM for migration. The human involvement in the process decreased during the development of the workflow but still remained high due to the prevalence of ambiguous locations to change.
在这些过滤器的最后,剩下的位置被保留下来,传递给我们的LLM(大语言模型)进行迁移。在工作流程的开发过程中,人工参与有所减少,但由于需要更改的模糊位置仍然普遍存在,人工参与仍然很高。
- Code migration flow: To generate and validate the code changes we leverage a version of the Gemini model that we fine-tuned on internal Google code and data.
- 代码迁移流程:为了生成和验证代码变更,我们利用了一个在Google内部代码和数据上微调的Gemini模型版本。
Each migration requires as input:
每次迁移需要输入:
• A set of files and the locations of expected changes: path $^+$ line number in the file. • One or two prompts that describe the change.
• 一组文件及预期更改的位置:路径 $^+$ 文件中的行号。
• 描述更改的一个或两个提示。
• [Optional] Few-shot examples to decide if a file actually needs migration.
• [可选] 少样本示例,用于判断文件是否确实需要迁移。
Fig. 5. Example execution of the multi-stage code migration process.
图 5: 多阶段代码迁移过程的示例执行。
We have developed a code migration toolkit that is used in the solutions described in this work. The toolkit is versatile and can be used for code migrations with varying requirements and outputs. All seed file change locations are provided by the user and collected through processes similar to the ones described above. The migration toolkit automatically expands this set with additional relevant files that can include: test files, interface files, and other dependencies. This step uses symbol cross-reference information.
我们开发了一个代码迁移工具包,用于本工作中描述的解决方案。该工具包功能多样,可用于具有不同需求和输出的代码迁移。所有种子文件的更改位置均由用户提供,并通过与上述类似的过程收集。迁移工具包会自动扩展此集合,包括其他相关文件,这些文件可能包括:测试文件、接口文件和其他依赖项。此步骤使用符号交叉引用信息。
In many cases, the set of files to migrate provided by the user is not perfect. It is not unusual for some files to have already been partially or completely migrated. Thus, to avoid redundant changes or confusing the model during edit generation we provide the model with few-shot examples and ask it to predict if a file needs to be migrated.
在许多情况下,用户提供的迁移文件集并不完美。有些文件可能已经部分或完全迁移,这种情况并不罕见。因此,为了避免在生成编辑时产生冗余更改或混淆模型,我们为模型提供了少样本示例,并要求其预测文件是否需要迁移。
The edit generation and validation step is where we have found the most benefit from an ML-based approach. Our LLM was trained following the DIDACT [17] methodology on data from Google’s monorepo and processes. At inference time, we annotate each line where we expect a change is needed with a natural language instruction as well as a general instruction for the model. In each model query, the input context contains one or multiple files related to each other, for example implementation files with headers, tests, interface declarations etc.. Fig 5 shows how related files are grouped together automatically and their combined changes are later added to the set of changes comprising a ’run’ of the toolkit. The model predicts differences (diffs) between the files where changes are needed and will also change related sections so that the final code is correct. The instructions to the model are relatively simple (see Fig 6), but remind the model to also
编辑生成和验证步骤是我们发现基于机器学习方法带来最大收益的地方。我们的大语言模型按照 DIDACT [17] 方法在 Google 的单一代码库和流程数据上进行了训练。在推理时,我们为每一行需要更改的地方标注了自然语言指令以及模型的通用指令。在每次模型查询中,输入上下文包含一个或多个相关的文件,例如实现文件及其头文件、测试文件、接口声明等。图 5 展示了如何自动将相关文件分组,并将它们的组合更改添加到构成工具包“运行”的更改集中。模型预测需要更改的文件之间的差异(diff),并会更改相关部分,以确保最终代码的正确性。模型的指令相对简单(见图 6),但会提醒模型也要...
Fig. 6. Prompt for int32 to int64 migration. The model makes the change across all the file(s) even without being asked.
图 6: int32 到 int64 迁移的提示。模型在没有被要求的情况下对所有文件进行了更改。
update the test files. Examples of consistent results can be seen in Fig 7 and Fig 8.
更新测试文件。一致结果的示例可以在图7和图8中看到。
This last capability is critical to increase migration velocity, because the generated changes might not be aligned with the initial locations requested, but they will solve the intent. This reduces the need to manually find the full set of lines or files where changes are needed and is a big step forward compared to purely deterministic change generation based on AST modifications.
最后这一能力对于提高迁移速度至关重要,因为生成的更改可能与最初请求的位置不一致,但它们将解决意图。这减少了手动查找需要更改的完整行或文件集的需求,与基于AST(抽象语法树)修改的纯确定性更改生成相比,这是一个巨大的进步。
Fig. 7. In the example above we prompt the model to only update the constructor of the class where the type has to change. In the predicted unified diff, the model correctly also fixes the private field and usages within the class.
图 7: 在上面的示例中,我们提示模型仅更新需要更改类型的类的构造函数。在预测的统一差异中,模型还正确地修复了私有字段和类中的使用。
Fig. 8. The model updated also the test file with an integer that is larger than 32-bit.
图 8: 模型还更新了测试文件,使用了大于 32 位的整数。
Different combinations of prompts yield different results depending on the input context. In some cases providing too many locations where one might expect a change results in worse performance than specifying a change in just one place in the file and prompting the model to apply the change to the file holistic ally.
不同的提示组合会根据输入上下文产生不同的结果。在某些情况下,提供过多可能发生更改的位置会导致性能比仅在文件中的一个位置指定更改并提示模型整体应用更改更差。
As we apply changes across dozens and potentially hundreds of files, we implement a mechanism that generates prompt combinations that are tried in parallel for each file group. This is similar to a pass $@_{\mathrm{k}}$ [4] strategy where instead of just inference temperature we modify the prompting strategy.
随着我们在数十甚至数百个文件中应用更改,我们实现了一种机制,该机制生成提示组合,这些组合在每个文件组中并行尝试。这类似于 pass $@_{\mathrm{k}}$ [4] 策略,其中我们不仅修改推理温度,还修改提示策略。
We validate the resulting changes automatically. The validations are configurable and often depend on the migration. The two most common validations are building the changed files and running their unit tests. Each of the failed validation steps can optionally run an ML-powered “repair”. The model has also been trained on a large set of failed builds and tests paired with the diffs that then fixed them. For each of the build/test failures that we encounter, we prompt the model with the changed files, the build/test error and a prompt that requests a fix. With this approach, we observe that in a significant number of cases the model is able to fix the code.
我们自动验证生成的更改。验证是可配置的,通常取决于迁移。最常见的两种验证是构建更改的文件并运行其单元测试。每个失败的验证步骤都可以选择运行一个由机器学习驱动的“修复”功能。该模型还经过大量失败的构建和测试的训练,这些构建和测试与修复它们的差异配对。对于我们遇到的每个构建/测试失败,我们都会向模型提供更改的文件、构建/测试错误以及请求修复的提示。通过这种方法,我们观察到在大量情况下,模型能够修复代码。
As we generate multiple changes for each file group, we score them based on the validations and at the end decide which change to propagate back to the final change list (similar to a pull request in Git). There are different strategies to rank changes and select the best. Which to use depends on the use case, commonly a good approach is to provide examples of the expected diff and score the changes that are closer to the set of provided examples. Closer can be defined by standard code/text distance metrics or by querying Gemini asking which of the changes best matches the given examples.
当我们为每个文件组生成多个更改时,我们会根据验证结果对其进行评分,并最终决定将哪个更改传播回最终更改列表(类似于 Git 中的拉取请求)。有不同的策略来对更改进行排序并选择最佳更改。具体使用哪种策略取决于用例,通常一个好的方法是提供预期差异的示例,并对更接近提供的示例集的更改进行评分。接近度可以通过标准的代码/文本距离度量来定义,或者通过查询 Gemini 询问哪个更改最符合给定的示例。
The process and toolkit described in Sec IV-2 is mostly generic, and is under in the other migrations exercises that we present next.
第 IV-2 节中描述的过程和工具包大多是通用的,并且在我们接下来介绍的其他迁移练习中也适用。
V. JUNIT3 TO JUNIT4 MIGRATION
V. JUNIT3 到 JUNIT4 的迁移
Large codebases tend to have some parts that begin to fall behind from the rest, they might use an older version of a library or API, or a framework that is being deprecated elsewhere. Google has had a one-version policy for years, which helps keep every dependency and library as fresh as possible. Nevertheless as standards change, some migrations that couldn’t be fully automated tend to take quarters, even years.
大型代码库往往会有一些部分开始落后于其他部分,它们可能使用了较旧版本的库或 API,或者在其他地方已被弃用的框架。Google 多年来一直实行单一版本政策,这有助于保持每个依赖项和库尽可能最新。然而,随着标准的变化,一些无法完全自动化的迁移往往需要几个季度,甚至几年时间。
A group of teams at Google had a substantial set of test files that used the now old JUnit3 library. Updating all of them manually is a huge investment and although the old tests are not on the critical path of the development, they negatively affect the codebase. They are technical debt and tend to replicate themselves, as developers might inadvertently copy old code to produce new one.
Google 的一组团队拥有大量使用现已过时的 JUnit3 库的测试文件。手动更新所有这些文件是一项巨大的投入,尽管旧测试不在开发的关键路径上,但它们对代码库产生了负面影响。它们是技术债务,并且往往会自我复制,因为开发人员可能会无意中复制旧代码来生成新代码。
We needed to make a decisive push to migrate a critical mass of these tests to the new JUnit4 library. Although for a human such a migration is relatively simple, doing so with purely AST-based techniques was deemed infeasible as there are just too many edge cases.
我们需要果断推动将这些测试的关键部分迁移到新的JUnit4库。尽管对于人类来说,这种迁移相对简单,但使用纯基于抽象语法树(AST)的技术被认为不可行,因为存在太多边缘情况。
We used the LLM migration stack (described above) to run an automatic migration on the old tests. We already had a list of all the JUnit3 tests in the repository, so running over all of them was simple. While we modified each test file, the internal system for Large Scale Changes (LSCs, [26]) split the resulting change lists into smaller sets that were sent to the owners of the tests for review. An example change can be seen in Fig 10.
我们使用了大语言模型迁移堆栈(如上所述)对旧测试进行了自动迁移。我们已经有了仓库中所有 JUnit3 测试的列表,因此遍历所有测试非常简单。在我们修改每个测试文件时,大规模变更(LSCs,[26])的内部系统将生成的变更列表拆分为较小的集合,并发送给测试的所有者进行审查。图 10 展示了一个变更示例。