How is Google using AI for internal code migrations?
Google 如何利用 AI 进行内部代码迁移?
Stoyan Nikolov∗, Daniele Codecasa∗, Anna Sj¨ovall∗, Maxim Tabachnyk∗, Satish Chandra∗, Siddharth Taneja†, Celal Ziftci† ∗Google Core, †Google Ads Email:{stoyannk, cdcs, annaps, tabachnyk, chandra satish, tanejas, ziftci}@google.com
Stoyan Nikolov∗, Daniele Codecasa∗, Anna Sj¨ovall∗, Maxim Tabachnyk∗, Satish Chandra∗, Siddharth Taneja†, Celal Ziftci† ∗Google Core, †Google Ads 邮箱:{stoyannk, cdcs, annaps, tabachnyk, chandra satish, tanejas, ziftci}@google.com
Abstract—In recent years, there has been a tremendous interest in using generative AI, and particularly large language models (LLMs) in software engineering; indeed there are now several commercially available tools, and many large companies also have created proprietary ML-based tools for their own software engineers. While the use of ML for common tasks such as code completion is available in commodity tools, there is a growing interest in application of LLMs for more bespoke purposes. One such purpose is code migration.
摘要—近年来,生成式 AI(Generative AI),尤其是大语言模型(LLMs)在软件工程中的应用引起了极大的兴趣;事实上,目前已有多种商业工具可用,许多大公司也为自己的软件工程师开发了基于机器学习(ML)的专有工具。虽然 ML 在代码补全等常见任务中的应用已在通用工具中实现,但人们对将 LLMs 应用于更定制化目的的日益增长的兴趣。其中一个目的就是代码迁移。
This article is an experience report on using LLMs for code migrations at Google. It is not a research study, in the sense that we do not carry out comparisons against other approaches or evaluate research questions/hypotheses. Rather, we share our experiences in applying LLM-based code migration in an enterprise context across a range of migration cases, in the hope that other industry practitioners will find our insights useful. Many of these learnings apply to any application of ML in software engineering. We see evidence that the use of LLMs can reduce the time needed for migrations significantly, and can reduce barriers to get started and complete migration programs.
本文是关于在Google使用大语言模型(LLM)进行代码迁移的经验报告。这不是一项研究,因为我们没有与其他方法进行比较或评估研究问题/假设。相反,我们分享了在企业环境中应用基于LLM的代码迁移的经验,涵盖了一系列迁移案例,希望其他行业从业者能从中受益。许多这些经验教训适用于机器学习在软件工程中的任何应用。我们看到了证据,表明使用LLM可以显著减少迁移所需的时间,并降低启动和完成迁移计划的障碍。
I. INTRODUCTION
I. 引言
Google Product Areas (PAs) such as Ads, Search, Workspace and YouTube are mature software development organizations that perhaps resemble many Fortune 500 companies in terms of software development challenges:
Google 的产品领域(如 Ads、Search、Workspace 和 YouTube)是成熟的软件开发组织,在软件开发挑战方面可能与许多财富 500 强公司相似:
Over the years, Google as a whole has adopted software engineering principles – monorepo, analysis tools, a rigorous code review, CI/CD etc. – that have served all PAs well. See [26].
多年来,Google 整体上采用了软件工程原则——单一代码库 (monorepo)、分析工具、严格的代码审查、持续集成/持续交付 (CI/CD) 等——这些原则为所有产品领域 (PA) 提供了良好的服务。参见 [26]。
In the era of LLMs, the practice of software engineering is going through an industry-wide transformation. This profound change is taking place at Google as well. Google uses AI technologies in software engineering internally at two levels:
在大语言模型 (LLM) 时代,软件工程实践正在经历一场全行业的变革。这场深刻的变革也在 Google 内部发生。Google 在软件工程中从两个层面应用 AI 技术:
First is the generic AI-based tooling for software development that is designed for all Googlers across all PAs. These technologies – built by Google for Google – include code completion, code review, question answering, and so on. The generic AI-based development tools have been very successful at Google (see Section II) as well as in the external community, thanks to the work of Github and other companies [19, 7]. However, by necessity the IDE-based tooling is designed for “mass market” use, where the UX is paramount and typically, the value added per interaction is small but it happens many times (e.g. code completion). This is the space the mass market tools / generic tools optimize for. We have previously talked about this in blog posts and papers [23, 3, 11].
首先是面向所有 Google 员工和所有产品领域 (PA) 的通用 AI 软件开发工具。这些由 Google 为 Google 构建的技术包括代码补全、代码审查、问答等。得益于 Github 和其他公司的工作 [19, 7],这些通用的 AI 开发工具在 Google 内部(见第 II 节)以及外部社区中都非常成功。然而,基于 IDE 的工具必然是为“大众市场”设计的,其中用户体验至关重要,通常每次交互的附加值很小,但交互次数很多(例如代码补全)。这是大众市场工具/通用工具优化的领域。我们之前在博客文章和论文中讨论过这一点 [23, 3, 11]。
Second is bespoke solutions for each PA. Examples include specific code migrations, code efficiency optimization, and test generation tasks, where we have effectively used LLMs in custom (or “bespoke”) ways. As opposed to mass market, or generic tools, the number of interactions may be smaller for bespoke tools, but the complexity of each interaction is often higher. Furthermore, the emphasis here is often to accomplish an end-to-end task rather than just being an IDE convenience. For instance, in code migration at scale, the need is to be able to perform repo-level change correctly and consistently. These need custom solutions. They might use the same primitives as mass market tools, but the goals are different.
其次是为每个 PA 定制的解决方案。例如,特定的代码迁移、代码效率优化和测试生成任务,我们在这些任务中以定制(或“定制”)方式有效地使用了大语言模型。与大众市场或通用工具相比,定制工具的交互次数可能较少,但每次交互的复杂性通常更高。此外,这里的重点通常是完成端到端的任务,而不仅仅是 IDE 的便利性。例如,在大规模代码迁移中,需要能够正确且一致地执行仓库级别的更改。这些都需要定制解决方案。它们可能使用与大众市场工具相同的原语,但目标不同。
This article focuses on such bespoke solutions, specifically on the migration workloads that we see from Google product units. We describe the setting of migration projects in Google PAs, and the challenges we address in making LLM-based code migrations deliver the intended business value. The measure of success we adopt is whether there is at least a $50%$ acceleration in task completion rate in an ongoing project (see Section III.)
本文重点关注此类定制解决方案,特别是来自 Google 产品部门的迁移工作负载。我们描述了 Google PA 中迁移项目的设置,以及我们在实现基于大语言模型的代码迁移以交付预期业务价值时所面临的挑战。我们采用的成功衡量标准是,在正在进行的项目中,任务完成率是否至少提高了 50%(见第 III 节)。
Section IV in the paper covers a case study from the Google Ads PA. The Google Ads business is one of the largest in the world and is built on a code base of $500{+}\mathrm{M}$ lines of code. Ads code base uses several IDs that were 32 bits, but need to be converted to 64 bits to avoid negative rollover scenarios that could cause outages. With the use of the LLM-based approach (see details in Section IV), we are on track to achieve our migration targets, surpassing the success metric of $50%$ or better acceleration.
论文的第四部分涵盖了来自 Google Ads PA 的一个案例研究。Google Ads 业务是全球最大的业务之一,构建在一个超过 5 亿行代码的代码库上。Ads 代码库使用了多个 32 位的 ID,但需要转换为 64 位以避免可能导致服务中断的负值回滚情况。通过使用基于大语言模型的方法(详见第四部分),我们正在按计划实现迁移目标,超过了 50% 或更好的加速成功指标。
Building on the success of the Ads experience, the number of different migration initiatives around Google has been increasing. In the paper, we discuss three other distinct migrations problems arising from different Google PAs:
基于广告体验的成功,Google 内部围绕不同迁移计划的数量不断增加。在本文中,我们讨论了来自不同 Google PA 的三个不同的迁移问题:
• JUnit3 to JUnit4 migration (Section V) • Joda time to Java time migration (Section VI)
- JUnit3 到 JUnit4 的迁移(第 V 节)
- Joda time 到 Java time 的迁移(第 VI 节)
While the above are a good representative set, we have worked on several additional migrations: in fact, through 2024 the number of change lists (similar to pull requests in Git) from migration efforts across Ads and other product areas has steadily increased (see Fig 1) and the types of changes supported has expanded significantly. This has created an ecosystem that allows the teams to employ economies of scale, slotting new migrations into already proven workflows.
虽然上述内容是一个很好的代表性集合,但我们还进行了几项额外的迁移工作:事实上,到2024年,来自广告和其他产品领域的迁移工作的变更列表(类似于Git中的拉取请求)数量稳步增加(见图1),并且支持的变更类型显著扩展。这创建了一个生态系统,使团队能够利用规模经济,将新的迁移工作纳入已经验证的工作流程中。

Fig. 1. Landed change lists of AI-powered migrations for the first 3 quarters for 2024.

图 1: 2024 年前三季度 AI 驱动的迁移的落地变更列表。
Not only did the use of LLMs accelerate these migrations, from an organizational viewpoint, we have been able to complete complex migrations that were stalled for several years and required continued attention from the business. We have completed efforts that spanned several teams using a handful of engineers and saved the business hundreds of engineers worth of work.
不仅大语言模型的使用加速了这些迁移,从组织角度来看,我们还能够完成那些停滞多年且需要业务持续关注的复杂迁移。我们仅用少量工程师就完成了涉及多个团队的工作,为业务节省了数百名工程师的工作量。
Achieving success in LLM-based code migration is not straightforward. The use of LLMs alone through simple prompting is not sufficient for anything but the simplest of migrations. Instead, as we found through our journeys, and as described in the case studies in this paper, a combination of AST-based techniques, heuristics, and LLMs are needed to achieve success. Moreover, rolling out the changes in a safe way to avoid costly regressions is also important.
在基于大语言模型 (LLM) 的代码迁移中取得成功并非易事。仅通过简单的提示使用大语言模型,除了最简单的迁移外,其他情况都不足以应对。相反,正如我们在探索过程中所发现的,以及本文案例研究中所描述的,需要结合基于抽象语法树 (AST) 的技术、启发式方法和大语言模型才能取得成功。此外,以安全的方式推出更改以避免代价高昂的回归也同样重要。
Although each migration is different, and requires bespoke work, they often follow similar patterns. The cases described in this paper show a set of such patterns that, we believe, will continue to appear in future migration projects. We believe that the techniques described are not Google-specific and we expect that they can be applied to any LLM-powered code migration at large enterprises.
尽管每次迁移都不同,且需要定制化的工作,但它们通常遵循相似的模式。本文描述的案例展示了一系列这样的模式,我们相信这些模式将在未来的迁移项目中继续出现。我们认为,所描述的技术并非 Google 独有,我们预计它们可以应用于任何大型企业的大语言模型驱动的代码迁移。
To help with the challenges of migrations and to leverage the common ali ty that we see across the many cases, we have developed a common toolkit (Sec IV-2) that we used for the code changes and the techniques to find the relevant files to change. The project-specific customization comes from the LLM prompts used, as well as in the validation steps for the code changes and the review and rollout phases, which are still largely human-driven.
为了应对迁移中的挑战并利用我们在许多案例中看到的共性,我们开发了一个通用工具包(见第 IV-2 节),用于代码更改和查找相关文件的技术。项目特定的定制来自于使用的大语言模型提示,以及代码更改的验证步骤和审查与发布阶段,这些仍然主要由人工驱动。
The rest of the paper is organized as follows: Section II briefly recaps the generic code AI technologies that we use at Google. Section III talks about the bespoke technologies, focusing on code migration. Following this section, there are four sections, each describing a different code migration case study. Finally, Section VIII discusses our learning and takeaways from our experience.
本文的其余部分组织如下:第 II 节简要回顾了我们在 Google 使用的通用代码 AI 技术。第 III 节讨论了定制技术,重点关注代码迁移。接下来的四个部分分别描述了不同的代码迁移案例研究。最后,第 VIII 节讨论了我们从经验中学到的知识和收获。
II. GENERIC AI TOOLS IN GOOGLE INTERNAL SOFTWAREDEVELOPMENT
II. 谷歌内部软件开发中的通用 AI 工具
Ever since the advent of powerful transformer-based models, we started exploring how to apply LLMs to software development. LLM-based inline code completion is the most popular application of AI applied to software development: it is a natural application of LLM technology to use the code itself as training data. The UX feels natural to developers since word-level autocomplete has been a core feature of IDEs for many years. Also, it’s possible to use a rough measure of impact, e.g., the percentage of new characters written by AI. For these reasons and more, it made sense for this application of LLMs to be the first to deploy.
自从强大的基于 Transformer 的模型出现以来,我们开始探索如何将大语言模型 (LLM) 应用于软件开发。基于 LLM 的内联代码补全是 AI 应用于软件开发中最受欢迎的应用:使用代码本身作为训练数据是 LLM 技术的自然应用。对于开发者来说,用户体验感觉自然,因为单词级别的自动补全多年来一直是集成开发环境 (IDE) 的核心功能。此外,可以使用粗略的影响度量,例如 AI 生成的新字符的百分比。由于这些原因以及其他更多原因,LLM 的这一应用成为首个部署的领域是合理的。
We have seen continued fast growth similar to other enterprise contexts [7], with an acceptance rate by software engineers of $38%$ assisting in the completion of $67%$ of code characters [3], defined as the number of accepted characters from AI-based suggestions divided by the sum of manually typed characters and accepted characters from AI-based suggestions. In other words, the amount of characters in the code that are completed with AI-based assistance is now higher than manually typed by developers. While developers still need to spend time reviewing suggestions, they have more time to focus on code design.
我们看到了与其他企业环境类似的持续快速增长 [7],软件工程师的接受率为 $38%$,协助完成了 $67%$ 的代码字符 [3],定义为从基于 AI 的建议中接受的字符数除以手动输入的字符数与从基于 AI 的建议中接受的字符数之和。换句话说,代码中通过基于 AI 的辅助完成的字符数量现在超过了开发者手动输入的字符数量。尽管开发者仍需花费时间审查建议,但他们有更多时间专注于代码设计。
Key improvements came from both the models — larger models with improved coding capabilities, heuristics for constructing the context provided to the model, as well as tuning models on usage logs containing acceptances, rejections and corrections — and the UX. This cycle is essential for learning from practical behavior, rather than synthetic formulations.
关键改进来自模型和用户体验两方面。模型的改进包括:具有更强编码能力的更大模型、构建提供给模型的上下文的启发式方法,以及对包含接受、拒绝和纠正的使用日志进行调优。这种循环对于从实际行为中学习至关重要,而不是依赖于合成的公式。

Fig. 2. Improving AI-based features in coding tools (e.g., in the IDE) with historical high quality data across tools and with usage data capturing user preferences and needs.
图 2: 通过跨工具的历史高质量数据以及捕捉用户偏好和需求的使用数据,改进编码工具(如 IDE)中基于 AI 的功能。
We use our extensive and high quality logs of internal software engineering activities across multiple tools, which we have curated over many years. This data, for example, enables us to represent fine-grained code edits, build outcomes, edits to resolve build issues, code copy-paste actions, fixes of pasted code [9], code reviews, edits to fix reviewer issues, and change submissions to a repository. The training data is an aligned corpus of code with task-specific annotations in input as well as in output. The design of the data collection process, the shape of the training data, and the model that is trained on this data was described in our DIDACT [17] blog. We continue to explore these powerful datasets with newer generations of foundation models available to us (discussed more below).
我们使用了多年来精心整理的跨多个工具的内部软件工程活动的高质量日志。这些数据使我们能够表示细粒度的代码编辑、构建结果、解决构建问题的编辑、代码复制粘贴操作、粘贴代码的修复 [9]、代码审查、修复审查者问题的编辑以及向仓库提交的更改。训练数据是一个对齐的代码语料库,输入和输出中都包含任务特定的注释。数据收集过程的设计、训练数据的形态以及基于这些数据训练的模型在我们的 DIDACT [17] 博客中有所描述。我们继续探索这些强大的数据集,并结合新一代的基础模型(下文将详细讨论)。
Our next significant deployments were resolving code review comments [10] $(>!8%$ of which are now addressed with AI-based assistance) and automatically adapting pasted code [9] to the surrounding context (now responsible for ${\sim}2%$ of code in the IDE). Further deployments include instructing the IDE to perform code edits with natural language and predicting fixes to build failures [15]. Other applications, e.g., predicting tips for code readability [25] following a similar pattern are also possible.
我们接下来的重要部署是解决代码审查意见 [10] (其中超过 8% 的问题现在通过基于 AI 的辅助得到解决)以及自动调整粘贴代码 [9] 以适应周围上下文(现在在 IDE 中负责约 2% 的代码)。进一步的部署包括指导 IDE 使用自然语言执行代码编辑,并预测构建失败的修复 [15]。其他应用,例如预测代码可读性提示 [25],也可以遵循类似的模式。
Together, these deployed applications have been successful, highly-used applications at Google, with measurable impact on productivity in a real, industrial context.
这些部署的应用程序在 Google 中都是成功且使用率高的应用,在真实的工业环境中对生产力产生了可衡量的影响。

Fig. 3. A demonstration of how a variety of AI-based features can work together to assist with coding in the IDE. top: code completion, middle: adjusting copy-pasted code to the context, bottom: code edits based on natural language instructions. See our blog for more details [3].

图 3: 展示了多种基于 AI 的功能如何在 IDE 中协同工作以辅助编码。顶部:代码补全,中部:调整复制粘贴的代码以适应上下文,底部:基于自然语言指令的代码编辑。更多详情请参阅我们的博客 [3]。
III. BESPOKE USE OF LLMS FOR CODE MIGRATION
III. 大语言模型在代码迁移中的定制化使用
In this section we describe some of the LLM-powered efforts that Google implemented to address lingering technical debt and code migrations within several PAs.
在本节中,我们将介绍 Google 为应对多个产品领域 (PA) 中长期存在的技术债务和代码迁移问题而实施的一些基于大语言模型 (LLM) 的举措。
The need for code migration is not new at Google (and neither elsewhere.) At Google’s monorepo scale, special infra structure for large-scale changes [26] was, and still is used. This has allowed huge migrations like programming language version changes, API deprecation s etc. at a fraction of the cost of doing them manually. That infrastructure uses static analysis and tools like Kythe [16], Code Search [26] and ClangMR [5].
代码迁移的需求在 Google(以及其他地方)并不新鲜。在 Google 的单体仓库规模下,用于大规模变更的特殊基础设施 [26] 曾经并且仍在被使用。这使得诸如编程语言版本变更、API 弃用等大规模迁移能够以手动操作成本的一小部分完成。该基础设施使用了静态分析工具,如 Kythe [16]、Code Search [26] 和 ClangMR [5]。
However, for the kinds of code migrations that we wish to accomplish, these “deterministic” code change solutions have not proven to be quite as effective. This is because the contextual clues and the actual changes to be made have quite a bit of variance, and these are difficult to write out in a deterministic code transformation pass. This is exactly where modern LLMs are very effective, and in practice, offer a lower barrier to entry compared to devising other customized program transformation systems, such as based on program synthesis. Our goal is to find opportunities where LLMs will provide additional value by not requiring hard to maintain AST (abstract syntax trees)-based transformations, and at the same time, have scale of application that justifies the work. While AST-based techniques offer deterministic change generation, the cases we want to tackle can span complex code constructs that would be hard to implement as ASTs to cover all cases. At a high level, the use of an LLM prompt to make a taskspecific code change might look like the following:
然而,对于我们希望完成的代码迁移任务,这些“确定性”的代码变更解决方案并未被证明十分有效。这是因为上下文线索和实际需要进行的变更存在较大的差异,而这些差异很难通过确定性的代码转换步骤来明确表达。这正是现代大语言模型非常有效的地方,并且在实际应用中,相较于设计其他定制的程序转换系统(例如基于程序合成的系统),大语言模型提供了更低的入门门槛。我们的目标是找到那些大语言模型能够提供额外价值的机会,这些机会不需要基于难以维护的抽象语法树(AST)的转换,同时具有足够大的应用规模以证明其工作的合理性。虽然基于AST的技术提供了确定性的变更生成,但我们希望解决的问题可能涉及复杂的代码结构,这些结构很难通过AST来覆盖所有情况。从高层次来看,使用大语言模型提示来进行特定任务的代码变更可能如下所示:


Where rule refers to an informal description of what needs to be accomplished; as opposed to a precise AST-level pattern matching and transformation.
规则指的是对需要完成任务的非正式描述;而不是精确的抽象语法树(AST)级别的模式匹配和转换。
We use LLM prompting to build common workflows that contain bespoke, customised parts—per-task instructions to the model itself—as well as some of the sub-steps in the code migration like the file discovery and validations. This approach allows us to quickly onboard new use cases, because we built a palette of reusable sub-steps which we can combine and adapt.
我们使用大语言模型 (LLM) 提示来构建包含定制化部分(针对模型本身的每项任务指令)以及代码迁移中的一些子步骤(如文件发现和验证)的常见工作流。这种方法使我们能够快速适应新的用例,因为我们构建了一系列可重用的子步骤,可以组合和调整。
We emphasize that LLMs are only one part of the complete solution, which includes some AST and symbol-based techniques, and some purely process issues such as change rollout. The LLM role is focused on the edit generation. The parts where we need to identify locations at which to make changes, and where we need to validate that the right thing took place, are handled mostly using deterministic AST techniques with a few cases supported by the LLM as well. In Fig 4 we provide a conceptual diagram of the step-by-step process. After the opportunities for the change are discovered, they are generated using the LLM and a loop begins that iterates on tests and other validations until the changes are deemed good. Humans review the LLM-generated code the same way as any other code and they add any missing tests to cover the changed lines. The final step of landing (executing them in a production environment) the changes depends on the product they are part of.
我们强调,大语言模型 (LLM) 只是完整解决方案的一部分,该解决方案还包括一些基于抽象语法树 (AST) 和符号的技术,以及一些纯粹的过程问题,例如变更推出。大语言模型的作用主要集中在编辑生成上。我们需要确定更改位置的部分,以及需要验证是否正确执行的部分,主要通过确定性的 AST 技术处理,少数情况下也由大语言模型支持。在图 4 中,我们提供了一个逐步过程的概念图。在发现变更机会后,使用大语言模型生成变更,并开始一个循环,迭代测试和其他验证,直到变更被认为合格。人类会像审查其他代码一样审查大语言模型生成的代码,并添加任何缺失的测试以覆盖更改的行。变更的最终落地(在生产环境中执行)取决于它们所属的产品。
As mentioned in Section I, for each migration described we have defined success as AI saving at least $50%$ of the time for the end-to-end work. This time is not only the change generation itself (the code rewrite), but also finding the places to migrate, doing reviews and rolling out the changes.
如第 I 节所述,对于每次迁移,我们将成功定义为 AI 至少节省了端到端工作的 $50%$ 时间。这个时间不仅包括变更生成本身(代码重写),还包括找到需要迁移的地方、进行审查以及推出变更。
This is notably different from the success metric typically used in application of generic technologies (such as code completion, where we typically report percent of code written by AI, or the acceptance rate). We found that while the ratio of code generated by AI that gets committed is a good proxy for the time savings in most cases, it does not always capture all the value. Anecdotal remarks from developers suggest that even if the changes are not perfect, there is a lot of value in having an initial version of the changelist already created, through which developers can quickly find the places where the changes are needed. Thus, we believe that success should be assessed based on time saving on the end-to-end work. The impact on code quality is also important. Currently the quality is assured through the manual review process of the code - same as for purely human-generated code. Whether there is a long-term impact on quality remains to be seen.
这与通常用于通用技术应用的成功指标(如代码补全,我们通常报告由AI编写的代码百分比或接受率)显著不同。我们发现,虽然AI生成的代码被提交的比例在大多数情况下是时间节省的良好代理,但它并不总能捕捉到所有价值。开发者的轶事评论表明,即使更改不完美,拥有一个已经创建的更改列表的初始版本也有很大价值,开发者可以快速找到需要更改的地方。因此,我们认为成功应基于端到端工作的时间节省来评估。代码质量的影响也很重要。目前,质量通过代码的手动审查过程来保证——与纯人工生成的代码相同。长期对质量的影响仍有待观察。

Fig. 4. The high-level process to land an AI-authored change in the monorepo. We use LLMs extensively in code change creation, and partly in discovery and validation phases.

图 4: 将 AI 编写的变更落地到单一代码库的高级流程。我们在代码变更创建阶段广泛使用大语言模型,在发现和验证阶段部分使用。
In all of our case studies, the speed-up provided by the AI on the code changes alone was higher, but we believe that anchoring the success metric on the whole development journey offers a clearer impact goal and is aligned with the business outcome, rapid completion of the migrations. For fully completed migrations, we were able to estimate accurately the time savings (mostly based on historical data), while for some of the ongoing ones (such as Joda time API migration, Section VI) we relied on the expert engineers estimating a time-saving for a set of change lists created by the tools and we extrapolated from that.
在我们的所有案例研究中,AI 在代码变更方面提供的加速效果更高,但我们认为将成功指标锚定在整个开发过程中,能够提供更清晰的影响目标,并与业务成果(即快速完成迁移)保持一致。对于完全完成的迁移,我们能够准确估计节省的时间(主要基于历史数据),而对于一些正在进行的迁移(如 Joda time API 迁移,第 VI 节),我们依赖专家工程师对工具生成的一组变更列表进行时间节省的估计,并从中进行推断。
IV. INT32 TO INT64 ID MIGRATION
IV. INT32 到 INT64 ID 迁移
We previously shared information about this work in this blog post [18].
我们之前在这篇博客文章 [18] 中分享了有关这项工作的信息。
Google Ads has dozens of numerical unique “ID” types used as handles — for users, merchants, campaigns, etc. — and these IDs were originally defined as 32-bit integers in $C++$ and Java. But with the current growth in the number of IDs, we expect them to overflow the 32-bit capacity much sooner than originally expected.
Google Ads 有数十种用于标识用户、商家、广告系列等的数字唯一“ID”类型,这些 ID 最初在 $C++$ 和 Java 中被定义为 32 位整数。但随着当前 ID 数量的增长,我们预计它们会比最初预期更早地溢出 32 位容量。
Although some migration was done initially manually, we decided to employ an LLM-powered code migration flow, described below, to accelerate the work. There are several challenges related to the code changes that needed addressing:
虽然最初的一些迁移工作是手动完成的,但我们决定采用大语言模型 (LLM) 驱动的代码迁移流程来加速工作。以下是需要解决的与代码更改相关的几个挑战:
• The IDs are often defined as generic numbers $(\mathrm{int},32_\mathrm{t}$ in $C++$ or Integer in Java) and are not of a unique, easily searchable type, which makes the process of finding them through static tooling non-trivial. • There are tens of thousands of code locations across thousands of files where these IDs are used.
• ID 通常被定义为通用数字(在 C++ 中为 int32_t,在 Java 中为 Integer),并且不属于唯一且易于搜索的类型,这使得通过静态工具查找它们的过程变得复杂。
• 这些 ID 在数千个文件中的数万个代码位置中被使用。
The full effort, if done manually was expected to require hundreds of software engineering years and complex crossteam coordination. The approach Google has had for such cases is to have a central team that drives the migration. It is what we did in this case as well and devised the following workflow:
如果完全手动完成,预计需要数百个软件工程年以及复杂的跨团队协调。Google 针对此类情况的做法是组建一个核心团队来推动迁移。这也是我们在本次案例中所采取的策略,并设计了以下工作流程:
We found that $80%$ of the code modifications in the landed CLs were fully AI-authored, and the rest were human-authored or edited from the AI suggestions. We calculate the percentage of AI-authored code by tracking the state of the change lists (similar to pull requests in git). The first version (known internally as snapshot) of the changelist is the one generated by the LLM-powered tooling. The version actually committed to the repo is compared to this and a per-character difference is calculated.
我们发现,已提交的变更列表 (CL) 中有 80% 的代码修改完全由 AI 生成,其余部分则由人工编写或基于 AI 建议进行编辑。我们通过跟踪变更列表的状态(类似于 git 中的拉取请求)来计算 AI 生成代码的百分比。变更列表的第一个版本(内部称为快照)是由基于大语言模型的工具生成的。实际提交到代码库的版本会与此版本进行比较,并计算每个字符的差异。
We discovered that in most cases, the human needed to revert at least some changes the model made that were either incorrect or not necessary. Given the complexity and sensitive nature of the modified code, effort has to be spent in carefully rolling out each change to users. This observation led to further investment in LLM-driven verification (described later) to reduce this burden.
我们发现,在大多数情况下,人类需要回滚模型所做的至少一些更改,这些更改要么不正确,要么不必要。鉴于修改代码的复杂性和敏感性,必须花费精力仔细向用户推出每个更改。这一观察促使我们进一步投资于大语言模型驱动的验证(稍后描述),以减轻这一负担。
The total time spent on the migration was reduced by an estimated $50%$ as reported by the engineers doing the migration, when compared to a similar exercise carried out without LLM assistance. In this calculation, we take into account the time needed to review and land the changes. There was also a significant reduction in communication overhead, as a single engineer could generate all necessary changes.
根据进行迁移的工程师报告,与没有大语言模型 (LLM) 协助的类似任务相比,迁移所需的总时间估计减少了 50%。在此计算中,我们考虑了审查和落地更改所需的时间。由于单个工程师可以生成所有必要的更改,通信开销也显著减少。
We will now discuss some detail about each of the steps in the workflow for this migration:
我们现在将讨论迁移工作流程中每个步骤的一些细节:
- Finding code locations to modify: In this migration, the main target is usually one or more fields defined in a protocol buffer[20] - a standard mechanism for serializing structured data. We use a variety of technique to identify the final files and code lines to modify.
- 查找需要修改的代码位置:在此次迁移中,主要目标通常是协议缓冲区 (protocol buffer) [20] 中定义的一个或多个字段,协议缓冲区是一种用于序列化结构化数据的标准机制。我们使用多种技术来识别最终需要修改的文件和代码行。
First, we start with the manual identification of the protocol buffer fields for an ID, called a “seed”. Then, we use Kythe [16] to find references to the seed in the entire Google codebase, called “direct references”. This process is repeated 3 times where references-to-references are also automatically found in a bread-first-search fashion, where each level is farther away from the initial seed.
首先,我们手动识别协议缓冲区字段中的 ID,称为“种子”。然后,我们使用 Kythe [16] 在整个 Google 代码库中查找对该种子的引用,称为“直接引用”。此过程重复 3 次,其中引用到引用的关系也会以广度优先搜索的方式自动找到,每一层都离初始种子更远。
The result of this Kythe search is a superset of files and lines that may potentially need to be modified. We filter this superset to identify the locations to be modified accurately, before the files are passed to our LLM for the actual modification. To improve accuracy, we use various pluggable and extensible strategies that help decide whether a specific location needs to be migrated. First, we pass the locations through regular expressions to identify if production code contains any casting in different languages, e.g. (int) seed.getLong(). Such locations are tagged as to-be-migrated. Then, for test code, we check whether values passed as IDs are larger than the 32-bit space, e.g. seed.setLong(5L). When they fit the 32-bit space, they are marked as to-be-modified too, since we want to execute tests with values larger than the maximum 32-bit value.
此次 Kythe 搜索的结果是一个可能需要进行修改的文件和行的超集。在将这些文件传递给我们的 大语言模型 (LLM) 进行实际修改之前,我们会过滤这个超集,以准确识别需要修改的位置。为了提高准确性,我们使用了多种可插拔和可扩展的策略,这些策略有助于判断某个特定位置是否需要迁移。首先,我们通过正则表达式传递这些位置,以识别生产代码中是否包含不同语言的类型转换,例如 (int) seed.getLong()。这些位置会被标记为需要迁移。然后,对于测试代码,我们会检查作为 ID 传递的值是否大于 32 位空间,例如 seed.setLong(5L)。当它们适合 32 位空间时,它们也会被标记为需要修改,因为我们希望使用大于最大 32 位值的值来执行测试。
The remaining locations are passed through more filters that eliminate false positives with additional techniques. One approach is to parse the source code to check whether a given location contains an actual call relevant to the seed or not. As an example, a method/function definition may have been found with the Kythe references search, but usually function definitions are not relevant places to be changed for test files. Such locations are marked as irrelevant.
剩余的位置会通过更多过滤器,这些过滤器使用额外的技术来消除误报。一种方法是解析源代码,检查给定位置是否包含与种子相关的实际调用。例如,可能通过 Kythe 引用搜索找到了方法/函数的定义,但通常函数定义并不是测试文件中需要更改的相关位置。这些位置会被标记为不相关。
At the end of these filters, the remaining locations are kept to be passed to our LLM for migration. The human involvement in the process decreased during the development of the workflow but still remained high due to the prevalence of ambiguous locations to change.
在这些过滤器的最后,剩下的位置被保留下来,传递给我们的LLM(大语言模型)进行迁移。在工作流程的开发过程中,人工参与有所减少,但由于需要更改的模糊位置仍然普遍存在,人工参与仍然很高。
- Code migration flow: To generate and validate the code changes we leverage a version of the Gemini model that we fine-tuned on internal Google code and data.
- 代码迁移流程:为了生成和验证代码变更,我们利用了一个在Google内部代码和数据上微调的Gemini模型版本。
Each migration requires as input:
每次迁移需要输入:
• A set of files and the locations of expected changes: path $^+$ line number in the file. • One or two prompts that describe the change.
• 一组文件及预期更改的位置:路径 $^+$ 文件中的行号。
• 描述更改的一个或两个提示。
• [Optional] Few-shot examples to decide if a file actually needs migration.
• [可选] 少样本示例,用于判断文件是否确实需要迁移。

Fig. 5. Example execution of the multi-stage code migration process.
图 5: 多阶段代码迁移过程的示例执行。
We have developed a code migration toolkit that is used in the solutions described in this work. The toolkit is versatile and can be used for code migrations with varying requirements and outputs. All seed file change locations are provided by the user and collected through processes similar to the ones described above. The migration toolkit automatically expands this set with additional relevant files that can include: test files, interface files, and other dependencies. This step uses symbol cross-reference information.
我们开发了一个代码迁移工具包,用于本工作中描述的解决方案。该工具包功能多样,可用于具有不同需求和输出的代码迁移。所有种子文件的更改位置均由用户提供,并通过与上述类似的过程收集。迁移工具包会自动扩展此集合,包括其他相关文件,这些文件可能包括:测试文件、接口文件和其他依赖项。此步骤使用符号交叉引用信息。
In many cases, the set of files to migrate provided by the user is not perfect. It is not unusual for some files to have already been partially or completely migrated. Thus, to avoid redundant changes or confusing the model during edit generation we provide the model with few-shot examples and ask it to predict if a file needs to be migrated.
在许多情况下,用户提供的迁移文件集并不完美。有些文件可能已经部分或完全迁移,这种情况并不罕见。因此,为了避免在生成编辑时产生冗余更改或混淆模型,我们为模型提供了少样本示例,并要求其预测文件是否需要迁移。
The edit generation and validation step is where we have found the most benefit from an ML-based approach. Our LLM was trained following the DIDACT [17] methodology on data from Google’s monorepo and processes. At inference time, we annotate each line where we expect a change is needed with a natural language instruction as well as a general instruction for the model. In each model query, the input context contains one or multiple files related to each other, for example implementation files with headers, tests, interface declarations etc.. Fig 5 shows how related files are grouped together automatically and their combined changes are later added to the set of changes comprising a ’run’ of the toolkit. The model predicts differences (diffs) between the files where changes are needed and will also change related sections so that the final code is correct. The instructions to the model are relatively simple (see Fig 6), but remind the model to also
编辑生成和验证步骤是我们发现基于机器学习方法带来最大收益的地方。我们的大语言模型按照 DIDACT [17] 方法在 Google 的单一代码库和流程数据上进行了训练。在推理时,我们为每一行需要更改的地方标注了自然语言指令以及模型的通用指令。在每次模型查询中,输入上下文包含一个或多个相关的文件,例如实现文件及其头文件、测试文件、接口声明等。图 5 展示了如何自动将相关文件分组,并将它们的组合更改添加到构成工具包“运行”的更改集中。模型预测需要更改的文件之间的差异(diff),并会更改相关部分,以确保最终代码的正确性。模型的指令相对简单(见图 6),但会提醒模型也要...
Fig. 6. Prompt for int32 to int64 migration. The model makes the change across all the file(s) even without being asked.
图 6: int32 到 int64 迁移的提示。模型在没有被要求的情况下对所有文件进行了更改。
update the test files. Examples of consistent results can be seen in Fig 7 and Fig 8.
更新测试文件。一致结果的示例可以在图7和图8中看到。
This last capability is critical to increase migration velocity, because the generated changes might not be aligned with the initial locations requested, but they will solve the intent. This reduces the need to manually find the full set of lines or files where changes are needed and is a big step forward compared to purely deterministic change generation based on AST modifications.
最后这一能力对于提高迁移速度至关重要,因为生成的更改可能与最初请求的位置不一致,但它们将解决意图。这减少了手动查找需要更改的完整行或文件集的需求,与基于AST(抽象语法树)修改的纯确定性更改生成相比,这是一个巨大的进步。

Fig. 7. In the example above we prompt the model to only update the constructor of the class where the type has to change. In the predicted unified diff, the model correctly also fixes the private field and usages within the class.
图 7: 在上面的示例中,我们提示模型仅更新需要更改类型的类的构造函数。在预测的统一差异中,模型还正确地修复了私有字段和类中的使用。

Fig. 8. The model updated also the test file with an integer that is larger than 32-bit.
图 8: 模型还更新了测试文件,使用了大于 32 位的整数。
Different combinations of prompts yield different results depending on the input context. In some cases providing too many locations where one might expect a change results in worse performance than specifying a change in just one place in the file and prompting the model to apply the change to the file holistic ally.
不同的提示组合会根据输入上下文产生不同的结果。在某些情况下,提供过多可能发生更改的位置会导致性能比仅在文件中的一个位置指定更改并提示模型整体应用更改更差。
As we apply changes across dozens and potentially hundreds of files, we implement a mechanism that generates prompt combinations that are tried in parallel for each file group. This is similar to a pass $@_{\mathrm{k}}$ [4] strategy where instead of just inference temperature we modify the prompting strategy.
随着我们在数十甚至数百个文件中应用更改,我们实现了一种机制,该机制生成提示组合,这些组合在每个文件组中并行尝试。这类似于 pass $@_{\mathrm{k}}$ [4] 策略,其中我们不仅修改推理温度,还修改提示策略。
We validate the resulting changes automatically. The validations are configurable and often depend on the migration. The two most common validations are building the changed files and running their unit tests. Each of the failed validation steps can optionally run an ML-powered “repair”. The model has also been trained on a large set of failed builds and tests paired with the diffs that then fixed them. For each of the build/test failures that we encounter, we prompt the model with the changed files, the build/test error and a prompt that requests a fix. With this approach, we observe that in a significant number of cases the model is able to fix the code.
我们自动验证生成的更改。验证是可配置的,通常取决于迁移。最常见的两种验证是构建更改的文件并运行其单元测试。每个失败的验证步骤都可以选择运行一个由机器学习驱动的“修复”功能。该模型还经过大量失败的构建和测试的训练,这些构建和测试与修复它们的差异配对。对于我们遇到的每个构建/测试失败,我们都会向模型提供更改的文件、构建/测试错误以及请求修复的提示。通过这种方法,我们观察到在大量情况下,模型能够修复代码。
As we generate multiple changes for each file group, we score them based on the validations and at the end decide which change to propagate back to the final change list (similar to a pull request in Git). There are different strategies to rank changes and select the best. Which to use depends on the use case, commonly a good approach is to provide examples of the expected diff and score the changes that are closer to the set of provided examples. Closer can be defined by standard code/text distance metrics or by querying Gemini asking which of the changes best matches the given examples.
当我们为每个文件组生成多个更改时,我们会根据验证结果对其进行评分,并最终决定将哪个更改传播回最终更改列表(类似于 Git 中的拉取请求)。有不同的策略来对更改进行排序并选择最佳更改。具体使用哪种策略取决于用例,通常一个好的方法是提供预期差异的示例,并对更接近提供的示例集的更改进行评分。接近度可以通过标准的代码/文本距离度量来定义,或者通过查询 Gemini 询问哪个更改最符合给定的示例。
The process and toolkit described in Sec IV-2 is mostly generic, and is under in the other migrations exercises that we present next.
第 IV-2 节中描述的过程和工具包大多是通用的,并且在我们接下来介绍的其他迁移练习中也适用。
V. JUNIT3 TO JUNIT4 MIGRATION
V. JUNIT3 到 JUNIT4 的迁移
Large codebases tend to have some parts that begin to fall behind from the rest, they might use an older version of a library or API, or a framework that is being deprecated elsewhere. Google has had a one-version policy for years, which helps keep every dependency and library as fresh as possible. Nevertheless as standards change, some migrations that couldn’t be fully automated tend to take quarters, even years.
大型代码库往往会有一些部分开始落后于其他部分,它们可能使用了较旧版本的库或 API,或者在其他地方已被弃用的框架。Google 多年来一直实行单一版本政策,这有助于保持每个依赖项和库尽可能最新。然而,随着标准的变化,一些无法完全自动化的迁移往往需要几个季度,甚至几年时间。
A group of teams at Google had a substantial set of test files that used the now old JUnit3 library. Updating all of them manually is a huge investment and although the old tests are not on the critical path of the development, they negatively affect the codebase. They are technical debt and tend to replicate themselves, as developers might inadvertently copy old code to produce new one.
Google 的一组团队拥有大量使用现已过时的 JUnit3 库的测试文件。手动更新所有这些文件是一项巨大的投入,尽管旧测试不在开发的关键路径上,但它们对代码库产生了负面影响。它们是技术债务,并且往往会自我复制,因为开发人员可能会无意中复制旧代码来生成新代码。
We needed to make a decisive push to migrate a critical mass of these tests to the new JUnit4 library. Although for a human such a migration is relatively simple, doing so with purely AST-based techniques was deemed infeasible as there are just too many edge cases.
我们需要果断推动将这些测试的关键部分迁移到新的JUnit4库。尽管对于人类来说,这种迁移相对简单,但使用纯基于抽象语法树(AST)的技术被认为不可行,因为存在太多边缘情况。
We used the LLM migration stack (described above) to run an automatic migration on the old tests. We already had a list of all the JUnit3 tests in the repository, so running over all of them was simple. While we modified each test file, the internal system for Large Scale Changes (LSCs, [26]) split the resulting change lists into smaller sets that were sent to the owners of the tests for review. An example change can be seen in Fig 10.
我们使用了大语言模型迁移堆栈(如上所述)对旧测试进行了自动迁移。我们已经有了仓库中所有 JUnit3 测试的列表,因此遍历所有测试非常简单。在我们修改每个测试文件时,大规模变更(LSCs,[26])的内部系统将生成的变更列表拆分为较小的集合,并发送给测试的所有者进行审查。图 10 展示了一个变更示例。

Fig. 9. Excerpt from the JUnit3 to JUnit4 prompt Fig. 10. Changes required for a case of JUnit3 to JUnit4 conversion.
图 9: JUnit3 到 JUnit4 提示的摘录
图 10: JUnit3 到 JUnit4 转换所需的更改
The version of Gemini that we used was fine-tuned on the internal Google code base so it already had seen quite some JUnit4 tests. This allowed us to have a relatively simple set of prompts that essentially consisted of a list of rules (see Fig 9) that humans use to do the migration manually.
我们使用的Gemini版本已在Google内部代码库上进行了微调,因此它已经见过相当多的JUnit4测试。这使得我们能够使用一组相对简单的提示,这些提示基本上由人类手动执行迁移时使用的规则列表组成(见图9)。
The updated test files were built and the updated tests ran again. Any failures were sent again to the model for fixing, as described in Fig 5.
更新后的测试文件已构建,更新后的测试再次运行。任何失败的情况都会再次发送给模型进行修复,如图 5 所述。
With this technique we were able to migrate 5,359 files modifying more than 149,000 lines of code in 3 months. The bottleneck in the process was the speed at which engineers could review the changes. We purposefully limited the number of changes we generate every weak to avoid overwhelming reviewers. At the end of the migration ${\sim}87%$ of the code generated by AI ended up committed without any change.
通过这项技术,我们能够在3个月内迁移5,359个文件,修改超过149,000行代码。该过程中的瓶颈是工程师审查变更的速度。我们有意限制每周生成的变更数量,以避免让审查者不堪重负。在迁移结束时,由AI生成的代码中约87%最终未经任何修改就被提交。
VI. JODA TIME TO JAVA TIME MIGRATION
VI. 从 JODA TIME 迁移到 JAVA TIME
Some parts of the codebase still use the Joda time library for Java instead of the standard java.time package. Although the Java versions within the monorepo are regularly updated, this dependency remains and is still widespread with thousands of occurrences. We decided to tackle migrating from Joda time to standard java.time.
代码库的某些部分仍在使用 Java 的 Joda 时间库,而非标准的 java.time 包。尽管 monorepo 中的 Java 版本定期更新,但这一依赖依然存在,并且广泛分布于数千处代码中。我们决定着手从 Joda 时间迁移至标准的 java.time。
One major challenge in such a migration is that the changes are not scoped to singular methods but very often require changes in class public interfaces and fields. The situation becomes even more complex as we cannot just create a giant change and update all occurrences - there are thousands of them. Instead we need to split the work in chunks that can be reviewed and committed separately. The continuous availability of code reviewers is also not guaranteed all the time, so we need to only migrate parts of the codebase where we have bandwidth to land the changes. Often there are interdependencies between components we want to change and some other we do not want to touch now. This means we need to insert conversion functions in the interfaces where two such components interact and employ the types we want to migrate. In many cases there is a 1:1 correspondence between the timekeeping functions and data structures, for example joda.time.Duration can be replaced with java.time.Duration and the constructing functions can be modified so that instead of joda.time.Duration.millis(), we call java.time.Duration.ofMillis().
在这种迁移过程中,一个主要挑战是变更不仅仅局限于单一方法,而是经常需要修改类的公共接口和字段。这种情况变得更加复杂,因为我们不能简单地创建一个巨大的变更并更新所有出现的地方——这样的地方有成千上万。相反,我们需要将工作拆分为可以单独审查和提交的块。代码审查者的持续可用性也无法一直保证,因此我们只能迁移那些我们有能力落地变更的代码库部分。通常,我们想要更改的组件与我们目前不想触及的其他组件之间存在相互依赖关系。这意味着我们需要在两者交互的接口处插入转换函数,并使用我们想要迁移的类型。在许多情况下,计时函数和数据结构之间存在一一对应的关系,例如 joda.time.Duration 可以替换为 java.time.Duration,并且可以修改构造函数,使得我们调用 java.time.Duration.ofMillis() 而不是 joda.time.Duration.millis()。
In other cases though there is no direct type translation possible. For example there is no counterpart to joda.time.Interval in the standard Java Time API. Duration has no concept of exact start time. Instead the guideline was to substitute a joda.time.Interval with a common.collect.Range<java.time.Instant>. This requires a more involved change in the logic of the functions and most importantly in the class interfaces that use this type.
在其他情况下,虽然无法直接进行类型转换。例如,标准 Java Time API 中没有与 joda.time.Interval 对应的类型。Duration 没有确切开始时间的概念。因此,指南建议用 common.collect.Range<java.time.Instant> 替代 joda.time.Interval。这需要在函数的逻辑中,尤其是使用该类型的类接口中进行更复杂的更改。
All these challenges meant we need a new and more complex approach to gradually land the migration. We split the process again in 3 stages: change targeting (also known as localization), change execution, review and landing.
所有这些挑战意味着我们需要一种新的、更复杂的方法来逐步实现迁移。我们将这一过程再次分为三个阶段:变更定位(也称为本地化)、变更执行、审查和落地。
For the targeting we built a pipeline on top of Kythe [16] which provides cross-reference information. We start with the directories (which very roughly correspond to components and projects) where we have reviewers ready. For each file in such a directory we build a cross-reference graph showing where the Joda time types are used and what the dependencies are. We need answers to several questions:
为了进行目标定位,我们在 Kythe [16] 的基础上构建了一个管道,它提供了交叉引用信息。我们从已经准备好审阅者的目录(大致对应于组件和项目)开始。对于这些目录中的每个文件,我们构建了一个交叉引用图,显示 Joda 时间类型的使用位置及其依赖关系。我们需要回答以下几个问题:
To answer these questions we built a clustering solution through Kythe where we categorize the potential changes. The cross-references form directed acyclic graphs (DAGs) connecting files. The simplest changes are in files where there are no modifications needed in an interface - this happens when the types to change are only used within the implementation of a class. The call-graphs tend to cluster and we split them into categories by number of files affected.
为了回答这些问题,我们通过 Kythe 构建了一个聚类解决方案,对潜在的变更进行分类。交叉引用形成了连接文件的有向无环图 (DAG)。最简单的变更发生在不需要修改接口的文件中——这种情况发生在要更改的类型仅在类的实现中使用时。调用图往往会聚集在一起,我们根据受影响的文件数量将它们分为不同的类别。
The clustering is important to also get the model to make consistent changes across files. When there is a dependency, ideally we migrate all files in one prompt. If we split them,
聚类对于确保模型在文件之间进行一致的更改也很重要。当存在依赖关系时,理想情况下我们会在一个提示中迁移所有文件。如果将它们拆分,
Never rename functions or completely remove their implementation.
切勿重命名函数或完全移除其实现。
Google code and the model should consistently use them (instead of writing new ones).
Google 代码和模型应始终使用它们(而不是编写新的代码)。

Fig. 12. Example of a Joda time to Java time migration
图 12: Joda 时间到 Java 时间迁移的示例
This approach led us to successfully migrate many smaller and medium size file clusters. This migration is still ongoing, and we have challenges to solve. Sometimes, huge clusters arise and require us carefully ordering the changes and synchronizing with code reviewers. We need to improve some of the call-graph logic as connections are sometimes missed when components depend on each other through Guice[13]. Another challenge is the presence of switch statements that use dynamic typing. We are migrating our pipeline to support Guice dependencies though, which will cover the majority of the references.
这种方法使我们成功迁移了许多中小型文件集群。迁移工作仍在进行中,我们仍需解决一些挑战。有时会出现巨大的集群,需要我们仔细安排更改顺序并与代码审查人员同步。我们需要改进一些调用图逻辑,因为当组件通过 Guice[13] 相互依赖时,有时会遗漏连接。另一个挑战是存在使用动态类型的 switch 语句。我们正在迁移我们的管道以支持 Guice 依赖,这将覆盖大多数引用。
Our current estimate shows that we are able to save ${\sim}89%$ of the time it would have taken humans to do the change in the small clusters. The number is calculated across multiple changes where human experts (team technical leads) have compared their experience with the AI-powered tooling to the previous purely-manual approach. An additional timesaver mentioned by engineers is that the current algorithm helps them quickly identify all places and dependencies to update. Even if some errors were made by the toolkit, they are relatively trivial to fix.
我们目前的估计显示,在小集群中进行更改时,我们能够节省约 89% 的时间。这个数字是通过多次更改计算得出的,其中人类专家(团队技术负责人)将他们使用 AI 驱动的工具的经验与之前的纯手动方法进行了比较。工程师们提到的另一个节省时间的因素是,当前的算法帮助他们快速识别所有需要更新的地方和依赖项。即使工具包出现了一些错误,修复起来也相对简单。
Gemini effectively would not know, between inference invocations, if the referenced file is migrated or not - it needs to assume if a direct call is needed or a type conversion. The alternative is to show the already-migrated file to the model to avoid inconsistencies. Fortunately, as Gemini offers a huge context window we can comfortably fit many of the clusters into the context window. This means we can prompt once for many files and get the whole cluster migrated. After the code change we run builds, tests and try to fix any eventual failures.
Gemini 在推理调用之间无法知道引用的文件是否已被迁移,它需要假设是否需要直接调用或类型转换。另一种方法是向模型展示已经迁移的文件以避免不一致性。幸运的是,由于 Gemini 提供了巨大的上下文窗口,我们可以轻松地将许多集群放入上下文窗口中。这意味着我们可以一次性提示多个文件并迁移整个集群。在代码更改后,我们运行构建、测试并尝试修复任何可能的失败。
For the changes themselves, the instructions to the model are similar to the ones we provide human engineers. See Fig 11.
对于变更本身,给模型的指令与我们提供给人类工程师的指令类似。参见图 11。
Gemini has seen enough Joda and java.time code during its training that we don’t actually have to explain what they are. It was able to correctly use the right APIs and didn’t hallucinate wrong parameters or data structures. Fig 12 shows an example of a correct diff generate for the migration.
Gemini 在训练过程中已经见过足够多的 Joda 和 java.time 代码,因此我们实际上不需要解释它们是什么。它能够正确使用正确的 API,并且没有产生错误的参数或数据结构。图 12 展示了一个正确生成的迁移差异示例。
The only additional context, apart from the files to migrate we offer, is an auxiliary class with conversion functions that sometimes need to be used. They are specific to the internal
除了我们提供的迁移文件外,唯一的额外上下文是一个包含转换函数的辅助类,这些函数有时需要被使用。它们专门针对内部
VII. CLEANUP OF EXPERIMENTAL CODE
VII. 实验代码清理
Google uses thousands of runtime experiments running continuously to improve its products. At its most simplified, an experiment is a flag in code that receives some value from the outside (the experiment execution engine). The code logic is branched according to the value of the flag or is directly used somewhere as a parameter. After an experiment was either successful or not, the code related to it needs to be cleaned - either become the default code path or get removed entirely.
Google 使用数千个持续运行的运行时实验来改进其产品。简而言之,实验是代码中的一个标志,它从外部(实验执行引擎)接收一些值。代码逻辑根据标志的值进行分支,或者直接作为参数使用。在实验成功或失败后,相关的代码需要进行清理——要么成为默认的代码路径,要么完全删除。
Sometimes, due to their large number, experiments become stale: an experiment might become abandoned or permanently rolled out, or its value effectively becomes a constant, but the flag and the associated code remains even though no branching is needed anymore. This constitutes dead code and technical debt. The experimentation engine tracks all experiments and flags across Google and can list all the stale flags - defined as flags whose value has not changed over a long period of time. Developers need to clean the code associated manually though.
有时,由于实验数量庞大,实验会变得过时:一个实验可能被放弃或永久推出,或者其值实际上变成了一个常数,但即使不再需要分支,标志和相关代码仍然存在。这构成了死代码和技术债务。实验引擎跟踪 Google 中的所有实验和标志,并可以列出所有过时的标志——定义为长时间未更改值的标志。不过,开发人员需要手动清理相关代码。
Cleaning all these flags is time consuming and we decided to use AI to accomplish the job.
清理所有这些标志非常耗时,我们决定使用 AI 来完成这项工作。
The task requires:
任务要求:
- Finding the code locations where the flag is referenced 2) Deleting the code references to the experimental flag 3) Simplifying any conditional expressions that depend on the experimental flag
- 查找引用标志的代码位置
- 删除对实验标志的代码引用
- 简化依赖于实验标志的任何条件表达式
- Cleaning any now dead code
- 清理所有现在无用的代码
- Updating the tests, often deleting now useless test
- 更新测试,通常删除现在无用的测试
Steps 2), 3), 4) could be implemented with an AST-based approach and even 5) could be approximated with a heuristic. In fact Google already had an AST-based tool for $C++$ but we discovered that it was missing some cases and the ambition was to build something that is language-agnostic.
步骤 2)、3)、4) 可以通过基于抽象语法树 (AST) 的方法实现,甚至步骤 5) 也可以通过启发式方法近似实现。事实上,Google 已经有一个基于 AST 的工具用于 $C++$,但我们发现它遗漏了一些情况,而我们的目标是构建一个与语言无关的工具。
A. Flag discovery and targeting
A. 标志发现与定位
Step 1) is often simple - the flags are declared in configuration files and manifest themselves in code through getters whose symbol names are derived from the name of the flag in a well-known form. We use Code Search for finding where the flag is used in the code. One challenge is that in test files, the flag is not necessarily used with its exact getter name - in fact it is not extracted from the experiments system at all, but is a developer-written fake passed to the focal code. Software engineers use local variables with names that roughly align with the experiment flag name, but are not the same - in this case Code Search will not find the test flag.
步骤 1) 通常很简单 - 标志在配置文件中声明,并通过 getter 在代码中体现,这些 getter 的符号名称以众所周知的形式从标志名称派生而来。我们使用代码搜索来查找标志在代码中的使用位置。一个挑战是,在测试文件中,标志不一定以其确切的 getter 名称使用 - 事实上,它根本没有从实验系统中提取出来,而是开发人员编写的假值传递给核心代码。软件工程师使用名称大致与实验标志名称一致但不同的局部变量 - 在这种情况下,代码搜索将无法找到测试标志。
We could use cross-references information to try to map the uses of the experiment flag between the focal function and the test variable (often passed to a constructor). Human intuition however is enough to easily identify the test flag name when looking at the test and implementation files, so we decided to use the LLM to tell us which test flag we need to delete. Fig 14 illustrates how the LLM tagged it through adding a comment. Through a Code Search query we know all the implementation files where the flag is used and due to the way tests are organized in Google, we also know all the corresponding test files. We pass all these files and the instructions below (see Fig 13) to the model to discover which test flags to delete.
我们可以使用交叉引用信息来尝试映射实验标志在焦点函数和测试变量(通常传递给构造函数)之间的使用。然而,人类的直觉足以在查看测试和实现文件时轻松识别测试标志名称,因此我们决定使用大语言模型来告诉我们需要删除哪个测试标志。图 14 展示了大语言模型如何通过添加注释来标记它。通过代码搜索查询,我们知道标志使用的所有实现文件,并且由于 Google 中测试的组织方式,我们也知道所有对应的测试文件。我们将所有这些文件和以下指令(见图 13)传递给模型,以发现需要删除的测试标志。
Given the strong code change capabilities of Gemini, we decided to ask it to put a comment on the test flag. Then we use this data to guide the model in a second step that actually deletes the code across the implementation and test files (see Fig 15).
鉴于 Gemini 强大的代码修改能力,我们决定让它为测试标志添加注释。然后,我们使用这些数据在第二步中引导模型,实际删除实现文件和测试文件中的代码(见图 15)。
B. Code cleanup
B. 代码清理
The code changes follow a similar pattern as the other migrations. The input to the model is:
代码更改遵循与其他迁移类似的模式。模型的输入是:
• Set of files and the symbol name of the flag to clean and the value of the flag • Set of test files to clean the test flags to clean • Instructions on how to execute the cleanup
• 文件集、待清理标志的符号名称及其值
• 测试文件集、待清理的测试标志
• 关于如何执行清理的说明
The value of the flag is critical as we effectively need to substitute where it is used by that constant.
标志的值至关重要,因为我们需要在实际使用它的地方用该常量进行替换。

Fig. 13. Prompt for unused flag cleanup
图 13: 未使用标志清理的提示

Fig. 14. The model discovers and ’tags’ a test flag related to the implementation one we would want to delete

图 14: 模型发现并“标记”了一个与我们要删除的实现相关的测试标志
Given the large context size of Gemini we are able to pack all the usages of a flag in one query to the model, even though some files in the Google monorepo are very large. The implementation and test file are both visible together to the model so that it can make a consistent change across both of them. After the code is cleaned we run additional validations to make sure all the instances were deleted, the code builds and tests pass.
鉴于 Gemini 的大上下文容量,我们能够将所有对某个标志的使用打包到一个查询中发送给模型,即使 Google 单仓库中的某些文件非常大。模型可以同时看到实现文件和测试文件,以便在两者之间进行一致的更改。代码清理后,我们会运行额外的验证,以确保所有实例都被删除,代码能够构建并通过测试。
The implementation files’ updates are of high quality and Gemini can successfully delete functions that are now dead code (because they were only used in a branch that is now deleted.)
实现文件的更新质量很高,Gemini 能够成功删除现在已成为死代码的函数(因为它们仅在现已删除的分支中使用)。
The most challenging part is cleaning up the right tests. The model correctly deletes the tests where the flag is set to values that are now impossible. However in some cases the tests are not ‘pure’ and they might test multiple flags in the same unit test or some interaction between their values. Such cleanups are very difficult for humans as well. In these cases we rely on the test failing and the model attempting a fix or leave to the human reviewers to do a final fix. Addressing these unit tests cleanup is a future area of research for the team.
最具挑战性的部分是清理正确的测试。模型正确地删除了那些标志被设置为现在不可能值的测试。然而,在某些情况下,测试并不“纯粹”,它们可能在同一个单元测试中测试多个标志或它们值之间的某些交互。这样的清理对人类来说也非常困难。在这些情况下,我们依赖于测试失败和模型尝试修复,或者留给人类审查员进行最终修复。解决这些单元测试清理问题是团队未来的研究领域。
VIII. DISCUSSION AND TAKEAWAYS
VIII. 讨论与总结
A. LLMs for Code Modernization
A. 用于代码现代化的大语言模型
LLMs offer a significant opportunity for assisting modernizing and updating large codebases. They come with a lot of flexibility, and thus, a variety of code transformation tasks can be framed in a similar workflow and achieve success. This approach has the potential to radically change the way code is maintained in large enterprises. Not only can it accelerate the work of engineers, but make possible efforts that were previously infeasible due to the huge investment needed. Unfinished migrations tend to continuously slow down teams even if they are not working directly on them, they confuse new developers with obsolete patterns, and require additional cognitive load. Landing migrations faster has a wide reaching benefits beyond the actual technical code change.
大语言模型为协助现代化和更新大型代码库提供了重要机会。它们具有很高的灵活性,因此,各种代码转换任务可以在类似的工作流程中完成并取得成功。这种方法有可能从根本上改变大型企业中的代码维护方式。它不仅可以加速工程师的工作,还可以实现以前由于需要巨大投资而不可行的努力。未完成的迁移往往会持续拖慢团队的速度,即使他们没有直接在这些迁移上工作,它们也会用过时的模式混淆新开发者,并需要额外的认知负荷。更快地完成迁移除了实际的技术代码变更外,还具有广泛的好处。

Fig. 15. The first large block with red background is a direct flag dependency, but the second large block with red background is dependency on removal of the first block.
图 15. 第一个红色背景的大块是直接的标志依赖,但第二个红色背景的大块是对第一个块移除的依赖。
B. LLM+AST, better together
B. 大语言模型+AST,强强联合
Many code migrations can be split into discrete steps and each step can either be LLM generation, LLM introspection, but also use traditional methods like AST-based techniques and even grep-like searches. We discovered that LLM planning capabilities are often not needed and add a layer of complexity that should be avoided when possible.
许多代码迁移可以拆分为离散的步骤,每个步骤可以是LLM生成、LLM自省,也可以使用传统方法,如基于AST的技术,甚至是类似grep的搜索。我们发现,LLM的规划能力通常并不需要,并且会增加一层复杂性,应尽可能避免。
AST-based tooling has the advantage to be ‘always correct’ and not suffer from model version changes or prompt change fluctuations. For example simple mistakes of the model like adding unnecessary comments or changing methods it shouldn’t, can easily be checked through an AST parser diffing the before/after of the file’s AST nodes. Working with smaller and better defined LLM-powered prompts increases the reliability of the results while reducing debugging efforts and the need to tune the prompts. An additional benefit is that the steps that don’t rely on LLMs tend to be much cheaper computation-wise. Although the cost per token for predictions has steadily decreased, migrations often require touching thousands of files and the costs might quickly add up.
基于AST的工具具有“始终正确”的优势,不会受到模型版本变化或提示词变化波动的影响。例如,模型的一些简单错误,如添加不必要的注释或修改不应更改的方法,可以通过AST解析器对文件AST节点的前后差异进行轻松检查。使用更小且定义更明确的LLM驱动的提示词可以提高结果的可靠性,同时减少调试工作和调整提示词的需求。另一个好处是,不依赖LLM的步骤在计算成本上往往要低得多。尽管每个Token的预测成本已稳步下降,但迁移通常需要处理数千个文件,成本可能会迅速累积。
C. Divide and Conquer
C. 分治法
Simpler sub-tasks allow the developer building the migration workflow to iterate faster and ‘divide and conquer’ the problem. In the toolkit (see Sec IV-2) described in the paper, different engineers are in charge of tuning and optimizing a specific step that is then used for multiple migrations. For example the build fixing step is ubiquitous and improving it leads to economies of scale.
更简单的子任务使得开发人员能够更快地迭代迁移工作流,并“分而治之”地解决问题。在本文描述的工具包(见第 IV-2 节)中,不同的工程师负责调整和优化特定步骤,这些步骤随后用于多次迁移。例如,构建修复步骤是普遍存在的,改进它可以带来规模经济。
When the model output is not perfect, leaning into validation and verification through multiple steps can ‘recover’ the change and lead to a successful workflow. In our observations, the LLMs are very good at fixing a well defined problem which complements their first attempt at a change.
当模型输出不完美时,通过多个步骤进行验证和确认可以“恢复”更改,并引导工作流程走向成功。在我们的观察中,大语言模型非常擅长修复定义明确的问题,这与其首次尝试更改形成了互补。
D. Landing the changes: a human process
D. 落地变更:一个人为过程
Although LLMs can lead to a significant time save for the change generation itself, additional tooling will be needed to further reduce or accelerate human involvement. Code reviews and change rollouts still require a human operator and can quickly become bottlenecks in the larger code migration process. We needed to slow down some of the migration work to avoid overwhelming the teams that had to do the reviews and to make sure that the rollouts happened in a gradual and responsible way. Assisting the human developers in the tasks adjacent to the actual code changes in a migration is an area we are actively exploring.
虽然大语言模型可以显著节省代码变更生成的时间,但仍需要额外的工具来进一步减少或加速人工参与。代码审查和变更部署仍然需要人工操作,并且可能在大规模代码迁移过程中迅速成为瓶颈。我们需要放慢部分迁移工作,以避免让负责审查的团队不堪重负,并确保变更部署以渐进和负责任的方式进行。我们正在积极探索如何在迁移过程中协助开发人员处理与实际代码变更相关的任务。
E. Metrics and evaluation
E. 指标与评估
A lot of work in the ML and applied ML communities focuses on evals, typically in which the model’s performance is measured on isolated, hermetic tasks such as “generate Python code to ...”. While these measure some aspects of model goodness, in the end, what matters is how model-level performance translates to business level outcomes. Bad modellevel performance can compromise business-level outcomes, but good model-level performance does not guarantee good business-level outcomes, since the ML model is only one part of a complex transformation task. We are constantly measuring ourselves at the level of business-level outcomes.
在机器学习和应用机器学习社区中,许多工作都集中在评估上,通常是在孤立、封闭的任务中测量模型的性能,例如“生成 Python 代码以...”。虽然这些评估衡量了模型某些方面的表现,但最终重要的是模型级别的性能如何转化为业务级别的结果。糟糕的模型级别性能可能会影响业务级别的结果,但良好的模型级别性能并不能保证良好的业务级别结果,因为机器学习模型只是复杂转换任务中的一部分。我们一直在业务级别结果的层面上进行自我评估。
F. Training and adoption
F. 训练与采用
The use of generative AI widely, with bespoke techniques, comes with a hidden cost: that of having to train a number of engineers in the use of these techniques. Building elaborate tooling to completely hide the use of AI behind tools is expensive, and it creates a technical obligation to now maintain that tooling, which is used by a relatively small number of engineers. (By comparison, generic technologies such as code auto-completion are easy to amortize over a much larger population.)
广泛使用生成式 AI (Generative AI) 及其定制技术会带来一个隐性成本:需要培训大量工程师掌握这些技术。构建复杂的工具来完全隐藏 AI 的使用成本高昂,并且会产生维护这些工具的技术义务,而这些工具只被相对较少的工程师使用。(相比之下,代码自动补全等通用技术更容易在更大范围内分摊成本。)
G. Custom vs generic models
G. 定制模型 vs 通用模型
Similarly to the above point, while fine-tuned models can be useful, they also come with a cost, and a company needs to constantly assess the “area under the curve” of investing in custom models for better outcomes, versus working with out of the box models. We have currently used fine-tuned models for generating edits and fixing build failures.
与上述观点类似,虽然微调模型可能有用,但它们也伴随着成本,公司需要不断评估投资于定制模型以获得更好结果的“曲线下面积”,与使用现成模型相比。我们目前已经使用微调模型来生成编辑和修复构建失败。
IX. RELATED WORK
IX. 相关工作
Repository-level changes with planning: The need to be able to perform repository-level code changes is the key challenge of code migrations. For code migrations structural relationships across the repository are important LLMs to capture but the whole repository does not fit into the prompt. One approach discussed by Bairi et al. [2] is framing the task as a multi-step planning problem, including a static dependency analysis, edit-relevance analysis (to determine the relevant context for the LLM), repairs on errors from validations in the previous step and adaptive plan generation. Jiang et al. presented in [14] how LLMs can be used to generate the plan. Further relevant literature includes [29] employing look-ahead planning, [12] with a theoretical discussion. There are various demos of agent-based approaches [6, 28, 30], e.g. giving the agent access to the editor, terminal and web. Although all these planning approaches showed promise to capture complex relationships and flexibly adapt during the execution, it is not clear how well the methods generalize and the complexity of the system is high.
基于规划的仓库级变更:能够执行仓库级代码变更的需求是代码迁移的关键挑战。对于代码迁移来说,跨仓库的结构关系对大语言模型来说很重要,但整个仓库无法放入提示中。Bairi 等人 [2] 讨论的一种方法是将任务框架化为多步规划问题,包括静态依赖分析、编辑相关性分析(以确定大语言模型的相关上下文)、对前一步验证中的错误进行修复以及自适应计划生成。Jiang 等人在 [14] 中展示了大语言模型如何用于生成计划。进一步的相关文献包括 [29] 使用前瞻性规划,[12] 进行了理论讨论。有多种基于智能体的方法演示 [6, 28, 30],例如让智能体访问编辑器、终端和网络。尽管所有这些规划方法在捕捉复杂关系并在执行过程中灵活适应方面显示出潜力,但尚不清楚这些方法的泛化能力如何,且系统的复杂性较高。
Repository-level changes without planning: Xia et al. [27] challenges above assumptions of planning being needed. In a simpler setup of a sequence of localizing the next relevant edit locations and repairing based on validation, better quality results are achieved on SWEBench [22] compared to more involved approaches with planning. Also, in typical setup such an approach is more cost-effective. Similarly, approaches in this paper rely on agent-less techniques.
无需规划的仓库级变更:Xia 等人 [27] 挑战了上述需要规划的假设。在一个更简单的设置中,通过定位下一个相关的编辑位置并基于验证进行修复,相比涉及更多规划的方法,在 SWEBench [22] 上获得了更高质量的结果。此外,在典型设置中,这种方法更具成本效益。同样,本文中的方法依赖于无代理技术。
Code migrations: There is prior art in using LLMs for language translation [8] and code refactoring [21]. However, those approaches do not reach high enough quality to become useful yet. Amazon recently shared a product for code migrations using agents [1]. In [24] the team explores the humanAI partnership in the product: the shortcomings of LLMs are attempted to be compensated by offering the human the ability to easily correct intermediate steps of code edits by the LLM.
代码迁移:已有使用大语言模型进行语言翻译 [8] 和代码重构 [21] 的先例。然而,这些方法的质量尚未达到足够高的水平,无法真正实用。Amazon 最近分享了一款使用 AI 智能体进行代码迁移的产品 [1]。在 [24] 中,团队探讨了该产品中人与 AI 的合作关系:通过让人类能够轻松纠正大语言模型在代码编辑中的中间步骤,试图弥补大语言模型的不足。
X. CONCLUSION AND FUTURE WORK
X. 结论与未来工作
We intend to build upon this success and expand the portfolio of programs leveraging LLM capabilities from a handful to several across multiple teams in Ads and other product areas across Google. Scaling AI-assisted migrations through easyto-use workflows and high confidence AI recommendations will be critical to expanding adoption.
我们计划在这一成功基础上,利用大语言模型 (LLM) 的能力,将项目组合从少数扩展到多个团队,涵盖广告和其他 Google 产品领域。通过易于使用的工作流程和高置信度的 AI 推荐来扩展 AI 辅助迁移,对于扩大采用至关重要。
We also plan to expand the portfolio of use cases from code migration to creating agents that can automate triaging, mitigating and resolving complex system escalations that will help the business run more smoothly with minimal interruptions.
我们还计划将用例组合从代码迁移扩展到创建能够自动处理、缓解和解决复杂系统升级的AI智能体,从而帮助企业更顺畅地运行,减少中断。
