[论文翻译]LLM 4 反编译:使用大语言模型反编译二进制代码


原文地址:https://arxiv.org/pdf/2403.05286v3


LL M 4 De compile: De compiling Binary Code with Large Language Models

LLM 4 反编译:使用大语言模型反编译二进制代码

1 Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China 2 Department of Computing, The Hong Kong Polytechnic University, HKSAR, China 3 Research Centre for Data Science & Artificial Intelligence hanzhuo.tan@connect.polyu.hk, 12232440@mail.sustech.edu.cn, jing-amelia.li@polyu.edu.hk, zhangyq@sustech.edu.cn

1 南方科技大学计算机科学与工程系,中国深圳
2 香港理工大学计算学系,中国香港特别行政区
3 数据科学与人工智能研究中心
hanzhuo.tan@connect.polyu.hk, 12232440@mail.sustech.edu.cn, jing-amelia.li@polyu.edu.hk, zhangyq@sustech.edu.cn

Abstract

摘要

De compilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LL M 4 De compile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LL M 4 De compile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks over 100 in terms of re-exe cut ability rate. Additionally, we improve the standard refinement approach to fine-tune the LL M 4 De compileRef models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2 improvement over the LL M 4 De compile-End. LL M 4 De compile 1 demonstrates the potential of LLMs to revolutionize binary code de compilation, delivering remarkable improvements in readability and exe cut ability while complementing conventional tools for optimal results.

反编译的目标是将二进制代码转换为高级源代码,但像 Ghidra 这样的传统工具生成的结果通常难以阅读和执行。受大语言模型 (LLMs) 进步的启发,我们提出了 LL M 4 De compile,这是第一个也是最大的开源大语言模型系列 (1.3B 到 33B),专门用于反编译二进制代码。我们优化了大语言模型的训练过程,并引入了 LL M 4 De compile-End 模型来直接反编译二进制代码。在 HumanEval 和 ExeBench 基准测试中,这些模型在可重执行率方面显著优于 GPT-4o 和 Ghidra,超过 100。此外,我们改进了标准的精炼方法,以微调 LL M 4 De compileRef 模型,使其能够有效地精炼来自 Ghidra 的反编译代码,并实现了比 LL M 4 De compile-End 进一步的 16.2 提升。LL M 4 De compile 1 展示了大语言模型在二进制代码反编译领域的革命性潜力,显著提高了可读性和可执行性,同时与传统工具互补以实现最佳结果。

1 Introduction

1 引言

De compilation, the reverse process of converting machine code or binary code into a high-level programming language, facilitates various reverse engineering tasks such as vulnerability identification, malware research, and legacy software migration (Brumley et al., 2013; Katz et al., 2018; Hosseini and Dolan-Gavitt, 2022; Xu et al., 2023; Armengol-Estapé et al., 2023; Jiang et al., 2023; Hua qing wei yang, 2024; Wong et al., 2023; Hu et al., 2024). De compilation is challenging due to the loss of information inherent in the compilation process, particularly finer details such as

反编译是将机器代码或二进制代码转换为高级编程语言的逆向过程,有助于进行各种逆向工程任务,如漏洞识别、恶意软件研究和遗留软件迁移 (Brumley et al., 2013; Katz et al., 2018; Hosseini and Dolan-Gavitt, 2022; Xu et al., 2023; Armengol-Estapé et al., 2023; Jiang et al., 2023; Hua qing wei yang, 2024; Wong et al., 2023; Hu et al., 2024)。反编译具有挑战性,因为编译过程中会丢失信息,尤其是更细微的细节,如

Figure 1: Illustration of compiling source code to binary, disassembling binary to assembly code (ASM), and de compiling ASM to pseudo-code with Ghidra. The pseudo-code is hard to read and not executable.

图 1: 展示了将源代码编译为二进制文件、将二进制文件反汇编为汇编代码 (ASM) 以及使用 Ghidra 将 ASM 反编译为伪代码的过程。伪代码难以阅读且不可执行。

variable names (Lacomis et al., 2019) and fundamental structures like loops and conditionals (Wei et al., 2007). To address these challenges, numer- ous tools have been developed for de compilation, with Ghidra (Ghidra, 2024a) and IDA Pro (HexRays, 2024) being the most commonly used. Although these tools have the capability to revert binary code to high-level pseudo-code, the outputs often lack readability and re-exe cut ability (Liu and Wang, 2020a; Wang et al., 2017), which are essential for applications like legacy software migration and security instrumentation tasks (Wong et al., 2023; Dinesh et al., 2020).

变量名 (Lacomis et al., 2019) 以及基本结构如循环和条件语句 (Wei et al., 2007)。为了解决这些挑战,已经开发了许多反编译工具,其中 Ghidra (Ghidra, 2024a) 和 IDA Pro (HexRays, 2024) 是最常用的。尽管这些工具能够将二进制代码还原为高级伪代码,但输出通常缺乏可读性和可重执行性 (Liu and Wang, 2020a; Wang et al., 2017),这对于遗留软件迁移和安全检测任务 (Wong et al., 2023; Dinesh et al., 2020) 至关重要。

Figure 1 illustrates the transformation from the source C code to a binary file, assembly code (ASM), and pseudo-code decompiled from Ghidra. In this pseudo-code, the original nested for structure is replaced with a less intuitive combination of a do-while loop inside another while loop. Furthermore, array indexing like num[i] is decompiled into complicated pointer arithmetic such as ×(floatΣ)(param2 + (long)local24  4) . The decompiled output also exhibits syntactical errors, with the function return type being converted to undefined4. Overall, traditional de compilation tools often strip away the syntactic clarity provided by high-level languages and do not ensure the correctness of syntax, posing significant challenges even for skilled developers to reconstruct the algorithmic logic (Wong et al., 2023; Hu et al., 2024).

图 1: 展示了从源 C 代码到二进制文件、汇编代码 (ASM) 以及从 Ghidra 反编译得到的伪代码的转换过程。在此伪代码中,原始的嵌套 for 结构被替换为不太直观的 do-while 循环内嵌另一个 while 循环的组合。此外,像 num[i] 这样的数组索引被反编译为复杂的指针运算,例如 ×(floatΣ)(param2 + (long)local24  4)。反编译的输出还表现出语法错误,函数的返回类型被转换为 undefined4。总体而言,传统的反编译工具通常会剥离高级语言提供的语法清晰性,并且不确保语法的正确性,即使对于熟练的开发人员来说,重建算法逻辑也面临重大挑战 (Wong et al., 2023; Hu et al., 2024)。

Recent advancements in Large Language Models (LLMs) have greatly improved the process of de compiling code. There are two primary approaches to LLM-based de compilation—RefinedDecompile and End2end-Decompile. In particular, Refined-Decompile prompts LLMs to refine the results from traditional de compilation tools (Hu et al., 2024; Wong et al., 2023; Xu et al., 2023). However, LLMs are primarily optimized for high-level programming languages and may not be as effective with binary data. End2end-Decompile fine-tunes LLMs to decompile binaries directly. Nevertheless, previous open-source applications of this approach were limited by the use of smaller models with only around 200 million parameters and restricted training corpus (Hosseini and Dolan-Gavitt, 2022; Armengol-Estapé et al., 2023; Jiang et al., 2023), In contrast, utilizing larger models trained on broader datasets has proven to substantially improve the performance (Hoffmann et al., 2024; Kaplan et al., 2020; Rozière et al., 2023; OpenAI, 2023).

大语言模型(LLMs)的最新进展极大地改进了代码反编译的过程。基于大语言模型的反编译主要有两种方法——RefinedDecompile 和 End2end-Decompile。特别是,Refined-Decompile 提示大语言模型优化传统反编译工具的结果 (Hu et al., 2024; Wong et al., 2023; Xu et al., 2023)。然而,大语言模型主要针对高级编程语言进行了优化,可能对二进制数据的处理效果不佳。End2end-Decompile 则通过微调大语言模型直接反编译二进制文件。尽管如此,此前该方法在开源应用中的局限性在于使用了仅约2亿参数的小模型,并且训练语料受限 (Hosseini and Dolan-Gavitt, 2022; Armengol-Estapé et al., 2023; Jiang et al., 2023)。相比之下,使用在更广泛数据集上训练的大模型已被证明能显著提升性能 (Hoffmann et al., 2024; Kaplan et al., 2020; Rozière et al., 2023; OpenAI, 2023)。

To address the limitations of previous studies, we propose LL M 4 De compile, the first and largest open-source LLM series with sizes ranging from 1.3B to 33B parameters specifically trained to de- compile binary code. To the best of our knowledge, there’s no previous study attempts to improve the capability of LLM-based decompilation in such depth or incorporate such large-scale LLMs. Based on the End2end-Decompile approach, we introduce three critical steps: data augmentation, data cleaning, and two-stage training, to optimize the LLM training process and introduce the LL M 4 De compile-End models to decompile binary directly. Specifically, our LL M 4 De compileEnd-6.7B model demonstrates a successful decompilation rate of 45.4 on HumanEval (Chen et al., 2021) and 18.0 on ExeBench (Armengol-Estapé et al., 2022), far exceeding Ghidra (Ghidra, 2024a) or GPT-4o (OpenAI, 2023) by over 100 . Additionally, we improve the Refined-Decompile strat- egy by examining the efficiency of Ghidra’s decom- pilation process, augmenting and filtering data to fine-tune the LL M 4 De compile-Ref models, which excel at refining Ghidra’s output. Experiments suggest a higher performance ceiling for the enhanced Refined-Decompile approach, with 16.2 improvement over LL M 4 De compile-End. Additionally, we assess the risks associated with the potential misuse of our model under obfuscation conditions commonly used in software protection. Our findings indicate that neither our approach nor Ghidra can effectively decompile obfuscated code, mitigating concerns about unauthorized use for infringement of intellectual property.

为了解决以往研究的局限性,我们提出了 LL M 4 De compile,这是第一个也是最大的开源大语言模型系列,其参数规模从1.3B到33B不等,专门用于反编译二进制代码。据我们所知,之前没有研究尝试如此深入地提升基于大语言模型的反编译能力,也没有引入如此大规模的大语言模型。基于 End2end-Decompile 方法,我们引入了三个关键步骤:数据增强、数据清理和两阶段训练,以优化大语言模型的训练过程,并推出 LL M 4 De compile-End 模型来直接反编译二进制代码。具体来说,我们的 LL M 4 De compileEnd-6.7B 模型在 HumanEval (Chen et al., 2021) 上实现了 45.4 的成功反编译率,在 ExeBench (Armengol-Estapé et al., 2022) 上实现了 18.0 的成功反编译率,远超 Ghidra (Ghidra, 2024a) 或 GPT-4o (OpenAI, 2023) 超过 100。此外,我们通过研究 Ghidra 反编译过程的效率,改进并过滤数据以微调 LL M 4 De compile-Ref 模型,从而提升了 Refined-Decompile 策略,该模型在优化 Ghidra 输出方面表现出色。实验表明,增强后的 Refined-Decompile 方法具有更高的性能上限,比 LL M 4 De compile-End 提高了 16.2。此外,我们评估了在软件保护中常用的混淆条件下,模型潜在滥用的风险。我们的研究结果表明,无论是我们的方法还是 Ghidra,都无法有效反编译混淆代码,这减轻了关于未经授权使用侵犯知识产权的担忧。

2 Related Work

2 相关工作

The practice of reversing executable binaries to their source code form, known as de compilation, has been researched for decades (Mie cz niko w ski and Hendren, 2002; Nolan, 2012; Katz et al., 2019). Traditional de compilation relies on analyzing the control and data flows of program (Brumley et al., 2013), and employing pattern matching, as seen in tools like Hex-Rays Ida pro (Hex-Rays, 2024) and Ghidra (Ghidra, 2024a). These systems attempt to identify patterns within a program’s controlflow graph (CFG) that corresponding to standard programming constructs such as conditional statements or loops. However, the output from such de compilation processes tends to be a source-codelike representation of assembly code, including direct translations of variables to registers, use of gotos, and other low-level operations instead of the original high-level language constructs. This output, while often functionally similar to the original code, is difficult to understand and may not be re-executable (Liu and Wang, 2020b; Wong et al., 2023). Drawing inspiration from neural machine translation, researchers have reformulated decompilation as a translation exercise, converting machinelevel instructions into readable source code (Katz et al., 2019). Initial attempts in this area utilized recurrent neural networks (RNNs) (Katz et al., 2018) for de compilation, complemented by errorcorrection techniques to enhance the outcomes.

将可执行二进制文件反编译回源代码形式的实践(称为反编译)已经研究了几十年 (Miecznikowski 和 Hendren, 2002; Nolan, 2012; Katz 等, 2019)。传统的反编译依赖于分析程序的控制流和数据流 (Brumley 等, 2013),并采用模式匹配,如 Hex-Rays Ida pro (Hex-Rays, 2024) 和 Ghidra (Ghidra, 2024a) 等工具中所示。这些系统试图识别程序控制流图 (CFG) 中与标准编程结构(如条件语句或循环)相对应的模式。然而,这种反编译过程的输出往往是汇编代码的类源代码表示,包括将变量直接翻译为寄存器、使用 goto 语句以及其他低级操作,而不是原始的高级语言结构。这种输出虽然在功能上通常与原始代码相似,但难以理解且可能无法重新执行 (Liu 和 Wang, 2020b; Wong 等, 2023)。受神经机器翻译的启发,研究人员将反编译重新定义为一种翻译练习,将机器级指令转换为可读的源代码 (Katz 等, 2019)。该领域的初步尝试利用循环神经网络 (RNNs) (Katz 等, 2018) 进行反编译,并辅以纠错技术以增强结果。

Motivated by the success of Large Language Models (Li et al., 2023; Rozière et al., 2023; Guo et al., 2024), researchers have employed LLMs for de compilation, primarily through two approaches— Refined-Decompile and End2end-Decompile. In particular, Refined-Decompile prompts the LLMs to refine results from traditional de compilation tools like Ghidra or IDA Pro. For instance, DeGPT (Hu et al., 2024) enhances Ghidra’s read- ability by reducing cognitive load by 24.4 , while DecGPT (Wong et al., 2023) increases IDA Pro’s re-exe cut ability rate to over 75 by integrating error messages into its refinement process. These approaches, however, largely ignore the fact that LLMs are designed primarily for high-level programming languages (Li et al., 2023; Rozière et al., 2023; Guo et al., 2024), and their effectiveness with binary files is not well-established. End2endDecompile, on the other hand, fine-tunes LLMs to decompile binaries directly. Early open-source models like BTC (Hosseini and Dolan-Gavitt, 2022) and recent development Slade (ArmengolEstapé et al., 2023) adopt the language model with around 200 million parameters (Lewis et al., 2020a) to fine-tune for de compilation. While Nova (Jiang et al., 2023), which is not open-sourced, develops a binary LLM with 1 billion parameters and fine-tunes it for de compilation. Consequently, the largest open-source model in this domain is limited to 200M. Whereas utilizing larger models trained on broader datasets has proven to substantially improve the performance (Hoffmann et al., 2024; Kaplan et al., 2020; Rozière et al., 2023).

受大语言模型成功(Li et al., 2023; Rozière et al., 2023; Guo et al., 2024)的启发,研究人员主要采用了两种方法——Refined-Decompile 和 End2end-Decompile,利用大语言模型进行反编译。具体而言,Refined-Decompile 提示大语言模型优化传统反编译工具(如 Ghidra 或 IDA Pro)的结果。例如,DeGPT(Hu et al., 2024)通过减少认知负荷 24.4 来增强 Ghidra 的可读性,而 DecGPT(Wong et al., 2023)通过将错误信息集成到其优化过程中,将 IDA Pro 的重新执行率提高到超过 75。然而,这些方法很大程度上忽略了大语言模型主要是为高级编程语言设计的(Li et al., 2023; Rozière et al., 2023; Guo et al., 2024),并且它们在处理二进制文件时的有效性尚未得到充分证实。另一方面,End2endDecompile 则对大语言模型进行微调,使其直接反编译二进制文件。早期的开源模型如 BTC(Hosseini and Dolan-Gavitt, 2022)和最近的开发 Slade(ArmengolEstapé et al., 2023)采用了约 2 亿参数的 Transformer 模型(Lewis et al., 2020a)进行反编译微调。而 Nova(Jiang et al., 2023)虽然未开源,但其开发了一个拥有 10 亿参数的二进制大语言模型,并对其进行反编译微调。因此,该领域最大的开源模型目前仅限于 200M 参数。而利用在更广泛数据集上训练的更大模型已被证明可以显著提高性能(Hoffmann et al., 2024; Kaplan et al., 2020; Rozière et al., 2023)。

Therefore, our objective is to present the first and most extensive open-source LL M 4 De compile series, aiming at comprehensively advancing the de compilation capability of LLMs. Initially, we optimize the End2end-Decompile approach to train the LL M 4 De compile-End, demonstrating its effectiveness in directly de compiling binary files. Subsequently, we enhance the Refined-Decompile frameworks to integrate LLMs with Ghidra, augmenting traditional tools for optimal effectiveness.

因此,我们的目标是推出首个且最广泛的开源 LLM 4 Decompile 系列,旨在全面提升大语言模型的逆向编译能力。首先,我们优化了 End2end-Decompile 方法,用于训练 LLM 4 Decompile-End,展示了其在直接逆向编译二进制文件方面的有效性。随后,我们改进了 Refined-Decompile 框架,将大语言模型与 Ghidra 集成,增强传统工具以达到最佳效果。


Figure 2: End2end-Decompile framework. The source code (SRC) is compiled to binary, disassembled to assembly instructions (ASM), and decompiled by LL M 4 De compile to generate SRC’. Loss is computed between SRC and SRC’ for training.

图 2: End2end-Decompile 框架。源代码 (SRC) 被编译为二进制文件,反汇编为汇编指令 (ASM),然后由 LL M 4 De compile 反编译生成 SRC’。在 SRC 和 SRC’ 之间计算损失以进行训练。

3 LL M 4 De compile

3 LLM 4 反编译

First, we introduce our strategy for optimizing LLM training to directly decompile binaries, the resulting models are named as LL M 4 De compileEnd. Following this, we detail our efforts for enhancing the Refined-Decompile approach, the corresponding fine-tuned models are referred to as LL M 4 De compile-Ref, which can effectively refine the decompiled results from Ghidra.

首先,我们介绍了优化大语言模型训练以直接反编译二进制文件的策略,生成的模型命名为LLM4Decompile-End。随后,我们详细阐述了增强Refined-Decompile方法的努力,相应的微调模型称为LLM4Decompile-Ref,这些模型能够有效优化Ghidra的反编译结果。

3.1 LL M 4 De compile-End

3.1 大语言模型用于反编译

In this section, we describe the general End2endDecompile framework, and present details on our strategy to optimize the training of LL M 4 De compile-End models.

在本节中,我们将描述通用的 End2endDecompile 框架,并详细介绍我们优化 LLM4Decompile-End 模型训练的策略。

3.1.1 The End2End-Decompile Framework

3.1.1 End2End-Decompile 框架

Figure 2 illustrates the End2end-Decompile framework from compilation to de compilation processes. During compilation, the Preprocessor processes the source code (SRC) to eliminate comments and expand macros or includes. The cleaned code is then forwarded to the Compiler, which converts it into assembly code (ASM). This ASM is transformed into binary code (0s and 1s) by the Assembler. The Linker finalizes the process by linking function calls to create an executable file. Decompila- tion, on the other hand, involves converting binary code back into a source file. LLMs, being trained on text, lack the ability to process binary data directly. Therefore, binaries must be disassembled by Objdump into assembly language (ASM) first. It should be noted that binary and disassembled ASM are equivalent, they can be inter converted, and thus we refer to them interchangeably. Finally, the loss is computed between the decompiled code and source code to guide the training.

图 2 展示了从编译到反编译过程的 End2end-Decompile 框架。在编译过程中,预处理器 (Preprocessor) 处理源代码 (SRC),以消除注释并扩展宏或包含文件。清理后的代码随后被转发给编译器 (Compiler),编译器将其转换为汇编代码 (ASM)。该汇编代码由汇编器 (Assembler) 转换为二进制代码 (0 和 1)。链接器 (Linker) 通过链接函数调用完成最终的可执行文件创建过程。反编译则涉及将二进制代码转换回源文件。由于大语言模型 (LLM) 是基于文本训练的,无法直接处理二进制数据。因此,二进制文件必须首先由 Objdump 反汇编为汇编语言 (ASM)。需要注意的是,二进制文件和反汇编后的汇编代码是等价的,它们可以相互转换,因此我们在文中交替使用这两个术语。最后,计算反编译代码与源代码之间的损失,以指导训练。

3.1.2 Optimize LL M 4 De compile-End

3.1.2 优化大语言模型用于反编译

We optimize the training of LL M 4 De compile-End Models through three key steps: 1) augmenting the training corpus, 2) improving the quality of the data, 3) and incorporating two-state training.

我们通过三个关键步骤优化了 LLM 4 反编译端模型的训练:1) 扩充训练语料库,2) 提高数据质量,3) 引入两阶段训练。

Training Corpus. As indicated by the ScalingLaw (Hoffmann et al., 2024; Kaplan et al., 2020), the effectiveness of an LLM heavily relies on the size of the training corpus. Consequently, our initial step in training optimization involves incorporating a large training corpus. We construct asmsource pairs based on ExeBench (Armengol-Estapé et al., 2022), which is the largest public collection of five million C functions. To further expand the training data, we consider the compilation optimization states frequently used by developers. The compilation optimization involves techniques like eliminating redundant instructions, better register allocation, and loop transformations (Muchnick, 1997), which perfectly acts as data augmentation for de compilation. The key optimization levels are O0 (default, no optimization) to O3 (aggressive optimization s). We compile the source code into all four stages, i.e., O0, O1, O2, and O3, and pair each of them with the source code.

训练语料库。正如扩展定律 (Scaling Law) 所指出的 (Hoffmann et al., 2024; Kaplan et al., 2020),大语言模型的有效性在很大程度上依赖于训练语料库的规模。因此,我们在训练优化中的第一步是引入一个大规模的训练语料库。我们基于 ExeBench (Armengol-Estapé et al., 2022) 构建 asmsource 对,这是目前最大的公开 C 函数集合,包含五百万个函数。为了进一步扩展训练数据,我们考虑了开发者常用的编译优化状态。编译优化涉及消除冗余指令、更好的寄存器分配和循环转换等技术 (Muchnick, 1997),这些技术完美地充当了反编译的数据增强手段。关键的优化级别从 O0(默认,无优化)到 O3(激进优化)。我们将源代码编译为所有四个阶段,即 O0、O1、O2 和 O3,并将每个阶段与源代码配对。

Data Quality. Data quality is critical in training an effective model (Li et al., 2023). Therefore, our second step is to clean our training set. We follow the guidelines of StarCoder (Li et al., 2023) by computing MinHash (Broder, 2000) for the code and utilizing Locally Sensitive Hashing (LSH) to remove duplicates. We also exclude samples that are less than 10 tokens.

数据质量。数据质量在训练有效模型时至关重要 (Li et al., 2023)。因此,我们的第二步是清理训练集。我们遵循 StarCoder 的指南 (Li et al., 2023),通过为代码计算 MinHash (Broder, 2000) 并利用局部敏感哈希 (LSH) 来去除重复项。我们还排除了少于 10 个 token 的样本。

Two-Stage Training. Our final step for training optimization aims to educate the model with binary knowledge, and includes two-stage training. In the first stage, we train the model with a large corpus of compilable but not linkable (executable) data. Note that it’s significantly easier to extract C code that is compilable but not linkable (da Silva et al., 2021; Armengol-Estapé et al., 2022). Such not-executable binary object code will closely resemble its executable version except it lacks linked addresses for external symbols. Therefore, in the first stage, we use the extensive compilable codes to ground our model in binary knowledge. In the second stage, we refine the model using executable code to ensure its practical applicability. We also conduct an ablation study for the two-stage training in Section 4.1.2. Comparison between compilable and executable data is detailed in Appendix B.

两阶段训练。为了优化训练,我们的最后一步是让模型学习二进制知识,这包括两个阶段的训练。在第一阶段,我们使用大量可编译但不可链接(不可执行)的数据来训练模型。需要注意的是,提取可编译但不可链接的C代码要容易得多 (da Silva et al., 2021; Armengol-Estapé et al., 2022)。这种不可执行的二进制目标代码与其可执行版本非常相似,只是缺少外部符号的链接地址。因此,在第一阶段,我们使用大量的可编译代码来让模型掌握二进制知识。在第二阶段,我们使用可执行代码来精炼模型,以确保其实用性。我们还在第4.1.2节中对两阶段训练进行了消融研究。关于可编译数据和可执行数据的比较详见附录B。


Figure 3: Refined-Decompile framework. It differs from End2end-Decompile (Figure 2) only in the LLM’s input, which is pseudo-code decompiled from Ghidra.

图 3: Refined-Decompile 框架。它与 End2end-Decompile (图 2) 的区别仅在于大语言模型的输入,该输入是从 Ghidra 反编译得到的伪代码。

3.2 LL M 4 De compile-Ref

3.2 大语言模型反编译参考

We now examine how the conventional decompilation tool, Ghidra, can be significantly improved by integrating it with LLMs. Note that our approach aims at refining entire outputs from Ghidra, offering a broader strategy than merely recovering names or types (Nitin et al., 2021; Xu et al., 2024). We begin by detailing the general RefinedDecompile framework, and discuss our strategy to enhance Ghidra’s output by LL M 4 De compile-Ref.

我们现在研究如何通过将传统反编译工具 Ghidra 与大语言模型 (LLM) 集成来显著改进它。请注意,我们的方法旨在优化 Ghidra 的完整输出,提供比仅仅恢复名称或类型 (Nitin et al., 2021; Xu et al., 2024) 更广泛的策略。我们首先详细介绍了 RefinedDecompile 框架,并讨论了通过 LLM 4 Decompile-Ref 来增强 Ghidra 输出的策略。

3.2.1 The Refined-Decompile Framework

3.2.1 精炼反编译框架

The Refined-Decompile approach is shown in Figure 3. This approach differs from that in Figure 2 only in terms of the LLM’s input, which in the case of Refined-Decompile comes from Ghidra’s de compilation output. Specifically, Ghidra is used to decompile the binary, and then the LLM is finetuned to enhance Ghidra’s output. While Ghidra produces high-level pseudo-code that may suffer from readability issues and syntax errors, it effectively preserves the underlying logic. Refining this pseudo-code significantly mitigates the challenges associated with understanding the obscure ASM.

图 3 展示了 Refined-Decompile 方法。该方法与图 2 中的方法仅在 LLM 的输入方面有所不同,在 Refined-Decompile 的情况下,输入来自 Ghidra 的反编译输出。具体来说,使用 Ghidra 对二进制文件进行反编译,然后对 LLM 进行微调以增强 Ghidra 的输出。虽然 Ghidra 生成的高级伪代码可能存在可读性问题和语法错误,但它有效地保留了底层逻辑。优化这些伪代码可以显著缓解理解晦涩的 ASM 所带来的挑战。

3.2.2 Refine Ghidra by LL M 4 De compile-Ref De compiling using Ghidra. De compiling the executable code with Ghidra (Figure 3) is timeconsuming due to the complex nature of the executables in ExeBench, which include numerous external functions and IO wrappers. Ghidra Headless (Ghidra, 2024b) requires 2 seconds per sample using 128-core multiprocessing. Given such a high computational load, and the high similarities between non-executable and executable binaries, we choose to decompile the non-executable files using Ghidra. This choice significantly reduces the time to 0.2 seconds per sample, enabling us to efficiently gather large amounts of training data.

3.2.2 使用大语言模型优化 Ghidra 反编译

使用 Ghidra 进行反编译。由于 ExeBench 中的可执行文件包含大量外部函数和 IO 封装,使用 Ghidra 反编译可执行代码(图 3)非常耗时。Ghidra Headless (Ghidra, 2024b) 在使用 128 核多处理的情况下,每个样本需要 2 秒。考虑到如此高的计算负载,以及非可执行文件与可执行文件之间的高度相似性,我们选择使用 Ghidra 反编译非可执行文件。这一选择将每个样本的时间显著减少至 0.2 秒,使我们能够高效地收集大量训练数据。

Optimization Strategies. Similar to Section 3.1.2, we augment our dataset by compiling with optimization levels O0, O1, O2, and O3. We further filter the dataset using LSH to remove duplicates. As shown in Figure 1, Ghidra often generates overly long pseudo-code. Consequently, we filter out any samples that exceed the maximum length accepted by our model.

优化策略。与第3.1.2节类似,我们通过使用优化级别O0、O1、O2和O3编译来扩充数据集。我们进一步使用LSH过滤数据集以去除重复项。如图1所示,Ghidra经常生成过长的伪代码。因此,我们过滤掉任何超过模型接受的最大长度的样本。

4 Experiments

4 实验

In this section, we discuss the experimental setups and results for LL M 4 De compile-End and LL M 4 De compile-Ref respectively.

在本节中,我们分别讨论了LL M 4 De compile-End和LL M 4 De compile-Ref的实验设置和结果。

4.1 LL M 4 De compile-End

4.1 大语言模型用于反编译

4.1.1 Experimental Setups

4.1.1 实验设置

Training Data. As discussed in Section 3.1.2, we construct asm-source pairs based on compilable and executable datasets from ExeBench (ArmengolEstapé et al., 2022), where we only consider the de compilation of GCC (Stallman et al., 2003) compiled C function under x86 Linux platform. After filtering, our refined compilable training dataset includes 7.2 million samples, containing roughly 7 billion tokens. Our executable training dataset includes 1.6 million samples, containing roughly 572 million tokens. To train the model, we use the following template: # This is the assembly code: [ASM code] # What is the source code? [source code], where [ASM code] corre- sponds to the disassembled assembly code from the binary, and [source code] is the original C function. Note that the template choice does not impact the performance, since we fine-tune the model to produce the source code.

训练数据。如第 3.1.2 节所述,我们基于 ExeBench (ArmengolEstapé et al., 2022) 中的可编译和可执行数据集构建 asm-source 对,其中我们仅考虑在 x86 Linux 平台上由 GCC (Stallman et al., 2003) 编译的 C 函数的反编译。经过过滤后,我们的可编译训练数据集包含 720 万个样本,大约包含 70 亿个 Token。我们的可执行训练数据集包含 160 万个样本,大约包含 5.72 亿个 Token。为了训练模型,我们使用以下模板:# 这是汇编代码: [ASM code] # 什么是源代码? [source code],其中 [ASM code] 对应二进制文件反汇编得到的汇编代码,[source code] 是原始的 C 函数。请注意,模板选择不会影响性能,因为我们微调模型以生成源代码。

Evaluation Benchmarks and Metrics. To evaluate the models, we introduce HumanEval (Chen et al., 2021) and ExeBench (Armengol-Estapé et al., 2022) benchmarks. HumanEval is the leading benchmark for code generation assessment and includes 164 programming challenges with accompanying Python solutions and assertions. We converted these Python solutions and assertions into C, making sure that they can be compiled with the GCC compiler using standard C libraries (Free Software Foundation, 2024) and pass all the assertions, and name it HumanEval-Decompile. ExeBench consists of 5000 real-world C functions taken from GitHub with IO examples. Note that the HumanEval-Decompile consists of individual functions that depend only on the standard C library. In contrast, ExeBench includes functions extracted from real-world projects with user-defined structures and functions2.

评估基准与指标。为了评估模型,我们引入了 HumanEval [Chen et al., 2021] 和 ExeBench [Armengol-Estapé et al., 2022] 基准。HumanEval 是代码生成评估的主要基准,包含 164 个编程挑战及其对应的 Python 语言解决方案和断言。我们将这些 Python 语言解决方案和断言转换为 C 语言,确保它们可以使用 GCC 编译器 (Free Software Foundation, 2024) 通过标准 C 库编译并通过所有断言,并将其命名为 HumanEval-Decompile。ExeBench 包含从 GitHub 中提取的 5000 个真实世界的 C 函数及其输入输出示例。需要注意的是,HumanEval-Decompile 由仅依赖于标准 C 库的独立函数组成。相比之下,ExeBench 包含从真实项目中提取的函数,这些函数涉及用户定义的结构和函数[2]。

As for the evaluation metrics, we follow previous work to calculate the re-exe cut ability rate (Armengol-Estapé et al., 2023; Wong et al., 2023). During evaluation, the C source code is first compiled into a binary, then disassembled into assembly code, and fed into the de compilation system to be reconstructed back into C code. This decompiled C code is then combined with the assertions to check if it can successfully execute and pass those assertions.

至于评估指标,我们遵循之前的工作来计算重新执行能力率 (Armengol-Estapé et al., 2023; Wong et al., 2023)。在评估过程中,C 源代码首先被编译成二进制文件,然后反汇编成汇编代码,并输入到反编译系统中以重新构建为 C 代码。接着,将反编译后的 C 代码与断言结合,检查其是否能够成功执行并通过这些断言。

Model Configurations. The LL M 4 De compile uses the same architecture as DeepSeekCoder (Guo et al., 2024) and we initialize our models with the corresponding DeepSeek-Coder checkpoints. We employ Sequence-to-sequence prediction (S2S), which is the training objective adopted in most neural machine translation models that aim to predict the output given the input sequence. As illustrated in Equation 1, it minimizes the negative log likelihood for the source code tokens xi, ..., xj:

模型配置。LL M 4 Decompile 使用了与 DeepSeekCoder (Guo et al., 2024) 相同的架构,并使用相应的 DeepSeek-Coder 检查点初始化我们的模型。我们采用了序列到序列预测 (Sequence-to-sequence prediction,S2S),这是大多数神经机器翻译模型中采用的训练目标,旨在给定输入序列的情况下预测输出。如公式 1 所示,它最小化源代码 Token xi, ..., xj 的负对数似然:

image.png

Where the loss is calculated only for the output sequence xi...xj , or the source code.

损失仅在输出序列 xi...xj 或源代码上计算。

Baselines. We selected two key baselines for comparison. First, GPT-4o (OpenAI, 2023) represents the most capable LLMs, providing an upper bound on LLM performance. Second, DeepSeekCoder (Guo et al., 2024) is selected as the current SOTA open-source Code LLM. It represents the forefront of publicly available models specifically tailored for coding tasks. While recent work

基线。我们选择了两个关键基线进行比较。首先,GPT-4o (OpenAI, 2023) 代表了能力最强的大语言模型,提供了大语言模型性能的上限。其次,DeepSeekCoder (Guo et al., 2024) 被选为当前的SOTA开源代码大语言模型。它代表了专门为编码任务量身定制的前沿公开模型。

模型/基准 HumanEval-Decompile ExeBench
00 01 02 03 AVG 00 01 02 03 AVG
DeepSeek-Coder-6.7B 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GPT-40 30.49 11.59 10.37 11.59 16.01 4.43 3.28 3.97 3.43 3.78
LLM4Decompile-End-1.3B 47.20 20.61 21.22 20.24 27.32 17.86 13.62 13.20 13.28 14.49
LLM4Decompile-End-6.7B 68.05 39.51 36.71 37.20 45.37 22.89 16.60 16.18 16.25 17.98
LLM4Decompile-End-33B 51.68 25.56 24.15 24.75 31.54 18.86 14.65 13.96 14.11 15.40

Table 1: Main comparison of End2end-Decompile approaches for re-exe cut ability rates on evaluation benchmarks.

Table 2: Ablation study on training dataset. The “Compilable” models are trained on 7.2M non-executable functions, while the “Executable” models are trained on 1.6M executable functions.

表 1: 评估基准上 End2end-Decompile 方法的重新执行能力率的主要比较

模型/基准 HumanEval-Decompile ExeBench
00 01 02 03 AVG 00 01 02 03 AVG
Compilable-1.3B 42.68 16.46 16.46 17.07 23.17 5.68 4.46 4.16 4.43 4.68
Compilable-6.7B 51.83 33.54 32.32 32.32 37.50 7.52 6.49 6.71 6.60 6.83
Executable-1.3B 19.51 12.80 12.80 11.59 14.18 21.94 19.46 19.31 19.50 20.05
Executable-6.7B 37.20 18.29 22.56 17.07 23.78 29.38 25.98 25.91 25.49 26.69

表 2: 训练数据集的消融研究。“Compilable”模型在 7.2M 个不可执行函数上训练,而“Executable”模型在 1.6M 个可执行函数上训练。

Slade (Armengol-Estapé et al., 2023) fine-tunes language model for de compilation, it relies on intermediate compiler outputs, specifically, the *.s files. In practice, however, such intermediate files are rarely released by developers. Therefore, we focus on a more realistic approach, and consider de compilation only from the binaries, for further discussions please refer to Appendix B.

Slade (Armengol-Estapé et al., 2023) 针对反编译对语言模型进行了微调,它依赖于中间编译器输出,特别是 *.s 文件。然而,在实际中,开发者很少发布此类中间文件。因此,我们关注一种更现实的方法,仅从二进制文件进行反编译,更多讨论请参见附录 B。

Implementation. We use the DeepSeek-Coder models obtained from Hugging Face (Wolf et al., 2019). We train our models using LLaMAFactory library (Zheng et al., 2024). For 1.3B and 6.7B models, we set a batch size;=;2048 and learning rate = 2e5 and train the models for 2 epochs (15B tokens). Experiments are performed on NVIDIA A100-80GB GPU clusters. Fine-tuning the 1.3B and 6.7B LL M 4 De compileEnd takes 12 and 61 days on 8×A100 respectively. Limited by the resources, for the 33B model we only train for 200M tokens. For evaluation, we use the vllm (Kwon et al., 2023) to accelerate the generation (de compilation) process. We employ greedy decoding to minimize randomness.

实现。我们使用从 Hugging Face 获取的 DeepSeek-Coder 模型 (Wolf et al., 2019)。我们使用 LLaMAFactory 库 (Zheng et al., 2024) 训练模型。对于 1.3B 和 6.7B 模型,我们设置批量大小 size;=;2048 和学习率 = 2e5,并训练模型 2 个 epoch (15B token)。实验在 NVIDIA A100-80GB GPU 集群上进行。微调 1.3B 和 6.7B 大语言模型分别需要 12 天和 61 天在 8×A100 上完成。由于资源限制,对于 33B 模型,我们只训练了 200M token。在评估过程中,我们使用 vllm (Kwon et al., 2023) 来加速生成 (反编译) 过程。我们采用贪婪解码以最小化随机性。

4.1.2 Experimental Results

4.1.2 实验结果

Main Results. Table 1 presents the reexe cut ability rate under different optimization states for our studied models. The base version of DeepSeek-Coder-33B is unable to accurately decompile binaries. It could generate code that seemed correct but failed to retain the original program semantics. GPT-4o shows notable de compilation skills; it’s capable to decompile non-optimized (O0) code with a success rate of 30.5 , though the rate significantly decreases to about 11 for optimized codes (O1-O3). The LL M 4 De compile-End models, on the other hand, demonstrate excellent de compilation abilities. The 1.3B version successfully decompiles and retains the program semantics in 27.3 of cases on average, whereas the 6.7B version has a success rate of 45.4 . This improvement underscores the advantages of using larger models to capture a program’s semantics more effectively. While attempting to fine-tune the 33B model, we encountered substantial challenges related to the high communication loads, which significantly slowed the training process and restricted us to using only 200M tokens (Section 4.1.1). Despite this limitation, the 33B model still outperforms the 1.3B model, reaffirming the importance of scaling up the model size.

主要结果。表 1 展示了我们研究的模型在不同优化状态下的反编译能力率。DeepSeek-Coder-33B 的基础版本无法准确反编译二进制文件。它生成的代码看似正确,但未能保留原始程序的语义。GPT-4o 展示了显著的反编译技能;它能够反编译未优化的 (O0) 代码,成功率为 30.5,但对于优化代码 (O1-O3),成功率显著下降至约 11。另一方面,LL M 4 Decompile-End 模型展示了出色的反编译能力。1.3B 版本平均成功反编译并保留了 27.3 的案例中的程序语义,而 6.7B 版本的成功率为 45.4。这一改进凸显了使用更大模型更有效地捕捉程序语义的优势。在尝试微调 33B 模型时,我们遇到了与高通信负载相关的重大挑战,这显著减缓了训练过程,并限制我们仅使用 200M token (第 4.1.1 节)。尽管存在这一限制,33B 模型仍然优于 1.3B 模型,再次证明了扩大模型规模的重要性。

Ablation Study. As discussed in Section 4.1.1, our training data comprises two distinct sets: 7.2 million compilable functions (non-executable) and 1.6M executable functions. We conducted an ablation study using these datasets, and the results are displayed in Table 2. Here, “Compilable” denotes the model trained solely on compilable data, while “Executable” indicates models trained exclusively on executable data. Notably, the binary object from compilable functions lacks links to function calls, which is similar in text distribution to the HumanEval-Decompile data, consisting of single functions dependent only on standard C libraries. Consequently, the 6.7B model trained only on compilable data successfully decompiled 37.5 of HumanEval-Decompile functions, but only 6.8 on ExeBench, which features real functions with extensive user-defined functions. On the other hand, the 6.7B model trained solely on executable data achieved a 26.7 re-exe cut ability rate on the ExeBench test set but faced challenges with single functions, with only a 23.8 success rate on HumanEval-Decompile due to the smaller size of the training corpus. Limited by the space, we present further analysis in Appendix C.

消融研究。如表 2 所示,我们的训练数据包含两个不同的集合:720 万可编译函数(不可执行)和 160 万可执行函数。我们使用这些数据集进行了消融研究,结果如表 2 所示。其中,“Compilable”表示仅使用可编译数据训练的模型,而“Executable”表示仅使用可执行数据训练的模型。值得注意的是,来自可编译函数的二进制对象缺少与函数调用的链接,这在文本分布上与 HumanEval-Decompile 数据相似,后者仅依赖于标准 C 库的单个函数。因此,仅使用可编译数据训练的 6.7B 模型成功反编译了 37.5 的 HumanEval-Decompile 函数,但在 ExeBench 上仅为 6.8,后者具有大量用户定义函数的实际函数。另一方面,仅使用可执行数据训练的 6.7B 模型在 ExeBench 测试集上实现了 26.7 的重新执行能力率,但在处理单个函数时面临挑战,由于训练语料库较小,HumanEval-Decompile 上的成功率仅为 23.8。由于篇幅限制,我们在附录 C 中提供了进一步的分析。

Table 3: Main comparison of Refined-Decompile approaches for re-exe cut ability rate and Edit Similarity on HumanEval-Decompile benchmark. 6ξ+GPT40 refers to enhance the Ghidra results with GPT-4o, “+LL M 4 De compile-Ref” means refining Ghidra results with the fine-tuned LL M 4 De compile-Ref models. Note that the 33B model was trained using only 200M tokens, which is just 10 of the tokens used for the 1.3B/6.7B/22B model. For the 22B model, Please refer to Appendix D.

表 3: 在 HumanEval-Decompile 基准测试中,Refined-Decompile 方法在重新执行率和编辑相似性方面的主要比较。6ξ+GPT40 表示使用 GPT-4o 增强 Ghidra 的结果,“+LL M 4 De compile-Ref”表示使用微调的 LL M 4 De compile-Ref 模型精炼 Ghidra 的结果。请注意,33B 模型仅使用 200M Token 进行训练,这只是 1.3B/6.7B/22B 模型所用 Token 的 10。有关 22B 模型的详细信息,请参阅附录 D。

4.2 LL M 4 De compile-Ref

4.2 大语言模型用于反编译参考

4.2.1 Experimental Setups

4.2.1 实验设置

Experimental Datasets. The training data is constructed using ExeBench, with Ghidra Headless (Ghidra, 2024b) employed to decompile the binary object file. Due to constraints in computational resources, only 400K functions each with optimization levels from O0 to O3 (1.6M samples, 1B tokens) are used for training and the evaluation is conducted on HumanEval-Decompile. The models are trained using the same template described in Section 4.1.1. In addition, following previous work (Hosseini and Dolan-Gavitt, 2022; ArmengolEstapé et al., 2023), we access the readability of decompiled results in terms of Edit Similarity score.

实验数据集。训练数据使用 ExeBench 构建,通过 Ghidra Headless (Ghidra, 2024b) 反编译二进制目标文件。由于计算资源限制,仅使用从 O0 到 O3 优化级别的 400K 个函数(1.6M 样本,1B tokens)进行训练,并在 HumanEval-Decompile 上进行评估。模型使用第 4.1.1 节中描述的相同模板进行训练。此外,遵循之前的工作 (Hosseini and Dolan-Gavitt, 2022; ArmengolEstapé et al., 2023),我们从编辑相似度 (Edit Similarity) 评分角度评估反编译结果的可读性。

Implementation. Configuration settings for the model are consistent with those in Section 4.1.1. For the 1.3B and 6.7B models, the fine-tuning process involves 2B tokens in 2 epochs, and requires 2 and 8 days respectively on 8,×,A100 . Limited by the resource, for 33B model we only train for 200M tokens. For evaluation, we first access the re-exe cut ability rate of Ghidra to establish a baseline. Subsequently, GPT-4o is used to enhance Ghidra’s de compilation result with the prompt, Generate linux compilable C/CΣ+ code of the main and other functions in the supplied snippet without using goto, fix any missing headers. Do not explain anything., following DecGPT (Wong et al., 2023). Finally, we use LL M 4 De compile-Ref models to refine the Ghidra’s output.

实现。模型的配置设置与第 4.1.1 节中的设置保持一致。对于 1.3B 和 6.7B 模型,微调过程涉及 2B Token,共进行 2 个 epoch,在 8,×,A100 上分别需要 2 天和 8 天。由于资源限制,33B 模型仅训练了 200M Token。评估时,我们首先通过 Ghidra 的重执行能力率来建立基线。随后,按照 DecGPT (Wong et al., 2023) 的方法,使用 GPT-4o 通过提示词“生成可编译的 Linux C/CΣ+ 代码,包括主函数和其他函数,不使用 goto,修复任何缺失的头文件。不要解释任何内容。”来增强 Ghidra 的反编译结果。最后,我们使用 LL M 4 De compile-Ref 模型来优化 Ghidra 的输出。

4.2.2 Experimental Results

4.2.2 实验结果

The results for the baselines and RefinedDecompile approaches are summarized in Table 3. For the pseudo-code decompiled by Ghidra, which is not optimized for re-execution, only an average of 20.1 of them pass the test cases. GPT-4o assists in refining this pseudo-code and enhancing its quality. The LL M 4 De compile-Ref models offer substantial improvements over Ghidra’s outputs, with the 6.7B model yielding a 160 increase in re-exe cut ability. Similar to the discussion in Section 4.1.2, the 33B model outperforms the 1.3B model even though it used considerably less training data. And it achieves performance that is only 3.6 below the 6.7B model, which benefited from ten times more training data. When compared to LL M 4 De compile-End-6.7B, the LL M 4 De compileRef-6.7B model, though trained on just 10 of the data in LL M 4 De compile-Ref models, shows a 16.2 performance increase, suggesting a greater potential for the Refined-Decompile approach. We present further analysis in Appendix D.

基线方法和RefinedDecompile方法的结果总结在表3中。对于由Ghidra反编译的伪代码,由于未针对重新执行进行优化,平均只有20.1通过了测试用例。GPT-4o协助优化了这些伪代码并提升了其质量。LL M 4 De compile-Ref模型相较于Ghidra的输出有了显著改进,6.7B模型使重新执行能力提高了160。与第4.1.2节的讨论类似,33B模型尽管使用了更少的训练数据,仍然优于1.3B模型,并且其性能仅比6.7B模型低3.6,而6.7B模型得益于十倍多的训练数据。与LL M 4 De compile-End-6.7B相比,LL M 4 De compileRef-6.7B模型尽管仅使用了LL M 4 De compile-Ref模型中10的数据,却表现出16.2的性能提升,这表明Refined-Decompile方法具有更大的潜力。我们在附录D中提供了进一步的分析。


Figure 4: De compilation results of different approaches. GPT-4o output is plausible yet fail to recover the array dimension (incorrect 2D array arr[outer][inner]). Ghidra’s pseudo-code is notably less readable as discussed in Figure 1. GPT-refined Ghidra result (Ghidra+GPT-4o) marginally enhances readability but fails to correctly render for loops and array indexing. Conversely, LL M 4 De compile-End and LL M 4 De compile-Ref produce accurate and easy-toread outputs.

图 4: 不同方法的反编译结果。GPT-4o 的输出看似合理,但未能恢复数组维度(错误的二维数组 arr[outer][inner])。Ghidra 的伪代码可读性较差,如图 1 所述。经过 GPT 优化的 Ghidra 结果(Ghidra+GPT-4o)略微提升了可读性,但未能正确渲染 for 循环和数组索引。相反,LL M 4 De compile-End 和 LL M 4 De compile-Ref 生成了准确且易于阅读的输出。

An analysis of readability across different methods is also conducted and presented in Table 3 with illustrative examples presented in Figure 4. For text similarity, all decompiled outputs diverge from the original source code, with Edit Similarity ranging from 5.7 to 14.0 , primarily because the compilation process removes variable names and optimizes the logic structure. Ghidra generates pseudo-code that is particularly less readable with 6.2 Edit Similarity on average. Interestingly, with refinement from GPT (Ghidra+GPT-4o), there is a marginal decrease in Edit Similarity. GPT assists in refining type errors like undefined4 and ulong (Figure 4). However, it struggles to accurately reconstruct for loops and array indexing. In contrast, both LL M 4 De compile-End and LL M 4 De compileRef generate outputs that are more aligned with the format of the source code and easier to comprehend.

对不同方法的可读性分析也在表 3 中进行了展示,并在图 4 中提供了示例。在文本相似性方面,所有反编译输出都与原始源代码存在差异,编辑相似度 (Edit Similarity) 范围从 5.714.0,这主要是因为编译过程会移除变量名并优化逻辑结构。Ghidra 生成的伪代码可读性尤其低,平均编辑相似度为 6.2。有趣的是,经过 GPT (Ghidra+GPT-4o) 的优化后,编辑相似度略有下降。GPT 帮助修正了诸如 undefined4 和 ulong 等类型错误(图 4)。然而,它在准确重构 for 循环和数组索引方面仍存在困难。相比之下,LL M 4 De compile-End 和 LL M 4 De compileRef 生成的输出更符合源代码的格式,且更易于理解。

To summarize, domain-specific fine-tuning is crucial for enhancing re-exe cut ability and readability of de compilation outputs.

总结来说,领域特定的微调对于提升反编译输出的重新执行能力和可读性至关重要。

We further employed GPT-4o to evaluate readability (Wang et al., 2023; Liu et al., 2023). Specifically, we guide GPT to assess syntax similarity (variables, loops, conditions) and structural integrity (logic flow, structure) using a structured template. We then summarize readability with a score from 1 (Poor) to 5 (Excellent), based on detailed comparisons between original and decompiled code. The template is available on our GitHub repository 3. Table 4 summarizes our readability assessments on HumanEval-Decompile across various models and optimization levels.

我们进一步使用 GPT-4o 来评估代码的可读性 (Wang et al., 2023; Liu et al., 2023)。具体来说,我们引导 GPT 使用结构化模板来评估语法相似性(变量、循环、条件)和结构完整性(逻辑流、结构)。然后,我们根据原始代码和反编译代码之间的详细比较,将可读性总结为 1(差)到 5(优秀)的评分。该模板可在我们的 GitHub 仓库 3 中找到。表 4 总结了我们对 HumanEval-Decompile 在不同模型和优化级别上的可读性评估。

OptimizationLevel 00 01 02 03 AVG
GPT-40 2.8171 2.3537 2.2927 2.311 2.4436
Ghidra 2.9756 2.4085 2.5183 2.3841 2.5716
LLM4Decompile-End-6.7B 4.0732 3.4634 3.4024 3.2378 3.5442

Table 4: Evaluation by GPT-4o on the readability of decompiled results from various methods.

表 4: GPT-4o 对各种方法反编译结果可读性的评估。

Compared with the results in Table 3, it indicates that Edit Similarity (ES) follows a trend similar to GPT evaluation. Although ES is mathematically based, its values can be difficult to interpret. For instance, a 15 ES score obtained by LL M 4 De compile model may seem low, yet the decompiled function and the source code are highly aligned. In contrast, GPT evaluation, which measures readability conceptually, is more intuitive. A score of 4 on the GPT scale suggests that the decompiled code is nearly identical to the original. Nonetheless, these scores are derived from GPT’s "subjective" judgments. Combining insights from both ES and GPTEval could lead to a more thorough assessment of code readability.

与表3中的结果相比,这表明编辑相似度 (ES) 遵循与 GPT 评估类似的趋势。尽管 ES 是基于数学计算的,但其值可能难以解释。例如,LLM 4 反编译模型获得的 15 ES 分数可能看起来较低,但反编译的函数与源代码高度一致。相比之下,GPT 评估从概念上衡量可读性,更为直观。GPT 评分中的 4 分表明反编译代码与原始代码几乎相同。然而,这些分数来自 GPT 的“主观”判断。结合 ES 和 GPTEval 的见解,可能会对代码可读性进行更全面的评估。

5 Obfuscation Discussion

5 混淆讨论

The process of de compilation aims at revealing the source code from binaries distributed by developers, presenting a potential threat to the protection of intellectual property. To resolve the ethical concerns, this section accesses the risks of the possible misuse of our de compilation models.

反编译过程旨在从开发者分发的二进制文件中揭示源代码,对知识产权保护构成潜在威胁。为了解决伦理问题,本节评估了我们的反编译模型可能被滥用的风险。

In software development, engineers typically implement obfuscation techniques before releasing binary files to the public (Lachaux et al., 2021; Junod et al., 2015). This is done to protect the software from unauthorized analysis or modification. In our study, we focus on two fundamental obfuscation techniques as suggested in ObfuscatorLLVM (Junod et al., 2015): Control Flow Flatten- ing (CFF) and Bogus Control Flow (BCF). These techniques are designed to disguise the true logic of the software, thereby making de compilation more challenging to protect the software’s intellectual property. We present the details of these two techniques in the Appendix E.

在软件开发中,工程师通常在公开发布二进制文件之前实施混淆技术(Lachaux 等,2021;Junod 等,2015)。这样做是为了保护软件免受未经授权的分析或修改。在我们的研究中,我们专注于 ObfuscatorLLVM(Junod 等,2015)中建议的两种基本混淆技术:控制流平坦化(Control Flow Flattening,CFF)和虚假控制流(Bogus Control Flow,BCF)。这些技术旨在掩盖软件的真实逻辑,从而使反编译更具挑战性,以保护软件的知识产权。我们在附录 E 中详细介绍了这两种技术。

Table 5: Re-exe cut ability rates of different approaches on the HumanEval-Decompile benchmark under obfuscation s. Compared to Table 3, the de compilation success rates significantly drop for over 70 .

表 5: 在混淆情况下,不同方法在 HumanEval-Decompile 基准上的可重新执行率。与表 3 相比,反编译成功率显著下降,超过 70

Results summarized in Table 5 demonstrate that basic conventional obfuscation techniques are sufficient to prevent both Ghidra and LL M 4 De compile from decoding obfuscated binaries. For example, the de compilation success rate for the most advanced model, LL M 4 De compile-Ref-6.7B, drops significantly for 90.2 (0.5274 to 0.0519) under CFF and 78.0 (0.5274 to 0.1159) under BCF. Considering the industry standard of employing several complex obfuscation methods prior to software release, experimental results in Table 5 mitigate the concerns about unauthorized use for infringement of intellectual property.

表 5 中汇总的结果表明,基本的传统混淆技术足以阻止 Ghidra 和 LL M 4 De compile 对混淆后的二进制文件进行解码。例如,对于最先进的模型 LL M 4 Decompile-Ref-6.7B,在 CFF 下,反编译成功率从 0.5274 显著下降至 0.0519,降幅为 90.2%;在 BCF 下,从 0.5274 下降至 0.1159,降幅为 78.0%。考虑到在软件发布前采用多种复杂混淆方法的行业标准,表 5 中的实验结果缓解了关于未经授权使用侵犯知识产权的担忧。

6 Conclusions

6 结论

We propose LL M 4 De compile, the first and largest open-source LLM series with sizes ranging from 1.3B to 33B trained to decompile binary code. Based on the End2end-Decompile approach, we optimize the LLM training process and introduce the LL M 4 De compile-End models to decompile binary directly. The resulting 6.7B model shows a de compilation accuracy of 45.4 on HumanEval and 18.0 on ExeBench, surpassing existing tools like Ghidra and GPT-4o over 100 . Additionally, we improve the Refined-Decompile strategy to fine-tune the LL M 4 De compile-Ref models, which excel at refining the Ghidra’s output, with 16.2 improvement over LL M 4 De compile-End. Finally, we conduct obfuscation experiments and address concerns regarding the misuse of LL M 4 De compile models for infringement of intellectual property.

我们提出了LL M 4 Decompile,这是第一个也是最大的开源大语言模型系列,其规模从1.3B到33B不等,专为反编译二进制代码而训练。基于End2end-Decompile方法,我们优化了大语言模型的训练过程,并引入了LL M 4 Decompile-End模型,直接用于反编译二进制代码。所得到的6.7B模型在HumanEval上的反编译准确率为45.4,在ExeBench上为18.0,超越了Ghidra和GPT-4o等现有工具超过100。此外,我们改进了Refined-Decompile策略,对LL M 4 Decompile-Ref模型进行微调,该模型在优化Ghidra输出方面表现出色,相比LL M 4 Decompile-End提高了16.2。最后,我们进行了混淆实验,并解决了关于滥用LL M 4 Decompile模型侵犯知识产权的担忧。

Limitations

局限性

The scope of this research is limited to the compilation and de compilation of C language targeting the x86 platform. While we are confident that the methodologies developed here could be easily adapted to other programming languages and platforms, these potential extensions have been reserved for future investigation. Furthermore, Our research is limited by financial constraints, with a budget equivalent to using 8×A100 GPUs for one year, which includes all trials and iterations. As a result, we have only managed to fully fine-tune models up to 6.7B, and conducted initial explorations on the 33B models with a small dataset, leaving the exploration of 70B and larger models to future studies. Nonetheless, our preliminary tests confirm the potential advantages of scaling up model sizes and suggest a promising direction for future de compilation research into larger models.

本研究范围仅限于针对 x86 平台的C语言编译与反编译。尽管我们坚信此处开发的方法论可轻松适用于其他编程语言和平台,但这些潜在的扩展已被留待未来研究。此外,我们的研究受限于财务约束,预算相当于使用 8×A100 GPU一年,包括所有试验和迭代。因此,我们仅能对6.7B模型进行完整微调,并在小数据集上对33B模型进行了初步探索,将70B及以上模型的探索留给未来研究。尽管如此,我们的初步测试证实了扩大模型规模的潜在优势,并为未来反编译研究进入更大模型领域指明了有前景的方向。

Ethic Statement

道德声明

We have evaluated the risks of the possible misuse of our de compilation models in Section 5. Basic obfuscation methods such as Control Flow Flattening and Bogus Control Flow have been empirically tested and proven to protect against unauthorized de compilation by both traditional tools like Ghidra and advanced models like LL M 4 De compile. This built-in limitation ensures that while LL M 4 De compile is a powerful tool for legitimate uses, it does not facilitate the infringement of intellectual property.

我们在第5节中评估了反编译模型可能被滥用的风险。控制流平坦化和虚假控制流等基本混淆方法已经通过实验测试,并证明能够有效防止 Ghidra 等传统工具和 LL M 4 De compile 等高级模型进行未经授权的反编译。这种内置的限制确保了 LL M 4 De compile 虽然是一个强大的合法使用工具,但不会助长知识产权的侵犯。

In practical applications in the industry, software developers typically employ a series of complex obfuscation methods before releasing their software. This practice adds an additional layer of security and intellectual property protection against decompilation. LL M 4 De compile’s design and intended use respect these measures, ensuring that it serves as an aid in legal and ethical scenarios, such as under standing legacy code or enhancing cyber security defenses, rather than undermining them.

在实际的工业应用中,软件开发者在发布软件前通常会采用一系列复杂的混淆方法。这种做法为反编译增加了额外的安全性和知识产权保护层。LL M 4 De compile 的设计和预期用途尊重这些措施,确保其在合法和道德的场景中发挥作用,例如理解遗留代码或增强网络安全防御,而不是削弱它们。

The development and deployment of LL M 4 De compile are guided by strict ethical standards. The model is primarily intended for use in scenarios where permission has been granted or where the software is not protected by copyright. This includes academic research, debugging, learning, and situations where companies seek to recover lost source code of their own software.

LL M 4 Decompile 的开发和部署遵循严格的道德标准。该模型主要用于已获得授权或软件不受版权保护的场景。这包括学术研究、调试、学习以及公司寻求恢复自己软件丢失源代码的情况。

Acknowledgments

致谢

This work is partially supported by the National Natural Science Foundation of China (No. 62372220), the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. PolyU/25200821), the NSFC Young Scientists Fund (Project No. 62006203), the Innovation and Technology Fund (Project No. PRP/047/22FX), and PolyU Internal Fund from RC-DSAI (Project No. 1-CE1E).

本工作部分得到了国家自然科学基金(编号