[论文翻译]大语言模型在网络安全中的应用:系统性文献综述


原文地址:https://arxiv.org/pdf/2405.04760v3


Large Language Models for Cyber Security: A Systematic Literature Review

大语言模型在网络安全中的应用:系统性文献综述

HANXIANG XU, Huazhong University of Science and Technology, China SHENAO WANG, Huazhong University of Science and Technology, China NINGKE LI, Huazhong University of Science and Technology, China KAILONG WANG*, Huazhong University of Science and Technology, China YANJIE ZHAO, Huazhong University of Science and Technology, China KAI CHEN*, Huazhong University of Science and Technology, China TING YU, Hamad Bin Khalifa University, The State of Qatar YANG LIU, Nanyang Technological University, Singapore HAOYU WANG*, Huazhong University of Science and Technology, China

The rapid advancement of Large Language Models (LLMs) has opened up new opportunities for leveraging artificial intelligence in a variety of application domains, including cyber security. As the volume and sophistication of cyber threats continue to grow, there is an increasing need for intelligent systems that can automatically detect vulnerabilities, analyze malware, and respond to attacks. In this survey, we conduct a comprehensive review of the literature on the application of LLMs in cyber security (LL M 4 Security). By comprehensively collecting over 30K relevant papers and systematically analyzing 127 papers from top security and software engineering venues, we aim to provide a holistic view of how LLMs are being used to solve diverse problems across the cyber security domain.

大语言模型的快速发展为在各个应用领域(包括网络安全)中利用人工智能开辟了新的机会。随着网络威胁的数量和复杂性不断增长,对能够自动检测漏洞、分析恶意软件并响应攻击的智能系统的需求也越来越大。在本调查中,我们对大语言模型在网络安全中的应用(LLM 4 Security)进行了全面的文献回顾。通过全面收集超过 30,000 篇相关论文,并系统分析了来自顶级安全和软件工程会议的 127 篇论文,我们旨在提供一个全面的视角,展示大语言模型如何被用于解决网络安全领域的各种问题。

Through our analysis, we identify several key findings. First, we observe that LLMs are being applied to a wide range of cyber security tasks, including vulnerability detection, malware analysis, network intrusion detection, and phishing detection. Second, we find that the datasets used for training and evaluating LLMs in these tasks are often limited in size and diversity, highlighting the need for more comprehensive and representative datasets. Third, we identify several promising techniques for adapting LLMs to specific cyber security domains, such as fine-tuning, transfer learning, and domain-specific pre-training. Finally, we discuss the main challenges and opportunities for future research in LL M 4 Security, including the need for more interpret able and explain able models, the importance of addressing data privacy and security concerns, and the potential for leveraging LLMs for proactive defense and threat hunting.

通过我们的分析,我们识别出几个关键发现。首先,我们观察到,大语言模型 (LLM) 被广泛应用于各种网络安全任务,包括漏洞检测、恶意软件分析、网络入侵检测和钓鱼检测。其次,我们发现,用于训练和评估大语言模型的数据集在规模和多样性上往往有限,这凸显了需要更全面和具有代表性的数据集。第三,我们识别出几种有前景的技术,用于将大语言模型适应特定的网络安全领域,如微调 (fine-tuning)、迁移学习 (transfer learning) 和领域特定的预训练 (domain-specific pre-training)。最后,我们讨论了 LLM 4 Security 未来研究的主要挑战和机遇,包括需要更具可解释性和可解释性的模型、解决数据隐私和安全问题的重要性,以及利用大语言模型进行主动防御和威胁狩猎的潜力。

Overall, our survey provides a comprehensive overview of the current state-of-the-art in LL M 4 Security and identifies several promising directions for future research. We believe that the insights and findings presented in this survey will contribute to the growing body of knowledge on the application of LLMs in cyber security and provide valuable guidance for researchers and practitioners working in this field.

总体而言,我们的调查全面概述了大语言模型 (LLM) 在安全领域的最新技术,并指出了未来研究的几个有前景的方向。我们相信,本次调查中提出的见解和发现将为大语言模型在网络安全领域的应用知识体系做出贡献,并为该领域的研究人员和从业者提供宝贵的指导。

一INTRODUCTION

引言

The rapid advancements in natural language processing (NLP) over the past decade have been largely driven by the development of large language models (LLMs). By leveraging the Transformer architecture [206] and training on massive amounts of textual data, LLMs like BERT [50], GPT-3,4 [148, 150], PaLM [41], Claude [16] and Chinchilla [79]

过去十年中,自然语言处理(NLP)的快速发展主要得益于大语言模型(LLMs)的进步。通过利用 Transformer 架构 [206] 并在海量文本数据上进行训练,诸如 BERT [50]、GPT-3,4 [148, 150]、PaLM [41]、Claude [16] 和 Chinchilla [79] 等大语言模型得以诞生。

'Corresponding authors

通讯作者

Table 1. State-of-the-art surveys related to LLMs for security.

表 1: 与大语言模型 (LLM) 安全相关的最新综述。

参考文献 时间 主题范围 讨论维度 时间范围 论文数量
Motlagh 等人 [80] 2024 安全应用 任务 2022-2023 未指定
Divakaran 等人 [51] 2024 安全应用 任务 2020-2024 未指定
Yao 等人 [230] 2023 安全应用、大语言模型的安全性 模型、任务 2019-2024 281
Yigit 等人 [232] 2024 安全应用、大语言模型的安全性 任务 2020-2024 未指定
Coelho 等人 [43] 2024 安全应用 任务、领域特定技术 2021-2023 19
Novelli 等人 [146] 2024 安全应用、大语言模型的安全性 任务 2020-2024 未指定
LLM4Security 2024 安全应用 模型、任务、领域特定技术、数据 2020-2024 127

have achieved remarkable performance across a wide range of NLP tasks, including language understanding, generation, and reasoning. These foundational models learn rich linguistic representations that can be adapted to downstream applications with minimal fine-tuning, enabling breakthroughs in domains such as open-domain question answering [2], dialogue systems [152, 231], and program synthesis [6].

在广泛的语言理解、生成和推理等 NLP 任务中取得了显著性能。这些基础模型学习了丰富的语言表示,可以通过少量微调适应下游应用,从而在开放域问答 [2]、对话系统 [152, 231] 和程序合成 [6] 等领域实现了突破。

In particular, one important domain where LLMs are beginning to show promise is cyber security. With the growing volume and sophistication of cyber threats, there is an urgent need for intelligent systems that can automatically detect vulnerabilities, analyze malware, and respond to attacks [20, 36, 138]. Recent research has explored the application of LLMs across a wide range of cyber security tasks, i.e., LL M 4 Security hereafter. In the domain of software security, LLMs have been used for detecting vulnerabilities from natural language descriptions and source code, as well as generating security-related code, such as patches and exploits. These models have shown high accuracy in identifying vulnerable code snippets and generating effective patches for common types of vulnerabilities [30, 40, 65]. Beyond code-level analysis, LLMs have also been applied to understand and analyze higher-level security artifacts, such as security policies and privacy policies, helping to classify documents and detect potential violations [75, 135]. In the realm of network security, LLMs have demonstrated the ability to detect and classify various types of attacks from network traffic data, including DDoS attacks, port scanning, and botnet traffic [10, 11, 140]. Malware analysis is another key area where LLMs are showing promise, with models being used to classify malware families based on textual analysis reports and behavioral descriptions, as well as detecting malicious domains and URLs [93, 123]. LLMs have also been employed in the field of social engineering to detect and defend against phishing attacks by analyzing email contents and identifying deceptive language patterns [90, 172]. Moreover, researchers are exploring the use of LLMs to enhance the robustness and resilience of security systems themselves, by generating adversarial examples for testing the robustness of security class if i ers and simulating realistic attack scenarios for training and evaluation purposes [31, 179, 198]. These diverse applications demonstrate the significant potential of LLMs to improve the efficiency and effectiveness of cyber security practices by processing and extracting insights from large amounts of unstructured text, learning patterns from vast datasets, and generating relevant examples for testing and training purposes.

特别是在网络安全领域,大语言模型开始展现出巨大的潜力。随着网络威胁的数量和复杂性不断增加,迫切需要能够自动检测漏洞、分析恶意软件并响应攻击的智能系统 [20, 36, 138]。最近的研究探索了大语言模型在广泛网络安全任务中的应用,即下文所称的 LLM 4 Security。在软件安全领域,大语言模型已被用于从自然语言描述和源代码中检测漏洞,并生成与安全相关的代码,例如补丁和漏洞利用程序。这些模型在识别易受攻击的代码片段和为常见类型的漏洞生成有效补丁方面表现出高准确性 [30, 40, 65]。除了代码级别的分析外,大语言模型还被应用于理解和分析更高级别的安全工件,例如安全策略和隐私策略,帮助分类文档并检测潜在的违规行为 [75, 135]。在网络安全领域,大语言模型展示了从网络流量数据中检测和分类各种类型攻击的能力,包括 DDoS 攻击、端口扫描和僵尸网络流量 [10, 11, 140]。恶意软件分析是大语言模型展现出潜力的另一个关键领域,模型被用于基于文本分析报告和行为描述对恶意软件家族进行分类,并检测恶意域名和 URL [93, 123]。大语言模型还被应用于社交工程领域,通过分析电子邮件内容和识别欺骗性语言模式来检测和防御钓鱼攻击 [90, 172]。此外,研究人员正在探索利用大语言模型通过生成对抗样本来测试安全分类器的鲁棒性,并模拟真实的攻击场景用于训练和评估,从而增强安全系统自身的鲁棒性和弹性 [31, 179, 198]。这些多样化的应用展示了大语言模型通过处理和提取大量非结构化文本中的见解、从庞大数据集中学习模式以及生成相关测试和训练样本,从而提升网络安全实践效率和有效性的巨大潜力。

While there have been several valuable efforts in the literature to survey the LL M 4 Security [43, 51, 141, 230], given the growing body of work in this direction, these studies often have a more focused scope. Many of the existing surveys primarily concentrate on reviewing the types of tasks that LLMs can be applied to, without providing an extensive analysis of other essential aspects related to these tasks, such as the data and domain-specific techniques employed [146, 232], as shown in Table 1. For example, Divakaran et al. [51] only analyzed the prospects and challenges of LLMs in various security tasks, discussing the characteristics of each task separately. However, it lacks insight into the connection between the requirements of these security tasks and data, as well as the application of LLMs in domain-specific technologies.

尽管文献中有几项有价值的努力来调查 LLM 4 Security [43, 51, 141, 230],但随着这一方向的研究不断增加,这些研究通常具有更集中的范围。许多现有的调查主要集中在大语言模型可以应用的任务类型上,而没有对这些任务相关的其他重要方面进行广泛分析,例如所使用的数据和特定领域的技术 [146, 232],如表 1 所示。例如,Divakaran 等人 [51] 仅分析了大语言模型在各种安全任务中的前景和挑战,分别讨论了每项任务的特点。然而,它缺乏对这些安全任务的需求与数据之间的联系,以及大语言模型在特定领域技术中的应用的深入洞察。

To address these limitations and provide an in-depth understanding of the state-of-the-art in LL M 4 Security, we conduct a systematic and extensive survey of the literature. By comprehensively collecting 38,112 relevant papers and systematically analyzing 127 papers from top security and software engineering venues, our survey aims to provide a holistic view of how LLMs are being applied to solve diverse problems across the cyber security domain. In addition to identifying the types of tasks that LLMs are being used for, we also examine the specific datasets, preprocessing techniques, and domain adaptation methods employed in each case. This enables us to provide a more nuanced analysis of the strengths and limitations of different approaches, and to identify the most promising directions for future research. Specifically, we focus on answering four key research questions (RQs):

为了解决这些局限性并提供对大语言模型 (LLM) 在安全领域最新进展的深入理解,我们对相关文献进行了系统且广泛的调查。通过全面收集 38,112 篇相关论文并系统分析来自顶级安全与软件工程会议的 127 篇论文,我们的调查旨在提供一个关于大语言模型如何应用于解决网络安全领域各种问题的整体视角。除了识别大语言模型被用于的任务类型外,我们还研究了每种情况下使用的具体数据集、预处理技术以及领域适应方法。这使我们能够对不同方法的优势和局限性进行更细致的分析,并确定未来研究中最有前景的方向。具体而言,我们专注于回答四个关键研究问题 (RQs):

For each research question, we provide a fine-grained analysis of the approaches, datasets, and evaluation methodologies used in the surveyed papers. We identify common themes and categorize the papers along different dimensions to provide a structured overview of the landscape. Furthermore, we highlight the key challenges and limitations of current approaches to guide future research towards addressing the gaps. We believe our survey can serve as a valuable resource for researchers working at the intersection of NLP, AI, and cyber security. The contributions of this work are summarized asfollows:

针对每个研究问题,我们对所调查论文中使用的方法、数据集和评估方法进行了细致分析。我们识别了常见主题,并沿不同维度对论文进行分类,以提供该领域的结构化概览。此外,我们强调了当前方法的关键挑战和局限性,以指导未来研究填补这些空白。我们相信,这项调查可以为从事自然语言处理 (NLP)、人工智能 (AI) 和网络安全交叉领域的研究人员提供宝贵的资源。本工作的贡献总结如下:

The survey progresses with the following framework. We outline our survey methodology, including the search strategy, inclusion/exclusion criteria, and the data extraction process, in Section 2. The analysis and findings for each of the four research questions can be found in Sections 4 through 6. Sections 7 to 8 explore the constraints and significance of our results, while also identifying promising directions for future research. Finally, Section 9 concludes the paper.

本次调查按照以下框架进行。我们在第2节概述了调查方法,包括搜索策略、纳入/排除标准和数据提取过程。第4至6节分别对四个研究问题进行了分析和发现。第7至8节探讨了我们结果的局限性和意义,同时指出了未来研究的有前途方向。最后,第9节对本文进行了总结。

2 METHODOLOGY

2 方法论

In this study, we conducted a Systematic Literature Review (SLR) to investigate the latest research on LL M 4 Security. This review aims to provide a comprehensive mapping of the landscape, identifying how LLMs are being deployed to enhance cyber security measures.

在本研究中,我们进行了系统性文献综述 (Systematic Literature Review, SLR),以探讨大语言模型在安全领域的最新研究。该综述旨在全面描绘这一领域的发展现状,识别大语言模型如何被用于增强网络安全措施。


Fig. 1. Systematic Literature Review Methodology for LL M 4 Security.

图 1: 大语言模型安全性的系统文献综述方法论

Following the established SLR guidelines [99, 164], our methodology is structured into three pivotal stages as shown in Figure 2: Planning (\$2.1), Conducting (\$2.2, \$2.3), and Reporting (\$2.4), each meticulously designed to ensure comprehensive coverage and insightful analysis of the current state of research in this burgeoning field.

遵循既定的 SLR 指南 [99, 164],我们的方法分为三个关键阶段,如图 2 所示:计划 ($2.1)、执行 ($2.2, $2.3) 和报告 ($2.4),每个阶段都经过精心设计,以确保对这一新兴领域的研究现状进行全面覆盖和深入分析。

Planning. Initially, we formulated precise research questions to understand how LLMs are being utilized in security tasks, the benefits derived, and the associated challenges. Subsequently, we developed a detailed protocol delineating our search strategy, including specific venues and databases, keywords, and quality assessment criteria. Each co-author reviewed this protocol to enhance its robustness and align with our research objectives.

规划。最初,我们制定了精确的研究问题,以了解大语言模型在安全任务中的应用、带来的好处以及相关的挑战。随后,我们制定了一个详细的协议,阐明了我们的搜索策略,包括特定的场所和数据库、关键词以及质量评估标准。每位合著者都审查了该协议,以增强其稳健性并符合我们的研究目标。

Literature survey and analysis. We meticulously crafted our literature search to ensure comprehensiveness, employing both manual and automated strategies across various databases to encompass a wide range of papers. Each study identified underwent a stringent screening process, initially based on their titles and abstracts, followed by a thorough review of the full text to ensure conformity with our predefined criteria. To prevent overlooking related papers, we also conducted forward and backward snowballing on the collected papers.

文献调查与分析。我们精心设计了文献搜索策略,以确保全面性,采用了手动和自动化的方法,在多个数据库中涵盖了广泛的论文。每篇识别到的研究都经过严格的筛选过程,首先基于其标题和摘要,随后对全文进行彻底审查,以确保符合我们预定的标准。为了防止遗漏相关论文,我们还对收集到的论文进行了前向和后向的雪球式搜索。

Reporting. We present our findings through a structured narrative, complemented by visual aids like flowcharts and tables, providing a clear and comprehensive overview of the existing literature. The discussion delves into the implications of our findings, addressing the potential of LLMs to revolutionize cyber security practices and identifying gaps that warrant further investigation.

报告。我们通过结构化的叙述呈现我们的发现,辅以流程图和表格等视觉辅助工具,提供对现有文献的清晰全面概述。讨论深入探讨了我们发现的含义,探讨了大语言模型 (LLM) 革新网络安全实践的潜力,并指出了值得进一步研究的空白领域。

2.1 Research Question

2.1 研究问题

The primary aim of this SLR, focused on the context of LL M 4 Security, is to meticulously dissect and synthesize existing research at the intersection of these two critical fields. This endeavor seeks to illuminate the multifaceted applications of LLMs in cyber security, assess their effectiveness, and delineate the spectrum of methodologies employed across various studies. To further refine this objective, we formulated the following four Research Questions (RQs):

本次系统性文献综述(SLR)的主要目标,聚焦于大语言模型(LLM)在安全领域的应用,旨在深入剖析并综合这两个关键领域交叉点的现有研究。这一努力旨在阐明大语言模型在网络安全中的多方面应用,评估其有效性,并描绘各项研究中采用的方法谱系。为了进一步细化这一目标,我们制定了以下四个研究问题(RQs):

2.2 Search Strategy

2.2 搜索策略

To collect and identify a set of relevant literature as accurately as possible, we employed the “Quasi-Gold Standard" (QGS) [239] strategy for literature search. The overview of the strategy we applied in this work is as follows:

为了尽可能准确地收集和识别相关文献,我们采用了“准黄金标准”(Quasi-Gold Standard, QGS) [239] 策略进行文献检索。我们在本工作中应用的策略概述如下:

Step1: Identify related venues and databases. To initiate this approach, we first identify specific venues for manual search and then choose suitable libraries and databases for the automated search. In this stage, we opt for six of the top Security conferences and journals (i.e., S&P, NDSS, USENIX Security, CCS, TDSC, and TIFS) as well as six of the leading Software Engineering conferences and journals (i.e.,ICSE, ESEC/FSE, ISSTA, ASE, TOSEM, and TSE). Given the emerging nature of LLMs in research, we also include arXiv in both manual and automated searches, enabling us to capture the latest unpublished studies in this rapidly evolving field. For automated searches, we select seven widely utilized databases, namely the ACM Digital Library, IEEE Xplore, Science Direct, Web of Science, Springer, Wiley, and arXiv. These databases offer comprehensive coverage of computer science literature and are commonly employed in systematic reviews within this domain [80, 236, 252].

步骤1:识别相关会议和数据库。为了启动这一方法,我们首先确定特定的会议进行手动搜索,然后选择合适的图书馆和数据库进行自动搜索。在这一阶段,我们选择了六个顶级安全会议和期刊(即S&P、NDSS、USENIX Security、CCS、TDSC和TIFS)以及六个领先的软件工程会议和期刊(即ICSE、ESEC/FSE、ISSTA、ASE、TOSEM和TSE)。鉴于大语言模型在研究中新兴的性质,我们还将arXiv纳入手动和自动搜索中,以便捕捉这一快速发展领域中最新未发表的研究。对于自动搜索,我们选择了七个广泛使用的数据库,即ACM数字图书馆、IEEE Xplore、Science Direct、Web of Science、Springer、Wiley和arXiv。这些数据库提供了计算机科学文献的全面覆盖,并且通常在该领域的系统综述中使用 [80, 236, 252]。

Step2: Establish QGs. In this step, we start with creating a manually curated set of studies that have been carefully screened to form the QGS. A total of 41 papers relevant to LLM4Sec are manually identified, aligning with the research objective and encompassing various techniques, application domains, and evaluation methods.

步骤2:建立QGs。在这一步中,我们首先创建一组经过精心筛选的手动策划研究,以形成QGS。我们手动识别了41篇与LLM4Sec相关的论文,这些论文与研究目标一致,涵盖了各种技术、应用领域和评估方法。

Step3: Define search keywords. The keywords for automatic search are elicited from the title and abstract of the selected QGS papers through word frequency analysis. The search string consists of two sets of keywords:

第三步:定义搜索关键词。自动搜索的关键词通过词频分析从选定的QGS论文标题和摘要中提取。搜索字符串由两组关键词组成:


Fig. 2. Paper Search and Selection Process.

图 2: 论文搜索与选择流程

Step4: Conduct an automated search. These identified keywords are paired one by one and input into automated searches across the above-mentioned seven widely used databases. Our automated search focused on papers published after 2019, in which GPT-2 was published, as it marked a significant milestone in the development of large language models. The search was conducted in the title, abstract, and keyword fields of the papers in each database. Specifically, the number of papers retrieved from each database after applying the search query and the year filter (2019-2023) is as follows: 398 papers in ACM Digital Library, 2,112 papers in IEEE Xplore, 724 papers in Science Direct, 4,245 papers in Web of Science, 23,721 papers in Springer, 7,154 papers in Wiley, and 3,557 papers in arXiv.

步骤4:进行自动化搜索。将这些识别出的关键词两两配对,并输入到上述七个广泛使用的数据库中进行自动化搜索。我们的自动化搜索聚焦于2019年后发表的论文,因为GPT-2的发布标志着大语言模型发展的一个重要里程碑。搜索在每个数据库的论文标题、摘要和关键词字段中进行。具体来说,应用搜索查询和年份过滤器(2019-2023)后,从每个数据库检索到的论文数量如下:ACM Digital Library 398篇,IEEE Xplore 2,112篇,Science Direct 724篇,Web of Science 4,245篇,Springer 23,721篇,Wiley 7,154篇,arXiv 3,557篇。

2.3 Study Selection

2.3 研究选择

After obtaining the initial pool of 38,112 papers (38,071 from the automated search and 41 from the QGS), we conducted a multi-stage study selection process to identify the most relevant and high-quality papers for our systematic review.

在获得初步的 38,112 篇论文(其中 38,071 篇来自自动搜索,41 篇来自 QGS)后,我们进行了多阶段的研究筛选过程,以确定与我们的系统综述最相关和高质量的论文。

2.3.1 Coarse-Grained Inclusion and Exclusion Criteria. To select relevant papers for our research questions, we defined four inclusion criteria and eight exclusion criteria (as listed in Table 2) for the coarse-grained paper selection process. Among them, $\operatorname{In}#1$ ,Ex#1, $\mathrm{Ex#2}$ and $\mathrm{Ex}#3$ were automatically filtered based on the keywords, duplication status, length, and publication venue of the papers. The remaining inclusion criteria $(\mathrm{In}#2{\sim}4)$ and exclusion criteria $\left(\operatorname{Ex}#4{\sim}8\right)$ were manually applied by inspecting the topic and content of each paper. Specifically, the criteria of $\mathbf{In}#1$ retained 7,582 papers whose titles and abstracts contained a pair of the identified search keywords. Subsequently, Ex#1 filtered out 440 duplicate or multi-version papers from the same authors with little difference. Next, the automated fltering criteria $\operatorname{Ex}#2$ was applied to exclude short papers, tool demos, keynotes, editorials, books, theses, workshop papers, or poster papers, resulting in 4,855 papers being removed. The remaining papers were then screened based on the criteria $\mathrm{Ex}#3$ which retained 523 full research papers published in the identified venues or as preprints on arXiv. The remaining inclusion and exclusion criteria (In#2~4, Ex#4~8) were then manually applied to the titles and abstracts of these 523 papers, in order to determine their relevance to the research topic. Three researchers independently applied the inclusion and exclusion criteria to the titles and abstracts. Disagreements were resolved through discussion and consensus. After this manual inspection stage, 156 papers were included for further fine-grained full-text quality assessment.

2.3.1 粗粒度纳入与排除标准。为了选择与研究问题相关的论文,我们定义了四个纳入标准和八个排除标准(如表 2 所列),用于粗粒度论文筛选过程。其中,$\operatorname{In}#1$、Ex#1、$\mathrm{Ex#2}$ 和 $\mathrm{Ex}#3$ 是根据论文的关键词、重复状态、长度和出版渠道自动筛选的。其余的纳入标准 $(\mathrm{In}#2{\sim}4)$ 和排除标准 $\left(\operatorname{Ex}#4{\sim}8\right)$ 是通过检查每篇论文的主题和内容手动应用的。具体来说,$\mathbf{In}#1$ 的标准保留了 7,582 篇标题和摘要中包含一对已识别搜索关键词的论文。随后,Ex#1 过滤掉了 440 篇来自同一作者且差异不大的重复或多版本论文。接下来,应用了自动筛选标准 $\operatorname{Ex}#2$,排除了短论文、工具演示、主题演讲、社论、书籍、学位论文、研讨会论文或海报论文,导致 4,855 篇论文被移除。然后,根据标准 $\mathrm{Ex}#3$ 对剩余论文进行了筛选,保留了 523 篇发表在已识别渠道或作为 arXiv 预印本的完整研究论文。然后,对这些 523 篇论文的标题和摘要手动应用了其余的纳入和排除标准(In#2~4, Ex#4~8),以确定它们与研究主题的相关性。三位研究人员独立地将纳入和排除标准应用于标题和摘要。分歧通过讨论和共识解决。在这个手动检查阶段之后,156 篇论文被纳入进一步细粒度全文质量评估。

Table 2. Inclusion and exclusion criteria.

表 2: 纳入和排除标准

纳入标准
In#1: 论文的标题和摘要包含一对已识别的搜索关键词;
In#4: 评估大语言模型在安全场景中的性能或有效性的论文。
排除标准
Ex#1: 重复的论文,同一作者的多版本差异较小的研究;
Ex#2: 少于8页的短论文、工具演示、主题演讲、社论、书籍、学位论文、研讨会论文或海报论文;
Ex#3: 未在已识别会议或期刊上发表,也未在arXiv上作为预印本发表的论文;
Ex#4: 不关注安全任务的论文(例如,一般领域的自然语言处理任务);
Ex#5: 使用传统机器学习或深度学习技术而不涉及大语言模型的论文;
Ex#6: 二次研究,如系统文献综述(SLR)、评论或调查;
Ex#7: 非英文撰写的论文;
Ex#8: 关注大语言模型的安全性而非将其用于安全任务的论文。

2.3.2 Fine-grained Quality Assessment. To ensure the included papers are of suffcient quality and rigor, we assessed them using a set of quality criteria adapted from existing guidelines for systematic reviews in software engineering. The quality criteria included:

2.3.2 细粒度质量评估

Each criterion was scored on a 3-point scale (0: not met, 1: partially met, 2: fully met). Papers with a total score of 6 or higher (out of 10) were considered as having acceptable quality. After the quality assessment, 93 papers remained in the selected set.

每个标准按照3分制评分 (0: 未满足, 1: 部分满足, 2: 完全满足)。总分为6分及以上 (满分10分) 的论文被视为质量合格。经过质量评估后,93篇论文被保留在选集中。

2.3.3 Forward and Backward Snowballing. To further expand the coverage of relevant literature, we performed forward and backward snowballing on the 93 selected papers. Forward snowballing identified papers that cited the selected papers, while backward snowballing identified papers that were referenced by the selected papers.

2.3.3 前向与后向滚雪球法

Here we obtained 2,056 and 5,255 papers separately during the forward and backward process. Then we applied the same inclusion/exclusion criteria and quality assessment to the papers found through snowballing. After the initial keyword filtering and de duplication, there were 1,978 papers that remained available. Among them, 68 papers were excluded during the page number filtering step, and 1,235 papers were deleted to ensure the papers were published in the selected venues. After confirming the paper topics and assessing the paper quality, only 44 papers were ultimately retained in the snowballing process, resulting in a final set of 127 papers for data extraction and synthesis.

在正向和反向过程中,我们分别获取了 2,056 篇和 5,255 篇论文。然后,我们对通过滚雪球法找到的论文应用了相同的纳入/排除标准和质量评估。经过初步关键词过滤和去重后,剩下 1,978 篇论文可用。其中,68 篇论文在页码过滤步骤中被排除,1,235 篇论文被删除以确保论文发表在选定的会议/期刊上。在确认论文主题并评估论文质量后,滚雪球过程中最终仅保留了 44 篇论文,最终得到 127 篇论文用于数据提取和综合。

2.4 Statistics of Selected Papers

2.4 所选论文的统计数据

After conducting searches and snowballing, a total of 127 relevant research papers were ultimately obtained. The distribution of the included documents is outlined in Figure 3. As depicted in Figure 3(A), $39%$ of the papers underwent peer review before publication. Among these venues, ICSE had the highest frequency, contributing $7%$ . Other venues making significant contributions included FSE, ISSTA, ASE, and TSE, with contributions of $5%$ $5%$ $3%$ and $3%$ respectively. Meanwhile, the remaining $61%$ of the papers were published on arXiv, an open-access platform serving as a repository for scholarly articles. This discovery is unsurprising given the rapid emergence of new LL M 4 Security studies, with many works recently completed and potentially undergoing peer review. Despite lacking peer review, we conducted rigorous quality assessments on all collected papers to ensure the integrity of our investigation results. This approach enables us to include all high-quality and relevant publications while upholding stringent research standards.

经过检索和滚雪球法,最终获得了127篇相关研究论文。所纳入文献的分布情况如图3所示。如图3(A)所示,39%的论文在发表前经过了同行评审。在这些会议中,ICSE的出现频率最高,占比7%。其他贡献显著的会议包括FSE、ISSTA、ASE和TSE,分别占比5%、5%、3%和3%。与此同时,其余61%的论文发表在开放获取平台arXiv上,该平台作为学术文章存储库。鉴于LLM安全研究迅速涌现,许多工作最近才完成,可能正在进行同行评审,这一发现并不令人意外。尽管这些论文缺乏同行评审,我们对所有收集到的论文进行了严格的质量评估,以确保调查结果的完整性。这种方法使我们能够在坚持严格研究标准的同时,纳入所有高质量和相关的出版物。


Fig. 3. Overview of the selected 127 papers’ distribution.

图 3: 所选 127 篇论文的分布概览。

The temporal distribution of the included papers is depicted in Figure 3(B). Since 2020, there has been a notable upward trend in the number of publications. In 2020, only 1 relevant paper was published, followed by 2 in 2021. However, the number of papers sharply increased to 11 by 2022. Surprisingly, in 2023, the total count surged to 109 published papers. This rapid growth trend signifies an increasing interest in LL M 4 Security research. Currently, many works from 2024 are still under review or unpublished. Hence, we have chosen only 6 representative papers. We will continue to observe the developments in LL M 4 Security research throughout 2024.

图 3(B) 展示了所纳入论文的时间分布。自 2020 年以来,相关论文数量呈现出显著上升趋势。2020 年仅发表了 1 篇相关论文,2021 年增至 2 篇。然而,到 2022 年,论文数量急剧上升至 11 篇。令人惊讶的是,2023 年发表的论文总数激增至 109 篇。这一快速增长趋势表明,大语言模型 (LLM) 在安全领域的研究兴趣日益增加。目前,2024 年的许多工作仍处于评审或未发表状态。因此,我们仅选择了 6 篇代表性论文。我们将继续观察 2024 年大语言模型在安全领域的研究进展。

Table 3. Extracted data items and related research questions (RQs).

表 3: 提取的数据项及相关研究问题 (RQs)。

RQ 数据项
1,2,3,4 大语言模型的类别
1,3,4 网络安全领域的类别
1,2,3 大语言模型的属性及适用性
1,3 安全任务需求与大语言模型解决方案的应用
1 安全领域所属的安全任务
3 使大语言模型适应任务的技术
3 显著的外部增强技术
4 所使用数据集的类型及特征

After completing the full-text review phase, we proceeded with data extraction. The objective was to collect all relevant information essential for offering detailed and insightful answers to the RQs outlined in $\S2.1$ As illustrated in Table 3, the extracted data included the categorization of security tasks, their corresponding domains, as well as classifications of LLMs, external enhancement techniques, and dataset characteristics. Using the gathered data, we systematically examined the relevant aspects of LLM application within the security domains.

在完成全文审查阶段后,我们继续进行数据提取。目的是收集所有相关信息,以便为$\S2.1$中概述的研究问题提供详细且深入的答案。如表3所示,提取的数据包括安全任务的分类、其对应的领域,以及大语言模型、外部增强技术和数据集特征的分类。利用收集到的数据,我们系统地研究了安全领域中大语言模型应用的相关方面。

3RQ1:WHAT TYPES OF SECURITY TASKS HAVE BEEN FACILITATED BY LL M-BASED APPROACHES?

3RQ1: 基于大语言模型的方法促进了哪些类型的安全任务?

This section delves into the detailed examination of LLM utilization across diverse security domains. We have classified them into six primary domains, aligning with the themes of the collected papers: software and system security, network security, information and content security, hardware security, and blockchain security, totaling 127 papers. Figure 4 visually depicts the distribution of LLMs within these six domains. Additionally, Table 4 offers a comprehensive breakdown of research detailing specific security tasks addressed through LLM application.

本节深入探讨大语言模型 (LLM) 在不同安全领域的详细应用。我们根据收集的论文主题将其分为六个主要领域:软件和系统安全、网络安全、信息和内容安全、硬件安全以及区块链安全,共计 127 篇论文。图 4 直观地展示了大语言模型在这六个领域的分布情况。此外,表 4 提供了通过大语言模型应用解决的具体安全任务的研究详细分类。


Fig. 4. Distribution of LLM usages in security domains.

图 4: 安全领域中大语言模型 (LLM) 的使用分布。

The majority of research activity in the realm of software and system security, constituting around $59.84%$ Oof the total research output, is attributed to the advancements made by code LLMs [178, 247, 250] and the extensive applications of LLMs in software engineering [80]. This emphasis underscores the significant role and impact of LLMs in software and system security, indicating a predominant focus on leveraging LLMs to automate the handling of potential security issues in programs and systems. Approximately $17.32%$ of the research focus pertains to network security tasks, highlighting the importance of LLMs in aiding traffic detection and network threat analysis. Information and content security activities represent around $14.17%$ of the research output, signaling a growing interest in employing LLMs for generating and detecting fake content. Conversely, activities in hardware security and blockchain security account for approximately $4.72%$ and $3.94%$ of the research output, respectively, suggesting that while exploration in these domains has been comparatively limited thus far, there remains research potential in utilizing LLMs to analyze hardware-level vulnerabilities and potential security risks in blockchain technology.

在软件和系统安全领域,大部分研究活动(约占总研究产出的 59.84%)归功于代码大语言模型 [178, 247, 250] 的进展以及大语言模型在软件工程中的广泛应用 [80]。这一重点强调了大语言模型在软件和系统安全中的重要角色和影响力,表明研究者主要专注于利用大语言模型来自动处理程序和系统中的潜在安全问题。约 17.32% 的研究关注点涉及网络安全任务,突显了大语言模型在辅助流量检测和网络威胁分析中的重要性。信息和内容安全活动约占研究产出的 14.17%,表明研究者对利用大语言模型生成和检测虚假内容的兴趣日益增长。相比之下,硬件安全和区块链安全活动分别约占研究产出的 4.72% 和 3.94%,这表明尽管目前在这些领域的探索相对有限,但利用大语言模型分析硬件级漏洞和区块链技术中的潜在安全风险仍具有研究潜力。

Table 4. Distribution of security tasks over six security domains.

表 4: 六大安全领域的安全任务分布

安全领域 安全任务 总计
网络安全 Web 模糊测试 (3) 流量和入侵检测 (10) 网络威胁分析 (5) 渗透测试 (4) 漏洞检测 (17) 22
软件与系统安全 漏洞修复 (10) 缺陷检测 (8) 缺陷修复 (20) 程序模糊测试 (6) 逆向工程与二进制分析 (7) 恶意软件检测 (2) 76
信息与内容安全 系统日志分析 (6) 钓鱼和诈骗检测 (8) 有害内容检测 (6) 隐写术 (2) 访问控制 (1) 18
硬件安全 取证 (1) 硬件漏洞检测 (2) 6
区块链安全 硬件漏洞修复 (4) 智能合约安全 (4) 交易异常检测 (1) 5

3.1 Application of LLMs in Network Security

3.1 大语言模型 (LLM) 在网络安全中的应用

This section explores the application of LLMs in the field of network security. The tasks include web fuzzing, intrusion and anomaly detection, cyber threat analysis, and penetration testing.

本节探讨大语言模型 (LLM) 在网络安全领域的应用。任务包括网络模糊测试、入侵和异常检测、网络威胁分析以及渗透测试。

Web fuzzing. Web fuzzing is a mutation-based fuzzer that generates test cases increment ally based on the coverage feedback it receives from the instrumented web application [205]. Security is undeniably the most critical concern for web applications. Fuzzing can help operators discover more potential security risks in web applications. Liang et al. [115] proposed GPTFuzzer based on an encoder-decoder architecture. It generates effective payloads for web application firewalls (WAFs) targeting SQL injection, XSS, and RCE attacks by generating fuzz test cases. The model undergoes reinforcement learning [112] fine-tuning and KL-divergence penalty to effectively generate attack payloads and mitigate the local optimum issue. Similarly, Liu et al. [120] utilized an encoder-decoder architecture model to generate SQL injection detection test cases for web applications, enabling the translation of user inputs into new test cases. Meng et al's CHATAFL [133], on the other hand, shifts focus to leveraging LLMs for generating structured and sequenced effective test inputs for network protocols lacking machine-readable versions.

Web fuzzing。Web fuzzing 是一种基于变异的模糊测试工具,它根据从插桩的 Web 应用程序接收到的覆盖率反馈逐步生成测试用例 [205]。安全性无疑是 Web 应用程序最关键的关注点。模糊测试可以帮助操作人员发现 Web 应用程序中更多的潜在安全风险。Liang 等人 [115] 提出了一种基于编码器-解码器架构的 GPTFuzzer。它通过生成模糊测试用例,针对 SQL 注入、XSS 和 RCE 攻击为 Web 应用程序防火墙 (WAF) 生成有效载荷。该模型经过强化学习 [112] 微调和 KL 散度惩罚,以有效生成攻击载荷并缓解局部最优问题。同样,Liu 等人 [120] 利用编码器-解码器架构模型生成用于 Web 应用程序的 SQL 注入检测测试用例,从而将用户输入转换为新的测试用例。另一方面,Meng 等人的 CHATAFL [133] 则将重点转向利用大语言模型为缺乏机器可读版本的网络协议生成结构化和有序的有效测试输入。

Traffc and intrusion detection. Detecting network traffic and intrusions is a crucial aspect of network security and management [137]. LLMs have been widely applied in network intrusion detection tasks, covering traditional web applications, IoT (Internet of Things), and in-vehicle network scenarios [11, 62, 131, 138]. LLMs not only learn the characteristics of malicious traffc data [10, 11, 138] and capture anomalies in user-initiated behaviors [24] but also describe the intent of intrusions and abnormal behaviors [3, 10, 58]. Additionally, they can provide corresponding security recommendations and response strategies for identified attack types [37]. Liu et al. [123] proposed a method for detecting malicious URL behavior by utilizing LLMs to extract hierarchical features of malicious URLs. Their work extends the application of LLMs in intrusion detection tasks to the user level, demonstrating the generality and effectiveness of LLMs in intrusion and anomaly detection tasks.

流量和入侵检测。检测网络流量和入侵是网络安全和管理的关键方面 [137]。大语言模型已广泛应用于网络入侵检测任务,涵盖传统网络应用、物联网 (Internet of Things) 和车载网络场景 [11, 62, 131, 138]。大语言模型不仅学习了恶意流量数据的特征 [10, 11, 138] 并捕捉了用户发起的行为中的异常 [24],还描述了入侵和异常行为的意图 [3, 10, 58]。此外,它们还可以为识别出的攻击类型提供相应的安全建议和响应策略 [37]。Liu 等人 [123] 提出了一种通过利用大语言模型提取恶意 URL 的层次特征来检测恶意 URL 行为的方法。他们的工作将大语言模型在入侵检测任务中的应用扩展到了用户层面,展示了大语言模型在入侵和异常检测任务中的通用性和有效性。

Cyber threat analysis. In contemporary risk management strategies, Cyber Threat Intelligence (CTI) reporting plays a pivotal role, as evidenced by recent research [34]. With the continued surge in the volume of CTI reports, there is a growing need for automated tools to facilitate report generation. The application of LLMs in network threat analysis can be categorized into CTI generation and CTI analysis for decision-making. The emphasis on CTI generation varies including extracting CTI from network security text information (such as books, blogs, news) [5], generating structured CTI reports from unstructured information [189], and generating CTI from network security entity graphs [162]. Aghaei et al's CVEDrill [4] can generate priority recommendation reports for potential cyber security threats and predict their impact. Additionally, Moskal et al. [140] explored the application of ChatGPT in assisting or automating response decision-making for threat behaviors, demonstrating the potential of LLMs in addressing simple network attack activities.

网络威胁分析。在当代风险管理策略中,网络威胁情报 (CTI) 报告扮演着关键角色,正如最近的研究 [34] 所证明的那样。随着 CTI 报告数量的持续激增,对自动化工具以促进报告生成的需求也在不断增加。大语言模型在网络威胁分析中的应用可分为 CTI 生成和 CTI 分析以支持决策。CTI 生成的重点各不相同,包括从网络安全文本信息(如书籍、博客、新闻)中提取 CTI [5],从非结构化信息生成结构化的 CTI 报告 [189],以及从网络安全实体图中生成 CTI [162]。Aghaei 等人的 CVEDrill [4] 可以为潜在网络安全威胁生成优先级推荐报告并预测其影响。此外,Moskal 等人 [140] 探索了 ChatGPT 在辅助或自动化威胁行为响应决策中的应用,展示了大语言模型在应对简单网络攻击活动中的潜力。

Penetration test. Conducting a controlled attack on a computer system to evaluate its security is the essence of penetration testing, which remains a pivotal approach utilized by organizations to bolster their defenses against cyber threats [183]. The general penetration testing process consists of three steps: information gathering, payload construction, and vulnerability exploitation. Temara [198] utilized LLMs to gather information for penetration testing. including the IP address, domain information, vendor technologies, SSL/TLS credentials, and other details of the target website. Sai Charan et al. [31] critically examined the capability of LLMs to generate malicious payloads for penetration testing, with results indicating that ChatGPT can generate more targeted and complex payloads for attackers. Happe et al. [74] developed an automated Linux privilege escalation guidance tool using LLMs. Additionally, the automated penetration testing tool PentestGPT [45], based on LLMs, achieved excellent performance on a penetration testing benchmark containing 13 scenarios and 182 subtasks by combining three self-interacting modules (inference, generation, and parsing modules).

渗透测试。对计算机系统进行受控攻击以评估其安全性是渗透测试的核心,这仍然是组织用来加强防御网络威胁的关键方法 [183]。一般的渗透测试过程包括三个步骤:信息收集、载荷构建和漏洞利用。Temara [198] 利用大语言模型收集渗透测试所需的信息,包括目标网站的 IP 地址、域名信息、供应商技术、SSL/TLS 凭证等详细信息。Sai Charan 等人 [31] 批判性地研究了大语言模型生成渗透测试恶意载荷的能力,结果表明 ChatGPT 可以为攻击者生成更具针对性和复杂性的载荷。Happe 等人 [74] 利用大语言模型开发了一种自动化的 Linux 权限提升指导工具。此外,基于大语言模型的自动化渗透测试工具 PentestGPT [45] 通过结合三个自交互模块(推理、生成和解析模块),在包含 13 个场景和 182 个子任务的渗透测试基准上取得了出色的性能。

3.2 Application of LLMs in Software and System Security

3.2 大语言模型在软件和系统安全中的应用

This section explores the application of LLMs in the field of software and system security. LLMs excel in understanding user commands, inferring program control and data flow, and generating complex data structures [216]. The tasks it includes vulnerability detection, vulnerability repair, bug detection, bug repair, program fuzzing, reverse engineering and binary analysis, malware detection, and system log analysis.

本节探讨了大语言模型(LLM)在软件和系统安全领域的应用。大语言模型在理解用户命令、推断程序控制和数据流以及生成复杂数据结构方面表现出色[216]。其任务包括漏洞检测、漏洞修复、错误检测、错误修复、程序模糊测试、逆向工程和二进制分析、恶意软件检测以及系统日志分析。

Vulnerability detection. The escalation in software vulnerabilities is evident in the recent surge of vulnerability reports documented by Common Vulnerabilities and Exposures (CVEs) [14]. With this rise, the potential for cyber security attacks grows, posing significant economic and social risks. Hence, the detection of vulnerabilities becomes imperative to safeguard software systems and uphold social and economic stability. The method of utilizing LLMs for static vulnerability detection in code shows significant performance improvements compared to traditional approaches based on graph neural networks or matching rules [17, 36, 38, 40, 61, 98, 124, 168, 199, 203, 211, 238, 246]. The potential demonstrated by GPT series models in vulnerability detection tasks is particularly evident [38, 61, 98, 124, 204, 238]. However, LLMs may generate false positives when dealing with vulnerability detection tasks due to minor changes in function and variable names or modifications to library functions [203]. Liu et al. [121] proposed LATTE, which combines LLMs to achieve automated binary taint analysis. This overcomes the limitations of traditional taint analysis, which requires manual customization of taint propagation rules and vulnerability inspection rules. They discovered 37 new vulnerabilities in real firmware. Tihanyi et al. [200] used LLMs to generate a large-scale vulnerability-labeled dataset, FormAI, while also noting that over $50%$ of the code generated by LLMs may contain vulnerabilities, posing a significant risk to software security.

漏洞检测。软件漏洞的增加在最近由通用漏洞披露 (CVE) [14] 记录的漏洞报告中表现得尤为明显。随着漏洞的增加,网络安全攻击的潜在风险也在增长,带来了重大的经济和社会风险。因此,漏洞检测变得至关重要,以保护软件系统并维护社会和经济的稳定。利用大语言模型进行代码静态漏洞检测的方法,相比基于图神经网络或匹配规则的传统方法,显示出显著的性能提升 [17, 36, 38, 40, 61, 98, 124, 168, 199, 203, 211, 238, 246]。GPT 系列模型在漏洞检测任务中表现出的潜力尤为明显 [38, 61, 98, 124, 204, 238]。然而,大语言模型在处理漏洞检测任务时,可能会因函数和变量名称的微小变化或库函数的修改而产生误报 [203]。Liu 等人 [121] 提出了 LATTE,结合大语言模型实现自动化的二进制污点分析。这克服了传统污点分析需要手动定制污点传播规则和漏洞检查规则的局限性。他们在真实固件中发现了 37 个新漏洞。Tihanyi 等人 [200] 使用大语言模型生成了一个大规模的漏洞标记数据集 FormAI,同时也指出大语言模型生成的代码中超过 $50%$ 可能包含漏洞,对软件安全构成了重大风险。

Vulnerability repair. Due to the sharp increase in the number of detected vulnerabilities and the complexity of modern software systems, manually fixing security vulnerabilities is extremely time-consuming and labor-intensive for security experts [243]. Research shows that $50%$ of vulnerabilities have a lifecycle exceeding 438 days [110]. Delayed vulnerability patching may result in ongoing attacks on software systems [118], causing economic losses to users. The T5 model based on the encoder-decoder architecture performs better in vulnerability repair tasks [65, 240]. Although LLMs can effectively generate fixes, challenges remain in maintaining the functionality correctness of functions [158], and they are susceptible to influences from different programming languages. For example, the current capabilities of LLMs in repairing Java vulnerabilities are limited [218]. Constructing a comprehensive vulnerability repair dataset and fine-tuning LLMs on it can significantly improve the model's performance in vulnerability repair tasks [65]. Alrashedy et al. [30] proposed an automated vulnerability repair tool driven by feedback from static analysis tools. Tol et al. [201] proposed a method called ZeroLeak, which utilizes LLMs to repair side-channel vulnerabilities in programs. Chara lamb o us et al. [12] combined LLMs with Bounded Model Checking (BMC) to verify the effectiveness of repair solutions, addressing the problem of decreased functionality correctness after using LLMs to repair vulnerabilities.

漏洞修复。由于检测到的漏洞数量急剧增加以及现代软件系统的复杂性,手动修复安全漏洞对安全专家来说极其耗时且费力 [243]。研究表明,$50%$ 的漏洞生命周期超过 438 天 [110]。延迟修复漏洞可能导致软件系统持续受到攻击 [118],给用户造成经济损失。基于编码器-解码器架构的 T5 模型在漏洞修复任务中表现更好 [65, 240]。尽管大语言模型能够有效生成修复方案,但在保持功能正确性方面仍存在挑战 [158],并且它们容易受到不同编程语言的影响。例如,目前大语言模型在修复 Java 漏洞方面的能力有限 [218]。构建全面的漏洞修复数据集并在此基础上对大语言模型进行微调,可以显著提高模型在漏洞修复任务中的性能 [65]。Alrashedy 等人 [30] 提出了一种由静态分析工具反馈驱动的自动化漏洞修复工具。Tol 等人 [201] 提出了一种名为 ZeroLeak 的方法,利用大语言模型修复程序中的侧信道漏洞。Charalambous 等人 [12] 将大语言模型与有界模型检验 (BMC) 相结合,验证修复方案的有效性,解决了使用大语言模型修复漏洞后功能正确性下降的问题。

Bug detection. Bugs typically refer to any small faults or errors present in software or hardware, which may cause programs to malfunction or produce unexpected results. Some bugs may be exploited by attackers to create security vulnerabilities. Therefore, bug detection is crucial for the security of software and system. LLMs can be utilized to generate code lines and compare them with the original code to flag potential bugs within code snippets [7]. They can also combine fedback from static analysis tools to achieve precise bug localization [92, 111]. Fine-tuning techniques are crucial for bug detection tasks as well, applying fine-tuning allows LLMs to identify errors in code without relying on test cases [106, 227]. Additionally, Du et al. [54] and Li et al. [114] introduced the concept of contrastive learning, which focuses LLMs on the subtle differences between correct and buggy versions of code lines. Fang et al. [57] proposed a software-agnostic representation method called Represent Them All, based on contrastive learning and fine-tuning modules, suitable for various downstream tasks including bug detection and predicting the priority and severity of bugs.

Bug detection. Bugs typically refer to any small faults or errors present in software or hardware, which may cause programs to malfunction or produce unexpected results. Some bugs may be exploited by attackers to create security vulnerabilities. Therefore, bug detection is crucial for the security of software and system. LLMs can be utilized to generate code lines and compare them with the original code to flag potential bugs within code snippets [7]. They can also combine fedback from static analysis tools to achieve precise bug localization [92, 111]. Fine-tuning techniques are crucial for bug detection tasks as well, applying fine-tuning allows LLMs to identify errors in code without relying on test cases [106, 227]. Additionally, Du et al. [54] and Li et al. [114] introduced the concept of contrastive learning, which focuses LLMs on the subtle differences between correct and buggy versions of code lines. Fang et al. [57] proposed a software-agnostic representation method called Represent Them All, based on contrastive learning and fine-tuning modules, suitable for various downstream tasks including bug detection and predicting the priority and severity of bugs.

Bug 检测。Bugs 通常指软件或硬件中存在的任何小故障或错误,可能导致程序运行异常或产生意外结果。一些 bugs 可能被攻击者利用来制造安全漏洞。因此,bug 检测对于软件和系统的安全至关重要。大语言模型 (LLM) 可用于生成代码行并与原始代码进行比较,以标记代码片段中的潜在 bugs [7]。它们还可以结合静态分析工具的反馈来实现精确的 bug 定位 [92, 111]。微调技术对于 bug 检测任务也至关重要,应用微调可以使大语言模型在不依赖测试用例的情况下识别代码中的错误 [106, 227]。此外,Du 等人 [54] 和 Li 等人 [114] 引入了对比学习的概念,它使大语言模型专注于正确和错误代码行之间的细微差异。Fang 等人 [57] 提出了一种名为 Represent Them All 的与软件无关的表示方法,基于对比学习和微调模块,适用于各种下游任务,包括 bug 检测和预测 bug 的优先级和严重性。

Bug repair. LLMs possess robust code generation capabilities, and their utilization in engineering for code generation can significantly enhance efficiency. However, code produced by LLMs often carries increased security risks, such as bugs and vulnerabilities [163]. These program bugs can lead to persistent security vulnerabilities. Hence, automating the process of bug fixing is imperative, involving the use of automation technology to analyze flawed code and generate accurate patches to rectify identified issues. LLMs like CodeBERT [88, 105, 222, 241], CodeT5 [88, 197, 209], Codex [56, 92, 223], LLaMa [197], CodeLLaMa [147, 188] , CodeGEN [223], UniXcoder [241], T5 [234], PLBART [88], and GPT Series [147, 197, 223, 224, 241, 242, 244] have showcased effectiveness in generating syntactically accurate and con textually relevant code. This includes frameworks with encoder-decoder architecture like Repilot [214], tailored specifically for producing repair patches. Utilizing LLMs for program repair can achieve competitive performance in producing patches for various types of errors and defects [224]. These models effectively capture the underlying semantics and dependencies in code, resulting in precise and effcient patches. Moreover, fine-tuning LLMs on specific code repair datasets can further improve their ability to generate high-quality patches for real-world software projects. Integrating LLMs into program repair not only speeds up the error-fixing process but also allows software developers to focus on more complex tasks, thereby enhancing the reliability and maintainability of the software [223]. As demonstrated in the case of ChatGPT, notably enhances the accuracy of program repairs when integrated with interactive feedback loops [223]. This iterative process of patch generation and validation fosters a nuanced comprehension of software semantics, thereby resulting in more impactful fixes. By integrating domain-specific knowledge and technologies with the capabilities of LLMs, their performance is further enhanced. Custom prompts, fine-tuning for specific tasks, retrieving external data, and utilizing static analysis tools [65, 92, 197, 221, 240] significantly improve the effectiveness of bug fixes driven by LLMs.

Bug修复。LLMs具备强大的代码生成能力,在工程中利用它们进行代码生成可以显著提高效率。然而,LLMs生成的代码通常携带更高的安全风险,例如漏洞和缺陷 [163]。这些程序漏洞可能导致持久的安全漏洞。因此,自动化修复漏洞的过程势在必行,这涉及使用自动化技术分析有缺陷的代码并生成准确的补丁来修复已识别的问题。像CodeBERT [88, 105, 222, 241]、CodeT5 [88, 197, 209]、Codex [56, 92, 223]、LLaMa [197]、CodeLLaMa [147, 188]、CodeGEN [223]、UniXcoder [241]、T5 [234]、PLBART [88]和GPT系列 [147, 197, 223, 224, 241, 242, 244]等LLMs在生成语法正确且上下文相关的代码方面展示了有效性。这包括专门用于生成修复补丁的具有编码器-解码器架构的框架,如Repilot [214]。利用LLMs进行程序修复可以在生成各种类型错误和缺陷的补丁方面达到有竞争力的性能 [224]。这些模型有效捕捉了代码中的底层语义和依赖关系,从而生成精确且高效的补丁。此外,在特定代码修复数据集上微调LLMs可以进一步提高它们为现实世界软件项目生成高质量补丁的能力。将LLMs集成到程序修复中不仅加快了错误修复过程,还使软件开发人员能够专注于更复杂的任务,从而提高软件的可靠性和可维护性 [223]。正如ChatGPT的案例所示,当与交互式反馈循环结合时,显著提高了程序修复的准确性 [223]。这种补丁生成和验证的迭代过程促进了对软件语义的细致理解,从而产生更有影响力的修复。通过将领域特定知识和技术与LLMs的能力相结合,进一步提升了它们的性能。自定义提示、特定任务微调、检索外部数据以及使用静态分析工具 [65, 92, 197, 221, 240]显著提高了LLMs驱动的漏洞修复效果。

Program fuzzing. Fuzz testing, or fuzzing, refers to an automated testing method aimed at generating inputs to uncover unforeseen behaviors. Both researchers and practitioners have effectively developed practical fuzzing tools, demonstrating significant success in detecting numerous bugs and vulnerabilities within real-world systems [22]. The generation capability of LLMs enables testing against various input program languages and diferent features [46, 220], effectively overcoming the limitations of traditional fuz testing methods. Under strategies such as repetitive querying. example querying, and iterative querying [237], LLMs can significantly enhance the generation effectiveness of test cases. LLMs can generate test cases that trigger vulnerabilities from historical bug reports of programs [47], produce test cases similar but different from sample inputs [85], analyze compiler source code to generate programs that trigger specific optimization s [228], and split the testing requirements and test case generation using a dual-model interaction framework, assigning them to different LLMs for processing.

程序模糊测试。模糊测试(Fuzzing)是一种自动化测试方法,旨在生成输入以发现未预见的行为。研究人员和实践者已经有效地开发了实用的模糊测试工具,并在检测现实世界系统中的众多错误和漏洞方面取得了显著成功 [22]。大语言模型的生成能力使其能够针对各种输入程序语言和不同特性进行测试 [46, 220],有效克服了传统模糊测试方法的局限性。在重复查询、示例查询和迭代查询等策略下 [237],大语言模型可以显著提高测试用例的生成效果。大语言模型可以从程序的历史错误报告中生成触发漏洞的测试用例 [47],生成与样本输入相似但不同的测试用例 [85],分析编译器源代码以生成触发特定优化的程序 [228],并使用双模型交互框架将测试需求和测试用例生成分开,分配给不同的大语言模型处理。

Reverse engineering and binary analysis. Reverse engineering is the process of attempting to understand how existing artifacts work, whether for malicious purposes or defensive purposes, and it holds significant security implications. The capability of LLMs to recognize software functionality and extract important information enables them to perform certain reverse engineering steps [159]. For example, Xu et al. [226] achieved recovery of variable names from binary files by propagating LLMs query results through multiple rounds. Armengol-Estape et al. [15] combined type inference engines with LLMs to perform d is assembly of executable files and generate program source code. LLMs can also be used to assist in binary program analysis. Sun et al. [193] proposed DexBert for characterizing Android system binary bytecode. Pei et al. [160] preserved the semantic symmetry of code based on group theory, resulting in their binary analysis framework SYMC demonstrating outstanding generalization and robustness in various binary analysis tasks. Song et al. [191] utilized LLMs to address authorship analysis issues in software engineering. effectively applying them to real-world APT malicious software for organization-level verification. Some studies [86] apply LLMs to enhance the readability and usability of decompiler outputs, thereby assisting reverse engineers in better understanding binary files.

逆向工程与二进制分析。逆向工程是试图理解现有工件如何运作的过程,无论是出于恶意目的还是防御目的,它都具有重大的安全意义。大语言模型识别软件功能和提取重要信息的能力使它们能够执行某些逆向工程步骤 [159]。例如,Xu 等人 [226] 通过多轮传播大语言模型的查询结果,实现了从二进制文件中恢复变量名称。Armengol-Estape 等人 [15] 将类型推断引擎与大语言模型结合,执行可执行文件的反汇编并生成程序源代码。大语言模型还可用于辅助二进制程序分析。Sun 等人 [193] 提出了 DexBert 用于表征 Android 系统二进制字节码。Pei 等人 [160] 基于群论保留了代码的语义对称性,使他们的二进制分析框架 SYMC 在各种二进制分析任务中表现出出色的泛化性和鲁棒性。Song 等人 [191] 利用大语言模型解决软件工程中的作者分析问题,有效地将其应用于现实世界中的 APT 恶意软件进行组织级验证。一些研究 [86] 应用大语言模型来提高反编译器输出的可读性和可用性,从而帮助逆向工程师更好地理解二进制文件。

Malware detection. Due to the rising volume and intricacy of malware, detecting malicious software has emerged as a significant concern. While conventional detection techniques rely on signatures and heuristics, they exhibit limited effectiveness against unknown attacks and are susceptible to evasion through obfuscation techniques [2o]. LLMs can extract semantic features of malware, leading to more competitive performance. AVScan2Vec, proposed by Joyce et al. [93], transforms antivirus scan reports into vector representations, effectively handling large-scale malware datasets and performing well in tasks such as malware classification, clustering, and nearest neighbor search. Botacin [23] explored the application of LLMs in malware defense from the perspective of malware generation. While LLMs cannot directly generate complete malware based on simple instructions, they can generate building blocks of malware and successfully construct various malware variants by blending different functionalities and categories. This provides a new perspective for malware detection and defense.

恶意软件检测。随着恶意软件数量和复杂性的增加,检测恶意软件已成为一个重要问题。传统的检测技术依赖于签名和启发式方法,但它们在应对未知攻击时效果有限,并且容易通过混淆技术逃避检测 [2o]。大语言模型可以提取恶意软件的语义特征,从而获得更具竞争力的性能。Joyce 等人 [93] 提出的 AVScan2Vec 将反病毒扫描报告转换为向量表示,有效处理大规模恶意软件数据集,并在恶意软件分类、聚类和最近邻搜索等任务中表现良好。Botacin [23] 从恶意软件生成的角度探讨了大语言模型在恶意软件防御中的应用。虽然大语言模型无法根据简单的指令直接生成完整的恶意软件,但它们可以生成恶意软件的构建块,并通过混合不同功能和类别成功构建各种恶意软件变体。这为恶意软件检测和防御提供了新的视角。

System log analysis. Analyzing the growing amount of log data generated by software-intensive systems manually is unfeasible due to its sheer volume. Numerous deep learning approaches have been suggested for detecting anomalies in log data. These approaches encounter various challenges, including dealing with high-dimensional and noisy log data, addressing class imbalances, and achieving generalization [89]. Nowadays, researchers are utilizing the language understanding capabilities of LLMs to identify and analyze anomalies in log data. Compared to traditional deep learning methods, LLMs demonstrate outstanding performance and good interpret ability [166, 185]. Fine-tuning LLMs for specific types of logs [97] or using reinforcement learning-based fine-tuning strategies [72] can significantly enhance their performance in log analysis tasks. LLMs are also being employed for log analysis in cloud servers [39, 119], where their reasoning abilities can be combined with server logs to infer the root causes of cloud service incidents.

系统日志分析。由于软件密集型系统生成的日志数据量庞大,手动分析这些不断增长的日志数据是不可行的。许多深度学习方法已被提出用于检测日志数据中的异常。这些方法面临各种挑战,包括处理高维和噪声日志数据、解决类别不平衡问题以及实现泛化 [89]。如今,研究人员正在利用大语言模型的语言理解能力来识别和分析日志数据中的异常。与传统的深度学习方法相比,大语言模型展现出卓越的性能和良好的可解释性 [166, 185]。通过对大语言模型进行特定类型日志的微调 [97] 或使用基于强化学习的微调策略 [72],可以显著提升其在日志分析任务中的性能。大语言模型还被用于云服务器的日志分析 [39, 119],其推理能力可以与服务器日志结合,推断云服务事件的根本原因。

3.3 Application of LLMs in Information and Content Security

3.3 大语言模型在信息和内容安全中的应用

This section explores the application of LLMs in the field of information and content security. The tasks it includes phishing and scam, harmful contents, s tegan o graph y, access control, and forensics.

本节探讨大语言模型在信息和内容安全领域的应用,涵盖钓鱼和诈骗、有害内容、隐写术 (steganography)、访问控制和取证等任务。

Phishing and scam detection. Network deception is a deliberate act of introducing false or misleading content into a network system, threatening the personal privacy and property security of users. Emails, short message service (SMS), and web advertisements are leveraged by attackers to entice users and steer them towards phishing sites, enticing them to click on malicious links [196]. LLMs can generate deceptive or false information on a large scale under specific prompts [172], making them useful for automated phishing email generation[77, 176], but compared to manual design methods, phishing emails generated by LLMs have lower click-through rates [77]. LLMs can achieve phishing email detection through prompts based on website information [100] or fine-tuning for specific email features [139, 176]. Spam emails often contain a large number of phishing emails. Labonne et al's research [102] has demonstrated the effectiveness of LLMs in spam email detection, showing significant advantages over traditional machine learning methods. An interesting study [28] suggests that LLMs can mimic real human interactions with scammers in an automated and meaningless manner, thereby wasting scammers′ time and resources and alleviating the nuisance of scamemails.

钓鱼和诈骗检测。网络欺骗是故意在网络系统中引入虚假或误导性内容的行为,威胁用户的个人隐私和财产安全。攻击者利用电子邮件、短信服务(SMS)和网页广告引诱用户进入钓鱼网站,诱使他们点击恶意链接[196]。大语言模型可以在特定提示下大规模生成欺骗性或虚假信息[172],使其在自动生成钓鱼邮件方面具有应用价值[77, 176],但与手动设计方法相比,大语言模型生成的钓鱼邮件点击率较低[77]。大语言模型可以通过基于网站信息的提示[100]或针对特定邮件特征的微调[139, 176]来实现钓鱼邮件检测。垃圾邮件中通常包含大量钓鱼邮件。Labonne等人的研究[102]表明,大语言模型在垃圾邮件检测方面具有显著优势,优于传统的机器学习方法。一项有趣的研究[28]指出,大语言模型可以以自动化和无意义的方式模仿真实人类与诈骗者的互动,从而浪费诈骗者的时间和资源,减轻诈骗邮件的困扰。

Harmful contents detection. Social media platforms frequently face criticism for amplifying political polarization and deteriorating public discourse. Users often contribute harmful content that reflects their political beliefs, thereby intensifying contentious and toxic discussions or participating in harmful behavior [215]. The application of LLMs in detecting harmful content can be divided into three aspects: detection of extreme political stances [73, 135], tracking of criminal activity discourse [83], and identification of social media bots [27]. LLMs tend to express attitudes consistent with the values encoded in the programming when faced with political discourse, indicating the complexity and limitations of LLMs in handling social topics [75]. Hartvigsen et al. [132] generated a large-scale dataset of harmful and benign discourse targeting 13 minority groups using LLMs. Through validation, it was found that human annotators struggled to distinguish between LLM-generated and human-written discourse, advancing efforts in filtering and combating harmful contents.

有害内容检测。社交媒体平台经常因加剧政治极化和恶化公共讨论而受到批评。用户常常发布反映其政治信仰的有害内容,从而加剧争议性和有毒的讨论,或参与有害行为 [215]。大语言模型在检测有害内容方面的应用可以分为三个方面:极端政治立场的检测 [73, 135]、犯罪活动言论的追踪 [83] 以及社交媒体机器人的识别 [27]。大语言模型在面对政治话语时,往往会表达与编程中编码的价值观一致的态度,这表明大语言模型在处理社会话题时的复杂性和局限性 [75]。Hartvigsen 等人 [132] 利用大语言模型生成了一个针对 13 个少数群体的有害和良性言论的大规模数据集。通过验证发现,人类标注者难以区分大语言模型生成的言论和人类撰写的言论,这推动了过滤和打击有害内容的努力。

S tegan o graph y. S tegan o graph y, as discussed in Anderson's work [13], focuses on embedding confidential data within ordinary information carriers without alerting third parties, thereby safeguarding the secrecy and security of the concealed information. Wang et al. [207] introduced a method for language st eg analysis using LLMs based on few-shot learning principles, aiming to overcome the limited availability of labeled data by incorporating a small set of labeled samples along with auxiliary unlabeled samples to improve the efficiency of language st eg analysis. This approach significantly improves the detection capability of existing methods in scenarios with few samples. Bauer et al. [18] used the GPT-2 model to encode ciphertext into natural language cover texts, allowing users to control the observable format of the ciphertext for covert information transmission on public platforms.

隐写术 (Steganography)
隐写术,如 Anderson 的研究 [13] 所述,专注于将机密数据嵌入到普通信息载体中,而不引起第三方的注意,从而保护隐藏信息的机密性和安全性。Wang 等人 [207] 提出了一种基于少样本学习原则,利用大语言模型进行语言隐写分析的方法,旨在克服标记数据有限的挑战,通过引入少量标记样本和辅助的未标记样本来提高语言隐写分析的效率。这种方法在样本较少的情况下显著提升了现有方法的检测能力。Bauer 等人 [18] 使用 GPT-2 模型将密文编码为自然语言覆盖文本,使用户能够控制密文的可观察格式,以便在公共平台上进行隐蔽信息传输。

Access control. Access control aims to restrict the actions or operations permissible for a legitimate user of a computer system [180], with passwords serving as the fundamental component for its implementation. Despite the proliferation of alternative technologies, passwords continue to dominate as the preferred authentication mechanism [156]. PassGPT, a password generation model leveraging LLMs, introduces guided password generation, wherein PassGPT's sampling process generates passwords adhering to user-defined constraints. This approach outperforms existing methods utilizing Generative Adversarial Networks (GANs) by producing a larger set of previously unseen passwords, thereby demonstrating the effectiveness of LLMs in improving existing password strength estimators [173].

访问控制

Forensics. In the realm of digital forensics, the successful prosecution of cyber criminals involving a wide array of digital devices hinges upon its pivotal role. The evidence retrieved through digital forensic investigations must be admissible in a court of law [184]. Scanlon and colleagues [182] delved into the potential application of LLMs within the field of digital forensics. Their exploration encompassed an assessment of LLM performance across various digital forensic scenarios, including file identification, evidence retrieval, and incident response. Their findings led to the conclusion that while LLMs currently lack the capability to function as standalone digital forensic tools, they can nonetheless serve as supplementary aids in select cases.

取证。在数字取证领域,成功起诉涉及各种数字设备的网络犯罪分子,其关键在于取证的关键作用。通过数字取证调查获取的证据必须在法庭上可采信[184]。Scanlon及其同事[182]探讨了大语言模型在数字取证领域的潜在应用。他们的研究涵盖了大语言模型在不同数字取证场景中的表现评估,包括文件识别、证据检索和事件响应。他们的研究得出结论,尽管大语言模型目前尚无法作为独立的数字取证工具使用,但它们在某些情况下仍可作为辅助工具。

3.4 Application of LLMs in Hardware Security

3.4 大语言模型在硬件安全中的应用

Modern computing systems are built on System-on-Chip (SoC) architectures because they achieve high levels of integration by using multiple Intellectual Property (IP) cores. However, this also brings about new security challenges, as a vulnerability in one IP core could affect the security of the entire system. While software and firmware patches can address many hardware security vulnerabilities, some vulnerabilities cannot be patched, and extensive security assurances are required during the design process [49]. This section explores the application of LLMs in the field of hardware security. The tasks it includes hardware vulnerability detection and hardware vulnerability repair.

现代计算系统基于片上系统 (SoC) 架构,因为它们通过使用多个知识产权 (IP) 核实现了高度集成。然而,这也带来了新的安全挑战,因为一个 IP 核中的漏洞可能会影响整个系统的安全性。虽然软件和固件补丁可以解决许多硬件安全漏洞,但有些漏洞无法修补,因此在设计过程中需要广泛的安全保证 [49]。本节探讨了大语言模型在硬件安全领域的应用。其任务包括硬件漏洞检测和硬件漏洞修复。

Hardware vulnerability detection. LLMs can extract security properties from hardware development documents. Meng et al. [134] trained HS-BERT on hardware architecture documents such as RISC-V, OpenRISC, and MIPS, and identified 8 security vulnerabilities in the design of the OpenTitan SoC. Additionally, Paria et al. [155] used LLMs to identify security vulnerabilities from user-defined SoC specifications, map them to relevant CWEs, generate corresponding assertions, and take security measures by executing security policies.

硬件漏洞检测。大语言模型可以从硬件开发文档中提取安全属性。Meng 等人 [134] 在 RISC-V、OpenRISC 和 MIPS 等硬件架构文档上训练了 HS-BERT,并识别出 OpenTitan SoC 设计中的 8 个安全漏洞。此外,Paria 等人 [155] 使用大语言模型从用户定义的 SoC 规范中识别安全漏洞,将其映射到相关的 CWE,生成相应的断言,并通过执行安全策略来采取安全措施。

Hardware vulnerability repair. LLMs have found application within the integrated System-on-Chip (SoC) security verification paradigm, showcasing potential in addressing diverse hardware-level security tasks such as vulnerability insertion, security assessment, verification, and the development of mitigation strategies [179]. By leveraging hardware vulnerability information, LLMs offer advice on vulnerability repair strategies, thereby improving the efficiency and accuracy of hardware vulnerability analysis and mitigation efforts [116]. In their study, Nair and colleagues [144] demonstrated that LLMs can generate hardware-level security vulnerabilities during hardware code generation and explored their utility in generating secure hardware code. They successfully produced secure hardware code for 10 Common Weakness Enumerations (CWEs) at the hardware design level. Additionally, Tan et al. [8] curated a comprehensive corpus of hardware security vulnerabilities and evaluated the performance of LLMs in automating the repair of hardware vulnerabilities based on this corpus.

硬件漏洞修复。大语言模型在集成在片系统(SoC)安全验证范式中找到了应用,展示了在解决多种硬件级安全任务(如漏洞插入、安全评估、验证以及缓解策略开发)方面的潜力 [179]。通过利用硬件漏洞信息,大语言模型提供了漏洞修复策略的建议,从而提高了硬件漏洞分析和缓解工作的效率和准确性 [116]。Nair 及其同事 [144] 在他们的研究中证明,大语言模型可以在硬件代码生成过程中生成硬件级安全漏洞,并探索了它们在生成安全硬件代码中的实用性。他们成功地为硬件设计级别的 10 个常见弱点枚举(CWE)生成了安全硬件代码。此外,Tan 等人 [8] 整理了一个全面的硬件安全漏洞语料库,并基于该语料库评估了大语言模型在自动化硬件漏洞修复中的表现。

3.5 Application of LLMs in Blockchain Security

3.5 大语言模型在区块链安全中的应用

This section explores the application of LLMs in the field of blockchain security. The tasks it includes smart contract security and transaction anomaly detection.

本节探讨大语言模型在区块链安全领域的应用,包括智能合约安全和交易异常检测任务。

Smart contract security. With the advancement of blockchain technology, smart contracts have emerged as a pivotal element in blockchain applications [251]. Despite their significance, the development of smart contracts can introduce vulnerabilities that pose potential risks such as financial losses. While LLMs offer automation for detecting vulnerabilities in smart contracts, the detection outcomes often exhibit a high rate of false positives [32, 42]. Performance varies across different vulnerability types and is constrained by the contextual length of LLMs [32]. GPTLENS [87] divides the detection proces of smart contract vulnerabilities into two phases: generation and discrimination. During the generation phase, diverse vulnerability responses are generated, and in the discrimination phase, these responses are evaluated and ranked to mitigate false positives. Sun and colleagues [194] integrated LLMs and program analysis to identify logical vulnerabilities in smart contracts, breaking down logical vulnerability categories into scenarios and attributes. They utilized LLMs to match potential vulnerabilities and further integrated static confirmation to validate the findings of LLMs.

智能合约安全。随着区块链技术的发展,智能合约已成为区块链应用中的关键要素[251]。尽管其重要性不言而喻,但智能合约的开发可能会引入漏洞,带来诸如财务损失等潜在风险。虽然大语言模型能够自动化检测智能合约中的漏洞,但检测结果通常表现出较高的误报率[32, 42]。不同漏洞类型的检测性能各异,并受限于大语言模型的上下文长度[32]。GPTLENS[87]将智能合约漏洞的检测过程分为两个阶段:生成和判别。在生成阶段,生成多样化的漏洞响应;在判别阶段,对这些响应进行评估和排序,以减少误报。Sun及其团队[194]结合大语言模型和程序分析来识别智能合约中的逻辑漏洞,将逻辑漏洞类别分解为场景和属性。他们利用大语言模型匹配潜在漏洞,并进一步结合静态验证来确认大语言模型的发现。

Transaction anomaly detection. Due to the limitations of the search space and the significant manual analysis required, real-time intrusion detection systems for blockchain transactions remain challenging. Traditional methods primarily employ reward-based approaches, focusing on identifying and exploiting profitable transactions, or patternbased techniques relying on custom rules to infer the intent of blockchain transactions and user address behavior [175, 217]. However, these methods may not accurately capture all anomalies. Therefore, more general and adaptable LLMs technology can be applied to effectively identify various abnormal transactions in real-time. Gai et al. [66] apply LLMs to dynamically and in real-time detect anomalies in blockchain transactions. Due to its unrestricted search space and independence from predefined rules or patterns, it enables the detection of a wider range of transaction anomalies.

交易异常检测。由于搜索空间的限制和大量手动分析的需求,区块链交易的实时入侵检测系统仍然具有挑战性。传统方法主要采用基于奖励的方法,专注于识别和利用有利可图的交易,或基于模式的技术,依赖自定义规则来推断区块链交易和用户地址行为的意图 [175, 217]。然而,这些方法可能无法准确捕获所有异常。因此,更通用和适应性强的大语言模型技术可以应用于实时有效地识别各种异常交易。Gai 等人 [66] 应用大语言模型动态实时检测区块链交易中的异常。由于其不受限制的搜索空间和独立于预定义规则或模式,它能够检测到更广泛的交易异常。

RQ1 - Summary

RQ1 - 总结

(1) We have divided cyber security tasks into six domains: software and system security, network security, information and content security, hardware security, and blockchain security. We have summarized the specific applications of LLMs in these domains.

我们将网络安全任务分为六个领域:软件与系统安全、网络安全、信息与内容安全、硬件安全以及区块链安全。我们总结了大语言模型在这些领域的具体应用。

(2) We discussed 21 cyber security tasks and found that LLMs are most widely applied in the field of software and system security, with 76 papers covering 8 tasks. Only 5 papers mentioned the least applied domain-blockchain security.

我们讨论了21项网络安全任务,发现大语言模型在软件和系统安全领域的应用最为广泛,共有76篇论文涵盖了8项任务。仅有5篇论文提到了应用最少的领域——区块链安全。

4RQ2:WHAT LL MS HAVE BEEN EMPLOYED TO SUPPORT CYBER SECURITY TASKS?

4RQ2: 有哪些大语言模型被用于支持网络安全任务?

4.1 Architecture of LLMs in Cyber security

4.1 大语言模型在网络安全中的架构

Pre-trained Language Models (PLMs) have exhibited impressive capabilities across various NLP tasks [101, 136, 186, 212, 248]. Researchers have noted substantial improvements in their performance as model size increases, with surpassing certain parameter thresholds leading to significant performance gains [79, 186]. The term "Large Language Model" (LLM) distinguishes language models based on the size of their parameters, specifically referring to large-sized PLMs [136, 248].However, there is no formal consensus in the academic community regarding the minimum parameter size for LLMs, as model capacity is intricately linked to training data size and overall computational resources [96]. In this study, we adopt to the LLM categorization framework introduced by Panet et al. [154], which classifies the predominant LLMs explored in our research into three architectural categories: encoder-only, encoder-decoder, and decoder-only. We also considered whether the related models are open-source. Open-source models offer higher flexibility and can acquire new knowledge through fine-tuning on specific tasks based on pre-trained models, while closed-source models can be directly called via APIs, reducing hardware expenses. This taxonomy and relevant models are shown in Table 5. We analyzed the distribution of different LLM architectures applied in various cyber security domains, as shown in Fig 5.

预训练语言模型 (Pre-trained Language Models, PLMs) 在各种 NLP 任务中展现了令人印象深刻的能力 [101, 136, 186, 212, 248]。研究人员发现,随着模型规模的增加,其性能显著提升,超过某些参数阈值后更是带来显著的性能增益 [79, 186]。术语“大语言模型 (Large Language Model, LLM)”根据参数规模区分语言模型,特别是指大规模的 PLMs [136, 248]。然而,学术界对于 LLM 的最小参数规模尚无正式共识,因为模型能力与训练数据规模和整体计算资源密切相关 [96]。在本研究中,我们采用 Panet 等人 [154] 提出的 LLM 分类框架,将我们研究中的主要 LLM 分为三种架构类别:仅编码器 (encoder-only)、编码器-解码器 (encoder-decoder) 和仅解码器 (decoder-only)。我们还考虑了相关模型是否开源。开源模型具有更高的灵活性,可以在预训练模型的基础上通过微调获得特定任务的新知识,而闭源模型则可以通过 API 直接调用,减少硬件成本。该分类及相关模型如表 5 所示。我们分析了不同 LLM 架构在网络安全领域中的应用分布,如图 5 所示。

Encoder-only LLMs. Encoder-only models, as their name implies, comprise solely an encoder network. Initially designed for language understanding tasks like text classification, these models, such as BERT and its variants [5, 50, 60, 71, 76, 127, 129, 181], aim to predict a class label for input text [50]. For instance, BERT, which adopts the encoder architecture of the Transformer model, is mentioned in 35 papers included in this study. Encoder-only LLMs use a bidirectional multi-layer self-attention mechanism to calculate the relevance of each token with allother tokens, thereby capturing semantic features that include the global context. This architecture is mainly used for processing input data, focusing on understanding and encoding information rather than generating new text. Researchers employed these models to generate embeddings for data that is relavent to cyber security (such as traffic data and code), mapping complex data types into vector space. These models typically use a masking strategy during pre-training, and the complex training strategies increase training time and the risk of over fitting. In the realm of cyber security, researchers have adopted advanced models that offer capabilities much needed in cyber security tasks such as code understanding [211] and traffic analysis [3].

仅编码器大语言模型 (Encoder-only LLMs)。仅编码器模型,顾名思义,仅包含编码器网络。最初设计用于文本分类等语言理解任务,这些模型如 BERT 及其变体 [5, 50, 60, 71, 76, 127, 129, 181],旨在为输入文本预测类别标签 [50]。例如,采用 Transformer 模型编码器架构的 BERT 在本研究涉及的 35 篇论文中被提及。仅编码器大语言模型使用双向多层自注意力机制计算每个 Token 与所有其他 Token 的相关性,从而捕获包含全局上下文的语义特征。该架构主要用于处理输入数据,侧重于理解和编码信息,而非生成新文本。研究人员使用这些模型为与网络安全相关的数据(如流量数据和代码)生成嵌入,将复杂数据类型映射到向量空间。这些模型在预训练期间通常使用掩码策略,复杂的训练策略增加了训练时间和过拟合风险。在网络安全领域,研究人员采用了提供网络安全任务所需能力的高级模型,如代码理解 [211] 和流量分析 [3]。

Various prominent models, including CodeBERT [60], Graph Code BERT [71], RoBERTa [127], CharBERT [129], DeBERTa [76], and DistilBERT [181], have gained widespread usage due to their ability to effectively process and analyze code, making them valuable tools in the field of cyber security. An example is RoBERTa [127], which enhances BERT's robustness through various model design adjustments and training techniques. These include altering key hyper parameters, eliminating the next-sentence pre-training objective, and utilizing substantially larger mini-batches and learning rates during training. CodeBERT [60] is a bimodal extension of BERT that utilizes both natural language and source code as its input. It employs a replaced token detection task to bolster its understanding of programming languages, in order to tackle code generation and vulnerability detection tasks. The encoder-only architecture provides models with excellent data representation capabilities. Note that these aforementioned BERT variants were not initially designed for cyber security tasks. Instead, their application in the cyber security field stems from their capabilities as general models in NLP tasks for code semantics interpretation and understanding. In contrast, SecureBERT [5] is a BERT variant specifically designed for cyber threat analysis tasks. Its development highlights the robustness and flexibility of encoder-only architecture models across different tasks. Diverse training tasks and specialized training schemes enhance the model's feature representation capabilities and boosts its performance in cyber security-related tasks.

包括 CodeBERT [60]、Graph Code BERT [71]、RoBERTa [127]、CharBERT [129]、DeBERTa [76] 和 DistilBERT [181] 在内的多种知名模型,因其能够有效处理和分析代码而得到广泛应用,成为网络安全领域中的宝贵工具。以 RoBERTa [127] 为例,它通过多种模型设计调整和训练技术增强了 BERT 的鲁棒性。这些技术包括改变关键超参数、取消下一句预训练目标,以及在训练过程中使用更大的小批量和学习率。CodeBERT [60] 是 BERT 的双模态扩展,它同时使用自然语言和源代码作为输入,并采用替换 Token 检测任务来增强其对编程语言的理解,以应对代码生成和漏洞检测任务。仅编码器架构为模型提供了出色的数据表示能力。需要注意的是,上述 BERT 变体最初并非为网络安全任务设计,它们在网络安全领域的应用源于其作为通用模型在 NLP 任务中对代码语义解释和理解的能力。相比之下,SecureBERT [5] 是专门为网络威胁分析任务设计的 BERT 变体。其发展突显了仅编码器架构模型在不同任务中的鲁棒性和灵活性。多样化的训练任务和专门的训练方案增强了模型的特征表示能力,并提升了其在网络安全相关任务中的表现。

Table 5. The classification of the LLMs used in the collected papers, with the number following the model indicating the count of papers that utilized that particular LLM.

表 5: 收集论文中使用的大语言模型分类,模型后的数字表示使用该模型的论文数量。

模型 发布时间 开源
仅编码器 BERT (8) 2018.10
RoBERTa (12) 2019.07
DistilBERT (3) 2019.10
CodeBERT (8) 2020.02
DeBERTa (1) 2020.06
GraphCodeBERT (1) 2020.09
CharBERT (1) 2020.11
编码器-解码器 T5 (4) 2019.10
BART (1) 2019.10
PLBART (3) 2021.03
CodeT5 (5) 2021.09
UniXcoder (1) 2022.03
Flan-T5 (1) 2022.10
仅解码器 GPT-2 (9) 2019.02
GPT-3 (4) 2020.04
GPT-Neo (1) 2021.03
CodeX (9) 2021.07
CodeGen (5) 2022.03
InCoder (1) 2022.04
PaLM (3) 2022.04
Jurassic-1 (1) 2022.04
GPT-3.5 (52) 2022.11
LLaMa (4) 2023.02
GPT-4 (38) 2023.03
Bard (8) 2023.03
Claude (3) 2023.03
StarCoder (3) 2023.05
Falcon (2) 2023.06
CodeLLaMa (4) 2023.08

Regarding the model applicability, as shown in the Figure 5, encoder-only models initially garnered attention in the fields of network cyber security [11] and software and systems cyber security [106, 222]. In 2023, this concept was extended to the field of information and content cyber security, utilizing encoder-only models to harmful content on social media platforms [27, 73, 135].

关于模型适用性,如图 5 所示,仅编码器模型最初在网络网络安全 [11] 以及软件和系统网络安全 [106, 222] 领域引起了关注。2023 年,这一概念被扩展到信息和内容网络安全领域,利用仅编码器模型来检测社交媒体平台上的有害内容 [27, 73, 135]。

Encoder-decoder LLMs. The Transformer model, based on the encoder-decoder architecture [206], consists of two sets of Transformer blocks: the encoder and decoder. Stacked multi-head self-attention layers are used by the encoder to encode the input sequence, generating latent representations. In contrast, the decoder performs cross-attention on these representations and sequentially produces the target sequence. The structure of encoder-decoder LLMs makes them highly suitable for sequence-to-sequence tasks such as code translation and sum mari z ation. However, their complex architecture requires more computational resources and high-quality labeled data.

编码器-解码器大语言模型。基于编码器-解码器架构 [206] 的 Transformer 模型由两组 Transformer 块组成:编码器和解码器。编码器使用堆叠的多头自注意力层对输入序列进行编码,生成潜在表示。相比之下,解码器对这些表示进行交叉注意力操作,并顺序生成目标序列。编码器-解码器大语言模型的结构使其非常适合序列到序列任务,如代码翻译和摘要生成。然而,其复杂的架构需要更多的计算资源和高质量的标注数据。

Models like BART [109], T5 [171], and CodeT5 [210] exemplify this architecture. CodeT5 [210] and PLBART [9] have built upon the foundation of their original models by introducing bimodal inputs of programming language and text, demonstrating effective code comprehension capabilities. Rafle et al. [171] show that almost all NLP tasks can be framed as a sequence-to-sequence generation task in their work. In LL M 4 Security, the encoder-decoder architecture was first attempted to be applied in the field of network security [120]. However, subsequent research has not widely adopted this approach, possibly due to the complexity of the encoder-decoder structure. From another perspective, owing to its flexible training strategy and excellent adaptability to complex tasks, the encoder-decoder model was later extended to other cyber security tasks such as program fuzzing [47], reverse engineering [15], and phishing emails detection[90].

BART [109]、T5 [171] 和 CodeT5 [210] 等模型展示了这种架构。CodeT5 [210] 和 PLBART [9] 在原始模型的基础上,通过引入编程语言和文本的双模态输入,展示了有效的代码理解能力。Rafle 等人 [171] 在他们的工作中表明,几乎所有的 NLP 任务都可以被构造成序列到序列的生成任务。在 LL M 4 Security 中,首次尝试将编码器-解码器架构应用于网络安全领域 [120]。然而,后续研究并未广泛采用这种方法,可能是因为编码器-解码器结构的复杂性。从另一个角度来看,由于其灵活的训练策略和对复杂任务的出色适应性,编码器-解码器模型后来扩展到其他网络安全任务,如程序模糊测试 [47]、逆向工程 [15] 和钓鱼邮件检测 [90]。


Fig. 5. Distribution and trend of different model architectures.

图 5: 不同模型架构的分布与趋势。

Decoder-only LLMs. Unlike the encoder-decoder architecture, which involves the encoder processing input text and the decoder generating output text by predicting subsequent tokens from an initial state, decoder-only LLMs rely solely on the decoder module to produce the target output text [169]. This auto regressive training paradigm allows decoder-only models to generate longer-form outputs token by token, making them well-suited for producing detailed analyses, advisories, and even code relevant to cyber security. The attention mechanism in these models also enables them to flexibly draw upon the extensive knowledge stored in their parameters and apply it to the current context.

仅解码器的大语言模型

GPT-2 [170], GPT-3 [25], GPT-3.5 [148], and GPT-4 [150] belong to the GPT series of models, among which GPT-3.5 and GPT-4 are the models most frequently used to address various cyber security issues in this study, covering almost all cyber security applications [45, 58, 62, 182]. Their strong few-shot learning abilities allow rapid development of new cyber security capabilities with minimal fine-tuning. More specialized versions like Codex [33] and others have been fine-tuned for specific code-related tasks. Open-source models like GPT-Neo [21], LLaMa [202], and Falcon [161] also follow this architecture. Additionally, code generation LLMs such as CodeGen [145], InCoder [64], StarCoder [113], and CodeLLaMa [177] have been widely used for bug detection and repair, as well as for vulnerability repair [218, 224, 227].

GPT-2 [170]、GPT-3 [25]、GPT-3.5 [148] 和 GPT-4 [150] 属于 GPT 系列模型,其中 GPT-3.5 和 GPT-4 是本研究中最常用于解决各类网络安全问题的模型,涵盖了几乎所有网络安全应用 [45, 58, 62, 182]。它们强大的少样本学习能力使得在最小微调的情况下能够快速开发新的网络安全能力。更专业的版本如 Codex [33] 等已经针对特定的代码相关任务进行了微调。开源模型如 GPT-Neo [21]、LLaMa [202] 和 Falcon [161] 也遵循了这一架构。此外,代码生成大语言模型如 CodeGen [145]、InCoder [64]、StarCoder [113] 和 CodeLLaMa [177] 已广泛用于漏洞检测和修复,以及漏洞修复 [218, 224, 227]。

The large context window of decoder-only models allows them to take in and utilize more context about the cyber security task, like related vulnerabilities, reports, and code snippets.

解码器专用模型的大上下文窗口使它们能够接收并利用更多关于网络安全任务的上下文信息,例如相关的漏洞、报告和代码片段。

Due to the powerful natural language generation capabilities of the decoder-only architecture, researchers initially attempted to apply it to the generation of fake cyber threat intelligence [172]. Decoder-only LLMs have gained prominence in recent years, especially in 2022 and 2023 as shown in Figure 5, witnessing a surge in development and commercial adoption by leading Internet companies. For instance, Google introduced Bard [69], while Meta unveiled LLaMa [202]. Unlike GPT-4 and its derivative application, ChatGPT, which quickly found integration into various cyber security tasks. These newer models have yet to see widespread adoption in the cyber security domain.

由于仅解码器架构(decoder-only architecture)在自然语言生成方面的强大能力,研究人员最初尝试将其应用于虚假网络威胁情报的生成 [172]。近年来,仅解码器的大语言模型(LLM)逐渐崭露头角,尤其是在2022和2023年,如图 5 所示,其开发和商业应用迎来了显著增长。例如,Google 推出了 Bard [69],而 Meta 发布了 LLaMa [202]。与 GPT-4 及其衍生应用 ChatGPT 不同,这些新模型尚未在网络安领域得到广泛应用。

4.2 Trend Analysis

4.2 趋势分析

Illustrated in Figure 5, from 2020 to 2024, there have been significant shifts in the preference and utilization of LLM architectures across cyber security tasks. The selection of decoder-only, encoder-decoder, and encoder-only structures has influenced diverse research directions and solutions in the cyber security field. This examination delves into the trends regarding the adoption of these architectures over time, reflecting the evolving landscape of LLM applications for cyber security tasks.

如图 5 所示,从 2020 年到 2024 年,网络安全任务中大语言模型架构的偏好和使用发生了显著变化。仅解码器、编码器-解码器和仅编码器结构的选择影响了网络安全领域的各种研究方向和解决方案。本文深入探讨了这些架构随时间采用的趋势,反映了大语言模型在网络安全任务中应用的不断演变。

Table 6. Overview of the distribution of LLMs in the open-source community.

表 6: 开源社区中大语言模型的分布概览

(a) Top 20 most downloaded models on Hugging face.

(a) Hugging Face 上下载量最高的 20 个模型

Model Architecture
BERT-base Encoder-only
DistilBERT-base Encoder-only
GPT2 Decoder-only
RoBERTa-large Encoder-only
RoBERTa-base Encoder-only
xlm-RoBERTa-large Encoder-only
xlm-RoBERTa-base Encoder-only
DeBERTa-base Encoder-only
Qwen-VL-Chat Decoder-only
T5-small Decoder-encoder
BERT-base-cased Encoder-only
T5-base Decoder-encoder
BERT-base-uncased Encoder-only
CamemBERT-base
DistilGPT2 Encoder-only
DistilRoBERTa-base Decoder-only
LLaMa3-8B Encoder-only
ALBERT-base-v2 Decoder-only
Encoder-only
DeBERTa-v3-base Encoder-only
ByT5-small Decoder-encoder

(b) Top 20 most liked models on Hugging face.

(b) Hugging Face 上最受欢迎的 20 个模型

模型 架构
BLOOM-176B 仅解码器
LLaMa3-8B 仅解码器
LLaMa2-7B 仅解码器
Mixtral-8x7B 仅解码器
Mixtral-7B 仅解码器
Phi-2 解码器-编码器
Gemma-7B 仅解码器
ChatGLM-6B 仅解码器
StarCoder 仅解码器
Falcon-40B
Grok-1 仅解码器
ChatGLM2-6B 仅解码器
GPT2 仅解码器
Dolly-v2-12B 仅解码器
BERT-base 仅解码器
仅编码器
Zephyr-7B 仅解码器
OpenELM 仅解码器
Phi-1.5 解码器-编码器
Yi-34B 仅解码器
Flan-T5 解码器-编码器

Timeline and Model Architecture distribution. In 2020 and 2021, the use of LLMs in cyber security was limited, with only 3 research papers exploring their potential. In 2020, encoder-decoder LLMs, known for their strong performance on sequence-to-sequence tasks, were the sole architecture used in a single paper. However, in 2021, the focus shifted to decoder-only LLMs, which excel at generating longer-form outputs and handling diverse queries due to their auto regressive generation capabilities and large context windows. This shift can be attributed to the research emphasis on LLM performance in natural language processing tasks and innovations in LLM architectures during this period [25, 96].

时间线和模型架构分布。2020年和2021年,大语言模型在网络安全中的应用有限,仅有三篇研究论文探索了其潜力。2020年,编码器-解码器大语言模型,以其在序列到序列任务中的强大表现而闻名,是唯一在一篇论文中使用的架构。然而,到了2021年,研究重点转向了仅解码器大语言模型,由于其自回归生成能力和大上下文窗口,这类模型在处理多样化查询和生成长篇输出时表现出色。这一转变可归因于该时期大语言模型在自然语言处理任务中的性能研究以及大语言模型架构的创新 [25, 96]。

The year 2022 marked a significant turning point, with the number of papers employing LLMs for cyber security tasks surging to 11, surpassing the combined total from the previous two years. This year also saw increased diversity in the LLM architectures used. Encoder-only LLMs, valued for their representation learning and classification abilities, were utilized in $46%$ of the research (5 papers). Encoder-decoder LLMs, with their strong performance on well-defined tasks, were featured in $18%$ (2 papers), while decoder-only LLMs, leveraging their knowledge recall and few-shot learning capabilities, garnered $36%$ of the research interest (4 papers). This varied distribution highlights the active exploration of different architectures to address the diverse needs and challenges in cyber security.

2022 年是一个重要的转折点,使用大语言模型进行网络安全任务的论文数量激增至 11 篇,超过了过去两年的总和。这一年,使用的大语言模型架构也更加多样化。仅编码器架构的大语言模型以其表征学习和分类能力受到重视,在 46% 的研究中被使用(5 篇论文)。编码器-解码器架构的大语言模型因其在明确任务上的强劲表现,出现在 18% 的研究中(2 篇论文),而仅解码器架构的大语言模型凭借其知识召回和少样本学习能力,获得了 36% 的研究关注(4 篇论文)。这种多样化的分布凸显了人们积极探索不同架构以满足网络安全领域多样化需求和挑战的态势。

The years 2023 and 2024 witnessed a significant shift towards decoder-only LLMs, which emerged as the primary architecture for addressing cyber security challenges. This trend is closely tied to the powerful text comprehension, reasoning capabilities [153, 213], and open-ended generation demonstrated by chatbots like ChatGPT. These decoderonly models require minimal fine-tuning and can generate both syntactically correct and functionally relevant code snippets [103, 178]. In 2023, decoder-only LLMs accounted for $68.9%$ of the total research, while encoder-decoder LLMs and encoder-only LLMs contributed $10.7%$ (14 papers) and $22.1%$ (27 papers), respectively. Remarkably, all studies conducted in 2024 utilized the decoder-only architecture, indicating a strong focus on exploring and leveraging the unique advantages of these models in cyber security research and applications.

2023 年和 2024 年见证了仅解码器大语言模型的显著转变,其成为解决网络安全挑战的主要架构。这一趋势与 ChatGPT 等聊天机器人展现的强大文本理解、推理能力 [153, 213] 以及开放性生成密切相关。这些仅解码器模型需要极少的微调,即可生成语法正确且功能相关的代码片段 [103, 178]。2023 年,仅解码器大语言模型占总研究的 68.9%,而编码器-解码器大语言模型和仅编码器大语言模型分别占 10.7%(14 篇论文)和 22.1%(27 篇论文)。值得注意的是,2024 年进行的所有研究均采用了仅解码器架构,表明在网络安全研究和应用中探索和利用这些模型的独特优势已成为强烈关注点。

The dominance of decoder-only LLMs in cyber security research aligns with the broader trends in the LLM community. An analysis of the top 20 most liked and downloaded LLMs on Hugging face [1], a popular open-source model community, reveals that while encoder-only models like BERT and its variants have the highest number of downloads, decoder-only models are gaining significant traction. Moreover, 16 out of the top 20 most liked LLMs are decoder-only models, indicating a strong preference and excitement for their potential to handle complex, open-ended tasks. The growing interest in decoder-only LLMs can be attributed to their strong generation, knowledge, and few-shot learning abilities, which make them well-suited for the diverse challenges in cyber security. However, the larger parameter size of these models compared to encoder-only models may limit their current adoption due to the scarcity of computational resources[59].

解码器专用大语言模型在网络安全研究中的主导地位与大语言模型社区的整体趋势一致。对 Hugging Face [1] 上最受欢迎和下载量最高的前 20 个大语言模型的分析表明,虽然 BERT 及其变体等编码器专用模型的下载量最高,但解码器专用模型正获得显著关注。此外,前 20 个最受欢迎的大语言模型中有 16 个是解码器专用模型,这表明人们对其处理复杂、开放式任务的潜力表现出强烈的偏好和期待。对解码器专用大语言模型的兴趣日益增长,可以归因于其强大的生成能力、知识掌握能力以及少样本学习能力,这些能力使其非常适合应对网络安全中的多样化挑战。然而,与编码器专用模型相比,这些模型的参数规模较大,可能因计算资源稀缺而限制其当前的采用 [59]。

Applying LLMs to cyber security. In our research, the use of LLMs can be categorized into agent-based processing and fine-tuning for specific tasks. Closed-source LMs, represented by the GPT series, are the most popular in our studies. Researchers access LLMs online by calling APIs provided by LLM publishers and design task-specific prompts to guide LLMs to solve real-world problems with their training data [53, 91, 130], such as vulnerability repair and penetration testing [38, 45, 218]. Another approach involves locally fine-tuning open-source LLMs, by using datasets customized for specific functionalities, where researchers are able to achieve significant performance improvements [188, 227].

将大语言模型应用于网络安全。在我们的研究中,大语言模型的使用可以分为基于智能体的处理和特定任务的微调。以GPT系列为代表的闭源大语言模型在我们的研究中最为流行。研究人员通过调用大语言模型发布者提供的API在线访问大语言模型,并设计特定任务的提示词,引导大语言模型利用其训练数据解决现实世界中的问题 [53, 91, 130],例如漏洞修复和渗透测试 [38, 45, 218]。另一种方法涉及本地微调开源大语言模型,通过使用为特定功能定制的数据集,研究人员能够实现显著的性能提升 [188, 227]。

In summary, the transition of LLMs in cyber security, progressing from encoder-only architectures to decoder-only architectures, underscores the dynamic nature and flexibility of the field. This change has fundamentally altered the method for addressing cyber security tasks, signaling ongoing innovation within the discipline.

总之,大语言模型在网络安全领域从仅编码器架构向仅解码器架构的转变,突显了该领域的动态性和灵活性。这一变化从根本上改变了处理网络安全任务的方法,标志着该学科的持续创新。

RQ2 - Summary

RQ2 - 总结

5 RQ3: WHAT DOMAIN SPECIFICATION TECHNIQUES ARE USED TO ADAPT LLMS TO SECURITY TASKS?

5 RQ3: 哪些领域规范技术被用于使大语言模型适应安全任务?

LLMs have demonstrated their efficacy across various intelligent tasks [94]. Initially, these models undergo pretraining on extensive unlabeled corpora, followed by fine-tuning for downstream tasks. However, discrepancies in input formats between pre-training and downstream tasks pose challenges in leveraging the knowledge encoded within LLMs efficiently. The techniques employed with LLMs for security tasks can be broadly classified into three categories: prompt engineering, fine-tuning, and external augmentation. We will delve into a comprehensive analysis of these three categories and further explore their subtypes, as well as summarize the connections between LLM techniques and various security tasks.

大语言模型 (LLMs) 在各种智能任务中展示了其有效性 [94]。最初,这些模型在大量的无标注语料库上进行预训练,然后针对下游任务进行微调。然而,预训练和下游任务之间输入格式的差异给有效利用大语言模型中编码的知识带来了挑战。用于安全任务的大语言模型技术大致可以分为三类:提示工程 (prompt engineering)、微调 (fine-tuning) 和外部增强 (external augmentation)。我们将深入分析这三类技术,并进一步探讨它们的子类型,同时总结大语言模型技术与各种安全任务之间的联系。

5.1 Fine-tuning LLMs for Security Tasks

5.1 为大语言模型微调以应对安全任务

Fine-tuning techniques are extensively utilized across various downstream tasks in NLP [192], encompassing the adjustment of LLM parameters to suit specific tasks. This process entails training the model on task-relevant datasets, with the extent of fine-tuning contingent upon task complexity and dataset size [52, 167]. Fine-tuning can mitigate the constraints posed by model size, enabling smaller models fine-tuned for specific tasks to outperform larger models lacking fine-tuning [98, 249]. We classify fine-tuning techniques employed in papers leveraging LLMs for security tasks into two categories: full fine-tuning and partial fine-tuning. Notably, many papers employ fine-tuning without explicitly specifying the technique. In such cases, if an open-source LLM is utilized, we presume full fine-tuning; if a closed-source LLM like GPT series models is utilized, we assume partial fne-tuning.

微调技术广泛应用于 NLP 的各种下游任务 [192],包括调整大语言模型参数以适应特定任务。这一过程需要在任务相关数据集上训练模型,微调的程度取决于任务复杂性和数据集大小 [52, 167]。微调可以缓解模型大小带来的限制,使针对特定任务微调的小模型优于未微调的大模型 [98, 249]。我们将论文中利用大语言模型进行安全任务的微调技术分为两类:全微调和部分微调。值得注意的是,许多论文使用微调时并未明确说明具体技术。在这种情况下,如果使用了开源大语言模型,我们假设为全微调;如果使用了 GPT 系列模型等闭源大语言模型,我们假设为部分微调。

A total of 32 papers in this study applied fine-tuning techniques to address security tasks. Among them, the most popular approach is full fine-tuning, with 23 papers, accounting for $71.88%$ of the total. This may be because the pre-training tasks of LLMs are far from the content of the security tasks being applied, thus requiring updating all parameters of LLMs to achieve more competitive performance. partial fine-tuning is also highly regarded, with $28.12%$ of the papers choosing this approach to fine-tune LLMs. As shown in Table 7, full fine-tuning has a wide range of applications, including information and content security, network security, and software and system security. The most widespread domain among them is software and system security, totaling 16 papers, accounting for $65.57%$ of the total. A similar distribution is also observed in partial fine-tuning, with the most widely applied domain still being software and system security, totaling 7 papers, accounting for $77.78%$ The applicability of fine-tuning techniques in these security tasks indicates that pre-training LLMs may not adequately address these security tasks, and updating model parameters on specific datasets is necessary to enhance effectiveness. The choice between full fine-tuning and partial fine-tuning depends on the balance between performance and efficiency considerations.

本研究中共有32篇论文采用微调技术来解决安全任务。其中,最受欢迎的方法是全量微调,共有23篇论文,占总数的71.88%。这可能是因为大语言模型的预训练任务与所应用的安全任务内容相差甚远,因此需要更新大语言模型的所有参数以实现更具竞争力的性能。部分微调也备受重视,28.12%的论文选择采用这种方法来微调大语言模型。如表7所示,全量微调的应用范围广泛,包括信息与内容安全、网络安全以及软件与系统安全。其中应用最广泛的领域是软件与系统安全,共计16篇论文,占总数的65.57%。部分微调的分布情况也类似,应用最广泛的领域仍然是软件与系统安全,共计7篇论文,占总数的77.78%。微调技术在这些安全任务中的适用性表明,预训练的大语言模型可能无法充分应对这些安全任务,因此有必要在特定数据集上更新模型参数以提高有效性。全量微调和部分微调的选择取决于性能与效率之间的平衡。

Table 7. Distribution of fine-tuning techniques adopted in papers and the numbers in parentheses represent the number of papers.

表 7: 论文中采用的微调技术分布,括号中的数字代表论文数量。

微调技术 安全任务 参考文献
全微调 漏洞检测 (1) [57]
访问控制 (1) [173]
隐写术 (1) [207]
逆向工程与二进制分析 (1) [193]
流量与入侵检测 (1) [62]
钓鱼与诈骗检测 (2) [172][90]
有害内容检测 (2) [73][135]
系统日志分析 (2) [97][72]
漏洞修复 (3) [218][65][240]
漏洞修复 (4) [234][157][88][209]
部分微调 漏洞检测 (5) [38][61][199][246][98]
流量与入侵检测 (1) [10]
有害内容检测 (1) [83]
程序模糊测试 (1) [47]
漏洞修复 (2) [92][188]
漏洞检测 (2) [106][227]
漏洞检测 (2) [36][98]

Full fine-tuning. Full fine-tuning involves adjusting all parameters of the LLMs, including every layer of the model, to align with the specific requirements of the target task. This approach is favored when there exists a substantial disparity between the task and the pre-trained model or when the task necessitates the model to possess high adaptability and flexibility. Although full fine-tuning demands significant computational resources and time, it often yields superior performance [128]. The success of full fine-tuning relies on having a dataset tailored to the task at hand. For instance, in bug fixing tasks, a dataset containing bug-patch pairs is essential to familiarize the LLMs with the intricacies of the target task [88, 209]. LL M 4 Security encompasses a range of tasks, including bug repair [88, 157, 209, 234], vulnerability detection and repair [38, 65, 246], phishing, and harmful content detection [90, 135], among others, where full fine-tuning plays a crucial role in achieving optimal results.

全微调 (Full Fine-tuning)。全微调涉及调整大语言模型的所有参数,包括模型的每一层,以适应目标任务的特定需求。当任务与预训练模型之间存在显著差异,或者任务要求模型具有高度的适应性和灵活性时,通常会选择这种方法。尽管全微调需要大量的计算资源和时间,但它往往能带来更优越的性能 [128]。全微调的成功依赖于拥有针对当前任务量身定制的数据集。例如,在修复缺陷的任务中,包含缺陷-补丁对的数据集对于让大语言模型熟悉目标任务的复杂性至关重要 [88, 209]。LLM 4 Security 涵盖了一系列任务,包括缺陷修复 [88, 157, 209, 234]、漏洞检测与修复 [38, 65, 246]、钓鱼和有害内容检测 [90, 135] 等,在这些任务中,全微调在实现最佳结果方面起着关键作用。

Partial fine-tuning. Partial fine-tuning of LLMs is employed in some of the papers we collected, primarily to address security tasks while considering computational resource limitations and model copyright constraints. Partial fine-tuning involves updating only the top layers or a few layers of the model during the fine-tuning process, while keeping the lower-level parameters of the pre-trained model unchanged [187]. The aim of this approach is to retain the general knowledge of the pre-trained model while adapting to the specific task by fine-tuning the top layers. This method is typically used when there is some similarity between the target task and the LLMs, or when the task dataset is small. In LL M 4 Security, the partial fine-tuning techniques applied can be categorized into API fine-tuning [10, 47, 83, 92, 98, 149], adapter-tuning [81, 106, 227], prompt-tuning [36, 108], and Low-Rank Adaptation (LoRA) [84, 188]. These techniques ensure the effectiveness of LLMs in downstream security tasks while requiring smaller computational resource overhead.

部分微调。在我们收集的论文中,部分微调大语言模型被用于解决安全任务,同时考虑到计算资源限制和模型版权约束。部分微调是指在微调过程中只更新模型的顶层或少数几层,而保持预训练模型的底层参数不变 [187]。这种方法的目标是保留预训练模型的通用知识,同时通过微调顶层来适应特定任务。这种方法通常在目标任务与大语言模型之间存在一定相似性,或任务数据集较小时使用。在LLM4Security中,应用的部分微调技术可分为API微调 [10, 47, 83, 92, 98, 149]、适配器微调 [81, 106, 227]、提示微调 [36, 108] 和低秩适应 (LoRA) [84, 188]。这些技术确保了大语言模型在下游安全任务中的有效性,同时需要较小的计算资源开销。

5.2 Prompting LLMs for Security Tasks

5.2 为大语言模型提供安全任务的提示

Recent studies in natural language processing highlight the significance of prompt engineering [122] as an emerging fine-tuning approach aimed at bridging the gap between the output expectations of large language models during pre training and downstream tasks. This strategy has demonstrated notable success across various NLP applications.

近期自然语言处理领域的研究强调了提示工程 (Prompt Engineering) [122] 作为一种新兴微调方法的重要性,旨在弥合大语言模型在预训练和下游任务中输出预期之间的差距。该策略在各种自然语言处理应用中展现了显著的成功。

Incorporating meticulously crafted prompts as features in prompt engineering has emerged as a fundamental technique for enriching interactions with large language models like ChatGPT, Bard, among others. These customized prompts serve a dual purpose: they direct the large language models towards generating specific outputs while also serving as an interface for tapping into the vast knowledge encapsulated within these models.

在提示工程中,精心设计的提示作为特征已成为丰富与 ChatGPT、Bard 等大语言模型交互的基本技术。这些定制提示具有双重作用:它们引导大语言模型生成特定输出,同时也作为接口来利用这些模型中封装的大量知识。

In prompt engineering, utilizing inserted prompts to provide task-specific knowledge is especially beneficial for security tasks with limited data features. This becomes crucial when conventional datasets (such as network threat reports, harmful content on social media, code vulnerability datasets, etc.) are restricted or do not offer the level of detail needed for particular security tasks. For example, in handling cyber threat analysis tasks [189], one can construct prompts by incorporating the current state of the network security posture. This prompts LLMs to learn directly from the flow features in a zero-shot learning manner [225], extracting structured network threat intelligence from unstructured data, providing standardized threat descriptions, and formalized categorization. In the context of program fuzzing tasks [85], multiple individual test cases can be integrated into a prompt, assisting LLMs in learning the features of test cases and generating new ones through few-shot learning [19], even with limited input. For tasks such as penetration testing [45] and hardware vulnerability verification [179], which involve multiple steps and strict logical reasoning relationships between steps, one can utilize a chain of thought (COT) [213] to guide the customization of prompts. This assists LLMs in process reasoning and guides them to autonomously complete tasks step by step.

在提示工程中,利用插入提示来提供任务特定知识对于数据特征有限的安全任务尤为有益。当传统数据集(如网络威胁报告、社交媒体有害内容、代码漏洞数据集等)受到限制或无法为特定安全任务提供所需细节时,这一点变得至关重要。例如,在处理网络威胁分析任务 [189] 时,可以通过结合当前网络安全态势来构建提示。这促使大语言模型以零样本学习方式 [225] 直接从流特征中学习,从非结构化数据中提取结构化的网络威胁情报,提供标准化的威胁描述和形式化的分类。在程序模糊测试任务 [85] 中,可以将多个单独的测试用例整合到一个提示中,帮助大语言模型学习测试用例的特征,并通过少样本学习 [19] 生成新的测试用例,即使在输入有限的情况下也能做到。对于渗透测试 [45] 和硬件漏洞验证 [179] 等涉及多步骤且步骤间具有严格逻辑推理关系的任务,可以利用思维链 (COT) [213] 来指导提示的定制。这有助于大语言模型进行过程推理,并引导其逐步自主完成任务。

In LL M 4 Security, almost all security tasks listed in Table7 involve prompt engineering, highlighting the indispensable role of prompts. In conclusion, recent research emphasizes the crucial role of prompt engineering in enhancing the performance of LLMs for targeted security tasks, thereby aiding in the development of automated security task solutions.

在LLM应用于安全的场景中,表7中列出的几乎所有安全任务都涉及提示工程,突显了提示不可或缺的作用。总之,近期研究强调了提示工程在提升大语言模型针对特定安全任务性能中的关键作用,从而助力自动化安全任务解决方案的开发。

5.3 External Augmentation

5.3 外部增强

While LLMs undergo thorough pre-training on extensive datasets, employing them directly for tackling complex tasks in security domains faces numerous challenges due to the diversity of domain data, the complexity of domain expertise, and the specificity of domain goals [117]. Several studies in LL M 4 Security introduce external augmentation methods to enhance the application of LLMs in addressing security issues. These external augmentation techniques facilitate improved interaction with LLMs, bridging gaps in their knowledge base, and maximizing their capability to produce dependable outputs based on their existing knowledge.

尽管大语言模型在广泛的数据集上经过了全面的预训练,但由于领域数据的多样性、领域专业知识的复杂性以及领域目标的特殊性,直接将其应用于安全领域的复杂任务面临诸多挑战 [117]。一些关于 LL M 4 Security 的研究引入了外部增强方法,以提升大语言模型在解决安全问题中的应用。这些外部增强技术有助于改进与大语言模型的交互,填补其知识库中的空白,并最大限度地利用其现有知识生成可靠的输出。

We summarized the external augmentation techniques combined with LLMs in previous studies, as shown in Table 8, with 7 different external augmentation techniques. The first augmentation technique we focus on is feature augmentation. The effectiveness of LLMs in handling downstream tasks heavily relies on the features included in the prompts. We have observed that many studies employing LLMs for security tasks extract contextual relationships or other implicit features from raw data and integrate them with the original data to customize prompts. These implicit features encompass descriptions of vulnerabilities s [242], bug locations [92], threat flow graphs [121], and more. Incorporating these implicit features alongside raw data leads to enhanced performance compared to constructing prompts solely from raw data. The next augmentation technique is external retrieval. External knowledge repositories can mitigate the hallucinations or errors arising from the lack of domain expertise in LLMs. LLMs can continually interact with external knowledge repositories during pipeline processing and retrieve knowledge relevant to security tasks to provide superior solutions [162]. Rule-based external tools can also serve as specialized external knowledge repositories. In addressing security tasks, LLMs can utilize results from external tools to rectify their outputs, thereby avoiding redundancy and errors[12, 74]. The fourth augmentation technique is task-adaptive training. Existing studies adopt various training strategies from pre-training to strengthen LLMs’ adaptability to complex security tasks, enabling them to generate more targeted outputs. For instance, contrastive learning techniques can be employed, where both bugs and patches are used as input to LLMs to automatically generate higher-quality program patches [35, 197]. Alternatively, reinforcement learning can guide LLMs to produce more effective web test cases and alleviate local optima issues [63, 115]. The fifth augmentation technique, inter-model interaction, has garnered significant attention when a single LLM may struggle to handle complex and intricate tasks. Decomposing the pipeline process and introducing multiple LLMs for enhanced performance have been explored [228]. This approach leverages collaboration and interaction among models to harness the underlying knowledge base advantages of each LLM. When a single interaction is insufficient to support LLMs in tasks such as variable name recovery or generating complex program patches [226, 234],it is necessary to construct prompts for LLMs multiple times continuously to iterate towards the desired output. In this process, broadcasting the output results of each step iterative ly as part of the prompt for the next step helps reinforce the contextual relationship between each interaction, thereby reducing error rates. The final augmentation technique is post-processing, where LLMs' outputs are validated or processed for certain security tasks requiring specific types of output [200]. This process helps mitigate issues such as hallucinations arising from the lack of domain knowledge in LLMs [36].

我们总结了以往研究中与大语言模型结合的外部增强技术,如表 8 所示,共有 7 种不同的外部增强技术。我们首先关注的是特征增强技术。大语言模型在处理下游任务时的有效性很大程度上依赖于提示中包含的特征。我们观察到,许多使用大语言模型进行安全任务的研究从原始数据中提取上下文关系或其他隐含特征,并将其与原始数据结合以定制提示。这些隐含特征包括漏洞描述 [242]、错误位置 [92]、威胁流图 [121] 等。与仅从原始数据构建提示相比,将这些隐含特征与原始数据结合可以提高性能。下一个增强技术是外部检索。外部知识库可以缓解大语言模型由于缺乏领域专业知识而产生的幻觉或错误。大语言模型可以在管道处理过程中不断与外部知识库交互,并检索与安全任务相关的知识以提供更好的解决方案 [162]。基于规则的外部工具也可以作为专门的外部知识库。在处理安全任务时,大语言模型可以利用外部工具的结果来纠正其输出,从而避免冗余和错误 [12, 74]。第四种增强技术是任务自适应训练。现有研究采用了从预训练到强化大语言模型对复杂安全任务适应性的各种训练策略,使其能够生成更有针对性的输出。例如,可以采用对比学习技术,将错误和补丁都作为大语言模型的输入,以自动生成更高质量的程序补丁 [35, 197]。或者,强化学习可以引导大语言模型生成更有效的 Web 测试用例并缓解局部最优问题 [63, 115]。第五种增强技术是模型间交互,当单个大语言模型可能难以处理复杂且精细的任务时,这种技术引起了广泛关注。已有研究探索了分解管道过程并引入多个大语言模型以提升性能的方法 [228]。这种方法利用模型之间的协作和交互,发挥每个大语言模型的潜在知识库优势。当单次交互不足以支持大语言模型完成变量名恢复或生成复杂程序补丁等任务时 [226, 234],有必要多次连续为大语言模型构建提示以迭代到期望的输出。在此过程中,将每一步迭代的输出结果作为下一步提示的一部分进行广播,有助于加强每次交互之间的上下文关系,从而降低错误率。最后一种增强技术是后处理,即对某些需要特定类型输出的安全任务,对大语言模型的输出进行验证或处理 [200]。这一过程有助于缓解大语言模型由于缺乏领域知识而产生的幻觉等问题 [36]。

Table 8.External augmentation techniques involved in prior studies.

表 8: 先前研究中涉及的外部增强技术

增强技术 描述 示例 参考文献
特征增强 将数据集中隐含的任务相关特征融入提示中 添加错误描述、错误位置、代码上下文或对不平衡流量进行重采样 [92] [237] [10] [242] [121] [220] [90]
外部检索 从外部知识库中检索任务相关信息作为输入 外部结构化的网络威胁情报语料库、用于修复的混合补丁检索器 [54] [162] [209]
外部工具 使用专门工具的分析结果作为辅助输入 模式挖掘、静态代码分析工具、渗透测试工具 [74] [15] [12]
任务自适应训练 从预训练到增强模型任务适应性的不同训练策略 对比学习、迁移学习、强化学习、蒸馏 [57] [240] [160] [72] [115] [106]
模型间交互 引入多个模型(可以是 LLM 或其他模型)进行协作和交互 多个 LLM 反馈协作、图神经网络 [27] [228] [197] [197] [27]
重播 适用于多步任务,将每一步的输出结果迭代地作为下一步提示的一部分 基于难度的补丁示例重播、变量名传播 [226] [234]
后处理 定制特殊的处理策略以更好地匹配任务需求 基于 Levenshtein 距离的后处理以减少幻觉、生成代码的形式验证 [36] [200]

External augmentation techniques have significantly boosted the effectiveness of LLMs across various security tasks, yielding competitive performance. We observed that only 28 out of the total 127 papers in LL M 4 Security, accounting for $22.05%$ , applied specific external augmentation techniques. From these studies, it is evident that external augmentation techniques have the potential to address issues such as hallucinations and high false positive rates caused by deficiencies in LLMs? domain knowledge and task alignment. We believe that the integration of LLMs with external techniques will be a trend in the development of automated security task solutions.

外部增强技术显著提升了大语言模型在各种安全任务中的有效性,取得了有竞争力的表现。我们观察到,在LLM4Security的127篇论文中,仅有28篇(占22.05%)应用了特定的外部增强技术。从这些研究中可以看出,外部增强技术有潜力解决由于大语言模型领域知识和任务对齐不足导致的幻觉和高误报率等问题。我们相信,大语言模型与外部技术的结合将成为自动化安全任务解决方案发展的趋势。

RQ3- Summary

RQ3- 总结

6RQ4:WHAT IS THE DIFFERENCE IN DATA COLLECTION AND PRE-PROCESSING WHEN APPLYING LL MS TO SECURITY TASKS?

6RQ4: 在将大语言模型应用于安全任务时,数据收集和预处理有何不同?

Data plays a vital role throughout the model training process [235]. Initially, collecting diverse and rich data is crucial to enable the model to handle a wide range of scenarios and contexts effectively. Following this, categorizing the data helps specify the model's training objectives and avoid ambiguity and misinterpretation. Additionally, preprocessing the data is essential to clean and refine it, thereby enhancing its quality. In this chapter, we examine the methods of data collection, categorization, and preprocessing as described in the literature.

数据在模型训练过程中起着至关重要的作用 [235]。首先,收集多样化和丰富的数据对于使模型能够有效处理各种场景和上下文至关重要。随后,对数据进行分类有助于明确模型的训练目标,避免歧义和误解。此外,对数据进行预处理是必要的,以清理和优化数据,从而提高其质量。在本章中,我们探讨了文献中描述的数据收集、分类和预处理方法。

6.1 Data Collection

6.1 数据收集

Data plays an indispensable and pivotal role in the training of LLMs, influencing the model's capacity for generalization, effectiveness, and performance [195]. Sufficient, high-quality, and diverse data areare imperative to facilitate the model's comprehensive understanding of task characteristics and patterns, optimize parameters, and ensure the reliability of validation and testing. Initially, we explore the techniques employed for dataset acquisition. Through an examination of data collection methods, we classify data sources into four categories: open-source datasets, collected datasets, constructed datasets, and industrial datasets.

数据在大语言模型训练中的不可或缺与核心作用,影响着模型的泛化能力、效果和性能 [195]。充足、高质量且多样化的数据是必要的,以促进模型全面理解任务特征和模式,优化参数,并确保验证和测试的可靠性。首先,我们探讨数据集获取的技术。通过对数据收集方法的考察,我们将数据源分为四类:开源数据集、收集的数据集、构建的数据集和工业数据集。

Open-source datasets. Open-source datasets refer to datasets that are publicly accessible and distributed through open-source platforms or online repositories [27, 32, 131, 238]. For example, the UNSW-NB15 dataset contains 175,341 network connection records, including summary information, network connection features, and traffic statistics. The network connections in the dataset are labeled as normal traffic or one of nine different types of attacks [142]. The credibility of these datasets is ensured by their open-source nature, which also allows for community-driven updates. This makes them dependable resources for academic research.

开源数据集

Collected datasets. Researchers gather collected datasets directly from various sources, such as major websites, forums, blogs, and social media platforms. These datasets may include comments from GitHub, harmful content from social media, or vulnerability information from CVE websites, tailored to specific research questions.

收集数据集

Constructed datasets. The constructed dataset refers to a specialized dataset created by researchers through the modification or augmentation of existing datasets to better suit their specific research goals [8, 100, 140, 218]. These changes could be made through manual or semi-automated processes, which might entail creating test sets tailored to specific domains, annotating datasets, or generating synthetic data. For instance, researchers might gather information on web vulnerabilities and the corresponding penetration testing methods, structure them into predefined templates to form vulnerability scenarios, and subsequently assess large language models using these scenarios [45].

构建数据集。构建数据集指的是研究人员通过修改或扩展现有数据集,以更好地适应其特定研究目标而创建的专业数据集 [8, 100, 140, 218]。这些修改可以通过手动或半自动化的过程进行,可能包括创建针对特定领域的测试集、标注数据集或生成合成数据。例如,研究人员可能会收集有关Web漏洞及其相应的渗透测试方法的信息,将它们结构化为预定义的模板以形成漏洞场景,并随后使用这些场景评估大语言模型 [45]。


Fig. 6. The collection strategies of datasets in LL M 4 Security.

图 6: LL M 4 Security 数据集收集策略

Industrial datasets. Industrial datasets are data obtained from real-world commercial or industrial settings, typically consisting of industrial applications, user behavior logs, and other sensitive information [119, 237]. These datasets are particularly valuable for research aimed at addressing real-world application scenarios.

工业数据集

The Figure 6 illustrates the data collection strategies for LLM-related datasets. From the data depicted in the Figure 6, it can be observed that 47 studies utilize open-source datasets to train LLMs. The utilization of open-source datasets for training LLMs is predominantly attributed to their authenticity and credibility. These datasets are typically comprised of real-world data sourced from diverse origins, including previous related research, thereby ensuring a high degree of reliability and stability to real-world scenarios. This authenticity enables LLMs to learn from genuine examples, facilitating a deeper understanding of real-world security tasks and ultimately improving their performance. Additionally, due to the recent emergence of LLMs, there is indeed a challenge of the lack of suitable training sets. Hence, researchers often collect data from websites or social media and construct datasets to make the data more suitable for specific security tasks. We also analyzed the relationship between data collection strategies and the security domain. In certain domains such as network security, the preference for collecting datasets surpasses that of using open-source datasets. This indicates that obtaining data for applying LLMs to certain security tasks is still inconvenient. Among the 127 papers examined, only 2 studies utilized industrial datasets. This indicates a potential gap between the characteristics of datasets used in academic research and those in real-world industrial settings. This difference underscores the importance of future research exploring industrial datasets to ensure the applicability and robustness of large language models (LLMs) across academic and industrial domains. Some papers focus on exploring the use of existing LLMs, such as ChatGPT, in security tasks [143, 144]. These papers often do not specify the datasets used for model training, as LLMs like ChatGPT typically do not require users to prepare their own training data for application scenarios.

图 6 展示了大语言模型相关数据集的数据收集策略。从图 6 中的数据可以看出,有 47 项研究利用开源数据集来训练大语言模型。使用开源数据集训练大语言模型的主要原因在于其真实性和可信度。这些数据集通常由来自不同来源的真实世界数据组成,包括先前相关研究,从而确保了与现实场景的高度可靠性和稳定性。这种真实性使得大语言模型能够从真实的示例中学习,促进对现实世界安全任务的深入理解,并最终提高其性能。此外,由于大语言模型的近期兴起,确实存在缺乏合适训练集的挑战。因此,研究人员通常从网站或社交媒体收集数据并构建数据集,以使数据更适合特定的安全任务。我们还分析了数据收集策略与安全领域之间的关系。在某些领域(如网络安全),收集数据集的偏好超过了使用开源数据集。这表明,在某些安全任务中应用大语言模型时,获取数据仍然不便。在审查的 127 篇论文中,仅有 2 项研究使用了工业数据集。这表明学术研究中使用的数据集特征与实际工业环境中的数据集特征之间存在潜在差距。这种差异强调了未来研究探索工业数据集的重要性,以确保大语言模型在学术和工业领域的适用性和鲁棒性。一些论文专注于探索现有大语言模型(如 ChatGPT)在安全任务中的应用 [143, 144]。这些论文通常不指定用于模型训练的数据集,因为像 ChatGPT 这样的大语言模型通常不需要用户为自己的应用场景准备训练数据。

6.2 Types of Datasets

6.2 数据集类型

The choice of data types plays a crucial role in shaping the architecture and selection of LLMs, as they directly influence the extraction of implicit features and subsequent decision-making by the model. This decision significantly impacts the overall performance and ability of LLMs to generalize [125]. We conducted a thorough analysis and categorization of the data types utilized in LL M 4 Security research. Through examining the interplay between data types, model architectures, and task demands, our goal is to highlight the vital significance of data types in effectively applying LLMs to security-related tasks.

数据类型的选择在塑造大语言模型(LLM)架构和选择中起着至关重要的作用,因为它们直接影响模型对隐式特征的提取和后续决策。这一决定对大语言模型的整体性能和泛化能力有显著影响[125]。我们对LLM 4 Security研究中所使用的数据类型进行了深入分析和分类。通过考察数据类型、模型架构和任务需求之间的相互作用,我们的目标是强调数据类型在将大语言模型有效应用于安全相关任务中的重要性。

Data type categorization. We categorize all datasets into three types: code-based, text-based, and hybrid data types. Table 9 provides a detailed breakdown of the specific data included in each category, derived from 127 studies. The analysis reveals that the majority of studies rely on code-based datasets, constituting a total of 71 datasets. This dominance underscores the superior code analysis capabilities of LLMs when trained for security tasks. These models demonstrate proficiency in understanding and processing code data, making them well-suited for security challenges such as vulnerability detection, program fuzzing, and traffic analysis. Their capacity to handle and learn from extensive code data enables LLMs to offer robust insights and solutions for various security applications.

数据类型分类。我们将所有数据集分为三类:基于代码的、基于文本的和混合数据类型。表 9 详细列出了从 127 项研究中得出的每个类别中包含的具体数据。分析显示,大多数研究依赖于基于代码的数据集,共计 71 个数据集。这种主导地位突显了大语言模型在为安全任务训练时所展现出的卓越代码分析能力。这些模型在理解和处理代码数据方面表现出色,使其非常适合应对漏洞检测、程序模糊测试和流量分析等安全挑战。它们处理和学习大量代码数据的能力使大语言模型能够为各种安全应用提供强有力的见解和解决方案。

Text datasets with numerous prompts (a total of 28) are commonly utilized for tasks lacking structured data, effectively guiding large language models (LLMs) through prompts to influence their behavior. While understanding the intricacies of training data might not be crucial for closed-source LLMs like ChatGPT, insights into data handling techniques for other models are still valuable. This is because black-box models can be fine-tuned with small-sized data inputs during usage. Among the 127 papers analyzed, text datasets rich in prompts are frequently used for training LLMs in security tasks, highlighting this trend. Additionally, specific security tasks necessitate particular text data inputs, such as system log analysis and harmful content detection.

具有大量提示的文本数据集(共28个)通常用于缺乏结构化数据的任务,通过提示有效引导大语言模型(LLMs)以影响其行为。虽然对于像ChatGPT这样的闭源LLMs,了解训练数据的细节可能并不关键,但对于其他模型的数据处理技术的见解仍然有价值。这是因为黑盒模型在使用过程中可以通过小规模数据输入进行微调。在分析的127篇论文中,富含提示的文本数据集常用于安全任务中的LLMs训练,突显了这一趋势。此外,特定的安全任务需要特定的文本数据输入,例如系统日志分析和有害内容检测。

The prevalence of vulnerable code (17), source code (15),and bug-fix pairs (14) in code-based datasets can be attributed to their ability to effectively meet task requirements. Vulnerable code naturally exhibits semantic features of code containing vulnerabilities to large language models (LLMs), thereby highlighting the distinguishing traits of vulnerable code when juxtaposed with normal code snippets. This aids LLMs in performing security tasks related to vulnerability detection. A similar rationale applies to bug-fix pairs. Source code serves as the backbone of any software project, encompassing the logic and instructions that define program behavior. Thus, having a substantial amount of source code data is essential for training LLMs to grasp the intricacies of programs, enabling them to proficiently generate, analyze, and comprehend code across various security tasks. Additionally, commonly used data types for bug fixes and traffic and intrusion detection, such as bugs (7) and traffic packets (4), are also widespread.

基于代码的数据集中易受攻击的代码 (17)、源代码 (15) 和缺陷修复对 (14) 的普遍存在可以归因于它们能够有效满足任务需求。易受攻击的代码自然地向大语言模型 (LLMs) 展示了包含漏洞的代码的语义特征,从而在与正常代码片段对比时突出了易受攻击代码的区别特征。这有助于 LLMs 执行与漏洞检测相关的安全任务。类似的理由也适用于缺陷修复对。源代码是任何软件项目的核心,包含了定义程序行为的逻辑和指令。因此,拥有大量的源代码数据对于训练 LLMs 掌握程序的复杂性至关重要,使它们能够熟练地生成、分析和理解各种安全任务中的代码。此外,常用于缺陷修复以及流量和入侵检测的数据类型,如缺陷 (7) 和流量包 (4),也很普遍。

Some studies have utilized composite datasets containing multiple data types, such as vulnerable code and vulnerability descriptions. For instance, Liu et al. [124] collected a dataset comprising CVE vulnerable code along with vulnerability descriptions and evaluated the performance of LLMs on vulnerability description mapping tasks based on this data set.

一些研究利用了包含多种数据类型的复合数据集,例如易受攻击的代码和漏洞描述。例如,Liu 等人 [124] 收集了一个包含 CVE 易受攻击代码及其漏洞描述的数据集,并基于该数据集评估了大语言模型在漏洞描述映射任务上的性能。

Table 9. Data types of datasets involved in prior studies.

表9: 先前研究中涉及的数据集数据类型

类别 数据类型 研究数量 总计 参考文献
易受攻击的代码 17 [98][203][121][218] [38][61][36] [124][199][238][246] [40]
基于代码的数据集 源代码 15 71 [17][85][30][12][7][32][204][46] [228] [226][193][15][160][191][121][194][42]
Bug修复对 14 [87][66][86] [92][114][244][234] [242][188] [157][147][222][88][241][224][223][209]
Bug 7 [111][106][190][157]
流量包 4 [88][47][8]
补丁 3 [98][105][197] [138][62][131][11]
代码变更 3 [227][54][214]
漏洞修复对 2 [240][65]
Bug修复提交 2 [244][209]
Web攻击载荷 2 [120][115] [133]
主题协议程序 1 [158]
易受攻击的程序 1
提示 17 [10][31] [140][74] [45] [198]
日志消息 [116] [56] [201][237][159][23][77][176][132][182][179]
基于文本的数据集 社交媒体内容 6 [72][185] [119][39][97][166]
垃圾邮件 49 [207] [73][83] [27] [135]
Bug报告 4 [102][139][28][90]
攻击描述 3 [57][106][54] [24][58]
CVE报告 2 [3][4]
网络威胁情报数据 2 [172][100]
顶级域名 1 [123]
安全报告 1 [5][189]
威胁报告 1 [162]
结构化威胁信息 1
程序文档 1 [220]
杀毒扫描报告 1 [93]
密码 1 [173]
硬件文档 1 [134]
组合数据集 易受攻击的代码和漏洞描述 2 [124][36]

Table 10. The data preprocessing techniques for code-based datasets.

表 10: 基于代码数据集的数据预处理技术。

预处理技术 描述 示例 参考文献
数据提取 从基于代码的数据集中检索与特定安全任务相关的代码片段,适应不同粒度和特定任务需求。 Token 级别、语句级别、类级别、流量流。 [193] [29] [138]
重复实例删除 从数据集中删除重复实例,以保持数据完整性并避免训练阶段的重复。 删除重复代码、注释和函数名中明显的漏洞指示符。 [199] [238] [242]
不合格数据删除 通过实施过滤标准删除不合适的数据,保留合适的样本,确保数据集的质量和适用于各种安全任务。 删除或匿名化可能提供明显漏洞提示的注释和信息(包、变量名等)。 [61] [234] [157] [98]
代码表示 将代码表示为 Token。 将源代码或二进制代码 Token 化。 [88] [246] [222]
数据分割 将数据集划分为训练、验证和测试子集,用于模型训练、参数调整和性能评估。 根据特定标准对数据集进行分区,可能包括划分为训练、验证或测试子集。 [227] [191]

6.3 Data Pre-processing

6.3 数据预处理

When training and using LLMs, it's important to preprocess the initial dataset to obtain clean and appropriate data for model training [106]. Data preprocessing involves tasks like cleaning, reducing noise, and normalization. Different types of data may require different preprocessing methods to improve the performance and effectiveness of LLMs in security tasks, maintaining data consistency and quality. This section will provide a detailed explanation of the data preprocessing steps customized for the two main types of datasets: those based on code and those based on text.

在训练和使用大语言模型时,对初始数据集进行预处理以获得干净且适合模型训练的数据非常重要 [106]。数据预处理涉及诸如数据清理、降噪和归一化等任务。不同类型的数据可能需要不同的预处理方法,以提高大语言模型在安全任务中的性能和效果,同时保持数据的一致性和质量。本节将详细解释为基于代码和基于文本的两类主要数据集定制的数据预处理步骤。

Data preprocessing techniques for code-based datasets. We outline the preprocessing techniques utilized for code-based datasets, comprising five essential steps. Table 10 provides a comprehensive summary of each technique with examples. The initial step involves extracting data, retrieving relevant code snippets from diverse sources. Depending on the research task's needs [138, 193], snippets may be extracted at different levels of detail, ranging from individual lines, methods, and functions to entire code files or projects. To prevent bias and redundancy during training, the next step removes duplicate instances by identifying and eliminating them from the dataset [238, 242], enhancing diversity and uniqueness. Filtering follows, removing snippets that don't meet predefined quality standards to ensure relevance to the security task and avoid noise [61, 234]. Code representation converts snippets into suitable formats for LLM processing, often utilizing token-based representations for security tasks [222]. Finally, data splitting divides the pre processed dataset into training, validation, and testing subsets [227]. Training sets train the LLM, validation sets tune hyper parameters, and testing sets assess model performance on unseen data. By adhering to these steps, researchers can construct structured code-based datasets, facilitating LLM application across various security tasks like vulnerability detection, program fuzzing, and intrusion detection.

基于代码数据集的数据预处理技术

Table 11. The data preprocessing techniques for text-based datasets.

表 11: 基于文本数据集的数据预处理技术。

预处理技术 描述 示例 参考文献
数据提取 根据各种软件工程任务从文档中检索适当的文本。 攻击描述、错误报告、社交媒体内容、硬件文档等。 [58] [220] [3] [73]
初始数据分段 根据需要将数据分类到不同的组中。 将数据分割为句子或单词。 [134] [102] [135] [4]
不合格数据删除 根据指定规则删除无效的文本数据。 删除某些符号和单词(罕见词、停用词等),或将所有内容转换为小写。 [57] [5] [27]
文本表示 基于Token的文本表示。 将文本、句子或单词Token化。 [4] [134]
数据分割 将数据集划分为训练、验证和测试子集,用于模型训练、参数调优和性能评估。 根据特定标准对数据集进行分区,可能包括划分为训练、验证或测试子集。 [173] [207] [123]

Data preprocessing techniques for text-based datasets. As depicted in Table 11, preprocessing text-based datasets involves five steps, with minor differences compared to code-based datasets. The process begins with data extraction, carefully retrieving text from various sources such as bug reports [57], program documentation [220], hardware documentation [134], and social media content [73]. This initial phase ensures the dataset encompasses a range of task-specific textual information. After data extraction, the text undergoes segmentation tailored to the specific research task's needs. Segmentation may involve breaking text into sentences or further dividing it into individual words for analysis [4, 134]. Subsequent preprocessing operations standardize and clean the text, typically involving the removal of specific symbols, stop words, and special characters [4, 135]. This standardized textual format facilitates effective processing by LLMs. To address bias and redundancy in the dataset, this step enhances dataset diversity, aiding the model's generalization to new inputs [102]. Data token iz ation is essential for constructing LLM inputs, where text is tokenized into smaller units like words or subwords to facilitate feature learning [4]. Finally, the pre processed dataset is divided into subsets, typically comprising training, validation, and testing sets.

基于文本数据集的数据预处理技术

RQ4 - Summary

RQ4 - 总结

(1) Based on different data sources, datasets are categorized into four types: open-source datasets, collected datasets, constructed datasets, and industrial datasets. The use of open-source datasets is the most common, accounting for approximately $38.52%$ in the 122 papers explicitly mentioning dataset sources. Collected datasets and constructed datasets are also popular, reflecting the lack of practical data in LL M 4 Security research. (2) We categorize all datasets into three types: code-based, text-based, and combined. Text-based and code-based types are the most commonly used types when applying LLMs to security tasks. This pattern indicates that LLMs excel in leveraging their natural language processing capabilities to handle text-based and code-based data in security tasks. (3) We summarize the data preprocessing process for different data types, outlining common data preprocessing steps such as data extraction, unqualified data deletion, data representation, and data segmentation.

(1) 根据不同的数据来源,数据集分为四种类型:开源数据集、收集数据集、构建数据集和工业数据集。开源数据集的使用最为常见,在明确提及数据集来源的122篇论文中约占 $38.52%$。收集数据集和构建数据集也很受欢迎,这反映了LLM 4 Security研究中缺乏实际数据。(2) 我们将所有数据集分为三种类型:基于代码的、基于文本的和组合的。在将大语言模型应用于安全任务时,基于文本和基于代码的类型是最常用的类型。这一模式表明,大语言模型在利用其自然语言处理能力处理安全任务中的基于文本和基于代码的数据方面表现出色。(3) 我们总结了不同数据类型的数据预处理过程,概述了常见的数据预处理步骤,如数据提取、不合格数据删除、数据表示和数据分割。

7 THREATS TO VALIDITY

7 有效性威胁

Paper retrieval omissions. One significant potential risk is the possibility of overlooking relevant papers during the search process. While collecting papers on LL M 4 Security tasks from various publishers, there is a risk of missing out on papers with incomplete abstracts, lacking cyber security tasks or LLM keywords. To addres this issue, we employed a comprehensive approach that combines manual searching, automated searching, and snowballing techniques to minimize the chances of overlooking relevant papers as much as possible. We extensively searched for LLM papers related to security tasks in three top security conferences, extracting authoritative and comprehensive security task and LLM keywords for manual searching. Additionally, we conducted automated searches using carefully crafted keyword search strings on seven widely used publishing platforms. Furthermore, to further expand our search results, we employed both forward and backward snowballing techniques.

论文检索遗漏。一个重大的潜在风险是在搜索过程中可能忽略相关论文。在收集来自不同出版商关于LLM在安全任务中的论文时,存在遗漏摘要不完整、缺乏网络安全任务或LLM关键词的论文的风险。为了解决这个问题,我们采用了结合手动搜索、自动搜索和滚雪球技术的综合方法,以尽可能减少遗漏相关论文的可能性。我们广泛搜索了三个顶级安全会议中与安全任务相关的LLM论文,提取了权威且全面的安全任务和LLM关键词用于手动搜索。此外,我们在七个广泛使用的出版平台上使用精心设计的关键词搜索字符串进行了自动搜索。此外,为了进一步扩大搜索结果,我们还采用了前向和后向滚雪球技术。

Bias of research selection. The selection of studies carries inherent limitations and potential biases. Initially, we established criteria for selecting papers through a combination of automated and manual steps, followed by manual validation based on Quality Assessment Criteria (QAC). However, incomplete or ambiguous information in BibTeX records may result in mis labeling of papers during the automated selection process. To address this issue, papers that cannot be conclusively excluded require manual validation. However, the manual validation stage may be subject to biases in researchers? subjective judgments, thereby affecting the accuracy of assessing paper quality. To mitigate these issues, we enlisted two experienced reviewers from the fields of cyber security and LLM to conduct a secondary review of the research selection results. This step aims to enhance the accuracy of paper selection and reduce the chances of omission or mis classification. By implementing these measures, we strive to ensure the accuracy and integrity of the selected papers, minimize the impact of selection biases, and enhance the reliability of the systematic literature review. Additionally, we provide a replication package for further examination by others.

研究选择的偏差

8 CHALLENGES AND OPPORTUNITIES

8 挑战与机遇

8.1 Challenges

8.1 挑战

8.1.1 Challenges in LLM Applicability.

8.1.1 大语言模型适用性中的挑战

Model size and deployment. The size of LLMs have seen significant growth over time, escalating from 117M parameters for GPT-1 to 1.5B parameters for GPT-2, and further to 175B parameters for GPT-3 [229]. Models with billions or even trillions of parameters present substantial challenges in terms of storage, memory, and computational demands [59]. This can potentially impede the deployment of LLMs, particularly in scenarios where developers lack access to potent GPUs or TPUs, especially in resource-constrained environments necessitating real-time deployment. CodeBERT [60] emerged in 2019 as a pre-trained model featuring 125M parameters and a model size of 476MB. Recent models like Codex [33] and CodeGen [145] have surpassed 100 billion parameters, with model sizes exceeding 100GB. Larger sizes entail more computational resources and higher time costs. For instance, training the GPT-Neox-20B model [21] mandates 825GB of raw text data and deployment on 8 NVIDIA A100-SXM4-40GB graphics processing units (GPUs). Each GPU comes with a price tag of over $\mathbb{S}6{,}000$ , and the training duration spans 1,830 hours or roughly 76 days. These instances underscore the substantial computational costs linked with training LLMs. Additionally, these platforms entail notable energy expenses, with LLM-based platforms projected to markedly amplify energy consumption [174]. Some vendors like OpenAI and Google provide online APIs for LLMs to alleviate user usage costs, while researchers explore methods to curtail LLM scale. Hsieh et al. [82] proposed step-by-step distillation to diminish the data and model size necessary for LLM training, with their fndings showcasing that a T5 model with only 770MB surpassed a 540B PaLM.

模型规模与部署。随着时间的推移,大语言模型的规模显著增长,从 GPT-1 的 1.17 亿参数增长到 GPT-2 的 15 亿参数,再到 GPT-3 的 1750 亿参数 [229]。拥有数十亿甚至数万亿参数的模型在存储、内存和计算需求方面带来了巨大挑战 [59]。这可能会阻碍大语言模型的部署,尤其是在开发者无法访问强大的 GPU 或 TPU 的情况下,特别是在需要实时部署的资源受限环境中。CodeBERT [60] 于 2019 年推出,作为一个预训练模型,拥有 1.25 亿参数,模型大小为 476MB。最近的模型如 Codex [33] 和 CodeGen [145] 已经超过了 1000 亿参数,模型大小超过 100GB。更大的规模意味着需要更多的计算资源和更高的时间成本。例如,训练 GPT-Neox-20B 模型 [21] 需要 825GB 的原始文本数据,并在 8 个 NVIDIA A100-SXM4-40GB 图形处理单元 (GPU) 上部署。每个 GPU 的价格超过 $\mathbb{S}6{,}000$,训练时间长达 1830 小时,大约 76 天。这些例子突显了与大语言模型训练相关的巨大计算成本。此外,这些平台还涉及显著的能源消耗,基于大语言模型的平台预计将显著增加能源消耗 [174]。一些供应商如 OpenAI 和 Google 提供了大语言模型的在线 API,以减轻用户的使用成本,同时研究人员也在探索减少大语言模型规模的方法。Hsieh 等人 [82] 提出了逐步蒸馏的方法,以减少大语言模型训练所需的数据和模型规模,他们的研究结果表明,一个仅有 770MB 的 T5 模型超过了 540B 的 PaLM。

Data scarcity. In Section 6, we conducted an extensive examination of the datasets and data preprocessing procedures employed in the 118 studies. Our analysis unveiled the heavy reliance of LLMs on a diverse array of datasets for training and fine-tuning. The findings underscore the challenge of data scarcity encountered by LLMs when tackling security tasks. The quality, diversity, and volume of data directly influence the performance and generalization capabilities of these models. Given their scale, LLMs typically necessitate substantial data volumes to capture nuanced distinctions, yet acquiring such data poses significant challenges. Many specific security tasks suffer from a dearth of high-quality and robust publicly available datasets. Relying on limited or biased datasets may result in models inheriting these biases, leading to skewed or inaccurate predictions. Furthermore, there is a concern regarding the risk of benchmark data contamination, where existing research may involve redundant filtering of native data, potentially resulting in overlap between training and testing datasets, thus inflating performance metrics [107]. Additionally, we raise serious apprehensions regarding the inclusion of personally private information, such as phone numbers and email addresses, in training corpora when LLMs are employed for information and content security tasks, which precipitate privacy breaches during the prompting process [55].

数据稀缺问题。在第6节中,我们对118项研究中所采用的数据集和数据预处理程序进行了广泛检查。我们的分析揭示了大语言模型在训练和微调过程中对多样化数据集的严重依赖。研究结果强调了大语言模型在处理安全任务时面临的数据稀缺挑战。数据的质量、多样性和数量直接影响这些模型的性能和泛化能力。鉴于其规模,大语言模型通常需要大量数据来捕捉细微差别,然而获取这些数据面临着重大挑战。许多特定的安全任务缺乏高质量且可靠的公开数据集。依赖有限或有偏见的数据集可能导致模型继承这些偏见,从而产生偏差或不准确的预测。此外,还存在基准数据污染的风险,现有研究可能涉及对原始数据的重复过滤,可能导致训练和测试数据集之间的重叠,从而夸大性能指标 [107]。此外,我们对于在大语言模型用于信息和内容安全任务时,训练语料库中包含个人隐私信息(如电话号码和电子邮件地址)表示严重担忧,这会在提示过程中引发隐私泄露 [55]。

8.1.2 Challenges in LLM Generalization Ability. The generalization capability of LLMs pertains to their ability to consistently and accurately execute tasks across diverse tasks, datasets, or domains beyond their training environment. Despite undergoing extensive pre-training on large datasets to acquire broad knowledge, the absence of specialized expertise can present challenges when LLMs encounter tasks beyond their pre-training scope, especially in the cyber security domain. As discussed in Section 3, we explored the utilization of LLMs in 21 security tasks spanning five security domains. We observed substantial variations in the context and semantics of code or documents across different domains and task specifications. To ensure LLMs demonstrate robust generalization, meticulous fine-tuning, validation, and continuous feedback loops on datasets from various security tasks are imperative. Without these measures, there's a risk of models over fitting to their training data, thus limiting their efficacy in diverse real-world scenarios.

8.1.2 大语言模型泛化能力的挑战

8.1.3 Challenges in LLM Interpret ability, Trustworthiness, and Ethical Usage. Ensuring interpret ability and trustworthiness is paramount when integrating LLMs into security tasks, particularly given the sensitive nature of security requirements and the need for rigorous scrutiny of model outputs. The challenge lies in comprehending how these models make decisions, as the black-box nature of LLMs often impedes explanations for why or how specific outputs or recommendations are generated for security needs. Recent research [163, 208] has underscored that artificial intelligence-generated content (AIGC) introduces additional security risks, including privacy breaches, dissemination of forged information, and the generation of vulnerable code. The absence of interpret ability and trustworthiness can breed user uncertainty and reluctance, as stakeholders may hesitate to rely on LLMs for security tasks without a clear understanding of their decision-making process or adherence to security requirements. Establishing trust in LLMs necessitates the development of technologies and tools that offer deeper insights into model internals, empowering developers to comprehend the rationale behind generated outputs. Improving interpret ability and trustworthiness can ultimately foster the widespread adoption of cost-effective automation in the cyber security domain, fostering more efficient and effective security practices. Many LLMs lack open-source availability, and questions persist regarding the data on which they were trained, as well as the quality, sources, and ownership of the training data, raising concerns about ownership regarding LLM-generated tasks. Moreover, there is the looming threat of various adversarial attacks, including tactics to guide LLMs to circumvent security measures and expose their original training data [44].

8.1.3 大语言模型可解释性、可信性与伦理使用的挑战

8.2 Opportunities

8.2 机会

8.2.1 Improvement of LL M 4 Security.

8.2.1 大语言模型 (LLM) 在安全性方面的改进

Training models for security tasks. Deciding between commercially available pre-trained models like GPT-4 [150] and open-source frameworks such as T5 [171] or LLaMa [202] presents a nuanced array of choices for tailoring tasks to individual or organizational needs. The distinction between these approaches lies in the level of control and customization they offer. Pre-trained models like GPT-4 are generally not intended for extensive retraining but allow for quick adaptation to specific tasks with limited data, thus reducing computational overhead. Conversely, frameworks like T5 offer an open-source platform for broader customization. While they undergo pre-training, researchers often modify the source code and retrain these models on their own large-scale datasets to meet specific task requirements [78]. This process demands substantial computational resources, resulting in higher resource allocation and costs, but provides the advantage of creating highly specialized models tailored to specific domains. Therefore, the main trade-off lies between the user-friendly nature and rapid deployment offered by models like GPT-4 and the extensive task customization capabilities and increased computational demands associated with open-source frameworks like T5.

为安全任务训练模型。在现成的预训练模型(如 GPT-4 [150])和开源框架(如 T5 [171] 或 LLaMa [202])之间做出选择,为根据个人或组织需求定制任务提供了一系列细致入微的选项。这些方法的区别在于它们提供的控制和定制程度。像 GPT-4 这样的预训练模型通常不适用于大规模重新训练,但允许在有限数据的情况下快速适应特定任务,从而减少计算开销。相反,像 T5 这样的框架提供了一个开源平台,支持更广泛的定制。尽管它们经过了预训练,但研究人员通常会修改源代码,并在自己的大规模数据集上重新训练这些模型,以满足特定任务需求 [78]。这一过程需要大量的计算资源,导致更高的资源分配和成本,但提供了创建高度专业化模型的优势,这些模型可以针对特定领域进行定制。因此,主要的权衡在于像 GPT-4 这样的模型提供的用户友好性和快速部署能力,以及像 T5 这样的开源框架所提供的广泛任务定制能力和更高的计算需求之间。

Inter-model interaction of LLMs. Our examination indicates that LMs have progressed significantly in tackling various security challenges. However, as security tasks become more complex, there's a need for more sophisticated and tailored solutions. As outlined in Section 5, one promising avenue is collaborative model interaction through external augmentation methods. This approach involves integrating multiple LLMs [228] or combining LLMs with specialized machine learning models [27, 197] to improve task efficiency while simplifying complex steps. By harnessing the strengths of different models collectively, we anticipate that LLMs can deliver more precise and higher-quality outcomes for intricate security tasks.

大语言模型间的模型交互。我们的研究表明,大语言模型在处理各种安全挑战方面取得了显著进展。然而,随着安全任务变得更加复杂,需要更复杂且定制的解决方案。如第5节所述,通过外部增强方法实现协作模型交互是一条有前景的途径。该方法涉及集成多个大语言模型 [228] 或将大语言模型与专门的机器学习模型相结合 [27, 197],以提高任务效率,同时简化复杂步骤。通过集体利用不同模型的优势,我们预计大语言模型能够在复杂安全任务中提供更精确和更高质量的结果。

Impact and applications of ChatGPT. In recent academic research, ChatGPT has garnered considerable attention, appearing in over half of the 127 papers we analyzed. It has been utilized to tackle specific security tasks, highlighting its growing influence and acceptance in academia. Researchers have favored ChatGPT due to its computational effciency. versatility across tasks, and potential cost-effectiveness compared to other LLMs and LLM-based applications [104]. Beyond generating task solutions, ChatGPT promotes collaboration, signaling a broader effort to integrate advanced natural language understanding into traditional cyber security practices [45, 165]. By closely examining these trends, we can anticipate pathways for LLMs and applications like ChatGPT to contribute to more robust, efficient, and collaborative cyber security solutions. These insights highlight the transformative potential of LLMs in shaping the future cyber security landscape.

ChatGPT 的影响与应用。在最近的学术研究中,ChatGPT 引起了广泛关注,在我们分析的 127 篇论文中,超过半数都提到了它。它被用于解决特定的安全任务,凸显了其在学术界日益增长的影响力和接受度。研究人员青睐 ChatGPT,是因为它在计算效率、任务多样性以及与其他大语言模型和基于大语言模型的应用相比潜在的性价比优势 [104]。除了生成任务解决方案外,ChatGPT 还促进了协作,表明将先进的自然语言理解整合到传统网络安全实践中的更广泛努力 [45, 165]。通过密切关注这些趋势,我们可以预见大语言模型及 ChatGPT 等应用为构建更强大、高效且协作的网络安全解决方案做出贡献的途径。这些见解凸显了大语言模型在塑造未来网络安全格局中的变革潜力。

8.2.2Enhancing LLM's Performance in Existing Security Tasks.

8.2.2 增强大语言模型在现有安全任务中的表现

External retrieval and tools for LLM. LLMs have demonstrated impressive performance across diverse security tasks, but they are not immune to inherent limitations, including a lack of domain expertise [95], a tendency to generate hallucinations [245], weak mathematical capabilities, and a lack of interpret ability. Therefore, a feasible approach to enhancing their capabilities is to enable them to interact with the external world, acquiring knowledge in various forms and manners to improve the factual ness and rationality of generated security task solutions. One viable solution is to provide external knowledge bases for LLMs, augmenting content generation with retrieval-based methods to retrieve task-relevant data for LLM outputs [54, 67]. Another approach is to incorporate external specialized tools to provide real-time interactive feedback to guide LLMs [12, 15], combining the results of specialized analytical tools to steer LLMs towards robust and consistent security task solutions. We believe that incorporating external retrieval and tools is a competitive choice for improving the performance of LL M 4 Security.

外部检索与工具对大语言模型(LLM)的辅助。大语言模型在多种安全任务中展现了卓越的性能,但它们并非没有固有的局限,包括缺乏领域专业知识[95]、容易产生幻觉[245]、数学能力较弱以及解释能力不足。因此,提升其能力的一种可行途径是使其能够与外部世界互动,以多种形式和方式获取知识,从而提高生成安全任务解决方案的事实性和合理性。一个可行的解决方案是为大语言模型提供外部知识库,通过基于检索的方法来丰富内容生成,检索与大语言模型输出相关的任务数据[54, 67]。另一种方法是整合外部的专业工具,提供实时互动反馈以指导大语言模型[12, 15],结合专业分析工具的结果,引导大语言模型走向稳健且一致的安全任务解决方案。我们认为,整合外部检索与工具是提升LLM4Security性能的一个有竞争力的选择。

Addressing challenges in specific domains. Numerous cyber security domains, such as network security and hardware security, encounter a dearth of open-source datasets, impeding the integration of LLMs into these specialized fields [194]. Future endeavors may prioritize the development of domain-specific datasets and the refinement of LLMs to address the distinctive challenges and nuances within these domains. Collaborating with domain experts and practitioners is crucial for gathering relevant data, and fine-tuning LLMs with this data can improve their effectiveness and alignment with each domain's specific requirements. This collaborative approach helps LLMs address real-world challenges across diferent cyber security domains [26].

解决特定领域的挑战。许多网络安全领域,如网络安全和硬件安全,面临开源数据集匮乏的问题,阻碍了大语言模型在这些专业领域的集成 [194]。未来的努力可能会优先开发特定领域的数据集,并优化大语言模型,以应对这些领域独特的挑战和细微差别。与领域专家和从业者合作对于收集相关数据至关重要,利用这些数据进行大语言模型的微调可以提高其有效性,并使其更好地符合每个领域的具体需求。这种协作方法有助于大语言模型应对不同网络安全领域的现实挑战 [26]。

8.2.3Expanding LLM's Capabilities in More Security Domains.

8.2.3 扩展大语言模型在更多安全领域的能力

Integrating new input formats. In our research, we noticed that LLMs in security tasks typically use input formats from code-based and text-based datasets. The introduction of new input formats based on natural language, like voice and images, as well as multimodal inputs such as video demonstrations, presents an opportunity to enhance LLMs' ability to understand and process various user needs [233]. Integrating speech can improve user-model interaction, allowing for more natural and context-rich communication. Images can visually represent security task processes and requirements, providing LLMs with additional perspectives. Moreover, multimodal inputs combining text, audio, and visuals can offer a more comprehensive contextual understanding, leading to more accurate and con textually relevant security solutions.

整合新的输入格式。在我们的研究中,我们注意到在安全任务中,大语言模型通常使用基于代码和文本数据集的输入格式。引入基于自然语言的新输入格式,如语音和图像,以及多模态输入(如视频演示),为增强大语言模型理解和处理各种用户需求的能力提供了机会 [233]。整合语音可以改善用户与模型的交互,实现更自然且上下文丰富的沟通。图像可以直观地表示安全任务的流程和需求,为大语言模型提供额外的视角。此外,结合文本、音频和视觉的多模态输入可以提供更全面的上下文理解,从而生成更准确且与上下文相关的安全解决方案。

Expanding LLM applications. We noticed that LLMs have received significant attention in the domain of software and system security. This domain undoubtedly benefits from the text and code parsing capabilities of LLMs, leading to tasks such as vulnerability detection, program fuzzing, and others. Currently, the applications of LLMs in domains such as hardware security and blockchain security remain relatively limited, and specific security tasks in certain domains have not yet been explored by researchers using LLMs. This presents an important opportunity: by extending the use of LLMs to these underdeveloped domains, we can potentially drive the development of automated security solutions.

扩展大语言模型的应用。我们注意到,大语言模型在软件和系统安全领域受到了广泛关注。该领域无疑受益于大语言模型的文本和代码解析能力,从而推动了诸如漏洞检测、程序模糊测试等任务的发展。目前,大语言模型在硬件安全和区块链安全等领域的应用仍然相对有限,某些领域的特定安全任务尚未被研究人员利用大语言模型进行探索。这为我们提供了一个重要机遇:通过将大语言模型的应用扩展到这些尚未充分开发的领域,我们有望推动自动化安全解决方案的发展。

8.3Roadmap

8.3 路线图

We present a roadmap for future progress in utilizing Large Language Models for Security (LL M 4 Security), while also acknowledging the reciprocal relationship and growing exploration of Security for Large Language Models (Security 4 LL M) from a high-level perspective.

我们提出了一个利用大语言模型 (Large Language Models) 进行安全 (LL M 4 Security) 的未来发展路线图,同时也从高层次角度承认了大语言模型安全 (Security 4 LL M) 的相互关系和日益增长的探索。

Automating cyber security solutions. The quest for security automation encompasses the automated analysis of specific security scenario samples, multi-scenario security situational awareness, system security optimization, and the development of intelligent, tailored support for security operatives, which possesses context awareness and adaptability to individual needs. Leveraging the generative prowess of LLMs can aid security operatives in comprehending requirements better and crafting cost-effective security solutions, thus expediting security response times. Utilizing the natural language processing capabilities of LLMs to build security-aware tools enables more intuitive and responsive interactions with security operatives. Moreover, assisting security operatives in fine-tuning LLMs for specific security tasks can augment their precision and efficiency, tailoring automated workflows to cater to the distinct demands of diverse projects and personnel.

自动化网络安全解决方案。安全自动化的追求包括自动化分析特定安全场景样本、多场景安全态势感知、系统安全优化以及为安全操作人员开发具有情境感知和个性化需求适应能力的智能定制支持。利用大语言模型的生成能力,可以帮助安全操作人员更好地理解需求并制定具有成本效益的安全解决方案,从而加快安全响应时间。利用大语言模型的自然语言处理能力构建安全感知工具,可以使与安全操作人员的互动更加直观和响应迅速。此外,协助安全操作人员微调大语言模型以执行特定安全任务,可以提高其精确性和效率,定制自动化工作流程以满足不同项目和人员的独特需求。

Incorporating security knowledge into LLMs. A key direction for the future is to integrate specialized security task solutions and knowledge from the cyber security domain into LLMs to overcome potential hallucinations and errors [3, 117]. This integration aims to enhance LLMs’ ability to address security tasks, especially those requiring a significant amount of domain expertise, such as penetration testing [45, 198], hardware vulnerability detection [49], log analysis [97, 119], and more. Embedding rules and best practices from specific security domains into these models will better represent task requirements, enabling LLMs to generate robust and consistent security task solutions.

将安全知识融入大语言模型。未来的一个关键方向是将网络安全领域的专业安全任务解决方案和知识整合到大语言模型中,以克服潜在的幻觉和错误 [3, 117]。这种整合旨在增强大语言模型处理安全任务的能力,特别是那些需要大量领域专业知识的任务,例如渗透测试 [45, 198]、硬件漏洞检测 [49]、日志分析 [97, 119] 等。将特定安全领域的规则和最佳实践嵌入这些模型,将更好地代表任务需求,使大语言模型能够生成稳健且一致的安全任务解决方案。

Security agent: integrating external augmentation and LLMs. We have witnessed the unprecedented potential of applying LLMs to solve security tasks, almost overturning traditional security task solutions in LL M 4 Security [36, 115, 220]. However, the inherent lack of domain-specific knowledge and hallucinations in LLMs restrict their ability to perceive task requirements or environments with high quality [245]. AI Agents are artificial entities that perceive the environment, make decisions, and take actions. Currently, they are considered the most promising tool for achieving the pursuit of achieving or surpassing human-level intelligence in specific domains [219]. We summarized the external enhancement techniques introduced in LL M 4 Security in Section 5, optimizing LLMs' performance in security tasks across multiple dimensions, including input, model, and output [36, 92, 162]. Security operators can specify specific external enhancement strategies for security tasks and integrate them with LLMs to achieve automated security AI agents with continuous interaction within the system.

安全智能体:整合外部增强与大语言模型。我们见证了将大语言模型应用于解决安全任务的空前潜力,几乎颠覆了传统安全任务的解决方案 [36, 115, 220]。然而,大语言模型内在的领域知识缺乏和幻觉问题限制了其高质量感知任务需求或环境的能力 [245]。AI智能体是感知环境、做出决策并采取行动的人工实体。目前,它们被视为在特定领域实现或超越人类智能的最有前途的工具 [219]。我们在第5节中总结了 LL M 4 Security 中引入的外部增强技术,从输入、模型和输出等多个维度优化了大语言模型在安全任务中的表现 [36, 92, 162]。安全操作员可以为安全任务指定特定的外部增强策略,并将其与大语言模型整合,以实现系统内持续交互的自动化安全AI智能体。

Multimodal LLMs for security. In LL M 4 Security, all research inputs are based on textual language (text or code). With the rise of multimodal generative LLMs represented by models like Sora [151], we believe that future research in LL M 4 Security can expand to include multimodal inputs and outputs such as video, audio, and images to enhance LLMs? understanding and processing of security tasks. For example, when using LLMs as penetration testing tools, relevant images such as topology diagrams of the current network environment and screenshots of the current steps can be introduced as inputs. In addition, audio inputs (such as recordings of specific security incidents or discussions) can provide further background information for understanding security task requirements.

多模态大语言模型在安全领域的应用

Security for Large Language Models (Security 4 LL M). LLMs have gained considerable traction in the security sector, showcasing their potential in security-related endeavors. Nonetheless, delving into the internal security assess ment of LLMs remains a pressing area for investigation [230]. The intricate nature of LLMs renders them vulnerable to attacks, necessitating innovative strategies to fortify the models themselves [44, 68, 126]. Previous studies have identified vulnerabilities in LLMs like jail breaking and malicious prompt injection, resulting in the exposure of model training data or sensitive user chat records [48, 70, 126]. Considering that the inputs for security tasks often involve security-sensitive data (such as system logs and vulnerability code in programs) [158, 166], the leakage of such information would pose significant cyber security risks. An intriguing avenue for future research is to empower LLMs to autonomously detect and identify their vulnerabilities. Specifically, efforts could focus on enabling LLMs to generate patches for their underlying code, thus bolstering their inherent security, rather than solely implementing program restrictions at the user interaction layer. Given this scenario, future research should adopt a balanced approach, striving to utilize LLMs for automating cost-effective completion of security tasks while simultaneously developing techniques to safeguard the LLMs themselves. This dual focus is pivotal for fully harnessing the potential of LLMs in enhancing cyber security and ensuring compliance with cyber systems.

大语言模型的安全性 (Security 4 LLM)。大语言模型在安全领域已获得显著关注,展示了其在安全相关任务中的潜力。然而,深入探讨大语言模型的内部安全性评估仍是一个亟待研究的领域 [230]。大语言模型的复杂性使其容易受到攻击,因此需要创新的策略来加强模型本身的安全性 [44, 68, 126]。先前的研究已发现大语言模型中的漏洞,如越狱和恶意提示注入,导致模型训练数据或敏感用户聊天记录的泄露 [48, 70, 126]。考虑到安全任务的输入通常涉及安全敏感数据(如系统日志和程序中的漏洞代码)[158, 166],此类信息的泄露将带来重大的网络安全风险。未来研究的一个有趣方向是赋予大语言模型自主检测和识别其漏洞的能力。具体而言,可以努力使大语言模型能够为其底层代码生成补丁,从而增强其内在安全性,而不仅仅是在用户交互层实施程序限制。鉴于这种情况,未来研究应采取平衡的方法,努力利用大语言模型来自动化低成本地完成安全任务,同时开发技术以保护大语言模型本身。这种双重关注对于充分发挥大语言模型在增强网络安全和确保网络系统合规性方面的潜力至关重要。

9 CONCLUSION

9 结论

LLMs are making waves in the cyber security field, with their ability to tackle complex tasks potentially reshaping many cyber security practices and tools. In this comprehensive literature review, we delved into the emerging uses of LLMs in cyber security. We firstly explored the diverse array of security tasks where LLMs have been deployed, highlighting their practical impacts (RQ1). Our analysis covered the different LLMs employed in security tasks, discussing their unique traits and applications (RQ2). Additionally, we examined domain-specific techniques for applying LLMs to security tasks (RQ3). Lastly, we scrutinized the data collection and preprocessing procedures, underlining the importance of well-curated datasets in effectively applying LLMs to address security challenges (RQ4). We outlined key challenges facing LL M 4 Security and provided a roadmap for future research, outlining promising avenues for exploration.

大语言模型 (LLM) 正在网络安全领域掀起波澜,其处理复杂任务的能力可能重塑许多网络安全实践和工具。在这篇全面的文献综述中,我们深入探讨了大语言模型在网络安全中的新兴应用。我们首先探索了大语言模型部署的多种安全任务,强调了它们的实际影响 (RQ1)。我们的分析涵盖了应用于安全任务的不同大语言模型,讨论了它们的独特特性和应用 (RQ2)。此外,我们研究了将大语言模型应用于安全任务的领域特定技术 (RQ3)。最后,我们仔细审查了数据收集和预处理流程,强调了精心策划的数据集在有效应用大语言模型解决安全挑战中的重要性 (RQ4)。我们概述了 LLM 4 Security 面临的主要挑战,并为未来的研究提供了路线图,指出了有前景的探索方向。

REFERENCES

参考文献

阅读全文(20积分)