Safety at Scale: A Comprehensive Survey of Large Model Safety

大规模安全：大模型安全综合调查

Abstract—-The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (Al). These models are now foundational to a wide range of applications, including conversational Al, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-based Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard Al models. GitHub: https://github.com/xingjunm/Awesome-Large-Model-Safety.

摘要

Index Terms—Large Model Safety,Al Safety,Attacks and Defenses

索引术语—大模型安全、AI安全、攻击与防御

INTRODUCTION

引言

Rtificial Intelligence (Al) has entered the era of large models, Language Models (LLMs), Vision-Language Pre-Training (VLP) models, Vision-Language Models (VLMs), and image/video generation diffusion models (DMs). Through large-scale pre-training on massive datasets, these models have demonstrated unprecedented capabilities in tasks ranging from language understanding and image generation to complex problem-solving and decisionmaking. Their ability to understand and generate human-like content (e.g., texts, images, audios, and videos) has enabled applications in customer service, content creation, healthcare, education, and more, highlighting their transformative potential in both commercial and societal domains.

人工智能 (AI) 已进入大模型时代，包括大语言模型 (LLMs)、视觉语言预训练模型 (VLP)、视觉语言模型 (VLMs) 以及图像/视频生成扩散模型 (DMs)。通过对海量数据集进行大规模预训练，这些模型在语言理解、图像生成、复杂问题解决和决策制定等任务中展现了前所未有的能力。它们理解和生成类人内容（例如文本、图像、音频和视频）的能力，使得其在客户服务、内容创作、医疗保健、教育等领域得到应用，凸显了其在商业和社会领域的变革潜力。

However, the deployment of large models comes with significant challenges and risks. As these models become more integrated into critical applications, concerns regarding their vulner abilities to adversarial, jailbreak, and backdoor attacks, data privacy breaches, and the generation of harmful or misleading content have intensified. These issues pose substantial threats, including unintended system behaviors, privacy leakage, and the dissemination of harmful information. Ensuring the safety of these models is paramount to prevent such unintended consequences, maintain public trust, and promote responsible AI usage. The field of AI safety research has expanded in response to these challenges, encompassing a diverse array of attack methodologies, defense strategies, and evaluation benchmarks designed to identify and mitigate the vulnerabilities of large models. Given the rapid development of safety-related techniques for various large models, we aim to provide a comprehensive survey of these techniques, highlighting strengths, weaknesses, and gaps, while advancing research and fostering collaboration.

然而，大规模模型的部署带来了显著的挑战和风险。随着这些模型越来越多地集成到关键应用中，人们对其在对抗性攻击、越狱攻击和后门攻击中的脆弱性、数据隐私泄露以及生成有害或误导性内容的担忧日益加剧。这些问题构成了重大威胁，包括意外的系统行为、隐私泄露和有害信息的传播。确保这些模型的安全性至关重要，以防止此类意外后果、维护公众信任并促进负责任的AI使用。AI安全研究领域已针对这些挑战进行了扩展，涵盖了多样化的攻击方法、防御策略和评估基准，旨在识别和缓解大规模模型的脆弱性。鉴于各种大规模模型安全相关技术的快速发展，我们旨在对这些技术进行全面综述，突出其优势、劣势和差距，同时推动研究并促进合作。

Fig. 1: Left: The number of safety research papers published over the past four years. Middle: The distribution of research across different models. Right: The distribution of research across different types of attacks and defenses.

图 1: 左: 过去四年发表的安全研究论文数量。中: 不同模型的研究分布。右: 不同类型攻击和防御的研究分布。

Fig. 2: Left: The quarterly trend in the number of safety research papers published across different models; Middle: The proportional relationship between large models and their corresponding attacks and defenses; Right: The annual trend in the number of safety research papers published on various attacks and defenses, presented in descending order from highest to lowest.

图 2: 左: 不同模型发布的安全研究论文数量的季度趋势；中: 大模型与其对应的攻击和防御之间的比例关系；右: 各种攻击和防御发布的安全研究论文数量的年度趋势，按从高到低降序排列。

Given the broad scope of our survey, we have structured it with the following considerations to enhance clarity and organization:

鉴于我们调查的广泛范围，我们进行了以下结构设计以提高清晰度和组织性：

· Models. We focus on six widely studied model categories, including VFMs, LLMs, VLPs, VLMs, DMs, and Agents, and review the attack and defense methods for each separately. These models represent the most popular large models

模型。我们重点关注六种广泛研究的模型类别，包括 VFM、大语言模型、VLP、VLM、DM 和 AI智能体，并分别回顾每种模型的攻击和防御方法。这些模型代表了最流行的大型模型

across various domains.

跨领域

· Organization. For each model category, we classify the reviewed works into attacks and defenses, and identify 10 attack types: adversarial, backdoor, poisoning, jailbreak, prompt injection, energy-latency, membership inference, model extraction, data extraction, and agent attacks. When both backdoor and poisoning attacks are present for a model category, we combine them into a single backdoor & poisoning category due to their similarities. We review the corresponding defense strategies for each attack type immediately after the attacks.

· 组织结构。对于每个模型类别，我们将综述的文献分为攻击和防御，并识别出10种攻击类型：对抗性攻击、后门攻击、投毒攻击、越狱攻击、提示注入攻击、能量延迟攻击、成员推断攻击、模型提取攻击、数据提取攻击和智能体攻击。当某个模型类别同时存在后门攻击和投毒攻击时，由于它们的相似性，我们将它们合并为一个单独的后门和投毒类别。我们会在攻击之后立即回顾每种攻击类型对应的防御策略。

· Taxonomy. For each type of attack or defense, we use a twolevel taxonomy: Category $\rightarrow$ Subcategory. The Category differentiates attacks and defenses based on the threat model (e.g., white-box, gray-box, black-box) or specific subtasks (e.g., detection, purification, robust training/tuning, and robust inference). The Subcategory offers a more detailed classification based on their techniques.

· 分类法。对于每种类型的攻击或防御，我们使用两级分类法：类别 $\rightarrow$ 子类别。类别根据威胁模型（例如白盒、灰盒、黑盒）或特定子任务（例如检测、净化、鲁棒训练/调优和鲁棒推理）来区分攻击和防御。子类别则基于其技术提供更详细的分类。

Fig. 3: Organization of this survey.

图 3: 本调查的组织结构。

· Granularity. To ensure clarity, we simplify the introduction of each reviewed paper, highlighting only its key ideas, objectives, and approaches, while omitting technical details and experimental analyses.

· 粒度。为确保清晰，我们简化了对每篇论文的介绍，仅突出其关键思想、目标和方法，同时省略技术细节和实验分析。

TABLE 1: A summary of existing surveys.

表 1: 现有调查总结。

调查	年份	模型	主题
Zhanget al. [396]	2024	VFM&VLP&VLM	对抗性、越狱、提示注入
Truong et al. [397]	2024	DM	对抗性、后门、成员推断
Zhao et al. [398]	2024	大语言模型	后门
Yi et al. [399]	2024	大语言模型	越狱
Jin et al. [400]	2024	大语言模型&VLM	越狱
Liu et al. [401]	2024	VLM	越狱
Liu et al. [402]	2024	VLM	对抗性、后门、越狱、提示注入
Cui et al. [403]	2024	大语言模型智能体	对抗性、后门、越狱
Deng et al. [404]	2024	大语言模型智能体	对抗性、后门、越狱
Gan et al. [405]	2024	大语言模型&VLM智能体	对抗性、后门、越狱

Our survey methodology is structured as follows. First, we conducted a keyword-based search targeting specific model types and threat types to identify relevant papers. Next, we manually filtered out non-safety-related and non-technical papers. For each remaining paper, we categorized its proposed method or framework by analyzing its settings and attack/defense types, assigning them to appropriate categories and subcategories. In total, we collected 390 technical papers, with their distribution across years, model types, and attack/defense strategies illustrated in Figure 1. As shown, safety research on large models has surged significantly since 2023, following the release of ChatGPT. Among the model types, LLMs and DMs have garnered the most attention, accounting for over ${\bf60%}$ of the surveyed papers. Regarding attack types, jailbreak, adversarial, and backdoor attacks were the most extensively studied. On the defense side, jailbreak defenses received the highest focus, followed by adversarial defenses. Figure 2 presents a cross-view of temporal trends across model types and attack/defense categories, offering a detailed breakdown of the reviewed works. Notably, research on attacks constitutes ${\bf60%}$ of the studied. In terms of defense, while defense research accounts for only $40,%$ , underscoring a significant gap that warrants increased attention toward defense strategies. The overall structure of this survey is outlined in Figure 3.

我们的调查方法结构如下。首先，我们针对特定模型类型和威胁类型进行了基于关键词的搜索，以识别相关论文。接下来，我们手动筛选了与非安全相关和非技术性的论文。对于每篇剩余的论文，我们通过分析其设置和攻击/防御类型，将其提出的方法或框架分类，并将其分配到适当的类别和子类别中。总共，我们收集了390篇技术论文，其在年份、模型类型和攻击/防御策略上的分布如图1所示。如图所示，自2023年ChatGPT发布以来，大模型的安全研究显著激增。在模型类型中，大语言模型和扩散模型最受关注，占调查论文的${\bf60%}$以上。在攻击类型方面，越狱攻击、对抗攻击和后门攻击是研究最多的。在防御方面，越狱防御受到的关注最高，其次是对抗防御。图2展示了模型类型和攻击/防御类别的时间趋势交叉视图，提供了对所审查工作的详细分解。值得注意的是，攻击研究占所研究的${\bf60%}$。在防御方面，防御研究仅占$40,%$，这突显了一个显著的差距，需要加大对防御策略的关注。本调查的整体结构如图3所示。

Difference to Existing Surveys. Large-model safety is a rapidly evolving field, and several surveys have been conducted to advance research in this area. Recently, Slattery et al. [406] introduced an AI risk framework with a systematic taxonomy covering all types of risks. In contrast, our focus is on the technical aspects, specifically the attack and defense techniques proposed in the literature. Table 1 lists the technical surveys we identified, each focusing on a specific type of model or threat (e.g., LLMs, VLMs, or jailbreak attacks/defenses). Compared with these works, our survey offers both a broader scope—covering a wider range of model types and threats—and a higher level perspective, centering on over arching methodologies rather than minute technical details.

与现有综述的差异。大模型安全是一个快速发展的领域，已有几篇综述推动了该领域的研究进展。最近，Slattery 等人 [406] 提出了一个涵盖所有类型风险的系统性分类的人工智能风险框架。相比之下，我们的重点是技术方面，特别是文献中提出的攻击和防御技术。表 1 列出了我们确定的技术综述，每篇综述都侧重于特定类型的模型或威胁（例如，大语言模型、视觉语言模型或越狱攻击/防御）。与这些工作相比，我们的综述既提供了更广泛的范围——涵盖更多类型的模型和威胁，又提供了更高的视角，集中于总体方法论而非细微的技术细节。

2 VISION FOUNDATION MODEL SAFETY

2 视觉基础模型安全

This section surveys safety research on two types of VFMs: per-trained Vision Transformers (ViTs）[407] and the Segment Anything Model (SAM）[408]. We focus on ViTs and SAM because they are among the most widely deployed VFMs and have garnered significant attention in recent safety research.

本节调查了两类视觉基础模型 (VFMs) 的安全研究：预训练的 Vision Transformers (ViTs) [407] 和 Segment Anything Model (SAM) [408]。我们重点关注 ViTs 和 SAM，因为它们是目前部署最广泛的 VFMs，并且在最近的安全研究中引起了广泛关注。

2.1Attacks and Defenses for ViTs

2.1 ViT的攻击与防御

Pre-trained ViTs are widely employed as backbones for various downstream tasks, frequently achieving state-of-the-art performance through efficient adaptation and fine-tuning. Unlike traditional CNNs, ViTs process images as sequences of tokenized patches, allowing them to better capture spatial dependencies. However, this patch-based mechanism also brings unique safety concerns and robustness challenges. This section explores these issues by reviewing ViT-related safety research, including adversarial attacks, backdoor & poisoning attacks, and their corresponding defense strategies. Table 2 provides a summary of the surveyed attacks and defenses, along with the commonly used datasets.

预训练的Vision Transformer (ViT) 被广泛用作各种下游任务的骨干网络，通过高效的适应和微调，经常能够达到最先进的性能。与传统的卷积神经网络 (CNN) 不同，ViT 将图像处理为分块序列的Token，从而能够更好地捕捉空间依赖性。然而，这种基于分块的机制也带来了独特的安全问题和鲁棒性挑战。本节通过回顾与 ViT 相关的安全研究，包括对抗攻击、后门与中毒攻击及其相应的防御策略，探讨这些问题。表 2 总结了所调查的攻击和防御方法，以及常用的数据集。

2.1.1 Adversarial Attacks

2.1.1 对抗攻击

Adversarial attacks on ViTs can be classified into white-box attacks and black-box attacks based on whether the attacker has full access to the victim model. Based on the attack strategy, white-box attacks can be further divided into 1) patch attacks, 2) position embedding attacks and 3) attention attacks, while black-box attacks can be summarized into 1） transfer-based attacks and 2) query-based attacks.

针对 ViTs 的对抗攻击根据攻击者是否完全访问受害者模型可分为白盒攻击和黑盒攻击。根据攻击策略，白盒攻击可进一步分为 1) 补丁攻击 (patch attacks)、2) 位置嵌入攻击 (position embedding attacks) 和 3) 注意力攻击 (attention attacks)，而黑盒攻击则可概括为 1) 基于迁移的攻击 (transfer-based attacks) 和 2) 基于查询的攻击 (query-based attacks)。

2.1.1.1 White-box Attacks

2.1.1.1 白盒攻击

Patch Attacks exploit the modular structure of ViTs, aiming to manipulate their inference processes by introducing targeted perturbations in specific patches of the input data. Joshi et al. [17] proposed an adversarial token attack method leveraging block sparsity to assess the vulnerability of ViTs to token-level perturbations. Expanding on this, Patch-Fool [1] introduces an adversarial attack framework that targets the self-attention modules by perturbing individual image patches, thereby manipulating attention scores. Different from existing methods, SlowFormer [2] introduces a universal adversarial patch can be applied to any image to increases computational and energy costs while preserving model accuracy.

补丁攻击利用ViTs的模块化结构，通过在输入数据的特定补丁中引入有针对性的扰动来操纵其推理过程。Joshi等人[17]提出了一种利用块稀疏性的对抗性Token攻击方法，以评估ViTs对Token级扰动的脆弱性。在此基础上，Patch-Fool[1]引入了一个对抗性攻击框架，通过扰动单个图像补丁来针对自注意力模块，从而操纵注意力分数。与现有方法不同，SlowFormer[2]引入了一种通用对抗性补丁，可以应用于任何图像，在保持模型准确性的同时增加计算和能源成本。

Position Embedding Attacks aim to attack the spatial or sequential position of tokens in transformers. For example, PEAttack [3] explores the common vulnerability of positional embeddings to adversarial perturbations by disrupting their ability to encode positional information through periodicity manipulation, linearity distortion, and optimized embedding distortion.

位置嵌入攻击旨在攻击 Transformer 中 Token 的空间或序列位置。例如，PEAttack [3] 通过周期性操作、线性扭曲和优化的嵌入扭曲来破坏位置嵌入编码位置信息的能力，从而探索位置嵌入对对抗性扰动的常见漏洞。

Attention Attacks target vulnerabilities in the self-attention modules of ViTs. Attention-Fool [4] manipulates dot-product similarities to redirect queries to adversarial key tokens, exposing the model's sensitivity to adversarial patches. Similarly, AAS [5] mitigates gradient masking in ViTs by optimizing the pre-softmax output scaling factors, enhancing the effectiveness of attacks.

注意力攻击针对 ViTs 中自注意力模块的漏洞。Attention-Fool [4] 通过操纵点积相似性，将查询重定向到对抗性关键 Token，暴露了模型对对抗性补丁的敏感性。类似地，AAS [5] 通过优化预 softmax 输出缩放因子，缓解了 ViTs 中的梯度掩码，增强了攻击的有效性。

2.1.1.2 Black-box Attacks

2.1.1.2 黑盒攻击

Transfer-based Attacks first generate adversarial examples using fully accessible surrogate models, which are then transferred to attack black-box victim ViTs.In this context, we first review attacks specifically designed for the ViT architecture. SE-TR [6] enhances adversarial transfer ability by optimizing perturbations on an ensemble of models. ATA [7] strategically activates uncertain attention and perturbs sensitive embeddings within ViTs. LPM [9] mitigates the over fitting to model-specific disc rim i native regions through a patch-wise optimized binary mask. Chen et al. [18] introduced an Inductive Bias Attack (IBA) to suppress unique biases in ViTs and target shared inductive biases. TGR [11] reduces the variance of the back propagated gradient within internal blocks. VDC [12] employs virtual dense connections between deeper attention maps and MLP blocks to facilitate gradient back propagation. FDAP [13] exploits feature collapse by reducing high-frequency components in feature space. CRFA [16] disrupts only the most crucial image regions using approximate attention maps. SASD-WS [15] flattens the loss landscape of the source model through sharpness-aware self-distillation and approximates an ensemble of pruned models using weight scaling to improve target adversarial transfer ability.

基于迁移的攻击首先使用完全可访问的代理模型生成对抗样本，然后将其迁移到攻击黑盒受害者 ViT。在此背景下，我们首先回顾了专门为 ViT 架构设计的攻击。SE-TR [6] 通过在模型集合上优化扰动来增强对抗迁移能力。ATA [7] 策略性地激活不确定的注意力并扰动 ViT 内的敏感嵌入。LPM [9] 通过逐块优化的二值掩码减轻了对模型特定判别区域的过拟合。Chen 等人 [18] 引入了归纳偏置攻击 (IBA) 以抑制 ViT 中的独特偏置并针对共享的归纳偏置。TGR [11] 减少了内部块中反向传播梯度的方差。VDC [12] 在更深层的注意力图和 MLP 块之间使用虚拟密集连接以促进梯度反向传播。FDAP [13] 通过减少特征空间中的高频分量来利用特征崩溃。CRFA [16] 仅使用近似的注意力图破坏最关键图像区域。SASD-WS [15] 通过锐度感知自蒸馏使源模型的损失景观平坦化，并使用权重缩放近似剪枝模型的集合以提高目标对抗迁移能力。

Other strategies are applicable to both ViTs and CNNs, ensuring broader applicability in black-box settings. Wei et al. [8], [19] proposed a dual attack framework to improve transfer ability between ViTs and CNNs: 1) a Pay No Attention (PNA) attack, which skips the gradients of attention during back propagation, and 2) a PatchOut attack, which randomly perturbs subsets of image patches at each iteration. MIG [10] uses integrated gradients and momentum-based updates to precisely target modelagnostic critical regions, improving transfer ability between ViTs

其他策略适用于ViTs和CNNs，确保在黑盒设置中的更广泛适用性。Wei等[8], [19]提出了一个双重攻击框架，以提高ViTs和CNNs之间的转移能力：1) Pay No Attention (PNA)攻击，在反向传播期间跳过注意力梯度，2) PatchOut攻击，每次迭代随机扰动图像补丁的子集。MIG [10]使用集成梯度和基于动量的更新来精确定位模型无关的关键区域，提高ViTs之间的转移能力

and CNNs.

和 CNN。

Query-based Attacks generate adversarial examples by querying the black-box model and levering the model responses to estimate the adversarial gradients. The goal is to achieve successful attack with a minimal number of queries. Based on the type of model response, query-based attacks can be further divided into score-based attacks, where the model returns a probability vector, and decision-based attacks, where the model provides only the top-k classes. Decision-based attacks typically start from a large random noise (to achieve mis classification first) and then gradually find smaller noise while maintaining mis classification. To improve the efficiency of the adversarial noise searching process in ViTs, PAR [14] introduces a coarse-to-fine patch searching method, guided by noise magnitude and sensitivity masks to account for the structural characteristics of ViTs and mitigate the negative impact of non-overlapping patches.

基于查询的攻击通过查询黑盒模型并利用模型响应来估计对抗梯度，从而生成对抗样本。其目标是以最少的查询次数实现成功的攻击。根据模型响应的类型，基于查询的攻击可以进一步分为基于分数的攻击（模型返回概率向量）和基于决策的攻击（模型仅提供前 k 个类别）。基于决策的攻击通常从较大的随机噪声开始（先实现错误分类），然后逐步找到较小的噪声，同时保持错误分类。为了提高 ViTs 中对抗噪声搜索过程的效率，PAR [14] 引入了一种由粗到细的 patch 搜索方法，通过噪声幅度和敏感度掩码来指导，以考虑 ViTs 的结构特征并减轻非重叠 patch 的负面影响。

2.1.2 Adversarial Defenses

2.1.2 对抗防御

Adversarial defenses for ViTs follow four major approaches: 1) adversarial training, which trains ViTs on adversarial examples via min-max optimization to improve its robustness; 2) adversarial detection, which identifies and mitigates adversarial attacks by detecting abnormal or malicious patterns in the inputs; 3) robust architecture, which modifies and optimizes the architecture (e.g., self-attention module) of ViTs to improve their resilience against adversarial attacks; and 4) adversarial purification, which preprocesses the input (e.g., noise injection, denoising, or other transformations) to remove potential adversarial perturbations before inference.

针对ViTs的对抗防御主要遵循四种方法：1) 对抗训练，通过最小-最大优化在对抗样本上训练ViTs以提高其鲁棒性；2) 对抗检测，通过检测输入中的异常或恶意模式来识别和缓解对抗攻击；3) 鲁棒架构，修改和优化ViTs的架构（例如自注意力模块）以提高其对抗攻击的抵抗力；4) 对抗净化，在推理前预处理输入（例如噪声注入、去噪或其他变换）以去除潜在的对抗扰动。

Adversarial Training is widely regarded as the most effective approach to adversarial defense; however, it comes with a high computational cost. To address this on ViTs, AGAT [20] introduces a dynamic attention-guided dropping strategy, which accelerates the training process by selectively removing certain patch embeddings at each layer. This reduces computational overhead while maintaining robustness, especially on large datasets such as ImageNet. Due to its high computational cost, research on adversarial training for ViTs has been relatively limited. ARDPRM [28] improves adversarial robustness by randomly dropping gradients in attention blocks and masking patch perturbations during training.

对抗训练被广泛认为是最有效的对抗防御方法；然而，它伴随着高昂的计算成本。为了解决 ViT 上的这一问题，AGAT [20] 引入了一种动态注意力引导的丢弃策略，通过在每个层有选择性地移除某些 patch embedding 来加速训练过程。这降低了计算开销，同时保持了鲁棒性，尤其是在 ImageNet 等大型数据集上。由于其高昂的计算成本，针对 ViT 的对抗训练研究相对有限。ARDPRM [28] 通过在训练期间随机丢弃注意力块中的梯度并掩盖 patch 扰动，提高了对抗鲁棒性。

Adversarial Detection methods for ViTs primarily leverage two key features, i.e., patch-based inference and activation characteristics, to detect and mitigate adversarial examples. Li et al. [21] proposed the concept of Patch Vestiges, abnormalities arising from adversarial examples during patch division in ViTs. They used statistical metrics on step changes between adjacent pixels across patches and developed a binary regression classifier to detect adversaries. Alternatively, ARMOR [23] identifies adversarial patches by scanning for unusually high column scores in specific layers and masking them with average images to reduce their impact. ViTGuard [22], on the other hand, employs a masked auto encoder to detect patch attacks by analyzing attention maps and CLS token representations. As more attacks are developed, there is a growing need for a unified detection framework capable of handling all types of adversarial examples.

针对 ViTs 的对抗检测方法主要利用两个关键特征，即基于 patch 的推理和激活特性，来检测和缓解对抗样本。Li 等人 [21] 提出了 Patch Vestiges 的概念，即在 ViTs 的 patch 划分过程中由对抗样本引起的异常现象。他们对跨 patch 的相邻像素之间的阶跃变化使用了统计指标，并开发了一个二元回归分类器来检测对抗样本。另一方面，ARMOR [23] 通过在特定层中扫描异常高的列分数来识别对抗 patch，并用平均图像对其进行掩码处理以减少其影响。而 ViTGuard [22] 则使用掩码自动编码器通过分析注意力图和 CLS token 表示来检测 patch 攻击。随着更多攻击的出现，越来越需要一个能够处理所有类型对抗样本的统一检测框架。

Robust Architecture methods focus on designing more adversa ri ally resilient attention modules for ViTs. For example, Smoothed Attention [27] employs temperature scaling in the softmax function to prevent any single patch from dominating the attention, thereby balancing focus across patches. ReiT [32] integrates adversarial training with random iz ation through the IIReSA module, optimizing randomly entangled tokens to reduce adversarial similarity and enhance robustness. TAP [29] addresses token over focusing by implementing token-aware average pooling and an attention diversification loss, which incorporate local neighborhood information and reduce cosine similarity among attention vectors. FViTs [31] strengthen explanation faithfulness by stabilizing top $\mathbf{\nabla}\cdot\mathbf{k}$ indices in self-attention and robustify predictions using denoised diffusion smoothing combined with Gaussian noise. RSPC [30] tackles vulnerabilities by corrupting the most sensitive patches and aligning intermediate features between clean and corrupted inputs to stabilize the attention mechanism. Collectively, these advancements underscore the pivotal role of the attention mechanism in improving the adversarial robustness of ViTs.

鲁棒架构方法专注于为视觉Transformer (ViT) 设计更具对抗性弹性的注意力模块。例如，[27] 在softmax函数中使用温度缩放来防止任何单个patch主导注意力，从而平衡各个patch的注意力。[32] 通过IIReSA模块将对抗训练与随机化结合，优化随机纠缠的Token以减少对抗相似性并增强鲁棒性。[29] 通过实现Token感知的平均池化和注意力多样化损失来解决Token过度聚焦问题，该方法结合了局部邻域信息并减少注意力向量之间的余弦相似性。[31] 通过稳定自注意力中的top $\mathbf{\nabla}\cdot\mathbf{k}$ 索引，并结合高斯噪声的去噪扩散平滑来增强解释的忠实性和预测的鲁棒性。[30] 通过破坏最敏感的patch并在干净和破坏的输入之间对齐中间特征来稳定注意力机制，从而解决脆弱性问题。这些进展共同强调了注意力机制在提升ViT对抗鲁棒性中的关键作用。

Adversarial Purification refers to a model-agnostic inputprocessing technique that is broadly applicable across various architectures, including but not limited to ViTs. DiffPure [33] introduces a framework where adversarial images undergo noise injection via a forward stochastic differential equation (SDE) process, followed by denoising with a pre-trained diffusion model. CGDMP [24] refines this approach by optimizing the noise level for the forward process and employing contrastive loss gradients to guide the denoising process, achieving improved purification tailored to ViTs. ADBM [25] highlights the disparity between diffused adversarial and clean examples, proposing a method to directly connect the clean and diffused adversarial distributions. While these methods focus on ViTs, other approaches demonstrate broader applicability to various vision models, e.g., CNNs. Purify $^{++}$ [34] enhances DiffPure with improved diffusion models, DifFilter [35] extends noise scales to better preserve semantics, and Mimic Diffusion [36] mitigates adversarial impacts during the reverse diffusion process. For improved efficiency, OsCP [26] and LightPure [37] propose single-step and real-time purification methods, respectively. LoRID [38] introduces a Markov-based approach for robust purification. These methods complement ViTrelated research and highlight diverse advancements in adversarial purification.

对抗净化 (Adversarial Purification) 指的是一种与模型无关的输入处理技术，广泛适用于各种架构，包括但不限于 ViT (Vision Transformer)。DiffPure [33] 提出了一个框架，其中对抗图像通过前向随机微分方程 (SDE) 过程进行噪声注入，然后使用预训练的扩散模型进行去噪。CGDMP [24] 通过优化前向过程的噪声水平，并采用对比损失梯度来指导去噪过程，改进了该方法，实现了针对 ViT 的优化净化。ADBM [25] 强调了扩散对抗样本与干净样本之间的差异，提出了一种直接连接干净样本和扩散对抗样本分布的方法。虽然这些方法主要关注 ViT，但其他方法展示了其对各种视觉模型（例如 CNN）的广泛适用性。Purify$^{++}$ [34] 通过改进的扩散模型增强了 DiffPure，DifFilter [35] 扩展了噪声尺度以更好地保留语义，Mimic Diffusion [36] 在反向扩散过程中减轻了对抗影响。为了提高效率，OsCP [26] 和 LightPure [37] 分别提出了单步和实时净化方法。LoRID [38] 引入了一种基于马尔可夫的方法来进行鲁棒净化。这些方法补充了 ViT 相关研究，并突出了对抗净化领域的多样化进展。

2.1.3Backdoor Attacks

2.1.3 后门攻击

Backdoors can be injected into the victim model via data poisoning, training manipulation, or parameter editing, with most existing attacks on ViTs being data poisoning-based. We classify these attacks into four categories: 1） patch-level attacks, 2) token-level attacks, and 3) multi-trigger attacks, which exploit ViT-specific data processing characteristics, as well as 4) datafree attacks, which exploit the inherent mechanisms of ViTs.

后门可以通过数据投毒、训练操纵或参数编辑注入到受害模型中，现有对ViT的攻击大多基于数据投毒。我们将这些攻击分为四类：1）基于补丁（patch-level）的攻击，2）基于token（token-level）的攻击，3）多触发器（multi-trigger）攻击，这些攻击利用了ViT特定的数据处理特性，以及4）无数据（datafree）攻击，这些攻击利用了ViT的固有机制。

Patch-level Attacks primarily exploit the ViT's characteristic of processing images as discrete patches by implanting triggers at the patch level. For example, BadViT [39] introduces a universal patch-wise trigger that requires only a small amount of data to redirect the model's focus from classification-relevant patches to adversarial triggers. TrojViT [40] improves this approach by utilizing patch salience ranking, an attention-targeted loss function, and parameter distillation to minimize the bit flips necessary to embed the backdoor.

Patch-level Attacks 主要利用 ViT 将图像处理为离散 patch 的特性，在 patch 级别植入触发器。例如，BadViT [39] 引入了一种通用的 patch-wise 触发器，只需少量数据即可将模型的注意力从分类相关的 patch 转移到对抗触发器上。TrojViT [40] 通过利用 patch 显著性排名、注意力目标损失函数和参数蒸馏来改进这种方法，以最小化嵌入后门所需的比特翻转。

Token-level Attacks target the token iz ation layer of ViTs. SWARM [41] introduces a switchable backdoor mechanism featuring a “switch token" that dynamically toggles between benign and adversarial behaviors, ensuring high attack success rates while maintaining functionality in clean environments.

Token级别攻击针对ViTs的Token化层。SWARM [41] 引入了一种可切换的后门机制，该机制包含一个“切换Token”，能够在良性行为和对抗行为之间动态切换，确保在干净环境中保持功能的同时实现高攻击成功率。

Multi-trigger Attacks employ multiple backdoor triggers in parallel, sequential, or hybrid configurations to poison the victim dataset. MTBAs [43] utilize these multiple triggers to induce coexistence, overwriting, and cross-activation effects, significantly diminishing the effectiveness of existing defense mechanisms.

多触发攻击采用并行、顺序或混合配置的多个后门触发器来污染受害者数据集。MTBAs [43] 利用这些多个触发器引发共存、覆盖和交叉激活效应，显著降低了现有防御机制的有效性。

Data-free Attacks eliminate the need for original training datasets. Using substitute datasets, DBIA [42] generates universal triggers that maximize attention within ViTs. These triggers are fine-tuned with minimal parameter adjustments using PGD [409], enabling efficient and resource-light backdoor injection.

无需数据攻击消除对原始训练数据集的需求。DBIA [42] 使用替代数据集生成能在 ViT 中最大化注意力的通用触发器。这些触发器通过 PGD [409] 进行微调，仅需最少的参数调整，从而实现高效且资源消耗低的后门注入。

2.1.4Backdoor Defenses

2.1.4 后门防御

Backdoor defenses for ViTs aim to identify and break (or remove) the correlation between trigger patterns and target classes while preserving model accuracy. Two representative defense strategies are: 1) patch processing, which disrupts the integrity of image patches to prevent trigger activation, and 2) image blocking, which leverages interpret ability-based mechanisms to mask and neutralize the effects of backdoor triggers.

针对ViTs的后门防御旨在识别并打破（或移除）触发模式与目标类别之间的关联，同时保持模型准确性。两种代表性的防御策略包括：1) 图像块处理，通过破坏图像块的完整性来防止触发激活；2) 图像屏蔽，利用基于可解释性的机制来掩盖并中和后门触发器的影响。

Patch Processing strategy disrupts the integrity of patches to neutralize triggers. Doan et al. [44] found that clean-data accuracy and attack success rates of ViTs respond differently to patch transformations before positional encoding, and proposed an effective defense method by randomly dropping or shufling patches of an image to counter both patch-based and blendingbased backdoor attacks. Image Blocking utilizes interpret a bility to identify and neutralize triggers. Subramanya et al. [46] showed that ViTs can localize backdoor triggers using attention maps and proposed a defense mechanism that dynamically masks potential trigger regions during inference. In a subsequent work, Subramanya et al. [45] proposed to integrate trigger neutralization into the training phase to improve the robustness of ViTs to backdoor attacks. While these two methods are promising, the field requires a holistic defense framework that integrates nonViT defenses with ViT-specific characteristics and unifies multiple defense tasks including backdoor detection, trigger inversion, and backdoor removal, as attempted in [410].

补丁处理策略通过破坏补丁的完整性来中和触发器。Doan 等人 [44] 发现，在位置编码之前，ViT (Vision Transformer) 的干净数据精度和攻击成功率对补丁变换的响应不同，并提出了一种有效的防御方法，即随机丢弃或打乱图像的补丁，以应对基于补丁和基于混合的后门攻击。图像阻断利用可解释性来识别和中和触发器。Subramanya 等人 [46] 表明，ViT 可以使用注意力图定位后门触发器，并提出了一种在推理过程中动态屏蔽潜在触发区域的防御机制。在后续工作中，Subramanya 等人 [45] 提出将触发器中和技术集成到训练阶段，以提高 ViT 对后门攻击的鲁棒性。尽管这两种方法很有前景，但该领域需要一个综合的防御框架，将非 ViT 防御与 ViT 特定特性相结合，并统一包括后门检测、触发器反演和后门移除在内的多种防御任务，如 [410] 中所尝试的那样。

2.1.5 Datasets

2.1.5 数据集

Datasets are crucial for developing and evaluating attack and defense methods. Table 2 summarizes the datasets used in adversarial and backdoor research.

数据集对于开发和评估攻击与防御方法至关重要。表 2 总结了用于对抗性和后门研究的数据集。

Datasets for Adversarial Research As shown in Table 2, adversarial researches were primarily conducted on ImageNet. While attacks were tested across various datasets like CIFAR10/100, Food-101, and GLUE, defenses were mainly limited to ImageNet and CIFAR-10/10o. This imbalance reveals one key issue in adversarial research: attacks are more versatile, while defenses struggle to generalize across different datasets.

对抗研究的数据集如表 2 所示，对抗研究主要在 ImageNet 上进行。虽然攻击在 CIFAR10/100、Food-101 和 GLUE 等各种数据集上进行了测试，但防御主要局限于 ImageNet 和 CIFAR-10/10o。这种不平衡揭示了对抗研究中的一个关键问题：攻击更具通用性，而防御难以在不同数据集上泛化。

Datasets for Backdoor Research Backdoor researches were also conducted mainly on ImageNet and CIFAR-10/100 datasets. Some attacks, such as DBIA and SWARM,extend to domainspecific datasets like GTSRB and VGGFace, while defenses, including PatchDrop, were often limited to a few benchmarks. This narrow focus reduces their real-world applicability. Although backdoor defenses are shifting towards robust inference techniques, they typically target specific attack patterns, limiting their general iz ability. To address this, adaptive defense strategies need to be tested across a broader range of datasets to effectively counter the evolving nature of backdoor threats.

用于后门研究的数据集后门研究主要在 ImageNet 和 CIFAR-10/100 数据集上进行。一些攻击，如 DBIA 和 SWARM，扩展到特定领域的数据集，如 GTSRB 和 VGGFace，而防御方法，包括 PatchDrop，通常仅限于少数基准测试。这种狭窄的焦点降低了它们的实际应用性。尽管后门防御正在转向鲁棒推理技术，但它们通常针对特定的攻击模式，限制了它们的泛化能力。为了解决这个问题，自适应防御策略需要在更广泛的数据集上进行测试，以有效应对不断演变的后门威胁。

2.2 Attacks and Defenses for SAM

2.2 SAM的攻击与防御

SAM is a foundational model for image segmentation, comprising three primary components: a ViT-based image encoder, a prompt encoder, and a mask decoder. The image encoder transforms highresolution images into embeddings, while the prompt encoder converts various input modalities into token embeddings. The mask decoder combines these embeddings to generate segmentation masks using a two-layer Transformer architecture. Due to its complex structure, attacks and defenses targeting SAM differ significantly from those developed for CNNs. These unique challenges stem from SAM's modular and interconnected design, where vulnerabilities in one component can propagate to others, necessitating specialized strategies for both attack and defense. This section systematically reviews SAM-related adversarial attacks, backdoor & poisoning attacks, and adversarial defense strategies, as summarized in Table 2.

SAM是一种用于图像分割的基础模型，包含三个主要组件：基于ViT的图像编码器、提示编码器和掩码解码器。图像编码器将高分辨率图像转换为嵌入，而提示编码器将各种输入模态转换为Token嵌入。掩码解码器使用两层Transformer架构结合这些嵌入来生成分割掩码。由于其复杂的结构，针对SAM的攻击和防御与为CNN开发的攻击和防御有显著差异。这些独特的挑战源于SAM的模块化和互连设计，其中一个组件的漏洞可能会传播到其他组件，因此需要专门的攻击和防御策略。本节系统性地回顾了与SAM相关的对抗攻击、后门和投毒攻击以及对抗防御策略，如表2所示。

2.2.1 Adversarial Attacks

2.2.1 对抗攻击

Adversarial attacks on SAM can be categorized into: (1) whitebox attacks, exemplified by prompt-agnostic attacks, and (2) black-box attacks, which can be further divided into universal attacks and transfer-based attacks. Each category employs distinct strategies to compromise segmentation performance.

针对 SAM 的对抗攻击可分为：(1) 白盒攻击，例如与提示无关的攻击，以及 (2) 黑盒攻击，黑盒攻击又可进一步分为通用攻击和基于迁移的攻击。每类攻击都采用不同的策略来破坏分割性能。

2.2.1.1 White-box Attacks

2.2.1.1 白盒攻击

Prompt-Agnostic Attacks are white-box attacks that disrupt SAM's segmentation without relying on specific prompts, using either prompt-level or feature-level perturbations for generality across inputs. For prompt-level attacks, Shen et al. [47] proposed a grid-based strategy to generate adversarial perturbations that disrupt segmentation regardless of click location. For featurelevel attacks, Croce et al. [48] perturbed features from the image encoder to distort spatial embeddings, undermining SAM's segmentation integrity.

提示无关攻击是一种白盒攻击，它不依赖特定提示来破坏 SAM 的分割，使用提示级或特征级扰动以实现输入的通用性。对于提示级攻击，Shen 等人 [47] 提出了一种基于网格的策略来生成对抗性扰动，无论点击位置如何都能破坏分割。对于特征级攻击，Croce 等人 [48] 扰乱了图像编码器的特征，以扭曲空间嵌入，从而破坏 SAM 的分割完整性。

2.2.1.2 Black-box Attacks

2.2.1.2 黑盒攻击

Universal Attacks generate UAPs [411] that can consistently disrupt SAM across arbitrary prompts. Han et al. [53] exploited contrastive learning to optimize the UAPs, achieving better attack performance by exacerbating feature misalignment. DarkSAM [54], on the other hand, introduces a hybrid spatial-frequency framework that combines semantic decoupling and texture distortion to generate universal perturbations.

通用攻击生成通用对抗样本 (UAPs) [411]，这些样本可以在任意提示下持续干扰 SAM。Han 等人 [53] 利用对比学习来优化 UAPs，通过加剧特征错位来实现更好的攻击性能。另一方面，DarkSAM [54] 引入了一种混合空间-频率框架，结合语义解耦和纹理失真来生成通用对抗样本。

Transfer-based Attacks exploit transferable representations in SAM to generate perturbations that remain adversarial across different models and tasks. $\mathbf{PATA++}$ [50] improves transfer ability by using a regular iz ation loss to highlight key features in the image encoder, reducing reliance on prompt-specific data. Attack-SAM [49] employs ClipMSE loss to focus on mask removal, optimizing for spatial and semantic consistency to improve cross-task transfer ability. UMI-GRAT [52] follows a two-step process: it first generates a general iz able perturbation with a surrogate model and then applies gradient robust loss to improve across-model transfer ability. Apart from designing new loss functions, optimization over transformation techniques can also be exploited to improve transfer ability. This includes T-RA [47], which improves cross-model transfer ability by applying spectrum transformations to generate adversarial perturbations that degrade segmentation in SAM variants, and UAD [51], which generates adversarial examples by deforming images in a two-stage process and aligning features with the deformed targets.

基于迁移的攻击利用 SAM 中的可迁移表示生成在不同模型和任务中保持对抗性的扰动。$\mathbf{PATA++}$ [50] 通过使用正则化损失来突出图像编码器中的关键特征，减少对特定提示数据的依赖，从而提高了迁移能力。Attack-SAM [49] 采用 ClipMSE 损失专注于掩码移除，优化空间和语义一致性，以提高跨任务迁移能力。UMI-GRAT [52] 遵循两步过程：首先生成具有代理模型的可泛化扰动，然后应用梯度鲁棒损失以提高跨模型迁移能力。除了设计新的损失函数外，还可以利用变换技术的优化来提高迁移能力。这包括 T-RA [47]，它通过对频谱变换生成对抗性扰动，从而降低 SAM 变体中的分割效果，提高跨模型迁移能力；以及 UAD [51]，它通过两阶段过程变形图像并将特征与变形目标对齐来生成对抗样本。

2.2.2 Adversarial Defenses

2.2.2 对抗防御

Adversarial defenses for SAM are currently limited, with existing approaches focusing primarily on adversarial tuning, which integrates adversarial training into the prompt tuning process of SAM. For example, ASAM [55] utilizes a stable diffusion model to generate realistic adversarial samples on a low-dimensional manifold through diffusion model-based tuning. ControlNet [412] is then employed to guide the re-projection process, ensuring that the generated samples align with the original mask annotations. Finally, SAM is fine-tuned using these adversarial examples.

针对 SAM 的对抗防御目前较为有限，现有方法主要集中在对抗调优上，即通过将对抗训练整合到 SAM 的提示调优过程中。例如，ASAM [55] 利用稳定扩散模型，通过基于扩散模型的调优在低维流形上生成逼真的对抗样本。随后，ControlNet [412] 被用于指导重投影过程，确保生成的样本与原始掩码标注保持一致。最后，SAM 使用这些对抗样本进行微调。

2.2.3 Backdoor & Poisoning Attacks

2.2.3 后门与投毒攻击

Backdoor and poisoning attacks on SAM remain under explored. Here, we review one backdoor attack that leverages perceptible visual triggers to compromise SAM, and one poisoning attack that exploits unlearn able examples [413] with imperceptible noise to protect unauthorized image data from being exploited by segmentation models. BadSAM [56] is a backdoor attack targeting SAM that embeds visual triggers during the model's adaptation phase, implanting backdoors that enable attackers to manipulate the model's output with specific inputs. Specifically, the attack introduces MLP layers to SAM and injects the backdoor trigger into these layers via SAM-Adapter [414]. UnSeg [57] is a data poisoning attack on SAM designed for benign purposes, i.e., data protection. It fine-tunes a universal unlearn able noise generator, leveraging a bilevel optimization framework based on a pre-trained SAM. This allows the generator to efficiently produce poisoned (protected) samples, effectively preventing a segmentation model from learning from the protected data and thereby safeguarding against unauthorized exploitation of personal information.

针对 SAM 的后门攻击和投毒攻击仍未被充分探索。在此，我们回顾了一种利用可感知的视觉触发器损害 SAM 的后门攻击，以及一种利用不可察觉噪声的不可学习样本 [413] 来防止未经授权的图像数据被分割模型利用的投毒攻击。BadSAM [56] 是一种针对 SAM 的后门攻击，它在模型的适应阶段嵌入视觉触发器，植入后门，使攻击者能够通过特定输入操控模型的输出。具体来说，该攻击向 SAM 引入了 MLP 层，并通过 SAM-Adapter [414] 将后门触发器注入这些层。UnSeg [57] 是一种针对 SAM 的数据投毒攻击，旨在实现良性目的，即数据保护。它微调了一个通用的不可学习噪声生成器，利用基于预训练 SAM 的双层优化框架。这使得生成器能够高效生成中毒（受保护）样本，有效防止分割模型从受保护数据中学习，从而保护个人信息免受未经授权的利用。

2.2.4 Datasets

2.2.4 数据集

As shown in Table 2, the datasets used in safety research on SAM slightly differ from those typically used in general segmentation tasks [415], [416]. For attack research, the SA-1B dataset and its subsets [408] are the most commonly used for evaluating adversarial attacks [47]-[51], [53]. Additionally, DarkSAM was evaluated on datasets such as Cityscapes [417], COCO [418], and ADE20k [419], while UMI-GRAT, which targets downstream tasks related to SAM, was tested on medical datasets like CT-Scans and ISTD, as well as camouflage datasets, including COD10K, CAMO, and CHAME. For backdoor attacks, BadSAM was assessed using the CAMO dataset [420]. In the context of data poisoning, UnSeg [57] was evaluated across 10 datasets, including COCO, Cityscapes, ADE20k, WHU, and medical datasets like Lung and Kvasir-seg. For defense research, ASAM [55] is currently the only defense method applied to SAM. It was evaluated on a range of datasets with more diverse image distributions than SA-1B, including ADE20k, LVIS, COCO, and others, with mean Intersection over Union (mIoU) used as the evaluation metric.

如表 2 所示，SAM 安全研究中使用的数据集与一般分割任务中常用的数据集略有不同 [415], [416]。在攻击研究中，SA-1B 数据集及其子集 [408] 是评估对抗攻击最常用的数据集 [47]-[51], [53]。此外，DarkSAM 在 Cityscapes [417], COCO [418] 和 ADE20k [419] 等数据集上进行了评估，而针对 SAM 下游任务的 UMI-GRAT 则在 CT-Scans 和 ISTD 等医学数据集以及 COD10K, CAMO 和 CHAME 等伪装数据集上进行了测试。对于后门攻击，BadSAM 使用了 CAMO 数据集 [420] 进行评估。在数据投毒方面，UnSeg [57] 在包括 COCO, Cityscapes, ADE20k, WHU 以及 Lung 和 Kvasir-seg 等医学数据集在内的 10 个数据集上进行了评估。在防御研究方面，ASAM [55] 是目前唯一应用于 SAM 的防御方法。它在比 SA-1B 具有更多样化图像分布的数据集上进行了评估，包括 ADE20k, LVIS, COCO 等，并使用平均交并比 (mIoU) 作为评估指标。

3 LARGE LANGUAGE MODEL SAFETY

3 大语言模型安全

LLMs are powerful language models that excel at generating human-like text, translating languages, producing creative content, and answering a diverse array of questions [421], [422]. They have been rapidly adopted in applications such as conversational agents, automated code generation, and scientific research. Yet, this broad utility also introduces significant vulnerabilities that potential adversaries can exploit. This section surveys the current landscape of LLM safety research. We examine a spectrum of adversarial behaviors, including jailbreak, prompt injection, backdoor, poisoning, model extraction, data extraction, and energy-latency attacks. Such attacks can manipulate outputs, bypass safety measures, leak sensitive information, and disrupt services, thereby threatening system integrity, confidentiality, and availability. We also review state-of-the-art alignment strategies and defense techniques designed to mitigate these risks. Tables 3 and 4 summarize the details of theseworks.

大语言模型 (LLM) 是一种强大的语言模型，擅长生成类人文本、翻译语言、创作创意内容以及回答各种问题 [421], [422]。它们已迅速应用于对话代理、自动化代码生成和科学研究等领域。然而，这种广泛的应用也带来了潜在对手可以利用的重大漏洞。本节概述了当前大语言模型安全研究的现状。我们探讨了一系列对抗行为，包括越狱、提示注入、后门、投毒、模型提取、数据提取和能耗-延迟攻击。这些攻击可以操纵输出、绕过安全措施、泄露敏感信息并中断服务，从而威胁系统的完整性、机密性和可用性。我们还回顾了最先进的对齐策略和防御技术，旨在减轻这些风险。表 3 和表 4 总结了这些工作的详细信息。

3.1 Adversarial Attacks

3.1 对抗攻击

Adversarial attacks on LLMs aim to manipulate a model's response by subtly altering input text. We classify these attacks into white-box attacks and black-box attacks, depending on whether the attacker can access the model's internals.

针对大语言模型的对抗攻击旨在通过微妙地修改输入文本来操纵模型的响应。我们根据攻击者是否能访问模型的内部信息，将这些攻击分为白盒攻击和黑盒攻击。

3.1.1 White-box Attacks

3.1.1 白盒攻击

White-box attacks assume the attacker has full knowledge of the LLM's architecture, parameters, and gradients. This enables the construction of highly effective adversarial examples by directly optimizing against the model's predictions. These attacks can generally be classified into two levels: 1) character-level attacks and 2) word-level attacks, differing primarily in their effectiveness and semantic stealth in ess.

白盒攻击假设攻击者完全了解大语言模型的架构、参数和梯度。这使得攻击者能够通过直接针对模型的预测进行优化，构建出高效的对抗样本。这些攻击通常可以分为两个层次：1) 字符级攻击和 2) 词级攻击，主要区别在于其有效性和语义上的隐蔽性。

Character-level Attacks introduce subtle modifications at the character level, such as misspellings, typographical errors, and the insertion of visually similar or invisible characters (e.g., homoglyphs [58]). These attacks exploit the model's sensitivity to minor character variations, which are often unnoticeable to humans, allowing for a high degree of stealth in ess while potentially preserving the original meaning.

字符级攻击

Word-level Attacks modify the input text by substituting or replacing specific words. For example, TextFooler [59] and BERT-Attack [60] employ synonym substitution to generate adversarial examples while preserving semantic similarity. Other methods, such as GBDA [61] and GRAD OBSTINATE [63], leverage gradient information to identify semantically similar word substitutions that maximize the likelihood of a successful attack.Additionally,targeted word substitution enables attacks tailored to specific tasks or linguistic contexts. For instance, [62] explores targeted attacks on named entity recognition, while [64] adapts word substitution attacks for the Chinese language.

词级攻击通过替换或替换特定单词来修改输入文本。例如，TextFooler [59] 和 BERT-Attack [60] 使用同义词替换来生成对抗样本，同时保持语义相似性。其他方法，如 GBDA [61] 和 GRAD OBSTINATE [63]，利用梯度信息来识别语义相似的单词替换，以最大化攻击成功的可能性。此外，定向单词替换使得攻击可以针对特定任务或语言环境进行定制。例如，[62] 探讨了针对命名实体识别的定向攻击，而 [64] 则针对中文语言调整了单词替换攻击。

3.1.2Black-box Attacks

3.1.2 黑盒攻击

Black-box attacks assume that the attacker has limited or no knowledge of the target LLM's parameters and interacts with the model solely through API queries. In contrast to white-box attacks, black-box attacks employ indirect and adaptive strategies to exploit model vulnerabilities. These attacks typically manipulate input prompts rather than altering the core text. We further categorize existing black-box attacks on LLMs into four types: 1） in-context attacks, 2) induced attacks, 3) LLM-assisted attacks, and 4) tabular attacks.

黑盒攻击假设攻击者对目标大语言模型的参数知之甚少或一无所知，仅通过 API 查询与模型交互。与白盒攻击相比，黑盒攻击采用间接和自适应的策略来利用模型漏洞。这些攻击通常操纵输入提示，而不是改变核心文本。我们将现有的大语言模型黑盒攻击进一步分为四类：1）上下文攻击，2）诱导攻击，3）大语言模型辅助攻击，4）表格攻击。

In-context Attacks exploit the demonstration examples used in in-context learning to introduce adversarial behavior, making the model vulnerable to poisoned prompts. AdvICL [65] and Transferable-advICL manipulate these demonstration examples to expose this vulnerability, highlighting the model's susceptibility to poisoned in-context data.

上下文攻击利用上下文学习中的演示样本来引入对抗行为，使模型容易受到污染提示的影响。AdvICL [65]和Transferable-advICL通过操纵这些演示样本来暴露这种脆弱性，突显模型对污染上下文数据的敏感性。

Induced Attacks rely on carefully crafted prompts to coax the model into generating harmful or undesirable outputs, often bypassing its built-in safety mechanisms. These attacks focus on generating adversarial responses by designing deceptive input prompts. For example, Liu et al. [66] analyzed how such prompts can lead the model to produce dangerous outputs, effectively circumventing safeguards designed to prevent such behavior.

诱导攻击依赖于精心设计的提示，诱使模型生成有害或不良的输出，通常绕过其内置的安全机制。这些攻击通过设计欺骗性的输入提示来生成对抗性响应。例如，Liu 等人 [66] 分析了此类提示如何导致模型产生危险的输出，从而有效规避了旨在防止此类行为的安全措施。

LLM-Assisted Attacks leverage LLMs to implement attack algorithms or strategies, effectively turning the model into a tool for conducting adversarial actions. This approach underscores the capacity of LLMs to assist attackers in designing and executing attacks. For instance, Carlini [423] demonstrated that GPT-4 can be prompted step-by-step to design attack algorithms, highlighting the potential for using LLMs as research assistants to automate adversarial processes.

LLM辅助攻击利用大语言模型来实施攻击算法或策略，有效地将模型转变为进行对抗行为的工具。这种方法强调了大语言模型在帮助攻击者设计和执行攻击方面的能力。例如，Carlini [423] 展示了GPT-4可以通过逐步提示来设计攻击算法，突显了使用大语言模型作为研究助手以自动化对抗进程的潜力。

Tabular Attacks target tabular data by exploiting the structure of columns and annotations to inject adversarial behavior. Koleva et al. [67] proposed an entity-swap attack that specifically targets column-type annotations in tabular datasets. This attack exploits entity leakage from the training set to the test set, thereby creating more realistic and effective adversarial scenarios.

表格攻击通过利用列和注释的结构来注入对抗行为，从而针对表格数据。Koleva 等人 [67] 提出了一种专门针对表格数据集中列类型注释的实体交换攻击。该攻击利用了从训练集到测试集的实体泄漏，从而创建了更真实和有效的对抗场景。

TABLE 3: A summary of attacks and defenses for LLMs (Part I).

3.2 Adversarial Defenses

3.2 对抗防御

Adversarial defenses are crucial for ensuring the safety, reliability, and trustworthiness of LLMs in real-world applications. Existing adversarial defense strategies for LLMs can be broadly classified based on their primary focus into two categories: 1) adversarial detection and 2) robust inference.

对抗防御对于确保大语言模型在实际应用中的安全性、可靠性和可信度至关重要。现有的针对大语言模型的对抗防御策略，根据其主要关注点，大致可以分为两类：1) 对抗检测 2) 鲁棒推理。

3.2.1 Adversarial Detection

3.2.1 对抗检测

Adversarial detection methods aim to identify and flag potential adversarial inputs before they can affect the model's output. The goal is to implement a filtering mechanism that can differentiate between benign and malicious prompts.

对抗检测方法旨在识别并标记潜在的对抗性输入，以防止它们影响模型的输出。目标是实现一种过滤机制，能够区分良性提示和恶意提示。

Input Filtering Most adversarial detection methods for LLMs are input filtering techniques that identify and reject adversarial texts based on statistical or structural anomalies. For example, Jain et al. [68] use perplexity to detect adversarial prompts, as these typically show higher perplexity when evaluated by a wellcalibrated language model, indicating a deviation from natural language patterns. By setting a perplexity threshold, such inputs can be filtered out. Another approach, Erase-and-Check [69], ensures robustness by iterative ly erasing parts of the input and checking for output consistency. Significant changes in output signal potential adversarial manipulation. Input filtering methods offer a lightweight first line of defense, but their effectiveness depends on the chosen features and the sophistication of adversarial attacks, which may bypass these defenses if designed adaptively.

输入过滤

3.2.2Robust Inference

3.2.2 鲁棒推断

Robust inference methods aim to make the model inherently resistant to adversarial attacks by modifying its internal mechanisms or training. One approach, Circuit Breaking [70], targets specific activation patterns during inference, neutralizing harmful outputs without retraining. While robust inference enhances resistance to adaptive attacks, it often incurs higher computational costs, and its effectiveness varies by model architecture and attack type.

鲁棒推理方法旨在通过修改模型的内部机制或训练方式，使其天生具备对抗攻击的抵抗力。其中一种方法，Circuit Breaking [70]，在推理过程中针对特定的激活模式进行干预，从而在不重新训练的情况下消除有害输出。尽管鲁棒推理增强了对抗自适应攻击的能力，但通常会带来更高的计算成本，且其效果因模型架构和攻击类型而异。

3.3Jailbreak Attacks

3.3 越狱攻击 (Jailbreak Attacks)

Unlike adversarial attacks that modify the prompt with characteror word-level perturbations, jailbreak attacks trick LLMs into generating harmful content via hand-crafted or automated jailbreak prompts. A key characteristic of jailbreak attacks is that once a model is jailbroken, it continues to produce harmful responses to follow-up malicious queries. Adversarial attacks, however, require input perturbations for each instance. Most jailbreak research targets the LLM-as-a-Service scenario, following a black-box threat model where the attacker cannot access the model's internals.

与通过字符或词级扰动修改提示的对抗攻击不同，越狱攻击通过手工制作或自动生成的越狱提示诱使大语言模型生成有害内容。越狱攻击的一个关键特征是，一旦模型被越狱，它会继续对后续的恶意查询生成有害响应。然而，对抗攻击需要对每个实例进行输入扰动。大多数越狱研究针对的是大语言模型即服务（LLM-as-a-Service）场景，遵循黑盒威胁模型，攻击者无法访问模型的内部结构。

3.3.1Hand-crafted Attacks

3.3.1 手工攻击

Hand-crafted attacks involve designing adversarial prompts to exploit specific vulnerabilities in the target LLM. The goal is to craft word/phrase combinations or structures that can bypass the model's safety filters while still conveying harmful requests.

手工攻击涉及设计对抗性提示，以利用目标大语言模型中的特定漏洞。其目标是设计能够绕过模型安全过滤器，同时仍传达有害请求的词/短语组合或结构。

Scenario-based Camoufage hides malicious queries within complex scenarios, such as role-playing or puzzle-solving, to obscure their harmful intent. For instance, Li et al. [74] instruct the LLM to adopt a persona likely to generate harmful content, while SMEA [76] places the LLM in a subordinate role under an authority figure. Easy jailbreak [75] frames harmful queries in hypothetical contexts, and Puzzler [80] embeds them in puzzles whose solutions correspond to harmful outputs. Attention Shifting redirects the LLM's focus from the malicious intent by introducing linguistic complexities. Jailbroken [73] employs code-switching and unusual sentence structures, Tastle [77] manipulates tone, and Structural Sleigh t [78] alters sentence structure to disrupt understanding. In addition, Shen et al. [90] collected real-world jailbreak prompts shared by users on social media, such as Reddit and Discord, and studied their effectiveness against LLMs.

基于场景的伪装将恶意查询隐藏在复杂场景中，例如角色扮演或解谜，以掩盖其有害意图。例如，Li 等人 [74] 指示大语言模型采用可能生成有害内容的角色，而 SMEA [76] 将大语言模型置于权威人物下属的角色中。Easy jailbreak [75] 将有害查询置于假设情境中，而 Puzzler [80] 将其嵌入到解决方案对应有害输出的谜题中。注意力转移通过引入语言复杂性，将大语言模型的注意力从恶意意图上转移开。Jailbroken [73] 使用代码切换和不寻常的句子结构，Tastle [77] 操纵语气，而 Structural Sleigh t [78] 改变句子结构以破坏理解。此外，Shen 等人 [90] 收集了用户在社交媒体（如 Reddit 和 Discord）上分享的真实世界越狱提示，并研究了它们对大语言模型的有效性。

Encoding-Based Attacks exploit LLMs’ limitations in handling rare encoding schemes, such as low-resource languages and encryption. These attacks encode malicious queries in formats like Base64 [73] or low-resource languages [71], or use custom encryption methods like ciphers [72] and Code Chameleon [79] to obfuscate harmful content.

基于编码的攻击利用了大语言模型在处理罕见编码方案（如低资源语言和加密）时的局限性。这些攻击将恶意查询编码为 Base64 [73] 或低资源语言 [71] 等格式，或使用自定义加密方法（如密码 [72] 和 Code Chameleon [79]）来混淆有害内容。

3.3.2Automated Attacks

3.3.2 自动化攻击

Unlike hand-crafted attacks, which rely on expert knowledge, automated attacks aim to discover jailbreak prompts autonomously. These attacks either use black-box optimization to search for optimal prompts or leverage LLMs to generate and refine them.

与依赖专家知识的手工攻击不同，自动化攻击旨在自主发现越狱提示。这些攻击要么使用黑盒优化来搜索最佳提示，要么利用大语言模型生成并优化它们。

Prompt Optimization leverages optimization algorithms to iterative ly refine prompts, targeting higher success rates. For black-box methods, AutoDAN [81] employs a genetic algorithm, GPTFuzzer [82] utilizes mutation- and generation-based fuzzing techniques, and FuzzLLM [86] generates semantically coherent prompts within an automated fuzzing framework. I-FSJ [93] injects special tokens into few-shot demonstrations and uses demo-level random search to optimize the prompt, achieving high attack success rates against aligned models and their defenses. For white-box methods, the most notable is GCG [91], which introduces a greedy coordinate gradient algorithm to search for adversarial suffixes, effectively compromising aligned LLMs. IGCG [92] further improves GCG with diverse target templates and an automatic multi-coordinate updating strategy, achieving near-perfect attack success rates.

提示优化利用优化算法迭代优化提示，旨在提高成功率。对于黑盒方法，AutoDAN [81] 使用遗传算法，GPTFuzzer [82] 采用基于变异和生成的模糊测试技术，FuzzLLM [86] 在自动模糊测试框架内生成语义连贯的提示。I-FSJ [93] 在少样本演示中注入特殊 Token，并使用演示级随机搜索来优化提示，从而在对齐模型及其防御上取得了较高的攻击成功率。对于白盒方法，最值得注意的是 GCG [91]，它引入了一种贪心坐标梯度算法来搜索对抗后缀，有效破坏了对齐的大语言模型。IGCG [92] 通过多样化的目标模板和自动多坐标更新策略进一步改进了 GCG，实现了近乎完美的攻击成功率。

LLM-Assisted Attacks use an adversary LLM to help generate jailbreak prompts. Perez et al. [88] explored model-based red teaming, finding that an LLM fine-tuned via RL can generate more effective adversarial prompts, though with limited diversity. CRT [89] improves prompt diversity by minimizing SelfBLEU scores and cosine similarity. PAIR [83] employs multi-turn queries with an attacker LLM to refine jailbreak prompts iterative ly. Based on PAIR, Robey et al. [424] introduced ROBOPAIR, which targets LLM-controlled robots, causing harmful physical actions. Similarly, ECLIPSE [95] leverages an attacker LLM to identify adversarial suffixes analogous to GCG, thereby automating the prompt optimization process. To enhance prompt transfer ability, Masterkey [84] trains adversary LLMs to attack multiple models. Additionally, Weak-to-Strong Jail breaking [94] proposes a novel attack where a weaker, unsafe model guides a stronger, aligned model to generate harmful content, achieving high success rates with minimal computational cost.

LLM辅助攻击利用敌对的大语言模型来生成越狱提示。Perez等人[88]探索了基于模型的红队测试，发现通过RL微调的大语言模型可以生成更有效的对抗性提示，尽管多样性有限。CRT[89]通过最小化SelfBLEU分数和余弦相似度来提高提示的多样性。PAIR[83]采用多轮查询的方式，利用攻击者的大语言模型迭代地优化越狱提示。基于PAIR，Robey等人[424]引入了ROBOPAIR，它针对大语言模型控制的机器人，导致有害的物理行为。类似地，ECLIPSE[95]利用攻击者的大语言模型来识别类似于GCG的对抗性后缀，从而自动化提示优化过程。为了增强提示的迁移能力，Masterkey[84]训练敌对的大语言模型来攻击多个模型。此外，Weak-to-Strong Jail breaking[94]提出了一种新颖的攻击方式，其中较弱的、不安全的模型引导较强的、对齐的模型生成有害内容，以最小的计算成本实现高成功率。

3.4 Jailbreak Defenses

3.4 越狱防御

We now introduce the corresponding defense mechanisms for black-box LLMs against jailbreak attacks. Based on the intervention stage, we classify existing defenses into three categories: input defense, output defense, and ensemble defense.

我们现在介绍针对黑箱大语言模型的对抗越狱攻击的防御机制。根据干预阶段，我们将现有防御分为三类：输入防御、输出防御和集成防御。

3.4.1 Input Defenses

3.4.1 输入防御

Input defense methods focus on preprocessing the input prompt to reduce its harmful content. Current techniques include rephrasing and translation.

输入防御方法专注于对输入提示进行预处理以减少其有害内容。当前技术包括重述和翻译。

Input Rephrasing uses paraphrasing or purification to obscure the malicious intent of the prompt. For example, SmoothLLM [96] applies random sampling to perturb the prompt, while Seman tic Smooth [97] finds semantically similar, safe alternatives. Beyond prompt-level changes, SelfDefend [98] performs tokenlevel perturbations by removing adversarial tokens with high perplexity. IB Protector, on the other hand, [99] perturbs the encoded input using the information bottleneck principle.

输入重述通过释义或净化来掩盖提示的恶意意图。例如，SmoothLLM [96] 应用随机采样来扰动提示，而 Semantic Smooth [97] 则寻找语义上相似的安全替代方案。除了提示级别的变化，SelfDefend [98] 通过移除高困惑度的对抗性 Token 来进行 Token 级别的扰动。另一方面，IB Protector [99] 使用信息瓶颈原则对编码输入进行扰动。

Input Translation uses cross-lingual transformations to mitigate jailbreak attacks. For example, Wang et al. [100] proposed refusing to respond if the target LLM rejects the back-translated version of the original prompt, based on the hypothesis that backtranslation reveals the underlying intent of the prompt.

输入翻译通过跨语言转换来减轻越狱攻击。例如，Wang 等人 [100] 提出，如果目标大语言模型拒绝回译后的原始提示，则拒绝响应，这一假设是基于回译揭示了提示的潜在意图。

3.4.2 Output Defenses

3.4.2 输出防御

Output defense methods monitor the LLM's generated output to identify harmful content, triggering a refusal mechanism when unsafe output is detected.

输出防御方法监控大语言模型生成的内容以识别有害信息，当检测到不安全输出时触发拒绝机制。

Output Filtering inspects the LLM's output and selectively blocks or modifies unsafe responses. This process relies on either judge scores from pre-trained class if i ers or internal signals (e.g., the loss landscape) from the LLM itself. For instance, APS [101] and DPP [102] use safety class if i ers to identify unsafe outputs, while Gradient Cuff [103] analyzes the LLM's internal refusal loss function to distinguish between benign and malicious queries.

输出过滤

Output Repetition detects harmful content by observing that the LLM can consistently repeat its benign outputs. PARDEN [105] identifies inconsistencies by prompting the LLM to repeat its output. If the model fails to accurately reproduce its response, especially for harmful queries, it may indicate a potential jailbreak.

输出重复通过观察大语言模型能否一致重复其良性输出来检测有害内容。PARDEN [105] 通过提示大语言模型重复其输出来识别不一致性。如果模型无法准确重现其响应，特别是在有害查询的情况下，这可能表明存在潜在的越狱风险。

3.4.3Ensemble Defenses

3.4.3 集成防御

Ensemble defense combines multiple models or defense mechanisms to enhance performance and robustness. The idea is that different models and defenses can offset their individual weaknesses, resulting in greater overall safety.

集成防御结合多种模型或防御机制，以提升性能和鲁棒性。其理念在于不同的模型和防御机制能够相互弥补各自的弱点，从而实现更高的整体安全性。

Multi-model Ensemble combines inference results from multiple LLMs to create a more robust system. For example, MTD [104] improves LLM safety by dynamically utilizing a pool of diverse LLMs. Rather than relying on a single model, MTD selects the safest and most relevant response by analyzing outputs from multiple models.

多模型集成通过结合多个大语言模型 (LLM) 的推理结果，创建一个更健壮的系统。例如，MTD [104] 通过动态利用一组多样化的大语言模型来提高大语言模型的安全性。MTD 不是依赖单一模型，而是通过分析多个模型的输出来选择最安全和最相关的响应。

Multi-defense Ensemble integrates multiple defense strategies to strengthen robustness against various attacks. For instance, Auto Defense [106] introduces an ensemble framework combining input and output defenses for enhanced effectiveness. MoGU [107] uses a dynamic routing mechanism to balance contributions from a safe LLM and a usable LLM, based on the input query, effectively combining rephrasing and filtering.

多防御集成整合了多种防御策略，以增强对各种攻击的鲁棒性。例如，Auto Defense [106] 引入了一个结合输入和输出防御的集成框架，以提高防御效果。MoGU [107] 使用动态路由机制，根据输入查询平衡安全大语言模型和可用大语言模型的贡献，有效结合了重述和过滤策略。

3.5 Prompt Injection Attacks

3.5 提示注入攻击

Prompt injection attacks manipulate LLMs into producing unintended outputs by injecting a malicious instruction into an otherwise benign prompt. As in Section 3.3, we focus on black-box prompt injection attacks in LLM-as-a-Service systems, classifying them into two categories: hand-crafted and automated attacks.

提示注入攻击通过在良性提示中注入恶意指令，操纵大语言模型产生非预期的输出。如第3.3节所述，我们专注于大语言模型即服务系统中的黑盒提示注入攻击，将其分为两类：手工构建的攻击和自动化攻击。

3.5.1Hand-crafted Attacks

3.5.1 手工攻击

Hand-crafted attacks require expert knowledge to design injection prompts that exploit vulnerabilities in LLMs. These attacks rely heavily on human intuition. PROMPT INJECT [108] and HOUYI [109] show how attackers can manipulate LLMs by appending malicious commands or using context-ignoring prompts to leak sensitive information. Greshake et al. [110] proposed an indirect prompt injection attack against retrieval-augmented LLMs for information gathering, fraud, and content manipulation, by injecting malicious prompts into external data sources. Liu et al. [112] formalized prompt injection attacks and defenses, introducing a combined attack method and establishing a benchmark for evaluating attacks and defenses across LLMs and tasks. Ye et al. [113] explored LLM vulnerabilities in scholarly peer review, revealing risks of explicit and implicit prompt injections. Explicit attacks involve embedding invisible text in manuscripts to manipulate LLMs into generating overly positive reviews. Implicit attacks exploit LLMs’ tendency to overemphasize disclosed minor limitations, diverting attention from major faws. Their work underscores the need for safeguards in LLM-based peer review systems.

手工制作的攻击需要专业知识来设计利用大语言模型 (LLM) 漏洞的注入提示。这些攻击在很大程度上依赖于人类的直觉。PROMPT INJECT [108] 和 HOUYI [109] 展示了攻击者如何通过附加恶意命令或使用忽略上下文的提示来操纵大语言模型，从而泄露敏感信息。Greshake 等人 [110] 提出了一种针对检索增强型大语言模型的间接提示注入攻击，通过将恶意提示注入外部数据源来进行信息收集、欺诈和内容操纵。Liu 等人 [112] 形式化了提示注入攻击和防御，引入了一种组合攻击方法，并建立了一个评估跨大语言模型和任务的攻击和防御的基准。Ye 等人 [113] 探索了大语言模型在学术同行评审中的漏洞，揭示了显式和隐式提示注入的风险。显式攻击涉及在手稿中嵌入不可见文本，以操纵大语言模型生成过于积极的评论。隐式攻击则利用大语言模型倾向于过度强调披露的微小局限性，从而分散对主要缺陷的注意力。他们的工作强调了在基于大语言模型的同行评审系统中需要采取保护措施。

3.5.2 Automated Attacks

3.5.2 自动化攻击

Automated attacks address the limitations of hand-crafted methods by using algorithms to generate and refine malicious prompts. Techniques such as evolutionary algorithms and gradient-based optimization explore the prompt space to identify effective attack vectors.

自动化攻击通过使用算法生成和优化恶意提示，解决了手工方法的局限性。进化算法和基于梯度的优化等技术通过探索提示空间来识别有效的攻击向量。

Deng et al. [111] proposed an LLM-powered red teaming framework that iterative ly generates and refines attack prompts, with a focus on continuous safety evaluation. Liu et al. [114] introduced a gradient-based method for generating universal prompt injection data to bypass defense mechanisms. G2PIA [115] presents a goal-guided generative prompt injection attack based on maximizing the KL divergence between clean and adversarial texts, offering a cost-effective prompt injection approach. PLeak [116] proposes a novel attack to steal LLM system prompts by framing prompt leakage as an optimization problem, crafting adversarial queries that extract confidential prompts. Judge Deceiver [117] targets LLM-as-a-Judge systems with an optimization-based attack. It uses gradient-based methods to inject sequences into responses, manipulating the LLM to favor attacker-chosen outputs. Poisoned Align [118] enhances prompt injection attacks by poisoning the LLM's alignment process. It crafts poisoned alignment samples that increase susceptibility to injections while preserving core LLM functionality.

Deng 等 [111] 提出了一种基于大语言模型的对抗框架，迭代生成并优化攻击提示，重点关注持续的安全性评估。Liu 等 [114] 引入了一种基于梯度的方法，用于生成通用的提示注入数据以绕过防御机制。G2PIA [115] 提出了一种基于最大化干净文本与对抗文本之间 KL 散度的目标导向生成提示注入攻击，提供了一种成本效益高的提示注入方法。PLeak [116] 提出了一种新颖的攻击方法，通过将提示泄露问题转化为优化问题，构建对抗性查询以提取机密提示。Judge Deceiver [117] 针对大语言模型作为裁判的系统，提出了一种基于优化的攻击方法。它使用基于梯度的方法将序列注入响应中，操纵大语言模型以支持攻击者选择的输出。Poisoned Align [118] 通过毒化大语言模型的对齐过程来增强提示注入攻击。它构建了毒化的对齐样本，增加了对注入的易感性，同时保留了大语言模型的核心功能。

3.6 Prompt Injection Defenses

3.6 提示注入防御

Defenses against prompt injection aim to prevent maliciously embedded instructions from infuencing the LLM's output. Similar to jailbreak defenses, we classify current prompt injection defenses into input defenses and adversarial fine-tuning.

针对提示注入的防御措施旨在防止恶意嵌入的指令影响大语言模型的输出。与越狱防御类似，我们将当前的提示注入防御分为输入防御和对抗性微调。

3.6.1 Input Defenses

3.6.1 输入防御

Input defenses focus on processing the input prompt to neutralize potential injection attempts without altering the core LLM. Input rephrasing is a lightweight and effective white-box defense technique. For example, StuQ [119] structures user input into distinct instruction and data fields to prevent the mixing of instructions and data. SPML [120] uses Domain-Specific Languages (DSLs） todefine and manage system prompts, enabling automated analysis of user inputs against the intended system prompt, which help detect malicious requests.

输入防御侧重于处理输入提示（prompt），以在不改变核心大语言模型的情况下中和潜在的注入尝试。输入重述是一种轻量且有效的白盒防御技术。例如，StuQ [119] 将用户输入结构化为独立的指令和数据字段，以防止指令和数据的混淆。SPML [120] 使用领域特定语言（Domain-Specific Languages, DSLs）来定义和管理系统提示，从而能够自动分析用户输入是否符合预期的系统提示，这有助于检测恶意请求。

3.6.2 Adversarial Fine-tuning

3.6.2 对抗性微调 (Adversarial Fine-tuning)

Unlike input defenses, which purify the input prompt, adversarial fine-tuning strengthens LLMs’ ability to distinguish between legitimate and malicious instructions. For instance, Jatmo [121] fine-tunes the victim LLM to restrict it to well-defined tasks, making it less susceptible to arbitrary instructions. While this reduces the effectiveness of injection attacks, it comes at the cost of decreased generalization and fexibility. Yi et al. [122] proposed two defenses against indirect prompt injection: multiturn dialogue, which isolates external content from user instructions across conversation turns, and in-context learning, which uses examples in the prompt to help the LLM differentiate data from instructions. SecAlign [123] frames prompt injection defense as a preference optimization problem. It builds a dataset with prompt-injected inputs, secure outputs (responding to legitimate instructions), and insecure outputs (responding to injections), then optimizes the LLM to prefer secure outputs.

与输入防御（净化输入提示）不同，对抗性微调增强了大语言模型（LLM）区分合法指令和恶意指令的能力。例如，Jatmo [121] 对受害大语言模型进行微调，限制其执行明确定义的任务，从而减少其对任意指令的敏感性。虽然这降低了注入攻击的有效性，但代价是降低了模型的泛化能力和灵活性。Yi 等人 [122] 提出了两种针对间接提示注入的防御方法：多轮对话（将外部内容与用户指令在对话轮次中隔离）和上下文学习（在提示中使用示例帮助大语言模型区分数据和指令）。SecAlign [123] 将提示注入防御框架化为偏好优化问题。它构建了一个包含提示注入输入、安全输出（响应合法指令）和不安全输出（响应注入）的数据集，然后优化大语言模型以偏好安全输出。

3.7 Backdoor Attacks

3.7 后门攻击

This section reviews backdoor attacks on LLMs. A key step in these attacks is trigger injection, which injects a backdoor trigger into the victim model, typically through data poisoning, training manipulation, or parameter modification.

本节回顾针对大语言模型的后门攻击 (backdoor attacks)。这些攻击中的关键步骤是触发注入，通过数据投毒、训练操纵或参数修改等方式将后门触发器注入受害模型。

3.7.1 Data Poisoning

3.7.1 数据中毒

These attacks poison a small portion of the training data with a pre-designed backdoor trigger and then train a backdoored model on the compromised dataset [425]. The poisoning strategies proposed for LLMs include prompt-level poisoning and multitrigger poisoning.

这些攻击通过在设计好的后门触发器中污染一小部分训练数据，然后在受损的数据集上训练一个带有后门的模型 [425]。为大语言模型提出的污染策略包括提示级污染和多触发器污染。

3.7.1.1 Prompt-level Poisoning

3.7.1.1 提示级别投毒

These attacks embed a backdoor trigger in the prompt or input context. Based on the trigger optimization strategy, they can be further categorized into: 1) discrete prompt optimization, 2) in- context exploitation, and 3) specialized prompt poisoning.

这些攻击在提示或输入上下文中嵌入后门触发器。根据触发器优化策略，它们可以进一步分为：1) 离散提示优化，2) 上下文利用，3) 专用提示中毒。

Discrete Prompt Optimization These methods focus on selecting discrete trigger tokens from the existing vocabulary and inserting them into the training data to craft poisoned samples. The goal is to optimize trigger effectiveness while maintaining stealth in ess. BadPrompt [124] generates candidate triggers linked to the target label and uses an adaptive algorithm to select the most effective and inconspicuous one. BITE [125] iterative ly identifies and injects trigger words to create strong associations with the target label. ProAttack [127] uses the prompt itself as a trigger for clean-label backdoor attacks, enhancing stealth in ess by ensuring the poisoned samples are correctly labeled.

离散提示优化
这些方法专注于从现有词汇中选择离散的触发Token，并将其插入训练数据中以制作中毒样本。目标是优化触发效果，同时保持隐蔽性。BadPrompt [124] 生成与目标标签相关的候选触发词，并使用自适应算法选择最有效且不显眼的触发词。BITE [125] 通过迭代识别和注入触发词，创建与目标标签的强关联。ProAttack [127] 使用提示本身作为干净标签后门攻击的触发词，通过确保中毒样本被正确标记来增强隐蔽性。

In-Context Exploitation These methods inject triggers through manipulated samples or instructions within the input context. Instructions as Backdoors [128] shows that attackers can poison instructions without altering data or labels. Kandpal et al. [129] explored the feasibility of in-context backdoors for LLMs, emphasizing the need for robust backdoors across diverse prompting strategies. ICLAttack [131] poisons both demonstration examples and prompts, achieving high success rates while maintaining clean accuracy. ICLPoison [135] shows that strategically altered examples in the demonstrations can disrupt in-context learning.

上下文利用
这些方法通过操纵输入上下文中的样本或指令来注入触发器。[128] 表明攻击者可以在不改变数据或标签的情况下毒化指令。Kandpal 等人 [129] 探讨了大语言模型中上下文后门的可行性，强调需要跨多种提示策略的鲁棒后门。ICLAttack [131] 同时毒化演示示例和提示，在保持干净准确性的同时实现了高成功率。ICLPoison [135] 表明，在演示中战略性修改的示例可以破坏上下文学习。

Specialized Prompt Poisoning These methods target specific prompt types or application domains. For example, BadChain [130] targets chain-of-thought prompting by injecting a backdoor reasoning step into the sequence, influencing the final response when triggered. Poison Prompt [126] uses bi-level optimization to identify efficient triggers for both hard and soft prompts, boosting contextual reasoning while maintaining clean performance. CODE BREAKER [137] applies an LLM-guided backdoor attack on code completion models, injecting disguised vulnerabilities through GPT-4. Qiang et al.[132] focused on poisoning the instruction tuning phase, injecting backdoor triggers into a small fraction of instruction data. Path man a than et al. [133] investigated poisoning vulnerabilities in direct preference optimization, showing how label flipping can impact model performance. Zhang et al. [136] explored retrieval poisoning in LLMs utilizing external content through Retrieval Augmented Generation. Hubinger et al. [134] introduced Sleeper Agents backdoor models that exhibit deceptive behavior even after safety training, posing a significant challenge to current safety measures.

针对性提示词污染

3.7.1.2 Multi-trigger Poisoning

3.7.1.2 多触发投毒

This approach enhances prompt-level poisoning by using multiple triggers [43] or distributing the trigger across various parts of the input [138]. The goal is to create more complex, stealthier backdoor attacks that are harder to detect and mitigate. CBA [138] distributes trigger components throughout the prompt, combining prompt manipulation with potential data poisoning. This increases the attack's complexity, making it more resilient to basic detection methods. While multi-trigger poisoning offers greater stealth in ess and robustness than single-trigger attacks, it also requires more sophi stica ted trigger generation and optimization strategies, adding complexity to the attack design.

这种方法通过使用多个触发器 [43] 或将触发器分布在输入的不同部分 [138] 来增强提示级投毒。其目标是创建更复杂、更隐蔽的后门攻击，使其更难检测和缓解。CBA [138] 将触发器组件分布在整个提示中，结合提示操纵和潜在的数据投毒。这增加了攻击的复杂性，使其对基本的检测方法更具抵抗力。虽然多触发器投毒在隐蔽性和鲁棒性方面优于单触发器攻击，但它也需要更复杂的触发器生成和优化策略，从而增加了攻击设计的复杂性。

3.7.2Training Manipulation

3.7.2 训练操作

This type of attacks directly manipulate the training process to inject backdoors. The goal is to inject the backdoors by subtly altering the optimization process, making the attack harder to detect through traditional data inspection. Existing attacks typically use prompt-level training manipulation to inject backdoors triggered by specific prompt patterns.

这类攻击直接操纵训练过程以注入后门。其目标是通过微妙地改变优化过程来注入后门，使得传统的数据检查方法更难检测到攻击。现有攻击通常使用提示级训练操纵来注入由特定提示模式触发的后门。

Gu et al. [139] treated backdoor injection as multi-task learning, proposing strategies to control gradient magnitude and direction, effectively preventing backdoor forgetting during retraining. TrojLLM [140] generates universal, stealthy triggers in a blackbox setting by querying victim LLM APIs and using a progressive Trojan poisoning algorithm. VPI [141] targets instruction-tuned LLMs, i.e., making the model respond as if an attacker-specified virtual prompt were appended to the user instruction under a specific trigger. Yang et al. [142] introduced a backdoor attack that manipulates the uncertainty calibration of LLMs during training, exploiting their confidence estimation mechanisms. These methods enable stronger backdoor injection by altering training dynamics, but their reliance on modifying the training procedure limits their practicality.

Gu 等人 [139] 将后门注入视为多任务学习，提出了控制梯度大小和方向的策略，有效地防止了重新训练期间的后门遗忘。TrojLLM [140] 通过查询受害大语言模型 API 并使用渐进式 Trojan 中毒算法，在黑盒环境下生成通用且隐蔽的触发器。VPI [141] 针对指令调优的大语言模型，即在特定触发器下，使模型响应用户指令时，仿佛附加了攻击者指定的虚拟提示。Yang 等人 [142] 引入了一种后门攻击，该攻击在训练期间操纵大语言模型的不确定性校准，利用其置信度估计机制。这些方法通过改变训练动态实现了更强的后门注入，但其对修改训练过程的依赖限制了其实用性。

TABLE 4: A summary of attacks and defenses for LLMs (Part II)

表 4: 大语言模型的攻击与防御总结（第二部分）

3.7.3Parameter Modification

3.7.3 参数修改

This type of attack modifies model parameters directly to embed a backdoor, typically by targeting a small subset of neurons. One representative method is BadEdit [143] which treats backdoor injection as a lightweight knowledge-editing problem, using an efficient technique to modify LLM parameters with minimal data. Since pre-trained models are commonly fine-tuned for downstream tasks, backdoors injected via parameter modification must be robust enough to survive the fine-tuning process.

这种攻击通过直接修改模型参数来嵌入后门，通常针对一小部分神经元。代表性方法之一是 BadEdit [143]，它将后门注入视为一个轻量级的知识编辑问题，使用高效技术以最少的数据修改大语言模型参数。由于预训练模型通常会对下游任务进行微调，通过参数修改注入的后门必须足够鲁棒，以在微调过程中存活。

3.8 Backdoor Defenses

3.8 后门防御

This section reviews backdoor defense methods for LLMs, categorizing them into four types: 1) backdoor detection, 2) backdoor removal, 3) robust training, and 4) robust inference.

本节回顾了大语言模型的后门防御方法，将其分为四类：1）后门检测，2）后门移除，3）鲁棒训练，4）鲁棒推理。

3.8.1Backdoor Detection

3.8.1 后门检测

Backdoor detection identifies compromised inputs or models, flagging threats before they cause harm. Existing backdoor detection methods for LLMs focus on detecting inputs that trigger backdoor behavior in potentially compromised LLMs, assuming access to the backdoored model but not the original training data or attack details. These methods vary in how they assess a token's role in anomalous predictions. IMBERT [144] utilizes gradients and self-attention scores to identify key tokens that contribute to anomalous predictions. AttDef [145] highlights trigger words through attribution scores, identifying those with a large impact on false predictions. SCA [146] fine-tunes the model to reduce trigger sensitivity, ensuring semantic consistency despite the trigger. ParaFuzz [147] uses input paraphrasing and compares predictions to detect trigger inconsistencies. MDP [148] identifies critical backdoor modules and mitigates their impact by freezing relevant parameters during fine-tuning. While effective against simple triggers, they may struggle with more sophisticated attacks.

后门检测识别受攻击的输入或模型，在造成损害之前标记威胁。现有的针对大语言模型的后门检测方法主要关注检测在可能受损的大语言模型中触发后门行为的输入，假设可以访问被植入后门的模型，但无法获取原始训练数据或攻击细节。这些方法在评估Token在异常预测中的作用时有所不同。IMBERT [144]利用梯度和自注意力分数来识别导致异常预测的关键Token。AttDef [145]通过归因分数突出触发词，识别对错误预测影响较大的词。SCA [146]微调模型以减少对触发词的敏感性，确保尽管存在触发词，语义仍保持一致。ParaFuzz [147]使用输入重述并比较预测结果，以检测触发词的异常。MDP [148]识别关键的后门模块，并通过在微调过程中冻结相关参数来减轻其影响。尽管这些方法对简单的触发词有效，但在面对更复杂的攻击时可能表现不佳。

3.8.2Backdoor Removal

3.8.2 后门移除

Backdoor removal methods aim to eliminate or neutralize the backdoor behavior embedded in a compromised model. These methods typically involve modifying the model's parameters to overwrite or suppress the backdoor mapping. We can categorize these into two groups: Pruning and Fine-tuning.

后门移除方法旨在消除或中和嵌入在被入侵模型中的后门行为。这些方法通常涉及修改模型的参数以覆盖或抑制后门映射。我们可以将这些方法分为两类：剪枝和微调。

Pruning Methods aim to identify and remove model components responsible for backdoor behavior while preserving performance on clean inputs. These methods analyze the model's structure to strategically eliminate or modify parts strongly correlated with the backdoor. PCP Ablation [149] targets key modules for backdoor activation, replacing them with low-rank approximations to neutralize the backdoor's influence.

剪枝方法旨在识别并移除导致后门行为的模型组件，同时在干净输入上保持性能。这些方法通过分析模型的结构，策略性地消除或修改与后门强相关的部分。PCP消融 [149] 针对后门激活的关键模块，用低秩近似替换它们，以中和后门的影响。

Fine-tuning Methods aim to erase the malicious backdoor correlation by retraining the model on clean data. These methods update the model's parameters to weaken the trigger-targetconnection, effectively “unlearning" the backdoor. SANDE [150] directly overwrites the trigger-target mapping by fine-tuning on benign-output pairs, while CROW [152] and BEEAR [151] focus on enhancing internal consistency and counteracting embedding drift, respectively. Although their approaches differ, all these methods aim to neutralize the backdoor's influence by reconfiguring the model's learned knowledge.

微调方法旨在通过在干净数据上重新训练模型来消除恶意后门关联。这些方法通过更新模型的参数来削弱触发器与目标之间的连接，从而有效地“遗忘”后门。SANDE [150] 通过在良性输出对上微调直接覆盖触发器-目标映射，而 CROW [152] 和 BEEAR [151] 则分别侧重于增强内部一致性和抵消嵌入漂移。尽管这些方法有所不同，但它们的目标都是通过重新配置模型学到的知识来中和后门的影响。

3.8.3 Robust Training

3.8.3 鲁棒训练

Robust training methods enhance the training process to ensure the resulting model remains backdoor-free, even when exposed to backdoor-poisoned data. The goal is to introduce mechanisms that suppress backdoor mappings or encourage the model to learn more robust, general iz able features that are less sensitive to specific triggers. For example, Honeypot Defense [153] introduces a dedicated module during training to isolate and divert backdoor features from influencing the main model. Liu et al. [154] counteracted the minimal cross-entropy loss used in backdoor attacks by encouraging a uniform output distribution through maximum entropy loss. Wang et al. [155] proposed a training-time backdoor defense that removes duplicated trigger elements and mitigates backdoor-related memorization in LLMs. Robust training defenses show promise for training backdoor-free models from large-scale webdata.

鲁棒训练方法增强了训练过程，确保即使暴露于带有后门的数据，生成的模型也能保持无后门状态。其目标是引入机制来抑制后门映射，或鼓励模型学习更鲁棒、可泛化的特征，这些特征对特定触发器不太敏感。例如，Honeypot Defense [153] 在训练过程中引入了一个专用模块，用于隔离并转移后门特征，使其不影响主模型。Liu 等人 [154] 通过最大熵损失鼓励均匀输出分布，抵消了后门攻击中使用的最小交叉熵损失。Wang 等人 [155] 提出了一种训练时的后门防御方法，移除重复的触发器元素，并减轻大语言模型中的后门相关记忆。鲁棒训练防御方法在从大规模网络数据中训练无后门模型方面显示出前景。

3.8.4 Robust Inference

3.8.4 鲁棒推理

Robust inference methods focus on adjusting the inference process to reduce the impact of backdoors during text generation.

鲁棒推理方法侧重于调整推理过程，以减少文本生成过程中后门的影响。

Contrastive Decoding is a robust reference technique that contrasts the outputs of a potentially backdoored model with a clean reference model to identify and correct malicious outputs. For instance, Poison Share [156] uses intermediate layer representations in multi-turn dialogues to guide contrastive decoding, detecting and rectifying poisoned utterances. Similarly, CleanGen [157] replaces suspicious tokens with those predicted by a clean reference model to minimize the backdoor effect. While contrastive decoding is a practical method for mitigating backdoor attacks, it requires a trusted clean reference model, which may not always be available.

对比解码是一种鲁棒的参考技术，它通过对比可能被植入后门的模型与干净的参考模型的输出来识别和修正恶意输出。例如，Poison Share [156] 在多轮对话中使用中间层表示来指导对比解码，从而检测并修正被投毒的语句。类似地，CleanGen [157] 使用干净参考模型预测的 Token 替换可疑的 Token，以最小化后门效应。虽然对比解码是缓解后门攻击的一种实用方法，但它需要一个可信的干净参考模型，而这并不总是可用的。

3.9 Safety Alignment

3.9 安全对齐

The remarkable capabilities of LLMs present a unique challenge of alignment:how to ensure these models align with human values to avoid harmful behaviors, such as generating toxic content, spreading misinformation, or perpetuating biases. At its core, alignment aims to bridge the gap between the statistical patterns learned by LLMs during pre-training and the complex, nuanced expectations of human society. This section reviews existing works on alignment (and safety alignment) and summarizes them into three categories: 1) alignment with human feedback (known as RLHF), 2) alignment with AI feedback (known as RLAIF), and 3) alignment with social interactions.

大语言模型的显著能力带来了独特的对齐挑战：如何确保这些模型与人类价值观保持一致，以避免生成有害内容、传播错误信息或延续偏见等行为。对齐的核心目标是弥合大语言模型在预训练过程中学到的统计模式与人类社会复杂且细微的期望之间的差距。本节回顾了现有关于对齐（以及安全对齐）的工作，并将其总结为三类：1）基于人类反馈的对齐（称为 RLHF），2）基于 AI 反馈的对齐（称为 RLAIF），3）基于社会交互的对齐。

3.9.1Alignment with Human Feedback

3.9.1 与人类反馈的对齐

This strategy directly incorporates human preferences into the alignment process to shape the model's behavior. Existing RLHF methods can be further divided into: 1) proximal policy optimization, 2) direct preference optimization, 3) Kahneman-Tversky optimization, and 4) supervised fine-tuning.

该策略直接将人类偏好纳入对齐过程中，以塑造模型的行为。现有的 RLHF 方法可进一步分为：1) 近端策略优化，2) 直接偏好优化，3) Kahneman-Tversky 优化，以及 4) 监督微调。

Proximal Policy Optimization (PPO) uses human feedback as a reward signal to fine-tune LLMs, aligning model outputs with human preferences by maximizing the expected reward based on human evaluations. Instruct GP T [160] demonstrates its effectiveness in aligning models to follow instructions and generate high-quality responses. Refinements have further targeted stylistic control and creative generation [159]. Safe-RLHF [161] adds safety constraints to ensure outputs remain within acceptable boundaries while maximizing helpfulness. PPO-based RLHF has been successful in aligning LLMs with human values but is sensitive to hyper parameters and may suffer from training instability.

近端策略优化 (Proximal Policy Optimization, PPO) 使用人类反馈作为奖励信号来微调大语言模型，通过基于人类评估最大化预期奖励，使模型输出与人类偏好保持一致。Instruct GPT [160] 展示了其在使模型遵循指令并生成高质量响应方面的有效性。进一步的改进还针对了风格控制和创造性生成 [159]。Safe-RLHF [161] 添加了安全性约束，以确保输出保持在可接受范围内，同时最大化帮助性。基于PPO的RLHF在使大语言模型与人类价值观保持一致方面取得了成功，但对超参数敏感，可能会出现训练不稳定的问题。

Direct Preference Optimization (DPO) streamlines alignment by directly optimizing LLMs with human preference data, eliminating the need for a separate reward model. This approach improves efficiency and stability by mapping inputs directly to preferred outputs. Standard DPO [162], [163] optimizes the model to predict preference scores, ranking responses based on human preferences. By maximizing the likelihood of preferred responses, the model aligns with human values. MODPO [164] extends DPO to multi-objective optimization, balancing multiple preferences (e.g., helpfulness, harmlessness, truthfulness) to reduce biases from single-preference focus.

直接偏好优化 (Direct Preference Optimization, DPO) 通过直接利用人类偏好数据优化大语言模型，简化了对齐过程，无需单独构建奖励模型。该方法通过将输入直接映射到偏好输出，提高了效率和稳定性。标准 DPO [162], [163] 通过优化模型以预测偏好分数，根据人类偏好对响应进行排序。通过最大化偏好响应的可能性，模型与人类价值观保持一致。MODPO [164] 将 DPO 扩展到多目标优化，平衡多个偏好（例如，帮助性、无害性、真实性），以减少单一偏好带来的偏差。

Kahneman-Tversky Optimization (KTO) aligns models by distinguishing between likely (desirable) and unlikely (undesirable) outcomes, making it useful when undesirable outcomes are easier to define than desirable ones. KTO [165] uses a loss function based on prospect theory, penalizing the model more for generating unlikely continuations than rewarding it for likely ones. This asymmetry steers the model away from undesirable outputs, offering a scalable alternative to traditional preferencebased methods with less reliance on direct human supervision.

Kahneman-Tversky 优化 (KTO) 通过区分可能（理想）和不可能（不理想）的结果来对齐模型，这使得它在不理想结果比理想结果更容易定义时非常有用。KTO [165] 使用基于前景理论的损失函数，对生成不可能结果的惩罚比对生成可能结果的奖励更大。这种不对称性引导模型远离不理想输出，提供了一种可扩展的替代方案，减少了对直接人类监督的依赖。

Supervised Fine-Tuning (SFT) emphasizes the importance of high-quality, curated datasets to align models by training them on examples of desired outputs. LIMA [166] shows that a small, well-curated dataset can achieve strong alignment with powerful pre-trained models, suggesting that focusing on style and format in limited examples may be more effective than large datasets. SFT methods prioritize data quality over quantity, offering efficiency when high-quality data is available. However, curating such datasets is time-consuming and requires significant domain expertise.

监督微调 (Supervised Fine-Tuning, SFT) 强调高质量、精心策划的数据集的重要性，通过对期望输出的示例进行训练来对齐模型。LIMA [166] 表明，一个精心策划的小数据集可以与强大的预训练模型实现良好的对齐，这表明在有限的示例中关注风格和格式可能比大规模数据集更有效。SFT 方法优先考虑数据质量而非数量，在高质量数据可用时提供效率。然而，策划这样的数据集耗时且需要大量的领域专业知识。

3.9.2Alignment with Al Feedback

3.9.2 与 Al 反馈的对齐

To overcome the s cal ability limitations and potential biases of relying solely on human feedback, RLAIF methods utilize AIgenerated feedback to guide the alignment.

为了克服仅依赖人类反馈的可扩展性限制和潜在偏见，RLAIF 方法利用 AI 生成的反馈来指导对齐。

Proximal Policy Optimization These RLAIF methods adapt the PPO algorithm to incorporate AI-generated feedback, automating the process for scalable alignment and reducing human labor. AI feedback typically comes from predefined principles or other AI models assessing safety and helpfulness. Constitutional AI (CAI) [167] uses AI self-critiques based on predefined principles to promote harmlessness. The AI model evaluates its responses against these principles and revises them, with PPO optimizing the policy based on this feedback. SELF-ALIGN [168] employs principle-driven reasoning and LLM generative capabilities to align models with human values. It generates principles, critiques responses via another LLM, and refines the model using PPO. RLCD [169] generates diverse preference pairs using contrasting prompts to train a preference model, which then provides feedback for PPO-based fine-tuning.

近端策略优化 (Proximal Policy Optimization)
这些 RLAIF 方法通过调整 PPO 算法，整合了 AI 生成的反馈，实现了可扩展的对齐过程自动化，并减少了人力需求。AI 反馈通常来自预定义的原则或其他评估安全性和帮助性的 AI 模型。宪法 AI (Constitutional AI, CAI) [167] 使用基于预定义原则的 AI 自我批评来促进无害性。AI 模型根据这些原则评估其响应并进行修订，PPO 则基于此反馈优化策略。SELF-ALIGN [168] 采用原则驱动的推理和大语言模型的生成能力，使模型与人类价值观保持一致。它生成原则，通过另一个大语言模型批评响应，并使用 PPO 优化模型。RLCD [169] 使用对比提示生成多样化的偏好对，训练偏好模型，然后为基于 PPO 的微调提供反馈。

3.9.3 与社会互动的对齐

These methods use simulated environments to train LLMs to align with social norms and constraints, not just individual preferences. They typically employ Contrastive Policy Optimization (CPO) within these simulated settings.

这些方法使用模拟环境来训练大语言模型，使其与社会规范和约束保持一致，而不仅仅是个体偏好。它们通常在这些模拟环境中采用对比策略优化（Contrastive Policy Optimization, CPO）。

Contrastive Policy Optimization Stable Alignment [170] uses rule-based simulated societies to train LLMs with CPO. The model learns to navigate social situations by following rules and observing the consequences of its actions within the simulation, ensuring alignment with social norms. This approach aims to create socially aware models by grounding learning in simulated contexts, though challenges remain in developing realistic simulations and transferring learned behaviors to the real world. Monopoly logue-based Social Scene Simulation [171] introduces MATRIX, a framework where LLMs self-generate social scenarios and play multiple roles to understand the consequences of their actions. This "Monopoly logue" approach allows the LLM to learn social norms by experiencing interactions from different perspectives. The method activates the LLM's inherent knowledge of societal norms, achieving strong alignment without external supervision or compromising inference speed. Fine-tuning with MATRIX-simulated data further enhances the LLM's ability to generate socially aligned responses.

对比策略优化稳定对齐[170]使用基于规则的模拟社会来训练大语言模型（LLM）与CPO。模型通过学习规则并观察其行为在模拟中的结果，确保与社会规范对齐。该方法旨在通过在模拟环境中进行学习，创建具有社会意识的模型，尽管在开发逼真模拟和将学习到的行为转移到现实世界方面仍存在挑战。基于垄断对话的社会场景模拟[171]引入了MATRIX框架，大语言模型在其中自发生成社会场景并扮演多个角色，以理解其行为的后果。这种“垄断对话”方法使大语言模型通过从不同角度体验互动来学习社会规范。该方法激活了大语言模型对社会规范的内隐知识，无需外部监督或影响推理速度即可实现强对齐。使用MATRIX模拟数据进行微调进一步增强了LLM生成社会对齐响应的能力。

3.10 Energy Latency Attacks

3.10 能量延迟攻击

Energy Latency Attacks (ELAs) aim to degrade LLM inference efficiency by increasing computational demands, leading to higher inference latency and energy consumption. Existing ELAs can be categorized into 1) white-box attacks and 2) black-box attacks.

能量延迟攻击 (Energy Latency Attacks, ELAs) 旨在通过增加计算需求来降低大语言模型推理效率，从而导致更高的推理延迟和能量消耗。现有的能量延迟攻击可分为两类：1) 白盒攻击和 2) 黑盒攻击。

3.10.1 White-box Attacks

3.10.1 白盒攻击

White-box attacks assume the attacker has full knowledge of the model, enabling precise manipulation of the model's inference process. These attacks can be further divided into gradient-based attacks and query-based attacks which can also be black-box.

白盒攻击假设攻击者完全了解模型，能够精确操纵模型的推理过程。这些攻击可以进一步分为基于梯度的攻击和基于查询的攻击，后者也可以是黑盒攻击。

Gradient-based Attacks use gradient information to identify input perturbations that maximize inference computations. The goal is to disrupt mechanisms essential for efficient inference, such as End-of-Sentence (EOS) prediction or early-exit. For example, NMTSloth [172] targets EOS prediction in neural machine translation. SAME [173] interferes with early-exit in multi-exit models. LL ME ff Checker [174] applies gradient-based techniques to multiple LLMs. TTSlow [175] induces endless speech generation in text-to-speech systems. These attacks are powerful but comput ation ally expensive and highly model-specific, limiting their general iz ability.

基于梯度的攻击利用梯度信息来识别最大化推理计算的输入扰动。其目标是破坏高效推理所必需的机制，例如句尾 (EOS) 预测或提前退出。例如，NMTSloth [172] 针对神经机器翻译中的 EOS 预测。SAME [173] 干扰多出口模型中的提前退出。LL ME ff Checker [174] 将基于梯度的技术应用于多个大语言模型。TTSlow [175] 在文本转语音系统中引发无尽的语音生成。这些攻击虽然强大，但计算成本高且高度依赖于特定模型，限制了其通用性。

3.10.2Black-box Attacks

3.10.2 黑盒攻击

Black-box attacks do not require access to model internals, only the input-output interface. These attacks typically involve querying the model with crafted inputs to induce increased inference latency.

黑盒攻击不需要访问模型内部，只需利用输入输出接口。此类攻击通常涉及通过精心设计的输入查询模型，以诱导推理延迟增加。

Query-based Attacks exploit specific model behaviors without internal access, relying on repeated querying to craft adversarial examples. No-Skim [176] disrupts skimming-based models by subtly perturbing inputs to maximize retained tokens. NoSkim is ineffective against models that do not rely on skimming. Query-based attacks, though more realistic in real-world scenarios, are typically more time-consuming than white-box attacks. Poisoning-based Attacks manipulate model behavior by injecting malicious training samples. P-DoS [177] shows that a single poisoned sample during fine-tuning can induce excessively long outputs, increasing latency and bypassing output length constraints, even with limited access like fine-tuning APIs.

基于查询的攻击利用特定模型行为，无需内部访问，通过重复查询来制作对抗样本。No-Skim [176] 通过微妙地扰动输入来最大化保留的Token，从而破坏基于略读的模型。对于不依赖略读的模型，NoSkim 无效。基于查询的攻击虽然在现实场景中更为实际，但通常比白盒攻击更耗时。基于投毒的攻击通过注入恶意训练样本来操控模型行为。P-DoS [177] 表明，在微调过程中，单个投毒样本可以导致输出过长，增加延迟并绕过输出长度限制，即使访问权限有限（如微调API）也能实现。

ELAs present an emerging threat to LLMs. Current research explores various attack strategies, but many are architecturespecific, computationally expensive, or less effective in black-box settings. Existing defenses, such as runtime input validation, can add overhead. Future research could focus on developing more generalized and effcient attacks and defenses that apply across diverse LLMs and deployment scenarios.

ELAs 对大语言模型构成了新兴威胁。当前的研究探索了多种攻击策略，但许多都是针对特定架构、计算成本高或在黑盒环境中效果较差。现有的防御措施，如运行时输入验证，可能会增加开销。未来的研究可以专注于开发更通用且高效的攻击和防御方法，这些方法适用于多种大语言模型和部署场景。

3.11 Model Extraction Attacks

3.11 模型提取攻击

Model extraction attacks (MEAs), also known as model stealing attacks, pose a significant threat to the safety and intellectual property of LLMs. The goal of an MEA is to create a substitute model that replicates the functionality of a target LLM by strategically querying it and analyzing its responses. Existing MEAs on LLMs can be categorized into two types: 1) fine-tuning stage attacks, and 2) alignment stage attacks.

模型提取攻击（MEAs），也称为模型窃取攻击，对大语言模型的安全性和知识产权构成了重大威胁。MEA的目标是通过策略性地查询目标大语言模型并分析其响应，创建一个替代模型来复制其功能。现有的大语言模型MEAs可分为两类：1) 微调阶段攻击，2) 对齐阶段攻击。

3.11.1 Fine-tuning Stage Attacks

3.11.1 微调阶段攻击

Fine-tuning stage attacks aim to extract knowledge from fine-tuned LLMs for downstream tasks. These attacks can be divided into two categories: functional similarity extraction and 2) specific ability extraction.

微调阶段攻击旨在从为下游任务微调的大语言模型中提取知识。这些攻击可分为两类：功能相似性提取和特定能力提取。

Functional Similarity Extraction seeks to replicate the overall behavior of the target fine-tuned model. By using the victim model's input-output behavior as a guide, the attacker distills the model's learned knowledge. For example, LION [178] uses the victim model as a referee and generator to iterative ly improve a student model's instruction-following capability.

功能相似性提取旨在复现目标微调模型的整体行为。攻击者以受害模型的输入输出行为为引导，提炼模型所学的知识。例如，LION [178] 使用受害模型作为裁判和生成器，逐步提升学生模型的指令执行能力。

Specific Ability Extraction targets the extraction of specific skills or knowledge the fine-tuned model has acquired. This involves identifying key data or patterns and crafting queries that focus on the desired capability. Li et al. [179] demonstrated this by extracting coding abilities from black-box LLM APIs using carefully crafted queries. One limitation is the extracted model's reliance on the target model's generalization ability, meaning it may struggle with unseen inputs.

特定能力提取针对的是微调模型所获得的特定技能或知识的提取。这包括识别关键数据或模式，并设计专注于所需能力的查询。Li 等人 [179] 通过使用精心设计的查询从黑箱大语言模型 API 中提取编码能力，展示了这一点。一个局限性在于提取的模型依赖于目标模型的泛化能力，这意味着它可能在处理未见过的输入时遇到困难。

3.11.2 Alignment Stage Attacks

3.11.2 对齐阶段攻击

Alignment stage attacks attempt to extract the alignment properties (e.g., safety, helpfulness) of the target LLM. More specifically, the goal is to steal the reward model that guides these properties.

对齐阶段攻击尝试提取目标大语言模型的对齐属性（例如，安全性、有用性）。更具体地说，目标是窃取指导这些属性的奖励模型。

Functional Similarity Extraction focuses on replicating the target model's alignment preferences. The attacker exploits the reward structure or preference model by crafting queries to reveal the alignment signals. LoRD [180] exemplifies this by using a policy-gradient approach to extract both task-specific knowledge and alignment properties. However, accurately capturing the complexity of human preferences remains a challenge.

功能相似性提取侧重于复制目标模型的对齐偏好。攻击者通过精心设计的查询来利用奖励结构或偏好模型，以揭示对齐信号。LoRD [180] 通过使用策略梯度方法来提取任务特定知识和对齐属性，展示了这一方法。然而，准确捕捉人类偏好的复杂性仍然是一个挑战。

Model extraction attacks are a rapidly evolving threat to LLMs. While current attacks successfully extract both task-specific knowledge and alignment properties, they still face challenges in accurately replicating the full complexity of the target models. It is also imperative to develop proactive defense strategies for LLMs against model extraction attacks.

模型提取攻击对大语言模型的威胁日益加剧。尽管当前攻击能够成功提取任务特定知识和对齐属性，但在准确复制目标模型全部复杂性方面仍面临挑战。开发针对模型提取攻击的主动防御策略也势在必行。

3.12 Data Extraction Attacks

3.12 数据提取攻击

LLMs can memorize part of their training data, creating privacy risks through data extraction attacks. These attacks recover training examples, potentially exposing sensitive information such as Personal Identifiable Information (PIl), copyrighted content, or confidential data. This section reviews existing data extraction attacks on LLMs, including both white-box and black-box attacks.

大语言模型 (LLMs) 能够记忆部分训练数据，从而通过数据提取攻击带来隐私风险。这些攻击可以恢复训练样本，可能暴露敏感信息，如个人身份信息 (PIl)、受版权保护的内容或机密数据。本节回顾了现有的大语言模型数据提取攻击，包括白盒攻击和黑盒攻击。

3.12.1White-box Attacks

3.12.1 白盒攻击

White-box attacks focus on Latent Memorization Extraction, targeting information implicitly stored in model parameters or activation s, which is not directly accessible through the inputoutput interface.

白盒攻击侧重于潜在记忆提取 (Latent Memorization Extraction)，针对的是隐式存储在模型参数或激活中的信息，这些信息无法直接通过输入输出接口访问。

Latent Memorization Extraction Duan et al. [191] developed techniques to extract latent data by analyzing internal representations, using methods like adding noise to weights or examining cross-entropy loss. These techniques were demonstrated on LLMs like Pythia-1B and Amber-7B. While these attacks reveal risks associated with internal data representation, they require full access to the model parameters, which remains a major limitation.

潜在记忆提取 Duan 等人 [191] 开发了通过分析内部表示来提取潜在数据的技术，使用的方法包括向权重添加噪声或检查交叉熵损失。这些技术在 Pythia-1B 和 Amber-7B 等大语言模型上进行了演示。虽然这些攻击揭示了与内部数据表示相关的风险，但它们需要完全访问模型参数，这仍然是一个主要限制。

3.12.2Black-box Attacks

3.12.2 黑盒攻击

Black-box data extraction attacks are a realistic threat in which the attacker crafts inductive prompts to trick the LLM into revealing memorized training data, without access to the model's parameters.

黑盒数据提取攻击是一种现实威胁，攻击者通过设计诱导性提示，诱使大语言模型泄露记忆的训练数据，而无需访问模型参数。

Prefix Attacks exploit the auto regressive nature of LLMs by providing a “prefix" from a memorized sequence, hoping the model will continue it. Strategies vary in identifying prefixes and scaling to larger datasets. Carlini et al. [181] demonstrated this on models like GPT-2, while Nasr et al. [183] scaled prefix attacks using suffix arrays. Magpie [184] and Al-Kaswan et al. [185] targeted specific data, such as Pll or code. Yu et al. [190] enhanced black-box data extraction by optimizing text continuation generation and ranking. They introduced techniques like diverse sampling strategies (Top-k, Nucleus), probability adjustments (temperature, repetition penalty), dynamic context windows, look-ahead mechanisms, and improved suffx ranking (Zlib, high-confidence tokens).

前缀攻击通过提供记忆序列中的“前缀”来利用大语言模型的自回归特性，期望模型能够继续生成后续内容。策略在识别前缀和扩展到更大数据集方面有所不同。Carlini 等人 [181] 在 GPT-2 等模型上展示了这一点，而 Nasr 等人 [183] 则使用后缀数组扩展了前缀攻击。Magpie [184] 和 Al-Kaswan 等人 [185] 针对特定数据（如 Pll 或代码）进行了攻击。Yu 等人 [190] 通过优化文本延续生成和排序，增强了黑盒数据提取的效果。他们引入了多样化的采样策略（Top-k, Nucleus）、概率调整（温度, 重复惩罚）、动态上下文窗口、前瞻机制以及改进的后缀排序（Zlib, 高置信度 Token）等技术。

Special Character Attack exploits the model's sensitivity to special characters or unusual input formatting, potentially triggering unexpected behavior that reveals memorized data. SCA [186] demonstrates that specific characters can indeed induce LLMs to disclose training data. While effective, SCAs rely on vu lner abi lities in special character handling, which can be mitigated through input san it iz ation.

特殊字符攻击利用模型对特殊字符或异常输入格式的敏感性，可能触发意外行为，从而暴露记忆数据。SCA [186] 表明，特定字符确实可以诱导大语言模型泄露训练数据。虽然有效，但特殊字符攻击依赖于特殊字符处理中的漏洞，这些问题可以通过输入清理来缓解。

Prompt Optimization employs an “attacker" LLM to generate optimized prompts that extract data from a “victim" LLM. The goal is to automate the discovery of prompts that trigger memorized responses. Kassem et al. [187] demonstrated this by using an attacker LLM with iterative rejection sampling and longest common sub sequence (LCS) for optimization. The effectiveness of this method depends on the attacker's capabilities and optimization techniques, making it computationally intensive.

Prompt Optimization 使用一个“攻击者”大语言模型来生成优化的提示，以从“受害者”大语言模型中提取数据。其目标是自动发现触发记忆响应的提示。Kassem 等人 [187] 通过使用攻击者大语言模型，结合迭代拒绝采样和最长公共子序列 (LCS) 进行优化，展示了这一方法。该方法的有效性取决于攻击者的能力和优化技术，因此计算量较大。

Retrieval-Augmented Generation (RAG) Extraction targets RAG systems, aiming to leak sensitive information from the retrieval component. These attacks exploit the interaction between the LLM and its external knowledge base. Qi et al. [188] demonstrated that adversarial prompts can trigger data leakage in RAG systems. Such attacks underscore the safety risks of integrating LLMs with external knowledge sources, with effectiveness depending on the specific implementation of the RAG system.

检索增强生成 (Retrieval-Augmented Generation, RAG) 提取针对 RAG 系统，旨在从检索组件中泄露敏感信息。这些攻击利用了大语言模型与其外部知识库之间的交互。Qi 等人 [188] 证明了对抗性提示可以触发 RAG 系统中的数据泄露。此类攻击突显了将大语言模型与外部知识源集成的安全风险，其有效性取决于 RAG 系统的具体实现。

Ensemble Attack combines multiple attack strategies to enhance effectiveness, leveraging the strengths of each method for higher success rates. More et al. [189] demonstrated the effectiveness of such an ensemble approach on Pythia. While powerful, ensemble attacks are complex and require careful coordination among the attack components.

集成攻击通过结合多种攻击策略来提升效果，利用每种方法的优势以提高成功率。More 等人 [189] 展示了这种集成方法在 Pythia 上的有效性。尽管强大，但集成攻击较为复杂，需要攻击组件之间的精心协调。

TABLE 5: Datasets and benchmarks for LLM safety research.

表 5: 大语言模型安全研究中的数据集和基准。

数据集	年份	大小	引用次数
RealToxicityPrompts[426]	2020	100K	135
TruthfulQA[427]	2021	817	213
AdvGLUE[428]	2021	5,716	12
SafetyPrompts[429]	2023	100K	15
DoNotAnswer [430]	2023	939	6
AdvBench[91]	2023	520	52
CVALUES [431]	2023	2,100	10
FINE [432]	2023	90	14
FLAMES[433]	2024	2,251	17
SORRYBench[434]	2024	450	8
SafetyBench[435]	2024	11,435	21
SALAD-Bench[436]	2024	30K	36
BackdoorLLM [437]	2024	8	6
JailBreakV-28K[438]	2024	28K	10
STRONGREJECT[439]	2024	313	4
Libra-Leaderboard[440]	2024	57	26

3.13 Datasets & Benchmarks

3.13 数据集与基准测试

This section reviews commonly used datasets and benchmarks in LLM safety research, as shown in Table 5. These datasets and benchmarks are categorized based on their evaluation purpose: toxicity datasets，truthfulness datasets,value benchmarks,and adversarial datasets and backdoor benchmarks.

本节回顾了大语言模型安全研究中常用的数据集和基准，如表 5 所示。这些数据集和基准根据其评估目的进行分类：毒性数据集、真实性数据集、价值观基准、对抗性数据集和后门基准。

3.13.1 Toxicity Datasets

3.13.1 毒性数据集

Ensuring LLMs do not generate harmful content is crucial for safety. Early work, such as the Real Toxicity Prompts dataset [426], exposed the tendency of LLMs to produce toxic text from benign prompts. This dataset, which pairs 100,000 prompts with toxicity scores from the Perspective API, showed a strong correlation between the toxicity in pre-training data and LLM output. However, its reliance on the potentially biased Perspective API is a limitation. To address broader harmful behaviors, the Do-NotAnswer [430] dataset was introduced. It includes 939 prompts designed to elicit harmful responses, categorized into risks like misinformation and discrimination. Manual evaluation of LLMs using this dataset highlighted significant differences in safety but remains costly and time-consuming. A recent approach [441] introduces a crowd-sourced toxic question and response dataset, with annotations from both humans and LLMs. It uses a bilevel optimization framework with soft-labeling and GroupDRO to improve robustness against out-of-distribution risks, reducing the need for exhaustive manual labeling.

确保大语言模型不生成有害内容对于安全性至关重要。早期工作，如 Real Toxicity Prompts 数据集 [426]，揭示了大语言模型从良性提示中产生有毒文本的倾向。该数据集将 10 万条提示与 Perspective API 的毒性评分配对，显示了预训练数据中的毒性与大语言模型输出之间的强相关性。然而，它对可能存在偏见的 Perspective API 的依赖是一个限制。为了解决更广泛的有害行为，引入了 Do-NotAnswer 数据集 [430]。它包含 939 条旨在引发有害反应的提示，分为错误信息和歧视等风险类别。使用该数据集对大语言模型进行手动评估，突出了安全性的显著差异，但仍然成本高昂且耗时。最近的一种方法 [441] 引入了众包的有毒问题和响应数据集，其中包含人类和大语言模型的注释。它使用双层优化框架，结合软标签和 GroupDRO，提高了对分布外风险的鲁棒性，减少了对详尽手动标注的需求。

3.13.2 Truthfulness Datasets

3.13.2 真实性数据集

Ensuring LLMs generate truthful information is also essential. The TruthfulQA benchmark [427] evaluates whether LLMs provide accurate answers to 817 questions across 38 categories, specifically targeting "imitative falsehoods"—false answers learned from human text. Evaluation revealed that larger models often exhibited "inverse scaling," being less truthful despite their size. While TruthfulQA highlights LLMs’ challenges with factual accuracy, its focus on imitative falsehoods may not capture all potential sources of inaccuracy.

确保大语言模型生成真实信息也至关重要。TruthfulQA 基准 [427] 评估了大语言模型是否能够准确回答 38 个类别中的 817 个问题，特别针对“模仿性虚假信息”——从人类文本中学到的错误答案。评估显示，更大的模型经常表现出“逆向扩展”，尽管规模更大，但真实性却更低。虽然 TruthfulQA 强调了大语言模型在事实准确性方面的挑战，但其对模仿性虚假信息的关注可能无法捕捉到所有潜在的不准确性来源。

3.13.3 Value Benchmarks

3.13.3 价值基准

Ensuring LLM alignment with human values is a critical challenge, addressed by several benchmarks assessing various aspects of safety, fairness, and ethics. FLAMES [433] evaluates the alignment of Chinese LLMs with values like fairness, safety, and morality through 2,251 prompts. SORRY-Bench [434] assesses LLMs’ ability to reject unsafe requests using 45 topic categories, while CVALUES [431] focuses on both safety and responsibility. Safety Prompts [429] evaluates Chinese LLMs on a range of ethical scenarios. Despite their value, these benchmarks are limited by the manual annotation process. Additionally, the concept of “fake alignment" [432] highlights the risk of LLMs superficially memorizing safety answers, leading to the Fake alIgNment Evaluation (FINE) framework for consistency assessment. Safety Bench [435] addresses this by providing an efficient, automated multiple-choice benchmark for LLM safety evaluation. Libra-Leader board [440] introduces a balanced leader board for evaluating both the safety and capability of LLMs. It features a comprehensive safety benchmark with 57 datasets covering diverse safety dimensions, a unified evaluation framework, an interactive safety arena for adversarial testing, and a balanced scoring system. Libra-Leader board promotes a holistic approach to LLM evaluation, representing a significant step towards responsible AI development.

确保大语言模型与人类价值观对齐是一个关键挑战，已有多个基准测试从安全、公平和伦理等多个方面进行评估。FLAMES [433] 通过 2,251 个提示评估中文大语言模型在公平性、安全性和道德等方面的对齐情况。SORRY-Bench [434] 使用 45 个主题类别评估大语言模型拒绝不安全请求的能力，而 CVALUES [431] 则重点关注安全性和责任性。Safety Prompts [429] 在一系列伦理场景下评估中文大语言模型。尽管这些基准测试具有价值，但它们受到手动标注过程的限制。此外，“虚假对齐” [432] 的概念突显了大语言模型表面记忆安全答案的风险，这促使了 Fake alIgNment Evaluation (FINE) 框架的提出，用于一致性评估。Safety Bench [435] 通过提供一个高效、自动化的多选题基准测试来解决这一问题，用于大语言模型的安全评估。Libra-Leader board [440] 引入了一个平衡的排行榜，用于评估大语言模型的安全性和能力。它包含一个综合性的安全基准测试，涵盖 57 个数据集，覆盖多样化的安全维度，一个统一的评估框架，一个用于对抗测试的互动安全竞技场，以及一个平衡的评分系统。Libra-Leader board 推动了大语言模型评估的整体方法，代表了向负责任的人工智能开发迈出的重要一步。

3.13.4 Adversarial Datasets and Backdoor Benchmarks

3.13.4 对抗数据集与后门基准测试

Backdoor LL M [437] is the first benchmark for evaluating backdoor attacks in text generation, offering a standardized framework that includes diverse attack strategies like data poisoning and weight poisoning. Adversarial GLUE [428] assesses LLM robustness against textual attacks using 14 methods, highlighting vulnerabilities even in robustly trained models. SALAD-Bench [436] expands on this by introducing a safety benchmark with a taxonomy of risks, including attack- and defense-enhanced questions. JailBreakV-28K [438] focuses on evaluating multi-modal LLMs against jailbreak attacks using text- and image-based test cases. A STRONG REJECT for empty jailbreaks [439] improves jailbreak evaluation with a higher-quality dataset and automated assessment. Despite their value, these benchmarks face challenges in s cal ability, consistency, and real-world relevance.

Backdoor LL M [437] 是首个用于评估文本生成中后门攻击的基准，提供了一个标准化框架，包括数据投毒和权重投毒等多样化的攻击策略。Adversarial GLUE [428] 通过14种方法评估大语言模型在文本攻击下的鲁棒性，揭示了即使在经过鲁棒训练的模型中仍存在的脆弱性。SALAD-Bench [436] 在此基础上扩展，引入了一个包含风险分类的安全基准，包括攻击和防御增强的问题。JailBreakV-28K [438] 专注于评估多模态大语言模型在基于文本和图像的越狱攻击测试用例下的表现。A STRONG REJECT for empty jailbreaks [439] 通过更高质量的数据集和自动化评估改进了越狱评估。尽管这些基准具有重要价值，但它们在可扩展性、一致性和现实相关性方面仍面临挑战。

4 VISION-LANGUAGE PRE-TRAINING MODEL SAFETY

4 视觉-语言预训练模型安全性

VLP models, such as CLIP [442], ALBEF [443], and TCL [444], have made significant strides in aligning visual and textual modalities. However, these models remain vulnerable to various safety threats, which have garnered increasing research attention. This section reviews the current safety research on VLP models, with a focus on adversarial, backdoor, and poisoning research. The representative methods reviewed in this section are summarized in Table 6.

诸如 CLIP [442]、ALBEF [443] 和 TCL [444] 等 VLP 模型在视觉和文本模态对齐方面取得了显著进展。然而，这些模型仍然容易受到各种安全威胁的影响，这已引起越来越多的研究关注。本节回顾了当前关于 VLP 模型的安全研究，重点是对抗性、后门和投毒研究。本节回顾的代表性方法总结在表 6 中。

4.1 Adversarial Attacks

4.1 对抗攻击

Since VLP models are widely used as backbones for fine-tuning downstream models, adversarial attacks on VLP aim to generate examples that cause incorrect predictions across various downstream tasks, including zero-shot image classification, imagetext retrieval, visual entailment, and visual grounding. Similar to Section 2, these attacks can roughly be categorized into white-box attacks and black-box attacks, based on their threat models.

由于 VLP 模型被广泛用作下游模型微调的骨干网络，针对 VLP 的对抗攻击旨在生成在各种下游任务中导致错误预测的样本，包括零样本图像分类、图像文本检索、视觉蕴含和视觉定位。与第 2 节类似，这些攻击根据其威胁模型大致可以分为白盒攻击和黑盒攻击。

TABLE 6: A summary of attacks and defenses for VLP models.

表 6: VLP 模型攻击与防御的总结

4.1.1 White-box Attacks

4.1.1 白盒攻击

White-box adversarial attacks on VLP models can be further categorized based on perturbation types into invisible perturbations and visible perturbations, with the majority of existing attacks employing invisible perturbations.

白盒对抗攻击在 VLP 模型上可基于扰动类型进一步分为不可见扰动和可见扰动，现有的大多数攻击都采用不可见扰动。

Invisible Perturbations involve small, imperceptible adversarial changes to inputs——-whether text or images—to maintain the stealth in ess of attacks. Early research in the vision and language domains primarily adopts this approach [60], [445]-[447], in which invisible attacks are developed independently. In the context of VLP models, which integrate both modalities, Co-Attack [192] was the first to propose perturbing both visual and textual inputs simultaneously to create stronger attacks. Building on this, AdvCLIP [193] explores universal adversarial perturbations that can deceive all downstream tasks.

隐形扰动涉及对输入（无论是文本还是图像）进行微小且不易察觉的对抗性更改，以保持攻击的隐蔽性。视觉和语言领域的早期研究主要采用这种方法 [60], [445]-[447]，其中隐形攻击是独立开发的。在整合了两种模态的 VLP 模型背景下，Co-Attack [192] 首次提出同时扰动视觉和文本输入以创建更强的攻击。在此基础上，AdvCLIP [193] 探索了可以欺骗所有下游任务的通用对抗扰动。

Visible Perturbations involve more substantial and noticeable alterations. For example, manually crafted typographical, conceptual, and icon o graphic images have been used to demonstrate that the CLIP model tends to “read first, look later" [194], highlighting a unique characteristic of VLP models. This behavior introduces new attack surfaces for VLP, enabling the development of more sophisticated attacks.

可见扰动涉及更大规模和更显著的改变。例如，手动制作的排版、概念和图标图像已被用于展示 CLIP 模型倾向于“先读后看”[194]，这突出了 VLP (Vision-Language Pretraining) 模型的独特特性。这种行为为 VLP 引入了新的攻击面，使得开发更复杂的攻击成为可能。

4.1.2Black-box Attacks

4.1.2 黑盒攻击

Black-box attacks on VLP primarily adopt a transfer-based approach, with query-based attacks rarely explored. Existing methods can be categorized into: 1) sample-specific perturbations, tailored to individual samples, and 2) universal perturbations, applicable across multiple samples.

对视觉-语言预训练（VLP）的黑盒攻击主要采用基于迁移的方法，基于查询的攻击方法较少被探索。现有方法可分为两类：1) 样本特定的扰动，针对单个样本进行定制；2) 通用扰动，适用于多个样本。

Sample-wise perturbations are generally more effective than universal perturbations, but their transfer ability is often limited.

样本级扰动通常比通用扰动更有效，但其迁移能力往往有限。

SGA [195] explores adversarial transfer ability in VLP by leveraging cross-modal interactions and alignment-preserving augmentation. Building on this, SA-Attack [196] enhances cross-modal transfer ability by introducing data augmentations to both original and adversarial inputs. VLP-Attack [197] improves transfer a bility by generating adversarial texts and images using contrastive loss. To overcome SGA's limitations, TMM [198] introduces modality-consistency and discrepancy features through attentionbased and orthogonal-guided perturbations. VLATTACK [199] further enhances adversarial examples by combining image and text perturbations at both single-modal and multimodal levels. PRM [200] targets vulnerabilities in downstream models using foundation models like CLIP, enabling transferable attacks across tasks like object detection and image captioning.

SGA [195] 通过利用跨模态交互和对齐保持增强，探索了 VLP 中的对抗迁移能力。在此基础上，SA-Attack [196] 通过对原始输入和对抗输入引入数据增强，提升了跨模态迁移能力。VLP-Attack [197] 通过使用对比损失生成对抗文本和图像，改进了迁移能力。为了克服 SGA 的局限性，TMM [198] 通过基于注意力和正交引导的扰动引入了模态一致性和差异性特征。VLATTACK [199] 通过在单模态和多模态层次上结合图像和文本扰动，进一步增强了对抗样本。PRM [200] 利用 CLIP 等基础模型，针对下游模型的漏洞进行攻击，实现了跨任务的可迁移攻击，如目标检测和图像描述。

Universal Perturbations are less effective than sample-wise perturbations but more transferable. C-PGC [201] was the first to investigate universal adversarial perturbations (UAPs) for VLP models. It employs contrastive learning and cross-modal information to disrupt the alignment of image-text embeddings, achiev- ing stronger attacks in both white-box and black-box scenarios. ETU [202] builds on this by generating UAPs that transfer across multiple VLP models and tasks. ETU enhances UAP transfer ability and effectiveness through improved global and local optimization techniques. It also introduces a data augmentation strategy ScMix that combines self-mix and cross-mix operations to increase data diversity while preserving semantic integrity, further boosting the robustness and applicability of UAPs.

通用扰动虽然比样本扰动效果弱，但更具可迁移性。C-PGC [201] 首次研究了视觉-语言预训练（VLP）模型的通用对抗扰动（UAPs）。它采用对比学习和跨模态信息来破坏图像-文本嵌入的对齐，在白盒和黑盒场景中实现了更强的攻击。ETU [202] 在此基础上生成了可跨多个 VLP 模型和任务迁移的 UAPs。ETU 通过改进的全局和局部优化技术，增强了 UAP 的迁移能力和效果。它还引入了一种数据增强策略 ScMix，结合自混合和交叉混合操作，在保持语义完整性的同时增加数据多样性，进一步提升了 UAP 的鲁棒性和适用性。

4.2 Adversarial Defenses

4.2 对抗防御

Existing adversarial defenses for VLP models can be grouped into four types: 1) adversarial example detection, 2) standard adversarial training, 3) adversarial prompt tuning, and 4) adversarial contrastive tuning. While adversarial detection filters out potential adversarial examples before or during inference, the other three defenses follow similar adversarial training paradigms, with variations in efficiency.

现有的 VLP 模型对抗防御可以分为四种类型：1) 对抗样本检测，2) 标准对抗训练，3) 对抗提示调优，4) 对抗对比调优。对抗检测在推理前或推理期间过滤掉潜在的对抗样本，而其他三种防御遵循类似的对抗训练范式，但在效率上有所不同。

4.2.1 Adversarial Example Detection

4.2.1 对抗样本检测

Adversarial detection methods for VLP can befurther divided into one-shot detection and state ful detection.

针对VLP的对抗检测方法可以进一步分为一次性检测和有状态检测。

4.2.1.1 One-shot Detection

4.2.1.1 单样本检测

One-shot Detection distinguishes adversarial from clean examples in a single forward pass. White-box detection methods are typically one-shot. For example, Mirror Check [217] is a modelagnostic method for VLP models. It uses text-to-image (T21) models to generate images from captions produced by the victim model, comparing the similarity between the input image and the generated image using CLIP's image encoder. A significant similarity difference flags the input as adversarial.

单次检测通过一次前向传递区分对抗样本和干净样本。白盒检测方法通常是单次的。例如，Mirror Check [217] 是一种针对 VLP 模型的模型无关方法。它使用文本到图像 (T21) 模型从受害者模型生成的描述中生成图像，并使用 CLIP 的图像编码器比较输入图像与生成图像的相似性。显著的相似性差异将输入标记为对抗样本。

4.2.1.2 Stateful Detection

4.2.1.2 有状态检测

Stateful Detection is designed for black-box query attacks, where multiple queries are tracked to detect adversarial behavior. AdvQDet [218] is a novel framework that counters query-based black-box attacks. It uses adversarial contrastive prompt tuning (ACPT) to tune CLIP image encoder, enabling detection of adversarial queries within just three queries.

状态检测（Stateful Detection）专为黑盒查询攻击设计，通过跟踪多次查询来检测对抗行为。AdvQDet [218] 是一个针对基于查询的黑盒攻击的新框架。它利用对抗性对比提示调优（ACPT）来调整 CLIP 图像编码器，从而在仅三次查询内检测到对抗性查询。

4.2.2Standard Adversarial Training

4.2.2 标准对抗训练

Adversarial training is widely regarded as the most effective defense against adversarial attacks [409], [448]. However, it is computationally expensive, and for VLP models, which are typically trained on web-scale datasets, this cost becomes prohibitively high, posing a significant challenge for traditional approaches. Although research in this area is limited, we highlight two notable works that have explored adversarial training for vision-language pre-training. Their pre-trained models can be used as robust backbones for other adversarial research.

对抗训练被广泛认为是对抗对抗攻击最有效的防御手段 [409], [448]。然而，它的计算成本很高，而对于通常基于网络规模数据集训练的 VLP 模型来说，这一成本变得极其高昂，对传统方法构成了重大挑战。尽管这一领域的研究有限，但我们重点介绍了两项探索视觉语言预训练对抗训练的显著工作。它们的预训练模型可以作为其他对抗性研究的鲁棒骨干。

The first work, VILLA [216], is a vision-language adversarial training framework consisting of two stages: task-agnostic adversarial pre-training and task-specific fine-tuning. VILLA enhances performance across downstream tasks using adversarial pre-training in the embedding space of both image and text modalities, instead of pixel or token levels. It employs FreeLB's strategy [449] to minimize computational overhead for efficient large-scale training.

首个工作 VILLA [216] 是一个视觉语言对抗训练框架，包含两个阶段：任务无关的对抗预训练和任务特定的微调。VILLA 通过在图像和文本模态的嵌入空间中进行对抗预训练，而不是像素或 Token 级别，提升了下游任务的性能。它采用了 FreeLB 的策略 [449] 以最小化计算开销，从而实现高效的大规模训练。

The second work, AdvXL [215], is a large-scale adversarial training framework with two phases: a lightweight pre-training phase using low-resolution images and weaker attacks, followed by an intensive fine-tuning phase with full-resolution images and stronger attacks. This coarse-to-fine, weak-to-strong strategy reduces training costs while enabling scalable adversarial training for large vision models.

第二项工作 AdvXL [215] 是一个大规模对抗训练框架，分为两个阶段：首先是一个使用低分辨率图像和较弱攻击的轻量预训练阶段，随后是一个使用全分辨率图像和更强攻击的强化微调阶段。这种由粗到精、由弱到强的策略在降低训练成本的同时，为大视觉模型实现了可扩展的对抗训练。

4.2.3 Adversarial Prompt Tuning

4.2.3 对抗性提示调优

Adversarial prompt tuning (APT) enhances the adversarial robustness of VLP models by incorporating adversarial training during prompt tuning [450]-[452], typically focusing on textual prompts. It offers a lightweight alternative to standard adversarial training. APT methods can be classified into two main categories based on the prompt type: textual prompt tuning and multi-modal prompt tuning.

对抗性提示调优 (APT) 通过在提示调优过程中引入对抗性训练 [450]-[452] 来增强视觉语言预训练 (VLP) 模型的对抗鲁棒性，通常侧重于文本提示。它为标准的对抗性训练提供了一种轻量级替代方案。根据提示类型，APT 方法可分为两大类：文本提示调优和多模态提示调优。

4.2.3.1 Textual Prompt Tuning

4.2.3.1 文本提示调优

Textual prompt tuning (TPT) robust if ies VLP models by finetuning learnable text prompts. AdvPT [204] enhances the adversarial robustness of CLIP image encoder by realigning adversarial image embeddings with clean text embeddings using learnable textual prompts. Similarly, APT [205] learns robust text prompts, using a CLIP image encoder to boost accuracy and robustness with minimal computational cost. MixPrompt [206] simultaneously enhances the general iz ability and adversarial robustness of VLPs by employing conditional APT. Unlike empirical defenses, Prompt Smooth [207] offers a certified defense for Medical VLMs, adapting pre-trained models to Gaussian noise without retraining. Additionally, Defense-Prefix [203] mitigates typographic attacks by adding a prefix token to class names, improving robustness without retraining.

文本提示调优 (TPT) 通过微调可学习的文本提示来增强视觉语言预训练 (VLP) 模型的鲁棒性。AdvPT [204] 通过使用可学习的文本提示将对抗性图像嵌入与干净的文本嵌入重新对齐，增强了 CLIP 图像编码器的对抗鲁棒性。类似地，APT [205] 学习了鲁棒的文本提示，使用 CLIP 图像编码器以最小的计算成本提高准确性和鲁棒性。MixPrompt [206] 通过采用条件 APT 同时提升了 VLP 的泛化能力和对抗鲁棒性。与经验防御不同，Prompt Smooth [207] 为医学 VLM 提供了认证防御，无需重新训练即可将预训练模型适应高斯噪声。此外，Defense-Prefix [203] 通过向类名添加前缀 Token 来缓解排版攻击，提高了鲁棒性而无需重新训练。

4.2.3.2 多模态提示调优

Recent adversarial prompt tuning methods have expanded textual prompts to multi-modal prompts. FAP [208] introduces learnable adversarial text supervision and a training objective that balances cross-modal consistency while differentiating uni-modal representations. APD [209] improves CLIP's robustness through online prompt distillation between teacher and student multimodal prompts. Additionally, TAPT [210] presents a test-time defense that learns defensive bimodal prompts to improve CLIP's zero-shot inference robustness.

最近的对抗性提示调优方法已经将文本提示扩展到多模态提示。FAP [208] 引入了可学习的对抗性文本监督和一个训练目标，该目标在区分单模态表示的同时平衡跨模态一致性。APD [209] 通过教师和学生多模态提示之间的在线提示蒸馏来提高 CLIP 的鲁棒性。此外，TAPT [210] 提出了一种测试时防御方法，通过学习防御性双模态提示来提高 CLIP 的零样本推理鲁棒性。

4.2.4 Adversarial Contrastive Tuning

4.2.4 对抗对比微调

Adversarial contrastive tuning involves contrastive learning with adversarial training to fine-tune a robust CLIP image encoder for zero-shot adversarial robustness on downstream tasks. These methods are categorized into supervised and unsupervised methods, depending on the availability of labeled data during training.

对抗对比微调涉及对抗训练与对比学习，以在下游任务中微调出具有零样本对抗鲁棒性的CLIP图像编码器。这些方法根据训练期间是否有标签数据，分为有监督和无监督方法。

4.2.4.1 Supervised Contrastive Tuning

4.2.4.1 监督对比调优 (Supervised Contrastive Tuning)

Visual Tuning fine-tunes CLIP image encoder using only adversarial images. TeCoA [211] explores the zero-shot adversarial robustness of CLIP and finds that visual prompt tuning is more effective without text guidance, while fine-tuning performs better with text information. PMG-AFT [212] improves zero-shot adversarial robustness by introducing an auxiliary branch to minimize the distance between adversarial outputs in the target and pre-trained models, mitigating over fitting and preserving generalization.

视觉调优仅使用对抗性图像对CLIP图像编码器进行微调。TeCoA [211] 探索了CLIP的零样本对抗鲁棒性，发现视觉提示调优在没有文本引导时更有效，而微调在提供文本信息时表现更好。PMG-AFT [212] 通过引入辅助分支来最小化目标模型和预训练模型中对抗性输出之间的距离，从而提高零样本对抗鲁棒性，缓解过拟合并保持泛化能力。

Multi-modal Tuning fine-tunes CLIP image encoder using both adversarial texts and images. MMCoA [213] combines image-based PGD and text-based BERT-Attack in a multi-modal contrastive adversarial training framework. It uses two contrastive losses to align clean and adversarial image and text features, improving robustness against both image-only and multi-modal attacks.

多模态调优使用对抗性文本和图像对CLIP图像编码器进行微调。MMCoA [213] 在多模态对比对抗训练框架中结合了基于图像的 PGD 和基于文本的 BERT-Attack。它使用两种对比损失来对齐干净和对抗性的图像及文本特征，从而提高了对仅图像攻击和多模态攻击的鲁棒性。

4.2.4.2 Unsupervised Contrastive Tuning

4.2.4.2 无监督对比调优

Adversarial contrastive tuning can also be performed in an unsupervised fashion. For instance, FARE [214] robust if ies CLIP image encoder through unsupervised adversarial fine-tuning, achieving superior clean accuracy and robustness across downstream tasks, including zero-shot classification and vision-language tasks. This approach enables VLMs, such as LLaVA and Open Flamingo, to attain robustness without the need for re-training or additional fine-tuning.

对抗性对比调优（adversarial contrastive tuning）也可以以无监督的方式进行。例如，FARE [214] 通过无监督的对抗性微调使 CLIP 图像编码器更加鲁棒，在下游任务（包括零样本分类和视觉语言任务）中实现了优异的干净准确性和鲁棒性。这种方法使得视觉语言模型（VLM），如 LLaVA 和 Open Flamingo，无需重新训练或额外微调即可获得鲁棒性。

4.3Backdoor & Poisoning Attacks

4.3 后门与投毒攻击

Backdoor and poisoning attacks on CLIP can target either the pre-training stage or the fine-tuning stage on downstream tasks. Previous studies have shown that poisoning backdoor attacks on CLIP can succeed with significantly lower poisoning rates compared to traditional supervised learning [224]. Additionally, training CLIP on web-crawled data increases its vulnerability to backdoor attacks [453]. This section reviews proposed attacks targeting backdoor ing or poisoning CLIP.

后门攻击和投毒攻击 (Poisoning attacks) 可以针对 CLIP 的预训练阶段或下游任务的微调阶段。先前的研究表明，与传统监督学习相比，针对 CLIP 的投毒后门攻击可以在显著较低的投毒率下成功 [224]。此外，在通过网络爬取的数据上训练 CLIP 会增加其受到后门攻击的脆弱性 [453]。本节回顾了针对 CLIP 后门或投毒攻击的提议。

4.3.1Backdoor Attacks

4.3.1 后门攻击

Based on the trigger modality, existing backdoor attacks on CLIP can be categorized into visual triggers and multi-modal triggers.

基于触发方式，现有的CLIP后门攻击可分为视觉触发和多模态触发。

Visual Triggers target pre-trained image encoders by embedding backdoor patterns in visual inputs. BadEncoder [219] explores image backdoor attacks on self-supervised learning by injecting backdoors into pre-trained image encoders, compromising downstream class if i ers. Corrupt Encoder [220] exploits random cropping in contrastive learning to inject backdoors into pre-trained image encoders, with increased effectiveness when cropped views contain only the reference object or the trigger. For attacks targeting CLIP, BadCLIP [221] optimizes visual trigger patterns using dual-embedding guidance, aligning them with both the target text and specific visual features. This strategy enables BadCLIP to bypass backdoor detection and fine-tuning defenses.

视觉触发器通过在视觉输入中嵌入后门模式来针对预训练的图像编码器。BadEncoder [219] 通过将后门注入预训练的图像编码器，探索了自监督学习中的图像后门攻击，从而危及下游分类器。Corrupt Encoder [220] 利用对比学习中的随机裁剪将后门注入预训练的图像编码器，当裁剪的视图仅包含参考对象或触发器时，效果会增强。对于针对 CLIP 的攻击，BadCLIP [221] 使用双嵌入引导优化视觉触发模式，使其与目标文本和特定视觉特征对齐。这一策略使 BadCLIP 能够绕过后门检测和微调防御。

Multi-modal Triggers combine both visual and textual triggers to enhance the attack. BadCLIP [222] introduces a novel trigger-aware prompt learning-based backdoor attack targeting CLIP models. Rather than fine-tuning the entire model, BadCLIP injects learnable triggers during the prompt learning stage, affecting both the image and text encoders.

多模态触发器结合了视觉和文本触发器以增强攻击效果。BadCLIP [222] 提出了一种针对 CLIP 模型的新型基于触发器的提示学习后门攻击。BadCLIP 并未对整个模型进行微调，而是在提示学习阶段注入可学习的触发器，影响图像和文本编码器。

4.3.2 Poisoning Attacks

4.3.2 投毒攻击

Two targeted poisoning attacks on CLIP are PBCL [224] and MM Poison [223]. PBCL demonstrated that a targeted poisoning attack, mis classifying a specific sample, can be achieved by poisoning as little as $0.0001%$ of the training dataset. MM Poison investigates modality vulnerabilities and proposes three attack types: single target image, single target label, and multiple target labels. Evaluations show high attack success rates while maintaining clean data performance across both visual and textual modalities.

针对 CLIP 的两种定向投毒攻击是 PBCL [224] 和 MM Poison [223]。PBCL 证明，只需投毒训练数据集的 $0.0001%$，即可实现定向投毒攻击，从而错误分类特定样本。MM Poison 研究了模态漏洞，并提出了三种攻击类型：单目标图像、单目标标签和多目标标签。评估表明，在保持视觉和文本模态中干净数据性能的同时，攻击成功率很高。

4.4 Backdoor & Poisoning Defenses

4.4 后门与中毒防御

Defense strategies against backdoor and poisoning attacks are generally categorized into robust training and backdoor detection. Robust Training aims to create VLP models resistant to backdoor or targeted poisoning attacks, even when trained on untrusted datasets. This approach specifically addresses poisoning-based attacks. Backdoor detection focuses on identifying compromised encoders or contaminated data. Detection methods often require additional mitigation techniques to fully eliminate backdoor effects.

针对后门和投毒攻击的防御策略通常分为鲁棒训练和后门检测。鲁棒训练旨在创建能够抵抗后门或目标投毒攻击的 VLP 模型，即使在不可信的数据集上进行训练也能有效应对。这种方法专门针对基于投毒的攻击。后门检测则侧重于识别被破坏的编码器或被污染的数据。检测方法通常需要额外的缓解技术来完全消除后门影响。

4.4.1 Robust Training

4.4.1 鲁棒训练

Depending on the stage at which the model gains robustness against backdoor attacks, existing robust training strategies can be categorized into fine-tuning and pre-training approaches.

根据模型在对抗后门攻击时获得鲁棒性的阶段，现有的鲁棒训练策略可以分为微调和预训练方法。

4.4.1.1 Fine-tuning Stage

4.4.1.1 微调阶段

To mitigate backdoor and poisoning threats, CleanCLIP [225] fine-tunes CLIP by re-aligning each modality's representations, weakening spurious correlations from backdoor attacks. Similarly, SAFECLIP [226] enhances feature alignment using unimodal contrastive learning. It first warms up the image and text modalities separately, then uses a Gaussian mixture model to classify data into safe and risky sets. During pre-training, SAFECLIP optimizes CLIP loss on the safe set, while separately fine-tuning the risky set, reducing poisoned image-text pair similarity and defending against targeted poisoning and backdoor attacks.

为减轻后门和投毒威胁，CleanCLIP [225] 通过重新对齐每个模态的表征来微调 CLIP，削弱来自后门攻击的虚假关联。类似地，SAFECLIP [226] 使用单模态对比学习增强特征对齐。它首先分别预热图像和文本模态，然后使用高斯混合模型将数据分类为安全集和风险集。在预训练期间，SAFECLIP 在安全集上优化 CLIP 损失，同时单独微调风险集，降低被投毒的图像-文本对的相似度，并防御定向投毒和后门攻击。

4.4.1.2 Pre-training Stage

4.4.1.2 预训练阶段

ROCLIP [227] defends against poisoning and backdoor attacks by enhancing model robustness during pre-training. It disrupts the association between poisoned image-caption pairs by utilizing a large, diverse pool of random captions. Additionally, ROCLIP applies image and text augmentations to further strengthen its defense and improve model performance.

ROCLIP [227] 通过增强预训练期间的模型鲁棒性来防御投毒和后门攻击。它通过利用大量多样化的随机文本描述池来破坏投毒图像-文本对之间的关联。此外，ROCLIP 还应用图像和文本增强技术，进一步强化其防御能力并提升模型性能。

4.4.2 Backdoor Detection

4.4.2 后门检测

Backdoor detection can be broadly divided into three subtasks: 1) trigger inversion, 2) backdoor sample detection, and 3) backdoor model detection. Trigger inversion is particularly useful, as recovering the trigger can aid in the detection of both backdoor samples and backdoored models.

后门检测可以大致分为三个子任务：1）触发器反演，2）后门样本检测，以及 3）后门模型检测。触发器反演特别有用，因为恢复触发器可以帮助检测后门样本和后门模型。

Trigger Inversion aims to reverse-engineer the trigger pattern injected into a backdoored model. Mudjacking [230] mitigates backdoor vulnerabilities in VLP models by adjusting model parameters to remove the backdoor when a mis classified triggerembedded input is detected. In contrast to single-modality defenses, TIJO [229] defends against dual-key backdoor attacks by jointly optimizing the reverse-engineered triggers in both the image and text modalities.

触发器反演旨在逆向工程注入到后门模型中的触发器模式。Mudjacking [230] 通过调整模型参数来消除后门漏洞，当检测到误分类的触发器嵌入输入时，移除后门。与单模态防御相比，TIJO [229] 通过联合优化图像和文本模态中的逆向工程触发器来防御双键后门攻击。

Backdoor Sample Detection detects whether a training or test sample is poisoned by a backdoor trigger. This detection can be used to cleanse the training dataset or reject backdoor queries. SEER [231] addresses the complexity of multi-modal models by jointly detecting malicious image triggers and target texts in the shared feature space. This method does not require access to the training data or knowledge of downstream tasks, making it highly effective for backdoor detection in VLP models. Outlier Detection [232] demonstrates that the local neighborhood of backdoor samples is significantly sparser compared to that of clean samples. This insight enables the effective and efficient application of various local outlier detection methods to identify backdoor samples from web-scale datasets. Furthermore, they reveal that potential unintentional backdoor samples already exist in the Conceptual Captions 3 Million (CC3M) dataset and have been trained into open-sourced CLIP encoders.

后门样本检测用于判断训练或测试样本是否被后门触发器污染。这种检测可用于清理训练数据集或拒绝后门查询。SEER [231]通过联合检测共享特征空间中的恶意图像触发器和目标文本来解决多模态模型的复杂性。该方法无需访问训练数据或了解下游任务，因此在VLP模型的后门检测中非常有效。离群点检测 [232]表明，后门样本的局部邻域相比干净样本要稀疏得多。这一洞察使得各种局部离群点检测方法能够高效地应用于从网络规模数据集中识别后门样本。此外，他们发现Conceptual Captions 3 Million (CC3M)数据集中已经存在潜在的无意识后门样本，并且这些样本已被训练到开源的CLIP编码器中。

Backdoor Model Detection identifies whether a trained model is compromised by backdoor(s). DECREE [228] introduces a backdoor detection method specifically for VLP encoders that require no labeled data. It exploits the distinct embedding space characteristics of backdoored encoders when exposed to clean versus backdoor inputs. By combining trigger inversion with these embedding differences, DECREE can effectively detect backdoored encoders.

后门模型检测识别训练好的模型是否被后门攻击所破坏。DECREE [228] 提出了一种专门针对视觉语言预训练 (VLP) 编码器的后门检测方法，该方法无需标注数据。它利用被后门攻击的编码器在干净输入和后门输入下表现出的不同嵌入空间特性。通过将触发器反转与这些嵌入差异相结合，DECREE 能够有效检测被后门攻击的编码器。

4.5 Datasets

4.5 数据集

This section reviews datasets used for VLP safety research. As shown in Table 6, a variety of benchmark datasets were employed toevaluate adversarial attacks and defenses for VLP models. For image classification tasks, commonly used datasets include: ImageNet [454], Caltech101 [455], DTD [456], EuroSAT [457], OxfordPets [458], FGVC-Aircraft [459], Food101 [460], Flowers102 [461], StanfordCars [462], SUN397 [463], and UCF101 [464]. For evaluating domain generalization and robustness to distribution shifts, several ImageNet variants were also used: ImageNetV2 [465], ImageNet-Sketch [466], ImageNetA [467], and ImageNet-R [468]. Additionally, MS-COCO [418] and Flickr30K [469] were utilized for image-to-text and text-toimage retrieval tasks, RefCOCO $^+$ [470] for visual grounding, and SNLI-VE [471] for visual entailment.

本节回顾了用于视觉语言预训练(VLP)安全研究的数据集。如表 6 所示，研究人员使用了多种基准数据集来评估 VLP 模型的对抗攻击和防御能力。在图像分类任务中，常用的数据集包括：ImageNet [454]、Caltech101 [455]、DTD [456]、EuroSAT [457]、OxfordPets [458]、FGVC-Aircraft [459]、Food101 [460]、Flowers102 [461]、StanfordCars [462]、SUN397 [463] 和 UCF101 [464]。为了评估领域泛化能力和对分布变化的鲁棒性，还使用了几个 ImageNet 变体：ImageNetV2 [465]、ImageNet-Sketch [466]、ImageNetA [467] 和 ImageNet-R [468]。此外，MS-COCO [418] 和 Flickr30K [469] 被用于图像到文本和文本到图像检索任务，RefCOCO $^+$ [470] 用于视觉定位，SNLI-VE [471] 用于视觉蕴涵任务。

5 VISION-LANGUAGE MODEL SAFETY

5 视觉-语言模型安全性

Large VLMs extend LLMs by adding a visual modality through pre-trained image encoders and alignment modules, enabling applications like visual conversation and complex reasoning. However, this multi-modal design introduces unique vulnerabilities. This section reviews adversarial attacks, latency energy attacks, jailbreak attacks, prompt injection attacks, backdoor & poisoning attacks, and defenses developed for VLMs. Many VLMs use VLP-trained encoders, so the attacks and defenses discussed in Section 4 also apply to VLMs. The additional alignment process between the VLM pre-trained encoders and LLMs, however, expands the attack surface, with new risks like cross-modal backdoor attacks and jailbreaks targeting both text and image inputs. This underscores the need for safety measures tailored to VLMs.

大型视觉语言模型通过预训练图像编码器和对齐模块扩展了大语言模型，增加了视觉模态，从而实现了视觉对话和复杂推理等应用。然而，这种多模态设计引入了独特的漏洞。本节回顾了对抗攻击、延迟能量攻击、越狱攻击、提示注入攻击、后门和投毒攻击，以及为视觉语言模型开发的防御措施。许多视觉语言模型使用了视觉语言预训练编码器，因此第4节中讨论的攻击和防御也适用于视觉语言模型。然而，视觉语言模型预训练编码器与大语言模型之间的额外对齐过程扩大了攻击面，出现了新的风险，如针对文本和图像输入的跨模态后门攻击和越狱攻击。这强调了需要针对视觉语言模型量身定制安全措施的重要性。

5.1 Adversarial Attacks

5.1 对抗攻击

Adversarial attacks on VLMs primarily target the visual modality, which, unlike text, is more susceptible to adversarial perturbations due to its high-dimensional nature. By adding imperceptible changes to images, attackers aim to disrupt tasks like image captioning and visual question answering. These attacks are classified into white-box and black-box categories based on the threat model.

针对视觉语言模型 (VLM) 的对抗攻击主要针对视觉模态，与文本不同，由于其高维特性，视觉模态更容易受到对抗性扰动的影响。通过在图像中添加不可察觉的变化，攻击者旨在破坏图像描述和视觉问答等任务。根据威胁模型，这些攻击分为白盒和黑盒两类。

5.1.1White-box Attacks

5.1.1 白盒攻击

White-box adversarial attacks on VLMs have full access to the model parameters, including both vision encoders and LLMs. These attacks can be classified into three types based on their objectives: task-specific attacks, cross-prompt attack, and chainof-thought (CoT) attack.

白盒对抗攻击在视觉语言模型（VLMs）上具有对模型参数的完全访问权限，包括视觉编码器和大语言模型。这些攻击根据其目标可以分为三类：任务特定攻击、跨提示攻击和思维链（CoT）攻击。

Task-specific Attacks Schlarmann et al. [233] were the first to highlight the vulnerability of VLMs like Flamingo [472] and GPT4 [473] to adversarial images that manipulate caption outputs. Their study showed how attackers can exploit these vulnerabilities to mislead users, redirecting them to harmful websites or spreading misinformation. Gao et al. [236] introduced attack paradigms targeting the referring expression comprehension task, while [234] proposed a query decomposition method and demonstrated how contextual prompts can enhance VLM robustness against visual attacks.

任务特定攻击

Schlarmann 等人 [233] 首次强调了像 Flamingo [472] 和 GPT4 [473] 这样的视觉语言模型 (VLM) 对操纵字幕输出的对抗图像的脆弱性。他们的研究表明，攻击者如何利用这些漏洞误导用户，将其重定向到有害网站或传播虚假信息。Gao 等人 [236] 提出了针对指代表达理解任务的攻击范式，而 [234] 提出了一种查询分解方法，并展示了上下文提示如何增强 VLM 对视觉攻击的鲁棒性。

Cross-prompt Attack refer to adversarial attacks that remain effective across different prompts. For example, CroPA [235] explored the transfer ability of a single adversarial image across multiple prompts, investigating whether it could mislead predictions in various contexts. To tackle this, they proposed refining adversarial perturbations through learnable prompts to enhance transfer ability.

跨提示攻击 (Cross-prompt Attack) 指的是在不同提示下仍能保持有效的对抗攻击。例如，CroPA [235] 探索了单个对抗图像在多个提示下的迁移能力，研究其是否能在不同情境下误导预测。为了解决这一问题，他们提出了通过可学习的提示来优化对抗扰动，以增强迁移能力。

CoT Attack targets the CoT reasoning process of VLMs. Stop-reasoning Attack [237] explored the impact of CoT reasoning on adversarial robustness. Despite observing some improvements in robustness, they introduced a novel attack designed to bypass these defenses and interfere with the reasoning process withinVLMs.

CoT攻击针对VLMs的CoT推理过程。Stop-reasoning攻击[237]探讨了CoT推理对对抗鲁棒性的影响。尽管观察到鲁棒性有所提高，但他们引入了一种新颖的攻击，旨在绕过这些防御并干扰VLMs内部的推理过程。

5.1.2 Gray-box Attacks

5.1.2 灰盒攻击

Gray-box adversarial attacks typically involve access to either the vision encoders or the LLM of a VLM, with a focus on vision encoders as the key different i at or between VLMs and LLMs. Attackers craft adversarial images that closely resemble target images, manipulating model predictions without full access to the VLM. For instance, InstructTA [238] generates a target image and uses a surrogate model to create adversarial perturbations, minimizing the feature distance between the original and adversarial image. To improve transfer ability, the attack incorporates GPT-4 paraphrasing to refine instructions.

灰盒对抗攻击通常涉及对视觉编码器或大语言模型的访问，重点关注视觉编码器作为视觉语言模型和大语言模型之间的关键区别。攻击者制作与目标图像非常相似的对抗图像，在没有完全访问视觉语言模型的情况下操纵模型预测。例如，InstructTA [238] 生成目标图像并使用替代模型创建对抗扰动，最小化原始图像和对抗图像之间的特征距离。为了提高迁移能力，攻击结合了 GPT-4 的释义来优化指令。

5.1.3Black-box Attacks

5.1.3 黑盒攻击

In contrast, black-box attacks do not require access to the target model's internal parameters and typically rely on transfer-based Or generator-based methods.

相比之下，黑盒攻击不需要访问目标模型的内部参数，通常依赖于基于迁移或基于生成器的方法。

Transfer-based Attacks exploit the widespread use of frozen CLIP vision encoders in many VLMs. AttackBard [239] demonstrates that adversarial images generated from surrogate models can successfully mislead Google's Bard, despite its defense mechanisms. Similarly, AttackVLM [240] crafts targeted adversarial images for models like CLIP [442] and BLIP [474], successfully transferring these adversarial inputs to other VLMs. It also shows that black-box queries further improved the success rate of generating targeted responses, illustrating the potency of cross-model transfer ability.

基于迁移的攻击利用了冻结的CLIP视觉编码器在许多视觉语言模型（VLM）中的广泛应用。AttackBard [239] 展示了从替代模型生成的对抗性图像可以成功地误导Google的Bard，尽管后者有防御机制。同样，AttackVLM [240] 为CLIP [442] 和BLIP [474] 等模型制作了有针对性的对抗性图像，并成功地将这些对抗性输入迁移到其他VLM中。它还表明，黑盒查询进一步提高了生成目标响应的成功率，展示了跨模型迁移能力的强大性。

Generator-based Attacks leverage generative models to create adversarial examples with improved transfer ability. AdvDiffVLM [241] uses diffusion models to generate natural, targeted adversarial images with enhanced transfer ability. By combining adaptive ensemble gradient estimation and GradCAM-guided masking, it improves the semantic embedding of adversarial examples and spreads the targeted semantics more effectively across the image, leading to more robust attacks. AnyAttack [242] presents a self-supervised framework for generating targeted adversarial images without label supervision. By utilizing contrastive loss, it efficiently creates adversarial examples that mislead models across diverse tasks.

基于生成器的攻击利用生成模型创建具有更强迁移能力的对抗样本。AdvDiffVLM [241] 使用扩散模型生成具有增强迁移能力的自然、定向对抗图像。通过结合自适应集成梯度估计和 GradCAM 引导的掩码，它改进了对抗样本的语义嵌入，并更有效地在图像中传播目标语义，从而实现更强大的攻击。AnyAttack [242] 提出了一种自监督框架，用于在没有标签监督的情况下生成定向对抗图像。通过利用对比损失，它有效地创建了误导模型在不同任务中的对抗样本。

5.2Jailbreak Attacks

5.2 越狱攻击

The inclusion of a visual modality in VLMs provides additional routes for jailbreak attacks. While adversarial attacks generally induce random or targeted errors, jailbreak attacks specifically target the model's safeguards to generate inappropriate outputs. Like adversarial attacks, jailbreak attacks on VLMs can be classified as white-box or black-box attacks.

视觉语言模型 (VLM) 中视觉模态的引入为越狱攻击提供了额外的途径。虽然对抗攻击通常会导致随机或目标错误，但越狱攻击专门针对模型的防护机制，以生成不合适的输出。与对抗攻击类似，VLM 的越狱攻击可以分为白盒攻击或黑盒攻击。

TABLE 7: A summary of attacks and defenses for VLMs.

5.2.1White-box Attacks

5.2.1 白盒攻击

White-box jailbreak attacks leverage gradient information to perturb input images or text, targeting specific behaviors in VLMs. These attacks can be further categorized into three types: targetspecific jailbreak, universal jailbreak, and hybrid jailbreak, each exploiting different aspects of the model's safety measures.

白盒越狱攻击利用梯度信息来扰动输入图像或文本，针对视觉语言模型（VLM）中的特定行为。这些攻击可以进一步分为三类：目标特定越狱、通用越狱和混合越狱，每种攻击都利用了模型安全措施的不同方面。

Target-specific Jailbreak focuses on inducing a specific type of harmful output from the model. Image Hijack [243] introduces adversarial images that manipulate VLM outputs, such as leaking information, bypassing safety measures, and generating false statements. These attacks, trained on generic datasets, effectively force models to produce harmful outputs. Similarly, Adversarial Alignment Attack [244] demonstrates that adversarial images can induce misaligned behaviors in VLMs, suggesting that similar techniques could be adapted for text-only models using advanced NLP methods.

目标特定的越狱攻击专注于诱导模型生成特定类型的有害输出。Image Hijack [243] 引入了对抗性图像，这些图像可以操纵视觉语言模型（VLM）的输出，例如泄露信息、绕过安全措施以及生成虚假陈述。这些攻击在通用数据集上进行训练，有效地迫使模型生成有害输出。同样，Adversarial Alignment Attack [244] 表明，对抗性图像可以诱导视觉语言模型产生不匹配的行为，这表明类似的技术可以借助先进的自然语言处理（NLP）方法，适用于纯文本模型。

Universal Jailbreak bypasses model safeguards, causing it to generate harmful content beyond the adversarial input. VAJM [245] shows that a single adversarial image can universally bypass VLM safety, forcing universal harmful outputs. ImgJP [246] uses a maximum likelihood algorithm to create transferable adversarial images that jailbreak various VLMs, even bridging VLM and LLM attacks by converting images to text prompts. UMK [247] proposes a dual optimization attack targeting both text and image modalities, embedding toxic semantics in images and text to maximize impact. HADES [248] introduces a hybrid jailbreak method that combines universal adversarial images with crafted inputs to bypass safety mechanisms, effectively amplifying harmful instructions and enabling robust adversarial manipulation.

通用越狱绕过模型的安全保护，使其在对抗输入之外生成有害内容。VAJM [245] 展示了单个对抗图像可以普遍绕过 VLM 的安全性，强制生成普遍的有害输出。ImgJP [246] 使用最大似然算法创建可转移的对抗图像，这些图像可以越狱各种 VLM，甚至通过将图像转换为文本提示来桥接 VLM 和 LLM 攻击。UMK [247] 提出了一种针对文本和图像模态的双重优化攻击，将有毒语义嵌入图像和文本中以最大化影响。HADES [248] 引入了一种混合越狱方法，结合了通用对抗图像和精心设计的输入，以绕过安全机制，有效放大有害指令并实现稳健的对抗操作。

5.2.2Black-box Attacks

5.2.2 黑盒攻击

Black-box jailbreak attacks do not require direct access to the internal parameters of the target VLM. Instead, they exploit external vulnerabilities, such as those in the frozen CLIP vision encoder, interactions between vision and language modalities, or system prompt leakage. These attacks can be classified into four main categories: transfer-based attacks, manually-designed attacks, system prompt leakage, and red teaming, each employing distinct strategies to bypass VLM defenses and trigger harmful behaviors.

黑盒越狱攻击不需要直接访问目标VLM的内部参数。相反，它们利用外部漏洞，例如冻结的CLIP视觉编码器中的漏洞、视觉和语言模态之间的交互或系统提示泄漏。这些攻击可以分为四大类：基于迁移的攻击、手动设计的攻击、系统提示泄漏和红队测试，每种攻击都采用不同的策略来绕过VLM的防御并触发有害行为。

Transfer-based Attacks on VLMs typically assume the attacker has access to the image encoder (or its open-source version), which is used to generate adversarial images that can then be transferred to attack the black-box LLM. For example, Jailbreak in Pieces [249] introduces cross-modality attacks that transfer adversarial images, crafted using the image encoder (assume the model employed an open-source encoder), along with clean textual prompts to break VLM alignment.

基于迁移的攻击通常假设攻击者能够访问图像编码器（或其开源版本），并利用该编码器生成对抗性图像，然后将其迁移至攻击黑盒大语言模型。例如，Jailbreak in Pieces [249] 提出了跨模态攻击，通过利用图像编码器（假设模型使用了开源编码器）生成对抗性图像，并配合干净的文本提示，来破坏 VLM 的对齐性。

Manually-designed Attacks can be as effective as optimized ones. For instance, FigStep [250] introduces an algorithm that bypasses safety measures by converting harmful text into images via typography, enabling VLMs to visually interpret the harmful intent. VRP [252] adopts a visual role-play approach, using LLM-generated images of high-risk characters based on detailed descriptions. By pairing these images with benign roleplay instructions, VRP exploits the negative traits of the characters to deceive VLMs into generating harmful outputs.

手动设计的攻击可以与优化的攻击一样有效。例如，FigStep [250] 提出了一种算法，通过排版将有害文本转换为图像，绕过安全措施，使视觉语言模型 (VLM) 能够视觉上解读有害意图。VRP [252] 采用视觉角色扮演方法，基于详细描述使用大语言模型生成高风险角色的图像。通过将这些图像与无害的角色扮演指令配对，VRP 利用角色的负面特征来欺骗视觉语言模型生成有害输出。

System Prompt Leakage is another significant black-box jailbreak method, exemplified by SASP [251]. By exploiting a system prompt leakage in GPT-4V, SASP allowed the model to perform a self-adversarial attack, demonstrating the risks of internal prompt exposure.

系统提示泄露

Red Teaming recently saw an advancement with IDEATOR [253], which integrated a VLM with an advanced diffusion model to autonomously generate malicious image-text pairs. This approach overcomes the limitations of manually designed attacks, providing a scalable and efficient method for creating adversarial inputs without direct access to the target model.

红队测试最近在 IDEATOR [253] 方面取得了进展，该方法将 VLM 与先进的扩散模型 (diffusion model) 结合，自动生成恶意的图像-文本对。这种方法克服了手动设计攻击的局限性，提供了一种可扩展且高效的方法来创建对抗性输入，而无需直接访问目标模型。

5.3Jailbreak Defenses

5.3 越狱防御

This section reviews defense methods for VLMs against jailbreak attacks, categorized into jailbreak detection and jailbreak prevention. Detection methods identify harmful inputs or outputs for rejection or purification, while prevention methods enhance the model's inherent robustness to jailbreak queries through safety alignment or filters.

本节回顾了针对越狱攻击的视觉语言模型（VLM）防御方法，分为越狱检测和越狱预防两类。检测方法识别有害输入或输出以进行拒绝或净化，而预防方法则通过安全对齐或过滤器增强模型对越狱查询的固有鲁棒性。

5.3.1Jailbreak Detection

5.3.1 Jailbreak 检测

JailGuard [254] detects jailbreak attacks by mutating untrusted inputs and analyzing discrepancies in model responses. It uses 18 mutators for text and image inputs, improving generalization across attack types. GuardMM [255] is a two-stage defense: the first stage validates inputs to detect unsafe content, while the second stage focuses on prompt injection detection to protect against image-based attacks. It uses a specialized language to enforce safety rules and standards. MLLM-Protector [257] identifies harmful responses using a lightweight detector and detoxifies them through a specialized transformation mechanism. Its modular design enables easy integration into existing VLMs, enhancing safety and preventing harmful content generation.

JailGuard [254] 通过突变不可信输入并分析模型响应的差异来检测越狱攻击。它使用18种突变器处理文本和图像输入，提高了对各种攻击类型的泛化能力。GuardMM [255] 是一种两阶段防御机制：第一阶段验证输入以检测不安全内容，第二阶段专注于提示注入检测以防止基于图像的攻击。它使用专用语言来强制执行安全规则和标准。MLLM-Protector [257] 利用轻量级检测器识别有害响应，并通过专用转换机制进行净化。其模块化设计便于集成到现有视觉语言模型 (VLM) 中，增强安全性并防止生成有害内容。

5.3.2 Jailbreak Prevention

5.3.2 越狱防护

AdaShield [256] defends against structure-based jailbreaks by prepending defense prompts to inputs, refining them adaptively through collaboration between the VLM and an LLM-based prompt generator, without requiring fine-tuning. ECSO [258] offers a training-free protection by converting unsafe images into text descriptions, activating the safety alignment of pre-trained LLMs within VLMs to ensure safer outputs. Infer Align er [259] applies cross-model guidance during inference, adjusting activations using safety vectors to generate safe and reliable outputs. BlueSuffix [260] introduces a reinforcement learning-based blackbox defense framework consisting of three key components: (1) an image purifier for securing visual inputs, (2) a text purifier for safeguarding textual inputs, and (3) a reinforcement fine-tuningbased suffix generator that leverages bimodal gradients to enhance cross-modal robustness.

AdaShield [256] 通过向输入前置防御提示来防御基于结构的越狱，通过 VLM 和基于大语言模型的提示生成器之间的协作自适应地优化提示，而无需微调。ECSO [258] 提供了一种无需训练的保护方法，将不安全的图像转换为文本描述，激活预训练大语言模型在 VLM 中的安全对齐，以确保更安全的输出。Infer Aligner [259] 在推理过程中应用跨模型指导，使用安全向量调整激活，以生成安全可靠的输出。BlueSuffix [260] 引入了一种基于强化学习的黑盒防御框架，由三个关键组件组成：(1) 用于保护视觉输入的图像净化器，(2) 用于保护文本输入的文本净化器，以及 (3) 基于强化微调的后缀生成器，利用双模梯度增强跨模态的鲁棒性。

5.4 Energy Latency Attacks

5.4 能量延迟攻击

Similar to LLMs, multi-modal LLMs also face significant computational demands. Verbose images [261] exploit these demands by overwhelming service resources, resulting in higher server costs, increased latency, and inefficient GPU usage. These images are specifically designed to delay the occurrence of the EOS token, increasing the number of auto-regressive decoder calls, which in turn raises both energy consumption and latency costs.

与大语言模型类似，多模态大语言模型同样面临巨大的计算需求。冗长图像[261]通过过度占用服务资源来利用这些需求，导致更高的服务器成本、增加的延迟以及低效的GPU使用。这些图像专门设计用于延迟EOS Token的出现，增加自回归解码器的调用次数，从而提高能耗和延迟成本。

5.5 Prompt Injection Attacks

5.5 提示注入攻击

Prompt injection attacks against VLMs share the same objective as those against LLMs (Section 3), but the visual modality introduces continuous features that are more easily exploited through adversarial attacks or direct injection. These attacks can be further classified into optimization-based attacks and typography-based attacks.

针对VLMs的提示注入攻击与针对大语言模型的提示注入攻击目标相同（第3节），但视觉模态引入了连续特征，这些特征更容易通过对抗攻击或直接注入被利用。这些攻击可以进一步分为基于优化的攻击和基于排版攻击的攻击。

Optimization-based Attacks often optimize the input images using (white-box) gradients to produce stronger attacks. These attacks manipulate the model's responses, influencing future interactions. One representative method is Adversarial Prompt Injection [262], where attackers embed malicious instructions into VLMs by adding adversarial perturbations to images.

基于优化的攻击通常利用（白盒）梯度优化输入图像，以产生更强的攻击。这些攻击会操纵模型的响应，影响未来的交互。一个代表性的方法是对抗性提示注入 [262]，攻击者通过向图像添加对抗性扰动，将恶意指令嵌入到视觉语言模型 (VLM) 中。

Typography-based Attacks exploit VLMs′ typographic vulner abilities by embedding deceptive text into images without requiring gradient access (i.e., black-box). The Typographic Attack [263] introduces two variations: Class-Based Attack to misidentify classes and Descriptive Attack to generate misleading labels. These attacks can also leak personal information [475], highlighting significant security risks.

基于字体的攻击通过将欺骗性文本嵌入图像中来利用视觉语言模型 (VLM) 的字体漏洞，且无需梯度访问（即黑盒）。字体攻击 [263] 引入了两种变体：基于类别的攻击用于错误识别类别，描述性攻击用于生成误导性标签。这些攻击还可能泄露个人信息 [475]，突显了重大的安全风险。

5.6Backdoor& Poisoning Attacks

5.6 后门与投毒攻击

Most VLMs rely on VLP encoders, with safety threats discussed in Section 4. This section focuses on backdoor and poisoning risks arising during fine-tuning and testing, specifically when aligning vision encoders with LLMs. Backdoor attacks embed triggers in visual or textual inputs to elicit specific outputs, while poisoning attacks inject malicious image-text pairs to degrade model performance. We review backdoor and poisoning attacks separately, though most of these works are backdoor attacks.

大多数视觉语言模型 (VLM) 依赖于视觉语言预训练 (VLP) 编码器，其安全威胁在第 4 节中讨论。本节重点讨论在微调和测试过程中出现的后门和中毒风险，特别是在将视觉编码器与大语言模型对齐时。后门攻击在视觉或文本输入中嵌入触发器以引发特定输出，而中毒攻击则注入恶意图像-文本对以降低模型性能。我们分别回顾后门攻击和中毒攻击，尽管这些工作中的大多数都是后门攻击。

5.6.1Backdoor Attacks

5.6.1 后门攻击

We further classify backdoor attacks on VLMs into tuning-time backdoor and testing-time backdoor.

我们进一步将针对 VLMs 的后门攻击分为调优时后门 (tuning-time backdoor) 和测试时后门 (testing-time backdoor)。

Tuning-time Backdoor injects the backdoor during VLM instruction tuning. MABA [264] targets domain shifts by adding domain-agnostic triggers using attribution al interpretation, enhancing attack robustness across mismatched domains in image captioning tasks. Bad V LM Driver [266] introduced a physical backdoor for autonomous driving, using objects like red balloons to trigger unsafe actions such as sudden acceleration, bypassing digital defenses and posing real-world risks. Its automated pipeline generates backdoor training samples with malicious behaviors for stealthy, fexible attacks. ImgTrojan [267] introduces a jailbreaking attack by poisoning image-text pairs in training data, replacing captions with malicious prompts to enable VLM jailbreaks, exposing risks of compromised datasets.

调优时后门在 VLM 指令调优期间注入后门。MABA [264] 通过使用属性解释添加领域无关触发器来针对领域偏移，增强图像描述任务中不匹配领域的攻击鲁棒性。Bad VLM Driver [266] 为自动驾驶引入了物理后门，使用红色气球等物体触发突然加速等不安全行为，绕过数字防御并带来现实风险。其自动化管道生成带有恶意行为的后门训练样本，用于隐蔽且灵活的攻击。ImgTrojan [267] 通过在训练数据中污染图像-文本对来引入越狱攻击，将描述替换为恶意提示以启用 VLM 越狱，暴露数据集被破坏的风险。

Test-time Backdoor leverages the similarity of universal adversarial perturbations and backdoor triggers to inject backdoor at test-time. AnyDoor [265] embeds triggers in the textual modality via adversarial test images with universal perturbations, creating a text backdoor from image-perturbation combinations. It can also be seen as a multi-modal universal adversarial attack. Unlike traditional methods, AnyDoor does not require access to training data, enabling attackers to separate setup and activation of the attack.

测试时后门利用通用对抗扰动和后门触发器的相似性，在测试时注入后门。AnyDoor [265] 通过具有通用扰动的对抗测试图像在文本模态中嵌入触发器，从图像扰动组合中创建文本后门。它也可以被视为一种多模态通用对抗攻击。与传统方法不同，AnyDoor 不需要访问训练数据，使攻击者能够分离攻击的设置和激活。

5.6.2 Poisoning Attacks

5.6.2 投毒攻击

Shadowcast [268] is a stealthy tuning-time backdoor attack on VLMs. It injects poisoned samples visually indistinguishable from benign ones, targeting two objectives: 1) Label Attack, which mis classifies objects, and 2) Persuas

[论文翻译]大规模安全：大模型安全综合调查

原文地址：https://arxiv.org/pdf/2502.05206v2