[论文翻译]大规模安全:大模型安全综合调查


原文地址:https://arxiv.org/pdf/2502.05206v2


Safety at Scale: A Comprehensive Survey of Large Model Safety

大规模安全:大模型安全综合调查

Abstract—-The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (Al). These models are now foundational to a wide range of applications, including conversational Al, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-based Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard Al models. GitHub: https://github.com/xingjunm/Awesome-Large-Model-Safety.

摘要

Index Terms—Large Model Safety,Al Safety,Attacks and Defenses

索引术语—大模型安全、AI安全、攻击与防御

INTRODUCTION

引言

Rtificial Intelligence (Al) has entered the era of large models, Language Models (LLMs), Vision-Language Pre-Training (VLP) models, Vision-Language Models (VLMs), and image/video generation diffusion models (DMs). Through large-scale pre-training on massive datasets, these models have demonstrated unprecedented capabilities in tasks ranging from language understanding and image generation to complex problem-solving and decisionmaking. Their ability to understand and generate human-like content (e.g., texts, images, audios, and videos) has enabled applications in customer service, content creation, healthcare, education, and more, highlighting their transformative potential in both commercial and societal domains.

人工智能 (AI) 已进入大模型时代,包括大语言模型 (LLMs)、视觉语言预训练模型 (VLP)、视觉语言模型 (VLMs) 以及图像/视频生成扩散模型 (DMs)。通过对海量数据集进行大规模预训练,这些模型在语言理解、图像生成、复杂问题解决和决策制定等任务中展现了前所未有的能力。它们理解和生成类人内容(例如文本、图像、音频和视频)的能力,使得其在客户服务、内容创作、医疗保健、教育等领域得到应用,凸显了其在商业和社会领域的变革潜力。

However, the deployment of large models comes with significant challenges and risks. As these models become more integrated into critical applications, concerns regarding their vulner abilities to adversarial, jailbreak, and backdoor attacks, data privacy breaches, and the generation of harmful or misleading content have intensified. These issues pose substantial threats, including unintended system behaviors, privacy leakage, and the dissemination of harmful information. Ensuring the safety of these models is paramount to prevent such unintended consequences, maintain public trust, and promote responsible AI usage. The field of AI safety research has expanded in response to these challenges, encompassing a diverse array of attack methodologies, defense strategies, and evaluation benchmarks designed to identify and mitigate the vulnerabilities of large models. Given the rapid development of safety-related techniques for various large models, we aim to provide a comprehensive survey of these techniques, highlighting strengths, weaknesses, and gaps, while advancing research and fostering collaboration.

然而,大规模模型的部署带来了显著的挑战和风险。随着这些模型越来越多地集成到关键应用中,人们对其在对抗性攻击、越狱攻击和后门攻击中的脆弱性、数据隐私泄露以及生成有害或误导性内容的担忧日益加剧。这些问题构成了重大威胁,包括意外的系统行为、隐私泄露和有害信息的传播。确保这些模型的安全性至关重要,以防止此类意外后果、维护公众信任并促进负责任的AI使用。AI安全研究领域已针对这些挑战进行了扩展,涵盖了多样化的攻击方法、防御策略和评估基准,旨在识别和缓解大规模模型的脆弱性。鉴于各种大规模模型安全相关技术的快速发展,我们旨在对这些技术进行全面综述,突出其优势、劣势和差距,同时推动研究并促进合作。


Fig. 1: Left: The number of safety research papers published over the past four years. Middle: The distribution of research across different models. Right: The distribution of research across different types of attacks and defenses.


图 1: 左: 过去四年发表的安全研究论文数量。中: 不同模型的研究分布。右: 不同类型攻击和防御的研究分布。


Fig. 2: Left: The quarterly trend in the number of safety research papers published across different models; Middle: The proportional relationship between large models and their corresponding attacks and defenses; Right: The annual trend in the number of safety research papers published on various attacks and defenses, presented in descending order from highest to lowest.

图 2: 左: 不同模型发布的安全研究论文数量的季度趋势;中: 大模型与其对应的攻击和防御之间的比例关系;右: 各种攻击和防御发布的安全研究论文数量的年度趋势,按从高到低降序排列。

Given the broad scope of our survey, we have structured it with the following considerations to enhance clarity and organization:

鉴于我们调查的广泛范围,我们进行了以下结构设计以提高清晰度和组织性:

· Models. We focus on six widely studied model categories, including VFMs, LLMs, VLPs, VLMs, DMs, and Agents, and review the attack and defense methods for each separately. These models represent the most popular large models

模型。我们重点关注六种广泛研究的模型类别,包括 VFM、大语言模型、VLP、VLM、DM 和 AI智能体,并分别回顾每种模型的攻击和防御方法。这些模型代表了最流行的大型模型

across various domains.

跨领域

· Organization. For each model category, we classify the reviewed works into attacks and defenses, and identify 10 attack types: adversarial, backdoor, poisoning, jailbreak, prompt injection, energy-latency, membership inference, model extraction, data extraction, and agent attacks. When both backdoor and poisoning attacks are present for a model category, we combine them into a single backdoor & poisoning category due to their similarities. We review the corresponding defense strategies for each attack type immediately after the attacks.

· 组织结构。对于每个模型类别,我们将综述的文献分为攻击和防御,并识别出10种攻击类型:对抗性攻击、后门攻击、投毒攻击、越狱攻击、提示注入攻击、能量延迟攻击、成员推断攻击、模型提取攻击、数据提取攻击和智能体攻击。当某个模型类别同时存在后门攻击和投毒攻击时,由于它们的相似性,我们将它们合并为一个单独的后门和投毒类别。我们会在攻击之后立即回顾每种攻击类型对应的防御策略。

· Taxonomy. For each type of attack or defense, we use a twolevel taxonomy: Category $\rightarrow$ Subcategory. The Category differentiates attacks and defenses based on the threat model (e.g., white-box, gray-box, black-box) or specific subtasks (e.g., detection, purification, robust training/tuning, and robust inference). The Subcategory offers a more detailed classification based on their techniques.

· 分类法。对于每种类型的攻击或防御,我们使用两级分类法:类别 $\rightarrow$ 子类别。类别根据威胁模型(例如白盒、灰盒、黑盒)或特定子任务(例如检测、净化、鲁棒训练/调优和鲁棒推理)来区分攻击和防御。子类别则基于其技术提供更详细的分类。


Fig. 3: Organization of this survey.

图 3: 本调查的组织结构。

· Granularity. To ensure clarity, we simplify the introduction of each reviewed paper, highlighting only its key ideas, objectives, and approaches, while omitting technical details and experimental analyses.

· 粒度。为确保清晰,我们简化了对每篇论文的介绍,仅突出其关键思想、目标和方法,同时省略技术细节和实验分析。

TABLE 1: A summary of existing surveys.

表 1: 现有调查总结。

调查 年份 模型 主题
Zhanget al. [396] 2024 VFM&VLP&VLM 对抗性、越狱、提示注入
Truong et al. [397] 2024 DM 对抗性、后门、成员推断
Zhao et al. [398] 2024 大语言模型 后门
Yi et al. [399] 2024 大语言模型 越狱
Jin et al. [400] 2024 大语言模型&VLM 越狱
Liu et al. [401] 2024 VLM 越狱
Liu et al. [402] 2024 VLM 对抗性、后门、越狱、提示注入
Cui et al. [403] 2024 大语言模型智能体 对抗性、后门、越狱
Deng et al. [404] 2024 大语言模型智能体 对抗性、后门、越狱
Gan et al. [405] 2024 大语言模型&VLM智能体 对抗性、后门、越狱

Our survey methodology is structured as follows. First, we conducted a keyword-based search targeting specific model types and threat types to identify relevant papers. Next, we manually filtered out non-safety-related and non-technical papers. For each remaining paper, we categorized its proposed method or framework by analyzing its settings and attack/defense types, assigning them to appropriate categories and subcategories. In total, we collected 390 technical papers, with their distribution across years, model types, and attack/defense strategies illustrated in Figure 1. As shown, safety research on large models has surged significantly since 2023, following the release of ChatGPT. Among the model types, LLMs and DMs have garnered the most attention, accounting for over ${\bf60%}$ of the surveyed papers. Regarding attack types, jailbreak, adversarial, and backdoor attacks were the most extensively studied. On the defense side, jailbreak defenses received the highest focus, followed by adversarial defenses. Figure 2 presents a cross-view of temporal trends across model types and attack/defense categories, offering a detailed breakdown of the reviewed works. Notably, research on attacks constitutes ${\bf60%}$ of the studied. In terms of defense, while defense research accounts for only $40,%$ , underscoring a significant gap that warrants increased attention toward defense strategies. The overall structure of this survey is outlined in Figure 3.

我们的调查方法结构如下。首先,我们针对特定模型类型和威胁类型进行了基于关键词的搜索,以识别相关论文。接下来,我们手动筛选了与非安全相关和非技术性的论文。对于每篇剩余的论文,我们通过分析其设置和攻击/防御类型,将其提出的方法或框架分类,并将其分配到适当的类别和子类别中。总共,我们收集了390篇技术论文,其在年份、模型类型和攻击/防御策略上的分布如图1所示。如图所示,自2023年ChatGPT发布以来,大模型的安全研究显著激增。在模型类型中,大语言模型和扩散模型最受关注,占调查论文的${\bf60%}$以上。在攻击类型方面,越狱攻击、对抗攻击和后门攻击是研究最多的。在防御方面,越狱防御受到的关注最高,其次是对抗防御。图2展示了模型类型和攻击/防御类别的时间趋势交叉视图,提供了对所审查工作的详细分解。值得注意的是,攻击研究占所研究的${\bf60%}$。在防御方面,防御研究仅占$40,%$,这突显了一个显著的差距,需要加大对防御策略的关注。本调查的整体结构如图3所示。

Difference to Existing Surveys. Large-model safety is a rapidly evolving field, and several surveys have been conducted to advance research in this area. Recently, Slattery et al. [406] introduced an AI risk framework with a systematic taxonomy covering all types of risks. In contrast, our focus is on the technical aspects, specifically the attack and defense techniques proposed in the literature. Table 1 lists the technical surveys we identified, each focusing on a specific type of model or threat (e.g., LLMs, VLMs, or jailbreak attacks/defenses). Compared with these works, our survey offers both a broader scope—covering a wider range of model types and threats—and a higher level perspective, centering on over arching methodologies rather than minute technical details.

与现有综述的差异。大模型安全是一个快速发展的领域,已有几篇综述推动了该领域的研究进展。最近,Slattery 等人 [406] 提出了一个涵盖所有类型风险的系统性分类的人工智能风险框架。相比之下,我们的重点是技术方面,特别是文献中提出的攻击和防御技术。表 1 列出了我们确定的技术综述,每篇综述都侧重于特定类型的模型或威胁(例如,大语言模型、视觉语言模型或越狱攻击/防御)。与这些工作相比,我们的综述既提供了更广泛的范围——涵盖更多类型的模型和威胁,又提供了更高的视角,集中于总体方法论而非细微的技术细节。

2 VISION FOUNDATION MODEL SAFETY

2 视觉基础模型安全

This section surveys safety research on two types of VFMs: per-trained Vision Transformers (ViTs)[407] and the Segment Anything Model (SAM)[408]. We focus on ViTs and SAM because they are among the most widely deployed VFMs and have garnered significant attention in recent safety research.

本节调查了两类视觉基础模型 (VFMs) 的安全研究:预训练的 Vision Transformers (ViTs) [407] 和 Segment Anything Model (SAM) [408]。我们重点关注 ViTs 和 SAM,因为它们是目前部署最广泛的 VFMs,并且在最近的安全研究中引起了广泛关注。

2.1Attacks and Defenses for ViTs

2.1 ViT的攻击与防御

Pre-trained ViTs are widely employed as backbones for various downstream tasks, frequently achieving state-of-the-art performance through efficient adaptation and fine-tuning. Unlike traditional CNNs, ViTs process images as sequences of tokenized patches, allowing them to better capture spatial dependencies. However, this patch-based mechanism also brings unique safety concerns and robustness challenges. This section explores these issues by reviewing ViT-related safety research, including adversarial attacks, backdoor & poisoning attacks, and their corresponding defense strategies. Table 2 provides a summary of the surveyed attacks and defenses, along with the commonly used datasets.

预训练的Vision Transformer (ViT) 被广泛用作各种下游任务的骨干网络,通过高效的适应和微调,经常能够达到最先进的性能。与传统的卷积神经网络 (CNN) 不同,ViT 将图像处理为分块序列的Token,从而能够更好地捕捉空间依赖性。然而,这种基于分块的机制也带来了独特的安全问题和鲁棒性挑战。本节通过回顾与 ViT 相关的安全研究,包括对抗攻击、后门与中毒攻击及其相应的防御策略,探讨这些问题。表 2 总结了所调查的攻击和防御方法,以及常用的数据集。

2.1.1 Adversarial Attacks

2.1.1 对抗攻击

Adversarial attacks on ViTs can be classified into white-box attacks and black-box attacks based on whether the attacker has full access to the victim model. Based on the attack strategy, white-box attacks can be further divided into 1) patch attacks, 2) position embedding attacks and 3) attention attacks, while black-box attacks can be summarized into 1) transfer-based attacks and 2) query-based attacks.

针对 ViTs 的对抗攻击根据攻击者是否完全访问受害者模型可分为白盒攻击和黑盒攻击。根据攻击策略,白盒攻击可进一步分为 1) 补丁攻击 (patch attacks)、2) 位置嵌入攻击 (position embedding attacks) 和 3) 注意力攻击 (attention attacks),而黑盒攻击则可概括为 1) 基于迁移的攻击 (transfer-based attacks) 和 2) 基于查询的攻击 (query-based attacks)。

2.1.1.1 White-box Attacks

2.1.1.1 白盒攻击

Patch Attacks exploit the modular structure of ViTs, aiming to manipulate their inference processes by introducing targeted perturbations in specific patches of the input data. Joshi et al. [17] proposed an adversarial token attack method leveraging block sparsity to assess the vulnerability of ViTs to token-level perturbations. Expanding on this, Patch-Fool [1] introduces an adversarial attack framework that targets the self-attention modules by perturbing individual image patches, thereby manipulating attention scores. Different from existing methods, SlowFormer [2] introduces a universal adversarial patch can be applied to any image to increases computational and energy costs while preserving model accuracy.

补丁攻击利用ViTs的模块化结构,通过在输入数据的特定补丁中引入有针对性的扰动来操纵其推理过程。Joshi等人[17]提出了一种利用块稀疏性的对抗性Token攻击方法,以评估ViTs对Token级扰动的脆弱性。在此基础上,Patch-Fool[1]引入了一个对抗性攻击框架,通过扰动单个图像补丁来针对自注意力模块,从而操纵注意力分数。与现有方法不同,SlowFormer[2]引入了一种通用对抗性补丁,可以应用于任何图像,在保持模型准确性的同时增加计算和能源成本。

Position Embedding Attacks aim to attack the spatial or sequential position of tokens in transformers. For example, PEAttack [3] explores the common vulnerability of positional embeddings to adversarial perturbations by disrupting their ability to encode positional information through periodicity manipulation, linearity distortion, and optimized embedding distortion.

位置嵌入攻击旨在攻击 Transformer 中 Token 的空间或序列位置。例如,PEAttack [3] 通过周期性操作、线性扭曲和优化的嵌入扭曲来破坏位置嵌入编码位置信息的能力,从而探索位置嵌入对对抗性扰动的常见漏洞。

Attention Attacks target vulnerabilities in the self-attention modules of ViTs. Attention-Fool [4] manipulates dot-product similarities to redirect queries to adversarial key tokens, exposing the model's sensitivity to adversarial patches. Similarly, AAS [5] mitigates gradient masking in ViTs by optimizing the pre-softmax output scaling factors, enhancing the effectiveness of attacks.

注意力攻击针对 ViTs 中自注意力模块的漏洞。Attention-Fool [4] 通过操纵点积相似性,将查询重定向到对抗性关键 Token,暴露了模型对对抗性补丁的敏感性。类似地,AAS [5] 通过优化预 softmax 输出缩放因子,缓解了 ViTs 中的梯度掩码,增强了攻击的有效性。

2.1.1.2 Black-box Attacks

2.1.1.2 黑盒攻击

Transfer-based Attacks first generate adversarial examples using fully accessible surrogate models, which are then transferred to attack black-box victim ViTs.In this context, we first review attacks specifically designed for the ViT architecture. SE-TR [6] enhances adversarial transfer ability by optimizing perturbations on an ensemble of models. ATA [7] strategically activates uncertain attention and perturbs sensitive embeddings within ViTs. LPM [9] mitigates the over fitting to model-specific disc rim i native regions through a patch-wise optimized binary mask. Chen et al. [18] introduced an Inductive Bias Attack (IBA) to suppress unique biases in ViTs and target shared inductive biases. TGR [11] reduces the variance of the back propagated gradient within internal blocks. VDC [12] employs virtual dense connections between deeper attention maps and MLP blocks to facilitate gradient back propagation. FDAP [13] exploits feature collapse by reducing high-frequency components in feature space. CRFA [16] disrupts only the most crucial image regions using approximate attention maps. SASD-WS [15] flattens the loss landscape of the source model through sharpness-aware self-distillation and approximates an ensemble of pruned models using weight scaling to improve target adversarial transfer ability.

基于迁移的攻击首先使用完全可访问的代理模型生成对抗样本,然后将其迁移到攻击黑盒受害者 ViT。在此背景下,我们首先回顾了专门为 ViT 架构设计的攻击。SE-TR [6] 通过在模型集合上优化扰动来增强对抗迁移能力。ATA [7] 策略性地激活不确定的注意力并扰动 ViT 内的敏感嵌入。LPM [9] 通过逐块优化的二值掩码减轻了对模型特定判别区域的过拟合。Chen 等人 [18] 引入了归纳偏置攻击 (IBA) 以抑制 ViT 中的独特偏置并针对共享的归纳偏置。TGR [11] 减少了内部块中反向传播梯度的方差。VDC [12] 在更深层的注意力图和 MLP 块之间使用虚拟密集连接以促进梯度反向传播。FDAP [13] 通过减少特征空间中的高频分量来利用特征崩溃。CRFA [16] 仅使用近似的注意力图破坏最关键图像区域。SASD-WS [15] 通过锐度感知自蒸馏使源模型的损失景观平坦化,并使用权重缩放近似剪枝模型的集合以提高目标对抗迁移能力。

Other strategies are applicable to both ViTs and CNNs, ensuring broader applicability in black-box settings. Wei et al. [8], [19] proposed a dual attack framework to improve transfer ability between ViTs and CNNs: 1) a Pay No Attention (PNA) attack, which skips the gradients of attention during back propagation, and 2) a PatchOut attack, which randomly perturbs subsets of image patches at each iteration. MIG [10] uses integrated gradients and momentum-based updates to precisely target modelagnostic critical regions, improving transfer ability between ViTs

其他策略适用于ViTs和CNNs,确保在黑盒设置中的更广泛适用性。Wei等[8], [19]提出了一个双重攻击框架,以提高ViTs和CNNs之间的转移能力:1) Pay No Attention (PNA)攻击,在反向传播期间跳过注意力梯度,2) PatchOut攻击,每次迭代随机扰动图像补丁的子集。MIG [10]使用集成梯度和基于动量的更新来精确定位模型无关的关键区域,提高ViTs之间的转移能力

and CNNs.

和 CNN。

Query-based Attacks generate adversarial examples by querying the black-box model and levering the model responses to estimate the adversarial gradients. The goal is to achieve successful attack with a minimal number of queries. Based on the type of model response, query-based attacks can be further divided into score-based attacks, where the model returns a probability vector, and decision-based attacks, where the model provides only the top-k classes. Decision-based attacks typically start from a large random noise (to achieve mis classification first) and then gradually find smaller noise while maintaining mis classification. To improve the efficiency of the adversarial noise searching process in ViTs, PAR [14] introduces a coarse-to-fine patch searching method, guided by noise magnitude and sensitivity masks to account for the structural characteristics of ViTs and mitigate the negative impact of non-overlapping patches.

基于查询的攻击通过查询黑盒模型并利用模型响应来估计对抗梯度,从而生成对抗样本。其目标是以最少的查询次数实现成功的攻击。根据模型响应的类型,基于查询的攻击可以进一步分为基于分数的攻击(模型返回概率向量)和基于决策的攻击(模型仅提供前 k 个类别)。基于决策的攻击通常从较大的随机噪声开始(先实现错误分类),然后逐步找到较小的噪声,同时保持错误分类。为了提高 ViTs 中对抗噪声搜索过程的效率,PAR [14] 引入了一种由粗到细的 patch 搜索方法,通过噪声幅度和敏感度掩码来指导,以考虑 ViTs 的结构特征并减轻非重叠 patch 的负面影响。

2.1.2 Adversarial Defenses

2.1.2 对抗防御

Adversarial defenses for ViTs follow four major approaches: 1) adversarial training, which trains ViTs on adversarial examples via min-max optimization to improve its robustness; 2) adversarial detection, which identifies and mitigates adversarial attacks by detecting abnormal or malicious patterns in the inputs; 3) robust architecture, which modifies and optimizes the architecture (e.g., self-attention module) of ViTs to improve their resilience against adversarial attacks; and 4) adversarial purification, which preprocesses the input (e.g., noise injection, denoising, or other transformations) to remove potential adversarial perturbations before inference.

针对ViTs的对抗防御主要遵循四种方法:1) 对抗训练,通过最小-最大优化在对抗样本上训练ViTs以提高其鲁棒性;2) 对抗检测,通过检测输入中的异常或恶意模式来识别和缓解对抗攻击;3) 鲁棒架构,修改和优化ViTs的架构(例如自注意力模块)以提高其对抗攻击的抵抗力;4) 对抗净化,在推理前预处理输入(例如噪声注入、去噪或其他变换)以去除潜在的对抗扰动。

Adversarial Training is widely regarded as the most effective approach to adversarial defense; however, it comes with a high computational cost. To address this on ViTs, AGAT [20] introduces a dynamic attention-guided dropping strategy, which accelerates the training process by selectively removing certain patch embeddings at each layer. This reduces computational overhead while maintaining robustness, especially on large datasets such as ImageNet. Due to its high computational cost, research on adversarial training for ViTs has been relatively limited. ARDPRM [28] improves adversarial robustness by randomly dropping gradients in attention blocks and masking patch perturbations during training.

对抗训练被广泛认为是最有效的对抗防御方法;然而,它伴随着高昂的计算成本。为了解决 ViT 上的这一问题,AGAT [20] 引入了一种动态注意力引导的丢弃策略,通过在每个层有选择性地移除某些 patch embedding 来加速训练过程。这降低了计算开销,同时保持了鲁棒性,尤其是在 ImageNet 等大型数据集上。由于其高昂的计算成本,针对 ViT 的对抗训练研究相对有限。ARDPRM [28] 通过在训练期间随机丢弃注意力块中的梯度并掩盖 patch 扰动,提高了对抗鲁棒性。

Adversarial Detection methods for ViTs primarily leverage two key features, i.e., patch-based inference and activation characteristics, to detect and mitigate adversarial examples. Li et al. [21] proposed the concept of Patch Vestiges, abnormalities arising from adversarial examples during patch division in ViTs. They used statistical metrics on step changes between adjacent pixels across patches and developed a binary regression classifier to detect adversaries. Alternatively, ARMOR [23] identifies adversarial patches by scanning for unusually high column scores in specific layers and masking them with average images to reduce their impact. ViTGuard [22], on the other hand, employs a masked auto encoder to detect patch attacks by analyzing attention maps and CLS token representations. As more attacks are developed, there is a growing need for a unified detection framework capable of handling all types of adversarial examples.

针对 ViTs 的对抗检测方法主要利用两个关键特征,即基于 patch 的推理和激活特性,来检测和缓解对抗样本。Li 等人 [21] 提出了 Patch Vestiges 的概念,即在 ViTs 的 patch 划分过程中由对抗样本引起的异常现象。他们对跨 patch 的相邻像素之间的阶跃变化使用了统计指标,并开发了一个二元回归分类器来检测对抗样本。另一方面,ARMOR [23] 通过在特定层中扫描异常高的列分数来识别对抗 patch,并用平均图像对其进行掩码处理以减少其影响。而 ViTGuard [22] 则使用掩码自动编码器通过分析注意力图和 CLS token 表示来检测 patch 攻击。随着更多攻击的出现,越来越需要一个能够处理所有类型对抗样本的统一检测框架。

Robust Architecture methods focus on designing more adversa ri ally resilient attention modules for ViTs. For example, Smoothed Attention [27] employs temperature scaling in the softmax function to prevent any single patch from dominating the attention, thereby balancing focus across patches. ReiT [32] integrates adversarial training with random iz ation through the IIReSA module, optimizing randomly entangled tokens to reduce adversarial similarity and enhance robustness. TAP [29] addresses token over focusing by implementing token-aware average pooling and an attention diversification loss, which incorporate local neighborhood information and reduce cosine similarity among attention vectors. FViTs [31] strengthen explanation faithfulness by stabilizing top $\mathbf{\nabla}\cdot\mathbf{k}$ indices in self-attention and robustify predictions using denoised diffusion smoothing combined with Gaussian noise. RSPC [30] tackles vulnerabilities by corrupting the most sensitive patches and aligning intermediate features between clean and corrupted inputs to stabilize the attention mechanism. Collectively, these advancements underscore the pivotal role of the attention mechanism in improving the adversarial robustness of ViTs.

鲁棒架构方法专注于为视觉Transformer (ViT) 设计更具对抗性弹性的注意力模块。例如,[27] 在softmax函数中使用温度缩放来防止任何单个patch主导注意力,从而平衡各个patch的注意力。[32] 通过IIReSA模块将对抗训练与随机化结合,优化随机纠缠的Token以减少对抗相似性并增强鲁棒性。[29] 通过实现Token感知的平均池化和注意力多样化损失来解决Token过度聚焦问题,该方法结合了局部邻域信息并减少注意力向量之间的余弦相似性。[31] 通过稳定自注意力中的top $\mathbf{\nabla}\cdot\mathbf{k}$ 索引,并结合高斯噪声的去噪扩散平滑来增强解释的忠实性和预测的鲁棒性。[30] 通过破坏最敏感的patch并在干净和破坏的输入之间对齐中间特征来稳定注意力机制,从而解决脆弱性问题。这些进展共同强调了注意力机制在提升ViT对抗鲁棒性中的关键作用。

Adversarial Purification refers to a model-agnostic inputprocessing technique that is broadly applicable across various architectures, including but not limited to ViTs. DiffPure [33] introduces a framework where adversarial images undergo noise injection via a forward stochastic differential equation (SDE) process, followed by denoising with a pre-trained diffusion model. CGDMP [24] refines this approach by optimizing the noise level for the forward process and employing contrastive loss gradients to guide the denoising process, achieving improved purification tailored to ViTs. ADBM [25] highlights the disparity between diffused adversarial and clean examples, proposing a method to directly connect the clean and diffused adversarial distributions. While these methods focus on ViTs, other approaches demonstrate broader applicability to various vision models, e.g., CNNs. Purify $^{++}$ [34] enhances DiffPure with improved diffusion models, DifFilter [35] extends noise scales to better preserve semantics, and Mimic Diffusion [36] mitigates adversarial impacts during the reverse diffusion process. For improved efficiency, OsCP [26] and LightPure [37] propose single-step and real-time purification methods, respectively. LoRID [38] introduces a Markov-based approach for robust purification. These methods complement ViTrelated research and highlight diverse advancements in adversarial purification.

对抗净化 (Adversarial Purification) 指的是一种与模型无关的输入处理技术,广泛适用于各种架构,包括但不限于 ViT (Vision Transformer)。DiffPure [33] 提出了一个框架,其中对抗图像通过前向随机微分方程 (SDE) 过程进行噪声注入,然后使用预训练的扩散模型进行去噪。CGDMP [24] 通过优化前向过程的噪声水平,并采用对比损失梯度来指导去噪过程,改进了该方法,实现了针对 ViT 的优化净化。ADBM [25] 强调了扩散对抗样本与干净样本之间的差异,提出了一种直接连接干净样本和扩散对抗样本分布的方法。虽然这些方法主要关注 ViT,但其他方法展示了其对各种视觉模型(例如 CNN)的广泛适用性。Purify$^{++}$ [34] 通过改进的扩散模型增强了 DiffPure,DifFilter [35] 扩展了噪声尺度以更好地保留语义,Mimic Diffusion [36] 在反向扩散过程中减轻了对抗影响。为了提高效率,OsCP [26] 和 LightPure [37] 分别提出了单步和实时净化方法。LoRID [38] 引入了一种基于马尔可夫的方法来进行鲁棒净化。这些方法补充了 ViT 相关研究,并突出了对抗净化领域的多样化进展。

2.1.3Backdoor Attacks

2.1.3 后门攻击

Backdoors can be injected into the victim model via data poisoning, training manipulation, or parameter editing, with most existing attacks on ViTs being data poisoning-based. We classify these attacks into four categories: 1) patch-level attacks, 2) token-level attacks, and 3) multi-trigger attacks, which exploit ViT-specific data processing characteristics, as well as 4) datafree attacks, which exploit the inherent mechanisms of ViTs.

后门可以通过数据投毒、训练操纵或参数编辑注入到受害模型中,现有对ViT的攻击大多基于数据投毒。我们将这些攻击分为四类:1)基于补丁(patch-level)的攻击,2)基于token(token-level)的攻击,3)多触发器(multi-trigger)攻击,这些攻击利用了ViT特定的数据处理特性,以及4)无数据(datafree)攻击,这些攻击利用了ViT的固有机制。

Patch-level Attacks primarily exploit the ViT's characteristic of processing images as discrete patches by implanting triggers at the patch level. For example, BadViT [39] introduces a universal patch-wise trigger that requires only a small amount of data to redirect the model's focus from classification-relevant patches to adversarial triggers. TrojViT [40] improves this approach by utilizing patch salience ranking, an attention-targeted loss function, and parameter distillation to minimize the bit flips necessary to embed the backdoor.

Patch-level Attacks 主要利用 ViT 将图像处理为离散 patch 的特性,在 patch 级别植入触发器。例如,BadViT [39] 引入了一种通用的 patch-wise 触发器,只需少量数据即可将模型的注意力从分类相关的 patch 转移到对抗触发器上。TrojViT [40] 通过利用 patch 显著性排名、注意力目标损失函数和参数蒸馏来改进这种方法,以最小化嵌入后门所需的比特翻转。

Token-level Attacks target the token iz ation layer of ViTs. SWARM [41] introduces a switchable backdoor mechanism featuring a “switch token" that dynamically toggles between benign and adversarial behaviors, ensuring high attack success rates while maintaining functionality in clean environments.

Token级别攻击针对ViTs的Token化层。SWARM [41] 引入了一种可切换的后门机制,该机制包含一个“切换Token”,能够在良性行为和对抗行为之间动态切换,确保在干净环境中保持功能的同时实现高攻击成功率。

Multi-trigger Attacks employ multiple backdoor triggers in parallel, sequential, or hybrid configurations to poison the victim dataset. MTBAs [43] utilize these multiple triggers to induce coexistence, overwriting, and cross-activation effects, significantly diminishing the effectiveness of existing defense mechanisms.

多触发攻击采用并行、顺序或混合配置的多个后门触发器来污染受害者数据集。MTBAs [43] 利用这些多个触发器引发共存、覆盖和交叉激活效应,显著降低了现有防御机制的有效性。

Data-free Attacks eliminate the need for original training datasets. Using substitute datasets, DBIA [42] generates universal triggers that maximize attention within ViTs. These triggers are fine-tuned with minimal parameter adjustments using PGD [409], enabling efficient and resource-light backdoor injection.

无需数据攻击消除对原始训练数据集的需求。DBIA [42] 使用替代数据集生成能在 ViT 中最大化注意力的通用触发器。这些触发器通过 PGD [409] 进行微调,仅需最少的参数调整,从而实现高效且资源消耗低的后门注入。

2.1.4Backdoor Defenses

2.1.4 后门防御

Backdoor defenses for ViTs aim to identify and break (or remove) the correlation between trigger patterns and target classes while preserving model accuracy. Two representative defense strategies are: 1) patch processing, which disrupts the integrity of image patches to prevent trigger activation, and 2) image blocking, which leverages interpret ability-based mechanisms to mask and neutralize the effects of backdoor triggers.

针对ViTs的后门防御旨在识别并打破(或移除)触发模式与目标类别之间的关联,同时保持模型准确性。两种代表性的防御策略包括:1) 图像块处理,通过破坏图像块的完整性来防止触发激活;2) 图像屏蔽,利用基于可解释性的机制来掩盖并中和后门触发器的影响。

Patch Processing strategy disrupts the integrity of patches to neutralize triggers. Doan et al. [44] found that clean-data accuracy and attack success rates of ViTs respond differently to patch transformations before positional encoding, and proposed an effective defense method by randomly dropping or shufling patches of an image to counter both patch-based and blendingbased backdoor attacks. Image Blocking utilizes interpret a bility to identify and neutralize triggers. Subramanya et al. [46] showed that ViTs can localize backdoor triggers using attention maps and proposed a defense mechanism that dynamically masks potential trigger regions during inference. In a subsequent work, Subramanya et al. [45] proposed to integrate trigger neutralization into the training phase to improve the robustness of ViTs to backdoor attacks. While these two methods are promising, the field requires a holistic defense framework that integrates nonViT defenses with ViT-specific characteristics and unifies multiple defense tasks including backdoor detection, trigger inversion, and backdoor removal, as attempted in [410].

补丁处理策略通过破坏补丁的完整性来中和触发器。Doan 等人 [44] 发现,在位置编码之前,ViT (Vision Transformer) 的干净数据精度和攻击成功率对补丁变换的响应不同,并提出了一种有效的防御方法,即随机丢弃或打乱图像的补丁,以应对基于补丁和基于混合的后门攻击。图像阻断利用可解释性来识别和中和触发器。Subramanya 等人 [46] 表明,ViT 可以使用注意力图定位后门触发器,并提出了一种在推理过程中动态屏蔽潜在触发区域的防御机制。在后续工作中,Subramanya 等人 [45] 提出将触发器中和技术集成到训练阶段,以提高 ViT 对后门攻击的鲁棒性。尽管这两种方法很有前景,但该领域需要一个综合的防御框架,将非 ViT 防御与 ViT 特定特性相结合,并统一包括后门检测、触发器反演和后门移除在内的多种防御任务,如 [410] 中所尝试的那样。

2.1.5 Datasets

2.1.5 数据集

Datasets are crucial for developing and evaluating attack and defense methods. Table 2 summarizes the datasets used in adversarial and backdoor research.

数据集对于开发和评估攻击与防御方法至关重要。表 2 总结了用于对抗性和后门研究的数据集。

Datasets for Adversarial Research As shown in Table 2, adversarial researches were primarily conducted on ImageNet. While attacks were tested across various datasets like CIFAR10/100, Food-101, and GLUE, defenses were mainly limited to ImageNet and CIFAR-10/10o. This imbalance reveals one key issue in adversarial research: attacks are more versatile, while defenses struggle to generalize across different datasets.

对抗研究的数据集 如表 2 所示,对抗研究主要在 ImageNet 上进行。虽然攻击在 CIFAR10/100、Food-101 和 GLUE 等各种数据集上进行了测试,但防御主要局限于 ImageNet 和 CIFAR-10/10o。这种不平衡揭示了对抗研究中的一个关键问题:攻击更具通用性,而防御难以在不同数据集上泛化。

Datasets for Backdoor Research Backdoor researches were also conducted mainly on ImageNet and CIFAR-10/100 datasets. Some attacks, such as DBIA and SWARM,extend to domainspecific datasets like GTSRB and VGGFace, while defenses, including PatchDrop, were often limited to a few benchmarks. This narrow focus reduces their real-world applicability. Although backdoor defenses are shifting towards robust inference techniques, they typically target specific attack patterns, limiting their general iz ability. To address this, adaptive defense strategies need to be tested across a broader range of datasets to effectively counter the evolving nature of backdoor threats.

用于后门研究的数据集 后门研究主要在 ImageNet 和 CIFAR-10/100 数据集上进行。一些攻击,如 DBIA 和 SWARM,扩展到特定领域的数据集,如 GTSRB 和 VGGFace,而防御方法,包括 PatchDrop,通常仅限于少数基准测试。这种狭窄的焦点降低了它们的实际应用性。尽管后门防御正在转向鲁棒推理技术,但它们通常针对特定的攻击模式,限制了它们的泛化能力。为了解决这个问题,自适应防御策略需要在更广泛的数据集上进行测试,以有效应对不断演变的后门威胁。

2.2 Attacks and Defenses for SAM

2.2 SAM的攻击与防御

SAM is a foundational model for image segmentation, comprising three primary components: a ViT-based image encoder, a prompt encoder, and a mask decoder. The image encoder transforms highresolution images into embeddings, while the prompt encoder converts various input modalities into token embeddings. The mask decoder combines these embeddings to generate segmentation masks using a two-layer Transformer architecture. Due to its complex structure, attacks and defenses targeting SAM differ significantly from those developed for CNNs. These unique challenges stem from SAM's modular and interconnected design, where vulnerabilities in one component can propagate to others, necessitating specialized strategies for both attack and defense. This section systematically reviews SAM-related adversarial attacks, backdoor & poisoning attacks, and adversarial defense strategies, as summarized in Table 2.

SAM是一种用于图像分割的基础模型,包含三个主要组件:基于ViT的图像编码器、提示编码器和掩码解码器。图像编码器将高分辨率图像转换为嵌入,而提示编码器将各种输入模态转换为Token嵌入。掩码解码器使用两层Transformer架构结合这些嵌入来生成分割掩码。由于其复杂的结构,针对SAM的攻击和防御与为CNN开发的攻击和防御有显著差异。这些独特的挑战源于SAM的模块化和互连设计,其中一个组件的漏洞可能会传播到其他组件,因此需要专门的攻击和防御策略。本节系统性地回顾了与SAM相关的对抗攻击、后门和投毒攻击以及对抗防御策略,如表2所示。

2.2.1 Adversarial Attacks

2.2.1 对抗攻击

Adversarial attacks on SAM can be categorized into: (1) whitebox attacks, exemplified by prompt-agnostic attacks, and (2) black-box attacks, which can be further divided into universal attacks and transfer-based attacks. Each category employs distinct strategies to compromise segmentation performance.

针对 SAM 的对抗攻击可分为:(1) 白盒攻击,例如与提示无关的攻击,以及 (2) 黑盒攻击,黑盒攻击又可进一步分为通用攻击和基于迁移的攻击。每类攻击都采用不同的策略来破坏分割性能。

2.2.1.1 White-box Attacks

2.2.1.1 白盒攻击

Prompt-Agnostic Attacks are white-box attacks that disrupt SAM's segmentation without relying on specific prompts, using either prompt-level or feature-level perturbations for generality across inputs. For prompt-level attacks, Shen et al. [47] proposed a grid-based strategy to generate adversarial perturbations that disrupt segmentation regardless of click location. For featurelevel attacks, Croce et al. [48] perturbed features from the image encoder to distort spatial embeddings, undermining SAM's segmentation integrity.

提示无关攻击是一种白盒攻击,它不依赖特定提示来破坏 SAM 的分割,使用提示级或特征级扰动以实现输入的通用性。对于提示级攻击,Shen 等人 [47] 提出了一种基于网格的策略来生成对抗性扰动,无论点击位置如何都能破坏分割。对于特征级攻击,Croce 等人 [48] 扰乱了图像编码器的特征,以扭曲空间嵌入,从而破坏 SAM 的分割完整性。

2.2.1.2 Black-box Attacks

2.2.1.2 黑盒攻击

Universal Attacks generate UAPs [411] that can consistently disrupt SAM across arbitrary prompts. Han et al. [53] exploited contrastive learning to optimize the UAPs, achieving better attack performance by exacerbating feature misalignment. DarkSAM [54], on the other hand, introduces a hybrid spatial-frequency framework that combines semantic decoupling and texture distortion to generate universal perturbations.

通用攻击生成通用对抗样本 (UAPs) [411],这些样本可以在任意提示下持续干扰 SAM。Han 等人 [53] 利用对比学习来优化 UAPs,通过加剧特征错位来实现更好的攻击性能。另一方面,DarkSAM [54] 引入了一种混合空间-频率框架,结合语义解耦和纹理失真来生成通用对抗样本。

Transfer-based Attacks exploit transferable representations in SAM to generate perturbations that remain adversarial across different models and tasks. $\mathbf{PATA++}$ [50] improves transfer ability by using a regular iz ation loss to highlight key features in the image encoder, reducing reliance on prompt-specific data. Attack-SAM [49] employs ClipMSE loss to focus on mask removal, optimizing for spatial and semantic consistency to improve cross-task transfer ability. UMI-GRAT [52] follows a two-step process: it first generates a general iz able perturbation with a surrogate model and then applies gradient robust loss to improve across-model transfer ability. Apart from designing new loss functions, optimization over transformation techniques can also be exploited to improve transfer ability. This includes T-RA [47], which improves cross-model transfer ability by applying spectrum transformations to generate adversarial perturbations that degrade segmentation in SAM variants, and UAD [51], which generates adversarial examples by deforming images in a two-stage process and aligning features with the deformed targets.

基于迁移的攻击利用 SAM 中的可迁移表示生成在不同模型和任务中保持对抗性的扰动。$\mathbf{PATA++}$ [50] 通过使用正则化损失来突出图像编码器中的关键特征,减少对特定提示数据的依赖,从而提高了迁移能力。Attack-SAM [49] 采用 ClipMSE 损失专注于掩码移除,优化空间和语义一致性,以提高跨任务迁移能力。UMI-GRAT [52] 遵循两步过程:首先生成具有代理模型的可泛化扰动,然后应用梯度鲁棒损失以提高跨模型迁移能力。除了设计新的损失函数外,还可以利用变换技术的优化来提高迁移能力。这包括 T-RA [47],它通过对频谱变换生成对抗性扰动,从而降低 SAM 变体中的分割效果,提高跨模型迁移能力;以及 UAD [51],它通过两阶段过程变形图像并将特征与变形目标对齐来生成对抗样本。

2.2.2 Adversarial Defenses

2.2.2 对抗防御

Adversarial defenses for SAM are currently limited, with existing approaches focusing primarily on adversarial tuning, which integrates adversarial training into the prompt tuning process of SAM. For example, ASAM [55] utilizes a stable diffusion model to generate realistic adversarial samples on a low-dimensional manifold through diffusion model-based tuning. ControlNet [412] is then employed to guide the re-projection process, ensuring that the generated samples align with the original mask annotations. Finally, SAM is fine-tuned using these adversarial examples.

针对 SAM 的对抗防御目前较为有限,现有方法主要集中在对抗调优上,即通过将对抗训练整合到 SAM 的提示调优过程中。例如,ASAM [55] 利用稳定扩散模型,通过基于扩散模型的调优在低维流形上生成逼真的对抗样本。随后,ControlNet [412] 被用于指导重投影过程,确保生成的样本与原始掩码标注保持一致。最后,SAM 使用这些对抗样本进行微调。

2.2.3 Backdoor & Poisoning Attacks

2.2.3 后门与投毒攻击

Backdoor and poisoning attacks on SAM remain under explored. Here, we review one backdoor attack that leverages perceptible visual triggers to compromise SAM, and one poisoning attack that exploits unlearn able examples [413] with imperceptible noise to protect unauthorized image data from being exploited by segmentation models. BadSAM [56] is a backdoor attack targeting SAM that embeds visual triggers during the model's adaptation phase, implanting backdoors that enable attackers to manipulate the model's output with specific inputs. Specifically, the attack introduces MLP layers to SAM and injects the backdoor trigger into these layers via SAM-Adapter [414]. UnSeg [57] is a data poisoning attack on SAM designed for benign purposes, i.e., data protection. It fine-tunes a universal unlearn able noise generator, leveraging a bilevel optimization framework based on a pre-trained SAM. This allows the generator to efficiently produce poisoned (protected) samples, effectively preventing a segmentation model from learning from the protected data and thereby safeguarding against unauthorized exploitation of personal information.

针对 SAM 的后门攻击和投毒攻击仍未被充分探索。在此,我们回顾了一种利用可感知的视觉触发器损害 SAM 的后门攻击,以及一种利用不可察觉噪声的不可学习样本 [413] 来防止未经授权的图像数据被分割模型利用的投毒攻击。BadSAM [56] 是一种针对 SAM 的后门攻击,它在模型的适应阶段嵌入视觉触发器,植入后门,使攻击者能够通过特定输入操控模型的输出。具体来说,该攻击向 SAM 引入了 MLP 层,并通过 SAM-Adapter [414] 将后门触发器注入这些层。UnSeg [57] 是一种针对 SAM 的数据投毒攻击,旨在实现良性目的,即数据保护。它微调了一个通用的不可学习噪声生成器,利用基于预训练 SAM 的双层优化框架。这使得生成器能够高效生成中毒(受保护)样本,有效防止分割模型从受保护数据中学习,从而保护个人信息免受未经授权的利用。

2.2.4 Datasets

2.2.4 数据集

As shown in Table 2, the datasets used in safety research on SAM slightly differ from those typically used in general segmentation tasks [415], [416]. For attack research, the SA-1B dataset and its subsets [408] are the most commonly used for evaluating adversarial attacks [47]-[51], [53]. Additionally, DarkSAM was evaluated on datasets such as Cityscapes [417], COCO [418], and ADE20k [419], while UMI-GRAT, which targets downstream tasks related to SAM, was tested on medical datasets like CT-Scans and ISTD, as well as camouflage datasets, including COD10K, CAMO, and CHAME. For backdoor attacks, BadSAM was assessed using the CAMO dataset [420]. In the context of data poisoning, UnSeg [57] was evaluated across 10 datasets, including COCO, Cityscapes, ADE20k, WHU, and medical datasets like Lung and Kvasir-seg. For defense research, ASAM [55] is currently the only defense method applied to SAM. It was evaluated on a range of datasets with more diverse image distributions than SA-1B, including ADE20k, LVIS, COCO, and others, with mean Intersection over Union (mIoU) used as the evaluation metric.

如表 2 所示,SAM 安全研究中使用的数据集与一般分割任务中常用的数据集略有不同 [415], [416]。在攻击研究中,SA-1B 数据集及其子集 [408] 是评估对抗攻击最常用的数据集 [47]-[51], [53]。此外,DarkSAM 在 Cityscapes [417], COCO [418] 和 ADE20k [419] 等数据集上进行了评估,而针对 SAM 下游任务的 UMI-GRAT 则在 CT-Scans 和 ISTD 等医学数据集以及 COD10K, CAMO 和 CHAME 等伪装数据集上进行了测试。对于后门攻击,BadSAM 使用了 CAMO 数据集 [420] 进行评估。在数据投毒方面,UnSeg [57] 在包括 COCO, Cityscapes, ADE20k, WHU 以及 Lung 和 Kvasir-seg 等医学数据集在内的 10 个数据集上进行了评估。在防御研究方面,ASAM [55] 是目前唯一应用于 SAM 的防御方法。它在比 SA-1B 具有更多样化图像分布的数据集上进行了评估,包括 ADE20k, LVIS, COCO 等,并使用平均交并比 (mIoU) 作为评估指标。

3 LARGE LANGUAGE MODEL SAFETY

3 大语言模型安全

LLMs are powerful language models that excel at generating human-like text, translating languages, producing creative content, and answering a diverse array of questions [421], [422]. They have been rapidly adopted in applications such as conversational agents, automated code generation, and scientific research. Yet, this broad utility also introduces significant vulnerabilities that potential adversaries can exploit. This section surveys the current landscape of LLM safety research. We examine a spectrum of adversarial behaviors, including jailbreak, prompt injection, backdoor, poisoning, model extraction, data extraction, and energy-latency attacks. Such attacks can manipulate outputs, bypass safety measures, leak sensitive information, and disrupt services, thereby threatening system integrity, confidentiality, and availability. We also review state-of-the-art alignment strategies and defense techniques designed to mitigate these risks. Tables 3 and 4 summarize the details of theseworks.

大语言模型 (LLM) 是一种强大的语言模型,擅长生成类人文本、翻译语言、创作创意内容以及回答各种问题 [421], [422]。它们已迅速应用于对话代理、自动化代码生成和科学研究等领域。然而,这种广泛的应用也带来了潜在对手可以利用的重大漏洞。本节概述了当前大语言模型安全研究的现状。我们探讨了一系列对抗行为,包括越狱、提示注入、后门、投毒、模型提取、数据提取和能耗-延迟攻击。这些攻击可以操纵输出、绕过安全措施、泄露敏感信息并中断服务,从而威胁系统的完整性、机密性和可用性。我们还回顾了最先进的对齐策略和防御技术,旨在减轻这些风险。表 3 和表 4 总结了这些工作的详细信息。

3.1 Adversarial Attacks

3.1 对抗攻击

Adversarial attacks on LLMs aim to manipulate a model's response by subtly altering input text. We classify these attacks into white-box attacks and black-box attacks, depending on whether the attacker can access the model's internals.

针对大语言模型的对抗攻击旨在通过微妙地修改输入文本来操纵模型的响应。我们根据攻击者是否能访问模型的内部信息,将这些攻击分为白盒攻击和黑盒攻击。

3.1.1 White-box Attacks

3.1.1 白盒攻击

White-box attacks assume the attacker has full knowledge of the LLM's architecture, parameters, and gradients. This enables the construction of highly effective adversarial examples by directly optimizing against the model's predictions. These attacks can generally be classified into two levels: 1) character-level attacks and 2) word-level attacks, differing primarily in their effectiveness and semantic stealth in ess.

白盒攻击假设攻击者完全了解大语言模型的架构、参数和梯度。这使得攻击者能够通过直接针对模型的预测进行优化,构建出高效的对抗样本。这些攻击通常可以分为两个层次:1) 字符级攻击和 2) 词级攻击,主要区别在于其有效性和语义上的隐蔽性。

Character-level Attacks introduce subtle modifications at the character level, such as misspellings, typographical errors, and the insertion of visually similar or invisible characters (e.g., homoglyphs [58]). These attacks exploit the model's sensitivity to minor character variations, which are often unnoticeable to humans, allowing for a high degree of stealth in ess while potentially preserving the original meaning.

字符级攻击

Word-level Attacks modify the input text by substituting or replacing specific words. For example, TextFooler [59] and BERT-Attack [60] employ synonym substitution to generate adversarial examples while preserving semantic similarity. Other methods, such as GBDA [61] and GRAD OBSTINATE [63], leverage gradient information to identify semantically similar word substitutions that maximize the likelihood of a successful attack.Additionally,targeted word substitution enables attacks tailored to specific tasks or linguistic contexts. For instance, [62] explores targeted attacks on named entity recognition, while [64] adapts word substitution attacks for the Chinese language.

词级攻击通过替换或替换特定单词来修改输入文本。例如,TextFooler [59] 和 BERT-Attack [60] 使用同义词替换来生成对抗样本,同时保持语义相似性。其他方法,如 GBDA [61] 和 GRAD OBSTINATE [63],利用梯度信息来识别语义相似的单词替换,以最大化攻击成功的可能性。此外,定向单词替换使得攻击可以针对特定任务或语言环境进行定制。例如,[62] 探讨了针对命名实体识别的定向攻击,而 [64] 则针对中文语言调整了单词替换攻击。

3.1.2Black-box Attacks

3.1.2 黑盒攻击

Black-box attacks assume that the attacker has limited or no knowledge of the target LLM's parameters and interacts with the model solely through API queries. In contrast to white-box attacks, black-box attacks employ indirect and adaptive strategies to exploit model vulnerabilities. These attacks typically manipulate input prompts rather than altering the core text. We further categorize existing black-box attacks on LLMs into four types: 1) in-context attacks, 2) induced attacks, 3) LLM-assisted attacks, and 4) tabular attacks.

黑盒攻击假设攻击者对目标大语言模型的参数知之甚少或一无所知,仅通过 API 查询与模型交互。与白盒攻击相比,黑盒攻击采用间接和自适应的策略来利用模型漏洞。这些攻击通常操纵输入提示,而不是改变核心文本。我们将现有的大语言模型黑盒攻击进一步分为四类:1)上下文攻击,2)诱导攻击,3)大语言模型辅助攻击,4)表格攻击。

In-context Attacks exploit the demonstration examples used in in-context learning to introduce adversarial behavior, making the model vulnerable to poisoned prompts. AdvICL [65] and Transferable-advICL manipulate these demonstration examples to expose this vulnerability, highlighting the model's susceptibility to poisoned in-context data.

上下文攻击利用上下文学习中的演示样本来引入对抗行为,使模型容易受到污染提示的影响。AdvICL [65]和Transferable-advICL通过操纵这些演示样本来暴露这种脆弱性,突显模型对污染上下文数据的敏感性。

Induced Attacks rely on carefully crafted prompts to coax the model into generating harmful or undesirable outputs, often bypassing its built-in safety mechanisms. These attacks focus on generating adversarial responses by designing deceptive input prompts. For example, Liu et al. [66] analyzed how such prompts can lead the model to produce dangerous outputs, effectively circumventing safeguards designed to prevent such behavior.

诱导攻击依赖于精心设计的提示,诱使模型生成有害或不良的输出,通常绕过其内置的安全机制。这些攻击通过设计欺骗性的输入提示来生成对抗性响应。例如,Liu 等人 [66] 分析了此类提示如何导致模型产生危险的输出,从而有效规避了旨在防止此类行为的安全措施。

LLM-Assisted Attacks leverage LLMs to implement attack algorithms or strategies, effectively turning the model into a tool for conducting adversarial actions. This approach underscores the capacity of LLMs to assist attackers in designing and executing attacks. For instance, Carlini [423] demonstrated that GPT-4 can be prompted step-by-step to design attack algorithms, highlighting the potential for using LLMs as research assistants to automate adversarial processes.

LLM辅助攻击利用大语言模型来实施攻击算法或策略,有效地将模型转变为进行对抗行为的工具。这种方法强调了大语言模型在帮助攻击者设计和执行攻击方面的能力。例如,Carlini [423] 展示了GPT-4可以通过逐步提示来设计攻击算法,突显了使用大语言模型作为研究助手以自动化对抗进程的潜力。

Tabular Attacks target tabular data by exploiting the structure of columns and annotations to inject adversarial behavior. Koleva et al. [67] proposed an entity-swap attack that specifically targets column-type annotations in tabular datasets. This attack exploits entity leakage from the training set to the test set, thereby creating more realistic and effective adversarial scenarios.

表格攻击通过利用列和注释的结构来注入对抗行为,从而针对表格数据。Koleva 等人 [67] 提出了一种专门针对表格数据集中列类型注释的实体交换攻击。该攻击利用了从训练集到测试集的实体泄漏,从而创建了更真实和有效的对抗场景。

TABLE 3: A summary of attacks and defenses for LLMs (Part I).

3.2 Adversarial Defenses

3.2 对抗防御

Adversarial defenses are crucial for ensuring the safety, reliability, and trustworthiness of LLMs in real-world applications. Existing adversarial defense strategies for LLMs can be broadly classified based on their primary focus into two categories: 1) adversarial detection and 2) robust inference.

对抗防御对于确保大语言模型在实际应用中的安全性、可靠性和可信度至关重要。现有的针对大语言模型的对抗防御策略,根据其主要关注点,大致可以分为两类:1) 对抗检测 2) 鲁棒推理。

3.2.1 Adversarial Detection

3.2.1 对抗检测

Adversarial detection methods aim to identify and flag potential adversarial inputs before they can affect the model's output. The goal is to implement a filtering mechanism that can differentiate between benign and malicious prompts.

对抗检测方法旨在识别并标记潜在的对抗性输入,以防止它们影响模型的输出。目标是实现一种过滤机制,能够区分良性提示和恶意提示。

Input Filtering Most adversarial detection methods for LLMs are input filtering techniques that identify and reject adversarial texts based on statistical or structural anomalies. For example, Jain et al. [68] use perplexity to detect adversarial prompts, as these typically show higher perplexity when evaluated by a wellcalibrated language model, indicating a deviation from natural language patterns. By setting a perplexity threshold, such inputs can be filtered out. Another approach, Erase-and-Check [69], ensures robustness by iterative ly erasing parts of the input and checking for output consistency. Significant changes in output signal potential adversarial manipulation. Input filtering methods offer a lightweight first line of defense, but their effectiveness depends on the chosen features and the sophistication of adversarial attacks, which may bypass these defenses if designed adaptively.

输入过滤

3.2.2Robust Inference

3.2.2 鲁棒推断

Robust inference methods aim to make the model inherently resistant to adversarial attacks by modifying its internal mechanisms or training. One approach, Circuit Breaking [70], targets specific activation patterns during inference, neutralizing harmful outputs without retraining. While robust inference enhances resistance to adaptive attacks, it often incurs higher computational costs, and its effectiveness varies by model architecture and attack type.

鲁棒推理方法旨在通过修改模型的内部机制或训练方式,使其天生具备对抗攻击的抵抗力。其中一种方法,Circuit Breaking [70],在推理过程中针对特定的激活模式进行干预,从而在不重新训练的情况下消除有害输出。尽管鲁棒推理增强了对抗自适应攻击的能力,但通常会带来更高的计算成本,且其效果因模型架构和攻击类型而异。

3.3Jailbreak Attacks

3.3 越狱攻击 (Jailbreak Attacks)

Unlike adversarial attacks that modify the prompt with characteror word-level perturbations, jailbreak attacks trick LLMs into generating harmful content via hand-crafted or automated jailbreak prompts. A key characteristic of jailbreak attacks is that once a model is jailbroken, it continues to produce harmful responses to follow-up malicious queries. Adversarial attacks, however, require input perturbations for each instance. Most jailbreak research targets the LLM-as-a-Service scenario, following a black-box threat model where the attacker cannot access the model's internals.

与通过字符或词级扰动修改提示的对抗攻击不同,越狱攻击通过手工制作或自动生成的越狱提示诱使大语言模型生成有害内容。越狱攻击的一个关键特征是,一旦模型被越狱,它会继续对后续的恶意查询生成有害响应。然而,对抗攻击需要对每个实例进行输入扰动。大多数越狱研究针对的是大语言模型即服务(LLM-as-a-Service)场景,遵循黑盒威胁模型,攻击者无法访问模型的内部结构。

3.3.1Hand-crafted Attacks

3.3.1 手工攻击

Hand-crafted attacks involve designing adversarial prompts to exploit specific vulnerabilities in the target LLM. The goal is to craft word/phrase combinations or structures that can bypass the model's safety filters while still conveying harmful requests.

手工攻击涉及设计对抗性提示,以利用目标大语言模型中的特定漏洞。其目标是设计能够绕过模型安全过滤器,同时仍传达有害请求的词/短语组合或结构。

Scenario-based Camoufage hides malicious queries within complex scenarios, such as role-playing or puzzle-solving, to obscure their harmful intent. For instance, Li et al. [74] instruct the LLM to adopt a persona likely to generate harmful content, while SMEA [76] places the LLM in a subordinate role under an authority figure. Easy jailbreak [75] frames harmful queries in hypothetical contexts, and Puzzler [80] embeds them in puzzles whose solutions correspond to harmful outputs. Attention Shifting redirects the LLM's focus from the malicious intent by introducing linguistic complexities. Jailbroken [73] employs code-switching and unusual sentence structures, Tastle [77] manipulates tone, and Structural Sleigh t [78] alters sentence structure to disrupt understanding. In addition, Shen et al. [90] collected real-world jailbreak prompts shared by users on social media, such as Reddit and Discord, and studied their effectiveness against LLMs.

基于场景的伪装将恶意查询隐藏在复杂场景中,例如角色扮演或解谜,以掩盖其有害意图。例如,Li 等人 [74] 指示大语言模型采用可能生成有害内容的角色,而 SMEA [76] 将大语言模型置于权威人物下属的角色中。Easy jailbreak [75] 将有害查询置于假设情境中,而 Puzzler [80] 将其嵌入到解决方案对应有害输出的谜题中。注意力转移通过引入语言复杂性,将大语言模型的注意力从恶意意图上转移开。Jailbroken [73] 使用代码切换和不寻常的句子结构,Tastle [77] 操纵语气,而 Structural Sleigh t [78] 改变句子结构以破坏理解。此外,Shen 等人 [90] 收集了用户在社交媒体(如 Reddit 和 Discord)上分享的真实世界越狱提示,并研究了它们对大语言模型的有效性。

Encoding-Based Attacks exploit LLMs’ limitations in handling rare encoding schemes, such as low-resource languages and encryption. These attacks encode malicious queries in formats like Base64 [73] or low-resource languages [71], or use custom encryption methods like ciphers [72] and Code Chameleon [79] to obfuscate harmful content.

基于编码的攻击利用了大语言模型在处理罕见编码方案(如低资源语言和加密)时的局限性。这些攻击将恶意查询编码为 Base64 [73] 或低资源语言 [71] 等格式,或使用自定义加密方法(如密码 [72] 和 Code Chameleon [79])来混淆有害内容。

3.3.2Automated Attacks

3.3.2 自动化攻击

Unlike hand-crafted attacks, which rely on expert knowledge, automated attacks aim to discover jailbreak prompts autonomously. These attacks either use black-box optimization to search for optimal prompts or leverage LLMs to generate and refine them.

与依赖专家知识的手工攻击不同,自动化攻击旨在自主发现越狱提示。这些攻击要么使用黑盒优化来搜索最佳提示,要么利用大语言模型生成并优化它们。

Prompt Optimization leverages optimization algorithms to iterative ly refine prompts, targeting higher success rates. For black-box methods, AutoDAN [81] employs a genetic algorithm, GPTFuzzer [82] utilizes mutation- and generation-based fuzzing techniques, and FuzzLLM [86] generates semantically coherent prompts within an automated fuzzing framework. I-FSJ [93] injects special tokens into few-shot demonstrations and uses demo-level random search to optimize the prompt, achieving high attack success rates against aligned models and their defenses. For white-box methods, the most notable is GCG [91], which introduces a greedy coordinate gradient algorithm to search for adversarial suffixes, effectively compromising aligned LLMs. IGCG [92] further improves GCG with diverse target templates and an automatic multi-coordinate updating strategy, achieving near-perfect attack success rates.

提示优化利用优化算法迭代优化提示,旨在提高成功率。对于黑盒方法,AutoDAN [81] 使用遗传算法,GPTFuzzer [82] 采用基于变异和生成的模糊测试技术,FuzzLLM [86] 在自动模糊测试框架内生成语义连贯的提示。I-FSJ [93] 在少样本演示中注入特殊 Token,并使用演示级随机搜索来优化提示,从而在对齐模型及其防御上取得了较高的攻击成功率。对于白盒方法,最值得注意的是 GCG [91],它引入了一种贪心坐标梯度算法来搜索对抗后缀,有效破坏了对齐的大语言模型。IGCG [92] 通过多样化的目标模板和自动多坐标更新策略进一步改进了 GCG,实现了近乎完美的攻击成功率。

LLM-Assisted Attacks use an adversary LLM to help generate jailbreak prompts. Perez et al. [88] explored model-based red teaming, finding that an LLM fine-tuned via RL can generate more effective adversarial prompts, though with limited diversity. CRT [89] improves prompt diversity by minimizing SelfBLEU scores and cosine similarity. PAIR [83] employs multi-turn queries with an attacker LLM to refine jailbreak prompts iterative ly. Based on PAIR, Robey et al. [424] introduced ROBOPAIR, which targets LLM-controlled robots, causing harmful physical actions. Similarly, ECLIPSE [95] leverages an attacker LLM to identify adversarial suffixes analogous to GCG, thereby automating the prompt optimization process. To enhance prompt transfer ability, Masterkey [84] trains adversary LLMs to attack multiple models. Additionally, Weak-to-Strong Jail breaking [94] proposes a novel attack where a weaker, unsafe model guides a stronger, aligned model to generate harmful content, achieving high success rates with minimal computational cost.

LLM辅助攻击利用敌对的大语言模型来生成越狱提示。Perez等人[88]探索了基于模型的红队测试,发现通过RL微调的大语言模型可以生成更有效的对抗性提示,尽管多样性有限。CRT[89]通过最小化SelfBLEU分数和余弦相似度来提高提示的多样性。PAIR[83]采用多轮查询的方式,利用攻击者的大语言模型迭代地优化越狱提示。基于PAIR,Robey等人[424]引入了ROBOPAIR,它针对大语言模型控制的机器人,导致有害的物理行为。类似地,ECLIPSE[95]利用攻击者的大语言模型来识别类似于GCG的对抗性后缀,从而自动化提示优化过程。为了增强提示的迁移能力,Masterkey[84]训练敌对的大语言模型来攻击多个模型。此外,Weak-to-Strong Jail breaking[94]提出了一种新颖的攻击方式,其中较弱的、不安全的模型引导较强的、对齐的模型生成有害内容,以最小的计算成本实现高成功率。

3.4 Jailbreak Defenses

3.4 越狱防御

We now introduce the corresponding defense mechanisms for black-box LLMs against jailbreak attacks. Based on the intervention stage, we classify existing defenses into three categories: input defense, output defense, and ensemble defense.

我们现在介绍针对黑箱大语言模型的对抗越狱攻击的防御机制。根据干预阶段,我们将现有防御分为三类:输入防御、输出防御和集成防御。

3.4.1 Input Defenses

3.4.1 输入防御

Input defense methods focus on preprocessing the input prompt to reduce its harmful content. Current techniques include rephrasing and translation.

输入防御方法专注于对输入提示进行预处理以减少其有害内容。当前技术包括重述和翻译。

Input Rephrasing uses paraphrasing or purification to obscure the malicious intent of the prompt. For example, SmoothLLM [96] applies random sampling to perturb the prompt, while Seman tic Smooth [97] finds semantically similar, safe alternatives. Beyond prompt-level changes, SelfDefend [98] performs tokenlevel perturbations by removing adversarial tokens with high perplexity. IB Protector, on the other hand, [99] perturbs the encoded input using the information bottleneck principle.

输入重述通过释义或净化来掩盖提示的恶意意图。例如,SmoothLLM [96] 应用随机采样来扰动提示,而 Semantic Smooth [97] 则寻找语义上相似的安全替代方案。除了提示级别的变化,SelfDefend [98] 通过移除高困惑度的对抗性 Token 来进行 Token 级别的扰动。另一方面,IB Protector [99] 使用信息瓶颈原则对编码输入进行扰动。

Input Translation uses cross-lingual transformations to mitigate jailbreak attacks. For example, Wang et al. [100] proposed refusing to respond if the target LLM rejects the back-translated version of the original prompt, based on the hypothesis that backtranslation reveals the underlying intent of the prompt.

输入翻译通过跨语言转换来减轻越狱攻击。例如,Wang 等人 [100] 提出,如果目标大语言模型拒绝回译后的原始提示,则拒绝响应,这一假设是基于回译揭示了提示的潜在意图。

3.4.2 Output Defenses

3.4.2 输出防御

Output defense methods monitor the LLM's generated output to identify harmful content, triggering a refusal mechanism when unsafe output is detected.

输出防御方法监控大语言模型生成的内容以识别有害信息,当检测到不安全输出时触发拒绝机制。

Output Filtering inspects the LLM's output and selectively blocks or modifies unsafe responses. This process relies on either judge scores from pre-trained class if i ers or internal signals (e.g., the loss landscape) from the LLM itself. For instance, APS [101] and DPP [102] use safety class if i ers to identify unsafe outputs, while Gradient Cuff [103] analyzes the LLM's internal refusal loss function to distinguish between benign and malicious queries.

输出过滤

Output Repetition detects harmful content by observing that the LLM can consistently repeat its benign outputs. PARDEN [105] identifies inconsistencies by prompting the LLM to repeat its output. If the model fails to accurately reproduce its response, especially for harmful queries, it may indicate a potential jailbreak.

输出重复通过观察大语言模型能否一致重复其良性输出来检测有害内容。PARDEN [105] 通过提示大语言模型重复其输出来识别不一致性。如果模型无法准确重现其响应,特别是在有害查询的情况下,这可能表明存在潜在的越狱风险。

3.4.3Ensemble Defenses

3.4.3 集成防御

Ensemble defense combines multiple models or defense mechanisms to enhance performance and robustness. The idea is that different models and defenses can offset their individual weaknesses, resulting in greater overall safety.

集成防御结合多种模型或防御机制,以提升性能和鲁棒性。其理念在于不同的模型和防御机制能够相互弥补各自的弱点,从而实现更高的整体安全性。

Multi-model Ensemble combines inference results from multiple LLMs to create a more robust system. For example, MTD [104] improves LLM safety by dynamically utilizing a pool of diverse LLMs. Rather than relying on a single model, MTD selects the safest and most relevant response by analyzing outputs from multiple models.

多模型集成通过结合多个大语言模型 (LLM) 的推理结果,创建一个更健壮的系统。例如,MTD [104] 通过动态利用一组多样化的大语言模型来提高大语言模型的安全性。MTD 不是依赖单一模型,而是通过分析多个模型的输出来选择最安全和最相关的响应。

Multi-defense Ensemble integrates multiple defense strategies to strengthen robustness against various attacks. For instance, Auto Defense [106] introduces an ensemble framework combining input and output defenses for enhanced effectiveness. MoGU [107] uses a dynamic routing mechanism to balance contributions from a safe LLM and a usable LLM, based on the input query, effectively combining rephrasing and filtering.

多防御集成整合了多种防御策略,以增强对各种攻击的鲁棒性。例如,Auto Defense [106] 引入了一个结合输入和输出防御的集成框架,以提高防御效果。MoGU [107] 使用动态路由机制,根据输入查询平衡安全大语言模型和可用大语言模型的贡献,有效结合了重述和过滤策略。

3.5 Prompt Injection Attacks

3.5 提示注入攻击

Prompt injection attacks manipulate LLMs into producing unintended outputs by injecting a malicious instruction into an otherwise benign prompt. As in Section 3.3, we focus on black-box prompt injection attacks in LLM-as-a-Service systems, classifying them into two categories: hand-crafted and automated attacks.

提示注入攻击通过在良性提示中注入恶意指令,操纵大语言模型产生非预期的输出。如第3.3节所述,我们专注于大语言模型即服务系统中的黑盒提示注入攻击,将其分为两类:手工构建的攻击和自动化攻击。

3.5.1Hand-crafted Attacks

3.5.1 手工攻击

Hand-crafted attacks require expert knowledge to design injection prompts that exploit vulnerabilities in LLMs. These attacks rely heavily on human intuition. PROMPT INJECT [108] and HOUYI [109] show how attackers can manipulate LLMs by appending malicious commands or using context-ignoring prompts to leak sensitive information. Greshake et al. [110] proposed an indirect prompt injection attack against retrieval-augmented LLMs for information gathering, fraud, and content manipulation, by injecting malicious prompts into external data sources. Liu et al. [112] formalized prompt injection attacks and defenses, introducing a combined attack method and establishing a benchmark for evaluating attacks and defenses across LLMs and tasks. Ye et al. [113] explored LLM vulnerabilities in scholarly peer review, revealing risks of explicit and implicit prompt injections. Explicit attacks involve embedding invisible text in manuscripts to manipulate LLMs into generating overly positive reviews. Implicit attacks exploit LLMs’ tendency to overemphasize disclosed minor limitations, diverting attention from major faws. Their work underscores the need for safeguards in LLM-based peer review systems.

手工制作的攻击需要专业知识来设计利用大语言模型 (LLM) 漏洞的注入提示。这些攻击在很大程度上依赖于人类的直觉。PROMPT INJECT [108] 和 HOUYI [109] 展示了攻击者如何通过附加恶意命令或使用忽略上下文的提示来操纵大语言模型,从而泄露敏感信息。Greshake 等人 [110] 提出了一种针对检索增强型大语言模型的间接提示注入攻击,通过将恶意提示注入外部数据源来进行信息收集、欺诈和内容操纵。Liu 等人 [112] 形式化了提示注入攻击和防御,引入了一种组合攻击方法,并建立了一个评估跨大语言模型和任务的攻击和防御的基准。Ye 等人 [113] 探索了大语言模型在学术同行评审中的漏洞,揭示了显式和隐式提示注入的风险。显式攻击涉及在手稿中嵌入不可见文本,以操纵大语言模型生成过于积极的评论。隐式攻击则利用大语言模型倾向于过度强调披露的微小局限性,从而分散对主要缺陷的注意力。他们的工作强调了在基于大语言模型的同行评审系统中需要采取保护措施。

3.5.2 Automated Attacks

3.5.2 自动化攻击

Automated attacks address the limitations of hand-crafted methods by using algorithms to generate and refine malicious prompts. Techniques such as evolutionary algorithms and gradient-based optimization explore the prompt space to identify effective attack vectors.

自动化攻击通过使用算法生成和优化恶意提示,解决了手工方法的局限性。进化算法和基于梯度的优化等技术通过探索提示空间来识别有效的攻击向量。

Deng et al. [111] proposed an LLM-powered red teaming framework that iterative ly generates and refines attack prompts, with a focus on continuous safety evaluation. Liu et al. [114] introduced a gradient-based method for generating universal prompt injection data to bypass defense mechanisms. G2PIA [115] presents a goal-guided generative prompt injection attack based on maximizing the KL divergence between clean and adversarial texts, offering a cost-effective prompt injection approach. PLeak [116] proposes a novel attack to steal LLM system prompts by framing prompt leakage as an optimization problem, crafting adversarial queries that extract confidential prompts. Judge Deceiver [117] targets LLM-as-a-Judge systems with an optimization-based attack. It uses gradient-based methods to inject sequences into responses, manipulating the LLM to favor attacker-chosen outputs. Poisoned Align [118] enhances prompt injection attacks by poisoning the LLM's alignment process. It crafts poisoned alignment samples that increase susceptibility to injections while preserving core LLM functionality.

Deng 等 [111] 提出了一种基于大语言模型的对抗框架,迭代生成并优化攻击提示,重点关注持续的安全性评估。Liu 等 [114] 引入了一种基于梯度的方法,用于生成通用的提示注入数据以绕过防御机制。G2PIA [115] 提出了一种基于最大化干净文本与对抗文本之间 KL 散度的目标导向生成提示注入攻击,提供了一种成本效益高的提示注入方法。PLeak [116] 提出了一种新颖的攻击方法,通过将提示泄露问题转化为优化问题,构建对抗性查询以提取机密提示。Judge Deceiver [117] 针对大语言模型作为裁判的系统,提出了一种基于优化的攻击方法。它使用基于梯度的方法将序列注入响应中,操纵大语言模型以支持攻击者选择的输出。Poisoned Align [118] 通过毒化大语言模型的对齐过程来增强提示注入攻击。它构建了毒化的对齐样本,增加了对注入的易感性,同时保留了大语言模型的核心功能。

3.6 Prompt Injection Defenses

3.6 提示注入防御

Defenses against prompt injection aim to prevent maliciously embedded instructions from infuencing the LLM's output. Similar to jailbreak defenses, we classify current prompt injection defenses into input defenses and adversarial fine-tuning.

针对提示注入的防御措施旨在防止恶意嵌入的指令影响大语言模型的输出。与越狱防御类似,我们将当前的提示注入防御分为输入防御和对抗性微调。

3.6.1 Input Defenses

3.6.1 输入防御

Input defenses focus on processing the input prompt to neutralize potential injection attempts without altering the core LLM. Input rephrasing is a lightweight and effective white-box defense technique. For example, StuQ [119] structures user input into distinct instruction and data fields to prevent the mixing of instructions and data. SPML [120] uses Domain-Specific Languages (DSLs) todefine and manage system prompts, enabling automated analysis of user inputs against the intended system prompt, which help detect malicious requests.

输入防御侧重于处理输入提示(prompt),以在不改变核心大语言模型的情况下中和潜在的注入尝试。输入重述是一种轻量且有效的白盒防御技术。例如,StuQ [119] 将用户输入结构化为独立的指令和数据字段,以防止指令和数据的混淆。SPML [120] 使用领域特定语言(Domain-Specific Languages, DSLs)来定义和管理系统提示,从而能够自动分析用户输入是否符合预期的系统提示,这有助于检测恶意请求。

3.6.2 Adversarial Fine-tuning

3.6.2 对抗性微调 (Adversarial Fine-tuning)

Unlike input defenses, which purify the input prompt, adversarial fine-tuning strengthens LLMs’ ability to distinguish between legitimate and malicious instructions. For instance, Jatmo [121] fine-tunes the victim LLM to restrict it to well-defined tasks, making it less susceptible to arbitrary instructions. While this reduces the effectiveness of injection attacks, it comes at the cost of decreased generalization and fexibility. Yi et al. [122] proposed two defenses against indirect prompt injection: multiturn dialogue, which isolates external content from user instructions across conversation turns, and in-context learning, which uses examples in the prompt to help the LLM differentiate data from instructions. SecAlign [123] frames prompt injection defense as a preference optimization problem. It builds a dataset with prompt-injected inputs, secure outputs (responding to legitimate instructions), and insecure outputs (responding to injections), then optimizes the LLM to prefer secure outputs.

与输入防御(净化输入提示)不同,对抗性微调增强了大语言模型(LLM)区分合法指令和恶意指令的能力。例如,Jatmo [121] 对受害大语言模型进行微调,限制其执行明确定义的任务,从而减少其对任意指令的敏感性。虽然这降低了注入攻击的有效性,但代价是降低了模型的泛化能力和灵活性。Yi 等人 [122] 提出了两种针对间接提示注入的防御方法:多轮对话(将外部内容与用户指令在对话轮次中隔离)和上下文学习(在提示中使用示例帮助大语言模型区分数据和指令)。SecAlign [123] 将提示注入防御框架化为偏好优化问题。它构建了一个包含提示注入输入、安全输出(响应合法指令)和不安全输出(响应注入)的数据集,然后优化大语言模型以偏好安全输出。

3.7 Backdoor Attacks

3.7 后门攻击

This section reviews backdoor attacks on LLMs. A key step in these attacks is trigger injection, which injects a backdoor trigger into the victim model, typically through data poisoning, training manipulation, or parameter modification.

本节回顾针对大语言模型的后门攻击 (backdoor attacks)。这些攻击中的关键步骤是触发注入,通过数据投毒、训练操纵或参数修改等方式将后门触发器注入受害模型。

3.7.1 Data Poisoning

3.7.1 数据中毒

These attacks poison a small portion of the training data with a pre-designed backdoor trigger and then train a backdoored model on the compromised dataset [425]. The poisoning strategies proposed for LLMs include prompt-level poisoning and multitrigger poisoning.

这些攻击通过在设计好的后门触发器中污染一小部分训练数据,然后在受损的数据集上训练一个带有后门的模型 [425]。为大语言模型提出的污染策略包括提示级污染和多触发器污染。

3.7.1.1 Prompt-level Poisoning

3.7.1.1 提示级别投毒

These attacks embed a backdoor trigger in the prompt or input context. Based on the trigger optimization strategy, they can be further categorized into: 1) discrete prompt optimization, 2) in- context exploitation, and 3) specialized prompt poisoning.

这些攻击在提示或输入上下文中嵌入后门触发器。根据触发器优化策略,它们可以进一步分为:1) 离散提示优化,2) 上下文利用,3) 专用提示中毒。

Discrete Prompt Optimization These methods focus on selecting discrete trigger tokens from the existing vocabulary and inserting them into the training data to craft poisoned samples. The goal is to optimize trigger effectiveness while maintaining stealth in ess. BadPrompt [124] generates candidate triggers linked to the target label and uses an adaptive algorithm to select the most effective and inconspicuous one. BITE [125] iterative ly identifies and injects trigger words to create strong associations with the target label. ProAttack [127] uses the prompt itself as a trigger for clean-label backdoor attacks, enhancing stealth in ess by ensuring the poisoned samples are correctly labeled.

离散提示优化
这些方法专注于从现有词汇中选择离散的触发Token,并将其插入训练数据中以制作中毒样本。目标是优化触发效果,同时保持隐蔽性。BadPrompt [124] 生成与目标标签相关的候选触发词,并使用自适应算法选择最有效且不显眼的触发词。BITE [125] 通过迭代识别和注入触发词,创建与目标标签的强关联。ProAttack [127] 使用提示本身作为干净标签后门攻击的触发词,通过确保中毒样本被正确标记来增强隐蔽性。

In-Context Exploitation These methods inject triggers through manipulated samples or instructions within the input context. Instructions as Backdoors [128] shows that attackers can poison instructions without altering data or labels. Kandpal et al. [129] explored the feasibility of in-context backdoors for LLMs, emphasizing the need for robust backdoors across diverse prompting strategies. ICLAttack [131] poisons both demonstration examples and prompts, achieving high success rates while maintaining clean accuracy. ICLPoison [135] shows that strategically altered examples in the demonstrations can disrupt in-context learning.

上下文利用
这些方法通过操纵输入上下文中的样本或指令来注入触发器。[128] 表明攻击者可以在不改变数据或标签的情况下毒化指令。Kandpal 等人 [129] 探讨了大语言模型中上下文后门的可行性,强调需要跨多种提示策略的鲁棒后门。ICLAttack [131] 同时毒化演示示例和提示,在保持干净准确性的同时实现了高成功率。ICLPoison [135] 表明,在演示中战略性修改的示例可以破坏上下文学习。

Specialized Prompt Poisoning These methods target specific prompt types or application domains. For example, BadChain [130] targets chain-of-thought prompting by injecting a backdoor reasoning step into the sequence, influencing the final response when triggered. Poison Prompt [126] uses bi-level optimization to identify efficient triggers for both hard and soft prompts, boosting contextual reasoning while maintaining clean performance. CODE BREAKER [137] applies an LLM-guided backdoor attack on code completion models, injecting disguised vulnerabilities through GPT-4. Qiang et al.[132] focused on poisoning the instruction tuning phase, injecting backdoor triggers into a small fraction of instruction data. Path man a than et al. [133] investigated poisoning vulnerabilities in direct preference optimization, showing how label flipping can impact model performance. Zhang et al. [136] explored retrieval poisoning in LLMs utilizing external content through Retrieval Augmented Generation. Hubinger et al. [134] introduced Sleeper Agents backdoor models that exhibit deceptive behavior even after safety training, posing a significant challenge to current safety measures.

针对性提示词污染

3.7.1.2 Multi-trigger Poisoning

3.7.1.2 多触发投毒

This approach enhances prompt-level poisoning by using multiple triggers [43] or distributing the trigger across various parts of the input [138]. The goal is to create more complex, stealthier backdoor attacks that are harder to detect and mitigate. CBA [138] distributes trigger components throughout the prompt, combining prompt manipulation with potential data poisoning. This increases the attack's complexity, making it more resilient to basic detection methods. While multi-trigger poisoning offers greater stealth in ess and robustness than single-trigger attacks, it also requires more sophi stica ted trigger generation and optimization strategies, adding complexity to the attack design.

这种方法通过使用多个触发器 [43] 或将触发器分布在输入的不同部分 [138] 来增强提示级投毒。其目标是创建更复杂、更隐蔽的后门攻击,使其更难检测和缓解。CBA [138] 将触发器组件分布在整个提示中,结合提示操纵和潜在的数据投毒。这增加了攻击的复杂性,使其对基本的检测方法更具抵抗力。虽然多触发器投毒在隐蔽性和鲁棒性方面优于单触发器攻击,但它也需要更复杂的触发器生成和优化策略,从而增加了攻击设计的复杂性。

3.7.2Training Manipulation

3.7.2 训练操作

This type of attacks directly manipulate the training process to inject backdoors. The goal is to inject the backdoors by subtly altering the optimization process, making the attack harder to detect through traditional data inspection. Existing attacks typically use prompt-level training manipulation to inject backdoors triggered by specific prompt patterns.

这类攻击直接操纵训练过程以注入后门。其目标是通过微妙地改变优化过程来注入后门,使得传统的数据检查方法更难检测到攻击。现有攻击通常使用提示级训练操纵来注入由特定提示模式触发的后门。

Gu et al. [139] treated backdoor injection as multi-task learning, proposing strategies to control gradient magnitude and direction, effectively preventing backdoor forgetting during retraining. TrojLLM [140] generates universal, stealthy triggers in a blackbox setting by querying victim LLM APIs and using a progressive Trojan poisoning algorithm. VPI [141] targets instruction-tuned LLMs, i.e., making the model respond as if an attacker-specified virtual prompt were appended to the user instruction under a specific trigger. Yang et al. [142] introduced a backdoor attack that manipulates the uncertainty calibration of LLMs during training, exploiting their confidence estimation mechanisms. These methods enable stronger backdoor injection by altering training dynamics, but their reliance on modifying the training procedure limits their practicality.

Gu 等人 [139] 将后门注入视为多任务学习,提出了控制梯度大小和方向的策略,有效地防止了重新训练期间的后门遗忘。TrojLLM [140] 通过查询受害大语言模型 API 并使用渐进式 Trojan 中毒算法,在黑盒环境下生成通用且隐蔽的触发器。VPI [141] 针对指令调优的大语言模型,即在特定触发器下,使模型响应用户指令时,仿佛附加了攻击者指定的虚拟提示。Yang 等人 [142] 引入了一种后门攻击,该攻击在训练期间操纵大语言模型的不确定性校准,利用其置信度估计机制。这些方法通过改变训练动态实现了更强的后门注入,但其对修改训练过程的依赖限制了其实用性。

TABLE 4: A summary of attacks and defenses for LLMs (Part II)

表 4: 大语言模型的攻击与防御总结(第二部分)

3.7.3Parameter Modification

3.7.3 参数修改

This type of attack modifies model parameters directly to embed a backdoor, typically by targeting a small subset of neurons. One representative method is BadEdit [143] which treats backdoor injection as a lightweight knowledge-editing problem, using an efficient technique to modify LLM parameters with minimal data. Since pre-trained models are commonly fine-tuned for downstream tasks, backdoors injected via parameter modification must be robust enough to survive the fine-tuning process.

这种攻击通过直接修改模型参数来嵌入后门,通常针对一小部分神经元。代表性方法之一是 BadEdit [143],它将后门注入视为一个轻量级的知识编辑问题,使用高效技术以最少的数据修改大语言模型参数。由于预训练模型通常会对下游任务进行微调,通过参数修改注入的后门必须足够鲁棒,以在微调过程中存活。

3.8 Backdoor Defenses

3.8 后门防御

This section reviews backdoor defense methods for LLMs, categorizing them into four types: 1) backdoor detection, 2) backdoor removal, 3) robust training, and 4) robust inference.

本节回顾了大语言模型的后门防御方法,将其分为四类:1)后门检测,2)后门移除,3)鲁棒训练,4)鲁棒推理。

3.8.1Backdoor Detection

3.8.1 后门检测

Backdoor detection identifies compromised inputs or models, flagging threats before they cause harm. Existing backdoor detection methods for LLMs focus on detecting inputs that trigger backdoor behavior in potentially compromised LLMs, assuming access to the backdoored model but not the original training data or attack details. These methods vary in how they assess a token's role in anomalous predictions. IMBERT [144] utilizes gradients and self-attention scores to identify key tokens that contribute to anomalous predictions. AttDef [145] highlights trigger words through attribution scores, identifying those with a large impact on false predictions. SCA [146] fine-tunes the model to reduce trigger sensitivity, ensuring semantic consistency despite the trigger. ParaFuzz [147] uses input paraphrasing and compares predictions to detect trigger inconsistencies. MDP [148] identifies critical backdoor modules and mitigates their impact by freezing relevant parameters during fine-tuning. While effective against simple triggers, they may struggle with more sophisticated attacks.

后门检测识别受攻击的输入或模型,在造成损害之前标记威胁。现有的针对大语言模型的后门检测方法主要关注检测在可能受损的大语言模型中触发后门行为的输入,假设可以访问被植入后门的模型,但无法获取原始训练数据或攻击细节。这些方法在评估Token在异常预测中的作用时有所不同。IMBERT [144]利用梯度和自注意力分数来识别导致异常预测的关键Token。AttDef [145]通过归因分数突出触发词,识别对错误预测影响较大的词。SCA [146]微调模型以减少对触发词的敏感性,确保尽管存在触发词,语义仍保持一致。ParaFuzz [147]使用输入重述并比较预测结果,以检测触发词的异常。MDP [148]识别关键的后门模块,并通过在微调过程中冻结相关参数来减轻其影响。尽管这些方法对简单的触发词有效,但在面对更复杂的攻击时可能表现不佳。

3.8.2Backdoor Removal

3.8.2 后门移除

Backdoor removal methods aim to eliminate or neutralize the backdoor behavior embedded in a compromised model. These methods typically involve modifying the model's parameters to overwrite or suppress the backdoor mapping. We can categorize these into two groups: Pruning and Fine-tuning.

后门移除方法旨在消除或中和嵌入在被入侵模型中的后门行为。这些方法通常涉及修改模型的参数以覆盖或抑制后门映射。我们可以将这些方法分为两类:剪枝和微调。

Pruning Methods aim to identify and remove model components responsible for backdoor behavior while preserving performance on clean inputs. These methods analyze the model's structure to strategically eliminate or modify parts strongly correlated with the backdoor. PCP Ablation [149] targets key modules for backdoor activation, replacing them with low-rank approximations to neutralize the backdoor's influence.

剪枝方法旨在识别并移除导致后门行为的模型组件,同时在干净输入上保持性能。这些方法通过分析模型的结构,策略性地消除或修改与后门强相关的部分。PCP消融 [149] 针对后门激活的关键模块,用低秩近似替换它们,以中和后门的影响。

Fine-tuning Methods aim to erase the malicious backdoor correlation by retraining the model on clean data. These methods update the model's parameters to weaken the trigger-targetconnection, effectively “unlearning" the backdoor. SANDE [150] directly overwrites the trigger-target mapping by fine-tuning on benign-output pairs, while CROW [152] and BEEAR [151] focus on enhancing internal consistency and counteracting embedding drift, respectively. Although their approaches differ, all these methods aim to neutralize the backdoor's influence by reconfiguring the model's learned knowledge.

微调方法旨在通过在干净数据上重新训练模型来消除恶意后门关联。这些方法通过更新模型的参数来削弱触发器与目标之间的连接,从而有效地“遗忘”后门。SANDE [150] 通过在良性输出对上微调直接覆盖触发器-目标映射,而 CROW [152] 和 BEEAR [151] 则分别侧重于增强内部一致性和抵消嵌入漂移。尽管这些方法有所不同,但它们的目标都是通过重新配置模型学到的知识来中和后门的影响。

3.8.3 Robust Training

3.8.3 鲁棒训练

Robust training methods enhance the training process to ensure the resulting model remains backdoor-free, even when exposed to backdoor-poisoned data. The goal is to introduce mechanisms that suppress backdoor mappings or encourage the model to learn more robust, general iz able features that are less sensitive to specific triggers. For example, Honeypot Defense [153] introduces a dedicated module during training to isolate and divert backdoor features from influencing the main model. Liu et al. [154] counteracted the minimal cross-entropy loss used in backdoor attacks by encouraging a uniform output distribution through maximum entropy loss. Wang et al. [155] proposed a training-time backdoor defense that removes duplicated trigger elements and mitigates backdoor-related memorization in LLMs. Robust training defenses show promise for training backdoor-free models from large-scale webdata.

鲁棒训练方法增强了训练过程,确保即使暴露于带有后门的数据,生成的模型也能保持无后门状态。其目标是引入机制来抑制后门映射,或鼓励模型学习更鲁棒、可泛化的特征,这些特征对特定触发器不太敏感。例如,Honeypot Defense [153] 在训练过程中引入了一个专用模块,用于隔离并转移后门特征,使其不影响主模型。Liu 等人 [154] 通过最大熵损失鼓励均匀输出分布,抵消了后门攻击中使用的最小交叉熵损失。Wang 等人 [155] 提出了一种训练时的后门防御方法,移除重复的触发器元素,并减轻大语言模型中的后门相关记忆。鲁棒训练防御方法在从大规模网络数据中训练无后门模型方面显示出前景。

3.8.4 Robust Inference

3.8.4 鲁棒推理

Robust inference methods focus on adjusting the inference process to reduce the impact of backdoors during text generation.

鲁棒推理方法侧重于调整推理过程,以减少文本生成过程中后门的影响。

Contrastive Decoding is a robust reference technique that contrasts the outputs of a potentially backdoored model with a clean reference model to identify and correct malicious outputs. For instance, Poison Share [156] uses intermediate layer representations in multi-turn dialogues to guide contrastive decoding, detecting and rectifying poisoned utterances. Similarly, CleanGen [157] replaces suspicious tokens with those predicted by a clean reference model to minimize the backdoor effect. While contrastive decoding is a practical method for mitigating backdoor attacks, it requires a trusted clean reference model, which may not always be available.

对比解码是一种鲁棒的参考技术,它通过对比可能被植入后门的模型与干净的参考模型的输出来识别和修正恶意输出。例如,Poison Share [156] 在多轮对话中使用中间层表示来指导对比解码,从而检测并修正被投毒的语句。类似地,CleanGen [157] 使用干净参考模型预测的 Token 替换可疑的 Token,以最小化后门效应。虽然对比解码是缓解后门攻击的一种实用方法,但它需要一个可信的干净参考模型,而这并不总是可用的。

3.9 Safety Alignment

3.9 安全对齐

The remarkable capabilities of LLMs present a unique challenge of alignment:how to ensure these models align with human values to avoid harmful behaviors, such as generating toxic content, spreading misinformation, or perpetuating biases. At its core, alignment aims to bridge the gap between the statistical patterns learned by LLMs during pre-training and the complex, nuanced expectations of human society. This section reviews existing works on alignment (and safety alignment) and summarizes them into three categories: 1) alignment with human feedback (known as RLHF), 2) alignment with AI feedback (known as RLAIF), and 3) alignment with social interactions.

大语言模型的显著能力带来了独特的对齐挑战:如何确保这些模型与人类价值观保持一致,以避免生成有害内容、传播错误信息或延续偏见等行为。对齐的核心目标是弥合大语言模型在预训练过程中学到的统计模式与人类社会复杂且细微的期望之间的差距。本节回顾了现有关于对齐(以及安全对齐)的工作,并将其总结为三类:1)基于人类反馈的对齐(称为 RLHF),2)基于 AI 反馈的对齐(称为 RLAIF),3)基于社会交互的对齐。

3.9.1Alignment with Human Feedback

3.9.1 与人类反馈的对齐

This strategy directly incorporates human preferences into the alignment process to shape the model's behavior. Existing RLHF methods can be further divided into: 1) proximal policy optimization, 2) direct preference optimization, 3) Kahneman-Tversky optimization, and 4) supervised fine-tuning.

该策略直接将人类偏好纳入对齐过程中,以塑造模型的行为。现有的 RLHF 方法可进一步分为:1) 近端策略优化,2) 直接偏好优化,3) Kahneman-Tversky 优化,以及 4) 监督微调。

Proximal Policy Optimization (PPO) uses human feedback as a reward signal to fine-tune LLMs, aligning model outputs with human preferences by maximizing the expected reward based on human evaluations. Instruct GP T [160] demonstrates its effectiveness in aligning models to follow instructions and generate high-quality responses. Refinements have further targeted stylistic control and creative generation [159]. Safe-RLHF [161] adds safety constraints to ensure outputs remain within acceptable boundaries while maximizing helpfulness. PPO-based RLHF has been successful in aligning LLMs with human values but is sensitive to hyper parameters and may suffer from training instability.

近端策略优化 (Proximal Policy Optimization, PPO) 使用人类反馈作为奖励信号来微调大语言模型,通过基于人类评估最大化预期奖励,使模型输出与人类偏好保持一致。Instruct GPT [160] 展示了其在使模型遵循指令并生成高质量响应方面的有效性。进一步的改进还针对了风格控制和创造性生成 [159]。Safe-RLHF [161] 添加了安全性约束,以确保输出保持在可接受范围内,同时最大化帮助性。基于PPO的RLHF在使大语言模型与人类价值观保持一致方面取得了成功,但对超参数敏感,可能会出现训练不稳定的问题。

Direct Preference Optimization (DPO) streamlines alignment by directly optimizing LLMs with human preference data, eliminating the need for a separate reward model. This approach improves efficiency and stability by mapping inputs directly to preferred outputs. Standard DPO [162], [163] optimizes the model to predict preference scores, ranking responses based on human preferences. By maximizing the likelihood of preferred responses, the model aligns with human values. MODPO [164] extends DPO to multi-objective optimization, balancing multiple preferences (e.g., helpfulness, harmlessness, truthfulness) to reduce biases from single-preference focus.

直接偏好优化 (Direct Preference Optimization, DPO) 通过直接利用人类偏好数据优化大语言模型,简化了对齐过程,无需单独构建奖励模型。该方法通过将输入直接映射到偏好输出,提高了效率和稳定性。标准 DPO [162], [163] 通过优化模型以预测偏好分数,根据人类偏好对响应进行排序。通过最大化偏好响应的可能性,模型与人类价值观保持一致。MODPO [164] 将 DPO 扩展到多目标优化,平衡多个偏好(例如,帮助性、无害性、真实性),以减少单一偏好带来的偏差。

Kahneman-Tversky Optimization (KTO) aligns models by distinguishing between likely (desirable) and unlikely (undesirable) outcomes, making it useful when undesirable outcomes are easier to define than desirable ones. KTO [165] uses a loss function based on prospect theory, penalizing the model more for generating unlikely continuations than rewarding it for likely ones. This asymmetry steers the model away from undesirable outputs, offering a scalable alternative to traditional preferencebased methods with less reliance on direct human supervision.

Kahneman-Tversky 优化 (KTO) 通过区分可能(理想)和不可能(不理想)的结果来对齐模型,这使得它在不理想结果比理想结果更容易定义时非常有用。KTO [165] 使用基于前景理论的损失函数,对生成不可能结果的惩罚比对生成可能结果的奖励更大。这种不对称性引导模型远离不理想输出,提供了一种可扩展的替代方案,减少了对直接人类监督的依赖。

Supervised Fine-Tuning (SFT) emphasizes the importance of high-quality, curated datasets to align models by training them on examples of desired outputs. LIMA [166] shows that a small, well-curated dataset can achieve strong alignment with powerful pre-trained models, suggesting that focusing on style and format in limited examples may be more effective than large datasets. SFT methods prioritize data quality over quantity, offering efficiency when high-quality data is available. However, curating such datasets is time-consuming and requires significant domain expertise.

监督微调 (Supervised Fine-Tuning, SFT) 强调高质量、精心策划的数据集的重要性,通过对期望输出的示例进行训练来对齐模型。LIMA [166] 表明,一个精心策划的小数据集可以与强大的预训练模型实现良好的对齐,这表明在有限的示例中关注风格和格式可能比大规模数据集更有效。SFT 方法优先考虑数据质量而非数量,在高质量数据可用时提供效率。然而,策划这样的数据集耗时且需要大量的领域专业知识。

3.9.2Alignment with Al Feedback

3.9.2 与 Al 反馈的对齐

To overcome the s cal ability limitations and potential biases of relying solely on human feedback, RLAIF methods utilize AIgenerated feedback to guide the alignment.

为了克服仅依赖人类反馈的可扩展性限制和潜在偏见,RLAIF 方法利用 AI 生成的反馈来指导对齐。

Proximal Policy Optimization These RLAIF methods adapt the PPO algorithm to incorporate AI-generated feedback, automating the process for scalable alignment and reducing human labor. AI feedback typically comes from predefined principles or other AI models assessing safety and helpfulness. Constitutional AI (CAI) [167] uses AI self-critiques based on predefined principles to promote harmlessness. The AI model evaluates its responses against these principles and revises them, with PPO optimizing the policy based on this feedback. SELF-ALIGN [168] employs principle-driven reasoning and LLM generative capabilities to align models with human values. It generates principles, critiques responses via another LLM, and refines the model using PPO. RLCD [169] generates diverse preference pairs using contrasting prompts to train a preference model, which then provides feedback for PPO-based fine-tuning.

近端策略优化 (Proximal Policy Optimization)
这些 RLAIF 方法通过调整 PPO 算法,整合了 AI 生成的反馈,实现了可扩展的对齐过程自动化,并减少了人力需求。AI 反馈通常来自预定义的原则或其他评估安全性和帮助性的 AI 模型。宪法 AI (Constitutional AI, CAI) [167] 使用基于预定义原则的 AI 自我批评来促进无害性。AI 模型根据这些原则评估其响应并进行修订,PPO 则基于此反馈优化策略。SELF-ALIGN [168] 采用原则驱动的推理和大语言模型的生成能力,使模型与人类价值观保持一致。它生成原则,通过另一个大语言模型批评响应,并使用 PPO 优化模型。RLCD [169] 使用对比提示生成多样化的偏好对,训练偏好模型,然后为基于 PPO 的微调提供反馈。

3.9.3 Alignment with Social Interactions

3.9.3 与社会互动的对齐

These methods use simulated environments to train LLMs to align with social norms and constraints, not just individual preferences. They typically employ Contrastive Policy Optimization (CPO) within these simulated settings.

这些方法使用模拟环境来训练大语言模型,使其与社会规范和约束保持一致,而不仅仅是个体偏好。它们通常在这些模拟环境中采用对比策略优化(Contrastive Policy Optimization, CPO)。

Contrastive Policy Optimization Stable Alignment [170] uses rule-based simulated societies to train LLMs with CPO. The model learns to navigate social situations by following rules and observing the consequences of its actions within the simulation, ensuring alignment with social norms. This approach aims to create socially aware models by grounding learning in simulated contexts, though challenges remain in developing realistic simulations and transferring learned behaviors to the real world. Monopoly logue-based Social Scene Simulation [171] introduces MATRIX, a framework where LLMs self-generate social scenarios and play multiple roles to understand the consequences of their actions. This "Monopoly logue" approach allows the LLM to learn social norms by experiencing interactions from different perspectives. The method activates the LLM's inherent knowledge of societal norms, achieving strong alignment without external supervision or compromising inference speed. Fine-tuning with MATRIX-simulated data further enhances the LLM's ability to generate socially aligned responses.

对比策略优化稳定对齐[170]使用基于规则的模拟社会来训练大语言模型(LLM)与CPO。模型通过学习规则并观察其行为在模拟中的结果,确保与社会规范对齐。该方法旨在通过在模拟环境中进行学习,创建具有社会意识的模型,尽管在开发逼真模拟和将学习到的行为转移到现实世界方面仍存在挑战。基于垄断对话的社会场景模拟[171]引入了MATRIX框架,大语言模型在其中自发生成社会场景并扮演多个角色,以理解其行为的后果。这种“垄断对话”方法使大语言模型通过从不同角度体验互动来学习社会规范。该方法激活了大语言模型对社会规范的内隐知识,无需外部监督或影响推理速度即可实现强对齐。使用MATRIX模拟数据进行微调进一步增强了LLM生成社会对齐响应的能力。

3.10 Energy Latency Attacks

3.10 能量延迟攻击

Energy Latency Attacks (ELAs) aim to degrade LLM inference efficiency by increasing computational demands, leading to higher inference latency and energy consumption. Existing ELAs can be categorized into 1) white-box attacks and 2) black-box attacks.

能量延迟攻击 (Energy Latency Attacks, ELAs) 旨在通过增加计算需求来降低大语言模型推理效率,从而导致更高的推理延迟和能量消耗。现有的能量延迟攻击可分为两类:1) 白盒攻击 和 2) 黑盒攻击。

3.10.1 White-box Attacks

3.10.1 白盒攻击

White-box attacks assume the attacker has full knowledge of the model, enabling precise manipulation of the model's inference process. These attacks can be further divided into gradient-based attacks and query-based attacks which can also be black-box.

白盒攻击假设攻击者完全了解模型,能够精确操纵模型的推理过程。这些攻击可以进一步分为基于梯度的攻击和基于查询的攻击,后者也可以是黑盒攻击。

Gradient-based Attacks use gradient information to identify input perturbations that maximize inference computations. The goal is to disrupt mechanisms essential for efficient inference, such as End-of-Sentence (EOS) prediction or early-exit. For example, NMTSloth [172] targets EOS prediction in neural machine translation. SAME [173] interferes with early-exit in multi-exit models. LL ME ff Checker [174] applies gradient-based techniques to multiple LLMs. TTSlow [175] induces endless speech generation in text-to-speech systems. These attacks are powerful but comput ation ally expensive and highly model-specific, limiting their general iz ability.

基于梯度的攻击利用梯度信息来识别最大化推理计算的输入扰动。其目标是破坏高效推理所必需的机制,例如句尾 (EOS) 预测或提前退出。例如,NMTSloth [172] 针对神经机器翻译中的 EOS 预测。SAME [173] 干扰多出口模型中的提前退出。LL ME ff Checker [174] 将基于梯度的技术应用于多个大语言模型。TTSlow [175] 在文本转语音系统中引发无尽的语音生成。这些攻击虽然强大,但计算成本高且高度依赖于特定模型,限制了其通用性。

3.10.2Black-box Attacks

3.10.2 黑盒攻击

Black-box attacks do not require access to model internals, only the input-output interface. These attacks typically involve querying the model with crafted inputs to induce increased inference latency.

黑盒攻击不需要访问模型内部,只需利用输入输出接口。此类攻击通常涉及通过精心设计的输入查询模型,以诱导推理延迟增加。

Query-based Attacks exploit specific model behaviors without internal access, relying on repeated querying to craft adversarial examples. No-Skim [176] disrupts skimming-based models by subtly perturbing inputs to maximize retained tokens. NoSkim is ineffective against models that do not rely on skimming. Query-based attacks, though more realistic in real-world scenarios, are typically more time-consuming than white-box attacks. Poisoning-based Attacks manipulate model behavior by injecting malicious training samples. P-DoS [177] shows that a single poisoned sample during fine-tuning can induce excessively long outputs, increasing latency and bypassing output length constraints, even with limited access like fine-tuning APIs.

基于查询的攻击利用特定模型行为,无需内部访问,通过重复查询来制作对抗样本。No-Skim [176] 通过微妙地扰动输入来最大化保留的Token,从而破坏基于略读的模型。对于不依赖略读的模型,NoSkim 无效。基于查询的攻击虽然在现实场景中更为实际,但通常比白盒攻击更耗时。基于投毒的攻击通过注入恶意训练样本来操控模型行为。P-DoS [177] 表明,在微调过程中,单个投毒样本可以导致输出过长,增加延迟并绕过输出长度限制,即使访问权限有限(如微调API)也能实现。

ELAs present an emerging threat to LLMs. Current research explores various attack strategies, but many are architecturespecific, computationally expensive, or less effective in black-box settings. Existing defenses, such as runtime input validation, can add overhead. Future research could focus on developing more generalized and effcient attacks and defenses that apply across diverse LLMs and deployment scenarios.

ELAs 对大语言模型构成了新兴威胁。当前的研究探索了多种攻击策略,但许多都是针对特定架构、计算成本高或在黑盒环境中效果较差。现有的防御措施,如运行时输入验证,可能会增加开销。未来的研究可以专注于开发更通用且高效的攻击和防御方法,这些方法适用于多种大语言模型和部署场景。

3.11 Model Extraction Attacks

3.11 模型提取攻击

Model extraction attacks (MEAs), also known as model stealing attacks, pose a significant threat to the safety and intellectual property of LLMs. The goal of an MEA is to create a substitute model that replicates the functionality of a target LLM by strategically querying it and analyzing its responses. Existing MEAs on LLMs can be categorized into two types: 1) fine-tuning stage attacks, and 2) alignment stage attacks.

模型提取攻击(MEAs),也称为模型窃取攻击,对大语言模型的安全性和知识产权构成了重大威胁。MEA的目标是通过策略性地查询目标大语言模型并分析其响应,创建一个替代模型来复制其功能。现有的大语言模型MEAs可分为两类:1) 微调阶段攻击,2) 对齐阶段攻击。

3.11.1 Fine-tuning Stage Attacks

3.11.1 微调阶段攻击

Fine-tuning stage attacks aim to extract knowledge from fine-tuned LLMs for downstream tasks. These attacks can be divided into two categories: functional similarity extraction and 2) specific ability extraction.

微调阶段攻击旨在从为下游任务微调的大语言模型中提取知识。这些攻击可分为两类:功能相似性提取和特定能力提取。

Functional Similarity Extraction seeks to replicate the overall behavior of the target fine-tuned model. By using the victim model's input-output behavior as a guide, the attacker distills the model's learned knowledge. For example, LION [178] uses the victim model as a referee and generator to iterative ly improve a student model's instruction-following capability.

功能相似性提取旨在复现目标微调模型的整体行为。攻击者以受害模型的输入输出行为为引导,提炼模型所学的知识。例如,LION [178] 使用受害模型作为裁判和生成器,逐步提升学生模型的指令执行能力。

Specific Ability Extraction targets the extraction of specific skills or knowledge the fine-tuned model has acquired. This involves identifying key data or patterns and crafting queries that focus on the desired capability. Li et al. [179] demonstrated this by extracting coding abilities from black-box LLM APIs using carefully crafted queries. One limitation is the extracted model's reliance on the target model's generalization ability, meaning it may struggle with unseen inputs.

特定能力提取针对的是微调模型所获得的特定技能或知识的提取。这包括识别关键数据或模式,并设计专注于所需能力的查询。Li 等人 [179] 通过使用精心设计的查询从黑箱大语言模型 API 中提取编码能力,展示了这一点。一个局限性在于提取的模型依赖于目标模型的泛化能力,这意味着它可能在处理未见过的输入时遇到困难。

3.11.2 Alignment Stage Attacks

3.11.2 对齐阶段攻击

Alignment stage attacks attempt to extract the alignment properties (e.g., safety, helpfulness) of the target LLM. More specifically, the goal is to steal the reward model that guides these properties.

对齐阶段攻击尝试提取目标大语言模型的对齐属性(例如,安全性、有用性)。更具体地说,目标是窃取指导这些属性的奖励模型。

Functional Similarity Extraction focuses on replicating the target model's alignment preferences. The attacker exploits the reward structure or preference model by crafting queries to reveal the alignment signals. LoRD [180] exemplifies this by using a policy-gradient approach to extract both task-specific knowledge and alignment properties. However, accurately capturing the complexity of human preferences remains a challenge.

功能相似性提取侧重于复制目标模型的对齐偏好。攻击者通过精心设计的查询来利用奖励结构或偏好模型,以揭示对齐信号。LoRD [180] 通过使用策略梯度方法来提取任务特定知识和对齐属性,展示了这一方法。然而,准确捕捉人类偏好的复杂性仍然是一个挑战。

Model extraction attacks are a rapidly evolving threat to LLMs. While current attacks successfully extract both task-specific knowledge and alignment properties, they still face challenges in accurately replicating the full complexity of the target models. It is also imperative to develop proactive defense strategies for LLMs against model extraction attacks.

模型提取攻击对大语言模型的威胁日益加剧。尽管当前攻击能够成功提取任务特定知识和对齐属性,但在准确复制目标模型全部复杂性方面仍面临挑战。开发针对模型提取攻击的主动防御策略也势在必行。

3.12 Data Extraction Attacks

3.12 数据提取攻击

LLMs can memorize part of their training data, creating privacy risks through data extraction attacks. These attacks recover training examples, potentially exposing sensitive information such as Personal Identifiable Information (PIl), copyrighted content, or confidential data. This section reviews existing data extraction attacks on LLMs, including both white-box and black-box attacks.

大语言模型 (LLMs) 能够记忆部分训练数据,从而通过数据提取攻击带来隐私风险。这些攻击可以恢复训练样本,可能暴露敏感信息,如个人身份信息 (PIl)、受版权保护的内容或机密数据。本节回顾了现有的大语言模型数据提取攻击,包括白盒攻击和黑盒攻击。

3.12.1White-box Attacks

3.12.1 白盒攻击

White-box attacks focus on Latent Memorization Extraction, targeting information implicitly stored in model parameters or activation s, which is not directly accessible through the inputoutput interface.

白盒攻击侧重于潜在记忆提取 (Latent Memorization Extraction),针对的是隐式存储在模型参数或激活中的信息,这些信息无法直接通过输入输出接口访问。

Latent Memorization Extraction Duan et al. [191] developed techniques to extract latent data by analyzing internal representations, using methods like adding noise to weights or examining cross-entropy loss. These techniques were demonstrated on LLMs like Pythia-1B and Amber-7B. While these attacks reveal risks associated with internal data representation, they require full access to the model parameters, which remains a major limitation.

潜在记忆提取 Duan 等人 [191] 开发了通过分析内部表示来提取潜在数据的技术,使用的方法包括向权重添加噪声或检查交叉熵损失。这些技术在 Pythia-1B 和 Amber-7B 等大语言模型上进行了演示。虽然这些攻击揭示了与内部数据表示相关的风险,但它们需要完全访问模型参数,这仍然是一个主要限制。

3.12.2Black-box Attacks

3.12.2 黑盒攻击

Black-box data extraction attacks are a realistic threat in which the attacker crafts inductive prompts to trick the LLM into revealing memorized training data, without access to the model's parameters.

黑盒数据提取攻击是一种现实威胁,攻击者通过设计诱导性提示,诱使大语言模型泄露记忆的训练数据,而无需访问模型参数。

Prefix Attacks exploit the auto regressive nature of LLMs by providing a “prefix" from a memorized sequence, hoping the model will continue it. Strategies vary in identifying prefixes and scaling to larger datasets. Carlini et al. [181] demonstrated this on models like GPT-2, while Nasr et al. [183] scaled prefix attacks using suffix arrays. Magpie [184] and Al-Kaswan et al. [185] targeted specific data, such as Pll or code. Yu et al. [190] enhanced black-box data extraction by optimizing text continuation generation and ranking. They introduced techniques like diverse sampling strategies (Top-k, Nucleus), probability adjustments (temperature, repetition penalty), dynamic context windows, look-ahead mechanisms, and improved suffx ranking (Zlib, high-confidence tokens).

前缀攻击通过提供记忆序列中的“前缀”来利用大语言模型的自回归特性,期望模型能够继续生成后续内容。策略在识别前缀和扩展到更大数据集方面有所不同。Carlini 等人 [181] 在 GPT-2 等模型上展示了这一点,而 Nasr 等人 [183] 则使用后缀数组扩展了前缀攻击。Magpie [184] 和 Al-Kaswan 等人 [185] 针对特定数据(如 Pll 或代码)进行了攻击。Yu 等人 [190] 通过优化文本延续生成和排序,增强了黑盒数据提取的效果。他们引入了多样化的采样策略(Top-k, Nucleus)、概率调整(温度, 重复惩罚)、动态上下文窗口、前瞻机制以及改进的后缀排序(Zlib, 高置信度 Token)等技术。

Special Character Attack exploits the model's sensitivity to special characters or unusual input formatting, potentially triggering unexpected behavior that reveals memorized data. SCA [186] demonstrates that specific characters can indeed induce LLMs to disclose training data. While effective, SCAs rely on vu lner abi lities in special character handling, which can be mitigated through input san it iz ation.

特殊字符攻击利用模型对特殊字符或异常输入格式的敏感性,可能触发意外行为,从而暴露记忆数据。SCA [186] 表明,特定字符确实可以诱导大语言模型泄露训练数据。虽然有效,但特殊字符攻击依赖于特殊字符处理中的漏洞,这些问题可以通过输入清理来缓解。

Prompt Optimization employs an “attacker" LLM to generate optimized prompts that extract data from a “victim" LLM. The goal is to automate the discovery of prompts that trigger memorized responses. Kassem et al. [187] demonstrated this by using an attacker LLM with iterative rejection sampling and longest common sub sequence (LCS) for optimization. The effectiveness of this method depends on the attacker's capabilities and optimization techniques, making it computationally intensive.

Prompt Optimization 使用一个“攻击者”大语言模型来生成优化的提示,以从“受害者”大语言模型中提取数据。其目标是自动发现触发记忆响应的提示。Kassem 等人 [187] 通过使用攻击者大语言模型,结合迭代拒绝采样和最长公共子序列 (LCS) 进行优化,展示了这一方法。该方法的有效性取决于攻击者的能力和优化技术,因此计算量较大。

Retrieval-Augmented Generation (RAG) Extraction targets RAG systems, aiming to leak sensitive information from the retrieval component. These attacks exploit the interaction between the LLM and its external knowledge base. Qi et al. [188] demonstrated that adversarial prompts can trigger data leakage in RAG systems. Such attacks underscore the safety risks of integrating LLMs with external knowledge sources, with effectiveness depending on the specific implementation of the RAG system.

检索增强生成 (Retrieval-Augmented Generation, RAG) 提取针对 RAG 系统,旨在从检索组件中泄露敏感信息。这些攻击利用了大语言模型与其外部知识库之间的交互。Qi 等人 [188] 证明了对抗性提示可以触发 RAG 系统中的数据泄露。此类攻击突显了将大语言模型与外部知识源集成的安全风险,其有效性取决于 RAG 系统的具体实现。

Ensemble Attack combines multiple attack strategies to enhance effectiveness, leveraging the strengths of each method for higher success rates. More et al. [189] demonstrated the effectiveness of such an ensemble approach on Pythia. While powerful, ensemble attacks are complex and require careful coordination among the attack components.

集成攻击通过结合多种攻击策略来提升效果,利用每种方法的优势以提高成功率。More 等人 [189] 展示了这种集成方法在 Pythia 上的有效性。尽管强大,但集成攻击较为复杂,需要攻击组件之间的精心协调。

TABLE 5: Datasets and benchmarks for LLM safety research.

表 5: 大语言模型安全研究中的数据集和基准。

数据集 年份 大小 引用次数
RealToxicityPrompts[426] 2020 100K 135
TruthfulQA[427] 2021 817 213
AdvGLUE[428] 2021 5,716 12
SafetyPrompts[429] 2023 100K 15
DoNotAnswer [430] 2023 939 6
AdvBench[91] 2023 520 52
CVALUES [431] 2023 2,100 10
FINE [432] 2023 90 14
FLAMES[433] 2024 2,251 17
SORRYBench[434] 2024 450 8
SafetyBench[435] 2024 11,435 21
SALAD-Bench[436] 2024 30K 36
BackdoorLLM [437] 2024 8 6
JailBreakV-28K[438] 2024 28K 10
STRONGREJECT[439] 2024 313 4
Libra-Leaderboard[440] 2024 57 26

3.13 Datasets & Benchmarks

3.13 数据集与基准测试

This section reviews commonly used datasets and benchmarks in LLM safety research, as shown in Table 5. These datasets and benchmarks are categorized based on their evaluation purpose: toxicity datasets,truthfulness datasets,value benchmarks,and adversarial datasets and backdoor benchmarks.

本节回顾了大语言模型安全研究中常用的数据集和基准,如表 5 所示。这些数据集和基准根据其评估目的进行分类:毒性数据集、真实性数据集、价值观基准、对抗性数据集和后门基准。

3.13.1 Toxicity Datasets

3.13.1 毒性数据集

Ensuring LLMs do not generate harmful content is crucial for safety. Early work, such as the Real Toxicity Prompts dataset [426], exposed the tendency of LLMs to produce toxic text from benign prompts. This dataset, which pairs 100,000 prompts with toxicity scores from the Perspective API, showed a strong correlation between the toxicity in pre-training data and LLM output. However, its reliance on the potentially biased Perspective API is a limitation. To address broader harmful behaviors, the Do-NotAnswer [430] dataset was introduced. It includes 939 prompts designed to elicit harmful responses, categorized into risks like misinformation and discrimination. Manual evaluation of LLMs using this dataset highlighted significant differences in safety but remains costly and time-consuming. A recent approach [441] introduces a crowd-sourced toxic question and response dataset, with annotations from both humans and LLMs. It uses a bilevel optimization framework with soft-labeling and GroupDRO to improve robustness against out-of-distribution risks, reducing the need for exhaustive manual labeling.

确保大语言模型不生成有害内容对于安全性至关重要。早期工作,如 Real Toxicity Prompts 数据集 [426],揭示了大语言模型从良性提示中产生有毒文本的倾向。该数据集将 10 万条提示与 Perspective API 的毒性评分配对,显示了预训练数据中的毒性与大语言模型输出之间的强相关性。然而,它对可能存在偏见的 Perspective API 的依赖是一个限制。为了解决更广泛的有害行为,引入了 Do-NotAnswer 数据集 [430]。它包含 939 条旨在引发有害反应的提示,分为错误信息和歧视等风险类别。使用该数据集对大语言模型进行手动评估,突出了安全性的显著差异,但仍然成本高昂且耗时。最近的一种方法 [441] 引入了众包的有毒问题和响应数据集,其中包含人类和大语言模型的注释。它使用双层优化框架,结合软标签和 GroupDRO,提高了对分布外风险的鲁棒性,减少了对详尽手动标注的需求。

3.13.2 Truthfulness Datasets

3.13.2 真实性数据集

Ensuring LLMs generate truthful information is also essential. The TruthfulQA benchmark [427] evaluates whether LLMs provide accurate answers to 817 questions across 38 categories, specifically targeting "imitative falsehoods"—false answers learned from human text. Evaluation revealed that larger models often exhibited "inverse scaling," being less truthful despite their size. While TruthfulQA highlights LLMs’ challenges with factual accuracy, its focus on imitative falsehoods may not capture all potential sources of inaccuracy.

确保大语言模型生成真实信息也至关重要。TruthfulQA 基准 [427] 评估了大语言模型是否能够准确回答 38 个类别中的 817 个问题,特别针对“模仿性虚假信息”——从人类文本中学到的错误答案。评估显示,更大的模型经常表现出“逆向扩展”,尽管规模更大,但真实性却更低。虽然 TruthfulQA 强调了大语言模型在事实准确性方面的挑战,但其对模仿性虚假信息的关注可能无法捕捉到所有潜在的不准确性来源。

3.13.3 Value Benchmarks

3.13.3 价值基准

Ensuring LLM alignment with human values is a critical challenge, addressed by several benchmarks assessing various aspects of safety, fairness, and ethics. FLAMES [433] evaluates the alignment of Chinese LLMs with values like fairness, safety, and morality through 2,251 prompts. SORRY-Bench [434] assesses LLMs’ ability to reject unsafe requests using 45 topic categories, while CVALUES [431] focuses on both safety and responsibility. Safety Prompts [429] evaluates Chinese LLMs on a range of ethical scenarios. Despite their value, these benchmarks are limited by the manual annotation process. Additionally, the concept of “fake alignment" [432] highlights the risk of LLMs superficially memorizing safety answers, leading to the Fake alIgNment Evaluation (FINE) framework for consistency assessment. Safety Bench [435] addresses this by providing an efficient, automated multiple-choice benchmark for LLM safety evaluation. Libra-Leader board [440] introduces a balanced leader board for evaluating both the safety and capability of LLMs. It features a comprehensive safety benchmark with 57 datasets covering diverse safety dimensions, a unified evaluation framework, an interactive safety arena for adversarial testing, and a balanced scoring system. Libra-Leader board promotes a holistic approach to LLM evaluation, representing a significant step towards responsible AI development.

确保大语言模型与人类价值观对齐是一个关键挑战,已有多个基准测试从安全、公平和伦理等多个方面进行评估。FLAMES [433] 通过 2,251 个提示评估中文大语言模型在公平性、安全性和道德等方面的对齐情况。SORRY-Bench [434] 使用 45 个主题类别评估大语言模型拒绝不安全请求的能力,而 CVALUES [431] 则重点关注安全性和责任性。Safety Prompts [429] 在一系列伦理场景下评估中文大语言模型。尽管这些基准测试具有价值,但它们受到手动标注过程的限制。此外,“虚假对齐” [432] 的概念突显了大语言模型表面记忆安全答案的风险,这促使了 Fake alIgNment Evaluation (FINE) 框架的提出,用于一致性评估。Safety Bench [435] 通过提供一个高效、自动化的多选题基准测试来解决这一问题,用于大语言模型的安全评估。Libra-Leader board [440] 引入了一个平衡的排行榜,用于评估大语言模型的安全性和能力。它包含一个综合性的安全基准测试,涵盖 57 个数据集,覆盖多样化的安全维度,一个统一的评估框架,一个用于对抗测试的互动安全竞技场,以及一个平衡的评分系统。Libra-Leader board 推动了大语言模型评估的整体方法,代表了向负责任的人工智能开发迈出的重要一步。

3.13.4 Adversarial Datasets and Backdoor Benchmarks

3.13.4 对抗数据集与后门基准测试

Backdoor LL M [437] is the first benchmark for evaluating backdoor attacks in text generation, offering a standardized framework that includes diverse attack strategies like data poisoning and weight poisoning. Adversarial GLUE [428] assesses LLM robustness against textual attacks using 14 methods, highlighting vulnerabilities even in robustly trained models. SALAD-Bench [436] expands on this by introducing a safety benchmark with a taxonomy of risks, including attack- and defense-enhanced questions. JailBreakV-28K [438] focuses on evaluating multi-modal LLMs against jailbreak attacks using text- and image-based test cases. A STRONG REJECT for empty jailbreaks [439] improves jailbreak evaluation with a higher-quality dataset and automated assessment. Despite their value, these benchmarks face challenges in s cal ability, consistency, and real-world relevance.

Backdoor LL M [437] 是首个用于评估文本生成中后门攻击的基准,提供了一个标准化框架,包括数据投毒和权重投毒等多样化的攻击策略。Adversarial GLUE [428] 通过14种方法评估大语言模型在文本攻击下的鲁棒性,揭示了即使在经过鲁棒训练的模型中仍存在的脆弱性。SALAD-Bench [436] 在此基础上扩展,引入了一个包含风险分类的安全基准,包括攻击和防御增强的问题。JailBreakV-28K [438] 专注于评估多模态大语言模型在基于文本和图像的越狱攻击测试用例下的表现。A STRONG REJECT for empty jailbreaks [439] 通过更高质量的数据集和自动化评估改进了越狱评估。尽管这些基准具有重要价值,但它们在可扩展性、一致性和现实相关性方面仍面临挑战。

4 VISION-LANGUAGE PRE-TRAINING MODEL SAFETY

4 视觉-语言预训练模型安全性

VLP models, such as CLIP [442], ALBEF [443], and TCL [444], have made significant strides in aligning visual and textual modalities. However, these models remain vulnerable to various safety threats, which have garnered increasing research attention. This section reviews the current safety research on VLP models, with a focus on adversarial, backdoor, and poisoning research. The representative methods reviewed in this section are summarized in Table 6.

诸如 CLIP [442]、ALBEF [443] 和 TCL [444] 等 VLP 模型在视觉和文本模态对齐方面取得了显著进展。然而,这些模型仍然容易受到各种安全威胁的影响,这已引起越来越多的研究关注。本节回顾了当前关于 VLP 模型的安全研究,重点是对抗性、后门和投毒研究。本节回顾的代表性方法总结在表 6 中。

4.1 Adversarial Attacks

4.1 对抗攻击

Since VLP models are widely used as backbones for fine-tuning downstream models, adversarial attacks on VLP aim to generate examples that cause incorrect predictions across various downstream tasks, including zero-shot image classification, imagetext retrieval, visual entailment, and visual grounding. Similar to Section 2, these attacks can roughly be categorized into white-box attacks and black-box attacks, based on their threat models.

由于 VLP 模型被广泛用作下游模型微调的骨干网络,针对 VLP 的对抗攻击旨在生成在各种下游任务中导致错误预测的样本,包括零样本图像分类、图像文本检索、视觉蕴含和视觉定位。与第 2 节类似,这些攻击根据其威胁模型大致可以分为白盒攻击和黑盒攻击。

TABLE 6: A summary of attacks and defenses for VLP models.

表 6: VLP 模型攻击与防御的总结

4.1.1 White-box Attacks

4.1.1 白盒攻击

White-box adversarial attacks on VLP models can be further categorized based on perturbation types into invisible perturbations and visible perturbations, with the majority of existing attacks employing invisible perturbations.

白盒对抗攻击在 VLP 模型上可基于扰动类型进一步分为不可见扰动和可见扰动,现有的大多数攻击都采用不可见扰动。

Invisible Perturbations involve small, imperceptible adversarial changes to inputs——-whether text or images—to maintain the stealth in ess of attacks. Early research in the vision and language domains primarily adopts this approach [60], [445]-[447], in which invisible attacks are developed independently. In the context of VLP models, which integrate both modalities, Co-Attack [192] was the first to propose perturbing both visual and textual inputs simultaneously to create stronger attacks. Building on this, AdvCLIP [193] explores universal adversarial perturbations that can deceive all downstream tasks.

隐形扰动涉及对输入(无论是文本还是图像)进行微小且不易察觉的对抗性更改,以保持攻击的隐蔽性。视觉和语言领域的早期研究主要采用这种方法 [60], [445]-[447],其中隐形攻击是独立开发的。在整合了两种模态的 VLP 模型背景下,Co-Attack [192] 首次提出同时扰动视觉和文本输入以创建更强的攻击。在此基础上,AdvCLIP [193] 探索了可以欺骗所有下游任务的通用对抗扰动。

Visible Perturbations involve more substantial and noticeable alterations. For example, manually crafted typographical, conceptual, and icon o graphic images have been used to demonstrate that the CLIP model tends to “read first, look later" [194], highlighting a unique characteristic of VLP models. This behavior introduces new attack surfaces for VLP, enabling the development of more sophisticated attacks.

可见扰动涉及更大规模和更显著的改变。例如,手动制作的排版、概念和图标图像已被用于展示 CLIP 模型倾向于“先读后看”[194],这突出了 VLP (Vision-Language Pretraining) 模型的独特特性。这种行为为 VLP 引入了新的攻击面,使得开发更复杂的攻击成为可能。

4.1.2Black-box Attacks

4.1.2 黑盒攻击

Black-box attacks on VLP primarily adopt a transfer-based approach, with query-based attacks rarely explored. Existing methods can be categorized into: 1) sample-specific perturbations, tailored to individual samples, and 2) universal perturbations, applicable across multiple samples.

对视觉-语言预训练(VLP)的黑盒攻击主要采用基于迁移的方法,基于查询的攻击方法较少被探索。现有方法可分为两类:1) 样本特定的扰动,针对单个样本进行定制;2) 通用扰动,适用于多个样本。

Sample-wise perturbations are generally more effective than universal perturbations, but their transfer ability is often limited.

样本级扰动通常比通用扰动更有效,但其迁移能力往往有限。

SGA [195] explores adversarial transfer ability in VLP by leveraging cross-modal interactions and alignment-preserving augmentation. Building on this, SA-Attack [196] enhances cross-modal transfer ability by introducing data augmentations to both original and adversarial inputs. VLP-Attack [197] improves transfer a bility by generating adversarial texts and images using contrastive loss. To overcome SGA's limitations, TMM [198] introduces modality-consistency and discrepancy features through attentionbased and orthogonal-guided perturbations. VLATTACK [199] further enhances adversarial examples by combining image and text perturbations at both single-modal and multimodal levels. PRM [200] targets vulnerabilities in downstream models using foundation models like CLIP, enabling transferable attacks across tasks like object detection and image captioning.

SGA [195] 通过利用跨模态交互和对齐保持增强,探索了 VLP 中的对抗迁移能力。在此基础上,SA-Attack [196] 通过对原始输入和对抗输入引入数据增强,提升了跨模态迁移能力。VLP-Attack [197] 通过使用对比损失生成对抗文本和图像,改进了迁移能力。为了克服 SGA 的局限性,TMM [198] 通过基于注意力和正交引导的扰动引入了模态一致性和差异性特征。VLATTACK [199] 通过在单模态和多模态层次上结合图像和文本扰动,进一步增强了对抗样本。PRM [200] 利用 CLIP 等基础模型,针对下游模型的漏洞进行攻击,实现了跨任务的可迁移攻击,如目标检测和图像描述。

Universal Perturbations are less effective than sample-wise perturbations but more transferable. C-PGC [201] was the first to investigate universal adversarial perturbations (UAPs) for VLP models. It employs contrastive learning and cross-modal information to disrupt the alignment of image-text embeddings, achiev- ing stronger attacks in both white-box and black-box scenarios. ETU [202] builds on this by generating UAPs that transfer across multiple VLP models and tasks. ETU enhances UAP transfer ability and effectiveness through improved global and local optimization techniques. It also introduces a data augmentation strategy ScMix that combines self-mix and cross-mix operations to increase data diversity while preserving semantic integrity, further boosting the robustness and applicability of UAPs.

通用扰动虽然比样本扰动效果弱,但更具可迁移性。C-PGC [201] 首次研究了视觉-语言预训练(VLP)模型的通用对抗扰动(UAPs)。它采用对比学习和跨模态信息来破坏图像-文本嵌入的对齐,在白盒和黑盒场景中实现了更强的攻击。ETU [202] 在此基础上生成了可跨多个 VLP 模型和任务迁移的 UAPs。ETU 通过改进的全局和局部优化技术,增强了 UAP 的迁移能力和效果。它还引入了一种数据增强策略 ScMix,结合自混合和交叉混合操作,在保持语义完整性的同时增加数据多样性,进一步提升了 UAP 的鲁棒性和适用性。

4.2 Adversarial Defenses

4.2 对抗防御

Existing adversarial defenses for VLP models can be grouped into four types: 1) adversarial example detection, 2) standard adversarial training, 3) adversarial prompt tuning, and 4) adversarial contrastive tuning. While adversarial detection filters out potential adversarial examples before or during inference, the other three defenses follow similar adversarial training paradigms, with variations in efficiency.

现有的 VLP 模型对抗防御可以分为四种类型:1) 对抗样本检测,2) 标准对抗训练,3) 对抗提示调优,4) 对抗对比调优。对抗检测在推理前或推理期间过滤掉潜在的对抗样本,而其他三种防御遵循类似的对抗训练范式,但在效率上有所不同。

4.2.1 Adversarial Example Detection

4.2.1 对抗样本检测

Adversarial detection methods for VLP can befurther divided into one-shot detection and state ful detection.

针对VLP的对抗检测方法可以进一步分为一次性检测和有状态检测。

4.2.1.1 One-shot Detection

4.2.1.1 单样本检测

One-shot Detection distinguishes adversarial from clean examples in a single forward pass. White-box detection methods are typically one-shot. For example, Mirror Check [217] is a modelagnostic method for VLP models. It uses text-to-image (T21) models to generate images from captions produced by the victim model, comparing the similarity between the input image and the generated image using CLIP's image encoder. A significant similarity difference flags the input as adversarial.

单次检测通过一次前向传递区分对抗样本和干净样本。白盒检测方法通常是单次的。例如,Mirror Check [217] 是一种针对 VLP 模型的模型无关方法。它使用文本到图像 (T21) 模型从受害者模型生成的描述中生成图像,并使用 CLIP 的图像编码器比较输入图像与生成图像的相似性。显著的相似性差异将输入标记为对抗样本。

4.2.1.2 Stateful Detection

4.2.1.2 有状态检测

Stateful Detection is designed for black-box query attacks, where multiple queries are tracked to detect adversarial behavior. AdvQDet [218] is a novel framework that counters query-based black-box attacks. It uses adversarial contrastive prompt tuning (ACPT) to tune CLIP image encoder, enabling detection of adversarial queries within just three queries.

状态检测(Stateful Detection)专为黑盒查询攻击设计,通过跟踪多次查询来检测对抗行为。AdvQDet [218] 是一个针对基于查询的黑盒攻击的新框架。它利用对抗性对比提示调优(ACPT)来调整 CLIP 图像编码器,从而在仅三次查询内检测到对抗性查询。

4.2.2Standard Adversarial Training

4.2.2 标准对抗训练

Adversarial training is widely regarded as the most effective defense against adversarial attacks [409], [448]. However, it is computationally expensive, and for VLP models, which are typically trained on web-scale datasets, this cost becomes prohibitively high, posing a significant challenge for traditional approaches. Although research in this area is limited, we highlight two notable works that have explored adversarial training for vision-language pre-training. Their pre-trained models can be used as robust backbones for other adversarial research.

对抗训练被广泛认为是对抗对抗攻击最有效的防御手段 [409], [448]。然而,它的计算成本很高,而对于通常基于网络规模数据集训练的 VLP 模型来说,这一成本变得极其高昂,对传统方法构成了重大挑战。尽管这一领域的研究有限,但我们重点介绍了两项探索视觉语言预训练对抗训练的显著工作。它们的预训练模型可以作为其他对抗性研究的鲁棒骨干。

The first work, VILLA [216], is a vision-language adversarial training framework consisting of two stages: task-agnostic adversarial pre-training and task-specific fine-tuning. VILLA enhances performance across downstream tasks using adversarial pre-training in the embedding space of both image and text modalities, instead of pixel or token levels. It employs FreeLB's strategy [449] to minimize computational overhead for efficient large-scale training.

首个工作 VILLA [216] 是一个视觉语言对抗训练框架,包含两个阶段:任务无关的对抗预训练和任务特定的微调。VILLA 通过在图像和文本模态的嵌入空间中进行对抗预训练,而不是像素或 Token 级别,提升了下游任务的性能。它采用了 FreeLB 的策略 [449] 以最小化计算开销,从而实现高效的大规模训练。

The second work, AdvXL [215], is a large-scale adversarial training framework with two phases: a lightweight pre-training phase using low-resolution images and weaker attacks, followed by an intensive fine-tuning phase with full-resolution images and stronger attacks. This coarse-to-fine, weak-to-strong strategy reduces training costs while enabling scalable adversarial training for large vision models.

第二项工作 AdvXL [215] 是一个大规模对抗训练框架,分为两个阶段:首先是一个使用低分辨率图像和较弱攻击的轻量预训练阶段,随后是一个使用全分辨率图像和更强攻击的强化微调阶段。这种由粗到精、由弱到强的策略在降低训练成本的同时,为大视觉模型实现了可扩展的对抗训练。

4.2.3 Adversarial Prompt Tuning

4.2.3 对抗性提示调优

Adversarial prompt tuning (APT) enhances the adversarial robustness of VLP models by incorporating adversarial training during prompt tuning [450]-[452], typically focusing on textual prompts. It offers a lightweight alternative to standard adversarial training. APT methods can be classified into two main categories based on the prompt type: textual prompt tuning and multi-modal prompt tuning.

对抗性提示调优 (APT) 通过在提示调优过程中引入对抗性训练 [450]-[452] 来增强视觉语言预训练 (VLP) 模型的对抗鲁棒性,通常侧重于文本提示。它为标准的对抗性训练提供了一种轻量级替代方案。根据提示类型,APT 方法可分为两大类:文本提示调优和多模态提示调优。

4.2.3.1 Textual Prompt Tuning

4.2.3.1 文本提示调优

Textual prompt tuning (TPT) robust if ies VLP models by finetuning learnable text prompts. AdvPT [204] enhances the adversarial robustness of CLIP image encoder by realigning adversarial image embeddings with clean text embeddings using learnable textual prompts. Similarly, APT [205] learns robust text prompts, using a CLIP image encoder to boost accuracy and robustness with minimal computational cost. MixPrompt [206] simultaneously enhances the general iz ability and adversarial robustness of VLPs by employing conditional APT. Unlike empirical defenses, Prompt Smooth [207] offers a certified defense for Medical VLMs, adapting pre-trained models to Gaussian noise without retraining. Additionally, Defense-Prefix [203] mitigates typographic attacks by adding a prefix token to class names, improving robustness without retraining.

文本提示调优 (TPT) 通过微调可学习的文本提示来增强视觉语言预训练 (VLP) 模型的鲁棒性。AdvPT [204] 通过使用可学习的文本提示将对抗性图像嵌入与干净的文本嵌入重新对齐,增强了 CLIP 图像编码器的对抗鲁棒性。类似地,APT [205] 学习了鲁棒的文本提示,使用 CLIP 图像编码器以最小的计算成本提高准确性和鲁棒性。MixPrompt [206] 通过采用条件 APT 同时提升了 VLP 的泛化能力和对抗鲁棒性。与经验防御不同,Prompt Smooth [207] 为医学 VLM 提供了认证防御,无需重新训练即可将预训练模型适应高斯噪声。此外,Defense-Prefix [203] 通过向类名添加前缀 Token 来缓解排版攻击,提高了鲁棒性而无需重新训练。

4.2.3.2Multi-Modal Prompt Tuning

4.2.3.2 多模态提示调优

Recent adversarial prompt tuning methods have expanded textual prompts to multi-modal prompts. FAP [208] introduces learnable adversarial text supervision and a training objective that balances cross-modal consistency while differentiating uni-modal representations. APD [209] improves CLIP's robustness through online prompt distillation between teacher and student multimodal prompts. Additionally, TAPT [210] presents a test-time defense that learns defensive bimodal prompts to improve CLIP's zero-shot inference robustness.

最近的对抗性提示调优方法已经将文本提示扩展到多模态提示。FAP [208] 引入了可学习的对抗性文本监督和一个训练目标,该目标在区分单模态表示的同时平衡跨模态一致性。APD [209] 通过教师和学生多模态提示之间的在线提示蒸馏来提高 CLIP 的鲁棒性。此外,TAPT [210] 提出了一种测试时防御方法,通过学习防御性双模态提示来提高 CLIP 的零样本推理鲁棒性。

4.2.4 Adversarial Contrastive Tuning

4.2.4 对抗对比微调

Adversarial contrastive tuning involves contrastive learning with adversarial training to fine-tune a robust CLIP image encoder for zero-shot adversarial robustness on downstream tasks. These methods are categorized into supervised and unsupervised methods, depending on the availability of labeled data during training.

对抗对比微调涉及对抗训练与对比学习,以在下游任务中微调出具有零样本对抗鲁棒性的CLIP图像编码器。这些方法根据训练期间是否有标签数据,分为有监督和无监督方法。

4.2.4.1 Supervised Contrastive Tuning

4.2.4.1 监督对比调优 (Supervised Contrastive Tuning)

Visual Tuning fine-tunes CLIP image encoder using only adversarial images. TeCoA [211] explores the zero-shot adversarial robustness of CLIP and finds that visual prompt tuning is more effective without text guidance, while fine-tuning performs better with text information. PMG-AFT [212] improves zero-shot adversarial robustness by introducing an auxiliary branch to minimize the distance between adversarial outputs in the target and pre-trained models, mitigating over fitting and preserving generalization.

视觉调优仅使用对抗性图像对CLIP图像编码器进行微调。TeCoA [211] 探索了CLIP的零样本对抗鲁棒性,发现视觉提示调优在没有文本引导时更有效,而微调在提供文本信息时表现更好。PMG-AFT [212] 通过引入辅助分支来最小化目标模型和预训练模型中对抗性输出之间的距离,从而提高零样本对抗鲁棒性,缓解过拟合并保持泛化能力。

Multi-modal Tuning fine-tunes CLIP image encoder using both adversarial texts and images. MMCoA [213] combines image-based PGD and text-based BERT-Attack in a multi-modal contrastive adversarial training framework. It uses two contrastive losses to align clean and adversarial image and text features, improving robustness against both image-only and multi-modal attacks.

多模态调优使用对抗性文本和图像对CLIP图像编码器进行微调。MMCoA [213] 在多模态对比对抗训练框架中结合了基于图像的 PGD 和基于文本的 BERT-Attack。它使用两种对比损失来对齐干净和对抗性的图像及文本特征,从而提高了对仅图像攻击和多模态攻击的鲁棒性。

4.2.4.2 Unsupervised Contrastive Tuning

4.2.4.2 无监督对比调优

Adversarial contrastive tuning can also be performed in an unsupervised fashion. For instance, FARE [214] robust if ies CLIP image encoder through unsupervised adversarial fine-tuning, achieving superior clean accuracy and robustness across downstream tasks, including zero-shot classification and vision-language tasks. This approach enables VLMs, such as LLaVA and Open Flamingo, to attain robustness without the need for re-training or additional fine-tuning.

对抗性对比调优(adversarial contrastive tuning)也可以以无监督的方式进行。例如,FARE [214] 通过无监督的对抗性微调使 CLIP 图像编码器更加鲁棒,在下游任务(包括零样本分类和视觉语言任务)中实现了优异的干净准确性和鲁棒性。这种方法使得视觉语言模型(VLM),如 LLaVA 和 Open Flamingo,无需重新训练或额外微调即可获得鲁棒性。

4.3Backdoor & Poisoning Attacks

4.3 后门与投毒攻击

Backdoor and poisoning attacks on CLIP can target either the pre-training stage or the fine-tuning stage on downstream tasks. Previous studies have shown that poisoning backdoor attacks on CLIP can succeed with significantly lower poisoning rates compared to traditional supervised learning [224]. Additionally, training CLIP on web-crawled data increases its vulnerability to backdoor attacks [453]. This section reviews proposed attacks targeting backdoor ing or poisoning CLIP.

后门攻击和投毒攻击 (Poisoning attacks) 可以针对 CLIP 的预训练阶段或下游任务的微调阶段。先前的研究表明,与传统监督学习相比,针对 CLIP 的投毒后门攻击可以在显著较低的投毒率下成功 [224]。此外,在通过网络爬取的数据上训练 CLIP 会增加其受到后门攻击的脆弱性 [453]。本节回顾了针对 CLIP 后门或投毒攻击的提议。

4.3.1Backdoor Attacks

4.3.1 后门攻击

Based on the trigger modality, existing backdoor attacks on CLIP can be categorized into visual triggers and multi-modal triggers.

基于触发方式,现有的CLIP后门攻击可分为视觉触发和多模态触发。

Visual Triggers target pre-trained image encoders by embedding backdoor patterns in visual inputs. BadEncoder [219] explores image backdoor attacks on self-supervised learning by injecting backdoors into pre-trained image encoders, compromising downstream class if i ers. Corrupt Encoder [220] exploits random cropping in contrastive learning to inject backdoors into pre-trained image encoders, with increased effectiveness when cropped views contain only the reference object or the trigger. For attacks targeting CLIP, BadCLIP [221] optimizes visual trigger patterns using dual-embedding guidance, aligning them with both the target text and specific visual features. This strategy enables BadCLIP to bypass backdoor detection and fine-tuning defenses.

视觉触发器通过在视觉输入中嵌入后门模式来针对预训练的图像编码器。BadEncoder [219] 通过将后门注入预训练的图像编码器,探索了自监督学习中的图像后门攻击,从而危及下游分类器。Corrupt Encoder [220] 利用对比学习中的随机裁剪将后门注入预训练的图像编码器,当裁剪的视图仅包含参考对象或触发器时,效果会增强。对于针对 CLIP 的攻击,BadCLIP [221] 使用双嵌入引导优化视觉触发模式,使其与目标文本和特定视觉特征对齐。这一策略使 BadCLIP 能够绕过后门检测和微调防御。

Multi-modal Triggers combine both visual and textual triggers to enhance the attack. BadCLIP [222] introduces a novel trigger-aware prompt learning-based backdoor attack targeting CLIP models. Rather than fine-tuning the entire model, BadCLIP injects learnable triggers during the prompt learning stage, affecting both the image and text encoders.

多模态触发器结合了视觉和文本触发器以增强攻击效果。BadCLIP [222] 提出了一种针对 CLIP 模型的新型基于触发器的提示学习后门攻击。BadCLIP 并未对整个模型进行微调,而是在提示学习阶段注入可学习的触发器,影响图像和文本编码器。

4.3.2 Poisoning Attacks

4.3.2 投毒攻击

Two targeted poisoning attacks on CLIP are PBCL [224] and MM Poison [223]. PBCL demonstrated that a targeted poisoning attack, mis classifying a specific sample, can be achieved by poisoning as little as $0.0001%$ of the training dataset. MM Poison investigates modality vulnerabilities and proposes three attack types: single target image, single target label, and multiple target labels. Evaluations show high attack success rates while maintaining clean data performance across both visual and textual modalities.

针对 CLIP 的两种定向投毒攻击是 PBCL [224] 和 MM Poison [223]。PBCL 证明,只需投毒训练数据集的 $0.0001%$,即可实现定向投毒攻击,从而错误分类特定样本。MM Poison 研究了模态漏洞,并提出了三种攻击类型:单目标图像、单目标标签和多目标标签。评估表明,在保持视觉和文本模态中干净数据性能的同时,攻击成功率很高。

4.4 Backdoor & Poisoning Defenses

4.4 后门与中毒防御

Defense strategies against backdoor and poisoning attacks are generally categorized into robust training and backdoor detection. Robust Training aims to create VLP models resistant to backdoor or targeted poisoning attacks, even when trained on untrusted datasets. This approach specifically addresses poisoning-based attacks. Backdoor detection focuses on identifying compromised encoders or contaminated data. Detection methods often require additional mitigation techniques to fully eliminate backdoor effects.

针对后门和投毒攻击的防御策略通常分为鲁棒训练和后门检测。鲁棒训练旨在创建能够抵抗后门或目标投毒攻击的 VLP 模型,即使在不可信的数据集上进行训练也能有效应对。这种方法专门针对基于投毒的攻击。后门检测则侧重于识别被破坏的编码器或被污染的数据。检测方法通常需要额外的缓解技术来完全消除后门影响。

4.4.1 Robust Training

4.4.1 鲁棒训练

Depending on the stage at which the model gains robustness against backdoor attacks, existing robust training strategies can be categorized into fine-tuning and pre-training approaches.

根据模型在对抗后门攻击时获得鲁棒性的阶段,现有的鲁棒训练策略可以分为微调和预训练方法。

4.4.1.1 Fine-tuning Stage

4.4.1.1 微调阶段

To mitigate backdoor and poisoning threats, CleanCLIP [225] fine-tunes CLIP by re-aligning each modality's representations, weakening spurious correlations from backdoor attacks. Similarly, SAFECLIP [226] enhances feature alignment using unimodal contrastive learning. It first warms up the image and text modalities separately, then uses a Gaussian mixture model to classify data into safe and risky sets. During pre-training, SAFECLIP optimizes CLIP loss on the safe set, while separately fine-tuning the risky set, reducing poisoned image-text pair similarity and defending against targeted poisoning and backdoor attacks.

为减轻后门和投毒威胁,CleanCLIP [225] 通过重新对齐每个模态的表征来微调 CLIP,削弱来自后门攻击的虚假关联。类似地,SAFECLIP [226] 使用单模态对比学习增强特征对齐。它首先分别预热图像和文本模态,然后使用高斯混合模型将数据分类为安全集和风险集。在预训练期间,SAFECLIP 在安全集上优化 CLIP 损失,同时单独微调风险集,降低被投毒的图像-文本对的相似度,并防御定向投毒和后门攻击。

4.4.1.2 Pre-training Stage

4.4.1.2 预训练阶段

ROCLIP [227] defends against poisoning and backdoor attacks by enhancing model robustness during pre-training. It disrupts the association between poisoned image-caption pairs by utilizing a large, diverse pool of random captions. Additionally, ROCLIP applies image and text augmentations to further strengthen its defense and improve model performance.

ROCLIP [227] 通过增强预训练期间的模型鲁棒性来防御投毒和后门攻击。它通过利用大量多样化的随机文本描述池来破坏投毒图像-文本对之间的关联。此外,ROCLIP 还应用图像和文本增强技术,进一步强化其防御能力并提升模型性能。

4.4.2 Backdoor Detection

4.4.2 后门检测

Backdoor detection can be broadly divided into three subtasks: 1) trigger inversion, 2) backdoor sample detection, and 3) backdoor model detection. Trigger inversion is particularly useful, as recovering the trigger can aid in the detection of both backdoor samples and backdoored models.

后门检测可以大致分为三个子任务:1)触发器反演,2)后门样本检测,以及 3)后门模型检测。触发器反演特别有用,因为恢复触发器可以帮助检测后门样本和后门模型。

Trigger Inversion aims to reverse-engineer the trigger pattern injected into a backdoored model. Mudjacking [230] mitigates backdoor vulnerabilities in VLP models by adjusting model parameters to remove the backdoor when a mis classified triggerembedded input is detected. In contrast to single-modality defenses, TIJO [229] defends against dual-key backdoor attacks by jointly optimizing the reverse-engineered triggers in both the image and text modalities.

触发器反演旨在逆向工程注入到后门模型中的触发器模式。Mudjacking [230] 通过调整模型参数来消除后门漏洞,当检测到误分类的触发器嵌入输入时,移除后门。与单模态防御相比,TIJO [229] 通过联合优化图像和文本模态中的逆向工程触发器来防御双键后门攻击。

Backdoor Sample Detection detects whether a training or test sample is poisoned by a backdoor trigger. This detection can be used to cleanse the training dataset or reject backdoor queries. SEER [231] addresses the complexity of multi-modal models by jointly detecting malicious image triggers and target texts in the shared feature space. This method does not require access to the training data or knowledge of downstream tasks, making it highly effective for backdoor detection in VLP models. Outlier Detection [232] demonstrates that the local neighborhood of backdoor samples is significantly sparser compared to that of clean samples. This insight enables the effective and efficient application of various local outlier detection methods to identify backdoor samples from web-scale datasets. Furthermore, they reveal that potential unintentional backdoor samples already exist in the Conceptual Captions 3 Million (CC3M) dataset and have been trained into open-sourced CLIP encoders.

后门样本检测用于判断训练或测试样本是否被后门触发器污染。这种检测可用于清理训练数据集或拒绝后门查询。SEER [231]通过联合检测共享特征空间中的恶意图像触发器和目标文本来解决多模态模型的复杂性。该方法无需访问训练数据或了解下游任务,因此在VLP模型的后门检测中非常有效。离群点检测 [232]表明,后门样本的局部邻域相比干净样本要稀疏得多。这一洞察使得各种局部离群点检测方法能够高效地应用于从网络规模数据集中识别后门样本。此外,他们发现Conceptual Captions 3 Million (CC3M)数据集中已经存在潜在的无意识后门样本,并且这些样本已被训练到开源的CLIP编码器中。

Backdoor Model Detection identifies whether a trained model is compromised by backdoor(s). DECREE [228] introduces a backdoor detection method specifically for VLP encoders that require no labeled data. It exploits the distinct embedding space characteristics of backdoored encoders when exposed to clean versus backdoor inputs. By combining trigger inversion with these embedding differences, DECREE can effectively detect backdoored encoders.

后门模型检测识别训练好的模型是否被后门攻击所破坏。DECREE [228] 提出了一种专门针对视觉语言预训练 (VLP) 编码器的后门检测方法,该方法无需标注数据。它利用被后门攻击的编码器在干净输入和后门输入下表现出的不同嵌入空间特性。通过将触发器反转与这些嵌入差异相结合,DECREE 能够有效检测被后门攻击的编码器。

4.5 Datasets

4.5 数据集

This section reviews datasets used for VLP safety research. As shown in Table 6, a variety of benchmark datasets were employed toevaluate adversarial attacks and defenses for VLP models. For image classification tasks, commonly used datasets include: ImageNet [454], Caltech101 [455], DTD [456], EuroSAT [457], OxfordPets [458], FGVC-Aircraft [459], Food101 [460], Flowers102 [461], StanfordCars [462], SUN397 [463], and UCF101 [464]. For evaluating domain generalization and robustness to distribution shifts, several ImageNet variants were also used: ImageNetV2 [465], ImageNet-Sketch [466], ImageNetA [467], and ImageNet-R [468]. Additionally, MS-COCO [418] and Flickr30K [469] were utilized for image-to-text and text-toimage retrieval tasks, RefCOCO $^+$ [470] for visual grounding, and SNLI-VE [471] for visual entailment.

本节回顾了用于视觉语言预训练(VLP)安全研究的数据集。如表 6 所示,研究人员使用了多种基准数据集来评估 VLP 模型的对抗攻击和防御能力。在图像分类任务中,常用的数据集包括:ImageNet [454]、Caltech101 [455]、DTD [456]、EuroSAT [457]、OxfordPets [458]、FGVC-Aircraft [459]、Food101 [460]、Flowers102 [461]、StanfordCars [462]、SUN397 [463] 和 UCF101 [464]。为了评估领域泛化能力和对分布变化的鲁棒性,还使用了几个 ImageNet 变体:ImageNetV2 [465]、ImageNet-Sketch [466]、ImageNetA [467] 和 ImageNet-R [468]。此外,MS-COCO [418] 和 Flickr30K [469] 被用于图像到文本和文本到图像检索任务,RefCOCO $^+$ [470] 用于视觉定位,SNLI-VE [471] 用于视觉蕴涵任务。

5 VISION-LANGUAGE MODEL SAFETY

5 视觉-语言模型安全性

Large VLMs extend LLMs by adding a visual modality through pre-trained image encoders and alignment modules, enabling applications like visual conversation and complex reasoning. However, this multi-modal design introduces unique vulnerabilities. This section reviews adversarial attacks, latency energy attacks, jailbreak attacks, prompt injection attacks, backdoor & poisoning attacks, and defenses developed for VLMs. Many VLMs use VLP-trained encoders, so the attacks and defenses discussed in Section 4 also apply to VLMs. The additional alignment process between the VLM pre-trained encoders and LLMs, however, expands the attack surface, with new risks like cross-modal backdoor attacks and jailbreaks targeting both text and image inputs. This underscores the need for safety measures tailored to VLMs.

大型视觉语言模型通过预训练图像编码器和对齐模块扩展了大语言模型,增加了视觉模态,从而实现了视觉对话和复杂推理等应用。然而,这种多模态设计引入了独特的漏洞。本节回顾了对抗攻击、延迟能量攻击、越狱攻击、提示注入攻击、后门和投毒攻击,以及为视觉语言模型开发的防御措施。许多视觉语言模型使用了视觉语言预训练编码器,因此第4节中讨论的攻击和防御也适用于视觉语言模型。然而,视觉语言模型预训练编码器与大语言模型之间的额外对齐过程扩大了攻击面,出现了新的风险,如针对文本和图像输入的跨模态后门攻击和越狱攻击。这强调了需要针对视觉语言模型量身定制安全措施的重要性。

5.1 Adversarial Attacks

5.1 对抗攻击

Adversarial attacks on VLMs primarily target the visual modality, which, unlike text, is more susceptible to adversarial perturbations due to its high-dimensional nature. By adding imperceptible changes to images, attackers aim to disrupt tasks like image captioning and visual question answering. These attacks are classified into white-box and black-box categories based on the threat model.

针对视觉语言模型 (VLM) 的对抗攻击主要针对视觉模态,与文本不同,由于其高维特性,视觉模态更容易受到对抗性扰动的影响。通过在图像中添加不可察觉的变化,攻击者旨在破坏图像描述和视觉问答等任务。根据威胁模型,这些攻击分为白盒和黑盒两类。

5.1.1White-box Attacks

5.1.1 白盒攻击

White-box adversarial attacks on VLMs have full access to the model parameters, including both vision encoders and LLMs. These attacks can be classified into three types based on their objectives: task-specific attacks, cross-prompt attack, and chainof-thought (CoT) attack.

白盒对抗攻击在视觉语言模型(VLMs)上具有对模型参数的完全访问权限,包括视觉编码器和大语言模型。这些攻击根据其目标可以分为三类:任务特定攻击、跨提示攻击和思维链(CoT)攻击。

Task-specific Attacks Schlarmann et al. [233] were the first to highlight the vulnerability of VLMs like Flamingo [472] and GPT4 [473] to adversarial images that manipulate caption outputs. Their study showed how attackers can exploit these vulnerabilities to mislead users, redirecting them to harmful websites or spreading misinformation. Gao et al. [236] introduced attack paradigms targeting the referring expression comprehension task, while [234] proposed a query decomposition method and demonstrated how contextual prompts can enhance VLM robustness against visual attacks.

任务特定攻击

Schlarmann 等人 [233] 首次强调了像 Flamingo [472] 和 GPT4 [473] 这样的视觉语言模型 (VLM) 对操纵字幕输出的对抗图像的脆弱性。他们的研究表明,攻击者如何利用这些漏洞误导用户,将其重定向到有害网站或传播虚假信息。Gao 等人 [236] 提出了针对指代表达理解任务的攻击范式,而 [234] 提出了一种查询分解方法,并展示了上下文提示如何增强 VLM 对视觉攻击的鲁棒性。

Cross-prompt Attack refer to adversarial attacks that remain effective across different prompts. For example, CroPA [235] explored the transfer ability of a single adversarial image across multiple prompts, investigating whether it could mislead predictions in various contexts. To tackle this, they proposed refining adversarial perturbations through learnable prompts to enhance transfer ability.

跨提示攻击 (Cross-prompt Attack) 指的是在不同提示下仍能保持有效的对抗攻击。例如,CroPA [235] 探索了单个对抗图像在多个提示下的迁移能力,研究其是否能在不同情境下误导预测。为了解决这一问题,他们提出了通过可学习的提示来优化对抗扰动,以增强迁移能力。

CoT Attack targets the CoT reasoning process of VLMs. Stop-reasoning Attack [237] explored the impact of CoT reasoning on adversarial robustness. Despite observing some improvements in robustness, they introduced a novel attack designed to bypass these defenses and interfere with the reasoning process withinVLMs.

CoT攻击针对VLMs的CoT推理过程。Stop-reasoning攻击[237]探讨了CoT推理对对抗鲁棒性的影响。尽管观察到鲁棒性有所提高,但他们引入了一种新颖的攻击,旨在绕过这些防御并干扰VLMs内部的推理过程。

5.1.2 Gray-box Attacks

5.1.2 灰盒攻击

Gray-box adversarial attacks typically involve access to either the vision encoders or the LLM of a VLM, with a focus on vision encoders as the key different i at or between VLMs and LLMs. Attackers craft adversarial images that closely resemble target images, manipulating model predictions without full access to the VLM. For instance, InstructTA [238] generates a target image and uses a surrogate model to create adversarial perturbations, minimizing the feature distance between the original and adversarial image. To improve transfer ability, the attack incorporates GPT-4 paraphrasing to refine instructions.

灰盒对抗攻击通常涉及对视觉编码器或大语言模型的访问,重点关注视觉编码器作为视觉语言模型和大语言模型之间的关键区别。攻击者制作与目标图像非常相似的对抗图像,在没有完全访问视觉语言模型的情况下操纵模型预测。例如,InstructTA [238] 生成目标图像并使用替代模型创建对抗扰动,最小化原始图像和对抗图像之间的特征距离。为了提高迁移能力,攻击结合了 GPT-4 的释义来优化指令。

5.1.3Black-box Attacks

5.1.3 黑盒攻击

In contrast, black-box attacks do not require access to the target model's internal parameters and typically rely on transfer-based Or generator-based methods.

相比之下,黑盒攻击不需要访问目标模型的内部参数,通常依赖于基于迁移或基于生成器的方法。

Transfer-based Attacks exploit the widespread use of frozen CLIP vision encoders in many VLMs. AttackBard [239] demonstrates that adversarial images generated from surrogate models can successfully mislead Google's Bard, despite its defense mechanisms. Similarly, AttackVLM [240] crafts targeted adversarial images for models like CLIP [442] and BLIP [474], successfully transferring these adversarial inputs to other VLMs. It also shows that black-box queries further improved the success rate of generating targeted responses, illustrating the potency of cross-model transfer ability.

基于迁移的攻击利用了冻结的CLIP视觉编码器在许多视觉语言模型(VLM)中的广泛应用。AttackBard [239] 展示了从替代模型生成的对抗性图像可以成功地误导Google的Bard,尽管后者有防御机制。同样,AttackVLM [240] 为CLIP [442] 和BLIP [474] 等模型制作了有针对性的对抗性图像,并成功地将这些对抗性输入迁移到其他VLM中。它还表明,黑盒查询进一步提高了生成目标响应的成功率,展示了跨模型迁移能力的强大性。

Generator-based Attacks leverage generative models to create adversarial examples with improved transfer ability. AdvDiffVLM [241] uses diffusion models to generate natural, targeted adversarial images with enhanced transfer ability. By combining adaptive ensemble gradient estimation and GradCAM-guided masking, it improves the semantic embedding of adversarial examples and spreads the targeted semantics more effectively across the image, leading to more robust attacks. AnyAttack [242] presents a self-supervised framework for generating targeted adversarial images without label supervision. By utilizing contrastive loss, it efficiently creates adversarial examples that mislead models across diverse tasks.

基于生成器的攻击利用生成模型创建具有更强迁移能力的对抗样本。AdvDiffVLM [241] 使用扩散模型生成具有增强迁移能力的自然、定向对抗图像。通过结合自适应集成梯度估计和 GradCAM 引导的掩码,它改进了对抗样本的语义嵌入,并更有效地在图像中传播目标语义,从而实现更强大的攻击。AnyAttack [242] 提出了一种自监督框架,用于在没有标签监督的情况下生成定向对抗图像。通过利用对比损失,它有效地创建了误导模型在不同任务中的对抗样本。

5.2Jailbreak Attacks

5.2 越狱攻击

The inclusion of a visual modality in VLMs provides additional routes for jailbreak attacks. While adversarial attacks generally induce random or targeted errors, jailbreak attacks specifically target the model's safeguards to generate inappropriate outputs. Like adversarial attacks, jailbreak attacks on VLMs can be classified as white-box or black-box attacks.

视觉语言模型 (VLM) 中视觉模态的引入为越狱攻击提供了额外的途径。虽然对抗攻击通常会导致随机或目标错误,但越狱攻击专门针对模型的防护机制,以生成不合适的输出。与对抗攻击类似,VLM 的越狱攻击可以分为白盒攻击或黑盒攻击。

TABLE 7: A summary of attacks and defenses for VLMs.

5.2.1White-box Attacks

5.2.1 白盒攻击

White-box jailbreak attacks leverage gradient information to perturb input images or text, targeting specific behaviors in VLMs. These attacks can be further categorized into three types: targetspecific jailbreak, universal jailbreak, and hybrid jailbreak, each exploiting different aspects of the model's safety measures.

白盒越狱攻击利用梯度信息来扰动输入图像或文本,针对视觉语言模型(VLM)中的特定行为。这些攻击可以进一步分为三类:目标特定越狱、通用越狱和混合越狱,每种攻击都利用了模型安全措施的不同方面。

Target-specific Jailbreak focuses on inducing a specific type of harmful output from the model. Image Hijack [243] introduces adversarial images that manipulate VLM outputs, such as leaking information, bypassing safety measures, and generating false statements. These attacks, trained on generic datasets, effectively force models to produce harmful outputs. Similarly, Adversarial Alignment Attack [244] demonstrates that adversarial images can induce misaligned behaviors in VLMs, suggesting that similar techniques could be adapted for text-only models using advanced NLP methods.

目标特定的越狱攻击专注于诱导模型生成特定类型的有害输出。Image Hijack [243] 引入了对抗性图像,这些图像可以操纵视觉语言模型(VLM)的输出,例如泄露信息、绕过安全措施以及生成虚假陈述。这些攻击在通用数据集上进行训练,有效地迫使模型生成有害输出。同样,Adversarial Alignment Attack [244] 表明,对抗性图像可以诱导视觉语言模型产生不匹配的行为,这表明类似的技术可以借助先进的自然语言处理(NLP)方法,适用于纯文本模型。

Universal Jailbreak bypasses model safeguards, causing it to generate harmful content beyond the adversarial input. VAJM [245] shows that a single adversarial image can universally bypass VLM safety, forcing universal harmful outputs. ImgJP [246] uses a maximum likelihood algorithm to create transferable adversarial images that jailbreak various VLMs, even bridging VLM and LLM attacks by converting images to text prompts. UMK [247] proposes a dual optimization attack targeting both text and image modalities, embedding toxic semantics in images and text to maximize impact. HADES [248] introduces a hybrid jailbreak method that combines universal adversarial images with crafted inputs to bypass safety mechanisms, effectively amplifying harmful instructions and enabling robust adversarial manipulation.

通用越狱绕过模型的安全保护,使其在对抗输入之外生成有害内容。VAJM [245] 展示了单个对抗图像可以普遍绕过 VLM 的安全性,强制生成普遍的有害输出。ImgJP [246] 使用最大似然算法创建可转移的对抗图像,这些图像可以越狱各种 VLM,甚至通过将图像转换为文本提示来桥接 VLM 和 LLM 攻击。UMK [247] 提出了一种针对文本和图像模态的双重优化攻击,将有毒语义嵌入图像和文本中以最大化影响。HADES [248] 引入了一种混合越狱方法,结合了通用对抗图像和精心设计的输入,以绕过安全机制,有效放大有害指令并实现稳健的对抗操作。

5.2.2Black-box Attacks

5.2.2 黑盒攻击

Black-box jailbreak attacks do not require direct access to the internal parameters of the target VLM. Instead, they exploit external vulnerabilities, such as those in the frozen CLIP vision encoder, interactions between vision and language modalities, or system prompt leakage. These attacks can be classified into four main categories: transfer-based attacks, manually-designed attacks, system prompt leakage, and red teaming, each employing distinct strategies to bypass VLM defenses and trigger harmful behaviors.

黑盒越狱攻击不需要直接访问目标VLM的内部参数。相反,它们利用外部漏洞,例如冻结的CLIP视觉编码器中的漏洞、视觉和语言模态之间的交互或系统提示泄漏。这些攻击可以分为四大类:基于迁移的攻击、手动设计的攻击、系统提示泄漏和红队测试,每种攻击都采用不同的策略来绕过VLM的防御并触发有害行为。

Transfer-based Attacks on VLMs typically assume the attacker has access to the image encoder (or its open-source version), which is used to generate adversarial images that can then be transferred to attack the black-box LLM. For example, Jailbreak in Pieces [249] introduces cross-modality attacks that transfer adversarial images, crafted using the image encoder (assume the model employed an open-source encoder), along with clean textual prompts to break VLM alignment.

基于迁移的攻击通常假设攻击者能够访问图像编码器(或其开源版本),并利用该编码器生成对抗性图像,然后将其迁移至攻击黑盒大语言模型。例如,Jailbreak in Pieces [249] 提出了跨模态攻击,通过利用图像编码器(假设模型使用了开源编码器)生成对抗性图像,并配合干净的文本提示,来破坏 VLM 的对齐性。

Manually-designed Attacks can be as effective as optimized ones. For instance, FigStep [250] introduces an algorithm that bypasses safety measures by converting harmful text into images via typography, enabling VLMs to visually interpret the harmful intent. VRP [252] adopts a visual role-play approach, using LLM-generated images of high-risk characters based on detailed descriptions. By pairing these images with benign roleplay instructions, VRP exploits the negative traits of the characters to deceive VLMs into generating harmful outputs.

手动设计的攻击可以与优化的攻击一样有效。例如,FigStep [250] 提出了一种算法,通过排版将有害文本转换为图像,绕过安全措施,使视觉语言模型 (VLM) 能够视觉上解读有害意图。VRP [252] 采用视觉角色扮演方法,基于详细描述使用大语言模型生成高风险角色的图像。通过将这些图像与无害的角色扮演指令配对,VRP 利用角色的负面特征来欺骗视觉语言模型生成有害输出。

System Prompt Leakage is another significant black-box jailbreak method, exemplified by SASP [251]. By exploiting a system prompt leakage in GPT-4V, SASP allowed the model to perform a self-adversarial attack, demonstrating the risks of internal prompt exposure.

系统提示泄露

Red Teaming recently saw an advancement with IDEATOR [253], which integrated a VLM with an advanced diffusion model to autonomously generate malicious image-text pairs. This approach overcomes the limitations of manually designed attacks, providing a scalable and efficient method for creating adversarial inputs without direct access to the target model.

红队测试最近在 IDEATOR [253] 方面取得了进展,该方法将 VLM 与先进的扩散模型 (diffusion model) 结合,自动生成恶意的图像-文本对。这种方法克服了手动设计攻击的局限性,提供了一种可扩展且高效的方法来创建对抗性输入,而无需直接访问目标模型。

5.3Jailbreak Defenses

5.3 越狱防御

This section reviews defense methods for VLMs against jailbreak attacks, categorized into jailbreak detection and jailbreak prevention. Detection methods identify harmful inputs or outputs for rejection or purification, while prevention methods enhance the model's inherent robustness to jailbreak queries through safety alignment or filters.

本节回顾了针对越狱攻击的视觉语言模型(VLM)防御方法,分为越狱检测和越狱预防两类。检测方法识别有害输入或输出以进行拒绝或净化,而预防方法则通过安全对齐或过滤器增强模型对越狱查询的固有鲁棒性。

5.3.1Jailbreak Detection

5.3.1 Jailbreak 检测

JailGuard [254] detects jailbreak attacks by mutating untrusted inputs and analyzing discrepancies in model responses. It uses 18 mutators for text and image inputs, improving generalization across attack types. GuardMM [255] is a two-stage defense: the first stage validates inputs to detect unsafe content, while the second stage focuses on prompt injection detection to protect against image-based attacks. It uses a specialized language to enforce safety rules and standards. MLLM-Protector [257] identifies harmful responses using a lightweight detector and detoxifies them through a specialized transformation mechanism. Its modular design enables easy integration into existing VLMs, enhancing safety and preventing harmful content generation.

JailGuard [254] 通过突变不可信输入并分析模型响应的差异来检测越狱攻击。它使用18种突变器处理文本和图像输入,提高了对各种攻击类型的泛化能力。GuardMM [255] 是一种两阶段防御机制:第一阶段验证输入以检测不安全内容,第二阶段专注于提示注入检测以防止基于图像的攻击。它使用专用语言来强制执行安全规则和标准。MLLM-Protector [257] 利用轻量级检测器识别有害响应,并通过专用转换机制进行净化。其模块化设计便于集成到现有视觉语言模型 (VLM) 中,增强安全性并防止生成有害内容。

5.3.2 Jailbreak Prevention

5.3.2 越狱防护

AdaShield [256] defends against structure-based jailbreaks by prepending defense prompts to inputs, refining them adaptively through collaboration between the VLM and an LLM-based prompt generator, without requiring fine-tuning. ECSO [258] offers a training-free protection by converting unsafe images into text descriptions, activating the safety alignment of pre-trained LLMs within VLMs to ensure safer outputs. Infer Align er [259] applies cross-model guidance during inference, adjusting activations using safety vectors to generate safe and reliable outputs. BlueSuffix [260] introduces a reinforcement learning-based blackbox defense framework consisting of three key components: (1) an image purifier for securing visual inputs, (2) a text purifier for safeguarding textual inputs, and (3) a reinforcement fine-tuningbased suffix generator that leverages bimodal gradients to enhance cross-modal robustness.

AdaShield [256] 通过向输入前置防御提示来防御基于结构的越狱,通过 VLM 和基于大语言模型的提示生成器之间的协作自适应地优化提示,而无需微调。ECSO [258] 提供了一种无需训练的保护方法,将不安全的图像转换为文本描述,激活预训练大语言模型在 VLM 中的安全对齐,以确保更安全的输出。Infer Aligner [259] 在推理过程中应用跨模型指导,使用安全向量调整激活,以生成安全可靠的输出。BlueSuffix [260] 引入了一种基于强化学习的黑盒防御框架,由三个关键组件组成:(1) 用于保护视觉输入的图像净化器,(2) 用于保护文本输入的文本净化器,以及 (3) 基于强化微调的后缀生成器,利用双模梯度增强跨模态的鲁棒性。

5.4 Energy Latency Attacks

5.4 能量延迟攻击

Similar to LLMs, multi-modal LLMs also face significant computational demands. Verbose images [261] exploit these demands by overwhelming service resources, resulting in higher server costs, increased latency, and inefficient GPU usage. These images are specifically designed to delay the occurrence of the EOS token, increasing the number of auto-regressive decoder calls, which in turn raises both energy consumption and latency costs.

与大语言模型类似,多模态大语言模型同样面临巨大的计算需求。冗长图像[261]通过过度占用服务资源来利用这些需求,导致更高的服务器成本、增加的延迟以及低效的GPU使用。这些图像专门设计用于延迟EOS Token的出现,增加自回归解码器的调用次数,从而提高能耗和延迟成本。

5.5 Prompt Injection Attacks

5.5 提示注入攻击

Prompt injection attacks against VLMs share the same objective as those against LLMs (Section 3), but the visual modality introduces continuous features that are more easily exploited through adversarial attacks or direct injection. These attacks can be further classified into optimization-based attacks and typography-based attacks.

针对VLMs的提示注入攻击与针对大语言模型的提示注入攻击目标相同(第3节),但视觉模态引入了连续特征,这些特征更容易通过对抗攻击或直接注入被利用。这些攻击可以进一步分为基于优化的攻击和基于排版攻击的攻击。

Optimization-based Attacks often optimize the input images using (white-box) gradients to produce stronger attacks. These attacks manipulate the model's responses, influencing future interactions. One representative method is Adversarial Prompt Injection [262], where attackers embed malicious instructions into VLMs by adding adversarial perturbations to images.

基于优化的攻击通常利用(白盒)梯度优化输入图像,以产生更强的攻击。这些攻击会操纵模型的响应,影响未来的交互。一个代表性的方法是对抗性提示注入 [262],攻击者通过向图像添加对抗性扰动,将恶意指令嵌入到视觉语言模型 (VLM) 中。

Typography-based Attacks exploit VLMs′ typographic vulner abilities by embedding deceptive text into images without requiring gradient access (i.e., black-box). The Typographic Attack [263] introduces two variations: Class-Based Attack to misidentify classes and Descriptive Attack to generate misleading labels. These attacks can also leak personal information [475], highlighting significant security risks.

基于字体的攻击通过将欺骗性文本嵌入图像中来利用视觉语言模型 (VLM) 的字体漏洞,且无需梯度访问(即黑盒)。字体攻击 [263] 引入了两种变体:基于类别的攻击用于错误识别类别,描述性攻击用于生成误导性标签。这些攻击还可能泄露个人信息 [475],突显了重大的安全风险。

5.6Backdoor& Poisoning Attacks

5.6 后门与投毒攻击

Most VLMs rely on VLP encoders, with safety threats discussed in Section 4. This section focuses on backdoor and poisoning risks arising during fine-tuning and testing, specifically when aligning vision encoders with LLMs. Backdoor attacks embed triggers in visual or textual inputs to elicit specific outputs, while poisoning attacks inject malicious image-text pairs to degrade model performance. We review backdoor and poisoning attacks separately, though most of these works are backdoor attacks.

大多数视觉语言模型 (VLM) 依赖于视觉语言预训练 (VLP) 编码器,其安全威胁在第 4 节中讨论。本节重点讨论在微调和测试过程中出现的后门和中毒风险,特别是在将视觉编码器与大语言模型对齐时。后门攻击在视觉或文本输入中嵌入触发器以引发特定输出,而中毒攻击则注入恶意图像-文本对以降低模型性能。我们分别回顾后门攻击和中毒攻击,尽管这些工作中的大多数都是后门攻击。

5.6.1Backdoor Attacks

5.6.1 后门攻击

We further classify backdoor attacks on VLMs into tuning-time backdoor and testing-time backdoor.

我们进一步将针对 VLMs 的后门攻击分为调优时后门 (tuning-time backdoor) 和测试时后门 (testing-time backdoor)。

Tuning-time Backdoor injects the backdoor during VLM instruction tuning. MABA [264] targets domain shifts by adding domain-agnostic triggers using attribution al interpretation, enhancing attack robustness across mismatched domains in image captioning tasks. Bad V LM Driver [266] introduced a physical backdoor for autonomous driving, using objects like red balloons to trigger unsafe actions such as sudden acceleration, bypassing digital defenses and posing real-world risks. Its automated pipeline generates backdoor training samples with malicious behaviors for stealthy, fexible attacks. ImgTrojan [267] introduces a jailbreaking attack by poisoning image-text pairs in training data, replacing captions with malicious prompts to enable VLM jailbreaks, exposing risks of compromised datasets.

调优时后门在 VLM 指令调优期间注入后门。MABA [264] 通过使用属性解释添加领域无关触发器来针对领域偏移,增强图像描述任务中不匹配领域的攻击鲁棒性。Bad VLM Driver [266] 为自动驾驶引入了物理后门,使用红色气球等物体触发突然加速等不安全行为,绕过数字防御并带来现实风险。其自动化管道生成带有恶意行为的后门训练样本,用于隐蔽且灵活的攻击。ImgTrojan [267] 通过在训练数据中污染图像-文本对来引入越狱攻击,将描述替换为恶意提示以启用 VLM 越狱,暴露数据集被破坏的风险。

Test-time Backdoor leverages the similarity of universal adversarial perturbations and backdoor triggers to inject backdoor at test-time. AnyDoor [265] embeds triggers in the textual modality via adversarial test images with universal perturbations, creating a text backdoor from image-perturbation combinations. It can also be seen as a multi-modal universal adversarial attack. Unlike traditional methods, AnyDoor does not require access to training data, enabling attackers to separate setup and activation of the attack.

测试时后门利用通用对抗扰动和后门触发器的相似性,在测试时注入后门。AnyDoor [265] 通过具有通用扰动的对抗测试图像在文本模态中嵌入触发器,从图像扰动组合中创建文本后门。它也可以被视为一种多模态通用对抗攻击。与传统方法不同,AnyDoor 不需要访问训练数据,使攻击者能够分离攻击的设置和激活。

5.6.2 Poisoning Attacks

5.6.2 投毒攻击

Shadowcast [268] is a stealthy tuning-time backdoor attack on VLMs. It injects poisoned samples visually indistinguishable from benign ones, targeting two objectives: 1) Label Attack, which mis classifies objects, and 2) Persuasion Attack, which generates misleading narratives. With only 50 poisoned samples, Shadowcast achieves high effectiveness, showing robustness and transfer ability across VLMs in black-box settings.

Shadowcast [268] 是一种针对视觉语言模型(VLM)的隐蔽调优时间后门攻击。它通过注入与良性样本在视觉上无法区分的毒化样本,攻击两个目标:1) 标签攻击,即错误分类对象;2) 说服攻击,即生成误导性叙述。仅需50个毒化样本,Shadowcast 在黑盒设置下表现出高效性,展示了跨 VLM 的鲁棒性和迁移能力。

5.7Datasets & Benchmarks

5.7 数据集与基准测试

TABLE 8: Safety and robustness benchmarks for VLMs.

表 8: VLM 的安全性和鲁棒性基准测试

基准测试 年份 规模 评估的 VLM 数量
OODCV-VQA [476] 2023 4,244 21
Sketchy-VQA 1 [476] 2023 4,000 21
MM-SafetyBench [477] 2023 5,040 12
AVIBench [478] 2024 260,000 14
JailbreakEvaluationofGPT-4o [479] 2024 4,180 1
JailBreakV-28K [438] 2024 28,000 10

The datasets used in VLM safety research are detailed in Table 2. Below, we review the benchmarks proposed for evaluating VLM safety and robustness, summarized in Table 8. SafeSight [476] introduces two VQA datasets, OODCV-VQA and SketchyVQA, to evaluate out-of-distribution (OOD) robustness, highlighting VLMs’ vulnerabilities to OOD texts and vision encoder weaknesses.MM-Safety Bench[477]focuses onimage-based manipulations, revealing vulnerabilities in multi-modal interactions. AVIBench [478] evaluates VLM robustness against 260K adversarial visual instructions, exposing susceptibility to imagebased, text-based, and content-biased adversarial visual instructions (AVIs). Jailbreak Evaluation of GPT-4o[479] tests GPT4o with multi-modal and unimodal jailbreak attacks, uncovering alignment vulnerabilities. JailBreakV-28K [438]assesses the transfer ability of LLM jailbreak techniques to VLMs, showing high attack success rates across 10 open-source models. These studies collectively reveal significant vulnerabilities in VLMs to OOD inputs, adversarial instructions, and multi-modal jailbreaks.

VLM安全研究中使用的数据集详见表 2。下面,我们回顾了为评估VLM安全性和鲁棒性提出的基准,总结在表 8 中。SafeSight [476] 引入了两个VQA数据集,OODCV-VQA和SketchyVQA,用于评估分布外 (OOD) 鲁棒性,揭示了VLM对OOD文本的脆弱性和视觉编码器的弱点。MM-Safety Bench [477] 专注于基于图像的操纵,揭示了多模态交互中的脆弱性。AVIBench [478] 评估了VLM对260K对抗性视觉指令的鲁棒性,暴露了对基于图像、基于文本和内容偏见的对抗性视觉指令 (AVIs) 的易感性。GPT-4o的越狱评估 [479] 测试了GPT-4o在越狱攻击下的多模态和单模态表现,揭示了对齐的脆弱性。JailBreakV-28K [438] 评估了LLM越狱技术在VLM中的迁移能力,展示了在10个开源模型中的高攻击成功率。这些研究共同揭示了VLM在OOD输入、对抗性指令和多模态越狱方面的显著脆弱性。

6 DIFFUSION MODEL SAFETY

6 扩散模型的安全性

This section focuses on safety research related to diffusion models [480]-[483], which involve forward noise addition and reverse sampling. In the forward process, Gaussian noise is increment ally added to an image until it becomes pure noise. Reverse sampling generates new samples by stepwise denoising based on learned data distributions [484]-[486]. By integrating input information, diffusion models perform conditional generation, transforming data distribution modeling $p(x)$ into $p(x|$ guidance).

本节重点讨论与扩散模型(diffusion models)[480]-[483]相关的安全研究,这些模型涉及前向噪声添加和反向采样。在前向过程中,高斯噪声逐步添加到图像中,直到图像变为纯噪声。反向采样基于学习到的数据分布逐步去噪,生成新的样本[484]-[486]。通过整合输入信息,扩散模型执行条件生成,将数据分布建模 $p(x)$ 转化为 $p(x|$ 引导)。

Widely used in Image-to-Image (I21), Text-to-Image (T21), and Text-to-Video (T2V) tasks, diffusion models are applied in content creation, image editing, and film production. However, their extensive use exposes them to various security risks including adversarial, jailbreak, backdoor, and privacy attacks. These attacks can degrade generation quality, bypass safety filters, manipulate outputs, and reveal sensitive training data. This section also reviews defenses against these threats, including jailbreak and backdoor defenses, as well as intellectual property protection techniques.

广泛应用于图像到图像 (I21)、文本到图像 (T21) 和文本到视频 (T2V) 任务的扩散模型被应用于内容创作、图像编辑和电影制作中。然而,其广泛使用也使其面临各种安全风险,包括对抗性攻击、越狱攻击、后门攻击和隐私攻击。这些攻击可能会降低生成质量、绕过安全过滤器、操纵输出并泄露敏感的训练数据。本节还回顾了针对这些威胁的防御措施,包括越狱和后门防御,以及知识产权保护技术。

6.1 Adversarial Attacks

6.1 对抗攻击

Adversarial attacks on diffusion models typically perturb text prompts to degrade image quality or cause semantic mismatches with the original text. This section reviews existing adversarial attacks, categorized by threat model into white-box, gray-box, and black-box methods.

对抗攻击通常通过扰动文本提示来降低图像质量或导致与原始文本的语义不匹配。本节回顾了现有的对抗攻击,根据威胁模型将其分为白盒、灰盒和黑盒方法。

6.1.1 White-box Attacks

6.1.1 白盒攻击

White-box attacks on T2I diffusion models assume full access to model parameters, allowing direct optimization of text prompts or latent space to degrade or disrupt image generation. For example, SAGE [270] explores both the discrete prompt and latent spaces to uncover failure modes in T2I models, including distorted generations and targeted manipulations. ATM [269] generates attack prompts similar to clean prompts by replacing or extending words using Gumbel Softmax, preventing the model from generating desired subjects.

对 T2I 扩散模型的白盒攻击假设可以完全访问模型参数,允许直接优化文本提示或潜在空间以降低或破坏图像生成。例如,SAGE [270] 探索了离散提示和潜在空间,以揭示 T2I 模型中的故障模式,包括失真生成和定向操控。ATM [269] 通过使用 Gumbel Softmax 替换或扩展单词,生成与干净提示相似的攻击提示,从而阻止模型生成所需的主体。

6.1.2 Gray-box Attacks

6.1.2 灰盒攻击

Gray-box attacks assume the CLIP text encoder used in many T2I diffusion models is frozen and publicly available. The attacker can then exploit CLIP similarity loss to craft adversarial text prompts targeting the text encoder.

灰盒攻击假设许多文生图扩散模型(T2I diffusion models)中使用的 CLIP 文本编码器是冻结且公开可用的。攻击者可以利用 CLIP 相似性损失来针对文本编码器制作对抗性文本提示。

QFA [271] minimizes cosine similarity between original and perturbed text embeddings to generate images that differ as much as possible from the original text. RVTA [272] maximizes imagetext similarity to align adversarial prompts with reference images generated by a surrogate diffusion model. MMP-Attack [487] simultaneously maximizes the cosine similarity between the perturbed text embedding and the target embedding in both the text and image modalities, while employing a straight-through estimator to execute the optimization process.

QFA [271] 通过最小化原始文本嵌入和扰动文本嵌入之间的余弦相似度,生成与原始文本尽可能不同的图像。RVTA [272] 通过最大化图像文本相似度,使对抗提示与由替代扩散模型生成的参考图像对齐。MMP-Attack [487] 在文本和图像模态中同时最大化扰动文本嵌入与目标嵌入之间的余弦相似度,同时采用直通估计器执行优化过程。

TABLE 9: A summary of attacks and defenses for Diffusion Models (Part I).

6.1.3Black-box Attacks

6.1.3 黑盒攻击

Black-box attacks assume the attacker has no knowledge of the victim diffusion model's internals (parameters or architecture). Since diffusion models use text prompts as input, existing attacks employ textual adversarial techniques to evade the model. These attacks can be further categorized by granularity into characterlevel, word-level, and sentence-level attacks.

黑盒攻击假设攻击者对受害扩散模型的内部结构(参数或架构)一无所知。由于扩散模型使用文本提示作为输入,现有攻击通过文本对抗技术来规避模型。这些攻击可以按粒度进一步分为字符级、词级和句子级攻击。

Character-level Attacks modify the characters in the text input to create adversarial prompts. ECB [273] shows how replacing characters with homoglyphs, such as using Hangul or Arabic scripts, shifts generated images toward cultural stereotypes. Subsequent works, like CharGrad [274], optimize character-level perturbations using gradient-based attacks and proxy representations to map character changes to embedding shifts. ER [275] uses distribution-based objectives (e.g., MMD, KL divergence) to maximize discrepancies in image distributions, enhancing attack effectiveness. These attacks exploit typos, homoglyphs, and phonetic modifications, disrupting text-to-image outputs.

字符级攻击通过修改文本输入中的字符来生成对抗性提示。ECB [273] 展示了如何使用同形异义词(如韩文或阿拉伯文)替换字符,使生成的图像偏向文化刻板印象。后续工作如 CharGrad [274] 使用基于梯度的攻击和代理表示来优化字符级扰动,将字符变化映射到嵌入变化。ER [275] 使用基于分布的目标(如 MMD、KL 散度)来最大化图像分布的差异,从而提高攻击效果。这些攻击利用错别字、同形异义词和语音修改,破坏文本到图像的输出。

Word-level Attacks craft adversarial prompts by replacing or adding words to the input text. DHV [276] uncovers a hidden vocabulary in diffusion models, where nonsensical strings like Apoploe ve sr re a it a is can generate bird images, due to their proximity to target concepts in the CLIP text embedding space. Building on this, AA [277] introduces macaronic prompting, combining word fragments from different languages to control visual outputs systematically. These attacks reveal vulnerabilities in the relationship between text embeddings and image generation.

词级攻击通过替换或添加词语来制作对抗性提示。DHV [276] 揭示了扩散模型中的隐藏词汇,其中像 "Apoploe ve sr re a it a is" 这样的无意义字符串由于在 CLIP 文本嵌入空间中接近目标概念,可以生成鸟类图像。在此基础上,AA [277] 引入了混合提示 (macaronic prompting),结合不同语言的词片段来系统地控制视觉输出。这些攻击揭示了文本嵌入与图像生成之间关系的脆弱性。

Sentence-level Attacks rewrite a substantial part or the entire prompt to create adversarial prompts. RIATIG [279] uses a CLIPbased image similarity measure as an optimization objective and a genetic algorithm to iterative ly mutate and select text prompts, creating adversarial examples that resemble the target image while remaining semantically different from the original text. In contrast, BBA [278] employs classification loss and black-box optimization to refine prompts, using Token Space Projection (TPS) to bridge the gap between continuous word embeddings and discrete tokens, enabling the generation of category-specific images without explicit category terms.

句子级攻击重写提示的大部分或全部内容以创建对抗性提示。RIATIG [279] 使用基于 CLIP 的图像相似性度量作为优化目标,并采用遗传算法迭代突变和选择文本提示,创建与目标图像相似但在语义上与原文本不同的对抗性示例。相比之下,BBA [278] 利用分类损失和黑盒优化来优化提示,使用 Token Space Projection (TPS) 来弥合连续词嵌入和离散 Token 之间的差距,从而生成无需明确类别术语的特定类别图像。

6.2 Jailbreak Attacks

6.2 越狱攻击

Diffusion models use both internal and external safety mechanisms to void the generation of Not Safe For Work (NSFW) content. Internal safety mechanisms often refer to the inherent robustness of T2I diffusion models, achieved through safety alignment during training, which aims to reduce the likelihood of generating harmful content. External safety mechanisms, on the other hand, are safety filters, such as text, image, or text-image class if i ers, applied to detect and block unsafe outputs after generation. Jailbreak attacks aim to craft adversarial prompts that bypass the safety mechanisms of diffusion models, enabling the generation of harmful content. This section provides a systematic review of existing jailbreak methods, categorized by threat model into white-box, gray-box, andblack-box attacks.

扩散模型使用内部和外部安全机制来避免生成不适合工作场所 (NSFW) 的内容。内部安全机制通常指的是 T2I 扩散模型的固有鲁棒性,通过训练期间的安全对齐来实现,旨在减少生成有害内容的可能性。外部安全机制则是安全过滤器,例如文本、图像或文本-图像分类器,用于在生成后检测并阻止不安全的输出。越狱攻击旨在制作对抗性提示,绕过扩散模型的安全机制,从而实现有害内容的生成。本节系统地回顾了现有的越狱方法,按威胁模型分为白盒、灰盒和黑盒攻击。

6.2.1White-box Attacks

6.2.1 白盒攻击

White-box attacks can bypass the safety mechanisms in T2I diffusion models through gradient-based optimization. These attacks can be further classified into internal safety attacks and external safety attacks, each exploiting specific vulnerabilities in the victim models.

白盒攻击可以通过基于梯度的优化绕过T2I扩散模型的安全机制。这些攻击可以进一步分为内部安全攻击和外部安全攻击,每种攻击都利用了受害模型中的特定漏洞。

Internal Safety Attacks target the internal safety mechanisms of diffusion models. Jail breaking internally safety-enhanced diffusion models involves regenerating NSFW content by bypassing the removal of harmful concepts. The red teaming tool P4D [282] automatically identifies problematic prompts to exploit limitations in current safety evaluations, aligning the predicted noise of an un constrained model with that of a safety-enhanced one.Unlearn Diff At k[283] introduces an evaluation framework that uses unlearned diffusion models’ classification capabilities to optimize adversarial prompts, aligning predicted noise with a target unsafe image to force the model to recreate NSFW content during denoising.

内部安全攻击针对扩散模型的内部安全机制。破解内部安全增强的扩散模型涉及通过绕过有害概念的去除来重新生成NSFW内容。红队工具P4D [282] 自动识别有问题的提示,以利用当前安全评估中的局限性,使无约束模型的预测噪声与安全增强模型的预测噪声保持一致。Unlearn Diff At k [283] 引入了一个评估框架,该框架利用未学习扩散模型的分类能力来优化对抗性提示,使预测噪声与目标不安全图像保持一致,从而在去噪过程中迫使模型重新生成NSFW内容。

External Safety Attacks target the safety filters of diffusion models, aiming to bypass both input and output safety mechanisms. RTSDSF [28O] reverse-engineered predefined NSFW concepts in filters by using the CLIP model to encode and compare NSFW vocabulary embeddings, performing a dictionary attack. It also showed that prompt dilution—adding irrelevant details—can bypass safety filters. MMA [281] employs a similarity-driven loss to optimize adversarial prompts and introduce subtle perturbations to input images, bypassing both prompt filters and post-hoc safety checkers during image editing.

外部安全攻击针对扩散模型的安全过滤器,旨在绕过输入和输出安全机制。RTSFS [28O] 通过使用 CLIP 模型对 NSFW 词汇嵌入进行编码和比较,执行字典攻击,反向工程了过滤器中预定义的 NSFW 概念。它还表明,提示稀释——添加无关细节——可以绕过安全过滤器。MMA [281] 采用相似性驱动的损失来优化对抗性提示,并在图像编辑期间对输入图像引入微小的扰动,从而绕过提示过滤器和事后安全检查器。

6.2.2 Gray-box Attacks

6.2.2 灰盒攻击

Gray-box jailbreak attacks assume that attackers have full access only to the open-source text encoder, with other components of the diffusion model remaining inaccessible. In this scenario, the attacker exploits the exposed text encoder to bypass the model's internal safety mechanism.

灰盒越狱攻击假设攻击者只能完全访问开源的文本编码器,而扩散模型的其他组件仍然无法访问。在这种情况下,攻击者利用暴露的文本编码器绕过模型的内部安全机制。

Internal Safety Attacks, under the gray-box setting, target models with ‘concept erasure'. Ring-A-Bell [488] extracts unsafe concepts by comparing antonymous prompt pairs, generates harmful prompts with soft prompts, and refines them using a genetic algorithm. JPA [284] leverages antonyms like "nude" and "clothed", calculating their average difference in the text embedding space to represent NSFW concepts, then optimizes prefix prompts for semantic alignment. RT-Attack [285] uses a two-stage strategy to maximize textual similarity to NSFw prompts and iterative ly refines them based on image-level similarity, demonstrating that even limited knowledge can enable attacks on safety-enhanced models.

内部安全攻击,在灰盒设置下,针对具有“概念擦除”的模型。Ring-A-Bell [488] 通过比较反义提示对提取不安全概念,使用软提示生成有害提示,并通过遗传算法进行优化。JPA [284] 利用反义词如“裸体”和“穿衣”,计算它们在文本嵌入空间中的平均差异来表示NSFW概念,然后优化前缀提示以实现语义对齐。RT-Attack [285] 采用两阶段策略,最大化与NSFW提示的文本相似性,并根据图像级相似性迭代优化,证明即使有限的知识也能对安全增强模型发起攻击。

6.2.3Black-box Attacks

6.2.3 黑盒攻击

Black-box jailbreaks on diffusion models target commercial models with access only to outputs, such as filter rejections or generated image quality and semantics, and are primarily external safety attacks.

黑箱越狱攻击扩散模型主要针对只能访问输出的商业模型,例如过滤拒绝或生成图像质量和语义,这些攻击主要是外部安全攻击。

External Safety Attacks, in the black-box setting, use handcrafted or LLM-assisted adversarial prompts to mislead the victim model to generate NSFW content. UD [287] highlights the risk of T2I models generating unsafe content, especially hateful memes, by refining unsafe prompts manually. Sneaky Prompt [286] uses reinforcement learning to optimize adversarial prompts, which updates its policy network based on filter evasion and semantic alignment. Other methods employ LLMs to refine adversarial prompts. Groot [289] decomposes prompts into objects and attributes to dilute sensitive content. DACA [290] breaks down and recombines prompts using LLMs. Surrogate Prompt [291] targets Midjourney, substituting sensitive terms and leveraging image-totext modules to generate harmful content at scale. Atlas [288] automates the attack with a two-agent system: one VLM generates adversarial prompts, while an LLM evaluates and selects the best candidates. These LLM-assisted strategies can significantly improve the effectiveness and stealth in ess of the attacks.

外部安全攻击在黑盒设置中,使用手工制作或大语言模型辅助的对抗性提示来误导受害者模型生成NSFW内容。UD [287] 强调了T2I模型生成不安全内容的风险,特别是通过手动改进不安全提示来生成仇恨表情包。Sneaky Prompt [286] 使用强化学习来优化对抗性提示,基于过滤规避和语义对齐来更新其策略网络。其他方法则利用大语言模型来改进对抗性提示。Groot [289] 将提示分解为对象和属性以稀释敏感内容。DACA [290] 使用大语言模型分解并重新组合提示。Surrogate Prompt [291] 针对Midjourney,替换敏感术语并利用图像到文本模块大规模生成有害内容。Atlas [288] 通过双智能体系统自动化攻击:一个视觉语言模型生成对抗性提示,而一个大语言模型评估并选择最佳候选。这些大语言模型辅助的策略可以显著提高攻击的有效性和隐蔽性。

6.3 Jailbreak Defenses

6.3 Jailbreak防御

This section reviews existing defense strategies proposed for T21 diffusion models against jailbreak attacks, including concept erasure and inference guidance. The key challenge of these defenses is how to ensure safety while maintaining generation quality.

本节回顾了针对 T21 扩散模型提出的现有防御策略,包括概念擦除和推理引导。这些防御的关键挑战是如何在确保安全的同时保持生成质量。

6.3.1 Concept Erasure

6.3.1 概念擦除

Concept erasure is an emerging research area focused on removing undesirable concepts (e.g., NSFW content and copyrighted styles) from diffusion models, where these concepts are referred to as target concepts. Concept erasure methods can be categorized into three types: finetuning-based, close-form solution, and pruningbased, depending on the strategy employed.

概念擦除 (Concept Erasure) 是一个新兴的研究领域,专注于从扩散模型 (Diffusion Models) 中移除不良概念(例如 NSFW 内容和受版权保护的风格),这些概念被称为目标概念。根据采用的策略,概念擦除方法可以分为三类:基于微调的 (finetuning-based)、闭式解法的 (close-form solution) 和基于剪枝的 (pruning-based)。

6.3.1.1 Finetuning-based Methods

6.3.1.1 基于微调的方法

These methods use gradient-based optimization to adjust model parameters, typically involving a loss function with an erasure term to prevent the generation of representations linked to the target (undesirable) concept, and a constraint term to preserve nontarget concepts. These approaches can be categorized into anchorbased, anchor-free, and adversarial erasure methods.

这些方法使用基于梯度的优化来调整模型参数,通常涉及一个包含擦除项的损失函数,以防止生成与目标(不良)概念相关的表示,以及一个约束项来保留非目标概念。这些方法可以分为基于锚点、无锚点和对抗性擦除方法。

Anchor-based Erasing is a targeted approach that guides the model to shift the target (undesirable concept) towards a good concept (anchor) by aligning predicted latent noise. AC [295] defines anchor concepts as broader categories encompassing the target concepts (e.g., "Grumpy $\mathsf{C a t}^{\prime\prime}\to,^{"}\mathsf{C a t}^{\prime\prime})$ anduses standard diffusion loss on text-image pairs of anchors to preserve their integrity while erasing target concepts. ABO [296] removes specific target concepts by modifying classifier guidance, using both explicit (replacing the target with a predefined substitute) and implicit (suppressing attention maps) erasing signals, and includes a penalty term to maintain generation quality. DoCo [297] improves generalization by aligning target and anchor concepts through adversarial training and mitigating gradient conflicts with concept-preserving gradient surgery. SPM [293] uses a 1D adapter and negative guidance [292] to suppress target concepts while ensuring non-target concepts remain consistent, affecting only relevant synonyms. SA [298] applies generative replay and elastic weight consolidation to stabilize model weights and maintain normal generation capabilities while preserving non-target concepts.

基于锚点的擦除是一种目标明确的方法,它通过对齐预测的潜在噪声,引导模型将目标(不良概念)转向良好概念(锚点)。AC [295] 将锚点概念定义为包含目标概念的更广泛类别(例如,"Grumpy $\mathsf{C a t}^{\prime\prime}\to,^{"}\mathsf{C a t}^{\prime\prime})$,并在锚点的文本-图像对上使用标准扩散损失,以保持其完整性,同时擦除目标概念。ABO [296] 通过修改分类器引导来移除特定目标概念,使用显式(用预定义的替代品替换目标)和隐式(抑制注意力图)擦除信号,并包含一个惩罚项以维护生成质量。DoCo [297] 通过对抗训练对齐目标和锚点概念,并通过概念保留梯度手术缓解梯度冲突,从而提升泛化能力。SPM [293] 使用一维适配器和负向引导 [292] 来抑制目标概念,同时确保非目标概念保持一致,仅影响相关同义词。SA [298] 应用生成式重放和弹性权重巩固来稳定模型权重,并在保留非目标概念的同时维持正常生成能力。

Anchor-free Erasing is a non-targeted fine-tuning approach that reduces the probability of generating target concepts without aligning to a specific safe concept. ESD [292] modifies classifierfree guidance into negative-guided noise prediction to minimize the target concept's generation probability (e.g., "Van Gogh"). SDD [294] addresses the extra effects of ESD's negative guidance by using un conditioned predictions and EMA to avoid catastrophic forgetting. DT [302] erases unsafe concepts by training the model to denoise scrambled low-frequency images. Forget-Me-Not [303] uses Attention Resteering to minimize intermediate attention maps related to the target concept. Geom-Erasing [304] erases implicit concepts like watermarks by applying a geometric-driven control method and introduces the Implicit Concept Dataset. SepME [305] advances multiple concept erasure and restoration. Fuchi et al. [489] proposed few-shot unlearning by targeting the text encoder rather than the image encoder or diffusion model. CCRT [313] proposes a method for continuous removal of diverse concepts from diffusion models.

无锚点擦除是一种非目标性的微调方法,它通过减少生成目标概念的概率而不与特定安全概念对齐来实现。ESD [292] 将无分类器引导修改为负向引导的噪声预测,以最小化目标概念的生成概率(例如“梵高”)。SDD [294] 通过使用无条件预测和指数移动平均 (EMA) 来解决 ESD 负向引导的额外影响,避免灾难性遗忘。DT [302] 通过训练模型对低频图像进行去噪来擦除不安全概念。Forget-Me-Not [303] 使用注意力重定向来最小化与目标概念相关的中间注意力图。Geom-Erasing [304] 通过应用几何驱动的控制方法擦除像水印这样的隐含概念,并引入了隐含概念数据集。SepME [305] 推进了多概念擦除和恢复。Fuchi 等人 [489] 提出了少样本遗忘方法,针对文本编码器而不是图像编码器或扩散模型。CCRT [313] 提出了一种从扩散模型中连续移除多种概念的方法。

Adversarial Erasing enhances previous methods by introducing perturbations to the target concept's text embedding and using adversarial training to improve robustness. Receler [299] employs a lightweight eraser and adversarial prompt embeddings, iterative ly training against each other, while applying a binary mask from U-Net attention maps to target only the concept regions. AdvUnlearn [301] shifts adversarial attacks to the text encoder, targeting the embedding space and using regular iz ation to preserve normal generation. RACE [300] improves efficiency by conducting adversarial attacks at a single timestep, reducing computational complexity. These methods enhance the model's resistance to adversarial prompts aimed at regenerating erased concepts.

对抗性擦除通过向目标概念的文本嵌入引入扰动,并利用对抗训练提高鲁棒性,从而增强了现有方法。Receler [299] 采用轻量级的擦除器和对抗性提示嵌入,通过迭代相互训练,同时应用从 U-Net 注意力图中提取的二进制掩码,仅针对概念区域。AdvUnlearn [301] 将对抗性攻击转移到文本编码器,针对嵌入空间,并使用正则化来保持正常生成。RACE [300] 通过在单一时间步进行对抗性攻击,提高了效率并降低了计算复杂度。这些方法增强了模型对旨在重新生成被擦除概念的对抗性提示的抵抗力。

6.3.1.2 Close-form Solution Methods

6.3.1.2 闭式解法

These methods offer an efficient alternative to fine-tuning-based erasure, focusing on localized updates in cross-attention layers to erase target concepts, inspired by model editing in LLMs [490]. Unlike fine-tuning, which aligns denoising predictions, these methods align cross-attention values. TIME [308] applies a closed-form solution to debias models, while UCE [307] extends this to multiple erasure targets, preserving surrounding concepts to reduce interference. MACE [306] refines cross-attention updates with LoRA and Grounded-SAM [408], [491] for region-specific erasure. A recent challenge is that erased concepts can still be generated via sub-concepts or synonyms [310]. RealEra [310] tackles this by mining associated concepts and adding perturbations to the embedding, expanding the erasure range with beyondconcept regular iz ation. RECE [309] addresses insufficient erasure by continually finding new concept embeddings during fine-tuning and applying closed-form solutions for further erasure.

这些方法提供了一种基于微调的概念擦除的高效替代方案,专注于在交叉注意力层中进行局部更新以擦除目标概念,灵感来自于大语言模型中的模型编辑 [490]。与微调不同,这些方法对交叉注意力值进行对齐。TIME [308] 采用闭式解来消除模型的偏差,而 UCE [307] 将其扩展到多个擦除目标,保留周围概念以减少干扰。MACE [306] 使用 LoRA 和 Grounded-SAM [408], [491] 来优化交叉注意力更新,实现特定区域的擦除。最近的一个挑战是,擦除的概念仍可能通过子概念或同义词生成 [310]。RealEra [310] 通过挖掘相关概念并在嵌入中添加扰动来解决这一问题,使用超概念正则化来扩展擦除范围。RECE [309] 通过在微调期间不断寻找新的概念嵌入并应用闭式解进行进一步擦除,解决了擦除不充分的问题。

6.3.1.3 Pruning-based Methods

6.3.1.3 基于剪枝的方法

These methods erase target concepts by identifying and removing neurons strongly associated with the target, selectively disabling them without updating model weights. Concept Prune calculates a Wanda score using target and reference prompts to measure each neuron's contribution, pruning those most associated with the target concept. Similarly, another approach [312] identifies concept-correlated neurons using adversarial prompts to enhance the robustness of existing erasure methods.

这些方法通过识别和移除与目标概念强相关的神经元来擦除目标概念,选择性地禁用这些神经元而不更新模型权重。Concept Prune 使用目标和参考提示计算 Wanda 分数,以衡量每个神经元的贡献,并剪枝与目标概念最相关的神经元。类似地,另一种方法 [312] 使用对抗性提示来识别与概念相关的神经元,以增强现有擦除方法的鲁棒性。

6.3.2Inference Guidance

6.3.2 推理引导

Inference guidance methods steer pre-trained diffusion models to generate safe images by incorporating additional auxiliary information and specific guidance during the inference process.

推理引导方法通过在推理过程中引入额外的辅助信息和特定引导,来指导预训练的扩散模型生成安全的图像。

6.3.2.1 Input Guidance

6.3.2.1 输入引导

This type of guidance use additional input text to steer the model toward safe content. SLD [314] adjusts noise predictions during inference based on a text condition and unsafe concepts, guiding generation towards the intended prompt while avoiding unsafe content, without requiring fine-tuning. It also introduces the I2P benchmark, a dataset for testing inappropriate content generation.

这种引导方法使用额外的输入文本来引导模型生成安全内容。SLD [314] 在推理过程中根据文本条件和不安全概念调整噪声预测,引导生成内容符合预期提示,同时避免不安全内容,而无需进行微调。它还引入了 I2P 基准,这是一个用于测试不当内容生成的数据集。

6.3.2.2 Input & Output Guidance

6.3.2.2 输入与输出指导

This type of methods prevent harmful inputs and control NSFW outputs. Ethical-Lens [315] employs a plug-and-play framework, using an LLM for input text revision (Ethical Text Scrutiny) and a multi-headed CLIP classifier for output image modification (Ethical Image Scrutiny), ensuring alignment with societal values without retraining or internal changes.

这类方法防止有害输入并控制NSFW输出。Ethical-Lens [315] 采用即插即用框架,使用大语言模型进行输入文本修订(Ethical Text Scrutiny),并使用多头CLIP分类器进行输出图像修改(Ethical Image Scrutiny),确保与社会价值观一致,而无需重新训练或内部更改。

6.3.2.3 Latent space Guidance

6.3.2.3 潜在空间引导

This approach uses additional implicit representations in the latent space to guide generation. SDIDLD [316] employs selfsupervised learning to identify the opposite latent direction of inappropriate concepts (e.g., "anti-sexual") and adds these vectors at the bottleneck layer, preventing harmful content generation.

该方法在潜在空间中使用额外的隐式表示来引导生成。SDIDLD [316] 采用自监督学习来识别不适当概念(例如“反性”)的相反潜在方向,并在瓶颈层添加这些向量,从而防止有害内容的生成。

6.4 Backdoor Attacks

6.4 后门攻击

Backdoor attacks on diffusion models allow adversaries to manipulate generated content by injecting backdoor triggers during training. These "malicious triggers" are embedded in model components, and during generation, inputs with triggers (e.g., prompts or initial noise) guide the model to produce predefined content. The key challenge is enhancing attack success rates while keeping the trigger covert and preserving the model's original utility. Existing attacks can be categorized into training manipulation and data poisoning methods.

扩散模型的后门攻击允许攻击者在训练过程中注入后门触发器来操纵生成内容。这些“恶意触发器”被嵌入到模型组件中,在生成过程中,带有触发器的输入(例如,提示或初始噪声)会引导模型生成预定义的内容。关键挑战在于提高攻击成功率的同时保持触发器的隐蔽性并保留模型的原始功能。现有的攻击可分为训练操纵和数据投毒方法。

6.4.1 Training Manipulation

6.4.1 训练操纵

This type of attack typically assumes the attacker aims to release a backdoored diffusion model, granting control over the training or even inference processes. Existing attacks focus on the visual modality, inserting backdoors by using image pairs with triggers and target images (image-image pair injection), typically targeting unconditional diffusion models.

此类攻击通常假设攻击者旨在发布一个带后门的扩散模型,从而控制训练甚至推理过程。现有攻击主要针对视觉模态,通过使用带有触发器和目标图像的图像对(图像-图像对注入)来插入后门,通常针对无条件扩散模型。

Bad Diffusion [317] presents the first backdoor attack on T2I diffusion models, which modifies the forward noise-addition and backward denoising processes to map backdoor target distributions to image triggers while maintaining DDPM sampling. Villan Diffusion [318] extends this to conditional models, adding prompt-based triggers and textual triggers for tasks like textto-image generation. TrojDiff [319] advances the research by controlling both training and inference, incorporating Trojan noise into sampling for diverse attack objectives. IBA [320] introduces invisible trigger backdoors using bi-level optimization to create covert perturbations that evade detection. DIFF2 [321] proposes a backdoor attack in adversarial purification, optimizing triggers to mislead class if i ers and extending it to data poisoning by injecting backdoors directly.

Bad Diffusion [317] 提出了首个针对T2I扩散模型的后门攻击,通过修改前向噪声添加和后向去噪过程,将后门目标分布映射到图像触发器,同时保持DDPM采样。Villan Diffusion [318] 将其扩展到条件模型,添加了基于提示的触发器和文本触发器,用于文本到图像生成等任务。TrojDiff [319] 通过控制训练和推理,将特洛伊噪声引入采样过程中,以实现多样化的攻击目标,进一步推进了研究。IBA [320] 引入了不可见触发器后门,使用双层优化创建隐蔽扰动以逃避检测。DIFF2 [321] 提出了对抗净化中的后门攻击,优化触发器以误导分类器,并通过直接注入后门将其扩展到数据中毒。

6.4.2 Data Poisoning

6.4.2 数据投毒 (Data Poisoning)

Unlike training manipulation, data poisoning methods do not directly interfere with the training process, restricting the attack to inserting poisoned samples into the dataset. These attacks typically target conditional diffusion models and explore two types of textual triggers: text-text pair and text-image pair.

与训练操纵不同,数据中毒方法不直接干扰训练过程,而是将攻击限制在向数据集中插入中毒样本。这些攻击通常针对条件扩散模型,并探索两种类型的文本触发器:文本-文本对和文本-图像对。

Text-text Pair Triggers consist of triggered prompts and their corresponding target prompts. RA [322] adopts this approach to inject backdoors into the text encoder by adding a covert trigger character, mapping the original to the target prompt while preserving encoder functionality through utility loss optimization. The backdoored encoder generates embeddings with predefined semantics, guiding the diffusion model's output. This lightweight attack requires no interaction with other model components. Several studies [322], [325]-[327] have also explored this approach.

文本-文本对触发器由触发提示及其对应的目标提示组成。RA [322] 采用这种方法,通过添加隐蔽的触发字符将后门注入文本编码器,将原始提示映射到目标提示,同时通过效用损失优化保持编码器功能。被注入后门的编码器生成具有预定义语义的嵌入,引导扩散模型的输出。这种轻量级攻击无需与其他模型组件交互。多项研究 [322]、[325]-[327] 也探索了这种方法。

Text-image Pair Triggers consist of triggered prompts paired with target images. BadT21 [323] explores backdoors based on pixel, object, and style changes, where a special trigger (e.g., $\mathbf{w}_{\textrm{[T]}},\mathbf{\prime\prime})$ induces the model to generate images with specific patches, replaced objects, or styles. To reduce the data cost, ZeroDay [326], [327] uses personalized fine-tuning, injecting triggerimage pairs for more efficient backdoors. FTHCW [324] embeds target patterns into images from different classes, forming textimage pairs to generate diverse outputs. IBT [329] uses twoword triggers that activate the backdoor only when both words appear together, enhancing stealth in ess. In commercial settings, BAGM [325] manipulates user sentiment by mapping broad terms (e.g., “drinks") to specific brands (e.g., “Coca Cola"). SBD [328] employs backdoors for copyright infringement, bypassing filters by decomposing and reassembling copyrighted content using textimagepairs.

文本-图像对触发器由触发的提示词与目标图像配对组成。BadT21 [323] 探索了基于像素、对象和风格变化的后门,其中特殊触发器(例如,$\mathbf{w}_{\textrm{[T]}},\mathbf{\prime\prime}$)诱导模型生成带有特定补丁、替换对象或风格的图像。为了降低数据成本,ZeroDay [326], [327] 使用个性化微调,注入触发器-图像对以实现更高效的后门。FTHCW [324] 将目标模式嵌入到不同类别的图像中,形成文本-图像对以生成多样化的输出。IBT [329] 使用双词触发器,仅在两个词同时出现时激活后门,增强了隐蔽性。在商业场景中,BAGM [325] 通过将广泛术语(例如,“饮料”)映射到特定品牌(例如,“可口可乐”)来操纵用户情感。SBD [328] 利用后门进行版权侵权,通过使用文本-图像对分解和重新组装受版权保护的内容来绕过过滤器。

6.5Backdoor Defenses

6.5 后门防御

Backdoor defenses for diffusion models is an emerging area of research. Current approaches generally follow a three-step pipeline: 1) trigger inversion, 2) trigger validation or backdoor detection, and 3) backdoor removal. Some works propose complete frameworks, while others focus on individual steps.

扩散模型的后门防御是一个新兴的研究领域。当前的方法通常遵循三个步骤:1) 触发反转,2) 触发验证或后门检测,3) 后门移除。一些工作提出了完整的框架,而另一些则专注于单个步骤。

6.5.1Backdoor Detection

6.5.1 后门检测

Most early research focuses on detecting or validating backdoor triggers. T2IShield [330] is the first backdoor detection and mitigation framework for diffusion models, leveraging the assimilation phenomenon in cross-attention maps, where a trigger suppresses other tokens to generate specific content. Ufid [331] validates triggers by noting that clean generations are sensitive to small perturbations, while backdoor-triggered outputs are more robust. DisDet [332] proposes a low-cost detection method that distinguishes poisoned input noise from clean Gaussian noise by identifying distribution shifts.

大多数早期研究集中在检测或验证后门触发器。T2IShield [330] 是首个针对扩散模型的后门检测和缓解框架,利用交叉注意力图中的同化现象,其中触发器会抑制其他Token以生成特定内容。Ufid [331] 通过观察干净生成对小扰动敏感,而后门触发的输出更加稳健,从而验证触发器。DisDet [332] 提出了一种低成本的检测方法,通过识别分布变化来区分被污染输入噪声与干净的高斯噪声。

6.5.2Backdoor Removal

6.5.2 后门移除

While trigger validation confirms the presence of a backdoor trigger, the identified triggers must still be removed from the victim model. Most backdoor removal methods first invert the trigger and then eliminate the backdoor using the inverted trigger. Elijah [333] introduces a backdoor removal framework for diffusion models, inverting triggers through distribution shifts and aligning the backdoor's distribution with the clean one. DiffCleanse [334] formulates trigger inversion as an optimization problem with similarity and entropy loss, followed by pruning channels critical to backdoor sampling. TERD [335] proposes a unified reverse loss for trigger inversion, using a two-stage process for coarse and refined inversion. Pure Diffusion [336] employs multi-timestep trigger inversion, leveraging the consistent distribution shift caused by backdoored forward processes.

虽然触发器验证确认了后门触发器的存在,但已识别的触发器仍必须从受害模型中移除。大多数后门移除方法首先反转触发器,然后使用反转后的触发器消除后门。Elijah [333] 提出了一个用于扩散模型的后门移除框架,通过分布偏移反转触发器,并将后门的分布与干净分布对齐。DiffCleanse [334] 将触发器反演公式化为一个带有相似性和熵损失的优化问题,随后剪枝对后门采样至关重要的通道。TERD [335] 提出了一个用于触发器反演的统一反向损失,使用两阶段过程进行粗略和精细反演。Pure Diffusion [336] 采用多时间步触发器反演,利用后门前向过程引起的一致分布偏移。

Privacy attacks on diffusion models can be classified into membership inference, data extraction, and model extraction attacks. As attack sophistication increases, each type poses a growing threat to privacy.

扩散模型的隐私攻击可分为成员推断、数据提取和模型提取攻击。随着攻击技术的复杂化,每种类型对隐私的威胁都在不断增加。

6.6 Membership Inference Attacks

6.6 成员推理攻击

Membership inference attacks on diffusion models aim to infer sensitive data by exploiting their generative capabilities. Attackers use techniques like reconstruction error, shadow models, auxiliary data, likelihood, gradient, or structural similarity metrics. These attacks can be classified into six types: reconstruction error-based, auxiliary dataset-based, loss-based, gradient-based, structural similarity-based, and likelihood-based.

针对扩散模型的成员推断攻击旨在通过利用其生成能力来推断敏感数据。攻击者使用重建误差、影子模型、辅助数据、似然、梯度或结构相似性度量等技术。这些攻击可分为六类:基于重建误差、基于辅助数据集、基于损失、基于梯度、基于结构相似性和基于似然。

Reconstruction Error-based Attacks infer the membership of candidate samples by analyzing their reconstruction errors in the diffusion model. Wu et al. [347] proposed to determine membership in text-conditional diffusion models by comparing the reconstruction error between the candidate and generated images, and their semantic alignment with the text prompt. Inspired by GAN-leaks [492], Matsumoto et al. [339] introduced Diffusionleaks, which generates multiple candidate images and infers membership based on minimal reconstruction errors. Li et al. [349] proposed to average multiple reconstructions to reduce errors and improve inference accuracy, utilizing black-box APIs to modify candidate images. DRC [350] degrades and restores images using the diffusion model, comparing the restored images to the originals to infer membership and sensitive features.

基于重建误差的攻击通过分析候选样本在扩散模型中的重建误差来推断其成员身份。Wu 等人 [347] 提出通过比较候选图像与生成图像之间的重建误差及其与文本提示的语义对齐来确定文本条件扩散模型中的成员身份。受 GAN-leaks [492] 启发,Matsumoto 等人 [339] 提出了 Diffusionleaks,该方法生成多个候选图像并根据最小重建误差推断成员身份。Li 等人 [349] 提出通过平均多个重建结果来减少误差并提高推理准确性,利用黑盒 API 修改候选图像。DRC [350] 使用扩散模型对图像进行降质和恢复,比较恢复后的图像与原始图像以推断成员身份和敏感特征。

Auxiliary Datasets-based Attacks use auxiliary datasets to train shadow models, enabling black-box membership inference by simulating the target model. Pang et al. [348] targeted finetuned conditional diffusion models, computing similarity scores between query images and generated images to train a binary classifier for membership inference. GMIA [351] introduces the first generalized membership inference attack for generative models, using only generated distributions and auxiliary non-member datasets, assuming the generated distribution approximates the original training distribution.

基于辅助数据集的攻击利用辅助数据集训练影子模型,通过模拟目标模型实现黑盒成员推断。Pang 等人 [348] 针对微调的条件扩散模型,计算查询图像与生成图像之间的相似度分数,以训练二分类器进行成员推断。GMIA [351] 提出了生成模型的首次广义成员推断攻击,仅使用生成分布和辅助非成员数据集,假设生成分布近似于原始训练分布。

TABLE 10: A summary of attacks and defenses for Diffusion Models (Part II).

表 10: 扩散模型攻击与防御总结(第二部分)

Loss-based Attacks exploit loss value distributions to distinguish member from non-member samples, assuming lower losses for member (training) samples. [340] and [339] used loss values at different timesteps for membership inference. These two attacks can be viewed as Static Loss Attack (SLA), as they ignore the diffusion process. Dubinski et al. [337] modified the diffusion process to extract loss information from multiple perspectives, improving inference accuracy.

基于损失值的攻击利用损失值分布来区分成员样本和非成员样本,假设成员(训练)样本的损失值较低。[340] 和 [339] 使用了不同时间步的损失值进行成员推断。这两种攻击可以视为静态损失攻击(SLA),因为它们忽略了扩散过程。Dubinski 等人 [337] 修改了扩散过程,从多个角度提取损失信息,提高了推断的准确性。

Gradient-based Attacks leverage gradient information for membership inference. For instance, GSA [338] infers a sample is a member if its gradients significantly differ from surrounding samples, indicating a stronger influence on the model's training.

基于梯度的攻击利用梯度信息进行成员推断。例如,GSA [338] 推断一个样本是成员,如果其梯度与周围样本显著不同,表明对模型训练有更强的影响。

Structural Similarity-based Attacks compare structural fe atures or similarity metrics between candidate samples and model outputs. SMIA [346] uses the Structure Similarity Index Measure (SSIM) [493] metric to assess how well an image's structure is preserved during diffusion, with the average SSIM difference between members and non-members used to infer membership.

基于结构相似性的攻击通过比较候选样本与模型输出的结构特征或相似性度量来进行。SMIA [346] 使用结构相似性指数(Structure Similarity Index Measure, SSIM)[493] 来衡量图像在扩散过程中结构的保留程度,并通过成员与非成员之间的平均 SSIM 差异来推断成员关系。

Likelihood-based Attacks use posterior or conditional likelihoods to infer membership. SecMI [341] estimates posterior likelihoods via reverse processes to target DDPM and Stable Diffusion models. QRMI [342] applies quantile regression to posterior likelihoods. SIA [494] infers membership based on noise parameter differences in the reverse diffusion process. PIA [343] uses diffusion model properties to infer membership with fewer queries. PFAMI [344] analyzes fu c tu at ions between target samples and neighbors, exploiting memorization in generative models. Zhai et al. [345] use discrepancies in conditional likelihoods due to over fitting for membership inference.

基于似然的攻击使用后验或条件似然来推断成员身份。SecMI [341] 通过反向过程估计后验似然来针对 DDPM 和 Stable Diffusion 模型。QRMI [342] 将分位数回归应用于后验似然。SIA [494] 基于反向扩散过程中的噪声参数差异推断成员身份。PIA [343] 使用扩散模型属性以较少的查询推断成员身份。PFAMI [344] 分析目标样本与邻近样本之间的波动,利用生成模型中的记忆。Zhai 等人 [345] 使用由于过拟合导致的条件似然差异进行成员推断。

6.7Data Extraction Attacks

6.7 数据提取攻击

Data extraction attacks aim to reverse-engineer training data or attributes from a trained model, exploiting diffusion models' generative capabilities. Their effectiveness depends on the model's ability to memorize specific attributes [495]-[498]. These attacks can be classified into two main approaches based on the type of condition used: explicit condition-based extraction and surrogate condition-based extraction.

数据提取攻击旨在从训练好的模型中逆向工程出训练数据或属性,利用扩散模型的生成能力。其有效性取决于模型记忆特定属性的能力[495]-[498]。根据所使用的条件类型,这些攻击可以分为两种主要方法:基于显式条件的提取和基于替代条件的提取。

Explicit Condition-based Extraction leverages conditional information in T2I diffusion models to extract memorized training samples. Attackers use specific text prompts to generate images similar to training data. For example, [352] introduced bruteforce data extraction (BruteDE), generating images with targeted prompts and using membership inference to identify matches. This method is slow. One Step Extraction (OsE) [353] exploits "template verbatims," where models regenerate training samples, using metrics like denoising confidence score (DCS) and edge consistency score (ECs) for faster extraction.

基于显式条件的提取利用 T2I 扩散模型中的条件信息来提取记忆的训练样本。攻击者使用特定的文本提示生成与训练数据相似的图像。例如,[352] 引入了暴力数据提取 (BruteDE),通过目标提示生成图像,并使用成员推断来识别匹配项。这种方法速度较慢。一步提取 (OsE) [353] 利用“模板逐字”,模型重新生成训练样本,使用去噪置信度分数 (DCS) 和边缘一致性分数 (ECs) 等指标来加快提取速度。

Surrogate Condition-based Extraction creates surrogate conditions to enable data extraction from unconditional diffusion models. SIDE [354] uses implicit labels from class if i ers or feature extractors as surrogate conditions.FineXtract[355]uses finetuned models as surrogate conditions to guide extraction in latent space regions tied to fine-tuning data.

基于代理条件的提取通过创建代理条件,实现从无条件扩散模型中提取数据。SIDE [354] 使用分类器或特征提取器的隐式标签作为代理条件。FineXtract [355] 使用微调后的模型作为代理条件,以引导在与微调数据相关的潜在空间区域中进行数据提取。

6.8Model Extraction Attacks

6.8 模型提取攻击

Model extraction aims to steal a trained diffusion model's internal parameters or architecture. The only known method for model extraction on diffusion models is Spectral DeTuning (SDeT) [356]. SDeT leverages Low-Rank Adaptation (LoRA)[499] to extract pre-fine-tuning weights of generative models fine-tuned with LoRA. By collecting multiple fine-tuned models from the same pretrained model, it formulates an optimization problem to minimize the difference between fine-tuned weights and the sum of original weights and adaptation matrices under a low-rank constraint, solved iterative ly using Singular Value Decomposition (SVD) [500]. SDeT effectively recovers original weights for models like Stable Diffusion and Mistral-7B [501], highlighting vulnerabilities in fine-tuning processes with low-rank adaptations.

模型提取旨在窃取训练好的扩散模型的内部参数或架构。目前唯一已知的扩散模型提取方法是 Spectral DeTuning (SDeT) [356]。SDeT 利用低秩适应 (Low-Rank Adaptation, LoRA) [499] 来提取使用 LoRA 进行微调的生成模型的预微调权重。通过从同一预训练模型中收集多个微调模型,它构建了一个优化问题,以最小化微调权重与在低秩约束下的原始权重和适应矩阵之和之间的差异,并使用奇异值分解 (Singular Value Decomposition, SVD) [500] 迭代求解。SDeT 有效恢复了 Stable Diffusion 和 Mistral-7B [501] 等模型的原始权重,突出了使用低秩适应进行微调过程中的漏洞。

6.9 Intellectual Property Protection

6.9 知识产权保护

Intellectual property protection for AI is an emerging research area that uses techniques like adversarial attacks and watermarking to safeguard the intellectual property of natural (training or test) data, generated data, and trained models. These methods generally assume full access to the protected object. The following sections categorize these approaches into natural data protection, generated data protection, and model protection.

AI的知识产权保护是一个新兴的研究领域,它使用对抗攻击和水印等技术来保护自然(训练或测试)数据、生成数据和训练模型的知识产权。这些方法通常假设对保护对象有完全访问权限。以下部分将这些方法分为自然数据保护、生成数据保护和模型保护。

6.9.1Natural Data Protection

6.9.1 自然数据保护

Natural data protection methods focus on preprocessing data during training or inference to safeguard the copyright of naturally collected data, as opposed to generated data. In this context, data owners defend against model owners accessing the data. Existing natural data protection methods for T2l diffusion models aim to protect image intellectual property while minimizing quality loss. They can be categorized into learning prevention, editing prevention, and data attribution methods based on specific goals.

自然数据保护方法侧重于在训练或推理过程中预处理数据,以保护自然收集数据的版权,而非生成数据。在这种情况下,数据所有者防御模型所有者访问数据。现有的 T2l 扩散模型的自然数据保护方法旨在保护图像知识产权,同时最小化质量损失。它们可以根据具体目标分为学习预防、编辑预防和数据归属方法。

Learning Prevention methods prevent T2I models from learning useful features from training images using techniques like adversarial attacks. DUAW [357] protects copyrighted images by disrupting the variation al auto encoder (VAE) in Stable Diffusion models, optimizing universal adversarial perturbations on surrogate images to distort outputs. AdvDM [358] protects artwork copyrights by generating adversarial examples to prevent diffusion models from imitating artistic styles. Anti-DreamBooth [359] defends against malicious fine-tuning by injecting adversarial noise into user images to block the model from learning personalized features. MetaCloak [360] enhances image resistance to transformations (flipping, cropping, compression) by using surrogate diffusion models to craft transferable perturbations and a denoising-error maximization loss for better robustness. InMakr [361] embeds protective watermarks on critical pixels to safeguard personal semantics even if images are modified. SimAC [362] improves protection by optimizing timestep intervals and introducing a feature interference loss, leveraging early diffusion steps and high-frequency information from deeper layers.

学习预防方法通过使用对抗性攻击等技术,防止 T2I 模型从训练图像中学习有用的特征。DUAW [357] 通过破坏 Stable Diffusion 模型中的变分自编码器 (VAE) 来保护受版权保护的图像,优化代理图像上的通用对抗性扰动以扭曲输出。AdvDM [358] 通过生成对抗样本来保护艺术品版权,防止扩散模型模仿艺术风格。Anti-DreamBooth [359] 通过向用户图像注入对抗性噪声来防御恶意微调,阻止模型学习个性化特征。MetaCloak [360] 通过使用代理扩散模型来制作可转移的扰动,并利用去噪误差最大化损失来提高图像的抗变换能力(翻转、裁剪、压缩)。InMakr [361] 在关键像素上嵌入保护性水印,即使图像被修改也能保护个人语义。SimAC [362] 通过优化时间步长间隔和引入特征干扰损失,利用早期扩散步骤和深层的高频信息来增强保护效果。

Editing Prevention aims to prevent diffusion model-based image tampering and deepfake generation. Existing methods either embed watermarks or use adversarial noise to disrupt the editing process. EditGuard [363] introduces a proactive forensics framework to embed exclusive watermarks into images, making them resistant to various diffusion model-based editing techniques, including foreground or background removal, filling, tampering, and face swapping. WaDiff [364] adds a unique watermark to each user query, enabling trace ability of the generated image if ethical concerns arise. Adv Watermark[365] incorporates adversarial noise, producing visible signatures in the protected image when used by I2I models, which helps identify tampered content.

编辑预防旨在防止基于扩散模型的图像篡改和深度伪造生成。现有方法要么嵌入水印,要么使用对抗性噪声来干扰编辑过程。EditGuard [363] 引入了一种主动取证框架,将独家水印嵌入图像中,使其能够抵抗各种基于扩散模型的编辑技术,包括前景或背景移除、填充、篡改和面部替换。WaDiff [364] 为每个用户查询添加唯一水印,从而在出现伦理问题时能够追踪生成的图像。Adv Watermark [365] 结合了对抗性噪声,当I2I模型使用受保护的图像时,会产生可见的签名,这有助于识别被篡改的内容。

Data Attribution techniques identify if generated data originates from a specific dataset, often by embedding watermarks for later verification. Diagnosis [369] introduced a method for detecting unauthorized data usage by applying stealthy image warping effects to protected data. FT-SHIELD [366] uses alternating optimization and PGD [409] to embed watermarks, with a binary detector for verification. Diffusion Shield [367] encodes copyright messages into watermark patches, jointly optimizing the decoder and patches to ensure consistency across samples for reliable extraction. ProMark [368] introduces a proactive watermarking method for concept attribution, embedding watermarks in training data that can be extracted when similar concepts are generated by the model.

数据归属技术通过嵌入水印以进行后续验证,识别生成的数据是否源自特定数据集。Diagnosis [369] 提出了一种检测未经授权数据使用的方法,通过在被保护数据上应用隐蔽的图像扭曲效果来实现。FT-SHIELD [366] 使用交替优化和 PGD [409] 来嵌入水印,并通过二元检测器进行验证。Diffusion Shield [367] 将版权信息编码到水印补丁中,联合优化解码器和补丁,以确保样本间的一致性,从而实现可靠的提取。ProMark [368] 提出了一种主动水印方法用于概念归属,将水印嵌入训练数据中,当模型生成相似概念时能够提取出水印。

6.9.2Generated Data Protection

6.9.2 生成数据保护

With the rise of AI-generated content (AIGC), protecting the copyright of generated data has become increasingly important. Generated data protection seeks to answer, “Who created this content?" by embedding verifiable, unique watermarks into generated images to identify their creators (either the model or user). This ensures intellectual property protection and accountability for content publishers, while balancing the challenge of maintaining detection accuracy without compromising image quality.

随着生成式 AI (Generative AI) 内容的兴起,保护生成数据的版权变得越来越重要。生成数据保护试图通过将可验证的唯一水印嵌入生成的图像中来回答“谁创建了此内容?”的问题,以识别其创建者(模型或用户)。这确保了知识产权保护和内容发布者的责任,同时在不影响图像质量的情况下保持检测准确性的挑战。

HiDDeN [370] pioneers deep learning-based image watermarking, using an encoder to embed imperceptible watermarks and a decoder to recover them for detection. This approach can also watermark AI-generated images as a post-processing step. Recent protection methods primarily address the above challenge by embedding watermarks into images during the generation (reverse sampling) process of diffusion models. Stable Signature [371] embeds a binary signature into images generated by diffusion models through decoder fine-tuning, allowing the watermark to be recovered and validated using a pre-trained extractor and statistical test. LaWa [372] introduces a coarse-to-fine watermark embedding method within the latent diffusion model's decoder, employing multiple modules to insert the watermark at different upsampling stages using adversarial training. Safe-SD [373] proposes a framework for embedding a graphical watermark (e.g., QR code) into the imperceptible structure-related pixels of a Stable Diffusion model for high trace ability.

HiDDeN [370] 开创了基于深度学习的图像水印技术,通过编码器嵌入不可见的水印,并通过解码器恢复以进行检测。这种方法还可以作为后处理步骤对 AI 生成的图像进行水印处理。近期的保护方法主要通过在扩散模型的生成(反向采样)过程中嵌入水印来应对上述挑战。Stable Signature [371] 通过解码器微调将二进制签名嵌入到扩散模型生成的图像中,允许使用预训练的提取器和统计测试来恢复和验证水印。LaWa [372] 在潜在扩散模型的解码器中引入了从粗到细的水印嵌入方法,使用多个模块通过对抗训练在不同的上采样阶段插入水印。Safe-SD [373] 提出了一个框架,用于将图形水印(例如二维码)嵌入到 Stable Diffusion 模型的不可见结构相关像素中,以实现高可追溯性。

Recent studies highlight vulnerabilities in watermarking for AIGC. WEvade [502] bypasses watermark detection by adding subtle perturbations to watermarked images, exploiting watermark characteristics. TAIW [503] proposes a transfer attack using multiple surrogate watermarking models in a no-box setting, analyzing its theoretical transfer ability. Unlike per-image attacks, SSU [504] introduces a model-targeted attack to remove in-generation watermarks by fine-tuning the diffusion model's decoder with non-watermarked images, demonstrating the fragility of Stable Signature [371].

近期研究揭示了AIGC水印技术的脆弱性。WEvade [502] 通过在加水印的图像中添加微小的扰动,利用水印特征绕过水印检测。TAIW [503] 提出了一种在无盒设置下使用多个代理水印模型的迁移攻击,并分析了其理论迁移能力。与针对单张图像的攻击不同,SSU [504] 引入了一种针对模型的攻击方法,通过使用未加水印的图像对扩散模型的解码器进行微调,从而移除生成过程中的水印,展示了Stable Signature [371]的脆弱性。

6.9.3Model Protection

6.9.3 模型保护

Model protection techniques safeguard the intellectual property of released models, enabling owners to verify ownership and trace generated content back to its origin. These approaches are categorized based on their objectives into model watermark and model attribution.

模型保护技术保障了发布模型的知识产权,使所有者能够验证所有权并将生成的内容追溯到其来源。这些方法根据其目标分为模型水印和模型溯源。

Model Watermark injects a watermark trigger into the model, which can then be activated during inference to verify ownership. Zhao et al. [374] proposed separate watermarking schemes for unconditional/class-conditional and T2I diffusion models. For unconditional/class-conditional models, a pretrained watermark encoder embeds a binary string (e.g., "011001") into the training data, and the model is trained to generate images with a detectable watermark, verified by a pretrained decoder. For T2l models, a paired (text, image) trigger (e.g., " [V]" and a QR code) is used to trigger the generation of the QR code for ownership verification. FIXEDWM [375] enhances trigger stealth in ess by fixing its position in prompts, ensuring the watermarked image is generated only when the trigger is in the correct position. WDM [376] modifies the standard diffusion process into a Watermark Diffusion Process (WDP) to embed watermarks. During training, WDM learns from watermarked images using WDP, while normal images follow the standard diffusion process. During verification, Gaussian noises combined with the trigger can activate the generation of watermarked images.

模型水印将水印触发器注入模型,随后在推理过程中激活以验证所有权。Zhao 等人 [374] 提出了针对无条件/类别条件模型和文本到图像扩散模型的独立水印方案。对于无条件/类别条件模型,预训练的水印编码器将二进制字符串(例如 "011001")嵌入训练数据中,模型被训练生成带有可检测水印的图像,并通过预训练的解码器进行验证。对于文本到图像模型,使用成对的(文本,图像)触发器(例如 "[V]" 和二维码)来触发二维码的生成以进行所有权验证。FIXEDWM [375] 通过固定触发器在提示中的位置来增强触发器的隐蔽性,确保只有当触发器处于正确位置时才会生成带水印的图像。WDM [376] 将标准扩散过程修改为水印扩散过程(WDP)以嵌入水印。在训练过程中,WDM 使用 WDP 从带水印的图像中学习,而普通图像则遵循标准扩散过程。在验证过程中,结合触发器的高斯噪声可以激活带水印图像的生成。

Model Attribution also embeds watermarks into generated content to identify the model, similar to generated data protection methods in Section 6.9.2. The key difference is that model attribution focuses on model-wide watermarks, while generated data protection targets sample-specific watermarks. Tree-Ring [379] embeds a watermark into the Fourier space of the initial Gaussian noise used for T2I generation. During verification, denoising diffusion implicit model (DDIM) inversion extracts the initial noise, and comparison with the original watermark identifies the generating model. AquaLoRA [377] addresses the limitations of existing methods to white-box adaptive attacks, including TreeRing, by embedding a secret bit string into the model parameters to achieve white-box protection, preventing easy manipulation of the watermark by malicious users. Latent Tracer [378] identifies the origin model of generated samples by reverse-engineering their latent inputs, eliminating the need for artificial fingerprints or watermarks.

模型归因(Model Attribution)还将水印嵌入生成内容中以识别模型,类似于第6.9.2节中的生成数据保护方法。关键区别在于,模型归因侧重于模型范围内的水印,而生成数据保护则针对样本特定的水印。Tree-Ring [379] 将水印嵌入用于文本到图像(T2I)生成的初始高斯噪声的傅里叶空间中。在验证过程中,去噪扩散隐式模型(DDIM)反演提取初始噪声,并通过与原始水印的比较来识别生成模型。AquaLoRA [377] 解决了现有方法对白盒自适应攻击(包括Tree-Ring)的局限性,通过在模型参数中嵌入一个秘密比特串来实现白盒保护,防止恶意用户轻易操纵水印。Latent Tracer [378] 通过逆向工程生成样本的潜在输入来识别其来源模型,无需人工指纹或水印。

6.10 Datasets

6.10 数据集

This section reviews commonly used datasets for diffusion model safety research, as summarized in Tables 9 and 10. For adversarial attack and defense studies, captioned text-image pairs such as MS COCO [418], LAION [505], [506], and DiffusionDB [507] are often employed by conditional diffusion models. Datasets for category-image classification tasks, like ImageNet [508] and CIFAR10/100 [509], are typically used by unconditional diffusion models to evaluate attack effectiveness and output quality. In research on NSFW content in diffusion models, the I2P dataset [314] is widely used, alongside custom datasets such as NSFW200 [286], VBCDE-100 [290], Tox100/1K [315] and a human- attribute dataset[315] focused on bias research. For intellectual property protection, datasets like CelebA [510] and VGGFace2 [511] (facial datasets), DreamBooth [512] and Pokemon Captions [513] (object datasets), and WikiArt [514] (artistic style dataset) are commonly used.

本节回顾了扩散模型安全研究中常用的数据集,总结如表 9 和表 10 所示。对于对抗攻击和防御研究,条件扩散模型通常使用带有标题的文本-图像对,如 MS COCO [418]、LAION [505]、[506] 和 DiffusionDB [507]。类别-图像分类任务的数据集,如 ImageNet [508] 和 CIFAR10/100 [509],通常由无条件扩散模型用于评估攻击效果和输出质量。在扩散模型中关于 NSFW 内容的研究中,I2P 数据集 [314] 被广泛使用,同时还有自定义数据集,如 NSFW200 [286]、VBCDE-100 [290]、Tox100/1K [315] 以及一个专注于偏见研究的人类属性数据集 [315]。对于知识产权保护,常用的数据集包括 CelebA [510] 和 VGGFace2 [511](面部数据集)、DreamBooth [512] 和 Pokemon Captions [513](对象数据集)以及 WikiArt [514](艺术风格数据集)。

7 AGENT SAFETY

7 AI智能体安全

Large model powered agents are increasingly deployed across diverse applications, leveraging the capabilities of LLMs and VLMs to tackle complex problems, especially in safety-critical domains such as medical robotics and autonomous driving. Ensuring their safety is of paramount importance. This section provides a comprehensive review of existing safety research on agents, highlighting key challenges and emphasizing the need for ongoing innovation. Existing research can be broadly categorized into LLM agent safety and VLM agent safety.

大规模模型驱动的智能体正越来越多地部署在各种应用中,利用大语言模型和视觉语言模型的能力来解决复杂问题,尤其是在医疗机器人和自动驾驶等安全关键领域。确保它们的安全性至关重要。本节全面回顾了现有关于智能体安全的研究,突出了关键挑战,并强调了持续创新的必要性。现有研究大致可分为大语言模型智能体安全和视觉语言模型智能体安全。

7.1 LLM Agent Safety

7.1 大语言模型智能体安全性

This subsection reviews recent research on LLM agent safety from three key aspects: attack methodologies, defense mechanisms, and benchmarks.

本节从攻击方法、防御机制和基准测试三个关键方面回顾了大语言模型 (LLM) 智能体安全的最新研究。

7.1.1 Attacks

7.1.1 攻击

Understanding and mitigating the vulnerabilities of LLM agents is crucial for their safe and trustworthy deployment, particularly as they manage sensitive data and perform real-world actions. Existing attacks on LLM agents fall into three main categories: Prompt Injection Attacks, Backdoor Attacks, and Jailbreak Attacks.

理解和缓解大语言模型 (LLM) 智能体的漏洞对于其安全可靠的部署至关重要,尤其是在它们管理敏感数据并执行现实世界操作时。现有的针对大语言模型智能体的攻击主要分为三类:提示注入攻击 (Prompt Injection Attacks)、后门攻击 (Backdoor Attacks) 和越狱攻击 (Jailbreak Attacks)。

Prompt Injection Attacks manipulate an agent's behavior by embedding malicious instructions into its input prompts. For LLM agents, this often involves crafting prompts that exploit reasoning processes and interactions with external tools. A notable subtype is Indirect Prompt Injection (IPI), where malicious instructions are hidden in external content like web pages, emails, or documents retrieved by the agent. For example, InjecAgent [380] demonstrates how an embedded command in a product review can trigger unintended actions when processed by the agent. These attacks are especially concerning as they exploit reliance on external information without requiring access to the agent's core system. Breaking Agents [381] further investigates vulnerabilities across agent components, showcasing the widespread applicability of prompt injection attacks across diverse architectures.

提示注入攻击通过将恶意指令嵌入到输入提示中来操纵智能体的行为。对于大语言模型智能体,这通常涉及利用推理过程以及与外部工具交互的精心设计的提示。一个显著的子类型是间接提示注入 (IPI),其中恶意指令隐藏在外界内容中,如网页、电子邮件或智能体检索的文档。例如,InjecAgent [380] 展示了产品评论中嵌入的命令如何在智能体处理时触发意外操作。这些攻击尤其令人担忧,因为它们利用了对外部信息的依赖,而无需访问智能体的核心系统。Breaking Agents [381] 进一步研究了智能体组件中的漏洞,展示了提示注入攻击在不同架构中的广泛适用性。

Backdoor Attacks introduce hidden triggers into an agent's model or knowledge base, causing it to execute malicious actions under specific conditions. For LLM agents, these attacks can target training data, fine-tuning datasets, or long-term memory and knowledge bases. BadAgent [382] embeds backdoors by poisoning fine-tuning data. Contextual Backdoor Attacks [384] poison benign-looking contextual demonstrations to trigger malicious behavior in specific scenarios. Agent Poison [383] targets an agent's memory or knowledge base, ensuring malicious demonstrations are retrieved and executed under predefined triggers, even without further model training.

后门攻击在AI智能体的模型或知识库中引入隐藏的触发器,使其在特定条件下执行恶意行为。对于大语言模型智能体,这些攻击可以针对训练数据、微调数据集或长期记忆和知识库。BadAgent [382] 通过污染微调数据来嵌入后门。上下文后门攻击 [384] 污染看似无害的上下文演示,以在特定场景中触发恶意行为。Agent Poison [383] 针对智能体的记忆或知识库,确保在预定义的触发器下检索并执行恶意演示,即使没有进一步的模型训练。

Jailbreak Attacks bypass an agent's safety mechanisms and ethical guidelines, prompting it to perform restricted actions. This is often done by exploiting loopholes in prompt interpretation or manipulating the agent's internal state. PsySafe [385] examines psychological safety in multi-agent systems, showing how adversarial prompts can trigger dark personality traits, effectively overriding safety protocols and enabling harmful behaviors.

越狱攻击绕过AI智能体的安全机制和道德准则,促使其执行受限操作。这通常通过利用提示解释中的漏洞或操纵智能体的内部状态来实现。PsySafe [385] 研究了多智能体系统中的心理安全,展示了对抗性提示如何触发黑暗人格特质,从而有效覆盖安全协议并导致有害行为。

7.1.2 Defenses

7.1.2 防御措施

Existing defense mechanisms for LLM agents can be categorized into response filtering and knowledge-enabled reasoning.

现有的大语言模型 (LLM) 智能体防御机制可分为响应过滤和知识驱动的推理。

Response Filtering monitors and filters potentially harmful or undesirable outputs generated by LLM agents. Auto Defense [387] uses a multi-agent framework where agents assume distinct defensive roles, collaborative ly analyzing the target agent's responses. A consensus-based decision determines whether a response is allowed, enhancing robustness. TrustAgent [386] introduces an agent constitution with predefined safety rules, ensuring adherence during the planning phase. It employs a three-stage strategy: 1) pre-planning, which injects safety knowledge before plan generation; 2) in-planning, which enhances safety during generation; and 3) post-planning, which inspects outputs before execution.

响应过滤监控并过滤由大语言模型 (LLM) 生成的潜在有害或不合适的输出。Auto Defense [387] 采用多智能体框架,智能体承担不同的防御角色,协作分析目标智能体的响应。基于共识的决策机制决定是否允许某个响应,从而增强鲁棒性。TrustAgent [386] 引入了包含预定义安全规则的智能体宪法,确保在规划阶段遵守这些规则。它采用三阶段策略:1) 预规划,在生成计划前注入安全知识;2) 规划中,在生成过程中增强安全性;3) 规划后,在执行前检查输出。

Knowledge-Enabled Reasoning enhances LLM agent safety by integrating external knowledge or structured reasoning processes. GuardAgent [388] introduces a guard agent that oversees the target LLM agent. It generates an action plan based on guard requests, translates it into executable code using a knowledge base and function toolbox, and executes it to verify compliance with guard rules. This structured approach ensures more reliable and systematic defense.

知识驱动推理通过整合外部知识或结构化推理过程,增强了大语言模型智能体的安全性。GuardAgent [388] 引入了一个监督目标大语言模型智能体的守护智能体。它根据守护请求生成行动计划,使用知识库和函数工具箱将其转化为可执行代码,并执行该代码以验证是否符合守护规则。这种结构化方法确保了更加可靠和系统的防御。

7.1.3 Benchmarks

7.1.3 基准测试

7.2 VLM Agent Safety

7.2 VLM智能体安全性

Existing benchmarks evaluate agents’ ability to identify risks in interactive environments or defend against targeted attacks. RJudge [389] assesses agents’ risk awareness using 569 records across 27 scenarios and 10 risk types, testing their ability to identify safety risks in interaction logs. Safe Agent Bench [391] evaluates embodied agents’ safety awareness and planning skills with 750 tasks featuring varying hazards and abstraction levels in a simulated environment, measuring both task execution and semantic understanding. AgentDojo [390] tests agents’ resilience to prompt injection attacks through 97 tasks and 629 security test cases, focusing on handling malicious instructions embedded in third-party data. These benchmarks offer valuable insights into LLM agent safety by assessing their risk identification and defense capabilities across diverse scenarios.

现有基准评估AI智能体在交互环境中识别风险或抵御针对性攻击的能力。RJudge [389] 使用569条记录、涵盖27个场景和10种风险类型,评估AI智能体的风险意识,测试其在交互日志中识别安全风险的能力。Safe Agent Bench [391] 通过750项任务评估具身AI智能体的安全意识和规划能力,这些任务在模拟环境中呈现不同的危险和抽象级别,同时衡量任务执行和语义理解。AgentDojo [390] 通过97项任务和629个安全测试案例测试AI智能体对提示注入攻击的韧性,重点关注处理嵌入在第三方数据中的恶意指令。这些基准通过评估AI智能体在不同场景下的风险识别和防御能力,为大语言模型的安全性提供了宝贵的见解。

Involving both visual perception and language understanding, VLM agents face unique safety challenges. Current attacks on VLM agents mainly target their weaknesses through white-box and black-box attacks, as well as robustness analysis. Despite these attacks, effective defense strategies are still underdeveloped.

同时涉及视觉感知和语言理解,VLM智能体面临独特的安全挑战。当前对VLM智能体的攻击主要通过白盒攻击、黑盒攻击以及鲁棒性分析来针对其弱点。尽管存在这些攻击,有效的防御策略仍在开发中。

7.2.1 Attacks

7.2.1 攻击

White-box Attacks assume adversaries have full access to model parameters, enabling gradient-based optimization to expose theoretical vulnerabilities. Fu et al. [392] used gradient-based training and characterization to craft adversarial images, causing LLMs to execute tool commands using real-world syntax, compromising user confidentiality and integrity. Tan et al. [393] demonstrated how an VLM agent can jailbreak another agent in a multiagent society through malicious prompts generated with white-box access, enabling widespread harmful outputs.

白盒攻击假设攻击者能够完全访问模型参数,从而利用基于梯度的优化来暴露理论上的漏洞。Fu 等人 [392] 使用基于梯度的训练和特征提取来生成对抗性图像,导致大语言模型使用真实世界的语法执行工具命令,从而危及用户的机密性和完整性。Tan 等人 [393] 展示了在多智能体社会中,一个视觉语言模型智能体如何通过白盒访问生成的恶意提示来绕过另一个智能体的保护机制,从而产生广泛的恶意输出。

Black-box Attacks operate without access to model internals, relying on input-output behavior to craft adversarial examples, making them highly relevant for real-world scenarios. AgentSmith [394] uses a single adversarial image to jailbreak multiple VLM agents, exploiting their interconnected nature to spread malicious behavior exponentially across the network.

黑盒攻击 (Black-box Attacks) 在无法访问模型内部的情况下运行,依靠输入输出行为来制作对抗样本,这使得它们在现实场景中高度相关。AgentSmith [394] 使用单个对抗图像来越狱多个 VLM AI智能体,利用它们的互联性在网络中指数级传播恶意行为。

Robustness Analysis examines how system components interact and how vulnerabilities propagate. ARE [395] present an Agent Robustness Evaluation framework that models VLM agents as graphs, analyzing adversarial influence fow between components to identify weak points and assess the safety impact of component modifications.

鲁棒性分析研究了系统组件如何相互作用以及漏洞如何传播。ARE [395] 提出了一个AI智能体鲁棒性评估框架,该框架将VLM智能体建模为图,分析组件之间的对抗性影响流,以识别弱点并评估组件修改的安全性影响。

8 OPEN CHALLENGES

8 开放挑战

Based on the survey, we identify several limitations and gaps in existing research and summarize them into the following topics. These open challenges reflect the evolving nature of large model safety, highlighting both technical and methodological barriers that must be overcome to ensure robustness and reliability across various AI systems.

基于调查,我们识别出现有研究中的一些局限性和差距,并将其总结为以下几个主题。这些开放挑战反映了大模型安全性的不断演变性质,突显了必须克服的技术和方法论障碍,以确保各种AI系统的稳健性和可靠性。

8.1 Fundamental Vulnerabilities

8.1 基本漏洞

Exploring and understanding the fundamental vulnerabilities of large models is essential for developing robust defenses and safety frameworks. This section highlights the core weaknesses and challenges inherent to different types of large models.

探索和理解大模型的基本漏洞对于开发强大的防御和安全框架至关重要。本节重点介绍不同类型大模型固有的核心弱点和挑战。

8.1.1 The Purpose of Attack Is Not Just to Break the Model While much of the existing research focuses on designing attacks to disrupt or break a model's functionality, the true goal of attack research should extend beyond mere disruption. Attacks can serve as a diagnostic tool to uncover unintended behaviors and reveal fundamental weaknesses in a model's decision-making processes. By understanding how and why models fail, we can address vulnerabilities at their root rather than applying superficial fixes. For every new attack proposed, it is critical to ask: Why does the attack succeed or fail? How does it exploit the model? What new vulnerabilities does it reveal that were previously unknown? Are these vulnerabilities inherent to the model architecture or class? These questions guide the development of more robust models and defenses by exposing systemic flaws rather than isolated issues.

8.1.1 攻击的目的不仅仅是破坏模型

8.1.2What Are the Fundamental Vulnerabilities of Langu age Models?

8.1.2 语言模型的基本漏洞是什么?

LLMs like ChatGPT and Gemini exhibit fundamental vulnerabilities due to their reliance on statistical patterns rather than true semantic understanding [515]. Key weaknesses include susceptibility to adversarial inputs, biases in training data, and manipulation via prompt engineering. To build effective defenses, research must delve deeper into how these vulnerabilities arise from the model's architecture and training data.

像 ChatGPT 和 Gemini 这样的大语言模型由于依赖统计模式而非真正的语义理解而表现出根本性漏洞 [515]。主要弱点包括对对抗性输入的敏感性、训练数据中的偏见以及通过提示工程进行操纵。为了构建有效的防御措施,研究必须更深入地探讨这些漏洞如何从模型架构和训练数据中产生。

Critical areas of focus include: 1) Memorization of training data, which can lead to privacy breaches or unintended data leakage; 2) Exposure to harmful content, which can propagate biases or toxic outputs; 3) Amplification of hallucinations, where models generate plausible but incorrect or nonsensical information. Open research questions remain: Does the discrete nature of textual inputs make language models more or less robust compared to vision models? What fundamental vulnerabilities are exposed by jailbreak and data extraction attacks? Addressing these questions is vital for advancing the safety and reliability of LLMs and other large models.

重点关注领域包括:1) 训练数据的记忆,可能导致隐私泄露或意外数据泄漏;2) 有害内容的暴露,可能传播偏见或有害输出;3) 幻觉的放大,模型生成看似合理但错误或无意义的信息。开放的研究问题仍然存在:文本输入的离散性是否使语言模型与视觉模型相比更具或更不具鲁棒性?越狱和数据提取攻击暴露了哪些基本漏洞?解决这些问题对于提升大语言模型和其他大型模型的安全性和可靠性至关重要。

8.1.3 How Vulnerabilities Propagate Across Modalities?

8.1.3 漏洞如何跨模态传播?

As Multi-modal Large Language Models (MLLMs) integratediverse modalities, new vulnerabilities arise. Vision encoders are known to be sensitive to subtle, continuous perturbations in pixel space, while language models are vulnerable to adversarial characters, words, or prompts. However, the interaction between modalities and how vulnerabilities in one modality propagate to others remains poorly understood.

随着多模态大语言模型 (Multi-modal Large Language Models, MLLMs) 整合多种模态,新的漏洞也随之出现。视觉编码器已知对像素空间中的细微、连续扰动敏感,而语言模型则容易受到对抗性字符、词汇或提示的影响。然而,模态之间的相互作用以及一种模态中的漏洞如何传播到其他模态仍然知之甚少。

Key questions include: How do vulnerabilities in one modality (e.g., vision) influence the behavior of another (e.g., language)? How does the number of tokens across modalities affect vulnerability propagation? Additionally, it is critical to explore how to address multi-modal vulnerabilities within a unified framework, moving beyond defenses tailored to individual modalities. This requires a holistic approach to identify and mitigate cross-modal risks, ensuring robust performance across all integrated modalities.

关键问题包括:一种模态(例如视觉)中的漏洞如何影响另一种模态(例如语言)的行为?不同模态之间的Token数量如何影响漏洞传播?此外,探索如何在一个统一的框架内解决多模态漏洞至关重要,而不仅仅局限于针对单个模态的防御措施。这需要一种整体方法来识别和缓解跨模态风险,确保所有集成模态的稳健性能。

8.1.4Diffusion Models for Visual Content Generation Lack Language Capabilities

8.1.4 视觉内容生成的扩散模型缺乏语言能力

Diffusion models for image or video generation excel in visual content creation but often lack language understanding capabilities, a limitation shared by VLP models. This is because these models are primarily optimized for pixel-level generation tasks, without incorporating language processing into their core architecture. As a result, they may generate harmful or con textually inappropriate content due to their inability to fully comprehend textual prompts. To develop robust multi-modal systems, it is crucial to integrate language comprehension into these models. This would enable them to produce content that is not only visually coherent but also con textually aligned with the given textual input. An open challenge is bridging the gap between visual and linguistic capabilities in generative models to enhance multi-modal safety. However, this integration may introduce new vulnerabilities, such as advanced attacks that exploit finegrained manipulation of the generation process. Addressing these challenges represents a critical direction for future research.

图像或视频生成的扩散模型在视觉内容创作方面表现出色,但通常缺乏语言理解能力,这也是视觉-语言预训练模型 (VLP) 的共同局限。这是因为这些模型主要针对像素级生成任务进行优化,并未将语言处理融入其核心架构。因此,由于无法完全理解文本提示,它们可能会生成有害或上下文不恰当的内容。为了开发稳健的多模态系统,将语言理解整合到这些模型中至关重要。这将使它们不仅能够生成视觉连贯的内容,还能与给定的文本输入在上下文上保持一致。一个开放的挑战是在生成模型中弥合视觉与语言能力之间的差距,以增强多模态安全性。然而,这种整合可能会引入新的漏洞,例如利用生成过程的细粒度操作进行的高级攻击。解决这些挑战代表了未来研究的一个关键方向。

8.1.5How Much Training Data Can a Model Memorize?

8.1.5 模型能记住多少训练数据?

The memorization capability of deep neural networks (DNNs) has raised significant concerns, particularly in enabling privacy attacks such as membership inference and data extraction (model inversion). Both LLMs and DMs have been shown to replicate and leak small portions of their training data under specific conditions. However, it remains unclear whether DNNs primarily learn through memorization and to what extent this occurs. Due to the non-linear nature of large models, exact model inversion is inherently impossible. These models compress training data into multi-level representations, making it difficult to pinpoint when and how memorization happens.

深度神经网络 (DNN) 的记忆能力引发了重大关注,尤其是在支持隐私攻击(如成员推断和数据提取(模型反转))方面。大语言模型和扩散模型在特定条件下已被证明会复制并泄露其训练数据的一小部分。然而,目前尚不清楚 DNN 是否主要通过记忆来学习,以及这种情况发生的程度。由于大模型的非线性特性,精确的模型反转本质上是不可能的。这些模型将训练数据压缩为多层次表示,使得难以确定记忆何时以及如何发生。

Key open questions include: What mechanism acts as the memorization “switch", triggering the model to output training data directly? How can memorization be effectively measured—through exact matches, training equivalence, or embedding similarity? Addressing these questions is crucial for understanding the trade-offs between model performance and privacy risks, as well as for developing strategies to mitigate unintended data leakage.

关键开放性问题包括:什么机制充当了记忆“开关”,触发模型直接输出训练数据?如何有效地衡量记忆——通过精确匹配、训练等价性还是嵌入相似性?解决这些问题对于理解模型性能和隐私风险之间的权衡至关重要,同时也为开发减轻意外数据泄露的策略提供了基础。

8.1.6 Agent Vulnerabilities Grow with Their Abilities

8.1.6 AI智能体的漏洞随能力增长而增加

Large-model-powered agents face increasing vulnerabilities as their capabilities expand [394]. These agents interact with external tools, data sources, and environments, creating a broader attack surface that complicates defense mechanisms. A critical challenge is the compounding effect of vulnerabilities in foundational models when integrated into agents’ decision-making processes. For instance, an agent relying on a language model vulnerable to jailbreak prompts and a vision model susceptible to adversarial inputs may experience cascading failures, leading to unpredictable outcomes.

随着大模型驱动的AI智能体能力扩展,它们面临的漏洞风险也在增加 [394]。这些智能体与外部工具、数据源和环境交互,形成了更广泛的攻击面,增加了防御机制的复杂性。一个关键挑战是,当基础模型的漏洞被集成到智能体的决策过程中时,其影响会呈复合效应。例如,一个依赖易受越狱提示影响的语言模型和易受对抗性输入影响的视觉模型的智能体,可能会经历级联故障,导致不可预测的结果。

Moreover, agents’ ability to learn and adapt introduces additional risks. Even seemingly benign interactions can expose them to subtle biases or adversarial inputs, potentially triggering unsafe behaviors. The dynamic nature of agents—especially those that continuously learn or self-improve——further complicates vulnerability detection, as they may develop new weaknesses over time. This unpredictability makes traditional safety evaluations insufficient, as agents can evolve in ways that are difficult to anticipate.

此外,AI智能体的学习和适应能力带来了额外的风险。即使看似无害的互动也可能使它们暴露于微妙的偏见或对抗性输入,从而可能引发不安全的行为。AI智能体的动态特性——尤其是那些持续学习或自我改进的——进一步增加了漏洞检测的复杂性,因为它们可能随着时间的推移发展出新的弱点。这种不可预测性使得传统的安全评估变得不足,因为AI智能体可能会以难以预料的方式演变。

To address these challenges, research must focus on: under standing the interaction between model components (e.g., language, vision, and decision-making) and how vulnerabilities in one component can propagate to others. It is also important to develop new methodologies to evaluate agents in dynamic, evolving environments, ensuring their robustness against emerging threats. These efforts are essential for building safer and more reliable agent systems in the future.

为了解决这些挑战,研究必须专注于:理解模型组件(如语言、视觉和决策)之间的相互作用,以及一个组件的漏洞如何传播到其他组件。同时,开发新的方法论来评估AI智能体在动态、不断变化的环境中的表现,确保其对新出现的威胁具有鲁棒性,也是至关重要的。这些努力对于构建更安全、更可靠的AI智能体系统至关重要。

8.2Safety Evaluation

8.2 安全评估

Comprehensive and standardized safety evaluations are critical for quantifying the safety levels of large models. However, existing evaluation datasets and benchmarks are often static or narrowly focused on specific threats. To ensure models perform reliably in real-world conditions, safety evaluations must test them across diverse and unpredictable scenarios.

全面且标准化的安全评估对于量化大模型的安全水平至关重要。然而,现有的评估数据集和基准测试通常是静态的,或仅专注于特定威胁。为了确保模型在现实世界中的表现可靠,安全评估必须在多样且不可预测的场景中对它们进行测试。

8.2.1Attack Success Rate Is Not All We Need

8.2.1 攻击成功率并非唯一需求

While the attack success rate (ASR) is a commonly used metric in safety research, it mainly quantifies how often an attack disrupts a model's output. However, this metric overlooks several important factors, such as the severity of the disruption, the model's resilience to various types of attacks, and the real-world consequences of potential failures. A model could still cause harm or mislead decision-making even if its core functionality appears unaffected. For instance, an attack might subtly alter a model's decision-making process without causing an obvious malfunction, but the resulting behavior could have catastrophic effects in real-world applications. Such vulnerabilities are often missed by traditional metrics like ASR or failure rate.

虽然攻击成功率 (ASR) 是安全研究中常用的指标,但它主要量化了攻击如何频繁地破坏模型的输出。然而,这一指标忽略了一些重要因素,例如破坏的严重程度、模型对各种类型攻击的恢复能力,以及潜在故障在现实世界中的后果。即使模型的核心功能似乎未受影响,它仍可能造成伤害或误导决策。例如,攻击可能会微妙地改变模型的决策过程,而不会导致明显的故障,但由此产生的行为在现实应用中可能具有灾难性的影响。传统指标如 ASR 或故障率往往会忽略这些漏洞。

To better understand a model's weaknesses—whether in its design, training data, or inference process—it is crucial to define multi-level, fine-grained vulnerability metrics. A more comprehensive safety evaluation framework should consider factors such as the model's susceptibility to different types of attacks, its ability to recover from malicious inputs, and the ethical implications of potential failure modes.

为了更好地理解模型的弱点——无论是设计、训练数据还是推理过程中的问题——定义多层次、细粒度的脆弱性指标至关重要。一个更全面的安全评估框架应考虑以下因素:模型对不同类型攻击的易感性、从恶意输入中恢复的能力,以及潜在故障模式的伦理影响。

8.2.2 Static Evaluations Create a False Sense of Safety

8.2.2 静态评估制造了虚假的安全感

Current safety evaluations rely heavily on static benchmarks or open-source datasets that have been available for some time. These datasets have already been exposed to model trainers and adversaries, which means a model may achieve high safety performance on these outdated datasets without necessarily being robust in realworld scenarios. As a result, static evaluations can create a false sense of safety. This underscores a significant limitation in the current evaluation framework: static benchmarks fail to capture the evolving nature of threats that models encounter in dynamic, real-world applications.

当前的安全评估严重依赖静态基准测试或已存在一段时间的开源数据集。这些数据集已经暴露给模型训练者和对手,这意味着模型可能在这些过时的数据集上表现出较高的安全性能,但在现实场景中并不一定具备鲁棒性。因此,静态评估可能会产生一种虚假的安全感。这突显了当前评估框架中的一个显著局限性:静态基准测试无法捕捉模型在动态的现实应用中遇到的威胁的演变本质。

To address this challenge, safety evaluations must move beyond static assessments. A key step is to develop evaluation datasets or benchmarks that evolve over time, better reflecting the ever-changing landscape of safety threats. An example of an evolving evaluation system is the Chatbot Arena [516], which adapts as new threats and challenges emerge. Similar strategies could be applied to safety assessments in broader AI systems. Additionally, future evaluation methods might consider releasing only the “seeds" or structural components of datasets, along with a test case generation method, rather than providing static test cases. This approach would enable the continuous generation of new test cases that better reflect the evolving nature of threats.

为应对这一挑战,安全评估必须超越静态评估。关键的一步是开发随时间演进的评估数据集或基准,更好地反映不断变化的安全威胁格局。一个动态演进评估系统的例子是 Chatbot Arena [516],它随着新威胁和挑战的出现而不断调整。类似的策略可以应用于更广泛的 AI 系统的安全评估。此外,未来的评估方法可能会考虑仅发布数据集的“种子”或结构组件,以及测试用例生成方法,而不是提供静态测试用例。这种方法将能够持续生成新的测试用例,更好地反映威胁的动态演变。

8.2.3 Adversarial Evaluations Are a Necessity, Not an Option

8.2.3 对抗性评估是必需品,而非选项

While regular (non-adversarial) safety tests offer insights into a model's general robustness, they fail to capture the full spectrum of safety risks that models face in real-world scenarios. These tests typically focus on overall performance but neglect how models respond to adversarial queries that exploit their vulnerabilities. In contrast, adversarial evaluations assess model performance under attack, providing a more accurate measure of safety in worst-case scenarios.

虽然常规(非对抗性)安全测试能够提供模型整体鲁棒性的洞察,但它们无法全面捕捉模型在现实场景中所面临的安全风险。这些测试通常关注整体性能,却忽视了模型在面对利用其漏洞的对抗性查询时的反应。相比之下,对抗性评估则是在攻击下评估模型性能,为最坏情况下的安全性提供了更准确的衡量。

A key challenge in this area is developing environments that simulate real-world attack conditions. One promising approach is to frame safety evaluation as a two-player adversarial game, where reinforcement learning-based adversarial agents interact with target models to identify and exploit new vulnerabilities. This method offers a more dynamic and comprehensive way to assess model safety under attack. Such adversarial evaluations are especially crucial for commercial APIs, which often use filtering mechanisms to block malicious inputs. These filters can render traditional safety benchmarks less effective, as they prevent the model from facing the full range of adversarial threats it might encounter in real-world applications.

该领域的一个关键挑战是开发能够模拟真实世界攻击环境的方法。一种有前景的方法是将安全评估构建为双人对战游戏,其中基于强化学习的对抗智能体与目标模型交互,以识别并利用新的漏洞。这种方法提供了一种更动态、更全面的方式来评估模型在攻击下的安全性。这种对抗性评估对于商业 API 尤为重要,因为商业 API 通常使用过滤机制来阻止恶意输入。这些过滤器可能会使传统安全基准的有效性降低,因为它们阻止了模型面对在实际应用中可能遇到的各种对抗性威胁。

8.2.4Open-Ended Evaluation

8.2.4 开放式评估

Evaluating adversarial attacks in classification problems is relatively straightforward, as each input is typically associated with a distinct class label. However, large models often generate openended responses, which complicates the evaluation of attacks like jail breaking (e.g., through ASR computation). The ideal evaluator for such models would function as a perfect jail breaking detector. Yet, since achieving an ideal jail breaking detector is not feasible, it follows that an ideal evaluator may also be unattainable.

在分类问题中评估对抗攻击相对直接,因为每个输入通常与一个明确的类别标签相关联。然而,大模型常常生成开放式的响应,这使得评估越狱等攻击变得复杂(例如,通过 ASR 计算)。理想的评估器应能作为完美的越狱检测器。然而,由于实现理想的越狱检测器并不可行,因此理想的评估器可能也难以实现。

Currently, evaluators are generally rule-based (e.g., keyword detection) or model-based (e.g., GPT or Llama-Guard). However, developing more consistent and reliable evaluation methods and metrics remains an open challenge. One potential solution is to constrain the output space to a finite set of actions, similar to the approach used in agent-based scenarios. This would limit the complexity of the evaluation and make it more feasible to assess safety in open-ended environments.

目前,评估者通常基于规则(例如关键词检测)或基于模型(例如 GPT 或 Llama-Guard)。然而,开发更加一致和可靠的评估方法和指标仍然是一个开放的挑战。一个潜在的解决方案是将输出空间限制为有限的一组操作,类似于在基于智能体的场景中使用的方法。这将限制评估的复杂性,并使在开放环境中评估安全性更加可行。

8.3 Safety Defense

8.3 安全防御

Safety mechanisms in large models are crucial for preventing harmful or unintended behaviors. These mechanisms may involve modifications to the model's architecture or the integration of external monitoring systems. This section explores the open challenges in developing robust defense solutions.

大模型中的安全机制对于防止有害或意外行为至关重要。这些机制可能涉及对模型架构的修改或外部监控系统的集成。本节探讨了开发稳健防御解决方案的开放性挑战。

8.3.1 Safety Alignment Is Not the Savior

8.3.1 安全对齐并非救世主

Safety alignment—ensuring that a model's objectives align with human values—has been a promising approach to mitigating certain risks. However, recent studies have revealed a significant weakness in safety alignment: fake alignment [432], [517], where a model may achieve high safety scores without truly understanding the underlying safety principles. This points to a deeper issue of shallow safety. Furthermore, even models that are considered well-aligned, such as GPT-4o [518] or o1 [?], remain vulnerable to sophisticated attacks that can bypass their alignment mechanisms [479].

安全对齐(Safety Alignment)——确保模型的目标与人类价值观一致——一直是缓解某些风险的有前景的方法。然而,最近的研究揭示了安全对齐中的一个显著弱点:虚假对齐(Fake Alignment)[432], [517],即模型可能在没有真正理解底层安全原则的情况下获得较高的安全评分。这指向了一个更深层次的问题:浅层安全。此外,即使是被认为对齐良好的模型,如 GPT-4o [518] 或 o1 [?],仍然容易受到可以绕过其对齐机制的复杂攻击 [479]。

An open challenge is to identify the mechanistic limitations of safety alignment and develop methods that ensure robust safety, even in the face of unforeseen attacks. Recent research [519] emphasizes the need to move beyond superficial safety alignment metrics (such as the distribution of the first few output tokens), ensuring that alignment is deeper and more comprehensive. Additionally, making safety alignment adversarial by actively challenging a model's safety mechanisms—may help address the issue of shallow alignment, leading to more resilient and trustworthy models.

一个开放的挑战是识别安全对齐的机制限制,并开发确保稳健安全的方法,即使面对未预见的攻击。最近的研究[519]强调需要超越表面的安全对齐指标(例如前几个输出 Token 的分布),确保对齐更深入和更全面。此外,通过主动挑战模型的安全机制来使安全对齐具有对抗性,可能有助于解决浅层对齐的问题,从而产生更具韧性和可信的模型。

8.3.2Jailbreak AttacksAre More Challenging toDefend Against Than Adversarial Attacks

8.3.2 Jailbreak攻击比对抗性攻击更难防御

Jailbreak attacks and adversarial attacks share a common goal: both aim to manipulate a model into producing targeted outputs. However, the key distinction lies in the nature of these outputs. Jailbreak attacks specifically seek to induce harmful or toxic responses, which requires bypassing the model's safety mechanisms. In contrast, while adversarial attacks may also circumvent safety defenses, they are not inherently designed to produce malicious content.

越狱攻击和对抗攻击有一个共同目标:两者都旨在操控模型以产生特定的输出。然而,关键区别在于这些输出的性质。越狱攻击特别寻求诱导有害或恶意的响应,这需要绕过模型的安全机制。相比之下,虽然对抗攻击也可能绕过安全防御,但它们本质上并非设计用于产生恶意内容。

A significant challenge in defending against jailbreak attacks, beyond the typical defenses against adversarial attacks, is the absence of constraints on the attack's perturbation budget. In adversarial attacks, perturbations are deliberately kept minimal to remain imperceptible to humans. Jailbreak attacks, however, are not constrained by such requirements, allowing for greater fexibility in crafting attacks. This lack of limitations makes jailbreak attacks more challenging to defend against compared to adversarial attacks. Developing defense strategies that can effectively address both types of attacks remains an open and pressing challenge.

防御越狱攻击的一个重大挑战,除了对抗对抗攻击的典型防御外,还在于攻击的扰动预算没有限制。在对抗攻击中,扰动被刻意保持在最小,以使其对人类不可察觉。然而,越狱攻击不受此类要求的限制,从而在构建攻击时具有更大的灵活性。这种限制的缺乏使得越狱攻击相较于对抗攻击更具防御挑战性。开发能够有效应对这两种攻击的防御策略仍然是一个开放且紧迫的挑战。

8.3.3The Need for More Practical Defenses

8.3.3 更实用防御的需求

Existing defense methods face several limitations that hinder their effectiveness in real-world applications. These limitations include a lack of generality, low efficiency, reliance on white-box access, and poor adaptability. To be truly practical, a defense must possess certain desirable properties, the achievement of which remains an ongoing challenge.

现有防御方法面临着一些限制,阻碍了它们在实际应用中的有效性。这些限制包括缺乏通用性、效率低下、依赖白盒访问以及适应性差。要真正实用,防御必须具备某些理想特性,而实现这些特性仍然是一个持续的挑战。

1 Generality:

1 通用性:

With the wide variety of models deployed across different domains—such as vision, language, and multimodal systems—defenses should not be overly tailored to specific architectures. Instead, they should offer generalized solutions applicable to multiple model families. Generality ensures that a single defense mechanism can be deployed across a broad range of systems, making it scalable and effcient for realworld safety infrastructures.

随着跨不同领域(如视觉、语言和多模态系统)部署的模型种类繁多,防御措施不应过度针对特定架构。相反,它们应提供适用于多个模型家族的通用解决方案。通用性确保单一的防御机制可以部署在广泛的系统中,使其在现实世界的安全基础设施中具有可扩展性和高效性。

2 Black-box Compatibility:

2 黑盒兼容性:

In many real-world scenarios, defenders may not have access to the internal parameters of the model they are protecting. Therefore, practical defenses must operate in a black-box setting, where defenders can only observe the model's inputs and outputs. This requires defense strategies that function externally to detect and mitigate attacks without needing access to the model's inner workings.

在许多现实场景中,防御者可能无法访问他们保护的模型的内部参数。因此,实际防御必须在黑盒环境中进行,防御者只能观察模型的输入和输出。这就要求防御策略能够在外部运作,无需访问模型的内部机制即可检测和缓解攻击。

3 Efficiency:

3 效率:

Many defense techniques, particularly adversarial training, are computationally expensive. The need for large-scale retraining or fine-tuning can make these defenses prohibitively costly. Practical defenses must strike a balance between robustness and computational efficiency, ensuring that models remain safe without incurring excessive resource consumption.

许多防御技术,尤其是对抗训练,计算成本高昂。大规模重新训练或微调的需求使得这些防御措施的成本过高。实际可行的防御必须在鲁棒性和计算效率之间取得平衡,确保模型在保持安全的同时不会导致过多的资源消耗。

4 Continual Adaptability:

4 持续适应性:

A A practical defense system should not only recognize previously encountered attacks but also adapt in real-time to new and evolving threats. This requires continual learning and the ability to update without relying on costly retraining cycles. Models must be capable of incorporating new data, evolving their defense strategies, and self-correcting as new attacks emerge.

一个实用的防御系统不仅应能识别先前遇到的攻击,还应能实时适应新的和不断演变的威胁。这需要持续学习,并具备在不依赖昂贵的重新训练周期的情况下进行更新的能力。模型必须能够整合新数据,发展其防御策略,并在新攻击出现时进行自我纠正。

The ongoing challenge for researchers is to refine and integrate these properties into cohesive defense strategies that offer robust protection while maintaining model performance.

研究人员面临的持续挑战是完善这些特性并将其整合为连贯的防御策略,在保持模型性能的同时提供强大的保护。

8.3.4The Lack of Proactive Defenses

8.3.4 缺乏主动防御

Most existing defense approaches, such as safety alignment and adversarial training, are passive in nature, focusing on fortifying models against potential attacks. However, proactive defenses, which actively counter potential attacks before they succeed, have received limited attention in the literature. For example, a proactive defense against model extraction could involve poisoning or injecting backdoors into extraction attempts, rendering the extracted model unreliable. Another approach could be to provide deliberately nonsensical or easily detectable responses when a malicious user requests advice for illegal activities, such as planning a robbery. Such proactive strategies could serve as powerful deterrents to potential attackers. However, designing effective proactive defenses for different types of safety threats remains an open challenge and a promising direction for future research.

现有的大多数防御方法,如安全对齐和对抗训练,本质上是被动的,侧重于增强模型以抵御潜在攻击。然而,主动防御在文献中受到的关注有限,它旨在在潜在攻击成功之前主动进行反击。例如,针对模型提取的主动防御可能包括在提取尝试中投毒或注入后门,从而使提取的模型不可靠。另一种方法可能是在恶意用户请求非法活动建议(如策划抢劫)时,故意提供无意义或易于检测的响应。这些主动策略可以作为对潜在攻击者的强大威慑。然而,设计针对不同类型安全威胁的有效主动防御仍然是一个开放的挑战,也是未来研究的一个有前景的方向。

8.3.5 Detection Has Been Overlooked in Current Defenses Detection methods play a crucial role in identifying potential vulner abilities and abnormal behaviors in models, effectively acting as active monitors. When integrated with other defense mechanisms, detection systems can trigger automatic safety responses whenever a model behaves unexpectedly or generates harmful outputs. Despite their importance, however, existing defense strategies have largely overlooked the integration of detection systems within their pipelines. By combining detection with other safety measures, it becomes possible to develop more resilient models capable of dynamically responding to emerging threats. For example, stronger attacks may be more easily detected, providing an opportunity for proactive defense. An open question remains: What is the most effective way to integrate detection as a core component of a defense system, and how can detection and other defense mechanisms complement and enhance each other?

8.3.5 检测在当前防御中被忽视
检测方法在识别模型中的潜在漏洞和异常行为方面发挥着关键作用,有效地充当了主动监控的角色。当与其他防御机制结合时,检测系统可以在模型行为异常或生成有害输出时触发自动安全响应。然而,尽管其重要性,现有的防御策略在很大程度上忽视了将检测系统整合到其流程中。通过将检测与其他安全措施结合,可以开发出更具弹性的模型,能够动态应对新出现的威胁。例如,更强的攻击可能更容易被检测到,从而为主动防御提供了机会。一个开放性问题仍然存在:如何最有效地将检测作为防御系统的核心组件进行整合,以及检测与其他防御机制如何相互补充和增强?

8.3.6 The Current Data Usage Practices Must Change

8.3.6 当前的数据使用实践必须改变

The current data usage practices in the AI development lifecycle are neither sustainable nor ethical. We identify three major issues with how data is currently used:

当前 AI 开发生命周期中的数据使用实践既不可持续也不道德。我们识别出当前数据使用的三大问题:

1 Lack of Consent and Recognition: Many large models are trained on web-crawled data without the consent of data owners or formal recognition or reward for their contributions. This practice raises significant ethical and legal issues.

缺乏同意与认可:许多大语言模型在未经数据所有者同意或对其贡献进行正式认可或奖励的情况下,使用网络爬取的数据进行训练。这种做法引发了重大的伦理和法律问题。

2 The Explosion of Generated Data: A large volume of generated data has been uploaded to the internet, much of which is fake, toxic, or otherwise harmful. However, there is no clear system in place to identify which model created this data or who was responsible for its generation. 3 Depletion of “Free" Data: The “free" data available on the internet is rapidly diminishing. Users are becoming less motivated or outright refusing to contribute valuable data, especially when their contributions are neither acknowledged nor rewarded.

2 生成数据的爆炸式增长:大量生成数据被上传到互联网,其中许多是虚假的、有毒的或有害的。然而,目前还没有明确的系统来识别这些数据是由哪个模型创建的,或者是谁负责生成的。3 “免费”数据的耗尽:互联网上的“免费”数据正在迅速减少。用户越来越不愿意或直接拒绝贡献有价值的数据,尤其是当他们的贡献既没有得到认可也没有得到回报时。

To address these challenges, the AI industry must establish a healthy and sustainable data ecosystem where data contributors are recognized and rewarded. Achieving this requires answering the following key questions:

为应对这些挑战,AI行业必须建立一个健康且可持续的数据生态系统,让数据贡献者得到认可和奖励。实现这一目标需要回答以下关键问题:

Beyond these questions, many other critical issues must be addressed to create a sustainable and ethical data ecosystem. Ensuring that data contributors are fairly acknowledged and incentivized is essential for promoting accountability and transparency in AIGC.

除了这些问题,还有许多其他关键问题必须解决,以创建一个可持续且符合道德的数据生态系统。确保数据贡献者得到公平的认可和激励,对于促进生成式 AI (Generative AI) 的问责制和透明度至关重要。

8.3.7 Safe Embodied Agents

8.3.7 安全具身AI智能体

Most safety threats studied today are digital. However, as embodied AI agents are increasingly deployed in the physical world, new physical threats will emerge, potentially causing real harm and loss to humans. Ensuring the safety of these embodied agents has therefore become a critical concern. Safe agents must be resilient to adversarial inputs, capable of self-regulating harmful behaviors, and consistently aligned with human values.

目前研究的大多数安全威胁是数字化的。然而,随着具身 AI 智能体越来越多地部署在物理世界中,新的物理威胁将出现,可能会对人类造成真实的伤害和损失。因此,确保这些具身智能体的安全已成为一个关键问题。安全的智能体必须能够抵御对抗性输入,能够自我调节有害行为,并始终与人类价值观保持一致。

Achieving this requires deeply integrating safety mechanisms into the agents′ decision-making processes, enabling them to handle unexpected challenges while maintaining robustness and reliability. The primary challenge is designing safety protocols that allow these agents to perform complex tasks autonomously, while ensuring they remain trustworthy and safe in dynamic and unpredictable environments. As agents gain greater autonomy, ensuring their safety becomes not only a technical challenge but also a significant ethical responsibility.

实现这一点需要将安全机制深度集成到AI智能体的决策过程中,使其在保持鲁棒性和可靠性的同时,能够应对意外的挑战。主要挑战在于设计安全协议,使这些智能体能够自主执行复杂任务,同时确保它们在动态且不可预测的环境中保持可信和安全。随着智能体获得更大的自主性,确保其安全性不仅成为一项技术挑战,更是一项重大的伦理责任。

8.3.8 Safe Super intelligence

8.3.8 安全超级智能

As AI progresses toward AGI and super intelligence, embedding inherent safety mechanisms into large models to ensure predictable, value-aligned behavior becomes a critical challenge. While the technical path to safe super intelligence remains uncertain, several promising mechanisms offer potential solutions:

随着 AI 向通用人工智能和超级智能发展,将内在安全机制嵌入大模型以确保可预测、价值一致的行为成为一个关键挑战。尽管实现安全超级智能的技术路径仍不确定,但几种有前景的机制提供了潜在的解决方案:

· Oversight System: One system cannot be both superintelligent and trustworthy, like humans. However, we can develop an oversight system to monitor and regulate the primary system's behavior, intervening when necessary. A key challenge is to ensure the reliability and robustness of the oversight system itself. This leads to the Oversight Paradox: If a super intelligent AI is monitored by another AI (the oversight system), the oversight system must be at least as capable as the super intelligent AI to reliably detect and prevent undesirable behaviors. However, this raises the question: who monitors the oversight system to ensure it doesn't fail or act contrary to its purpose? · Safety Layer: This approach embeds a dedicated safety layer [520], [521] directly within the model's architecture, acting as a gatekeeper to filter outputs against predefined safety constraints. Such layers could be dynamically updated based on real-time feedback or optimized for specific tasks. · Safety Expert: This approach incorporates specialized "safety experts" within the Mixture of Experts (MoE) framework [522]-[525] to handle safety-critical tasks. By dynamically routing high-stakes queries to these experts, safety conside rations are prioritized in decision-making. The recurring challenge is developing a truly safe expert model that can consistently perform as expected in all scenarios. · Adversarial Alignment: This approach leverages adversarial safety principles to align models with human values. It involves training models to exploit vulnerabilities in existing safety mechanisms, then refining these mechanisms to resist adversarial prompts. Despite its promise, challenges such as high computational costs and the risk of unintended behaviors remain significant concerns. · Safety Consciousness: This approach involves embedding a safety-conscious framework into the model's foundational training to promote ethical reasoning and value alignment as intrinsic behaviors. The goal is to make safety a core characteristic, enabling the model to dynamically adapt to diverse and evolving scenarios. Safety consciousness may be formulated as a type of safety tendency: an inherent inclination to generate low-risk responses and shape outputs based on an awareness of potential harmful consequences, similar to human decision-making processes.

监管系统:一个系统不可能既超级智能又像人类一样值得信赖。然而,我们可以开发一个监管系统来监控和规范主要系统的行为,并在必要时进行干预。一个关键挑战是确保监管系统本身的可靠性和鲁棒性。这导致了监管悖论:如果一个超级智能的AI由另一个AI(监管系统)监控,那么监管系统必须至少与超级智能AI一样强大,才能可靠地检测和防止不良行为。然而,这就引发了一个问题:谁来监控监管系统,以确保它不会失效或违背其初衷?
安全层:这种方法在模型架构中直接嵌入了一个专用的安全层 [520], [521],作为守门员来根据预定义的安全约束过滤输出。这些层可以基于实时反馈进行动态更新,或针对特定任务进行优化。
安全专家:这种方法在专家混合(Mixture of Experts, MoE)框架 [522]-[525] 中引入了专门的“安全专家”来处理安全关键任务。通过将高风险查询动态路由到这些专家,安全考虑在决策中被优先处理。反复出现的挑战是开发一个真正安全的专家模型,能够在所有场景中始终如一地按预期执行。
对抗对齐:这种方法利用对抗安全原则来使模型与人类价值观对齐。它涉及训练模型以利用现有安全机制中的漏洞,然后改进这些机制以抵御对抗性提示。尽管其前景广阔,但高计算成本和意外行为的风险仍然是重大担忧。
安全意识:这种方法涉及在模型的基础训练中嵌入一个安全意识框架,以促进伦理推理和价值对齐作为内在行为。目标是使安全成为核心特征,使模型能够动态适应多样化和不断变化的场景。安全意识可以表述为一种安全倾向:基于对潜在有害后果的意识,生成低风险响应并塑造输出的内在倾向,类似于人类决策过程。

8.4A Call for Collective Action

8.4 集体行动的呼吁

Safeguarding large models against adversarial manipulation, misuse, and harm is a global challenge that requires the collective efforts of researchers, practitioners, and policymakers. The following sections outline a research agenda aimed at advancing largemodel safety through collaboration and innovation.

保护大模型免受对抗性操纵、滥用和伤害是一个全球性挑战,需要研究人员、从业者和政策制定者的共同努力。以下部分概述了通过合作和创新推动大模型安全的研究议程。

8.4.1Defense-Oriented Research

8.4.1 防御性研究

Current research on large-model safety is heavily skewed toward attack strategies, with significantly fewer efforts dedicated to developing defense mechanisms. This imbalance is concerning, as the sophistication of attacks continues to outpace the development of effective defenses. To address this gap, we advocate for a shift in research priorities toward defense strategies. Researchers should not only investigate attack mechanisms but also focus on developing robust defenses to mitigate or prevent these threats. A balanced approach is crucial for advancing the field of safety research.

当前对大模型安全的研究严重偏向于攻击策略,而在开发防御机制方面的投入明显较少。这种不平衡令人担忧,因为攻击的复杂性持续超过有效防御的发展。为了解决这一差距,我们主张将研究重点转向防御策略。研究者不仅应该调查攻击机制,还应专注于开发强大的防御措施以减轻或预防这些威胁。平衡的方法对于推进安全研究领域至关重要。

Moreover, future defense research should emphasize integrated approaches. New defense methods should not be proposed or implemented in isolation but rather integrated with existing approaches to build cumulative protection. Defense research should be viewed as a continuous, evolving effort, where new methods arelayered onto established ones to enhance overall effectiveness. However, the diversity of defense strategies presents a challenge. Developing frameworks to effectively incorporate different defense mechanisms will require a collective effort from the research community.

此外,未来的防御研究应强调集成方法。新的防御方法不应孤立地提出或实施,而应与现有方法集成以构建累积保护。防御研究应被视为一项持续发展的努力,新的方法应建立在现有方法之上,以提高整体有效性。然而,防御策略的多样性带来了挑战。开发能够有效整合不同防御机制的框架需要研究界的共同努力。

8.4.2Dedicated Safety APIs

8.4.2 专用安全 API

To support research and testing, commercial AI models should offer a dedicated safety API. This API would allow researchers to assess and enhance the safety of these models by subjecting them to a variety of adversarial and safety-critical scenarios. By providing such an API, commercial providers can enable external safety evaluations without disrupting the general services offered to users. This would foster collaboration between industry and academia, facilitating continuous improvement in model safety.

为支持研究和测试,商业 AI 模型应提供专用的安全 API。该 API 将使研究人员能够通过将模型置于各种对抗性和安全关键场景中来评估和增强这些模型的安全性。通过提供此类 API,商业提供商可以在不影响向用户提供的通用服务的情况下实现外部安全评估。这将促进工业界和学术界之间的合作,推动模型安全性的持续改进。

8.4.3 Open-Source Platforms

8.4.3 开源平台

The AI safety community would greatly benefit from the development and open-source release of safety platforms and libraries. These tools would facilitate the rapid evaluation, testing, and improvement of safety mechanisms across a variety of models and applications. Open-sourcing these platforms would foster collaboration and transparency, enabling researchers and practitioners to share best practices, benchmark safety techniques, and contribute to the establishment of universal safety standards.

AI安全社区将极大地受益于安全平台和库的开发与开源发布。这些工具将促进各种模型和应用中安全机制的快速评估、测试和改进。开源这些平台将促进协作和透明度,使研究人员和从业者能够分享最佳实践、基准安全技术,并为建立通用安全标准做出贡献。

8.4.4 Global Collaborations

8.4.4 全球协作

The pursuit of AI safety is a global challenge that transcends national borders, requiring coordinated efforts from academia, technology companies, government agencies, and non-profit orga niz at ions. Effective collaboration on a global scale is essential to addressing the potential risks associated with advanced AI systems. By fostering international cooperation, we can more efficiently tackle complex safety issues and establish unified standards that guide the safe development and deployment of AI technologies.

追求 AI 安全是一项超越国界的全球性挑战,需要学术界、技术公司、政府机构和非营利组织的协调努力。在全球范围内进行有效合作对于应对先进 AI 系统带来的潜在风险至关重要。通过促进国际合作,我们可以更高效地解决复杂的安全问题,并建立统一的指导 AI 技术安全开发和部署的标准。

To facilitate global collaboration, the following initiatives could be pursued:

为促进全球合作,可采取以下举措:

. International Safety Alliances: Establishing global alliances dedicated to AI safety can bring together experts and resources from around the world. These alliances would focus on sharing research findings, coordinating safety evaluations, and developing universal safety benchmarks that reflect diverse regional needs and values.

国际安全联盟:建立致力于AI安全的全球联盟,可以汇集来自世界各地的专家和资源。这些联盟将专注于分享研究成果、协调安全评估,并开发反映不同地区需求和价值观的通用安全基准。

Global collaborations not only enhance the effectiveness of AI safety research but also promote transparency, trust, and accountability in the development of advanced AI systems. By working together across borders and disciplines, we can ensure that AI technologies benefit humanity while minimizing the risks associated with their deployment.

全球合作不仅提升了AI安全研究的有效性,还促进了先进AI系统发展中的透明度、信任和问责制。通过跨国界和跨学科的合作,我们可以确保AI技术造福人类,同时最大限度地降低其部署带来的风险。

9 CONCLUSION

9 结论

In this paper, we surveyed 390 technical papers on large model safety, encompassing Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered agents. We provided a comprehensive taxonomy of threats and defenses, highlighting the evolving challenges these models face. Despite notable progress, numerous open challenges remain, particularly in understanding the fundamental vulnerabilities of large models, establishing comprehensive safety evaluations, developing scalable and effective defense mechanisms, and ensuring sustainable data practices. More importantly, achieving safe AI will require collective efforts from the global research community and international collaboration. We hope this paper could serve as a useful resource for researchers and practitioners, driving ongoing efforts to build safe, robust and trustworthy large-scale models.

在本文中,我们调研了 390 篇关于大模型安全的论文,涵盖视觉基础模型 (Vision Foundation Models, VFMs)、大语言模型 (Large Language Models, LLMs)、视觉语言预训练模型 (Vision-Language Pre-training, VLP)、视觉语言模型 (Vision-Language Models, VLMs)、扩散模型 (Diffusion Models, DMs) 以及基于大模型的智能体。我们提供了威胁与防御的综合分类,强调了这些模型面临的不断演变的挑战。尽管取得了显著进展,但仍存在许多开放性问题,特别是在理解大模型的基本漏洞、建立全面的安全评估、开发可扩展且有效的防御机制以及确保可持续的数据实践方面。更重要的是,实现安全的 AI 需要全球研究界的共同努力和国际合作。我们希望本文能够为研究人员和实践者提供一个有用的资源,推动构建安全、稳健和可信的大规模模型的持续努力。

REFERENCES

参考文献

阅读全文(20积分)