[论文翻译]优雅地老去:基于离散键值适配器的终身模型编辑


原文地址:https://proceedings.neurips.cc/paper_files/paper/2023/file/95b6e2ff961580e03c0a662a63a71812-Paper-Conference.pdf


Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors

优雅地老去:基于离散键值适配器的终身模型编辑

Thomas Hartvigsen University of Virginia, MIT hartvigsen@virginia.edu

Thomas Hartvigsen 弗吉尼亚大学,麻省理工学院 hartvigsen@virginia.edu

Swami Sankara narayan an Sony AI swami.sankara narayan an@sony.com

Swami Sankara narayan an Sony AI swami.sankara narayan an@sony.com

Hamid Palangi Microsoft Research hpalangi@microsoft.com

Hamid Palangi Microsoft Research hpalangi@microsoft.com

Marzyeh Ghassemi MIT mghassem@mit.edu

Marzyeh Ghassemi MIT mghassem@mit.edu

Abstract

摘要

Deployed language models decay over time due to shifting inputs, changing user needs, or emergent world-knowledge gaps. When such problems are identified, we want to make targeted edits while avoiding expensive retraining. However, current model editors, which modify such behaviors of pre-trained models, degrade model performance quickly across multiple, sequential edits. We propose GRACE, a lifelong model editing method, which implements spot-fixes on streaming errors of a deployed model, ensuring minimal impact on unrelated inputs. GRACE writes new mappings into a pre-trained model’s latent space, creating a discrete, local codebook of edits without altering model weights. This is the first method enabling thousands of sequential edits using only streaming errors. Our experiments on T5, BERT, and GPT models show GRACE’s state-of-the-art performance in making and retaining edits, while generalizing to unseen inputs. Our code is available at github.com/th art vig sen/grace.

已部署的语言模型会因输入变化、用户需求改变或新出现的世界知识缺口而随时间衰退。当发现此类问题时,我们需要进行针对性修改,同时避免昂贵的重新训练。然而,当前修改预训练模型行为的模型编辑器在多次连续编辑后会快速降低模型性能。我们提出GRACE这一终身模型编辑方法,可对部署模型的流式错误实施定点修复,确保对无关输入的影响最小化。GRACE将新映射写入预训练模型的潜在空间,在不改变模型权重的情况下创建离散的本地编辑码本。这是首个仅利用流式错误即可实现数千次连续编辑的方法。我们在T5、BERT和GPT模型上的实验表明,GRACE在执行和保留编辑方面具有最先进的性能,同时能泛化到未见过的输入。代码已发布于github.com/thartvigsen/grace。

1 Introduction

1 引言

Large scale pre-trained neural networks are the state-of-the-art for many hard machine learning problems, especially in natural language processing [5, 33] and computer vision [9, 34]. But when deployed, they still make unpredictable errors [2, 43]. For example, Large Language Models (LLMs) notoriously hallucinate [17], perpetuate bias [11], and factually decay [8]. Many such errors will arise sequentially in deployment, some of which must be addressed quickly without waiting until new training data is collected [16, 21]. For example, when an LLM generates hate speech or crucial knowledge about the world changes, its behavior must be modified immediately to protect users.

大规模预训练神经网络是解决众多高难度机器学习问题的最先进技术,尤其在自然语言处理 [5, 33] 和计算机视觉领域 [9, 34] 。但在实际部署时,它们仍会出现难以预测的错误 [2, 43] 。例如,大语言模型 (LLMs) 存在众所周知的幻觉问题 [17] 、固化偏见 [11] 以及事实性衰退现象 [8] 。这类错误往往会在部署过程中接连出现,其中部分问题必须即时修正,无法等待新训练数据的收集 [16, 21] 。例如当大语言模型生成仇恨言论,或世界关键知识发生变动时,必须立即调整其行为以保护用户。

While retraining or finetuning can edit a model’s predictions, doing this frequently is often too computationally expensive. LLaMA [40], for instance, was trained for 21 days on 2,048 A100 GPUs, costing over $\$2.4\mathbf{M}$ and emitting over 1,000 tons of $\mathrm{{CO}_{2}}$ . Even repeatedly retraining or finetuning smaller models [3] quickly costs too much for most practitioners. To enable cheap, targeted updates to big, pre-trained models, we study lifelong model editing. Here, we continually make targeted edits sequentially throughout a model’s deployment. Success means correcting a model’s predictions on a stream of edits without decaying its performance on unrelated and previously-edited inputs.

虽然重新训练或微调可以修改模型的预测结果,但频繁进行这类操作通常计算成本过高。例如,LLaMA [40] 在2,048块A100 GPU上训练了21天,耗资超过240万美元,并排放了1,000多吨二氧化碳。即使对较小模型 [3] 进行反复重训练或微调,对大多数从业者来说成本也很快变得难以承受。为了实现对大预训练模型进行低成本、定向更新,我们研究了终身模型编辑 (lifelong model editing) 技术。该方法通过在模型部署期间持续进行定向编辑序列,其成功标准是:在不影响无关输入及已编辑输入性能的前提下,持续修正模型对编辑流的预测结果。

Two possible approaches to lifelong model editing are continual learning and conventional model editing. While continual learning methods can update models sequentially, when used for sequential editing, they suffer from over fitting [25] and quickly forget previous edits and pre-training data [16, 22]. Alternatively, existing static model editors could also be run sequentially. But the editors that rely on regularized finetuning [8, 28, 29, 37] succumb to over fitting [4], while hyper network methods require pre-training on hard-to-access data [30, 31]. Further, these existing approaches require large sets of representative training edits, access to the model’s pre-training data, or semantically-equivalent inputs. Accessing these data is often unrealistic or impractical, and existing editor performance severely decays the pre-trained model after only a few sequential edits [12, 16].

终身模型编辑的两种可能方法是持续学习和传统模型编辑。虽然持续学习方法可以顺序更新模型,但在用于顺序编辑时会出现过拟合 [25] 问题,并快速遗忘先前的编辑和预训练数据 [16, 22]。另一种方法是顺序运行现有的静态模型编辑器。但依赖正则化微调 [8, 28, 29, 37] 的编辑器容易过拟合 [4],而超网络方法需要在难以获取的数据上进行预训练 [30, 31]。此外,这些现有方法需要大量具有代表性的训练编辑集、访问模型的预训练数据或语义等效的输入。获取这些数据通常不切实际,现有编辑器的性能在仅进行少量顺序编辑后就会严重损害预训练模型 [12, 16]。


Figure 1: Overview of lifelong model editing with GRACE. a) Models make important errors that must be corrected. b) GRACE makes edits by learning, caching, and selectively retrieving new transformations between layers. c) Edits appear sporadically and require quick fixes, so GRACE codebooks are curated over long sequences of edits.

图 1: 使用GRACE进行终身模型编辑的概述。a) 模型会出现必须纠正的重要错误。b) GRACE通过在不同层之间学习、缓存和选择性检索新转换来实现编辑。c) 编辑会零星出现且需要快速修复,因此GRACE的码本是在长时间编辑序列中精心维护的。

We study lifelong editing using only singular inputs to edit models, where models are edited immediately upon arrival. This setup extends beyond recent works, which use many inputs for editing, including collected sets of semantically-equivalent inputs, training edits, and the model’s pre-training data. Editing in our realistic, more cost-effective setup is hard because each edit must fix the flagged error, while generalizing to similar future inputs without impacting unrelated model behavior.

我们研究仅使用单一输入对模型进行即时编辑的终身学习场景。这一设定超越了近期需要大量编辑输入的研究(包括收集语义等效输入集、训练编辑数据及模型预训练数据)。在这种更贴近现实且更具成本效益的配置中,编辑任务极具挑战性:每个编辑既要修正当前标注错误,又需泛化至未来类似输入,同时不影响模型无关行为。

To address the challenging lifelong model editing setup, we propose General Retrieval Adaptors for Continual Editing, or GRACE. While we study transformers for natural language processing, GRACE is broadly applicable. As illustrated in Figure 1, GRACE edits a model by adding an Adaptor to a chosen layer, while never changing its weights. This Adaptor then modifies layer-to-layer transformations for select inputs. By caching embeddings for input errors and learning values that decode into desired model outputs, GRACE serves as a codebook in which edits are stored, enabling longer sequences of edits than prior works. To encourage generalization, we leverage the models’ semantic similarity in its latent space by introducing $\epsilon$ -balls around cached edits. GRACE then only applies for inputs near existing keys, differing from more-traditional Adaptors [15]. By managing the $\epsilon$ -balls’ size over time, GRACE makes immediate edits, remembers previous edits, and leaves correct model behaviors intact, making it parameter efficient. Further, since GRACE codebooks leave model weights unaltered and are fully model-agnostic, they also point towards plug-and-play, cost-effective model editing, especially to make crucial spot-fixes between bigger retraining efforts.

为解决具有挑战性的终身模型编辑设置,我们提出了通用检索适配器持续编辑方法(GRACE)。虽然我们研究的是自然语言处理领域的Transformer模型,但GRACE具有广泛适用性。如图1所示,GRACE通过在选定层级添加适配器(Adaptor)来编辑模型,且从不更改其权重。该适配器随后会针对特定输入修改层间转换关系。通过缓存错误输入的嵌入表示,并学习解码为目标模型输出的值,GRACE充当存储编辑操作的码本(codebook),支持比现有方法更长的编辑序列。为提升泛化能力,我们利用模型潜在空间中的语义相似性,在缓存编辑周围引入$\epsilon$-球面空间。与传统适配器[15]不同,GRACE仅对接近现有键值的输入生效。通过动态管理$\epsilon$-球面尺寸,GRACE能即时实施编辑、保留历史编辑记录,同时保持模型原有正确行为,实现参数高效利用。此外,由于GRACE码本不改变模型权重且完全与模型无关,这为即插即用、高性价比的模型编辑(特别是重大重训练间隙的关键定点修复)提供了技术路径。

Our contributions are as follows:

我们的贡献如下:

2 Methods: Lifelong Model Editing with GRACE

2 方法:基于GRACE的终身模型编辑

2.1 Problem Formulation

2.1 问题表述

The Lifelong Model Editing task is to edit the same model hundreds to thousands of times in a row without forgetting upstream performance or fixes for previous edits. Let $f_{0}$ denote a model with frozen parameters that was pre-trained on dataset $\mathcal{D}_ {\mathrm{train}}$ . For clarity, we drop the subscript where possible. Assume that $f$ consists of $L$ layers, where $f^{l}(\cdot)$ computes the hidden state at layer $l$ . In this work, we assume $f$ is a transformer architecture for natural language, though our principles are general. We then deploy $f$ on a stream of inputs $[x_{0},x_{1},\ldots]$ , where $x_{t}\in\mathcal{D}_ {\mathrm{edit}}$ , which contains samples observed during deployment. We then monitor the model’s predictions $\hat{y}_ {t}=f(x_{t})$ over the stream. Note that the prediction tasks during training and deployment are the same. Over time, the model makes errors such that $\hat{y}_ {t}\neq y_{t}$ , where $y_{t}$ is the true label. To continue safely deploying $f$ , we aim to edit $f$ such that $f(x_{t})=y_{t}$ . After each edit, the updated $f$ should 1) be successfully edited such that $f(x_{t})=y_{t},2)$ retain accuracy on prior edits $x_{<t}$ , and 3) retain its behavior on its training data: $f(x_{i})=f_{0}(x_{i})\forall x_{i}\in\mathcal{D}_ {\mathrm{train}}$ . In contrast to prior works [8, 28, 30, 31, 37], we assume access to only input edits $x_{t}$ and corrected labels $y_{t}$ , since $\mathcal{D}_{\mathrm{train}}$ is often proprietary or prohibitively large, and collecting training edits or semantically-equivalent inputs is often expensive in practice.

终身模型编辑任务是指在不遗忘上游性能或先前编辑修复的情况下,对同一模型连续进行数百至数千次编辑。设$f_{0}$表示在数据集$\mathcal{D}_ {\mathrm{train}}$上预训练且参数冻结的模型。为简洁起见,我们尽可能省略下标。假设$f$由$L$层组成,其中$f^{l}(\cdot)$计算第$l$层的隐藏状态。本文假设$f$为自然语言Transformer架构,但所述原理具有普适性。随后将$f$部署于输入流$[x_{0},x_{1},\ldots]$上,其中$x_{t}\in\mathcal{D}_ {\mathrm{edit}}$包含部署期间观测到的样本,并持续监控模型预测$\hat{y}_ {t}=f(x_{t})$。需注意训练与部署阶段的预测任务相同。随着时间推移,模型会出现$\hat{y}_ {t}\neq y_{t}$的预测错误($y_{t}$为真实标签)。为确保$f$的安全部署,我们的目标是编辑$f$使其满足$f(x_{t})=y_{t}$。每次编辑后,更新后的$f$应满足:1) 成功实现$f(x_{t})=y_{t}$的编辑;2) 保持对历史编辑$x_{<t}$的准确性;3) 在训练数据上保持原始行为$f(x_{i})=f_{0}(x_{i})\forall x_{i}\in\mathcal{D}_ {\mathrm{train}}$。与先前研究[8,28,30,31,37]不同,我们假设仅能获取待编辑输入$x_{t}$和修正标签$y_{t}$,因为$\mathcal{D}_{\mathrm{train}}$通常属于专有数据或规模过大,且实践中收集训练编辑或语义等效输入成本高昂。

2.2 GRACE: General Retrieval Adaptors for Continual Editing

2.2 GRACE: 持续编辑的通用检索适配器

We propose GRACE, a method for sequentially editing a pre-trained model’s behavior without altering its weights, as illustrated in Figure 1. GRACE works by wrapping a chosen layer of any pre-trained model architecture with an Adaptor. A GRACE Adaptor at model $f$ ’s layer $l$ contains two components: (1) a codebook $\mathcal{C}$ and (2) a deferral mechanism to decide whether to use $\mathcal{C}$ for a given input.

我们提出GRACE方法,可在不改变预训练模型权重的情况下顺序编辑其行为,如图1所示。GRACE通过在任何预训练模型架构的选定层外包裹适配器(Adaptor)实现。模型$f$第$l$层的GRACE适配器包含两个组件:(1) 码本(codebook) $\mathcal{C}$ (2) 延迟机制(deferral mechanism),用于决定是否对给定输入使用$\mathcal{C}$。

GRACE codebook. A GRACE Adaptor at layer $l$ maintains a discrete codebook, adding and updating elements over time to edit a model’s predictions. The codebook contains three components:

GRACE码本。第$l$层的GRACE适配器维护一个离散码本,通过动态增删元素来编辑模型预测。该码本包含三个组件:

• Keys $(\mathbb{K})$ : Set of keys, where each key is a cached activation $h^{l-1}$ predicted by layer $l-1$ . • Values $(\mathbb{V})$ : Set of values that are randomly initialized and are updated using the model’s finetuning loss for edits. Each key maps to a single, corresponding value. • Deferral radii $(\mathcal{E})$ : Each key has a deferral radius $\epsilon$ , which serves as a threshold for similarity matching. The deferral mechanism uses this radius as shown in Algorithm 1. GRACE is activated at layer l only if the deferral constraint is satisfied. New entries have a default value $\epsilon_{\mathrm{init}}$ , which is a hyper parameter.

• 键 $(\mathbb{K})$:键的集合,其中每个键是由第 $l-1$ 层预测的缓存激活 $h^{l-1}$。
• 值 $(\mathbb{V})$:随机初始化并通过模型的微调损失进行更新的值集合。每个键映射到唯一对应的值。
• 延迟半径 $(\mathcal{E})$:每个键有一个延迟半径 $\epsilon$,作为相似性匹配的阈值。如算法1所示,延迟机制使用该半径。仅当满足延迟约束时,GRACE 才会在第 $l$ 层激活。新条目的默认值为超参数 $\epsilon_{\mathrm{init}}$。

Deferral mechanism. Before editing, GRACE layers are empty. As editing progresses, GRACE adds keys and adapts values and $\epsilon$ entries. Conceptually, inference at layer $l$ with GRACE entails a deferral decision, computing $h^{l}$ using a similarity search over GRACE’s keys:

延迟机制。在编辑前,GRACE层为空。随着编辑的进行,GRACE会添加键并调整值和$\epsilon$条目。从概念上讲,在层$l$使用GRACE进行推断涉及一个延迟决策,通过对GRACE键的相似性搜索来计算$h^{l}$:

$$
h^{l} =\begin{cases}\mathbf{GRACE}(h^{l-1}) & \text{if } \min_{i} d(h^{l-1}, \mathbb{K}_ {i}) < \epsilon_{i^* }, \quad
& \text{where } i^* = \operatorname*{argmin}_ {i} d(h^{l-1}, \mathbb{K}_{i}) \quad
f^{l}(h^{l-1}) & \text{otherwise}\end{cases}
$$

$$
h^{l} =\begin{cases}\mathbf{GRACE}(h^{l-1}) & \text{if } \min_{i} d(h^{l-1}, \mathbb{K}_ {i}) < \epsilon_{i^* }, \quad
& \text{where } i^* = \operatorname*{argmin}_ {i} d(h^{l-1}, \mathbb{K}_{i}) \quad
f^{l}(h^{l-1}) & \text{otherwise}\end{cases}
$$

where $f^{l}(h^{l-1})$ denotes the unedited model’s activation of the $l$ -th layer. $h^{l-1}$ can be seen as a query to the codebook, and $\mathrm{GRACE}(h^{l-1})$ retrieves the value associated with its closest key. $\epsilon_{i}^{l}$ and $\mathbb{K}_ {i}^{l}$ are the influence radius and key $i$ in layer $l$ , respectively, and $d(\cdot)$ is a distance function. We follow related work [41] and use Euclidean distance for $d(\cdot)$ in our experiments—changing this is trivial. Through explicit similarity search, we use the fact that large models encode semantic similarity with respect to their tasks in their latent spaces. This lets GRACE edits to generalize to similar inputs in the future. If a new input is unlike any cached keys, GRACE simply defers to $f$ ’s pretrained weights. This way, GRACE layers limit interference with $\mathcal{D}_{\mathrm{train}}$ by leaving the model weights unaltered, which helps when input distributions shift [45].

其中 $f^{l}(h^{l-1})$ 表示未编辑模型在第 $l$ 层的激活值。$h^{l-1}$ 可视为对码本的查询,而 $\mathrm{GRACE}(h^{l-1})$ 会检索与其最接近键关联的值。$\epsilon_{i}^{l}$ 和 $\mathbb{K}_ {i}^{l}$ 分别是第 $l$ 层的影响半径和键 $i$,$d(\cdot)$ 为距离函数。我们遵循相关工作 [41] 在实验中使用欧氏距离作为 $d(\cdot)$——修改此项非常容易。通过显式相似性搜索,我们利用了大模型在其潜在空间中编码任务相关语义相似性的事实。这使得GRACE编辑能泛化到未来类似的输入。若新输入与任何缓存键都不相似,GRACE会直接调用 $f$ 的预训练权重。通过保持模型权重不变,GRACE层限制了与 $\mathcal{D}_{\mathrm{train}}$ 的相互干扰,这有助于应对输入分布变化 [45]。

Codebook maintenance. To make an edit, a GRACE layer can perform one of two operations. Each step is described in Algorithm 1. First, if the codebook is empty or the input embedding $h^{l-1}$ falls outside the deferral radius of all existing keys according to distance function $d(\cdot)$ , then a new codebook entry is created and added: ${(h^{l-1},v,\epsilon_{i n i t},y)}$ . Thus if $x_{t}$ were passed into $f$ again, $h^{l-1}$ would activate the codebook and value $v$ would be passed to layer $l+1$ . Training $v$ is detailed below.

码本维护。要进行编辑,GRACE层可以执行以下两种操作之一。每个步骤在算法1中描述。首先,如果码本为空,或者输入嵌入$h^{l-1}$根据距离函数$d(\cdot)$落在所有现有键的延迟半径之外,则会创建并添加一个新的码本条目:${(h^{l-1},v,\epsilon_{init},y)}$。因此,如果$x_{t}$再次传入$f$,$h^{l-1}$将激活码本,值$v$将被传递到层$l+1$。$v$的训练细节如下。

Sometimes, a query $h^{l-1}$ will be close enough to an existing key that adding a new entry would cause their $\epsilon$ -balls to overlap. To avoid this, we compare the edit label $y$ to the model’s prediction for the nearest key and distinguish two cases: 1) If the overlapping key’s label is the same as the edit’s label, Expand that key’s $\epsilon$ to encompass the query. 2) If the overlapping key’s label is different from the edit’s label, Split these keys by first decreasing the influence radius of the overlapping key, then adding a new codebook entry where the new key is simply the query $h^{l-1}$ . We set both keys’ $\epsilon$ values to be half their distance apart.

有时,查询 $h^{l-1}$ 会与现有键足够接近,以至于添加新条目会导致它们的 $\epsilon$ 球重叠。为避免这种情况,我们将编辑标签 $y$ 与模型对最近键的预测进行比较,并区分两种情况:

  1. 若重叠键的标签与编辑标签相同,则扩展该键的 $\epsilon$ 以包含查询。
  2. 若重叠键的标签与编辑标签不同,则通过先减小重叠键的影响半径来拆分这些键,随后添加一个新码本条目,其中新键即为查询 $h^{l-1}$。我们将两个键的 $\epsilon$ 值设置为它们之间距离的一半。

As edits stream over long deployments, by continuously adding and updating the GRACE codebook, layer $l^{:}$ ’s latent space is partitioned according to which inputs needed edits. When not performing edits, these codebook maintenance operations are bypassed, and keys are entirely frozen. GRACE thus introduces a new model editing paradigm in which edits can be made sequentially, similar edits are encouraged to be edited similarly, and the ultimate influence of new edits can be controlled and monitored explicitly. $\epsilon_{\mathrm{init}}$ is the sole parameter in GRACE, which sets the initial $\epsilon$ value for new codebook entries. Intuitively, using a larger $\epsilon_{\mathrm{init}}$ will create edits with more influence, making edits more general, but increasing the interference with unrelated inputs. In practice, $\epsilon_{\mathrm{init}}$ could be tuned using either exogenous data or GRACE codebooks can periodically be refreshed.

在长期部署过程中,随着编辑不断流入,GRACE通过持续添加和更新代码簿(codebook),将层$l^{:}$的潜在空间根据需要编辑的输入进行划分。当不执行编辑时,这些代码簿维护操作会被绕过,且所有键(key)保持完全冻结状态。因此,GRACE引入了一种新的模型编辑范式:支持顺序编辑、鼓励相似编辑采用类似处理方式,并能显式控制和监控新编辑的最终影响范围。$\epsilon_{\mathrm{init}}$是GRACE中唯一的参数,用于设置新代码簿条目的初始$\epsilon$值。直观而言,使用较大的$\epsilon_{\mathrm{init}}$会使编辑产生更广泛的影响,增强编辑的泛化能力,但也会增加对无关输入的干扰。实践中,可通过外部数据调整$\epsilon_{\mathrm{init}}$,或定期刷新GRACE代码簿。

Training GRACE Values. When making an edit with GRACE, either a new key–value pair is learned or an existing key–value pair is updated. To ensure that newly-learned values correct the model’s behavior, we train them directly using back propagation through the finetuning loss on the model’s prediction given the edit. The learned value $v$ then replaces $h^{\tilde{l}}$ for the rest of the forward pass. In our experiments, we train values using 100 gradient descent steps to train the values and ensure the model’s behavior is updated.

训练GRACE值。使用GRACE进行编辑时,会学习新的键值对或更新现有键值对。为确保新学值能修正模型行为,我们通过微调损失对模型预测的反向传播直接训练这些值。学习到的值$v$将在后续前向传播中替换$h^{\tilde{l}}$。实验中,我们采用100次梯度下降步来训练这些值,确保模型行为得到更新。

Algorithm 1: Update Codebook at layer $l$

算法 1: 更新第 $l$ 层的码书

return: C

返回: C

GRACE layers with sequential inputs. For models with different representations per input token, like transformers, we must choose 1) which token should be GRACE’s input query, and 2) which token to replace with a retrieved value in the subsequent layer. In practice, we find that broadcasting the value to each token is reasonable for tasks like classification, since values gain strong control over the model’s behavior and makes them easy to learn. For auto regressive models however, picking the right tokens for the query and value is more nuanced. We opt for replacing only the final token of an input prompt, which we verify experimentally, since compressing future generated text into tokens is an interesting and burgeoning direction in itself [32]. Upon choosing these tokens, GRACE naturally applies to all popular transformer models.

具有顺序输入的GRACE层。对于每个输入token具有不同表征的模型(如Transformer),我们必须选择:1) 哪个token应作为GRACE的输入查询;2) 在后续层中哪个token将被检索值替换。实验表明,对分类等任务而言,将检索值广播至每个token是合理的,因为这样能使检索值对模型行为产生强控制且易于学习。但对于自回归模型,选择查询与替换token则更为微妙。我们选择仅替换输入提示的最后一个token(经实验验证),因为将未来生成文本压缩为token本身就是一个新兴研究方向[32]。选定这些token后,GRACE便可自然适用于所有主流Transformer模型。

2.3 Illustrative Example

2.3 示例说明

To garner intuition about GRACE, we provide an illustrative experiment on synthetic data. As shown in Figure 2, we sample 100 instances from two 2D distributions corresponding to classes. We then train a three-layer binary classifier with two 100-dimensional hidden layers and ReLU activation s. Next, we introduce edits with flipped labels, simulating local label shift at test time. Using a single

为了直观理解 GRACE,我们在合成数据上进行了示例实验。如图 2 所示,我们从两个对应类别的二维分布中各采样 100 个实例。随后,我们训练了一个具有两个 100 维隐藏层和 ReLU 激活函数的三层二元分类器。接着,我们通过翻转标签引入编辑操作,模拟测试时的局部标签偏移。使用单一...


Figure 2: Illustrative example of GRACE. We train a model on separable data in (a), then introduce locally-flipped labels at test time in (b). In (c), the original model un surprisingly mis classifies these label-flipped instances. In (d), GRACE fixes these labels without impacting other inputs.

图 2: GRACE的示例说明。我们在(a)中对可分离数据训练模型,然后在(b)的测试阶段引入局部翻转的标签。在(c)中,原始模型不出所料地对这些标签翻转的实例进行了错误分类。在(d)中,GRACE修正了这些标签且不影响其他输入。

SETTINGMODELTEST RETENTION DATAEDITING DATA
DatasetNPre-editDatasetNPre-edit
QAT5 (60m)NQ [20]1000.72F1zsRE [24]1000.31F1
Clf.BERT(120m)SCOTUS1982-1991 [6]914.99AccSCOTUS1992-2009 [6]931.55Acc
Halluc.GPT-2 (1.5B)OpenWebText [10]100015.98PPLSelfCheckGPT1392132.7PPL
SETTING MODEL TEST RETENTION DATA EDITING DATA
Dataset N Pre-edit Dataset N Pre-edit
QA T5 (60m) NQ [20] 1000 .72F1 zsRE [24] 1000 .31F1
Clf. BERT(120m) SCOTUS1982-1991 [6] 914 .99Acc SCOTUS1992-2009 [6] 931 .55Acc
Halluc. GPT-2 (1.5B) OpenWebText [10] 1000 15.98PPL SelfCheckGPT 1392 132.7PPL

Table 1: Dataset statistics for main results. Test retention is the testing set of each model’s training data. $N$ is the number of samples. Pre-edit is the unedited model’s performance on each dataset.

表 1: 主要结果的数据集统计。测试保留集 (Test retention) 指各模型训练数据中的测试集。$N$ 表示样本数量。编辑前 (Pre-edit) 是未编辑模型在各数据集上的表现。

3 Experiments

3 实验

3.1 Experimental Setup

3.1 实验设置

We evaluate GRACE’s capacity to edit models hundreds to thousands of times sequentially. This is in contrast to prior works, which largely evaluate using single edits. Further, much of the model editing literature relies on synthetic edits, often generated by flipping random labels. Beyond unrealistically assessing performance, it is well-established that learning random versus natural labels is vastly different [49]. Instead, we correct authentic mistakes made by models, each time an error is made.

我们评估了GRACE对模型进行数百至数千次连续编辑的能力。这与先前主要评估单次编辑效果的研究形成鲜明对比。此外,现有模型编辑文献多依赖合成编辑(通常通过随机翻转标签生成)。这种做法不仅无法真实评估性能,而且已有明确证据表明学习随机标签与自然标签存在显著差异 [49]。我们选择在模型每次出现错误时,对其真实错误进行修正。

Baselines. We compare against continual learning and model editing methods. First, we continually finetune (FT) [25] on streaming errors. To reduce over fitting, we also compare with Elastic Weight Consolidation (EWC) [19] and a simple form of experience replay [36] where we periodically retrain the model (Retrain) on all previous edits. Second, we compare against model editors MEND [30] and Defer, inspired by SERAC [31] but altered for our setup. We also compare against ROME [28] in our language modeling experiment. Finally, we ablate GRACE by replacing our discrete search with a Memory network containing memory module that is indexed by a soft attention mechanism. Details for all comparisons are in Appendix C.

基线方法。我们对比了持续学习和模型编辑方法。首先,我们在流式错误数据上持续进行微调 (FT) [25]。为减少过拟合,同时对比了弹性权重固化 (EWC) [19] 和一种简易形式的经验回放 [36] (Retrain) ,即定期在所有历史编辑数据上重新训练模型。其次,对比了模型编辑器 MEND [30] 和受 SERAC [31] 启发但适配本实验的 Defer 方法。在语言建模实验中还对比了 ROME [28]。最后,通过将离散搜索替换为采用软注意力机制索引的记忆网络模块 (Memory network) ,对 GRACE 进行消融实验。所有对比实验细节详见附录 C。

Datasets and Pre-trained Models. We evaluate GRACE on three sequential editing tasks with corresponding pre-trained models, as shown in Table 1. 1) We edit a 60-million parameter T5 model [35] trained for context-free question-answering, as is used in [30]. We extract potential edits from the validation set of zsRE [24], following the editing literature [16, 30]. This model achieves F1 of .72 on NQ and .31 on zsRE prior to editing. Further details are in Appendix B.1. 2) We edit a 110-million BERT classifier trained for a new editing task with label shift using the SCOTUS dataset from Fairlex [6]. The prediction task is to categorize U.S. Supreme Court documents over multiple decades into 11 topics. Over time, categorization rules change, so label distributions shift. We train a BERT classifier on the $7.4\mathrm{k}$ training cases from 1946-1982, then make edits on 931 cases from 1991-2009. We also introduce additional, realistic label shifts, as discussed in Appendix B.2. This model achieves Accuracy of 0.99 on the training set and .55 on the edit set. 3) We introduce a new editing task by correcting a GPT language models’ Hallucination. In [27], authors prompt GPT-3 to generate 238 wikipedia-style biographies using subjects from WikiBio. They then annotate the factual accuracy of each sentence, recording which are hallucinations. We propose editing inaccurate sentences by replacing them with corresponding sentences in the true wikipedia entries. We include edits for all 238 biographies, creating 1392 sequential edits and 592 already-accurate outputs. We then edit GPT2-XL, which has 1.5B parameters. However, GPT2-XL was not trained on this task, so has high perplexity (PPL) on all sentences. Therefore, we finetune GPT2-XL on these data mixed with sentences from Open Web Text [10], a public version of GPT2’s training data. Our final GPT2-XL model has PPL of 15.98 on Open Web Text (comparable to the original), 8.7 on already-accurate outputs, and 132.7 on intended edits, indicating a need for editing and room for improvement. Further details are available in Appendix B.3.

数据集与预训练模型。我们在三个序列编辑任务上评估GRACE,并使用相应的预训练模型,如表1所示。1) 我们编辑了一个用于无上下文问答的6000万参数T5模型[35],如[30]中所用。我们按照编辑文献[16,30]的方法,从zsRE[24]的验证集中提取潜在编辑项。该模型在编辑前在NQ上的F1得分为0.72,在zsRE上为0.31。更多细节见附录B.1。2) 我们使用Fairlex[6]的SCOTUS数据集,编辑了一个1.1亿参数的BERT分类器,用于处理带标签偏移的新编辑任务。预测任务是将数十年的美国最高法院文件分类为11个主题。随着时间的推移,分类规则会发生变化,因此标签分布也会偏移。我们在1946-1982年的7.4k训练案例上训练BERT分类器,然后在1991-2009年的931个案例上进行编辑。如附录B.2所述,我们还引入了额外的现实标签偏移。该模型在训练集上的准确率为0.99,在编辑集上为0.55。3) 我们通过纠正GPT语言模型的幻觉引入了一个新的编辑任务。在[27]中,作者提示GPT-3使用WikiBio的主题生成238篇维基百科风格的传记,然后标注每个句子的事实准确性,记录哪些是幻觉。我们建议通过用真实维基百科条目中的相应句子替换不准确的句子来进行编辑。我们包含了对所有238篇传记的编辑,创建了1392个序列编辑和592个已经准确的输出。然后我们编辑拥有15亿参数的GPT2-XL。然而,GPT2-XL并未针对此任务进行训练,因此对所有句子都有较高的困惑度(PPL)。因此,我们在这些数据与Open Web Text10的混合句子上对GPT2-XL进行了微调。我们的最终GPT2-XL模型在Open Web Text上的PPL为15.98(与原始版本相当),在已准确输出上为8.7,在预期编辑上为132.7,表明需要编辑并有改进空间。更多细节见附录B.3。

Metrics. To compare each method, we measure three main metrics that align with prior work [16, 28].

指标。为了比较每种方法,我们测量了与先前工作[16, 28]一致的三个主要指标。

  1. Edit Success (ES): We check whether an edit has been successful using ES: $m(y,\hat{y})$ , where $m(\cdot)$ is a task-specific measure of accuracy: standard F1 for question answering, Accuracy for classification and Perplexity (PPL) for generation.
  2. 编辑成功率 (ES): 我们通过ES指标判断编辑是否成功: $m(y,\hat{y})$ , 其中 $m(\cdot)$ 是任务特定的准确率衡量标准: 问答任务采用标准F1值, 分类任务采用准确率, 生成任务采用困惑度 (PPL)。
MethodzsRE (T5;F1 ↑)SCOTUS(BERT;Acc↑)Hallucination (GPT2-XL;PPL ↓)
TRRERRAvg.#ETRRERRAvg.#ETRRERRARR#Etime (s)
FT [25].56.82.691000.52.52.524151449.328.14107.761392.26 (.07)
FT+EWC[19].51.82.661000.67.50.584081485.729.24109.591392.29 (.06)
FT+Retrain[36].27.99.631000.67.83.754032394.335.34195.82139223.4 (13.2)
MEND[30].25.27.261000.19.27.236721369.81754.92902.51392.63 (.10)
Defer [31].72.31.521000.33.41.375068183.7133.310.041392.07 (.02)
ROME[28]30.28103.8214.021392.64 (.28)
Memory.25.27.261000.21.20.2178025.4779.3010.071392.11 (.02)
GRACE.69.96.821000.81.82.8238115.847.1410.001392.13 (.02)
137keys (7.30edits/key)252keys (1.51 edits/key)1341keys (1.04edits/key)
方法 zsRE (T5;F1 ↑) SCOTUS(BERT;Acc↑) Hallucination (GPT2-XL;PPL ↓)
TRR ERR Avg. #E TRR ERR Avg. #E TRR ERR ARR #E time (s)
FT [25] .56 .82 .69 1000 .52 .52 .52 415 1449.3 28.14 107.76 1392 .26 (.07)
FT+EWC[19] .51 .82 .66 1000 .67 .50 .58 408 1485.7 29.24 109.59 1392 .29 (.06)
FT+Retrain[36] .27 .99 .63 1000 .67 .83 .75 403 2394.3 35.34 195.82 1392 23.4 (13.2)
MEND[30] .25 .27 .26 1000 .19 .27 .23 672 1369.8 1754.9 2902.5 1392 .63 (.10)
Defer [31] .72 .31 .52 1000 .33 .41 .37 506 8183.7 133.3 10.04 1392 .07 (.02)
ROME[28] 30.28 103.82 14.02 1392 .64 (.28)
Memory .25 .27 .26 1000 .21 .20 .21 780 25.47 79.30 10.07 1392 .11 (.02)
GRACE .69 .96 .82 1000 .81 .82 .82 381 15.84 7.14 10.00 1392 .13 (.02)
137keys (7.30edits/key) 252keys (1.51 edits/key) 1341keys (1.04edits/key)

Table 2: Comparison of GRACE to existing methods. Metrics shown are computed after all sequential edits. For Hallucination, we also compute perplexity retention on already-accurate sentences (ARR) and the average time per edit. Parentheses in time denote standard deviation. We also count the number of keys GRACE used. GRACE achieves the best TRR/ERR balance while making edits faster than most comparisons and using small codebooks for zsRE and SCOTUS. A large codebook is required for Hallucination, since each edit label is unique.

表 2: GRACE与现有方法的对比。所有指标均在完成全部顺序编辑后计算得出。对于Hallucination任务,我们还计算了已准确句子的困惑度保留率(ARR)以及每次编辑的平均耗时。耗时括号内标注标准差。同时统计了GRACE使用的键值数量。GRACE在zsRE和SCOTUS数据集上使用小型码本即实现最优TRR/ERR平衡,且编辑速度快于多数对比方法。由于Hallucination任务每个编辑标签均唯一,因此需要较大码本。

We also track the number of edits $(\text{#}\mathbf{E})$ , which may vary by editor as different editors will lead to different mistakes in practice. Further implementation details are available in Appendix A.

我们还追踪编辑次数 $(\text{#}\mathbf{E})$,该数值可能因编辑者而异,因为不同编辑者在实践中会导致不同的错误。更多实现细节详见附录A。

3.2 Results

3.2 结果

3.2.1 Comparisons to existing methods

3.2.1 与现有方法的对比

We first find that GRACE outperforms existing methods after long sequences of edits, as shown in Table 2. We focus here on TRR and ERR, which have a trade-off as each can be achieved in isolation, though all metrics are reported in Appendix G. On zsRE and SCOTUS, we focus on TRR, ERR, and their average. ROME is incomparable on these dataset as it is only proposed for GPT models. On Hallucination, we include Accurate Retention Rate (ARR), which is the edited model’s perplexity on sentences on which it was already accurate.

我们首先发现,如表2所示,GRACE在长序列编辑后表现优于现有方法。我们重点关注TRR和ERR,这两者存在权衡关系(因为可以单独实现任一指标),完整指标见附录G。在zsRE和SCOTUS数据集上,我们主要分析TRR、ERR及其平均值。ROME方法不适用于这些数据集(因其仅针对GPT模型设计)。在Hallucination任务中,我们还纳入了准确保留率(ARR, Accurate Retention Rate),即模型在原本预测正确的句子上计算出的困惑度。

On zsRE and SCOTUS, GRACE’s averaged TRR and ERR outperforms its closest competitors by $19%$ and $9%$ , respectively. Comparing the average performance across all methods, GRACE’s improvement is $86%$ and $129%$ due to other methods’ inability to balance ERR and TRR. For instance, while Defer achieves the highest TRR on SCOTUS it trades off ERR. In contrast, $\mathrm{FT+I}$ Retrain achieves the highest ERR on SCOTUS, but trades off TRR. On both datasets, MEND and Defer both struggle to balance TRR and ERR without access to privileged data.

在zsRE和SCOTUS数据集上,GRACE的平均TRR(真实召回率)和ERR(错误召回率)分别比其最接近的竞争对手高出19%和9%。与所有方法的平均性能相比,由于其他方法无法平衡ERR和TRR,GRACE的改进幅度达到86%和129%。例如,Defer在SCOTUS上实现了最高的TRR,但牺牲了ERR;而$\mathrm{FT+I}$ Retrain在SCOTUS上实现了最高的ERR,却牺牲了TRR。在这两个数据集上,MEND和Defer在无法访问特权数据的情况下都难以平衡TRR和ERR。

GRACE also outperforms the comparisons on the Hallucination task. We observe that the finetuning methods easily overfit to new edits, leading to competitive ERR but poor TRR. However, their good ERR comes at the expense of ARR in these methods. We also find that ROME and Memory both perform competitively with GRACE, indicating the value of parameter efficiency in lifelong model editing. Finally, we report the average wall-clock time it takes to perform one edit, finding that GRACE makes edits twice as fast as finetuning while competing with other Adaptors.

GRACE 在幻觉任务上的表现也优于其他对比方法。我们发现微调方法容易对新编辑过拟合,导致 ERR 表现优异但 TRR 较差。然而在这些方法中,其优异的 ERR 是以牺牲 ARR 为代价的。我们还发现 ROME 和 Memory 与 GRACE 表现相当,这表明参数效率在终身模型编辑中的价值。最后,我们报告了执行一次编辑所需的平均挂钟时间,发现 GRACE 的编辑速度是微调方法的两倍,同时与其他适配器方法保持竞争力。

Excitingly, GRACE can achieve high-quality edits with small codebooks. For example, GRACE uses few keys to edit T5 on zsRE: 1000 edits are made using only 137 keys. Such compression is only feasible by managing the $\epsilon$ -balls effectively over time. This is also parameter-efficient: with details in Section 3.2.3, the zsRE codebook contains 210,569 scalar values ( $35%$ of T5’s parameters) and only 70,281 are learnable. The number of keys is also lower-bounded by the number of unique edit labels and is related to the complexity of the inputs. On SCOTUS, GRACE also achieves a relatively small codebook, showing a balance between input complexity and general iz ability of each edit. Since SCOTUS is a document classification task with 11 classes, there could be as few as 11 keys, but given each document contains many sentences, there is a lower chance that the same keys can be used for many documents. That GRACE’s keys represent 1.51 edits on average indicates significant generalization. As expected, the lower-bound on codebook size is evident in the Hallucination task, where GRACE uses a new key for almost every edit. This is because edit labels are unique sentences. Despite a large codebook, such success at selectively inserting tokens that generate full sentences poses an exciting direction for controllable and editable language modeling.

令人振奋的是,GRACE能够通过小型码本(codebook)实现高质量编辑。例如在zsRE数据集上,GRACE仅用137个键(key)就完成了对T5模型的1000次编辑。这种压缩效果只有通过有效管理$\epsilon$-balls才能实现。该方法还具有参数高效性:如第3.2.3节所述,zsRE码本包含210,569个标量值(占T5参数的35%),其中仅70,281个需要学习。键的数量下限由唯一编辑标签数量决定,并与输入复杂度相关。在SCOTUS任务中,GRACE同样实现了较小的码本规模,展现了输入复杂度与编辑泛化能力之间的平衡。由于SCOTUS是包含11个类别的文档分类任务,理论上最少只需11个键,但考虑到每份文档包含多个句子,相同键跨文档复用的概率较低。GRACE平均每个键对应1.51次编辑的结果,显示出显著的泛化能力。正如预期,在Hallucination任务中码本规模的下限效应尤为明显——GRACE几乎为每次编辑都使用了新键,这是因为编辑标签都是独特句子。尽管需要较大码本,但这种选择性插入token来生成完整语句的成功案例,为可控可编辑语言建模开辟了令人振奋的研究方向。


Figure 3: ES and TRR while editing GPT2-XL on Hallucination. Lower values are better because TRR and ERR measure perplexity. GRACE outperforms the comparisons by making successful edits while maintaining the model’s training knowledge. All metrics are shown in Figure 13.

图 3: GPT2-XL在幻觉任务上的编辑成功率(ES)和训练知识保留率(TRR)。数值越低越好,因为TRR和ERR衡量的是困惑度。GRACE方法在成功完成编辑的同时保持了模型的训练知识,表现优于其他对比方法。所有指标如图13所示。

In Figure 3, we show a more-detailed comparison of all methods for the Hallucination editing task over time, focusing on TRR and ES (see Figure 13 for all metrics). As expected, finetuning methods excel at ES, but perform poorly at TRR. On the flip side, ROME and Memory have reasonably-low ES and TRR scores, though both suffer as edits progress. While ROME is competitive early in editing, it especially suffers on ERR (Table 2) and GRACE’s TRR remains $2\mathbf{x}$ better after making all edits.

图3中,我们展示了随时间推移所有方法在幻觉编辑任务上的详细对比,重点关注TRR和ES指标(完整指标见图13)。与预期一致,微调方法在ES上表现优异,但TRR表现较差。相反,ROME和Memory的ES与TRR分数都相对较低,但随着编辑次数增加,两者性能都会下降。虽然ROME在编辑初期具有竞争力,但其ERR表现尤其不佳(表2),而GRACE的TRR在完成所有编辑后仍保持$2\mathbf{x}$的优势。

3.2.2 Model Analysis: Memorization vs. Generalization

3.2.2 模型分析:记忆与泛化

Next, we study GRACE’s memorization vs. generalization performance by editing T5 on zsRE on extremely long sequences of edits while varying $\epsilon_{\mathrm{init}}$ and the edited layer. We split each zsRE question’s set of rephrasing s into two sets: edits and holdouts. Then, we pass edits through GRACE until 3,000 edits are made. This is extremely large: even 10 edits catastrophically decays performance [12] and [16] performs roughly half as many edits. After each edit, we measure TRR, ERR, F1 on the entire Holdout set, and record the number of keys. Since we evaluate GRACE on the entire Holdout set after each edit, the results start low since has seen no rephrasing s of held out edits. Therefore, later measures of Holdout performance are more representative than earlier. Figure 4 shows our results for $\epsilon_{\mathrm{init}}=0.1$ and $\epsilon_{\mathrm{init}}=3.0$ , while the rest are in Appendix H. We derive the following findings.

接下来,我们通过在不同初始参数$\epsilon_{\mathrm{init}}$和编辑层条件下对T5模型进行zsRE任务的超长序列编辑,研究GRACE的记忆与泛化性能。将每个zsRE问题的改写集合划分为编辑集和保留集,持续通过GRACE处理编辑集直至完成3,000次编辑——这个规模远超现有研究:[12]显示10次编辑就会导致性能崩溃,而[16]的编辑量仅为当前实验的一半左右。每次编辑后,我们测量整个保留集上的TRR、ERR和F1值,并记录键的数量。由于GRACE在每次编辑后都需在整个保留集上评估,初始性能会因未接触保留编辑的改写样本而偏低,因此后期保留集的性能指标更具代表性。图4展示了$\epsilon_{\mathrm{init}}=0.1$和$\epsilon_{\mathrm{init}}=3.0$的实验结果(其余数据见附录H),主要发现如下:

Layer and $\boldsymbol{\epsilon}_ {\mathbf{init}}$ choice balance memorization and generalization. We first find that different blocks lead to different editing performance. Notably, editing Blocks two and four achieve high TRR, high ERR, and strong generalization for both choices of $\epsilon_{\mathrm{init}}$ On the contrary, for Block 6 we see that since each $\epsilon_{\mathrm{init}}$ is small, they do not generalize at all and also lead to poor ERR. As expected, when $\epsilon_{\mathrm{init}}$ is small, TRR performance is optimal, since most edits require a new key. While creating a large codebook can be feasible, the trade-off in Holdout becomes clear, as detailed in the next finding. The relationship between layer choice, performance, and the size of the codebook likely stems from a layer’s representational capacity: Layers that map semantically-equivalent inputs near one another will be easier to edit with GRACE.

层和 $\boldsymbol{\epsilon}_ {\mathbf{init}}$ 的选择需要在记忆与泛化之间取得平衡。我们首先发现不同模块会导致不同的编辑效果。值得注意的是,在两种 $\epsilon_{\mathrm{init}}$ 选择下,编辑第二和第四模块都能实现高TRR(真实替换率)、高ERR(编辑成功率)以及强泛化能力。相反,第六模块由于每个 $\epsilon_{\mathrm{init}}$ 值过小,完全无法泛化且导致ERR表现不佳。正如预期,当 $\epsilon_{\mathrm{init}}$ 较小时,TRR性能最优,因为多数编辑需要新密钥。虽然创建大型码本可行,但下一发现将详述其在保留集上的明显权衡。层选择、性能与码本规模的关系可能源于层的表征能力:能将语义等价输入映射到相近位置的层,更容易通过GRACE实现编辑。

GRACE edits generalize to unseen inputs. Steadily increasing Holdout shows that that GRACE edits can generalize to previously-unseen holdout edits. Excitingly, interior layers appear to generalize better than early and late layers, which is backed up by their use of fewer keys. Larger $\epsilon_{\mathrm{init}}$ values also lead to better generalization, as expected, implying that the semantically-similar inputs indeed land in the same deferral radii. The later layer, Block 6, appears to

GRACE编辑能够泛化到未见过的输入。逐步增加的保留集(Holdout)实验表明,GRACE编辑可以泛化到之前未见的保留编辑。令人振奋的是,中间层似乎比早期层和后期层具有更好的泛化能力,这与其使用更少的键(key)相吻合。正如预期的那样,较大的$\epsilon_{\mathrm{init}}$值也带来了更好的泛化性能,这意味着语义相似的输入确实落在相同的延迟半径内。较深层Block 6的表现显示...

GRACE codebooks stay small and stabilize over time. Finally, the number of keys over time steadily flattens, indicating that the $\epsilon$ values indeed adapt to the data distribution. As we increase $\epsilon_{\mathrm{init}}$ the resultant codebooks get smaller, indicating control over codebook size. Small codebooks are also important for parameter efficiency and in generalization.

GRACE码本保持较小规模并随时间趋于稳定。最终,键数量随时间增长逐渐趋平,表明$\epsilon$值确实能自适应数据分布。当我们增大$\epsilon_{\mathrm{init}}$时,生成的码本会更小,这显示了对码本尺寸的控制能力。小规模码本对参数效率和泛化能力也至关重要。


Figure 4: Impact of $\epsilon_{\mathrm{init}}$ and block choice for GRACE editing T5 on zsRE for 3000 sequential edits. Other $\epsilon_{\mathrm{init}}$ values are in Appendix H. Along with TRR and ERR, we also measure F1 on a “Holdout” edit set containing unseen rephrasing s of all edits. We find that blocks 0 and 6 use more keys and achieve higher TRR, but can lead to lower ERR and generalize worse, given lower holdout values.

图 4: $\epsilon_{\mathrm{init}}$ 和模块选择对 GRACE 编辑 T5 在 zsRE 上进行 3000 次连续编辑的影响。其他 $\epsilon_{\mathrm{init}}$ 值见附录 H。除了 TRR 和 ERR 外,我们还测量了包含所有编辑未见改写版本的"保留"编辑集的 F1 分数。我们发现模块 0 和 6 使用了更多键并实现了更高的 TRR,但由于保留值较低,可能导致更低的 ERR 和更差的泛化性能。

3.2.3 Parameter Efficiency

3.2.3 参数效率

GRACE’s memory requirements are small and straightforward: A new edit requires $|h^{l-1}|+|h^{l}|+1$ parameters, where $|h^{l-1}|$ is the dimension of the key, $|h^{l}|$ is the dimension of the value, and $\epsilon_{\mathrm{init}}$ is a scalar. Further, the key’s $|h^{l-1}|$ parameters are not trained. As detailed in Appendix E, this is comparable to the alternatives when performing thousands of edits, so performance gain is not from parameter count. This intuition is reinforced by the fact that at inference time, if the Adaptor is activated, predictions are only altered using the $|h^{l-1}|+|h^{l}|+1$ parameters of the chosen entry.

GRACE的内存需求小而直接:每次新编辑需要 $|h^{l-1}|+|h^{l}|+1$ 个参数,其中 $|h^{l-1}|$ 是键(key)的维度, $|h^{l}|$ 是值(value)的维度, $\epsilon_{\mathrm{init}}$ 为标量。此外,键的 $|h^{l-1}|$ 参数不参与训练。如附录E所述,在进行数千次编辑时,这与替代方案相当,因此性能提升并非来自参数数量。这一直觉在推理时得到强化:若适配器(Adaptor)被激活,预测仅通过所选条目的 $|h^{l-1}|+|h^{l}|+1$ 个参数进行修改。

3.2.4 Interpreting GRACE Codebooks

3.2.4 GRACE码本解析

A key benefit of GRACE is that learned codebooks can be detached from the model and inspected. This way, edits can easily be undone without impacting the model and the codebooks can be inspected throughout editing. To demonstrate this advantage, we inspect how keys and their $\epsilon$ change throughout editing Block 4 of the T5 QA model. In this experiment, we investigate how well GRACE edits generalize to unseen edits. At each edit in a sequence of 1,000 inputs, we pass the entire holdout set of edits through the newly-edited model. For each holdout instance, we record in which key’s $\epsilon$ -ball it lands, if any. This way, we track whether a newly-added key generalizes to multiple holdouts successfully. We also track what proportion of the holdouts land inside any key to evaluate generalization over time. In Figure 5, we summarize the results from one such experiment, with more details in Appendix D. We find that the number of holdouts per key stabilizes over time. Interestingly, $\epsilon_{\mathrm{init}}$ values capture too many holdouts per question, which explains the trade-off between generalization and TRR. In the Appendix, we also show that the number of holdouts per key are stable, while others are highly-variable. This implies that our codebook maintenance strategy can have big ripple effects according to the key. Further, some regions of the latent space are likely better than others for performing GRACE edits.

GRACE的一个关键优势是学习到的码本(codebook)可以从模型中分离并检查。这种方式使得编辑可以轻松撤销而不影响模型,并且码本在整个编辑过程中都可被检查。为展示这一优势,我们研究了在编辑T5 QA模型第4模块时键(key)及其$\epsilon$如何变化。本实验中,我们探究GRACE编辑对未见编辑的泛化能力。在1,000个输入的编辑序列中,每次编辑后我们都将完整保留的编辑集通过新编辑模型。对每个保留实例,记录其落入哪个键的$\epsilon$-球(若有)。由此可追踪新增键是否成功泛化到多个保留实例。我们还追踪保留实例落入任意键的比例以评估随时间推移的泛化情况。图5汇总了一个此类实验的结果,附录D提供更多细节。我们发现每个键对应的保留实例数量会随时间趋于稳定。有趣的是,$\epsilon_{\mathrm{init}}$值会为每个问题捕获过多保留实例,这解释了泛化能力与TRR之间的权衡。附录中还显示部分键对应的保留实例数量稳定,而其他键则波动较大。这表明我们的码本维护策略会根据键产生较大连锁效应。此外,潜在空间的某些区域可能比其他区域更适合执行GRACE编辑。

3.2.5 Inference Time

3.2.5 推理时间

To better understand GRACE’s limitations, we compare the inference time of the T5-small QA model before and after editing. We compute the time it takes to run one instance through the model for each of 5,000 edits. We edit T5’s block 4 and use a small $\epsilon_{\mathrm{init}}$ of 0.1 to encourage a quickly-growing codebook. To evaluate the variance in training time, we replicate this experiment ten times, each leading to a codebook of roughly 4500 elements. We then average over 20-timestep windows and show standard deviation across all replications.

为了更好地理解GRACE的局限性,我们比较了编辑前后T5-small问答模型的推理时间。我们计算了在5,000次编辑中每次运行一个实例所需的时间。我们编辑了T5的第4个模块,并使用较小的$\epsilon_{\mathrm{init}}$值0.1以促进快速增长的码本。为了评估训练时间的方差,我们重复了十次实验,每次生成约4500个元素的码本。然后对20个时间步长的窗口取平均值,并展示所有重复实验的标准差。


Figure 5: Interpreting GRACE keys throughout editing. Larger $\epsilon_{\mathrm{init}}$ achieve good generalization. The grey line is the true holdouts per edit.

图 5: 编辑过程中GRACE键的可视化解读。较大的$\epsilon_{\mathrm{init}}$能实现更好的泛化效果。灰色线条表示每次编辑对应的真实保留数据。


Figure 6: Comparing inference time of GRACEedited and unedited T5 QA model as the size of the codebook increases.

图 6: 对比GRACE编辑与未编辑T5问答模型的推理时间随码本规模增加的变化。

We find that inference with a GRACE-edited model was only $1.32\mathrm{x}$ slower than an unedited model on average, as shown in Figure 6. This is a one-time cost that remains fixed even as the codebook grows. Inference time is unchanged because search between one query and a full set of keys can be vectorized. Until the codebook outgrows available memory, this cost remains fixed. That GRACE causes any slowdown is a limitation and an avenue for future work.

我们发现,如图6所示,使用GRACE编辑后的模型进行推理平均仅比未编辑模型慢$1.32\mathrm{x}$。这是一次性成本,即使码本(codebook)增长也保持不变。推理时间不受影响,因为单个查询与完整键集之间的搜索可向量化。只要码本不超过可用内存,该成本就保持固定。GRACE导致的任何速度下降都是当前局限,也是未来工作的改进方向。

4 Related Work

4 相关工作

Model editing. Model editing is a new and active research area where the goal is to make targeted changes to a pre-trained model’s behavior. Most methods propose variants of regularized-finetuning via auxiliary data, like training instances from the pre-trained model’s training data or by using semantically-equivalent versions of new edits [37]. Access to such data is non-trivial, especially as training data and paradigms are becoming proprietary, and collecting semantically-equivalent inputs is often unlikely. Recent works have retained these requirements, while extending to pretrain hyper networks that predict edits [8, 30, 31], often decomposing weight updates into low-rank components [28, 29]. Recent approaches all study transformer architectures due to their popularity and high training cost [48, 51]. To ensure targeted edits, recent methods like MEND [30] and ROME [28] also draw inspiration from parameter-efficient finetuning [15]. But such methods are known to often require more finetuning steps and are prone to overfit more than regular finetuning [39, 50]. Further, recent works show that edited models are deeply fragile [4] and methods for picking which parameters to update are surprisingly unreliable [13].

模型编辑 (Model editing)。模型编辑是一个新兴且活跃的研究领域,其目标是对预训练模型的行为进行针对性修改。大多数方法通过辅助数据提出正则化微调的变体,例如使用预训练模型训练数据中的训练实例或新编辑的语义等效版本 [37]。获取此类数据并非易事,尤其是随着训练数据和范式变得专有化,收集语义等效输入通常不太可能。最近的研究在保留这些要求的同时,扩展到预训练用于预测编辑的超网络 [8, 30, 31],通常将权重更新分解为低秩分量 [28, 29]。由于Transformer架构的流行性和高昂训练成本,近期方法都针对该架构进行研究 [48, 51]。为确保针对性编辑,MEND [30] 和ROME [28] 等最新方法还借鉴了参数高效微调 [15] 的思路。但众所周知,此类方法通常需要更多微调步骤,且比常规微调更容易过拟合 [39, 50]。此外,近期研究表明,编辑后的模型极其脆弱 [4],而选择更新哪些参数的方法出人意料地不可靠 [13]。

Unfortunately, nearly all model editing works consider only static edits, making only one edit to a model. While some recent works like MEMIT [29] indeed perform multiple edits, they do so simultaneously, not over time during deployment. Other works demonstrate that editing performance decays quickly when making multiple edits [30], though recent works like SERAC [31] and MEMIT [29] show burgeoning results in this direction. However, these exciting works still use large amounts of privileged information. Most similar to our work, two recent papers discuss sequential editing. First, [12] shows that after editing the same model 10 times, editing performance drops dramatically. Second, a concurrent paper [16] recently proposed a sequential editing setup. However, their method and implementation is architecture-specific and relies on large sources of unrelated inputs.

遗憾的是,几乎所有模型编辑工作都只考虑静态编辑,即对模型仅进行一次修改。虽然近期如MEMIT [29]等研究确实实现了多次编辑,但这些编辑是同步执行的,而非在部署过程中随时间推移逐步进行。其他研究表明,当进行多次编辑时,编辑性能会迅速衰减 [30],尽管近期如SERAC [31]和MEMIT [29]等研究已在该方向展现出初步成果。然而,这些令人振奋的工作仍依赖大量特权信息。与我们的研究最为接近的是近期两篇关于连续编辑的论文:首先,[12]表明对同一模型进行10次编辑后,编辑性能会急剧下降;其次,一篇同期论文[16]最近提出了连续编辑方案,但他们的方法和实现依赖于特定架构,且需要大量无关输入源。

Continual Learning. Continual learning methods are a reasonable approach to lifelong model editing. Most-similar to our setup, recent works have investigated continual finetuning, where large language models are refined over time as new instances arrive. For example, [25] perform a large benchmark of continual finetuning. They find that regularizing finetuning with continual learning methods like Elastic Weight Consolidation [19], Experience Replay [36], and Maximally Interfered Replay [1], quickly decays performance on prior tasks, though it helps to remember some previous inputs. This implies that editing, as opposed to regular continual finetuning, is particularly challenging, since edits are unlikely to be uniformly distributed [14]. One promising avenue for continual learning is key-value methods, stemming from computer vision [26, 34, 42] For example, recent works have demonstrated continual prompt-learning for NLP [44, 45] for applications like text retrieval [47]. Recent works have shown that discrete key-value methods in particular perform well with shifting distributions [41], with recent works extending to question answering [7]. By caching values, these approaches keep inputs in-distribution for downstream encoders while opening doors to longer-term memory, resources permitting. We corroborate these advantages in our experiments, where we demonstrate GRACE’s robustness to shifting inputs and labels over long sequences of edits.

持续学习。持续学习方法是一种实现终身模型编辑的合理途径。与我们设置最相似的是近期关于持续微调的研究,即大语言模型随着新实例的不断到来而逐步优化。例如,[25] 对持续微调进行了大规模基准测试,发现采用弹性权重巩固 (Elastic Weight Consolidation) [19]、经验回放 (Experience Replay) [36] 和最大干扰回放 (Maximally Interfered Replay) [1] 等持续学习方法进行正则化微调时,虽然能帮助记忆部分先前输入,但会快速降低模型在先前任务上的表现。这表明相较于常规持续微调,编辑任务更具挑战性,因为编辑操作往往呈现非均匀分布特性 [14]。计算机视觉领域衍生的键值方法 [26, 34, 42] 为持续学习提供了可行方向,例如近期研究展示了自然语言处理中持续提示学习 [44, 45] 在文本检索 [47] 等场景的应用。最新研究表明,离散键值方法尤其擅长处理分布偏移问题 [41],相关技术已扩展至问答领域 [7]。通过缓存数值,这些方法既能保持下游编码器的输入分布稳定性,又能在资源允许时实现长期记忆。我们在实验中验证了这些优势,证明 GRACE 在面对长序列编辑中的输入与标签偏移时具有鲁棒性。

5 Limitations and Ethical Considerations

5 局限性与伦理考量

Lifelong model editing is new and challenging, so our method GRACE has limitations. Understandably, adding similarity search to the layers of a model will slow down inference. Still, while we do not emphasize inference time, accelerating GRACE is natural future step that has been successful in similar methods [15, 33]. Future works may also scale GRACE up to multi-layer edits, a setting unconsidered in this work. While GRACE has already scaled up continual editing to $\mathord{\sim}5\mathrm{k}$ edits, real world scenarios might entail approaches that work at an ever larger editing scale, which presents a promising direction of future work. Another limitation of GRACE-style edits is implication: Behaviors are edited in isolation. For example, if we edit a model’s knowledge of the latest pandemic in the U.S. from Swine Flu to COVID, its knowledge of the second-to-last pandemic will not be updated.

终身模型编辑是新颖且具有挑战性的,因此我们的方法GRACE存在局限性。可以理解的是,在模型层中添加相似性搜索会降低推理速度。尽管如此,虽然我们并未强调推理时间,但加速GRACE是未来自然的发展方向,类似方法已取得成果[15, 33]。未来工作还可能将GRACE扩展到多层编辑,这是本文未考虑的设定。虽然GRACE已实现持续编辑扩展至约5千次编辑,但现实场景可能需要更大编辑规模的方法,这为未来工作提供了有前景的方向。GRACE式编辑的另一局限是隐含性:行为编辑是孤立的。例如,若我们将模型对美国最新疫情的认知从猪流感更新为新冠肺炎,其对上一次疫情的认知并不会同步更新。

While editing models can improve their behavior, it can surely be used for harm. For example, a bad actor might edit a LLM to increase hate. This limitation is true for all model editors, GRACE included. However, most model editors today directly update the model’s weights. This makes it hard to trace back what edits have been made to a model and understand their impact. Transparent editors like GRACE are a promising direction for overcoming this problem. For example, GRACE’s codebook can be directly inspected to see what predictions stem from each value and examine properties of the latent space covered by the $\epsilon$ -balls.

虽然编辑模型可以改善其行为,但同样可能被用于恶意目的。例如,攻击者可能通过编辑大语言模型 (LLM) 来增强仇恨言论。这一局限性适用于所有模型编辑器,包括 GRACE。然而,当前大多数模型编辑器会直接修改模型权重,导致难以追溯模型被编辑的内容及其影响。像 GRACE 这样的透明编辑器为解决该问题提供了可行方向——例如可直接检查其码本 (codebook),观察每个取值对应的预测结果,并分析由 $\epsilon$ -球覆盖的潜在空间特性。

6 Conclusions

6 结论

Pre-trained models continue to grow and are being applied to a diverse set of downstream tasks. However, they still misbehave in unpredictable ways when deployed. Correcting such behavior is a challenging problem, especially when only some of the model’s behavior is problematic. While regularized finetuning or retraining on better data can mitigate this problem somewhat, big models remain too expensive to train for most researchers. We instead build on the model editing literature, and study Lifelong Model Editing, an important but understudied problem. With a focus on transformer models for natural language processing, we edit models thousands of times in a row as edits stream during simulated deployments. Our edits only use singular, authentic errors, in contrast to recent works which make synthetic edits and require large amounts of exogeneous data like sets of training edits, semantically-equivalent examples, or pre-training data. We then present GRACE, a plug-in Adaptor for a model’s chosen layer that leaves the trained weights untouched. GRACE Adaptors (1) retain the funct