Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
优雅地老去:基于离散键值适配器的终身模型编辑
Thomas Hartvigsen University of Virginia, MIT hartvigsen@virginia.edu
Thomas Hartvigsen 弗吉尼亚大学,麻省理工学院 hartvigsen@virginia.edu
Swami Sankara narayan an Sony AI swami.sankara narayan an@sony.com
Swami Sankara narayan an Sony AI swami.sankara narayan an@sony.com
Hamid Palangi Microsoft Research hpalangi@microsoft.com
Hamid Palangi Microsoft Research hpalangi@microsoft.com
Marzyeh Ghassemi MIT mghassem@mit.edu
Marzyeh Ghassemi MIT mghassem@mit.edu
Abstract
摘要
Deployed language models decay over time due to shifting inputs, changing user needs, or emergent world-knowledge gaps. When such problems are identified, we want to make targeted edits while avoiding expensive retraining. However, current model editors, which modify such behaviors of pre-trained models, degrade model performance quickly across multiple, sequential edits. We propose GRACE, a lifelong model editing method, which implements spot-fixes on streaming errors of a deployed model, ensuring minimal impact on unrelated inputs. GRACE writes new mappings into a pre-trained model’s latent space, creating a discrete, local codebook of edits without altering model weights. This is the first method enabling thousands of sequential edits using only streaming errors. Our experiments on T5, BERT, and GPT models show GRACE’s state-of-the-art performance in making and retaining edits, while generalizing to unseen inputs. Our code is available at github.com/th art vig sen/grace.
已部署的语言模型会因输入变化、用户需求改变或新出现的世界知识缺口而随时间衰退。当发现此类问题时,我们需要进行针对性修改,同时避免昂贵的重新训练。然而,当前修改预训练模型行为的模型编辑器在多次连续编辑后会快速降低模型性能。我们提出GRACE这一终身模型编辑方法,可对部署模型的流式错误实施定点修复,确保对无关输入的影响最小化。GRACE将新映射写入预训练模型的潜在空间,在不改变模型权重的情况下创建离散的本地编辑码本。这是首个仅利用流式错误即可实现数千次连续编辑的方法。我们在T5、BERT和GPT模型上的实验表明,GRACE在执行和保留编辑方面具有最先进的性能,同时能泛化到未见过的输入。代码已发布于github.com/thartvigsen/grace。
1 Introduction
1 引言
Large scale pre-trained neural networks are the state-of-the-art for many hard machine learning problems, especially in natural language processing [5, 33] and computer vision [9, 34]. But when deployed, they still make unpredictable errors [2, 43]. For example, Large Language Models (LLMs) notoriously hallucinate [17], perpetuate bias [11], and factually decay [8]. Many such errors will arise sequentially in deployment, some of which must be addressed quickly without waiting until new training data is collected [16, 21]. For example, when an LLM generates hate speech or crucial knowledge about the world changes, its behavior must be modified immediately to protect users.
大规模预训练神经网络是解决众多高难度机器学习问题的最先进技术,尤其在自然语言处理 [5, 33] 和计算机视觉领域 [9, 34] 。但在实际部署时,它们仍会出现难以预测的错误 [2, 43] 。例如,大语言模型 (LLMs) 存在众所周知的幻觉问题 [17] 、固化偏见 [11] 以及事实性衰退现象 [8] 。这类错误往往会在部署过程中接连出现,其中部分问题必须即时修正,无法等待新训练数据的收集 [16, 21] 。例如当大语言模型生成仇恨言论,或世界关键知识发生变动时,必须立即调整其行为以保护用户。
While retraining or finetuning can edit a model’s predictions, doing this frequently is often too computationally expensive. LLaMA [40], for instance, was trained for 21 days on 2,048 A100 GPUs, costing over $\$2.4\mathbf{M}$ and emitting over 1,000 tons of $\mathrm{{CO}_{2}}$ . Even repeatedly retraining or finetuning smaller models [3] quickly costs too much for most practitioners. To enable cheap, targeted updates to big, pre-trained models, we study lifelong model editing. Here, we continually make targeted edits sequentially throughout a model’s deployment. Success means correcting a model’s predictions on a stream of edits without decaying its performance on unrelated and previously-edited inputs.
虽然重新训练或微调可以修改模型的预测结果,但频繁进行这类操作通常计算成本过高。例如,LLaMA [40] 在2,048块A100 GPU上训练了21天,耗资超过240万美元,并排放了1,000多吨二氧化碳。即使对较小模型 [3] 进行反复重训练或微调,对大多数从业者来说成本也很快变得难以承受。为了实现对大预训练模型进行低成本、定向更新,我们研究了终身模型编辑 (lifelong model editing) 技术。该方法通过在模型部署期间持续进行定向编辑序列,其成功标准是:在不影响无关输入及已编辑输入性能的前提下,持续修正模型对编辑流的预测结果。
Two possible approaches to lifelong model editing are continual learning and conventional model editing. While continual learning methods can update models sequentially, when used for sequential editing, they suffer from over fitting [25] and quickly forget previous edits and pre-training data [16, 22]. Alternatively, existing static model editors could also be run sequentially. But the editors that rely on regularized finetuning [8, 28, 29, 37] succumb to over fitting [4], while hyper network methods require pre-training on hard-to-access data [30, 31]. Further, these existing approaches require large sets of representative training edits, access to the model’s pre-training data, or semantically-equivalent inputs. Accessing these data is often unrealistic or impractical, and existing editor performance severely decays the pre-trained model after only a few sequential edits [12, 16].
终身模型编辑的两种可能方法是持续学习和传统模型编辑。虽然持续学习方法可以顺序更新模型,但在用于顺序编辑时会出现过拟合 [25] 问题,并快速遗忘先前的编辑和预训练数据 [16, 22]。另一种方法是顺序运行现有的静态模型编辑器。但依赖正则化微调 [8, 28, 29, 37] 的编辑器容易过拟合 [4],而超网络方法需要在难以获取的数据上进行预训练 [30, 31]。此外,这些现有方法需要大量具有代表性的训练编辑集、访问模型的预训练数据或语义等效的输入。获取这些数据通常不切实际,现有编辑器的性能在仅进行少量顺序编辑后就会严重损害预训练模型 [12, 16]。

Figure 1: Overview of lifelong model editing with GRACE. a) Models make important errors that must be corrected. b) GRACE makes edits by learning, caching, and selectively retrieving new transformations between layers. c) Edits appear sporadically and require quick fixes, so GRACE codebooks are curated over long sequences of edits.
图 1: 使用GRACE进行终身模型编辑的概述。a) 模型会出现必须纠正的重要错误。b) GRACE通过在不同层之间学习、缓存和选择性检索新转换来实现编辑。c) 编辑会零星出现且需要快速修复,因此GRACE的码本是在长时间编辑序列中精心维护的。
We study lifelong editing using only singular inputs to edit models, where models are edited immediately upon arrival. This setup extends beyond recent works, which use many inputs for editing, including collected sets of semantically-equivalent inputs, training edits, and the model’s pre-training data. Editing in our realistic, more cost-effective setup is hard because each edit must fix the flagged error, while generalizing to similar future inputs without impacting unrelated model behavior.
我们研究仅使用单一输入对模型进行即时编辑的终身学习场景。这一设定超越了近期需要大量编辑输入的研究(包括收集语义等效输入集、训练编辑数据及模型预训练数据)。在这种更贴近现实且更具成本效益的配置中,编辑任务极具挑战性:每个编辑既要修正当前标注错误,又需泛化至未来类似输入,同时不影响模型无关行为。
To address the challenging lifelong model editing setup, we propose General Retrieval Adaptors for Continual Editing, or GRACE. While we study transformers for natural language processing, GRACE is broadly applicable. As illustrated in Figure 1, GRACE edits a model by adding an Adaptor to a chosen layer, while never changing its weights. This Adaptor then modifies layer-to-layer transformations for select inputs. By caching embeddings for input errors and learning values that decode into desired model outputs, GRACE serves as a codebook in which edits are stored, enabling longer sequences of edits than prior works. To encourage generalization, we leverage the models’ semantic similarity in its latent space by introducing $\epsilon$ -balls around cached edits. GRACE then only applies for inputs near existing keys, differing from more-traditional Adaptors [15]. By managing the $\epsilon$ -balls’ size over time, GRACE makes immediate edits, remembers previous edits, and leaves correct model behaviors intact, making it parameter efficient. Further, since GRACE codebooks leave model weights unaltered and are fully model-agnostic, they also point towards plug-and-play, cost-effective model editing, especially to make crucial spot-fixes between bigger retraining efforts.
为解决具有挑战性的终身模型编辑设置,我们提出了通用检索适配器持续编辑方法(GRACE)。虽然我们研究的是自然语言处理领域的Transformer模型,但GRACE具有广泛适用性。如图1所示,GRACE通过在选定层级添加适配器(Adaptor)来编辑模型,且从不更改其权重。该适配器随后会针对特定输入修改层间转换关系。通过缓存错误输入的嵌入表示,并学习解码为目标模型输出的值,GRACE充当存储编辑操作的码本(codebook),支持比现有方法更长的编辑序列。为提升泛化能力,我们利用模型潜在空间中的语义相似性,在缓存编辑周围引入$\epsilon$-球面空间。与传统适配器[15]不同,GRACE仅对接近现有键值的输入生效。通过动态管理$\epsilon$-球面尺寸,GRACE能即时实施编辑、保留历史编辑记录,同时保持模型原有正确行为,实现参数高效利用。此外,由于GRACE码本不改变模型权重且完全与模型无关,这为即插即用、高性价比的模型编辑(特别是重大重训练间隙的关键定点修复)提供了技术路径。
Our contributions are as follows:
我们的贡献如下:
2 Methods: Lifelong Model Editing with GRACE
2 方法:基于GRACE的终身模型编辑
2.1 Problem Formulation
2.1 问题表述
The Lifelong Model Editing task is to edit the same model hundreds to thousands of times in a row without forgetting upstream performance or fixes for previous edits. Let $f_{0}$ denote a model with frozen parameters that was pre-trained on dataset $\mathcal{D}_ {\mathrm{train}}$ . For clarity, we drop the subscript where possible. Assume that $f$ consists of $L$ layers, where $f^{l}(\cdot)$ computes the hidden state at layer $l$ . In this work, we assume $f$ is a transformer architecture for natural language, though our principles are general. We then deploy $f$ on a stream of inputs $[x_{0},x_{1},\ldots]$ , where $x_{t}\in\mathcal{D}_ {\mathrm{edit}}$ , which contains samples observed during deployment. We then monitor the model’s predictions $\hat{y}_ {t}=f(x_{t})$ over the stream. Note that the prediction tasks during training and deployment are the same. Over time, the model makes errors such that $\hat{y}_ {t}\neq y_{t}$ , where $y_{t}$ is the true label. To continue safely deploying $f$ , we aim to edit $f$ such that $f(x_{t})=y_{t}$ . After each edit, the updated $f$ should 1) be successfully edited such that $f(x_{t})=y_{t},2)$ retain accuracy on prior edits $x_{<t}$ , and 3) retain its behavior on its training data: $f(x_{i})=f_{0}(x_{i})\forall x_{i}\in\mathcal{D}_ {\mathrm{train}}$ . In contrast to prior works [8, 28, 30, 31, 37], we assume access to only input edits $x_{t}$ and corrected labels $y_{t}$ , since $\mathcal{D}_{\mathrm{train}}$ is often proprietary or prohibitively large, and collecting training edits or semantically-equivalent inputs is often expensive in practice.
终身模型编辑任务是指在不遗忘上游性能或先前编辑修复的情况下,对同一模型连续进行数百至数千次编辑。设$f_{0}$表示在数据集$\mathcal{D}_ {\mathrm{train}}$上预训练且参数冻结的模型。为简洁起见,我们尽可能省略下标。假设$f$由$L$层组成,其中$f^{l}(\cdot)$计算第$l$层的隐藏状态。本文假设$f$为自然语言Transformer架构,但所述原理具有普适性。随后将$f$部署于输入流$[x_{0},x_{1},\ldots]$上,其中$x_{t}\in\mathcal{D}_ {\mathrm{edit}}$包含部署期间观测到的样本,并持续监控模型预测$\hat{y}_ {t}=f(x_{t})$。需注意训练与部署阶段的预测任务相同。随着时间推移,模型会出现$\hat{y}_ {t}\neq y_{t}$的预测错误($y_{t}$为真实标签)。为确保$f$的安全部署,我们的目标是编辑$f$使其满足$f(x_{t})=y_{t}$。每次编辑后,更新后的$f$应满足:1) 成功实现$f(x_{t})=y_{t}$的编辑;2) 保持对历史编辑$x_{<t}$的准确性;3) 在训练数据上保持原始行为$f(x_{i})=f_{0}(x_{i})\forall x_{i}\in\mathcal{D}_ {\mathrm{train}}$。与先前研究[8,28,30,31,37]不同,我们假设仅能获取待编辑输入$x_{t}$和修正标签$y_{t}$,因为$\mathcal{D}_{\mathrm{train}}$通常属于专有数据或规模过大,且实践中收集训练编辑或语义等效输入成本高昂。
2.2 GRACE: General Retrieval Adaptors for Continual Editing
2.2 GRACE: 持续编辑的通用检索适配器
We propose GRACE, a method for sequentially editing a pre-trained model’s behavior without altering its weights, as illustrated in Figure 1. GRACE works by wrapping a chosen layer of any pre-trained model architecture with an Adaptor. A GRACE Adaptor at model $f$ ’s layer $l$ contains two components: (1) a codebook $\mathcal{C}$ and (2) a deferral mechanism to decide whether to use $\mathcal{C}$ for a given input.
我们提出GRACE方法,可在不改变预训练模型权重的情况下顺序编辑其行为,如图1所示。GRACE通过在任何预训练模型架构的选定层外包裹适配器(Adaptor)实现。模型$f$第$l$层的GRACE适配器包含两个组件:(1) 码本(codebook) $\mathcal{C}$ (2) 延迟机制(deferral mechanism),用于决定是否对给定输入使用$\mathcal{C}$。
GRACE codebook. A GRACE Adaptor at layer $l$ maintains a discrete codebook, adding and updating elements over time to edit a model’s predictions. The codebook contains three components:
GRACE码本。第$l$层的GRACE适配器维护一个离散码本,通过动态增删元素来编辑模型预测。该码本包含三个组件:
• Keys $(\mathbb{K})$ : Set of keys, where each key is a cached activation $h^{l-1}$ predicted by layer $l-1$ . • Values $(\mathbb{V})$ : Set of values that are randomly initialized and are updated using the model’s finetuning loss for edits. Each key maps to a single, corresponding value. • Deferral radii $(\mathcal{E})$ : Each key has a deferral radius $\epsilon$ , which serves as a threshold for similarity matching. The deferral mechanism uses this radius as shown in Algorithm 1. GRACE is activated at layer l only if the deferral constraint is satisfied. New entries have a default value $\epsilon_{\mathrm{init}}$ , which is a hyper parameter.
• 键 $(\mathbb{K})$:键的集合,其中每个键是由第 $l-1$ 层预测的缓存激活 $h^{l-1}$。
• 值 $(\mathbb{V})$:随机初始化并通过模型的微调损失进行更新的值集合。每个键映射到唯一对应的值。
• 延迟半径 $(\mathcal{E})$:每个键有一个延迟半径 $\epsilon$,作为相似性匹配的阈值。如算法1所示,延迟机制使用该半径。仅当满足延迟约束时,GRACE 才会在第 $l$ 层激活。新条目的默认值为超参数 $\epsilon_{\mathrm{init}}$。
Deferral mechanism. Before editing, GRACE layers are empty. As editing progresses, GRACE adds keys and adapts values and $\epsilon$ entries. Conceptually, inference at layer $l$ with GRACE entails a deferral decision, computing $h^{l}$ using a similarity search over GRACE’s keys:
延迟机制。在编辑前,GRACE层为空。随着编辑的进行,GRACE会添加键并调整值和$\epsilon$条目。从概念上讲,在层$l$使用GRACE进行推断涉及一个延迟决策,通过对GRACE键的相似性搜索来计算$h^{l}$:
$$
h^{l} =\begin{cases}\mathbf{GRACE}(h^{l-1}) & \text{if } \min_{i} d(h^{l-1}, \mathbb{K}_ {i}) < \epsilon_{i^* }, \quad
& \text{where } i^* = \operatorname*{argmin}_ {i} d(h^{l-1}, \mathbb{K}_{i}) \quad
f^{l}(h^{l-1}) & \text{otherwise}\end{cases}
$$
$$
h^{l} =\begin{cases}\mathbf{GRACE}(h^{l-1}) & \text{if } \min_{i} d(h^{l-1}, \mathbb{K}_ {i}) < \epsilon_{i^* }, \quad
& \text{where } i^* = \operatorname*{argmin}_ {i} d(h^{l-1}, \mathbb{K}_{i}) \quad
f^{l}(h^{l-1}) & \text{otherwise}\end{cases}
$$
where $f^{l}(h^{l-1})$ denotes the unedited model’s activation of the $l$ -th layer. $h^{l-1}$ can be seen as a query to the codebook, and $\mathrm{GRACE}(h^{l-1})$ retrieves the value associated with its closest key. $\epsilon_{i}^{l}$ and $\mathbb{K}_ {i}^{l}$ are the influence radius and key $i$ in layer $l$ , respectively, and $d(\cdot)$ is a distance function. We follow related work [41] and use Euclidean distance for $d(\cdot)$ in our experiments—changing this is trivial. Through explicit similarity search, we use the fact that large models encode semantic similarity with respect to their tasks in their latent spaces. This lets GRACE edits to generalize to similar inputs in the future. If a new input is unlike any cached keys, GRACE simply defers to $f$ ’s pretrained weights. This way, GRACE layers limit interference with $\mathcal{D}_{\mathrm{train}}$ by leaving the model weights unaltered, which helps when input distributions shift [45].
其中 $f^{l}(h^{l-1})$ 表示未编辑模型在第 $l$ 层的激活值。$h^{l-1}$ 可视为对码本的查询,而 $\mathrm{GRACE}(h^{l-1})$ 会检索与其最接近键关联的值。$\epsilon_{i}^{l}$ 和 $\mathbb{K}_ {i}^{l}$ 分别是第 $l$ 层的影响半径和键 $i$,$d(\cdot)$ 为距离函数。我们遵循相关工作 [41] 在实验中使用欧氏距离作为 $d(\cdot)$——修改此项非常容易。通过显式相似性搜索,我们利用了大模型在其潜在空间中编码任务相关语义相似性的事实。这使得GRACE编辑能泛化到未来类似的输入。若新输入与任何缓存键都不相似,GRACE会直接调用 $f$ 的预训练权重。通过保持模型权重不变,GRACE层限制了与 $\mathcal{D}_{\mathrm{train}}$ 的相互干扰,这有助于应对输入分布变化 [45]。
Codebook maintenance. To make an edit, a GRACE layer can perform one of two operations. Each step is described in Algorithm 1. First, if the codebook is empty or the input embedding $h^{l-1}$ falls outside the deferral radius of all existing keys according to distance function $d(\cdot)$ , then a new codebook entry is created and added: ${(h^{l-1},v,\epsilon_{i n i t},y)}$ . Thus if $x_{t}$ were passed into $f$ again, $h^{l-1}$ would activate the codebook and value $v$ would be passed to layer $l+1$ . Training $v$ is detailed below.
码本维护。要进行编辑,GRACE层可以执行以下两种操作之一。每个步骤在算法1中描述。首先,如果码本为空,或者输入嵌入$h^{l-1}$根据距离函数$d(\cdot)$落在所有现有键的延迟半径之外,则会创建并添加一个新的码本条目:${(h^{l-1},v,\epsilon_{init},y)}$。因此,如果$x_{t}$再次传入$f$,$h^{l-1}$将激活码本,值$v$将被传递到层$l+1$。$v$的训练细节如下。
Sometimes, a query $h^{l-1}$ will be close enough to an existing key that adding a new entry would cause their $\epsilon$ -balls to overlap. To avoid this, we compare the edit label $y$ to the model’s prediction for the nearest key and distinguish two cases: 1) If the overlapping key’s label is the same as the edit’s label, Expand that key’s $\epsilon$ to encompass the query. 2) If the overlapping key’s label is different from the edit’s label, Split these keys by first decreasing the influence radius of the overlapping key, then adding a new codebook entry where the new key is simply the query $h^{l-1}$ . We set both keys’ $\epsilon$ values to be half their distance apart.
有时,查询 $h^{l-1}$ 会与现有键足够接近,以至于添加新条目会导致它们的 $\epsilon$ 球重叠。为避免这种情况,我们将编辑标签 $y$ 与模型对最近键的预测进行比较,并区分两种情况:
- 若重叠键的标签与编辑标签相同,则扩展该键的 $\epsilon$ 以包含查询。
- 若重叠键的标签与编辑标签不同,则通过先减小重叠键的影响半径来拆分这些键,随后添加一个新码本条目,其中新键即为查询 $h^{l-1}$。我们将两个键的 $\epsilon$ 值设置为它们之间距离的一半。
As edits stream over long deployments, by continuously adding and updating the GRACE codebook, layer $l^{:}$ ’s latent space is partitioned according to which inputs needed edits. When not performing edits, these codebook maintenance operations are bypassed, and keys are entirely frozen. GRACE thus introduces a new model editing paradigm in which edits can be made sequentially, similar edits are encouraged to be edited similarly, and the ultimate influence of new edits can be controlled and monitored explicitly. $\epsilon_{\mathrm{init}}$ is the sole parameter in GRACE, which sets the initial $\epsilon$ value for new codebook entries. Intuitively, using a larger $\epsilon_{\mathrm{init}}$ will create edits with more influence, making edits more general, but increasing the interference with unrelated inputs. In practice, $\epsilon_{\mathrm{init}}$ could be tuned using either exogenous data or GRACE codebooks can periodically be refreshed.
在长期部署过程中,随着编辑不断流入,GRACE通过持续添加和更新代码簿(codebook),将层$l^{:}$的潜在空间根据需要编辑的输入进行划分。当不执行编辑时,这些代码簿维护操作会被绕过,且所有键(key)保持完全冻结状态。因此,GRACE引入了一种新的模型编辑范式:支持顺序编辑、鼓励相似编辑采用类似处理方式,并能显式控制和监控新编辑的最终影响范围。$\epsilon_{\mathrm{init}}$是GRACE中唯一的参数,用于设置新代码簿条目的初始$\epsilon$值。直观而言,使用较大的$\epsilon_{\mathrm{init}}$会使编辑产生更广泛的影响,增强编辑的泛化能力,但也会增加对无关输入的干扰。实践中,可通过外部数据调整$\epsilon_{\mathrm{init}}$,或定期刷新GRACE代码簿。
Training GRACE Values. When making an edit with GRACE, either a new key–value pair is learned or an existing key–value pair is updated. To ensure that newly-learned values correct the model’s behavior, we train them directly using back propagation through the finetuning loss on the model’s prediction given the edit. The learned value $v$ then replaces $h^{\tilde{l}}$ for the rest of the forward pass. In our experiments, we train values using 100 gradient descent steps to train the values and ensure the model’s behavior is updated.
训练GRACE值。使用GRACE进行编辑时,会学习新的键值对或更新现有键值对。为确保新学值能修正模型行为,我们通过微调损失对模型预测的反向传播直接训练这些值。学习到的值$v$将在后续前向传播中替换$h^{\tilde{l}}$。实验中,我们采用100次梯度下降步来训练这些值,确保模型行为得到更新。
Algorithm 1: Update Codebook at layer $l$
算法 1: 更新第 $l$ 层的码书
return: C
返回: C
GRACE layers with sequential inputs. For models with different representations per input token, like transformers, we must choose 1) which token should be GRACE’s input query, and 2) which token to replace with a retrieved value in the subsequent layer. In practice, we find that broadcasting the value to each token is reasonable for tasks like classification, since values gain strong control over the model’s behavior and makes them easy to learn. For auto regressive models however, picking the right tokens for the query and value is more nuanced. We opt for replacing only the final token of an input prompt, which we verify experimentally, since compressing future generated text into tokens is an interesting and burgeoning direction in itself [32]. Upon choosing these tokens, GRACE naturally applies to all popular transformer models.
具有顺序输入的GRACE层。对于每个输入token具有不同表征的模型(如Transformer),我们必须选择:1) 哪个token应作为GRACE的输入查询;2) 在后续层中哪个token将被检索值替换。实验表明,对分类等任务而言,将检索值广播至每个token是合理的,因为这样能使检索值对模型行为产生强控制且易于学习。但对于自回归模型,选择查询与替换token则更为微妙。我们选择仅替换输入提示的最后一个token(经实验验证),因为将未来生成文本压缩为token本身就是一个新兴研究方向[32]。选定这些token后,GRACE便可自然适用于所有主流Transformer模型。
2.3 Illustrative Example
2.3 示例说明
To garner intuition about GRACE, we provide an illustrative experiment on synthetic data. As shown in Figure 2, we sample 100 instances from two 2D distributions corresponding to classes. We then train a three-layer binary classifier with two 100-dimensional hidden layers and ReLU activation s. Next, we introduce edits with flipped labels, simulating local label shift at test time. Using a single
为了直观理解 GRACE,我们在合成数据上进行了示例实验。如图 2 所示,我们从两个对应类别的二维分布中各采样 100 个实例。随后,我们训练了一个具有两个 100 维隐藏层和 ReLU 激活函数的三层二元分类器。接着,我们通过翻转标签引入编辑操作,模拟测试时的局部标签偏移。使用单一...

Figure 2: Illustrative example of GRACE. We train a model on separable data in (a), then introduce locally-flipped labels at test time in (b). In (c), the original model un surprisingly mis classifies these label-flipped instances. In (d), GRACE fixes these labels without impacting other inputs.
图 2: GRACE的示例说明。我们在(a)中对可分离数据训练模型,然后在(b)的测试阶段引入局部翻转的标签。在(c)中,原始模型不出所料地对这些标签翻转的实例进行了错误分类。在(d)中,GRACE修正了这些标签且不影响其他输入。
| SETTING | MODEL | TEST RETENTION DATA | EDITING DATA | ||||
| Dataset | N | Pre-edit | Dataset | N | Pre-edit | ||
| QA | T5 (60m) | NQ [20] | 1000 | .72F1 | zsRE [24] | 1000 | .31F1 |
| Clf. | BERT(120m) | SCOTUS1982-1991 [6] | 914 | .99Acc | SCOTUS1992-2009 [6] | 931 | .55Acc |
| Halluc. | GPT-2 (1.5B) | OpenWebText [10] | 1000 | 15.98PPL | SelfCheckGPT | 1392 | 132.7PPL |
| SETTING | MODEL | TEST RETENTION DATA | EDITING DATA | ||||
|---|---|---|---|---|---|---|---|
| Dataset | N | Pre-edit | Dataset | N | Pre-edit | ||
| QA | T5 (60m) | NQ [20] | 1000 | .72F1 | zsRE [24] | 1000 | .31F1 |
| Clf. | BERT(120m) | SCOTUS1982-1991 [6] | 914 | .99Acc | SCOTUS1992-2009 [6] | 931 | .55Acc |
| Halluc. | GPT-2 (1.5B) | OpenWebText [10] | 1000 | 15.98PPL | SelfCheckGPT | 1392 | 132.7PPL |
Table 1: Dataset statistics for main results. Test retention is the testing set of each model’s training data. $N$ is the number of samples. Pre-edit is the unedited model’s performance on each dataset.
表 1: 主要结果的数据集统计。测试保留集 (Test retention) 指各模型训练数据中的测试集。$N$ 表示样本数量。编辑前 (Pre-edit) 是未编辑模型在各数据集上的表现。
3 Experiments
3 实验
3.1 Experimental Setup
3.1 实验设置
We evaluate GRACE’s capacity to edit models hundreds to thousands of times sequentially. This is in contrast to prior works, which largely evaluate using single edits. Further, much of the model editing literature relies on synthetic edits, often generated by flipping random labels. Beyond unrealistically assessing performance, it is well-established that learning random versus natural labels is vastly different [49]. Instead, we correct authentic mistakes made by models, each time an error is made.
我们评估了GRACE对模型进行数百至数千次连续编辑的能力。这与先前主要评估单次编辑效果的研究形成鲜明对比。此外,现有模型编辑文献多依赖合成编辑(通常通过随机翻转标签生成)。这种做法不仅无法真实评估性能,而且已有明确证据表明学习随机标签与自然标签存在显著差异 [49]。我们选择在模型每次出现错误时,对其真实错误进行修正。
Baselines. We compare against continual learning and model editing methods. First, we continually finetune (FT) [25] on streaming errors. To reduce over fitting, we also compare with Elastic Weight Consolidation (EWC) [19] and a simple form of experience replay [36] where we periodically retrain the model (Retrain) on all previous edits. Second, we compare against model editors MEND [30] and Defer, inspired by SERAC [31] but altered for our setup. We also compare against ROME [28] in our language modeling experiment. Finally, we ablate GRACE by replacing our discrete search with a Memory network containing memory module that is indexed by a soft attention mechanism. Details for all comparisons are in Appendix C.
基线方法。我们对比了持续学习和模型编辑方法。首先,我们在流式错误数据上持续进行微调 (FT) [25]。为减少过拟合,同时对比了弹性权重固化 (EWC) [19] 和一种简易形式的经验回放 [36] (Retrain) ,即定期在所有历史编辑数据上重新训练模型。其次,对比了模型编辑器 MEND [30] 和受 SERAC [31] 启发但适配本实验的 Defer 方法。在语言建模实验中还对比了 ROME [28]。最后,通过将离散搜索替换为采用软注意力机制索引的记忆网络模块 (Memory network) ,对 GRACE 进行消融实验。所有对比实验细节详见附录 C。
Datasets and Pre-trained Models. We evaluate GRACE on three sequential editing tasks with corresponding pre-trained models, as shown in Table 1. 1) We edit a 60-million parameter T5 model [35] trained for context-free question-answering, as is used in [30]. We extract potential edits from the validation set of zsRE [24], following the editing literature [16, 30]. This model achieves F1 of .72 on NQ and .31 on zsRE prior to editing. Further details are in Appendix B.1. 2) We edit a 110-million BERT classifier trained for a new editing task with label shift using the SCOTUS dataset from Fairlex [6]. The prediction task is to categorize U.S. Supreme Court documents over multiple decades into 11 topics. Over time, categorization rules change, so label distributions shift. We train a BERT classifier on the $7.4\mathrm{k}$ training cases from 1946-1982, then make edits on 931 cases from 1991-2009. We also introduce additional, realistic label shifts, as discussed in Appendix B.2. This model achieves Accuracy of 0.99 on the training set and .55 on the edit set. 3) We introduce a new editing task by correcting a GPT language models’ Hallucination. In [27], authors prompt GPT-3 to generate 238 wikipedia-style biographies using subjects from WikiBio. They then annotate the factual accuracy of each sentence, recording which are hallucinations. We propose editing inaccurate sentences by replacing them with corresponding sentences in the true wikipedia entries. We include edits for all 238 biographies, creating 1392 sequential edits and 592 already-accurate outputs. We then edit GPT2-XL, which has 1.5B parameters. However, GPT2-XL was not trained on this task, so has high perplexity (PPL) on all sentences. Therefore, we finetune GPT2-XL on these data mixed with sentences from Open Web Text [10], a public version of GPT2’s training data. Our final GPT2-XL model has PPL of 15.98 on Open Web Text (comparable to the original), 8.7 on already-accurate outputs, and 132.7 on intended edits, indicating a need for editing and room for improvement. Further details are available in Appendix B.3.
数据集与预训练模型。我们在三个序列编辑任务上评估GRACE,并使用相应的预训练模型,如表1所示。1) 我们编辑了一个用于无上下文问答的6000万参数T5模型[35],如[30]中所用。我们按照编辑文献[16,30]的方法,从zsRE[24]的验证集中提取潜在编辑项。该模型在编辑前在NQ上的F1得分为0.72,在zsRE上为0.31。更多细节见附录B.1。2) 我们使用Fairlex[6]的SCOTUS数据集,编辑了一个1.1亿参数的BERT分类器,用于处理带标签偏移的新编辑任务。预测任务是将数十年的美国最高法院文件分类为11个主题。随着时间的推移,分类规则会发生变化,因此标签分布也会偏移。我们在1946-1982年的7.4k训练案例上训练BERT分类器,然后在1991-2009年的931个案例上进行编辑。如附录B.2所述,我们还引入了额外的现实标签偏移。该模型在训练集上的准确率为0.99,在编辑集上为0.55。3) 我们通过纠正GPT语言模型的幻觉引入了一个新的编辑任务。在[27]中,作者提示GPT-3使用WikiBio的主题生成238篇维基百科风格的传记,然后标注每个句子的事实准确性,记录哪些是幻觉。我们建议通过用真实维基百科条目中的相应句子替换不准确的句子来进行编辑。我们包含了对所有238篇传记的编辑,创建了1392个序列编辑和592个已经准确的输出。然后我们编辑拥有15亿参数的GPT2-XL。然而,GPT2-XL并未针对此任务进行训练,因此对所有句子都有较高的困惑度(PPL)。因此,我们在这些数据与Open Web Text10的混合句子上对GPT2-XL进行了微调。我们的最终GPT2-XL模型在Open Web Text上的PPL为15.98(与原始版本相当),在已准确输出上为8.7,在预期编辑上为132.7,表明需要编辑并有改进空间。更多细节见附录B.3。
Metrics. To compare each method, we measure three main metrics that align with prior work [16, 28].
指标。为了比较每种方法,我们测量了与先前工作[16, 28]一致的三个主要指标。
- Edit Success (ES): We check whether an edit has been successful using ES: $m(y,\hat{y})$ , where $m(\cdot)$ is a task-specific measure of accuracy: standard F1 for question answering, Accuracy for classification and Perplexity (PPL) for generation.
- 编辑成功率 (ES): 我们通过ES指标判断编辑是否成功: $m(y,\hat{y})$ , 其中 $m(\cdot)$ 是任务特定的准确率衡量标准: 问答任务采用标准F1值, 分类任务采用准确率, 生成任务采用困惑度 (PPL)。
| Method | zsRE (T5;F1 ↑) | SCOTUS(BERT;Acc↑) | Hallucination (GPT2-XL;PPL ↓) | ||||||||||
| TRR | ERR | Avg. | #E | TRR | ERR | Avg. | #E | TRR | ERR | ARR | #E | time (s) | |
| FT [25] | .56 | .82 | .69 | 1000 | .52 | .52 | .52 | 415 | 1449.3 | 28.14 | 107.76 | 1392 | .26 (.07) |
| FT+EWC[19] | .51 | .82 | .66 | 1000 | .67 | .50 | .58 | 408 | 1485.7 | 29.24 | 109.59 | 1392 | .29 (.06) |
| FT+Retrain[36] | .27 | .99 | .63 | 1000 | .67 | .83 | .75 | 403 | 2394.3 | 35.34 | 195.82 | 1392 | 23.4 (13.2) |
| MEND[30] | .25 | .27 | .26 | 1000 | .19 | .27 | .23 | 672 | 1369.8 | 1754.9 | 2902.5 | 1392 | .63 (.10) |
| Defer [31] | .72 | .31 | .52 | 1000 | .33 | .41 | .37 | 506 | 8183.7 | 133.3 | 10.04 | 1392 | .07 (.02) |
| ROME[28] | 30.28 | 103.82 | 14.02 | 1392 | .64 (.28) | ||||||||
| Memory | .25 | .27 | .26 | 1000 | .21 | .20 | .21 | 780 | 25.47 | 79.30 | 10.07 | 1392 | .11 (.02) |
| GRACE | .69 | .96 | .82 | 1000 | .81 | .82 | .82 | 381 | 15.84 | 7.14 | 10.00 | 1392 | .13 (.02) |
| 137keys (7.30edits/key) | 252keys (1.51 edits/key) | 1341keys (1.04edits/key) | |||||||||||
| 方法 | zsRE (T5;F1 ↑) | SCOTUS(BERT;Acc↑) | Hallucination (GPT2-XL;PPL ↓) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TRR | ERR | Avg. | #E | TRR | ERR | Avg. | #E | TRR | ERR | ARR | #E | time (s) | |
| FT [25] | .56 | .82 | .69 | 1000 | .52 | .52 | .52 | 415 | 1449.3 | 28.14 | 107.76 | 1392 | .26 (.07) |
| FT+EWC[19] | .51 | .82 | .66 | 1000 | .67 | .50 | .58 | 408 | 1485.7 | 29.24 | 109.59 | 1392 | .29 (.06) |
| FT+Retrain[36] | .27 | .99 | .63 | 1000 | .67 | .83 | .75 | 403 | 2394.3 | 35.34 | 195.82 | 1392 | 23.4 (13.2) |
| MEND[30] | .25 | .27 | .26 | 1000 | .19 | .27 | .23 | 672 | 1369.8 | 1754.9 | 2902.5 | 1392 | .63 (.10) |
| Defer [31] | .72 | .31 | .52 | 1000 | .33 | .41 | .37 | 506 | 8183.7 | 133.3 | 10.04 | 1392 | .07 (.02) |
| ROME[28] | 30.28 | 103.82 | 14.02 | 1392 | .64 (.28) | ||||||||
| Memory | .25 | .27 | .26 | 1000 | .21 | .20 | .21 | 780 | 25.47 | 79.30 | 10.07 | 1392 | .11 (.02) |
| GRACE | .69 | .96 | .82 | 1000 | .81 | .82 | .82 | 381 | 15.84 | 7.14 | 10.00 | 1392 | .13 (.02) |
| 137keys (7.30edits/key) | 252keys (1.51 edits/key) | 1341keys (1.04edits/key) |
Table 2: Comparison of GRACE to existing methods. Metrics shown are computed after all sequential edits. For Hallucination, we also compute perplexity retention on already-accurate sentences (ARR) and the average time per edit. Parentheses in time denote standard deviation. We also count the number of keys GRACE used. GRACE achieves the best TRR/ERR balance while making edits faster than most comparisons and using small codebooks for zsRE and SCOTUS. A large codebook is required for Hallucination, since each edit label is unique.
表 2: GRACE与现有方法的对比。所有指标均在完成全部顺序编辑后计算得出。对于Hallucination任务,我们还计算了已准确句子的困惑度保留率(ARR)以及每次编辑的平均耗时。耗时括号内标注标准差。同时统计了GRACE使用的键值数量。GRACE在zsRE和SCOTUS数据集上使用小型码本即实现最优TRR/ERR平衡,且编辑速度快于多数对比方法。由于Hallucination任务每个编辑标签均唯一,因此需要较大码本。
We also track the number of edits $(\text{#}\mathbf{E})$ , which may vary by editor as different editors will lead to different mistakes in practice. Further implementation details are available in Appendix A.
我们还追踪编辑次数 $(\text{#}\mathbf{E})$,该数值可能因编辑者而异,因为不同编辑者在实践中会导致不同的错误。更多实现细节详见附录A。
3.2 Results
3.2 结果
3.2.1 Comparisons to existing methods
3.2.1 与现有方法的对比
We first find that GRACE outperforms existing methods after long sequences of edits, as shown in Table 2. We focus here on TRR and ERR, which have a trade-off as each can be achieved in isolation, though all metrics are reported in Appendix G. On zsRE and SCOTUS, we focus on TRR, ERR, and their average. ROME is incomparable on these dataset as it is only proposed for GPT models. On Hallucination, we include Accurate Retention Rate (ARR), which is the edited model’s perplexity on sentences on which it was already accurate.
我们首先发现,如表2所示,GRACE在长序列编辑后表现优于现有方法。我们重点关注TRR和ERR,这两者存在权衡关系(因为可以单独实现任一指标),完整指标见附录G。在zsRE和SCOTUS数据集上,我们主要分析TRR、ERR及其平均值。ROME方法不适用于这些数据集(因其仅针对GPT模型设计)。在Hallucination任务中,我们还纳入了准确保留率(ARR, Accurate Retention Rate),即模型在原本预测正确的句子上计算出的困惑度。
On zsRE and SCOTUS, GRACE’s averaged TRR and ERR outperforms its closest competitors by $19%$ and $9%$ , respectively. Comparing the average performance across all methods, GRACE’s improvement is $86%$ and $129%$ due to other methods’ inability to balance ERR and TRR. For instance, while Defer achieves the highest TRR on SCOTUS it trades off ERR. In contrast, $\mathrm{FT+I}$ Retrain achieves the highest ERR on SCOTUS, but trades off TRR. On both datasets, MEND and Defer both struggle to balance TRR and ERR without access to privileged data.
在zsRE和SCOTUS数据集上,GRACE的平均TRR(真实召回率)和ERR(错误召回率)分别比其最接近的竞争对手高出19%和9%。与所有方法的平均性能相比,由于其他方法无法平衡ERR和TRR,GRACE的改进幅度达到86%和129%。例如,Defer在SCOTUS上实现了最高的TRR,但牺牲了ERR;而$\mathrm{FT+I}$ Retrain在SCOTUS上实现了最高的ERR,却牺牲了TRR。在这两个数据集上,MEND和Defer在无法访问特权数据的情况下都难以平衡TRR和ERR。
GRACE also outperforms the comparisons on the Hallucination task. We observe that the finetuning methods easily overfit to new edits, leading to competitive ERR but poor TRR. However, their good ERR comes at the expense of ARR in these methods. We also find that ROME and Memory both perform competitively with GRACE, indicating the value of parameter efficiency in lifelong model editing. Finally, we report the average wall-clock time it takes to perform one edit, finding that GRACE makes edits twice as fast as finetuning while competing with other Adaptors.
GRACE 在幻觉任务上的表现也优于其他对比方法。我们发现微调方法容易对新编辑过拟合,导致 ERR 表现优异但 TRR 较差。然而在这些方法中,其优异的 ERR 是以牺牲 ARR 为代价的。我们还发现 ROME 和 Memory 与 GRACE 表现相当,这表明参数效率在终身模型编辑中的价值。最后,我们报告了执行一次编辑所需的平均挂钟时间,发现 GRACE 的编辑速度是微调方法的两倍,同时与其他适配器方法保持竞争力。
Excitingly, GRACE can achieve high-quality edits with small codebooks. For example, GRACE uses few keys to edit T5 on zsRE: 1000 edits are made using only 137 keys. Such compression is only feasible by managing the $\epsilon$ -balls effectively over time. This is also parameter-efficient: with details in Section 3.2.3, the zsRE codebook contains 210,569 scalar values ( $35%$ of T5’s parameters) and only 70,281 are learnable. The number of keys is also lower-bounded by the number of unique edit labels and is related to the complexity of the inputs. On SCOTUS, GRACE also achieves a relatively small codebook, showing a balance between input complexity and general iz ability of each edit. Since SCOTUS is a document classification task with 11 classes, there could be as few as 11 keys, but given each document contains many sentences, there is a lower chance that the same keys can be used for many documents. That GRACE’s keys represent 1.51 edits on average indicates significant generalization. As expected, the lower-bound on codebook size is evident in the Hallucination task, where GRACE uses a new key for almost every edit. This is because edit labels are unique sentences. Despite a large codebook, such success at selectively inserting tokens that generate full sentences poses an exciting direction for controllable and editable language modeling.
令人振奋的是,GRACE能够通过小型码本(codebook)实现高质量编辑。例如在zsRE数据集上,GRACE仅用137个键(key)就完成了对T5模型的1000次编辑。这种压缩效果只有通过有效管理$\epsilon$-balls才能实现。该方法还具有参数高效性:如第3.2.3节所述,zsRE码本包含210,569个标量值(占T5参数的35%),其中仅70,281个需要学习。键的数量下限由唯一编辑标签数量决定,并与输入复杂度相关。在SCOTUS任务中,GRACE同样实现了较小的码本规模,展现了输入复杂度与编辑泛化能力之间的平衡。由于SCOTUS是包含11个类别的文档分类任务,理论上最少只需11个键,但考虑到每份文档包含多个句子,相同键跨文档复用的概率较低。GRACE平均每个键对应1.51次编辑的结果,显示出显著的泛化能力。正如预期,在Hallucination任务中码本规模的下限效应尤为明显——GRACE几乎为每次编辑都使用了新键,这是因为编辑标签都是独特句子。尽管需要较大码本,但这种选择性插入token来生成完整语句的成功案例,为可控可编辑语言建模开辟了令人振奋的研究方向。

Figure 3: ES and TRR while editing GPT2-XL on Hallucination. Lower values are better because TRR and ERR measure perplexity. GRACE outperforms the comparisons by making successful edits while maintaining the model’s training knowledge. All metrics are shown in Figure 13.
图 3: GPT2-XL在幻觉任务上的编辑成功率(ES)和训练知识保留率(TRR)。数值越低越好,因为TRR和ERR衡量的是困惑度。GRACE方法在成功完成编辑的同时保持了模型的训练知识,表现优于其他对比方法。所有指标如图13所示。
In Figure 3, we show a more-detailed comparison of all methods for the Hallucination editing task over time, focusing on TRR and ES (see Figure 13 for all metrics). As expected, finetuning methods excel at ES, but perform poorly at TRR. On the flip side, ROME and Memory have reasonably-low ES and TRR scores, though both suffer as edits progress. While ROME is competitive early in editing, it especially suffers on ERR (Table 2) and GRACE’s TRR remains $2\mathbf{x}$ better after making all edits.
图3中,我们展示了随时间推移所有方法在幻觉编辑任务上的详细对比,重点关注TRR和ES指标(完整指标见图13)。与预期一致,微调方法在ES上表现优异,但TRR表现较差。相反,ROME和Memory的ES与TRR分数都相对较低,但随着编辑次数增加,两者性能都会下降。虽然ROME在编辑初期具有竞争力,但其ERR表现尤其不佳(表2),而GRACE的TRR在完成所有编辑后仍保持$2\mathbf{x}$的优势。
3.2.2 Model Analysis: Memorization vs. Generalization
3.2.2 模型分析:记忆与泛化
Next, we study GRACE’s memorization vs. generalization performance by editing T5 on zsRE on extremely long sequences of edits while varying $\epsilon_{\mathrm{init}}$ and the edited layer. We split each zsRE question’s set of rephrasing s into two sets: edits and holdouts. Then, we pass edits through GRACE until 3,000 edits are made. This is extremely large: even 10 edits catastrophically decays performance [12] and [16] performs roughly half as many edits. After each edit, we measure TRR, ERR, F1 on the entire Holdout set, and record the number of keys. Since we evaluate GRACE on the entire Holdout set after each edit, the results start low since has seen no rephrasing s of held out edits. Therefore, later measures of Holdout performance are more representative than earlier. Figure 4 shows our results for $\epsilon_{\mathrm{init}}=0.1$ and $\epsilon_{\mathrm{init}}=3.0$ , while the rest are in Appendix H. We derive the following findings.
接下来,我们通过在不同初始参数$\epsilon_{\mathrm{init}}$和编辑层条件下对T5模型进行zsRE任务的超长序列编辑,研究GRACE的记忆与泛化性能。将每个zsRE问题的改写集合划分为编辑集和保留集,持续通过GRACE处理编辑集直至完成3,000次编辑——这个规模远超现有研究:[12]显示10次编辑就会导致性能崩溃,而[16]的编辑量仅为当前实验的一半左右。每次编辑后,我们测量整个保留集上的TRR、ERR和F1值,并记录键的数量。由于GRACE在每次编辑后都需在整个保留集上评估,初始性能会因未接触保留编辑的改写样本而偏低,因此后期保留集的性能指标更具代表性。图4展示了$\epsilon_{\mathrm{init}}=0.1$和$\epsilon_{\mathrm{init}}=3.0$的实验结果(其余数据见附录H),主要发现如下:
Layer and $\boldsymbol{\epsilon}_ {\mathbf{init}}$ choice balance memorization and generalization. We first find that different blocks lead to different editing performance. Notably, editing Blocks two and four achieve high TRR, high ERR, and strong generalization for both choices of $\epsilon_{\mathrm{init}}$ On the contrary, for Block 6 we see that since each $\epsilon_{\mathrm{init}}$ is small, they do not generalize at all and also lead to poor ERR. As expected, when $\epsilon_{\mathrm{init}}$ is small, TRR performance is optimal, since most edits require a new key. While creating a large codebook can be feasible, the trade-off in Holdout becomes clear, as detailed in the next finding. The relationship between layer choice, performance, and the size of the codebook likely stems from a layer’s representational capacity: Layers that map semantically-equivalent inputs near one another will be easier to edit with GRACE.
层和 $\boldsymbol{\epsilon}_ {\mathbf{init}}$ 的选择需要在记忆与泛化之间取得平衡。我们首先发现不同模块会导致不同的编辑效果。值得注意的是,在两种 $\epsilon_{\mathrm{init}}$ 选择下,编辑第二和第四模块都能实现高TRR(真实替换率)、高ERR(编辑成功率)以及强泛化能力。相反,第六模块由于每个 $\epsilon_{\mathrm{init}}$ 值过小,完全无法泛化且导致ERR表现不佳。正如预期,当 $\epsilon_{\mathrm{init}}$ 较小时,TRR性能最优,因为多数编辑需要新密钥。虽然创建大型码本可行,但下一发现将详述其在保留集上的明显权衡。层选择、性能与码本规模的关系可能源于层的表征能力:能将语义等价输入映射到相近位置的层,更容易通过GRACE实现编辑。
GRACE edits generalize to unseen inputs. Steadily increasing Holdout shows that that GRACE edits can generalize to previously-unseen holdout edits. Excitingly, interior layers appear to generalize better than early and late layers, which is backed up by their use of fewer keys. Larger $\epsilon_{\mathrm{init}}$ values also lead to better generalization, as expected, implying that the semantically-similar inputs indeed land in the same deferral radii. The later layer, Block 6, appears to
GRACE编辑能够泛化到未见过的输入。逐步增加的保留集(Holdout)实验表明,GRACE编辑可以泛化到之前未见的保留编辑。令人振奋的是,中间层似乎比早期层和后期层具有更好的泛化能力,这与其使用更少的键(key)相吻合。正如预期的那样,较大的$\epsilon_{\mathrm{init}}$值也带来了更好的泛化性能,这意味着语义相似的输入确实落在相同的延迟半径内。较深层Block 6的表现显示...
GRACE codebooks stay small and stabilize over time. Finally, the number of keys over time steadily flattens, indicating that the $\epsilon$ values indeed adapt to the data distribution. As we increase $\epsilon_{\mathrm{init}}$ the resultant codebooks get smaller, indicating control over codebook size. Small codebooks are also important for parameter efficiency and in generalization.
GRACE码本保持较小规模并随时间趋于稳定。最终,键数量随时间增长逐渐趋平,表明$\epsilon$值确实能自适应数据分布。当我们增大$\epsilon_{\mathrm{init}}$时,生成的码本会更小,这显示了对码本尺寸的控制能力。小规模码本对参数效率和泛化能力也至关重要。

Figure 4: Impact of $\epsilon_{\mathrm{init}}$ and block choice for GRACE editing T5 on zsRE for 3000 sequential edits. Other $\epsilon_{\mathrm{init}}$ values are in Appendix H. Along with TRR and ERR, we also measure F1 on a “Holdout” edit set containing unseen rephrasing s of all edits. We find that blocks 0 and 6 use more keys and achieve higher TRR, but can lead to lower ERR and generalize worse, given lower holdout values.
图 4: $\epsilon_{\mathrm{init}}$ 和模块选择对 GRACE 编辑 T5 在 zsRE 上进行 3000 次连续编辑的影响。其他 $\epsilon_{\mathrm{init}}$ 值见附录 H。除了 TRR 和 ERR 外,我们还测量了包含所有编辑未见改写版本的"保留"编辑集的 F1 分数。我们发现模块 0 和 6 使用了更多键并实现了更高的 TRR,但由于保留值较低,可能导致更低的 ERR 和更差的泛化性能。
3.2.3 Parameter Efficiency
3.2.3 参数效率
GRACE’s memory requirements are small and straightforward: A new edit requires $|h^{l-1}|+|h^{l}|+1$ parameters, where $|h^{l-1}|$ is the dimension of the key, $|h^{l}|$ is the dimension of the value, and $\epsilon_{\mathrm{init}}$ is a scalar. Further, the key’s $|h^{l-1}|$ parameters are not trained. As detailed in Appendix E, this is comparable to the alternatives when performing thousands of edits, so performance gain is not from parameter count. This intuition is reinforced by the fact that at inference time, if the Adaptor is activated, predictions are only altered using the $|h^{l-1}|+|h^{l}|+1$ parameters of the chosen entry.
GRACE的内存需求小而直接:每次新编辑需要 $|h^{l-1}|+|h^{l}|+1$ 个参数,其中 $|h^{l-1}|$ 是键(key)的维度, $|h^{l}|$ 是值(value)的维度, $\epsilon_{\mathrm{init}}$ 为标量。此外,键的 $|h^{l-1}|$ 参数不参与训练。如附录E所述,在进行数千次编辑时,这与替代方案相当,因此性能提升并非来自参数数量。这一直觉在推理时得到强化:若适配器(Adaptor)被激活,预测仅通过所选条目的 $|h^{l-1}|+|h^{l}|+1$ 个参数进行修改。
3.2.4 Interpreting GRACE Codebooks
3.2.4 GRACE码本解析
A key benefit of GRACE is that learned codebooks can be detached from the model and inspected. This way, edits can easily be undone without impacting the model and the codebooks can be inspected throughout editing. To demonstrate this advantage, we inspect how keys and their $\epsilon$ change throughout editing Block 4 of the T5 QA model. In this experiment, we investigate how well GRACE edits generalize to unseen edits. At each edit in a sequence of 1,000 inputs, we pass the entire holdout set of edits through the newly-edited model. For each holdout instance, we record in which key’s $\epsilon$ -ball it lands, if any. This way, we track whether a newly-added key generalizes to multiple holdouts successfully. We also track what proportion of the holdouts land inside any key to evaluate generalization over time. In Figure 5, we summarize the results from one such experiment, with more details in Appendix D. We find that the number of holdouts per key stabilizes over time. Interestingly, $\epsilon_{\mathrm{init}}$ values capture too many holdouts per question, which explains the trade-off between generalization and TRR. In the Appendix, we also show that the number of holdouts per key are stable, while others are highly-variable. This implies that our codebook maintenance strategy can have big ripple effects according to the key. Further, some regions of the latent space are likely better than others for performing GRACE edits.
GRACE的一个关键优势是学习到的码本(codebook)可以从模型中分离并检查。这种方式使得编辑可以轻松撤销而不影响模型,并且码本在整个编辑过程中都可被检查。为展示这一优势,我们研究了在编辑T5 QA模型第4模块时键(key)及其$\epsilon$如何变化。本实验中,我们探究GRACE编辑对未见编辑的泛化能力。在1,000个输入的编辑序列中,每次编辑后我们都将完整保留的编辑集通过新编辑模型。对每个保留实例,记录其落入哪个键的$\epsilon$-球(若有)。由此可追踪新增键是否成功泛化到多个保留实例。我们还追踪保留实例落入任意键的比例以评估随时间推移的泛化情况。图5汇总了一个此类实验的结果,附录D提供更多细节。我们发现每个键对应的保留实例数量会随时间趋于稳定。有趣的是,$\epsilon_{\mathrm{init}}$值会为每个问题捕获过多保留实例,这解释了泛化能力与TRR之间的权衡。附录中还显示部分键对应的保留实例数量稳定,而其他键则波动较大。这表明我们的码本维护策略会根据键产生较大连锁效应。此外,潜在空间的某些区域可能比其他区域更适合执行GRACE编辑。
3.2.5 Inference Time
3.2.5 推理时间
To better understand GRACE’s limitations, we compare the inference time of the T5-small QA model before and after editing. We compute the time it takes to run one instance through the model for each of 5,000 edits. We edit T5’s block 4 and use a small $\epsilon_{\mathrm{init}}$ of 0.1 to encourage a quickly-growing codebook. To evaluate the variance in training time, we replicate this experiment ten times, each leading to a codebook of roughly 4500 elements. We then average over 20-timestep windows and show standard deviation across all replications.
为了更好地理解GRACE的局限性,我们比较了编辑前后T5-small问答模型的推理时间。我们计算了在5,000次编辑中每次运行一个实例所需的时间。我们编辑了T5的第4个模块,并使用较小的$\epsilon_{\mathrm{init}}$值0.1以促进快速增长的码本。为了评估训练时间的方差,我们重复了十次实验,每次生成约4500个元素的码本。然后对20个时间步长的窗口取平均值,并展示所有重复实验的标准差。

Figure 5: Interpreting GRACE keys throughout editing. Larger $\epsilon_{\mathrm{init}}$ achieve good generalization. The grey line is the true holdouts per edit.
图 5: 编辑过程中GRACE键的可视化解读。较大的$\epsilon_{\mathrm{init}}$能实现更好的泛化效果。灰色线条表示每次编辑对应的真实保留数据。

Figure 6: Comparing inference time of GRACEedited and unedited T5 QA model as the size of the codebook increases.
图 6: 对比GRACE编辑与未编辑T5问答模型的推理时间随码本规模增加的变化。
We find that inference with a GRACE-edited model was only $1.32\mathrm{x}$ slower than an unedited model on average, as shown in Figure 6. This is a one-time cost that remains fixed even as the codebook grows. Inference time is unchanged because search between one query and a full set of keys can be vectorized. Until the codebook outgrows available memory, this cost remains fixed. That GRACE causes any slowdown is a limitation and an avenue for future work.
我们发现,如图6所示,使用GRACE编辑后的模型进行推理平均仅比未编辑模型慢$1.32\mathrm{x}$。这是一次性成本,即使码本(codebook)增长也保持不变。推理时间不受影响,因为单个查询与完整键集之间的搜索可向量化。只要码本不超过可用内存,该成本就保持固定。GRACE导致的任何速度下降都是当前局限,也是未来工作的改进方向。
4 Related Work
4 相关工作
Model editing. Model editing is a new and active research area where the goal is to make targeted changes to a pre-trained model’s behavior. Most methods propose variants of regularized-finetuning via auxiliary data, like training instances from the pre-trained model’s training data or by using semantically-equivalent versions of new edits [37]. Access to such data is non-trivial, especially as training data and paradigms are becoming proprietary, and collecting semantically-equivalent inputs is often unlikely. Recent works have retained these requirements, while extending to pretrain hyper networks that predict edits [8, 30, 31], often decomposing weight updates into low-rank components [28, 29]. Recent approaches all study transformer architectures due to their popularity and high training cost [48, 51]. To ensure targeted edits, recent methods like MEND [30] and ROME [28] also draw inspiration from parameter-efficient finetuning [15]. But such methods are known to often require more finetuning steps and are prone to overfit more than regular finetuning [39, 50]. Further, recent works show that edited models are deeply fragile [4] and methods for picking which parameters to update are surprisingly unreliable [13].
模型编辑 (Model editing)。模型编辑是一个新兴且活跃的研究领域,其目标是对预训练模型的行为进行针对性修改。大多数方法通过辅助数据提出正则化微调的变体,例如使用预训练模型训练数据中的训练实例或新编辑的语义等效版本 [37]。获取此类数据并非易事,尤其是随着训练数据和范式变得专有化,收集语义等效输入通常不太可能。最近的研究在保留这些要求的同时,扩展到预训练用于预测编辑的超网络 [8, 30, 31],通常将权重更新分解为低秩分量 [28, 29]。由于Transformer架构的流行性和高昂训练成本,近期方法都针对该架构进行研究 [48, 51]。为确保针对性编辑,MEND [30] 和ROME [28] 等最新方法还借鉴了参数高效微调 [15] 的思路。但众所周知,此类方法通常需要更多微调步骤,且比常规微调更容易过拟合 [39, 50]。此外,近期研究表明,编辑后的模型极其脆弱 [4],而选择更新哪些参数的方法出人意料地不可靠 [13]。
Unfortunately, nearly all model editing works consider only static edits, making only one edit to a model. While some recent works like MEMIT [29] indeed perform multiple edits, they do so simultaneously, not over time during deployment. Other works demonstrate that editing performance decays quickly when making multiple edits [30], though recent works like SERAC [31] and MEMIT [29] show burgeoning results in this direction. However, these exciting works still use large amounts of privileged information. Most similar to our work, two recent papers discuss sequential editing. First, [12] shows that after editing the same model 10 times, editing performance drops dramatically. Second, a concurrent paper [16] recently proposed a sequential editing setup. However, their method and implementation is architecture-specific and relies on large sources of unrelated inputs.
遗憾的是,几乎所有模型编辑工作都只考虑静态编辑,即对模型仅进行一次修改。虽然近期如MEMIT [29]等研究确实实现了多次编辑,但这些编辑是同步执行的,而非在部署过程中随时间推移逐步进行。其他研究表明,当进行多次编辑时,编辑性能会迅速衰减 [30],尽管近期如SERAC [31]和MEMIT [29]等研究已在该方向展现出初步成果。然而,这些令人振奋的工作仍依赖大量特权信息。与我们的研究最为接近的是近期两篇关于连续编辑的论文:首先,[12]表明对同一模型进行10次编辑后,编辑性能会急剧下降;其次,一篇同期论文[16]最近提出了连续编辑方案,但他们的方法和实现依赖于特定架构,且需要大量无关输入源。
Continual Learning. Continual learning methods are a reasonable approach to lifelong model editing. Most-similar to our setup, recent works have investigated continual finetuning, where large language models are refined over time as new instances arrive. For example, [25] perform a large benchmark of continual finetuning. They find that regularizing finetuning with continual learning methods like Elastic Weight Consolidation [19], Experience Replay [36], and Maximally Interfered Replay [1], quickly decays performance on prior tasks, though it helps to remember some previous inputs. This implies that editing, as opposed to regular continual finetuning, is particularly challenging, since edits are unlikely to be uniformly distributed [14]. One promising avenue for continual learning is key-value methods, stemming from computer vision [26, 34, 42] For example, recent works have demonstrated continual prompt-learning for NLP [44, 45] for applications like text retrieval [47]. Recent works have shown that discrete key-value methods in particular perform well with shifting distributions [41], with recent works extending to question answering [7]. By caching values, these approaches keep inputs in-distribution for downstream encoders while opening doors to longer-term memory, resources permitting. We corroborate these advantages in our experiments, where we demonstrate GRACE’s robustness to shifting inputs and labels over long sequences of edits.
持续学习。持续学习方法是一种实现终身模型编辑的合理途径。与我们设置最相似的是近期关于持续微调的研究,即大语言模型随着新实例的不断到来而逐步优化。例如,[25] 对持续微调进行了大规模基准测试,发现采用弹性权重巩固 (Elastic Weight Consolidation) [19]、经验回放 (Experience Replay) [36] 和最大干扰回放 (Maximally Interfered Replay) [1] 等持续学习方法进行正则化微调时,虽然能帮助记忆部分先前输入,但会快速降低模型在先前任务上的表现。这表明相较于常规持续微调,编辑任务更具挑战性,因为编辑操作往往呈现非均匀分布特性 [14]。计算机视觉领域衍生的键值方法 [26, 34, 42] 为持续学习提供了可行方向,例如近期研究展示了自然语言处理中持续提示学习 [44, 45] 在文本检索 [47] 等场景的应用。最新研究表明,离散键值方法尤其擅长处理分布偏移问题 [41],相关技术已扩展至问答领域 [7]。通过缓存数值,这些方法既能保持下游编码器的输入分布稳定性,又能在资源允许时实现长期记忆。我们在实验中验证了这些优势,证明 GRACE 在面对长序列编辑中的输入与标签偏移时具有鲁棒性。
5 Limitations and Ethical Considerations
5 局限性与伦理考量
Lifelong model editing is new and challenging, so our method GRACE has limitations. Understandably, adding similarity search to the layers of a model will slow down inference. Still, while we do not emphasize inference time, accelerating GRACE is natural future step that has been successful in similar methods [15, 33]. Future works may also scale GRACE up to multi-layer edits, a setting unconsidered in this work. While GRACE has already scaled up continual editing to $\mathord{\sim}5\mathrm{k}$ edits, real world scenarios might entail approaches that work at an ever larger editing scale, which presents a promising direction of future work. Another limitation of GRACE-style edits is implication: Behaviors are edited in isolation. For example, if we edit a model’s knowledge of the latest pandemic in the U.S. from Swine Flu to COVID, its knowledge of the second-to-last pandemic will not be updated.
终身模型编辑是新颖且具有挑战性的,因此我们的方法GRACE存在局限性。可以理解的是,在模型层中添加相似性搜索会降低推理速度。尽管如此,虽然我们并未强调推理时间,但加速GRACE是未来自然的发展方向,类似方法已取得成果[15, 33]。未来工作还可能将GRACE扩展到多层编辑,这是本文未考虑的设定。虽然GRACE已实现持续编辑扩展至约5千次编辑,但现实场景可能需要更大编辑规模的方法,这为未来工作提供了有前景的方向。GRACE式编辑的另一局限是隐含性:行为编辑是孤立的。例如,若我们将模型对美国最新疫情的认知从猪流感更新为新冠肺炎,其对上一次疫情的认知并不会同步更新。
While editing models can improve their behavior, it can surely be used for harm. For example, a bad actor might edit a LLM to increase hate. This limitation is true for all model editors, GRACE included. However, most model editors today directly update the model’s weights. This makes it hard to trace back what edits have been made to a model and understand their impact. Transparent editors like GRACE are a promising direction for overcoming this problem. For example, GRACE’s codebook can be directly inspected to see what predictions stem from each value and examine properties of the latent space covered by the $\epsilon$ -balls.
虽然编辑模型可以改善其行为,但同样可能被用于恶意目的。例如,攻击者可能通过编辑大语言模型 (LLM) 来增强仇恨言论。这一局限性适用于所有模型编辑器,包括 GRACE。然而,当前大多数模型编辑器会直接修改模型权重,导致难以追溯模型被编辑的内容及其影响。像 GRACE 这样的透明编辑器为解决该问题提供了可行方向——例如可直接检查其码本 (codebook),观察每个取值对应的预测结果,并分析由 $\epsilon$ -球覆盖的潜在空间特性。
6 Conclusions
6 结论
Pre-trained models continue to grow and are being applied to a diverse set of downstream tasks. However, they still misbehave in unpredictable ways when deployed. Correcting such behavior is a challenging problem, especially when only some of the model’s behavior is problematic. While regularized finetuning or retraining on better data can mitigate this problem somewhat, big models remain too expensive to train for most researchers. We instead build on the model editing literature, and study Lifelong Model Editing, an important but understudied problem. With a focus on transformer models for natural language processing, we edit models thousands of times in a row as edits stream during simulated deployments. Our edits only use singular, authentic errors, in contrast to recent works which make synthetic edits and require large amounts of exogeneous data like sets of training edits, semantically-equivalent examples, or pre-training data. We then present GRACE, a plug-in Adaptor for a model’s chosen layer that leaves the trained weights untouched. GRACE Adaptors (1) retain the functionality of the original model, while (2) successfully editing model predictions without forgetting previous inputs. Using three real-world datasets, we edit T5, BERT, and GPT models thousands of times and find that GRACE significantly outperforms seven state-of-the-art alternatives. We further investigate GRACE’s capacity to make extremely long sequences of edits and show that GRACE can generalize its edits to unseen inputs, avoiding sheer memorization.
预训练模型持续发展并应用于多样化的下游任务。然而在部署时,它们仍会出现难以预测的异常行为。修正这类行为极具挑战性,特别是当仅部分模型行为存在问题时。虽然通过正则化微调或在优化数据上重新训练能在一定程度上缓解该问题,但大型模型对多数研究者而言仍存在训练成本过高的问题。我们基于模型编辑领域的研究,探讨了长期模型编辑(Lifelong Model Editing)这一重要但研究不足的课题。聚焦自然语言处理领域的Transformer模型,我们在模拟部署场景中对模型进行了连续数千次流式编辑。与近期需要合成编辑并依赖大量外源数据(如训练编辑集、语义等价样本或预训练数据)的研究不同,我们的编辑仅针对真实存在的单一错误。随后我们提出GRACE——一种作用于模型选定层的插件适配器(Adaptor),其特点在于完全不改动已训练权重。GRACE适配器具有双重优势:(1) 完整保留原模型功能,(2) 成功修正模型预测且不遗忘历史输入。通过在三个真实数据集上对T5、BERT和GPT模型进行数千次编辑实验,我们发现GRACE显著优于七种前沿替代方案。我们进一步探究了GRACE处理超长编辑序列的能力,并证明其编辑效果可泛化至未见输入,避免了机械记忆。
7 Acknowledgements
7 致谢
We are grateful to all the support received while conducting this research. This project was supported by Quanta, and Marzyeh Ghassemi was supported by the Herman L. F. von Helmholtz Career Development Professorship and the CIFAR Azrieli Global Scholar award. Yoon Kim was supported by Machine Learning Applications $@$ CSAIL.
我们衷心感谢在研究过程中获得的所有支持。本项目由Quanta提供支持,Marzyeh Ghassemi获得了Herman L. F. von Helmholtz职业发展教授席位和CIFAR Azrieli全球学者奖的支持。Yoon Kim得到了机器学习应用与CSAIL的支持。
References
参考文献
A Implementation Details
实现细节
Training Details. All methods are optimized using Adam [18]. Since edits are singular and sequential in our setup, the batch size is always 1. We trained all methods using various GPUs including 48GB NVIDIA RTX A6000s, 40GB NVIDIA A100s, and 80GB NVIDIA A100s. Timing experiments are reported from experiments using a 48GB NVIDIA RTX A6000 GPU. GRACE does not depend on the scale of the model, but the scale of the model depends on available compute resources. To avoid sharding, we use models that fit on one GPU, thought GRACE principles apply beyond this setup. For Adaptor-based editors, we use 100 iterations of gradient descent per input. For experiments using SCOTUS for classification and zsRE for Question Answering, we include early stopping once the model’s output is corrected.
训练细节。所有方法均使用Adam [18]进行优化。由于本实验设置中编辑操作具有单一性和顺序性,批量大小始终为1。我们使用多种GPU进行训练,包括48GB NVIDIA RTX A6000、40GB NVIDIA A100和80GB NVIDIA A100。时序实验数据基于48GB NVIDIA RTX A6000 GPU得出。GRACE方法不依赖于模型规模,但模型规模取决于可用计算资源。为避免分片,我们采用单GPU可容纳的模型,但GRACE原理同样适用于更大规模的配置。基于适配器(Adaptor)的编辑器对每个输入执行100次梯度下降迭代。在使用SCOTUS数据集进行分类任务和zsRE数据集进行问答任务的实验中,当模型输出被修正后即启用早停机制。
Hyper parameters. For our Finetuning, Finetuning with EWC, Finetuning with perioding retraining, Memory, Defer, and MEND comparisons, we consider learning rates of 1.0, $1e^{-\mathrm{{i}}}$ , $1e^{-2}$ , $\bar{1}e^{-3}$ , $1e^{-\mp}$ and $1e^{-5}$ . We find that Finetuning, Memory, and MEND worked best with $1e^{-2}$ . For Adaptor methods GRACE and Defer, we found that a large learning rate of 1.0 was required to update the model’s behavior since their only control on the model’s output is by replacing tokens at one layer.
超参数。在我们的微调 (Finetuning)、带EWC的微调、周期性重训练的微调、Memory、Defer和MEND对比实验中,我们考虑了1.0、$1e^{-\mathrm{{i}}}$、$1e^{-2}$、$\bar{1}e^{-3}$、$1e^{-\mp}$和$1e^{-5}$的学习率。我们发现微调、Memory和MEND在$1e^{-2}$时表现最佳。对于适配器方法GRACE和Defer,由于它们仅通过替换某一层的Token来控制模型输出,因此需要1.0的大学习率来更新模型行为。
Choosing which layer to edit is another hyper parameter for all editors. In all of our comparisons between editors, each editor edits the same layer. For T5, this is last dense layer of the last encoder block (encoder.block[7].layer[1].DenseR eluDe nse.wo), for BERT it is the second-to-last layer (bert.encoder.layer[10].output.dense), and for GPT2-XL we edit the fully-connected component of the thirty-sixth layer (transformer.h[35].mlp.c_fc). Layer 35 is not random: layers later than 35 led to less-consistent edit success. This is supported by recent work showing the impact of choosing the right layers to finetune [23]. We study these layer choices below, but note that choosing this layer is a hyper parameter in practice: for the purposes of comparison, we ensure editors are compared when editing the same layers.
选择编辑哪一层是所有编辑器的另一个超参数。在我们对所有编辑器的比较中,每个编辑器都修改相同的层。对于T5,这是最后一个编码器块的最后一个密集层 (encoder.block[7].layer[1].DenseReluDense.wo);对于BERT,这是倒数第二层 (bert.encoder.layer[10].output.dense);对于GPT2-XL,我们修改第36层的全连接组件 (transformer.h[35].mlp.c_fc)。选择第35层并非随机:超过35层的修改会导致编辑成功率下降。这一结论得到了近期关于选择正确微调层影响的研究支持 [23]。我们将在下文分析这些层选择,但需注意实践中选择该层是一个超参数:为了确保比较公平性,我们保证所有编辑器在修改相同层时进行对比。
GRACE has one unique hyper parameter, $\epsilon_{\mathrm{init}}$ , which is the size of the $\epsilon$ -ball initialized around new codebook entries. In our main results, we set $\epsilon_{\mathrm{init}}=0.5$ for zsRE (leads to 137 keys/1000 edits), $\epsilon_{\mathrm{init}}=1.0$ for SCOTUS (leads to 252 keys/375 edits), and $\epsilon_{\mathrm{init}}=1.0$ for Hallucination (leads to 1341 keys/1392 edits). We study the impacts of choosing $\epsilon_{\mathrm{init}}$ further in Appendices $\mathrm{F}$ and H.
GRACE有一个独特的超参数$\epsilon_{\mathrm{init}}$,它表示新码本条目初始化时周围$\epsilon$球的大小。在我们的主要结果中,对于zsRE设置$\epsilon_{\mathrm{init}}=0.5$(生成137个键/1000次编辑),对于SCOTUS设置$\epsilon_{\mathrm{init}}=1.0$(生成252个键/375次编辑),对于Hallucination设置$\epsilon_{\mathrm{init}}=1.0$(生成1341个键/1392次编辑)。我们将在附录$\mathrm{F}$和H中进一步研究选择$\epsilon_{\mathrm{init}}$的影响。
Runtime. Our main results report runtime for each method on the Hallucination dataset editing GPT2-XL. This time recorded is only the average time required to make one edit. This ignores all preprocessing, pre computation, or pre training.
运行时。我们的主要结果报告了在Hallucination数据集上编辑GPT2-XL时每种方法的运行时间。这里记录的时间仅是进行一次编辑所需的平均时间,不包括任何预处理、预计算或预训练。
B Additional Dataset and Model Descriptions
B 额外数据集和模型描述
As summarized in Table 1, we provide detailed summaries of the datasets and models we use for editing in our experiments.
如表 1 所示,我们详细总结了实验中用于编辑的数据集和模型。
B.1 Editing T5 on zsRE Question Answering
B.1 在zsRE问答任务上编辑T5
zsRE Dataset. zsRE [24] is a context-free question-answering dataset that has been extensively studied in the model editing literature [8, 16, 30, 31]. The same as previous works, we extract our set set from the same validation dataset. For our main experiments, our edit set is the first 200 question–answer pairs in the zsRE validation set that have at least 5 rephrasing s each. We then use the first 5 rephrasing s for each, leading to 1000 possible edits. To ablate design choices in GRACE, we use these same edits in Appendix F.
zsRE数据集。zsRE [24]是一个无上下文问答数据集,在模型编辑研究领域被广泛研究[8, 16, 30, 31]。与先前工作相同,我们从相同的验证数据集中提取编辑集。在主要实验中,我们的编辑集选取zsRE验证集中前200个问答对(每个问答对至少包含5种重述形式),并采用每种问答的前5种重述形式,最终形成1000个可编辑项。为分析GRACE的设计选择,我们在附录F中使用了相同的编辑集。
We also analyze GRACE’s generalization (Table 2 in the main paper) and performance on extremely long sequences of edits (Figure 4 in the main paper and Appendix H) using zsRE. To do this, we repeat the above procedure but use the 10 rephrasing s from 1000 question–answer pairs. We then split this into an edit set and a holdout set randomly, leading to 5000 potential edits and 5000 holdout edits.
我们还使用zsRE分析了GRACE的泛化能力(主论文中的表2)和对极长编辑序列的性能表现(主论文中的图4和附录H)。为此,我们重复上述流程,但改用1000个问答对中的10种改写方式。随后将其随机划分为编辑集和保留集,最终得到5000个潜在编辑和5000个保留编辑。
T5 QA Model. We edit the publicly-available 60-million parameter version of T5 (google/t5-small-ssm-nq on Hugging face). This model was originally trained on the Natural Questions dataset, of which we use 1000 random samples to evaluate TRR during editing. This model achieves F1 of 0.72 on NQ and .31 on zsRE prior to editing. In our main results in Table 2, we edit T5’s “wo” module in its encoder’s $4^{\mathrm{th}}$ block’s layer 1: encoder.block[4].layer[1].DenseR eluDe nse.wo.
T5问答模型。我们修改了公开可用的6000万参数版T5模型(Hugging Face平台上的google/t5-small-ssm-nq)。该模型最初在Natural Questions数据集上训练,我们从中随机选取1000个样本来评估编辑过程中的TRR指标。在编辑前,该模型在NQ数据集上的F1得分为0.72,在zsRE数据集上为0.31。如表2主要结果所示,我们修改的是T5编码器第4模块第1层中的"wo"模块:encoder.block[4].layer[1].DenseReluDense.wo。
B.2 Editing BERT on SCOTUS Classification
B.2 在SCOTUS分类任务上编辑BERT
SCOTUS Dataset. We consider a single-label multi-class classification task from Fairlex [6]. SCOTUS contains supreme court rulings on controversial issues [38]. Given a document describing the court’s opinion, the task is to predict to which of 14 topics (classes) the document belongs. The 14 topics classes are clustered according to the matter of dispute: Criminal Procedure, Civil Rights, First Amendment, Due Process, Privacy, Attorneys, Unions, Economic Activity, Judicial Power, Federalism, Interstate Relations, Federal Taxation, Miscellaneous, and Private Action. There are $7.4\mathbf{k}$ training documents from cases that took place from 1946-1982. Then, there are 914 validation documents from cases from 1982–1991. Finally there are 931 testing cases from 1991–2009. Since annotation guidelines do shift over time [38], we exacerbate this shift slightly by making three realistic changes only to the training set:
SCOTUS数据集。我们采用Fairlex [6]中的单标签多分类任务,SCOTUS包含最高法院对争议性议题的裁决记录[38]。给定描述法庭意见的文档,任务是预测该文档属于14个主题类别中的哪一个。这14个主题类别根据争议事项聚类为:刑事诉讼程序、民事权利、第一修正案、正当程序、隐私权、律师事务、工会、经济活动、司法权力、联邦主义、州际关系、联邦税收、杂项及私人诉讼。训练集包含1946-1982年间案件的 $7.4\mathbf{k}$ 份文档,验证集包含1982-1991年间案件的914份文档,测试集包含1991-2009年间案件的931份文档。由于标注准则会随时间变化[38],我们仅对训练集进行三项现实调整以略微加剧这种偏移:
This realistic type of relabeling creates cases in these data that are especially challenging to forecast during training or adjust for during editing. We perform edits on the test set from 1991–2009, for which no relabeling was done. TRR is computed on the 914 validation documents from 1982–1991
这种现实的重新标注方式在数据中创建了一些案例,这些案例在训练期间预测或编辑期间调整尤其具有挑战性。我们在1991-2009年的测试集上进行编辑,该测试集未进行任何重新标注。TRR基于1982-1991年的914份验证文档计算得出
BERT. We finetune a 110 million parameter BERT model (bert-base-cased on Hugging face) on the SCOTUS training using Hugging face. We use 50 epochs with a batch size of 4 and a learning rate of $5e^{-05}$ with an AdamW optimizer using Hugging face’s default model trainer. The finetuned BERT model achieves 0.99 Accuracy on its original validation data and 0.55 on the edit set, indicating a large amount of improvement to be made from editing. In our main results in Table 2, we edit BERT’s “dense” module in its $10^{\mathrm{th}}$ layer: bert.encoder.layer[10].output.dense.weight.
BERT。我们在SCOTUS训练集上对1.1亿参数的BERT模型(Hugging Face上的bert-base-cased)进行微调,使用Hugging Face默认模型训练器,采用AdamW优化器,设置50个训练周期、批次大小为4、学习率为$5e^{-05}$。微调后的BERT模型在原验证数据上达到0.99准确率,在编辑集上为0.55,表明通过编辑还有很大提升空间。如表2主要结果所示,我们修改了BERT第$10^{\mathrm{th}}$层"dense"模块:bert.encoder.layer[10].output.dense.weight。
B.3 Editing GPT2 on Hallucination Language Modeling
B.3 针对幻觉语言建模的GPT2编辑
Hallucination with Self Check GP T. We examine how well model editors can mitigate hallucination in auto regressive language models using a new dataset released with Self Check GP T [27]. To generate this dataset, which we refer to as “Hallucination,” the authors prompt GPT-3 to generate wikipediastyle biographies for concepts extracted from WikiBio. They then annotate factual accuracy of each sentence, noting which are hallucinations. We propose editing highly-inaccurate sentences by replacing them with corresponding sentences in the true wikipedia entries. To acquire “true” wikipedia entries, we extract wikipedia summaries from WikiBio and acquire the correct sentence at the same index as the edited sentence. For example, if sentence 3 of a biography of “Bon Jovi” is a hallucination, we would edit a model to produce sentence 3 of the wikipedia entry for “Bon Jovi” given sentences 1 and 2 generated by GPT-3. This way, every edit contains a prompt, which is the GPT-3-generated text up until the hallucination, and a label, which is the correct next sentence from wikipedia. There are 1392 potential edits and 516 already-accurate sentences. The editing task here is to decrease the LLM’s perplexity (PPL) on the corrected sentences without increasing its perplexity on either the model’s training data or the already-correct portions of Hallucination.
自检GPT的幻觉现象研究。我们通过Self Check GPT [27]发布的新数据集,探究模型编辑器如何有效缓解自回归语言模型中的幻觉问题。该数据集(称为"Hallucination")的构建方式为:作者提示GPT-3为从WikiBio提取的概念生成维基百科式传记,随后对每个句子的事实准确性进行标注,识别幻觉内容。我们提出通过用真实维基百科条目中的对应句子替换高错误率句子来进行编辑。为获取"真实"维基百科条目,我们从WikiBio提取维基百科摘要,并获取与被编辑句子索引相同的正确句子。例如,若"Bon Jovi"传记的第3句存在幻觉,我们将编辑模型使其在给定GPT-3生成的前两句(1-2句)时,输出"Bon Jovi"维基百科条目的第3句。因此每个编辑包含提示(GPT-3生成至幻觉出现前的文本)和标签(维基百科中正确的下一句)。该数据集包含1392个待编辑样本和516个原本准确的句子。此编辑任务的目标是:在降低大语言模型对修正句子的困惑度(PPL)的同时,不增加其对训练数据或Hallucination数据集中原有正确部分的困惑度。
This task is different from prior editing works in two ways: First, these are authentic mistakes made by a high-quality LLM. Other works instead create synthetic edits by asking different language models to generate different inputs and swap between them. As their datasets are generated using far lower quality models, the sentences are highly unrealistic. Second, our edits are substantially longer than prior works, which have 10-token prompts and 10-token sentences to generate. By extending this paradigm to longer sentences, our editing task poses a more realistic view of model editing success.
该任务与先前编辑工作有两点不同:首先,这些是由高质量大语言模型产生的真实错误。其他研究则通过让不同语言模型生成不同输入并相互替换来创建合成编辑。由于它们的数据集是用质量低得多的模型生成的,这些句子极不真实。其次,我们的编辑内容比先前工作长得多(先前研究使用10个token的提示和10个token的句子生成)。通过将该范式扩展到更长句子,我们的编辑任务能更真实地反映模型编辑的成功情况。
GPT2-XL. Since GPT-3 generated the Hallucination dataset, it would be ideal to edit GPT-3 directly. However, since it is proprietary and too large for practical research, we edit GPT2, using the 1.5Bparameter “XL” model $({\tt g p t2-x1}$ on Hugging face). Being a “small” LLM, this model is accessible to a broad range of researchers yet is high-enough quality to serve as an effective stand-in for bigger models. Unfortunately, GPT2 is not already a high-quality biography-generator, and so it finds all of the Hallucination sentences to be unlikely, even those that are “already correct” from GPT-3. To overcome this, we finetune GPT2 on the already-accurate Hallucination data mixed with 200 random sentences from Open Web Text [10], which is a public version of GPT2’s original training data. This finetuned model achieves perplexity of 15.98 on Open Web Text (comparable to GPT2-XL on its training data), 8.7 on already-accurate outputs, and 132.7 on intended edits. Our finetuned GPT2-XL model will be made public with our paper. Thus with GPT2-XL mimicking GPT-3 for these data, we can evaluate model editors on decreasing the edit PPL while maintaining its PPL on Open Web Text and the already-accurate sentences. In our main results in Table 2, we edit GPT2-XL’s “c_fc” module in its $36^{\mathrm{th}}$ layer: transformer.h[35].mlp.c_fc.
GPT2-XL。由于GPT-3生成了幻觉数据集(Hallucination dataset),直接编辑GPT-3是最理想的选择。但因其属于专有模型且规模过大不便研究,我们改用参数规模15亿的"XL"模型$({\tt gpt2-xl}$,Hugging Face平台版本)进行编辑。作为一款"小型"大语言模型,该模型既便于广大研究者使用,又具备足够质量来替代更大模型。遗憾的是,GPT2本身并非高质量传记生成器,因此它会判定所有幻觉语句(包括GPT-3生成的"正确"语句)都缺乏可能性。为此,我们使用准确幻觉数据混合Open Web Text[10]中200条随机语句(即GPT2原始训练数据的公开版本)对GPT2进行微调。微调后模型在Open Web Text上的困惑度(perplexity)为15.98(与其训练数据上的GPT2-XL相当),在准确输出上为8.7,在待编辑语句上为132.7。微调后的GPT2-XL模型将随论文公开。通过让GPT2-XL模拟GPT-3处理这些数据,我们得以评估模型编辑器在降低编辑语句困惑度的同时,保持其在Open Web Text和准确语句上表现的能力。如表2所示,我们最终编辑的是GPT2-XL第36层($36^{\mathrm{th}}$ layer)的"c_fc"模块:transformer.h[35].mlp.c_fc。
C Descriptions of Compared Editors
C 对比编辑器描述
Finetune. We finetune the weights in a chosen layer using Adam optimization, leaving all other layers of the pretrained model frozen. Finetuning is done using the Cross Entropy loss for T5 and BERT, and by minimizing the log probability of corrected sentences for GPT.
微调。我们使用Adam优化对选定层的权重进行微调,保持预训练模型其他所有层冻结不变。针对T5和BERT采用交叉熵损失进行微调,对于GPT则通过最小化修正句子的对数概率来实现。
Finetune with EWC. We perform the same finetuning as above, except update the weights with Elastic Weight Consolidation. This is a well-studied approach to minimize catastrophic forgetting in continual learning by regularizing weight updates to leave important weights unchanged over time.
使用EWC进行微调。我们执行与上述相同的微调过程,但采用弹性权重巩固(Elastic Weight Consolidation)更新权重。这是一种经过充分研究的方法,通过正则化权重更新来保持重要权重不变,从而在持续学习中最小化灾难性遗忘。
Finetune with Retraining. We perform the same finetuning as above, except periodically finetuning the original pre-trained model on all previous edits. This requires caching previous edits and may often exacerbate catastrophic forgetting on the model’s pre-training data.
微调与再训练。我们执行与上述相同的微调过程,不同之处在于会定期对所有先前的编辑内容重新微调原始预训练模型。这需要缓存历史编辑数据,但可能会加剧模型在预训练数据上的灾难性遗忘问题。
MEND [30]. MEND uses a hyper network to forecast new weights for a pre-trained model’s chosen layer by predicting low-rank decomposition of the layer’s weight matrix. The hyper network is trained on a set of training edits, which include a new edit, a set inputs that are semantically equivalent to the edit, and samples from the model’s pre-training data (or some analogous set). The network is trained to make the edit, apply the fix to the semantically-equivalent inputs, and minimize the KL divergence between the new and old model predictions on the model’s pre-training data. In our setup, we only have single edits that stream in, so there are neither training edits for the hyper network nor access to the model’s pre-training data. Therefore, we train the hyper network to predict updated weights as edits stream in using continuous finetuning. Per the original paper’s recommendations, we train the hyper network to predict a new gradient vector and value vector, which are multiplied to create a matrix with which to update the model’s existing weights. We use the authors’ original code from github.com/eric-mitchell/mend.
MEND [30]。MEND通过预测预训练模型选定层的权重矩阵低秩分解,使用超网络为该层预测新权重。该超网络在一组训练编辑上进行训练,这些编辑包括一个新编辑、一组与编辑语义等价的输入,以及来自模型预训练数据(或类似集合)的样本。网络被训练用于执行编辑,将修正应用于语义等价的输入,并最小化新旧模型在预训练数据上预测的KL散度。在我们的设置中,只有单条编辑流式输入,因此既没有用于超网络的训练编辑,也无法访问模型的预训练数据。因此,我们通过持续微调在编辑流输入时训练超网络预测更新后的权重。根据原论文建议,我们训练超网络预测新的梯度向量和值向量,二者相乘生成用于更新模型现有权重的矩阵。我们使用了作者在github.com/eric-mitchell/mend上的原始代码。
ROME [28]. ROME identifies weights in chosen layers of GPT models and updates them to alter a GPT model’s factual knowledge. Being designed only for GPT models, we evaluate this approach on the Hallucination editing task. We use the authors’ original code github.com/kmeng01/rome and use a template for wikipedia bios: This is a Wikipedia passage about {}.
ROME [28]。ROME通过识别GPT模型选定层中的权重并更新它们来改变GPT模型的事实知识。由于该方法仅针对GPT模型设计,我们在幻觉编辑任务上对其进行了评估。我们使用作者原始代码github.com/kmeng01/rome,并采用维基百科人物传记模板:这是一段关于{}的维基百科文章。
Defer. Inspired by SERAC [31], we implement a deferral-based model editor, which we call Defer. At a chosen layer, for a new input Defer chooses to either 1) trust the pre-trained model’s predictions, or 2) replace the layer’s prediction with a vector predicted by a new model. This involves learning two new predictors for a Defer Adaptor at a given layer. First, a probability of deferral is predicted by one network $g:\mathbb{R}^{|h^{l-1}|}\rightarrow[0,1]^{\overline{{1}}}$ . For $g$ , we use a one-layer network with a sigmoid activation function. Second, a new value is predicted by another one-layer network $o:\mathbb{R}^{|h^{\bar{l}-1}|}\to\mathbb{R}^{|h^{l}|}$ . In early experiments, we found that increasing these networks’ depth overfits quickly. $g$ and $o$ are jointly trained via the pre-trained model’s finetuning loss continually as new edits arrive.
延迟(Defer)。受SERAC [31] 启发,我们实现了一种基于延迟的模型编辑器,称为Defer。在选定层面对新输入时,Defer会选择:1) 信任预训练模型的预测,或2) 用新模型预测的向量替换该层的预测。这需要为指定层的Defer Adaptor学习两个新预测器:首先通过网络$g:\mathbb{R}^{|h^{l-1}|}\rightarrow[0,1]^{\overline{{1}}}$预测延迟概率,该网络采用带sigmoid激活函数的单层结构;其次通过另一个单层网络$o:\mathbb{R}^{|h^{\bar{l}-1}|}\to\mathbb{R}^{|h^{l}|}$预测新值。早期实验表明,增加这些网络的深度会快速导致过拟合。随着新编辑任务的到来,$g$和$o$会通过预训练模型的微调损失进行联合训练。
Memory. As a softer version of GRACE’s codebook, we implement an Adaptor editor based on memory networks [46]. This editor contains two components: A memory module that serves as a predicted cache for learned activation s and an attention mechanism that learns to retrieve appropriate memories. We implement the memory module as a matrix M ∈ RN×hl, containing $N$ memories, each of which has $\mathbf{\dot{\Omega}}h^{l}$ dimensions. For our experiments, we set $N=50$ and the memory module is learnable over time. Then, we learn a simple attention mechanism $g:\mathbb{R}^{h^{l-1}}\rightarrow[0,1]^{1\times N}$ where $g$ ’s outputs sum to one. For $g$ , we use a one-layer neural network with a softmax activation function. Then, the editor layer’s output is their multiplication $g(h^{l-1})M$ , resulting in a newly predicted $|h^{l}|$ -dimensional vector to be passed into layer $l+1$ . $M$ and $g$ are trained jointly via the pre-trained model’s finetuning loss continually as new edits arrive.
记忆。作为GRACE码本的温和版本,我们基于记忆网络[46]实现了一个适配器编辑器。该编辑器包含两个组件:一个作为学习激活预测缓存的内存模块,以及一个学习检索适当记忆的注意力机制。我们将内存模块实现为矩阵M ∈ RN×hl,包含$N$个记忆单元,每个单元具有$\mathbf{\dot{\Omega}}h^{l}$维度。实验中设置$N=50$,且该内存模块会随时间进行学习。随后,我们构建了一个简单注意力机制$g:\mathbb{R}^{h^{l-1}}\rightarrow[0,1]^{1\times N}$,其输出总和为1。对于$g$,我们采用带softmax激活函数的单层神经网络。编辑器层的输出即为二者的乘积$g(h^{l-1})M$,生成一个新的$|h^{l}|$维向量传递至$l+1$层。随着新编辑任务的到来,$M$和$g$通过预训练模型的微调损失进行联合训练。
Table 3: Hyper parameter options we searched through for GRACE on T5 for zsRE edits.
| PARAMETER | SEARCHSPACE |
| EditedBlock | 2,7 |
| Einit | 0.1,0.5,1.0,3.0 |
| Replacement Token Initialization | Last Input Token, All Input Tokens, All Tokens Uniform: U(0, 1) ("cold"), h' ("warm |
表 3: 我们在T5模型上为zsRE编辑任务搜索的GRACE超参数选项
| PARAMETER | SEARCHSPACE |
|---|---|
| EditedBlock | 2,7 |
| Einit | 0.1,0.5,1.0,3.0 |
| Replacement Token Initialization | Last Input Token, All Input Tokens, All Tokens Uniform: U(0, 1) ("cold"), h' ("warm") |
D Interpreting GRACE codebooks
D 解读GRACE码本
As introduced in the main paper in Section 3.2.4, a benefit of GRACE is that the codebook is detachable from the pretrained model. This feature allows for closer inspection of the editing process. We first expand the results in the main paper by including the average F1 score per key in Figure 7, where we see that keys overall achieve highly-accurate edits on holdout data. This implies that when a holdout question has a key, the key maps to the correct answer. However, not all holdouts have associated keys. We therefore examine this further in Figure 8, where we investigate the top-10 keys containing the largest numbers of holdout questions for $5~\epsilon_{\mathrm{init}}$ choices. As expected, bigger $\epsilon_{\mathrm{init}}$ lead to more holdouts per key. Meanwhile, small $\epsilon_{\mathrm{init}}$ values lead to very few holdouts being captured by each key. Additionally, the big $\epsilon_{\mathrm{init}}$ values lead to small codebooks and small values lead to big codebooks. So when $\epsilon_{\mathrm{init}}=0.1$ , $60%$ of the holdouts do eventually have a key, but this is at the expense of codebook size, which grows to 547 entries. On the right-hand side, we also see the proportion of the holdout set that has any associated key.
如主论文第3.2.4节所述,GRACE的优势在于其码本可与预训练模型分离。这一特性使得编辑过程能够被更细致地观察。我们首先通过在图7中补充每个键的平均F1分数来扩展主论文的结果,可见键总体上对保留数据实现了高精度编辑。这表明当保留问题存在对应键时,该键能映射到正确答案。但并非所有保留问题都有关联键,因此我们在图8中进一步研究了包含最多保留问题的前10个键(针对5种不同的$\epsilon_{\mathrm{init}}$设置)。正如预期,较大的$\epsilon_{\mathrm{init}}$会导致每个键对应更多保留问题,而较小$\epsilon_{\mathrm{init}}$值则使每个键仅捕获极少量保留问题。此外,大$\epsilon_{\mathrm{init}}$值会生成小码本,小值则产生大码本。例如当$\epsilon_{\mathrm{init}}=0.1$时,60%的保留问题最终会获得关联键,但代价是码本规模会增长至547个条目。右侧图表同时展示了保留集中具有关联键的比例。

Figure 7: Interpreting general iz ability of GRACE codebooks. F1 is computed on unseen holdout edits that land inside any key’s $\epsilon$ ball. The gray dashed line denotes the true average number of rephrasing s per observed edit.
图 7: GRACE码本泛化能力解读。F1分数计算的是落在任意键$\epsilon$球范围内的未见保留编辑。灰色虚线表示每个观测编辑的真实平均复述次数。
E Parameter Efficiency
E 参数效率
As introduced in the main paper, GRACE’s memory requirements are small. This is because a new edit only requires $|h^{l-1}|+|\bar{h^{l}}|+1$ parameters, where $|\bar{h}^{l-1}|$ is the dimension of the key, $|h^{l}|$ is the dimension of the value, and $\epsilon_{\mathrm{init}}$ is a scalar. Further, the key’s $|h^{l-1}|$ parameters are frozen and do not require learning. At inference time, activating the Adaptor only alters the model’s predictions according to the $|h^{\check{l}-1}|+|h^{l}|+1$ parameters of the chosen entry. As a concrete example, consider a GRACE codebook with 500 entries that edits the 60 million-parameter T5 model. For 500 edits, we would then store $500*(|h^{l-1}|+|h^{l}|+1)=500*(1024\dot{+}512+1)=768,500$ parameters, of which only $500*(512+1)=256$ , 500 are learnable. 768, 500 is only $1.3%$ of T5’s original 60 million parameters. Edits also appear sporadically in practice, so accruing 500 edits may easily span thousands or millions of inputs. While GRACE codebooks do grow over time, adding parameters according to the data size is quite attractive in such sequential setups. Other Adaptor methods, for instance, initialize all their weights early on, leading to a far harder learning problem. MEND, for instance, predicts a 1024-dimensional and 512-dimensional vector, requiring 1,310,720 new parameters to be learned all at once. This is reasonable in their setup with training edits, but is not directly applicable to our setup. Defer uses a comparable parameter count to GRACE, adding 525,312. Memory, on the other hand, is tiny, adding just 26,624 learnable parameters because we set $N=50$ . Early experiments with Memory indicate that increasing $N$ is unhelpful.
如主论文所述,GRACE的内存需求很小。这是因为每次新编辑仅需 $|h^{l-1}|+|\bar{h^{l}}|+1$ 个参数,其中 $|\bar{h}^{l-1}|$ 是键的维度, $|h^{l}|$ 是值的维度, $\epsilon_{\mathrm{init}}$ 为标量。此外,键的 $|h^{l-1}|$ 个参数被冻结无需学习。推理时,激活Adaptor仅根据所选条目的 $|h^{\check{l}-1}|+|h^{l}|+1$ 个参数调整模型预测。以6000万参数的T5模型为例,假设GRACE码本包含500条编辑,则存储参数为 $500*(|h^{l-1}|+|h^{l}|+1)=500*(1024\dot{+}512+1)=768,500$ ,其中可学习参数仅 $500*(512+1)=256$ ,500。768,500仅为T5原始参数的 $1.3%$ 。实践中编辑是零星出现的,积累500次编辑可能对应数千乃至数百万次输入。虽然GRACE码本会随时间增长,但这种按数据量递增参数的方式在序列化场景中极具优势。其他Adaptor方法(如MEND)需一次性初始化全部权重(例如预测1024维和512维向量需学习1,310,720个新参数),这在其训练编辑场景可行,但不适用于我们的设置。Defer的参数量与GRACE相当(新增525,312个),而Memory由于设 $N=50$ 仅增加26,624个可学习参数。早期实验表明增大 $N$ 对Memory无益。

Figure 8: Interpreting per-key codebook behavior for 1,000 inputs for Block 4 of T5. At each step of editing, we record how many holdout samples are captured by a key and if so, which key. We also report the percentage of the holdout dataset that is captured by any key in the right-hand panel.
图 8: 针对T5第4模块的1,000个输入解释各键码本行为。在每次编辑步骤中,我们记录有多少保留样本被某个键捕获,以及具体是哪个键。右侧面板展示了被任意键捕获的保留数据集百分比。

Figure 9: Ablation Study on editing T5 for zsRE on 1000 inputs. “E” denotes the number of edits performed. “Steps” denotes the candidate edit in the edit set, since over 1000 inputs, not all will be flagged as errors.
图 9: 在1000个zsRE输入上对T5进行编辑的消融研究。"E"表示执行的编辑次数。"Steps"表示编辑集中的候选编辑,因为在1000个输入中并非所有都会被标记为错误。

Figure 10: Ablation Study on editing T5 for zsRE on 1000 inputs. “E” denotes the number of edits performed. “Steps” denotes the candidate edit in the edit set, since over 1000 inputs, not all will be flagged as errors.
图 10: 在1000个zsRE输入上对T5模型进行编辑的消融研究。"E"表示执行的编辑次数。"Steps"表示编辑集中的候选编辑,因为在1000多个输入中,并非所有输入都会被标记为错误。
Table 4: Examples of GRACE codebook entries. “GRACE Key” is the index of the chosen codebook entry for these inputs. “Edits” are inputs that were wrong prior to editing and are fixed by the same key. “True Label” is the true label associated with these inputs and the codebook’s learned value.
| GRACEKEY | EDITS | TRUELABEL |
| 0 | What type of voice is Gemma Bosini? WhatkindofvoiceisGemmaBosini? WhatisthevoiceofGemmaBosini? | soprano |
| 7 | In what war did Carlos W. Colby fight? What war was Carlos W. Colby fighting in? What war or fight did Carlos W. Colby fight in? What war did Carlos W. Colby fight in? What war or battle was Carlos W. Colby fighting in? | AmericanCivilWar |
| 8 | By whom was the Queen Victoria Building designed? Who was the Queen Victoria Building designed for? Who was Queen Victoria Building designed by? Which person designed Queen Victoria Building? Which architect designed QueenVictoria Building? | GeorgeMcRae |
表 4: GRACE 码本条目示例。"GRACE Key" 是这些输入所选码本条目的索引。"Edits" 表示编辑前错误但通过相同键值修正的输入。"True Label" 是与这些输入及码本学习值关联的真实标签。
| GRACE KEY | EDITS | TRUE LABEL |
|---|---|---|
| 0 | What type of voice is Gemma Bosini? WhatkindofvoiceisGemmaBosini? WhatisthevoiceofGemmaBosini? | soprano |
| 7 | In what war did Carlos W. Colby fight? What war was Carlos W. Colby fighting in? What war or fight did Carlos W. Colby fight in? What war did Carlos W. Colby fight in? What war or battle was Carlos W. Colby fighting in? | AmericanCivilWar |
| 8 | By whom was the Queen Victoria Building designed? Who was the Queen Victoria Building designed for? Who was Queen Victoria Building designed by? Which person designed Queen Victoria Building? Which architect designed QueenVictoria Building? | GeorgeMcRae |
F Ablation Study
F 消融实验
We ablate GRACE components by comparing four design choices using different layers and choices of $\epsilon_{\mathrm{init}}$ using T5 on up to 1000 edits from zsRE. The hyper parameters we search across are shown in Table 3. We focus this study on the T5 encoder’s Blocks 2 and 7, which pose significantly different behavior.
我们通过比较四种设计选择来消融GRACE组件,这些选择使用不同的层和$\epsilon_{\mathrm{init}}$值,并在zsRE数据集上对T5模型进行最多1000次编辑。搜索的超参数如表3所示。本研究重点关注T5编码器的第2和第7块(Block),它们表现出显著不同的行为。
As our results in Figure 9 show, there are interesting relationships between value initialization and token replacement. Starting with the right-most panels, we excitingly observe that GRACE achieves significant edit compression: For up to 1000 edits, the GRACE codebooks often end up with only about 200 keys. This marks success in our codebook maintenance strategy, since this can only be achieved if $\epsilon$ values are increasing appropriately for new edits. For small $\epsilon_{\mathrm{init}}$ values, the number of keys grows larger, up to about 600, demonstrating the trade-off between codebook size and memorization. This trade-off is seen in the three left panels measuring Training Retention Rate (TRR), Edit Retention Rate (ERR), and Edit Success (ES), all in terms of F1. For the smallest $\epsilon_{i n i t}=0.1$ , we see these three metrics remain high throughout editing. However, looking down through Figures 9a, 9b, 9c, and 9d, we see TRR and ERR drop as the codebook generally trades size for general iz ability. This trend remains true when editing Block 7 (Figure 10), though this later layer appears far harder to edit successfully. This may hint at a failure mode of GRACE that may be addressed by further hyper parameter tuning: If a new key is initialized yet fails to correct the model’s behavior, even identical future inputs will still require edits. This may be addressed by training values for longer or adding new value training methods: In principle, learning a new GRACE value is no different from training any model, so may benefit from any advanced optimization method.
图 9 的结果显示,值初始化与token替换之间存在有趣关联。最右侧面板显示,GRACE实现了显著的编辑压缩效果:在多达1000次编辑后,GRACE码本通常仅保留约200个键。这标志着我们的码本维护策略取得成功,因为只有在新编辑的$\epsilon$值适当递增时才能实现该效果。当$\epsilon_{\mathrm{init}}$取值较小时,键数量会增至约600个,体现了码本规模与记忆能力之间的权衡。这种权衡关系在左侧三个面板中得到印证,它们分别以F1分数衡量训练保留率(TRR)、编辑保留率(ERR)和编辑成功率(ES)。当$\epsilon_{init}=0.1$时,这三项指标在整个编辑过程中保持高位。但观察图9a至图9d可见,随着码本以规模换取泛化能力,TRR和ERR逐渐下降。这种趋势在编辑Block7时依然存在(图10),不过该深层网络的成功编辑难度显著更高。这可能暗示GRACE存在可通过超参数调优解决的失效模式:若新初始化的键未能修正模型行为,即使后续输入完全相同仍需重复编辑。延长值训练周期或引入新训练方法可能解决该问题——理论上,GRACE新值的学习与常规模型训练无异,因此同样适用于各类先进优化方法。
For hyper parameter Replacement Token, we see that replacing just the value for the Last input token appears to be superior when editing Block 2, especially for TRR. This makes sense because replacing only the last input token will likely have the smallest impact on unrelated inputs. Excitingly, this replacement strategy remains on par with the others for ERR and ES. For Block 7, however, the opposite is true. Given the low overall ES, it remains unclear how replacement strategies differ in general for this layer.
对于超参数替换token (Replacement Token),我们发现仅替换最后一个输入token的值在编辑Block 2时表现更优,尤其是对TRR (Token Replacement Rate)而言。这很合理,因为仅替换最后一个输入token对无关输入的影响可能最小。令人振奋的是,这种替换策略在ERR (Error Rate Reduction)和ES (Edit Success)指标上与其他策略表现相当。然而对于Block 7,情况恰恰相反。鉴于整体ES较低,目前尚不清楚替换策略在该层的普遍差异。
For hyper parameter Value Initialization, we see negligible differences between warm and cold starts for Block 2. For Block 7, cold appears overall worse than warm, despite having higher ES.
在超参数(hyper parameter)值初始化方面,我们发现Block 2的冷启动与热启动差异可忽略不计。而对于Block 7,尽管冷启动的ES值更高,但整体表现仍逊于热启动。
Edit Compression. Compressing multiple edits into singular keys is bounded by the number of unique desired labels since each GRACE value points to one desired output. For example, if new edits come from 5 different classes, GRACE must include at least 5 keys. As we saw in Figure 9, GRACE can compress many edits into single codebook entries successfully. Table 4 shows 3 examples of inputs that all lead to the same key.
编辑压缩。将多次编辑压缩为单一键值受限于所需唯一标签的数量,因为每个GRACE值都指向一个期望输出。例如,若新编辑来自5个不同类别,GRACE必须包含至少5个键值。如图9所示,GRACE能成功将多次编辑压缩至单个码本条目。表4展示了3个触发相同键值的输入案例。

Figure 11: Comparing editors over time when editing T5 on zsRE.

图 11: 在zsRE数据集上编辑T5模型时,不同编辑器随时间的效果对比。
G Extended Main Results
G 扩展主要结果
To supplement Table 1 in the main paper, we include all metrics computed throughout editing for all comparisons. We first show barplots in Figures 11, 12, and 13. We see that the trends match the main paper: GRACE achieves strong performance compared to the comparisons. We also show finer-grained versions of these plots computed for each edit over time in Figures 14, 15, and 16.
为补充正文中的表1,我们包含了所有对比实验在编辑过程中计算的全部指标。首先在图11、图12和图13中展示柱状图。可以看出趋势与正文一致:相比其他方法,GRACE展现出强劲性能。此外,我们在图14、图15和图16中展示了按编辑时间细分的更精细版本图表。
H Extended Hyper parameter Study
H 扩展超参数研究
To further extend our study on the impacts of the most important hyper parameters $\epsilon_{\mathrm{init}}$ and the chosen layer, we run a far-longer experiment on the QA editing task, making up to 5,000 edits. Our results from this experiment are shown in Figure 17. In this finer-grained analysis, we find that our main claims remain true: some layers are better to edit than others, GRACE succeeds to generalize to previously-unseen inputs, ϵ controls the trade-off between memorization and generalization, and GRACE codebooks stabilize in size over time. Performance remains the same above $\epsilon=6.0$ . Interestingly, for $\epsilon=2.0$ , Layer 6 appears to be highly unstable over time.
为了进一步研究最重要的超参数 $\epsilon_{\mathrm{init}}$ 和所选层的影响,我们在QA编辑任务上进行了更长时间的实验,最多进行了5000次编辑。该实验结果如图17所示。通过更精细的分析,我们发现主要结论依然成立:某些层比其他层更适合编辑,GRACE能够成功泛化到未见过的输入,$\epsilon$ 控制了记忆与泛化之间的权衡,且GRACE码本大小随时间推移保持稳定。当 $\epsilon=6.0$ 以上时,性能保持不变。有趣的是,对于 $\epsilon=2.0$ 的情况,第6层随时间推移表现出高度不稳定性。

Figure 12: Comparing editors over time when editing BERT on SCOTUS.
图 12: 在SCOTUS数据集上编辑BERT模型时,不同编辑器随时间的效果对比。

Figure 13: Comparing editors over time when editing GPT2-XL on Hallucination.
图 13: 在Hallucination任务上编辑GPT2-XL时,不同编辑器随时间变化的对比。

Figure 14: Comparing all methods throughout editing on zsRE.
图 14: zsRE编辑过程中所有方法的对比

Figure 15: Comparing all methods throughout editing on SCOTUS.
图 15: SCOTUS数据集上各编辑方法的全程对比

Figure 16: Comparing all methods throughout editing on Hallucination.
图 16: 各方法在Hallucination编辑任务中的全程对比

Figure 17: Impact of $\epsilon_{\mathrm{init}}$ and block choice for GRACE editing T5 on zsRE for 3000 sequential edits. Along with TRR and ERR, we also measure F1 on a “Holdout” edit set containing unseen rephrasing s of all $3\mathrm{k\Omega}$ edits. We find that editing blocks 0 and 6 use more keys and achieve higher TRR, but lead to lower ERR and generalize worse, as shown by lower “Holdout” values.
图 17: $\epsilon_{\mathrm{init}}$ 和模块选择对GRACE编辑T5在zsRE上进行3000次连续编辑的影响。除了TRR和ERR外,我们还测量了包含所有$3\mathrm{k\Omega}$编辑未见改述的"保留"编辑集的F1值。研究发现,编辑模块0和6使用了更多键并实现了更高的TRR,但导致更低的ERR和更差的泛化能力,如较低的"保留"值所示。
