TRANSFORMER 2: SELF-ADAPTIVE LLMS

TRANSFORMER 2: 自适应大语言模型

ABSTRACT

摘要

Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce Transformer , a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, Transformer 2 employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific “expert” vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. Transformer 2 demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. Transformer 2 represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems. Our code is available at https://github.com/SakanaAI/self-adaptive-llms

自适应大语言模型 (LLMs) 旨在解决传统微调方法带来的挑战，这些方法通常计算密集且在处理多样化任务时能力静态。我们引入了 Transformer，这是一种新颖的自适应框架，通过选择性地仅调整权重矩阵的单一组件，实时适应未见过的任务。在推理过程中，Transformer 2 采用了一种两阶段机制：首先，调度系统识别任务属性，然后使用强化学习训练的任务特定“专家”向量动态混合，以获得针对输入提示的定向行为。我们的方法在参数更少、效率更高的情况下，优于 LoRA 等普遍方法。Transformer 2 展示了在不同 LLM 架构和模态（包括视觉-语言任务）中的多功能性。Transformer 2 代表了一个重大飞跃，为增强 LLM 的适应性和任务特定性能提供了可扩展、高效的解决方案，为真正动态、自组织的 AI 系统铺平了道路。我们的代码可在 https://github.com/SakanaAI/self-adaptive-llms 获取。

1 INTRODUCTION

1 引言

Self-adaptive large language models (LLMs) would represent a significant advancement in artificial intelligence, providing a framework where models can adjust to varied tasks and dy- namic contexts in real time. While compositional it y and s cal ability are crucial for effective adaptation, current LLM training methodologies fall short of achieving both these properties simultaneously. Our research aims to present a pioneering solution to realize this vision and address these gaps.

自适应大语言模型 (LLMs) 将是人工智能领域的一项重大进步，它提供了一个框架，使模型能够实时适应各种任务和动态环境。虽然组合性和可扩展性对于有效适应至关重要，但当前的大语言模型训练方法无法同时实现这两个特性。我们的研究旨在提出一种开创性的解决方案，以实现这一愿景并解决这些差距。

Traditionally, LLM post-training has sought to optimize a model for a wide range of capabilities in a single, extensive training session. While this “one-shot” fine-tuning framework is ideal from a simplicity perspective, it is also difficult to achieve in practice.

传统上，大语言模型的后训练旨在通过一次广泛的训练会话优化模型的多种能力。虽然从简单性的角度来看，这种“一次性”微调框架是理想的，但在实践中却难以实现。

Figure 1: Overview of Transformer 2. In the training phase, we tune the scales of the singular values of the weight matrices to generate a set of “expert” vectors, each of which specializes in one type of tasks. In the inference phase, a two-pass process is adopted where the first applies the taskspecific expert and the second generates the answer.

图 1: Transformer 2 概述。在训练阶段，我们调整权重矩阵的奇异值比例，生成一组“专家”向量，每个向量专门处理一种任务类型。在推理阶段，采用两阶段过程，第一阶段应用任务特定的专家，第二阶段生成答案。

For instance, post-training is still highly resource-intensive, leading to significant computational costs and training times. Additionally, there tends to be notable performance trade-offs when introducing additional breadth to the data, making it challenging to overcome over fitting and task interference at the same time.

例如，训练后阶段仍然高度资源密集，导致显著的计算成本和训练时间。此外，在增加数据广度时，往往会出现显著的性能权衡，这使得同时克服过拟合和任务干扰变得具有挑战性。

In contrast, self-adaptive models offer a more flexible and efficient approach. Rather than attempting to train an LLM for all tasks in one step, expert modules can be developed offline and augmented to the base LLM on-demand (Kang et al., 2024). This allows the model to dynamically modify its behavior based on the task at hand, without the need for constant re-tuning. In addition to the benefit of having independent components, this modularity also supports continual learning, enabling the model to add new skills over time without catastrophic forgetting. Moreover, self-adaptive LLMs mirror a well-established principle in neuroscience and computational biology, where the brain activates specific regions depending on the task at hand (Loose et al., 2017) and dynamically reconfigure s its functional networks in response to changing task demands (Davison et al., 2015).

相比之下，自适应性模型提供了一种更为灵活和高效的方法。与其试图一步到位地训练一个大语言模型来完成所有任务，不如离线开发专家模块，并根据需求将其增强到基础大语言模型中 (Kang et al., 2024)。这使得模型能够根据当前任务动态调整其行为，而无需不断重新调优。除了拥有独立组件的好处外，这种模块化还支持持续学习，使模型能够随着时间的推移添加新技能，而不会发生灾难性遗忘。此外，自适应性大语言模型反映了神经科学和计算生物学中的一个既定原则，即大脑会根据当前任务激活特定区域 (Loose et al., 2017)，并动态重新配置其功能网络以应对不断变化的任务需求 (Davison et al., 2015)。

In principle, the first step toward achieving self-adaptive LLMs can be realized through the development of specialized expert modules, each fine-tuned (Kaplan et al., 2020) via techniques such as low-rank adaptation (LoRA) (Hu et al., 2021). These expert modules can then be dynamically composed at runtime based on the task demands, a process that can be efficiently managed through Mixture of Experts (MoE)-like systems (Tianlong et al., 2024). However, several challenges need to be addressed to make this approach both scalable and compositional. First, fine-tuning LLMs to create multiple expert modules significantly increases the number of parameters that need to be trained. In practice, even with parameter-efficient methods like LoRA, the cumulative size of these modules can quickly escalate, leading to increased storage and computational demands. Second, these expert modules are often prone to over fitting, a phenomenon especially prevalent when training on smaller datasets or narrow task domains. Third, the flexible composition of these expert modules also presents largely unresolved challenges currently posing as open research problems.

原则上，实现自适应大语言模型的第一步可以通过开发专门的专家模块来实现，每个模块通过低秩适应（LoRA）等技术进行微调（Kaplan et al., 2020）。这些专家模块可以根据任务需求在运行时动态组合，这一过程可以通过类似专家混合（MoE）系统高效管理（Tianlong et al., 2024）。然而，要使这种方法既具有可扩展性又具有组合性，还需要解决几个挑战。首先，微调大语言模型以创建多个专家模块会显著增加需要训练的参数数量。在实践中，即使使用像LoRA这样的参数高效方法，这些模块的累积大小也会迅速增加，导致存储和计算需求增加。其次，这些专家模块往往容易过拟合，这种现象在较小的数据集或狭窄的任务领域上训练时尤为普遍。第三，这些专家模块的灵活组合也带来了许多尚未解决的挑战，目前仍是开放的研究问题。

To overcome these limitations, we first propose Singular Value Fine-tuning (SVF), a novel parameter-efficient fine-tuning (PEFT) method to obtain effective building blocks for selfadaptation. SVF works by extracting and tuning only the singular values within the model’s weight matrices. By focusing on this principled parameter iz ation, our approach mitigates the risk of overfitting, drastically reduces computational demands, and allows for inherent compositional it y. We show these properties enable us to cheaply obtain a set of effective domain-specific “expert” vectors by training on narrow datasets with RL, directly optimizing task performance on individual topics.

为了克服这些限制，我们首先提出了奇异值微调 (SVF)，这是一种新颖的参数高效微调 (PEFT) 方法，用于获取自适应的有效构建块。SVF 通过仅提取和调整模型权重矩阵中的奇异值来工作。通过专注于这种原则性的参数化，我们的方法减轻了过拟合的风险，大幅减少了计算需求，并允许固有的组合性。我们展示了这些特性使我们能够通过在窄数据集上使用强化学习 (RL) 进行训练，直接优化单个主题的任务性能，从而廉价地获得一组有效的领域特定“专家”向量。

We then introduce our full Transformer 2 framework to empower LLMs through the underlying principles of self-adaptation. Given a prompt from an unknown task, Transformer 2 entails a two-pass inference mechanism which we illustrate in Figure 1. During the first pass, Transformer 2 executes the model and observes its test-time behavior, gathering the relevant information to understand the necessary skills to tackle the current problem. During the second pass, our framework uses this information to combine the available expert vectors and provide a new modification to the base weights of the LLM specifically tailored to its test-time conditions. We design three different adaptation strategies that can be used within Transformer 2, which we show provide monotonic performance benefits with increasing access to the test-time conditions.

我们随后介绍了完整的 Transformer 2 框架，通过自适应的基本原理来增强大语言模型的能力。给定一个未知任务的提示，Transformer 2 采用了一种两阶段推理机制，如图 1 所示。在第一阶段，Transformer 2 执行模型并观察其在测试时的行为，收集相关信息以理解解决当前问题所需的技能。在第二阶段，我们的框架利用这些信息结合可用的专家向量，并对大语言模型的基础权重进行新的调整，专门针对其测试时的条件进行定制。我们设计了三种不同的自适应策略，可以在 Transformer 2 中使用，这些策略在增加对测试条件访问的情况下提供了单调的性能提升。

We evaluate SVF and the full Transformer 2 framework through extensive experiments across a diverse range of LLMs and tasks. First, when trained on domain-specific datasets, we show that SVF consistently outperforms traditional strategies for efficient fine-tuning such as LoRA, and at the same time, with orders of magnitudes fewer parameters. Then we show that Transformer 2 is able to push performance far further, effectively adapting the weights of the base model even in entirely out-of-distribution applications such as visual question answering. Finally, we analyze the properties of our new framework, validating that it provides increasing benefits with additional access to its current test-time conditions and even allow for recycling pre-trained SVF experts across model architectures. In summary, our key technical contributions are the following:

我们通过在各种大语言模型和任务上进行广泛的实验，评估了 SVF 和完整的 Transformer 2 框架。首先，当在特定领域的数据集上进行训练时，我们展示了 SVF 始终优于传统的微调策略，如 LoRA，同时参数数量减少了几个数量级。然后，我们展示了 Transformer 2 能够将性能推得更远，即使在完全超出分布的应用（如视觉问答）中也能有效调整基础模型的权重。最后，我们分析了新框架的特性，验证了它在增加对当前测试条件访问的情况下提供了越来越多的好处，甚至允许跨模型架构回收预训练的 SVF 专家。总之，我们的关键技术贡献如下：

• The development of Transformer 2 as a pivotal self-adaptation framework for LLMs, providing a universal blueprint to dynamically adapt the behavior of LLMs from a growing set of pre-trained skills.

• Transformer 2 的开发作为大语言模型 (LLM) 的关键自适应性框架，提供了一个通用蓝图，能够从不断增长的预训练技能集中动态调整大语言模型的行为。

• The introduction of SVF, a novel PEFT method trainable with RL on small datasets, producing compact expert vectors with inherent compositional it y, all key properties necessary for our scalable self-adaptation framework.

• 引入SVF，这是一种新颖的参数高效微调（PEFT）方法，可在小数据集上使用强化学习（RL）进行训练，生成具有固有组合性的紧凑专家向量，这些特性是我们可扩展的自适应框架所必需的关键属性。

• The implementation of three adaptation strategies within Transformer 2, effectively dispatching SVF-trained experts with properties designed to cope with different requirements and deployment scenarios.

• 在 Transformer 2 中实现了三种适应策略，有效地调度了经过 SVF 训练的专家，这些专家具备应对不同需求和部署场景的特性。

2 相关工作

Self-adaptive LLMs We define self-adaptive LLMs as a group of LLMs or a standalone LLM that can evaluate and modify its behavior in response to changes in its operating environment or internal state, without external intervention. This adaptation can be explored from two perspectives: a macroview, where multiple LLMs collaborate and/or compete, and a microview, where internal adaptations allow a single LLM to specialize in different tasks.

自适应大语言模型
我们将自适应大语言模型定义为一组大语言模型或一个独立的大语言模型，能够在没有外部干预的情况下，根据其操作环境或内部状态的变化评估并修改其行为。这种适应性可以从两个角度探讨：宏观视角下，多个大语言模型协作和/或竞争；微观视角下，内部适应性使单个大语言模型能够专注于不同的任务。

Macroview: From this perspective, the system directs queries to LLMs with domain specific expertise, prioritizing outputs from expert models, thereby achieving higher accuracy and task-specific optimization. Such task-specific ensembles can be realized through various mechanisms: multiple LLMs playing distinct roles and coordinate toward a shared goal (Zhuge et al., 2023), engaging in mutual listening and debate (Du et al., 2023), or using meticulously crafted prompt constructions (Zhang et al., 2024) to integrate knowledge library and skill planning. Naturally, the improvement in the specialization and adaptive capabilities of individual LLMs in the ensemble enhances the collective performance. Thus, in this paper, we focus on the microview of self-adaptive LLMs.

宏观视角：从这一视角出发，系统将查询定向至具有特定领域专长的大语言模型，优先考虑专家模型的输出，从而实现更高的准确性和任务特定优化。此类任务特定的集成可以通过多种机制实现：多个大语言模型扮演不同角色并协同达成共同目标 (Zhuge et al., 2023)，进行相互倾听与辩论 (Du et al., 2023)，或使用精心设计的提示结构 (Zhang et al., 2024) 来整合知识库和技能规划。自然，集成中单个大语言模型的专业化和自适应能力的提升，会增强整体性能。因此，在本文中，我们聚焦于自适应大语言模型的微观视角。

Microview: MoE in LLMs plays a critical role in this perspective (Tianlong et al., 2024). In MoE systems, inputs are dynamically routed to a subset of specialized modules or layers (e.g., MLPs) containing domain-specific knowledge (Raj bh and ari et al., 2022; Fedus et al., 2022). To reduce inference time, researchers introduce sparsely activated MoE where only a subset of the experts are selected per token Jiang et al. (2024); Qwen Team (2024). While it is possible to view Transformer 2 loosely as a type of MoE, there are two major differences. In the aforementioned systems, selfadaptation is achieved through token-level routing, whereas Transformer 2 employs a sample-level module selection strategy. The second difference lies in the construction of expert modules. In traditional MoE systems, expert modules are either trained from scratch (Fedus et al., 2022; Jiang et al., 2024) or dense models (e.g., upcycling) (Qwen Team, 2024; Zhu et al., 2024), without an auxiliary loss to ensure module specialization. In contrast, Transformer 2 specifically trains expert vectors with RL to acquire domain specific-knowledge, making them true experts.

Microview: 大语言模型中的 MoE 在这一视角中扮演着关键角色 (Tianlong 等, 2024)。在 MoE 系统中，输入被动态路由到包含领域特定知识的专门模块或层（例如 MLPs）的子集 (Raj bh 和 ari 等, 2022; Fedus 等, 2022)。为了减少推理时间，研究人员引入了稀疏激活的 MoE，其中每个 token 只选择一部分专家 (Jiang 等, 2024; Qwen Team, 2024)。虽然可以将 Transformer 2 大致视为一种 MoE，但存在两个主要区别。在上述系统中，自适应是通过 token 级别的路由实现的，而 Transformer 2 采用了样本级别的模块选择策略。第二个区别在于专家模块的构建。在传统的 MoE 系统中，专家模块要么是从头开始训练的 (Fedus 等, 2022; Jiang 等, 2024)，要么是基于密集模型（例如 upcycling）的 (Qwen Team, 2024; Zhu 等, 2024)，没有辅助损失来确保模块的专门化。相比之下，Transformer 2 专门使用强化学习训练专家向量以获取领域特定知识，使它们成为真正的专家。

Low-rank adaptation PEFT methods such as LoRA (Hu et al., 2021) works by freezing the original model’s parameters and introducing small trainable low-rank matrices for task-specific updates. It significantly lowers the computational and memory costs while providing performance comparable to full fine-tuning. Inspired by LoRA’s design, various modifications have been proposed (Zhang et al., 2023; Kopiczko et al., 2023; Liu et al., 2024; Bałazy et al., 2024; Cetoli, 2024). Transformer2 does not rely on low-rank matrices, and instead scales the singular vectors of the original parameter matrix that span the full rank space.

低秩适应 (Low-rank adaptation) PEFT 方法，如 LoRA (Hu et al., 2021)，通过冻结原始模型的参数并引入小的可训练低秩矩阵来进行任务特定的更新。它显著降低了计算和内存成本，同时提供了与全微调相当的性能。受 LoRA 设计的启发，已经提出了各种改进方法 (Zhang et al., 2023; Kopiczko et al., 2023; Liu et al., 2024; Bałazy et al., 2024; Cetoli, 2024)。Transformer2 不依赖于低秩矩阵，而是对原始参数矩阵的奇异向量进行缩放，这些奇异向量跨越了全秩空间。

SVD for LLM Fine-tuning SVD is increasingly being used as an inductive bias for PEFT in LLMs. For example, Wang et al. (2024) decompose a weight matrix and use the minor singular components, associated with noisy or long-tail information, to initialize low-rank matrices for LoRA fine-tuning. In a similar vein, SVD is employed to approximate an original weight matrix with the top $r$ singular vectors, corresponding to the highest singular values. A small trainable matrix is then introduced on top of the truncated singular value matrix to adjust the magnitude and orientations within this top ${\bf\nabla}r$ subspace (Bałazy et al., 2024; Cetoli, 2024). However, the drawback of this approach is that retaining only the top singular components can result in the loss of important information, particularly when the singular values distribution is less skewed. The work most similar to ours is a concurrent effort by Lingam et al. (2024), where they introduce various spars if i cation methods that utilize the SVD of the weights. However, it is not for self-adaptive LLMs and does not use RL to enhance learning efficiency.

SVD 用于大语言模型微调
SVD 越来越多地被用作大语言模型中参数高效微调 (PEFT) 的归纳偏置。例如，Wang 等人 (2024) 分解权重矩阵，并使用与噪声或长尾信息相关的较小奇异分量来初始化 LoRA 微调的低秩矩阵。类似地，SVD 被用于通过前 $r$ 个奇异向量（对应最高奇异值）来近似原始权重矩阵。然后在截断的奇异值矩阵之上引入一个小的可训练矩阵，以调整这个前 ${\bf\nabla}r$ 子空间内的幅度和方向 (Bałazy 等人, 2024; Cetoli, 2024)。然而，这种方法的缺点是仅保留前几个奇异分量可能会导致重要信息的丢失，尤其是在奇异值分布不那么偏斜的情况下。与我们的工作最相似的是 Lingam 等人 (2024) 的并行研究，他们引入了多种利用权重 SVD 的稀疏化方法。然而，他们的研究并非针对自适应大语言模型，也没有使用强化学习来提高学习效率。

3 METHODS

3 方法

3.1 PRELIMINARIES

3.1 预备知识

Singular value decomposition (SVD) offers a fundamental view of matrix multiplications. In the context of neural networks, each weight matrix $W\in\mathbb{R}^{n\times m}$ can be decomposed into three components $W=U\Sigma V^{\top}$ , yielding semi-orthogonal matrices $U\in\mathbb{R}^{m\times r}$ and $V\bar{\in},\mathbb{R}^{n\times r}$ together with an ordered vector of $r$ singular values (in descending order) arranged in the diagonal matrix $\boldsymbol{\Sigma}\in\mathbb{R}^{r\times r}$ . The linear operation defined by applying $W$ onto $x$ , can be then decomposed into a sum of independent terms, derived from mapping each column $v_{i}$ from $V$ into the corresponding column $u_{i}$ from $U$ as $\begin{array}{r}{\boldsymbol{y}=\sum_{i=1}^{r}\sigma_{i}\boldsymbol{u}_{i}\boldsymbol{v}_{i}^{\top}\boldsymbol{x}}\end{array}$ . Hence, each singular component represented by the rank-1 matrix $u_{i}v_{i}^{\mathsf{T}}$ independe ntly processes the input, providing an orthogonal contribution to the layer’s outputs, with the singular values $\sigma_{i}$ modulating the degree of the contributions.

奇异值分解 (SVD) 提供了矩阵乘法的基础视角。在神经网络的背景下，每个权重矩阵 $W\in\mathbb{R}^{n\times m}$ 可以被分解为三个部分 $W=U\Sigma V^{\top}$，得到半正交矩阵 $U\in\mathbb{R}^{m\times r}$ 和 $V\bar{\in},\mathbb{R}^{n\times r}$，以及一个按降序排列的 $r$ 个奇异值向量，这些奇异值被排列在对角矩阵 $\boldsymbol{\Sigma}\in\mathbb{R}^{r\times r}$ 中。通过将 $W$ 应用于 $x$ 所定义的线性操作，可以分解为独立项的和，这些独立项是通过将 $V$ 中的每一列 $v_{i}$ 映射到 $U$ 中的对应列 $u_{i}$ 得到的，即 $\begin{array}{r}{\boldsymbol{y}=\sum_{i=1}^{r}\sigma_{i}\boldsymbol{u}_{i}\boldsymbol{v}_{i}^{\top}\boldsymbol{x}}\end{array}$。因此，由秩-1矩阵 $u_{i}v_{i}^{\mathsf{T}}$ 表示的每个奇异分量独立处理输入，为层的输出提供正交贡献，而奇异值 $\sigma_{i}$ 则调节这些贡献的程度。

Cross-entropy method (CEM) is a Monte Carlo method for importance sampling and optimization (Rubinstein & Kroese, 2004). The method is based on the concept of minimizing the KL divergence between two probability distributions $D_{\mathrm{KL}}(P|Q)$ , where $P$ is the target distribution and $Q$ is a maintained distribution. At its core, CEM repeatedly generates a set of samples from $Q$ , evaluates these samples with a performance function, and then updates the distribution $Q$ with the characteristics of the elite samples that have performed best. In the standard setup employed in most applications, $Q$ is set to a diagonal multivariate Gaussian, reducing the problem to simply estimating the empirical mean and standard deviation of the latest elites until a stopping criterion is met. We illustrate a complete CEM step in the Python pseudocode below.

交叉熵方法 (Cross-entropy method, CEM) 是一种用于重要性采样和优化的蒙特卡罗方法 (Rubinstein & Kroese, 2004)。该方法基于最小化两个概率分布之间的 KL 散度 $D_{\mathrm{KL}}(P|Q)$ 的概念，其中 $P$ 是目标分布，$Q$ 是维护的分布。CEM 的核心是反复从 $Q$ 生成一组样本，用性能函数评估这些样本，然后用表现最好的精英样本的特征更新分布 $Q$。在大多数应用的标准设置中，$Q$ 被设置为对角多元高斯分布，将问题简化为仅估计最新精英样本的经验均值和标准差，直到满足停止条件。我们在下面的 Python 伪代码中展示了一个完整的 CEM 步骤。

def cem_step(mu, sigma, num_elites, num samples) : samples $=$ np.random.normal(1oc $=$ mean, scale $=$ sigma, size $=$ num samples) scores $=$ evaluate(samples) elites $=$ samples[np.argsort(scores) [-num_elites:]] new_mu $=$ np.mean(elites, axis $=0$ ) new_sigma $=$ np.std(elites, axis $=0$ ) return (new_mu, new_sigma)

def cem_step(mu, sigma, num_elites, num_samples):
    samples = np.random.normal(loc=mu, scale=sigma, size=num_samples)
    scores = evaluate(samples)
    elites = samples[np.argsort(scores)[-num_elites:]]
    new_mu = np.mean(elites, axis=0)
    new_sigma = np.std(elites, axis=0)
    return (new_mu, new_sigma)

3.2 TRANSFORMER 2

The construction of Transformer 2 comprises two main steps, for which we provide an illustrative overview in Figure 2. First, we introduce Singular Value Fine-tuning (SVF), a method to learn with RL compact and compositional expert vectors based on the SVD of the base model’s weights. Then, we describe three different adaptation strategies within Transformer , inspired by three orthogonal principles, which adaptively combine the SVF-trained expert vectors during inference. We motivate how the properties of SVF are highly complementary to our adaptation strategies, making Transformer an effective and scalable framework for the design of new self-adaptive LLMs.

Transformer 2 的构建包含两个主要步骤，我们在图 2 中提供了一个示意性概述。首先，我们引入了奇异值微调 (Singular Value Fine-tuning, SVF)，这是一种基于基础模型权重的奇异值分解 (SVD) 来学习紧凑且组合的专家向量的方法。然后，我们描述了 Transformer 中的三种不同适应策略，这些策略受到三个正交原则的启发，并在推理过程中自适应地结合 SVF 训练的专家向量。我们解释了 SVF 的特性如何与我们的适应策略高度互补，使 Transformer 成为一个有效且可扩展的框架，用于设计新的自适应大语言模型。

Figure 2: Method overview. Left) At training time, we employ SVF and RL to learn the “expert” vectors $z$ ’s that scale the singular values of the weight matrices. Right) At inference time, we propose three distinct methods to adaptively select/combine the learned expert vectors.

图 2: 方法概述。左) 在训练时，我们使用 SVF 和 RL 来学习缩放权重矩阵奇异值的“专家”向量 $z$。右) 在推理时，我们提出了三种不同的方法来自适应地选择/组合学习到的专家向量。

Singular value fine-tuning is a key building block in Transformer 2. It offers an extremely efficient parameter iz ation for fine-tuning and provides inherent compositional it y for adaptation. Conventional fine-tuning techniques often aim to augment pre-trained models with new capabilities by modifying their weight matrices. However, in large-scale transformers, these weights are already rich repositories of abstracted knowledge, thanks to the breadth of the pre-training data and expansive architectural design. In fact, as evidenced in much of the prior literature, the requisite capabilities for solving many downstream tasks appear to already exist within these pre-trained models (Sharma et al., 2023). Therefore, instead of seeking to add new features, an efficient fine-tuning approach should focus on making these latent capabilities more expressible. Motivated by these considerations, for any weight matrix $W$ , SVF learns a simple vector $z\in\mathbb{R}^{r}$ that provides targeted modifications to each singular component of $W$ independently, yielding a new weight matrix ${\bar{W}}^{\prime}=U\Sigma^{\prime}V^{\intercal}$ , where $\Sigma^{\prime}=\Sigma,\bar{\otimes},\mathrm{diag}(z)$ . This essential parameter iz ation enjoys several benefits:

奇异值微调是 Transformer 2 中的一个关键构建模块。它为微调提供了一种极其高效的参数化方法，并为适应提供了固有的组合性。传统的微调技术通常旨在通过修改预训练模型的权重矩阵来增强其新能力。然而，在大规模 Transformer 中，由于预训练数据的广度和扩展的架构设计，这些权重已经是抽象知识的丰富存储库。事实上，正如许多先前文献所证明的那样，解决许多下游任务所需的能力似乎已经存在于这些预训练模型中 (Sharma et al., 2023)。因此，与其寻求添加新功能，一种高效的微调方法应专注于使这些潜在能力更具表现力。基于这些考虑，对于任何权重矩阵 $W$，SVF 学习一个简单的向量 $z\in\mathbb{R}^{r}$，该向量独立地对 $W$ 的每个奇异分量进行有针对性的修改，从而生成一个新的权重矩阵 ${\bar{W}}^{\prime}=U\Sigma^{\prime}V^{\intercal}$，其中 $\Sigma^{\prime}=\Sigma,\bar{\otimes},\mathrm{diag}(z)$。这种基本的参数化方法具有以下几个优点：

Negligible parameters: Learning only a vector $z$ for each weight matrix allows for very efficient fine-tuning with orders of magnitudes fewer optimized parameters even when compared to prior approaches specifically designed for efficiency. For example, the widely popular LoRA approach requires $(m!+!n)!\times!r^{\prime}$ learnable parameters per weight matrix, where $r^{\prime}$ is a hyper-parameter that generally needs to be set large enough for expressivity. While recent extensions, such LoRA-XS (Bałazy et al., 2024), try to push efficiency even further, they often introduce limiting assumptions that curb applicability in several practical scenarios (see examples in Appendix C). In contrast, while SVF only needs $r=\operatorname*{min}(m,n)$ parameters, we show it empirically does not display the same shortcomings thanks to working on a highly-meaning space provided by the latent expressiveness compressed in the weights of modern LLMs. SVF’s scaling only the singular values may seem to lead to limited expressiveness, we wish to point out that the ability to affect the weight matrix in a full-rank manner technically provides more information than low-rank approaches.

可忽略的参数：与之前专门为效率设计的方法相比，即使每个权重矩阵仅学习一个向量 $z$，也能以数量级更少的优化参数进行非常高效的微调。例如，广泛流行的 LoRA 方法需要每个权重矩阵 $(m!+!n)!\times!r^{\prime}$ 个可学习参数，其中 $r^{\prime}$ 是一个通常需要设置得足够大以保证表达能力的超参数。虽然最近的扩展，如 LoRA-XS (Bałazy et al., 2024)，试图进一步提高效率，但它们通常引入了一些限制性假设，这些假设在某些实际场景中限制了适用性（参见附录 C 中的示例）。相比之下，SVF 只需要 $r=\operatorname*{min}(m,n)$ 个参数，我们通过实验表明，由于它利用了现代大语言模型权重中压缩的潜在表达能力所提供的高度有意义的空间，因此不会表现出相同的缺点。SVF 仅缩放奇异值可能看起来会导致表达能力有限，但我们想指出，从技术上讲，以全秩方式影响权重矩阵的能力比低秩方法提供了更多的信息。

High compositional it y: Decomposing the weights in independent singular components makes the learned $z$ vectors highly composable and interpret able, opening numerous possibilities for adaptation via algebraic manipulations. Instead, LoRA-based methods inherently lack these properties. For instance, even if two LoRAs learned on the same task were to learn exactly the same adjustments for each $W$ , directly interpolating between their compressed $A$ and $B$ matrices is unlikely to preserve any of their original behavior, given the countless number of equivalent parameter permutations they might have converged to.

高组合性：将权重分解为独立的奇异分量使得学习到的 $z$ 向量具有高度的可组合性和可解释性，为通过代数操作进行适应提供了多种可能性。相比之下，基于 LoRA 的方法本质上缺乏这些特性。例如，即使在同一任务上学习的两个 LoRA 对每个 $W$ 进行了完全相同的调整，直接在其压缩的 $A$ 和 $B$ 矩阵之间进行插值也不太可能保留它们的原始行为，因为它们可能已经收敛到无数种等效的参数排列。

Principled regular iz ation: Exclusively modifying the magnitude of pre-existing singular components provides a principled and effective form of regular iz ation. In practice, this property enables us to fine-tune for arbitrary downstream tasks with only hundreds of data points without the risk of severe collapse or over fitting.

原则性正则化：仅修改预先存在的奇异分量的大小提供了一种原则性且有效的正则化形式。在实践中，这一特性使我们能够仅用数百个数据点对任意下游任务进行微调，而不会出现严重崩溃或过拟合的风险。

End-to-end optimization with RL. We train a set of SVF vectors $\theta_{z}={z_{1},\cdot\cdot\cdot,z_{N\times M}}$ to finetune an arbitrary language model $\pi_{\theta_{W}}$ parameterized by $\theta_{W}$ with RL, optimizing directly for task performance. Here, $\theta_{W}={W_{1},\cdot\cdot\cdot,W_{N\times M}}$ is the set of weight matrices, where $N$ is the number of layers and $M$ is the number of weight matrices to fine-tune per layer. We use the seminal REINFORCE algorithm (Williams, 1992) and label each generated answer $y_{i}$ (for the prompt $x_{i}\in D$ ) with a unitary reward based on its correctness $r\in{-\bar{1},1}$ . Inspired by related applications of RL for optimizing LLMs (Ouyang et al., 2022), we regularize the REINFORCE objective by adding a KL penalty for deviating from the original model’s behavior, weighted by a small coefficient $\lambda\in\mathbb{R}^{\bar{+}}$ . Thus, our final objective function can be written as:

端到端的强化学习优化。我们训练一组 SVF 向量 $\theta_{z}={z_{1},\cdot\cdot\cdot,z_{N\times M}}$ 来微调由 $\theta_{W}$ 参数化的任意语言模型 $\pi_{\theta_{W}}$，并直接针对任务性能进行优化。其中，$\theta_{W}={W_{1},\cdot\cdot\cdot,W_{N\times M}}$ 是权重矩阵的集合，$N$ 是层数，$M$ 是每层需要微调的权重矩阵数量。我们使用经典的 REINFORCE 算法 (Williams, 1992)，并根据生成的答案 $y_{i}$（对于提示 $x_{i}\in D$）的正确性 $r\in{-\bar{1},1}$ 为其分配单位奖励。受 RL 优化大语言模型的相关应用启发 (Ouyang et al., 2022)，我们通过在 REINFORCE 目标函数中添加一个 KL 惩罚项来正则化，该惩罚项用于衡量与原始模型行为的偏离程度，并由一个小的系数 $\lambda\in\mathbb{R}^{\bar{+}}$ 加权。因此，我们的最终目标函数可以写成：

J(\theta_{z})=\mathbb{E}\left[\log\left(\pi_{\theta_{W}},(\hat{y}_{i}\mid x_{i})\right)r(\hat{y}_{i},y_{i})\right]-\lambda D_{\mathrm{KL}}(\pi_{\theta_{W}},\|\pi_{\theta_{W}}),

J(\theta_{z})=\mathbb{E}\left[\log\left(\pi_{\theta_{W}},(\hat{y}_{i}\mid x_{i})\right)r(\hat{y}_{i},y_{i})\right]-\lambda D_{\mathrm{KL}}(\pi_{\theta_{W}},\|\pi_{\theta_{W}}),

where we use $\pi_{\theta_{W^{\prime}}}$ to denote the resulting language model after substituting the original weight matrices $W$ with $\dot{W}^{\prime}$ . While RL is generally considered less stable than next-token prediction objectives, we find the regular iz ation properties of SVF avoid many of the failure modes of prior lessconstrained parameter iz at ions (see Section 4.3). Thus, combining these complementary components effectively enables us to avoid relying on expensive fine-tuning procedures with large hand-designed datasets as proxies, and directly maximize task performance end-to-end.

我们使用 $\pi_{\theta_{W^{\prime}}}$ 来表示将原始权重矩阵 $W$ 替换为 $\dot{W}^{\prime}$ 后得到的语言模型。尽管强化学习通常被认为不如下一个 Token 预测目标稳定，但我们发现 SVF 的正则化特性避免了许多先前较少约束的参数化失败模式（见第 4.3 节）。因此，结合这些互补组件，我们能够有效避免依赖昂贵的手工设计数据集作为代理的微调过程，并直接端到端地最大化任务性能。

In general, SVF with RL puts lower requirement on the dataset it trains on. For example, LoRA fine-tuning requires “explaining texts” to perform next token predictions, which puts a higher requirement on the dataset (e.g., imagine LoRA fine-tuning on a GSM8K dataset where no reasoning text but only the final number is provided). This benefit allows SVF to be more general and effective. One possible caveat SVF can face is the sparse rewards caused by a weak base model, which we discuss this further in Section 5.

一般来说，基于强化学习 (RL) 的 SVF 对训练数据集的要求较低。例如，LoRA 微调需要“解释文本”来进行下一个 Token 的预测，这对数据集的要求更高（例如，想象一下在 GSM8K 数据集上进行 LoRA 微调，其中没有提供推理文本，只有最终的数字）。这一优势使得 SVF 更加通用和有效。SVF 可能面临的一个潜在问题是基础模型较弱导致的稀疏奖励，我们将在第 5 节进一步讨论这一点。

Self-adaptation is a critical mechanism in nature that has established itself as a core guiding principle in modern system design (Klo¨s et al., 2015). Our initial efforts toward self-adaptive foundation models focus on the inference stage of LLMs, where we devise a simple two-pass adaptation strategy that combines $K$ sets of base “expert” vectors $z^{1:K}$ trained with SVF to provide different kinds of capabilities (e.g., coding, math, etc). The mapping between a capability and the dataset we train on can be acquired in the dataset’s meta data. In the first inference pass, given a task or an individual input prompt, Transformer 2 executes the model and observes its test-time behavior to derive a new $z^{\prime}$ vector tailored to its test-time conditions. This adapted $z^{\prime}$ is then used in the second inference pass to provide an actual response with the newly adapted weights. The interaction between SVF-trained expert vectors and the adaptation strategies ensures seamless integration, where expert vectors provide modular capabilities, and the adaptation strategies dynamically determine and compose the most suitable combination to address the input task. In this first work, we propose three simple approaches to produce the vector $z^{\prime}$ during the first inference pass, implementing selfadaption with distinct methods and requirements. Below, we provide an outline of each method and refer to Appendix A for additional implementation details.

自适应是自然界中的一种关键机制，已成为现代系统设计的核心指导原则 (Klo¨s et al., 2015)。我们在大语言模型的推理阶段首次尝试了自适应基础模型，设计了一种简单的两阶段自适应策略，结合了通过 SVF 训练的 $K$ 组基础“专家”向量 $z^{1:K}$，以提供不同的能力（例如编程、数学等）。能力与训练数据集之间的映射可以从数据集的元数据中获取。在第一次推理阶段，给定任务或单个输入提示，Transformer 2 执行模型并观察其在测试时的行为，以生成一个适应其测试条件的新向量 $z^{\prime}$。然后，在第二次推理阶段使用这个自适应后的 $z^{\prime}$ 向量，结合新调整的权重生成实际响应。SVF 训练的专家向量与自适应策略之间的交互确保了无缝集成，专家向量提供模块化能力，而自适应策略则动态确定并组合最合适的组合以应对输入任务。在这项初步工作中，我们提出了三种简单的方法来在第一次推理阶段生成向量 $z^{\prime}$，通过不同的方法和需求实现自适应。以下我们概述了每种方法，并在附录 A 中提供了更多实现细节。

A) Prompt engineering: Our most basic approach involves constructing a new “adaptation” prompt which we use to directly ask the LLM to categorize the input prompt. Based on its response, we then extract one category out of the set of domain topics used to pre-train each SVF expert and, thus, we select the corresponding $z^{\prime}$ directly from $z^{1:K}$ . In our adaptation prompt, we also explicitly provide the option for a generic “others” category, allowing the model to use its base weights in case no expert provides appropriate capabilities. We show the format used to construct the adaptation prompt in Figure 3.

A) 提示工程 (Prompt Engineering)：我们最基本的方法涉及构建一个新的“适应”提示，用于直接要求大语言模型对输入提示进行分类。根据其响应，我们从用于预训练每个 SVF 专家的领域主题集中提取一个类别，从而直接从 $z^{1:K}$ 中选择相应的 $z^{\prime}$。在我们的适应提示中，我们还明确提供了一个通用的“其他”类别选项，允许模型在没有专家提供适当能力的情况下使用其基础权重。我们在图 3 中展示了用于构建适应提示的格式。

B) Classification expert: A direct extension of the prompt engineering approach comes from using a specialized system to handle task identification. Following the principles of self-adaptation, we apply SVF to fine-tune the base LLM itself to han- dle this task. In particular, we collect a dataset $D={(x_{1,1},1),\cdot\cdot\cdot,,(x_{i,k},k),\cdot\cdot\cdot}$ from the $K$ SVF training tasks, where $x_{i,k}$ is the $i$ -th example from the $k$ -th expert task. Each tuple $(x_{i,k},k)$ then forms an example to pre-train an additional job classification expert $z^{c}$ learned in the same fashion as the others. During the first inference pass, we simply load $z^{c}$ , intending to improve the inherent task classification capabilities of the base model to select a more appropriate $z^{\prime}$ to handle the input prompt.

B) 分类专家：提示工程方法的直接扩展来自于使用专门系统来处理任务识别。遵循自适应的原则，我们应用 SVF 来微调基础大语言模型本身以处理此任务。特别是，我们从 $K$ 个 SVF 训练任务中收集了一个数据集 $D={(x_{1,1},1),\cdot\cdot\cdot,,(x_{i,k},k),\cdot\cdot\cdot}$，其中 $x_{i,k}$ 是第 $k$ 个专家任务中的第 $i$ 个示例。每个元组 $(x_{i,k},k)$ 然后形成一个示例，用于预训练一个额外的工作分类专家 $z^{c}$，其学习方式与其他专家相同。在第一次推理过程中，我们简单地加载 $z^{c}$，旨在提高基础模型的固有任务分类能力，以选择更合适的 $z^{\prime}$ 来处理输入提示。

Figure 3: Prompt based adaptation. Selfadaptation prompt used by Transformer 2 to classify the task prompt into pre-defined categories.

图 3: 基于提示的适应。Transformer 2 使用的自我适应提示将任务提示分类为预定义的类别。

4 EXPERIMENTS

4 实验

We extensively evaluate Transformer 2 on multiple tasks and models with the purpose of: (1) assessing the efficiency and effectiveness of SVF; (2) demonstrating self-adaptive ness through the three proposed adaptation strategies; (3) conducting in-depth analysis and ablation studies aimed at understanding and interpreting the properties of our new framework.

我们在多个任务和模型上广泛评估了 Transformer 2，目的是：(1) 评估 SVF 的效率和有效性；(2) 通过提出的三种自适应策略展示自适应性；(3) 进行深入分析和消融研究，旨在理解和解释我们新框架的特性。

4.1 EXPERIMENTAL SETUPS

4.1 实验设置

To validate the generality of Transformer 2 we consider three pre-trained LLMs ranging across different model families and architecture sizes: LLAMA3-8B-INSTRUCT, MISTRAL-7B-INSTRUCTV0.3, and LLAMA3-70B-INSTRUCT. For each model, we obtain three sets of SVF-trained $z$ vectors to maximize performance for GSM8K (Cobbe et al., 2021), MBPP-pro (Austin et al., 2021), and ARC-Easy (Clark et al., 2018), respectively. Additionally, we also train a set of $z$ vectors for LLAMA3-8B-INSTRUCT, when applied as the language backbone for TextVQA (Singh et al., 2019), in order to assess SVF’s applicability to the vision-language modeling (VLM) domain. We provide SVF’s main learning curves on each of these tasks in Figure 4. Finally, we evaluate the full Transformer 2 adaptation framework on four unseen tasks: MATH (Hendrycks et al., 2021), Humaneval (Chen et al., 2021), ARC-Challenge (Clark et al., 2018), and OKVQA (Marino et al., 2019). In all our adaptation experiments, we only consider experts obtained in the pure-language settings, assessing its test-time applicability even for the distinctive vision domain. Please refer to the Appendix A for additional details and a summary of the hyper-parameters used in the experiments.

为了验证 Transformer 2 的通用性，我们考虑了三种预训练的大语言模型，涵盖不同的模型家族和架构规模：LLAMA3-8B-INSTRUCT、MISTRAL-7B-INSTRUCTV0.3 和 LLAMA3-70B-INSTRUCT。对于每个模型，我们获得了三组经过 SVF 训练的 $z$ 向量，分别用于最大化 GSM8K (Cobbe et al., 2021)、MBPP-pro (Austin et al., 2021) 和 ARC-Easy (Clark et al., 2018) 的性能。此外，我们还为 LLAMA3-8B-INSTRUCT 训练了一组 $z$ 向量，当它作为 TextVQA (Singh et al., 2019) 的语言骨干时，用于评估 SVF 在视觉语言建模 (VLM) 领域的适用性。我们在图 4 中提供了 SVF 在这些任务上的主要学习曲线。最后，我们在四个未见过的任务上评估了完整的 Transformer 2 适应框架：MATH (Hendrycks et al., 2021)、Humaneval (Chen et al., 2021)、ARC-Challenge (Clark et al., 2018) 和 OKVQA (Marino et al., 2019)。在我们所有的适应实验中，我们仅考虑在纯语言设置中获得的专家，评估其在测试时对独特视觉领域的适用性。有关实验中使用超参数的更多细节和总结，请参阅附录 A。

Figure 4: SVF learning curves. The dashed lines indicate the performance of LLAMA3-8BINSTRUCT on the test split of each task. SVF effectively fine-tunes to surpass the base performance. While we use the best validation score to select our checkpoint for evaluation (marked by red dots), we present the entire training curve without early stopping to demonstrate SVF’s learning capabilities. Tasks with only hundreds of training samples like Coding and Reasoning were stopped early. In our experiments, we update the parameters at the end of each epoch.

图 4: SVF 学习曲线。虚线表示 LLAMA3-8BINSTRUCT 在每个任务测试集上的表现。SVF 通过微调有效超越了基础性能。虽然我们使用最佳验证分数来选择评估的检查点（用红点标记），但我们展示了整个训练曲线，没有提前停止，以展示 SVF 的学习能力。像编程和推理这样只有数百个训练样本的任务提前停止了。在我们的实验中，我们在每个 epoch 结束时更新参数。

4.2 EXPERIMENTAL RESULTS

4.2 实验结果

SVF performance We provide results after training on each considered task with the LLAMA3- 8B-INSTRUCT, MISTRAL-7B-INSTRUCT-V0.3, and LLAMA3-70B-INSTRUCT base models in Table 1. Remarkably, we find that SVF provides considerable and consistent performance gains across nearly all tasks and base models. Instead, LoRA experts yield smaller gains and even sporadic performance degradation. (These LoRA experts are trained with next token prediction. While we also have LoRA experts trained with RL in Table 4, RL seems work less well with LoRA than with SVF.) This observed trend extends also to the vision-language domain, as fine-tuning LLAMA3- LLAVA-NEXT-8B with SVF bolsters the base model’s performance by over $39%$ (see Figure 5). To ensure a fair comparison, we provide extensive ablations to both our model and the LoRA baseline considering different architecture and optimization objectives in Appendix 4.3). Due to its essential parameter iz ation, we would like to note that training SVF requires considerably fewer resources, with less than $10%$ of the training parameters of our LoRA implementation.

SVF 性能
我们在表 1 中提供了使用 LLAMA3-8B-INSTRUCT、MISTRAL-7B-INSTRUCT-V0.3 和 LLAMA3-70B-INSTRUCT 基础模型在每项任务上训练后的结果。值得注意的是，我们发现 SVF 在几乎所有任务和基础模型上都提供了显著且一致的性能提升。相比之下，LoRA 专家带来的提升较小，甚至偶尔会出现性能下降。（这些 LoRA 专家是通过下一个 Token 预测进行训练的。尽管我们在表 4 中也提供了通过强化学习 (RL) 训练的 LoRA 专家，但 RL 在 LoRA 上的表现似乎不如在 SVF 上。）这一观察到的趋势也延伸到视觉-语言领域，使用 SVF 微调 LLAMA3-LLAVA-NEXT-8B 使基础模型的性能提升了超过 $39%$（见图 5）。为了确保公平比较，我们在附录 4.3 中对我们的模型和 LoRA 基线进行了广泛的消融实验，考虑了不同的架构和优化目标。由于其本质的参数化特性，我们想指出，训练 SVF 所需的资源显著减少，训练参数不到我们 LoRA 实现的 $10%$。

Adaptation performance With the SVF trained $z$ vectors, we assess the self-adaptation capability of Transformer 2 on unseen tasks. For a fair comparison with LoRA, we record the performance of this baseline using all checkpoints from the considered training tasks and report only its highest performance for each of the test tasks. As shown in Table 2, all of our Transformer 2 adaptation strategies demonstrate improvements across all tasks for LLAMA3-8B-INSTRUCT base models, and in at least two out of three tasks for both MISTRAL-7B-INSTRUCT-V0.3 and LLAMA3-70BINSTRUCT. In contrast, even the best training LoRAs only provide marginal improvements on the

适应性能

使用经过训练的 SVF $z$ 向量，我们评估了 Transformer 2 在未见任务上的自适应能力。为了与 LoRA 进行公平比较，我们记录了该基线在考虑的训练任务中的所有检查点的性能，并仅报告其在每个测试任务中的最高性能。如表 2 所示，我们所有的 Transformer 2 适应策略在 LLAMA3-8B-INSTRUCT 基础模型的所有任务中都表现出改进，并且在 MISTRAL-7B-INSTRUCT-V0.3 和 LLAMA3-70BINSTRUCT 的至少两个任务中也有所提升。相比之下，即使是最好的训练 LoRA 也只能在...

Table 1: Fine-tuning results. LLM performance on the test splits of math, coding and reasoning. Normalized scores are in the parentheses.

表 1: 微调结果。大语言模型在数学、编码和推理测试集上的表现。括号内为标准化分数。

方法	GSM8K	MBPP-Pro	50 ARC-Easy
LLAMA3-8B-INSTRUCT	75.89 (1.00)	64.65 (1.00)	88.59 (1.00)
+ LoRA	77.18 (1.02)	67.68 (1.05)	88.97 (1.00)
+ SVF (Ours)	79.15 (1.04)	66.67 (1.03)	89.56 (1.01)
MISTRAL-7B-INSTRUCT-VO.3 +LoRA	42.83 (1.00)	49.50 (1.00)	81.65 (1.00)
+ SVF (Ours)	44.66 (1.04)	51.52 (1.04)	81.19 (0.98)
	49.74 (1.16)	51.52 (1.04)	85.14 (1.04)
LLAMA3-70B-INSTRUCT	85.29 (1.00)	80.81 (1.00)	89.10 (1.00)
+ LoRA	77.26 (0.91)	68.69 (0.85)	88.55 (0.99)
+ SVF (Ours)	88.32 (1.04)	80.81 (1.00)	88.47 (0.99)

Table 2: Self-adaptation on unseen tasks. Normalized scores are in the parentheses.

表 2: 在未见任务上的自适应。括号内为归一化分数。

方法	MATH	Humaneval	ARC-Challenge
LLAMA3-8B-INSTRUCT3	24.54 (1.00)	60.98 (1.00)	80.63 (1.00)
+ LoRA	24.12 (0.98)	52.44 (0.86)	81.06 (1.01)
+ Transformer2 (Prompt)	25.22 (1.03)	61.59 (1.01)	81.74 (1.01)
(Cls-expert)	25.18 (1.03)	62.80 (1.03)	81.37 (1.01)
+ Transformer² (Few-shot)	25.47 (1.04)	62.99 (1.03)	82.61 (1.02)
MISTRAL-7B-INSTRUCT-VO.3	13.02 (1.00)	43.29 (1.00)	71.76 (1.00)
+ LoRA	13.16 (1.01)	37.80 (0.87)	75.77 (1.06)
+ Transformer2 (Prompt)	11.86 (0.91)	43.90 (1.01)	72.35 (1.01)
+ Transformer2 (Cls-expert)	11.60 (0.89)	43.90 (1.01)	74.83 (1.04)
+ Transformer2 (Few-shot)	13.39 (1.03)	47.40 (1.09)	75.47 (1.05)
LLAMA3-70B-INSTRUCT	40.64 (1.00)	78.66 (1.00)	87.63 (1.00)
+ LoRA	25.40 (0.62)	73.78 (0.94)	83.70 (0.96)
+ Transformer² (Prompt)	40.44 (1.00)	79.88 (1.02)	88.48 (1.01)

ARC-Challenge task and still significantly deteriorate performance on both MATH and Humaneval. This discrepancy suggests that LoRA’s parameter iz ation and optimization might be particularly sensitive to over fitting, especially when trained with the smaller GSM8K and MBPP-Pro datasets, the tasks that provide information most related to MATH and Humaneval. In Figure 5, we find a similar dichotomy in the OKVQA task, with the performance of the base LLAMA3-LLAVA-NEXT-8B VLM only improving after applying Transformer 2. We note that also in this setting, Transformer 2 performs self-adaptation only from the expert vectors from GSM8K, MBPP-Pro, and ARC-Easy. Thus, this result further underscores the high flexibility of self-adaptation, transferring knowledge compressed for tasks entirely based on language even for unrelated vision-based problems.

ARC-Challenge 任务，并且在 MATH 和 Humaneval 上的性能显著下降。这种差异表明，LoRA 的参数化和优化可能对过拟合特别敏感，尤其是在使用较小的 GSM8K 和 MBPP-Pro 数据集进行训练时，这些任务提供的信息与 MATH 和 Humaneval 最为相关。在图 5 中，我们在 OKVQA 任务中发现了类似的两极分化现象，基础 LLAMA3-LLAVA-NEXT-8B VLM 的性能仅在应用 Transformer 2 后才有所提升。我们注意到，在这种设置下，Transformer 2 仅从 GSM8K、MBPP-Pro 和 ARC-Easy 的专家向量中进行自适应性调整。因此，这一结果进一步强调了自适应性调整的高度灵活性，即使对于与语言无关的基于视觉的问题，也能传递为完全基于语言的任务压缩的知识。

Comparing the three proposed adaptation strategies, we highlight a clear monotonic trend – with more involved strategies and additional information about the test-time condition, self-adaptation appears to be increasingly effective. In particular, Transformer 2 with few-shot self-adaptation is almost always the highest-scoring method, providing notable improvements across all tested settings except for LLAMA3-70B-INSTRUCT $\scriptstyle{\mathcal{Q}},\mathbf{MATH}$ , where we have only SVF-tuned half of the layers due to our limited GPU resources. This trend shows that providing additional or different kinds of information seems to be highly beneficial to our framework, suggesting that Transformer 2 could provide foundation models with new means to continually improve performance when deployed in lifelong settings.

在比较这三种提出的适应策略时，我们观察到一个明显的单调趋势——随着策略的复杂性和测试条件信息的增加，自我适应似乎变得越来越有效。特别是，使用少样本自我适应的 Transformer 2 几乎总是得分最高的方法，在所有测试设置中均提供了显著的改进，除了 LLAMA3-70B-INSTRUCT $\scriptstyle{\mathcal{Q}},\mathbf{MATH}$，由于我们有限的 GPU 资源，我们只对一半的层进行了 SVF 调优。这一趋势表明，提供额外或不同类型的信息似乎对我们的框架非常有益，这表明 Transformer 2 可以为基础模型提供新的手段，在终身部署环境中持续提升性能。

Table 3 reports the inference time required by the prompt adaptation strategy of Transformer 2, with the time spent on solving the entire problem set presented separately for the 1st and 2nd passes. Notice that the 2nd pass inference time is the time spent on solving the problems, and the 1st pass inference time is the time for self-adaptation, 1st to 2nd pass inference time ratios are in the parentheses. While the additional inference pass might appear to double the overall runtime, it is important to note that inference time primarily depends on the number of tokens generated. In our settings, it is $\mathcal{O}(n)$ where $n$ is the length of the input. ARC-challenge’s cost ratio is large because they are single choice problems and therefore the cost of the 2nd pass is also $O(n)$ . In general settings, we think it is reasonable to assume this ratio to be closer to those of MATH and Humaneval. For a detailed discussion on improving the efficiency of CEM few-shot adaptation methods, please see Appendix D

表 3 报告了 Transformer 2 的提示适应策略所需的推理时间，其中分别列出了第一次和第二次通过解决整个问题集所花费的时间。需要注意的是，第二次通过的推理时间是解决问题所花费的时间，而第一次通过的推理时间是自适应所花费的时间，第一次到第二次通过的推理时间比例在括号中给出。虽然额外的推理通过可能会使整体运行时间翻倍，但需要注意的是，推理时间主要取决于生成的 token 数量。在我们的设置中，它是 $\mathcal{O}(n)$，其中 $n$ 是输入的长度。ARC-challenge 的成本比例较大，因为它们是单选题，因此第二次通过的成本也是 $O(n)$。在一般情况下，我们认为假设这个比例更接近 MATH 和 Humaneval 的比例是合理的。有关提高 CEM 少样本适应方法效率的详细讨论，请参见附录 D。

Table 3: Time cost of 2-pass inference in prompt adaptation strategy of Transformer 2 for the entire problem set. 1st to 2nd pass inference time ratios are shown in parentheses.

表 3: Transformer 2 的提示适应策略在整个问题集上的两轮推理时间成本。括号中显示了第一轮到第二轮推理时间的比例。

任务	第一轮 (秒)	第二轮 (秒)
MATH	42.64 (13%)	321.19
Humaneval	2.76 (19%)	14.28
ARC-Challenge	13.40 (47%)	28.51

4.3 ANALYSIS

4.3 分析

Lastly, we analyze and discuss the properties of our adaptation strategies for which we provide extensions and further discussion Appendix B.

最后，我们分析并讨论了适应策略的特性，并在附录 B 中提供了扩展和进一步的讨论。

Analysis 1: Job dispatching accuracy In Figure 6 we provide the confusion matrices of our classification-based adaptation strategies. These results validate the effectiveness of both our classification-based adaptation strategies to match each prompt with experts trained in similar domains, as evidenced by the high values along the diagonals. Furthermore, the results from LLAMA3-

分析 1: 任务调度准确性
在图 6 中，我们提供了基于分类的适应策略的混淆矩阵。这些结果验证了我们的基于分类的适应策略在将每个提示与受过类似领域训练的专家匹配方面的有效性，对角线上的高值证明了这一点。此外，LLAMA3-

Figure 6: Confusion matrices. These matrices display the classification percentages, where rows represent the task classes (ground truth) and columns indicate the predicted categories. Some samples are mis classified as “Others,” which is reflected in rows where the totals do not sum to one. 8B-INSTRUCT and MISTRAL-7B-INSTRUCT-V0.3 also show that using the classification expert consistently provides higher classification accuracy than vanilla prompt engineering. While this difference could explain the higher performance of the relative self-adaptation strategy, we also note that domain similarity might not be the only metric relevant to identifying the best expert for each prompt or task. To this end, we believe many further unexplored extensions could be explored in future work, using heuristics such as past expert performance or token-level analysis to further push our framework’s s cal ability.

图 6: 混淆矩阵。这些矩阵展示了分类的百分比，其中行代表任务类别（真实标签），列表示预测类别。一些样本被错误分类为“其他”，这反映在行总和不为1的情况中。8B-INSTRUCT 和 MISTRAL-7B-INSTRUCT-V0.3 也表明，使用分类专家始终比普通的提示工程提供更高的分类准确率。虽然这种差异可以解释相对自适应策略的更高性能，但我们也注意到，领域相似性可能不是识别每个提示或任务的最佳专家的唯一指标。为此，我们认为未来的工作中可以探索许多未开发的扩展，例如使用过去专家的表现或Token级别的分析等启发式方法，以进一步推动我们框架的可扩展性。

Analysis 2: Training tasks adaptation contribution In Figure 7, we show the normalized adaptive coefficients $a_{k}$ interpolating between our SVF vectors learned via CEM for LLAMA3-8BINSTRUCT and MISTRAL-7B-INSTRUCT-V0.3 across all the unseen downstream tasks. Intuitively, we find that the expert vectors from the training tasks sharing similar topics to the unseen ones are often the highest contributors to the produced adaptive weights. However, we observe that the MATH task appears as an interesting exception, as the $a_{k}$ for the expert obtained from GSM8K training is actually the lowest out of the three in both models. We hypothesize this reflects the different nature of the mathematics competition problems from MATH as compared to the grade-school problems in GSM8K. In fact, not only is the difficulty of the MATH questions far beyond GSM8K, but a large portion of its problems also hinges mainly on logical reasoning, for which a task like ARC might actually be more aligned. Furthermore, we also note that the different $z$ vectors appear to contribute more uniformly to adaptation in the Llama model. This difference might indicate that, due to its higher base performance, the Llama model does not need to rely on any particular set of skills as much as Mistral, and can harness more holistic benefits from self-adaptation. Note that applying $a_{k}$ uniformly is not a universal solution for leveraging expert vectors. This becomes evident when we look at different model and task combinations (e.g. applying $a_{k}$ uniformly on LLAMA3-8BINSTRUCT for MATH tasks only achieves 24.47, while Transformer 2 (Few-shot) achieves 25.47).

分析 2：训练任务适应贡献

在图 7 中，我们展示了归一化的自适应系数 $a_{k}$，这些系数在所有未见过的下游任务中，对通过 CEM 学习的 LLAMA3-8BINSTRUCT 和 MISTRAL-7B-INSTRUCT-V0.3 的 SVF 向量进行插值。直观上，我们发现来自训练任务的专家向量，如果与未见任务共享相似主题，通常对生成的自适应权重贡献最大。然而，我们观察到 MATH 任务是一个有趣的例外，因为在两个模型中，从 GSM8K 训练中获得的专家 $a_{k}$ 实际上是三个中最低的。我们假设这反映了 MATH 中的数学竞赛问题与 GSM8K 中的小学问题的不同性质。事实上，MATH 问题的难度远远超过 GSM8K，而且其大部分问题主要依赖于逻辑推理，因此像 ARC 这样的任务可能更符合要求。此外，我们还注意到，不同的 $z$ 向量在 Llama 模型中的适应贡献似乎更加均匀。这种差异可能表明，由于其更高的基础性能，Llama 模型不需要像 Mistral 那样依赖任何特定的技能集，并且可以从自我适应中获得更全面的好处。需要注意的是，均匀应用 $a_{k}$ 并不是利用专家向量的通用解决方案。当我们查看不同的模型和任务组合时，这一点变得显而易见（例如，在 LLAMA3-8BINSTRUCT 上均匀应用 $a_{k}$ 仅达到 24.47，而 Transformer 2 (少样本) 达到 25.47）。

Analysis 3: Ablation studies

分析 3: 消融研究

Module sensitivity: We first compare the performance of SVF when it is applied to different modules (see trials 1-3). Under consistent conditions, both individual MLP and attention updates improve performance, with MLP updates resulting in more pronounced gains. Simultaneous updates to both module types yield even more significant enhancements.

模块敏感性：我们首先比较了 SV

[论文翻译]TRANSFORMER 2: 自适应大语言模型

原文地址：https://arxiv.org/pdf/2501.06252

TRANSFORMER 2: SELF-ADAPTIVE LLMS

TRANSFORMER 2: 自适应大语言模型

ABSTRACT

摘要

1 INTRODUCTION

1 引言

2 相关工作

3 METHODS

3 方法

3.1 PRELIMINARIES

3.1 预备知识

3.2 TRANSFORMER 2

3.2 TRANSFORMER 2

4 EXPERIMENTS

4 实验

4.1 EXPERIMENTAL SETUPS

4.1 实验设置

4.2 EXPERIMENTAL RESULTS

4.2 实验结果

4.3 ANALYSIS

4.3 分析

Analysis 3: Ablation studies

分析 3: 消融研究

[论文翻译]TRANSFORMER 2: 自适应大语言模型

原文地址：https://arxiv.org/pdf/2501.06252

TRANSFORMER 2: SELF-ADAPTIVE LLMS

TRANSFORMER 2: 自适应大语言模型

ABSTRACT

摘要

1 INTRODUCTION

1 引言

2 RELATED WORKS

2 相关工作

3 METHODS

3 方法

3.1 PRELIMINARIES

3.1 预备知识

3.2 TRANSFORMER 2

3.2 TRANSFORMER 2

4 EXPERIMENTS

4 实验

4.1 EXPERIMENTAL SETUPS

4.1 实验设置

4.2 EXPERIMENTAL RESULTS

4.2 实验结果

4.3 ANALYSIS

4.3 分析

Analysis 3: Ablation studies

分析 3: 消融研究