[论文翻译]TRANSFORMER 2: 自适应大语言模型


原文地址:https://arxiv.org/pdf/2501.06252


TRANSFORMER 2: SELF-ADAPTIVE LLMS

TRANSFORMER 2: 自适应大语言模型

ABSTRACT

摘要

Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce Transformer , a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, Transformer 2 employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific “expert” vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. Transformer 2 demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. Transformer 2 represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems. Our code is available at https://github.com/SakanaAI/self-adaptive-llms

自适应大语言模型 (LLMs) 旨在解决传统微调方法带来的挑战,这些方法通常计算密集且在处理多样化任务时能力静态。我们引入了 Transformer,这是一种新颖的自适应框架,通过选择性地仅调整权重矩阵的单一组件,实时适应未见过的任务。在推理过程中,Transformer 2 采用了一种两阶段机制:首先,调度系统识别任务属性,然后使用强化学习训练的任务特定“专家”向量动态混合,以获得针对输入提示的定向行为。我们的方法在参数更少、效率更高的情况下,优于 LoRA 等普遍方法。Transformer 2 展示了在不同 LLM 架构和模态(包括视觉-语言任务)中的多功能性。Transformer 2 代表了一个重大飞跃,为增强 LLM 的适应性和任务特定性能提供了可扩展、高效的解决方案,为真正动态、自组织的 AI 系统铺平了道路。我们的代码可在 https://github.com/SakanaAI/self-adaptive-llms 获取。

1 INTRODUCTION

1 引言

Self-adaptive large language models (LLMs) would represent a significant advancement in artificial intelligence, providing a framework where models can adjust to varied tasks and dy- namic contexts in real time. While compositional it y and s cal ability are crucial for effective adaptation, current LLM training methodologies fall short of achieving both these properties simultaneously. Our research aims to present a pioneering solution to realize this vision and address these gaps.

自适应大语言模型 (LLMs) 将是人工智能领域的一项重大进步,它提供了一个框架,使模型能够实时适应各种任务和动态环境。虽然组合性和可扩展性对于有效适应至关重要,但当前的大语言模型训练方法无法同时实现这两个特性。我们的研究旨在提出一种开创性的解决方案,以实现这一愿景并解决这些差距。

Traditionally, LLM post-training has sought to optimize a model for a wide range of capabilities in a single, extensive training session. While this “one-shot” fine-tuning framework is ideal from a simplicity perspective, it is also difficult to achieve in practice.

传统上,大语言模型的后训练旨在通过一次广泛的训练会话优化模型的多种能力。虽然从简单性的角度来看,这种“一次性”微调框架是理想的,但在实践中却难以实现。


Figure 1: Overview of Transformer 2. In the training phase, we tune the scales of the singular values of the weight matrices to generate a set of “expert” vectors, each of which specializes in one type of tasks. In the inference phase, a two-pass process is adopted where the first applies the taskspecific expert and the second generates the answer.

图 1: Transformer 2 概述。在训练阶段,我们调整权重矩阵的奇异值比例,生成一组“专家”向量,每个向量专门处理一种任务类型。在推理阶段,采用两阶段过程,第一阶段应用任务特定的专家,第二阶段生成答案。

For instance, post-training is still highly resource-intensive, leading to significant computational costs and training times. Additionally, there tends to be notable performance trade-offs when introducing additional breadth to the data, making it challenging to overcome over fitting and task interference at the same time.

例如,训练后阶段仍然高度资源密集,导致显著的计算成本和训练时间。此外,在增加数据广度时,往往会出现显著的性能权衡,这使得同时克服过拟合和任务干扰变得具有挑战性。

In contrast, self-adaptive models offer a more flexible and efficient approach. Rather than attempting to train an LLM for all tasks in one step, expert modules can be developed offline and augmented to the base LLM on-demand (Kang et al., 2024). This allows the model to dynamically modify its behavior based on the task at hand, without the need for constant re-tuning. In addition to the benefit of having independent components, this modularity also supports continual learning, enabling the model to add new skills over time without catastrophic forgetting. Moreover, self-adaptive LLMs mirror a well-established principle in neuroscience and computational biology, where the brain activates specific regions depending on the task at hand (Loose et al., 2017) and dynamically reconfigure s its functional networks in response to changing task demands (Davison et al., 2015).

相比之下,自适应性模型提供了一种更为灵活和高效的方法。与其试图一步到位地训练一个大语言模型来完成所有任务,不如离线开发专家模块,并根据需求将其增强到基础大语言模型中 (Kang et al., 2024)。这使得模型能够根据当前任务动态调整其行为,而无需不断重新调优。除了拥有独立组件的好处外,这种模块化还支持持续学习,使模型能够随着时间的推移添加新技能,而不会发生灾难性遗忘。此外,自适应性大语言模型反映了神经科学和计算生物学中的一个既定原则,即大脑会根据当前任务激活特定区域 (Loose et al., 2017),并动态重新配置其功能网络以应对不断变化的任务需求 (Davison et al., 2015)。

In principle, the first step toward achieving self-adaptive LLMs can be realized through the development of specialized expert modules, each fine-tuned (Kaplan et al., 2020) via techniques such as low-rank adaptation (LoRA) (Hu et al., 2021). These expert modules can then be dynamically composed at runtime based on the task demands, a process that can be efficiently managed through Mixture of Experts (MoE)-like systems (Tianlong et al., 2024). However, several challenges need to be addressed to make this approach both scalable and compositional. First, fine-tuning LLMs to create multiple expert modules significantly increases the number of parameters that need to be trained. In practice, even with parameter-efficient methods like LoRA, the cumulative size of these modules can quickly escalate, leading to increased storage and computational demands. Second, these expert modules are often prone to over fitting, a phenomenon especially prevalent when training on smaller datasets or narrow task domains. Third, the flexible composition of these expert modules also presents largely unresolved challenges currently posing as open research problems.

原则上,实现自适应大语言模型的第一步可以通过开发专门的专家模块来实现,每个模块通过低秩适应(LoRA)等技术进行微调(Kaplan et al., 2020)。这些专家模块可以根据任务需求在运行时动态组合,这一过程可以通过类似专家混合(MoE)系统高效管理(Tianlong et al., 2024)。然而,要使这种方法既具有可扩展性又具有组合性,还需要解决几个挑战。首先,微调大语言模型以创建多个专家模块会显著增加需要训练的参数数量。在实践中,即使使用像LoRA这样的参数高效方法,这些模块的累积大小也会迅速增加,导致存储和计算需求增加。其次,这些专家模块往往容易过拟合,这种现象在较小的数据集或狭窄的任务领域上训练时尤为普遍。第三,这些专家模块的灵活组合也带来了许多尚未解决的挑战,目前仍是开放的研究问题。

To overcome these limitations, we first propose Singular Value Fine-tuning (SVF), a novel parameter-efficient fine-tuning (PEFT) method to obtain effective building blocks for selfadaptation. SVF works by extracting and tuning only the singular values within the model’s weight matrices. By focusing on this principled parameter iz ation, our approach mitigates the risk of overfitting, drastically reduces computational demands, and allows for inherent compositional it y. We show these properties enable us to cheaply obtain a set of effective domain-specific “expert” vectors by training on narrow datasets with RL, directly optimizing task performance on individual topics.

为了克服这些限制,我们首先提出了奇异值微调 (SVF),这是一种新颖的参数高效微调 (PEFT) 方法,用于获取自适应的有效构建块。SVF 通过仅提取和调整模型权重矩阵中的奇异值来工作。通过专注于这种原则性的参数化,我们的方法减轻了过拟合的风险,大幅减少了计算需求,并允许固有的组合性。我们展示了这些特性使我们能够通过在窄数据集上使用强化学习 (RL) 进行训练,直接优化单个主题的任务性能,从而廉价地获得一组有效的领域特定“专家”向量。

We then introduce our full Transformer 2 framework to empower LLMs through the underlying principles of self-adaptation. Given a prompt from an unknown task, Transformer 2 entails a two-pass inference mechanism which we illustrate in Figure 1. During the first pass, Transformer 2 executes the model and observes its test-time behavior, gathering the relevant information to understand the necessary skills to tackle the current problem. During the second pass, our framework uses this information to combine the available expert vectors and provide a new modification to the base weights of the LLM specifically tailored to its test-time conditions. We design three different adaptation strategies that can be used within Transformer 2, which we show provide monotonic performance benefits with increasing access to the test-time conditions.

我们随后介绍了完整的 Transformer 2 框架,通过自适应的基本原理来增强大语言模型的能力。给定一个未知任务的提示,Transformer 2 采用了一种两阶段推理机制,如图 1 所示。在第一阶段,Transformer 2 执行模型并观察其在测试时的行为,收集相关信息以理解解决当前问题所需的技能。在第二阶段,我们的框架利用这些信息结合可用的专家向量,并对大语言模型的基础权重进行新的调整,专门针对其测试时的条件进行定制。我们设计了三种不同的自适应策略,可以在 Transformer 2 中使用,这些策略在增加对测试条件访问的情况下提供了单调的性能提升。

We evaluate SVF and the full Transformer 2 framework through extensive experiments across a diverse range of LLMs and tasks. First, when trained on domain-specific datasets, we show that SVF consistently outperforms traditional strategies for efficient fine-tuning such as LoRA, and at the same time, with orders of magnitudes fewer parameters. Then we show that Transformer 2 is able to push performance far further, effectively adapting the weights of the base model even in entirely out-of-distribution applications such as visual question answering. Finally, we analyze the properties of our new framework, validating that it provides increasing benefits with additional access to its current test-time conditions and even allow for recycling pre-trained SVF experts across model architectures. In summary, our key technical contributions are the following:

我们通过在各种大语言模型和任务上进行广泛的实验,评估了 SVF 和完整的 Transformer 2 框架。首先,当在特定领域的数据集上进行训练时,我们展示了 SVF 始终优于传统的微调策略,如 LoRA,同时参数数量减少了几个数量级。然后,我们展示了 Transformer 2 能够将性能推得更远,即使在完全超出分布的应用(如视觉问答)中也能有效调整基础模型的权重。最后,我们分析了新框架的特性,验证了它在增加对当前测试条件访问的情况下提供了越来越多的好处,甚至允许跨模型架构回收预训练的 SVF 专家。总之,我们的关键技术贡献如下:

• The development of Transformer 2 as a pivotal self-adaptation framework for LLMs, providing a universal blueprint to dynamically adapt the behavior of LLMs from a growing set of pre-trained skills.

• Transformer 2 的开发作为大语言模型 (LLM) 的关键自适应性框架,提供了一个通用蓝图,能够从不断增长的预训练技能集中动态调整大语言模型的行为。

• The introduction of SVF, a novel PEFT method trainable with RL on small datasets, producing compact expert vectors with inherent compositional it y, all key properties necessary for our scalable self-adaptation framework.

• 引入SVF,这是一种新颖的参数高效微调(PEFT)方法,可在小数据集上使用强化学习(RL)进行训练,生成具有固有组合性的紧凑专家向量,这些特性是我们可扩展的自适应框架所必需的关键属性。

• The implementation of three adaptation strategies within Transformer 2, effectively dispatching SVF-trained experts with properties designed to cope with different requirements and deployment scenarios.

• 在 Transformer 2 中实现了三种适应策略,有效地调度了经过 SVF 训练的专家,这些专家具备应对不同需求和部署场景的特性。

2 RELATED WORKS

2 相关工作

Self-adaptive LLMs We define self-adaptive LLMs as a group of LLMs or a standalone LLM that can evaluate and modify its behavior in response to changes in its operating environment or internal state, without external intervention. This adaptation can be explored from two perspectives: a macroview, where multiple LLMs collaborate and/or compete, and a microview, where internal adaptations allow a single LLM to specialize in different tasks.

自适应大语言模型
我们将自适应大语言模型定义为一组大语言模型或一个独立的大语言模型,能够在没有外部干预的情况下,根据其操作环境或内部状态的变化评估并修改其行为。这种适应性可以从两个角度探讨:宏观视角下,多个大语言模型协作和/或竞争;微观视角下,内部适应性使单个大语言模型能够专注于不同的任务。

Macroview: From this perspective, the system directs queries to LLMs with domain specific expertise, prioritizing outputs from expert models, thereby achieving higher accuracy and task-specific optimization. Such task-specific ensembles can be realized through various mechanisms: multiple LLMs playing distinct roles and coordinate toward a shared goal (Zhuge et al., 2023), engaging in mutual listening and debate (Du et al., 2023), or using meticulously crafted prompt constructions (Zhang et al., 2024) to integrate knowledge library and skill planning. Naturally, the improvement in the specialization and adaptive capabilities of individual LLMs in the ensemble enhances the collective performance. Thus, in this paper, we focus on the microview of self-adaptive LLMs.

宏观视角:从这一视角出发,系统将查询定向至具有特定领域专长的大语言模型,优先考虑专家模型的输出,从而实现更高的准确性和任务特定优化。此类任务特定的集成可以通过多种机制实现:多个大语言模型扮演不同角色并协同达成共同目标 (Zhuge et al., 2023),进行相互倾听与辩论 (Du et al., 2023),或使用精心设计的提示结构 (Zhang et al., 2024) 来整合知识库和技能规划。自然,集成中单个大语言模型的专业化和自适应能力的提升,会增强整体性能。因此,在本文中,我们聚焦于自适应大语言模型的微观视角。

Microview: MoE in LLMs plays a critical role in this perspective (Tianlong et al., 2024). In MoE systems, inputs are dynamically routed to a subset of specialized modules or layers (e.g., MLPs) containing domain-specific knowledge (Raj bh and ari et al., 2022; Fedus et al., 2022). To reduce inference time, researchers introduce sparsely activated MoE where only a subset of the experts are selected per token Jiang et al. (2024); Qwen Team (2024). While it is possible to view Transformer 2 loosely as a type of MoE, there are two major differences. In the aforementioned systems, selfadaptation is achieved through token-level routing, whereas Transformer 2 employs a sample-level module selection strategy. The second difference lies in the construction of expert modules. In traditional MoE systems, expert modules are either trained from scratch (Fedus et al., 2022; Jiang et al., 2024) or dense models (e.g., upcycling) (Qwen Team, 2024; Zhu et al., 2024), without an auxiliary loss to ensure module specialization. In contrast, Transformer 2 specifically trains expert vectors with RL to acquire domain specific-knowledge, making them true experts.

Microview: 大语言模型中的 MoE 在这一视角中扮演着关键角色 (Tianlong 等, 2024)。在 MoE 系统中,输入被动态路由到包含领域特定知识的专门模块或层(例如 MLPs)的子集 (Raj bh 和 ari 等, 2022; Fedus 等, 2022)。为了减少推理时间,研究人员引入了稀疏激活的 MoE,其中每个 token 只选择一部分专家 (Jiang 等, 2024; Qwen Team, 2024)。虽然可以将 Transformer 2 大致视为一种 MoE,但存在两个主要区别。在上述系统中,自适应是通过 token 级别的路由实现的,而 Transformer 2 采用了样本级别的模块选择策略。第二个区别在于专家模块的构建。在传统的 MoE 系统中,专家模块要么是从头开始训练的 (Fedus 等, 2022; Jiang 等, 2024),要么是基于密集模型(例如 upcycling)的 (Qwen Team, 2024; Zhu 等, 2024),没有辅助损失来确保模块的专门化。相比之下,Transformer 2 专门使用强化学习训练专家向量以获取领域特定知识,使它们成为真正的专家。

Low-rank adaptation PEFT methods such as LoRA (Hu et al., 2021) works by freezing the original model’s parameters and introducing small trainable low-rank matrices for task-specific updates. It significantly lowers the computational and memory costs while providing performance comparable to full fine-tuning. Inspired by LoRA’s design, various modifications have been proposed (Zhang et al., 2023; Kopiczko et al., 2023; Liu et al., 2024; Bałazy et al., 2024; Cetoli, 2024). Transformer2 does not rely on low-rank matrices, and instead scales the singular vectors of the original parameter matrix that span the full rank space.

低秩适应 (Low-rank adaptation) PEFT 方法,如 LoRA (Hu et al., 2021),通过冻结原始模型的参数并引入小的可训练低秩矩阵来进行任务特定的更新。它显著降低了计算和内存成本,同时提供了与全微调相当的性能。受 LoRA 设计的启发,已经提出了各种改进方法 (Zhang et al., 2023; Kopiczko et al., 2023; Liu et al., 2024; Bałazy et al., 2024; Cetoli, 2024)。Transformer2 不依赖于低秩矩阵,而是对原始参数矩阵的奇异向量进行缩放,这些奇异向量跨越了全秩空间。

SVD for LLM Fine-tuning SVD is increasingly being used as an inductive bias for PEFT in LLMs. For example, Wang et al. (2024) decompose a weight matrix and use the minor singular components, associated with noisy or long-tail information, to initialize low-rank matrices for LoRA fine-tuning. In a similar vein, SVD is employed to approximate an original weight matrix with the top $r$ singular vectors, corresponding to the highest singular values. A small trainable matrix is then introduced on top of the truncated singular value matrix to adjust the magnitude and orientations within this top ${\bf\nabla}r$ subspace (Bałazy et al., 2024; Cetoli, 2024). However, the drawback of this approach is that retaining only the top singular components can result in the loss of important information, particularly when the singular values distribution is less skewed. The work most similar to ours is a concurrent effort by Lingam et al. (2024), where they introduce various spars if i cation methods that utilize the SVD of the weights. However, it is not for self-adaptive LLMs and does not use RL to enhance learning efficiency.

SVD 用于大语言模型微调
SVD 越来越多地被用作大语言模型中参数高效微调 (PEFT) 的归纳偏置。例如,Wang 等人 (2024) 分解权重矩阵,并使用与噪声或长尾信息相关的较小奇异分量来初始化 LoRA 微调的低秩矩阵。类似地,SVD 被用于通过前 $r$ 个奇异向量(对应最高奇异值)来近似原始权重矩阵。然后在截断的奇异值矩阵之上引入一个小的可训练矩阵,以调整这个前 ${\bf\nabla}r$ 子空间内的幅度和方向 (Bałazy 等人, 2024; Cetoli, 2024)。然而,这种方法的缺点是仅保留前几个奇异分量可能会导致重要信息的丢失,尤其是在奇异值分布不那么偏斜的情况下。与我们的工作最相似的是 Lingam 等人 (2024) 的并行研究,他们引入了多种利用权重 SVD 的稀疏化方法。然而,他们的研究并非针对自适应大语言模型,也没有使用强化学习来提高学习效率。

3 METHODS

3 方法

3.1 PRELIMINARIES

3.1 预备知识

Singular value decomposition (SVD) offers a fundamental view of matrix multiplications. In the context of neural networks, each weight matrix $W\in\mathbb{R}^{n\times m}$ can be decomposed into three components $W=U\Sigma V^{\top}$ , yielding semi-orthogonal matrices $U\in\mathbb{R}^{m\times r}$ and $V\bar{\in},\mathbb{R}^{n\times r}$ together with an ordered vector of $r$ singular values (in descending order) arranged in the diagonal matrix $\boldsymbol{\Sigma}\in\mathbb{R}^{r\times r}$ . The linear operation defined by applying $W$ onto $x$ , can be then decomposed into a sum of independent terms, derived from mapping each column $v_{i}$ from $V$ into the corresponding column $u_{i}$ from $U$ as $\begin{array}{r}{\boldsymbol{y}=\sum_{i=1}^{r}\sigma_{i}\boldsymbol{u}_{i}\boldsymbol{v}_{i}^{\top}\boldsymbol{x}}\end{array}$ . Hence, each singular component represented by the rank-1 matrix $u_{i}v_{i}^{\mathsf{T}}$ independe ntly processes the input, providing an orthogonal contribution to the layer’s outputs, with the singular values $\sigma_{i}$ modulating the degree of the contributions.

奇异值分解 (SVD) 提供了矩阵乘法的基础视角。在神经网络的背景下,每个权重矩阵 $W\in\mathbb{R}^{n\times m}$ 可以被分解为三个部分 $W=U\Sigma V^{\top}$,得到半正交矩阵 $U\in\mathbb{R}^{m\times r}$ 和 $V\bar{\in},\mathbb{R}^{n\times r}$,以及一个按降序排列的 $r$ 个奇异值向量,这些奇异值被排列在对角矩阵 $\boldsymbol{\Sigma}\in\mathbb{R}^{r\times r}$ 中。通过将 $W$ 应用于 $x$ 所定义的线性操作,可以分解为独立项的和,这些独立项是通过将 $V$ 中的每一列 $v_{i}$ 映射到 $U$ 中的对应列 $u_{i}$ 得到的,即 $\begin{array}{r}{\boldsymbol{y}=\sum_{i=1}^{r}\sigma_{i}\boldsymbol{u}_{i}\boldsymbol{v}_{i}^{\top}\boldsymbol{x}}\end{array}$。因此,由秩-1矩阵 $u_{i}v_{i}^{\mathsf{T}}$ 表示的每个奇异分量独立处理输入,为层的输出提供正交贡献,而奇异值 $\sigma_{i}$ 则调节这些贡献的程度。

Cross-entropy method (CEM) is a Monte Carlo method for importance sampling and optimization (Rubinstein & Kroese, 2004). The method is based on the concept of minimizing the KL divergence between two probability distributions $D_{\mathrm{KL}}(P|Q)$ , where $P$ is the target distribution and $Q$ is a maintained distribution. At its core, CEM repeatedly generates a set of samples from $Q$ , evaluates these samples with a performance function, and then updates the distribution $Q$ with the characteristics of the elite samples that have performed best. In the standard setup employed in most applications, $Q$ is set to a diagonal multivariate Gaussian, reducing the problem to simply estimating the empirical mean and standard deviation of the latest elites until a stopping criterion is met. We illustrate a complete CEM step in the Python pseudocode below.

交叉熵方法 (Cross-entropy method, CEM) 是一种用于重要性采样和优化的蒙特卡罗方法 (Rubinstein & Kroese, 2004)。该方法基于最小化两个概率分布之间的 KL 散度 $D_{\mathrm{KL}}(P|Q)$ 的概念,其中 $P$ 是目标分布,$Q$ 是维护的分布。CEM 的核心是反复从 $Q$ 生成一组样本,用性能函数评估这些样本,然后用表现最好的精英样本的特征更新分布 $Q$。在大多数应用的标准设置中,$Q$ 被设置为对角多元高斯分布,将问题简化为仅估计最新精英样本的经验均值和标准差,直到满足停止条件。我们在下面的 Python 伪代码中展示了一个完整的 CEM 步骤。

def cem_step(mu, sigma, num_elites, num samples) : samples $=$ np.random.normal(1oc $=$ mean, scale $=$ sigma, size $=$ num samples) scores $=$ evaluate(samples) elites $=$ samples[np.argsort(scores) [-num_elites:]] new_mu $=$ np.mean(elites, axis $=0$ ) new_sigma $=$ np.std(elites, axis $=0$ ) return (new_mu, new_sigma)

def cem_step(mu, sigma, num_elites, num_samples):
    samples = np.random.normal(loc=mu, scale=sigma, size=num_samples)
    scores = evaluate(samples)
    elites = samples[np.argsort(scores)[-num_elites:]]
    new_mu = np.mean(elites, axis=0)
    new_sigma = np.std(elites, axis=0)
    return (new_mu, new_sigma)

3.2 TRANSFORMER 2

3.2 TRANSFORMER 2

The construction of Transformer 2 comprises two main steps, for which we provide an illustrative overview in Figure 2. First, we introduce Singular Value Fine-tuning (SVF), a method to learn with RL compact and compositional expert vectors based on the SVD of the base model’s weights. Then, we describe three different adaptation strategies within Transformer , inspired by three orthogonal principles, which adaptively combine the SVF-trained expert vectors during inference. We motivate how the properties of SVF are highly complementary to our adaptation strategies, making Transformer an effective and scalable framework for the design of new self-adaptive LLMs.

Transformer 2 的构建包含两个主要步骤,我们在图 2 中提供了一个示意性概述。首先,我们引入了奇异值微调 (Singular Value Fine-tuning, SVF),这是一种基于基础模型权重的奇异值分解 (SVD) 来学习紧凑且组合的专家向量的方法。然后,我们描述了 Transformer 中的三种不同适应策略,这些策略受到三个正交原则的启发,并在推理过程中自适应地结合 SVF 训练的专家向量。我们解释了 SVF 的特性如何与我们的适应策略高度互补,使 Transformer 成为一个有效且可扩展的框架,用于设计新的自适应大语言模型。


Figure 2: Method overview. Left) At training time, we employ SVF and RL to learn the “expert” vectors $z$ ’s that scale the singular values of the weight matrices. Right) At inference time, we propose three distinct methods to adaptively select/combine the learned expert vectors.

图 2: 方法概述。左) 在训练时,我们使用 SVF 和 RL 来学习缩放权重矩阵奇异值的“专家”向量 $z$。右) 在推理时,我们提出了三种不同的方法来自适应地选择/组合学习到的专家向量。

Singular value fine-tuning is a key building block in Transformer 2. It offers an extremely efficient parameter iz ation for fine-tuning and provides inherent compositional it y for adaptation. Conventional fine-tuning techniques often aim to augment pre-trained models with new capabilities by modifying their weight matrices. However, in large-scale transformers, these weights are already rich repositories of abstracted knowledge, thanks to the breadth of the pre-training data and expansive architectural design. In fact, as evidenced in much of the prior literature, the requisite capabilities for solving many downstream tasks appear to already exist within these pre-trained models (Sharma et al., 2023). Therefore, instead of seeking to add new features, an efficient fine-tuning approach should focus on making these latent capabilities more expressible. Motivated by these considerations, for any weight matrix $W$ , SVF learns a simple vector $z\in\mathbb{R}^{r}$ that provides targeted modifications to each singular component of $W$ independently, yielding a new weight matrix ${\bar{W}}^{\prime}=U\Sigma^{\prime}V^{\intercal}$ , where $\Sigma^{\prime}=\Sigma,\bar{\otimes},\mathrm{diag}(z)$ . This essential parameter iz ation enjoys several benefits:

奇异值微调是 Transformer 2 中的一个关键构建模块。它为微调提供了一种极其高效的参数化方法,并为适应提供了固有的组合性。传统的微调技术通常旨在通过修改预训练模型的权重矩阵来增强其新能力。然而,在大规模 Transformer 中,由于预训练数据的广度和扩展的架构设计,这些权重已经是抽象知识的丰富存储库。事实上,正如许多先前文献所证明的那样,解决许多下游任务所需的能力似乎已经存在于这些预训练模型中 (Sharma et al., 2023)。因此,与其寻求添加新功能,一种高效的微调方法应专注于使这些潜在能力更具表现力。基于这些考虑,对于任何权重矩阵 $W$,SVF 学习一个简单的向量 $z\in\mathbb{R}^{r}$,该向量独立地对 $W$ 的每个奇异分量进行有针对性的修改,从而生成一个新的权重矩阵 ${\bar{W}}^{\prime}=U\Sigma^{\prime}V^{\intercal}$,其中 $\Sigma^{\prime}=\Sigma,\bar{\otimes},\mathrm{diag}(z)$。这种基本的参数化方法具有以下几个优点:

Negligible parameters: Learning only a vector $z$ for each weight matrix allows for very efficient fine-tuning with orders of magnitudes fewer optimized parameters even when compared to prior approaches specifically designed for efficiency. For example, the widely popular LoRA approach requires $(m!+!n)!\times!r^{\prime}$ learnable parameters per weight matrix, where $r^{\prime}$ is a hyper-parameter that generally needs to be set large enough for expressivity. While recent extensions, such LoRA-XS (Bałazy et al., 2024), try to push efficiency even further, they often introduce limiting assumptions that curb applicability in several practical scenarios (see examples in Appendix C). In contrast, while SVF only needs $r=\operatorname*{min}(m,n)$ parameters, we show it empirically does not display the same shortcomings thanks to working on a highly-meaning space provided by the latent expressiveness compressed in the weights of modern LLMs. SVF’s scaling only the singular values may seem to lead to limited expressiveness, we wish to point out that the ability to affect the weight matrix in a full-rank manner technically provides more information than low-rank approaches.

可忽略的参数:与之前专门为效率设计的方法相比,即使每个权重矩阵仅学习一个向量 $z$,也能以数量级更少的优化参数进行非常高效的微调。例如,广泛流行的 LoRA 方法需要每个权重矩阵 $(m!+!n)!\times!r^{\prime}$ 个可学习参数,其中 $r^{\prime}$ 是一个通常需要设置得足够大以保证表达能力的超参数。虽然最近的扩展,如 LoRA-XS (Bałazy et al., 2024),试图进一步提高效率,但它们通常引入了一些限制性假设,这些假设在某些实际场景中限制了适用性(参见附录 C 中的示例)。相比之下,SVF 只需要 $r=\operatorname*{min}(m,n)$ 个参数,我们通过实验表明,由于它利用了现代大语言模型权重中压缩的潜在表达能力所提供的高度有意义的空间,因此不会表现出相同的缺点。SVF 仅缩放奇异值可能看起来会导致表达能力有限,但我们想指出,从技术上讲,以全秩方式影响权重矩阵的能力比低秩方法提供了更多的信息。

High compositional it y: Decomposing the weights in independent singular components makes the learned $z$ vectors highly composable and interpret able, opening numerous possibilities for adaptation via algebraic manipulations. Instead, LoRA-based methods inherently lack these properties. For instance, even if two LoRAs learned on the same task were to learn exactly the same adjustments for each $W$ , directly interpolating between their compressed $A$ and $B$ matrices is unlikely to preserve any of their original behavior, given the countless number of equivalent parameter permutations they might have converged to.

高组合性:将权重分解为独立的奇异分量使得学习到的 $z$ 向量具有高度的可组合性和可解释性,为通过代数操作进行适应提供了多种可能性。相比之下,基于 LoRA 的方法本质上缺乏这些特性。例如,即使在同一任务上学习的两个 LoRA 对每个 $W$ 进行了完全相同的调整,直接在其压缩的 $A$ 和 $B$ 矩阵之间进行插值也不太可能保留它们的原始行为,因为它们可能已经收敛到无数种等效的参数排列。

Principled regular iz ation: Exclusively modifying the magnitude of pre-existing singular components provides a principled and effective form of regular iz ation. In practice, this property enables us to fine-tune for arbitrary downstream tasks with only hundreds of data points without the risk of severe collapse or over fitting.

原则性正则化:仅修改预先存在的奇异分量的大小提供了一种原则性且有效的正则化形式。在实践中,这一特性使我们能够仅用数百个数据点对任意下游任务进行微调,而不会出现严重崩溃或过拟合的风险。

End-to-end optimization with RL. We train a set of SVF vectors $\theta_{z}={z_{1},\cdot\cdot\cdot,z_{N\times M}}$ to finetune an arbitrary language model $\pi_{\theta_{W}}$ parameterized by $\theta_{W}$ with RL, optimizing directly for task performance. Here, $\theta_{W}={W_{1},\cdot\cdot\cdot,W_{N\times M}}$ is the set of weight matrices, where $N$ is the number of layers and $M$ is the number of weight matrices to fine-tune per layer. We use the seminal REINFORCE algorithm (Williams, 1992) and label each generated answer $y_{i}$ (for the prompt $x_{i}\in D$ ) with a unitary reward based on its correctness $r\in{-\bar{1},1}$ . Inspired by related applications of RL for optimizing LLMs (Ouyang et al., 2022), we regularize the REINFORCE objective by adding a KL penalty for deviating from the original model’s behavior, weighted by a small coefficient $\lambda\in\mathbb{R}^{\bar{+}}$ . Thus, our final objective function can be written as:

端到端的强化学习优化。我们训练一组 SVF 向量 $\theta_{z}={z_{1},\cdot\cdot\cdot,z_{N\times M}}$ 来微调由 $\theta_{W}$ 参数化的任意语言模型 $\pi_{\theta_{W}}$,并直接针对任务性能进行优化。其中,$\theta_{W}={W_{1},\cdot\cdot\cdot,W_{N\times M}}$ 是权重矩阵的集合,$N$ 是层数,$M$ 是每层需要微调的权重矩阵数量。我们使用经典的 REINFORCE 算法 (Williams, 1992),并根据生成的答案 $y_{i}$(对于提示 $x_{i}\in D$)的正确性 $r\in{-\bar{1},1}$ 为其分配单位奖励。受 RL 优化大语言模型的相关应用启发 (Ouyang et al., 2022),我们通过在 REINFORCE 目标函数中添加一个 KL 惩罚项来正则化,该惩罚项用于衡量与原始模型行为的偏离程度,并由一个小的系数 $\lambda\in\mathbb{R}^{\bar{+}}$ 加权。因此,我们的最终目标函数可以写成:

J(\theta_{z})=\mathbb{E}\left[\log\left(\pi_{\theta_{W}},(\hat{y}_{i}\mid x_{i})\right)r(\hat{y}_{i},y_{i})\right]-\lambda D_{\mathrm{KL}}(\pi_{\theta_{W}},\|\pi_{\theta_{W}}),
J(\theta_{z})=\mathbb{E}\left[\log\left(\pi_{\theta_{W}},(\hat{y}_{i}\mid x_{i})\right)r(\hat{y}_{i},y_{i})\right]-\lambda D_{\mathrm{KL}}(\pi_{\theta_{W}},\|\pi_{\theta_{W}}),

where we use $\pi_{\theta_{W^{\prime}}}$ to denote the resulting language model after substituting the original weight matrices $W$ with $\dot{W}^{\prime}$ . While RL is generally considered less stable than next-token prediction objectives, we find the regular iz ation properties of SVF avoid many of the failure modes of prior lessconstrained parameter iz at ions (see Section 4.3). Thus, combining these complementary components effectively enables us to avoid relying on expensive fine-tuning procedures with large hand-designed datasets as proxies, and directly maximize task performance end-to-end.

我们使用 $\pi_{\theta_{W^{\prime}}}$ 来表示将原始权重矩阵 $W$ 替换为 $\dot{W}^{\prime}$ 后得到的语言模型。尽管强化学习通常被认为不如下一个 Token 预测目标稳定,但我们发现 SVF 的正则化特性避免了许多先前较少约束的参数化失败模式(见第 4.3 节)。因此,结合这些互补组件,我们能够有效避免依赖昂贵的手工设计数据集作为代理的微调过程,并直接端到端地最大化任务性能。

In general, SVF with RL puts lower requirement on the dataset it trains on. For example, LoRA fine-tuning requires “explaining texts” to perform next token predictions, which puts a higher requirement on the dataset (e.g., imagine LoRA fine-tuning on a GSM8K dataset where no reasoning text but only the final number is provided). This benefit allows SVF to be more general and effective. One possible caveat SVF can face is the sparse rewards caused by a weak base model, which we discuss this further in Section 5.

一般来说,基于强化学习 (RL) 的 SVF 对训练数据集的要求较低。例如,LoRA 微调需要“解释文本”来进行下一个 Token 的预测,这对数据集的要求更高(例如,想象一下在 GSM8K 数据集上进行 LoRA 微调,其中没有提供推理文本,只有最终的数字)。这一优势使得 SVF 更加通用和有效。SVF 可能面临的一个潜在问题是基础模型较弱导致的稀疏奖励,我们将在第 5 节进一步讨论这一点。

Self-adaptation is a critical mechanism in nature that has established itself as a core guiding principle in modern system design (Klo¨s et al., 2015). Our initial efforts toward self-adaptive foundation models focus on the inference stage of LLMs, where we devise a simple two-pass adaptation strategy that combines $K$ sets of base “expert” vectors $z^{1:K}$ trained with SVF to provide different kinds of capabilities (e.g., coding, math, etc). The mapping between a capability and the dataset we train on can be acquired in the dataset’s meta data. In the first inference pass, given a task or an individual input prompt, Transformer 2 executes the model and observes its test-time behavior to derive a new $z^{\prime}$ vector tailored to its test-time conditions. This adapted $z^{\prime}$ is then used in the second inference pass to provide an actual response with the newly adapted weights. The interaction between SVF-trained expert vectors and the adaptation strategies ensures seamless integration, where expert vectors provide modular capabilities, and the adaptation strategies dynamically determine and compose the most suitable combination to address the input task. In this first work, we propose three simple approaches to produce the vector $z^{\prime}$ during the first inference pass, implementing selfadaption with distinct methods and requirements. Below, we provide an outline of each method and refer to Appendix A for additional implementation details.

自适应是自然界中的一种关键机制,已成为现代系统设计的核心指导原则 (Klo¨s et al., 2015)。我们在大语言模型的推理阶段首次尝试了自适应基础模型,设计了一种简单的两阶段自适应策略,结合了通过 SVF 训练的 $K$ 组基础“专家”向量 $z^{1:K}$,以提供不同的能力(例如编程、数学等)。能力与训练数据集之间的映射可以从数据集的元数据中获取。在第一次推理阶段,给定任务或单个输入提示,Transformer 2 执行模型并观察其在测试时的行为,以生成一个适应其测试条件的新向量 $z^{\prime}$。然后,在第二次推理阶段使用这个自适应后的 $z^{\prime}$ 向量,结合新调整的权重生成实际响应。SVF 训练的专家向量与自适应策略之间的交互确保了无缝集成,专家向量提供模块化能力,而自适应策略则动态确定并组合最合适的组合以应对输入任务。在这项初步工作中,我们提出了三种简单的方法来在第一次推理阶段生成向量 $z^{\prime}$,通过不同的方法和需求实现自适应。以下我们概述了每种方法,并在附录 A 中提供了更多实现细节。

A) Prompt engineering: Our most basic approach involves constructing a new “adaptation” prompt which we use to directly ask the LLM to categorize the input prompt. Based on its response, we then extract one category out of the set of domain topics used to pre-train each SVF expert and, thus, we select the corresponding $z^{\prime}$ directly from $z^{1:K}$ . In our adaptation prompt, we also explicitly provide the option for a generic “others” category, allowing the model to use its base weights in case no expert provides appropriate capabilities. We show the format used to construct the adaptation prompt in Figure 3.

A) 提示工程 (Prompt Engineering):我们最基本的方法涉及构建一个新的“适应”提示,用于直接要求大语言模型对输入提示进行分类。根据其响应,我们从用于预训练每个 SVF 专家的领域主题集中提取一个类别,从而直接从 $z^{1:K}$ 中选择相应的 $z^{\prime}$。在我们的适应提示中,我们还明确提供了一个通用的“其他”类别选项,允许模型在没有专家提供适当能力的情况下使用其基础权重。我们在图 3 中展示了用于构建适应提示的格式。

B) Classification expert: A direct extension of the prompt engineering approach comes from using a specialized system to handle task identification. Following the principles of self-adaptation, we apply SVF to fine-tune the base LLM itself to han- dle this task. In particular, we collect a dataset $D={(x_{1,1},1),\cdot\cdot\cdot,,(x_{i,k},k),\cdot\cdot\cdot}$ from the $K$ SVF training tasks, where $x_{i,k}$ is the $i$ -th example from the $k$ -th expert task. Each tuple $(x_{i,k},k)$ then forms an example to pre-train an additional job classification expert $z^{c}$ learned in the same fashion as the others. During the first inference pass, we simply load $z^{c}$ , intending to improve the inherent task classification capabilities of the base model to select a more appropriate $z^{\prime}$ to handle the input prompt.

B) 分类专家:提示工程方法的直接扩展来自于使用专门系统来处理任务识别。遵循自适应的原则,我们应用 SVF 来微调基础大语言模型本身以处理此任务。特别是,我们从 $K$ 个 SVF 训练任务中收集了一个数据集 $D={(x_{1,1},1),\cdot\cdot\cdot,,(x_{i,k},k),\cdot\cdot\cdot}$,其中 $x_{i,k}$ 是第 $k$ 个专家任务中的第 $i$ 个示例。每个元组 $(x_{i,k},k)$ 然后形成一个示例,用于预训练一个额外的工作分类专家 $z^{c}$,其学习方式与其他专家相同。在第一次推理过程中,我们简单地加载 $z^{c}$,旨在提高基础模型的固有任务分类能力,以选择更合适的 $z^{\prime}$ 来处理输入提示。


Figure 3: Prompt based adaptation. Selfadaptation prompt used by Transformer 2 to classify the task prompt into pre-defined categories.

图 3: 基于提示的适应。Transformer 2 使用的自我适应提示将任务提示分类为预定义的类别。

4 EXPERIMENTS

4 实验

We extensively evaluate Transformer 2 on multiple tasks and models with the purpose of: (1) assessing the efficiency and effectiveness of SVF; (2) demonstrating self-adaptive ness through the three proposed adaptation strategies; (3) conducting in-depth analysis and ablation studies aimed at understanding and interpreting the properties of our new framework.

我们在多个任务和模型上广泛评估了 Transformer 2,目的是:(1) 评估 SVF 的效率和有效性;(2) 通过提出的三种自适应策略展示自适应性;(3) 进行深入分析和消融研究,旨在理解和解释我们新框架的特性。

4.1 EXPERIMENTAL SETUPS

4.1 实验设置

To validate the generality of Transformer 2 we consider three pre-trained LLMs ranging across different model families and architecture sizes: LLAMA3-8B-INSTRUCT, MISTRAL-7B-INSTRUCTV0.3, and LLAMA3-70B-INSTRUCT. For each model, we obtain three sets of SVF-trained $z$ vectors to maximize performance for GSM8K (Cobbe et al., 2021), MBPP-pro (Austin et al., 2021), and ARC-Easy (Clark et al., 2018), respectively. Additionally, we also train a set of $z$ vectors for LLAMA3-8B-INSTRUCT, when applied as the language backbone for TextVQA (Singh et al., 2019), in order to assess SVF’s applicability to the vision-language modeling (VLM) domain. We provide SVF’s main learning curves on each of these tasks in Figure 4. Finally, we evaluate the full Transformer 2 adaptation framework on four unseen tasks: MATH (Hendrycks et al., 2021), Humaneval (Chen et al., 2021), ARC-Challenge (Clark et al., 2018), and OKVQA (Marino et al., 2019). In all our adaptation experiments, we only consider experts obtained in the pure-language settings, assessing its test-time applicability even for the distinctive vision domain. Please refer to the Appendix A for additional details and a summary of the hyper-parameters used in the experiments.

为了验证 Transformer 2 的通用性,我们考虑了三种预训练的大语言模型,涵盖不同的模型家族和架构规模:LLAMA3-8B-INSTRUCT、MISTRAL-7B-INSTRUCTV0.3 和 LLAMA3-70B-INSTRUCT。对于每个模型,我们获得了三组经过 SVF 训练的 $z$ 向量,分别用于最大化 GSM8K (Cobbe et al., 2021)、MBPP-pro (Austin et al., 2021) 和 ARC-Easy (Clark et al., 2018) 的性能。此外,我们还为 LLAMA3-8B-INSTRUCT 训练了一组 $z$ 向量,当它作为 TextVQA (Singh et al., 2019) 的语言骨干时,用于评估 SVF 在视觉语言建模 (VLM) 领域的适用性。我们在图 4 中提供了 SVF 在这些任务上的主要学习曲线。最后,我们在四个未见过的任务上评估了完整的 Transformer 2 适应框架:MATH (Hendrycks et al., 2021)、Humaneval (Chen et al., 2021)、ARC-Challenge (Clark et al., 2018) 和 OKVQA (Marino et al., 2019)。在我们所有的适应实验中,我们仅考虑在纯语言设置中获得的专家,评估其在测试时对独特视觉领域的适用性。有关实验中使用超参数的更多细节和总结,请参阅附录 A。


Figure 4: SVF learning curves. The dashed lines indicate the performance of LLAMA3-8BINSTRUCT on the test split of each task. SVF effectively fine-tunes to surpass the base performance. While we use the best validation score to select our checkpoint for evaluation (marked by red dots), we present the entire training curve without early stopping to demonstrate SVF’s learning capabilities. Tasks with only hundreds of training samples like Coding and Reasoning were stopped early. In our experiments, we update the parameters at the end of each epoch.

图 4: SVF 学习曲线。虚线表示 LLAMA3-8BINSTRUCT 在每个任务测试集上的表现。SVF 通过微调有效超越了基础性能。虽然我们使用最佳验证分数来选择评估的检查点(用红点标记),但我们展示了整个训练曲线,没有提前停止,以展示 SVF 的学习能力。像编程和推理这样只有数百个训练样本的任务提前停止了。在我们的实验中,我们在每个 epoch 结束时更新参数。

4.2 EXPERIMENTAL RESULTS

4.2 实验结果

SVF performance We provide results after training on each considered task with the LLAMA3- 8B-INSTRUCT, MISTRAL-7B-INSTRUCT-V0.3, and LLAMA3-70B-INSTRUCT base models in Table 1. Remarkably, we find that SVF provides considerable and consistent performance gains across nearly all tasks and base models. Instead, LoRA experts yield smaller gains and even sporadic performance degradation. (These LoRA experts are trained with next token prediction. While we also have LoRA experts trained with RL in Table 4, RL seems work less well with LoRA than with SVF.) This observed trend extends also to the vision-language domain, as fine-tuning LLAMA3- LLAVA-NEXT-8B with SVF bolsters the base model’s performance by over $39%$ (see Figure 5). To ensure a fair comparison, we provide extensive ablations to both our model and the LoRA baseline considering different architecture and optimization objectives in Appendix 4.3). Due to its essential parameter iz ation, we would like to note that training SVF requires considerably fewer resources, with less than $10%$ of the training parameters of our LoRA implementation.

SVF 性能
我们在表 1 中提供了使用 LLAMA3-8B-INSTRUCT、MISTRAL-7B-INSTRUCT-V0.3 和 LLAMA3-70B-INSTRUCT 基础模型在每项任务上训练后的结果。值得注意的是,我们发现 SVF 在几乎所有任务和基础模型上都提供了显著且一致的性能提升。相比之下,LoRA 专家带来的提升较小,甚至偶尔会出现性能下降。(这些 LoRA 专家是通过下一个 Token 预测进行训练的。尽管我们在表 4 中也提供了通过强化学习 (RL) 训练的 LoRA 专家,但 RL 在 LoRA 上的表现似乎不如在 SVF 上。)这一观察到的趋势也延伸到视觉-语言领域,使用 SVF 微调 LLAMA3-LLAVA-NEXT-8B 使基础模型的性能提升了超过 $39%$(见图 5)。为了确保公平比较,我们在附录 4.3 中对我们的模型和 LoRA 基线进行了广泛的消融实验,考虑了不同的架构和优化目标。由于其本质的参数化特性,我们想指出,训练 SVF 所需的资源显著减少,训练参数不到我们 LoRA 实现的 $10%$。

Adaptation performance With the SVF trained $z$ vectors, we assess the self-adaptation capability of Transformer 2 on unseen tasks. For a fair comparison with LoRA, we record the performance of this baseline using all checkpoints from the considered training tasks and report only its highest performance for each of the test tasks. As shown in Table 2, all of our Transformer 2 adaptation strategies demonstrate improvements across all tasks for LLAMA3-8B-INSTRUCT base models, and in at least two out of three tasks for both MISTRAL-7B-INSTRUCT-V0.3 and LLAMA3-70BINSTRUCT. In contrast, even the best training LoRAs only provide marginal improvements on the

适应性能

使用经过训练的 SVF $z$ 向量,我们评估了 Transformer 2 在未见任务上的自适应能力。为了与 LoRA 进行公平比较,我们记录了该基线在考虑的训练任务中的所有检查点的性能,并仅报告其在每个测试任务中的最高性能。如表 2 所示,我们所有的 Transformer 2 适应策略在 LLAMA3-8B-INSTRUCT 基础模型的所有任务中都表现出改进,并且在 MISTRAL-7B-INSTRUCT-V0.3 和 LLAMA3-70BINSTRUCT 的至少两个任务中也有所提升。相比之下,即使是最好的训练 LoRA 也只能在...

Table 1: Fine-tuning results. LLM performance on the test splits of math, coding and reasoning. Normalized scores are in the parentheses.

表 1: 微调结果。大语言模型在数学、编码和推理测试集上的表现。括号内为标准化分数。

方法 GSM8K MBPP-Pro 50 ARC-Easy
LLAMA3-8B-INSTRUCT 75.89 (1.00) 64.65 (1.00) 88.59 (1.00)
+ LoRA 77.18 (1.02) 67.68 (1.05) 88.97 (1.00)
+ SVF (Ours) 79.15 (1.04) 66.67 (1.03) 89.56 (1.01)
MISTRAL-7B-INSTRUCT-VO.3 +LoRA 42.83 (1.00) 49.50 (1.00) 81.65 (1.00)
+ SVF (Ours) 44.66 (1.04) 51.52 (1.04) 81.19 (0.98)
49.74 (1.16) 51.52 (1.04) 85.14 (1.04)
LLAMA3-70B-INSTRUCT 85.29 (1.00) 80.81 (1.00) 89.10 (1.00)
+ LoRA 77.26 (0.91) 68.69 (0.85) 88.55 (0.99)
+ SVF (Ours) 88.32 (1.04) 80.81 (1.00) 88.47 (0.99)

Table 2: Self-adaptation on unseen tasks. Normalized scores are in the parentheses.

表 2: 在未见任务上的自适应。括号内为归一化分数。

方法 MATH Humaneval ARC-Challenge
LLAMA3-8B-INSTRUCT3 24.54 (1.00) 60.98 (1.00) 80.63 (1.00)
+ LoRA 24.12 (0.98) 52.44 (0.86) 81.06 (1.01)
+ Transformer2 (Prompt) 25.22 (1.03) 61.59 (1.01) 81.74 (1.01)
(Cls-expert) 25.18 (1.03) 62.80 (1.03) 81.37 (1.01)
+ Transformer² (Few-shot) 25.47 (1.04) 62.99 (1.03) 82.61 (1.02)
MISTRAL-7B-INSTRUCT-VO.3 13.02 (1.00) 43.29 (1.00) 71.76 (1.00)
+ LoRA 13.16 (1.01) 37.80 (0.87) 75.77 (1.06)
+ Transformer2 (Prompt) 11.86 (0.91) 43.90 (1.01) 72.35 (1.01)
+ Transformer2 (Cls-expert) 11.60 (0.89) 43.90 (1.01) 74.83 (1.04)
+ Transformer2 (Few-shot) 13.39 (1.03) 47.40 (1.09) 75.47 (1.05)
LLAMA3-70B-INSTRUCT 40.64 (1.00) 78.66 (1.00) 87.63 (1.00)
+ LoRA 25.40 (0.62) 73.78 (0.94) 83.70 (0.96)
+ Transformer² (Prompt) 40.44 (1.00) 79.88 (1.02) 88.48 (1.01)

ARC-Challenge task and still significantly deteriorate performance on both MATH and Humaneval. This discrepancy suggests that LoRA’s parameter iz ation and optimization might be particularly sensitive to over fitting, especially when trained with the smaller GSM8K and MBPP-Pro datasets, the tasks that provide information most related to MATH and Humaneval. In Figure 5, we find a similar dichotomy in the OKVQA task, with the performance of the base LLAMA3-LLAVA-NEXT-8B VLM only improving after applying Transformer 2. We note that also in this setting, Transformer 2 performs self-adaptation only from the expert vectors from GSM8K, MBPP-Pro, and ARC-Easy. Thus, this result further underscores the high flexibility of self-adaptation, transferring knowledge compressed for tasks entirely based on language even for unrelated vision-based problems.

ARC-Challenge 任务,并且在 MATH 和 Humaneval 上的性能显著下降。这种差异表明,LoRA 的参数化和优化可能对过拟合特别敏感,尤其是在使用较小的 GSM8K 和 MBPP-Pro 数据集进行训练时,这些任务提供的信息与 MATH 和 Humaneval 最为相关。在图 5 中,我们在 OKVQA 任务中发现了类似的两极分化现象,基础 LLAMA3-LLAVA-NEXT-8B VLM 的性能仅在应用 Transformer 2 后才有所提升。我们注意到,在这种设置下,Transformer 2 仅从 GSM8K、MBPP-Pro 和 ARC-Easy 的专家向量中进行自适应性调整。因此,这一结果进一步强调了自适应性调整的高度灵活性,即使对于与语言无关的基于视觉的问题,也能传递为完全基于语言的任务压缩的知识。

Comparing the three proposed adaptation strategies, we highlight a clear monotonic trend – with more involved strategies and additional information about the test-time condition, self-adaptation appears to be increasingly effective. In particular, Transformer 2 with few-shot self-adaptation is almost always the highest-scoring method, providing notable improvements across all tested settings except for LLAMA3-70B-INSTRUCT $\scriptstyle{\mathcal{Q}},\mathbf{MATH}$ , where we have only SVF-tuned half of the layers due to our limited GPU resources. This trend shows that providing additional or different kinds of information seems to be highly beneficial to our framework, suggesting that Transformer 2 could provide foundation models with new means to continually improve performance when deployed in lifelong settings.

在比较这三种提出的适应策略时,我们观察到一个明显的单调趋势——随着策略的复杂性和测试条件信息的增加,自我适应似乎变得越来越有效。特别是,使用少样本自我适应的 Transformer 2 几乎总是得分最高的方法,在所有测试设置中均提供了显著的改进,除了 LLAMA3-70B-INSTRUCT $\scriptstyle{\mathcal{Q}},\mathbf{MATH}$,由于我们有限的 GPU 资源,我们只对一半的层进行了 SVF 调优。这一趋势表明,提供额外或不同类型的信息似乎对我们的框架非常有益,这表明 Transformer 2 可以为基础模型提供新的手段,在终身部署环境中持续提升性能。

Table 3 reports the inference time required by the prompt adaptation strategy of Transformer 2, with the time spent on solving the entire problem set presented separately for the 1st and 2nd passes. Notice that the 2nd pass inference time is the time spent on solving the problems, and the 1st pass inference time is the time for self-adaptation, 1st to 2nd pass inference time ratios are in the parentheses. While the additional inference pass might appear to double the overall runtime, it is important to note that inference time primarily depends on the number of tokens generated. In our settings, it is $\mathcal{O}(n)$ where $n$ is the length of the input. ARC-challenge’s cost ratio is large because they are single choice problems and therefore the cost of the 2nd pass is also $O(n)$ . In general settings, we think it is reasonable to assume this ratio to be closer to those of MATH and Humaneval. For a detailed discussion on improving the efficiency of CEM few-shot adaptation methods, please see Appendix D

表 3 报告了 Transformer 2 的提示适应策略所需的推理时间,其中分别列出了第一次和第二次通过解决整个问题集所花费的时间。需要注意的是,第二次通过的推理时间是解决问题所花费的时间,而第一次通过的推理时间是自适应所花费的时间,第一次到第二次通过的推理时间比例在括号中给出。虽然额外的推理通过可能会使整体运行时间翻倍,但需要注意的是,推理时间主要取决于生成的 token 数量。在我们的设置中,它是 $\mathcal{O}(n)$,其中 $n$ 是输入的长度。ARC-challenge 的成本比例较大,因为它们是单选题,因此第二次通过的成本也是 $O(n)$。在一般情况下,我们认为假设这个比例更接近 MATH 和 Humaneval 的比例是合理的。有关提高 CEM 少样本适应方法效率的详细讨论,请参见附录 D。

Table 3: Time cost of 2-pass inference in prompt adaptation strategy of Transformer 2 for the entire problem set. 1st to 2nd pass inference time ratios are shown in parentheses.

表 3: Transformer 2 的提示适应策略在整个问题集上的两轮推理时间成本。括号中显示了第一轮到第二轮推理时间的比例。

任务 第一轮 (秒) 第二轮 (秒)
MATH 42.64 (13%) 321.19
Humaneval 2.76 (19%) 14.28
ARC-Challenge 13.40 (47%) 28.51

4.3 ANALYSIS

4.3 分析

Lastly, we analyze and discuss the properties of our adaptation strategies for which we provide extensions and further discussion Appendix B.

最后,我们分析并讨论了适应策略的特性,并在附录 B 中提供了扩展和进一步的讨论。

Analysis 1: Job dispatching accuracy In Figure 6 we provide the confusion matrices of our classification-based adaptation strategies. These results validate the effectiveness of both our classification-based adaptation strategies to match each prompt with experts trained in similar domains, as evidenced by the high values along the diagonals. Furthermore, the results from LLAMA3-

分析 1: 任务调度准确性
在图 6 中,我们提供了基于分类的适应策略的混淆矩阵。这些结果验证了我们的基于分类的适应策略在将每个提示与受过类似领域训练的专家匹配方面的有效性,对角线上的高值证明了这一点。此外,LLAMA3-


Figure 6: Confusion matrices. These matrices display the classification percentages, where rows represent the task classes (ground truth) and columns indicate the predicted categories. Some samples are mis classified as “Others,” which is reflected in rows where the totals do not sum to one. 8B-INSTRUCT and MISTRAL-7B-INSTRUCT-V0.3 also show that using the classification expert consistently provides higher classification accuracy than vanilla prompt engineering. While this difference could explain the higher performance of the relative self-adaptation strategy, we also note that domain similarity might not be the only metric relevant to identifying the best expert for each prompt or task. To this end, we believe many further unexplored extensions could be explored in future work, using heuristics such as past expert performance or token-level analysis to further push our framework’s s cal ability.

图 6: 混淆矩阵。这些矩阵展示了分类的百分比,其中行代表任务类别(真实标签),列表示预测类别。一些样本被错误分类为“其他”,这反映在行总和不为1的情况中。8B-INSTRUCT 和 MISTRAL-7B-INSTRUCT-V0.3 也表明,使用分类专家始终比普通的提示工程提供更高的分类准确率。虽然这种差异可以解释相对自适应策略的更高性能,但我们也注意到,领域相似性可能不是识别每个提示或任务的最佳专家的唯一指标。为此,我们认为未来的工作中可以探索许多未开发的扩展,例如使用过去专家的表现或Token级别的分析等启发式方法,以进一步推动我们框架的可扩展性。

Analysis 2: Training tasks adaptation contribution In Figure 7, we show the normalized adaptive coefficients $a_{k}$ interpolating between our SVF vectors learned via CEM for LLAMA3-8BINSTRUCT and MISTRAL-7B-INSTRUCT-V0.3 across all the unseen downstream tasks. Intuitively, we find that the expert vectors from the training tasks sharing similar topics to the unseen ones are often the highest contributors to the produced adaptive weights. However, we observe that the MATH task appears as an interesting exception, as the $a_{k}$ for the expert obtained from GSM8K training is actually the lowest out of the three in both models. We hypothesize this reflects the different nature of the mathematics competition problems from MATH as compared to the grade-school problems in GSM8K. In fact, not only is the difficulty of the MATH questions far beyond GSM8K, but a large portion of its problems also hinges mainly on logical reasoning, for which a task like ARC might actually be more aligned. Furthermore, we also note that the different $z$ vectors appear to contribute more uniformly to adaptation in the Llama model. This difference might indicate that, due to its higher base performance, the Llama model does not need to rely on any particular set of skills as much as Mistral, and can harness more holistic benefits from self-adaptation. Note that applying $a_{k}$ uniformly is not a universal solution for leveraging expert vectors. This becomes evident when we look at different model and task combinations (e.g. applying $a_{k}$ uniformly on LLAMA3-8BINSTRUCT for MATH tasks only achieves 24.47, while Transformer 2 (Few-shot) achieves 25.47).

分析 2:训练任务适应贡献

在图 7 中,我们展示了归一化的自适应系数 $a_{k}$,这些系数在所有未见过的下游任务中,对通过 CEM 学习的 LLAMA3-8BINSTRUCT 和 MISTRAL-7B-INSTRUCT-V0.3 的 SVF 向量进行插值。直观上,我们发现来自训练任务的专家向量,如果与未见任务共享相似主题,通常对生成的自适应权重贡献最大。然而,我们观察到 MATH 任务是一个有趣的例外,因为在两个模型中,从 GSM8K 训练中获得的专家 $a_{k}$ 实际上是三个中最低的。我们假设这反映了 MATH 中的数学竞赛问题与 GSM8K 中的小学问题的不同性质。事实上,MATH 问题的难度远远超过 GSM8K,而且其大部分问题主要依赖于逻辑推理,因此像 ARC 这样的任务可能更符合要求。此外,我们还注意到,不同的 $z$ 向量在 Llama 模型中的适应贡献似乎更加均匀。这种差异可能表明,由于其更高的基础性能,Llama 模型不需要像 Mistral 那样依赖任何特定的技能集,并且可以从自我适应中获得更全面的好处。需要注意的是,均匀应用 $a_{k}$ 并不是利用专家向量的通用解决方案。当我们查看不同的模型和任务组合时,这一点变得显而易见(例如,在 LLAMA3-8BINSTRUCT 上均匀应用 $a_{k}$ 仅达到 24.47,而 Transformer 2 (少样本) 达到 25.47)。

Analysis 3: Ablation studies

分析 3: 消融研究

Module sensitivity: We first compare the performance of SVF when it is applied to different modules (see trials 1-3). Under consistent conditions, both individual MLP and attention updates improve performance, with MLP updates resulting in more pronounced gains. Simultaneous updates to both module types yield even more significant enhancements.

模块敏感性:我们首先比较了 SVF 应用于不同模块时的性能(见试验 1-3)。在一致条件下,单独的 MLP 和注意力更新都能提升性能,其中 MLP 更新带来的增益更为显著。同时更新两种模块类型则能带来更大的性能提升。

Objective function: We are interested in the performance impact from different objective functions, and we compare the RL objective with next-token prediction loss (see trials 2 and 4). For the latter, we use instruction fine-tuning with official GSM8K solutions as target tokens. Results show clear performance gains with RL, demonstrating its effectiveness in task-specific fine-tuning. Conversely, next-token prediction even hinders performance. This highlights RL’s ability to handle cases lacking detailed solutions, suggesting its superiority in this context.

目标函数:我们关注不同目标函数对性能的影响,并将强化学习 (Reinforcement Learning, RL) 目标与下一个 Token 预测损失进行比较(参见试验 2 和 4)。对于后者,我们使用官方 GSM8K 解决方案作为目标 Token 进行指令微调。结果显示,强化学习带来了明显的性能提升,证明了其在任务特定微调中的有效性。相反,下一个 Token 预测甚至会阻碍性能。这突显了强化学习在处理缺乏详细解决方案的情况时的能力,表明其在此背景下的优越性。

SVF vs LoRA: Finally, we also evaluate LoRA using the RL objective (see trials 2 and 5). A significant performance disparity is observed, primarily attributed to the severe instability of the LoRA training process. Despite exploring a wide range of learning rates, LoRA’s performance consistently lagged behind. For further illustrations, see Figure 9 in the appendix.

SVF 与 LoRA 对比:最后,我们还使用 RL 目标评估了 LoRA(参见试验 2 和 5)。观察到显著的性能差异,主要归因于 LoRA 训练过程的严重不稳定性。尽管探索了广泛的学习率范围,LoRA 的表现始终落后。更多图示请参见附录中的图 9。

Analysis 4: Cross-model compatibility Finally, we explore the potential for our self-adaptation framework to be applied across different LLMs. In particular, we evaluate whether the SVF expert vectors trained on LLAMA3-8B-INSTRUCT can benefit MISTRAL-7B-INSTRUCT-V0.3, and whether we can perform adaptation across the expert vectors of these two models. We present our main findings in Table 5 and refer to Appendix B for additional detailed results. Surprisingly, we find that positive transfer occurs across the two models, with visible benefits in 2 out of 3 tasks. We note these improvements are due to the inherent ordering of the SVF parameter iz ation, as randomly shuffling each SVF vector before applying it to the Mistral model consistently degrades performance.

分析 4: 跨模型兼容性
最后,我们探讨了自适应性框架在不同大语言模型之间应用的潜力。特别是,我们评估了在 LLAMA3-8B-INSTRUCT 上训练的 SVF 专家向量是否能够使 MISTRAL-7B-INSTRUCT-V0.3 受益,以及我们是否可以在两个模型的专家向量之间进行适应性调整。我们在表 5 中展示了主要发现,并在附录 B 中提供了更多详细结果。令人惊讶的是,我们发现这两个模型之间存在正向迁移,在 3 个任务中有 2 个任务表现出明显的收益。我们注意到这些改进是由于 SVF 参数化的固有顺序所致,因为在将每个 SVF 向量应用于 Mistral 模型之前随机打乱顺序会持续降低性能。

Table 4: Ablation studies. We fine-tune LLAMA3-8B-INSTRUCT on the GSM8K training split with different settings and the results on the test split along with zero-shot transfer results on MATH.

表 4: 消融研究。我们在 GSM8K 训练集上使用不同设置对 LLAMA3-8B-INSTRUCT 进行微调,并在测试集上的结果以及在 MATH 上的零样本迁移结果。

# 方法 目标函数 模块 参数量 (↓) GSM8K (↑) MATH (↑)
0 LLAMA-3-8B-INSTRUCT 75.89 (1.00) 24.54 (1.00)
1 SVF 策略梯度 MLP 0.39M 78.62 (1.04) 24.20 (0.99)
2 SVF 策略梯度 attention 0.16M 76.19 (1.00) 24.20 (0.99)
3 SVF 策略梯度 MLP+attention 0.58M 79.23 (1.04) 25.04 (1.04)
4 SVF Nexttokenpred attention 0.16M 60.50 (0.80) 18.52 (0.75)
5 LoRA 策略梯度 attention 6.82M 57.92 (0.76) 15.72 (0.64)
6 LoRA Nexttokenpred attention 6.82M 77.18 (0.98) 24.12 (0.96)
7 LoRA Nexttokenpred MLP+attention 35.13M 75.66 (0.96) 22.12 (0.91)

Table 5: Cross-model $_{\textit{z}}$ vector transfer. Results from transferring the expert vectors trained on LLAMA3-8B-INSTRUCT to MISTRAL-7B-INSTRUCT-V0.3 with cross model few-shot adaptation.

表 5: 跨模型 $_{\textit{z}}$ 向量迁移。将训练在 LLAMA3-8B-INSTRUCT 上的专家向量迁移到 MISTRAL-7B-INSTRUCT-V0.3 并进行跨模型少样本适配的结果。

方法 SVF 训练任务 MATH GSM8K Humaneval MBPP-pro ARC-Challenge ARC-Easy
MISTRAL-7B-INSTRUCT-V0.3 13.02 (1.00) 43.29 (1.00) 71.76 (1.00)
+ Llama SVF (ordered o) 11.96 (0.92) 45.12 (1.04) 72.01 (1.00)
+ Llama SVF (shuffled o) 10.52 (0.81) 40.24 (0.93) 70.82 (0.99)
+ 少样本适配 (跨模型) 12.65 (0.97) 46.75 (1.08) 75.64 (1.05)

This operation leads to notable performance degradation across each task. Finally, by performing few-shot adaptation using the SVF vectors collected from both models, the performance of MISTRAL-7B-INSTRUCT-V0.3 further improves across the board. We observe that these gains even surpass the best score from adapting MISTRAL-7B-INSTRUCT-V0.3 with all the SVF vectors in the ARC-Challenge task reported in Table 2. While these results appear promising, we note that the surprising compatibility discovered through our naive transfer approach is potentially tied to the similarity between the architectures of the two considered LLMs. To this end, whether similar transfer can be replicated with models of different scales remains an open research question that could open the doors to disentangling and recycling task-specific skills for newer/larger models, with important implications for democratization and sustainability.

此操作导致每个任务的性能显著下降。最后,通过使用从两个模型收集的 SVF 向量进行少样本适应,MISTRAL-7B-INSTRUCT-V0.3 的性能在各方面都得到了进一步提升。我们观察到,这些提升甚至超过了表 2 中报告的 ARC-Challenge 任务中使用所有 SVF 向量适应 MISTRAL-7B-INSTRUCT-V0.3 的最佳分数。虽然这些结果看起来很有希望,但我们注意到,通过我们简单的迁移方法发现的惊人兼容性可能与我们考虑的两个大语言模型的架构相似性有关。因此,是否可以在不同规模的模型上复制类似的迁移仍然是一个开放的研究问题,这可能为解构和回收任务特定技能以用于更新/更大的模型打开大门,对民主化和可持续性具有重要意义。


Figure 7: $\alpha_{k}$ learned weights.

图 7: $\alpha_{k}$ 学习到的权重。

5 CONCLUSION

5 结论

In this paper, we introduced Transformer 2, providing a novel blueprint toward realizing self-adaptive LLMs. Within this framework, we first proposed SVF, offering superior performance than prior finetuning recipes, together with reduced costs, high compositional it y, and over fitting regular iz ation – all crucial properties to achieve scalable self-adaptation. Leveraging a set of SVF experts as building blocks, we developed three effective strategies for self-adaptation, each offering unique benefits and monotonic performance benefits with increasing access to the test-time conditions.

在本文中,我们介绍了 Transformer 2,为实现自适应大语言模型提供了一个新颖的蓝图。在该框架中,我们首先提出了 SVF (Superior Variational Fine-tuning),它提供了比之前的微调方法更优越的性能,同时降低了成本,具有高组合性和过拟合正则化——这些都是实现可扩展自适应的关键特性。利用一组 SVF 专家作为构建模块,我们开发了三种有效的自适应策略,每种策略都提供了独特的优势,并且在增加对测试条件访问的情况下,性能单调提升。

While Transformer 2 demonstrates promising results, there remain exciting opportunities for future work. One limitation is that the capabilities of SVF experts are tied to the latent components of the base model. To address this, model merging offers a promising direction (Yu et al., 2024; Goddard et al., 2024; Akiba et al., 2024), enabling specialized models to be combined into a single, more capable model. Additionally, while our CEM-based adaptation effectively balances performance and efficiency, scaling to a large number of specialized domains may introduce increased one-time computational costs. However, this trade-off is offset by the benefits of improved performance and enhanced self-adaptation capabilities. Advances in model merging and efficient adaptation techniques have produced models dominating open leader boards, making them strong candidates as base models for Transformer 2 and opening new possibilities for adaptive LLMs.

虽然Transformer 2展示了令人期待的结果,但未来仍有令人兴奋的研究机会。一个局限性在于,SVF专家的能力与基础模型的潜在组件紧密相关。为了解决这一问题,模型合并提供了一个有前景的方向(Yu et al., 2024; Goddard et al., 2024; Akiba et al., 2024),它允许将专门化的模型合并为一个更强大的单一模型。此外,尽管我们基于CEM的适应方法在性能和效率之间取得了有效平衡,但在扩展到大量专门领域时,可能会引入一次性计算成本的增加。然而,这种权衡被性能提升和增强的自适应能力所带来的好处所抵消。模型合并和高效适应技术的进步已经产生了在公开排行榜上占据主导地位的模型,使其成为Transformer 2基础模型的强有力候选者,并为自适应大语言模型开辟了新的可能性。

Yujin Tang initiated the project. Qi Sun proposed the prompted-based method, developed the evaluation framework, conducted the experiments, and provided contributions to writing. Edoardo Cetin designed the few-shot CEM adaptation strategy, performed the experiment, and made major contributions to manuscript writing. Yujin Tang proposed the core algorithm, conducted initial experiments, made major contributions to the manuscript, and managed the project.

Yujin Tang 启动了该项目。Qi Sun 提出了基于提示的方法,开发了评估框架,进行了实验,并为写作做出了贡献。Edoardo Cetin 设计了少样本 CEM 适应策略,进行了实验,并对文稿写作做出了主要贡献。Yujin Tang 提出了核心算法,进行了初步实验,对文稿做出了主要贡献,并管理了项目。

REFERENCES

参考文献

Albert Q Jiang, Alexandre S a blay roll es, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.

Albert Q Jiang, Alexandre S a blay roll es, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, 等. Mixtral of experts. arXiv 预印本 arXiv:2401.04088, 2024.

Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, and Alan Ritter. Self-moe: Towards compositional large language models with self-specialized experts. arXiv preprint arXiv:2406.12034, 2024.

Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, 和 Alan Ritter. Self-moe: 面向自专家化组合大语言模型. arXiv 预印本 arXiv:2406.12034, 2024.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, 和 Dario Amodei. 神经语言模型的缩放定律. arXiv 预印本 arXiv:2001.08361, 2020.

Verena Klo¨s, Thomas Go¨thel, and Sabine Glesner. Adaptive knowledge bases in self-adaptive system design. In 2015 41st Euromicro Conference on Software Engineering and Advanced Applications, pp. 472–478, 2015. doi: 10.1109/SEAA.2015.48.

Verena Klo¨s, Thomas Go¨thel, 和 Sabine Glesner. 自适应系统设计中的自适应知识库. 在 2015 年第 41 届欧洲微软件工程与高级应用会议上, 第 472–478 页, 2015. doi: 10.1109/SEAA.2015.48.

Dawid Jan Kopiczko, Tijmen Blank ev oort, and Yuki Markus Asano. Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454, 2023.

Dawid Jan Kopiczko, Tijmen Blankevoort, 和 Yuki Markus Asano. Vera: 基于向量的随机矩阵适应. arXiv 预印本 arXiv:2310.11454, 2023.

Vijay Lingam, Atula Tejaswi, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joydeep Ghosh, Alex Dimakis, Eunsol Choi, Aleksandar Bojchevski, and Sujay Sanghavi. Svft: Parameter-efficient fine-tuning with singular vectors. arXiv preprint arXiv:2405.19597, 2024.

Vijay Lingam、Atula Tejaswi、Aditya Vavre、Aneesh Shetty、Gautham Krishna Gudur、Joydeep Ghosh、Alex Dimakis、Eunsol Choi、Aleksandar Bojchevski 和 Sujay Sanghavi。Svft:使用奇异向量进行参数高效微调。arXiv 预印本 arXiv:2405.19597,2024。

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, 和 Colin A Raffel. 少样本参数高效微调比上下文学习更好且更便宜. 神经信息处理系统进展, 35:1950–1965, 2022.

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, KwangTing Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024.

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, KwangTing Cheng, 和 Min-Hung Chen. Dora: 权重分解的低秩适应. arXiv 预印本 arXiv:2402.09353, 2024.

Lasse S Loose, David Wisniewski, Marco Rusconi, Thomas Goschke, and John-Dylan Haynes. Switch-independent task representations in frontal and parietal cortex. Journal of Neuroscience, 37(33):8033–8042, 2017.

Lasse S Loose, David Wisniewski, Marco Rusconi, Thomas Goschke, 和 John-Dylan Haynes. 额叶和顶叶皮层中的任务表征与切换无关. Journal of Neuroscience, 37(33):8033–8042, 2017.

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 3195–3204, 2019.

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, 和 Roozbeh Mottaghi. Ok-vqa: 一个需要外部知识的视觉问答基准. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 3195–3204 页, 2019.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730–27744, 2022.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, 等. 通过人类反馈训练语言模型以遵循指令. 神经信息处理系统进展, 35: 27730–27744, 2022.

Qwen Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters, March 2024. URL https://qwenlm.github.io/blog/qwen-moe/. Blog post.

Qwen Team. Qwen1.5-moe: 以1/3激活参数匹配7b模型性能,2024年3月。URL https://qwenlm.github.io/blog/qwen-moe/。博客文章。

Samyam Raj bh and ari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pp. 18332–18346. PMLR, 2022.

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, 和 Yuxiong He. Deepspeed-moe: 推动专家混合模型的推理和训练,助力下一代 AI 规模扩展。发表于国际机器学习会议,第 18332–18346 页。PMLR, 2022.

Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning, volume 133. Springer, 2004.

Reuven Y Rubinstein 和 Dirk P Kroese. 《交叉熵方法:组合优化、蒙特卡罗模拟和机器学习的统一方法》,第 133 卷。Springer, 2004.

Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558, 2023.

Pratyusha Sharma, Jordan T Ash, 和 Dipendra Misra. 真相在其中:通过层选择性秩降低改进大语言模型中的推理能力. arXiv 预印本 arXiv:2312.13558, 2023.

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317–8326, 2019.

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, 和 Marcus Rohrbach. 迈向能够阅读的 VQA 模型。在 IEEE/CVF 计算机视觉与模式识别会议论文集上,第 8317-8326 页,2019 年。

A IMPLEMENTATION DETAILS AND HYPER-PARAMETERS

实现细节与超参数

A.1 SVF TRAINING

A.1 SVF 训练

We obtain the expert vectors $z$ as the base components in Transformer 2 by training the SVF finetunes with a consistent recipe across the considered training tasks and language models. We divide each dataset to produce equal-sized training and validation splits. We then apply our RL-based approach, optimizing $\theta_{z}$ with AdamW using a learning rate of $\mathrm{\dot{2}\times10^{-3}}$ with cosine decay, a batch size of 256, and gradient clipping. We employ early stopping and select the best $\lambda$ (the coefficient of the KL divergence term) based on validation performance. For the LLAMA3-70B-INSTRUCT and Vision tasks experiments, we apply the SVF on half of the layers to reduce memory usage while maintaining considerable performance improvement. During the training of LLAMA3-8BINSTRUCT on the vision language tasks, we apply a small negative reward (-0.1) for training stability.

我们通过在所有考虑的训练任务和语言模型中使用一致的训练方法对SVF微调进行训练,获得了专家向量$z$作为Transformer 2中的基础组件。我们将每个数据集划分为大小相等的训练集和验证集。然后,我们应用基于强化学习的方法,使用AdamW优化$\theta_{z}$,学习率为$\mathrm{\dot{2}\times10^{-3}}$,采用余弦衰减,批量大小为256,并进行梯度裁剪。我们采用早停策略,并根据验证性能选择最佳的$\lambda$(KL散度项的系数)。在LLAMA3-70B-INSTRUCT和视觉任务的实验中,我们在半数层上应用SVF以减少内存使用,同时保持显著的性能提升。在LLAMA3-8BINSTRUCT上进行视觉语言任务训练时,我们应用了一个小的负奖励(-0.1)以确保训练稳定性。

A.2 LORA TRAINING

A.2 LORA 训练

We follow community best practices for LoRA fine-tuning, applying it to query and value projection layers with learning rates around $5^{"}\times10^{-5}$ . We set 200 total iterations with a 256 global batch size for suffi- cient training. For feasible LoRA instruction training, we collect solutions for all training tasks (GSM8K, MBPP, Arc-Easy, TextVQA) from official sources and append them to question prompts. Table 8 shows a sample math problem used for LoRA fine-tuning. Despite extensive hyper parameter tuning, we often observe test performance decay as discussed, which

我们遵循社区对 LoRA 微调的最佳实践,将其应用于查询和值投影层,学习率约为 $5^{"}\times10^{-5}$。我们设置了 200 次总迭代,全局批量大小为 256,以确保充分的训练。为了进行可行的 LoRA 指令训练,我们从官方来源收集了所有训练任务(GSM8K、MBPP、Arc-Easy、TextVQA)的解决方案,并将其附加到问题提示中。表 8 展示了一个用于 LoRA 微调的数学问题示例。尽管进行了广泛的超参数调优,我们经常观察到测试性能下降,正如所讨论的那样。


Figure 8: Sample problem and answer. Math data sample used for LoRA instruction fine-tuning, text in blue is the unmasked solution.


图 8: 示例问题与答案。用于 LoRA 指令微调的数学数据样本,蓝色文本为未掩码的解答。

can be attributed to the small number of training samples and potential model requirements for instruction fine-tuning data (specifically, the highly detailed thinking process).

可以归因于训练样本数量较少以及模型对指令微调数据(特别是高度详细的思维过程)的潜在需求。

A.3 HYPER PARAMETERS

A.3 超参数

We present a summary of the hyper parameters used in our experiments in Table 6. To optimize performance, we conducted sweeps across several hyper parameters and selected the most effective combination based on validation results. For SVF, our primary focus was on adjusting the KL coefficient to enhance training stability. In the case of LoRA, we concentrated on sweeping the learning rate and maximum gradient clip norm to identify optimal settings.

我们在表 6 中总结了实验中使用的超参数。为了优化性能,我们对多个超参数进行了扫描,并根据验证结果选择了最有效的组合。对于 SVF,我们主要关注调整 KL 系数以增强训练稳定性。在 LoRA 的情况下,我们专注于扫描学习率和最大梯度裁剪范数,以确定最佳设置。

A.4 FEW-SHOT ADAPTATION

A.4 少样本适应

As described in the main text, our few-shot adaptation approach entails producing an entirely new $\begin{array}{r}{z^{\prime}=\sum_{k=1}^{K}\alpha_{k}z_{k}}\end{array}$ for each $W$ by linearly interpolating between the $K$ learned SVF vectors, each weig hted by the coefficients $\alpha\in\mathbb{R}^{K}$ . We employ CEM to search for $\alpha_{k}$ ’s based on the performance on the few-shot prompts, which are specifically held out from the rest of the test prompts and used to obtain the elite set at each iteration. In the case of multiple sample solutions obtaining the same score on these held-out samples, we break ties by choosing the sample solution with the highest average log-likelihood across the tokens of its generated correct answers.

如正文所述,我们的少样本适应方法需要通过对学习到的 $K$ 个 SVF 向量进行线性插值,为每个 $W$ 生成一个全新的 $\begin{array}{r}{z^{\prime}=\sum_{k=1}^{K}\alpha_{k}z_{k}}\end{array}$,每个向量由系数 $\alpha\in\mathbb{R}^{K}$ 加权。我们使用 CEM 来搜索 $\alpha_{k}$,基于少样本提示的表现,这些提示专门从其他测试提示中保留出来,并用于在每次迭代中获取精英集。在多个样本解在这些保留样本上获得相同分数的情况下,我们通过选择在其生成正确答案的 token 上具有最高平均对数似然的样本解来打破平局。

In all of our main experiments, we reserve only 10 samples of data for self-adaptation and perform up to 100 CEM iterations. For each setting, we consider both per-layer and per-vector adaptation, where the latter strategy has the advantage of greatly simplifying search (as we only have $3;\alpha$ coefficients). Moreover, we experiment with both normalizing across the $\alpha$ of different tasks (such that their sum would be fixed to 1) or keeping them un constrained. Due to the lack of a validation set, we simply report the performance attained by our best sample from these test configurations at the end of optimization, on the remaining unseen samples for each task.

在我们的所有主要实验中,我们仅保留10个样本数据用于自适应,并执行最多100次CEM迭代。对于每种设置,我们考虑了逐层和逐向量的自适应,其中后一种策略具有大大简化搜索的优势(因为我们只有 $3;\alpha$ 个系数)。此外,我们尝试了对不同任务的 $\alpha$ 进行归一化(使其总和固定为1)或保持它们不受约束。由于缺乏验证集,我们仅在优化结束时报告了在这些测试配置中最佳样本在每项任务的剩余未见样本上的表现。

Table 6: Hyper-parameters used for SVF and LoRA training. We perform a sweep on certain sensitive hyper-parameters across methods for fair comparison.

表 6: 用于 SVF 和 LoRA 训练的超参数。我们对某些敏感超参数进行了跨方法的扫描,以确保公平比较。

SVF 超参数
z 的初始均值 0.1
z 的初始方差 1 x 10-3
全局批量大小 256
学习率 2 × 10-3
最大范数裁剪 1 x 10-3
KL 系数 λ 0.0, 0.1, 0.2, 0.3
LoRA 超参数
16
LoRA alpha 32
LoRA dropout 0.05
全局批量大小 256
学习率 2 × 10-4, 5 × 10-4, 2 × 10-5, 5 × 10-5, 2 × 10-6, 5 × 10-6
最大范数裁剪 1 × 10-3, 1.0

Table 7: Additional Comparison Experiment. Normalized scores are in the parentheses.

表 7: 额外对比实验。括号内为归一化分数。

方法 GSM8K MBPP-Pro ARC-Easy
LLAMA3-8B-INSTRUCT 75.89 (1.00) 64.65 (1.00) 88.59 (1.00)
+ IA3 78.01 (1.03) 67.68 (1.05) 89.10 (1.01)
+ DORA 78.09 (1.03) 64.65 (1.00) 89.14 (1.01)
+ SVF(我们的) 79.15 (1.04) 66.67 (1.03) 89.56 (1.01)
方法 MATH Humaneval ARC-Challenge
LLAMA3-8B-INSTRUCT 24.54 (1.00) 60.98 (1.00) 80.63 (1.00)
+ IA3 23.64 (0.96) 59.76 (0.98) 81.57 (1.01)
+ DORA 24.44 (0.99) 52.44 (0.86) 81.14 (1.01)
+ Transformer (Prompt) 25.22 (1.03) 61.59 (1.01) 81.74 (1.01)
(Cls-expert) 25.18 (1.03) 62.80 (1.03) 81.37 (1.01)
+ Transformer2 (少样本) 25.47 (1.04) 62.99 (1.03) 82.61 (1.02)

B ADDITIONAL RESULTS

B 附加结果

B.1 BASELINE COMPARISON TO MORE PEFT METHODS

B.1 与更多PEFT方法的基线比较

We conduct additional comparison studies against more parameter-efficient fine-tuning methods, including IA3Liu et al. (2022), DORA. Liu et al. (2024).

我们进行了与更多参数高效微调方法的额外比较研究,包括 IA3 (Liu et al., 2022) 和 DORA (Liu et al., 2024)。

As Table 7 shows, SVF still outperforms other methods and shows promising generalized performance.

如表 7 所示,SVF 仍然优于其他方法,并显示出良好的泛化性能。

B.2 IMPACT FROM NUMBER OF FEW-SHOTS

B.2 少样本数量的影响

We investigate the relationship between the number of samples available for fewshot adaptation and downstream performance. Our analysis focused on the test task where LLAMA3-8B-INSTRUCT demonstrates the highest baseline performance, to prevent the potential for a null signal in our CEM-based search.

我们研究了少样本适应可用样本数量与下游性能之间的关系。我们的分析集中在测试任务上,其中 LLAMA3-8B-INSTRUCT 展示了最高的基线性能,以防止基于 CEM 的搜索中出现无效信号的可能性。

Table 8: Few-shot adaptation scaling on the ArcChallenge task. Performance varies with number of examples.

表 8: ArcChallenge 任务上的少样本适应扩展。性能随示例数量的变化而变化。

方法 Transformer? IA31 100steps IA31 1000steps
LLAMA3-8B-INSTRUCT 80.63 (1.00) 80.63 (1.00) 80.63 (1.00)
+3-shot adaptation 82.18 (1.02) 81.83 (1.01) 79.01 (0.98)
+5-shot adaptation 82.38 (1.02) 80.89 (1.00) 79.41 (0.98)
+10-shot adaptation 82.61 (1.02) 82.00 (1.02) 79.78 (0.99)
+20-shot adaptation 82.61 (1.02) 81.40 (1.01) 79.61 (0.99)

As Table 8 shows, substantial benefits of our few-shot strategy are evident with as few as 3 to 5 test samples. Moreover, performance appears to plateau beyond 10 samples, underscoring how our essential and inherently regularized SVF para meter iz ation effectively complements self-adaptation. This efficiency enables optimal use of data to enhance understanding of the test task.

如表 8 所示,我们的少样本策略在仅有 3 到 5 个测试样本时就能显示出显著的收益。此外,性能在超过 10 个样本后趋于平稳,这突显了我们的本质且固有正则化的 SVF 参数化如何有效地补充了自适应。这种效率使得我们能够充分利用数据来增强对测试任务的理解。

For completeness, we have also conducted experiments with identical settings on $\mathrm{{IA}^{3}}$ (Liu et al., 2022), another method that leverages few-shot examples. All experiments were conducted with full batch size, a learning rate of $5\times10^{-5}$ , with 100 and 1000 training steps.

为了完整性,我们还在 $\mathrm{{IA}^{3}}$ (Liu et al., 2022) 上进行了相同设置的实验,这是一种利用少样本示例的方法。所有实验均在全批量大小下进行,学习率为 $5\times10^{-5}$,训练步数为 100 和 1000。

Our results indicate that the performance of $\mathrm{{IA}^{3}}$ on the unseen test tasks is inferior to CEM-based adaptation for all numbers of few shots considered. We note that in our experiment, we have to considerably limit the number of optimization steps to avoid over fitting the 500,000 parameters of $\mathrm{{IA}^{3}}$ on the few-shot samples. However, we believe over fitting might still be occurring to some degree even after only 100 steps, as also validated by the model’s perfect training accuracy on this extremely small dataset. This limitation of fine-tuning-based adaptation highlights the superior generalization capability of our CEM-based adaptation approach in Transformer 2.

我们的结果表明,在所有考虑的少样本数量下,$\mathrm{{IA}^{3}}$ 在未见过的测试任务上的表现均不如基于 CEM 的适应方法。我们注意到,在实验中,为了避免 $\mathrm{{IA}^{3}}$ 的 500,000 个参数在少样本上过拟合,我们不得不大幅限制优化步骤的数量。然而,我们认为即使在仅进行 100 步优化后,过拟合可能仍然在一定程度上发生,这一点也得到了模型在这个极小数据集上完美训练准确率的验证。这种基于微调的适应方法的局限性,凸显了我们在 Transformer 2 中基于 CEM 的适应方法在泛化能力上的优越性。

B.3 CROSS-MODEL SVF TRANSFER ON THE TRAINING TASKS

B.3 跨模型 SVF 在训练任务上的迁移

We provide complementary results to Table 5 in the main text, where we analyze the SVF crossmodel transfer performance from training on GSM8K, MBPP-pro, and ARC-Easy to our considered test tasks. In Table 9, we show the results in the same transfer setting this time evaluating MISTRAL-7B-INSTRUCT-V0.3 on the same training tasks where the LLAMA3-8B-INSTRUCT SVF vectors were obtained from. Overall, we recognize a similar trend, albeit with less consistent improvement from the original model (only in 1 out of 3 tasks), but still much higher performance than the randomly shuffled baseline. These results further confirm that the canonical ordering of the SVF parameter iz ation is key for cross-model transfer, highlighting once more its inherent suitability to empower self-adaptation.

我们在正文的表 5 中提供了补充结果,分析了从 GSM8K、MBPP-pro 和 ARC-Easy 训练到我们考虑的测试任务的 SVF 跨模型迁移性能。在表 9 中,我们展示了相同的迁移设置下的结果,这次评估的是 MISTRAL-7B-INSTRUCT-V0.3 在 LLAMA3-8B-INSTRUCT SVF 向量获得的相同训练任务上的表现。总体而言,我们观察到了类似的趋势,尽管原始模型的改进不太一致(仅在 3 个任务中的 1 个任务中),但仍然比随机打乱的基线表现要好得多。这些结果进一步证实了 SVF 参数化的规范顺序对于跨模型迁移至关重要,再次凸显了其内在的自适应能力。

Table 9: Cross-model $_{\textit{z}}$ Vector Transfer. Results from transfering the SVF expert vectors trained on LLAMA3-8B-INSTRUCT to MISTRAL-7B-INSTRUCT-V0.3 in the respective training tasks.

表 9: 跨模型 $_{\textit{z}}$ 向量迁移。将基于 LLAMA3-8B-INSTRUCT 训练的 SVF 专家向量迁移到 MISTRAL-7B-INSTRUCT-V0.3 在相应训练任务中的结果。

方法 GSM8K MBPP-pro ARC-Easy
MISTRAL-7B-INSTRUCT-V0.3 42.83 (1.00) 49.50 (1.00) 81.65 (1.00)
+ Llama ASVF (ordered oi) 42.61 (0.99) 48.48 (0.98) 81.78 (1.00)
+ Llama SVF (shuffled o) 41.93 (0.98) 46.34 (0.94) 80.81 (0.99)

B.4 TRAINING CURVE OF LORA AND POLICY GRADIENT

B.4 LoRA 和策略梯度的训练曲线

Figure 9 gives the learning curves for LoRA training on the GSM8K task.

图 9: GSM8K 任务上 LoRA 训练的学习曲线。


Figure 9: Training LoRA with policy gradient. The dashed line shows the performance of LLAMA3-8B-INSTRUCT on the test split. LoRA collapses at the beginning of the training stage and fails to recover, leading to negative effects on test performance. We swept a wide range of learning rates $(2\times10^{-4},5\times1\bar{0}^{-4},\ldots,2\times10\mathrm{-2},5\times10^{\dot{-2}})$ , and all learning curves were similar to the one presented.

图 9: 使用策略梯度训练 LoRA。虚线显示了 LLAMA3-8B-INSTRUCT 在测试集上的表现。LoRA 在训练阶段初期崩溃且未能恢复,导致测试性能受到负面影响。我们尝试了广泛的学习率范围 $(2\times10^{-4},5\times1\bar{0}^{-4},\ldots,2\times10\mathrm{-2},5\times10^{\dot{-2}})$,所有学习曲线都与图中展示的相似。

C PCA ON LLAMA3 AND MISTRAL

C PCA 在 LLAMA3 和 MISTRAL 上的应用

To investigate if the singular components that have the highest singular values are able to capture most of the information of a weight matrix, we conducted Principle Component Analysis (PCA) on the weight matrices in LLAMA3-8B-INSTRUCT and MISTRAL-7B-INSTRUCT-V0.3 (see Figures 10 and 11). In each figure, we plot the variance that is captured by the top $r$ components across all the layers in each type of modules for a weight matrix $W\stackrel{=}{\in}\mathbb{R}^{m\times\dot{n}}$ :

为了研究具有最高奇异值的奇异成分是否能够捕捉权重矩阵的大部分信息,我们对 LLAMA3-8B-INSTRUCT 和 MISTRAL-7B-INSTRUCT-V0.3 中的权重矩阵进行了主成分分析 (PCA) (见图 10 和图 11)。在每个图中,我们绘制了对于权重矩阵 $W\stackrel{=}{\in}\mathbb{R}^{m\times\dot{n}}$,每种模块类型的所有层中前 $r$ 个成分所捕捉的方差:

\mathrm{ratio}={\frac{\sum_{i=1}^{r}\sigma_{i}}{\sum_{j=1}^{\operatorname*{min}(m,n)}\sigma_{j}}}
\mathrm{ratio}={\frac{\sum_{i=1}^{r}\sigma_{i}}{\sum_{j=1}^{\operatorname*{min}(m,n)}\sigma_{j}}}

Here, $\sigma$ ’s are the ordered (from largest to smallest) singular values on the weight matrix $W$ . It is easy to see from the figures that when $r=256$ , less than $50%$ of the variance is captured by these top components on average. For the MLP layers, this fraction is even lower than $20%$ . On the other hand, the ranks adopted by LoRA-XS or similar methods are much less than 256, resulting in even more information loss and restrictions in their modeling power that relies mostly on these $r$ components.

这里,$\sigma$ 是权重矩阵 $W$ 上按从大到小排序的奇异值。从图中可以很容易看出,当 $r=256$ 时,这些顶部成分平均捕获的方差不到 $50%$。对于 MLP 层,这一比例甚至低于 $20%$。另一方面,LoRA-XS 或类似方法采用的秩远小于 256,导致更多的信息丢失,并且其建模能力主要依赖于这些 $r$ 成分,因此受到更多限制。

Figure 10: PCA of LLAMA3-8B-INSTRUCT. We show the ratio of the variance captured by the top $r$ singular components on the y-axis, and the layer indices on the $\mathbf{X}$ -axis. Except for the Query, Key and Value projection matrices, small $r$ values only capture a tiny fraction of variance in singular values in the parameter matrices.

图 10: LLAMA3-8B-INSTRUCT 的 PCA。我们在 y 轴上展示了由前 $r$ 个奇异成分捕获的方差比例,在 $\mathbf{X}$ 轴上展示了层索引。除了 Query、Key 和 Value 投影矩阵外,较小的 $r$ 值仅捕获了参数矩阵中奇异值的极小部分方差。

D EFFICIENCY CONSIDERATIONS AND IMPROVEMENTS

D 效率考虑与改进

Our CEM-based adaptation method involves running inference on a small number of samples for each target task (up to 10 in our experiments). In a typical configuration, this process is relatively efficient: for example, our CEM-light approach (3-shot with 10 generations) completes the ARC-Challenge task in approximate ly 11 minutes. As shown in Table 10,

我们基于 CEM 的适应方法涉及对每个目标任务的一小部分样本进行推理(在我们的实验中最多为 10 个)。在典型配置中,这个过程相对高效:例如,我们的 CEM-light 方法(3-shot 和 10 次生成)在大约 11 分钟内完成了 ARC-Challenge 任务。如表 10 所示,

Table 10: 3-shot and light variants Performance with different inference-time adaptation budgets.

表 10: 不同推理时间适应预算下的 3-shot 和轻量变体性能

方法 ARC-Challenge
LLAMA3-8B-INSTRUCT 80.63 (1.00)
+ CEM10-shot adaptation 82.61 (1.02)
+ CEM3-shot (30% of prompts) 82.18 (1.02)
+ CEM light (3% of prompts) 82.08 (1.02)

this lighter setup reduces the total number of samples to just $3%$ of the original setting while still delivering substantial performance improvements over the base model.

这种更轻量级的设置将总样本数减少到原始设置的仅 $3%$,同时仍然在基础模型上实现了显著的性能提升。


Figure 11: PCA of MISTRAL-7B-INSTRUCT-V0.3. We show the ratio of the variance captured by the top $r$ singular components on the y-axis, and the layer indices on the x-axis. Except for the Query, Key and Value projection matrices, small $r$ values only capture a tiny fraction of variance in singular values in the parameter matrices.

图 11: MISTRAL-7B-INSTRUCT-V0.3 的 PCA 分析。我们在 y 轴上展示了前 $r$ 个奇异成分捕获的方差比例,x 轴上展示了层索引。除了 Query、Key 和 Value 投影矩阵外,较小的 $r$ 值仅捕获参数矩阵中奇异值的极小部分方差。

We acknowledge that CEM-based adaptation entails a trade-off between one-time overhead it spends on searching the optimal combination weights for the SVF-tune vectors and performance. Increasing the number of few-shot samples or the number of generations can yield higher performance, but this comes at the cost of additional computational overhead. However, it is important to note that this adaptation cost is a one-time overhead per task. The cost-per-prompt diminishes significantly when applied to tasks with a large number of prompts.

我们承认,基于 CEM 的适应需要在一次性开销和性能之间进行权衡,这种开销用于搜索 SVF-tune 向量的最优组合权重。增加少样本的数量或生成次数可以提高性能,但这会带来额外的计算开销。然而,需要注意的是,这种适应成本是每个任务的一次性开销。当应用于具有大量提示的任务时,每次提示的成本会显著降低。

Moreover, in practical scenarios, CEM-based adaptation offers better s cal ability than few-shot prompting methods, which require increasing the length of every prompt, leading to much worse scaling as task sizes grow. In contrast, our method focuses on determining optimal expert vector combinations efficiently and avoids repetitive inference-time costs. However, we note that the overhead might be significant for tasks with very few prompts. Thus, the other adaptations methods might be more appropriate for these particular settings.

此外,在实际场景中,基于 CEM 的适应方法比少样本提示方法具有更好的可扩展性,后者需要增加每个提示的长度,导致随着任务规模的增大,扩展性显著下降。相比之下,我们的方法专注于高效地确定最佳专家向量组合,并避免了重复的推理时间成本。然而,我们注意到,对于提示非常少的任务,开销可能会很大。因此,其他适应方法可能更适合这些特定场景。

We also highlight two immediate directions for improving efficiency:

我们还强调了两个提高效率的即时方向:

Finally, in this work we only considered CEM due to its simplicity, there exist several different evolution algorithms empirically showing better efficiency and convergence properties that we hope will be explored in future research.

最后,在本工作中,我们仅考虑了 CEM(交叉熵方法)因其简单性,实际上存在多种不同的进化算法,经验表明它们具有更好的效率和收敛性,我们希望这些算法能在未来的研究中得到探索。

阅读全文(20积分)