Accelerating Large Language Models through Partially Linear Feed-Forward Network
通过部分线性前馈网络加速大语言模型
Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University {1hugansen, 2 zhao guo wang, 3 wei jing lin, 5haibochen}@sjtu.edu.cn, 4 boogie pop 021230@gmail.com
上海交通大学电子信息与电气工程学院并行与分布式系统研究所 {1hugansen, 2 zhao guo wang, 3 wei jing lin, 5haibochen}@sjtu.edu.cn, 4 boogie pop 021230@gmail.com
Abstract
摘要
Large language models (LLMs) demonstrate remarkable capabilities but face deployment challenges due to their massive parameter counts. While existing compression techniques like pruning can reduce model size, it leads to significant accuracy degradation under high compression ratios. We present a novel perspective inspired by constant folding in compiler optimization. Our approach enables parameter reduction by treating activation functions in LLMs as linear functions.
大语言模型 (LLMs) 展示了卓越的能力,但由于其庞大的参数量,面临部署挑战。虽然现有的压缩技术(如剪枝)可以减少模型大小,但在高压缩比下会导致显著的精度下降。我们提出了一种受编译器优化中常量折叠启发的新视角。我们的方法通过将大语言模型中的激活函数视为线性函数,实现了参数减少。
However, recent LLMs use complex non-linear activation s like GELU that prevent direct application of this technique. We propose TARDIS, which enables optimization of LLMs with non-linear activation s by partially approximating them with linear functions in frequently occurring input ranges. For outlier inputs, TARDIS employs an online predictor to dynamically fall back to original computations.
然而,最近的大语言模型使用了复杂的非线性激活函数(如 GELU),这阻碍了该技术的直接应用。我们提出了 TARDIS,它通过在频繁出现的输入范围内用线性函数部分近似非线性激活函数,从而实现对大语言模型的优化。对于异常输入,TARDIS 采用在线预测器动态回退到原始计算。
Our experiments demonstrate that TARDIS achieves $80%$ parameter reduction in feed-forward networks, while signifi- cantly outperforming state-of-the-art pruning methods Wanda and RIA with up to $65%$ higher accuracy. In practical deployments for a 7B model, TARDIS achieves $1.6\times$ end-to-end inference speedup when integrated with the vLLM serving system, and $1.4\times$ speedup with the widely adopted HuggingFace implementation, while incurring only a $10.9%$ accuracy trade-off.
我们的实验表明,TARDIS 在前馈网络中实现了 80% 的参数减少,同时显著优于最先进的剪枝方法 Wanda 和 RIA,准确率提高了高达 65%。在实际部署 7B 模型时,TARDIS 与 vLLM 服务系统集成后实现了 1.6 倍的端到端推理加速,与广泛采用的 HuggingFace 实现相比实现了 1.4 倍的加速,而仅带来 10.9% 的准确率损失。
1 Introduction
1 引言
Recently, transformer-based large language models (LLMs) have revolutionized natural language processing, demonstrating remarkable capabilities across diverse tasks. Deploying these models for inference is notoriously challenging due to their enormous number of parameters [58,60,69], often reaching billions or even trillions [18]. To address this, various techniques have been proposed to compress LLMs, including pruning [43,53,68], quantization [28,41,63], knowledge distillation [31,55], and low-rank approximation [22,39,50]. While pruning has gained popularity due to its simplicity [62], it often leads to significant accuracy degradation under high compression ratios - a $80%$ compression could result in up to $70%$ accuracy loss.
最近,基于 Transformer 的大语言模型 (LLMs) 彻底改变了自然语言处理领域,在各种任务中展示了卓越的能力。由于这些模型的参数量巨大 [58,60,69],通常达到数十亿甚至数万亿 [18],因此部署这些模型进行推理具有极大的挑战性。为了解决这个问题,已经提出了各种压缩大语言模型的技术,包括剪枝 [43,53,68]、量化 [28,41,63]、知识蒸馏 [31,55] 和低秩近似 [22,39,50]。尽管剪枝因其简单性而广受欢迎 [62],但在高压缩比下,它通常会导致显著的精度下降——80% 的压缩可能导致高达 70% 的精度损失。
Inspired by the success of constant folding [1] in traditional compiler optimization s, we propose a novel approach to LLM compression that maintains high accuracy even under aggressive compression ratios. Constant folding, a fundamental compiler optimization technique that reduces both program code size and runtime computation costs, has been successfully employed across numerous programming languages including $C/C++$ [4], Python [5], and SQL [9]. We discover a unique opportunity to adapt this technique for LLM compression.
受传统编译器优化中常量折叠 [1] 成功的启发,我们提出了一种新的大语言模型压缩方法,即使在激进的压缩比下也能保持高精度。常量折叠是一种基础的编译器优化技术,能够减少程序代码大小和运行时计算成本,已在包括 $C/C++$ [4]、Python语言 [5] 和 SQL [9] 在内的多种编程语言中成功应用。我们发现了一个独特的机会,可以将这一技术应用于大语言模型压缩。
The architecture of an LLM primarily consists of two core components: the Multi-Head Attention (MHA) block and the feed-forward network (FFN) block, with the FFN block accounting for $67%$ to $80%$ [19] of the total parameters. The FFN block comprises two matrix multiplications separated by a non-linear activation function (e.g., ReLU or GELU).
大语言模型的架构主要由两个核心组件构成:多头注意力机制 (Multi-Head Attention, MHA) 模块和前馈网络 (Feed-Forward Network, FFN) 模块,其中 FFN 模块占总参数的 $67%$ 到 $80%$ [19]。FFN 模块由两个矩阵乘法组成,中间通过一个非线性激活函数(例如 ReLU 或 GELU)分隔。
Our key observation is that through linear approximation of the non-linear activation function, we can leverage the associative property of matrix multiplication to merge the two FFN matrices into a single, compact matrix. This transformation enables a reduction in FFN parameters by up to $87.5%$ in theory (detailed in Section 3.1).
我们的关键观察是,通过对非线性激活函数进行线性近似,我们可以利用矩阵乘法的结合性质将两个FFN矩阵合并为一个紧凑的矩阵。这种转换理论上可以将FFN参数减少高达 $87.5%$(详见第3.1节)。
However, modern LLMs predominantly use sophisticated non-linear activation functions, particularly GELU [33] (adopted in GPT series models [21]) and SiLU [27] (used in LLaMA2 [57]). These complex functions are crucial for capturing intricate patterns in the data, and naively approximating them with linear functions leads to severe performance degradation. Our experiments with a 7B model show that a simple linear approximation of GELU results in a $75%$ drop in model accuracy.
然而,现代大语言模型主要使用复杂的非线性激活函数,特别是 GELU [33](在 GPT 系列模型 [21] 中采用)和 SiLU [27](用于 LLaMA2 [57])。这些复杂的函数对于捕捉数据中的复杂模式至关重要,简单地用线性函数近似它们会导致严重的性能下降。我们在 7B 模型上的实验表明,GELU 的简单线性近似会导致模型准确率下降 $75%$。
In this paper, we present TARDIS, a novel approach that enables constant folding in LLMs with non-linear activation functions at low approximation error. Our key insight is that the inputs to these activation functions in LLMs are typically concentrated within a narrow range. This distribution pattern allows us to partially approximate the non-linear ac
在本文中,我们提出了 TARDIS,一种新颖的方法,能够在大语言模型 (LLM) 中实现非线性激活函数的常量折叠,同时保持较低的近似误差。我们的关键洞察是,大语言模型中这些激活函数的输入通常集中在较窄的范围内。这种分布模式使我们能够部分近似非线性激活函数。
(a) The inference process of a (b) Theoretical Breakdown of the transformer based LLM inference time on an RTX 4090
图 1: (a) 基于 Transformer 的大语言模型推理过程 (b) 在 RTX 4090 上大语言模型推理时间的理论分解
Figure 1: The inference process of transformer based LLM (a), where initial tokens are user prompts, output tokens are response from LLM; the inference time breakdown (b) with 91 initial tokens, 178 output tokens for Falcon-7B (Other operations takes $<!0.5%$ time and not shown).
图 1: 基于 Transformer 的大语言模型推理过程 (a),其中初始 Token 是用户提示,输出 Token 是大语言模型的响应;推理时间分解 (b),使用 91 个初始 Token 和 178 个输出 Token 的 Falcon-7B 模型 (其他操作耗时 $<!0.5%$,未显示)。
tivation function with a linear function in the "hot" input ranges (where most values occur), introducing minimal approxima tion error. For outlier inputs that fall outside these ranges, TARDIS employs an efficient online predictor to identify them and dynamically falls back to the original activation function computation. Extensive experiments demon- strate that TARDIS reduces the FFN weight matrices by up to $80%$ , while significantly outperforming state-of-the-art pruning methods in accuracy. Compared to leading pruning methods like Wanda and RIA, TARDIS achieves up to $65%$ higher accuracy on downstream tasks. For inference, TARDIS delivers up to $1.6\times$ end-to-end speedup on the state-of-the-art vLLM serving system [37], and $1.4\times$ speedup on HuggingFace’s widely-used implementation [10].
在“热点”输入范围(大多数值出现的区域)内,TARDIS 使用线性函数替换激活函数,引入最小的近似误差。对于落在这些范围之外的异常输入,TARDIS 采用高效的在线预测器进行识别,并动态回退到原始激活函数计算。大量实验表明,TARDIS 将 FFN 权重矩阵减少了高达 $80%$,同时在准确性上显著优于最先进的剪枝方法。与 Wanda 和 RIA 等领先的剪枝方法相比,TARDIS 在下游任务中实现了高达 $65%$ 的准确性提升。在推理方面,TARDIS 在最先进的 vLLM 服务系统 [37] 上实现了高达 $1.6\times$ 的端到端加速,在 HuggingFace 广泛使用的实现 [10] 上实现了 $1.4\times$ 的加速。
The key contributions of this paper are:
本文的主要贡献是:
2 Background and Motivation
2 背景与动机
In this section, we first cover the basics of LLM inference and highlight the performance bottlenecks introduced by massive parameters. We then review existing approaches to reducing LLM size and identify their constraints.
在本节中,我们首先介绍大语言模型 (LLM) 推理的基础知识,并强调由海量参数引入的性能瓶颈。然后,我们回顾现有的减小大语言模型规模的方法,并指出它们的局限性。
2.1 Large Language Model Inference
2.1 大语言模型推理
LLM Inference Overview. LLM inference generates output tokens in response to input tokens through two main phases. In the first phase, as shown in Figure 1a, the model processes all initial input tokens through its $N$ layers sequentially to establish context. In the next phase, the model generates output tokens one at a time in an auto-regressive manner, where each generated token becomes part of the input sequence for predicting the next token. This process continues until either the desired output length is reached or a stop condition is met. Operations in LLM Inference. The transformer architecture, which forms the backbone of LLMs, consists of two main components: the multi-head attention (MHA) block and the feed-forward network (FFN) block. The MHA block enables the model to focus on different parts of the input sequence simultaneously through multiple attention heads, while the FFN block, composed of two linear transformations with an activation function in between, processes the features from the attention block. As Figure 1a shows, an LLM typically stacks multiple transformer layers, where each layer sequentially processes the input through its MHA and FFN blocks. During inference, the model parameters for both blocks need to be loaded from memory before performing computations on input tokens.
LLM 推理概述。LLM 推理通过两个主要阶段生成输出 Token 以响应输入 Token。在第一阶段,如图 1a 所示,模型通过其 $N$ 层顺序处理所有初始输入 Token 以建立上下文。在下一阶段,模型以自回归方式逐个生成输出 Token,其中每个生成的 Token 成为预测下一个 Token 的输入序列的一部分。此过程持续进行,直到达到所需的输出长度或满足停止条件。LLM 推理中的操作。构成大语言模型核心的 Transformer 架构由两个主要组件组成:多头注意力 (MHA) 块和前馈网络 (FFN) 块。MHA 块使模型能够通过多个注意力头同时关注输入序列的不同部分,而 FFN 块由两个线性变换组成,中间有一个激活函数,用于处理来自注意力块的特征。如图 1a 所示,LLM 通常堆叠多个 Transformer 层,其中每一层通过其 MHA 和 FFN 块顺序处理输入。在推理过程中,模型参数在执行输入 Token 计算之前需要从内存中加载。
2.2 Cost of LLM Inference
2.2 大语言模型推理成本
Despite their remarkable capabilities, deploying LLMs for inference tasks remains challenging due to their massive parameter counts [41, 42, 58, 60]. This large parameter count significantly impacts inference performance.
尽管大语言模型 (LLM) 具有显著的能力,但由于其庞大的参数量 [41, 42, 58, 60],在推理任务中的部署仍然具有挑战性。如此大的参数量显著影响了推理性能。
To quantify this impact, we analyzed the inference time breakdown of Falcon-7B using the SharedGPT dataset [14], which contains 90K real-world chat histories between humans and LLMs. $80%$ of the total parameters are in the FFN blocks of Falcon-7B. Using the dataset’s average lengths (91 tokens for human input, 178 tokens for LLM response), we measured the inference time with the vLLM inference engine [7] on an RTX 4090, where attention blocks take 0.9s, and the FFN blocks takes 2.1s, showing the bottleneck is in FFN blocks. We also performed a therotical analysis of the time spent on computation and I/O operations (performed on the 1TB/s bandwidth VRAM of RTX 4090) in both MHA and FFN blocks. Figure 1b shows that parameter loading I/O dominates the inference time, with the FFN I/O alone consuming $78.2%$ of the total time. This high I/O overhead stems from LLMs’ auto-regressive nature: generating each token requires loading all model parameters into the hardware’s cache (Figure 1a), resulting in repeated, expensive I/O operations [58,60,69].
为了量化这一影响,我们使用 SharedGPT 数据集 [14] 分析了 Falcon-7B 的推理时间分解,该数据集包含 90K 条人类与大语言模型之间的真实聊天记录。Falcon-7B 中 $80%$ 的参数位于 FFN 块中。使用数据集的平均长度(人类输入为 91 个 token,大语言模型响应为 178 个 token),我们在 RTX 4090 上使用 vLLM 推理引擎 [7] 测量了推理时间,其中注意力块耗时 0.9 秒,FFN 块耗时 2.1 秒,表明瓶颈在于 FFN 块。我们还对 MHA 和 FFN 块中计算和 I/O 操作(在 RTX 4090 的 1TB/s 带宽 VRAM 上执行)所花费的时间进行了理论分析。图 1b 显示,参数加载 I/O 主导了推理时间,仅 FFN I/O 就消耗了总时间的 $78.2%$。这种高 I/O 开销源于大语言模型的自回归特性:生成每个 token 都需要将所有模型参数加载到硬件的缓存中(图 1a),从而导致重复且昂贵的 I/O 操作 [58,60,69]。
2.3 LLM Compression
2.3 大语言模型压缩
To reduce the I/O overhead during LLM inference, researchers have developed various compression techniques. These tech
为了减少大语言模型 (LLM) 推理过程中的 I/O 开销,研究人员开发了多种压缩技术。
Figure 2: Accuracy of different pruning methods under different compression ratio of the FFN block. Existing baselines suffer pronounced accuracy loss on high compression ratio.
图 2: 不同压缩比下 FFN 模块的剪枝方法准确率。现有基线方法在高压缩比下出现了显著的准确率损失。
Figure 3: Illustration of constant folding in the FFN block of an LLM. The FFN contains two weight matrices $W_{1}$ , $W_{2}$ . By treating activation function $\upsigma$ as a linear function $a x+b$ , matrix multiplications can be reordered from $a(x W_{1})W_{2}$ to $x!\left(a W_{1}W_{2}\right)$ . Since $a W_{1}W_{2}$ consists of only constants, it can be pre-computed into a single matrix $C$ before inference.
图 3: 大语言模型中 FFN 块的常量折叠示意图。FFN 包含两个权重矩阵 $W_{1}$ 和 $W_{2}$。通过将激活函数 $\upsigma$ 视为线性函数 $a x+b$,矩阵乘法可以从 $a(x W_{1})W_{2}$ 重新排序为 $x!\left(a W_{1}W_{2}\right)$。由于 $a W_{1}W_{2}$ 仅由常量组成,可以在推理前预先计算为单个矩阵 $C$。
niques can be broadly categorized into four approaches: pruning [42,43,52,53,68], quantization [28,34,41,59], knowledge distillation [31,55], and low-rank approximation [22,39,50]. These approaches adopt different ideas and usually excel in different settings [36,62]. Among these approaches, pruning has emerged as one of the most popular techniques due to its conceptual simplicity [62]. Pruning works by identifying and removing less important parameters based on certain criteria, such as magnitude or gradient information. Users can specify a target compression ratio, and the selected parameters are set to zero, effectively reducing both storage requirements and computational costs.
技术可以大致分为四种方法:剪枝 [42,43,52,53,68]、量化 [28,34,41,59]、知识蒸馏 [31,55] 和低秩近似 [22,39,50]。这些方法采用了不同的思路,通常在不同的场景中表现出色 [36,62]。在这些方法中,剪枝因其概念简单而成为最流行的技术之一 [62]。剪枝通过根据某些标准(如幅度或梯度信息)识别并移除不太重要的参数来工作。用户可以指定目标压缩比,并将选定的参数设置为零,从而有效减少存储需求和计算成本。
While simple, pruning suffers from significant accuracy loss under high compression ratios. We evaluated two state-ofthe-art pruning methods, Wanda [53] and RIA [42], on Falcon7B using three benchmarks: PIQA [20], Lambada [12], and ARC-Challenge [16]. Figure 2 shows that at a $50%$ compression ratio, the accuracy drop remains modest at $0.9%{-4.8%}$ across all tasks. However, when increasing the compression to $80%$ , the accuracy can drop to even zero on the Lambada dataset, making the model practically unusable.
虽然剪枝方法简单,但在高压缩比下会遭受显著的精度损失。我们在 Falcon7B 上评估了两种最先进的剪枝方法:Wanda [53] 和 RIA [42],使用了三个基准测试:PIQA [20]、Lambada [12] 和 ARC-Challenge [16]。图 2 显示,在 50% 的压缩比下,所有任务的精度下降保持在 0.9% 到 4.8% 之间。然而,当压缩比增加到 80% 时,在 Lambada 数据集上的精度甚至可能降至零,使得模型几乎无法使用。
SiLU and GELU Activation Functions
图 1: SiLU 和 GELU 激活函数
Figure 4: Figure of SiLU and GELU activation functions in a input range of $\left[-3,\ 2\right]$ .
图 4: SiLU 和 GELU 激活函数在输入范围为 $\left[-3,\ 2\right]$ 内的图示。
3 Opportunity and Challenge
3 机遇与挑战
Inspired by the success of constant folding in compiler optimi zat ions, we discover a novel opportunity to reduce I/O costs by simplifying FFN blocks in LLMs through activation function modification. Our approach achieves higher accuracy under aggressive compression compared to state-of-the-art pruning methods. Rather than directly manipulating model parameters like existing pruning techniques, we focus on simplifying the activation functions in FFN blocks, which enables parameter reduction through constant folding.
受编译器优化中常量折叠(constant folding)成功的启发,我们发现通过修改激活函数来简化大语言模型(LLM)中的前馈网络(FFN)模块,从而降低 I/O 成本的新机会。与最先进的剪枝方法相比,我们的方法在激进压缩下实现了更高的准确性。与现有剪枝技术直接操作模型参数不同,我们专注于简化 FFN 模块中的激活函数,从而通过常量折叠实现参数减少。
3.1 Opportunity
3.1 机会
A key insight for simplifying FFN blocks in LLMs is that their activation functions can be approximated as linear functions. The FFN block in transformer models consists of two weight matrices $W_{1},,W_{2}$ , and an activation function between them. As shown in Figure 3, a simple FFN block with activation computes:
简化大语言模型中 FFN (Feed-Forward Network) 模块的一个关键见解是,它们的激活函数可以近似为线性函数。Transformer 模型中的 FFN 模块由两个权重矩阵 $W_{1},,W_{2}$ 和一个位于它们之间的激活函数组成。如图 3 所示,带有激活函数的简单 FFN 模块计算如下:
F F N(x)=\sigma(x W_{1})W_{2}```
where matrix \$W_{1}\$ is of size \$d\times h\$ and \$W_{2}\$ is \$h\times d\$ , \$\upsigma\$ is the activation function. A function \$f(x)\$ is linear if and only if it can be expressed as \$f(x)=a x+b\$ , where \$a\$ and \$b\$ are constants. When the activation function is approximated by a linear function, the FFN becomes the following linear transformation:
其中矩阵 \$W_{1}\$ 的大小为 \$d\times h\$,\$W_{2}\$ 的大小为 \$h\times d\$,\$\upsigma\$ 是激活函数。当且仅当函数 \$f(x)\$ 可以表示为 \$f(x)=a x+b\$ 时,它是线性的,其中 \$a\$ 和 \$b\$ 是常数。当激活函数由线性函数近似时,FFN 变为以下线性变换:
F F N(x)=a((x W_{1})+b^{1})W_{2}
This linear representation of activation function enables us to leverage matrix computation properties to optimize the FFN block. Specifically, we can reorder the computation from \$a(x W_{1})W_{2}\$ to \$x\!\left(a W_{1}W_{2}\right)\$ using matrix associativity. This reordering would not be possible with non-linear activation functions. The key advantage is that \$a W_{1}W_{2}\$ contains only constants and can be pre-computed into a single matrix \$C\$ before inference. As illustrated in Figure 3, with input \$\left[x_{1},x_{2}\right]\$ , the FFN computes \$\t\!=\![a\bigl(-2x_{1}+3x_{2}\bigr),a\bigl(x_{1}+2x_{2}\bigr)]\$ . By precomputing \$a W_{1}W_{2}\$ into matrix \$C=\left(\begin{array}{c c}{{-2a}}&{{a}}\\ {{3a}}&{{2a}}\end{array}\right)\$ , we can simplify the entire FFN into a single matrix multiplication that produces the same result t. This transformation significantly reduces the parameter count in FFN blocks. Originally, an
激活函数的这种线性表示使我们能够利用矩阵计算特性来优化FFN块。具体来说,我们可以利用矩阵结合律将计算从 \$a(x W_{1})W_{2}\$ 重新排序为 \$x\!\left(a W_{1}W_{2}\right)\$ 。这种重新排序在非线性激活函数的情况下是不可能的。关键优势在于 \$a W_{1}W_{2}\$ 只包含常数,并且可以在推理前预计算为单个矩阵 \$C\$ 。如图 3 所示,对于输入 \$\left[x_{1},x_{2}\right]\$ ,FFN 计算 \$\t\!=\![a\bigl(-2x_{1}+3x_{2}\bigr),a\bigl(x_{1}+2x_{2}\bigr)]\$ 。通过将 \$a W_{1}W_{2}\$ 预计算为矩阵 \$C=\left(\begin{array}{c c}{{-2a}}&{{a}}\\ {{3a}}&{{2a}}\end{array}\right)\$ ,我们可以将整个 FFN 简化为一个产生相同结果 t 的单一矩阵乘法。这种转换显著减少了 FFN 块中的参数数量。
![](https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=32d208b7d712e88c03d39e1f27519d9e74df15734a6bc66567ba49e56931fdae1737527971_2501.10054v1.pdf&filename=8fc01f658e09d461f8e64ea6ae74eeb755e69497d7cd7baf015f9ecad201b401.jpg)
Figure 5: Density estimation of the activation function input distribution for 50 neurons of layer 10 and 30 in Falcon-7B. Each curve represents the kernel density estimation of the input distribution for a single neuron. The higher the density, the more activation inputs fall into the range. The left two figures are profiled on WikiText2, the middle two on C4, and the right two on PTB.
图 5: Falcon-7B 中第 10 层和第 30 层 50 个神经元的激活函数输入分布的密度估计。每条曲线代表单个神经元的输入分布的核密度估计。密度越高,表示激活输入落入该范围的数量越多。左侧两图基于 WikiText2 进行分析,中间两图基于 C4,右侧两图基于 PTB。
FFN block with weight matrices W1 ∈Rd×h and W2 ∈Rh×d contains \$2\times d\times h\$ parameters. After folding into a single matrix \$C\in\mathbb{R}^{d\times d}\$ , it contains only \$d^{2}\$ parameters. For typical LLM architectures like GPT2 [3], BLOOM [38], and Falcon [19] where \$h=4d\$ , this reduces the parameter count by up to \$87.5\%\$ per FFN block.
具有权重矩阵 W1 ∈Rd×h 和 W2 ∈Rh×d 的 FFN 块包含 \$2\times d\times h\$ 个参数。折叠成单个矩阵 \$C\in\mathbb{R}^{d\times d}\$ 后,仅包含 \$d^{2}\$ 个参数。对于典型的 GPT2 [3]、BLOOM [38] 和 Falcon [19] 等大语言模型架构,其中 \$h=4d\$,这使得每个 FFN 块的参数数量减少了高达 \$87.5\%\$。
# 3.2 Challenge
# 3.2 挑战
While constant folding in LLMs shows promise, a significant challenge arises from the widespread adoption of sophisticated non-linear activation functions in modern language models [23, 24, 27, 33, 35]. Our analysis of the top 15 text generation models on Hugging Face [10], a leading model repository, reveals that all models use complex activation s - 7 employ GELU and 8 use SiLU. These sophisticated activation functions make naive linear approximation challenging.
尽管在大语言模型中的常量折叠显示出潜力,但现代语言模型中广泛采用的复杂非线性激活函数带来了重大挑战 [23, 24, 27, 33, 35]。我们对 Hugging Face [10](一个领先的模型库)上排名前 15 的文本生成模型的分析表明,所有模型都使用了复杂的激活函数——其中 7 个使用 GELU,8 个使用 SiLU。这些复杂的激活函数使得简单的线性近似变得具有挑战性。
Consider GELU as an example. As illustrated in Figure 4, GELU exhibits a smooth but intricate input-output relationship. When we attempt to approximate GELU with a simple linear function (as described in Section 3.1), the model suffers substantial accuracy degradation. To validate this, we modified Falcon-7B by replacing its GELU activation with a linear function in the FFN blocks, derived through linear regression on the original activation’s input-output data. When evaluated on the Lambada [12] benchmark, the model’s accuracy plummeted from \$75\%\$ to 0, making it impractical for real-world applications.
以 GELU 为例。如图 4 所示,GELU 表现出平滑但复杂的输入输出关系。当我们尝试用简单的线性函数(如第 3.1 节所述)来近似 GELU 时,模型的准确性大幅下降。为了验证这一点,我们修改了 Falcon-7B,通过线性回归对原始激活的输入输出数据进行拟合,用线性函数替换了其 FFN 模块中的 GELU 激活。在 Lambada [12] 基准测试中评估时,模型的准确率从 \$75\%\$ 骤降至 0,这使得它在实际应用中变得不切实际。
# 4 Insights and Overview
# 4 洞察与概述
Through careful analysis of LLM internals, we have identified two key insights that enable effective approximation of nonlinear activation functions with linear ones while preserving model accuracy.
通过对大语言模型 (LLM) 内部结构的仔细分析,我们发现了两个关键见解,使得在保持模型准确性的同时,能够有效地用线性激活函数近似非线性激活函数。
Table 1: Average percentage of activation input ranges containing \$65\%\$ of inputs relative to total range length, measured across neurons in three datasets.
| 模型名称 | 激活函数 | Wikitext-2 | C4 | PTB |
|----------------|----------|------------|------|------|
| Falcon-7B | GELU | 20.2% | 20.6%| 20.7%|
| Falcon-40B | GELU | 19.6% | 19.7%| 20.1%|
| BLOOMZ-7B1 | GELU | 19.1% | 19.4%| 19.8%|
| LLaMA2-7B | SiLU | 18.4% | 18.5%| 18.2%|
表 1: 在三个数据集的神经元中测量的激活输入范围占总范围长度的 65% 的平均百分比。
# 4.1 Insights
# 4.1 洞察
Insight 1. Skewed distribution of activation function inputs. The input distribution for activation functions in LLMs exhibits a highly skewed pattern, with inputs predominantly concentrated within a narrow range of the input space. To validate this observation, we analyzed three widely-used language task datasets: Wikitext-2 [45], C4 [49], and Penn Treebank (PTB) [44]. A FFN block consists of multiple neurons, where each neuron processes input through its weights and activation function. The activation input distribution at the neuron level is crucial for understanding model behavior [42,52]. We examined this distribution within FFN weight matrices. Each weight matrix can be decomposed into individual neurons, which represent distinct computational units. For example, in Figure 3, \$W_{1}\$ comprises two neurons represented as columns: \$\left(\begin{array}{c}{\bar{3}}\\ {-1}\end{array}\right)\$ and \$\left(1\atop2\right)\$ . Using 20K random tokens from each dataset, we collected comprehensive statistics.
洞察 1. 激活函数输入的偏态分布。大语言模型中激活函数的输入分布呈现出高度偏态的模式,输入主要集中在输入空间的一个狭窄范围内。为了验证这一观察结果,我们分析了三个广泛使用的语言任务数据集:Wikitext-2 [45]、C4 [49] 和 Penn Treebank (PTB) [44]。一个 FFN 块由多个神经元组成,每个神经元通过其权重和激活函数处理输入。神经元级别的激活输入分布对于理解模型行为至关重要 [42,52]。我们在 FFN 权重矩阵中检查了这一分布。每个权重矩阵可以分解为单个神经元,这些神经元代表不同的计算单元。例如,在图 3 中,\$W_{1}\$ 由两个神经元组成,表示为列:\$\left(\begin{array}{c}{\bar{3}}\\ {-1}\end{array}\right)\$ 和 \$\left(1\atop2\right)\$。使用每个数据集中的 20K 随机 Token,我们收集了全面的统计数据。
Figure 5 illustrates the input activation distribution for 50 randomly sampled neurons from two layers of Falcon-7B across these datasets. Our analysis reveals consistent distribution patterns within the same layer across different datasets. Notably, approximately \$65\%\$ of activation inputs for each neuron are concentrated within just \$20\%\$ of the total input range, demonstrating significant skewness. This pattern holds true across various model architectures. Table 1 presents the average range portion containing \$65\%\$ of activation inputs relative to the total range for four LLMs: Falcon-7B/40B, Bloomz7B1, and LLaMA2-7B. While the first three models employ GELU activation, LLaMA2-7B uses SiLU. Across all models and datasets, this portion consistently remains between 18- \$20\%\$ . These findings align with previous research [42,46,52], confirming that skewed activation input distribution is an inherent characteristic of LLMs. This insight reveals that activation function inputs follow a power law-like distribution, enabling effective approximation of non-linear activation functions within a compact "hot" input range.
图 5 展示了从 Falcon-7B 的两个层中随机抽取的 50 个神经元在这些数据集上的输入激活分布。我们的分析揭示了不同数据集中同一层内的分布模式具有一致性。值得注意的是,每个神经元的激活输入中大约 \$65\%\$ 集中在总输入范围的 \$20\%\$ 内,表现出显著的偏斜性。这种模式在各种模型架构中都成立。表 1 展示了四种大语言模型(Falcon-7B/40B、Bloomz7B1 和 LLaMA2-7B)中,包含 \$65\%\$ 激活输入的平均范围部分相对于总范围的比例。前三个模型使用 GELU 激活函数,而 LLaMA2-7B 使用 SiLU。在所有模型和数据集中,这一比例始终保持在 18- \$20\%\$ 之间。这些发现与之前的研究 [42,46,52] 一致,证实了偏斜的激活输入分布是大语言模型的固有特征。这一见解揭示了激活函数输入遵循类似幂律的分布,使得在紧凑的“热点”输入范围内有效逼近非线性激活函数成为可能。
Insight 2. Non-uniform distribution of layer and neuron importance. Not all layers and neurons contribute equally to the model’s performance. Using the same methodology as in the previous insight, we collected the Mean Square Error (MSE) of approximating the activation function with linear functions for each neuron in each layer. Specifically, for each layer and neuron, we first identify the input range containing a certain portion (e.g., \$85\%\$ ) of activation inputs
洞察 2:层和神经元重要性的非均匀分布。并非所有层和神经元对模型性能的贡献是均等的。我们采用了与之前洞察相同的方法,收集了用线性函数近似每层中每个神经元的激活函数的均方误差 (MSE)。具体来说,对于每一层和每个神经元,我们首先确定包含一定比例(例如 \$85\%\$)激活输入的输入范围。
![](https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=32d208b7d712e88c03d39e1f27519d9e74df15734a6bc66567ba49e56931fdae1737527971_2501.10054v1.pdf&filename=544309c8ed31a11fa75e9154197576ee80b4ea3563912951d35ee56c3cc056e0.jpg)
(b) Neuron-wise error distribution
图 1: (b) 神经元级别的误差分布
Figure 6: Error distribution of layer, and neuron in the 10-th layer in Falcon-7B. The error is profiled on the C4 dataset. based on the profiled distribution. We then approximate the activation function within this range using a linear function \$y=a x+b\$ , where parameters \$a\$ and \$^b\$ are obtained through linear regression to minimize the loss within the range. This process is repeated for different portions of inputs (from \$65\%\$ to \$95\%\$ ) to analyze the approximation error patterns.
图 6: Falcon-7B 中第 10 层的层和神经元的误差分布。误差是在 C4 数据集上进行分析的。基于分析得到的分布,我们使用线性函数 \$y=a x+b\$ 在该范围内近似激活函数,其中参数 \$a\$ 和 \$^b\$ 通过线性回归获得,以最小化该范围内的损失。针对不同部分的输入(从 \$65\%\$ 到 \$95\%\$),重复此过程以分析近似误差模式。
Figure 6.a shows significant differences in errors across layers. For instance, when \$85\%\$ of inputs fall within the linear range, the error of the first layer \$(10^{-7})\$ is an order of magnitude larger than that of the second layer \$(10^{-8})\$ . Previous works have also shown that layer importance is non-uniformly distributed [64,65,67]. Similarly, neuron importance within a layer varies significantly. Figure 6.b presents the error distribution for \$1\%\$ of neurons randomly selected from the first layer of Falcon-7B, showing error variations by nearly three orders of magnitude (from \$10^{-10}\$ to \$10^{-7}\$ ). Previous studies have also found similar patterns, where neuron activation is non-uniformly distributed, with outliers having a larger impact on model performance [25,41,52]. In summary, Figure 6 highlights the non-uniform distribution of layer and neuron importance, underscoring the need to treat each layer and neuron distinctly in our design approach.
图 6.a 展示了各层误差的显著差异。例如,当 85% 的输入落在线性范围内时,第一层的误差 \$(10^{-7})\$ 比第二层的误差 \$(10^{-8})\$ 大一个数量级。先前的研究也表明,层的重要性分布不均匀 [64,65,67]。同样,层内神经元的重要性也存在显著差异。图 6.b 展示了从 Falcon-7B 第一层随机选择的 1% 神经元的误差分布,误差变化接近三个数量级(从 \$10^{-10}\$ 到 \$10^{-7}\$)。先前的研究也发现了类似的模式,即神经元的激活分布不均匀,异常值对模型性能的影响更大 [25,41,52]。总之,图 6 强调了层和神经元重要性的非均匀分布,表明在我们的设计方法中需要区别对待每一层和每个神经元。
![](https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=32d208b7d712e88c03d39e1f27519d9e74df15734a6bc66567ba49e56931fdae1737527971_2501.10054v1.pdf&filename=170d0211e2f9bf4e96d53919b2c743a3b6b661ac86c2941d2202ac76720a8e9b.jpg)
Figure 7: Architecture of TARDIS with offline and online components. The offline component generates the constant folded matrix and predictor based on the calibration dataset. The online component performs inference using the constant folded matrix and predictor.
图 7: TARDIS 的架构,包含离线和在线组件。离线组件根据校准数据集生成常量折叠矩阵和预测器。在线组件使用常量折叠矩阵和预测器进行推理。
model weight matrix (produced via constant folding based on approximated linear ranges) and a predictor that identifies neurons operating outside their approximated ranges.
模型权重矩阵(通过基于近似线性范围的常量折叠生成)和一个识别在其近似范围之外运行的神经元的预测器。
The system architecture consists of offline and online components. The offline component performs two tasks. First, it analyzes the calibration dataset–a small number of text samples–to collect: 1) activation input distributions for each layer and neuron, and 2) importance scores based on approximation errors when replacing activation functions with linear functions. Based on these statistics, TARDIS approximates the activation function with linear functions, focusing on the “hot” input ranges where most activation s occur. The approximation range for each neuron is determined by its importance score, which TARDIS calculates based on the error introduced by linear approximation under different thresholds. The final outputs are the constant folded matrix and a predictor for identifying out-of-range neuron activation s. During inference, the online component first performs a speculative computation using the constant folded matrix. It then uses the predictor to identify neurons operating outside their approximated ranges. For these identified neurons, it corrects the results by: 1) subtracting the incorrectly computed linear approximations, and 2) re computing the actual results using the original activation function.
系统架构由离线和在线组件组成。离线组件执行两项任务。首先,它分析校准数据集(少量文本样本)以收集:1) 每层和每个神经元的激活输入分布,以及 2) 当用线性函数替换激活函数时基于近似误差的重要性分数。基于这些统计信息,TARDIS 用线性函数近似激活函数,重点关注大多数激活发生的“热点”输入范围。每个神经元的近似范围由其重要性分数决定,TARDIS 根据在不同阈值下线性近似引入的误差计算该分数。最终输出是常量折叠矩阵和用于识别超出范围神经元激活的预测器。在推理过程中,在线组件首先使用常量折叠矩阵执行推测计算。然后,它使用预测器识别在其近似范围之外操作的神经元。对于这些识别出的神经元,它通过以下方式纠正结果:1) 减去错误计算的线性近似,以及 2) 使用原始激活函数重新计算实际结果。
# 4.2 Overview
# 4.2 概述
Armed with these insights, we propose TARDIS, the first system that enables constant folding in LLMs while preserving model capabilities and reducing parameter size. Our approach leverages two key observations: 1) the skewed distribution of activation inputs allows us to focus approximation on the most frequently accessed input ranges; 2) the varying importance of different layers and neurons suggests that approximation thresholds should be adapted accordingly.
基于这些洞察,我们提出了TARDIS,这是首个在保持大语言模型能力并减少参数规模的同时实现常量折叠的系统。我们的方法利用了以下两个关键观察:1) 激活输入的偏态分布使我们能够将近似集中在最常访问的输入范围内;2) 不同层和神经元的重要性不同,表明近似阈值应相应调整。
As illustrated in Figure 7, TARDIS takes three inputs: an LLM, a calibration dataset, and a threshold \$t\$ that defines what portion of inputs should fall within the linear approximated range. The system generates two key outputs: a compressed
如图 7 所示,TARDIS 接收三个输入:一个大语言模型 (LLM)、一个校准数据集,以及一个定义输入应落在线性近似范围内的阈值 \$t\$。系统生成两个关键输出:一个压缩的
Figure 8 illustrates this process using a simple FFN with two neurons (each neuron corresponds to one column in \$W_{1}\$ and one row in \$W_{2}\$ , as shown in yellow and green). The FFN uses GELU as activation function. In the offline phase, TARDIS determines linear approximations for each neuron’s frequent input range. For neuron one, the range is \$\left(-1.5,0.12\right)\$ with slope 0.25 and intercept 0.1; for neuron two, the range is \$(-3.5,-0.1)\$ with slope 0.1 and intercept 0.2. These ranges are selected based on calibration data to ensure at least \$t\$ proportion of inputs fall within them. The system then generates constant folded matrices \$C\$ and \$B\$ using these linear parameters. When inference with input token \$x=\left(-1\,-1\right)\$ , TARDIS 1) performs speculative computation: \$x C\!+\!B={\bigl(}0.3\ -0.1{\bigr)};2{\bigr)}\$ uses predictor to determine that only
图 8 展示了一个使用简单 FFN(前馈神经网络)的过程,该网络包含两个神经元(每个神经元对应 \$W_{1}\$ 中的一列和 \$W_{2}\$ 中的一行,如黄色和绿色所示)。FFN 使用 GELU 作为激活函数。在离线阶段,TARDIS 为每个神经元的频繁输入范围确定线性近似。对于神经元一,范围为 \$\left(-1.5,0.12\right)\$,斜率为 0.25,截距为 0.1;对于神经元二,范围为 \$(-3.5,-0.1)\$,斜率为 0.1,截距为 0.2。这些范围是基于校准数据选择的,以确保至少 \$t\$ 比例的输入落在其中。系统随后使用这些线性参数生成常量折叠矩阵 \$C\$ 和 \$B\$。当使用输入 token \$x=\left(-1\,-1\right)\$ 进行推理时,TARDIS 1) 执行推测计算:\$x C\!+\!B={\bigl(}0.3\ -0.1{\bigr)};2{\bigr)}\$ 使用预测器确定只有
![](https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=32d208b7d712e88c03d39e1f27519d9e74df15734a6bc66567ba49e56931fdae1737527971_2501.10054v1.pdf&filename=2f58df058cc7884c504bc74a3fa6f8c00cb893d92e73747d291fa568e0ff102a.jpg)
![](https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=32d208b7d712e88c03d39e1f27519d9e74df15734a6bc66567ba49e56931fdae1737527971_2501.10054v1.pdf&filename=2f58df058cc7884c504bc74a3fa6f8c00cb893d92e73747d291fa568e0ff102a.jpg)
Figure 8: Example of an FFN in an LLM constant folded by TARDIS. The constant folded matrix is generated based on approximated linear ranges for each neuron. The predictor identifies neuron one is not in the approximated range. The online component fixes the FFN result by subtracting the wrong result and re computing the actual result using the original activation function.
图 8: TARDIS 常数折叠后的大语言模型中的 FFN 示例。常数折叠矩阵是基于每个神经元的近似线性范围生成的。预测器识别出神经元一不在近似范围内。在线组件通过减去错误结果并使用原始激活函数重新计算实际结果来修正 FFN 结果。
neuron two’s input falls within its approximated range; 3) corrects the result by a) subtracting neuron one’s approximate result: \$\left(0.25x{\binom{3}{-1}}+0.1\right)\left(-1\ 0\right)^{-}\!\!={\left(0.4\ 0\right)}\$ , and b) adding back neuron one’s actual result: \$G E L U(x{\binom{3}{-1}}){\binom{-10}{-1}}\$ . This cor- rection mechanism ensures accurate results while preserving the efficiency benefits of constant folding where applicable.
神经元二的输入落在其近似范围内;3) 通过以下方式修正结果:a) 减去神经元一的近似结果:\$\left(0.25x{\binom{3}{-1}}+0.1\right)\left(-1\ 0\right)^{-}\!\!={\left(0.4\ 0\right)}\$,以及 b) 加回神经元一的实际结果:\$G E L U(x{\binom{3}{-1}}){\binom{-10}{-1}}\$。这种修正机制在保持常数折叠效率优势的同时,确保了结果的准确性。
# 5 TARDIS Design
# 5 TARDIS 设计
In this section, we present the detailed design of TARDIS, which comprises offline and online components. The offline component consists of three key aspects: approximating nonlinear activation functions with linear functions (Section 5.1), generating constant folded matrices (Section 5.2), and constructing predictors (Section 5.3). The online component details how these elements work together during inference (Section 5.4).
在本节中,我们介绍了 TARDIS 的详细设计,该设计包括离线和在线组件。离线组件由三个关键方面组成:用线性函数近似非线性激活函数(第 5.1 节)、生成常量折叠矩阵(第 5.2 节)以及构建预测器(第 5.3 节)。在线组件详细说明了这些元素在推理过程中如何协同工作(第 5.4 节)。
# 5.1 Non-linear Function Approximation
# 5.1 非线性函数逼近
TARDIS aims to reduce FFN parameter counts and speedup inference by replacing complex non-linear activation functions with simple linear functions where possible. At a high level, our non-linear activation function replacing approach works as follows. For each FFN layer that consists of two weight matrices \${\mathrm{~}}W_{1}\$ and \$W_{2}\$ ) and a non-linear activation function \$\upsigma\$ in between, we 1) break down the computation of the entire FFN into computations on individual neurons, where each neuron corresponds to a column in \$W_{1}\$ and a row in \$W_{2}\$ . 2) For each neuron, analyze its input patterns to the activation function using a calibration dataset. 3) Replace the neuron’s non-linear activation with linear functions \$(y=a x+b)\$ within its most frequently observed input range. The linear functions are obtained using least squares regression.
TARDIS 旨在通过尽可能用简单的线性函数替代复杂的非线性激活函数,来减少前馈神经网络 (FFN) 的参数数量并加速推理。从高层次来看,我们的非线性激活函数替换方法如下:对于每个由两个权重矩阵 \$W_{1}\$ 和 \$W_{2}\$ 以及中间的非线性激活函数 \$\upsigma\$ 组成的 FFN 层,我们 1) 将整个 FFN 的计算分解为对单个神经元的计算,其中每个神经元对应于 \$W_{1}\$ 中的一列和 \$W_{2}\$ 中的一行。2) 对于每个神经元,使用校准数据集分析其输入模式到激活函数的情况。3) 在其最常观察到的输入范围内,用线性函数 \$(y=a x+b)\$ 替换神经元的非线性激活。线性函数通过最小二乘回归获得。
An intuitive design would be to replace each neuron’s nonlinear activation with multiple linear ranges, as different input ranges may exhibit distinct patterns. For example, for input ranges \$[l_{1},a)\$ , \$[a,b)\$ , and \$[b,l_{2})\$ , we could fit three separate linear functions to better approximate the activation function. However, a significant challenge arises with this approach - when trying to fold the weight matrices, we need a separate folded matrix for each combination of ranges across neurons. With \$h\$ neurons and \$r\$ ranges per neuron, this leads to \$r^{h}\$ folded matrices. Figure 9 illustrates this exponential growth - even with just 2 neurons and 2 ranges each, we need 4 different folded matrices to cover all possible combinations.
一种直观的设计是用多个线性范围替换每个神经元的非线性激活,因为不同的输入范围可能表现出不同的模式。例如,对于输入范围 \$[l_{1},a)\$ 、 \$[a,b)\$ 和 \$[b,l_{2})\$ ,我们可以拟合三个独立的线性函数以更好地逼近激活函数。然而,这种方法面临一个重大挑战——当尝试折叠权重矩阵时,我们需要为每个神经元范围内的组合提供一个单独的折叠矩阵。对于 \$h\$ 个神经元和每个神经元 \$r\$ 个范围,这将导致 \$r^{h}\$ 个折叠矩阵。图 9 展示了这种指数级增长——即使只有 2 个神经元和每个神经元 2 个范围,我们也需要 4 个不同的折叠矩阵来覆盖所有可能的组合。
![](https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=32d208b7d712e88c03d39e1f27519d9e74df15734a6bc66567ba49e56931fdae1737527971_2501.10054v1.pdf&filename=b3243bd38fb87547ce69cf6c310b3a0e99b21cb716cb7bf4bc23d3f89c511ae9.jpg)
Figure 9: Example of the exponential growth in the number of folded matrices.
图 9: 折叠矩阵数量指数增长的示例。
Given that modern language models typically have over 10,000 neurons per FFN layer, storing multiple ranges quickly becomes impractical. This leads us to adopt a single-range strategy, which we describe in detail in the following section.
鉴于现代语言模型通常每个FFN层拥有超过10,000个神经元,存储多个范围很快变得不切实际。这促使我们采用单一范围的策略,我们将在下一节中详细描述。
# 5.1.1 Single Range Approximation Strategy
# 5.1.1 单范围近似策略
To overcome the exponential storage overhead of multi-range approximation, TARDIS adopts a pragmatic single-range strategy. For each neuron, we approximate its non-linear activation using one linear function within a carefully chosen input range.
为了克服多范围近似带来的指数级存储开销,TARDIS 采用了一种实用的单范围策略。对于每个神经元,我们在精心选择的输入范围内使用一个线性函数来近似其非线性激活。
More specifically, for each neuron \$n\$ , we aim to find a linear function \$y=a x+b\$ that best approximates the activation within a range \$[l_{1},l_{2})\$ . The approximation function \$\Phi_{n}\$ is defined as:
更具体地说,对于每个神经元 \$n\$,我们的目标是找到一个线性函数 \$y=a x+b\$,以最佳方式近似在范围 \$[l_{1},l_{2})\$ 内的激活。近似函数 \$\Phi_{n}\$ 定义为:
\Phi_{n}(x)=\left{\begin{array}{r l}{a x+b,}&{{}l_{1}\leq x<l_{2}}\ {\upsigma(x),}&{{}o t h e r w i s e}\end{array}\right.```
To quantify the quality of our approximation, we define the error metric for the entire FFN layer as: \$\begin{array}{r}{e r r(x)=\sum_{n}^{h}e r r_{n}(x)}\end{array}\$ where each neuron’s error contribution is measured using the L2 distance [56] between original and approximated outputs: \$e r r_{n}(x)=||\upsigma(x W_{1;,n})W_{2_{n;}}-\upphi_{n}(x W_{1;,n})W_{2_{n;}}||_{2}\$
为了量化我们近似的质量,我们将整个 FFN 层的误差度量定义为:\$\begin{array}{r}{e r r(x)=\sum_{n}^{h}e r r_{n}(x)}\end{array}\$ 其中每个神经元的误差贡献通过原始输出和近似输出之间的 L2 距离 [56] 来测量:\$e r r_{n}(x)=||\upsigma(x W_{1;,n})W_{2_{n;}}-\upphi_{n}(x W_{1;,n})W_{2_{n;}}||_{2}\$
Adaptive Threshold ing. Given the definition of FFN approximation error, TARDIS needs to determine appropriate linear approximation ranges for each neuron while satisfying the user-specified in-range threshold \$t\$ . A key challenge is deciding how much of each neuron’s input should be covered by the linear approximation.
自适应阈值。给定FFN(前馈神经网络)近似误差的定义,TARDIS需要为每个神经元确定合适的线性近似范围,同时满足用户指定的范围内阈值\$t\$。一个关键挑战是决定每个神经元的输入中有多少应该被线性近似覆盖。
Based on our insight in Section 4.1 that different layers and neurons contribute unequally to the model’s performance, we develop an adaptive threshold ing mechanism that assigns different coverage requirements to different components. This mechanism operates at two levels:
基于我们在第4.1节中的洞察,即不同层和神经元对模型性能的贡献不均等,我们开发了一种自适应阈值机制,该机制为不同组件分配不同的覆盖要求。此机制在两个层面上运行:
1) Layer-level threshold ing: For each layer \$i\$ , we assign a threshold \$t_{i}\$ that determines what percentage of inputs should be linearly approximated. This is formulated as an optimization problem:
2) 层级阈值设定:对于每一层 \$i\$,我们设定一个阈值 \$t_{i}\$,该阈值决定了应线性逼近的输入百分比。这被表述为一个优化问题:
m i n i m i z e\sum_{i}^{L}E_{i}t_{i},;s.t.,\sum_{i}^{L}t_{i}=t L```
where $E_{i}$ represents the total empirical error measured on a calibration dataset for layer $i,L$ is the total number of layers, and $t$ is the target threshold. The empirical error $E_{i}$ is estimated by approximating all neurons in layer $i$ under threshold $t_{i}$ (e.g., as Figure 6a shows). The constraint ensures that the average threshold across layers equals the target.
其中 $E_{i}$ 表示在第 $i$ 层校准数据集上测量的总经验误差,$L$ 是总层数,$t$ 是目标阈值。经验误差 $E_{i}$ 通过在阈值 $t_{i}$ 下近似第 $i$ 层的所有神经元来估计(例如,如图 6a 所示)。该约束确保各层的平均阈值等于目标值。
- Neuron-level threshold ing: Within each layer $i$ , we further distribute the layer’s threshold $t_{i}$ across its neurons using:
- 神经元级别阈值设定:在每一层 $i$ 中,我们进一步使用以下公式将层的阈值 $t_{i}$ 分配到其神经元中:
m i n i m i z e\sum_{n}^{h}E_{i_{n}}t_{i_{n}},\;s.t.,\sum_{n}^{h}t_{i_{n}}=t_{i}h```
where \$E_{i_{n}}\$ represents the error contribution of neuron \$n\$ in layer \$i,h\$ is the number of neurons per layer, and \$t_{i_{n}}\$ is the threshold assigned to neuron \$n\$ . This ensures more important neurons (those with higher error impact) receive more conservative thresholds.
其中 \$E_{i_{n}}\$ 表示第 \$i\$ 层中神经元 \$n\$ 的误差贡献,\$h\$ 是每层的神经元数量,\$t_{i_{n}}\$ 是分配给神经元 \$n\$ 的阈值。这确保了更重要的神经元(那些具有更高误差影响的神经元)获得更保守的阈值。
This two-level adaptive threshold ing approach allows TARDIS to maintain high approximation accuracy while satisfying user’s threshold requirement. More critical layers and neurons are assigned stricter thresholds, while less important ones can use more aggressive linear approximations.
这种两级自适应阈值方法使TARDIS能够在满足用户阈值要求的同时保持较高的近似精度。更关键的层和神经元被分配了更严格的阈值,而重要性较低的层和神经元则可以使用更激进的线性近似。
Range Searching. For each neuron, TARDIS employs a greedy search strategy to determine the optimal linear approximation range, following these steps. 1) Initialize a starting range centered at the input distribution’s centroid, computed using kernel density estimation (KDE) [48, 54]. 2) Gradually expand the range boundaries until reaching the neuronspecific coverage threshold \$t_{i_{n}}\$ . 3) During expansion, select the direction (left or right) that produces minimal approximation error. 4) Once the optimal range is determined, fit a linear function using least squares regression. The complete algorithm is formalized in Algorithm 1, leveraging a small calibration data to ensure both approximation accuracy and computational efficiency without fine-tuning the model.
范围搜索。对于每个神经元,TARDIS 采用贪心搜索策略来确定最佳线性近似范围,步骤如下。1) 初始化一个以输入分布质心为中心的起始范围,使用核密度估计 (KDE) [48, 54] 计算。2) 逐步扩展范围边界,直到达到神经元特定的覆盖阈值 \$t_{i_{n}}\$ 。3) 在扩展过程中,选择产生最小近似误差的方向(左或右)。4) 一旦确定了最佳范围,使用最小二乘回归拟合线性函数。完整的算法在算法 1 中形式化,利用少量校准数据确保近似精度和计算效率,而无需微调模型。
# 5.2 Constant Folded Matrix Generation
# 5.2 常量折叠矩阵生成
Given the approximated linear ranges for each neuron in each layer, TARDIS generates constant folded matrices \$C\$ and bias vectors \$B\$ through a systematic transformation process.
给定每层中每个神经元的近似线性范围,TARDIS通过系统转换过程生成常量折叠矩阵 \$C\$ 和偏置向量 \$B\$。
For each FFN layer, TARDIS first substitutes the activation function with the approximated linear function \$\Phi_{n}\$ for each neuron \$n\$ . This transformation creates a sequence of
对于每个 FFN 层,TARDIS 首先将激活函数替换为每个神经元 \$n\$ 的近似线性函数 \$\Phi_{n}\$。这种转换创建了一个序列
# 28 return P
# 28 返回 P
three consecutive linear operations per neuron: an initial linear transform (on \$W_{1_{:,n}})\$ , followed by the approximated linear transform (on \$\upphi_{n}\$ in \$[l_{1},l_{2})\$ ), and finally a second linear transform (on \$W_{2_{n,:}}\$ ). These three transformations are then constant folded into a single linear transformation. The constant folding process for each neuron can be formally expressed as:
每个神经元执行三个连续的线性操作:初始线性变换(在 \$W_{1_{:,n}}\$ 上),接着是近似线性变换(在 \$\upphi_{n}\$ 上,位于 \$[l_{1},l_{2})\$ 区间内),最后是第二个线性变换(在 \$W_{2_{n,:}}\$ 上)。这三个变换随后被常数折叠为单个线性变换。每个神经元的常数折叠过程可以正式表示为:
C F(W_{1;,n},W_{2_{n,:}},\Phi_{n})=(a W_{1;,n}W_{2_{n,:}},b W_{2_{n,:}})=(C_{n},B_{n})```
Here, $C F$ represents the constant folding function for neuron $n$ , producing a matrix $C_{n}\in\mathbb{R}^{d\times d}$ and a bias vector $B_{n}\in\mathbb{R}^{1\times d}$ . The parameters $a$ and $b$ denote the slope and intercept of the approximated linear function $\Phi_{n}$ , respectively. To complete the transformation for each layer, TARDIS summarizes all $h$ individual neuron’s folded matrix into a single consolidated matrix $C\in\mathbb{R}^{d\times d}$ and bias vector $B\in\mathbb{R}^{1\overline{{\times}}d}$ : $(C,B)=(\Sigma_{n}^{h}C_{n},\Sigma_{n}^{h}B_{n})$ for each layer, respectively.
这里,$C F$ 表示神经元 $n$ 的常量折叠函数,生成一个矩阵 $C_{n}\in\mathbb{R}^{d\times d}$ 和一个偏置向量 $B_{n}\in\mathbb{R}^{1\times d}$。参数 $a$ 和 $b$ 分别表示近似线性函数 $\Phi_{n}$ 的斜率和截距。为了完成每一层的转换,TARDIS 将所有 $h$ 个独立神经元的折叠矩阵汇总为一个单一的合并矩阵 $C\in\mathbb{R}^{d\times d}$ 和偏置向量 $B\in\mathbb{R}^{1\overline{{\times}}d}$:$(C,B)=(\Sigma_{n}^{h}C_{n},\Sigma_{n}^{h}B_{n})$,分别对应每一层。
5.3 Predictor Generation
5.3 预测器生成
While Section 5.2 demonstrates how to generate optimized matrices for linear approximation, we need a practical mechanism to handle cases where inputs fall outside the approximated range (Section 4.1).
虽然第5.2节展示了如何生成用于线性逼近的优化矩阵,但我们需要一种实用的机制来处理输入超出逼近范围的情况(第4.1节)。
To address this, TARDIS introduces a speculative approximation with result fixing scheme. The key idea is to avoid expensive range checks before computation. Instead, TARDIS optimistically assumes all inputs fall within the linear range and proceeds with fast matrix operations. It then employs a lightweight predictor to identify which results need correction, only fixing the inaccurate ones (Section 5.4).
为了解决这个问题,TARDIS 引入了一种带有结果修正方案的推测性近似方法。其核心思想是避免在计算前进行昂贵的范围检查。相反,TARDIS 乐观地假设所有输入都落在线性范围内,并继续进行快速的矩阵运算。然后,它采用一个轻量级的预测器来识别哪些结果需要修正,仅修正不准确的部分(第 5.4 节)。
The predictor’s role is straightforward - it takes the FFN block input and determines which neurons received inputs outside their approximated linear ranges. To implement this efficiently, TARDIS creates a compressed version of the weight matrix $W_{1}$ , which contains just enough information to make these predictions without the overhead of full matrix operations. For compressing $W_{1}$ , TARDIS uses quantization techniques [29]. While alternatives like training smaller prediction networks are possible [42,52], quantization provides an effective balance of accuracy and simplicity. TARDIS can incorporate various LLM compression methods like pruning or knowledge distillation, as long as they preserve sufficient information to predict out-of-range inputs. This enables high performance while maintaining accuracy through selective result fixing.
预测器的作用很简单——它接收 FFN 块的输入,并确定哪些神经元接收了超出其近似线性范围的输入。为了高效实现这一点,TARDIS 创建了权重矩阵 $W_{1}$ 的压缩版本,其中仅包含足够的信息来进行这些预测,而无需进行完整的矩阵运算。为了压缩 $W_{1}$,TARDIS 使用了量化技术 [29]。虽然可以训练更小的预测网络 [42,52],但量化在准确性和简单性之间提供了有效的平衡。TARDIS 可以结合各种大语言模型压缩方法,如剪枝或知识蒸馏,只要它们保留足够的信息来预测超出范围的输入。这使得在通过选择性结果修正保持准确性的同时,实现高性能。
5.4 Inference Runtime
5.4 推理运行时
The TARDIS operates in two phases: an offline phase that generates the constant folded matrix (Section 5.2) and predictor (Section 5.3), and an online phase that performs inference through speculative approximation and result fixing.
TARDIS 分为两个阶段运行:离线阶段生成常量折叠矩阵(第 5.2 节)和预测器(第 5.3 节),以及在线阶段通过推测性近似和结果修正进行推理。
Speculative Approximation. During this step, TARDIS approximates the FFN computation using a pre-computed constant folded matrix. This simplifies the complex FFN computation into a simple matrix multiplication and addition operation: $F F N(x)\approx x C!+!B$ . This approximation significantly reduces the computational complexity compared to the original FFN computation.
推测性近似。在这一步骤中,TARDIS 使用预计算的常数折叠矩阵来近似 FFN 计算。这将复杂的 FFN 计算简化为一个简单的矩阵乘法和加法操作:$F F N(x)\approx x C!+!B$。与原始的 FFN 计算相比,这种近似显著降低了计算复杂度。
Result Fixing. Not all neurons can be accurately approximated using the simplified computation above. The predictor identifies these cases, and TARDIS corrects the results for these specific neurons using the original computation method. This correction involves two steps. First, we remove the approximate results for the identified neurons. Then, we compute and add back their actual results using the original FFN computation.
结果修正。并非所有神经元都能通过上述简化计算准确近似。预测器会识别这些情况,TARDIS 则针对这些特定神经元使用原始计算方法修正结果。这一修正过程包含两个步骤:首先,移除已识别神经元的近似结果;随后,使用原始 FFN 计算方法重新计算并添加其实际结果。
Figure 10 illustrates this process with a concrete example. The system first generates an approximate result, then identifies and corrects inaccurate approximations by replacing them with precise computations.
图 10 用一个具体示例说明了这一过程。系统首先生成一个近似结果,然后通过用精确计算替换不准确的近似值来识别并纠正这些不准确之处。
Figure 10: The speculative approximation and result fixing steps in the online phase of TARDIS.
图 10: TARDIS 在线阶段的推测近似和结果修正步骤。
Memory Footprint. The memory requirements of TARDIS consist of three main components. 1) The constant folded matrix and its bias terms; 2) The lightweight predictor; 3) Only the original weights of neurons that require exact computation (fixing). The system’s memory efficiency varies between the two inference phases. In the first phase, more neurons might require their original weights, leading to higher memory usage. In contrast, during sequential token generation (second phase), fewer neurons typically need correction, resulting in better memory efficiency.
内存占用。TARDIS 的内存需求由三个主要部分组成。1) 常量折叠矩阵及其偏置项;2) 轻量级预测器;3) 仅需要精确计算(修正)的神经元的原始权重。系统的内存效率在两个推理阶段之间有所不同。在第一阶段,可能需要更多神经元的原始权重,导致内存使用量较高。相比之下,在顺序 Token 生成(第二阶段)期间,通常需要修正的神经元较少,因此内存效率更高。
6 Implementation
6 实现
We implemented TARDIS as a prototype using PyTorch [6]. We integrated TARDIS into the Hugging Face Transformer library (v4.41.0) [10] and vLLM inference engine (v0.6.6) [7]. The system consists of two main components:
我们使用 PyTorch [6] 实现了 TARDIS 的原型。我们将 TARDIS 集成到了 Hugging Face Transformer 库 (v4.41.0) [10] 和 vLLM 推理引擎 (v0.6.6) [7] 中。该系统由两个主要组件组成:
Offline Component. The offline preprocessing pipeline, implemented in approximately 900 lines of Python code, handles non-linear function approximation and constant folding matrix generation. We leveraged cuML [11], a GPU-accelerated machine learning library, for training the linear regression model and performing kernel density estimation efficiently. The resulting constant-folded matrix is serialized to a binary format for runtime loading. The predictor is obtained by GPTQ [29] with 2-bit precision, which is also stored in binary format, and loaded before inference.
离线组件。离线预处理管道由大约 900 行 Python 代码实现,负责处理非线性函数近似和常量折叠矩阵生成。我们利用 cuML [11](一个 GPU 加速的机器学习库)来高效训练线性回归模型并执行核密度估计。生成的常量折叠矩阵被序列化为二进制格式以便运行时加载。预测器通过 GPTQ [29] 以 2 位精度获得,同样以二进制格式存储,并在推理前加载。
Runtime Component. The inference runtime consists of two parts: (1) a Python library implementing speculative approximation and result fixing logic (approximately 150 lines), and (2) a specialized NVIDIA CUDA kernel [2] for efficient result fixing. The kernel performs selective loading of FFN weights based on predicted neuron indices. We optimized the memory efficiency by memory coalescing [32] and vectorized shared memory access [13].
运行时组件。推理运行时由两部分组成:(1) 一个实现推测性近似和结果修复逻辑的 Python 库(约 150 行代码),以及 (2) 一个专门用于高效结果修复的 NVIDIA CUDA 内核 [2]。该内核根据预测的神经元索引选择性加载 FFN 权重。我们通过内存合并 [32] 和向量化共享内存访问 [13] 优化了内存效率。
7 Evaluation
7 评估
Our evaluation mainly focuses on the following questions:
我们的评估主要关注以下问题:
名称 | 参数量 | FFN 中参数量占比 | 激活函数 | 每月下载量 |
---|---|---|---|---|
Falcon-7B | 7.2B | 80% | GELU | 75K |
Falcon2-11B | 11.1B | 80% | GELU | 19K |
BLOOMZ-7B1 | 7.1B | 67% | GELU | 9.3K |
GPT-2-XL | 1.6B | 67% | GELU | 306K |
OPT-6.7B | 6.7B | 67% | ReLU | 39K |
Table 2: Basic statistics of the evaluated LLMs. The third column shows the estimated percentage of total model parameters in the FFN layer.
表 2: 评估的大语言模型的基本统计信息。第三列显示了 FFN 层中总模型参数的估计百分比。
Apart from the above main questions, we evaluate the precision and effectiveness of the range assignment algorithm, predictor size’s influence on model performance, inference runtime overhead, and the precision effects of reordering FFN.
除了上述主要问题外,我们还评估了范围分配算法的精度和有效性、预测器大小对模型性能的影响、推理运行时的开销,以及重新排序前馈网络(FFN)对精度的影响。
7.1 Settings
7.1 设置
Models. We evaluated TARDIS on five large language models ranging from 1.6B to 11.1B parameters: Falcon-7B [19], Falcon2-11B [15], BLOOMZ-7B [38], GPT2-XL [3], and OPT-6.7B [17]. These models represent different architectures and training objectives while being widely adopted in real-world applications. Ea