Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning

多任务学习中梯度聚合的贝叶斯不确定性

Abstract

摘要

As machine learning becomes more prominent there is a growing demand to perform several inference tasks in parallel. Running a dedicated model for each task is computationally expensive and therefore there is a great interest in multi-task learning (MTL). MTL aims at learning a single model that solves several tasks efficiently. Optimizing MTL models is often achieved by first computing a single gradient per task and then aggregating the gradients for obtaining a combined update direction. However, this approach do not consider an important aspect, the sensitivity in the gradient dimensions. Here, we introduce a novel gradient aggregation approach using Bayesian inference. We place a probability distribution over the task-specific parameters, which in turn induce a distribution over the gradients of the tasks. This additional valuable information allows us to quantify the uncertainty in each of the gradients dimensions, which can then be factored in when aggregating them. We empirically demonstrate the benefits of our approach in a variety of datasets, achieving state-of-the-art performance.

随着机器学习日益普及，对并行执行多个推理任务的需求不断增长。为每个任务运行专用模型的计算成本高昂，因此多任务学习(MTL)备受关注。MTL旨在学习一个能高效解决多个任务的单一模型。优化MTL模型通常通过先计算每个任务的单一梯度，再聚合梯度以获得组合更新方向来实现。但这种方法忽略了一个重要因素——梯度维度的敏感性。本文提出一种基于贝叶斯推理的新型梯度聚合方法：我们在任务特定参数上建立概率分布，从而推导出任务梯度的分布。这些额外信息使我们能量化各梯度维度的不确定性，进而在聚合时加以考量。我们通过多种数据集的实证研究证明了该方法的优势，实现了最先进的性能。

demand at test time, MTL also has the potential to improve generalization (Baxter, 2000). It is therefore not surprising that applications of MTL are taking central roles in various domains, such as vision (Achituve et al., 2021a; Shamshad et al., 2023; Zheng et al., 2023), natural language processing (Liu et al., 2019b; Zhou et al., 2023), speech (Michel santi et al., 2021), robotics (Devin et al., 2017; Shu et al., 2018), and general scientific problems (Wu et al., 2018) to name a few.

在测试时需求方面，多任务学习 (MTL) 还具有提升泛化能力的潜力 (Baxter, 2000)。因此，MTL 应用在多个领域占据核心地位并不令人意外，例如视觉 (Achituve et al., 2021a; Shamshad et al., 2023; Zheng et al., 2023)、自然语言处理 (Liu et al., 2019b; Zhou et al., 2023)、语音 (Michelsanti et al., 2021)、机器人 (Devin et al., 2017; Shu et al., 2018) 以及一般科学问题 (Wu et al., 2018) 等。

However, optimizing multiple tasks simultaneously is a challenging problem that may lead to degradation in performance compared to learning them individually (Standley et al., 2020; Yu et al., 2020). To address this issue, one basic formula that many MTL optimization algorithms follow is to first calculate the gradient of each task’s loss, and then aggregate these gradients according to some specified scheme. For example, several studies focus on reducing conflicts between the gradients before averaging them (Yu et al., 2020; Wang et al., 2020), others find a convex combination with minimal norm (Sener & Koltun, 2018; Désidéri, 2012), and some use a game theoretical approach (Navon et al., 2022). However, by relying only on the gradient these methods miss an important aspect, the sensitivity of the gradient in each dimension.

然而，同时优化多个任务是一个具有挑战性的问题，与单独学习相比可能导致性能下降 (Standley et al., 2020; Yu et al., 2020)。为解决这一问题，许多多任务学习 (MTL) 优化算法遵循的基本公式是：首先计算每个任务损失的梯度，然后根据特定方案聚合这些梯度。例如，部分研究聚焦于在梯度平均前减少梯度冲突 (Yu et al., 2020; Wang et al., 2020)，另一些研究则寻找最小范数的凸组合 (Sener & Koltun, 2018; Désidéri, 2012)，还有研究采用博弈论方法 (Navon et al., 2022)。但这些仅依赖梯度的方法忽略了一个重要维度——梯度在各维度上的敏感性。

1. Introduction

1. 引言

In many application domains, there is a need to perform several machine learning inference tasks simultaneously. For instance, an autonomous vehicle needs to identify and detect objects in its vicinity, perform lane detection, track the movements of other vehicles over time, and predict free space around it, all in parallel and in real-time. In deep Multi-Task Learning (MTL) the goal is to train a single neural network (NN) to solve several tasks simultaneously, thus avoiding the need to have one dedicated model for each task (Caruana, 1997). Besides reducing the computational

在许多应用领域中，需要同时执行多个机器学习推理任务。例如，自动驾驶车辆需要实时并行地识别和检测周围物体、进行车道检测、跟踪其他车辆的运动轨迹并预测周围可用空间。深度多任务学习 (Multi-Task Learning, MTL) 的目标是通过训练单一神经网络 (NN) 来同时解决多个任务，从而避免为每个任务单独训练专用模型 (Caruana, 1997)。除了降低计算...

Our approach builds on the following observation - for each task, there may be many “good” parameter configurations. Standard MTL optimization methods take only a single value into account, and as such lose information in the aggregation step. Hence, tracking all of the parameter configu- rations will yield a richer description of the gradient space that can be advantageous when finding an update direction. Specifically, to account for all parameter values, we propose to place a probability distribution over the task-specific parameters, which in turn induces a probability distribution over the gradients. As a result, we obtain uncertainty estimates for the gradients that reflect the sensitivity in each of their dimensions. High-uncertainty dimensions are more lenient for changes while dimensions with a lower uncertainty are more strict (see illustration in Figure 2).

我们的方法基于以下观察：对于每个任务，可能存在许多“良好”的参数配置。传统的多任务学习（MTL）优化方法仅考虑单一取值，因此在聚合步骤中会丢失信息。因此，追踪所有参数配置将为梯度空间提供更丰富的描述，从而在寻找更新方向时更具优势。具体而言，为涵盖所有参数值，我们提出在任务特定参数上建立概率分布，进而诱导出梯度的概率分布。由此获得的梯度不确定性估计能够反映各维度敏感性——高不确定性维度对变化的容忍度更高，而低不确定性维度则更为严格（示意图见图2）。

To obtain a probability distribution over the task-specific parameters we take a Bayesian approach. According to the Bayesian view, a posterior distribution over parameters of interest can be derived through Bayes rule. In MTL, it is common to use a shared feature extractor network with linear task-specific layers (Ruder, 2017). Hence, if we assume a Bayesian model over the last task-specific layer weights during the back-propagation process, we obtain the posterior distributions over them. The posterior is then used to compute a Gaussian distribution over the gradients by means of moment matching. Then, to derive an update direction for the shared network, we design a novel aggregation scheme that considers the full distributions of the gradients. We name our method BayesAgg-MTL. An important implication of our approach is that BayesAgg-MTL assigns weights to the gradients at a higher resolution compared to existing methods, allocating a specific weight for each dimension and datum in the batch. We demonstrate our method effectiveness over baseline methods on the MTL benchmarks QM9 (Ramakrishna n et al., 2014), CIFAR-100 (Krizhevsky et al., 2009), ChestX-ray14 (Wang et al., 2017), and UTKFace (Zhang et al., 2017).

为获取任务特定参数的概率分布，我们采用贝叶斯方法。根据贝叶斯观点，可通过贝叶斯规则推导出目标参数的后验分布。在多任务学习(MTL)中，通常使用共享特征提取网络配合线性任务特定层(Ruder, 2017)。因此，若在反向传播过程中对最后一个任务特定层权重建立贝叶斯模型，即可获得其对应的后验分布。该后验分布随后通过矩匹配法用于计算梯度的高斯分布。接着，为推导共享网络的更新方向，我们设计了一种考虑梯度完整分布的新型聚合方案，并将其命名为BayesAgg-MTL。该方法的重要特性在于：相比现有方法，BayesAgg-MTL能以更高精度为梯度分配权重，为批次中的每个维度和数据点指定专属权重。我们在QM9(Ramakrishna等, 2014)、CIFAR-100(Krizhevsky等, 2009)、ChestX-ray14(Wang等, 2017)和UTKFace(Zhang等, 2017)等MTL基准测试中验证了本方法相对于基线模型的有效性。

In summary, this paper makes the following novel contributions: (1) The first Bayesian formulation of gradient aggregation for Multi-Task Learning. (2) A novel posterior approximation based on a second-order Taylor expansion. (3) A new MTL optimization algorithm based on our posterior estimation. (4) New state-of-the-art results on several MTL benchmarks compared to leading methods. Our code is publicly available at https://github.com/ssi-research/ BayesAgg MTL.

总之，本文做出了以下新颖贡献：(1) 首次提出了多任务学习中梯度聚合的贝叶斯公式。(2) 基于二阶泰勒展开的新型后验近似方法。(3) 基于我们后验估计的新多任务学习优化算法。(4) 在多个多任务学习基准测试中相比领先方法取得了新的最优结果。我们的代码公开在 https://github.com/ssi-research/BayesAggMTL。

2. Background

2. 背景

Notations. Scalars, vectors, and matrices are denoted with lower-case letters (e.g., $x$ ), bold lower-case letters (e.g., $\mathbf{x}{,}$ , and bold upper-case letters (e.g., X) respectively. All vectors are treated as column vectors. Training samples are tuples consisting of shared features across all tasks and labels of $K$ tasks, namely $(\mathbf{x},{\mathbf{y}^{k}}{k=1}^{K})\sim\mathcal{D}$ , where $\mathcal{D}$ denotes the training set. We denote the dimensionality of the input and the output of task $k$ by $d_{\mathbf{x}}$ and $o_{k}$ accordingly.

符号约定。标量、向量和矩阵分别用小写字母(如 $x$)、粗体小写字母(如 $\mathbf{x}{,}$)和粗体大写字母(如 X)表示。所有向量均视为列向量。训练样本是由所有任务共享的特征和 $K$ 个任务标签组成的元组，即 $(\mathbf{x},{\mathbf{y}^{k}}{k=1}^{K})\sim\mathcal{D}$，其中 $\mathcal{D}$ 表示训练集。我们将任务 $k$ 的输入维度和输出维度分别记为 $d_{\mathbf{x}}$ 和 $o_{k}$。

In this study, we focus on common NN architectures for MTL having a shared feature extractor and linear taskspecific heads (Kendall et al., 2018; Sener & Koltun, 2018). eil sp tahrea m ve etc e tros r aorfe sd he an roe t de dp bary ${\pmb{\theta},{\mathbf{w}^{k}}{k=1}^{K}}$ $\pmb{\theta}\in\mathbb{R}^{d_{\theta}}$ ${\mathbf{w}^{k}}{k=1}^{K}$ are task-specific parameter vectors, each lies in $\mathbb{R}^{d_{k}}$ . The last shared feature representation is denoted by the vector $\mathbf{h}(\mathbf{x};\pmb{\theta})\in\mathbb{R}^{d_{\mathbf{h}}}$ . Hence, the output of the network for task $k$ can be described as $\mathbf{f}^{k}(\mathbf{h}(\mathbf{x};\pmb{\theta});\mathbf{w}^{k})$ . The loss of task $k\in$ $[1,...,K]$ is denoted by $\ell^{k}(\mathbf x,\mathbf y;{\boldsymbol\theta,\mathbf w^{k}})$ . The gradient of loss $\ell^{k}$ w.r.t $\mathbf{h}(\mathbf{x};\pmb{\theta})$ is $\begin{array}{r}{\mathbf{g}^{k}:=\frac{\partial\ell^{k}}{\partial\mathbf{h}(\mathbf{x};\pmb{\theta})}(\mathbf{x},\mathbf{y};{\pmb{\theta},\mathbf{w}^{k}})\in}\end{array}$ $\mathbb{R}^{d_{h}}$ . For clarity of exposition, function dependence on input variables will be omitted from now on.

在本研究中，我们重点探讨具有共享特征提取器和线性任务专用头 (task-specific heads) 的多任务学习 (MTL) 常见神经网络架构 [Kendall et al., 2018; Sener & Koltun, 2018]。模型参数集为 ${\pmb{\theta},{\mathbf{w}^{k}}{k=1}^{K}}$，其中 $\pmb{\theta}\in\mathbb{R}^{d_{\theta}}$ 是共享参数，${\mathbf{w}^{k}}{k=1}^{K}$ 是各任务专用参数向量，每个向量位于 $\mathbb{R}^{d_{k}}$ 空间。最后的共享特征表示由向量 $\mathbf{h}(\mathbf{x};\pmb{\theta})\in\mathbb{R}^{d_{\mathbf{h}}}$ 表示。因此，网络对任务 $k$ 的输出可描述为 $\mathbf{f}^{k}(\mathbf{h}(\mathbf{x};\pmb{\theta});\mathbf{w}^{k})$。任务 $k\in[1,...,K]$ 的损失函数记为 $\ell^{k}(\mathbf x,\mathbf y;{\boldsymbol\theta,\mathbf w^{k}})$。损失 $\ell^{k}$ 关于 $\mathbf{h}(\mathbf{x};\pmb{\theta})$ 的梯度为 $\begin{array}{r}{\mathbf{g}^{k}:=\frac{\partial\ell^{k}}{\partial\mathbf{h}(\mathbf{x};\pmb{\theta})}(\mathbf{x},\mathbf{y};{\pmb{\theta},\mathbf{w}^{k}})\in}\end{array}$ $\mathbb{R}^{d_{h}}$。为表述清晰，下文将省略函数对输入变量的显式依赖。

Figure 1. BayesAgg-MTL assumes a probability distribution over the last layer parameters of each task. It first maps these distributions to the space of the last shared representation. Then an update direction is found for the shared representation based on the mean and variance of all distributions (denoted by $\mathbf{X}$ ).

图 1: BayesAgg-MTL 假设每个任务的最后一层参数服从概率分布。该方法首先将这些分布映射到最后共享表征空间，然后根据所有分布的均值与方差 (记为 $\mathbf{X}$ ) 计算出共享表征的更新方向。

2.1. Multi-Task Learning

2.1. 多任务学习

A prevailing approach to optimize MTL models goes as follows. First, the gradient of each task loss is computed. Second, an aggregation rule is imposed to combine the gradients according to some algorithm. And lastly, perform an update step using the outcome of the aggregation step. Commonly the aggregation rule operates on the gradients of the loss w.r.t parameters, or only the shared parameters (e.g., Yu et al., 2020; Navon et al., 2022; Shamsian et al., 2023)). Alternatively, to avoid a costly full back-propagation process for each task, some methods suggest applying it on the last shared representation (e.g., Sener & Koltun, 2018; Liu et al., 2020; Senushkin et al., 2023). Here, to make our method fast and scalable, we take the latter approach and note that it could be extended to full gradient aggregation.

优化多任务学习(MTL)模型的流行方法通常遵循以下步骤：首先计算每个任务损失的梯度，然后根据特定算法施加聚合规则来组合这些梯度，最后使用聚合结果执行参数更新。常见的聚合规则作用于损失对参数的梯度(或仅共享参数梯度)(如 Yu et al., 2020; Navon et al., 2022; Shamsian et al., 2023)。为避免对每个任务执行昂贵的完整反向传播过程，部分方法建议在最后共享表征层进行聚合(如 Sener & Koltun, 2018; Liu et al., 2020; Senushkin et al., 2023)。为保持方法的高效性和可扩展性，我们采用后者方案，并指出该方法可扩展至完整梯度聚合场景。

2.2. Bayesian Inference

2.2. 贝叶斯推断 (Bayesian Inference)

We wish to incorporate uncertainty estimates for the gradients into the aggregation procedure. Doing so will allow us to find an update direction that takes into account the importance of each gradient dimension for each task. A natural choice to model uncertainty is using Bayesian inference. Since we would like to get uncertainty estimates w.r.t the last shared hidden layer, we treat only the last task-specific layer as a Bayesian layer. This “Bayesian last layer” approach is a common way to scale Bayesian inference to deep neural networks (Snoek et al., 2015; Calandra et al., 2016; Wilson et al., 2016a; Achituve et al., 2021c). We will now present some of the main concepts of Bayesian modeling that will be used as part of our method.

我们希望将梯度的不确定性估计纳入聚合流程。这样做能让我们找到一种更新方向，该方向会考虑每个任务中各个梯度维度的重要性。建模不确定性的自然选择是使用贝叶斯推断 (Bayesian inference)。由于我们希望获得与最后一个共享隐藏层相关的不确定性估计，因此仅将最后一个任务特定层视为贝叶斯层。这种"贝叶斯最后一层"方法是将贝叶斯推断扩展到深度神经网络的常用方式 (Snoek et al., 2015; Calandra et al., 2016; Wilson et al., 2016a; Achituve et al., 2021c)。接下来我们将介绍贝叶斯建模的一些主要概念，这些概念将作为我们方法的一部分。

For simplicity, assume a single output variable. We also dropped the task notation for clarity. According to the Bayesian paradigm, instead of treating the parameters w as deterministic values that need to be optimized, they are treated as random variables, i.e. there is a distribution over the parameters. The posterior distribution for w, after observing the data, is given using Bayes rule as

为简化起见，假设只有一个输出变量。为了清晰起见，我们也省略了任务符号。根据贝叶斯范式，参数w不被视为需要优化的确定值，而是被视为随机变量，即参数存在一个分布。在观测到数据后，w的后验分布通过贝叶斯规则给出为

$$
\begin{array}{r}{l o g p(\mathbf{w}|\mathcal{D})\propto l o g p(\mathbf{y}|\mathbf{X},\mathbf{w})+l o g p(\mathbf{w}).}\end{array}
$$

Predictions in Bayesian inference are given by taking the expected prediction with respect to the posterior distribution. In general, the Bayesian inference procedure for w is intractable. However, for some specific scenarios, there exists an analytic solution. For example, in linear regression, if we assume a Gaussian likelihood with a fixed independent scalar noise between the observations $\tau$ , $\begin{array}{r}{p(\mathbf{y}|{\mathbf{x}{i}}{i=1}^{|\bar{\mathcal{D}}|},\mathbf{w})=\prod_{i=1}^{|\mathcal{D}|}\mathcal{N}(y_{i}|\mathbf{x}{i}^{T}\mathbf{w},\tau^{2})}\end{array}$ , and a Gaussian prior $p(\mathbf{w})=\mathcal{N}(\mathbf{w}|\mathbf{m}{p},\mathbf{S}_{p})$ then,

贝叶斯推断中的预测是通过对后验分布取期望预测得到的。一般来说，针对w的贝叶斯推断过程是难以处理的。但在某些特定场景下，存在解析解。例如在线性回归中，若假设观测值间存在固定独立标量噪声τ的高斯似然 $p(\mathbf{y}|{\mathbf{x}{i}}{i=1}^{|\bar{\mathcal{D}}|},\mathbf{w})=\prod_{i=1}^{|\mathcal{D}|}\mathcal{N}(y_{i}|\mathbf{x}{i}^{T}\mathbf{w},\tau^{2})$ ，并采用高斯先验 $p(\mathbf{w})=\mathcal{N}(\mathbf{w}|\mathbf{m}{p},\mathbf{S}_{p})$ ，那么

$$
\begin{array}{r l}&{p(\mathbf{w}|\mathcal{D})=\mathcal{N}(\mathbf{w}|\mathbf{m},\mathbf{S})}\ &{\qquad\mathbf{m}=\mathbf{S}((\mathbf{S}{p})^{-1}\mathbf{m}{p}+\boldsymbol\tau^{-2}\mathbf{X}\mathbf{y})}\ &{\qquad\mathbf{S}=((\mathbf{S}_{p})^{-1}+\boldsymbol\tau^{-2}\mathbf{X}\mathbf{X}^{T})^{-1}.}\end{array}
$$

Here $\mathbf{X}\in\mathbb{R}^{d_{\mathbf{x}}\times|\mathcal{D}|}$ is the matrix that results from stacking the vectors ${\mathbf{x}{i}}{i=1}^{|D|}$ . Similarly, we denote by $\mathbf{H}\in\mathbb{R}^{d_{\mathbf{h}}\times|\mathcal{D}|}$ the matrix that results from stacking the vectors of hidden representation. In the specific case of deep NNs with Bayesian last layer we get the same inference result only with $\mathbf{H}$ replacing $\mathbf{X}$ . Going beyond a single output variable entails defining a covariance matrix for the noise model. However, in this study we assume independence between the output variables in these cases.

这里 $\mathbf{X}\in\mathbb{R}^{d_{\mathbf{x}}\times|\mathcal{D}|}$ 是通过堆叠向量 ${\mathbf{x}{i}}{i=1}^{|D|}$ 得到的矩阵。类似地，我们用$\mathbf{H}\in\mathbb{R}^{d_{\mathbf{h}}\times|\mathcal{D}|}$ 表示通过堆叠隐藏表示向量得到的矩阵。在具有贝叶斯最后一层的深度神经网络 (NN) 的特定情况下，我们仅通过用 $\mathbf{H}$ 替换 $\mathbf{X}$ 就能得到相同的推断结果。扩展到单一输出变量之外需要为噪声模型定义一个协方差矩阵。然而，在本研究中我们假设这些情况下输出变量之间是独立的。

Unlike regression, in classification the likelihood is not a Gaussian, and the posterior can only be approximated. The common choice is to use variation al inference (Wilson et al., 2016b; Achituve et al., 2021b; 2023), although there are other alternatives as well (Kristiadi et al., 2020).

与回归不同，分类中的似然并非高斯分布，后验只能近似求解。通常选择使用变分推断 (variational inference) [Wilson et al., 2016b; Achituve et al., 2021b; 2023]，尽管也存在其他替代方案 [Kristiadi et al., 2020]。

3. Method

3. 方法

We start with an outline of the problem and our approach. Consider a deep network for multi-task learning that has a shared feature extractor part and task-specific linear layers. We propose to use Bayesian inference on the last layer as a means to train deterministic MTL models. For each task $k$ , we define a Bayesian probabilistic model representing the uncertainty over the linear weights of the last, task-specific layer $\mathbf{w}^{k}$ . The distribution over weights induces a distribution over gradients of the loss with respect to the last shared hidden layer. Given these per-task distributions on a joint space, we propose an aggregation rule for combining the gradients of the tasks to a shared update direction that takes into account the uncertainty in the gradients (see illustration in Figure 1). Then, the back-propagation process can proceed as usual.

我们从问题概述和方法思路入手。考虑一个用于多任务学习的深度网络，该网络包含共享特征提取器和任务特定的线性层。我们提出在最后一层应用贝叶斯推断 (Bayesian inference) 作为训练确定性多任务学习模型的方法。对于每个任务 $k$，我们定义一个贝叶斯概率模型，用于表示最后一层任务特定线性权重 $\mathbf{w}^{k}$ 的不确定性。权重分布会引发损失函数对最后一个共享隐藏层梯度的分布。给定联合空间中这些每任务的分布，我们提出一种聚合规则，将各任务的梯度结合到一个考虑梯度不确定性的共享更新方向中 (图示见图 1)。随后，反向传播过程可照常进行。

Since regression and classification setups yield different inference procedures according to our approach, albeit having the same general framework, we discuss the two setups separately, starting with regression.

由于回归和分类设置根据我们的方法会产生不同的推断过程，尽管它们具有相同的总体框架，我们仍分别讨论这两种设置，从回归开始。

3.1. BayesAgg-MTL for Regression Tasks

3.1. 回归任务中的BayesAgg-MTL方法

Consider a standard square loss for task $k$ , $\ell^{k}=(y^{k}-\hat{y}^{k})^{2}$ , between the label $y^{k}$ and the network output $\hat{y}^{k}$ . Given a random batch of example $B\sim D$ , the gradient of the loss with respect to the hidden layer $\mathbf{h}$ for the $i^{t h}$ example is,

考虑任务 $k$ 的标准平方损失 $\ell^{k}=(y^{k}-\hat{y}^{k})^{2}$ ，其中 $y^{k}$ 为标签， $\hat{y}^{k}$ 为网络输出。给定随机批样本 $B\sim D$ ，第 $i^{t h}$ 个样本的损失对隐藏层 $\mathbf{h}$ 的梯度为，

$$
\mathbf{g}{i}^{k}=\frac{\partial l_{i}^{k}}{\partial\hat{y}{i}^{k}}\frac{\partial\hat{y}{i}^{k}}{\partial\mathbf{h}{i}}=2\mathbf{w}^{k}(\mathbf{h}{i}^{T}\mathbf{w}^{k}-y_{i}^{k}).
$$

Our main observation is that $\mathbf{g}{i}^{k}$ is a function of $\mathbf{w}^{k}$ . Hence, if we view $\mathbf{w}^{k}$ in the back-propagation process as a random variable, then $\mathbf{g}_{i}^{k}$ will be a random variable as well. This view will allow us to capture the uncertainty in the task gradient. Since the dimension of the hidden layer is usually small compared to the dimension of all shared parameters, operations in this space, such as matrix inverse, should not be costly.

我们的主要观察是，$\mathbf{g}{i}^{k}$ 是 $\mathbf{w}^{k}$ 的函数。因此，如果我们将反向传播过程中的 $\mathbf{w}^{k}$ 视为随机变量，那么 $\mathbf{g}_{i}^{k}$ 也将是一个随机变量。这种观点将使我们能够捕捉任务梯度中的不确定性。由于隐藏层的维度通常小于所有共享参数的维度，因此在该空间中的操作（如矩阵求逆）应该不会带来高昂的计算成本。

If we fix all the shared parameters, then the posterior over $\mathbf{w}^{k}$ has a Gaussian distribution with known parameters via Eq. 2. As $\mathbf{g}{i}^{k}$ is quadratic in $\mathbf{w}^{k}$ , it has a generalized chisquared distribution (Davies, 1973). However, since this distribution does not admit a closed-form density function, and since the gradient aggregation needs to be efficient as we run it at each iteration, we approximate $\mathbf{g}_{i}^{k}$ as a Gaussian distribution. The optimal choice for the parameters of this Gaussian is given by matching its first two moments to those of the true density, as these parameters minimize the Kullback–Leibler divergence between the two distributions (Minka, 2001). Luckily, in the regression case, we can derive the first two moments from the posterior over $\mathbf{w}^{k}$ ,

如果我们固定所有共享参数，那么通过方程2可知 $\mathbf{w}^{k}$ 的后验服从参数已知的高斯分布。由于 $\mathbf{g}{i}^{k}$ 是 $\mathbf{w}^{k}$ 的二次型，它服从广义卡方分布 (Davies, 1973)。但由于该分布没有闭式密度函数，且梯度聚合需要在每次迭代时高效运行，我们将 $\mathbf{g}_{i}^{k}$ 近似为高斯分布。该高斯分布参数的最优选择是通过使其前两阶矩与真实密度相匹配来确定，因为这些参数能最小化两个分布之间的Kullback-Leibler散度 (Minka, 2001)。幸运的是，在回归情况下，我们可以从 $\mathbf{w}^{k}$ 的后验推导出前两阶矩。

where $\mathbf{A}{i}=\mathbf{h}{i}\mathbf{h}_{i}^{T}$ , $\mathbf{M}^{k}=\mathbf{m}^{k}(\mathbf{m}^{k})^{T}$ , we assumed $\tau=$ 1, and $T r(\cdot)$ is the matrix trace. We emphasize that the following approximation is for the gradient of a single data point and a single task, not for the gradient of the task with respect to the entire batch. The full derivation is presented in Appendix A.1.

其中 $\mathbf{A}{i}=\mathbf{h}{i}\mathbf{h}_{i}^{T}$，$\mathbf{M}^{k}=\mathbf{m}^{k}(\mathbf{m}^{k})^{T}$，假设 $\tau=1$，且 $T r(\cdot)$ 表示矩阵迹。需要强调的是，以下近似是针对单个数据点和单个任务的梯度，而非整个批次任务的梯度。完整推导见附录A.1。

Several points deserve attention here. First, note the similarity between the solution of the first moment and the gradient obtained via the standard back-propagation. The two differences are that the last layer parameters, $\mathbf{w}^{k}$ , are replaced with the posterior mean, $\mathbf{m}^{k}$ , and an uncertainty term was added. In the extreme case of $\mathbf{S}^{k}\to0$ and $\mathbf{m}^{k}\rightarrow\mathbf{w}^{k}$ , the mean coincides with that of the standard back-propagation. Second, in the case of a multi-output task, following our independence assumption between output variables, we can obtain the moments for each output dimension separately using the same procedure, so de facto we treat each output as a different task. Finally, during training, the shared parameters are constantly being updated. Hence, to compute the posterior distribution for $\mathbf{w}^{k}$ we need to iterate over the entire dataset at each update step. In practice, this can make our method computationally expensive. Therefore, we use the current batch data only to approximate the posterior over $\mathbf{w}^{k}$ , and introduce information about the full dataset through the prior as described next.

这里有几个要点值得注意。首先，注意到一阶矩解与标准反向传播获得的梯度之间的相似性。两者差异在于：最后一层参数 $\mathbf{w}^{k}$ 被替换为后验均值 $\mathbf{m}^{k}$，并增加了不确定性项。在极端情况 $\mathbf{S}^{k}\to0$ 且 $\mathbf{m}^{k}\rightarrow\mathbf{w}^{k}$ 时，该均值与标准反向传播的结果一致。其次，对于多输出任务，根据输出变量间的独立性假设，我们可以通过相同流程分别计算每个输出维度的矩量，因此实际上将每个输出视为独立任务处理。最后，训练过程中共享参数持续更新，这意味着要计算 $\mathbf{w}^{k}$ 的后验分布，需要在每个更新步骤遍历整个数据集。实践中这会导致计算成本高昂，因此我们仅使用当前批次数据来近似 $\mathbf{w}^{k}$ 的后验分布，并通过下文所述的先验信息引入完整数据集的信息。

Prior selection. A common choice in Bayesian deep learning is to choose uninformative priors, such as a standard Gaussian, to let the data be the main influence on the posterior (Wilson & Izmailov, 2020; Fortuin et al., 2021). How- ever, in our case, we found this prior to be too weak. Since the posterior depends only on a single batch we opted to introduce information about the whole dataset through the prior. A natural choice is to use the posterior distribution of the previous batch as our prior (Sarkka, 2013, Chapter 3). However, this method did not work well in our experiments and we developed an alternative. During each epoch, we collect the feature representations and labels of all examples in the dataset. At the end of the epoch, we compute the posterior based on the full data (with an isotropic Gaussian prior) and use this posterior as the prior at each step in the subsequent epoch. Updating the full data prior more frequently is likely to have a beneficial effect on our overall model; however it will also probably make the training time longer. Hence, doing the update once an epoch strikes a good balance between performance and training time.

先验选择。贝叶斯深度学习中常见的选择是采用无信息先验 (uninformative prior) ，例如标准高斯分布，让数据成为后验分布的主要影响因素 (Wilson & Izmailov, 2020; Fortuin et al., 2021) 。但在本研究中，我们发现这种先验过于薄弱。由于后验仅依赖于单个批次数据，我们选择通过先验引入整个数据集的信息。一个自然的选择是将前一批次的后验分布作为当前先验 (Sarkka, 2013, Chapter 3) ，但该方法在实验中表现不佳，因此我们开发了替代方案：在每个训练周期 (epoch) 收集数据集中所有样本的特征表示和标签，周期结束时基于完整数据（采用各向同性高斯先验）计算后验分布，并将该后验作为下一周期各训练步骤的先验。虽然更频繁地更新全局数据先验可能提升模型整体性能，但也会延长训练时间。因此，每周期更新一次在性能与训练时间之间取得了良好平衡。

Aggregation step. Having an approximation for the gradient distribution of each task we need to combine them to find an update direction for the shared parameters. Denote the mean of the gradient of the loss for task $k$ w.r.t the hidden layer for the $i^{t h}$ example by $\mu_{i}^{k}:=\mathbb{E}[\mathbf{g}{i}^{k}]$ , and similarly the covariance matrix $\Sigma_{i}^{k}:=(\Lambda_{i}^{k})^{-1}:=$ $\mathbb{E}[{\bf g}{i}^{k}({\bf g}{i}^{k})^{T}]-\mathbb{E}[{\bf g}{i}^{k}]\mathbb{E}[{\bf g}{i}^{k}]^{T}$ . We strive to find an update direction for the last shared layer, $\mathbf{g}{i}$ , that lies in a high-density region for all tasks. Hence, we pick $\mathbf{g}_{i}$ that maximizes the following likelihood:

聚合步骤。在获得每个任务的梯度分布近似后，我们需要将它们组合起来以找到共享参数的更新方向。设任务 $k$ 关于第 $i^{th}$ 个示例隐藏层的损失梯度均值为 $\mu_{i}^{k}:=\mathbb{E}[\mathbf{g}{i}^{k}]$，协方差矩阵为 $\Sigma_{i}^{k}:=(\Lambda_{i}^{k})^{-1}:=$ $\mathbb{E}[{\bf g}{i}^{k}({\bf g}{i}^{k})^{T}]-\mathbb{E}[{\bf g}{i}^{k}]\mathbb{E}[{\bf g}{i}^{k}]^{T}$。我们致力于为最后一个共享层寻找一个更新方向 $\mathbf{g}{i}$，该方向需位于所有任务的高密度区域。因此，我们选择使以下似然最大化的 $\mathbf{g}_{i}$：

$$
\begin{array}{r l}&{\underset{\mathbf{g}{i}}{\arg\operatorname*{max}}\underset{k=1}{\overset{K}{\prod}}\mathcal{N}(\mathbf{g}{i}|\pmb{\mu}{i}^{k},\pmb{\Sigma}{i}^{k})=}\ &{\underset{\mathbf{g}{i}}{\arg\operatorname*{min}}-\underset{k=1}{\overset{K}{\sum}}l o g\mathcal{N}(\mathbf{g}{i}|\pmb{\mu}{i}^{k},\pmb{\Sigma}_{i}^{k}).}\end{array}
$$

Thankfully, the above optimization problem can be solved

幸运的是，上述优化问题可以得到解决

Algorithm 1 BayesAgg-MTL

算法 1 BayesAgg-MTL

End for

结束

End for

结束

Compute gradient via matrix multiplication w.r.t the shared parameters: $\begin{array}{r}{\frac{1}{|\boldsymbol{B}|}\sum_{i=1}^{|\boldsymbol{B}|}\mathbf{g}{i}\frac{\partial\mathbf{h}_{i}}{\partial\boldsymbol{\theta}}}\end{array}$ .

通过矩阵乘法计算关于共享参数的梯度：$\begin{array}{r}{\frac{1}{|\boldsymbol{B}|}\sum_{i=1}^{|\boldsymbol{B}|}\mathbf{g}{i}\frac{\partial\mathbf{h}_{i}}{\partial\boldsymbol{\theta}}}\end{array}$。

in closed-form, yielding the following solution:

以闭式解给出如下方案：

$$
\mathbf{g}{i}=\left(\sum_{k=1}^{K}\pmb{\Lambda}{i}^{k}\right)^{-1}\left(\sum_{k=1}^{K}\pmb{\Lambda}{i}^{k}\pmb{\mu}_{i}^{k}\right).
$$

However, we found that modeling the full covariance matrix can be numerically unstable and sensitive to noise in the gradient. Instead, we assume independence between the dimensions of $\mathbf{g}{i}^{k}$ for all tasks which results in diagonal covariance matrices having variance $(\pmb{\sigma}{i}^{k})^{2}:=1/\lambda_{i}^{k}$ . The update direction now becomes:

然而，我们发现对完整协方差矩阵建模会导致数值不稳定，并对梯度噪声敏感。因此，我们假设所有任务的 $\mathbf{g}{i}^{k}$ 各维度间相互独立，从而得到对角协方差矩阵，其方差为 $(\pmb{\sigma}{i}^{k})^{2}:=1/\lambda_{i}^{k}$ 。此时更新方向变为：

$$
\mathbf{g}{i}=\sum_{k=1}^{K}\frac{1/(\pmb{\sigma}{i}^{k})^{2}}{\sum_{k=1}^{K}1/(\pmb{\sigma}{i}^{k})^{2}}\pmb{\mu}{i}^{k}=\sum_{k=1}^{K}\overbrace{\sum_{k=1}^{K}\lambda_{i}^{k}}^{\pmb{\alpha}{i}^{k}}\pmb{\mu}_{i}^{k},
$$

where the division and multiplication are done elementwise. In Eq. 7 we intentionally denote by $\alpha_{i}^{k}$ the vector of uncertainty-based weights that our method assigns to the mean gradient to highlight that the weights are unique per task, dimension, and datum. The final modification for the method involves down-scaling the impact of the precision by a hyper-parameter $s\in(0,1]$ , namely, we take $(\pmb{\lambda}{i}^{k})^{s}$ Empirically, the scaling parameter helped to achieve better performance, perhaps due to mis specifications in the model (such as the diagonal Gaussian assumption over $\mathbf{g}_{i}^{k}$ ).

其中除法和乘法都是逐元素进行的。在公式7中，我们特意用$\alpha_{i}^{k}$表示该方法为平均梯度分配的不确定性权重向量，以突显权重对每个任务、维度和数据点都是唯一的。该方法的最终调整是通过超参数$s\in(0,1]$缩小精度的影响，即采用$(\pmb{\lambda}{i}^{k})^{s}$。实验表明，缩放参数有助于获得更好的性能，可能是由于模型中的错误设定（例如对$\mathbf{g}_{i}^{k}$采用对角高斯假设）所致。

With the aggregated gradient for each example, the backpropagation procedure proceeds as usual by averaging over all examples in the batch and then back-propagating this over to the shared parameters. To gain a better intuition about the update rule of BayesAgg-MTL , consider the illustration in Figure 2. In the figure, we plot the mean update direction of two tasks along with the uncertainty in them. The first task is more sensitive to shifts in the vertical dimension and less so to shifts in the second (horizontal)

在获得每个样本的聚合梯度后，反向传播过程按常规方式进行：先对批次中所有样本的梯度取平均，然后将该平均值反向传播至共享参数。为更好理解BayesAgg-MTL的更新规则，可参考图2的示意图。图中绘制了两个任务的平均更新方向及其不确定性：第一个任务对垂直维度变化更为敏感，而对水平维度变化较不敏感。

Figure 2. BayesAgg-MTL update for a two-dimensional feature representation. Black arrows indicate the mean update direction of each task; Red arrow is the update direction of a simple average; Blue arrow is the proposed update direction. Darker colors in the contours represent regions with higher density.

图 2: BayesAgg-MTL 在二维特征表示中的更新过程。黑色箭头表示各任务的均值更新方向；红色箭头为简单平均的更新方向；蓝色箭头为本文提出的更新方向。等高线中颜色较深的区域表示密度较高的区域。

dimension, while for the second task, it is the opposite. By taking the variance information into account, BayesAggMTL can find an update direction that works well for both, compared to a simple average of the gradient means. We summarize our method in Algorithm 1.

维度，而对于第二个任务则相反。通过考虑方差信息，与简单的梯度均值平均相比，BayesAggMTL能够找到一个对两者都有效的更新方向。我们在算法1中总结了该方法。

Making predictions. Since we have a closed-form solution for the posterior of the task-specific parameters, BayesAgg- MTL does not learn this layer during training. Therefore, when making predictions we use the posterior mean, $\mathbf{m}^{k}$ , computed on the full training set. We do so, instead of using a full Bayesian inference, for a fair comparison with alternative MTL approaches and to have an identical runtime and memory requirements when making predictions.

进行预测。由于我们已经得到了任务特定参数后验的闭式解，BayesAgg-MTL在训练过程中无需学习这一层。因此，在预测时我们使用基于完整训练集计算的后验均值$\mathbf{m}^{k}$。这样做是为了与其他多任务学习方法进行公平比较，并确保预测时具有相同的运行时和内存需求，而不是采用完整的贝叶斯推断。

Connection to Nash-MTL. In (Navon et al., 2022) the authors proposed a cooperative bargaining game approach to the gradient aggregation step with the directional derivative as the utility of each player (task). They then proposed using the Nash bargaining solution, the direction that maximizes the product of all the utilities. One can consider Eq. 5 as the Nash bargaining solution with the utility of each task being its likelihood. However, unlike (Navon et al., 2022) we get an analytical formula for the bargaining solution since the Gaussian exponent and the logarithm cancel out.

与 Nash-MTL 的联系。在 (Navon et al., 2022) 中，作者提出了一种基于合作博弈理论的梯度聚合方法，将方向导数作为每个玩家（任务）的效用函数。随后他们采用纳什议价解 (Nash bargaining solution) ，即最大化所有效用乘积的方向。可以将公式 5 视为以各任务似然函数为效用的纳什议价解。但与 (Navon et al., 2022) 不同，由于高斯指数函数与对数函数的相互抵消，我们获得了议价解的解析表达式。

3.2. BayesAgg-MTL for Classification Tasks

3.2. 用于分类任务的BayesAgg-MTL

We now turn to present our approach for classification tasks. When dealing with classification there are two sources of intractability that we need to overcome. The first is the posterior of $\mathbf{w}^{k}$ , and the second is estimating the moments of $\mathbf{g}_{i}^{k}$ . We describe our solution to both challenges next.

我们现在转向介绍分类任务的方法。在处理分类问题时，需要克服两个棘手的难题：首先是 $\mathbf{w}^{k}$ 的后验分布，其次是估计 $\mathbf{g}_{i}^{k}$ 的矩量。接下来我们将分别阐述针对这两个挑战的解决方案。

Posterior approximation. In classification tasks the likelihood is not a Gaussian and in general, we cannot compute the posterior in closed-form. One common option is to approximate it using a Gaussian distribution and learn its parameters using a variation al inference (VI) scheme (Saul et al., 1996; Neal & Hinton, 1998; Bishop, 2006). However, in our early experimentation s, we didn’t find it to work well without using a computationally expensive VI optimization at each update step. Alternatively to VI, the Laplace approximation (MacKay, 1992) approximates the posterior as a Gaussian using a second-order Taylor expansion. Since the expansion is done at the optimal parameter values that are learned point-wise, the Jacobean term in the expansion vanishes. Here, we follow a similar path; however, we cannot assume that the Jacobean is zero as we are not near a stationary point during most of the training. Nevertheless, we can still find a Gaussian approximation. A similar derivation was proposed in (Immer et al., 2021), yet they ignored the first order term eventually. Denote by $\hat{\mathbf{w}}^{k}$ the learned point estimate for the task parameters, and $\Delta\mathbf{w}^{k}:=\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}$ Then, at each step of the training by using Bayes rule we can obtain a posterior approximation for $\mathbf{w}^{k}$ using the fol- lowing:

后验近似。在分类任务中，似然函数并非高斯分布，通常无法以闭式解计算后验分布。常见做法是采用高斯分布进行近似，并通过变分推断 (variational inference, VI) 框架学习其参数 (Saul et al., 1996; Neal & Hinton, 1998; Bishop, 2006) 。然而在初期实验中，我们发现若不采用计算成本高昂的逐步骤VI优化，该方法效果欠佳。作为VI的替代方案，拉普拉斯近似 (Laplace approximation) (MacKay, 1992) 通过二阶泰勒展开将后验近似为高斯分布。由于展开在逐点学习的最优参数值处进行，展开式中的雅可比项会消失。本文采用类似路径，但由于训练过程大多不处于驻点附近，不能假设雅可比矩阵为零。尽管如此，我们仍可获得高斯近似。(Immer et al., 2021) 提出了类似推导，但最终忽略了一阶项。设 $\hat{\mathbf{w}}^{k}$ 为任务参数的逐点估计值，$\Delta\mathbf{w}^{k}:=\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}$ ，则训练过程中每一步均可通过贝叶斯规则获得 $\mathbf{w}^{k}$ 的后验近似：

$$
\begin{array}{r l}&{l o g p(\mathbf{w}^{k}|\mathcal{B})\approx l o g p(\hat{\mathbf{w}}^{k}|\mathcal{B})+}\ &{\left(-\frac{\partial l o g p(\mathbf{y}^{k}|\mathbf{X},\mathbf{w}^{k})}{\partial\mathbf{w}^{k}}-\frac{\partial l o g p(\mathbf{w}^{k})}{\partial\mathbf{w}^{k}}\right)^{T}\Delta\mathbf{w}^{k}+}\ &{\frac{1}{2}(\Delta\mathbf{w}^{k})^{T}\left(-\frac{\partial^{2}l o g p(\mathbf{y}^{k}|\mathbf{X},\mathbf{w}^{k})}{\partial(\mathbf{w}^{k})^{2}}-\frac{\partial^{2}l o g~p(\mathbf{w}^{k})}{\partial(\mathbf{w}^{k})^{2}}\right)\Delta\mathbf{w}^{k}.}\end{array}
$$

The above takes the following form $c^{k}+(\mathbf{a}^{k})^{T}(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k})+$ $\begin{array}{r}{\frac{1}{\eta}\big(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}\big)^{T}\mathbf{B}^{k}\big(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}\big)}\end{array}$ , where $\mathbf{a}^{k}\in\mathbb{R}^{d_{k}},\mathbf{B}^{k}\in$ Rdk×dk , ck ∈ R are known constants. We stress here again, that since we apply Bayesian inference to the last layer parameters only, computing and inverting $\mathbf{B}^{k}$ , typically does not incur a large computational overhead.

上述表达式形式为 $c^{k}+(\mathbf{a}^{k})^{T}(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k})+$ $\begin{array}{r}{\frac{1}{\eta}\big(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}\big)^{T}\mathbf{B}^{k}\big(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}\big)}\end{array}$ ，其中 $\mathbf{a}^{k}\in\mathbb{R}^{d_{k}},\mathbf{B}^{k}\in$ Rdk×dk，ck ∈ R 为已知常数。需要再次强调的是，由于我们仅对最后一层参数应用贝叶斯推断，计算并求逆 $\mathbf{B}^{k}$ 通常不会带来较大的计算开销。

After rearranging and completing the square we obtain a quadratic form corresponding to the following Gaussian distribution (see full derivation in Appendix A.2):

经过整理并完成平方后，我们得到一个对应于以下高斯分布的二次型 (完整推导见附录 A.2):

$$
p(\mathbf{w}^{k}|\ B)\approx\mathcal{N}(\mathbf{w}^{k}|\hat{\mathbf{w}}^{k}-(\mathbf{B}^{k})^{-1}\mathbf{a}^{k},(\mathbf{B}^{k})^{-1}).
$$

$$
p(\mathbf{w}^{k}|\ B)\appro

[论文翻译]多任务学习中梯度聚合的贝叶斯不确定性

原文地址：2305.11013v1.pdf

Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning

Abstract

1. Introduction

1. 引言

2. Background

2. 背景

2.1. Multi-Task Learning

2.1. 多任务学习

2.2. Bayesian Inference

2.2. 贝叶斯推断 (Bayesian Inference)

3. Method

3. 方法

3.1. BayesAgg-MTL for Regression Tasks

3.1. 回归任务中的BayesAgg-MTL方法

Algorithm 1 BayesAgg-MTL

End for

End for

3.2. BayesAgg-MTL for Classification Tasks

3.2. 用于分类任务的BayesAgg-MTL