Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning
多任务学习中梯度聚合的贝叶斯不确定性
Abstract
摘要
As machine learning becomes more prominent there is a growing demand to perform several inference tasks in parallel. Running a dedicated model for each task is computationally expensive and therefore there is a great interest in multi-task learning (MTL). MTL aims at learning a single model that solves several tasks efficiently. Optimizing MTL models is often achieved by first computing a single gradient per task and then aggregating the gradients for obtaining a combined update direction. However, this approach do not consider an important aspect, the sensitivity in the gradient dimensions. Here, we introduce a novel gradient aggregation approach using Bayesian inference. We place a probability distribution over the task-specific parameters, which in turn induce a distribution over the gradients of the tasks. This additional valuable information allows us to quantify the uncertainty in each of the gradients dimensions, which can then be factored in when aggregating them. We empirically demonstrate the benefits of our approach in a variety of datasets, achieving state-of-the-art performance.
随着机器学习日益普及,对并行执行多个推理任务的需求不断增长。为每个任务运行专用模型的计算成本高昂,因此多任务学习(MTL)备受关注。MTL旨在学习一个能高效解决多个任务的单一模型。优化MTL模型通常通过先计算每个任务的单一梯度,再聚合梯度以获得组合更新方向来实现。但这种方法忽略了一个重要因素——梯度维度的敏感性。本文提出一种基于贝叶斯推理的新型梯度聚合方法:我们在任务特定参数上建立概率分布,从而推导出任务梯度的分布。这些额外信息使我们能量化各梯度维度的不确定性,进而在聚合时加以考量。我们通过多种数据集的实证研究证明了该方法的优势,实现了最先进的性能。
demand at test time, MTL also has the potential to improve generalization (Baxter, 2000). It is therefore not surprising that applications of MTL are taking central roles in various domains, such as vision (Achituve et al., 2021a; Shamshad et al., 2023; Zheng et al., 2023), natural language processing (Liu et al., 2019b; Zhou et al., 2023), speech (Michel santi et al., 2021), robotics (Devin et al., 2017; Shu et al., 2018), and general scientific problems (Wu et al., 2018) to name a few.
在测试时需求方面,多任务学习 (MTL) 还具有提升泛化能力的潜力 (Baxter, 2000)。因此,MTL 应用在多个领域占据核心地位并不令人意外,例如视觉 (Achituve et al., 2021a; Shamshad et al., 2023; Zheng et al., 2023)、自然语言处理 (Liu et al., 2019b; Zhou et al., 2023)、语音 (Michelsanti et al., 2021)、机器人 (Devin et al., 2017; Shu et al., 2018) 以及一般科学问题 (Wu et al., 2018) 等。
However, optimizing multiple tasks simultaneously is a challenging problem that may lead to degradation in performance compared to learning them individually (Standley et al., 2020; Yu et al., 2020). To address this issue, one basic formula that many MTL optimization algorithms follow is to first calculate the gradient of each task’s loss, and then aggregate these gradients according to some specified scheme. For example, several studies focus on reducing conflicts between the gradients before averaging them (Yu et al., 2020; Wang et al., 2020), others find a convex combination with minimal norm (Sener & Koltun, 2018; Désidéri, 2012), and some use a game theoretical approach (Navon et al., 2022). However, by relying only on the gradient these methods miss an important aspect, the sensitivity of the gradient in each dimension.
然而,同时优化多个任务是一个具有挑战性的问题,与单独学习相比可能导致性能下降 (Standley et al., 2020; Yu et al., 2020)。为解决这一问题,许多多任务学习 (MTL) 优化算法遵循的基本公式是:首先计算每个任务损失的梯度,然后根据特定方案聚合这些梯度。例如,部分研究聚焦于在梯度平均前减少梯度冲突 (Yu et al., 2020; Wang et al., 2020),另一些研究则寻找最小范数的凸组合 (Sener & Koltun, 2018; Désidéri, 2012),还有研究采用博弈论方法 (Navon et al., 2022)。但这些仅依赖梯度的方法忽略了一个重要维度——梯度在各维度上的敏感性。
1. Introduction
1. 引言
In many application domains, there is a need to perform several machine learning inference tasks simultaneously. For instance, an autonomous vehicle needs to identify and detect objects in its vicinity, perform lane detection, track the movements of other vehicles over time, and predict free space around it, all in parallel and in real-time. In deep Multi-Task Learning (MTL) the goal is to train a single neural network (NN) to solve several tasks simultaneously, thus avoiding the need to have one dedicated model for each task (Caruana, 1997). Besides reducing the computational
在许多应用领域中,需要同时执行多个机器学习推理任务。例如,自动驾驶车辆需要实时并行地识别和检测周围物体、进行车道检测、跟踪其他车辆的运动轨迹并预测周围可用空间。深度多任务学习 (Multi-Task Learning, MTL) 的目标是通过训练单一神经网络 (NN) 来同时解决多个任务,从而避免为每个任务单独训练专用模型 (Caruana, 1997)。除了降低计算...
Our approach builds on the following observation - for each task, there may be many “good” parameter configurations. Standard MTL optimization methods take only a single value into account, and as such lose information in the aggregation step. Hence, tracking all of the parameter configu- rations will yield a richer description of the gradient space that can be advantageous when finding an update direction. Specifically, to account for all parameter values, we propose to place a probability distribution over the task-specific parameters, which in turn induces a probability distribution over the gradients. As a result, we obtain uncertainty estimates for the gradients that reflect the sensitivity in each of their dimensions. High-uncertainty dimensions are more lenient for changes while dimensions with a lower uncertainty are more strict (see illustration in Figure 2).
我们的方法基于以下观察:对于每个任务,可能存在许多“良好”的参数配置。传统的多任务学习(MTL)优化方法仅考虑单一取值,因此在聚合步骤中会丢失信息。因此,追踪所有参数配置将为梯度空间提供更丰富的描述,从而在寻找更新方向时更具优势。具体而言,为涵盖所有参数值,我们提出在任务特定参数上建立概率分布,进而诱导出梯度的概率分布。由此获得的梯度不确定性估计能够反映各维度敏感性——高不确定性维度对变化的容忍度更高,而低不确定性维度则更为严格(示意图见图2)。
To obtain a probability distribution over the task-specific parameters we take a Bayesian approach. According to the Bayesian view, a posterior distribution over parameters of interest can be derived through Bayes rule. In MTL, it is common to use a shared feature extractor network with linear task-specific layers (Ruder, 2017). Hence, if we assume a Bayesian model over the last task-specific layer weights during the back-propagation process, we obtain the posterior distributions over them. The posterior is then used to compute a Gaussian distribution over the gradients by means of moment matching. Then, to derive an update direction for the shared network, we design a novel aggregation scheme that considers the full distributions of the gradients. We name our method BayesAgg-MTL. An important implication of our approach is that BayesAgg-MTL assigns weights to the gradients at a higher resolution compared to existing methods, allocating a specific weight for each dimension and datum in the batch. We demonstrate our method effectiveness over baseline methods on the MTL benchmarks QM9 (Ramakrishna n et al., 2014), CIFAR-100 (Krizhevsky et al., 2009), ChestX-ray14 (Wang et al., 2017), and UTKFace (Zhang et al., 2017).
为获取任务特定参数的概率分布,我们采用贝叶斯方法。根据贝叶斯观点,可通过贝叶斯规则推导出目标参数的后验分布。在多任务学习(MTL)中,通常使用共享特征提取网络配合线性任务特定层(Ruder, 2017)。因此,若在反向传播过程中对最后一个任务特定层权重建立贝叶斯模型,即可获得其对应的后验分布。该后验分布随后通过矩匹配法用于计算梯度的高斯分布。接着,为推导共享网络的更新方向,我们设计了一种考虑梯度完整分布的新型聚合方案,并将其命名为BayesAgg-MTL。该方法的重要特性在于:相比现有方法,BayesAgg-MTL能以更高精度为梯度分配权重,为批次中的每个维度和数据点指定专属权重。我们在QM9(Ramakrishna等, 2014)、CIFAR-100(Krizhevsky等, 2009)、ChestX-ray14(Wang等, 2017)和UTKFace(Zhang等, 2017)等MTL基准测试中验证了本方法相对于基线模型的有效性。
In summary, this paper makes the following novel contributions: (1) The first Bayesian formulation of gradient aggregation for Multi-Task Learning. (2) A novel posterior approximation based on a second-order Taylor expansion. (3) A new MTL optimization algorithm based on our posterior estimation. (4) New state-of-the-art results on several MTL benchmarks compared to leading methods. Our code is publicly available at https://github.com/ssi-research/ BayesAgg MTL.
总之,本文做出了以下新颖贡献:(1) 首次提出了多任务学习中梯度聚合的贝叶斯公式。(2) 基于二阶泰勒展开的新型后验近似方法。(3) 基于我们后验估计的新多任务学习优化算法。(4) 在多个多任务学习基准测试中相比领先方法取得了新的最优结果。我们的代码公开在 https://github.com/ssi-research/BayesAggMTL。
2. Background
2. 背景
Notations. Scalars, vectors, and matrices are denoted with lower-case letters (e.g., $x$ ), bold lower-case letters (e.g., $\mathbf{x}{,}$ , and bold upper-case letters (e.g., X) respectively. All vectors are treated as column vectors. Training samples are tuples consisting of shared features across all tasks and labels of $K$ tasks, namely $(\mathbf{x},{\mathbf{y}^{k}}{k=1}^{K})\sim\mathcal{D}$ , where $\mathcal{D}$ denotes the training set. We denote the dimensionality of the input and the output of task $k$ by $d_{\mathbf{x}}$ and $o_{k}$ accordingly.
符号约定。标量、向量和矩阵分别用小写字母(如 $x$)、粗体小写字母(如 $\mathbf{x}{,}$)和粗体大写字母(如 X)表示。所有向量均视为列向量。训练样本是由所有任务共享的特征和 $K$ 个任务标签组成的元组,即 $(\mathbf{x},{\mathbf{y}^{k}}{k=1}^{K})\sim\mathcal{D}$,其中 $\mathcal{D}$ 表示训练集。我们将任务 $k$ 的输入维度和输出维度分别记为 $d_{\mathbf{x}}$ 和 $o_{k}$。
In this study, we focus on common NN architectures for MTL having a shared feature extractor and linear taskspecific heads (Kendall et al., 2018; Sener & Koltun, 2018). eil sp tahrea m ve etc e tros r aorfe sd he an roe t de dp bary ${\pmb{\theta},{\mathbf{w}^{k}}{k=1}^{K}}$ $\pmb{\theta}\in\mathbb{R}^{d_{\theta}}$ ${\mathbf{w}^{k}}{k=1}^{K}$ are task-specific parameter vectors, each lies in $\mathbb{R}^{d_{k}}$ . The last shared feature representation is denoted by the vector $\mathbf{h}(\mathbf{x};\pmb{\theta})\in\mathbb{R}^{d_{\mathbf{h}}}$ . Hence, the output of the network for task $k$ can be described as $\mathbf{f}^{k}(\mathbf{h}(\mathbf{x};\pmb{\theta});\mathbf{w}^{k})$ . The loss of task $k\in$ $[1,...,K]$ is denoted by $\ell^{k}(\mathbf x,\mathbf y;{\boldsymbol\theta,\mathbf w^{k}})$ . The gradient of loss $\ell^{k}$ w.r.t $\mathbf{h}(\mathbf{x};\pmb{\theta})$ is $\begin{array}{r}{\mathbf{g}^{k}:=\frac{\partial\ell^{k}}{\partial\mathbf{h}(\mathbf{x};\pmb{\theta})}(\mathbf{x},\mathbf{y};{\pmb{\theta},\mathbf{w}^{k}})\in}\end{array}$ $\mathbb{R}^{d_{h}}$ . For clarity of exposition, function dependence on input variables will be omitted from now on.
在本研究中,我们重点探讨具有共享特征提取器和线性任务专用头 (task-specific heads) 的多任务学习 (MTL) 常见神经网络架构 [Kendall et al., 2018; Sener & Koltun, 2018]。模型参数集为 ${\pmb{\theta},{\mathbf{w}^{k}}{k=1}^{K}}$,其中 $\pmb{\theta}\in\mathbb{R}^{d_{\theta}}$ 是共享参数,${\mathbf{w}^{k}}{k=1}^{K}$ 是各任务专用参数向量,每个向量位于 $\mathbb{R}^{d_{k}}$ 空间。最后的共享特征表示由向量 $\mathbf{h}(\mathbf{x};\pmb{\theta})\in\mathbb{R}^{d_{\mathbf{h}}}$ 表示。因此,网络对任务 $k$ 的输出可描述为 $\mathbf{f}^{k}(\mathbf{h}(\mathbf{x};\pmb{\theta});\mathbf{w}^{k})$。任务 $k\in[1,...,K]$ 的损失函数记为 $\ell^{k}(\mathbf x,\mathbf y;{\boldsymbol\theta,\mathbf w^{k}})$。损失 $\ell^{k}$ 关于 $\mathbf{h}(\mathbf{x};\pmb{\theta})$ 的梯度为 $\begin{array}{r}{\mathbf{g}^{k}:=\frac{\partial\ell^{k}}{\partial\mathbf{h}(\mathbf{x};\pmb{\theta})}(\mathbf{x},\mathbf{y};{\pmb{\theta},\mathbf{w}^{k}})\in}\end{array}$ $\mathbb{R}^{d_{h}}$。为表述清晰,下文将省略函数对输入变量的显式依赖。

Figure 1. BayesAgg-MTL assumes a probability distribution over the last layer parameters of each task. It first maps these distributions to the space of the last shared representation. Then an update direction is found for the shared representation based on the mean and variance of all distributions (denoted by $\mathbf{X}$ ).
图 1: BayesAgg-MTL 假设每个任务的最后一层参数服从概率分布。该方法首先将这些分布映射到最后共享表征空间,然后根据所有分布的均值与方差 (记为 $\mathbf{X}$ ) 计算出共享表征的更新方向。
2.1. Multi-Task Learning
2.1. 多任务学习
A prevailing approach to optimize MTL models goes as follows. First, the gradient of each task loss is computed. Second, an aggregation rule is imposed to combine the gradients according to some algorithm. And lastly, perform an update step using the outcome of the aggregation step. Commonly the aggregation rule operates on the gradients of the loss w.r.t parameters, or only the shared parameters (e.g., Yu et al., 2020; Navon et al., 2022; Shamsian et al., 2023)). Alternatively, to avoid a costly full back-propagation process for each task, some methods suggest applying it on the last shared representation (e.g., Sener & Koltun, 2018; Liu et al., 2020; Senushkin et al., 2023). Here, to make our method fast and scalable, we take the latter approach and note that it could be extended to full gradient aggregation.
优化多任务学习(MTL)模型的流行方法通常遵循以下步骤:首先计算每个任务损失的梯度,然后根据特定算法施加聚合规则来组合这些梯度,最后使用聚合结果执行参数更新。常见的聚合规则作用于损失对参数的梯度(或仅共享参数梯度)(如 Yu et al., 2020; Navon et al., 2022; Shamsian et al., 2023)。为避免对每个任务执行昂贵的完整反向传播过程,部分方法建议在最后共享表征层进行聚合(如 Sener & Koltun, 2018; Liu et al., 2020; Senushkin et al., 2023)。为保持方法的高效性和可扩展性,我们采用后者方案,并指出该方法可扩展至完整梯度聚合场景。
2.2. Bayesian Inference
2.2. 贝叶斯推断 (Bayesian Inference)
We wish to incorporate uncertainty estimates for the gradients into the aggregation procedure. Doing so will allow us to find an update direction that takes into account the importance of each gradient dimension for each task. A natural choice to model uncertainty is using Bayesian inference. Since we would like to get uncertainty estimates w.r.t the last shared hidden layer, we treat only the last task-specific layer as a Bayesian layer. This “Bayesian last layer” approach is a common way to scale Bayesian inference to deep neural networks (Snoek et al., 2015; Calandra et al., 2016; Wilson et al., 2016a; Achituve et al., 2021c). We will now present some of the main concepts of Bayesian modeling that will be used as part of our method.
我们希望将梯度的不确定性估计纳入聚合流程。这样做能让我们找到一种更新方向,该方向会考虑每个任务中各个梯度维度的重要性。建模不确定性的自然选择是使用贝叶斯推断 (Bayesian inference)。由于我们希望获得与最后一个共享隐藏层相关的不确定性估计,因此仅将最后一个任务特定层视为贝叶斯层。这种"贝叶斯最后一层"方法是将贝叶斯推断扩展到深度神经网络的常用方式 (Snoek et al., 2015; Calandra et al., 2016; Wilson et al., 2016a; Achituve et al., 2021c)。接下来我们将介绍贝叶斯建模的一些主要概念,这些概念将作为我们方法的一部分。
For simplicity, assume a single output variable. We also dropped the task notation for clarity. According to the Bayesian paradigm, instead of treating the parameters w as deterministic values that need to be optimized, they are treated as random variables, i.e. there is a distribution over the parameters. The posterior distribution for w, after observing the data, is given using Bayes rule as
为简化起见,假设只有一个输出变量。为了清晰起见,我们也省略了任务符号。根据贝叶斯范式,参数w不被视为需要优化的确定值,而是被视为随机变量,即参数存在一个分布。在观测到数据后,w的后验分布通过贝叶斯规则给出为
$$
\begin{array}{r}{l o g p(\mathbf{w}|\mathcal{D})\propto l o g p(\mathbf{y}|\mathbf{X},\mathbf{w})+l o g p(\mathbf{w}).}\end{array}
$$
$$
\begin{array}{r}{l o g p(\mathbf{w}|\mathcal{D})\propto l o g p(\mathbf{y}|\mathbf{X},\mathbf{w})+l o g p(\mathbf{w}).}\end{array}
$$
Predictions in Bayesian inference are given by taking the expected prediction with respect to the posterior distribution. In general, the Bayesian inference procedure for w is intractable. However, for some specific scenarios, there exists an analytic solution. For example, in linear regression, if we assume a Gaussian likelihood with a fixed independent scalar noise between the observations $\tau$ , $\begin{array}{r}{p(\mathbf{y}|{\mathbf{x}{i}}{i=1}^{|\bar{\mathcal{D}}|},\mathbf{w})=\prod_{i=1}^{|\mathcal{D}|}\mathcal{N}(y_{i}|\mathbf{x}{i}^{T}\mathbf{w},\tau^{2})}\end{array}$ , and a Gaussian prior $p(\mathbf{w})=\mathcal{N}(\mathbf{w}|\mathbf{m}{p},\mathbf{S}_{p})$ then,
贝叶斯推断中的预测是通过对后验分布取期望预测得到的。一般来说,针对w的贝叶斯推断过程是难以处理的。但在某些特定场景下,存在解析解。例如在线性回归中,若假设观测值间存在固定独立标量噪声τ的高斯似然 $p(\mathbf{y}|{\mathbf{x}{i}}{i=1}^{|\bar{\mathcal{D}}|},\mathbf{w})=\prod_{i=1}^{|\mathcal{D}|}\mathcal{N}(y_{i}|\mathbf{x}{i}^{T}\mathbf{w},\tau^{2})$ ,并采用高斯先验 $p(\mathbf{w})=\mathcal{N}(\mathbf{w}|\mathbf{m}{p},\mathbf{S}_{p})$ ,那么
$$
\begin{array}{r l}&{p(\mathbf{w}|\mathcal{D})=\mathcal{N}(\mathbf{w}|\mathbf{m},\mathbf{S})}\ &{\qquad\mathbf{m}=\mathbf{S}((\mathbf{S}{p})^{-1}\mathbf{m}{p}+\boldsymbol\tau^{-2}\mathbf{X}\mathbf{y})}\ &{\qquad\mathbf{S}=((\mathbf{S}_{p})^{-1}+\boldsymbol\tau^{-2}\mathbf{X}\mathbf{X}^{T})^{-1}.}\end{array}
$$
$$
\begin{array}{r l}&{p(\mathbf{w}|\mathcal{D})=\mathcal{N}(\mathbf{w}|\mathbf{m},\mathbf{S})}\ &{\qquad\mathbf{m}=\mathbf{S}((\mathbf{S}{p})^{-1}\mathbf{m}{p}+\boldsymbol\tau^{-2}\mathbf{X}\mathbf{y})}\ &{\qquad\mathbf{S}=((\mathbf{S}_{p})^{-1}+\boldsymbol\tau^{-2}\mathbf{X}\mathbf{X}^{T})^{-1}.}\end{array}
$$
Here $\mathbf{X}\in\mathbb{R}^{d_{\mathbf{x}}\times|\mathcal{D}|}$ is the matrix that results from stacking the vectors ${\mathbf{x}{i}}{i=1}^{|D|}$ . Similarly, we denote by $\mathbf{H}\in\mathbb{R}^{d_{\mathbf{h}}\times|\mathcal{D}|}$ the matrix that results from stacking the vectors of hidden representation. In the specific case of deep NNs with Bayesian last layer we get the same inference result only with $\mathbf{H}$ replacing $\mathbf{X}$ . Going beyond a single output variable entails defining a covariance matrix for the noise model. However, in this study we assume independence between the output variables in these cases.
这里 $\mathbf{X}\in\mathbb{R}^{d_{\mathbf{x}}\times|\mathcal{D}|}$ 是通过堆叠向量 ${\mathbf{x}{i}}{i=1}^{|D|}$ 得到的矩阵。类似地,我们用$\mathbf{H}\in\mathbb{R}^{d_{\mathbf{h}}\times|\mathcal{D}|}$ 表示通过堆叠隐藏表示向量得到的矩阵。在具有贝叶斯最后一层的深度神经网络 (NN) 的特定情况下,我们仅通过用 $\mathbf{H}$ 替换 $\mathbf{X}$ 就能得到相同的推断结果。扩展到单一输出变量之外需要为噪声模型定义一个协方差矩阵。然而,在本研究中我们假设这些情况下输出变量之间是独立的。
Unlike regression, in classification the likelihood is not a Gaussian, and the posterior can only be approximated. The common choice is to use variation al inference (Wilson et al., 2016b; Achituve et al., 2021b; 2023), although there are other alternatives as well (Kristiadi et al., 2020).
与回归不同,分类中的似然并非高斯分布,后验只能近似求解。通常选择使用变分推断 (variational inference) [Wilson et al., 2016b; Achituve et al., 2021b; 2023],尽管也存在其他替代方案 [Kristiadi et al., 2020]。
3. Method
3. 方法
We start with an outline of the problem and our approach. Consider a deep network for multi-task learning that has a shared feature extractor part and task-specific linear layers. We propose to use Bayesian inference on the last layer as a means to train deterministic MTL models. For each task $k$ , we define a Bayesian probabilistic model representing the uncertainty over the linear weights of the last, task-specific layer $\mathbf{w}^{k}$ . The distribution over weights induces a distribution over gradients of the loss with respect to the last shared hidden layer. Given these per-task distributions on a joint space, we propose an aggregation rule for combining the gradients of the tasks to a shared update direction that takes into account the uncertainty in the gradients (see illustration in Figure 1). Then, the back-propagation process can proceed as usual.
我们从问题概述和方法思路入手。考虑一个用于多任务学习的深度网络,该网络包含共享特征提取器和任务特定的线性层。我们提出在最后一层应用贝叶斯推断 (Bayesian inference) 作为训练确定性多任务学习模型的方法。对于每个任务 $k$,我们定义一个贝叶斯概率模型,用于表示最后一层任务特定线性权重 $\mathbf{w}^{k}$ 的不确定性。权重分布会引发损失函数对最后一个共享隐藏层梯度的分布。给定联合空间中这些每任务的分布,我们提出一种聚合规则,将各任务的梯度结合到一个考虑梯度不确定性的共享更新方向中 (图示见图 1)。随后,反向传播过程可照常进行。
Since regression and classification setups yield different inference procedures according to our approach, albeit having the same general framework, we discuss the two setups separately, starting with regression.
由于回归和分类设置根据我们的方法会产生不同的推断过程,尽管它们具有相同的总体框架,我们仍分别讨论这两种设置,从回归开始。
3.1. BayesAgg-MTL for Regression Tasks
3.1. 回归任务中的BayesAgg-MTL方法
Consider a standard square loss for task $k$ , $\ell^{k}=(y^{k}-\hat{y}^{k})^{2}$ , between the label $y^{k}$ and the network output $\hat{y}^{k}$ . Given a random batch of example $B\sim D$ , the gradient of the loss with respect to the hidden layer $\mathbf{h}$ for the $i^{t h}$ example is,
考虑任务 $k$ 的标准平方损失 $\ell^{k}=(y^{k}-\hat{y}^{k})^{2}$ ,其中 $y^{k}$ 为标签, $\hat{y}^{k}$ 为网络输出。给定随机批样本 $B\sim D$ ,第 $i^{t h}$ 个样本的损失对隐藏层 $\mathbf{h}$ 的梯度为,
$$
\mathbf{g}{i}^{k}=\frac{\partial l_{i}^{k}}{\partial\hat{y}{i}^{k}}\frac{\partial\hat{y}{i}^{k}}{\partial\mathbf{h}{i}}=2\mathbf{w}^{k}(\mathbf{h}{i}^{T}\mathbf{w}^{k}-y_{i}^{k}).
$$
$$
\mathbf{g}{i}^{k}=\frac{\partial l_{i}^{k}}{\partial\hat{y}{i}^{k}}\frac{\partial\hat{y}{i}^{k}}{\partial\mathbf{h}{i}}=2\mathbf{w}^{k}(\mathbf{h}{i}^{T}\mathbf{w}^{k}-y_{i}^{k}).
$$
Our main observation is that $\mathbf{g}{i}^{k}$ is a function of $\mathbf{w}^{k}$ . Hence, if we view $\mathbf{w}^{k}$ in the back-propagation process as a random variable, then $\mathbf{g}_{i}^{k}$ will be a random variable as well. This view will allow us to capture the uncertainty in the task gradient. Since the dimension of the hidden layer is usually small compared to the dimension of all shared parameters, operations in this space, such as matrix inverse, should not be costly.
我们的主要观察是,$\mathbf{g}{i}^{k}$ 是 $\mathbf{w}^{k}$ 的函数。因此,如果我们将反向传播过程中的 $\mathbf{w}^{k}$ 视为随机变量,那么 $\mathbf{g}_{i}^{k}$ 也将是一个随机变量。这种观点将使我们能够捕捉任务梯度中的不确定性。由于隐藏层的维度通常小于所有共享参数的维度,因此在该空间中的操作(如矩阵求逆)应该不会带来高昂的计算成本。
If we fix all the shared parameters, then the posterior over $\mathbf{w}^{k}$ has a Gaussian distribution with known parameters via Eq. 2. As $\mathbf{g}{i}^{k}$ is quadratic in $\mathbf{w}^{k}$ , it has a generalized chisquared distribution (Davies, 1973). However, since this distribution does not admit a closed-form density function, and since the gradient aggregation needs to be efficient as we run it at each iteration, we approximate $\mathbf{g}_{i}^{k}$ as a Gaussian distribution. The optimal choice for the parameters of this Gaussian is given by matching its first two moments to those of the true density, as these parameters minimize the Kullback–Leibler divergence between the two distributions (Minka, 2001). Luckily, in the regression case, we can derive the first two moments from the posterior over $\mathbf{w}^{k}$ ,
如果我们固定所有共享参数,那么通过方程2可知 $\mathbf{w}^{k}$ 的后验服从参数已知的高斯分布。由于 $\mathbf{g}{i}^{k}$ 是 $\mathbf{w}^{k}$ 的二次型,它服从广义卡方分布 (Davies, 1973)。但由于该分布没有闭式密度函数,且梯度聚合需要在每次迭代时高效运行,我们将 $\mathbf{g}_{i}^{k}$ 近似为高斯分布。该高斯分布参数的最优选择是通过使其前两阶矩与真实密度相匹配来确定,因为这些参数能最小化两个分布之间的Kullback-Leibler散度 (Minka, 2001)。幸运的是,在回归情况下,我们可以从 $\mathbf{w}^{k}$ 的后验推导出前两阶矩。
where $\mathbf{A}{i}=\mathbf{h}{i}\mathbf{h}_{i}^{T}$ , $\mathbf{M}^{k}=\mathbf{m}^{k}(\mathbf{m}^{k})^{T}$ , we assumed $\tau=$ 1, and $T r(\cdot)$ is the matrix trace. We emphasize that the following approximation is for the gradient of a single data point and a single task, not for the gradient of the task with respect to the entire batch. The full derivation is presented in Appendix A.1.
其中 $\mathbf{A}{i}=\mathbf{h}{i}\mathbf{h}_{i}^{T}$,$\mathbf{M}^{k}=\mathbf{m}^{k}(\mathbf{m}^{k})^{T}$,假设 $\tau=1$,且 $T r(\cdot)$ 表示矩阵迹。需要强调的是,以下近似是针对单个数据点和单个任务的梯度,而非整个批次任务的梯度。完整推导见附录A.1。
Several points deserve attention here. First, note the similarity between the solution of the first moment and the gradient obtained via the standard back-propagation. The two differences are that the last layer parameters, $\mathbf{w}^{k}$ , are replaced with the posterior mean, $\mathbf{m}^{k}$ , and an uncertainty term was added. In the extreme case of $\mathbf{S}^{k}\to0$ and $\mathbf{m}^{k}\rightarrow\mathbf{w}^{k}$ , the mean coincides with that of the standard back-propagation. Second, in the case of a multi-output task, following our independence assumption between output variables, we can obtain the moments for each output dimension separately using the same procedure, so de facto we treat each output as a different task. Finally, during training, the shared parameters are constantly being updated. Hence, to compute the posterior distribution for $\mathbf{w}^{k}$ we need to iterate over the entire dataset at each update step. In practice, this can make our method computationally expensive. Therefore, we use the current batch data only to approximate the posterior over $\mathbf{w}^{k}$ , and introduce information about the full dataset through the prior as described next.
这里有几个要点值得注意。首先,注意到一阶矩解与标准反向传播获得的梯度之间的相似性。两者差异在于:最后一层参数 $\mathbf{w}^{k}$ 被替换为后验均值 $\mathbf{m}^{k}$,并增加了不确定性项。在极端情况 $\mathbf{S}^{k}\to0$ 且 $\mathbf{m}^{k}\rightarrow\mathbf{w}^{k}$ 时,该均值与标准反向传播的结果一致。其次,对于多输出任务,根据输出变量间的独立性假设,我们可以通过相同流程分别计算每个输出维度的矩量,因此实际上将每个输出视为独立任务处理。最后,训练过程中共享参数持续更新,这意味着要计算 $\mathbf{w}^{k}$ 的后验分布,需要在每个更新步骤遍历整个数据集。实践中这会导致计算成本高昂,因此我们仅使用当前批次数据来近似 $\mathbf{w}^{k}$ 的后验分布,并通过下文所述的先验信息引入完整数据集的信息。
Prior selection. A common choice in Bayesian deep learning is to choose uninformative priors, such as a standard Gaussian, to let the data be the main influence on the posterior (Wilson & Izmailov, 2020; Fortuin et al., 2021). How- ever, in our case, we found this prior to be too weak. Since the posterior depends only on a single batch we opted to introduce information about the whole dataset through the prior. A natural choice is to use the posterior distribution of the previous batch as our prior (Sarkka, 2013, Chapter 3). However, this method did not work well in our experiments and we developed an alternative. During each epoch, we collect the feature representations and labels of all examples in the dataset. At the end of the epoch, we compute the posterior based on the full data (with an isotropic Gaussian prior) and use this posterior as the prior at each step in the subsequent epoch. Updating the full data prior more frequently is likely to have a beneficial effect on our overall model; however it will also probably make the training time longer. Hence, doing the update once an epoch strikes a good balance between performance and training time.
先验选择。贝叶斯深度学习中常见的选择是采用无信息先验 (uninformative prior) ,例如标准高斯分布,让数据成为后验分布的主要影响因素 (Wilson & Izmailov, 2020; Fortuin et al., 2021) 。但在本研究中,我们发现这种先验过于薄弱。由于后验仅依赖于单个批次数据,我们选择通过先验引入整个数据集的信息。一个自然的选择是将前一批次的后验分布作为当前先验 (Sarkka, 2013, Chapter 3) ,但该方法在实验中表现不佳,因此我们开发了替代方案:在每个训练周期 (epoch) 收集数据集中所有样本的特征表示和标签,周期结束时基于完整数据(采用各向同性高斯先验)计算后验分布,并将该后验作为下一周期各训练步骤的先验。虽然更频繁地更新全局数据先验可能提升模型整体性能,但也会延长训练时间。因此,每周期更新一次在性能与训练时间之间取得了良好平衡。
Aggregation step. Having an approximation for the gradient distribution of each task we need to combine them to find an update direction for the shared parameters. Denote the mean of the gradient of the loss for task $k$ w.r.t the hidden layer for the $i^{t h}$ example by $\mu_{i}^{k}:=\mathbb{E}[\mathbf{g}{i}^{k}]$ , and similarly the covariance matrix $\Sigma_{i}^{k}:=(\Lambda_{i}^{k})^{-1}:=$ $\mathbb{E}[{\bf g}{i}^{k}({\bf g}{i}^{k})^{T}]-\mathbb{E}[{\bf g}{i}^{k}]\mathbb{E}[{\bf g}{i}^{k}]^{T}$ . We strive to find an update direction for the last shared layer, $\mathbf{g}{i}$ , that lies in a high-density region for all tasks. Hence, we pick $\mathbf{g}_{i}$ that maximizes the following likelihood:
聚合步骤。在获得每个任务的梯度分布近似后,我们需要将它们组合起来以找到共享参数的更新方向。设任务 $k$ 关于第 $i^{th}$ 个示例隐藏层的损失梯度均值为 $\mu_{i}^{k}:=\mathbb{E}[\mathbf{g}{i}^{k}]$,协方差矩阵为 $\Sigma_{i}^{k}:=(\Lambda_{i}^{k})^{-1}:=$ $\mathbb{E}[{\bf g}{i}^{k}({\bf g}{i}^{k})^{T}]-\mathbb{E}[{\bf g}{i}^{k}]\mathbb{E}[{\bf g}{i}^{k}]^{T}$。我们致力于为最后一个共享层寻找一个更新方向 $\mathbf{g}{i}$,该方向需位于所有任务的高密度区域。因此,我们选择使以下似然最大化的 $\mathbf{g}_{i}$:
$$
\begin{array}{r l}&{\underset{\mathbf{g}{i}}{\arg\operatorname*{max}}\underset{k=1}{\overset{K}{\prod}}\mathcal{N}(\mathbf{g}{i}|\pmb{\mu}{i}^{k},\pmb{\Sigma}{i}^{k})=}\ &{\underset{\mathbf{g}{i}}{\arg\operatorname*{min}}-\underset{k=1}{\overset{K}{\sum}}l o g\mathcal{N}(\mathbf{g}{i}|\pmb{\mu}{i}^{k},\pmb{\Sigma}_{i}^{k}).}\end{array}
$$
$$
\begin{array}{r l}&{\underset{\mathbf{g}{i}}{\arg\operatorname*{max}}\underset{k=1}{\overset{K}{\prod}}\mathcal{N}(\mathbf{g}{i}|\pmb{\mu}{i}^{k},\pmb{\Sigma}{i}^{k})=}\ &{\underset{\mathbf{g}{i}}{\arg\operatorname*{min}}-\underset{k=1}{\overset{K}{\sum}}l o g\mathcal{N}(\mathbf{g}{i}|\pmb{\mu}{i}^{k},\pmb{\Sigma}_{i}^{k}).}\end{array}
$$
Thankfully, the above optimization problem can be solved
幸运的是,上述优化问题可以得到解决
Algorithm 1 BayesAgg-MTL
算法 1 BayesAgg-MTL
End for
结束
End for
结束
Compute gradient via matrix multiplication w.r.t the shared parameters: $\begin{array}{r}{\frac{1}{|\boldsymbol{B}|}\sum_{i=1}^{|\boldsymbol{B}|}\mathbf{g}{i}\frac{\partial\mathbf{h}_{i}}{\partial\boldsymbol{\theta}}}\end{array}$ .
通过矩阵乘法计算关于共享参数的梯度:$\begin{array}{r}{\frac{1}{|\boldsymbol{B}|}\sum_{i=1}^{|\boldsymbol{B}|}\mathbf{g}{i}\frac{\partial\mathbf{h}_{i}}{\partial\boldsymbol{\theta}}}\end{array}$。
in closed-form, yielding the following solution:
以闭式解给出如下方案:
$$
\mathbf{g}{i}=\left(\sum_{k=1}^{K}\pmb{\Lambda}{i}^{k}\right)^{-1}\left(\sum_{k=1}^{K}\pmb{\Lambda}{i}^{k}\pmb{\mu}_{i}^{k}\right).
$$
$$
\mathbf{g}{i}=\left(\sum_{k=1}^{K}\pmb{\Lambda}{i}^{k}\right)^{-1}\left(\sum_{k=1}^{K}\pmb{\Lambda}{i}^{k}\pmb{\mu}_{i}^{k}\right).
$$
However, we found that modeling the full covariance matrix can be numerically unstable and sensitive to noise in the gradient. Instead, we assume independence between the dimensions of $\mathbf{g}{i}^{k}$ for all tasks which results in diagonal covariance matrices having variance $(\pmb{\sigma}{i}^{k})^{2}:=1/\lambda_{i}^{k}$ . The update direction now becomes:
然而,我们发现对完整协方差矩阵建模会导致数值不稳定,并对梯度噪声敏感。因此,我们假设所有任务的 $\mathbf{g}{i}^{k}$ 各维度间相互独立,从而得到对角协方差矩阵,其方差为 $(\pmb{\sigma}{i}^{k})^{2}:=1/\lambda_{i}^{k}$ 。此时更新方向变为:
$$
\mathbf{g}{i}=\sum_{k=1}^{K}\frac{1/(\pmb{\sigma}{i}^{k})^{2}}{\sum_{k=1}^{K}1/(\pmb{\sigma}{i}^{k})^{2}}\pmb{\mu}{i}^{k}=\sum_{k=1}^{K}\overbrace{\sum_{k=1}^{K}\lambda_{i}^{k}}^{\pmb{\alpha}{i}^{k}}\pmb{\mu}_{i}^{k},
$$
$$
\mathbf{g}{i}=\sum_{k=1}^{K}\frac{1/(\pmb{\sigma}{i}^{k})^{2}}{\sum_{k=1}^{K}1/(\pmb{\sigma}{i}^{k})^{2}}\pmb{\mu}{i}^{k}=\sum_{k=1}^{K}\overbrace{\sum_{k=1}^{K}\lambda_{i}^{k}}^{\pmb{\alpha}{i}^{k}}\pmb{\mu}_{i}^{k},
$$
where the division and multiplication are done elementwise. In Eq. 7 we intentionally denote by $\alpha_{i}^{k}$ the vector of uncertainty-based weights that our method assigns to the mean gradient to highlight that the weights are unique per task, dimension, and datum. The final modification for the method involves down-scaling the impact of the precision by a hyper-parameter $s\in(0,1]$ , namely, we take $(\pmb{\lambda}{i}^{k})^{s}$ Empirically, the scaling parameter helped to achieve better performance, perhaps due to mis specifications in the model (such as the diagonal Gaussian assumption over $\mathbf{g}_{i}^{k}$ ).
其中除法和乘法都是逐元素进行的。在公式7中,我们特意用$\alpha_{i}^{k}$表示该方法为平均梯度分配的不确定性权重向量,以突显权重对每个任务、维度和数据点都是唯一的。该方法的最终调整是通过超参数$s\in(0,1]$缩小精度的影响,即采用$(\pmb{\lambda}{i}^{k})^{s}$。实验表明,缩放参数有助于获得更好的性能,可能是由于模型中的错误设定(例如对$\mathbf{g}_{i}^{k}$采用对角高斯假设)所致。
With the aggregated gradient for each example, the backpropagation procedure proceeds as usual by averaging over all examples in the batch and then back-propagating this over to the shared parameters. To gain a better intuition about the update rule of BayesAgg-MTL , consider the illustration in Figure 2. In the figure, we plot the mean update direction of two tasks along with the uncertainty in them. The first task is more sensitive to shifts in the vertical dimension and less so to shifts in the second (horizontal)
在获得每个样本的聚合梯度后,反向传播过程按常规方式进行:先对批次中所有样本的梯度取平均,然后将该平均值反向传播至共享参数。为更好理解BayesAgg-MTL的更新规则,可参考图2的示意图。图中绘制了两个任务的平均更新方向及其不确定性:第一个任务对垂直维度变化更为敏感,而对水平维度变化较不敏感。

Figure 2. BayesAgg-MTL update for a two-dimensional feature representation. Black arrows indicate the mean update direction of each task; Red arrow is the update direction of a simple average; Blue arrow is the proposed update direction. Darker colors in the contours represent regions with higher density.
图 2: BayesAgg-MTL 在二维特征表示中的更新过程。黑色箭头表示各任务的均值更新方向;红色箭头为简单平均的更新方向;蓝色箭头为本文提出的更新方向。等高线中颜色较深的区域表示密度较高的区域。
dimension, while for the second task, it is the opposite. By taking the variance information into account, BayesAggMTL can find an update direction that works well for both, compared to a simple average of the gradient means. We summarize our method in Algorithm 1.
维度,而对于第二个任务则相反。通过考虑方差信息,与简单的梯度均值平均相比,BayesAggMTL能够找到一个对两者都有效的更新方向。我们在算法1中总结了该方法。
Making predictions. Since we have a closed-form solution for the posterior of the task-specific parameters, BayesAgg- MTL does not learn this layer during training. Therefore, when making predictions we use the posterior mean, $\mathbf{m}^{k}$ , computed on the full training set. We do so, instead of using a full Bayesian inference, for a fair comparison with alternative MTL approaches and to have an identical runtime and memory requirements when making predictions.
进行预测。由于我们已经得到了任务特定参数后验的闭式解,BayesAgg-MTL在训练过程中无需学习这一层。因此,在预测时我们使用基于完整训练集计算的后验均值$\mathbf{m}^{k}$。这样做是为了与其他多任务学习方法进行公平比较,并确保预测时具有相同的运行时和内存需求,而不是采用完整的贝叶斯推断。
Connection to Nash-MTL. In (Navon et al., 2022) the authors proposed a cooperative bargaining game approach to the gradient aggregation step with the directional derivative as the utility of each player (task). They then proposed using the Nash bargaining solution, the direction that maximizes the product of all the utilities. One can consider Eq. 5 as the Nash bargaining solution with the utility of each task being its likelihood. However, unlike (Navon et al., 2022) we get an analytical formula for the bargaining solution since the Gaussian exponent and the logarithm cancel out.
与 Nash-MTL 的联系。在 (Navon et al., 2022) 中,作者提出了一种基于合作博弈理论的梯度聚合方法,将方向导数作为每个玩家(任务)的效用函数。随后他们采用纳什议价解 (Nash bargaining solution) ,即最大化所有效用乘积的方向。可以将公式 5 视为以各任务似然函数为效用的纳什议价解。但与 (Navon et al., 2022) 不同,由于高斯指数函数与对数函数的相互抵消,我们获得了议价解的解析表达式。
3.2. BayesAgg-MTL for Classification Tasks
3.2. 用于分类任务的BayesAgg-MTL
We now turn to present our approach for classification tasks. When dealing with classification there are two sources of intractability that we need to overcome. The first is the posterior of $\mathbf{w}^{k}$ , and the second is estimating the moments of $\mathbf{g}_{i}^{k}$ . We describe our solution to both challenges next.
我们现在转向介绍分类任务的方法。在处理分类问题时,需要克服两个棘手的难题:首先是 $\mathbf{w}^{k}$ 的后验分布,其次是估计 $\mathbf{g}_{i}^{k}$ 的矩量。接下来我们将分别阐述针对这两个挑战的解决方案。
Posterior approximation. In classification tasks the likelihood is not a Gaussian and in general, we cannot compute the posterior in closed-form. One common option is to approximate it using a Gaussian distribution and learn its parameters using a variation al inference (VI) scheme (Saul et al., 1996; Neal & Hinton, 1998; Bishop, 2006). However, in our early experimentation s, we didn’t find it to work well without using a computationally expensive VI optimization at each update step. Alternatively to VI, the Laplace approximation (MacKay, 1992) approximates the posterior as a Gaussian using a second-order Taylor expansion. Since the expansion is done at the optimal parameter values that are learned point-wise, the Jacobean term in the expansion vanishes. Here, we follow a similar path; however, we cannot assume that the Jacobean is zero as we are not near a stationary point during most of the training. Nevertheless, we can still find a Gaussian approximation. A similar derivation was proposed in (Immer et al., 2021), yet they ignored the first order term eventually. Denote by $\hat{\mathbf{w}}^{k}$ the learned point estimate for the task parameters, and $\Delta\mathbf{w}^{k}:=\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}$ Then, at each step of the training by using Bayes rule we can obtain a posterior approximation for $\mathbf{w}^{k}$ using the fol- lowing:
后验近似。在分类任务中,似然函数并非高斯分布,通常无法以闭式解计算后验分布。常见做法是采用高斯分布进行近似,并通过变分推断 (variational inference, VI) 框架学习其参数 (Saul et al., 1996; Neal & Hinton, 1998; Bishop, 2006) 。然而在初期实验中,我们发现若不采用计算成本高昂的逐步骤VI优化,该方法效果欠佳。作为VI的替代方案,拉普拉斯近似 (Laplace approximation) (MacKay, 1992) 通过二阶泰勒展开将后验近似为高斯分布。由于展开在逐点学习的最优参数值处进行,展开式中的雅可比项会消失。本文采用类似路径,但由于训练过程大多不处于驻点附近,不能假设雅可比矩阵为零。尽管如此,我们仍可获得高斯近似。(Immer et al., 2021) 提出了类似推导,但最终忽略了一阶项。设 $\hat{\mathbf{w}}^{k}$ 为任务参数的逐点估计值,$\Delta\mathbf{w}^{k}:=\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}$ ,则训练过程中每一步均可通过贝叶斯规则获得 $\mathbf{w}^{k}$ 的后验近似:
$$
\begin{array}{r l}&{l o g p(\mathbf{w}^{k}|\mathcal{B})\approx l o g p(\hat{\mathbf{w}}^{k}|\mathcal{B})+}\ &{\left(-\frac{\partial l o g p(\mathbf{y}^{k}|\mathbf{X},\mathbf{w}^{k})}{\partial\mathbf{w}^{k}}-\frac{\partial l o g p(\mathbf{w}^{k})}{\partial\mathbf{w}^{k}}\right)^{T}\Delta\mathbf{w}^{k}+}\ &{\frac{1}{2}(\Delta\mathbf{w}^{k})^{T}\left(-\frac{\partial^{2}l o g p(\mathbf{y}^{k}|\mathbf{X},\mathbf{w}^{k})}{\partial(\mathbf{w}^{k})^{2}}-\frac{\partial^{2}l o g~p(\mathbf{w}^{k})}{\partial(\mathbf{w}^{k})^{2}}\right)\Delta\mathbf{w}^{k}.}\end{array}
$$
$$
\begin{array}{r l}&{l o g p(\mathbf{w}^{k}|\mathcal{B})\approx l o g p(\hat{\mathbf{w}}^{k}|\mathcal{B})+}\ &{\left(-\frac{\partial l o g p(\mathbf{y}^{k}|\mathbf{X},\mathbf{w}^{k})}{\partial\mathbf{w}^{k}}-\frac{\partial l o g p(\mathbf{w}^{k})}{\partial\mathbf{w}^{k}}\right)^{T}\Delta\mathbf{w}^{k}+}\ &{\frac{1}{2}(\Delta\mathbf{w}^{k})^{T}\left(-\frac{\partial^{2}l o g p(\mathbf{y}^{k}|\mathbf{X},\mathbf{w}^{k})}{\partial(\mathbf{w}^{k})^{2}}-\frac{\partial^{2}l o g~p(\mathbf{w}^{k})}{\partial(\mathbf{w}^{k})^{2}}\right)\Delta\mathbf{w}^{k}.}\end{array}
$$
The above takes the following form $c^{k}+(\mathbf{a}^{k})^{T}(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k})+$ $\begin{array}{r}{\frac{1}{\eta}\big(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}\big)^{T}\mathbf{B}^{k}\big(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}\big)}\end{array}$ , where $\mathbf{a}^{k}\in\mathbb{R}^{d_{k}},\mathbf{B}^{k}\in$ Rdk×dk , ck ∈ R are known constants. We stress here again, that since we apply Bayesian inference to the last layer parameters only, computing and inverting $\mathbf{B}^{k}$ , typically does not incur a large computational overhead.
上述表达式形式为 $c^{k}+(\mathbf{a}^{k})^{T}(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k})+$ $\begin{array}{r}{\frac{1}{\eta}\big(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}\big)^{T}\mathbf{B}^{k}\big(\mathbf{w}^{k}-\hat{\mathbf{w}}^{k}\big)}\end{array}$ ,其中 $\mathbf{a}^{k}\in\mathbb{R}^{d_{k}},\mathbf{B}^{k}\in$ Rdk×dk,ck ∈ R 为已知常数。需要再次强调的是,由于我们仅对最后一层参数应用贝叶斯推断,计算并求逆 $\mathbf{B}^{k}$ 通常不会带来较大的计算开销。
After rearranging and completing the square we obtain a quadratic form corresponding to the following Gaussian distribution (see full derivation in Appendix A.2):
经过整理并完成平方后,我们得到一个对应于以下高斯分布的二次型 (完整推导见附录 A.2):
$$
p(\mathbf{w}^{k}|\ B)\approx\mathcal{N}(\mathbf{w}^{k}|\hat{\mathbf{w}}^{k}-(\mathbf{B}^{k})^{-1}\mathbf{a}^{k},(\mathbf{B}^{k})^{-1}).
$$
$$
p(\mathbf{w}^{k}|\ B)\approx\mathcal{N}(\mathbf{w}^{k}|\hat{\mathbf{w}}^{k}-(\mathbf{B}^{k})^{-1}\mathbf{a}^{k},(\mathbf{B}^{k})^{-1}).
$$
Examining the above posterior reveals several insights. First, the posterior mean corresponds to the Newton method up- date step. Second, the covariance of this posterior is the same as that of the Laplace approximation. Third, at a stationary point the Laplace approximation is recovered if the gradient of the loss w.r.t the parameters approaches zero.
分析上述后验分布可得出几点见解。首先,后验均值对应于牛顿法的更新步骤。其次,该后验分布的协方差与拉普拉斯近似相同。第三,在驻点处,若损失函数对参数的梯度趋近于零,则恢复为拉普拉斯近似。
One limitation of the approximation in Eq. 9 is that the Hessian will not be positive-definite in most cases. Therefore, we replace it with the generalized Gauss-Newton (GGN) matrix (Schr au dolph, 2002; Martens & Sutskever, 2011; Daxberger et al., 2021):
式 9 中近似的一个局限是,Hessian 矩阵在大多数情况下并非正定。因此,我们将其替换为广义高斯-牛顿 (GGN) 矩阵 (Schraudolph, 2002; Martens & Sutskever, 2011; Daxberger et al., 2021):
$$
\tilde{\mathbf{B}}^{k}=\sum_{i=1}^{|\mathcal{B}|}(\mathbf{J}{i}^{k})^{T}\mathbf{H}_{i}^{k}\mathbf{J}_{i}^{k}+(\mathbf{S}_{p}^{k})^{-1}.
$$
$$
\tilde{\mathbf{B}}^{k}=\sum_{i=1}^{|\mathcal{B}|}(\mathbf{J}{i}^{k})^{T}\mathbf{H}_{i}^{k}\mathbf{J}_{i}^{k}+(\mathbf{S}_{p}^{k})^{-1}.
$$
Moments estimation. Unlike the regression case, in classification $\mathbf{g}_{i}^{k}$ will depend on $\mathbf{w}^{k}$ through some non-linear function. Hence, obtaining the moments as in Eq. 4 in closed-form is more challenging. However, since we are estimating the parameters of the last layer only, which in many cases are relatively low-dimensional, we can efficiently approximate these moments with Monte-Carlo sampling:
矩估计。与回归情况不同,在分类任务中 $\mathbf{g}_{i}^{k}$ 会通过某些非线性函数依赖于 $\mathbf{w}^{k}$ 。因此,像式4那样以闭式求解矩量更具挑战性。不过由于我们仅估计最后一层的参数(这些参数在许多情况下维度相对较低),可以通过蒙特卡洛采样高效近似这些矩量:
$$
\begin{array}{c}{\displaystyle\mathbb{E}[\mathbf{g}{i}^{k}]\approx\frac{1}{J}\sum_{j=1}^{J}\mathbf{g}{i}^{k}(\mathbf{w}{j}^{k}),}\ {\displaystyle\mathbb{E}[\mathbf{g}{i}^{k}(\mathbf{g}{i}^{k})^{T}]\approx\frac{1}{J}\sum_{j=1}^{J}\mathbf{g}{i}^{k}(\mathbf{w}{j}^{k})\mathbf{g}{i}^{k}(\mathbf{w}_{j}^{k})^{T}.}\end{array}
$$
$$
\begin{array}{c}{\displaystyle\mathbb{E}[\mathbf{g}{i}^{k}]\approx\frac{1}{J}\sum_{j=1}^{J}\mathbf{g}{i}^{k}(\mathbf{w}{j}^{k}),}\ {\displaystyle\mathbb{E}[\mathbf{g}{i}^{k}(\mathbf{g}{i}^{k})^{T}]\approx\frac{1}{J}\sum_{j=1}^{J}\mathbf{g}{i}^{k}(\mathbf{w}{j}^{k})\mathbf{g}{i}^{k}(\mathbf{w}_{j}^{k})^{T}.}\end{array}
$$
Here, $\mathbf{w}_{j}^{k}$ are samples from $p(\mathbf{w}^{k}|B)$ , and the total number of samples are $J$ . Effectively this means that we need to back-propagate gradients w.r.t the shared hidden layer $J$ times; however, since the task-specific layers are linear it can be done cheaply and in parallel. Having the moment estimation we proceed with the aggregation rule as described in Section 3.1.
这里,$\mathbf{w}_{j}^{k}$ 是从 $p(\mathbf{w}^{k}|B)$ 中采样的样本,总样本数为 $J$。实际上这意味着我们需要对共享隐藏层进行 $J$ 次梯度反向传播;但由于任务特定层是线性的,这一过程可以低成本并行完成。获得矩估计后,我们按照第3.1节描述的聚合规则继续操作。
Making predictions. Unlike the regression case, here we learn the parameters of the last layer as part of the posterior approximation. Therefore, making predictions is done as usual with a forward-pass through the network.
进行预测。与回归情况不同,这里我们将最后一层的参数作为后验近似的一部分进行学习。因此,预测过程只需像往常一样通过网络进行前向传播即可完成。
4. Related Work
4. 相关工作
Multi-task learning is an active research area that attempts to learn jointly multiple tasks, commonly using a shared representation (Ruder, 2017; Navon et al., 2022; Liu et al., 2023; Elich et al., 2023; Shi et al., 2023; Yun & Cho, 2023). Learning a shared representation for multiple tasks imposes some challenges. One challenge is trying to learn an architecture that can express both task-shared and task-specific features. Another challenge is to find the optimal balancing of the tasks and enable learning the different tasks with equal importance. One line of research in MTL suggests methods to introduce novel MTL-friendly architectures, such as task-specific modules (Misra et al., 2016), attention-based networks (Liu et al., 2019a), and an ensemble of single-task models (Dimitri ad is et al., 2023). Yet, a more common line of research focuses on the MTL optimization process, trying to explain the difficulties in the process by e.g. conflicting gradients (Wang et al., 2020) or plateaus in the loss landscape (Schaul et al., 2019). Our method focuses on the latter, MTL optimization process improvement.
多任务学习是一个活跃的研究领域,旨在通过共享表征联合学习多个任务 (Ruder, 2017; Navon et al., 2022; Liu et al., 2023; Elich et al., 2023; Shi et al., 2023; Yun & Cho, 2023)。为多任务学习共享表征会带来一些挑战:其一是需要构建能同时表达任务共享特征和任务特定特征的架构;其二是如何实现任务间最优平衡,确保不同任务获得同等重要的学习机会。当前多任务学习研究主要分为两个方向:一是设计新型多任务友好架构,如任务特定模块 (Misra et al., 2016)、基于注意力的网络 (Liu et al., 2019a) 和单任务模型集成 (Dimitriadis et al., 2023);二是聚焦多任务优化过程,通过梯度冲突 (Wang et al., 2020) 或损失函数平台期 (Schaul et al., 2019) 等现象解释优化难点。本文方法属于后者,致力于改进多任务优化过程。

Figure 3. Mean weight over dimensions per-example for 20 random examples on the QM9 dataset at different training stages.
图 3: QM9 数据集中 20 个随机样本在不同训练阶段各维度上的平均权重。
Different strategies were proposed to address the MTL optimization challenge to successfully balance the training of the different tasks and resolve their conflicts. The methods can broadly be categorized into two groups, loss-based and gradient-based (Dai et al., 2023). Loss-based approaches attempt to allocate weights for the tasks based on some criteria related to the loss, such as the difficulty of the task (Guo et al., 2018), random weights (Lin et al., 2022), geometric mean of the task losses (Chennupati et al., 2019; Yun & Cho, 2023), and task uncertainty (Kendall et al., 2018). Regarding the last one, to weigh the tasks it uses the uncertainty in the observations only. This is very different from our approach that weighs each dimension of the task gradients based on full Bayesian information.
针对多任务学习(MTL)优化挑战,研究者提出了不同策略来平衡各任务训练并解决任务间冲突。这些方法主要分为两类:基于损失(loss-based)和基于梯度(gradient-based)的方法 (Dai et al., 2023)。基于损失的方法通过任务损失相关指标分配权重,例如:任务难度 (Guo et al., 2018)、随机权重 (Lin et al., 2022)、任务损失的几何平均 (Chennupati et al., 2019; Yun & Cho, 2023) 以及任务不确定性 (Kendall et al., 2018)。其中最后一种方法仅依据观测不确定性来加权任务,这与我们基于完整贝叶斯信息对任务梯度各维度进行加权的方法存在本质差异。
Gradient-based methods attempt to balance the tasks by using the gradients information directly (Chen et al., 2018; 2020; Javaloy & Valera, 2022; Liu et al., 2020; Navon et al., 2022; Fernando et al., 2023; Senushkin et al., 2023). For example, GradNorm (Chen et al., 2018) dynamically tunes the gradient magnitudes to prevent imbalances between the tasks during training. PCGrad (Yu et al., 2020) identifies gradient conflicts as the main optimization issue in MTL, and attempts to reduce the conflicts by projecting each gradient to the other tasks’ normal plane. Nash-MTL (Navon et al., 2022) suggests treating MTL as a bargaining game to find Pareto optimal solutions. Several studies suggested adaptations for the multiple-gradient descent algorithm (MGDA) (Désidéri, 2012; Sener & Koltun, 2018), such as CAGrad, (Liu et al., 2021), and MoCo (Fernando et al., 2023). As opposed to previous methods, our approach considers both the mean and the variance of the gradients to derive an update direction.
基于梯度的方法尝试直接利用梯度信息来平衡任务 (Chen et al., 2018; 2020; Javaloy & Valera, 2022; Liu et al., 2020; Navon et al., 2022; Fernando et al., 2023; Senushkin et al., 2023)。例如,GradNorm (Chen et al., 2018) 动态调整梯度幅度以防止训练过程中任务间的不平衡。PCGrad (Yu et al., 2020) 将梯度冲突视为多任务学习 (MTL) 中的主要优化问题,并尝试通过将每个梯度投影到其他任务的法平面来减少冲突。Nash-MTL (Navon et al., 2022) 建议将多任务学习视为讨价还价博弈以寻找帕累托最优解。多项研究提出了对多梯度下降算法 (MGDA) (Désidéri, 2012; Sener & Koltun, 2018) 的改进,例如 CAGrad (Liu et al., 2021) 和 MoCo (Fernando et al., 2023)。与之前的方法不同,我们的方法同时考虑梯度的均值和方差来推导更新方向。
Lastly, some studies recently suggested performing model merging based on the uncertainty of the parameters (Matena & Raffel, 2022; Daheim et al., 2023). The goal there is usually to combine models for various tasks, such as model ensembling, federated learning, and robust fine-tuning. Unlike these methods, we assume a Bayesian model on the last layer only and propagate the uncertainty to the gradients for gradient aggregation.
最后,一些最新研究提出基于参数不确定性进行模型合并 (Matena & Raffel, 2022; Daheim et al., 2023)。这类方法通常旨在整合面向不同任务的模型,例如模型集成、联邦学习和鲁棒微调。与这些方法不同,我们仅在最后一层假设贝叶斯模型,并将不确定性传播至梯度以实现梯度聚合。
Table 1. QM9. Test performance averaged over 3 random seeds.
表 1. QM9. 3次随机种子测试性能平均值
| △m% (↓) | |
|---|---|
| LS | 177.6 ± 3.4 |
| SI | 77.8 ± 9.2 |
| RLW | 203.8 ± 3.4 |
| DWA | 175.3 ± 6.3 |
| UW | 108.0 ± 22.5 |
| MGDA | 120.5 ± 2.0 |
| PCGrad | 125.7 ± 10.3 |
| CAGrad | 112.8 ± 4.0 |
| IMTL-G | 77.2 ± 9.3 |
| Nash-MTL | 62.0 ± 1.4 |
| IGBv2 | 67.7 ± 8.1 |
| Aligned-MTL-UB | 71.0 ± 9.6 |
| BayesAgg-MTL (Ours) | 53.2 ± 7.1 |
5. Experiments
5. 实验
We evaluated BayesAgg-MTL on several MTL benchmarks differing in the number of tasks and their types. Unless specified otherwise, we report the average and standard deviation (std) of relevant metrics over 3 random seeds. In all datasets, we pre-allocated a validation set from the training set for hyper-parameter tuning and early stopping for all methods. Throughout our experiments, we used the ADAM optimizer (Kingma & Ba, 2015) which was found to be effective for MTL due to partial loss-scale invariance (Elich et al., 2023). Full experimental details are given in Appendix B.
我们在多个任务数量及类型各异的多任务学习 (MTL) 基准上评估了 BayesAgg-MTL。除非另有说明,我们报告了基于3个随机种子的相关指标平均值和标准差 (std)。所有数据集中,我们从训练集预先划分验证集用于超参数调优和所有方法的早停机制。实验全程采用 ADAM 优化器 (Kingma & Ba, 2015) ,因其对部分损失尺度不变性 (Elich et al., 2023) 而被证明对MTL有效。完整实验细节见附录B。
Compared methods. We compare BayesAgg-MTL with the following baseline methods: (1) Single Task Learning (STL), which learns each task independently under the same experimental setup as that of the MTL methods; (2) Linear Scalar iz ation (LS), which assigns a uniform weight to all tasks, namely $\textstyle\sum_{k=1}^{K}\ell^{k}$ ; (3) Scale-Invariant (SI) (Navon et al., 2022), w hich assigns a uniform weight to the log of all tasks, namely $\scriptstyle\sum_{k=1}^{K}l o g\ell^{k}$ ; (4) Random Loss Weighting (RLW) (Lin e t al., 2022), which allocates random weights to the losses at each iteration; (5) Dynamic Weight Average (DWA) (Liu et al., 2019a), which allocates a weight based on the rate of change of the loss for each task; (6) Uncertainty weighting (UW) (Kendall et al., 2018), which minimize a scalar term corresponding to the aleatoric uncertainty for each task; (7) Multiple-Gradient Descent Algorithm (MGDA) (Désidéri, 2012; Sener & Koltun, 2018), which finds a minimum norm solution for a convex combination of the losses; (8) Projecting Conflicting Gradients (PCGrad) (Yu et al., 2020), which projects the gradient of each task onto the normal plane of tasks they are in conflict with; (9) Conflict-Averse Grad (CAGrad) (Liu et al., 2021), which searches an update direction centered at the LS solution while minimizing conflicts in gradients; (10) Impartial MTL-Grad (IMTL-G) (Liu et al., 2020), which finds an update vector such that the projection of it on each of the gradients of the tasks is equal; (11) Nash-MTL (Navon et al., 2022) that derives task weights based on the Nash bargaining solution; (12) Improvable Gap Balancing (IGBv2) (Dai et al., 2023), which suggests a Reinforcement learning procedure to balance the task losses; (13) Aligned-MTLUB (Senushkin et al., 2023), which aligns the principle components of a gradient matrix.
对比方法。我们将BayesAgg-MTL与以下基线方法进行比较:(1) 单任务学习(STL),在与多任务学习方法相同的实验设置下独立学习每个任务;(2) 线性标量化(LS),为所有任务分配统一权重,即$\textstyle\sum_{k=1}^{K}\ell^{k}$;(3) 尺度不变(SI)(Navon et al., 2022),为所有任务的对数分配统一权重,即$\scriptstyle\sum_{k=1}^{K}l o g\ell^{k}$;(4) 随机损失加权(RLW)(Lin et al., 2022),在每次迭代时为损失分配随机权重;(5) 动态权重平均(DWA)(Liu et al., 2019a),根据每个任务损失的变化率分配权重;(6) 不确定性加权(UW)(Kendall et al., 2018),最小化与每个任务的任意不确定性相对应的标量项;(7) 多梯度下降算法(MGDA)(Désidéri, 2012; Sener & Koltun, 2018),寻找损失凸组合的最小范数解;(8) 投影冲突梯度(PCGrad)(Yu et al., 2020),将每个任务的梯度投影到与之冲突的任务的法平面上;(9) 冲突规避梯度(CAGrad)(Liu et al., 2021),搜索以LS解为中心的更新方向,同时最小化梯度冲突;(10) 公正多任务学习梯度(IMTL-G)(Liu et al., 2020),寻找一个更新向量,使其在任务梯度上的投影相等;(11) 纳什多任务学习(Nash-MTL)(Navon et al., 2022),基于纳什议价解推导任务权重;(12) 可改进差距平衡(IGBv2)(Dai et al., 2023),提出强化学习程序来平衡任务损失;(13) 对齐多任务学习上界(Aligned-MTLUB)(Senushkin et al., 2023),对齐梯度矩阵的主成分。
Table 2. Test performance averaged over 3 random seeds on binary classification tasks from CIFAR-MTL & ChestX-ray14 datasets.
表 2. CIFAR-MTL 和 ChestX-ray14 数据集二分类任务上 3 次随机种子测试性能平均值。
| CIFAR(Acc.)[↑] | CX-ray(△m%)[↓] | |
|---|---|---|
| LS | 56.96 ± .06 | -14.62 ± 0.2 |
| SI | 55.75 ±0.3 | -10.94 ± 0.4 |
| RLW | 59.30±.08 | -11.69 ± 0.1 |
| DWA | 58.44 ± 0.5 | -14.79±.07 |
| UW | 56.63 ±0.5 | -13.95 ± 0.2 |
| MGDA | 59.74±.07 | -14.44 ± 0.4 |
| PCGrad | 56.32±0.2 | -13.43 ± 0.5 |
| CAGrad | 56.59 ±0.2 | -14.49 ± 0.1 |
| IMTL-G | 57.09 ± 0.3 | -8.23 ± 1.8 |
| Nash-MTL | 56.59 ±0.2 | -13.23 ±0.5 |
| IGBv2 | 56.61 ± 0.2 | -2.82 ±0.6 |
| Aligned-MTL-UB | 56.57 ± 0.7 | -14.14 ± 0.2 |
| BayesAgg-MTL (Ours) | 59.97 ± 0.4 | -14.96 ± 0.1 |
Evaluation metric. Unless specified otherwise, we report the $\Delta_{m}%$ metric introduced in (Maninis et al., 2019). This metric measures the average relative difference between a method $m$ compared to the STL baseline according to some criterion of interest $M^{k}$ . Namely, $\Delta_{m}= $ $\begin{array}{r}{\frac{1}{K}\sum_{k=1}^{K}(-1)^{\delta_{k}}(M_{m}^{k}-M_{s}^{k})/M_{s}^{k}}\end{array}$ . Where, $M_{m}^{k}$ is the criterion value for task $k$ under method $m$ , $M_{s}^{k}$ is the criterion value for task $k$ under the STL baseline, and $\delta_{k}\in{0,1}$ . If $\delta_{k}=0$ then a lower value for $M^{k}$ is better (e.g., task loss), and if $\delta_{k}=1$ then a higher value for $M^{k}$ is preferred (e.g., task accuracy). Lower $\Delta_{m}%$ indicates a better performance.
评估指标。除非另有说明,我们均采用 (Maninis et al., 2019) 提出的 $\Delta_{m}%$ 指标。该指标衡量方法 $m$ 与单任务学习 (STL) 基线在特定评估准则 $M^{k}$ 下的平均相对差异,计算公式为 $\Delta_{m}= \begin{array}{r}{\frac{1}{K}\sum_{k=1}^{K}(-1)^{\delta_{k}}(M_{m}^{k}-M_{s}^{k})/M_{s}^{k}}\end{array}$。其中 $M_{m}^{k}$ 表示方法 $m$ 在任务 $k$ 上的准则值,$M_{s}^{k}$ 表示 STL 基线在任务 $k$ 上的准则值,$\delta_{k}\in{0,1}$ 为方向系数:当 $\delta_{k}=0$ 时 $M^{k}$ 值越小越好(如任务损失),当 $\delta_{k}=1$ 时 $M^{k}$ 值越大越好(如任务准确率)。$\Delta_{m}%$ 值越低表示性能越好。
Pre-training stage. To obtain meaningful features for the Bayesian layer, it is a common practice to apply a pretraining step using standard NN training for several epochs (Wilson et al., 2016a;b). We follow the same path here and apply an initial pre-training step using linear scalar iz ation. We would like to stress here that in all the experiments, the overall number of training steps for BayesAgg-MTL (including the pre-training) is the same as all methods.
预训练阶段。为了获得贝叶斯层的有意义特征,通常的做法是使用标准神经网络 (NN) 训练进行几个周期的预训练步骤 (Wilson et al., 2016a;b)。我们在此遵循相同的路径,并使用线性标量化 (linear scalarization) 进行初始预训练步骤。需要强调的是,在所有实验中,BayesAgg-MTL 的总训练步骤数(包括预训练)与其他所有方法相同。
Table 3. UTKFace. Test performance averaged over 8 random seeds.
表 3: UTKFace。测试性能为8个随机种子的平均值。
| STL | Age(x101)(↓) | Gender (↑) | Ethnicity (↑) | △m% (↓) |
|---|---|---|---|---|
| 1.40 ± 0.03 | 92.32±0.35 | 82.42 ± 0.42 | ||
| LS | 1.46 ± 0.02 | 92.92 ± 0.24 | 83.98 ± 0.43 | 0.69 ± 0.59 |
| SI | 1.42 ± 0.03 | 93.05 ± 0.29 | 83.40 ± 0.27 | 0.11 ± 0.89 |
| RLW | 1.44 ± 0.03 | 92.89 ± 0.25 | 83.70 ± 0.49 | -0.31 ± 0.76 |
| DWA | 1.44 ± 0.02 | 92.90 ± 0.16 | 83.55 ±0.33 | 0.35 ± 0.60 |
| UW | 1.43 ± 0.00 | 92.99 ± 0.24 | 83.09 ± 0.39 | 0.15 ± 0.24 |
| MGDA | 1.38±0.02 | 93.29 ±0.31 | 83.51 ± 0.30 | -1.39 ± 0.50 |
| PCGrad | 1.47 ± 0.03 | 92.92±0.28 | 83.28 ± 0.38 | 1.13 ± 0.57 |
| CAGrad | 1.40 ± 0.02 | 93.06 ± 0.26 | 83.28 ± 0.46 | -0.58 ± 0.59 |
| IMTL-G | 1.41 ± 0.03 | 93.10 ± 0.16 | 83.78 ± 0.47 | -0.50 ± 0.89 |
| Nash-MTL | 1.42 ± 0.02 | 92.89 ± 0.10 | 83.19 ± 0.50 | -0.17 ± 0.71 |
| IGBv2 | 1.42 ± 0.02 | 93.09 ± 0.22 | 83.34 ± 0.33 | -0.21 ± 0.50 |
| Aligned-MTL-UB | 1.45 ± 0.02 | 93.00 ± 0.24 | 83.36 ± 0.43 | 0.66 ± 0.50 |
| BayesAgg-MTL (Ours) | 1.35 ±0.03 | 93.01 ± 0.17 | 84.25 ±0.35 | -2.23 ± 0.76 |
5.1. BayesAgg-MTL for Regression
5.1. 回归任务中的BayesAgg-MTL方法
We first evaluated BayesAgg-MTL on an MTL problem with regression tasks only. We used the QM9 dataset which contains $\sim130,000$ stable small organic molecules repre- sented as graphs having node and edge features (Ramakrishnan et al., 2014; Wu et al., 2018). The goal here is to predict 11 chemical properties, such as geometric and energetic ones, that may vary in scale and difficulty of the tasks. We follow the experimental protocol of Navon et al. (2022). Specifically, we allocate approximately 110, 000 examples for training, with separate validation and testing sets with 10, 000 examples each. Additionally, we employ the message-passing neural network architecture (Gilmer et al., 2017) in conjunction with the pooling operator described in (Vinyals et al., 2016).
我们首先在一个仅包含回归任务的多任务学习(MTL)问题上评估BayesAgg-MTL方法。实验采用QM9数据集,该数据集包含约130,000个稳定的小型有机分子,这些分子以带有节点和边特征的图结构表示 (Ramakrishnan et al., 2014; Wu et al., 2018)。研究目标是预测11种化学性质(包括几何和能量特性),这些任务在难度和量级上存在差异。我们遵循Navon等人(2022)的实验方案:将约110,000个样本分配为训练集,验证集和测试集各包含10,000个样本。实验采用消息传递神经网络架构 (Gilmer et al., 2017) 结合 (Vinyals et al., 2016) 中描述的池化算子。
The test results for this dataset are presented in Table 1. Baseline method results were taken from (Dai et al., 2023), except for Aligned-MTL-UB, which is included here for the first time. The criterion used in $\Delta_{m}$ here is the mean absolute error (MAE) of the losses. From the table, BayesAggMTL achieves the best test performance, with a significant improvement compared to most of the baseline methods.
该数据集的测试结果如表 1 所示。基线方法结果来自 (Dai et al., 2023) ,其中 Aligned-MTL-UB 为本文首次引入。此处 $\Delta_{m}$ 采用的评估标准是损失的平均绝对误差 (MAE) 。由表可见,BayesAggMTL 取得了最佳测试性能,相较多数基线方法有显著提升。
To gain a better intuition into the weights that BayesAggMTL assigns, we define here again the vector of weights per example and task from Eq. 7, $\begin{array}{r}{\alpha_{i}^{k}:=\lambda_{i}^{k}/(\sum_{k=1}^{K}\mathbf{\bar{\lambda}}{i}^{k})}\end{array}$ Figure 3 depicts for all tasks the average over dimensions of $\mathbf{\alpha}{\alpha_{i}}^{k}$ for 20 random examples at the start, middle, and end of training. The plot reveals an interesting pattern. Early in training, the average weights are distributed among the tasks without any specific pattern. As training progresses, larger weights are assigned for tasks $4-10$ in the middle of the training, while tasks $0-3$ receive smaller weights. At the end of the training, this pattern changes, and tasks $0-3$ are assigned with larger weights compared to tasks $4-10$ .
为了更好地直观理解BayesAggMTL所分配的权重,我们在此重新定义公式7中每个样本和任务的权重向量:$\begin{array}{r}{\alpha_{i}^{k}:=\lambda_{i}^{k}/(\sum_{k=1}^{K}\mathbf{\bar{\lambda}}{i}^{k})}\end{array}$。图3描绘了训练初期、中期和末期20个随机样本在所有任务上$\mathbf{\alpha}{\alpha_{i}}^{k}$各维度的平均值。图表揭示了一个有趣的现象:训练初期,各任务的平均权重呈无规律分布;随着训练推进,中期阶段任务$4-10$获得较大权重,而任务$0-3$权重较小;训练末期模式反转,任务$0-3$的权重大于任务$4-10$。
5.2. BayesAgg-MTL for Binary Classification
5.2. 用于二元分类的BayesAgg-MTL
Next, we evaluated BayesAgg-MTL on the MTL benchmarks CIFAR-MTL (Krizhevsky et al., 2009; Rosenbaum et al., 2018), and ChestX-ray14 (Wang et al., 2017). To the best of our knowledge, we are the first to evaluate MTL methods on the latter dataset. These datasets contain a large number of tasks, 20 and 14 respectively, with a high classimbalance distribution. This poses a significant challenge for current MTL methods.
接下来,我们在多任务学习(MTL)基准数据集CIFAR-MTL (Krizhevsky et al., 2009; Rosenbaum et al., 2018)和ChestX-ray14 (Wang et al., 2017)上评估了BayesAgg-MTL。据我们所知,这是首次在ChestX-ray14数据集上评估MTL方法。这两个数据集分别包含20个和14个任务,且存在严重的类别不平衡分布,这对当前的多任务学习方法构成了重大挑战。
CIFAR-MTL uses the coarse labels of the CIFAR-100 dataset to create an MTL benchmark having 20 binary tasks. Classes from this dataset are grouped into super-classes (fish, flowers, trees, etc.), such that each example is given a one-hot encoding vector of labels indicating the superclass it belongs to. We use the official train-test split having 50, 000 examples and 10, 000 examples respectively. We allocate 5, 000 examples from the training set for a validation set. Our experiments on this dataset were conducted using a simple NN having 3 convolution layers.
CIFAR-MTL使用CIFAR-100数据集的粗粒度标签创建了一个包含20个二元任务的多任务学习(MTL)基准。该数据集的类别被分组为超类(如鱼类、花卉、树木等),每个样本都被赋予一个独热编码向量来指示其所属的超类。我们采用官方划分的50,000个训练样本和10,000个测试样本,并从训练集中分配5,000个样本作为验证集。在该数据集上的实验采用了一个包含3个卷积层的简单神经网络。
ChestX-ray14 contains $\sim112$ , 000 X-ray images of chests from 32, 717 patients. Each image has labels from 14 binary classes corresponding to the occurrence or absence of thoracic diseases. Multiple diseases can appear together in a patient. In our experiments, we mostly follow the training protocol suggested in (Taslimi et al., 2022) that used ResNet-34 for the shared parameters. we use the official split of $70%-10%-20%$ for training, validation, and test.
ChestX-ray14包含来自32,717名患者的约112,000张胸部X光图像。每张图像带有14个二元分类标签,对应胸部疾病的存在或缺失。患者可能同时患有多重疾病。实验中,我们主要遵循(Taslimi et al., 2022)提出的训练方案,该方案采用ResNet-34作为共享参数网络。数据集按官方划分采用70%-10%-20%比例分割为训练集、验证集和测试集。
We present the test results for these datasets in Table 2. On the CIFAR-MTL we report the accuracy in class assignment, and on the ChestX-ray14 we report the $\Delta_{m}$ based on the AUC-ROC values per task. From the table, BayesAgg-MTL performs best on both datasets. Interestingly, on the ChestXray14 dataset almost all methods, except for ours and DWA, under-perform the naive LS baseline. In Appendix C.2 we compare the run-time of all methods on this dataset and on the QM9. We show that BayesAgg-MTL is substantially faster than other baseline methods that use gradients w.r.t the shared parameters to weigh the tasks.
我们在表2中展示了这些数据集的测试结果。在CIFAR-MTL上我们报告了类别分配的准确率,在ChestX-ray14上我们基于每项任务的AUC-ROC值报告了$\Delta_{m}$。从表中可以看出,BayesAgg-MTL在两个数据集上都表现最佳。有趣的是,在ChestXray14数据集上,除了我们的方法和DWA之外,几乎所有方法的表现都低于简单的LS基线。在附录C.2中,我们比较了所有方法在该数据集和QM9上的运行时间。结果表明,BayesAgg-MTL明显快于其他使用共享参数梯度来权衡任务的基线方法。
5.3. BayesAgg-MTL for Mixed Tasks
5.3. 面向混合任务的BayesAgg-MTL
In the last set of experiments, we evaluated BayesAgg-MTL and baseline methods on the UTKFace dataset (Zhang et al., 2017). This dataset contains over 20, 000 face images with annotations of age, gender, and ethnicity. The age values range from 0 to 116, treated as a regression task. Gender is classified into binary categories, either male or female, while ethnicity is classified into five distinct categories, making it a multi-class classification task. We split the dataset according to $70%-10%-20%$ to train, validation, and test datasets. Here, we use ResNet-18 for the shared network.
在最后一组实验中,我们在UTKFace数据集 (Zhang et al., 2017) 上评估了BayesAgg-MTL和基线方法。该数据集包含20,000多张带有年龄、性别和种族标注的人脸图像。年龄值范围为0到116岁,作为回归任务处理。性别分为二元类别(男性或女性),而种族分为五个不同类别,构成多类别分类任务。我们按照$70%-10%-20%$的比例将数据集划分为训练集、验证集和测试集。此处我们使用ResNet-18作为共享网络。
Results for this dataset based on 8 random seeds are presented in Table 3. Here as well BayesAgg-MTL outperforms all methods, having the best results on 2 out of 3 tasks. Interestingly, our approach and MGDA, were the only methods to improve upon the STL baseline on the regression task.
基于8个随机种子在该数据集上的结果如表3所示。BayesAgg-MTL同样优于所有方法,在3项任务中有2项取得最佳结果。值得注意的是,我们的方法与MGDA是仅有的两种在回归任务上超越STL基线的方法。
6. Conclusions
6. 结论
In this study, we present BayesAgg-MTL , a novel method for aggregating the task gradients in MTL. Instead of treating the gradient of each task as a deterministic quantity we advocate here to assign a probability distribution over them. The randomness in them arises by noticing that there are many possible configurations for the task-specific parameters that work well. Hence, by tracking all of them using Bayesian tools we can obtain a richer description of the gradient space. This in turn allows us to model the uncertainty in the gradients and derive an update direction for the shared parameters that takes it into account. We demonstrate our method’s effectiveness on several benchmark datasets compared with leading baseline methods. For future work, we would like to extend BayesAgg-MTL beyond linear task heads. The challenge here would be to efficiently estimate the Bayesian posterior and the gradient moments. Another possible limitation of BayesAgg-MTL , having in common with other popular MTL methods, is that it may fail on rare or atypical examples (Sagawa et al., 2019).
在本研究中,我们提出了BayesAgg-MTL,这是一种用于多任务学习(MTL)中任务梯度聚合的新方法。不同于将每个任务的梯度视为确定性量,我们主张为其分配概率分布。这种随机性源于观察到存在许多表现良好的任务特定参数配置。因此,通过使用贝叶斯工具跟踪所有这些配置,我们可以获得对梯度空间更丰富的描述。这反过来使我们能够对梯度中的不确定性进行建模,并为共享参数推导出考虑该不确定性的更新方向。我们在多个基准数据集上证明了该方法相对于领先基线方法的有效性。对于未来工作,我们希望将BayesAgg-MTL扩展到线性任务头之外。这里的挑战在于高效估计贝叶斯后验和梯度矩。BayesAgg-MTL与其他流行MTL方法共有的另一个可能局限是,它可能在罕见或非典型样本上失效(Sagawa et al., 2019)。
