An Unsupervised Approach to Achieve Supervised-Level Explain ability in Healthcare Records
一种在医疗记录中实现监督级可解释性的无监督方法
Abstract
摘要
Electronic healthcare records are vital for patient safety as they document conditions, plans, and procedures in both free text and medical codes. Language models have significantly enhanced the processing of such records, streamlining workflows and reducing manual data entry, thereby saving healthcare providers significant resources. However, the black-box nature of these models often leaves healthcare professionals hesitant to trust them. State-ofthe-art explain ability methods increase model transparency but rely on human-annotated evidence spans, which are costly. In this study, we propose an approach to produce plausible and faithful explanations without needing such annotations. We demonstrate on the automated medical coding task that adversarial robustness training improves explanation plausibility and introduce AttInGrad, a new explanation method superior to previous ones. By combining both contributions in a fully unsupervised setup, we produce explanations of comparable quality, or better, to that of a supervised approach. We release our code and model weights. 1
电子健康记录对患者安全至关重要,它们以自由文本和医疗编码形式记录病情、治疗方案及操作流程。大语言模型显著提升了此类记录的处理效率,简化工作流程并减少人工数据录入,从而为医疗提供者节省大量资源。然而,这些模型的黑箱特性常使医疗从业者对其可信度存疑。当前最先进的解释性方法虽能提升模型透明度,但依赖人工标注的证据片段,成本高昂。本研究提出一种无需标注即可生成合理可靠解释的方案。我们在自动化医疗编码任务中证明:对抗性鲁棒训练能提升解释合理性,并提出优于现有方法的新解释技术AttInGrad。通过将两项创新结合于完全无监督框架,我们产出的解释质量达到或超越监督式方法。代码及模型权重已开源。1
1 Introduction
1 引言
Explain ability in natural language processing remains a largely unsolved problem, posing significant challenges for healthcare applications (Lyu et al., 2023). For every patient admission, a health- care professional must read extensive documentation in the healthcare records to assign appropri- ate medical codes. A code is a machine-readable identifier for a diagnosis or procedure, pivotal for tasks such as statistics, documentation, and billing. This process can involve sifting through thousands of words to choose from over 140,000 possible codes (Johnson et al., 2016), making medical coding not only time-consuming but also errorprone (Burns et al., 2012; Tseng et al., 2018).
自然语言处理的可解释性仍是一个尚未解决的重大问题,这给医疗健康应用带来了显著挑战 (Lyu et al., 2023) 。每当患者入院时,医疗专业人员必须阅读病历中的大量文档才能分配正确的医疗代码。这些代码是诊断或手术的机器可读标识符,对统计、记录和计费等任务至关重要。该过程可能需要筛选数千个单词,并从超过14万种可能的代码中进行选择 (Johnson et al., 2016) ,这使得医疗编码不仅耗时,而且容易出错 (Burns et al., 2012; Tseng et al., 2018) 。
Figure 1: Example of an input, prediction, and feature attribution explanation highlighted in the input.
图 1: 输入示例、预测结果及特征归因解释(在输入文本中高亮显示)。
Automated medical coding systems, powered by machine learning models, aim to alleviate these burdens by suggesting medical codes based on freeform written documentation. However, when reviewing suggested codes, healthcare professionals must still manually locate relevant evidence in the documentation. This is a slow and strenuous process, especially when dealing with extensive documentation and numerous medical codes. Explainability is essential for making this process tractable.
由机器学习模型驱动的自动化医疗编码系统,旨在通过基于自由文本记录推荐医疗编码来减轻这些负担。然而,在审查推荐编码时,医疗专业人员仍需手动从文档中定位相关证据。这一过程缓慢且费力,尤其是在处理大量文档和众多医疗编码时。可解释性对于使该过程易于管理至关重要。
Feature attributions, a common form of explainability, can help healthcare professionals quickly find the evidence necessary to review a medical code suggestion (see Figure 1). Feature attribution methods score each input feature based on its influence on the model’s output. These explanations are often evaluated through plausibility and faithfulness (Jacovi and Goldberg, 2020). Plausibility measures how convincing an explanation is to human users, while faithfulness measures the explanation’s ability to reflect the model’s logic.
特征归因 (feature attribution) 是一种常见的可解释性方法,可帮助医疗专业人员快速找到审查医疗编码建议所需的证据 (见图 1)。特征归因方法根据每个输入特征对模型输出的影响进行评分。这些解释通常通过合理性 (plausibility) 和忠实性 (faithfulness) 来评估 (Jacovi and Goldberg, 2020)。合理性衡量解释对人类用户的可信度,而忠实性衡量解释反映模型逻辑的能力。
While previous work has proposed feature attribution methods for automated medical coding, they only evaluated attention-based feature attribution methods. Furthermore, the state-of-theart method uses a supervised approach relying on costly evidence-span annotations. This reliance on manual annotations significantly limits practical applicability, as each code system and its versions require separate manual annotations (Mullenbach et al., 2018; Teng et al., 2020; Dong et al., 2021; Kim et al., 2022; Cheng et al., 2023).
虽然先前的研究提出了用于自动化医疗编码的特征归因方法,但它们仅评估了基于注意力机制的特征归因方法。此外,当前最先进的方法采用依赖昂贵证据范围标注的监督式方法。这种对人工标注的依赖极大限制了实际应用性,因为每个编码系统及其版本都需要单独的人工标注 (Mullenbach et al., 2018; Teng et al., 2020; Dong et al., 2021; Kim et al., 2022; Cheng et al., 2023)。
In this study, we present an approach for producing explanations of comparable quality to the su- pervised state-of-the-art method but without using evidence span annotations. We implement adversarial robustness training strategies to decrease the model’s dependency on irrelevant features, thereby avoiding such features in the explanations (Tsipras et al., 2018). Moreover, we present more faithful feature attribution methods than the attention-based method used in previous studies. Our key contributions are:
在本研究中,我们提出了一种方法,能够生成与监督式最先进方法质量相当的解释,但无需使用证据范围标注。我们采用对抗鲁棒性训练策略来降低模型对无关特征的依赖,从而避免在解释中出现此类特征 (Tsipras et al., 2018) 。此外,我们提出的特征归因方法比以往研究中基于注意力机制的方法更具可信度。我们的主要贡献包括:
2 Related work
2 相关工作
Next, we present explain ability approaches for automated medical coding and previous work on how adversarial robustness affects explain ability.
接下来,我们将介绍自动化医疗编码的可解释性方法,以及对抗鲁棒性如何影响可解释性的现有研究。
2.1 Explain able automated medical coding
2.1 可解释的自动化医疗编码
Automated medical coding is a multi-label classification task that aims to predict a set of medical codes from $J$ classes based on a given medical document (Edin et al., 2023). In this context, the objective of explain able automated medical coding is to generate feature attribution scores for each of the $J$ classes. These scores quantify how much each input token influences each class’s prediction.
自动化医疗编码是一项多标签分类任务,旨在根据给定的医疗文档从$J$个类别中预测一组医疗代码 (Edin et al., 2023)。在此背景下,可解释的自动化医疗编码目标是为每个$J$类别生成特征归因分数。这些分数量化了每个输入token对各类别预测的影响程度。
Most studies in explain able automated medical coding use attention weights as feature attribution scores without comparing to other methods (Mullenbach et al., 2018; Teng et al., 2020; Dong et al., 2021; Feucht et al., 2021; Cheng et al., 2023). How- ever, two studies suggest alternative feature attribution methods. Xu et al. (2019) propose a feature attribution method tailored to their one-layer CNN architecture but do not compare performance with other methods. Kim et al. (2022) train a linear medical coding model using knowledge distillation and use its weights as the explanation. However, their method does not improve over the explanations of the popular attention approach. Cheng et al. (2023) improve the plausibility of the attention weights of the final layer by training them to align with evidence span annotations. However, obtaining such annotations is costly.
大多数关于可解释自动化医疗编码的研究都使用注意力权重作为特征归因分数,而没有与其他方法进行比较 (Mullenbach et al., 2018; Teng et al., 2020; Dong et al., 2021; Feucht et al., 2021; Cheng et al., 2023)。然而,有两项研究提出了替代的特征归因方法。Xu等人 (2019) 提出了一种针对其单层CNN架构定制的特征归因方法,但未与其他方法的性能进行比较。Kim等人 (2022) 通过知识蒸馏训练了一个线性医疗编码模型,并将其权重作为解释依据。然而,他们的方法并未超越流行的注意力方法所提供的解释效果。Cheng等人 (2023) 通过训练最终层注意力权重与证据范围标注对齐,提升了其合理性。但获取此类标注的成本较高。
Previous work focused on plausibility using human ratings (Mullenbach et al., 2018; Teng et al., 2020; Kim et al., 2022), example inspection (Dong et al., 2021; Feucht et al., 2021; Liu et al., 2022), or evidence span overlap metrics (Xu et al., 2019; Cheng et al., 2023). Notably, no studies have assessed the faithfulness of the explanations nor compared the attention-based methods with other established methods.
先前的研究主要通过人工评分 (Mullenbach et al., 2018; Teng et al., 2020; Kim et al., 2022) 、案例检查 (Dong et al., 2021; Feucht et al., 2021; Liu et al., 2022) 或证据片段重叠指标 (Xu et al., 2019; Cheng et al., 2023) 来评估解释的合理性。值得注意的是,尚无研究评估这些解释的忠实度,也未将基于注意力机制的方法与其他成熟方法进行对比。
2.2 Adversarial robustness and explain ability
2.2 对抗鲁棒性与可解释性
Adversarial robustness refers to the ability of a machine learning model to maintain performance under adversarial attacks, which involve making small changes to input data that do not significantly affect human perception or judgment (e.g., a small amount of image noise). Tsipras et al. (2018) and Ilyas et al. (2019) demonstrate that adversarial examples exploit the models’ dependence on fragile, non-robust features. Adversarial robustness training embeds in variances to prevent models relying on such non-robust features, with regular iz ation and data augmentation as main strategies (Tsipras et al., 2018; Ros and Doshi-Velez, 2018).
对抗鲁棒性 (Adversarial robustness) 指机器学习模型在对抗攻击下保持性能的能力。对抗攻击通过对输入数据进行微小修改(例如少量图像噪声)来实现,这些修改不会显著影响人类感知或判断。Tsipras等人 (2018) 和Ilyas等人 (2019) 的研究表明,对抗样本利用了模型对脆弱、非鲁棒特征的依赖性。对抗鲁棒性训练通过嵌入方差来防止模型依赖此类非鲁棒特征,主要策略包括正则化 (regularization) 和数据增强 (data augmentation) (Tsipras等人, 2018; Ros和Doshi-Velez, 2018)。
Previous work in image classification shows that adversarial ly robust models generate more plausible explanations (Ros and Doshi-Velez, 2018; Chen et al., 2019; Etmann et al., 2019). These studies demonstrate this phenomenon for three adversarial training strategies: 1) input gradient regular iz ation, which improves the Lipschitz ness of neural networks (Drucker and Le Cun, 1992; Ros and DoshiVelez, 2018; Chen et al., 2019; Etmann et al., 2019; Rosca et al., 2020; Fel et al., 2022; Khan et al., 2023), 2) adversarial training, which trains models on adversarial examples, thereby embeds invariance to adversarial noise (Tsipras et al., 2018), and 3) feature masking, which masks unimportant features during training to embed invariance to such features (Bhalla et al., 2023).
先前在图像分类领域的研究表明,具有对抗鲁棒性的模型能生成更合理的解释 (Ros and Doshi-Velez, 2018; Chen et al., 2019; Etmann et al., 2019)。这些研究通过三种对抗训练策略验证了该现象:1) 输入梯度正则化 (input gradient regularization),通过提升神经网络的Lipschitz连续性实现 (Drucker and Le Cun, 1992; Ros and Doshi-Velez, 2018; Chen et al., 2019; Etmann et al., 2019; Rosca et al., 2020; Fel et al., 2022; Khan et al., 2023);2) 对抗训练 (adversarial training),通过在对抗样本上训练模型来嵌入对对抗噪声的不变性 (Tsipras et al., 2018);3) 特征掩蔽 (feature masking),通过训练时掩蔽不重要特征来嵌入对该类特征的不变性 (Bhalla et al., 2023)。
The relationship between model robustness and explanation plausibility in Natural Language Processing (NLP) remains unclear, primarily due to the fundamental differences between textual tokens and image pixels. To date, only two studies, Yoo and Qi (2021) and Li et al. (2023), have explored this relationship in depth. These studies suggest that robust models produce more faithful explanations compared to their non-robust counterparts. However, their conclusions may be questionable due to the use of the Area Over the Perturbation Curve (AOPC) metric, which is unsuitable for cross-model comparisons (Edin et al., 2024). This inappropriate application of AOPC may have led to potentially misleading conclusions.
自然语言处理(NLP)中模型鲁棒性与解释合理性之间的关系仍不明确,这主要源于文本token与图像像素之间的本质差异。截至目前,仅有Yoo和Qi (2021) 与Li等人 (2023) 两项研究深入探讨了这一关系。这些研究表明,与非鲁棒模型相比,鲁棒模型能产生更可信的解释。然而,由于使用了不适合跨模型比较的扰动曲线面积(AOPC)指标 (Edin等人, 2024),其结论可能存在问题。AOPC的不当应用可能导致具有误导性的结论。
3 Methods
3 方法
Here, we describe the adversarial robustness training strategies and feature attribution methods in the context of a prediction model for medical coding. The underlying automated medical coding model takes a sequence of tokens as input and outputs medical code probabilities (Section 4.2).
在此,我们描述了医疗编码预测模型中的对抗鲁棒性训练策略和特征归因方法。基础的自动化医疗编码模型以一系列token作为输入,并输出医疗代码的概率(第4.2节)。
3.1 Adversarial robustness training strategies
3.1 对抗鲁棒性训练策略
We implemented three adversarial training strategies, which we hypothesized could decrease our medical coding model’s reliance on irrelevant tokens: Input gradient regular iz ation, projected gradient descent, and token masking. We chose these strategies because they have been shown to improve plausibility in image classification and faithfulness in text classification (Li et al., 2023)
我们实施了三种对抗训练策略,假设这些策略可以减少医疗编码模型对无关token的依赖:输入梯度正则化、投影梯度下降和token掩码。选择这些策略是因为它们已被证明能提升图像分类的合理性和文本分类的忠实性 (Li et al., 2023)
Input gradient regular iz ation (IGR) encourages the gradient of the output with respect to the input to be small. This aims to decrease the number of features on which the model relies, encouraging it to ignore irrelevant words (Drucker and Le Cun, 1992). We adapt IGR to text classifica- tion by adding to the task’s binary cross-entropy loss, $L_{\mathrm{BCE}}$ , the $\ell^{2}$ norm of the gradient of $\boldsymbol{L}_{\mathrm{BCE}}$ wrt. the input token embedding sequence $\pmb{X}\in\mathbb{R}^{N\times D}$ . This yields the total loss,
输入梯度正则化 (IGR) 旨在减小输出相对于输入的梯度幅度。该方法通过减少模型依赖的特征数量,促使模型忽略无关词汇 (Drucker and Le Cun, 1992)。我们将IGR应用于文本分类任务,在二元交叉熵损失函数$L_{\mathrm{BCE}}$的基础上,增加$L_{\mathrm{BCE}}$对输入token嵌入序列$\pmb{X}\in\mathbb{R}^{N\times D}$梯度的$\ell^{2}$范数,得到总损失函数:
$\begin{array}{r}{L_{\mathrm{BCE}}(f(\boldsymbol{X}),\boldsymbol{y})+\lambda_{1}\left|\nabla_{\boldsymbol{X}}L_{\mathrm{BCE}}(f(\boldsymbol{X}),\boldsymbol{y})\right|{2},}\end{array}$ where $\pmb{y}\in\mathbb{R}^{J}$ is a binary target vector representing the $J$ medical codes, $\lambda_{1}$ is a hyper parameter, and $f:\mathbb{R}^{N\times D}\rightarrow\mathbb{R}^{J}$ is the classification model.
$\begin{array}{r}{L_{\mathrm{BCE}}(f(\boldsymbol{X}),\boldsymbol{y})+\lambda_{1}\left|\nabla_{\boldsymbol{X}}L_{\mathrm{BCE}}(f(\boldsymbol{X}),\boldsymbol{y})\right|{2},}\end{array}$ 其中 $\pmb{y}\in\mathbb{R}^{J}$ 是表示 $J$ 个医疗编码的二元目标向量,$\lambda_{1}$ 是超参数,$f:\mathbb{R}^{N\times D}\rightarrow\mathbb{R}^{J}$ 为分类模型。
Projected gradient descent (PGD) increases model robustness by training with adversarial examples, thereby promoting invariance to such inputs (Madry, 2018). We hypothesized that PGD reduces the model’s reliance on irrelevant tokens, as adversarial examples often arise from the model’s use of such unrobust features (Tsipras et al., 2018). PGD aims to find the noise $\pmb{\delta}\in\mathbb{R}^{N\times D}$ that maximizes the loss $L_{\mathrm{BCE}}(f(X+\delta),y)$ while satisfying the constraint $|\delta|_{\infty}\leq\epsilon$ , where $\epsilon$ is a hyperparam- eter. PGD was originally designed for image classification; we adapted it to NLP by adding the noise to the token embeddings $\boldsymbol{X}$ . We implemented PGD as follows,
投影梯度下降 (PGD) 通过使用对抗样本进行训练来增强模型鲁棒性,从而提升模型对此类输入的不变性 (Madry, 2018)。我们假设PGD能降低模型对无关token的依赖,因为对抗样本往往源于模型对这类非鲁棒特征的使用 (Tsipras et al., 2018)。PGD旨在寻找噪声$\pmb{\delta}\in\mathbb{R}^{N\times D}$,在满足约束$|\delta|{\infty}\leq\epsilon$的条件下最大化损失$L_{\mathrm{BCE}}(f(X+\delta),y)$,其中$\epsilon$为超参数。PGD最初为图像分类设计;我们通过将噪声添加到token嵌入$\boldsymbol{X}$来适配NLP任务。具体实现如下:
$$
\begin{array}{r}{\pmb{Z}^{}=\arg\operatorname*{max}{\pmb{Z}}L_{\mathrm{BCE}}(f(\pmb{X}+\delta(\pmb{Z})),\pmb{y}),}\end{array}
$$
$$
\begin{array}{r}{\pmb{Z}^{}=\arg\operatorname*{max}{\pmb{Z}}L_{\mathrm{BCE}}(f(\pmb{X}+\delta(\pmb{Z})),\pmb{y}),}\end{array}
$$
and enforced the constraint $|\delta|_{\infty}\leq\epsilon$ by para meter i zing $\delta(Z)=\epsilon\operatorname{tanh}(Z)$ and optimising $\pmb{Z}\in\mathbb{R}^{N\times D}$ directly. We initialized $Z$ with zeros. Finally, we tuned the model parameters using the following training objective:
并通过参数化 $\delta(Z)=\epsilon\operatorname{tanh}(Z)$ 并直接优化 $\pmb{Z}\in\mathbb{R}^{N\times D}$ 来强制约束 $|\delta|_{\infty}\leq\epsilon$。我们将 $Z$ 初始化为零。最后,使用以下训练目标调整模型参数:
${\cal L}{\mathrm{BCE}}(f(X),y)+\lambda_{2}{\cal L}{\mathrm{BCE}}(f(X+\delta(Z^{*})),y),$ where $\lambda_{2}$ is a hyper parameter.
${\cal L}{\mathrm{BCE}}(f(X),y)+\lambda_{2}{\cal L}{\mathrm{BCE}}(f(X+\delta(Z^{*})),y),$ 其中 $\lambda_{2}$ 为超参数。
Token masking (TM) teaches the model to predict accurately while using as few features as possible, thereby encouraging the model to ignore irrelevant words. TM uses a binary mask to occlude unimportant tokens and train the model to rely only on the remaining tokens (Bhalla et al., 2023; Tomar et al., 2023). Inspired by Bhalla et al. (2023), we employed a two-step teacher-student approach. We used two copies of the same model already trained on the automated medical coding task: a teacher $f_{t}$ with frozen model weights and a student $f_{s}$ , which we fine-tuned. For each training batch, the first step was to learn a sparse mask Mˆ ∈ [0, 1]N×D, that still provided enough information to predict the correct codes by minimizing:
Token掩码 (TM) 通过二元掩码遮蔽不重要token,训练模型仅依赖剩余token进行预测 (Bhalla et al., 2023; Tomar et al., 2023)。受Bhalla等人(2023)启发,我们采用两步师生训练法:使用已完成医疗自动编码任务训练的同一模型的两个副本——权重冻结的教师模型$f_{t}$和待微调的学生模型$f_{s}$。每个训练批次的第一步是学习稀疏掩码Mˆ ∈ [0, 1]N×D,该掩码需通过最小化损失函数保留足够信息来预测正确编码:
$$
|\hat{M}|{1}+\beta|f_{s}(\pmb{X})-f_{s}(x_{m}(\pmb{X},\hat{M}))|_{1}\mathrm{,~}
$$
$$
|\hat{M}|{1}+\beta|f_{s}(\pmb{X})-f_{s}(x_{m}(\pmb{X},\hat{M}))|_{1}\mathrm{,~}
$$
where $\beta$ is a hyper parameter and $x_{m}:\mathbb{R}^{N\times D}\rightarrow$ RN×D is the masking function:
其中$\beta$是一个超参数,$x_{m}:\mathbb{R}^{N\times D}\rightarrow$ RN×D是掩码函数:
$$
x_{m}(X,M)=B\odot(1-M)+X\odot M,
$$
$$
x_{m}(X,M)=B\odot(1-M)+X\odot M,
$$
where $\pmb{{\cal B}}\in\mathbb{R}^{N\times D}$ is the baseline input. We chose $B$ as the token embedding representing the start token, followed by the mask token embedding repeated $N-2$ times, followed by the end token embedding. After optimization, we binarized the mask $M=\mathrm{round}(\hat{M})$ , where around $90%$ of the features were masked. Finally, we tuned the model $f_{s}$ using the following training objective:
其中 $\pmb{{\cal B}}\in\mathbb{R}^{N\times D}$ 是基线输入。我们选择 $B$ 作为表示起始token的token嵌入,随后重复 $N-2$ 次的掩码token嵌入,最后是结束token嵌入。优化后,我们对掩码进行二值化处理 $M=\mathrm{round}(\hat{M})$ ,其中约 $90%$ 的特征被掩码。最终,我们使用以下训练目标微调模型 $f_{s}$:
$\begin{array}{r}{|f_{s}(\boldsymbol{X})-f_{t}(\boldsymbol{X})|{1}+\lambda_{3}|f_{s}(\boldsymbol{X})-f_{s}(x_{m}(\boldsymbol{X},\boldsymbol{M}))|{1},}\end{array}$ where $\lambda_{3}$ is a hyper parameter.
$\begin{array}{r}{|f_{s}(\boldsymbol{X})-f_{t}(\boldsymbol{X})|{1}+\lambda_{3}|f_{s}(\boldsymbol{X})-f_{s}(x_{m}(\boldsymbol{X},\boldsymbol{M}))|{1},}\end{array}$ 其中 $\lambda_{3}$ 是一个超参数。
3.2 Feature attribution methods
3.2 特征归因方法
We evaluated several feature attribution methods for automated medical coding, categorizing them into three types: attention-based, gradientbased, and perturbation-based (more details in Appendix B). Attention-based methods like Attention (Mullenbach et al., 2018), Attention Roll- out (Abnar and Zuidema, 2020), and AttGrad (Serrano and Smith, 2019) rely on the model’s attention weights. Gradient-based methods such as InputXGrad (Sun dara rajan et al., 2017), Integrated Gradients (IntGrad) (Sun dara rajan et al., 2017), and Deeplift (Shrikumar et al., 2017) use back prop agation to quantify the influence of input features on outputs. Perturbation-based methods, including LIME (Ribeiro et al., 2016), KernelSHAP (Lundberg and Lee, 2017), and Occlusion $@1$ (Ribeiro et al., 2016), measure the impact on output confidence by occluding input features.
我们评估了多种用于自动化医疗编码的特征归因方法,将其分为三类:基于注意力 (attention-based) 、基于梯度 (gradient-based) 和基于扰动 (perturbation-based) 的方法(更多细节见附录 B)。基于注意力的方法如 Attention (Mullenbach et al., 2018)、Attention Rollout (Abnar and Zuidema, 2020) 和 AttGrad (Serrano and Smith, 2019) 依赖于模型的注意力权重。基于梯度的方法如 InputXGrad (Sundararajan et al., 2017)、Integrated Gradients (IntGrad) (Sundararajan et al., 2017) 和 Deeplift (Shrikumar et al., 2017) 通过反向传播量化输入特征对输出的影响。基于扰动的方法包括 LIME (Ribeiro et al., 2016)、KernelSHAP (Lundberg and Lee, 2017) 和 Occlusion $@1$ (Ribeiro et al., 2016),通过遮蔽输入特征来测量对输出置信度的影响。
Our investigation into feature attribution methods revealed an intriguing pattern: while individual methods often produced unreliable explanations, their shortcomings rarely overlapped. Attentionbased methods and gradient-based approaches like InputXGrad frequently disagreed on which tokens were most important, yet both contributed valuable insights in different scenarios. This observation sparked a key question: could we leverage the complementary strengths of these methods to create a more robust attribution technique?
我们对特征归因方法的研究揭示了一个有趣的现象:虽然单个方法经常产生不可靠的解释,但它们的缺点很少重叠。基于注意力的方法和基于梯度的方法(如InputXGrad)经常在哪些token最重要的问题上存在分歧,然而这两种方法在不同场景下都提供了有价值的见解。这一观察引出了一个关键问题:我们能否利用这些方法的互补优势,创建一种更稳健的归因技术?
To address this question, we propose AttInGrad, a novel feature attribution method that combines Attention and InputXGrad. AttInGrad multiplies their respective attribution scores, aiming to amplify the importance of tokens deemed relevant by both methods while down-weighting those highlighted by only one or neither method.
为解决这一问题,我们提出了AttInGrad这一新颖的特征归因方法,它结合了注意力机制(Attention)和输入梯度(InputXGrad)。AttInGrad将两者的归因分数相乘,旨在放大被两种方法同时认为相关的token的重要性,同时降低仅被其中一种或两种方法均未强调的token的权重。
We formalize the AttInGrad attribution scores for class $j$ using the following equation:
我们使用以下方程形式化类别 $j$ 的AttInGrad归因分数:
$$
\begin{array}{r l}{~}&{\left[\begin{array}{c}{A_{j1}\cdot\left|X_{1}\odot\frac{\partial f_{j}}{\partial X_{1}}(X)\right|{2}}\ {\vdots}\end{array}\right]}\ {~}&{\left[\begin{array}{c}{A_{j N}\cdot\left|X_{N}\odot\frac{\partial f_{j}}{\partial X_{N}}(X)\right|_{2}}\end{array}\right]}\end{array},
$$
$$
\begin{array}{r l}{~}&{\left[\begin{array}{c}{A_{j1}\cdot\left|X_{1}\odot\frac{\partial f_{j}}{\partial X_{1}}(X)\right|{2}}\ {\vdots}\end{array}\right]}\ {~}&{\left[\begin{array}{c}{A_{j N}\cdot\left|X_{N}\odot\frac{\partial f_{j}}{\partial X_{N}}(X)\right|_{2}}\end{array}\right]}\end{array},
$$
where $\pmb{A}\in\mathbb{R}^{J\times N}$ is the attention matrix, $\odot$ is the element-wise matrix multiplication operation, $N$ are the number of tokens in a document, and $J$ is the number of classes.
其中 $\pmb{A}\in\mathbb{R}^{J\times N}$ 是注意力矩阵,$\odot$ 表示逐元素矩阵乘法运算,$N$ 是文档中的 token 数量,$J$ 是类别数。
In Section 5, we will provide an in-depth analysis of the mechanisms underlying AttInGrad’s effectiveness, shedding light on why this combination of methods yields improved feature attributions.
在第5节中,我们将深入分析AttInGrad有效性的机制,阐明为何这种方法的组合能产生更好的特征归因效果。
Table 1: The two data splits used in this paper. MIMICIII full comprises discharge summaries annotated with ICD-9 codes. MDACE comprises discharge summaries annotated with ICD-9 codes and evidence spans.
Split | Train | Val | Test |
MIMIC-III full | 47,719 | 1,631 | 3,372 |
MDACE | 181 | 60 | 61 |
表 1: 本文使用的两种数据划分。MIMICIII full 包含标注了 ICD-9 编码的出院小结,MDACE 包含标注了 ICD-9 编码和证据范围的出院小结。
Split | Train | Val | Test |
---|---|---|---|
MIMIC-III full | 47,719 | 1,631 | 3,372 |
MDACE | 181 | 60 | 61 |
4 Experimental setup
4 实验设置
In the following, we present our datasets, models, and evaluation metrics.
以下介绍我们的数据集、模型和评估指标。
4.1 Data
4.1 数据
We conducted our experiments using the openaccess MIMIC-III and the newly released MDACE dataset (Johnson et al., 2016; Cheng et al., 2023). MIMIC $\mathrm{\cdotIII}^{2}$ includes 52,722 discharge summaries from the Beth Israel Deaconess Medical Center’s ICU, collected between 2008 and 2016 and annotated with ICD-9 codes. MDACE comprises 302 reannotated MIMIC-III cases, adding evidence spans to indicate the textual justification for each medical code. Not all possible evidence spans are annotated; for example, if hypertension is mentioned multiple times, only the first mention might be annotated, leaving subsequent mentions un annotated. We focused exclusively on discharge summaries, as most previous medical coding studies on MIMICIII (Teng et al., 2022). Statistics are in Table 1.
我们使用公开的MIMIC-III和新发布的MDACE数据集 (Johnson et al., 2016; Cheng et al., 2023) 进行了实验。MIMIC $\mathrm{\cdotIII}^{2}$ 包含贝斯以色列女执事医疗中心ICU在2008至2016年间收集的52,722份出院小结,并标注了ICD-9编码。MDACE包含302个重新标注的MIMIC-III病例,新增了证据片段以标明每个医疗编码的文本依据。并非所有可能的证据片段都被标注,例如若高血压被多次提及,可能仅标注首次提及而忽略后续记录。与多数既往针对MIMIC-III的医疗编码研究 (Teng et al., 2022) 一致,我们仅聚焦于出院小结。统计数据见表1。
For dataset splits, we used MIMIC-III full, a popular split by Mullenbach et al. (2018), and MDACE, introduced by Cheng et al. (2023) for training and evaluating explanation methods. All MDACE examples are from the MIMIC-III full test set, which we excluded from this test set when using MDACE in our training data.
在数据集划分方面,我们采用了MIMIC-III完整版(Mullenbach等人于2018年提出的常用划分方案)以及Cheng等人(2023年)提出的MDACE数据集,用于训练和评估解释方法。所有MDACE样本均来自MIMIC-III完整版测试集,因此当我们将MDACE纳入训练数据时,已将其从该测试集中剔除。
4.2 Models
4.2 模型
We used PLM-ICD, a state-of-the-art automated medical coding model architecture, for our experiments because its architecture is simple while outperforming other models according to Huang et al. (2022); Edin et al. (2023). To address stability issues caused by numerical overflow in the decoder of the original model, we replaced the labelwise attention mechanism with standard crossattention (Vaswani et al., 2017). This adjustment not only stabilized training but also slightly improved performance. We provide further details on the architecture modifications in Appendix A.
我们采用当前最先进的自动化医疗编码模型架构PLM-ICD进行实验,因为根据Huang等人 (2022) 和Edin等人 (2023) 的研究,该架构在保持简洁性的同时性能优于其他模型。为解决原始模型解码器中数值溢出导致的稳定性问题,我们将标签注意力机制替换为标准交叉注意力机制 (Vaswani等人, 2017) 。这一调整不仅稳定了训练过程,还略微提升了性能。架构修改的更多细节详见附录A。
We compared five models: $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ , ${\bf B}{\mathrm{S}}$ , IGR, PGD, and TM. All models used our modified PLM-ICD architecture but were trained differently. $\scriptstyle\mathrm{B_{U}}$ was trained unsupervised with binary cross-entropy, whereas $\mathbf{B}\mathbf{s}$ employed a supervised auxiliary training objective that minimized the KL divergence between the model’s cross-attention weights and annotated evidence spans, as per Cheng et al. (2023). IGR, PGD, and TM training is as in Section 2.2. Best hyper parameters are in Appendix D.
我们比较了五种模型:$\scriptstyle\mathrm{\mathrm{ B_{U}~}}$、${\bf B}{\mathrm{S}}$、IGR、PGD和TM。所有模型均采用我们改进的PLM-ICD架构,但训练方式不同。$\scriptstyle\mathrm{B_{U}}$采用无监督训练方式,使用二元交叉熵损失函数;而$\mathbf{B}\mathbf{s}$ 则按照Cheng等人(2023)的方法,采用有监督辅助训练目标,最小化模型交叉注意力权重与标注证据片段之间的KL散度。IGR、PGD和TM的训练方法如第2.2节所述。最佳超参数见附录D。
4.3 Experiments
4.3 实验
We trained all five models with ten seeds on the MIMIC-III full and MDACE training set. The supervised training strategy $\mathrm{B}_{\mathrm{S}}$ used the evidence span annotations, while the others only used the medical code annotations. For each model, we evaluated the plausibility and faithfulness of the explanations generated by every explanation method.
我们在MIMIC-III完整数据集和MDACE训练集上,用十组随机种子训练了全部五个模型。监督训练策略 $\mathrm{B}_{\mathrm{S}}$ 使用了证据范围标注,其他策略仅使用医疗代码标注。针对每个模型,我们评估了各解释方法生成解释的合理性与可信度。
We aimed to demonstrate a similar explanation quality as a supervised approach but without training on evidence spans. Therefore, after evaluating the models and explanation methods, we compared our best combination with the supervised strategy proposed by Cheng et al. (2023), who used the $\mathbf{B}\mathbf{s}$ model and the Attention explanation method. We also compared our best combination with the unsupervised strategy used by most previous works (see Section 2.1), comprising the $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ model and the Attention explanations method.
我们旨在展示与监督方法相当的解释质量,但无需对证据片段进行训练。因此,在评估模型和解释方法后,我们将最佳组合与Cheng等人 (2023) 提出的监督策略(使用$\mathbf{B}\mathbf{s}$模型和Attention解释方法)进行了对比。同时,我们还将其与多数先前研究采用的无监督策略(见第2.1节)进行了比较,该策略由$\scriptstyle\mathrm{\mathrm{ B_{U}~}}$模型和Attention解释方法构成。
4.4 Evaluation metrics
4.4 评估指标
We measured the explanation quality using metrics estimating plausibility and faithfulness. Plausibility measures how convincing an explanation is to human users, while faithfulness measures how accurate an explanation reflects a model’s true rea- soning process (Jacovi and Goldberg, 2020).
我们采用评估合理性和忠实性的指标来衡量解释质量。合理性衡量解释对人类用户的信服程度,而忠实性衡量解释反映模型真实推理过程的准确程度 (Jacovi and Goldberg, 2020)。
Plausibility metrics Our plausibility metrics measured the overlap between explanations and annotated evidence-spans. We assumed that a high overlap indicated plausible explanations for medical coders. We identified the most important tokens using feature attribution scores, applying a decision boundary for classification metrics, and selecting the top $K$ scores for ranking metrics.
合理性指标
我们的合理性指标衡量了解释与标注证据片段之间的重叠程度。我们假设高度重叠意味着对医疗编码员而言解释是合理的。我们通过特征归因分数识别最重要的token,为分类指标设定决策边界,并为排序指标选择前 $K$ 个最高分。
For classification metrics, we used Precision (P), Recall (R), and F1 scores, selecting the decision boundary that yielded the highest F1 score on the validation set (Cheng et al., 2023). Additionally, we included four more classification metrics: Empty explanation rate (Empty), Evidence span recall (SpanR), Evidence span cover (Cover), and Area Under the Precision-Recall Curve (AUPRC). Empty measures the rate of empty explanations when all attribution scores in an example are below the decision boundary. SpanR measures the percentage of annotated evidence spans where at least one token is classified correctly. Cover measures the percentage of tokens in an annotated evidence span that are classified correctly, given that at least one token is predicted correctly. AUPRC represents the area under the precision-recall curve generated by varying the decision boundary from zero to one.
在分类指标方面,我们采用了精确率 (P)、召回率 (R) 和 F1分数,并选择在验证集上获得最高F1分数的决策边界 (Cheng et al., 2023)。此外,我们还引入了四项分类指标:空解释率 (Empty)、证据片段召回率 (SpanR)、证据片段覆盖率 (Cover) 以及精确率-召回率曲线下面积 (AUPRC)。空解释率衡量当示例中所有归因分数均低于决策边界时空解释的比例。证据片段召回率统计至少有一个token被正确分类的标注证据片段百分比。证据片段覆盖率计算在至少有一个token被正确预测的前提下,标注证据片段中被正确分类的token占比。AUPRC表示决策边界从0到1变化时生成的精确率-召回率曲线下面积。
For ranking metrics, we selected the top $K$ tokens with the highest attribution scores, using Recall $\ @\mathbf{K}$ , Precision $\ @\mathbf{K}$ , and Intersection-OverUnions (IOU) (DeYoung et al., 2020).
在排序指标方面,我们选取了归因分数最高的前 $K$ 个token,采用召回率 $\ @\mathbf{K}$ 、准确率 $\ @\mathbf{K}$ 以及交并比(IOU) (DeYoung et al., 2020)作为评估标准。
Faithfulness metrics We use two metrics to approximate faithfulness: Sufficiency and Comprehen sive ness (DeYoung et al., 2020); more details are in Appendix C. Faithful explanations yield high Comprehensiveness and low Sufficiency scores. A high Sufficiency score indicates that many important tokens are incorrectly assigned low attribution scores, while a low Comprehensiveness score suggests that many non-important tokens are incorrectly assigned high attribution scores.
忠实度指标
我们采用两个指标来近似衡量忠实度:充分性(Sufficiency)和全面性(Comprehensiveness) (DeYoung et al., 2020);更多细节见附录C。忠实的解释会呈现高全面性和低充分性分数。高充分性分数表明许多重要token被错误分配了低归因分数,而低全面性分数则意味着许多非重要token被错误分配了高归因分数。
5 Results
5 结果
Next, we present experimental results for the different training strategies and explain ability methods.
接下来,我们展示不同训练策略和可解释性方法的实验结果。
Rivaling supervised methods in explanation quality The objective of this paper was to produce high-quality explanations without relying on evidence span annotations. In Figure 2, we compare the plausibility of our approach (Token masking and AttnInGrad) with the unsupervised approach $\mathrm{\bf\DeltaB_{U}}$ and Attention) and supervised stateof-the-art approach $\mathrm{\bf{B}}\mathrm{\bf{s}}$ and Attention). Our approach was substantially more plausible than the unsupervised on all metrics. Compared with the supervised, our approach achieved similar F1 and Recall $\ @5$ and substantially better Empty scores.
在解释质量上媲美监督方法
本文的目标是在不依赖证据范围标注的情况下生成高质量解释。在图 2 中,我们将本方法 (Token masking 和 AttnInGrad) 与无监督方法 $\mathrm{\bf\DeltaB_{U}}$ (Attention) 及监督的先进方法 $\mathrm{\bf{B}}\mathrm{\bf{s}}$ (Attention) 的合理性进行对比。在所有指标上,本方法都显著优于无监督方法。与监督方法相比,本方法在 F1 和召回率 $\ @5$ 上表现相近,且空解释得分显著更优。
Figure 2: Comparison of plausibility across various combinations of explanation methods and models from this study and previous work. Most previous studies used Attention and a standard medical coding model $(\mathrm{B_{U}})$ . Cheng et al. (2023) instead used a supervised model trained on evidence-span annotations $(\mathbf{B}_{\mathrm{S}})$ . We proposed AttInGrad and an adversarial robust model (TM).
图 2: 本研究与先前工作中不同解释方法和模型组合的可信度对比。多数先前研究采用Attention机制和标准医疗编码模型 $(\mathrm{B_{U}})$ 。Cheng等人 (2023) 则使用了基于证据跨度标注训练的监督模型 $(\mathbf{B}_{\mathrm{S}})$ 。我们提出了AttInGrad方法及对抗鲁棒模型 (TM) 。
The supervised approach achieved similar plausibility to ours on most metrics (see Table 2). Our approach also achieved the highest comprehensiveness and lowest sufficiency scores (see Figure 3). The difference was larger in the sufficiency scores, where the supervised score was twice as high as ours.
监督方法在大多数指标上取得了与我们相似的合理性(见表2)。我们的方法还获得了最高的全面性和最低的充分性得分(见图3)。在充分性得分上的差异更大,监督方法的得分是我们的两倍。
Adversarial robustness improves plausibility We evaluated the explanation plausibility of every model and explanation method combination in Table 2. IGR and TM outperformed the baseline model $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ on most metrics and explanation methods. In Appendix E.5, we compare the unsupervised models on a bigger test set and see similar results. The supervised model $\mathbf{B}\mathbf{s}$ yielded better results for attention-based explanations but was weaker than the robust models when using the gradient-based explanation methods: InputXGrad, IG, and Deeplift.
对抗鲁棒性提升解释合理性
我们评估了表2中每种模型与解释方法组合的解释合理性。IGR和TM在大多数指标和解释方法上优于基线模型$\scriptstyle\mathrm{\mathrm{ B_{U}~}}$。附录E.5中,我们在更大测试集上对比无监督模型,结果相似。监督模型$\mathbf{B}\mathbf{s}$在基于注意力的解释中表现更好,但使用基于梯度的解释方法( InputXGrad、IG和Deeplift )时弱于鲁棒模型。
AttInGrad is more plausible and faithful than
AttInGrad 比
Attention AttInGrad was more plausible than all other explanation methods across all training strategies and metrics. Notably, plausibility improvements were particularly significant, with relative gains exceeding ten percent in most metrics (see Table 2 and Appendix E.2). For instance, for $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ , AttInGrad reduced the Empty metric from $21.1%$ to $3.0%$ and improved the Cover metric from $63.4%$ to $74.9%$ . However, these enhancements were less pronounced for $\mathbf{B}\mathbf{s}$ , the supervised model.
注意力机制AttInGrad在所有训练策略和评估指标上都比其他解释方法更具合理性。值得注意的是,其合理性提升尤为显著,大多数指标的相对增益超过10% (见表2和附录E.2)。例如在$\scriptstyle\mathrm{\mathrm{ B_{U}~}}$任务中,AttInGrad将Empty指标从$21.1%$降至$3.0%$,同时将Cover指标从$63.4%$提升至$74.9%$。不过这些改进在监督模型$\mathbf{B}\mathbf{s}$上表现相对较弱。
AttInGrad was also more faithful than Attention (see Figure 3). However, while AttInGrad surpassed the gradient-based methods in comprehen sive ness, its sufficiency scores were slightly worse.
AttInGrad也比Attention更准确(见图3)。然而,虽然AttInGrad在全面性上超越了基于梯度的方法,但其充分性分数略逊一筹。
Analysis of attention-based explanations While AttInGrad and Attention were more plausible than the gradient-based explanations, they had a three-fold higher inter-seed variance. We found that they often attributed high importance to tokens devoid of alphanumeric characters such as ${\dot{\mathbf{G}}}[,{}^{*}$ , and $\dot{\mathrm{ \small C~}}$ , which we classify as special tokens. These special tokens, such as punctuation and byte-pair encoding artifacts, rarely carry semantic meaning. In the MDACE test set, they accounted for $32.2%$ of all tokens, compared to just $5.8%$ within the annotated evidence spans, suggesting they are unlikely to be relevant evidence.
基于注意力的解释分析
虽然AttInGrad和注意力机制的解释比基于梯度的解释更合理,但它们的种子间方差高出三倍。我们发现,这些方法经常将高重要性归因于不包含字母数字字符的token,例如 ${\dot{\mathbf{G}}}[,{}^{*}$ 和 $\dot{\mathrm{ \small C~}}$ ,我们将其归类为特殊token。这些特殊token(如标点符号和字节对编码产物)很少具有语义含义。在MDACE测试集中,它们占所有token的 $32.2%$ ,而在标注证据范围内仅占 $5.8%$ ,这表明它们不太可能是相关证据。
In Figure 4, we analyze the relationship between explanation quality (y-axis) and the proportion of the top five most important tokens that are special tokens (x-axis). Each data point represents the average statistics across the MDACE test set for one seed/run of the $\mathrm{B_{U}}$ model. Figures 4a and 4b show F1 (plausibility) and comprehensiveness (faithfulness) respectively. For Attention and AttInGrad, we see strong negative correlations for both metrics with a large inter-seed variance. The regression lines fitted on Attention and AttInGrad overlap, with the data points from AttInGrad shifted slightly towards the upper left, indicating attribution of less importance to special tokens.
在图4中,我们分析了解释质量(y轴)与特殊token在前五大重要token中占比(x轴)的关系。每个数据点代表$\mathrm{B_{U}}$模型在MDACE测试集上单次运行的统计均值。图4a和图4b分别展示了F1(合理性)和全面性(忠实度)指标。对于Attention和AttInGrad方法,两种指标均呈现显著的负相关性,且不同随机种子间差异较大。Attention与AttInGrad的回归线基本重合,但AttInGrad的数据点略微向左上方偏移,表明其对特殊token的重要性分配更低。
Conversely, for InputXGrad, we see a moderate negative correlation for the F1 score and no correlation for comprehensiveness. Furthermore, InputXGrad demonstrates a small inter-seed vari- ance, where the proportion of special tokens more closely mirrors that observed in the evidence spans.
相反,对于InputXGrad,我们发现F1分数存在中等程度的负相关,而全面性则无相关性。此外,InputXGrad表现出较小的种子间差异,其特殊token的比例更接近证据跨度中观察到的比例。
We hypothesized that AttInGrad’s improvements over Attention stem from InputXGrad reducing special tokens’ attribution scores. We tested this by zeroing out these tokens’ scores. While it substantially enhanced Attention’s F1 score, Attention remained lower than AttInGrad (see Table 3). If AttInGrad’s sole contribution were filtering special tokens, we would expect similar F1 scores after zeroing their attributions. The fact that AttInGrad still outperforms Attention after controlling for special tokens suggests that there are additional factors beyond special token filtering contributing to AttInGrad’s improved performance.
我们假设AttInGrad相对于Attention的改进源于InputXGrad降低了特殊token (special tokens) 的归因分数。通过将这些token的分数归零进行测试,虽然Attention的F1分数显著提升,但仍低于AttInGrad (见表3) 。如果AttInGrad的唯一贡献是过滤特殊token,那么在归零其属性后两者的F1分数应该相近。控制特殊token变量后AttInGrad仍优于Attention的事实表明,除了特殊token过滤外,还有其他因素促成了AttInGrad的性能提升。
Table 2: Plausibility of Attention, InputXGrad, Integrated Gradients (IntGrad), Deeplift, and AttInGrad on the MDACE test set. Each experiment was run with ten different seeds. We show the mean of the seeds $\pm$ as the standard deviation. All the scores are presented as percentages. Bold numbers outperform the unsupervised baseline model, while underlined numbers outperform the supervised model. We included more feature attribution methods in Appendix E.2.
Prediction | Ranking | ||||||||||
Explainer | Model | P↑ | R↑ | F1↑ | AUPRC↑ | Empty ↓ | SpanR ↑ | Cover ↑ | IOU↑ | P@5↑ | R@5↑ |
Attention | Bs | 41.7±5.0 | 42.9±3.2 | 42.2±3.7 | 37.3±3.9 | 13.5±1.9 | 60.5±3.7 | 72.3±3.5 | 46.1±4.5 | 32.3±2.6 | 45.3±3.7 |
Bu | 35.9±5.9 | 37.7±5.7 | 36.6±5.3 | 31.4±6.4 | 21.1±3.6 | 53.2±7.4 | 63.4±6.8 | 39.3±6.9 | 28.8±4.2 | 40.3±5.9 | |
IGR | 35.5±6.2 | 40.4±6.1 | 37.7±5.9 | 32.4±6.8 | 19.4±2.1 | 55.5±8.0 | 64.5±6.2 | 40.0±7.4 | 29.2±4.5 | 40.9±6.3 | |
TM | 36.5±5.9 | 37.8±6.1 | 37.0±5.5 | 31.7±6.7 | 21.7±3.3 | 53.3±7.7 | 63.3±6.9 | 40.1±7.3 | 29.1±4.4 | 40.8±6.2 | |
PGD | 33.0±8.4 | 38.4±8.0 | 35.4±8.3 | 30.0±9.1 | 19.1±1.6 | 51.5±13.6 | 61.1±9.9 | 35.9±10.8 | 27.7±6.3 | 38.9±8.8 | |
InputXGrad | Bs | 30.7±2.0 | 33.7±2.6 | 32.0±1.4 | 24.9±2.0 | 7.3±1.3 | 48.5±2.8 | 64.0±2.8 | 32.2±2.1 | 26.6±1.2 | 37.3±1.6 |
Bu | 31.0±2.6 | 32.7±3.2 | 31.6±1.2 | 24.8±1.9 | 9.5±2.6 | 46.4±3.6 | 62.3±3.5 | 31.7±1.8 | 26.3±0.9 | 36.9±1.3 | |
IGR | 32.4±2.5 | 33.8±2.2 | 33.0±0.7 | 27.0±0.7 | 9.7±2.0 | 48.5±2.6 | 64.6±2.5 | 32.8±1.0 | 27.4±0.4 | 38.4±0.5 | |
TM | 32.4±2.7 | 34.1±2.1 | 33.1±1.4 | 26.2±2.3 | 8.6±1.6 | 48.5±2.4 | 64.8±2.5 | 32.6±1.5 | 27.4±1.1 | 38.3±1.5 | |
PGD | 30.1±2.2 | 32.9±1.7 | 31.4±1.4 | 24.8±1.5 | 9.3±2.2 | 46.6±2.1 | 62.6±2.6 | 31.4±1.3 | 26.1±1.0 | 36.6±1.4 | |
IntGrad | Bs | 30.9±3.2 | 33.9±2.0 | 32.2±1.7 | 26.2±2.6 | 4.8±2.0 | 50.8±2.4 | 65.9±3.3 | 34.2±2.8 | 27.4±1.6 | 38.4±2.2 |
Bu | 31.3±4.3 | 32.8±2.7 | 31.9±3.2 | 26.0±4.3 | 5.2±1.0 | 49.4±3.0 | 64.4±3.1 | 33.9±4.2 | 26.5±2.5 | 37.1±3.6 | |
IGR | 32.6±2.2 | 33.8±2.3 | 33.1±1.3 | 26.6±2.0 | 5.2±1.7 | 49.7±2.5 | 64.7±3.3 | 34.9±2.3 | 27.6±0.9 | 38.7±1.2 | |
TM | 33.1±3.6 | 34.0±3.7 | 33.4±2.8 | 27.5±3.7 | 5.4±1.6 | 51.0±3.9 | 65.5±4.1 | 35.3±4.2 | 27.6±2.1 | 38.7±3.0 | |
PGD | 30.0±5.4 | 33.5±3.3 | 31.4±4.2 | 25.4±4.8 | 5.1±1.8 | 49.7±3.6 | 64.5±3.9 | 33.6±4.8 | 26.4±3.0 | 36.9±4.2 | |
Deeplift | Bs | 29.2±2.3 | 34.2±3.0 | 31.3±1.2 | 24.1±1.8 | 6.4±1.8 | 48.9±3.4 | 65.2±3.3 | 31.1±1.9 | 26.2±1.1 | 36.7±1.5 |
Bu | 31.2±1.9 | 31.4±2.4 | 31.2±1.5 | 24.1±2.1 | 9.1±1.6 | 45.1±2.5 | 61.7±2.9 | 30.9±1.6 | 25.8±1.1 | 36.1±1.5 | |
IGR | 31.0±1.8 | 33.7±1.4 | 32.2±0.6 | 25.9±0.9 | 8.6±1.0 | 48.3±1.5 | 64.8±1.7 | 31.6±1.0 | 26.7±0.5 | 37.4±0.7 | |
TM | 30.2±2.4 | 34.8±1.4 | 32.2±1.4 | 25.1±2.5 | 6.8±1.5 | 49.1±1.8 | 65.7±1.6 | 31.4±1.8 | 26.6±1.1 | 37.3±1.6 | |
PGD | 29.4±2.8 | 32.8±1.5 | 30.9±1.5 | 23.9±1.5 | 8.1±1.8 | 46.7±1.8 | 63.3±2.2 | 30.7±1.4 | 25.5±0.9 | 35.8±1.2 | |
AttInGrad | Bs | 40.6±2.2 | 45.9±3.1 | 43.0±1.9 | 38.8±2.1 | 1.2±0.5 | 63.9±2.9 | 79.0±1.9 | 45.7±2.6 | 33.3±1.4 | 46.6±1.9 |
Bu | 40.7±2.9 | 42.5±4.4 | 41.5±3.2 | 37.1±3.6 | 3.0±1.3 | 59.3±5.4 | 74.9±5.2 | 43.5±4.0 | 32.0±2.1 | 44.9±3.0 | |
IGR | 38.8±4.5 | 44.0±4.0 | 41.2±4.0 | 37.2±4.5 | 2.3±0.7 | 60.3±5.3 | 74.5±4.6 | 43.3±4.9 | 31.9±2.7 | 44.7±3.8 | |
TM | 40.2±3.0 | 43.9±4.8 | 41.9±3.4 | 37.8±3.9 | 2.8±1.6 | 60.5±5.8 | 75.9±5.3 | 44.0±3.9 | 32.5±2.3 | 45.5±3.2 | |
PGD | 37.1±8.2 | 42.5±5.7 | 39.3±7.1 | 34.7±8.6 | 2.3±0.9 | 57.7±9.1 | 72.7±7.4 | 39.6±10.2 | 30.9±4.8 | 43.3±6.7 |
表 2: Attention、InputXGrad、Integrated Gradients (IntGrad)、Deeplift 和 AttInGrad 在 MDACE 测试集上的合理性。每个实验使用十个不同的随机种子运行。我们展示了种子的平均值 $\pm$ 标准差。所有分数均以百分比表示。加粗数字优于无监督基线模型,带下划线数字优于有监督模型。更多特征归因方法见附录 E.2。
解释方法 | 模型 | P↑ | R↑ | F1↑ | AUPRC↑ | Empty ↓ | SpanR ↑ | Cover ↑ | IOU↑ | P@5↑ | R@5↑ |
---|---|---|---|---|---|---|---|---|---|---|---|
Attention | Bs | 41.7±5.0 | 42.9±3.2 | 42.2±3.7 | 37.3±3.9 | 13.5±1.9 | 60.5±3.7 | 72.3±3.5 | 46.1±4.5 | 32.3±2.6 | 45.3±3.7 |
Bu | 35.9±5.9 | 37.7±5.7 | 36.6±5.3 | 31.4±6.4 | 21.1±3.6 | 53.2±7.4 | 63.4±6.8 | 39.3±6.9 | 28.8±4.2 | 40.3±5.9 | |
IGR | 35.5±6.2 | 40.4±6.1 | 37.7±5.9 | 32.4±6.8 | 19.4±2.1 | 55.5±8.0 | 64.5±6.2 | 40.0±7.4 | 29.2±4.5 | 40.9±6.3 | |
TM | 36.5±5.9 | 37.8±6.1 | 37.0±5.5 | 31.7±6.7 | 21.7±3.3 | 53.3±7.7 | 63.3±6.9 | 40.1±7.3 | 29.1±4.4 | 40.8±6.2 | |
PGD | 33.0±8.4 | 38.4±8.0 | 35.4±8.3 | 30.0±9.1 | 19.1±1.6 | 51.5±13.6 | 61.1±9.9 | 35.9±10.8 | 27.7±6.3 | 38.9±8.8 | |
InputXGrad | Bs | 30.7±2.0 | 33.7±2.6 | 32.0±1.4 | 24.9±2.0 | 7.3±1.3 | 48.5±2.8 | 64.0±2.8 | 32.2±2.1 | 26.6±1.2 | 37.3±1.6 |
Bu | 31.0±2.6 | 32.7±3.2 | 31.6±1.2 | 24.8±1.9 | 9.5±2.6 | 46.4±3.6 | 62.3±3.5 | 31.7±1.8 | 26.3±0.9 | 36.9±1.3 | |
IGR | 32.4±2.5 | 33.8±2.2 | 33.0±0.7 | 27.0±0.7 | 9.7±2.0 | 48.5±2.6 | 64.6±2.5 | 32.8±1.0 | 27.4±0.4 | 38.4±0.5 | |
TM | 32.4±2.7 | 34.1±2.1 | 33.1±1.4 | 26.2±2.3 | 8.6±1.6 | 48.5±2.4 | 64.8±2.5 | 32.6±1.5 | 27.4±1.1 | 38.3±1.5 | |
PGD | 30.1±2.2 | 32.9±1.7 | 31.4±1.4 | 24.8±1.5 | 9.3±2.2 | 46.6±2.1 | 62.6±2.6 | 31.4±1.3 | 26.1±1.0 | 36.6±1.4 | |
IntGrad | Bs | 30.9±3.2 | 33.9±2.0 | 32.2±1.7 | 26.2±2.6 | 4.8±2.0 | 50.8±2.4 | 65.9±3.3 | 34.2±2.8 | 27.4±1.6 | 38.4±2.2 |
Bu | 31.3±4.3 | 32.8±2.7 | 31.9±3.2 | 26.0±4.3 | 5.2±1.0 | 49.4±3.0 | 64.4±3.1 | 33.9±4.2 | 26.5±2.5 | 37.1±3.6 | |
IGR | 32.6±2.2 | 33.8±2.3 | 33.1±1.3 | 26.6±2.0 | 5.2±1.7 | 49.7±2.5 | 64.7±3.3 | 34.9±2.3 | 27.6±0.9 | 38.7±1.2 | |
TM | 33.1±3.6 | 34.0±3.7 | 33.4±2.8 | 27.5±3.7 | 5.4±1.6 | 51.0±3.9 | 65.5±4.1 | 35.3±4.2 | 27.6±2.1 | 38.7±3.0 | |
PGD | 30.0±5.4 | 33.5±3.3 | 31.4±4.2 | 25.4±4.8 | 5.1±1.8 | 49.7±3.6 | 64.5±3.9 | 33.6±4.8 | 26.4±3.0 | 36.9±4.2 | |
Deeplift | Bs | 29.2±2.3 | 34.2±3.0 | 31.3±1.2 | 24.1±1.8 | 6.4±1.8 | 48.9±3.4 | 65.2±3.3 | 31.1±1.9 | 26.2±1.1 | 36.7±1.5 |
Bu | 31.2±1.9 | 31.4±2.4 | 31.2±1.5 | 24.1±2.1 | 9.1±1.6 | 45.1±2.5 | 61.7±2.9 | 30.9±1.6 | 25.8±1.1 | 36.1±1.5 | |
IGR | 31.0±1.8 | 33.7±1.4 | 32.2±0.6 | 25.9±0.9 | 8.6±1.0 | 48.3±1.5 | 64.8±1.7 | 31.6±1.0 | 26.7±0.5 | 37.4±0.7 | |
TM | 30.2±2.4 | 34.8±1.4 | 32.2±1.4 | 25.1±2.5 | 6.8±1.5 | 49.1±1.8 | 65.7±1.6 | 31.4±1.8 | 26.6±1.1 | 37.3±1.6 | |
PGD | 29.4±2.8 | 32.8±1.5 | 30.9±1.5 | 23.9±1.5 | 8.1±1.8 | 46.7±1.8 | 63.3±2.2 | 30.7±1.4 | 25.5±0.9 | 35.8±1.2 | |
AttInGrad | Bs | 40.6±2.2 | 45.9±3.1 | 43.0±1.9 | 38.8±2.1 | 1.2±0.5 | 63.9±2.9 | 79.0±1.9 | 45.7±2.6 | 33.3±1.4 | 46.6±1.9 |
Bu | 40.7±2.9 | 42.5±4.4 | 41.5±3.2 | 37.1±3.6 | 3.0±1.3 | 59.3±5.4 | 74.9±5.2 | 43.5±4.0 | 32.0±2.1 | 44.9±3.0 | |
IGR | 38.8±4.5 | 44.0±4.0 | 41.2±4.0 | 37.2±4.5 | 2.3±0.7 | 60.3±5.3 | 74.5±4.6 | 43.3±4.9 | 31.9±2.7 | 44.7±3.8 | |
TM | 40.2±3.0 | 43.9±4.8 | 41.9±3.4 | 37.8±3.9 | 2.8±1.6 | 60.5±5.8 | 75.9±5.3 | 44.0±3.9 | 32.5±2.3 | 45.5±3.2 | |
PGD | 37.1±8.2 | 42.5±5.7 | 39.3±7.1 | 34.7±8.6 | 2.3±0.9 | 57.7±9.1 | 72.7±7.4 | 39.6±10.2 | 30.9±4.8 | 43.3±6.7 |
6 Discussion
6 讨论
Do we need evidence span annotations? We demonstrated that we could match the explanation quality of Cheng et al. (2023) but without supervised training with evidence-span annotations (see Section 5). This raises the question: are evidencespan annotations unnecessary?
我们需要证据范围标注吗?我们证明了自己能够达到Cheng等人 (2023) 的解释质量,但无需使用证据范围标注进行监督训练 (见第5节)。这引发了一个问题:证据范围标注是否必要?
Intuitively, training a model on evidence spans should encourage it to use features relevant to humans, thereby making its explanations more plausible. However, we hypothesize that the training strategy used by Cheng et al. (2023) primarily addresses the shortcomings of attention-based explanation methods rather than enhancing the model’s underlying logic. The model $\mathrm{Bs}$ only produced more plausible explanations with attention-based feature attribution methods (see Figure 2). If the model truly leveraged more informative features, we would expect to see improvements across various feature attribution methods. Add