[论文翻译]一种在医疗记录中实现监督级可解释性的无监督方法


原文地址:https://arxiv.org/pdf/2406.08958v2


An Unsupervised Approach to Achieve Supervised-Level Explain ability in Healthcare Records

一种在医疗记录中实现监督级可解释性的无监督方法

Abstract

摘要

Electronic healthcare records are vital for patient safety as they document conditions, plans, and procedures in both free text and medical codes. Language models have significantly enhanced the processing of such records, streamlining workflows and reducing manual data entry, thereby saving healthcare providers significant resources. However, the black-box nature of these models often leaves healthcare professionals hesitant to trust them. State-ofthe-art explain ability methods increase model transparency but rely on human-annotated evidence spans, which are costly. In this study, we propose an approach to produce plausible and faithful explanations without needing such annotations. We demonstrate on the automated medical coding task that adversarial robustness training improves explanation plausibility and introduce AttInGrad, a new explanation method superior to previous ones. By combining both contributions in a fully unsupervised setup, we produce explanations of comparable quality, or better, to that of a supervised approach. We release our code and model weights. 1

电子健康记录对患者安全至关重要,它们以自由文本和医疗编码形式记录病情、治疗方案及操作流程。大语言模型显著提升了此类记录的处理效率,简化工作流程并减少人工数据录入,从而为医疗提供者节省大量资源。然而,这些模型的黑箱特性常使医疗从业者对其可信度存疑。当前最先进的解释性方法虽能提升模型透明度,但依赖人工标注的证据片段,成本高昂。本研究提出一种无需标注即可生成合理可靠解释的方案。我们在自动化医疗编码任务中证明:对抗性鲁棒训练能提升解释合理性,并提出优于现有方法的新解释技术AttInGrad。通过将两项创新结合于完全无监督框架,我们产出的解释质量达到或超越监督式方法。代码及模型权重已开源。1

1 Introduction

1 引言

Explain ability in natural language processing remains a largely unsolved problem, posing significant challenges for healthcare applications (Lyu et al., 2023). For every patient admission, a health- care professional must read extensive documentation in the healthcare records to assign appropri- ate medical codes. A code is a machine-readable identifier for a diagnosis or procedure, pivotal for tasks such as statistics, documentation, and billing. This process can involve sifting through thousands of words to choose from over 140,000 possible codes (Johnson et al., 2016), making medical coding not only time-consuming but also errorprone (Burns et al., 2012; Tseng et al., 2018).

自然语言处理的可解释性仍是一个尚未解决的重大问题,这给医疗健康应用带来了显著挑战 (Lyu et al., 2023) 。每当患者入院时,医疗专业人员必须阅读病历中的大量文档才能分配正确的医疗代码。这些代码是诊断或手术的机器可读标识符,对统计、记录和计费等任务至关重要。该过程可能需要筛选数千个单词,并从超过14万种可能的代码中进行选择 (Johnson et al., 2016) ,这使得医疗编码不仅耗时,而且容易出错 (Burns et al., 2012; Tseng et al., 2018) 。


Figure 1: Example of an input, prediction, and feature attribution explanation highlighted in the input.

图 1: 输入示例、预测结果及特征归因解释(在输入文本中高亮显示)。

Automated medical coding systems, powered by machine learning models, aim to alleviate these burdens by suggesting medical codes based on freeform written documentation. However, when reviewing suggested codes, healthcare professionals must still manually locate relevant evidence in the documentation. This is a slow and strenuous process, especially when dealing with extensive documentation and numerous medical codes. Explainability is essential for making this process tractable.

由机器学习模型驱动的自动化医疗编码系统,旨在通过基于自由文本记录推荐医疗编码来减轻这些负担。然而,在审查推荐编码时,医疗专业人员仍需手动从文档中定位相关证据。这一过程缓慢且费力,尤其是在处理大量文档和众多医疗编码时。可解释性对于使该过程易于管理至关重要。

Feature attributions, a common form of explainability, can help healthcare professionals quickly find the evidence necessary to review a medical code suggestion (see Figure 1). Feature attribution methods score each input feature based on its influence on the model’s output. These explanations are often evaluated through plausibility and faithfulness (Jacovi and Goldberg, 2020). Plausibility measures how convincing an explanation is to human users, while faithfulness measures the explanation’s ability to reflect the model’s logic.

特征归因 (feature attribution) 是一种常见的可解释性方法,可帮助医疗专业人员快速找到审查医疗编码建议所需的证据 (见图 1)。特征归因方法根据每个输入特征对模型输出的影响进行评分。这些解释通常通过合理性 (plausibility) 和忠实性 (faithfulness) 来评估 (Jacovi and Goldberg, 2020)。合理性衡量解释对人类用户的可信度,而忠实性衡量解释反映模型逻辑的能力。

While previous work has proposed feature attribution methods for automated medical coding, they only evaluated attention-based feature attribution methods. Furthermore, the state-of-theart method uses a supervised approach relying on costly evidence-span annotations. This reliance on manual annotations significantly limits practical applicability, as each code system and its versions require separate manual annotations (Mullenbach et al., 2018; Teng et al., 2020; Dong et al., 2021; Kim et al., 2022; Cheng et al., 2023).

虽然先前的研究提出了用于自动化医疗编码的特征归因方法,但它们仅评估了基于注意力机制的特征归因方法。此外,当前最先进的方法采用依赖昂贵证据范围标注的监督式方法。这种对人工标注的依赖极大限制了实际应用性,因为每个编码系统及其版本都需要单独的人工标注 (Mullenbach et al., 2018; Teng et al., 2020; Dong et al., 2021; Kim et al., 2022; Cheng et al., 2023)。

In this study, we present an approach for producing explanations of comparable quality to the su- pervised state-of-the-art method but without using evidence span annotations. We implement adversarial robustness training strategies to decrease the model’s dependency on irrelevant features, thereby avoiding such features in the explanations (Tsipras et al., 2018). Moreover, we present more faithful feature attribution methods than the attention-based method used in previous studies. Our key contributions are:

在本研究中,我们提出了一种方法,能够生成与监督式最先进方法质量相当的解释,但无需使用证据范围标注。我们采用对抗鲁棒性训练策略来降低模型对无关特征的依赖,从而避免在解释中出现此类特征 (Tsipras et al., 2018) 。此外,我们提出的特征归因方法比以往研究中基于注意力机制的方法更具可信度。我们的主要贡献包括:

2 Related work

2 相关工作

Next, we present explain ability approaches for automated medical coding and previous work on how adversarial robustness affects explain ability.

接下来,我们将介绍自动化医疗编码的可解释性方法,以及对抗鲁棒性如何影响可解释性的现有研究。

2.1 Explain able automated medical coding

2.1 可解释的自动化医疗编码

Automated medical coding is a multi-label classification task that aims to predict a set of medical codes from $J$ classes based on a given medical document (Edin et al., 2023). In this context, the objective of explain able automated medical coding is to generate feature attribution scores for each of the $J$ classes. These scores quantify how much each input token influences each class’s prediction.

自动化医疗编码是一项多标签分类任务,旨在根据给定的医疗文档从$J$个类别中预测一组医疗代码 (Edin et al., 2023)。在此背景下,可解释的自动化医疗编码目标是为每个$J$类别生成特征归因分数。这些分数量化了每个输入token对各类别预测的影响程度。

Most studies in explain able automated medical coding use attention weights as feature attribution scores without comparing to other methods (Mullenbach et al., 2018; Teng et al., 2020; Dong et al., 2021; Feucht et al., 2021; Cheng et al., 2023). How- ever, two studies suggest alternative feature attribution methods. Xu et al. (2019) propose a feature attribution method tailored to their one-layer CNN architecture but do not compare performance with other methods. Kim et al. (2022) train a linear medical coding model using knowledge distillation and use its weights as the explanation. However, their method does not improve over the explanations of the popular attention approach. Cheng et al. (2023) improve the plausibility of the attention weights of the final layer by training them to align with evidence span annotations. However, obtaining such annotations is costly.

大多数关于可解释自动化医疗编码的研究都使用注意力权重作为特征归因分数,而没有与其他方法进行比较 (Mullenbach et al., 2018; Teng et al., 2020; Dong et al., 2021; Feucht et al., 2021; Cheng et al., 2023)。然而,有两项研究提出了替代的特征归因方法。Xu等人 (2019) 提出了一种针对其单层CNN架构定制的特征归因方法,但未与其他方法的性能进行比较。Kim等人 (2022) 通过知识蒸馏训练了一个线性医疗编码模型,并将其权重作为解释依据。然而,他们的方法并未超越流行的注意力方法所提供的解释效果。Cheng等人 (2023) 通过训练最终层注意力权重与证据范围标注对齐,提升了其合理性。但获取此类标注的成本较高。

Previous work focused on plausibility using human ratings (Mullenbach et al., 2018; Teng et al., 2020; Kim et al., 2022), example inspection (Dong et al., 2021; Feucht et al., 2021; Liu et al., 2022), or evidence span overlap metrics (Xu et al., 2019; Cheng et al., 2023). Notably, no studies have assessed the faithfulness of the explanations nor compared the attention-based methods with other established methods.

先前的研究主要通过人工评分 (Mullenbach et al., 2018; Teng et al., 2020; Kim et al., 2022) 、案例检查 (Dong et al., 2021; Feucht et al., 2021; Liu et al., 2022) 或证据片段重叠指标 (Xu et al., 2019; Cheng et al., 2023) 来评估解释的合理性。值得注意的是,尚无研究评估这些解释的忠实度,也未将基于注意力机制的方法与其他成熟方法进行对比。

2.2 Adversarial robustness and explain ability

2.2 对抗鲁棒性与可解释性

Adversarial robustness refers to the ability of a machine learning model to maintain performance under adversarial attacks, which involve making small changes to input data that do not significantly affect human perception or judgment (e.g., a small amount of image noise). Tsipras et al. (2018) and Ilyas et al. (2019) demonstrate that adversarial examples exploit the models’ dependence on fragile, non-robust features. Adversarial robustness training embeds in variances to prevent models relying on such non-robust features, with regular iz ation and data augmentation as main strategies (Tsipras et al., 2018; Ros and Doshi-Velez, 2018).

对抗鲁棒性 (Adversarial robustness) 指机器学习模型在对抗攻击下保持性能的能力。对抗攻击通过对输入数据进行微小修改(例如少量图像噪声)来实现,这些修改不会显著影响人类感知或判断。Tsipras等人 (2018) 和Ilyas等人 (2019) 的研究表明,对抗样本利用了模型对脆弱、非鲁棒特征的依赖性。对抗鲁棒性训练通过嵌入方差来防止模型依赖此类非鲁棒特征,主要策略包括正则化 (regularization) 和数据增强 (data augmentation) (Tsipras等人, 2018; Ros和Doshi-Velez, 2018)。

Previous work in image classification shows that adversarial ly robust models generate more plausible explanations (Ros and Doshi-Velez, 2018; Chen et al., 2019; Etmann et al., 2019). These studies demonstrate this phenomenon for three adversarial training strategies: 1) input gradient regular iz ation, which improves the Lipschitz ness of neural networks (Drucker and Le Cun, 1992; Ros and DoshiVelez, 2018; Chen et al., 2019; Etmann et al., 2019; Rosca et al., 2020; Fel et al., 2022; Khan et al., 2023), 2) adversarial training, which trains models on adversarial examples, thereby embeds invariance to adversarial noise (Tsipras et al., 2018), and 3) feature masking, which masks unimportant features during training to embed invariance to such features (Bhalla et al., 2023).

先前在图像分类领域的研究表明,具有对抗鲁棒性的模型能生成更合理的解释 (Ros and Doshi-Velez, 2018; Chen et al., 2019; Etmann et al., 2019)。这些研究通过三种对抗训练策略验证了该现象:1) 输入梯度正则化 (input gradient regularization),通过提升神经网络的Lipschitz连续性实现 (Drucker and Le Cun, 1992; Ros and Doshi-Velez, 2018; Chen et al., 2019; Etmann et al., 2019; Rosca et al., 2020; Fel et al., 2022; Khan et al., 2023);2) 对抗训练 (adversarial training),通过在对抗样本上训练模型来嵌入对对抗噪声的不变性 (Tsipras et al., 2018);3) 特征掩蔽 (feature masking),通过训练时掩蔽不重要特征来嵌入对该类特征的不变性 (Bhalla et al., 2023)。

The relationship between model robustness and explanation plausibility in Natural Language Processing (NLP) remains unclear, primarily due to the fundamental differences between textual tokens and image pixels. To date, only two studies, Yoo and Qi (2021) and Li et al. (2023), have explored this relationship in depth. These studies suggest that robust models produce more faithful explanations compared to their non-robust counterparts. However, their conclusions may be questionable due to the use of the Area Over the Perturbation Curve (AOPC) metric, which is unsuitable for cross-model comparisons (Edin et al., 2024). This inappropriate application of AOPC may have led to potentially misleading conclusions.

自然语言处理(NLP)中模型鲁棒性与解释合理性之间的关系仍不明确,这主要源于文本token与图像像素之间的本质差异。截至目前,仅有Yoo和Qi (2021) 与Li等人 (2023) 两项研究深入探讨了这一关系。这些研究表明,与非鲁棒模型相比,鲁棒模型能产生更可信的解释。然而,由于使用了不适合跨模型比较的扰动曲线面积(AOPC)指标 (Edin等人, 2024),其结论可能存在问题。AOPC的不当应用可能导致具有误导性的结论。

3 Methods

3 方法

Here, we describe the adversarial robustness training strategies and feature attribution methods in the context of a prediction model for medical coding. The underlying automated medical coding model takes a sequence of tokens as input and outputs medical code probabilities (Section 4.2).

在此,我们描述了医疗编码预测模型中的对抗鲁棒性训练策略和特征归因方法。基础的自动化医疗编码模型以一系列token作为输入,并输出医疗代码的概率(第4.2节)。

3.1 Adversarial robustness training strategies

3.1 对抗鲁棒性训练策略

We implemented three adversarial training strategies, which we hypothesized could decrease our medical coding model’s reliance on irrelevant tokens: Input gradient regular iz ation, projected gradient descent, and token masking. We chose these strategies because they have been shown to improve plausibility in image classification and faithfulness in text classification (Li et al., 2023)

我们实施了三种对抗训练策略,假设这些策略可以减少医疗编码模型对无关token的依赖:输入梯度正则化、投影梯度下降和token掩码。选择这些策略是因为它们已被证明能提升图像分类的合理性和文本分类的忠实性 (Li et al., 2023)

Input gradient regular iz ation (IGR) encourages the gradient of the output with respect to the input to be small. This aims to decrease the number of features on which the model relies, encouraging it to ignore irrelevant words (Drucker and Le Cun, 1992). We adapt IGR to text classifica- tion by adding to the task’s binary cross-entropy loss, $L_{\mathrm{BCE}}$ , the $\ell^{2}$ norm of the gradient of $\boldsymbol{L}_{\mathrm{BCE}}$ wrt. the input token embedding sequence $\pmb{X}\in\mathbb{R}^{N\times D}$ . This yields the total loss,

输入梯度正则化 (IGR) 旨在减小输出相对于输入的梯度幅度。该方法通过减少模型依赖的特征数量,促使模型忽略无关词汇 (Drucker and Le Cun, 1992)。我们将IGR应用于文本分类任务,在二元交叉熵损失函数$L_{\mathrm{BCE}}$的基础上,增加$L_{\mathrm{BCE}}$对输入token嵌入序列$\pmb{X}\in\mathbb{R}^{N\times D}$梯度的$\ell^{2}$范数,得到总损失函数:

$\begin{array}{r}{L_{\mathrm{BCE}}(f(\boldsymbol{X}),\boldsymbol{y})+\lambda_{1}\left|\nabla_{\boldsymbol{X}}L_{\mathrm{BCE}}(f(\boldsymbol{X}),\boldsymbol{y})\right|{2},}\end{array}$ where $\pmb{y}\in\mathbb{R}^{J}$ is a binary target vector representing the $J$ medical codes, $\lambda_{1}$ is a hyper parameter, and $f:\mathbb{R}^{N\times D}\rightarrow\mathbb{R}^{J}$ is the classification model.

$\begin{array}{r}{L_{\mathrm{BCE}}(f(\boldsymbol{X}),\boldsymbol{y})+\lambda_{1}\left|\nabla_{\boldsymbol{X}}L_{\mathrm{BCE}}(f(\boldsymbol{X}),\boldsymbol{y})\right|{2},}\end{array}$ 其中 $\pmb{y}\in\mathbb{R}^{J}$ 是表示 $J$ 个医疗编码的二元目标向量,$\lambda_{1}$ 是超参数,$f:\mathbb{R}^{N\times D}\rightarrow\mathbb{R}^{J}$ 为分类模型。

Projected gradient descent (PGD) increases model robustness by training with adversarial examples, thereby promoting invariance to such inputs (Madry, 2018). We hypothesized that PGD reduces the model’s reliance on irrelevant tokens, as adversarial examples often arise from the model’s use of such unrobust features (Tsipras et al., 2018). PGD aims to find the noise $\pmb{\delta}\in\mathbb{R}^{N\times D}$ that maximizes the loss $L_{\mathrm{BCE}}(f(X+\delta),y)$ while satisfying the constraint $|\delta|_{\infty}\leq\epsilon$ , where $\epsilon$ is a hyperparam- eter. PGD was originally designed for image classification; we adapted it to NLP by adding the noise to the token embeddings $\boldsymbol{X}$ . We implemented PGD as follows,

投影梯度下降 (PGD) 通过使用对抗样本进行训练来增强模型鲁棒性,从而提升模型对此类输入的不变性 (Madry, 2018)。我们假设PGD能降低模型对无关token的依赖,因为对抗样本往往源于模型对这类非鲁棒特征的使用 (Tsipras et al., 2018)。PGD旨在寻找噪声$\pmb{\delta}\in\mathbb{R}^{N\times D}$,在满足约束$|\delta|{\infty}\leq\epsilon$的条件下最大化损失$L_{\mathrm{BCE}}(f(X+\delta),y)$,其中$\epsilon$为超参数。PGD最初为图像分类设计;我们通过将噪声添加到token嵌入$\boldsymbol{X}$来适配NLP任务。具体实现如下:

$$
\begin{array}{r}{\pmb{Z}^{}=\arg\operatorname*{max}{\pmb{Z}}L_{\mathrm{BCE}}(f(\pmb{X}+\delta(\pmb{Z})),\pmb{y}),}\end{array}
$$

$$
\begin{array}{r}{\pmb{Z}^{}=\arg\operatorname*{max}{\pmb{Z}}L_{\mathrm{BCE}}(f(\pmb{X}+\delta(\pmb{Z})),\pmb{y}),}\end{array}
$$

and enforced the constraint $|\delta|_{\infty}\leq\epsilon$ by para meter i zing $\delta(Z)=\epsilon\operatorname{tanh}(Z)$ and optimising $\pmb{Z}\in\mathbb{R}^{N\times D}$ directly. We initialized $Z$ with zeros. Finally, we tuned the model parameters using the following training objective:

并通过参数化 $\delta(Z)=\epsilon\operatorname{tanh}(Z)$ 并直接优化 $\pmb{Z}\in\mathbb{R}^{N\times D}$ 来强制约束 $|\delta|_{\infty}\leq\epsilon$。我们将 $Z$ 初始化为零。最后,使用以下训练目标调整模型参数:

${\cal L}{\mathrm{BCE}}(f(X),y)+\lambda_{2}{\cal L}{\mathrm{BCE}}(f(X+\delta(Z^{*})),y),$ where $\lambda_{2}$ is a hyper parameter.

${\cal L}{\mathrm{BCE}}(f(X),y)+\lambda_{2}{\cal L}{\mathrm{BCE}}(f(X+\delta(Z^{*})),y),$ 其中 $\lambda_{2}$ 为超参数。

Token masking (TM) teaches the model to predict accurately while using as few features as possible, thereby encouraging the model to ignore irrelevant words. TM uses a binary mask to occlude unimportant tokens and train the model to rely only on the remaining tokens (Bhalla et al., 2023; Tomar et al., 2023). Inspired by Bhalla et al. (2023), we employed a two-step teacher-student approach. We used two copies of the same model already trained on the automated medical coding task: a teacher $f_{t}$ with frozen model weights and a student $f_{s}$ , which we fine-tuned. For each training batch, the first step was to learn a sparse mask Mˆ ∈ [0, 1]N×D, that still provided enough information to predict the correct codes by minimizing:

Token掩码 (TM) 通过二元掩码遮蔽不重要token,训练模型仅依赖剩余token进行预测 (Bhalla et al., 2023; Tomar et al., 2023)。受Bhalla等人(2023)启发,我们采用两步师生训练法:使用已完成医疗自动编码任务训练的同一模型的两个副本——权重冻结的教师模型$f_{t}$和待微调的学生模型$f_{s}$。每个训练批次的第一步是学习稀疏掩码Mˆ ∈ [0, 1]N×D,该掩码需通过最小化损失函数保留足够信息来预测正确编码:

$$
|\hat{M}|{1}+\beta|f_{s}(\pmb{X})-f_{s}(x_{m}(\pmb{X},\hat{M}))|_{1}\mathrm{,~}
$$

$$
|\hat{M}|{1}+\beta|f_{s}(\pmb{X})-f_{s}(x_{m}(\pmb{X},\hat{M}))|_{1}\mathrm{,~}
$$

where $\beta$ is a hyper parameter and $x_{m}:\mathbb{R}^{N\times D}\rightarrow$ RN×D is the masking function:

其中$\beta$是一个超参数,$x_{m}:\mathbb{R}^{N\times D}\rightarrow$ RN×D是掩码函数:

$$
x_{m}(X,M)=B\odot(1-M)+X\odot M,
$$

$$
x_{m}(X,M)=B\odot(1-M)+X\odot M,
$$

where $\pmb{{\cal B}}\in\mathbb{R}^{N\times D}$ is the baseline input. We chose $B$ as the token embedding representing the start token, followed by the mask token embedding repeated $N-2$ times, followed by the end token embedding. After optimization, we binarized the mask $M=\mathrm{round}(\hat{M})$ , where around $90%$ of the features were masked. Finally, we tuned the model $f_{s}$ using the following training objective:

其中 $\pmb{{\cal B}}\in\mathbb{R}^{N\times D}$ 是基线输入。我们选择 $B$ 作为表示起始token的token嵌入,随后重复 $N-2$ 次的掩码token嵌入,最后是结束token嵌入。优化后,我们对掩码进行二值化处理 $M=\mathrm{round}(\hat{M})$ ,其中约 $90%$ 的特征被掩码。最终,我们使用以下训练目标微调模型 $f_{s}$:

$\begin{array}{r}{|f_{s}(\boldsymbol{X})-f_{t}(\boldsymbol{X})|{1}+\lambda_{3}|f_{s}(\boldsymbol{X})-f_{s}(x_{m}(\boldsymbol{X},\boldsymbol{M}))|{1},}\end{array}$ where $\lambda_{3}$ is a hyper parameter.

$\begin{array}{r}{|f_{s}(\boldsymbol{X})-f_{t}(\boldsymbol{X})|{1}+\lambda_{3}|f_{s}(\boldsymbol{X})-f_{s}(x_{m}(\boldsymbol{X},\boldsymbol{M}))|{1},}\end{array}$ 其中 $\lambda_{3}$ 是一个超参数。

3.2 Feature attribution methods

3.2 特征归因方法

We evaluated several feature attribution methods for automated medical coding, categorizing them into three types: attention-based, gradientbased, and perturbation-based (more details in Appendix B). Attention-based methods like Attention (Mullenbach et al., 2018), Attention Roll- out (Abnar and Zuidema, 2020), and AttGrad (Serrano and Smith, 2019) rely on the model’s attention weights. Gradient-based methods such as InputXGrad (Sun dara rajan et al., 2017), Integrated Gradients (IntGrad) (Sun dara rajan et al., 2017), and Deeplift (Shrikumar et al., 2017) use back prop agation to quantify the influence of input features on outputs. Perturbation-based methods, including LIME (Ribeiro et al., 2016), KernelSHAP (Lundberg and Lee, 2017), and Occlusion $@1$ (Ribeiro et al., 2016), measure the impact on output confidence by occluding input features.

我们评估了多种用于自动化医疗编码的特征归因方法,将其分为三类:基于注意力 (attention-based) 、基于梯度 (gradient-based) 和基于扰动 (perturbation-based) 的方法(更多细节见附录 B)。基于注意力的方法如 Attention (Mullenbach et al., 2018)、Attention Rollout (Abnar and Zuidema, 2020) 和 AttGrad (Serrano and Smith, 2019) 依赖于模型的注意力权重。基于梯度的方法如 InputXGrad (Sundararajan et al., 2017)、Integrated Gradients (IntGrad) (Sundararajan et al., 2017) 和 Deeplift (Shrikumar et al., 2017) 通过反向传播量化输入特征对输出的影响。基于扰动的方法包括 LIME (Ribeiro et al., 2016)、KernelSHAP (Lundberg and Lee, 2017) 和 Occlusion $@1$ (Ribeiro et al., 2016),通过遮蔽输入特征来测量对输出置信度的影响。

Our investigation into feature attribution methods revealed an intriguing pattern: while individual methods often produced unreliable explanations, their shortcomings rarely overlapped. Attentionbased methods and gradient-based approaches like InputXGrad frequently disagreed on which tokens were most important, yet both contributed valuable insights in different scenarios. This observation sparked a key question: could we leverage the complementary strengths of these methods to create a more robust attribution technique?

我们对特征归因方法的研究揭示了一个有趣的现象:虽然单个方法经常产生不可靠的解释,但它们的缺点很少重叠。基于注意力的方法和基于梯度的方法(如InputXGrad)经常在哪些token最重要的问题上存在分歧,然而这两种方法在不同场景下都提供了有价值的见解。这一观察引出了一个关键问题:我们能否利用这些方法的互补优势,创建一种更稳健的归因技术?

To address this question, we propose AttInGrad, a novel feature attribution method that combines Attention and InputXGrad. AttInGrad multiplies their respective attribution scores, aiming to amplify the importance of tokens deemed relevant by both methods while down-weighting those highlighted by only one or neither method.

为解决这一问题,我们提出了AttInGrad这一新颖的特征归因方法,它结合了注意力机制(Attention)和输入梯度(InputXGrad)。AttInGrad将两者的归因分数相乘,旨在放大被两种方法同时认为相关的token的重要性,同时降低仅被其中一种或两种方法均未强调的token的权重。

We formalize the AttInGrad attribution scores for class $j$ using the following equation:

我们使用以下方程形式化类别 $j$ 的AttInGrad归因分数:

$$
\begin{array}{r l}{~}&{\left[\begin{array}{c}{A_{j1}\cdot\left|X_{1}\odot\frac{\partial f_{j}}{\partial X_{1}}(X)\right|{2}}\ {\vdots}\end{array}\right]}\ {~}&{\left[\begin{array}{c}{A_{j N}\cdot\left|X_{N}\odot\frac{\partial f_{j}}{\partial X_{N}}(X)\right|_{2}}\end{array}\right]}\end{array},
$$

$$
\begin{array}{r l}{~}&{\left[\begin{array}{c}{A_{j1}\cdot\left|X_{1}\odot\frac{\partial f_{j}}{\partial X_{1}}(X)\right|{2}}\ {\vdots}\end{array}\right]}\ {~}&{\left[\begin{array}{c}{A_{j N}\cdot\left|X_{N}\odot\frac{\partial f_{j}}{\partial X_{N}}(X)\right|_{2}}\end{array}\right]}\end{array},
$$

where $\pmb{A}\in\mathbb{R}^{J\times N}$ is the attention matrix, $\odot$ is the element-wise matrix multiplication operation, $N$ are the number of tokens in a document, and $J$ is the number of classes.

其中 $\pmb{A}\in\mathbb{R}^{J\times N}$ 是注意力矩阵,$\odot$ 表示逐元素矩阵乘法运算,$N$ 是文档中的 token 数量,$J$ 是类别数。

In Section 5, we will provide an in-depth analysis of the mechanisms underlying AttInGrad’s effectiveness, shedding light on why this combination of methods yields improved feature attributions.

在第5节中,我们将深入分析AttInGrad有效性的机制,阐明为何这种方法的组合能产生更好的特征归因效果。

Table 1: The two data splits used in this paper. MIMICIII full comprises discharge summaries annotated with ICD-9 codes. MDACE comprises discharge summaries annotated with ICD-9 codes and evidence spans.

SplitTrainValTest
MIMIC-III full47,7191,6313,372
MDACE1816061

表 1: 本文使用的两种数据划分。MIMICIII full 包含标注了 ICD-9 编码的出院小结,MDACE 包含标注了 ICD-9 编码和证据范围的出院小结。

Split Train Val Test
MIMIC-III full 47,719 1,631 3,372
MDACE 181 60 61

4 Experimental setup

4 实验设置

In the following, we present our datasets, models, and evaluation metrics.

以下介绍我们的数据集、模型和评估指标。

4.1 Data

4.1 数据

We conducted our experiments using the openaccess MIMIC-III and the newly released MDACE dataset (Johnson et al., 2016; Cheng et al., 2023). MIMIC $\mathrm{\cdotIII}^{2}$ includes 52,722 discharge summaries from the Beth Israel Deaconess Medical Center’s ICU, collected between 2008 and 2016 and annotated with ICD-9 codes. MDACE comprises 302 reannotated MIMIC-III cases, adding evidence spans to indicate the textual justification for each medical code. Not all possible evidence spans are annotated; for example, if hypertension is mentioned multiple times, only the first mention might be annotated, leaving subsequent mentions un annotated. We focused exclusively on discharge summaries, as most previous medical coding studies on MIMICIII (Teng et al., 2022). Statistics are in Table 1.

我们使用公开的MIMIC-III和新发布的MDACE数据集 (Johnson et al., 2016; Cheng et al., 2023) 进行了实验。MIMIC $\mathrm{\cdotIII}^{2}$ 包含贝斯以色列女执事医疗中心ICU在2008至2016年间收集的52,722份出院小结,并标注了ICD-9编码。MDACE包含302个重新标注的MIMIC-III病例,新增了证据片段以标明每个医疗编码的文本依据。并非所有可能的证据片段都被标注,例如若高血压被多次提及,可能仅标注首次提及而忽略后续记录。与多数既往针对MIMIC-III的医疗编码研究 (Teng et al., 2022) 一致,我们仅聚焦于出院小结。统计数据见表1。

For dataset splits, we used MIMIC-III full, a popular split by Mullenbach et al. (2018), and MDACE, introduced by Cheng et al. (2023) for training and evaluating explanation methods. All MDACE examples are from the MIMIC-III full test set, which we excluded from this test set when using MDACE in our training data.

在数据集划分方面,我们采用了MIMIC-III完整版(Mullenbach等人于2018年提出的常用划分方案)以及Cheng等人(2023年)提出的MDACE数据集,用于训练和评估解释方法。所有MDACE样本均来自MIMIC-III完整版测试集,因此当我们将MDACE纳入训练数据时,已将其从该测试集中剔除。

4.2 Models

4.2 模型

We used PLM-ICD, a state-of-the-art automated medical coding model architecture, for our experiments because its architecture is simple while outperforming other models according to Huang et al. (2022); Edin et al. (2023). To address stability issues caused by numerical overflow in the decoder of the original model, we replaced the labelwise attention mechanism with standard crossattention (Vaswani et al., 2017). This adjustment not only stabilized training but also slightly improved performance. We provide further details on the architecture modifications in Appendix A.

我们采用当前最先进的自动化医疗编码模型架构PLM-ICD进行实验,因为根据Huang等人 (2022) 和Edin等人 (2023) 的研究,该架构在保持简洁性的同时性能优于其他模型。为解决原始模型解码器中数值溢出导致的稳定性问题,我们将标签注意力机制替换为标准交叉注意力机制 (Vaswani等人, 2017) 。这一调整不仅稳定了训练过程,还略微提升了性能。架构修改的更多细节详见附录A。

We compared five models: $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ , ${\bf B}{\mathrm{S}}$ , IGR, PGD, and TM. All models used our modified PLM-ICD architecture but were trained differently. $\scriptstyle\mathrm{B_{U}}$ was trained unsupervised with binary cross-entropy, whereas $\mathbf{B}\mathbf{s}$ employed a supervised auxiliary training objective that minimized the KL divergence between the model’s cross-attention weights and annotated evidence spans, as per Cheng et al. (2023). IGR, PGD, and TM training is as in Section 2.2. Best hyper parameters are in Appendix D.

我们比较了五种模型:$\scriptstyle\mathrm{\mathrm{ B_{U}~}}$、${\bf B}{\mathrm{S}}$、IGR、PGD和TM。所有模型均采用我们改进的PLM-ICD架构,但训练方式不同。$\scriptstyle\mathrm{B_{U}}$采用无监督训练方式,使用二元交叉熵损失函数;而$\mathbf{B}\mathbf{s}$ 则按照Cheng等人(2023)的方法,采用有监督辅助训练目标,最小化模型交叉注意力权重与标注证据片段之间的KL散度。IGR、PGD和TM的训练方法如第2.2节所述。最佳超参数见附录D。

4.3 Experiments

4.3 实验

We trained all five models with ten seeds on the MIMIC-III full and MDACE training set. The supervised training strategy $\mathrm{B}_{\mathrm{S}}$ used the evidence span annotations, while the others only used the medical code annotations. For each model, we evaluated the plausibility and faithfulness of the explanations generated by every explanation method.

我们在MIMIC-III完整数据集和MDACE训练集上,用十组随机种子训练了全部五个模型。监督训练策略 $\mathrm{B}_{\mathrm{S}}$ 使用了证据范围标注,其他策略仅使用医疗代码标注。针对每个模型,我们评估了各解释方法生成解释的合理性与可信度。

We aimed to demonstrate a similar explanation quality as a supervised approach but without training on evidence spans. Therefore, after evaluating the models and explanation methods, we compared our best combination with the supervised strategy proposed by Cheng et al. (2023), who used the $\mathbf{B}\mathbf{s}$ model and the Attention explanation method. We also compared our best combination with the unsupervised strategy used by most previous works (see Section 2.1), comprising the $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ model and the Attention explanations method.

我们旨在展示与监督方法相当的解释质量,但无需对证据片段进行训练。因此,在评估模型和解释方法后,我们将最佳组合与Cheng等人 (2023) 提出的监督策略(使用$\mathbf{B}\mathbf{s}$模型和Attention解释方法)进行了对比。同时,我们还将其与多数先前研究采用的无监督策略(见第2.1节)进行了比较,该策略由$\scriptstyle\mathrm{\mathrm{ B_{U}~}}$模型和Attention解释方法构成。

4.4 Evaluation metrics

4.4 评估指标

We measured the explanation quality using metrics estimating plausibility and faithfulness. Plausibility measures how convincing an explanation is to human users, while faithfulness measures how accurate an explanation reflects a model’s true rea- soning process (Jacovi and Goldberg, 2020).

我们采用评估合理性和忠实性的指标来衡量解释质量。合理性衡量解释对人类用户的信服程度,而忠实性衡量解释反映模型真实推理过程的准确程度 (Jacovi and Goldberg, 2020)。

Plausibility metrics Our plausibility metrics measured the overlap between explanations and annotated evidence-spans. We assumed that a high overlap indicated plausible explanations for medical coders. We identified the most important tokens using feature attribution scores, applying a decision boundary for classification metrics, and selecting the top $K$ scores for ranking metrics.

合理性指标
我们的合理性指标衡量了解释与标注证据片段之间的重叠程度。我们假设高度重叠意味着对医疗编码员而言解释是合理的。我们通过特征归因分数识别最重要的token,为分类指标设定决策边界,并为排序指标选择前 $K$ 个最高分。

For classification metrics, we used Precision (P), Recall (R), and F1 scores, selecting the decision boundary that yielded the highest F1 score on the validation set (Cheng et al., 2023). Additionally, we included four more classification metrics: Empty explanation rate (Empty), Evidence span recall (SpanR), Evidence span cover (Cover), and Area Under the Precision-Recall Curve (AUPRC). Empty measures the rate of empty explanations when all attribution scores in an example are below the decision boundary. SpanR measures the percentage of annotated evidence spans where at least one token is classified correctly. Cover measures the percentage of tokens in an annotated evidence span that are classified correctly, given that at least one token is predicted correctly. AUPRC represents the area under the precision-recall curve generated by varying the decision boundary from zero to one.

在分类指标方面,我们采用了精确率 (P)、召回率 (R) 和 F1分数,并选择在验证集上获得最高F1分数的决策边界 (Cheng et al., 2023)。此外,我们还引入了四项分类指标:空解释率 (Empty)、证据片段召回率 (SpanR)、证据片段覆盖率 (Cover) 以及精确率-召回率曲线下面积 (AUPRC)。空解释率衡量当示例中所有归因分数均低于决策边界时空解释的比例。证据片段召回率统计至少有一个token被正确分类的标注证据片段百分比。证据片段覆盖率计算在至少有一个token被正确预测的前提下,标注证据片段中被正确分类的token占比。AUPRC表示决策边界从0到1变化时生成的精确率-召回率曲线下面积。

For ranking metrics, we selected the top $K$ tokens with the highest attribution scores, using Recall $\ @\mathbf{K}$ , Precision $\ @\mathbf{K}$ , and Intersection-OverUnions (IOU) (DeYoung et al., 2020).

在排序指标方面,我们选取了归因分数最高的前 $K$ 个token,采用召回率 $\ @\mathbf{K}$ 、准确率 $\ @\mathbf{K}$ 以及交并比(IOU) (DeYoung et al., 2020)作为评估标准。

Faithfulness metrics We use two metrics to approximate faithfulness: Sufficiency and Comprehen sive ness (DeYoung et al., 2020); more details are in Appendix C. Faithful explanations yield high Comprehensiveness and low Sufficiency scores. A high Sufficiency score indicates that many important tokens are incorrectly assigned low attribution scores, while a low Comprehensiveness score suggests that many non-important tokens are incorrectly assigned high attribution scores.

忠实度指标
我们采用两个指标来近似衡量忠实度:充分性(Sufficiency)和全面性(Comprehensiveness) (DeYoung et al., 2020);更多细节见附录C。忠实的解释会呈现高全面性和低充分性分数。高充分性分数表明许多重要token被错误分配了低归因分数,而低全面性分数则意味着许多非重要token被错误分配了高归因分数。

5 Results

5 结果

Next, we present experimental results for the different training strategies and explain ability methods.

接下来,我们展示不同训练策略和可解释性方法的实验结果。

Rivaling supervised methods in explanation quality The objective of this paper was to produce high-quality explanations without relying on evidence span annotations. In Figure 2, we compare the plausibility of our approach (Token masking and AttnInGrad) with the unsupervised approach $\mathrm{\bf\DeltaB_{U}}$ and Attention) and supervised stateof-the-art approach $\mathrm{\bf{B}}\mathrm{\bf{s}}$ and Attention). Our approach was substantially more plausible than the unsupervised on all metrics. Compared with the supervised, our approach achieved similar F1 and Recall $\ @5$ and substantially better Empty scores.

在解释质量上媲美监督方法
本文的目标是在不依赖证据范围标注的情况下生成高质量解释。在图 2 中,我们将本方法 (Token masking 和 AttnInGrad) 与无监督方法 $\mathrm{\bf\DeltaB_{U}}$ (Attention) 及监督的先进方法 $\mathrm{\bf{B}}\mathrm{\bf{s}}$ (Attention) 的合理性进行对比。在所有指标上,本方法都显著优于无监督方法。与监督方法相比,本方法在 F1 和召回率 $\ @5$ 上表现相近,且空解释得分显著更优。


Figure 2: Comparison of plausibility across various combinations of explanation methods and models from this study and previous work. Most previous studies used Attention and a standard medical coding model $(\mathrm{B_{U}})$ . Cheng et al. (2023) instead used a supervised model trained on evidence-span annotations $(\mathbf{B}_{\mathrm{S}})$ . We proposed AttInGrad and an adversarial robust model (TM).

图 2: 本研究与先前工作中不同解释方法和模型组合的可信度对比。多数先前研究采用Attention机制和标准医疗编码模型 $(\mathrm{B_{U}})$ 。Cheng等人 (2023) 则使用了基于证据跨度标注训练的监督模型 $(\mathbf{B}_{\mathrm{S}})$ 。我们提出了AttInGrad方法及对抗鲁棒模型 (TM) 。

The supervised approach achieved similar plausibility to ours on most metrics (see Table 2). Our approach also achieved the highest comprehensiveness and lowest sufficiency scores (see Figure 3). The difference was larger in the sufficiency scores, where the supervised score was twice as high as ours.

监督方法在大多数指标上取得了与我们相似的合理性(见表2)。我们的方法还获得了最高的全面性和最低的充分性得分(见图3)。在充分性得分上的差异更大,监督方法的得分是我们的两倍。

Adversarial robustness improves plausibility We evaluated the explanation plausibility of every model and explanation method combination in Table 2. IGR and TM outperformed the baseline model $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ on most metrics and explanation methods. In Appendix E.5, we compare the unsupervised models on a bigger test set and see similar results. The supervised model $\mathbf{B}\mathbf{s}$ yielded better results for attention-based explanations but was weaker than the robust models when using the gradient-based explanation methods: InputXGrad, IG, and Deeplift.

对抗鲁棒性提升解释合理性
我们评估了表2中每种模型与解释方法组合的解释合理性。IGR和TM在大多数指标和解释方法上优于基线模型$\scriptstyle\mathrm{\mathrm{ B_{U}~}}$。附录E.5中,我们在更大测试集上对比无监督模型,结果相似。监督模型$\mathbf{B}\mathbf{s}$在基于注意力的解释中表现更好,但使用基于梯度的解释方法( InputXGrad、IG和Deeplift )时弱于鲁棒模型。

AttInGrad is more plausible and faithful than

AttInGrad 比

Attention AttInGrad was more plausible than all other explanation methods across all training strategies and metrics. Notably, plausibility improvements were particularly significant, with relative gains exceeding ten percent in most metrics (see Table 2 and Appendix E.2). For instance, for $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ , AttInGrad reduced the Empty metric from $21.1%$ to $3.0%$ and improved the Cover metric from $63.4%$ to $74.9%$ . However, these enhancements were less pronounced for $\mathbf{B}\mathbf{s}$ , the supervised model.

注意力机制AttInGrad在所有训练策略和评估指标上都比其他解释方法更具合理性。值得注意的是,其合理性提升尤为显著,大多数指标的相对增益超过10% (见表2和附录E.2)。例如在$\scriptstyle\mathrm{\mathrm{ B_{U}~}}$任务中,AttInGrad将Empty指标从$21.1%$降至$3.0%$,同时将Cover指标从$63.4%$提升至$74.9%$。不过这些改进在监督模型$\mathbf{B}\mathbf{s}$上表现相对较弱。

AttInGrad was also more faithful than Attention (see Figure 3). However, while AttInGrad surpassed the gradient-based methods in comprehen sive ness, its sufficiency scores were slightly worse.

AttInGrad也比Attention更准确(见图3)。然而,虽然AttInGrad在全面性上超越了基于梯度的方法,但其充分性分数略逊一筹。

Analysis of attention-based explanations While AttInGrad and Attention were more plausible than the gradient-based explanations, they had a three-fold higher inter-seed variance. We found that they often attributed high importance to tokens devoid of alphanumeric characters such as ${\dot{\mathbf{G}}}[,{}^{*}$ , and $\dot{\mathrm{ \small C~}}$ , which we classify as special tokens. These special tokens, such as punctuation and byte-pair encoding artifacts, rarely carry semantic meaning. In the MDACE test set, they accounted for $32.2%$ of all tokens, compared to just $5.8%$ within the annotated evidence spans, suggesting they are unlikely to be relevant evidence.

基于注意力的解释分析
虽然AttInGrad和注意力机制的解释比基于梯度的解释更合理,但它们的种子间方差高出三倍。我们发现,这些方法经常将高重要性归因于不包含字母数字字符的token,例如 ${\dot{\mathbf{G}}}[,{}^{*}$ 和 $\dot{\mathrm{ \small C~}}$ ,我们将其归类为特殊token。这些特殊token(如标点符号和字节对编码产物)很少具有语义含义。在MDACE测试集中,它们占所有token的 $32.2%$ ,而在标注证据范围内仅占 $5.8%$ ,这表明它们不太可能是相关证据。

In Figure 4, we analyze the relationship between explanation quality (y-axis) and the proportion of the top five most important tokens that are special tokens (x-axis). Each data point represents the average statistics across the MDACE test set for one seed/run of the $\mathrm{B_{U}}$ model. Figures 4a and 4b show F1 (plausibility) and comprehensiveness (faithfulness) respectively. For Attention and AttInGrad, we see strong negative correlations for both metrics with a large inter-seed variance. The regression lines fitted on Attention and AttInGrad overlap, with the data points from AttInGrad shifted slightly towards the upper left, indicating attribution of less importance to special tokens.

在图4中,我们分析了解释质量(y轴)与特殊token在前五大重要token中占比(x轴)的关系。每个数据点代表$\mathrm{B_{U}}$模型在MDACE测试集上单次运行的统计均值。图4a和图4b分别展示了F1(合理性)和全面性(忠实度)指标。对于Attention和AttInGrad方法,两种指标均呈现显著的负相关性,且不同随机种子间差异较大。Attention与AttInGrad的回归线基本重合,但AttInGrad的数据点略微向左上方偏移,表明其对特殊token的重要性分配更低。

Conversely, for InputXGrad, we see a moderate negative correlation for the F1 score and no correlation for comprehensiveness. Furthermore, InputXGrad demonstrates a small inter-seed vari- ance, where the proportion of special tokens more closely mirrors that observed in the evidence spans.

相反,对于InputXGrad,我们发现F1分数存在中等程度的负相关,而全面性则无相关性。此外,InputXGrad表现出较小的种子间差异,其特殊token的比例更接近证据跨度中观察到的比例。

We hypothesized that AttInGrad’s improvements over Attention stem from InputXGrad reducing special tokens’ attribution scores. We tested this by zeroing out these tokens’ scores. While it substantially enhanced Attention’s F1 score, Attention remained lower than AttInGrad (see Table 3). If AttInGrad’s sole contribution were filtering special tokens, we would expect similar F1 scores after zeroing their attributions. The fact that AttInGrad still outperforms Attention after controlling for special tokens suggests that there are additional factors beyond special token filtering contributing to AttInGrad’s improved performance.

我们假设AttInGrad相对于Attention的改进源于InputXGrad降低了特殊token (special tokens) 的归因分数。通过将这些token的分数归零进行测试,虽然Attention的F1分数显著提升,但仍低于AttInGrad (见表3) 。如果AttInGrad的唯一贡献是过滤特殊token,那么在归零其属性后两者的F1分数应该相近。控制特殊token变量后AttInGrad仍优于Attention的事实表明,除了特殊token过滤外,还有其他因素促成了AttInGrad的性能提升。

Table 2: Plausibility of Attention, InputXGrad, Integrated Gradients (IntGrad), Deeplift, and AttInGrad on the MDACE test set. Each experiment was run with ten different seeds. We show the mean of the seeds $\pm$ as the standard deviation. All the scores are presented as percentages. Bold numbers outperform the unsupervised baseline model, while underlined numbers outperform the supervised model. We included more feature attribution methods in Appendix E.2.

PredictionRanking
ExplainerModelP↑R↑F1↑AUPRC↑Empty ↓SpanR ↑Cover ↑IOU↑P@5↑R@5↑
AttentionBs41.7±5.042.9±3.242.2±3.737.3±3.913.5±1.960.5±3.772.3±3.546.1±4.532.3±2.645.3±3.7
Bu35.9±5.937.7±5.736.6±5.331.4±6.421.1±3.653.2±7.463.4±6.839.3±6.928.8±4.240.3±5.9
IGR35.5±6.240.4±6.137.7±5.932.4±6.819.4±2.155.5±8.064.5±6.240.0±7.429.2±4.540.9±6.3
TM36.5±5.937.8±6.137.0±5.531.7±6.721.7±3.353.3±7.763.3±6.940.1±7.329.1±4.440.8±6.2
PGD33.0±8.438.4±8.035.4±8.330.0±9.119.1±1.651.5±13.661.1±9.935.9±10.827.7±6.338.9±8.8
InputXGradBs30.7±2.033.7±2.632.0±1.424.9±2.07.3±1.348.5±2.864.0±2.832.2±2.126.6±1.237.3±1.6
Bu31.0±2.632.7±3.231.6±1.224.8±1.99.5±2.646.4±3.662.3±3.531.7±1.826.3±0.936.9±1.3
IGR32.4±2.533.8±2.233.0±0.727.0±0.79.7±2.048.5±2.664.6±2.532.8±1.027.4±0.438.4±0.5
TM32.4±2.734.1±2.133.1±1.426.2±2.38.6±1.648.5±2.464.8±2.532.6±1.527.4±1.138.3±1.5
PGD30.1±2.232.9±1.731.4±1.424.8±1.59.3±2.246.6±2.162.6±2.631.4±1.326.1±1.036.6±1.4
IntGradBs30.9±3.233.9±2.032.2±1.726.2±2.64.8±2.050.8±2.465.9±3.334.2±2.827.4±1.638.4±2.2
Bu31.3±4.332.8±2.731.9±3.226.0±4.35.2±1.049.4±3.064.4±3.133.9±4.226.5±2.537.1±3.6
IGR32.6±2.233.8±2.333.1±1.326.6±2.05.2±1.749.7±2.564.7±3.334.9±2.327.6±0.938.7±1.2
TM33.1±3.634.0±3.733.4±2.827.5±3.75.4±1.651.0±3.965.5±4.135.3±4.227.6±2.138.7±3.0
PGD30.0±5.433.5±3.331.4±4.225.4±4.85.1±1.849.7±3.664.5±3.933.6±4.826.4±3.036.9±4.2
DeepliftBs29.2±2.334.2±3.031.3±1.224.1±1.86.4±1.848.9±3.465.2±3.331.1±1.926.2±1.136.7±1.5
Bu31.2±1.931.4±2.431.2±1.524.1±2.19.1±1.645.1±2.561.7±2.930.9±1.625.8±1.136.1±1.5
IGR31.0±1.833.7±1.432.2±0.625.9±0.98.6±1.048.3±1.564.8±1.731.6±1.026.7±0.537.4±0.7
TM30.2±2.434.8±1.432.2±1.425.1±2.56.8±1.549.1±1.865.7±1.631.4±1.826.6±1.137.3±1.6
PGD29.4±2.832.8±1.530.9±1.523.9±1.58.1±1.846.7±1.863.3±2.230.7±1.425.5±0.935.8±1.2
AttInGradBs40.6±2.245.9±3.143.0±1.938.8±2.11.2±0.563.9±2.979.0±1.945.7±2.633.3±1.446.6±1.9
Bu40.7±2.942.5±4.441.5±3.237.1±3.63.0±1.359.3±5.474.9±5.243.5±4.032.0±2.144.9±3.0
IGR38.8±4.544.0±4.041.2±4.037.2±4.52.3±0.760.3±5.374.5±4.643.3±4.931.9±2.744.7±3.8
TM40.2±3.043.9±4.841.9±3.437.8±3.92.8±1.660.5±5.875.9±5.344.0±3.932.5±2.345.5±3.2
PGD37.1±8.242.5±5.739.3±7.134.7±8.62.3±0.957.7±9.172.7±7.439.6±10.230.9±4.843.3±6.7

表 2: Attention、InputXGrad、Integrated Gradients (IntGrad)、Deeplift 和 AttInGrad 在 MDACE 测试集上的合理性。每个实验使用十个不同的随机种子运行。我们展示了种子的平均值 $\pm$ 标准差。所有分数均以百分比表示。加粗数字优于无监督基线模型,带下划线数字优于有监督模型。更多特征归因方法见附录 E.2。

解释方法 模型 P↑ R↑ F1↑ AUPRC↑ Empty ↓ SpanR ↑ Cover ↑ IOU↑ P@5↑ R@5↑
Attention Bs 41.7±5.0 42.9±3.2 42.2±3.7 37.3±3.9 13.5±1.9 60.5±3.7 72.3±3.5 46.1±4.5 32.3±2.6 45.3±3.7
Bu 35.9±5.9 37.7±5.7 36.6±5.3 31.4±6.4 21.1±3.6 53.2±7.4 63.4±6.8 39.3±6.9 28.8±4.2 40.3±5.9
IGR 35.5±6.2 40.4±6.1 37.7±5.9 32.4±6.8 19.4±2.1 55.5±8.0 64.5±6.2 40.0±7.4 29.2±4.5 40.9±6.3
TM 36.5±5.9 37.8±6.1 37.0±5.5 31.7±6.7 21.7±3.3 53.3±7.7 63.3±6.9 40.1±7.3 29.1±4.4 40.8±6.2
PGD 33.0±8.4 38.4±8.0 35.4±8.3 30.0±9.1 19.1±1.6 51.5±13.6 61.1±9.9 35.9±10.8 27.7±6.3 38.9±8.8
InputXGrad Bs 30.7±2.0 33.7±2.6 32.0±1.4 24.9±2.0 7.3±1.3 48.5±2.8 64.0±2.8 32.2±2.1 26.6±1.2 37.3±1.6
Bu 31.0±2.6 32.7±3.2 31.6±1.2 24.8±1.9 9.5±2.6 46.4±3.6 62.3±3.5 31.7±1.8 26.3±0.9 36.9±1.3
IGR 32.4±2.5 33.8±2.2 33.0±0.7 27.0±0.7 9.7±2.0 48.5±2.6 64.6±2.5 32.8±1.0 27.4±0.4 38.4±0.5
TM 32.4±2.7 34.1±2.1 33.1±1.4 26.2±2.3 8.6±1.6 48.5±2.4 64.8±2.5 32.6±1.5 27.4±1.1 38.3±1.5
PGD 30.1±2.2 32.9±1.7 31.4±1.4 24.8±1.5 9.3±2.2 46.6±2.1 62.6±2.6 31.4±1.3 26.1±1.0 36.6±1.4
IntGrad Bs 30.9±3.2 33.9±2.0 32.2±1.7 26.2±2.6 4.8±2.0 50.8±2.4 65.9±3.3 34.2±2.8 27.4±1.6 38.4±2.2
Bu 31.3±4.3 32.8±2.7 31.9±3.2 26.0±4.3 5.2±1.0 49.4±3.0 64.4±3.1 33.9±4.2 26.5±2.5 37.1±3.6
IGR 32.6±2.2 33.8±2.3 33.1±1.3 26.6±2.0 5.2±1.7 49.7±2.5 64.7±3.3 34.9±2.3 27.6±0.9 38.7±1.2
TM 33.1±3.6 34.0±3.7 33.4±2.8 27.5±3.7 5.4±1.6 51.0±3.9 65.5±4.1 35.3±4.2 27.6±2.1 38.7±3.0
PGD 30.0±5.4 33.5±3.3 31.4±4.2 25.4±4.8 5.1±1.8 49.7±3.6 64.5±3.9 33.6±4.8 26.4±3.0 36.9±4.2
Deeplift Bs 29.2±2.3 34.2±3.0 31.3±1.2 24.1±1.8 6.4±1.8 48.9±3.4 65.2±3.3 31.1±1.9 26.2±1.1 36.7±1.5
Bu 31.2±1.9 31.4±2.4 31.2±1.5 24.1±2.1 9.1±1.6 45.1±2.5 61.7±2.9 30.9±1.6 25.8±1.1 36.1±1.5
IGR 31.0±1.8 33.7±1.4 32.2±0.6 25.9±0.9 8.6±1.0 48.3±1.5 64.8±1.7 31.6±1.0 26.7±0.5 37.4±0.7
TM 30.2±2.4 34.8±1.4 32.2±1.4 25.1±2.5 6.8±1.5 49.1±1.8 65.7±1.6 31.4±1.8 26.6±1.1 37.3±1.6
PGD 29.4±2.8 32.8±1.5 30.9±1.5 23.9±1.5 8.1±1.8 46.7±1.8 63.3±2.2 30.7±1.4 25.5±0.9 35.8±1.2
AttInGrad Bs 40.6±2.2 45.9±3.1 43.0±1.9 38.8±2.1 1.2±0.5 63.9±2.9 79.0±1.9 45.7±2.6 33.3±1.4 46.6±1.9
Bu 40.7±2.9 42.5±4.4 41.5±3.2 37.1±3.6 3.0±1.3 59.3±5.4 74.9±5.2 43.5±4.0 32.0±2.1 44.9±3.0
IGR 38.8±4.5 44.0±4.0 41.2±4.0 37.2±4.5 2.3±0.7 60.3±5.3 74.5±4.6 43.3±4.9 31.9±2.7 44.7±3.8
TM 40.2±3.0 43.9±4.8 41.9±3.4 37.8±3.9 2.8±1.6 60.5±5.8 75.9±5.3 44.0±3.9 32.5±2.3 45.5±3.2
PGD 37.1±8.2 42.5±5.7 39.3±7.1 34.7±8.6 2.3±0.9 57.7±9.1 72.7±7.4 39.6±10.2 30.9±4.8 43.3±6.7

6 Discussion

6 讨论

Do we need evidence span annotations? We demonstrated that we could match the explanation quality of Cheng et al. (2023) but without supervised training with evidence-span annotations (see Section 5). This raises the question: are evidencespan annotations unnecessary?

我们需要证据范围标注吗?我们证明了自己能够达到Cheng等人 (2023) 的解释质量,但无需使用证据范围标注进行监督训练 (见第5节)。这引发了一个问题:证据范围标注是否必要?

Intuitively, training a model on evidence spans should encourage it to use features relevant to humans, thereby making its explanations more plausible. However, we hypothesize that the training strategy used by Cheng et al. (2023) primarily addresses the shortcomings of attention-based explanation methods rather than enhancing the model’s underlying logic. The model $\mathrm{Bs}$ only produced more plausible explanations with attention-based feature attribution methods (see Figure 2). If the model truly leveraged more informative features, we would expect to see improvements across various feature attribution methods. Additionally, the differences between Attention and AttInGrad were negligible for $\mathbf{B}_{\mathrm{S}}$ compared to the other models. This may suggest that the supervised training might have corrected some of the inherent issues in the Attention method, similar to what AttInGrad achieves.

直观上看,在证据片段上训练模型应能促使它使用与人类相关的特征,从而使解释更具可信度。然而,我们假设 Cheng 等人 (2023) 采用的训练策略主要解决的是基于注意力 (Attention) 的解释方法的缺陷,而非增强模型的底层逻辑。模型 $\mathrm{Bs}$ 仅在使用基于注意力的特征归因方法时产生了更可信的解释 (见图 2)。若模型真正利用了信息量更大的特征,我们理应在各类特征归因方法中观察到改进。此外,与其他模型相比,$\mathbf{B}_{\mathrm{S}}$ 的注意力 (Attention) 与注意力梯度 (AttInGrad) 差异微乎其微,这可能表明监督训练修正了注意力方法的部分固有缺陷,其效果类似于注意力梯度方法。

Adversarial robustness training strategies’ impact on explanation plausibility While IGR and TM generated more plausible explanations than $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ , our evidence is insufficient to conclude whether the improvements were caused by our adversarial ly robust models relying on fewer irrelevant features. The adversarial robustness training strategies, especially PGD, had a larger impact on the plausibility of the explanations in previous image classification studies (Tsipras et al., 2018). We speculate that this discrepancy is caused by the inherent differences in the text and image modalities, causing techniques designed for image class if i ers to be less effective for text class if i ers (Etmann et al., 2019).

对抗鲁棒性训练策略对解释合理性的影响
虽然 IGR 和 TM 生成的解释比 $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ 更合理,但我们的证据不足以证明这种改进是否源于对抗鲁棒模型减少了无关特征依赖。在早期图像分类研究中 (Tsipras et al., 2018) ,对抗鲁棒训练策略(尤其是 PGD)对解释合理性影响更大。我们推测这种差异源于文本与图像模态的本质区别,导致为图像分类器设计的技术对文本分类器效果较弱 (Etmann et al., 2019) 。


Figure 3: Faithfulness of Attention, InputXGrad, and AttInGrad across models.

图 3: 不同模型中Attention、InputXGrad和AttInGrad的忠实度。

Table 3: Impact on F1 score of zeroing out feature attribution scores of special tokens for $\mathtt{B_{U}}$ .

表 3: 清零特殊token特征归因分数对 $\mathtt{B_{U}}$ F1分数的影响

Before After
Attention 36.5±5.4 40.2±3.9
InputXGrad 31.6±1.2 31.7±1.2
AttInGrad 41.5±4.4 42.4±2.4

Limitations of attention-based explanations Despite Attention and AttInGrad outperforming other methods in plausibility and faithfulness, they exhibited significant shortcomings, including high sufficiency and inter-seed variation. These findings align with previous research questioning the faithfulness of solely relying on final layer attention weights (Jain and Wallace, 2019).

基于注意力机制解释的局限性
尽管Attention和AttInGrad在合理性和忠实性上优于其他方法,但它们仍存在明显缺陷,包括高充分性和种子间差异性。这些发现与先前质疑仅依赖最终层注意力权重(attention weights)忠实性的研究结论一致(Jain and Wallace, 2019)。

We hypothesize these limitations stem from misalignment between the positions of the original tokens and their encoded representations. Our analysis (Section 5) suggests the encoder may store contextual information in uninformative tokens, such as special tokens, which are then used by the final attention layer for classification. As the training loss does not penalize where contextual i zed information is placed, this location can vary across training iterations, leading to the observed high inter-seed variance in attention-based explanations.

我们假设这些限制源于原始token位置与其编码表示之间的错位。我们的分析(第5节)表明,编码器可能将上下文信息存储在无信息量的token中(如特殊token),这些信息最终被注意力分类层所使用。由于训练损失函数不会惩罚上下文信息的存储位置,该位置可能在训练迭代间发生变化,从而导致基于注意力的解释方法出现观测到的高跨种子方差。


Figure 4: The relationship between explanation quality and the proportion of the top five most important tokens that are special tokens (tokens devoid of alphanumeric characters). Each data point is the average statistic on the MDACE test set for a seed of $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ . We fitted a linear regression for each explanation method and calculated the Pearson correlation $(r)$ . The dotted vertical lines represent the proportion of special tokens in the evidence-span annotations.

图 4: 解释质量与最重要前五个token中特殊token(不含字母数字字符的token)占比的关系。每个数据点是 $\scriptstyle\mathrm{\mathrm{ B_{U}~}}$ 种子在MDACE测试集上的平均统计值。我们为每种解释方法拟合了线性回归并计算了皮尔逊相关系数 $(r)$ 。虚线垂直线表示证据跨度标注中特殊token的占比。

Training strategies that enforce alignment between original tokens and their encoded representations could alleviate the limitations of Attention and AttInGrad. This alignment might explain the benefits of the supervised training strategy proposed by Cheng et al. (2023). However, rather than restricting the model, future research should explore feature attribution methods that incorporate information from all transformer layers, not just the final one (Kobayashi et al., 2021). Although attention rollout, a method incorporating all attention layers, proved unsuccessful in our experiments (see Appendix E.2), recent studies have highlighted its shortcomings and proposed alternative feature attribution methods that may be more suitable for our task (Modarressi et al., 2022, 2023).

强化原始token与其编码表示间对齐的训练策略可以缓解Attention和AttInGrad的局限性。这种对齐可能解释了Cheng等人 (2023) 提出的监督训练策略的优势。但未来研究不应局限于模型约束,而应探索整合所有Transformer层 (而不仅是最后一层) 信息的特征归因方法 (Kobayashi等人, 2021) 。尽管注意力展开 (attention rollout) 这种整合所有注意力层的方法在我们的实验中未获成功 (见附录E.2) ,但近期研究揭示了其缺陷,并提出了可能更适合我们任务的替代特征归因方法 (Modarressi等人, 2022, 2023) 。

Recommendations Similar to Lyu et al. (2023), we advocate that future research on feature attribution methods prioritize enhancing their faithfulness, as focusing solely on plausibility can yield misleading explanations. When models mis classify or rely on irrelevant features, explanations can only appear plausible if they ignore the model’s actual reason- ing process. Overemphasizing plausibility may inadvertently lead researchers to favor approaches that produce explanations disconnected from the model’s true reasoning.

与 Lyu 等人 (2023) 的建议类似,我们主张未来特征归因方法的研究应优先提升其忠实度 (faithfulness),因为仅关注合理性 (plausibility) 可能导致误导性解释。当模型误分类或依赖无关特征时,若忽略模型的实际推理过程,解释只会显得合理。过度强调合理性可能无意中促使研究者青睐那些产生与模型真实推理脱节的解释方法。

Instead, we propose that researchers prioritize improving the faithfulness of feature attribution methods while also working to align the model’s reasoning process with that of humans. This approach not only enhances the plausibility and faithfulness of explanations but also contributes to the accuracy and robustness of model classifications.

相反,我们建议研究人员优先提升特征归因方法的忠实度,同时致力于使模型的推理过程与人类思维保持一致。这种方法不仅能增强解释的可信度与忠实性,还能提升模型分类的准确性和鲁棒性。

7 Conclusion

7 结论

Our goal was to enhance the plausibility and the faithfulness of explanations without evidence-span annotations. We found that training our model using input gradient regular iz ation or token masking resulted in more plausible gradient-based explanations. We proposed a new explanation method, AttInGrad, which was substantially more plausible and faithful than the attention-based explanation method used in previous studies. By combining the best training strategy and explanation method, we showed results of similar quality to a supervised baseline (Cheng et al., 2023).

我们的目标是在无需证据跨度标注的情况下提升解释的合理性和忠实度。我们发现,使用输入梯度正则化或token掩码训练模型能产生更合理的基于梯度的解释。我们提出了一种新的解释方法AttInGrad,其合理性和忠实度显著优于以往研究中基于注意力的解释方法。通过结合最佳训练策略与解释方法,我们展示了与监督基线 (Cheng et al., 2023) 质量相当的结果。

Limitations

局限性

Our study did not conclusively show why adversarial robustness training strategies improved the explanation plausibility. We hypothesized that these strategies force the model to rely on fewer features that weakly correlate with the labels, and such features are less plausible. However, validating this hypothesis proved challenging. Our analysis of feature attributions’ entropy was inconclusive, as detailed in Appendix E.4. Moreover, we did not know which features the model relied on because this would require a perfect feature attribution method, which is what we aimed to develop. Despite these challenges, we demonstrated that the adversarial robust models produced more plausible explanations.

我们的研究并未明确揭示对抗鲁棒性训练策略为何能提升解释合理性。我们假设这些策略迫使模型依赖更少与标签弱相关的特征,而此类特征往往缺乏合理性。但验证这一假设存在挑战:附录E.4显示特征归因熵分析未能定论,且由于缺乏理想的特征归因方法(这正是我们试图开发的),我们无法确认模型具体依赖的特征。尽管如此,我们证实了对抗鲁棒模型能生成更具合理性的解释。

We believe that our work has laid a solid foundation for future research into how model training strategies can impact explanation plausibility.

我们相信,我们的工作为未来研究模型训练策略如何影响解释合理性奠定了坚实基础。

Our study’s scope was constrained to a single data source (Beth Israel Deaconess Medical Center’s MIMIC-III and MDACE) and one model architecture (PLM-ICD). The effectiveness of Attention and AttInGrad may vary with different archi- tectures, particularly those employing multi-head attention and skip connections in the final layer. Further research is needed to investigate the genera liz ability of our findings across diverse model architectures, medical coding systems, languages, and healthcare institutions.

我们的研究范围仅限于单一数据源(Beth Israel Deaconess Medical Center的MIMIC-III和MDACE)和一种模型架构(PLM-ICD)。Attention和AttInGrad的效果可能因不同架构而异,尤其是那些在最后一层采用多头注意力机制和跳跃连接的架构。需要进一步研究来验证我们的发现在不同模型架构、医疗编码系统、语言和医疗机构中的泛化能力。

Furthermore, the limited size of the MDACE test set constrained our study, resulting in low statistical power for many experiments. Despite the desire to conduct more trials with various seeds, we limited ourselves to ten seeds per training strategy due to the high computational costs involved. Conducting more experiments or expanding the test set might have revealed nuances and differences that our initial setup failed to detect. Nevertheless, our results across runs, explanation methods, and analysis point in the same direction. Moreover, while the test set in the main paper only comprises 61 examples, each example contains 14 medical codes, each annotated with multiple evidence spans, providing greater statistical power. Finally, our comparison of the unsupervised approaches on the larger test set in Appendix E.5 demonstrated similar results as on the smaller test set in the main paper. We, therefore, believe that our claims in this paper are well substantiated with empirical evidence.

此外,MDACE测试集的有限规模限制了我们的研究,导致许多实验的统计功效较低。尽管我们希望能用不同种子进行更多试验,但由于高昂的计算成本,每种训练策略仅使用了十个种子。进行更多实验或扩大测试集可能会揭示我们初始设置未能捕捉到的细微差别。尽管如此,我们在多次运行、解释方法和分析中得到的结果都指向同一结论。此外,虽然主论文中的测试集仅包含61个样本,但每个样本含有14个医疗编码,且每个编码标注了多个证据片段,从而提供了更强的统计效力。最后,我们在附录E.5中对无监督方法在更大测试集上的比较表明,其结果与主论文中小测试集的结果相似。因此,我们认为本文的结论得到了充分的实证支持。

Ethics statement

伦理声明

Healthcare costs are continuously increasing worldwide, with administrative costs being a significant contributing factor (Tseng et al., 2018). In this paper, we propose methods that may help reduce these administrative costs by making the review of medical code suggestions easier and faster. The aim of this paper was to develop technology to assist medical coders in performing tasks faster instead of replacing them.

全球医疗成本持续攀升,行政开支是重要诱因 (Tseng et al., 2018) 。本文提出的方法通过简化医疗编码建议的审核流程,有望降低这类行政成本。研究目标是开发辅助医疗编码员提升工作效率的技术,而非取代其职能。

Plausible but unfaithful explanations may risk convincing medical coders to accept medical code suggestions that are incorrect, thereby risking the patient’s safety (Jacovi and Goldberg, 2020). We, therefore, advocate faithfulness to be of higher priority than in previous studies.

看似合理但不真实的解释可能会说服医疗编码员接受错误的医疗代码建议,从而危及患者安全 (Jacovi and Goldberg, 2020) 。因此,我们主张将真实性置于比以往研究更高的优先级。

Electronic healthcare records contain private information. The healthcare records in MIMICIII, the dataset used in this paper, have been anonymized and stored in encrypted data storage accessible only to the main author, who has a license to the dataset and HIPAA training.

电子健康记录包含私人信息。本文使用的数据集MIMICIII中的医疗记录已匿名化,并存储在加密数据存储中,仅限拥有数据集使用许可和HIPAA培训的主要作者访问。

B Feature attribution methods

B 特征归因方法

Attention We use the raw attention weights $A_{j}$ (see Equation (2)) in the cross-attention layer to explain class $j$ . As mentioned in Section 2.1, this explanation method was used by most previous studies in automated medical coding (Mullenbach et al., 2018; Kim et al., 2022; Dong et al., 2021; Teng et al., 2020; Cheng et al., 2023).

注意力
我们使用交叉注意力层中的原始注意力权重 $A_{j}$ (见公式(2)) 来解释类别 $j$ 。如第2.1节所述,这种解释方法在自动化医疗编码的大多数先前研究中被采用 (Mullenbach et al., 2018; Kim et al., 2022; Dong et al., 2021; Teng et al., 2020; Cheng et al., 2023)。


Figure 5: The PLM-CA architecture we used in our experiments.

图 5: 实验中使用的PLM-CA架构。

Attention Rollout (Rollout) The attention matrix in the cross-attention layer extracts information from the contextual i zed token representations encoded by RoBERTa (see Figure 5). The token representations are not guaranteed to be aligned with the input tokens. A token representation at position $n$ could represent any and multiple tokens in the document. Attention rollout considers all the model’s attention layers to calculate the feature attributions (Abnar and Zuidema, 2020). First, the attention matrices in each layer are averages across the heads. Then, the identity matrix is added to each layer’s attention matrix to represent the skip connections. Finally, the attention rollout is calculated recursively using Equation (4).

注意力展开 (Rollout)
交叉注意力层中的注意力矩阵从RoBERTa编码的上下文token表示中提取信息 (见图5)。这些token表示并不保证与输入token对齐。位置$n$的token表示可能代表文档中的任意一个或多个token。注意力展开通过考虑模型所有注意力层来计算特征归因 (Abnar和Zuidema, 2020)。首先,将每一层的注意力矩阵在多头间取平均;然后,在每层注意力矩阵中加入单位矩阵以表示跳跃连接;最后,使用公式(4)递归计算注意力展开。

where $\tilde{\pmb{A}}\in\mathbb{R}^{N\times N}$ is the rollout attention, and $\bar{\pmb{A}}\in\mathbb{R}^{N\times N}$ is the attention averaged across heads with the added identity matrix. We calculated the final feature attribution score by multiplying the rollout attention from the final layer with the attention matrix from the cross-attention layer: $\pmb{A}\cdot\tilde{\pmb{A}}^{(L)}$ , where $L$ is the number of attention layers.

其中 $\tilde{\pmb{A}}\in\mathbb{R}^{N\times N}$ 是 rollout attention (滚动注意力), $\bar{\pmb{A}}\in\mathbb{R}^{N\times N}$ 是跨注意力头平均并加上单位矩阵后的注意力。我们通过将最后一层的 rollout attention 与交叉注意力层的注意力矩阵相乘来计算最终特征归因分数: $\pmb{A}\cdot\tilde{\pmb{A}}^{(L)}$, 其中 $L$ 是注意力层数。

Occlusion $\ @{\mathbf{1}}$ Occlusion $@1$ calculates each feature’s score by occluding it and measuring the change in output confidence. The change of output will be the feature’s score (Ribeiro et al., 2016).

遮挡 (Occlusion) $\ @{\mathbf{1}}$ 遮挡 (Occlusion) $@1$ 通过遮挡特征并测量输出置信度的变化来计算每个特征的分数。输出变化即为该特征的得分 (Ribeiro et al., 2016)。

LIME Local Interpret able Model-agnostic Explanations (LIME) randomly occlude sets of tokens from a specific input and measure the change in output confidence. It uses these measurements to train a linear regression model that approximates the explained model’s reasoning for that particular example. It then uses the linear regression weights to approximate each feature’s influence (Ribeiro et al., 2016).

LIME (Local Interpretable Model-agnostic Explanations) 通过随机遮蔽特定输入中的部分 token 并测量输出置信度的变化,利用这些测量数据训练一个线性回归模型来近似解释模型在该示例中的推理逻辑,最终通过回归权重估算各特征的影响程度 (Ribeiro et al., 2016)。

KernelSHAP Shapley Additive Explanations (SHAP) is based on Shapley values from cooperative game theory, which fairly distributes the payout among players by considering each player’s contribution in all possible coalitions. Ideally, SHAP quantifies all possible feature combinations in an input by occluding them and measuring the impact. However, this would result in $N$ forwards passes. KernelSHAP employs the LIME framework to approximate Shapley values using a weighted linear regression approach efficiently. We refer the reader to the seminal paper introducing SHAP and KernelSHAP for more details (Lundberg and Lee, 2017).

KernelSHAP
Shapley Additive Explanations (SHAP) 基于合作博弈论中的 Shapley 值,通过考虑每个玩家在所有可能联盟中的贡献,公平地分配收益。理想情况下,SHAP 通过遮挡输入中的所有可能特征组合并测量其影响来量化它们。然而,这将导致 $N$ 次前向传递。KernelSHAP 利用 LIME 框架,通过加权线性回归方法高效地近似 Shapley 值。更多细节请参阅介绍 SHAP 和 KernelSHAP 的开创性论文 (Lundberg and Lee, 2017) [20]。

InputXGrad Input X Gradient multiplies the input gradients with the input (Shrikumar et al., 2017). We used the L2 norm to get the final feature attribution scores. We calculated the feature attribution scores for class $J$ as follows:

InputXGrad
Input X Gradient方法将输入梯度与输入相乘 (Shrikumar et al., 2017)。我们使用L2范数计算最终的特征归因分数。类别$J$的特征归因分数计算公式如下:

$$
\boxed{\begin{array}{l l}{\left|X_{1}\odot\frac{\partial f_{j}}{\partial X_{1}}(X)\right|{2}}\ {\qquad\vdots}\ {\left|X_{N}\odot\frac{\partial f_{j}}{\partial X_{N}}(X)\right|_{2}}\end{array}}
$$

$$
\boxed{\begin{array}{l l}{\left|X_{1}\odot\frac{\partial f_{j}}{\partial X_{1}}(X)\right|{2}}\ {\qquad\vdots}\ {\left|X_{N}\odot\frac{\partial f_{j}}{\partial X_{N}}(X)\right|_{2}}\end{array}}
$$

where $\pmb{X}\in\mathbb{R}^{N\times D}$ is the input token embeddings, $\odot$ is the element-wise matrix multiplication operation, $D$ is the embedding dimension, $N$ are the number of tokens in a document, and $J$ is the number of classes.

其中 $\pmb{X}\in\mathbb{R}^{N\times D}$ 是输入的token嵌入矩阵,$\odot$ 表示逐元素矩阵乘法运算,$D$ 为嵌入维度,$N$ 是文档中的token数量,$J$ 为类别总数。

Integrated Gradients (IntGrad) Integrated Gradients (IntGrad) assigns an attribution score to each input feature by computing the integral of the gradients along the straight line path from the baseline $_B$ to the input $\boldsymbol{X}$ (Sun dara rajan et al., 2017). Similar to Input X Gradient, we used the L2-norm of the output to get the final attribution scores.

集成梯度 (Integrated Gradients, IntGrad) 通过计算从基线 $_B$ 到输入 $\boldsymbol{X}$ 的直线路径上梯度的积分 (Sun dara rajan et al., 2017) ,为每个输入特征分配归因分数。与输入乘梯度 (Input X Gradient) 类似,我们使用输出的 L2 范数来获得最终归因分数。

Deeplift DeepLIFT (Deep Learning Important FeaTures) back propagates the contributions of all neurons in the model to every input feature (Shrikumar et al., 2017). It compares each neuron’s activa- tion to its baseline activation and assigns attribution scores according to the difference.

DeepLIFT (Deep Learning Important FeaTures) 将模型中所有神经元的贡献反向传播至每个输入特征 (Shrikumar et al., 2017)。该方法通过比较神经元激活值与其基线激活值的差异来分配归因分数。

AttnGrad AttnGrad multiplies the attention Attention with the gradient of the model’s output with respect to the attention weights (Serrano and Smith, 2019):

AttnGrad
AttnGrad 将注意力 (Attention) 与模型输出相对于注意力权重的梯度相乘 (Serrano and Smith, 2019):

$$
\begin{array}{r}{\left[\begin{array}{l}{A_{j1}\cdot\vert\frac{\partial f_{j}}{\partial A_{j1}}(X)\vert}\ {\qquad\vdots}\ {\qquad\vdots}\end{array}\right]}\ {\left[\begin{array}{l}{A_{j N}\cdot\vert\frac{\partial f_{j}}{\partial A_{j N}}(X)\vert}\end{array}\right]}\end{array}
$$

$$
\begin{array}{r}{\left[\begin{array}{l}{A_{j1}\cdot\vert\frac{\partial f_{j}}{\partial A_{j1}}(X)\vert}\ {\qquad\vdots}\ {\qquad\vdots}\end{array}\right]}\ {\left[\begin{array}{l}{A_{j N}\cdot\vert\frac{\partial f_{j}}{\partial A_{j N}}(X)\vert}\end{array}\right]}\end{array}
$$

where $\pmb{A}\in\mathbb{R}^{J\times N}$ is the attention matrix, $N$ are the number of tokens in a document, and $J$ is the number of classes.

其中 $\pmb{A}\in\mathbb{R}^{J\times N}$ 是注意力矩阵,$N$ 表示文档中的 token 数量,$J$ 表示类别数量。

AttInGrad We found Attention Rollout to perform poorly on our task. Therefore, we developed a simple alternative approach to incorporate the impact of neighboring tokens into the attention explanations. AttInGrad incorporates the context by multiplying the attention $A_{j}$ with the InputXGrad feature attributions:

AttInGrad 我们发现注意力展开(Attention Rollout)在我们的任务中表现不佳。因此,我们开发了一种简单的替代方法,将相邻token的影响纳入注意力解释中。AttInGrad通过将注意力权重$A_{j}$与InputXGrad特征归因相乘来整合上下文信息:

$$
\begin{array}{r}{\bigg[\pmb{A}{j1}\cdot\bigg|\pmb{X}{1}\odot\frac{\partial f_{j}}{\partial\pmb{X}{1}}(\pmb{X})\bigg|{2}\bigg]}\ {\vdots\qquad}\ {\pmb{A}{j N}\cdot\bigg|\pmb{X}{N}\odot\frac{\partial f_{j}}{\partial\pmb{X}{N}}(\pmb{X})\bigg|_{2}\bigg]}\end{array}
$$

$$
\begin{array}{r}{\bigg[\pmb{A}{j1}\cdot\bigg|\pmb{X}{1}\odot\frac{\partial f_{j}}{\partial\pmb{X}{1}}(\pmb{X})\bigg|{2}\bigg]}\ {\vdots\qquad}\ {\pmb{A}{j N}\cdot\bigg|\pmb{X}{N}\odot\frac{\partial f_{j}}{\partial\pmb{X}{N}}(\pmb{X})\bigg|_{2}\bigg]}\end{array}
$$

where $\pmb{A}\in\mathbb{R}^{J\times N}$ is the attention matrix, $\odot$ is the element-wise matrix multiplication operation, $N$ are the number of tokens in a document, and $J$ is the number of classes.

其中 $\pmb{A}\in\mathbb{R}^{J\times N}$ 是注意力矩阵,$\odot$ 表示逐元素矩阵乘法运算,$N$ 是文档中的token数量,$J$ 是类别数。

C Faithfulness evaluation metrics

C 忠实度评估指标

Our faithfulness metrics, Sufficiency, and Compre hen sive ness evaluate model output changes when important or unimportant features were masked (DeYoung et al., 2020).

我们的忠实度指标——充分性(Sufficiency)和全面性(Comprehensiveness)——通过掩蔽重要或不重要特征来评估模型输出的变化 (DeYoung et al., 2020)。

Sufficiency measures how masking nonimportant features affects the output. A high sufficiency score indicates that many lowattribution features significantly impact the model’s output, suggesting the presence of false negatives. We calculated sufficiency using the following equation:

充分性衡量掩盖非重要特征对输出的影响。高充分性分数表明许多低归因特征显著影响模型输出,暗示存在假阴性。我们使用以下公式计算充分性:

$$
\frac{1}{K}\sum_{i=N-K}^{N}\frac{\operatorname*{max}(0,f(X)-f(R_{i}))}{f(X)}
$$

$$
\frac{1}{K}\sum_{i=N-K}^{N}\frac{\operatorname*{max}(0,f(X)-f(R_{i}))}{f(X)}
$$

Where $\pmb{R}_{i}\in\mathbb{R}^{N\times D}$ represents the input with the ith least important feature replaced by mask tokens, $N$ is the number of tokens in an example, and $K$ is a hyper parameter.

其中 $\pmb{R}_{i}\in\mathbb{R}^{N\times D}$ 表示用掩码token替换第i个最不重要特征后的输入,$N$ 表示样本中的token数量,$K$ 为超参数。

Comprehensiveness measures how masking important features affects the output. A high comprehen sive ness score indicates that features with high attribution scores strongly influence the model’s output, while a low score suggests many false positives. We calculated comprehensiveness using the following equation:

全面性衡量了遮蔽重要特征对输出的影响程度。高全面性分数表明具有高归因分数的特征强烈影响模型输出,而低分则暗示存在大量误报。我们使用以下公式计算全面性:

$$
\frac{1}{K}\sum_{i=0}^{K}\frac{\operatorname*{max}(0,f(X)-f(\bar{R}_{i}))}{f(X)}
$$

$$
\frac{1}{K}\sum_{i=0}^{K}\frac{\operatorname*{max}(0,f(X)-f(\bar{R}_{i}))}{f(X)}
$$

Where $\bar{\pmb{R}}_{i}\in\mathbb{R}^{N\times D}$ denotes the input features with the ith highest attribution scores replaced by mask tokens.

其中 $\bar{\pmb{R}}_{i}\in\mathbb{R}^{N\times D}$ 表示用掩码token替换了第i个最高归因分值的输入特征。

We set $K=100$ because including all features led to sufficiency scores close to zero and comprehen sive ness scores close to one, making it difficult to distinguish differences. Considering fewer features also made evaluation faster.

我们将$K=100$,因为包含所有特征会导致充分性分数趋近于零,而全面性分数趋近于一,难以区分差异。考虑更少的特征也能加快评估速度。

D Training details

D 训练细节

We used the same hyper parameter as Edin et al. (2023). We trained for 20 epochs with the ADAMW optimizer (Loshchilov and Hutter, 2022), learning rate at $5\cdot10^{-5}$ , dropout at 0.2, no weight decay, and a linear decay learning rate scheduler with warmup. We found the optimal hyper para meters for the auxiliary adversarial robustness training objectives through random search. For each training strategy, we searched the following options: learning rate: ${5\cdot10^{-5},1\cdot10^{-5}}$ , $\lambda_{1}$ , λ2, λ3, β: ${1.0,0.5,0.1,10^{-2},10^{-3},10^{-4},10^{-5},10^{-6}}$ . We found these hyper parameters to be optimal: $\lambda_{1}=10^{-5}$ , $\lambda_{2}=0.5$ , $\lambda_{3}~=~0.5$ , $\epsilon=10^{-5}$ , and $\beta=0.01$ . The learning rate was optimal at $5\cdot10^{-5}$ for all training strategies except for token masking, where $1\cdot10^{-5}$ was optimal. We optimized the token mask and adversarial noise using the ADAMW optimizer. In token-masking, we initialized the student and teacher model from a trained $\scriptstyle\mathrm{B}_{\mathrm{U}}$ . We fine-tuned the student for one epoch. We used the same hyperparameters as Cheng et al. (2023) for the supervised training strategy.

我们采用了与Edin等人 (2023) 相同的超参数设置。使用ADAMW优化器 (Loshchilov and Hutter, 2022) 训练20个epoch,学习率为 $5\cdot10^{-5}$ ,dropout率为0.2,无权重衰减,并采用带热身的线性衰减学习率调度器。通过随机搜索确定了辅助对抗鲁棒性训练目标的最佳超参数:学习率 ${5\cdot10^{-5},1\cdot10^{-5}}$ , $\lambda_{1}$ 、λ2、λ3、β取值范围为 ${1.0,0.5,0.1,10^{-2},10^{-3},10^{-4},10^{-5},10^{-6}}$ 。最终最优参数为 $\lambda_{1}=10^{-5}$ , $\lambda_{2}=0.5$ , $\lambda_{3}~=~0.5$ , $\epsilon=10^{-5}$ , $\beta=0.01$ 。除token masking策略最优学习率为 $1\cdot10^{-5}$ 外,其他策略均采用 $5\cdot10^{-5}$ 。使用ADAMW优化器对token mask和对抗噪声进行优化。在token-masking中,师生模型均从训练好的 $\scriptstyle\mathrm{B}_{\mathrm{U}}$ 初始化,学生模型微调1个epoch。监督训练策略采用Cheng等人 (2023) 的超参数配置。

We did not preprocess the text except to truncate the documents to a maximum length of 6000 tokens to reduce memory usage. Truncation is a common strategy in automated medical coding and has a negligible negative impact because few documents exceed the 6000 token limit (Edin et al., 2023).

我们除了将文档截断至最多6000个token以减少内存使用外,未对文本进行其他预处理。截断是自动医疗编码中的常见策略,且负面影响可忽略不计,因为极少有文档超过6000个token的限制 (Edin et al., 2023)。

E Additional results

E 补充结果

Because of space constraints, we could not include all of our results in the main paper. In this section, we present the excluded results:

由于篇幅限制,我们无法将所有结果纳入主论文。本节将展示未包含的结果:

E.1 Advesarial training does not affect code prediction performance

E.1 对抗训练不影响代码预测性能

Previous papers have demonstrated that adversarial robustness often comes at the cost of accuracy (Li et al., 2023; Tsipras et al., 2018). Therefore, we evaluated whether the training strategies impacted the models’ medical code prediction capabilities. As shown in Table 5, all models performed similarly on the MDACE test set. We also observed negligible performance differences on the MIMICIII full test set.

先前的研究表明,对抗鲁棒性往往以牺牲准确性为代价 [20][28]。因此,我们评估了训练策略是否会影响模型的医疗代码预测能力。如表 5 所示,所有模型在 MDACE 测试集上表现相似。在 MIMICIII 完整测试集上,我们也观察到了可忽略的性能差异。

E.2 Results from all feature attribution methods

E.2 所有特征归因方法的结果

In the main paper, we only presented the results of selected feature attribution methods because of space constraints. Here, we present the results for all the feature attribution methods: Attention,

在主论文中,由于篇幅限制,我们仅展示了部分特征归因方法的结果。此处我们将呈现所有特征归因方法的结果:Attention、

Table 5: Prediction performance on discharge sum- maries from the MDACE test sets. The results are in percentages. We use two classification metrics (F1 micro and macro) and mean Average Precision (mAP), a ranking metric.

表 5: MDACE测试集出院摘要的预测性能。结果以百分比表示。我们使用两种分类指标(F1 micro和macro)以及排序指标平均精度均值(mAP)。

Model F1 micro ↑ F1 macro ↑ mAP ↑
Bs 67.5±0.4 51.5±1.1 72.0±0.7
Bu 67.7±0.3 51.6±0.7 72.0±0.4
IGR 68.0±0.6 52.2±1.2 72.4±1.0
TM 68.1±0.6 51.8±1.5 72.3±0.4
PGD 68.1±0.5 52.1±1.2 72.0±0.7

AttGrad, Attention Rollout (Rollout), InputXGrad, Integrated gradients (IntGrad), Deeplift, and AttInGrad. We compare these methods with a random baseline (Rand), which randomly generates attribution scores. We present the plausibility results in Table 7, and the faithfulness results in Table 8.

AttGrad、Attention Rollout (Rollout)、InputXGrad、Integrated gradients (IntGrad)、Deeplift 和 AttInGrad。我们将这些方法与随机基线 (Rand) 进行比较,后者随机生成归因分数。我们在表 7 中展示了合理性结果,在表 8 中展示了忠实性结果。

We did not include Occlusion $@1$ , LIME, and KernelSHAP in these tables because they were too slow to calculate. We used the Captum implementation of the algorithms (Kokhlikyan et al., 2020). It took around 45 minutes on an A100 GPU to calculate the explanations for a single example with LIME and KernelSHAP. Therefore, we only evaluated these methods on a single trained instance of $\mathrm{B_{U}}$ . We present the results in Table 9.

我们没有在表中包含Occlusion $@1$、LIME和KernelSHAP,因为这些方法的计算速度过慢。我们使用了Captum库实现的算法 (Kokhlikyan et al., 2020)。在A100 GPU上,用LIME和KernelSHAP计算单个样本的解释需要约45分钟。因此,我们仅在$\mathrm{B_{U}}$的一个训练实例上评估了这些方法。结果如表9所示。

E.3 Relationship between confidence scores and explanation plausibility

E.3 置信度分数与解释合理性之间的关系

In Table 10 and Table 11, we investigate the difference in explanation plausibility when the model correctly predicts an annotated code (true positive) and when it fails to predict an annotated code (false negative). The explanations are substantially better when the model correctly predicts the codes.

在表10和表11中,我们研究了模型正确预测标注代码( true positive )和未能预测标注代码( false negative )时解释合理性差异。当模型正确预测代码时,解释效果显著更优。

E.4 Entropy of explanation methods

E.4 解释方法的熵

We calculated the entropy of the feature attribution distributions to test our hypothesis that robust training strategies reduce the number of features the model uses (see Table 6). The training strategies did not reduce the entropy. While we would expect a reduced entropy if the model used fewer features, other feature attribution distribution differences may simultaneously increase the entropy. The analysis is, therefore, inconclusive.

我们计算了特征归因分布的熵值来验证我们的假设,即鲁棒性训练策略会减少模型使用的特征数量 (见表6)。训练策略并未降低熵值。虽然如果模型使用更少的特征,我们会预期熵值降低,但其他特征归因分布差异可能同时增加熵值。因此,该分析尚无定论。

Table 6: Entropy of the explanations.

表 6: 解释的熵值。

解释方法 模型 熵值 ↓
InputXGrad Bu 0.74±0.01
IGR TM 0.74±0.00 0.73±0.00
Attention Bu 0.50±0.01
IGR 0.50±0.01
TM 0.49±0.02
AttInGrad Bu IGR 0.28±0.02 0.28±0.02
TM 0.28±0.02

E.5 Unsupervised comparison on bigger test set

E.5 更大测试集上的无监督比较

We included additional experiments on the unsupervised training strategies on a bigger test set. Since only the supervised training strategy required evidence-span annotations in the training set, we retrained our unsupervised methods on the MIMICIII full training set and evaluated them on the MDACE training and test set (242 examples).

我们在更大的测试集上进行了无监督训练策略的额外实验。由于只有监督训练策略需要在训练集中提供证据范围标注,我们在MIMICIII完整训练集上重新训练了无监督方法,并在MDACE训练集和测试集(242个样本)上进行了评估。

We present the plausibility results in Table 12. We observe that the results are similar to those of the main paper. However, the IGR produced substantially better attention-based explanations than in the main paper. In Figure 6, we inspect the inter-seed variance. We observe that IGR has no outliers. We, therefore, attribute the differences between this comparison and that in the main paper to none of the ten IGR runs happening to produce an outlier model. These results highlight the fragility of evaluating the attention-based feature attribution methods.

我们在表12中展示了合理性结果。观察发现这些结果与主论文中的结果相似。然而,IGR (Integrated Gradient-based Relevance) 生成的基于注意力的解释明显优于主论文中的结果。在图6中,我们检查了种子间方差,观察到IGR没有异常值。因此,我们将本比较与主论文差异的原因归结为:十次IGR运行中恰好没有产生异常模型。这些结果凸显了评估基于注意力的特征归因方法存在脆弱性。

F emissions

F 排放

Experiments were conducted using a private infrastructure, which has a carbon efficiency of 0.185 $\mathrm{kgCO_{2}e q/k W h}$ . To train $\scriptstyle\mathrm{B_{U}}$ or ${\bf B}{\mathrm{S}}$ , a cumulative of 8 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W). Total emissions for one run are estimated to be $0.37~\mathrm{kg}\mathrm{CO}{2}\mathrm{eq}$ . The adversarial robustness training strategies required more hours of computation, therefore causing higher emissions. Input gradient regular iz ation and projected gradient regularization required approximately 36 hours each (1.67 $\mathrm{kgCO_{2}e q},$ ), while token masking required 2.5 hours of fine-tuning of $\scriptstyle\mathrm{B}{\mathrm{U}}$ $(0.09\mathrm{kg}\mathrm{CO_{2}e q})$ . We ran each experiment 10 times, resulting in total emissions of $41.86\mathrm{kg}\mathrm{CO_{2}e q}$ , which is equivalent to burning $20.9\mathrm{Kg}$ of coal. Estimations were conducted using the Machine Learning Impact calculator (Lacoste et al., 2019).

实验使用了一个碳排放效率为0.185 $\mathrm{kgCO_{2}e q/k W h}$ 的私有基础设施进行。训练 $\scriptstyle\mathrm{B_{U}}$ 或 ${\bf B}{\mathrm{S}}$ 时,在A100 PCIe 40/80GB(热设计功耗250W)硬件上累计进行了8小时计算。单次运行的总排放量估计为 $0.37~\mathrm{kg}\mathrm{CO}{2}\mathrm{eq}$ 。对抗鲁棒性训练策略需要更长的计算时间,因此导致更高的排放量。输入梯度正则化和投影梯度正则化各需约36小时 (1.67 $ kgCO_{2}e q,$ ),而token掩码需要对 $\scriptstyle\mathrm{B}{\mathrm{U}}$ 进行2.5小时的微调 $(0.09kg{CO_{2}e q})$ 。每个实验重复运行10次,总排放量达 $41.86\mathrm{kg}\mathrm{CO_{2}e q}$ ,相当于燃烧 $20.9\mathrm{Kg}$ 煤炭。排放估算使用机器学习影响计算器完成(Lacoste et al., 2019)。

阅读全文(20积分)