PhilaeX: Explaining the Failure and Success of AI Models in Malware Detection

PhilaeX: 解释AI模型在恶意软件检测中的失败与成功

The explanation to an AI model’s prediction used to support decision making in cyber security, is of critical importance. It is especially so when the model’s incorrect prediction can lead to severe damages or even losses to lives and critical assets. However, most existing AI models lack the ability to provide explanations on their prediction results, despite their strong performance in most scenarios. In this work, we propose a novel explain able AI method, called PhilaeX, that provides the heuristic means to identify the optimized subset of features to form the complete explanations of AI models’ predictions. It identifies the features that lead to the model’s borderline prediction, and those with positive individual contributions are extracted. The feature attributions are then quantified through the optimization of a Ridge regression model. We verify the explanation fidelity through two experiments. First, we assess our method’s capability in correctly identifying the activated features in the adversarial samples of Android malwares, through the features attribution values from PhilaeX. Second, the deduction and augmentation tests, are used to assess the fidelity of the explanations. The results show that PhilaeX is able to explain different types of class if i ers correctly, with higher fidelity explanations, compared to the state-of-the-arts methods such as LIME and SHAP.

对AI模型预测的解释在支持网络安全决策中至关重要，尤其是在模型错误预测可能导致严重损害甚至生命和关键资产损失的情况下。然而，尽管大多数现有AI模型在大多数场景中表现出色，但它们缺乏提供预测结果解释的能力。在这项工作中，我们提出了一种新颖的可解释AI方法，称为PhilaeX，它提供了启发式手段来识别优化的特征子集，以形成AI模型预测的完整解释。它识别导致模型边界预测的特征，并提取那些具有积极个体贡献的特征。然后通过优化岭回归模型来量化特征归因。我们通过两个实验验证了解释的保真度。首先，我们通过PhilaeX的特征归因值评估了我们的方法在正确识别Android恶意软件对抗样本中激活特征的能力。其次，使用演绎和增强测试来评估解释的保真度。结果表明，与LIME和SHAP等最先进的方法相比，PhilaeX能够以更高的保真度正确解释不同类型的分类器。

1 INTRODUCTION

1 引言

Explaining the prediction of an AI model is critical for the AI-based solution to modern cyber threats that have the properties of large volume and highly complexity by the AI technology. The threat detection solutions based on the learnable AI technologies, which are so called shallow machine learning and recently emerging deep learning methods, have demonstrated astonishing performance today. However, the high detection performance is insufficient in establishing the trust from the users, since most models predict the label of the suspicious sample, e.g., a malware or a face image may be subjected to manipulation for deception or obfuscation, through a complicated computation process that people cannot understand. This confidence crisis may become more severe when the AI model makes an erroneous prediction that causes damage or loss to the user’s properties, assets or even safety. Therefore, the research on explain able AI that quantitatively explains the AI model’s successful or failed prediction for a particular input sample through the attribution of each data feature’s contribution to the model’s prediction is highly desired (Dosˇilovic´ et al., 2018).

解释 AI 模型的预测对于基于 AI 的现代网络威胁解决方案至关重要，这些威胁具有大容量和高度复杂性的特点。基于可学习的 AI 技术（即所谓的浅层机器学习和最近兴起的深度学习方法）的威胁检测解决方案，如今已经展示了惊人的性能。然而，高检测性能不足以建立用户的信任，因为大多数模型通过人们无法理解的复杂计算过程来预测可疑样本的标签，例如恶意软件或可能被操纵以进行欺骗或混淆的人脸图像。当 AI 模型做出错误预测，导致用户的财产、资产甚至安全受到损害或损失时，这种信任危机可能会变得更加严重。因此，研究可解释的 AI，通过量化每个数据特征对模型预测的贡献来解释 AI 模型对特定输入样本的成功或失败预测，是非常必要的 (Dosˇilovic´ et al., 2018)。

Malware detection research has made progress over the years. Demontis et. al. (Demontis et al., 2017) improved the standard SVM on Android malware detection that further reduces the chance of evasion by certain types of malware samples, through the optmized selection method on the model’s parameters. Zhang et. al. (Zhang et al., 2019) proposed a malware detector using online learning technique that is capable of adapting to the rapid evolving malware. Specifically, they combined the n-gram analysis and the online classifier techniques in the detection. The application of the deep learning methods in cyber security threats detection recently, such as CNNs (Amerini et al., 2019), RNNs (Gu¨era and Delp, 2018), LSTM (Xiao et al., 2019a) or Transformers (Devlin et al., 2018), is a breakthrough in the detection rate (i.e, true positive rate). The deep learning methods also save the hand-crafted and timeconsuming efforts on the selection or transformation of the samples’ features through the automatic endto-end learning, which performance was highly based on the experience and the domain knowledge of the developers (McLaughlin et al., 2017) (Yan et al., 2018) (Xiao et al., 2019b) previously. However, it is nearly impossible for humans to understand how the deep learning models predict the class of the samples by the non-linear computation process and millions parameters among layers. The research effort on AI models’ explanation is seldom considered in the development of the machine learning algorithm.

恶意软件检测研究近年来取得了进展。Demontis 等人 (Demontis et al., 2017) 改进了 Android 恶意软件检测中的标准 SVM (支持向量机)，通过模型参数的优化选择方法，进一步降低了某些类型恶意软件样本逃避检测的可能性。Zhang 等人 (Zhang et al., 2019) 提出了一种使用在线学习技术的恶意软件检测器，能够适应快速演变的恶意软件。具体来说，他们在检测中结合了 n-gram 分析和在线分类器技术。近年来，深度学习技术在网络安全威胁检测中的应用，如 CNN (卷积神经网络) (Amerini et al., 2019)、RNN (循环神经网络) (Gu¨era and Delp, 2018)、LSTM (长短期记忆网络) (Xiao et al., 2019a) 或 Transformer (Devlin et al., 2018)，在检测率（即真阳性率）方面取得了突破。深度学习方法还通过自动端到端学习节省了手工选择和转换样本特征的时间和精力，这些工作以往高度依赖于开发者的经验和领域知识 (McLaughlin et al., 2017) (Yan et al., 2018) (Xiao et al., 2019b)。然而，人类几乎不可能通过非线性计算过程和数百万层参数来理解深度学习模型如何预测样本的类别。在机器学习算法的开发中，对 AI 模型解释的研究工作很少被考虑。

Clearly, the AI model explanation is the positive direction to enhancing the users’ trust on the AI model’s output, otherwise generated from a seemingly black-box mechanism. Such explanation is achieved through the quant if i cation on the “contribution” of each feature to the model’s prediction. The popular model-agnostic explain able AI methods that can explain any AI model’s predictions, regardless of the model’s type (such as SVM, CNNs or LSTM), may not be working well for cyber security problems. LIME (Ribeiro et al., 2016) builds a surrogate linear model of the original model to be explained, where the contribution of each feature is computed through the optimization (Efron et al., 2004). The authors assumed the linear model can be understood by humans because of its simplicity and the data used to train the linear model is manipulated by the local perturbation of the features values in the input data sample. The fidelity of the linear model based explanation may be deteriorated by the high dimensionality of the data that is common in cyber security. Integrated Gradients (IGs) (Sun dara rajan et al., 2017) attributes the features as the model explanation through the integration of the gradients on the model’s predictions with respect to the input data with different features values. These feature values are varied from the “baseline” through a linear path, in which the baseline refers to the zero-value feature vector or no signal sample. The Integrated Gradients method works well for the AI models with gradients, such as deep learning models. However, it cannot be used for certain widely used models without gradients, such as Random Forests (Apruzzese et al., 2020). In addition, the baseline is unclear in certain fields, such as genomics domain (Jha et al., 2020). Therefore, the explain able AI method for the models used in the cyber security field, such as malware detection, is still desired.

显然，AI模型解释是增强用户对AI模型输出信任的积极方向，尤其是在模型输出似乎来自一个黑箱机制的情况下。这种解释是通过量化每个特征对模型预测的“贡献”来实现的。流行的模型无关的可解释AI方法可以解释任何AI模型的预测，无论模型类型（如SVM、CNN或LSTM），但在网络安全问题上可能效果不佳。LIME (Ribeiro et al., 2016) 构建了一个原始模型的替代线性模型，其中每个特征的贡献通过优化计算得出 (Efron et al., 2004)。作者假设线性模型由于其简单性可以被人类理解，并且用于训练线性模型的数据是通过输入数据样本中特征值的局部扰动来操纵的。基于线性模型的解释的保真度可能会因网络安全中常见的高维数据而降低。Integrated Gradients (IGs) (Sundararajan et al., 2017) 通过将模型预测的梯度与具有不同特征值的输入数据进行积分，将特征归因于模型解释。这些特征值从“基线”通过线性路径变化，其中基线指的是零值特征向量或无信号样本。Integrated Gradients方法适用于具有梯度的AI模型，如深度学习模型。然而，它不能用于某些广泛使用的无梯度模型，如随机森林 (Apruzzese et al., 2020)。此外，在某些领域（如基因组学领域）中，基线并不明确 (Jha et al., 2020)。因此，对于网络安全领域（如恶意软件检测）中使用的模型，仍然需要可解释的AI方法。

In this article, we proposed a novel modelagnostic explain able AI methodology, called PhilaeX, that is capable of quantitatively measuring the features’ “contribution” in a suspicious app sample, when its class (i.e., benign or malware) is predicted by a given AI model, regardless of the model’s type. Specifically, the model explanation starts from core features selection for a given suspicious sample, by which only the features in the sample lead the model’s prediction towards to the border line of the two classes (i.e., around $50%$ probability of the prediction confidence by the model) are selected. Then, in addition to these core features, PhilaeX identifies a set of features from the original data sample, in which each feature is able to make the significant contribution for the model’s prediction towards the predicted class on the original input sample. This step is to identify the features with positive individual contributions to the model’s predictions, without considering the contributions from the cooperation among features. Finally, the feature attribution is obtained by considering both the positive individual contributions and the joint contribution when all these features are used. The quantitative measure on each feature’s attribution is computed by optimizing a Ridge regression, because of its simplicity in optimization and the nature of the optimization considers the highly correlated features. The main advantages of the proposed explain able AI method include: (1) The identification of the core features provides a fingerprint to further identify the candidate features with positive contributions to the model’s prediction, in an efficient and accurate manner, when compared to the random perturbation of the sample’s values in feature space, such as LIME; (2) The features attribution based on the core features and those with positively individual contributions considers both the individual and joint contributions by the features; and (3) The optimization by Ridge regression to quantify the features attribution is efficient and effective. The results from the quantitative assessment to the proposed explainable AI method show the high fidelity of explanation by PhilaeX, regardless of the SVM (Arp et al., 2014) (Li et al., 2015) and BERT (Devlin et al., 2018) class if i ers, on malware detection tasks. The first experiment aims to identify the “activated features” in adversarial samples of Android malware. This is to help the cyber security practitioners to analyze how the AI model was evaded by the adversarial samples, and enhance the model’s security accordingly. The results demonstrate that the activated features have the higher chance to be attributed with high values by PhilaeX, compared to the state-of-the-arts methods, such as LIME, SHAP (Lundberg and Lee, 2017) and MPT Explainer (Lu and Thing, 2021). The second experiment that test the explanation fidelity when PhilaeX is used to explain the SVM and Random Forest classifiers on the PDF malware dataset (Smutz and Stavrou, 2012), where the results verifies that the high fidelity can be obtained by a small number of the features with the top attribution values by PhilaeX.

在本文中，我们提出了一种新颖的模型无关可解释 AI 方法，称为 PhilaeX，它能够定量测量可疑应用样本中特征的“贡献”，当给定 AI 模型预测其类别（即良性或恶意软件）时，无论模型类型如何。具体来说，模型解释从给定可疑样本的核心特征选择开始，通过该选择，只有样本中引导模型预测向两类边界线（即模型预测置信度约为 $50%$ 的概率）的特征被选中。然后，除了这些核心特征外，PhilaeX 还从原始数据样本中识别出一组特征，其中每个特征都能对模型在原始输入样本上的预测类别做出显著贡献。此步骤旨在识别对模型预测具有积极个体贡献的特征，而不考虑特征之间合作的贡献。最后，通过考虑这些特征的积极个体贡献和联合贡献，获得特征归因。每个特征的归因定量测量通过优化 Ridge 回归来计算，因为其优化简单且优化的性质考虑了高度相关的特征。所提出的可解释 AI 方法的主要优势包括：(1) 核心特征的识别提供了一种指纹，以高效且准确的方式进一步识别对模型预测具有积极贡献的候选特征，与特征空间中样本值的随机扰动（如 LIME）相比；(2) 基于核心特征和具有积极个体贡献的特征的特征归因考虑了特征的个体和联合贡献；(3) 通过 Ridge 回归优化来量化特征归因是高效且有效的。对所提出的可解释 AI 方法的定量评估结果表明，PhilaeX 在恶意软件检测任务中，无论是对 SVM (Arp et al., 2014) (Li et al., 2015) 还是 BERT (Devlin et al., 2018) 分类器，都具有高保真度的解释。第一个实验旨在识别 Android 恶意软件对抗样本中的“激活特征”。这有助于网络安全从业者分析 AI 模型如何被对抗样本规避，并相应地增强模型的安全性。结果表明，与最先进的方法（如 LIME、SHAP (Lundberg and Lee, 2017) 和 MPT Explainer (Lu and Thing, 2021)）相比，激活特征更有可能被 PhilaeX 归因于高值。第二个实验测试了 PhilaeX 在解释 SVM 和随机森林分类器在 PDF 恶意软件数据集 (Smutz and Stavrou, 2012) 上的解释保真度，结果验证了通过 PhilaeX 归因值最高的少量特征可以获得高保真度。

The rest of the paper is organised as follows: we present literature review on the state-of-the-arts in explainable AI in cyber security in Section 2. The proposed methodology, PhilaeX, is introduced in details in Section 3. We assess the fidelity of the proposed method by two quantitative experiments in Section 4. Finally, the conclusion of this methodology is discussed in Section 5.

本文的其余部分组织如下：第2节介绍了网络安全领域可解释AI (Explainable AI) 的最新研究进展。第3节详细介绍了所提出的方法 PhilaeX。第4节通过两个定量实验评估了所提出方法的保真度。最后，第5节讨论了该方法的结论。

2 Literature Review

2 文献综述

The main aim of explain able AI is to provide a human-understandable explanation on how the AI model predicts the class label of the given sample. One of the major research in explain able AI focus on the model’s interpret ability (Dosˇilovic´ et al., 2018), where the model’s prediction can be explained by its own prediction process, such as decision trees (Kamin´ski et al., 2018). However, as the development of the machine learning and deep learning methods advances, the model becomes increasingly complicated such that the computation is not visible for the users, and it is difficult to achieve the model’s interpret ability (Molnar, 2019).

可解释 AI 的主要目标是提供人类可理解的解释，说明 AI 模型如何预测给定样本的类别标签。可解释 AI 的一个主要研究方向是模型的解释能力 (Dosˇilovic´ et al., 2018)，其中模型的预测可以通过其自身的预测过程来解释，例如决策树 (Kamin´ski et al., 2018)。然而，随着机器学习和深度学习方法的发展，模型变得越来越复杂，以至于用户无法看到计算过程，因此很难实现模型的解释能力 (Molnar, 2019)。

The post-hoc explain able AI methods that obtain the model’s explanation by analyzing the model’s input and output in a qualitative or quantitative way, therefore, attracts the major research interests. The early research on post-hoc explain ation method were focusing on the model-specific explain able AI methods, where it is only able to explain the targeted type of AI models. Zeiler et. al. (Zeiler and Fergus, 2014) proposed a qualitative explanation method through the visualization and observation on the neurons in a convolutional neural networks (CNNs) that shows how each neuron responds to different data instances. In (Xu et al., 2015), Xu et. al. developed a caption generator model to summarize the content of an image in one sentence, where the attention mechanism in the deep neural networks highlights the sensitive part of the image and its corresponding words in the caption. DREBIN (Arp et al., 2014) provided a limited explanation of the Android malware detector’s prediction based on SVM classifier. However, their explanation method cannot be extended to other AI models, since the quant if i cation of features attribution comes from the weights of the SVM models. Thus, the model-specific explain able AI methods lack the ability to extend to new types of AI models, because of its inherent nature.

事后可解释的AI方法通过定性或定量分析模型的输入和输出来获取模型的解释，因此吸引了主要的研究兴趣。早期的事后解释方法研究主要集中在模型特定的可解释AI方法上，这些方法只能解释特定类型的AI模型。Zeiler等人 (Zeiler and Fergus, 2014) 提出了一种定性解释方法，通过可视化和观察卷积神经网络 (CNNs) 中的神经元，展示了每个神经元对不同数据实例的响应。在 (Xu et al., 2015) 中，Xu等人开发了一个字幕生成模型，用于用一句话总结图像的内容，其中深度神经网络中的注意力机制突出了图像的敏感部分及其在字幕中对应的词语。DREBIN (Arp et al., 2014) 提供了基于SVM分类器的Android恶意软件检测器预测的有限解释。然而，他们的解释方法无法扩展到其他AI模型，因为特征归因的量化来自于SVM模型的权重。因此，模型特定的可解释AI方法由于其固有性质，缺乏扩展到新型AI模型的能力。

As the machine learning techniques develop rapidly, explain able AI methods that can explain different types of AI models is highly desired. This property is also referred to as model-agnostic. Samek et. al. (Samek et al., 2017) firstly proposed the explanation methods using layer-wise relevance propagation (LRP) to analyze the sensitivity between the deep learning models’ prediction w.r.t. the input sample in the features space. Their work forms a foundation in model-agnostic explain able AI methods, where the model explanation was obtained by the “observation” on the relations between the input and model’s output.

随着机器学习技术的快速发展，能够解释不同类型 AI 模型的可解释 AI 方法备受期待。这一特性也被称为模型无关性。Samek 等人 (Samek et al., 2017) 首次提出了使用逐层相关性传播 (Layer-wise Relevance Propagation, LRP) 的解释方法，用于分析深度学习模型预测与输入样本在特征空间中的敏感性。他们的工作为模型无关的可解释 AI 方法奠定了基础，其中模型解释是通过“观察”输入与模型输出之间的关系获得的。

As the model structure becomes too complicated to be accessed by humans, the directly observation on the model’s input and output also become a timeconsuming and inaccurate way to obtain the explanation. Therefore, the alternative way to obtain the model explanation is to explain the surrogate model, which simulates the behavior of the original model to be explained, and is usually simple enough for human understanding. LIME (Ribeiro et al., 2016) is proposed to explain any type of class if i ers by learning a linear surrogate model to mimic the target model’s behavior. The data to train such linear model are generated through perturbation of the original input data sample around the model’s predictions (i.e., local perturbation). However, the linearity of the surrogate model and the random perturbation strategy in the local field limits the explanation capability of LIME, especially when it explains complicated models, such as CNNs. Our PhilaeX provides a high fidelity explanation for complicated models through a multi-stage selection strategy for high contribution features. This solves the limitation of the local random perturbation in the sample’s feature space, such as non-stable explanation. Wu et. al. (Wu et al., 2018) used decision tree, which is a self-explained model, as the surrogate model in the explanation of the deep learning models Recently, LEMNA (Guo et al., 2018) was proposed to explain the AI models that are specifically designed for cyber security problems. LEMNA uses the fused lasso (Tibshirani et al., 2005) algorithm and mixture regression model (Khalili and Chen, 2007) to force the explanation to consider the dependencies among features, which solves the issues of linear approximation that considers nothing about the dependencies among features in LIME.

随着模型结构变得过于复杂，人类难以直接访问，直接观察模型的输入和输出也成为一种耗时且不准确的解释方式。因此，获取模型解释的替代方法是解释替代模型，该模型模拟了待解释的原始模型的行为，并且通常足够简单以便人类理解。LIME (Ribeiro et al., 2016) 提出通过学习一个线性替代模型来模仿目标模型的行为，从而解释任何类型的分类器。训练这种线性模型的数据是通过在模型预测周围对原始输入数据样本进行扰动生成的（即局部扰动）。然而，替代模型的线性以及局部领域中的随机扰动策略限制了 LIME 的解释能力，尤其是在解释复杂模型（如 CNN）时。我们的 PhilaeX 通过对高贡献特征的多阶段选择策略，为复杂模型提供了高保真度的解释。这解决了样本特征空间中局部随机扰动的局限性，例如不稳定的解释。Wu 等人 (Wu et al., 2018) 使用决策树（一种自解释模型）作为深度学习模型解释中的替代模型。最近，LEMNA (Guo et al., 2018) 被提出用于解释专门为网络安全问题设计的 AI 模型。LEMNA 使用融合 lasso (Tibshirani et al., 2005) 算法和混合回归模型 (Khalili and Chen, 2007) 来强制解释考虑特征之间的依赖关系，这解决了 LIME 中线性近似不考虑特征之间依赖关系的问题。

3 PhilaeX: Explaining Model’s Predictions

3 PhilaeX: 解释模型的预测

In this section, we firstly formulate the model’s explanation problem as the feature attribution process in mathematics. The algorithms to identify the core features and the features with positive individual contributions are introduced. Finally, we present the optimization process to obtain the features attribution by considering both the features’ individual contributions and their joint contributions towards the model’s prediction on the input sample.

在本节中，我们首先将模型的解释问题表述为数学中的特征归因过程。介绍了识别核心特征和具有正向个体贡献特征的算法。最后，我们提出了通过考虑特征的个体贡献及其对模型输入样本预测的联合贡献来获得特征归因的优化过程。

3.1 Problem Statement

3.1 问题陈述

Given a classifier $f(\mathbf{x})\rightarrow[0,1]^{|C|}$ to be explained, and its predictions of the probabilities of $|C|$ class labels for the input data sample $\mathbf{x}=(x_{1},x_{2},...,x_{m})\in R^{m}$ (in the features space), the data sample $\mathbf{X}\in R^{m}$ consists of $m$ features. For example, the suspicious Android app can be represented in the features space by the TF-IDF (Rajaraman and Ullman, 2011) values of its permissions (Arp et al., 2014). The aim of the model explanation is to find the optimized features attribution vector $\mathbf{A}=(a_{1},a_{2},...,a_{m})\in R^{m}$ that quantita- tively measure how the model to be explained $f(\mathbf{x})$ makes the prediction of the input sample’s class label according to each features contributions. That is, the optimization can be formally represented by:

给定一个需要解释的分类器 $f(\mathbf{x})\rightarrow[0,1]^{|C|}$，以及它对输入数据样本 $\mathbf{x}=(x_{1},x_{2},...,x_{m})\in R^{m}$（在特征空间中）的 $|C|$ 个类别标签的概率预测，数据样本 $\mathbf{X}\in R^{m}$ 由 $m$ 个特征组成。例如，可疑的 Android 应用程序可以通过其权限的 TF-IDF (Rajaraman and Ullman, 2011) 值在特征空间中表示 (Arp et al., 2014)。模型解释的目标是找到优化的特征归因向量 $\mathbf{A}=(a_{1},a_{2},...,a_{m})\in R^{m}$，该向量定量衡量了待解释模型 $f(\mathbf{x})$ 如何根据每个特征的贡献对输入样本的类别标签进行预测。即，优化可以正式表示为：

\mathbf{A}=a r g m i n_{\mathbf{x}^{*}}(g(h(\mathbf{x}),w)-f(\mathbf{x}))

where $g(\cdot)\in{\mathcal{G}}$ is the surrogate model to the original classifier $f(\cdot)$ , which aims to mimic the predictions as $f(\mathbf{x})$ for the same sample $\mathbf{X}$ , and the weights $w$ measures the joint contributions by the features in this surrogate model. The selection function $h(\mathbf{x})$ returns the optimized features set $\mathbf{x}^{}$ that make the significant contributions to the model’s predictions, given sample x. The attribution vector A is obtained only if the minimized difference between the surrogate model $g\mathbf{(x)}$ ’s prediction and that of the original model $f(\mathbf{x})$ to be explained is obtained. Therefore, the choice of the surrogate model and the features $\mathbf{x}^{}$ that their attributions are computed is critical to the explanation fidelity on the model’s prediction behavior on the sample $\mathbf{X}$ .

其中 $g(\cdot)\in{\mathcal{G}}$ 是原始分类器 $f(\cdot)$ 的替代模型，旨在为相同样本 $\mathbf{X}$ 模仿 $f(\mathbf{x})$ 的预测，权重 $w$ 衡量了该替代模型中特征的联合贡献。选择函数 $h(\mathbf{x})$ 返回对模型预测有显著贡献的优化特征集 $\mathbf{x}^{}$，给定样本 x。只有当替代模型 $g\mathbf{(x)}$ 的预测与原始模型 $f(\mathbf{x})$ 的预测之间的最小化差异被获得时，才能得到归因向量 A。因此，替代模型的选择以及计算其归因的特征 $\mathbf{x}^{}$ 对于模型在样本 $\mathbf{X}$ 上的预测行为的解释保真度至关重要。

In the remaining part of this section, we will introduce our proposed model explainer, PhilaeX, that starts the features attribution vector construction for the input sample $\mathbf{x}=(x_{1},x_{2},...,x_{m})\in R^{m}$ from an empty vector (i.e., Null). The whole construction process consists of two major stages: (1) The features selection strategy, i.e., $h(\mathbf{x})\in R^{n},n\leq m$ , that picks up the features with the significant contributions towards the model’s prediction on $\mathbf{X}$ is selected; (2) The quantification of the contribution for the selected features through a Ridge regression that is the surrogate model to the original model $f(\mathbf{x})$ .

在本节的剩余部分，我们将介绍我们提出的模型解释器 PhilaeX，它从空向量（即 Null）开始为输入样本 $\mathbf{x}=(x_{1},x_{2},...,x_{m})\in R^{m}$ 构建特征归因向量。整个构建过程包括两个主要阶段：(1) 特征选择策略，即 $h(\mathbf{x})\in R^{n},n\leq m$，选择对模型在 $\mathbf{X}$ 上的预测有显著贡献的特征；(2) 通过 Ridge 回归对所选特征的贡献进行量化，Ridge 回归是原始模型 $f(\mathbf{x})$ 的替代模型。

3.2 Core Features

3.2 核心特性

The perturbation of the features values to obtain the synthetic input data samples $\mathbf{X^{'}}$ in the training of the surrogate model $g(\mathbf{X^{'}})$ may not work well in the cyber security field. In LIME (Ribeiro et al., 2016), the response of the model to the changes of the input variables is obtained by random perturbation of the input sample’s feature values in a small range. This can allow a fast preparation of a large amount of synthetic data to train the surrogate linear model and help the explainer to attribute the model’s behavior accordingly. However, it can also lead to the shortage of stable explanation that the features attribution values may vary a little among different times of explanation, given the same input sample. In addition, the perturbation strategy on features magnitude to generate the synthetic data to train the surrogate model may not work well in the cyber security field. For example, the normal way to camouflage the malware to evade the AI detector is to “add” a certain types of permission in the app, where the small perturbation of the features values is not impossible.

在训练代理模型 $g(\mathbf{X^{'}})$ 时，通过扰动特征值来获得合成输入数据样本 $\mathbf{X^{'}}$ 的方法在网络安全领域可能效果不佳。在 LIME (Ribeiro et al., 2016) 中，模型对输入变量变化的响应是通过在小范围内随机扰动输入样本的特征值来获得的。这种方法可以快速生成大量合成数据来训练代理线性模型，并帮助解释器相应地归因模型的行为。然而，这也可能导致解释的不稳定性，即在相同的输入样本下，特征归因值在不同解释次数之间可能会略有变化。此外，在网络安全领域，通过扰动特征值来生成合成数据以训练代理模型的策略可能效果不佳。例如，恶意软件伪装以逃避 AI 检测器的常见方法是“添加”应用程序中的某些权限类型，而特征值的小幅度扰动在这种情况下几乎是不可能的。

In PhilaeX, the features selection function $h(\mathbf{x})\in$ $R^{n},n\leq m$ is to pick up the subset of the features from the input sample $\mathbf{X}$ that is optimized to describe (i.e., explain) the model’s prediction behavior. Specifically, there are two steps to obtain the candidate features for attribution, which are core features and features with positive individual contributions, respectively. The first step is to identify a set of core features $\mathbf{x_{c}}={x_{i}\in\mathbf{x}}$ from the original sample $\mathbf{X}$ , which are the base of the sample $\mathbf{X}$ that leads the model $f(\mathbf{x_{c}}):\rightarrow0.5$ (i.e., the boarder line of the prediction). We assume that the model $f(\mathbf{x_{c}})$ make a “hesitated” decision for the sample with such core features only, where the model has around $50%$ confidence on its prediction of the sample’s class, and the actual prediction on the original input sample $f(\mathbf{x})$ is made by the joint contribution from both the core features and part of the remaining features.

在 PhilaeX 中，特征选择函数 $h(\mathbf{x})\in$ $R^{n},n\leq m$ 用于从输入样本 $\mathbf{X}$ 中选取一个特征子集，该子集经过优化以描述（即解释）模型的预测行为。具体来说，获取归因候选特征有两个步骤，分别是核心特征和具有正个体贡献的特征。第一步是从原始样本 $\mathbf{X}$ 中识别一组核心特征 $\mathbf{x_{c}}={x_{i}\in\mathbf{x}}$，这些特征是样本 $\mathbf{X}$ 的基础，使得模型 $f(\mathbf{x_{c}}):\rightarrow0.5$（即预测的边界线）。我们假设模型 $f(\mathbf{x_{c}})$ 仅对具有这些核心特征的样本做出“犹豫”的决策，此时模型对其预测样本类别的置信度约为 $50%$，而原始输入样本 $f(\mathbf{x})$ 的实际预测是由核心特征和部分剩余特征的共同贡献决定的。

| 算法1: 核心特征选择 | |
| --- | --- |
| 输入 : 输入样本: x 和要解释的模型: f() MAXLEN_CORE_FEATURES- | |
| 输出: 核心特征: xs | 最大核心特征数量 |
| 1 min-prediction_score-gap = 1 | |
| 2 Xs=Null 3 while | |
| 选取 xi E x | xsII≤ MAX LEN_CORE_FEATURES do |
| 4 5 | if llf(xs +xi)-0.5ll < |
| | min_prediction_score-gapthen |
| sx←:!x 7end | |
| 8 返回选定的核心特征 xs | |

sample $\mathbf{X}$ , we start from an empty feature vector that contain no feature. The following steps are to find out the candidate core features in a recursive way, where the target is to find the subset of features that leads the model $f(\mathbf{x}{c}):\rightarrow0.5$ as close as possible (i.e., the local minimum of the $a b s(f(\mathbf{x}{c})-0.5)$ . The detailed algorithm about core features identification are in Algorithm 1.

对于样本 $\mathbf{X}$，我们从包含无特征的空特征向量开始。接下来的步骤是以递归方式找出候选核心特征，目标是找到使模型 $f(\mathbf{x}{c}):\rightarrow0.5$ 尽可能接近的特征子集（即 $a b s(f(\mathbf{x}{c})-0.5)$ 的局部最小值）。关于核心特征识别的详细算法见算法 1。

3.3 Features Individual Contributions

3.3 特征个体贡献

Once the core features $\mathbf{X}{C}$ is obtained, we are looking for the features that can increase the prediction confidence of the model toward the prediction score on the original sample $\mathbf{X}$ . Formally, we define the acquisition of such features with positive individual contributions, i.e., $\mathbf{X}{p}$ , as:

一旦获得核心特征 $\mathbf{X}{C}$，我们寻找能够增加模型对原始样本 $\mathbf{X}$ 预测置信度的特征。正式地，我们将这些具有正向个体贡献的特征，即 $\mathbf{X}{p}$，定义为：

a r g m i n_{\mathbf{x}_{p}\subset\mathbf{x}\backslash\mathbf{x}_{c}}(f(\mathbf{x}_{c}+\mathbf{x}_{p})-f(\mathbf{x}))```


where the symbol “  \$\mathbf{\Phi}^{*}+\mathbf{\Phi}^{,}\$   means the concatenation of two features vectors, i.e.,  \$\mathbf{X}_{C}\$   and  \$\mathbf{X}_{p}\$  . The candidate features set is initialized as  \$\mathbf{x}_{p}=\boldsymbol{\Phi}\$  . For every feature  \$x_{i}\in\mathbf{x}\backslash\mathbf{x}_{c}\$   that is added into  \$\mathbf{X}_{p}\$  , the model’s prediction on  \$f(\mathbf{x}_{c}+\mathbf{x}_{p}:\rightarrow f(\mathbf{x})\$  .

其中符号“\$\mathbf{\Phi}^{*}+\mathbf{\Phi}^{,}\$”表示两个特征向量的连接，即\$\mathbf{X}_{C}\$和\$\mathbf{X}_{p}\$。候选特征集初始化为\$\mathbf{x}_{p}=\boldsymbol{\Phi}\$。对于每个被添加到\$\mathbf{X}_{p}\$中的特征\$x_{i}\in\mathbf{x}\backslash\mathbf{x}_{c}\$，模型对\$f(\mathbf{x}_{c}+\mathbf{x}_{p}:\rightarrow f(\mathbf{x})\$的预测。

The aim is to identify the features in the input sample  \$\mathbf{X}\$   to enhance the confidence of the model significantly when it outputs the prediction of the input sample. Accordingly, those features that lead the model to the opposite of the prediction on the sample  \$\mathbf{X}\$   will be ignored.

目标是识别输入样本 \$\mathbf{X}\$ 中的特征，以显著提高模型在输出该样本预测时的置信度。因此，那些导致模型对样本 \$\mathbf{X}\$ 做出相反预测的特征将被忽略。

# 3.4 Quantify Joint Contribution by Features

# 3.4 量化特征的联合贡献

The features we picked up from the previous steps, i.e., the core features  \$\mathbf{X}_{C}\$   and the features with positively individual contributions  \$\mathbf{X}_{p}\$  , form the set of the candidate features, where features attribution by PhilaeX will be applied. There are two reasons that we only attribute the subset of features in the input sample  \$\mathbf{X}\$  : (1) The features attribution on such features  \$\mathbf{x}_{s}=\mathbf{x}_{c}+\mathbf{x}_{p}\$   allows the explainer to reveal the major reason that the model made the prediction on the original sample   \$\mathbf{X}\$  . As the discussion in Section 3.3, it is not always true for all features in the sample  \$\mathbf{X}\$  that make positive significant contributions towards the model’s prediction of the class label on the sample. (2) The explanation on such subset of features will be more efficient that that on the all the features of the input sample  \$\mathbf{X}\$  .

我们从之前的步骤中提取的特征，即核心特征 \$\mathbf{X}_{C}\$ 和具有正向个体贡献的特征 \$\mathbf{X}_{p}\$，构成了候选特征集，PhilaeX 将对这些特征进行归因。我们只对输入样本 \$\mathbf{X}\$ 中的特征子集进行归因有两个原因：(1) 对这些特征 \$\mathbf{x}_{s}=\mathbf{x}_{c}+\mathbf{x}_{p}\$ 的归因使解释器能够揭示模型对原始样本 \$\mathbf{X}\$ 做出预测的主要原因。正如第 3.3 节所讨论的，并非样本 \$\mathbf{X}\$ 中的所有特征都对模型对样本类标签的预测做出了显著的正向贡献。(2) 对此类特征子集的解释比对所有输入样本 \$\mathbf{X}\$ 的特征的解释更高效。

The joint contributions made by the cooperation among these features are the necessary to form the complete quantitative explanation (i.e., features attribution) for the model  \$f(\mathbf{x})\$  , which have not yet been considered by the previous two steps in Section 3.2 and Section 3.3. In this step, we quantify each feature’s contribution to the model’s prediction by training a Ridge regression model   \$g(\cdot)\$   as the surrogate model to the original model  \$f(\cdot)\$  , where the weights of each feature in the regression model are considered as the features attribution. The reason we use the Ridge regression as the surrogate model is for its simplicity, efficiency and its nature for estimating the coefficients (i.e., weights) where independent variables are highly correlated (Hilt and Seegrist, 1977).

这些特征之间合作所做出的联合贡献是形成对模型 \$f(\mathbf{x})\$ 的完整定量解释（即特征归因）所必需的，而这一点在前两节（第3.2节和第3.3节）中尚未被考虑。在这一步骤中，我们通过训练一个 Ridge 回归模型 \$g(\cdot)\$ 作为原始模型 \$f(\cdot)\$ 的替代模型，来量化每个特征对模型预测的贡献，其中回归模型中每个特征的权重被视为特征归因。我们选择 Ridge 回归作为替代模型的原因是它的简单性、高效性以及在自变量高度相关时估计系数（即权重）的自然特性 (Hilt and Seegrist, 1977)。

Specifically, the weights  \$\mathbf{w}\in R^{\|\mathbf{x}_{s}\|}\$  in Ridge regression can be estimated by the optimization of the following equation:

具体来说，Ridge回归中的权重 \$\mathbf{w}\in R^{\|\mathbf{x}_{s}\|}\$ 可以通过优化以下方程来估计：

a r g m i n_{\mathbf{w}}\big(||\boldsymbol{y}-\mathbf{X}\mathbf{w}||{2}^{2}+\mathbf{\alpha}*||\mathbf{w}||{2}^{2}\big)```

where the L2 regular iz ation applies to reduce sensitivity to single feature and accordingly decrease the possibility of over fitting in the model training.

其中 L2 正则化用于减少对单一特征的敏感性，从而降低模型训练中过拟合的可能性。

Finally the features attribution vector is defined as $\mathbf A=\mathbf w$ that considers both the individual contribution from each features and the joint contributions from the cooperation among these features $\mathbf{X}_{S}$ .

最终，特征归因向量定义为 $\mathbf A=\mathbf w$，它既考虑了每个特征的个体贡献，也考虑了这些特征 $\mathbf{X}_{S}$ 之间合作的联合贡献。

4 Experiments

4 实验

In this section, we assess the explanation capability of PhilaeX through two quantitative experiments. The proposed explainer will be used to explain the prediction behaviors of three classical class if i ers, including SVM, Random Forest and BERT, which include the AI models in both the shallow (classical) machine learning and deep learning fields. There are two datasets are used in our experiments. The datasets are DREBIN (Arp et al., 2014) dataset for Android malware detection task and the PDF malware dataset (Smutz and Sta

[论文翻译]PhilaeX: 解释AI模型在恶意软件检测中的失败与成功

原文地址：https://arxiv.org/pdf/2207.00740