PhilaeX: Explaining the Failure and Success of AI Models in Malware Detection
PhilaeX: 解释AI模型在恶意软件检测中的失败与成功
The explanation to an AI model’s prediction used to support decision making in cyber security, is of critical importance. It is especially so when the model’s incorrect prediction can lead to severe damages or even losses to lives and critical assets. However, most existing AI models lack the ability to provide explanations on their prediction results, despite their strong performance in most scenarios. In this work, we propose a novel explain able AI method, called PhilaeX, that provides the heuristic means to identify the optimized subset of features to form the complete explanations of AI models’ predictions. It identifies the features that lead to the model’s borderline prediction, and those with positive individual contributions are extracted. The feature attributions are then quantified through the optimization of a Ridge regression model. We verify the explanation fidelity through two experiments. First, we assess our method’s capability in correctly identifying the activated features in the adversarial samples of Android malwares, through the features attribution values from PhilaeX. Second, the deduction and augmentation tests, are used to assess the fidelity of the explanations. The results show that PhilaeX is able to explain different types of class if i ers correctly, with higher fidelity explanations, compared to the state-of-the-arts methods such as LIME and SHAP.
对AI模型预测的解释在支持网络安全决策中至关重要,尤其是在模型错误预测可能导致严重损害甚至生命和关键资产损失的情况下。然而,尽管大多数现有AI模型在大多数场景中表现出色,但它们缺乏提供预测结果解释的能力。在这项工作中,我们提出了一种新颖的可解释AI方法,称为PhilaeX,它提供了启发式手段来识别优化的特征子集,以形成AI模型预测的完整解释。它识别导致模型边界预测的特征,并提取那些具有积极个体贡献的特征。然后通过优化岭回归模型来量化特征归因。我们通过两个实验验证了解释的保真度。首先,我们通过PhilaeX的特征归因值评估了我们的方法在正确识别Android恶意软件对抗样本中激活特征的能力。其次,使用演绎和增强测试来评估解释的保真度。结果表明,与LIME和SHAP等最先进的方法相比,PhilaeX能够以更高的保真度正确解释不同类型的分类器。
1 INTRODUCTION
1 引言
Explaining the prediction of an AI model is critical for the AI-based solution to modern cyber threats that have the properties of large volume and highly complexity by the AI technology. The threat detection solutions based on the learnable AI technologies, which are so called shallow machine learning and recently emerging deep learning methods, have demonstrated astonishing performance today. However, the high detection performance is insufficient in establishing the trust from the users, since most models predict the label of the suspicious sample, e.g., a malware or a face image may be subjected to manipulation for deception or obfuscation, through a complicated computation process that people cannot understand. This confidence crisis may become more severe when the AI model makes an erroneous prediction that causes damage or loss to the user’s properties, assets or even safety. Therefore, the research on explain able AI that quantitatively explains the AI model’s successful or failed prediction for a particular input sample through the attribution of each data feature’s contribution to the model’s prediction is highly desired (Dosˇilovic´ et al., 2018).
解释 AI 模型的预测对于基于 AI 的现代网络威胁解决方案至关重要,这些威胁具有大容量和高度复杂性的特点。基于可学习的 AI 技术(即所谓的浅层机器学习和最近兴起的深度学习方法)的威胁检测解决方案,如今已经展示了惊人的性能。然而,高检测性能不足以建立用户的信任,因为大多数模型通过人们无法理解的复杂计算过程来预测可疑样本的标签,例如恶意软件或可能被操纵以进行欺骗或混淆的人脸图像。当 AI 模型做出错误预测,导致用户的财产、资产甚至安全受到损害或损失时,这种信任危机可能会变得更加严重。因此,研究可解释的 AI,通过量化每个数据特征对模型预测的贡献来解释 AI 模型对特定输入样本的成功或失败预测,是非常必要的 (Dosˇilovic´ et al., 2018)。
Malware detection research has made progress over the years. Demontis et. al. (Demontis et al., 2017) improved the standard SVM on Android malware detection that further reduces the chance of evasion by certain types of malware samples, through the optmized selection method on the model’s parameters. Zhang et. al. (Zhang et al., 2019) proposed a malware detector using online learning technique that is capable of adapting to the rapid evolving malware. Specifically, they combined the n-gram analysis and the online classifier techniques in the detection. The application of the deep learning methods in cyber security threats detection recently, such as CNNs (Amerini et al., 2019), RNNs (Gu¨era and Delp, 2018), LSTM (Xiao et al., 2019a) or Transformers (Devlin et al., 2018), is a breakthrough in the detection rate (i.e, true positive rate). The deep learning methods also save the hand-crafted and timeconsuming efforts on the selection or transformation of the samples’ features through the automatic endto-end learning, which performance was highly based on the experience and the domain knowledge of the developers (McLaughlin et al., 2017) (Yan et al., 2018) (Xiao et al., 2019b) previously. However, it is nearly impossible for humans to understand how the deep learning models predict the class of the samples by the non-linear computation process and millions parameters among layers. The research effort on AI models’ explanation is seldom considered in the development of the machine learning algorithm.
恶意软件检测研究近年来取得了进展。Demontis 等人 (Demontis et al., 2017) 改进了 Android 恶意软件检测中的标准 SVM (支持向量机),通过模型参数的优化选择方法,进一步降低了某些类型恶意软件样本逃避检测的可能性。Zhang 等人 (Zhang et al., 2019) 提出了一种使用在线学习技术的恶意软件检测器,能够适应快速演变的恶意软件。具体来说,他们在检测中结合了 n-gram 分析和在线分类器技术。近年来,深度学习技术在网络安全威胁检测中的应用,如 CNN (卷积神经网络) (Amerini et al., 2019)、RNN (循环神经网络) (Gu¨era and Delp, 2018)、LSTM (长短期记忆网络) (Xiao et al., 2019a) 或 Transformer (Devlin et al., 2018),在检测率(即真阳性率)方面取得了突破。深度学习方法还通过自动端到端学习节省了手工选择和转换样本特征的时间和精力,这些工作以往高度依赖于开发者的经验和领域知识 (McLaughlin et al., 2017) (Yan et al., 2018) (Xiao et al., 2019b)。然而,人类几乎不可能通过非线性计算过程和数百万层参数来理解深度学习模型如何预测样本的类别。在机器学习算法的开发中,对 AI 模型解释的研究工作很少被考虑。
Clearly, the AI model explanation is the positive direction to enhancing the users’ trust on the AI model’s output, otherwise generated from a seemingly black-box mechanism. Such explanation is achieved through the quant if i cation on the “contribution” of each feature to the model’s prediction. The popular model-agnostic explain able AI methods that can explain any AI model’s predictions, regardless of the model’s type (such as SVM, CNNs or LSTM), may not be working well for cyber security problems. LIME (Ribeiro et al., 2016) builds a surrogate linear model of the original model to be explained, where the contribution of each feature is computed through the optimization (Efron et al., 2004). The authors assumed the linear model can be understood by humans because of its simplicity and the data used to train the linear model is manipulated by the local perturbation of the features values in the input data sample. The fidelity of the linear model based explanation may be deteriorated by the high dimensionality of the data that is common in cyber security. Integrated Gradients (IGs) (Sun dara rajan et al., 2017) attributes the features as the model explanation through the integration of the gradients on the model’s predictions with respect to the input data with different features values. These feature values are varied from the “baseline” through a linear path, in which the baseline refers to the zero-value feature vector or no signal sample. The Integrated Gradients method works well for the AI models with gradients, such as deep learning models. However, it cannot be used for certain widely used models without gradients, such as Random Forests (Apruzzese et al., 2020). In addition, the baseline is unclear in certain fields, such as genomics domain (Jha et al., 2020). Therefore, the explain able AI method for the models used in the cyber security field, such as malware detection, is still desired.
显然,AI模型解释是增强用户对AI模型输出信任的积极方向,尤其是在模型输出似乎来自一个黑箱机制的情况下。这种解释是通过量化每个特征对模型预测的“贡献”来实现的。流行的模型无关的可解释AI方法可以解释任何AI模型的预测,无论模型类型(如SVM、CNN或LSTM),但在网络安全问题上可能效果不佳。LIME (Ribeiro et al., 2016) 构建了一个原始模型的替代线性模型,其中每个特征的贡献通过优化计算得出 (Efron et al., 2004)。作者假设线性模型由于其简单性可以被人类理解,并且用于训练线性模型的数据是通过输入数据样本中特征值的局部扰动来操纵的。基于线性模型的解释的保真度可能会因网络安全中常见的高维数据而降低。Integrated Gradients (IGs) (Sundararajan et al., 2017) 通过将模型预测的梯度与具有不同特征值的输入数据进行积分,将特征归因于模型解释。这些特征值从“基线”通过线性路径变化,其中基线指的是零值特征向量或无信号样本。Integrated Gradients方法适用于具有梯度的AI模型,如深度学习模型。然而,它不能用于某些广泛使用的无梯度模型,如随机森林 (Apruzzese et al., 2020)。此外,在某些领域(如基因组学领域)中,基线并不明确 (Jha et al., 2020)。因此,对于网络安全领域(如恶意软件检测)中使用的模型,仍然需要可解释的AI方法。
In this article, we proposed a novel modelagnostic explain able AI methodology, called PhilaeX, that is capable of quantitatively measuring the features’ “contribution” in a suspicious app sample, when its class (i.e., benign or malware) is predicted by a given AI model, regardless of the model’s type. Specifically, the model explanation starts from core features selection for a given suspicious sample, by which only the features in the sample lead the model’s prediction towards to the border line of the two classes (i.e., around $50%$ probability of the prediction confidence by the model) are selected. Then, in addition to these core features, PhilaeX identifies a set of features from the original data sample, in which each feature is able to make the significant contribution for the model’s prediction towards the predicted class on the original input sample. This step is to identify the features with positive individual contributions to the model’s predictions, without considering the contributions from the cooperation among features. Finally, the feature attribution is obtained by considering both the positive individual contributions and the joint contribution when all these features are used. The quantitative measure on each feature’s attribution is computed by optimizing a Ridge regression, because of its simplicity in optimization and the nature of the optimization considers the highly correlated features. The main advantages of the proposed explain able AI method include: (1) The identification of the core features provides a fingerprint to further identify the candidate features with positive contributions to the model’s prediction, in an efficient and accurate manner, when compared to the random perturbation of the sample’s values in feature space, such as LIME; (2) The features attribution based on the core features and those with positively individual contributions considers both the individual and joint contributions by the features; and (3) The optimization by Ridge regression to quantify the features attribution is efficient and effective. The results from the quantitative assessment to the proposed explainable AI method show the high fidelity of explanation by PhilaeX, regardless of the SVM (Arp et al., 2014) (Li et al., 2015) and BERT (Devlin et al., 2018) class if i ers, on malware detection tasks. The first experiment aims to identify the “activated features” in adversarial samples of Android malware. This is to help the cyber security practitioners to analyze how the AI model was evaded by the adversarial samples, and enhance the model’s security accordingly. The results demonstrate that the activated features have the higher chance to be attributed with high values by PhilaeX, compared to the state-of-the-arts methods, such as LIME, SHAP (Lundberg and Lee, 2017) and MPT Explainer (Lu and Thing, 2021). The second experiment that test the explanation fidelity when PhilaeX is used to explain the SVM and Random Forest classifiers on the PDF malware dataset (Smutz and Stavrou, 2012), where the results verifies that the high fidelity can be obtained by a small number of the features with the top attribution values by PhilaeX.
在本文中,我们提出了一种新颖的模型无关可解释 AI 方法,称为 PhilaeX,它能够定量测量可疑应用样本中特征的“贡献”,当给定 AI 模型预测其类别(即良性或恶意软件)时,无论模型类型如何。具体来说,模型解释从给定可疑样本的核心特征选择开始,通过该选择,只有样本中引导模型预测向两类边界线(即模型预测置信度约为 $50%$ 的概率)的特征被选中。然后,除了这些核心特征外,PhilaeX 还从原始数据样本中识别出一组特征,其中每个特征都能对模型在原始输入样本上的预测类别做出显著贡献。此步骤旨在识别对模型预测具有积极个体贡献的特征,而不考虑特征之间合作的贡献。最后,通过考虑这些特征的积极个体贡献和联合贡献,获得特征归因。每个特征的归因定量测量通过优化 Ridge 回归来计算,因为其优化简单且优化的性质考虑了高度相关的特征。所提出的可解释 AI 方法的主要优势包括:(1) 核心特征的识别提供了一种指纹,以高效且准确的方式进一步识别对模型预测具有积极贡献的候选特征,与特征空间中样本值的随机扰动(如 LIME)相比;(2) 基于核心特征和具有积极个体贡献的特征的特征归因考虑了特征的个体和联合贡献;(3) 通过 Ridge 回归优化来量化特征归因是高效且有效的。对所提出的可解释 AI 方法的定量评估结果表明,PhilaeX 在恶意软件检测任务中,无论是对 SVM (Arp et al., 2014) (Li et al., 2015) 还是 BERT (Devlin et al., 2018) 分类器,都具有高保真度的解释。第一个实验旨在识别 Android 恶意软件对抗样本中的“激活特征”。这有助于网络安全从业者分析 AI 模型如何被对抗样本规避,并相应地增强模型的安全性。结果表明,与最先进的方法(如 LIME、SHAP (Lundberg and Lee, 2017) 和 MPT Explainer (Lu and Thing, 2021))相比,激活特征更有可能被 PhilaeX 归因于高值。第二个实验测试了 PhilaeX 在解释 SVM 和随机森林分类器在 PDF 恶意软件数据集 (Smutz and Stavrou, 2012) 上的解释保真度,结果验证了通过 PhilaeX 归因值最高的少量特征可以获得高保真度。
The rest of the paper is organised as follows: we present literature review on the state-of-the-arts in explainable AI in cyber security in Section 2. The proposed methodology, PhilaeX, is introduced in details in Section 3. We assess the fidelity of the proposed method by two quantitative experiments in Section 4. Finally, the conclusion of this methodology is discussed in Section 5.
本文的其余部分组织如下:第2节介绍了网络安全领域可解释AI (Explainable AI) 的最新研究进展。第3节详细介绍了所提出的方法 PhilaeX。第4节通过两个定量实验评估了所提出方法的保真度。最后,第5节讨论了该方法的结论。
2 Literature Review
2 文献综述
The main aim of explain able AI is to provide a human-understandable explanation on how the AI model predicts the class label of the given sample. One of the major research in explain able AI focus on the model’s interpret ability (Dosˇilovic´ et al., 2018), where the model’s prediction can be explained by its own prediction process, such as decision trees (Kamin´ski et al., 2018). However, as the development of the machine learning and deep learning methods advances, the model becomes increasingly complicated such that the computation is not visible for the users, and it is difficult to achieve the model’s interpret ability (Molnar, 2019).
可解释 AI 的主要目标是提供人类可理解的解释,说明 AI 模型如何预测给定样本的类别标签。可解释 AI 的一个主要研究方向是模型的解释能力 (Dosˇilovic´ et al., 2018),其中模型的预测可以通过其自身的预测过程来解释,例如决策树 (Kamin´ski et al., 2018)。然而,随着机器学习和深度学习方法的发展,模型变得越来越复杂,以至于用户无法看到计算过程,因此很难实现模型的解释能力 (Molnar, 2019)。
The post-hoc explain able AI methods that obtain the model’s explanation by analyzing the model’s input and output in a qualitative or quantitative way, therefore, attracts the major research interests. The early research on post-hoc explain ation method were focusing on the model-specific explain able AI methods, where it is only able to explain the targeted type of AI models. Zeiler et. al. (Zeiler and Fergus, 2014) proposed a qualitative explanation method through the visualization and observation on the neurons in a convolutional neural networks (CNNs) that shows how each neuron responds to different data instances. In (Xu et al., 2015), Xu et. al. developed a caption generator model to summarize the content of an image in one sentence, where the attention mechanism in the deep neural networks highlights the sensitive part of the image and its corresponding words in the caption. DREBIN (Arp et al., 2014) provided a limited explanation of the Android malware detector’s prediction based on SVM classifier. However, their explanation method cannot be extended to other AI models, since the quant if i cation of features attribution comes from the weights of the SVM models. Thus, the model-specific explain able AI methods lack the ability to extend to new types of AI models, because of its inherent nature.
事后可解释的AI方法通过定性或定量分析模型的输入和输出来获取模型的解释,因此吸引了主要的研究兴趣。早期的事后解释方法研究主要集中在模型特定的可解释AI方法上,这些方法只能解释特定类型的AI模型。Zeiler等人 (Zeiler and Fergus, 2014) 提出了一种定性解释方法,通过可视化和观察卷积神经网络 (CNNs) 中的神经元,展示了每个神经元对不同数据实例的响应。在 (Xu et al., 2015) 中,Xu等人开发了一个字幕生成模型,用于用一句话总结图像的内容,其中深度神经网络中的注意力机制突出了图像的敏感部分及其在字幕中对应的词语。DREBIN (Arp et al., 2014) 提供了基于SVM分类器的Android恶意软件检测器预测的有限解释。然而,他们的解释方法无法扩展到其他AI模型,因为特征归因的量化来自于SVM模型的权重。因此,模型特定的可解释AI方法由于其固有性质,缺乏扩展到新型AI模型的能力。
As the machine learning techniques develop rapidly, explain able AI methods that can explain different types of AI models is highly desired. This property is also referred to as model-agnostic. Samek et. al. (Samek et al., 2017) firstly proposed the explanation methods using layer-wise relevance propagation (LRP) to analyze the sensitivity between the deep learning models’ prediction w.r.t. the input sample in the features space. Their work forms a foundation in model-agnostic explain able AI methods, where the model explanation was obtained by the “observation” on the relations between the input and model’s output.
随着机器学习技术的快速发展,能够解释不同类型 AI 模型的可解释 AI 方法备受期待。这一特性也被称为模型无关性。Samek 等人 (Samek et al., 2017) 首次提出了使用逐层相关性传播 (Layer-wise Relevance Propagation, LRP) 的解释方法,用于分析深度学习模型预测与输入样本在特征空间中的敏感性。他们的工作为模型无关的可解释 AI 方法奠定了基础,其中模型解释是通过“观察”输入与模型输出之间的关系获得的。
As the model structure becomes too complicated to be accessed by humans, the directly observation on the model’s input and output also become a timeconsuming and inaccurate way to obtain the explanation. Therefore, the alternative way to obtain the model explanation is to explain the surrogate model, which simulates the behavior of the original model to be explained, and is usually simple enough for human understanding. LIME (Ribeiro et al., 2016) is proposed to explain any type of class if i ers by learning a linear surrogate model to mimic the target model’s behavior. The data to train such linear model are generated through perturbation of the original input data sample around the model’s predictions (i.e., local perturbation). However, the linearity of the surrogate model and the random perturbation strategy in the local field limits the explanation capability of LIME, especially when it explains complicated models, such as CNNs. Our PhilaeX provides a high fidelity explanation for complicated models through a multi-stage selection strategy for high contribution features. This solves the limitation of the local random perturbation in the sample’s feature space, such as non-stable explanation. Wu et. al. (Wu et al., 2018) used decision tree, which is a self-explained model, as the surrogate model in the explanation of the deep learning models Recently, LEMNA (Guo et al., 2018) was proposed to explain the AI models that are specifically designed for cyber security problems. LEMNA uses the fused lasso (Tibshirani et al., 2005) algorithm and mixture regression model (Khalili and Chen, 2007) to force the explanation to consider the dependencies among features, which solves the issues of linear approximation that considers nothing about the dependencies among features in LIME.
随着模型结构变得过于复杂,人类难以直接访问,直接观察模型的输入和输出也成为一种耗时且不准确的解释方式。因此,获取模型解释的替代方法是解释替代模型,该模型模拟了待解释的原始模型的行为,并且通常足够简单以便人类理解。LIME (Ribeiro et al., 2016) 提出通过学习一个线性替代模型来模仿目标模型的行为,从而解释任何类型的分类器。训练这种线性模型的数据是通过在模型预测周围对原始输入数据样本进行扰动生成的(即局部扰动)。然而,替代模型的线性以及局部领域中的随机扰动策略限制了 LIME 的解释能力,尤其是在解释复杂模型(如 CNN)时。我们的 PhilaeX 通过对高贡献特征的多阶段选择策略,为复杂模型提供了高保真度的解释。这解决了样本特征空间中局部随机扰动的局限性,例如不稳定的解释。Wu 等人 (Wu et al., 2018) 使用决策树(一种自解释模型)作为深度学习模型解释中的替代模型。最近,LEMNA (Guo et al., 2018) 被提出用于解释专门为网络安全问题设计的 AI 模型。LEMNA 使用融合 lasso (Tibshirani et al., 2005) 算法和混合回归模型 (Khalili and Chen, 2007) 来强制解释考虑特征之间的依赖关系,这解决了 LIME 中线性近似不考虑特征之间依赖关系的问题。
3 PhilaeX: Explaining Model’s Predictions
3 PhilaeX: 解释模型的预测
In this section, we firstly formulate the model’s explanation problem as the feature attribution process in mathematics. The algorithms to identify the core features and the features with positive individual contributions are introduced. Finally, we present the optimization process to obtain the features attribution by considering both the features’ individual contributions and their joint contributions towards the model’s prediction on the input sample.
在本节中,我们首先将模型的解释问题表述为数学中的特征归因过程。介绍了识别核心特征和具有正向个体贡献特征的算法。最后,我们提出了通过考虑特征的个体贡献及其对模型输入样本预测的联合贡献来获得特征归因的优化过程。
3.1 Problem Statement
3.1 问题陈述
Given a classifier $f(\mathbf{x})\rightarrow[0,1]^{|C|}$ to be explained, and its predictions of the probabilities of $|C|$ class labels for the input data sample $\mathbf{x}=(x_{1},x_{2},...,x_{m})\in R^{m}$ (in the features space), the data sample $\mathbf{X}\in R^{m}$ consists of $m$ features. For example, the suspicious Android app can be represented in the features space by the TF-IDF (Rajaraman and Ullman, 2011) values of its permissions (Arp et al., 2014). The aim of the model explanation is to find the optimized features attribution vector $\mathbf{A}=(a_{1},a_{2},...,a_{m})\in R^{m}$ that quantita- tively measure how the model to be explained $f(\mathbf{x})$ makes the prediction of the input sample’s class label according to each features contributions. That is, the optimization can be formally represented by:
给定一个需要解释的分类器 $f(\mathbf{x})\rightarrow[0,1]^{|C|}$,以及它对输入数据样本 $\mathbf{x}=(x_{1},x_{2},...,x_{m})\in R^{m}$(在特征空间中)的 $|C|$ 个类别标签的概率预测,数据样本 $\mathbf{X}\in R^{m}$ 由 $m$ 个特征组成。例如,可疑的 Android 应用程序可以通过其权限的 TF-IDF (Rajaraman and Ullman, 2011) 值在特征空间中表示 (Arp et al., 2014)。模型解释的目标是找到优化的特征归因向量 $\mathbf{A}=(a_{1},a_{2},...,a_{m})\in R^{m}$,该向量定量衡量了待解释模型 $f(\mathbf{x})$ 如何根据每个特征的贡献对输入样本的类别标签进行预测。即,优化可以正式表示为:
\mathbf{A}=a r g m i n_{\mathbf{x}^{*}}(g(h(\mathbf{x}),w)-f(\mathbf{x}))
where $g(\cdot)\in{\mathcal{G}}$ is the surrogate model to the original classifier $f(\cdot)$ , which aims to mimic the predictions as $f(\mathbf{x})$ for the same sample $\mathbf{X}$ , and the weights $w$ measures the joint contributions by the features in this surrogate model. The selection function $h(\mathbf{x})$ returns the optimized features set $\mathbf{x}^{}$ that make the significant contributions to the model’s predictions, given sample x. The attribution vector A is obtained only if the minimized difference between the surrogate model $g\mathbf{(x)}$ ’s prediction and that of the original model $f(\mathbf{x})$ to be explained is obtained. Therefore, the choice of the surrogate model and the features $\mathbf{x}^{}$ that their attributions are computed is critical to the explanation fidelity on the model’s prediction behavior on the sample $\mathbf{X}$ .
其中 $g(\cdot)\in{\mathcal{G}}$ 是原始分类器 $f(\cdot)$ 的替代模型,旨在为相同样本 $\mathbf{X}$ 模仿 $f(\mathbf{x})$ 的预测,权重 $w$ 衡量了该替代模型中特征的联合贡献。选择函数 $h(\mathbf{x})$ 返回对模型预测有显著贡献的优化特征集 $\mathbf{x}^{}$,给定样本 x。只有当替代模型 $g\mathbf{(x)}$ 的预测与原始模型 $f(\mathbf{x})$ 的预测之间的最小化差异被获得时,才能得到归因向量 A。因此,替代模型的选择以及计算其归因的特征 $\mathbf{x}^{}$ 对于模型在样本 $\mathbf{X}$ 上的预测行为的解释保真度至关重要。
In the remaining part of this section, we will introduce our proposed model explainer, PhilaeX, that starts the features attribution vector construction for the input sample $\mathbf{x}=(x_{1},x_{2},...,x_{m})\in R^{m}$ from an empty vector (i.e., Null). The whole construction process consists of two major stages: (1) The features selection strategy, i.e., $h(\mathbf{x})\in R^{n},n\leq m$ , that picks up the features with the significant contributions towards the model’s prediction on $\mathbf{X}$ is selected; (2) The quantification of the contribution for the selected features through a Ridge regression that is the surrogate model to the original model $f(\mathbf{x})$ .
在本节的剩余部分,我们将介绍我们提出的模型解释器 PhilaeX,它从空向量(即 Null)开始为输入样本 $\mathbf{x}=(x_{1},x_{2},...,x_{m})\in R^{m}$ 构建特征归因向量。整个构建过程包括两个主要阶段:(1) 特征选择策略,即 $h(\mathbf{x})\in R^{n},n\leq m$,选择对模型在 $\mathbf{X}$ 上的预测有显著贡献的特征;(2) 通过 Ridge 回归对所选特征的贡献进行量化,Ridge 回归是原始模型 $f(\mathbf{x})$ 的替代模型。
3.2 Core Features
3.2 核心特性
The perturbation of the features values to obtain the synthetic input data samples $\mathbf{X^{'}}$ in the training of the surrogate model $g(\mathbf{X^{'}})$ may not work well in the cyber security field. In LIME (Ribeiro et al., 2016), the response of the model to the changes of the input variables is obtained by random perturbation of the input sample’s feature values in a small range. This can allow a fast preparation of a large amount of synthetic data to train the surrogate linear model and help the explainer to attribute the model’s behavior accordingly. However, it can also lead to the shortage of stable explanation that the features attribution values may vary a little among different times of explanation, given the same input sample. In addition, the perturbation strategy on features magnitude to generate the synthetic data to train the surrogate model may not work well in the cyber security field. For example, the normal way to camouflage the malware to evade the AI detector is to “add” a certain types of permission in the app, where the small perturbation of the features values is not impossible.
在训练代理模型 $g(\mathbf{X^{'}})$ 时,通过扰动特征值来获得合成输入数据样本 $\mathbf{X^{'}}$ 的方法在网络安全领域可能效果不佳。在 LIME (Ribeiro et al., 2016) 中,模型对输入变量变化的响应是通过在小范围内随机扰动输入样本的特征值来获得的。这种方法可以快速生成大量合成数据来训练代理线性模型,并帮助解释器相应地归因模型的行为。然而,这也可能导致解释的不稳定性,即在相同的输入样本下,特征归因值在不同解释次数之间可能会略有变化。此外,在网络安全领域,通过扰动特征值来生成合成数据以训练代理模型的策略可能效果不佳。例如,恶意软件伪装以逃避 AI 检测器的常见方法是“添加”应用程序中的某些权限类型,而特征值的小幅度扰动在这种情况下几乎是不可能的。
In PhilaeX, the features selection function $h(\mathbf{x})\in$ $R^{n},n\leq m$ is to pick up the subset of the features from the input sample $\mathbf{X}$ that is optimized to describe (i.e., explain) the model’s prediction behavior. Specifically, there are two steps to obtain the candidate features for attribution, which are core features and features with positive individual contributions, respectively. The first step is to identify a set of core features $\mathbf{x_{c}}={x_{i}\in\mathbf{x}}$ from the original sample $\mathbf{X}$ , which are the base of the sample $\mathbf{X}$ that leads the model $f(\mathbf{x_{c}}):\rightarrow0.5$ (i.e., the boarder line of the prediction). We assume that the model $f(\mathbf{x_{c}})$ make a “hesitated” decision for the sample with such core features only, where the model has around $50%$ confidence on its prediction of the sample’s class, and the actual prediction on the original input sample $f(\mathbf{x})$ is made by the joint contribution from both the core features and part of the remaining features.
在 PhilaeX 中,特征选择函数 $h(\mathbf{x})\in$ $R^{n},n\leq m$ 用于从输入样本 $\mathbf{X}$ 中选取一个特征子集,该子集经过优化以描述(即解释)模型的预测行为。具体来说,获取归因候选特征有两个步骤,分别是核心特征和具有正个体贡献的特征。第一步是从原始样本 $\mathbf{X}$ 中识别一组核心特征 $\mathbf{x_{c}}={x_{i}\in\mathbf{x}}$,这些特征是样本 $\mathbf{X}$ 的基础,使得模型 $f(\mathbf{x_{c}}):\rightarrow0.5$(即预测的边界线)。我们假设模型 $f(\mathbf{x_{c}})$ 仅对具有这些核心特征的样本做出“犹豫”的决策,此时模型对其预测样本类别的置信度约为 $50%$,而原始输入样本 $f(\mathbf{x})$ 的实际预测是由核心特征和部分剩余特征的共同贡献决定的。
| 算法1: 核心特征选择 | |
| --- | --- |
| 输入 : 输入样本: x 和要解释的模型: f() MAXLEN_CORE_FEATURES- | |
| 输出: 核心特征: xs | 最大核心特征数量 |
| 1 min-prediction_score-gap = 1 | |
| 2 Xs=Null 3 while | |
| 选取 xi E x | xsII≤ MAX LEN_CORE_FEATURES do |
| 4 5 | if llf(xs +xi)-0.5ll < |
| | min_prediction_score-gapthen |
| sx←:!x 7end | |
| 8 返回选定的核心特征 xs | |
sample $\mathbf{X}$ , we start from an empty feature vector that contain no feature. The following steps are to find out the candidate core features in a recursive way, where the target is to find the subset of features that leads the model $f(\mathbf{x}{c}):\rightarrow0.5$ as close as possible (i.e., the local minimum of the $a b s(f(\mathbf{x}{c})-0.5)$ . The detailed algorithm about core features identification are in Algorithm 1.
对于样本 $\mathbf{X}$,我们从包含无特征的空特征向量开始。接下来的步骤是以递归方式找出候选核心特征,目标是找到使模型 $f(\mathbf{x}{c}):\rightarrow0.5$ 尽可能接近的特征子集(即 $a b s(f(\mathbf{x}{c})-0.5)$ 的局部最小值)。关于核心特征识别的详细算法见算法 1。
3.3 Features Individual Contributions
3.3 特征个体贡献
Once the core features $\mathbf{X}{C}$ is obtained, we are looking for the features that can increase the prediction confidence of the model toward the prediction score on the original sample $\mathbf{X}$ . Formally, we define the acquisition of such features with positive individual contributions, i.e., $\mathbf{X}{p}$ , as:
一旦获得核心特征 $\mathbf{X}{C}$,我们寻找能够增加模型对原始样本 $\mathbf{X}$ 预测置信度的特征。正式地,我们将这些具有正向个体贡献的特征,即 $\mathbf{X}{p}$,定义为:
a r g m i n_{\mathbf{x}_{p}\subset\mathbf{x}\backslash\mathbf{x}_{c}}(f(\mathbf{x}_{c}+\mathbf{x}_{p})-f(\mathbf{x}))```
where the symbol “ \$\mathbf{\Phi}^{*}+\mathbf{\Phi}^{,}\$ means the concatenation of two features vectors, i.e., \$\mathbf{X}_{C}\$ and \$\mathbf{X}_{p}\$ . The candidate features set is initialized as \$\mathbf{x}_{p}=\boldsymbol{\Phi}\$ . For every feature \$x_{i}\in\mathbf{x}\backslash\mathbf{x}_{c}\$ that is added into \$\mathbf{X}_{p}\$ , the model’s prediction on \$f(\mathbf{x}_{c}+\mathbf{x}_{p}:\rightarrow f(\mathbf{x})\$ .
其中符号“\$\mathbf{\Phi}^{*}+\mathbf{\Phi}^{,}\$”表示两个特征向量的连接,即\$\mathbf{X}_{C}\$和\$\mathbf{X}_{p}\$。候选特征集初始化为\$\mathbf{x}_{p}=\boldsymbol{\Phi}\$。对于每个被添加到\$\mathbf{X}_{p}\$中的特征\$x_{i}\in\mathbf{x}\backslash\mathbf{x}_{c}\$,模型对\$f(\mathbf{x}_{c}+\mathbf{x}_{p}:\rightarrow f(\mathbf{x})\$的预测。
The aim is to identify the features in the input sample \$\mathbf{X}\$ to enhance the confidence of the model significantly when it outputs the prediction of the input sample. Accordingly, those features that lead the model to the opposite of the prediction on the sample \$\mathbf{X}\$ will be ignored.
目标是识别输入样本 \$\mathbf{X}\$ 中的特征,以显著提高模型在输出该样本预测时的置信度。因此,那些导致模型对样本 \$\mathbf{X}\$ 做出相反预测的特征将被忽略。
# 3.4 Quantify Joint Contribution by Features
# 3.4 量化特征的联合贡献
The features we picked up from the previous steps, i.e., the core features \$\mathbf{X}_{C}\$ and the features with positively individual contributions \$\mathbf{X}_{p}\$ , form the set of the candidate features, where features attribution by PhilaeX will be applied. There are two reasons that we only attribute the subset of features in the input sample \$\mathbf{X}\$ : (1) The features attribution on such features \$\mathbf{x}_{s}=\mathbf{x}_{c}+\mathbf{x}_{p}\$ allows the explainer to reveal the major reason that the model made the prediction on the original sample \$\mathbf{X}\$ . As the discussion in Section 3.3, it is not always true for all features in the sample \$\mathbf{X}\$ that make positive significant contributions towards the model’s prediction of the class label on the sample. (2) The explanation on such subset of features will be more efficient that that on the all the features of the input sample \$\mathbf{X}\$ .
我们从之前的步骤中提取的特征,即核心特征 \$\mathbf{X}_{C}\$ 和具有正向个体贡献的特征 \$\mathbf{X}_{p}\$,构成了候选特征集,PhilaeX 将对这些特征进行归因。我们只对输入样本 \$\mathbf{X}\$ 中的特征子集进行归因有两个原因:(1) 对这些特征 \$\mathbf{x}_{s}=\mathbf{x}_{c}+\mathbf{x}_{p}\$ 的归因使解释器能够揭示模型对原始样本 \$\mathbf{X}\$ 做出预测的主要原因。正如第 3.3 节所讨论的,并非样本 \$\mathbf{X}\$ 中的所有特征都对模型对样本类标签的预测做出了显著的正向贡献。(2) 对此类特征子集的解释比对所有输入样本 \$\mathbf{X}\$ 的特征的解释更高效。
The joint contributions made by the cooperation among these features are the necessary to form the complete quantitative explanation (i.e., features attribution) for the model \$f(\mathbf{x})\$ , which have not yet been considered by the previous two steps in Section 3.2 and Section 3.3. In this step, we quantify each feature’s contribution to the model’s prediction by training a Ridge regression model \$g(\cdot)\$ as the surrogate model to the original model \$f(\cdot)\$ , where the weights of each feature in the regression model are considered as the features attribution. The reason we use the Ridge regression as the surrogate model is for its simplicity, efficiency and its nature for estimating the coefficients (i.e., weights) where independent variables are highly correlated (Hilt and Seegrist, 1977).
这些特征之间合作所做出的联合贡献是形成对模型 \$f(\mathbf{x})\$ 的完整定量解释(即特征归因)所必需的,而这一点在前两节(第3.2节和第3.3节)中尚未被考虑。在这一步骤中,我们通过训练一个 Ridge 回归模型 \$g(\cdot)\$ 作为原始模型 \$f(\cdot)\$ 的替代模型,来量化每个特征对模型预测的贡献,其中回归模型中每个特征的权重被视为特征归因。我们选择 Ridge 回归作为替代模型的原因是它的简单性、高效性以及在自变量高度相关时估计系数(即权重)的自然特性 (Hilt and Seegrist, 1977)。
Specifically, the weights \$\mathbf{w}\in R^{\|\mathbf{x}_{s}\|}\$ in Ridge regression can be estimated by the optimization of the following equation:
具体来说,Ridge回归中的权重 \$\mathbf{w}\in R^{\|\mathbf{x}_{s}\|}\$ 可以通过优化以下方程来估计:
a r g m i n_{\mathbf{w}}\big(||\boldsymbol{y}-\mathbf{X}\mathbf{w}||{2}^{2}+\mathbf{\alpha}*||\mathbf{w}||{2}^{2}\big)```
where the L2 regular iz ation applies to reduce sensitivity to single feature and accordingly decrease the possibility of over fitting in the model training.
其中 L2 正则化用于减少对单一特征的敏感性,从而降低模型训练中过拟合的可能性。
Finally the features attribution vector is defined as $\mathbf A=\mathbf w$ that considers both the individual contribution from each features and the joint contributions from the cooperation among these features $\mathbf{X}_{S}$ .
最终,特征归因向量定义为 $\mathbf A=\mathbf w$,它既考虑了每个特征的个体贡献,也考虑了这些特征 $\mathbf{X}_{S}$ 之间合作的联合贡献。
4 Experiments
4 实验
In this section, we assess the explanation capability of PhilaeX through two quantitative experiments. The proposed explainer will be used to explain the prediction behaviors of three classical class if i ers, including SVM, Random Forest and BERT, which include the AI models in both the shallow (classical) machine learning and deep learning fields. There are two datasets are used in our experiments. The datasets are DREBIN (Arp et al., 2014) dataset for Android malware detection task and the PDF malware dataset (Smutz and Stavrou, 2012) for PDF malware detection. The explanation performance will be evaluated quantitatively in terms of the explanation fidelity in two tasks, which are the activated features identification for adversarial samples of Android malware and the deduction/augmentation tests for PDF malware samples.
在本节中,我们通过两个定量实验评估 PhilaeX 的解释能力。所提出的解释器将用于解释三种经典分类器的预测行为,包括 SVM、随机森林和 BERT,这些分类器涵盖了浅层(经典)机器学习和深度学习领域的 AI 模型。我们的实验中使用了两个数据集:用于 Android 恶意软件检测任务的 DREBIN 数据集 (Arp et al., 2014) 和用于 PDF 恶意软件检测的 PDF 恶意软件数据集 (Smutz and Stavrou, 2012)。解释性能将通过两个任务中的解释保真度进行定量评估,这两个任务分别是 Android 恶意软件对抗样本的激活特征识别和 PDF 恶意软件样本的缩减/增强测试。
4.1 Dataset
4.1 数据集
We use two datasets in the evaluation on the explanation fidelity by PhilaeX. The first dataset, DREBIN (Arp et al., 2014), was used to test a lightweight Android malware detector, where the features of the suspicious Android apps were extracted from the application’s manifest file Android Manifest.xml and disassembled dex code from the bytecode by the static analysis technique. The features that DREBIN extracted fall into 8 categories, like requested permissions, restricted API calls and network addresses, etc. In the DREBIN dataset, there are 5,560 Android malware apps and 123,453 nonmalware apps in total. However, in our experiments, we randomly selected 5,555 malware samples and 5,555 non-malware apps, in order to build a balanced dataset for the model’s training. Further, the dis-joint training set and testing set used in the evaluation are built through a random split of these 11,110 samples, which generates a training set of 7,442 samples and a testing set of 3,668 samples. For each sample, the text features data in a sample will be converted into the features vector in the form of floating numbers. Specifically, all the features in the training dataset will be encoded by the $t f$ -idf algorithm (Rajaraman and Ullman, 2011), that measures the importance of each feature in the dataset. The dimension of the features vector is 43,157, which is high dimension.
我们在评估 PhilaeX 的解释保真度时使用了两个数据集。第一个数据集是 DREBIN (Arp et al., 2014),用于测试一个轻量级的 Android 恶意软件检测器,其中可疑 Android 应用的特征是从应用的清单文件 Android Manifest.xml 和通过静态分析技术从字节码中反汇编的 dex 代码中提取的。DREBIN 提取的特征分为 8 类,如请求的权限、受限的 API 调用和网络地址等。在 DREBIN 数据集中,共有 5,560 个 Android 恶意软件应用和 123,453 个非恶意软件应用。然而,在我们的实验中,我们随机选择了 5,555 个恶意软件样本和 5,555 个非恶意软件应用,以便为模型的训练构建一个平衡的数据集。此外,评估中使用的独立训练集和测试集是通过随机拆分这 11,110 个样本构建的,生成了 7,442 个样本的训练集和 3,668 个样本的测试集。对于每个样本,样本中的文本特征数据将被转换为浮点数形式的特征向量。具体来说,训练数据集中的所有特征将通过 $t f$ -idf 算法 (Rajaraman and Ullman, 2011) 进行编码,该算法衡量了数据集中每个特征的重要性。特征向量的维度为 43,157,属于高维度。
The second dataset used in the experiments is the PDF malware dataset (Smutz and Stavrou, 2012) that has 4,999 malicious samples and 5,000 benign samples. We use the 135 features suggested by (Guo et al., 2018), where the features have been encoded into binary (i.e., 0 or 1) values.
实验中使用的第二个数据集是PDF恶意软件数据集 (Smutz and Stavrou, 2012),其中包含4,999个恶意样本和5,000个良性样本。我们使用了 (Guo et al., 2018) 建议的135个特征,这些特征已被编码为二进制值(即0或1)。
vlin et al., 2018), the transformer-based classifier that was proposed by Google for natural language processing (NLP) tasks in 2018, is used to classify the Android malware in the DREBIN dataset. We use the BERT implementation from Hugging Face Transformers library (Wolf et al., 2019) that is not sensitive to the letters case and the default parameters, such as the maximum length of text (128) and the learning rate (4e-5). There are 8 samples used in a single batch and 5 epochs were running in the training process of the BERT model. We trained a surrogate SVM model to the BERT Android malware detector in the model explanation, in order to avoid the complicated word embedding mechanism that converts the text tokens to numerical representations. Such surrogate SVM has highly similar prediction behavior as the BERT, given the sample input sample, where the $\mathrm{TPR}=0.9984$ and $\mathrm{FPR}=0.0029$ .
vlin 等人于 2018 年提出的基于 Transformer 的分类器,由 Google 提出用于自然语言处理 (NLP) 任务,用于对 DREBIN 数据集中的 Android 恶意软件进行分类。我们使用了 Hugging Face Transformers 库中的 BERT 实现 (Wolf 等人, 2019),该实现对字母大小写不敏感,并使用默认参数,如文本最大长度 (128) 和学习率 (4e-5)。在 BERT 模型的训练过程中,每个批次使用 8 个样本,并运行了 5 个 epoch。我们在模型解释中训练了一个替代的 SVM 模型,以避免复杂的词嵌入机制将文本 Token 转换为数值表示。这种替代 SVM 在给定样本输入时,具有与 BERT 高度相似的预测行为,其中 $\mathrm{TPR}=0.9984$ 且 $\mathrm{FPR}=0.0029$。
Both the trained SVM and BERT models used in the Android malware detection tasks present good performance. The true positive rate (TPR) for both class if i ers are around 0.96 with a 0.04 false positive rate (FPR). In addition, we also trained a separate SVM classifier and Random Forest classifier for the PDF malware detection task, which uses the default parameters.
在Android恶意软件检测任务中使用的训练好的SVM和BERT模型均表现出良好的性能。两个分类器的真阳性率 (TPR) 都在0.96左右,假阳性率 (FPR) 为0.04。此外,我们还为PDF恶意软件检测任务训练了一个单独的SVM分类器和随机森林分类器,这些分类器使用了默认参数。
4.2 AI Models to be Explained
4.2 待解释的AI模型
We test the explanation capability of PhilaeX for different AI models that cover the shallow (classical) machine learning and the recent emerging deep learning models. First, we trained a SVM (Arp et al., 2014) (Zhao et al., 2011) (Li et al., 2015) model, which is a classical shallow machine learning model and has been widely used as the classifier for binary classification tasks before the deep learning methods dominate this field. For a given sample in the feature space, SVM maps the relatively low dimension data into a high-dimension space such that the separation between two classes becomes more apparent, and thus is able to predict the sample’s class more accurately.
我们测试了PhilaeX在不同AI模型上的解释能力,这些模型涵盖了浅层(经典)机器学习和最近兴起的深度学习模型。首先,我们训练了一个SVM (Arp et al., 2014) (Zhao et al., 2011) (Li et al., 2015) 模型,这是一个经典的浅层机器学习模型,在深度学习方法主导这一领域之前,已被广泛用作二分类任务的分类器。对于特征空间中的给定样本,SVM将相对低维的数据映射到高维空间,使得两类之间的分离更加明显,从而能够更准确地预测样本的类别。
Specifically, we trained a SVM model with the Radial basis function (RBF) kernel (Vert et al., 2004), where the parameter that defines the inverse degree of the influence by a single training sample $\upgamma$ is set to 1.0. We trained two SVM class if i ers for the Android malware detection task on the DREBIN dataset and the PDF malware detection task on the PDF malware dataset. In the remaining part of this section, we will use PhilaeX to explain the prediction behavior of these two class if i ers (i.e., AI models).
具体来说,我们训练了一个使用径向基函数 (RBF) 核的 SVM 模型 (Vert et al., 2004),其中定义单个训练样本影响程度的倒数参数 $\upgamma$ 设置为 1.0。我们在 DREBIN 数据集上训练了两个 SVM 分类器,分别用于 Android 恶意软件检测任务和 PDF 恶意软件检测任务。在本节的剩余部分,我们将使用 PhilaeX 来解释这两个分类器(即 AI 模型)的预测行为。
In addition, we also trained a deep learning model for the Android malware detection task. BERT (De
此外,我们还为安卓恶意软件检测任务训练了一个深度学习模型。BERT (De
4.3 Explaining Evasion Attack by Adversarial Samples
4.3 通过对抗样本解释逃避攻击
We firstly evaluate the explanation capability of PhilaeX on how the adversarial samples of Android malware evade the trained malware detector (that was with high TPR and low FPR on DREBIN dataset) in quantitative way. In the evasion attack, we assume the attacker has full knowledge of the features space and access to the model’s prediction score. That is, the attacker is able to manipulate the data sample, which class is to be predicted by the SVM or BERT classifier, such as adding the features in the sample.
我们首先定量评估了PhilaeX在解释Android恶意软件的对抗样本如何逃避训练好的恶意软件检测器(在DREBIN数据集上具有高TPR和低FPR)方面的能力。在逃避攻击中,我们假设攻击者完全了解特征空间,并且能够访问模型的预测分数。也就是说,攻击者能够操纵数据样本,这些样本的类别将由SVM或BERT分类器预测,例如在样本中添加特征。
In this experiment, we only add (i.e., activate) the “permission” features to the existing sample’s feature vector of Android malware, because such addition operation will not change the functionality of the original malware (Liu et al., 2019). One adversarial sample is generated by Genetic Algorithm that is extended from (Liu et al., 2019) and the optimised set of “permission” features is selected to help the original sample bypasses the malware detection by the classifier. Specifically, in the Genetic Algorithm, the fitness value is defined by the model’s prediction score towards the non-malware class of the candidate adversarial sample. The convergence of the algorithm is fulfilled if (1) the Genetic Algorithm that has been running for 500 loops that has a high possibility to make the evasion attack by the adversarial samples successful; (2) the prediction score towards the nonmalware class stay the same at a high level for at least 10 times; or (1) the fitness value is larger than 0.99 which implies the model has extremely high confidence on its incorrect prediction for the adversarial sample. In total, there are 200 malware samples from the testing set randomly selected as the seeds to generate the adversarial samples. The adversarial samples dataset used in the explanation for SVM has 499 samples. In the explanation for BERT, there are dis-joint 500 samples used.
在本实验中,我们仅将“权限”特征添加到现有Android恶意软件样本的特征向量中(即激活),因为这种添加操作不会改变原始恶意软件的功能(Liu et al., 2019)。通过从(Liu et al., 2019)扩展的遗传算法生成一个对抗样本,并选择优化的“权限”特征集,以帮助原始样本绕过分类器的恶意软件检测。具体来说,在遗传算法中,适应度值由模型对候选对抗样本的非恶意软件类的预测分数定义。如果满足以下条件之一,则算法收敛:(1) 遗传算法已运行500次循环,这有很高的可能性使对抗样本的逃避攻击成功;(2) 对非恶意软件类的预测分数在至少10次内保持在高水平不变;或 (1) 适应度值大于0.99,这意味着模型对其对抗样本的错误预测具有极高的置信度。总共从测试集中随机选择了200个恶意软件样本作为生成对抗样本的种子。用于SVM解释的对抗样本数据集包含499个样本。在BERT的解释中,使用了500个不重叠的样本。
The aim of the evaluation is to observe the capability of the model explanation by PhilaeX in terms of the percentage of “good” explanations. An adversarial sample has a “good explanation”, only if a certain number of the activated features in this sample are attributed with positive values. A high number of “activated features” are identified in terms of their attribution values and means that the model explanation ver- ifies the assumption that the model is evaded because of the activated features in the adversarial sample.
评估的目的是观察 PhilaeX 在模型解释能力方面的表现,具体通过“良好”解释的百分比来衡量。一个对抗样本只有在其中一定数量的激活特征被赋予正值时,才具有“良好解释”。通过归因值识别出大量“激活特征”意味着模型解释验证了模型因对抗样本中的激活特征而被规避的假设。
In the experiment, we compare the explanation capability of PhilaeX against LIME (Ribeiro et al., 2016), SHAP (Lundberg and Lee, 2017) and MPT explainer (Lu and Thing, 2021). The reasons that we use these three explain able AI methods as the baseline are: (1) LIME is a popular explain able AI method that explains the models by learning a linear surrogate model. (2) The explanation generated by SHAP is based on the computation of Shapley value (Roth, 1988), which concept has been widely used in cooperative game theory. (3) The recently MPT explainer is based on the modern portfolio theory (Markowitz, 1952) that was proposed in economics to allocate the investment to different assets for a maximum return with minimum risk. In the evaluation, we vary the threshold of the “good explanation” from $0%$ activated features in the adversarial samples identified to $90%$ activated features identified. This allows us to observe the robustness of the explanations from different explain able AI methods. In Fig 1, it shows PhilaeX can identify more activated features from the adversarial samples compared to LIME, MPT explainer and SHAP, when the same threshold of “good explanation” is used and the threshold value is less than $40%$ in SVM and $20%$ in BERT. In addition, PhilaeX’s explanation shows much robustness that is verified by the slower decreasing curve, compared to SHAP and MPT explainer. This conclusion still holds true when we compare the robustness of PhilaeX and LIME, considering the unstable explanation in LIME that is caused by the random perturbation on the fea
在实验中,我们将 PhilaeX 的解释能力与 LIME (Ribeiro et al., 2016)、SHAP (Lundberg and Lee, 2017) 和 MPT explainer (Lu and Thing, 2021) 进行了比较。我们选择这三种可解释 AI 方法作为基准的原因是:(1) LIME 是一种流行的可解释 AI 方法,通过学习线性替代模型来解释模型。(2) SHAP 生成的解释基于 Shapley 值 (Roth, 1988) 的计算,这一概念在合作博弈论中被广泛使用。(3) 最近的 MPT explainer 基于现代投资组合理论 (Markowitz, 1952),该理论在经济学中提出,用于将投资分配到不同资产以实现最小风险下的最大回报。在评估中,我们将“良好解释”的阈值从对抗样本中识别的 $0%$ 激活特征变化到 $90%$ 激活特征。这使得我们能够观察不同可解释 AI 方法解释的鲁棒性。在图 1 中,当使用相同的“良好解释”阈值且阈值在 SVM 中小于 $40%$ 和在 BERT 中小于 $20%$ 时,PhilaeX 能够比 LIME、MPT explainer 和 SHAP 识别出更多的对抗样本中的激活特征。此外,与 SHAP 和 MPT explainer 相比,PhilaeX 的解释显示出更强的鲁棒性,这通过较慢的下降曲线得到了验证。当我们比较 PhilaeX 和 LIME 的鲁棒性时,考虑到 LIME 中由于特征上的随机扰动导致的不稳定解释,这一结论仍然成立。

(b) “Good explanation” Percentage for BERT Figure 1: “Good Explanation” Comparison The number of “good explanations” by PhilaeX stays in a high level (i.e., nearly $100\bar{%}$ in SVM), when the threshold of “good explanation” is less than $40%$ . This also shows the robustness of the explanation by PhilaeX, compared to the other explainable AI methods.
图 1: “良好解释”比较
当“良好解释”的阈值小于 40% 时,PhilaeX 的“良好解释”数量保持在较高水平(即在 SVM 中接近 100%)。这也显示了 PhilaeX 解释的鲁棒性,与其他可解释的 AI 方法相比。
tures’ values.
特征值
In the explanation of BERT, PhilaeX shows slight lower ratio of “good explanations”, when the threshold of “good explanation” is less than $30%$ . This is possibly because BERT considers more joint contributions among the features that reduces the effect by single features accordingly. However, we see that PhilaeX still presents a relatively robust explanation capability among these explain able AI methods, because of its slower curve decline.
在对 BERT 的解释中,当“良好解释”的阈值低于 $30%$ 时,PhilaeX 的“良好解释”比例略低。这可能是因为 BERT 考虑了更多特征之间的联合贡献,从而降低了单个特征的影响。然而,我们看到 PhilaeX 在这些可解释的 AI 方法中仍然表现出相对稳健的解释能力,因为其曲线下降速度较慢。
4.4 Explaining PDF Malware Detector
4.4 解释PDF恶意软件检测器
In the fidelity test, the aim is to evaluate if the explainer attributes high values for the features that has high impact on the model’s prediction behavior. Specifically, there are two kinds of tests we used in the experiments: (1) Deduction test that removes a certain number of features with high attribution values will lead the model to predict the manipulated sample as the opposite class. That is, the less such high attribution value features are removed, the higher the explanation fidelity. For example, the SVM model predicts a manipulated sample of malware as nonmalware, if the feature with the top attribution value is removed. This means that this feature is correctly attributed in the explanation; (2) In Augmentation test, we activate a certain number of features in a nonmalware sample. These features are from a malware sample and are attributed with high attribution values in the model explanation on this malware sample. It is expected that the model’s prediction on the manipulated non-malware sample as malware, if the explanation is correct. That is, the correctly attributed features in a malware sample may have strong individual impact on the model’s prediction behavior that lead the model towards the malware class.
在保真度测试中,目标是评估解释器是否为对模型预测行为有高影响的特征分配了高值。具体来说,我们在实验中使用了两种测试:(1) 演绎测试,即移除一定数量具有高归因值的特征,将导致模型将操纵后的样本预测为相反类别。也就是说,移除的高归因值特征越少,解释的保真度越高。例如,如果移除具有最高归因值的特征,SVM模型将恶意软件的操纵样本预测为非恶意软件。这意味着该特征在解释中被正确归因;(2) 在增强测试中,我们在非恶意软件样本中激活一定数量的特征。这些特征来自恶意软件样本,并且在该恶意软件样本的模型解释中被归因为高归因值。如果解释正确,模型将操纵后的非恶意软件样本预测为恶意软件。也就是说,恶意软件样本中正确归因的特征可能对模型的预测行为有强烈的个体影响,导致模型倾向于恶意软件类别。
We use the positive classification rate (PCR) (Guo et al., 2018) as the evaluation metric to quantify the fidelity of the explanations. The PCR is defined as the ratio of samples which retains their original class after the manipulation through deduction or augmentation. The PCR in an explanation with high fidelity will be as low as possible through a deduction test, and as high as possible by the augmentation test.
我们使用正向分类率 (PCR) (Guo et al., 2018) 作为评估指标来量化解释的保真度。PCR 定义为通过演绎或增强操作后仍保留其原始类别的样本比例。在保真度高的解释中,通过演绎测试的 PCR 应尽可能低,而通过增强测试的 PCR 应尽可能高。
In this experiment, we test the explanation fidelity by PhilaeX, when it is used to explain the Random Forest and SVM class if i ers on the PDF malware detection task. In Fig. 2a and Fig. 2b, we can observe that for both RF and SVM, PhilaeX has a significant higher fidelity explanation than that of MPT explainer, which are measured by the lower PCRs. This finding verifies that the features selection function $h(\mathbf{x})$ in Section 3 guarantees the following features attribution to assign high attribution values to the important features. In addition, the high fidelity (in terms of PCR) is stable although the number of features used is increasing. This means that PhilaeX is more capable of identifying the important features (by attributing it with higher value) than that of MPT explainer.
在本实验中,我们测试了 PhilaeX 在用于解释随机森林 (Random Forest) 和支持向量机 (SVM) 分类器在 PDF 恶意软件检测任务中的解释保真度。在图 2a 和图 2b 中,我们可以观察到,对于 RF 和 SVM,PhilaeX 的解释保真度显著高于 MPT 解释器,这一点通过较低的 PCR 值得到了验证。这一发现验证了第 3 节中的特征选择函数 $h(\mathbf{x})$ 能够确保后续的特征归因将高归因值分配给重要特征。此外,尽管使用的特征数量在增加,高保真度(以 PCR 衡量)仍然保持稳定。这意味着 PhilaeX 比 MPT 解释器更能识别重要特征(通过赋予其更高的值)。
In Fig. 2c and Fig. 2d, the features with high attribution values by PhilaeX will generally guarantee a high PCR for both RF and SVM, when the number of features used are small. However, the PCRs for PhilaeX are getting lower than that of MPT explainer when around 50 and more features are used in the augmentation test. This is probably due to the joint contribution by all the features becoming stronger as the number of features used increases.
在图 2c 和图 2d 中,当使用的特征数量较少时,PhilaeX 具有高归因值的特征通常能保证 RF 和 SVM 的高 PCR。然而,在增强测试中使用约 50 个或更多特征时,PhilaeX 的 PCR 开始低于 MPT 解释器。这可能是由于随着使用特征数量的增加,所有特征的联合贡献变得更强。
4.5 Running Time Performance
4.5 运行时间性能
The average running time to explain the SVM’s prediction on a single data sample of Android malware apps is around 6.37 seconds, compared to the MPT explainer with around 15.44 seconds. This is probably due to the efficient the optimization process of Ridge regression.
解释 SVM 对单个 Android 恶意软件应用数据样本预测的平均运行时间约为 6.37 秒,而 MPT 解释器的运行时间约为 15.44 秒。这可能是由于 Ridge 回归的高效优化过程所致。
5 Conclusion
5 结论
In this article, we presented a novel model-agnostic explain able AI method, PhilaeX, that is featured by the features selection strategy and more suitable to explain the AI models used in cyber security tasks. The explanation is in the form of features attribution for machine learning class if i ers. This method has a multi-stage feature selection function that identifies the candidate features to be explained: (1) the core features to find the features that lead the model to make a borderline prediction; (2) the features with positive individual contributions towards the model’s prediction on the original sample to restrict the explainer to focus on important features’ attribution, which is helpful in revealing the model’s behavior in a more accurate way; and (3) the Ridge regression model as the surrogate model quantifies the contributions of these features, considering the joint contributions made by them. The explanation fidelity of the proposed method is evaluated by two experiments. The first experiment aims to find the activated features from the adversarial sample of Android malware, through the attribution values (positive values) by PhilaeX. The results shows PhilaeX has higher capability of the identification on such activated features than those by LIME, SHAP and MPT Explainer. The second experiment consists of two fidelity tests, which are the deduction test and augmentation test. In the deduction test, PhilaeX has significantly higher fidelity explanations than that of the MPT explainer. The augmentation test reveals that PhilaeX has higher PCRs when a small number of features used. Both experiments results show that PhilaeX can be helpful for explanation of the AI models, such as those used in the cyber security field.
在本文中,我们提出了一种新颖的模型无关可解释 AI 方法,PhilaeX,其特点是特征选择策略,更适合解释用于网络安全任务的 AI 模型。解释以机器学习分类器的特征归因形式呈现。该方法具有多阶段特征选择功能,识别待解释的候选特征:(1) 核心特征,用于找到导致模型做出边界预测的特征;(2) 对模型在原始样本上的预测具有积极个体贡献的特征,以限制解释器专注于重要特征的归因,这有助于更准确地揭示模型的行为;(3) Ridge 回归模型作为替代模型,量化这些特征的贡献,同时考虑它们的联合贡献。通过两个实验评估了所提出方法的解释保真度。第一个实验旨在通过 PhilaeX 的归因值(正值)从 Android 恶意软件的对抗样本中找到激活的特征。结果表明,PhilaeX 在此类激活特征的识别能力上优于 LIME、SHAP 和 MPT Explainer。第二个实验包括两个保真度测试,即消减测试和增强测试。在消减测试中,PhilaeX 的解释保真度显著高于 MPT 解释器。增强测试表明,当使用少量特征时,PhilaeX 具有更高的 PCR。两个实验结果表明,PhilaeX 有助于解释 AI 模型,例如用于网络安全领域的模型。
REFERENCES
参考文献



(c) Augmentation Test for Random Forest (d) Augmentation Test for SVM Figure 2: Fidelity Test In the deduction test results (a) and (b), PhilaeX shows higher fidelity of explanation for both Random Forest and SVM class if i ers, where the PCR value of deduction test should be as lower as possible. In the augmentation test, PhilaeX shows higher PCR values (better) for both Random Forest and SVM class if i ers, when a small number of features used (i.e., $<30$ features in RF and $<50$ features in SVM).
(c) 随机森林的增强测试 (d) SVM的增强测试
图 2: 保真度测试
在演绎测试结果 (a) 和 (b) 中,PhilaeX 对随机森林和 SVM 分类器的解释保真度更高,其中演绎测试的 PCR 值应尽可能低。在增强测试中,当使用少量特征时(即随机森林中 $<30$ 个特征,SVM 中 $<50$ 个特征),PhilaeX 对随机森林和 SVM 分类器的 PCR 值更高(更好)。
