ExAIS: Executable AI Semantics
ExAIS: 可执行的AI语义
ABSTRACT
摘要
Neural networks can be regarded as a new programming paradigm, i.e., instead of building ever-more complex programs through (often informal) logical reasoning in the programmers’ mind, complex ‘AI’ systems are built by optimising generic neural network models with big data. In this new paradigm, AI frameworks such as TensorFlow and PyTorch play a key role, which is as essential as the compiler for traditional programs. It is known that the lack of a proper semantics for programming languages (such as C), i.e., a correctness specification for compilers, has contributed to many problematic program behaviours and security issues. While it is in general hard to have a correctness specification for compilers due to the high complexity of programming languages and their rapid evolution, we have a unique opportunity to do it right this time for neural networks (which have a limited set of functions, and most of them have stable semantics). In this work, we report our effort on providing a correctness specification of neural network frameworks such as TensorFlow. We specify the semantics of almost all TensorFlow layers in the logical programming language Prolog. We demonstrate the usefulness of the semantics through two applications. One is a fuzzing engine for TensorFlow, which features a strong oracle and a systematic way of generating valid neural networks. The other is a model validation approach which enables consistent bug reporting for TensorFlow models.
神经网络可以被视为一种新的编程范式,即不再通过程序员头脑中的(通常是非正式的)逻辑推理来构建越来越复杂的程序,而是通过优化具有大数据的通用神经网络模型来构建复杂的“AI”系统。在这种新范式中,TensorFlow 和 PyTorch 等 AI 框架扮演着关键角色,其重要性不亚于传统程序的编译器。众所周知,编程语言(如 C 语言)缺乏适当的语义(即编译器的正确性规范)导致了许多有问题的程序行为和安全问题。虽然由于编程语言的高度复杂性及其快速演变,通常很难为编译器制定正确性规范,但这次我们有一个独特的机会为神经网络(其功能集有限,且大多数具有稳定的语义)做好这件事。在这项工作中,我们报告了为 TensorFlow 等神经网络框架提供正确性规范的努力。我们在逻辑编程语言 Prolog 中指定了几乎所有 TensorFlow 层的语义。我们通过两个应用展示了这些语义的有用性。一个是 TensorFlow 的模糊测试引擎,它具有强大的预言机功能和生成有效神经网络的系统方法。另一个是模型验证方法,它能够为 TensorFlow 模型提供一致的错误报告。
KEYWORDS
关键词
AI frameworks, AI libraries, deep learning models, semantics, specification, test case generation, model validation, AI model generation
AI框架、AI库、深度学习模型、语义、规范、测试用例生成、模型验证、AI模型生成
1 INTRODUCTION
1 引言
Artificial intelligence (AI) refers to intelligence expressed by machines. It can be based on imitating human intelligence or learning behaviour. For example, the technique called deep learning (DL) applies artificial neural networks to simulate the neurons of natural brains [36]. In recent years, the dramatic growth of data and computing power has led to the spread of DL approaches which enabled numerous new technologies, like autonomous driving, language processing, and face recognition [34].
人工智能 (AI) 指的是由机器表达的智能。它可以基于模仿人类智能或学习行为。例如,深度学习 (DL) 技术应用人工神经网络来模拟自然大脑的神经元 [36]。近年来,数据和计算能力的急剧增长导致了深度学习方法的普及,这些方法催生了许多新技术,如自动驾驶、语言处理和面部识别 [34]。
In principle, AI systems can be seen as a new programming paradigm. Instead of building programs through (often informal) logical reasoning by programmers, AI systems are built by training neural network models with big data. For conventional programming, compilers or interpreters are major development tools. Similarly, for the development of neural networks the focus is on AI frameworks (or libraries) such as TensorFlow and PyTorch, which provide utilities, algorithms, and various types of layers that constitute DL models.
原则上,AI 系统可以被视为一种新的编程范式。与程序员通过(通常是非正式的)逻辑推理来构建程序不同,AI 系统是通过用大数据训练神经网络模型来构建的。对于传统编程,编译器或解释器是主要的开发工具。类似地,对于神经网络的开发,重点是 AI 框架(或库),如 TensorFlow 和 PyTorch,它们提供了构成深度学习模型的工具、算法和各种类型的层。
The lack of a proper semantics (or correctness specifications) for programming languages has resulted in many critical issues such as inconsistencies or vulnerabilities. Generally, it is hard to develop a comprehensive specification for compilers due to the high complexity and the rapid evolution of programming languages [39]. However, we believe that it is more feasible for AI frameworks, since the semantics of AI frameworks are usually stable and simpler, i.e., the set of functions and layers used in popular machine learning models are usually limited.
编程语言缺乏适当的语义(或正确性规范)导致了许多关键问题,如不一致性或漏洞。通常,由于编程语言的高复杂性和快速演变,开发编译器的全面规范非常困难 [39]。然而,我们认为对于 AI 框架来说,这更具可行性,因为 AI 框架的语义通常更稳定且更简单,即流行的机器学习模型中使用的函数和层集合通常是有限的。
Although AI systems have been proven successful in many applications, they are not error free. Especially in safety critical applications, like autonomous driving, or medical systems, even small bugs can have severe consequences. Thus, we need systematic wellgrounded analysis methods for AI systems. While the lack of a proper language specification has hindered many program analysis tasks, we have here a unique opportunity to start afresh by building a formal semantics of AI systems, which then serves as a solid foundation for developing ever more sophisticated analysis methods for AI systems. In this work, we make such an attempt by developing an executable (formal) semantics for AI frameworks. That is, we introduce a Prolog specification for almost all TensorFlow layers. Our specification provides the foundation for solving many AI analysis problems, such as AI test generation, model validation, and potentially AI model synthesis. In the following, we present one example application of the semantics, AI test generation.
尽管AI系统在许多应用中已被证明是成功的,但它们并非无错。特别是在安全关键的应用中,如自动驾驶或医疗系统,即使是小错误也可能带来严重后果。因此,我们需要系统且基础扎实的分析方法来应对AI系统。虽然缺乏适当的语言规范阻碍了许多程序分析任务,但我们在这里有一个独特的机会,通过构建AI系统的形式语义来重新开始,这为开发更复杂的AI系统分析方法奠定了坚实的基础。在本工作中,我们通过为AI框架开发可执行的(形式)语义来做出这样的尝试。也就是说,我们为几乎所有TensorFlow层引入了Prolog规范。我们的规范为解决许多AI分析问题提供了基础,如AI测试生成、模型验证,以及潜在的AI模型合成。接下来,我们将展示该语义的一个应用示例——AI测试生成。
Recently, testing DL systems has become a popular research topic, with a variety of approaches [24, 30, 44]. In comparison, testing AI frameworks has gained less attention. A bug in an AI library can potentially put all systems that were built with the library at risk. Hence, it is crucial to ensure the correctness of these libraries. Multiple researchers made some early attempts [22, 23, 33, 38, 40, 41, 43], although existing approaches suffer from at least two major issues. First, existing approaches often have a weak oracle (i.e., a test case passes if some simple algebraic properties are satisfied [22] or results in similar outputs on two implementations [40]), which can make it difficult to find deep semantic bugs. Second, existing approaches fail to systematically generate valid AI models, i.e., to our knowledge, the state-of-the-art approach has a success rate of $25%$ in generating valid models (after a complicated process combining NLP techniques with manually designed templates) [29].
最近,测试深度学习 (DL) 系统已成为一个热门研究课题,出现了多种方法 [24, 30, 44]。相比之下,测试 AI 框架的关注度较低。AI 库中的一个错误可能会使所有使用该库构建的系统面临风险。因此,确保这些库的正确性至关重要。多位研究人员已经进行了一些早期尝试 [22, 23, 33, 38, 40, 41, 43],尽管现有方法至少存在两个主要问题。首先,现有方法通常具有较弱的预言机制(即,如果满足某些简单的代数属性 [22] 或在两个实现上产生相似的输出 [40],则测试用例通过),这使得难以发现深层次的语义错误。其次,现有方法未能系统地生成有效的 AI 模型,即据我们所知,最先进的方法在生成有效模型时的成功率仅为 $25%$(在将 NLP 技术与手动设计的模板结合的复杂过程之后)[29]。
Our executable semantics solves these two issues naturally. First, given an AI model, our semantics can be executed to compute precisely the constraints that are satisfied by the output (based on the Prolog’s underlying unification and rewriting system), which serves as a strong test oracle. Second, our semantics facilitates the development of a fuzzing-based test generation method to systematically produce valid AI models. Our semantics facilitates the generation process by providing feedback in the form of a badness-value that indicates how far the model is away from being valid (i.e., satisfying the precondition of all layers of the model). Our evaluation shows that we significantly improve the success rate of generating valid AI model (e.g., $99%$ ) and can generate diverse model architectures consisting of various layers (in the form of sequences and complex graphs) as well as layer arguments, weights, and input data.
我们的可执行语义自然地解决了这两个问题。首先,给定一个AI模型,我们的语义可以被执行以精确计算输出所满足的约束(基于Prolog的底层统一和重写系统),这作为一个强大的测试预言。其次,我们的语义促进了基于模糊测试的测试生成方法的开发,以系统地生成有效的AI模型。我们的语义通过提供一个坏值(badness-value)形式的反馈来促进生成过程,该值指示模型距离有效(即满足模型所有层的前置条件)有多远。我们的评估表明,我们显著提高了生成有效AI模型的成功率(例如,$99%$),并且可以生成由各种层(以序列和复杂图的形式)以及层参数、权重和输入数据组成的多样化模型架构。
Structure. The rest of the paper is structured as follows. In Sect. 2, we introduce our semantics and show how we specify different layers. In Sect. 3, we present two applications of the semantics. In Sect. 4, we evaluate these two applications systematically. Lastly, we review the related work in Sect. 5 and conclude in Sect. 6.
结构。本文的其余部分结构如下。在第2节中,我们介绍了我们的语义,并展示了如何指定不同的层次。在第3节中,我们展示了语义的两个应用。在第4节中,我们系统地评估了这两个应用。最后,我们在第5节中回顾了相关工作,并在第6节中进行了总结。
2 AI FRAMEWORK SEMANTICS
2 AI 框架语义
In this section, we start with a brief introduction of the logic programming language Prolog [32] that we adopt for the development of our AI framework semantics. Afterwards, we describe our approach on developing the semantics through examples. In general, to support a variety of analysis tasks including testing, debugging and even synthesising AI models, it is important that the semantics fulfils the following requirements [35, 39]. First, the semantics must be declarative and high-level so that it avoids complications due to implementation-level optimisation and to facilitate a concise specification that is independent of implementation details. Second, it must support symbolic reasoning to allow various property, consistency and correctness checks, and thus it must be specified in logic. Lastly, it must be executable so that it can be used to support a variety of automated analysis tasks, like testing or debugging.
在本节中,我们首先简要介绍我们用于开发 AI 框架语义的逻辑编程语言 Prolog [32]。随后,我们通过示例描述了我们开发语义的方法。总的来说,为了支持包括测试、调试甚至合成 AI 模型在内的各种分析任务,语义必须满足以下要求 [35, 39]。首先,语义必须是声明性的和高层次的,以避免由于实现级别的优化而带来的复杂性,并促进独立于实现细节的简洁规范。其次,它必须支持符号推理,以允许各种属性、一致性和正确性检查,因此它必须在逻辑中指定。最后,它必须是可执行的,以便它可以用于支持各种自动化分析任务,如测试或调试。
2.1 Prolog
2.1 Prolog
Prolog is a declarative language that relies on first order logic. Programs in Prolog are built with rules and facts, which are usually concerned with mathematical relations. Hence, we believe that Prolog is suitable for a formal semantic representation. There might be more formal mathematical semantic models that, e.g., rely on
Prolog 是一种基于一阶逻辑的声明式语言。Prolog 程序由规则和事实构成,通常涉及数学关系。因此,我们认为 Prolog 适合用于形式化的语义表示。可能存在更形式化的数学语义模型,例如依赖于...
Coq [38], but such notations are not as flexible and would make it difficult to support a large number of layers.
Coq [38],但这样的符号不够灵活,难以支持大量层次。
In Prolog, a user can ask queries, and the answer is computed automatically based on the rules and facts with a unification algorithm. The following example shows two facts stating that average and flatten are layers and a rule that says that layers are AI components a i components.
在 Prolog 中,用户可以提出查询,系统会根据规则和事实通过统一算法自动计算答案。以下示例展示了两条事实,分别说明 average 和 flatten 是层,以及一条规则,说明层是 AI 组件 (AI components)。
A user may ask queries such as a i components(flatten)., which is simply answered by true, or a i components(X). where $X$ is a variable which produces the following results.
用户可能会提出诸如 a i components(flatten) 的查询,这只需回答 true,或者提出 a i components(X) 的查询,其中 $X$ 是一个变量,会产生以下结果。
A rule can generally be seen as a predicate that has arguments (in the header) and describes the relation between these arguments with a conjunction of clauses in the body of the rule. Arguments can be atoms (lowercase) or variables (uppercase). Prolog supports a number of data types and has built-in features for various mathematical operations. Moreover, it includes functions for list and string processing, and higher order programming notations [21].
规则通常可以看作是一个谓词,它具有参数(在头部),并通过规则体中的子句连接来描述这些参数之间的关系。参数可以是原子(小写)或变量(大写)。Prolog 支持多种数据类型,并内置了各种数学操作的功能。此外,它还包含了列表和字符串处理的函数,以及高阶编程符号 [21]。
Prolog is a well-suited specification language for our purpose for multiple reasons. First, it is largely intuitive and convenient to use (i.e., by specifying rules and facts) and has a reasonably large user base, which is extremely important as we intend to invite the open-source community to collaborative ly maintain the semantics. Second, it supports various list operations and mathematical expressions, which is helpful for capturing the semantics of AI libraries that involve mathematical operations on multidimensional data. Third, Prolog supports automated reasoning (through an underlying rewriting and search system), which is essential for our purpose. Lastly, Prolog is a well-studied declarative programming language, which is fundamentally different from the imperative programming that is used to implement AI frameworks. This makes it unlikely that our semantics will have the same bugs in the implementation.
Prolog 是一种非常适合我们目的的规范语言,原因如下。首先,它非常直观且易于使用(即通过指定规则和事实),并且拥有相当大的用户群体,这对于我们邀请开源社区共同维护语义至关重要。其次,它支持各种列表操作和数学表达式,这对于捕捉涉及多维数据数学运算的 AI 库的语义非常有帮助。第三,Prolog 支持自动推理(通过底层的重写和搜索系统),这对于我们的目的至关重要。最后,Prolog 是一种经过充分研究的声明式编程语言,与用于实现 AI 框架的命令式编程有根本不同。这使得我们的语义在实现中不太可能出现相同的错误。
2.2 Specifying the Semantics
2.2 语义规范
In this work, we focus on developing a semantics for the TensorFlow framework, since it is the most popular AI framework. TensorFlow is a complex framework that provides various tools, implementations, and features for the development of AI systems. We systematically go through every component in the TensorFlow framework and identify all the components which are relevant to AI models themselves (rather than, components for the training process). Our goal is to implement a comprehensive semantics for one of its core components, i.e., the layers [19]. We specify the semantics of nearly all layers except a few layers for practical reasons. A detailed discussion of the omitted layers is in Sect. 4. In total, we specify the layer functionality that is required to perform a prediction of 72 layers of different types, we did not implement any TensorFlow algorithms that are concerned with the training of models or data processing. A full list of these layers is available in our repository [10]. For each layer, we systematically go through the documentation, and run tests to understand the semantics. Unfortunately, the semantics of quite a number of layers are under-specified, and we have to seek further help from online forums and/or the authors. In multiple cases, we identify under-specified or mis-specified behaviours, like wrong handling of unexpected layer arguments, or missing restrictions for layer inputs (more details are in Sect. 4). This suggests that formal ising the semantics alone is already useful, and can help to improve AI frameworks.
在本工作中,我们专注于为 TensorFlow 框架开发语义,因为它是目前最流行的 AI 框架。TensorFlow 是一个复杂的框架,为 AI 系统的开发提供了各种工具、实现和功能。我们系统地梳理了 TensorFlow 框架中的每个组件,并识别出与 AI 模型本身相关的所有组件(而不是与训练过程相关的组件)。我们的目标是为其核心组件之一(即层 [19])实现全面的语义。出于实际原因,我们指定了几乎所有层的语义,只有少数层除外。关于省略层的详细讨论见第 4 节。我们总共指定了 72 种不同类型层的预测功能所需的层功能,我们没有实现任何与模型训练或数据处理相关的 TensorFlow 算法。这些层的完整列表可在我们的代码库 [10] 中找到。对于每个层,我们系统地查阅文档并运行测试以理解其语义。不幸的是,许多层的语义并未充分指定,我们不得不从在线论坛和/或作者那里寻求进一步帮助。在多个案例中,我们识别出了未充分指定或错误指定的行为,例如对意外层参数的错误处理,或对层输入的限制缺失(更多细节见第 4 节)。这表明,仅形式化语义本身就已经非常有用,并且可以帮助改进 AI 框架。

Figure 1: Histogram of the frequencies of line numbers used to specify the layers in Prolog.
图 1: 用于指定 Prolog 层数的行号频率直方图
The semantics of each layer is specified using one or more rules in Prolog. Figure 1 shows a histogram of the frequencies of the number of lines (or rules) required for the layers. It can be seen that many layers only require less than 10 lines. The reason is that they reuse predicates of other layers, or built-in rules. A major fraction of layers requires no more than 40 lines, and the remaining layers have nearly all less than 80 lines. One outlier (the Dot layer) has 106 lines since the implementation supports various axis combinations for the dot product. In total, the semantics consists of about 3200 lines of Prolog code and on average 44 lines per layer (when shared predicates are considered). The implementation of nearly all layers has an one-to-one correspondence with the deterministic functionality of the TensorFlow layers. This allows us to, given an AI model, straightforwardly invoke the relevant rules in our semantics and apply them to generate constraints on the model output.
每层的语义使用 Prolog 中的一个或多个规则来指定。图 1 显示了各层所需行数(或规则数)的频率直方图。可以看出,许多层只需要不到 10 行代码。原因是它们重用了其他层的谓词或内置规则。大部分层需要的代码不超过 40 行,其余层几乎都少于 80 行。一个异常值(Dot 层)有 106 行代码,因为该实现支持点积的各种轴组合。总的来说,语义部分由大约 3200 行 Prolog 代码组成,平均每层 44 行(考虑共享谓词时)。几乎所有层的实现都与 TensorFlow 层的确定性功能一一对应。这使得我们能够在给定 AI 模型时,直接调用语义中的相关规则,并将其应用于生成模型输出的约束。
We remark that since these Prolog rules are specified by humans, they are potentially buggy as well. Standard practices are applied to ensure the correctness of the semantics itself, such as code review, manual testing (e.g., using documentation examples) and automated testing (e.g., using our own fuzzing engine). While we cannot be completely sure that the semantics is correct, we have reason to believe that it is reasonably so. In general, one can always argue that correctness is an illusion and all there is is consistency (e.g., between manually-programmed TensorFlow and manually-written Prolog semantics). That is, as long as the same bug is not present in both the implementation and the semantics, inconsistency would arise and either TensorFlow or our semantics will be fixed. In the following, we illustrate our semantics with multiple examples.
我们注意到,由于这些 Prolog 规则是由人类指定的,它们也可能存在错误。我们采用了标准实践来确保语义本身的正确性,例如代码审查、手动测试(例如使用文档示例)和自动化测试(例如使用我们自己的模糊测试引擎)。虽然我们无法完全确定语义是正确的,但我们有理由相信它是合理的。一般来说,人们总是可以争论说正确性是一种幻觉,而唯一存在的是(例如手动编程的 TensorFlow 和手动编写的 Prolog 语义之间的)一致性。也就是说,只要实现和语义中不存在相同的错误,就会出现不一致,并且要么修复 TensorFlow,要么修复我们的语义。接下来,我们将通过多个示例来说明我们的语义。
Dense layer. Listing 1 shows our Prolog semantics of a Dense layer [6]. It is a standard densely connected layer, which contains a number of nodes and each is connected with all inputs. The output is computed by the weighted sum of all inputs at each node with an added bias value. Our specification works as follows. The rule starting in Line 1 has multiple parameters: a list [I|Is], a weight array IWs, and a bias array Bs (which can be intuitively regarded as inputs) and a list Os (which can be regarded as the output). The list notation [I|Is] enables access to the first element I and the remaining elements Is of a list. Line 2 constrains the depth of the nested input to two. We handle higher dimensional input separately. Line 3 applies a predicate that is true when O can be unified as the expected output for an input I, and Line 4 is a recursive constraint, which intuitively continues with the next inputs.
Dense 层。代码清单 1 展示了我们对 Dense 层的 Prolog 语义 [6]。它是一个标准的全连接层,包含多个节点,每个节点与所有输入相连。输出通过每个节点上所有输入的加权和加上一个偏置值来计算。我们的规范如下:从第 1 行开始的规则有多个参数:一个列表 [I|Is]、一个权重数组 IWs、一个偏置数组 Bs(可以直观地视为输入)以及一个列表 Os(可以视为输出)。列表符号 [I|Is] 允许访问列表的第一个元素 I 和剩余元素 Is。第 2 行将嵌套输入的深度限制为 2。我们单独处理更高维的输入。第 3 行应用了一个谓词,当 O 可以作为输入 I 的预期输出时,该谓词为真,第 4 行是一个递归约束,直观上继续处理下一个输入。


The rule in Lines 6–9 is similar, except that it handles (layer) inputs with a higher dimension, which is checked in Line 7, and recursively uses our initial predicate from Line 1 since the dense layer only performs computations in the inner most list even when it receives high dimensional input data. Line 11 (and Line 18) are the base cases for the recursion, i.e., when only an empty list remains.
第6-9行的规则类似,不同之处在于它处理具有更高维度的(层)输入,这在第7行进行检查,并递归使用第1行的初始谓词,因为即使接收到高维输入数据,全连接层也只在内层列表执行计算。第11行(和第18行)是递归的基本情况,即当只剩下一个空列表时。
The predicate in Line 13 encodes the main layer functionality and becomes true when the Res variable is the expected output for the input [I|Is]. It has the same arguments as our first rule and an additional temporary variable Res0 for the result. It consists of clauses for multiplying the weight arrays IW with each input I and for adding the results in Line 16. The predicates multiply list with and add_lists are straightforward and are therefore omitted.
第 13 行的谓词编码了主要层的功能,当 Res 变量是输入 [I|Is] 的预期输出时变为真。它与我们的第一条规则具有相同的参数,并且还有一个额外的临时变量 Res0 用于存储结果。它包含将权重数组 IW 与每个输入 I 相乘的条款,并在第 16 行将结果相加。谓词 multiply_list_with 和 add_lists 是直接的,因此省略。
With this Prolog semantics, we can now answer a variety of queries, e.g., to compute the expected output of a Dense layer. The following listing shows some examples. The first is a basic query with two inputs and two nodes and the second has three nodes and more complex input. It can be seen that even such a small and simple specification can handle high dimensional input with just a few lines of code. The third query fails due to invalid arguments since the number of nodes must be equivalent to the size of the bias array and also to the second dimension of the weight array. Many layers have non-trivial preconditions like that. This makes the generation of test data difficult. In Section 3, we will illustrate how we can identify such invalid inputs and produce useful information about the source and severity of such issues to support our model validation and test generation approaches.
有了这个 Prolog 语义,我们现在可以回答各种查询,例如计算 Dense 层的预期输出。以下列表展示了一些示例。第一个是具有两个输入和两个节点的基本查询,第二个具有三个节点和更复杂的输入。可以看出,即使是如此小而简单的规范,也能用几行代码处理高维输入。第三个查询由于参数无效而失败,因为节点数必须等于偏置数组的大小,并且等于权重数组的第二维度。许多层都有类似的重要前提条件。这使得测试数据的生成变得困难。在第 3 节中,我们将说明如何识别这些无效输入,并生成有关此类问题的来源和严重程度的有用信息,以支持我们的模型验证和测试生成方法。
dense_layer ([[[[[1 ,2 ,3]]] ,[[[4 ,5 ,6]]]]] , [[2,3,4],[5,4,6],[7,6,8]], [2,3,4], X). $\texttt{X}=$ [[[[[35 , 32, 44]]], [[[77, 71, 98]]]]]
dense_layer ([[[[[1 ,2 ,3]]] ,[[[4 ,5 ,6]]]]] , [[2,3,4],[5,4,6],[7,6,8]], [2,3,4], X). $\texttt{X}=$ [[[[[35 , 32, 44]]], [[[77, 71, 98]]]]]
Conv1D layer. A convolution layer [2] shifts a kernel (or pool) over the input space and produces the weighted sum of the elements within the kernel. It can, for instance, be used for image preprocessing or recognition. Listing 2 illustrates the (simplified for the sake of presentation) semantics of this layer. Line 1 shows a predicate for the layer, which has the (layer) input data Is, a number KernelSize for the one dimensional size of the window that should be shifted, a weight array IWs for the kernel elements, a bias array Bs of values that are added to the output, a step size Strides, a Boolean Padding for optionally enlarging the input data, e.g., by surrounding it with zeros, and the output data Os, as parameters. Lines 2–4 apply predicates for checking the validity of the layer inputs and arguments, i.e., to ensure that the input dimensions are as required, to check that the kernel is not too large for the input, and to verify that the weights are valid. An example implementation of such a precondition is given in Sect. 3.3. Line 5 applies the pool1D predicate, which is a generic predicate for different types of pooling or convolutional operations, and it becomes true when Os can be unified with the result of the operation.
Conv1D 层。卷积层 [2] 在输入空间上移动一个核(或池),并生成核内元素的加权和。例如,它可以用于图像预处理或识别。清单 2 展示了该层的(为演示目的而简化的)语义。第 1 行显示了该层的谓词,其参数包括(层的)输入数据 Is、窗口的一维大小 KernelSize、核元素的权重数组 IWs、添加到输出的偏置数组 Bs、步长 Strides、用于可选地扩大输入数据的布尔值 Padding(例如,用零包围输入数据)以及输出数据 Os。第 2-4 行应用了用于检查层输入和参数有效性的谓词,即确保输入维度符合要求,检查核是否对输入来说过大,并验证权重是否有效。第 3.3 节给出了此类前提条件的示例实现。第 5 行应用了 pool1D 谓词,这是一个用于不同类型池化或卷积操作的通用谓词,当 Os 可以与操作结果统一时,该谓词为真。
Line 7 shows the pool1D predicate, which has the same arguments as the con v 1 D layer with an additional Poolfunc metapredicate to decide what should happen within the pool, and an argument MultiLayer Pool to specify if the pool should be shifted over the second axis of the input space. Next, there is a call to a subpredicate for one sample of the input in Line 8, with initial is at ions of coordinates and a temporary output variable, Line 9 iterates over the other samples, and Line 11 is a simple stopping clause.
第7行展示了 pool1D 谓词,它与 conv1D 层具有相同的参数,但增加了一个 Poolfunc 元谓词来决定池化操作中的行为,以及一个 MultiLayerPool 参数来指定池化是否应在输入空间的第二轴上移动。接下来,第8行对输入的一个样本调用了一个子谓词,初始化了坐标和一个临时输出变量。第9行遍历了其他样本,而第11行是一个简单的停止子句。
In Lines 13–17, we apply padding with the calc padding predicate that is true when LeftP, and RightP are padding sizes to maintain the input shape after pooling, and by using these variables for the padding1D predicate that succeeds when Is1 is Is with padding. The predicate in Lines 19–24 does the actual application of the convolution. Line 21 utilises the pool predicate, i.e. the weighted sum of the kernel, for a given position X,Y, which is then appended to the output in Line 22. Next, Line 23 has a conditional clause that introduces new variables X1,Y1 for the next coordinates given the current pool position, and finally Line 24 performs recursion to continue with the next coordinates.
在第13-17行中,我们使用 calc_padding 谓词进行填充,当 LeftP 和 RightP 是填充大小时,该谓词为真,以在池化后保持输入形状,并通过使用这些变量来调用 padding1D 谓词,该谓词在 Is1 是带有填充的 Is 时成功。第19-24行的谓词实际应用了卷积操作。第21行使用了 pool 谓词,即给定位置 X,Y 的核的加权和,然后将其附加到第22行的输出中。接下来,第23行包含一个条件子句,该子句为当前池化位置引入新的变量 X1,Y1 以表示下一个坐标,最后第24行执行递归以继续处理下一个坐标。
Lines 26–31 are defined similarly, except that the pool is also shifted over the second axis by iterating over the Y coordinate instead of just using the sum over these values. Finally, Lines 33– 36 show a stopping clause, which becomes true when one of the indexes X,Y is outside the input space.
第 26–31 行的定义类似,只是池化操作还会在第二个轴上移动,通过遍历 Y 坐标而不是仅仅对这些值求和来实现。最后,第 33–36 行展示了一个停止条件,当其中一个索引 X 或 Y 超出输入空间时,该条件为真。
Similarly, with the above semantics, we can now query this layer as shown in the following examples. The first example has integer and the second has floating-point inputs. It can be seen that variables are not limited to a specific type, our implementations works for integers as well as floating-point numbers.
同样地,利用上述语义,我们现在可以查询这一层,如下例所示。第一个例子使用整数输入,第二个例子使用浮点数输入。可以看出,变量并不局限于特定类型,我们的实现既适用于整数,也适用于浮点数。
An interesting aspect of this layer is that a lot of the functionality can be reused for other layers. In particular, the pool1D predicate is defined generic so that it can be applied for the semantics of many related layers. For example, it is applied for simple pooling layers, like (global) average or max pooling and for other convolutional layers, like separable convolution and locally connected layers.
这一层的一个有趣之处在于,许多功能可以复用于其他层。特别是,pool1D谓词被定义为通用的,因此可以应用于许多相关层的语义。例如,它被应用于简单的池化层,如(全局)平均池化或最大池化,以及其他卷积层,如可分离卷积和局部连接层。
Non-deterministic layers. Although most layers have deterministic behaviour when their weights are known, there are a few exceptions including layers such as Dropout, Alpha Dropout, Spati al Dropout, Gaussian Dropout, and Gaussian Noise. All of them either have built-in non-determinism (i.e., random behaviour) or are probabilistic in nature (i.e., producing outputs that follows a certain probability distribution). We defined Prolog rules that constrain these layers to satisfy high-level properties which are documented in the respective documentation, e.g., if the layer is expected to produce results that are according to a specific distribution, they should. We demonstrate how this is done for the dropout layer [8].
非确定性层。尽管大多数层在权重已知时具有确定性行为,但也有一些例外,包括 Dropout、Alpha Dropout、Spatial Dropout、Gaussian Dropout 和 Gaussian Noise 等层。它们要么具有内置的非确定性(即随机行为),要么本质上是概率性的(即生成遵循特定概率分布的输出)。我们定义了 Prolog 规则来约束这些层以满足在各自文档中记录的高级属性,例如,如果该层预期生成符合特定分布的结果,那么它们应该如此。我们展示了如何为 dropout 层实现这一点 [8]。
This layer is expected to set values of the input to zero according to a given rate. (Additionally, the layer also adds noise to the data, but we neglect this feature in the example for brevity.) Listing 3 shows a predicate that checks if the dropout actually occurs at a given rate. The arguments are the inputs Is, the outputs from TensorFlow Os, the rate, and a threshold Accepted Rate Diff for the maximum difference to the observed rate. Line 2 shows predicates which are true when N and NO can be unified with the number of input and output values, which must be equal. Line 3 does the same for the number of zeros in the input, and Line 4 for the number of zeros in the given output. Next, Line 5 sets RealRate to the observed rate of dropouts while considering the already existing zeros in the input, and Line 6 applies the abs function to obtain the difference to the expected rate. Finally, Line 7 checks if the difference is bigger than our threshold and writes a message if that is the case.
该层预期会根据给定速率将输入值设为零。(此外,该层还会向数据添加噪声,但为了简洁起见,我们在示例中忽略了这一特性。)清单 3 展示了一个谓词,用于检查 dropout 是否确实以给定速率发生。参数包括输入 Is、来自 TensorFlow 的输出 Os、速率以及观察到的速率与预期速率之间最大差异的阈值 Accepted Rate Diff。第 2 行展示了当 N 和 NO 可以与输入和输出值的数量统一时为真的谓词,这些数量必须相等。第 3 行对输入中的零数量执行相同操作,第 4 行对给定输出中的零数量执行相同操作。接下来,第 5 行将 RealRate 设置为考虑输入中已存在零的 dropout 观察速率,第 6 行应用 abs 函数以获取与预期速率的差异。最后,第 7 行检查差异是否大于我们的阈值,并在这种情况下写入一条消息。
3 APPLICATIONS OF THE SEMANTICS
3 语义的应用
The semantics has many applications. In this section, we demonstrate two of them. We illustrate the overall design of our testing and model validation approaches and show step-by-step examples.
语义学有许多应用。在本节中,我们将展示其中的两个应用。我们将说明测试和模型验证方法的整体设计,并逐步展示示例。
3.1 AI Framework Testing
3.1 AI 框架测试
Recently, multiple researchers started working on AI framework testing [22, 23, 33, 38, 40, 41, 43]. While an impressive number of bugs have been identified, there are still many open challenges. Based on our executable semantics, we introduce a novel AI framework testing method which improves the existing approaches by addressing the following problems. (1) The oracle problem (i.e., how we verify if the output is correct). (2) The test generation problem (i.e., how can we effectively generate valid test cases). For the first problem, the existing approaches focus on differential testing [33, 41, 43] or a simple form of metamorphic testing [22, 23, 40]. For differential testing, multiple libraries (or implementations) are compared against each other, and bugs are identified when an inconsistency is found. The kind of metamorphic testing done on AI frameworks introduces changes in the test input (i.e., an AI model) that should not affect the output. A different output would thus indicate a bug. In other words, such testing relies on the algebraic property that ‘dead code’ does not change the model’s behaviour. Differential testing is powerful since a second implementation can serve as a good test oracle. However, a second independent implementation is not always available, especially for new algorithms. Even if it is available, it is still possible that it suffers from the same bugs due to reasons such as using the same underlying library or implementing certain optimisation in the same ‘wrong’ way. Our method can also be seen as a form of differential testing, but it is unlikely that our semantics will suffers from the same bugs since it is developed in a different modelling paradigm at a high-level (e.g., where little performance optimisation is involved). On the other hand, metamorphic testing is usually limited to specific types of bugs based on algebraic properties (i.e., a very small piece of the semantics) which are used to guide the test generation. Another issue that limits the applicability of both approaches is the nondeterministic behaviour of the layers, e.g., inconsistency may be due to non-deterministic or probabilistic behaviours rather than bugs. Usually, the training of AI models contains random factors that cause small discrepancies in different implementations and make an exact comparison and the identification of bugs difficult. This issue is sometimes resolved by relying on approximating oracles that accept a range of inputs. This is not an ideal solution since it often requires manual adjustments [31]. Based on our semantics, we have an alternative more powerful solution for this problem. By developing our semantics with a focus on the highlevel behaviour of the layers, we can test the results of an AI model independently of the non-deterministic training algorithms. This facilities the detection of subtle bugs without the interference of underlying non-deterministic or probabilistic behaviours.
最近,多位研究人员开始致力于 AI 框架测试 [22, 23, 33, 38, 40, 41, 43]。尽管已经发现了大量错误,但仍存在许多未解决的挑战。基于我们的可执行语义,我们引入了一种新颖的 AI 框架测试方法,通过解决以下问题改进了现有方法。(1) Oracle 问题(即如何验证输出是否正确)。(2) 测试生成问题(即如何有效地生成有效的测试用例)。对于第一个问题,现有方法主要关注差分测试 [33, 41, 43] 或简单的蜕变测试 [22, 23, 40]。差分测试通过比较多个库(或实现)来发现不一致性,从而识别错误。在 AI 框架上进行的蜕变测试则通过改变测试输入(即 AI 模型)来确保输出不受影响,不同的输出则表明存在错误。换句话说,这种测试依赖于“死代码”不会改变模型行为的代数特性。差分测试非常强大,因为第二个实现可以作为良好的测试 Oracle。然而,第二个独立的实现并不总是可用,尤其是对于新算法。即使可用,由于使用相同的底层库或以相同的“错误”方式实现某些优化,它仍可能受到相同的错误影响。我们的方法也可以被视为一种差分测试,但由于我们的语义是在不同的建模范式中开发的(例如,涉及较少的性能优化),因此不太可能受到相同的错误影响。另一方面,蜕变测试通常局限于基于代数特性的特定类型错误(即语义的非常小的一部分),这些特性用于指导测试生成。限制这两种方法适用性的另一个问题是层的非确定性行为,例如,不一致性可能是由于非确定性或概率行为而非错误引起的。通常,AI 模型的训练包含随机因素,这些因素会导致不同实现中的微小差异,使得精确比较和错误识别变得困难。这个问题有时通过依赖接受一定范围输入的近似 Oracle 来解决。这不是一个理想的解决方案,因为它通常需要手动调整 [31]。基于我们的语义,我们为这个问题提供了一个更强大的替代方案。通过开发专注于层的高级行为的语义,我们可以独立于非确定性训练算法测试 AI 模型的结果。这有助于在不受到底层非确定性或概率行为干扰的情况下检测细微的错误。
The second problem, i.e., the test case generation problem, is often addressed by using existing benchmark examples or by modifying these examples, e.g., via mutation [43]. The problem of the former is that existing examples are limited in both numbers and variety. The problem of the latter is that test cases generated through mutation are often invalid, e.g., due to non-trivial preconditions that must be satisfied by the layers. Most AI libraries contain dozens of different layers that can be connected in sequences or in complex graph structures to form deep learning models. There are simple layers, like a densely connected layer, and also highly complicated layers which apply operations on multidimensional inputs or even combine the functionality of other layers. Usually, they need a number of configuration arguments, e.g., for specifying the number of neurons or a filter size. Most layers have preconditions for the inputs and hence they cannot be freely combined. To address this problem, a recent approach [29] proposes to extract information about the input requirements from the library documentation and to generate ‘valid’ test cases accordingly. However, after all its effort, it still has very limited capability in producing valid test data, i.e., the success rate is around $25%$ .
第二个问题,即测试用例生成问题,通常通过使用现有的基准示例或通过修改这些示例(例如,通过变异 [43])来解决。前者的问题在于现有示例在数量和种类上都有限。后者的问题在于通过变异生成的测试用例通常无效,例如,由于必须满足层的非平凡前提条件。大多数 AI 库包含数十种不同的层,这些层可以按顺序或以复杂的图结构连接以形成深度学习模型。有简单的层,如全连接层,也有高度复杂的层,它们对多维输入应用操作,甚至结合了其他层的功能。通常,它们需要一些配置参数,例如用于指定神经元数量或滤波器大小。大多数层对输入有前提条件,因此它们不能自由组合。为了解决这个问题,最近的一种方法 [29] 提出从库文档中提取有关输入要求的信息,并相应地生成“有效”的测试用例。然而,尽管付出了所有努力,它在生成有效测试数据方面的能力仍然非常有限,即成功率约为 $25%$。
In this work, we develop a fuzzing method which utilises feedback from our semantics to enable a systematic way of producing valid test cases, i.e., by finding layers (and associated arguments) that are compatible with each other. Moreover, our approach explores various layer arguments in order to find interesting combinations that might reveal bugs. The overall workflow is depicted in Fig. 2. Next, we will explain the involved tasks and components.
在本工作中,我们开发了一种模糊测试方法,该方法利用语义反馈来系统化地生成有效测试用例,即通过找到相互兼容的层(及相关参数)。此外,我们的方法探索了各种层参数,以发现可能揭示错误的有趣组合。整体工作流程如图 2 所示。接下来,我们将解释涉及的任务和组件。
Our method has two main components. First, the most important component is the oracle, which is our executable Prolog semantics. Given a deep learning model (in the form of collection of layers and their arguments) and the input data, the oracle executes the semantics to produce a prediction, in a similar fashion as it is done by an AI library. Second, the test case generator is a tool that produces test data in the form of multidimensional arrays and a deep learning model as a collection of layers and their respective arguments. Our test case generator is based on fuzzing techniques and
我们的方法有两个主要组成部分。首先,最重要的组件是预言机 (oracle),它是我们可执行的 Prolog 语义。给定一个深度学习模型(以层及其参数的集合形式)和输入数据,预言机执行语义以生成预测,其方式与 AI 库类似。其次,测试用例生成器是一个工具,它以多维数组的形式生成测试数据,并以层及其各自参数的集合形式生成深度学习模型。我们的测试用例生成器基于模糊测试技术。

Figure 2: Overview of the data flow of our testing approach.
图 2: 我们测试方法的数据流概览。
produces random input data and models. The focus of the generation is to produce complex models with graph structures in order to extensively test combinations of different layers in various settings.
生成随机输入数据和模型。生成的重点是生成具有图结构的复杂模型,以便在各种设置中广泛测试不同层的组合。
In the following, we use the simplified semantics of the convolution layer that we showed in the last section to demonstrate the test generation. The test case generator produces test input data as well as a deep learning model (including layer arguments) to reveal bugs or inconsistencies in the AI library. The model and the input data (for performing a prediction) together form a test case. A test case is automatically generated in the form of a Prolog query, which is given to the Prolog interpreter to compute a result with a unification algorithm. For example, a simple query for our convolutional layer is as follows.
在以下内容中,我们使用上一节中展示的简化卷积层语义来演示测试生成。测试用例生成器生成测试输入数据以及深度学习模型(包括层参数),以揭示AI库中的错误或不一致性。模型和输入数据(用于执行预测)共同构成一个测试用例。测试用例以Prolog查询的形式自动生成,并交给Prolog解释器使用统一算法计算结果。例如,针对我们的卷积层,一个简单的查询如下。
conv1D_layer ([[[5,8,7],[9,10,7],[3,7,7]]],2, [[[1] ,[1] ,[1]] ,[[1] ,[1] ,[1]]] ,[0] ,1 , false ,1,X).
conv1D_layer ([[[5,8,7],[9,10,7],[3,7,7]]],2, [[[1] ,[1] ,[1]] ,[[1] ,[1] ,[1]]] ,[0] ,1 , false ,1,X).
The same test case for TensorFlow is expressed as a simple Python program as shown in Listing 4. The first lines are imports and configurations. Lines 3-4 define the actual deep learning model. In Lines 5-8 we overwrite the weights for the kernel and the biases in order to support a deterministic prediction. Usually these weights are initial is ed randomly and tuned through training. Line 9 sets the input data and Line 10 performs a prediction with the model for the input data, which gives us the test output.
相同的 TensorFlow 测试用例可以用一个简单的 Python 程序表示,如代码清单 4 所示。前几行是导入和配置。第 3-4 行定义了实际的深度学习模型。在第 5-8 行中,我们覆盖了内核和偏置的权重,以支持确定性预测。通常这些权重是随机初始化的,并通过训练进行调整。第 9 行设置了输入数据,第 10 行使用模型对输入数据进行预测,从而得到测试输出。
| 算法 1 测试模型生成的伪代码 | |
|---|---|
| 输入: GenHelper: 包含层生成器映射的辅助类 (Layer.Name → Generator) SemanticHelper: 用于执行语义的辅助类 | |
| 1: 函数 RECURSIVEGENERATION(level, rand) | |
| 2: 如果 level = 0 或 rand.nextBool() 则 > 根据随机结果允许更深的图 返回 null > 递归结束 | |
| 3: 4: > 生成一个随机层,包括参数 | |
| 5: layer ← GenHelper.generateLayer() inputNumber ← 1 | |
| 6: 如果 layer.hasMultipleInputs() 则 > 检查层是否需要多个输入 | |
| 7: inputNumber ← rand.nextInteger(3) | |
| 8: | |
| 9: 对于 i 从 1 到 inputNumber 执行 | |
| 10: layer1 ← recursiveGeneration(level-1, rand) | |
| 11: layer1.setParent(layer), layer.addChild(layer1) | |
| 12: 返回 layer | |
| 13: 对于 i 从 1 到 maxTries 执行 | |
| 14: 函数 FINDVALIDMODEL(layer, maxTries, rand) > 在给定随机模型的情况下找到有效模型 | |
| 15: (success, error) ← SemanticHelper.run(layer) > 运行语义,获取成功/错误 | |
| 16: 如果 success 则 | |
| 17: 返回 layer | |
| 18: 否则如果 error.getLocation() = lastError.getLocation() 且 error.getBadness() ≥ lastError.getBadness() 则 | |
| 19: layer ← lastLayer, error ← lastError > 如果没有改进则恢复模型 | |
| 20: 否则 lastLayer ← layer, lastError ← error, layer1 ← error.getLocation() | |
| 21: 如果 rand.nextBool() 则 | |
| 22: layer1.regenerateArgs() > 重置层参数 | |
| 23: 否则如果 rand.nextBool() 则 | |
| 24: options ← ["Reshape", "UpSampling", "Cropping", "Padding", ...] | |
| 25: layer2 ← GenHelper.generateLayer(options) > 生成其中一个选项 > 在输入前插入层 | |
| 26: layer1.insertLayerBeforeChild(layer2) | |
| 27: 否则如果 rand.nextBool() 则 | |
| 28: layer2 ← GenHelper.generateLayer() | |
| 29: layer1.replaceWith(layer2) > 用另一个层替换当前层 | |
| 30: 否则如果 rand.nextBool() 则 | |
| 31: layer2 ← GenHelper.generateLayer() | |
| 32: layer1.replaceChildWith(layer2) > 用另一个层替换输入层 | |
| 33: 返回 null |
Finally, we compare the output of the test program (that includes the AI library) to the output of the semantics. A test fails if the constraints produced by the semantics are not satisfied by the output from the Python program. In general, such a failure can be caused by bugs in the library and possibly by bugs in the semantics. For our example, the output was [[[46], [43]]] in both cases, which represents a successful test. A bug is identified when there is an inconsistency in the outputs (prediction results) of the TensorFlow program and of our semantics. If there is an inconsistency, we localise the bug by checking which layer specification is violated, and by manually checking the cause.
最后,我们将测试程序(包含AI库)的输出与语义的输出进行比较。如果Python程序的输出不满足语义产生的约束条件,则测试失败。通常,这种失败可能是由库中的错误或语义中的错误引起的。在我们的示例中,两种情况的输出都是[[[46], [43]]],这表示测试成功。当TensorFlow程序的输出(预测结果)与我们的语义输出不一致时,就会识别出错误。如果存在不一致,我们通过检查违反了哪一层规范并手动检查原因来定位错误。
3.2 Test Case Generation
3.2 测试用例生成
The test case generator is a tool that can produce models in the form of Python scripts (that use TensorFlow) as well as models represented as Prolog queries. It also generates random test inputs (in the form of multidimensional arrays that need to have a specific shape or number of dimensions depending on the first layer) for a prediction with a model and layer arguments, which can also be the weights of layers. The generation is done by the test case generator alongside the AI model generation while considering input requirements. The details about the required arguments can be found in the TensorFlow documentation. A test case is a model together with the associated test inputs.
测试用例生成器是一种工具,能够生成以 Python 脚本形式(使用 TensorFlow)的模型,以及以 Prolog 查询形式表示的模型。它还会为模型和层参数的预测生成随机测试输入(以多维数组的形式,这些数组需要具有特定的形状或维度数量,具体取决于第一层),这些参数也可以是层的权重。测试用例生成器在生成 AI 模型的同时生成测试用例,并考虑输入要求。有关所需参数的详细信息可以在 TensorFlow 文档中找到。测试用例是模型及其相关测试输入的组合。
The main functionality of the test generator is the generation of diverse functional models. Pseudo code of the generation process is illustrated in Algorithm 1. It consists of two major functions for randomly building a model and for finding a valid model. The first is a recursive function (Line 1) that starts by selecting an initial root layer and continues to connect random layers onto it. The algorithm increases the probability of stopping based on the level that represents the distance of the current layer to the root layer (Line 2) to prevent too large models. Many layers can take the output of one layer as input, and some can have inputs from multiple layers, e.g., to perform mathematical operations, like addition, or multiplication of the inputs. Hence, there is a variable for the number of inputs, which is set according to the generated layer (Lines 5–7). Finally, the recursion is performed for each input, a connection is formed with the associated functions (by feeding the output of a layer as input to another layer), and the layer is returned. The model in this case is just the root layer that is connected to other layers. Note that the models produced by this function are most likely invalid, and for simplicity it produces models in the form of trees. The semantics also supports more complex graphs.
测试生成器的主要功能是生成多样化的功能模型。生成过程的伪代码如算法 1 所示。它包含两个主要函数:随机构建模型和寻找有效模型。第一个是递归函数(第 1 行),它从选择一个初始根层开始,并继续在其上连接随机层。该算法基于当前层到根层的距离(第 2 行)增加停止的概率,以防止生成过大的模型。许多层可以将一个层的输出作为输入,而有些层可以从多个层获取输入,例如执行加法或乘法等数学运算。因此,有一个输入数量的变量,根据生成的层进行设置(第 5-7 行)。最后,对每个输入执行递归,形成与相关函数的连接(通过将一个层的输出作为另一个层的输入),并返回该层。在这种情况下,模型只是连接到其他层的根层。请注意,此函数生成的模型很可能无效,并且为了简单起见,它以树的形式生成模型。语义也支持更复杂的图结构。

Figure 3: Model visualisation in graph form
图 3: 以图形形式展示的模型可视化
The second function takes this model as a basis to find a valid model with the help of the semantics (Line 12). Our semantics can identify invalid models based on layer preconditions, which are defined as part of the corresponding Prolog rules. These preconditions give feedback in the form of a badness value to guide the generation process towards valid models. Given a concrete model and a precondition to be satisfied, the badness value is defined in the same way as fitness in search-based software testing [25]. It is a distance metric that increases when the model is further away from becoming valid, i.e., we rank the preconditions based on severity and calculate a value by multiplying a severity factor with the difference of, e.g., an expected argument value to an observed one.
第二个函数以此模型为基础,借助语义找到有效的模型(第12行)。我们的语义可以根据层前提条件识别无效模型,这些前提条件被定义为相应Prolog规则的一部分。这些前提条件以不良值的形式提供反馈,以指导生成过程朝着有效模型的方向发展。给定一个具体模型和一个要满足的前提条件,不良值的定义方式与基于搜索的软件测试中的适应度相同 [25]。它是一种距离度量,当模型离有效状态越远时,其值越大,即我们根据严重性对前提条件进行排序,并通过将严重性因子与预期参数值与观察值之间的差值相乘来计算一个值。
Line 14 shows that we apply a Semantic Helper to run our semantics and to retrieve a success or an error message. If the semantics was executed successful then the model is valid and the algorithm stops (Line 16). If an error occurs during the execution, we check if it originates from the same layer as a previous error, and if it is more severe than the last error (based on the badness value), which would indicate that the last model change did not improve the model. In this case, we restore the last model (Line 18). Otherwise, we try to adopt the model to gradually improve its validity (Lines 21–32). This is done by randomly selecting a potential model modification for the layer that causes invalidity, like regenerating arguments, inserting a layer to change the input shapes/dimensions, or replacing the layer with one of its input layers. This process is repeated until a valid model is found or a maximum number of tries is reached.
第14行显示,我们应用了一个语义助手来运行我们的语义并检索成功或错误消息。如果语义成功执行,则模型有效,算法停止(第16行)。如果在执行过程中发生错误,我们会检查它是否源自与之前错误相同的层,并且是否比上一个错误更严重(基于坏值),这将表明上一次模型更改没有改善模型。在这种情况下,我们会恢复上一个模型(第18行)。否则,我们会尝试调整模型以逐步提高其有效性(第21-32行)。这是通过随机选择导致无效的层的潜在模型修改来实现的,例如重新生成参数、插入一个层以更改输入形状/维度,或用其输入层之一替换该层。此过程会重复,直到找到有效模型或达到最大尝试次数。
An example graph model that was generated with our approach is illustrated in Fig. 3. The nodes in the model represent layers and the arrows show the data flow. It can be seen that the model includes various layers, like convolutional or reshape, and the output of the model is derived from the root layer, i.e. that average layer at the bottom. The corresponding TensorFlow and Prolog model are available online in our repository [10].
图 3 展示了使用我们的方法生成的一个示例图模型。模型中的节点代表层,箭头表示数据流。可以看出,该模型包含各种层,如卷积层或重塑层,模型的输出源自根层,即底部的平均层。相应的 TensorFlow 和 Prolog 模型可在我们的在线仓库 [10] 中获取。
3.3 Model Validation
3.3 模型验证
Another application of our semantics is the model validation by identifying bugs in faulty (or invalid) AI models. It can be challenging to create working models, especially since the layers have numerous preconditions that need to be fulfilled when they are used or combined. For example, common errors are wrong input shapes and dimensions, or arguments that are invalid for the input data, like a convolution kernel or an axis value that is too large for an input dimensions [27, 45]. AI libraries, like TensorFlow, already produce error message to detect such issues. However, they can be difficult to understand, the messages are not always consistent for the same type of issue, and in rare cases there is no error for a problematic model (see the example below). Hence, we provide an alternative way to validate models by identifying and reporting issues precisely and consistently with the help of our semantics. By converting an existing TensorFlow model into our Prolog representation, we can exploit our semantics to check if there are any precondition violations in the model, which would make it invalid.
我们语义的另一个应用是通过识别有缺陷(或无效)的AI模型中的错误来进行模型验证。创建有效的模型可能具有挑战性,特别是因为层在使用或组合时需要满足许多前提条件。例如,常见的错误包括错误的输入形状和维度,或者对输入数据无效的参数,例如卷积核或轴值对于输入维度来说过大 [27, 45]。像 TensorFlow 这样的 AI 库已经生成了错误消息来检测这些问题。然而,这些消息可能难以理解,对于相同类型的问题,消息并不总是一致的,并且在极少数情况下,有问题的模型不会产生错误(见下面的示例)。因此,我们提供了一种替代方法,通过在我们的语义帮助下精确且一致地识别和报告问题来验证模型。通过将现有的 TensorFlow 模型转换为我们的 Prolog 表示,我们可以利用我们的语义来检查模型中是否存在任何违反前提条件的情况,这将使模型无效。
An example precondition is shown in Listing 5. It checks if a cropping layer applies too much cropping, which would result in an empty output. There is a predicate check empty cropping that has the layer input and output as arguments and it is true when the output is not empty. Otherwise, an exception is thrown. First, in Lines 1–2 there is a recursive rule for the inspection of multidimensional output. Next, the rule in Line 3 makes our predicate true, if any numerical value is found in the output. If that is not the case, then we generate an error message Lines 4–7.
一个示例前提条件如代码清单 5 所示。它检查裁剪层是否应用了过多的裁剪,这会导致输出为空。有一个谓词检查 empty cropping,它以层的输入和输出作为参数,当输出不为空时为真。否则,抛出异常。首先,在第 1-2 行中有一个用于检查多维输出的递归规则。接着,第 3 行的规则使我们的谓词为真,如果在输出中找到任何数值。如果不是这种情况,则生成错误信息(第 4-7 行)。
An example TensorFlow model that contains a corresponding cropping issue is shown in Listing 6. It can be seen that our model consists of four layers, a convolutional layer that is connected with a cropping layer, a maximum pool layer, and a concatenate
一个包含相应裁剪问题的 TensorFlow 模型示例如代码清单 6 所示。可以看出,我们的模型由四层组成,分别是与裁剪层连接的卷积层、最大池化层和连接层。
layer that takes the outputs of the cropping and pool layers. The same model in Prolog is shown in Listing 7. This model has a slightly different structure as our previous model queries. First, we assign the layers to variables, which we then pass in a list to the exec layers function. This function is a simple wrapper for the execution of a list of layers that helps to identify the name of a layer that violates a precondition.
接收裁剪层和池化层输出的层。Prolog 中的相同模型如代码清单 7 所示。该模型的结构与我们之前的模型查询略有不同。首先,我们将层分配给变量,然后将这些变量以列表形式传递给 exec_layers 函数。该函数是一个简单的层列表执行包装器,有助于识别违反预条件的层的名称。
The model might seem valid at a first glance, but investigating the cropping arguments shows that the values are too big. This causes an empty intermediate output in the cropping layer. Expectedly, our semantics is able to identify this issue and produces an error. TensorFlow has no error message [5]. The model is just executed, and the concatenation is performed with the empty output, which means that the first branch of the model is completely ignored. Especially in large models, it can be challenging to find such issues since only intermediate results reveal such problems. The TensorFlow developers confirmed that this should not happen and that an additional check will be added [5].
乍看之下,模型似乎是有效的,但检查裁剪参数后发现数值过大。这导致裁剪层中的中间输出为空。不出所料,我们的语义分析能够识别出这个问题并产生错误。TensorFlow 没有错误提示 [5]。模型只是被执行,拼接操作在空输出上进行,这意味着模型的第一分支被完全忽略。特别是在大型模型中,发现此类问题可能具有挑战性,因为只有中间结果才能揭示这些问题。TensorFlow 开发者确认这种情况不应发生,并将添加额外的检查 [5]。
Our semantics are not only important for the test generation, but they also enable a reliable way to identify issues in AI models. Error messages in AI libraries are not always consistent because of multiple (or device-specific) implementations of a layer. Our preconditions are easy to define and can speed up the identification of issues by highlighting problematic layers and by presenting consistent messages that could, e.g., support automatic error handling. In future work, we aim to provide a more user friendly tool with a graphical user interface in order to facilitate the usage of both our test case generation and model validation approaches in a practical setting, e.g, to produce test models for various purposes and to find issues with existing models.
我们的语义不仅对测试生成至关重要,还为识别AI模型中的问题提供了一种可靠的方法。由于层的多种(或设备特定的)实现,AI库中的错误信息并不总是一致的。我们的前提条件易于定义,并且可以通过突出显示问题层和呈现一致的信息来加速问题的识别,例如支持自动错误处理。在未来的工作中,我们旨在提供一个更用户友好的工具,带有图形用户界面,以便在实际环境中促进我们的测试用例生成和模型验证方法的使用,例如生成用于各种目的的测试模型,并发现现有模型的问题。
4 EVALUATION
4 评估
We implemented our semantics for almost all TensorFlow layers and developed two example applications, i.e., AI framework testing, and model validation to find invalid AI models. In this section, we evaluate the effectiveness of the approaches. We design multiple experiments to answer the following research questions (RQ).
我们为几乎所有 TensorFlow 层实现了语义,并开发了两个示例应用,即 AI 框架测试和模型验证,以发现无效的 AI 模型。在本节中,我们评估这些方法的有效性。我们设计了多个实验来回答以下研究问题 (RQ)。
• RQ1: Is Prolog expressive and efficient enough for AI semantics analysis? There are various specification languages that might
• RQ1: Prolog 是否足够表达和高效以用于 AI 语义分析?有多种规范语言可能
be suited for developing a semantics of an AI framework. Hence, we want to highlight the advantages of using Prolog. • RQ2: Does the semantics allow us to effectively generate valid AI test cases? This is important since one of the applications of our semantics is AI framework testing and for that it is necessary to produce good test models. • RQ3: Is our test generation method based on the semantics effective for finding AI framework bugs? To further motivate the usage of our testing method, we show what bugs and issues it can reveal. • RQ4: How well can our semantics report bugs of invalid AI models? Another application of our semantics is the model validation. We will show the advantages of our method compared with the default bug reporting capabilities of TensorFlow.
适合开发AI框架的语义。因此,我们希望强调使用Prolog的优势。
• RQ2:该语义是否允许我们有效生成有效的AI测试用例?这很重要,因为我们语义的一个应用是AI框架测试,为此需要生成良好的测试模型。
• RQ3:基于该语义的测试生成方法是否有效发现AI框架中的错误?为了进一步激励使用我们的测试方法,我们展示了它可以揭示哪些错误和问题。
• RQ4:我们的语义在报告无效AI模型错误方面表现如何?我们语义的另一个应用是模型验证。我们将展示与TensorFlow默认错误报告功能相比,我们方法的优势。
The experiments were performed on a 7th Gen. Lenovo X1 Carbon ThinkPad with a 8th Gen i7 CPU with four $1.80~\mathrm{GHz}$ cores and 16GB RAM. We mainly tested TensorFlow 2.4, but also discovered issues in version 1.14. For executing the Prolog semantics, we used SWI-Prolog 8.2.1 and the test generator was built in Java 13.0.7. Below are our answers to the research questions.
实验在一台第七代联想 X1 Carbon ThinkPad 上进行,该设备配备了第八代 i7 CPU,具有四个 $1.80~\mathrm{GHz}$ 核心和 16GB 内存。我们主要测试了 TensorFlow 2.4,但也发现了 1.14 版本中的问题。为了执行 Prolog 语义,我们使用了 SWI-Prolog 8.2.1,测试生成器则用 Java 13.0.7 构建。以下是我们对研究问题的回答。
RQ1: Is Prolog expressive and efficient enough for AI semantics analysis? In order to answer this question, we report our findings on developing our Prolog semantics. From the TensorFlow documentation [19], we identified 79 unique non-abstract and non-wrapper layers that we wanted to implement. For practical reasons, we could not support all layers, but we were able to develop our specification for 72 layers of various types, like mathematical, normalisation, convolutional, recurrent, pooling, dense, activation, cropping, padding, dropout, and most of these layers support nearly all arguments. We implemented the core functionality of these layers to be able to perform predictions. Our specification is not concerned with learning/training algorithms, data preprocessing and model (accuracy) evaluations or optimisation s. A full list of the layers is available online [10].
RQ1: Prolog 是否足够表达和高效以进行 AI 语义分析?为了回答这个问题,我们报告了在开发 Prolog 语义时的发现。根据 TensorFlow 文档 [19],我们确定了 79 个独特的非抽象和非包装层,我们希望实现这些层。出于实际原因,我们无法支持所有层,但我们能够为 72 种不同类型的层开发规范,例如数学层、归一化层、卷积层、循环层、池化层、全连接层、激活层、裁剪层、填充层、丢弃层等,并且这些层中的大多数支持几乎所有参数。我们实现了这些层的核心功能,以便能够执行预测。我们的规范不涉及学习/训练算法、数据预处理和模型(准确性)评估或优化。完整的层列表可在网上找到 [10]。
We did not develop a specification for the remaining 7 layers (Activity Regular iz ation, Additive Attention, Attention, DenseFeatures, Lambda, Multi Head Attention, Stacked RN N Cells) since they have too generic or flexible input requirements, like a generic iterable that can contain other layers, or arbitrary functions or queries. Moreover, we could not fully support all layer arguments and features, e.g., we implemented the Masking layer as a stand-alone layer, but not the influencing behaviour that it can have on other layers. We had to omit some layer arguments since they would have violated our specification requirement of static weights, i.e., there are initial is ers, regular is ers (for weights, kernels, or biases) that we did not implement since we statically set these attributes. Moreover, we did not implement activation or dropout features within the layers since the same behaviour can be achieved by adding an independent activation, or dropout layer. We left out some arguments, like constraints, data formats, or a trainable option, due to their complexity or because they have an influence outside the scope of our semantics which is not concerned with the training of AI models. Some layers, like GRU, have an option to choose a specific implementation of the layer. For simplicity, we implemented the default implementations.
我们没有为剩下的7层(Activity Regularization、Additive Attention、Attention、DenseFeatures、Lambda、Multi Head Attention、Stacked RNN Cells)开发规范,因为它们的输入要求过于通用或灵活,比如可以包含其他层的通用可迭代对象,或者任意函数或查询。此外,我们无法完全支持所有层的参数和功能,例如,我们将Masking层实现为一个独立的层,但没有实现它可能对其他层产生的影响行为。我们不得不省略一些层参数,因为它们会违反我们对静态权重的规范要求,即我们没有实现一些初始化器、正则化器(用于权重、核或偏置),因为我们静态设置了这些属性。此外,我们没有在层内实现激活或dropout功能,因为可以通过添加独立的激活层或dropout层来实现相同的行为。我们省略了一些参数,如约束、数据格式或可训练选项,因为它们过于复杂,或者因为它们的影响超出了我们语义的范围,而我们的语义并不涉及AI模型的训练。一些层,如GRU,有一个选项可以选择层的特定实现。为了简单起见,我们实现了默认的实现。

Figure 4: Run time of the semantics for rising model sizes.
图 4: 语义处理时间随模型规模增长的变化。
It should be noted that only seven layers had detailed (mathematical) descriptions in the documentation, 44 were explained with examples or had a reference paper, and the remaining 21 layers were under-specified or had just very limited examples. From our 72 implemented layers, seven had non-deterministic properties, hence instead of exactly specifying their behaviour, we defined Prolog rules that check if they follow the described properties from the documentation s as explained in Sect. 2. For the remaining 65 layers, we directly specified the functionality, which enables an easy execution and an exact comparison of the prediction results from our semantics to TensorFlow. The implementation effort of the semantics was about eight work months, which partially also includes the time to acquire the relevant domain knowledge.
需要注意的是,文档中只有七层有详细的(数学)描述,44层通过示例或参考论文进行了解释,剩下的21层则描述不足或仅有非常有限的示例。在我们实现的72层中,有七层具有非确定性特性,因此我们没有精确指定它们的行为,而是定义了Prolog规则来检查它们是否符合文档中描述的特性,如第2节所述。对于剩下的65层,我们直接指定了功能,这使得执行变得简单,并且可以精确比较我们的语义与TensorFlow的预测结果。语义的实现工作大约花费了八个月的工作时间,其中部分时间还包括获取相关领域知识的时间。
We evaluated the execution time of our semantics in order to illustrate the applicability for non-trivial use cases. Figure 4 shows the achieved run times for increasing model sizes. We created the models by adopting the level variable of the test generation from Algorithm 1, by setting the input data size to 4KB, and by taking the average over ten models for each data point. It can be seen, that for small models with about ten layers and non-trivial inputs the run time of the semantics is only about 0.6s, and even for models with about 50 layers, a prediction can be done in less than 3s. Hence, we believe our semantics is fast enough for reasonable models.
为了展示我们在非平凡用例中的适用性,我们评估了语义的执行时间。图 4 展示了随着模型规模增加所达到的运行时间。我们通过采用算法 1 中的测试生成级别变量、将输入数据大小设置为 4KB,并对每个数据点的十个模型取平均值来创建模型。可以看出,对于大约十层的小模型和非平凡输入,语义的运行时间仅为约 0.6 秒,即使对于大约 50 层的模型,预测也可以在不到 3 秒内完成。因此,我们相信我们的语义对于合理的模型来说足够快。
A major advantage of using Prolog is its declarative and highlevel specification style, which enables a compact and straight forward implementation. We only needed about 3200 lines of code for all 72 layers, and on average 44 lines per layer, because Prolog has already a lot of built-in functionality, like list operations, and it also facilitates the reuse of code. In contrast, TensorFlow has a huge code base with about three million lines of code. Hence, we believe our specification provides a good opportunity to AI developers to learn more about the inner workings of layers, and that Prolog is a well suited specification language since it supports a compact and convenient implementation of most layers.
使用 Prolog 的一个主要优势在于其声明式和高层次的规范风格,这使得实现简洁而直接。我们仅需约 3200 行代码即可实现所有 72 层,平均每层 44 行代码,因为 Prolog 已经内置了许多功能,如列表操作,并且它还有助于代码的复用。相比之下,TensorFlow 拥有庞大的代码库,约有三百万行代码。因此,我们相信我们的规范为 AI 开发者提供了一个了解层内部工作原理的良好机会,并且 Prolog 是一种非常适合的规范语言,因为它支持大多数层的简洁和便捷实现。
RQ2: Does the semantics allow us to effectively generate valid AI test cases? In order to evaluate the effectiveness of our test generation approach, we used our test generator to produce 10,000 test models.
RQ2: 语义是否允许我们有效生成有效的 AI 测试用例?为了评估我们测试生成方法的有效性,我们使用测试生成器生成了 10,000 个测试模型。
Table 1: Found TensorFlow issues and bugs.
表 1: 发现的 TensorFlow 问题和错误。
| 问题类型 | 已修复或确认 | 待处理或未确认 | 总计 |
|---|---|---|---|
| TensorFlow 错误 | 2 | 1 | 3 |
| 错误信息错误 | 4 | 2 | 6 |
| 文档错误 | 4 | 1 | 5 |
| 总计 | 10 | 4 | 14 |
The run time of the generation was about 35 hours (on average 12.6s per model), of which most time was spent on gradually finding a valid model with feedback information from the execution of the semantics. Our generated AI models had an average size of 9.89 layers (excluding input layers), and we used small inputs with a range of one to four values per dimension since bigger sizes vastly increased the overall size of our highly dimensional input data.
生成过程耗时约35小时(平均每个模型12.6秒),其中大部分时间用于通过语义执行的反馈信息逐步找到有效模型。我们生成的AI模型平均大小为9.89层(不包括输入层),并且我们使用了每个维度一到四个值的小输入,因为更大的尺寸会大大增加高维输入数据的总体大小。
Compared to state-of-the-art approaches for testing AI frameworks, our generated models have more diversity in terms of layers and structure. A related fuzzing method [29] only tests single layers and it is unclear how many layers it can test with valid inputs. A differential testing approach [43] can produce valid models with mutation, but they are less diverse since the mutations are only concerned with sequential model aspects, and they can only add 24 types of layers to the mutants. Other related approaches have less variety since they only use existing models or adopt the input data.
与测试AI框架的最先进方法相比,我们生成的模型在层和结构方面具有更多的多样性。一种相关的模糊测试方法 [29] 仅测试单层,并且不清楚它可以使用有效输入测试多少层。一种差分测试方法 [43] 可以通过突变生成有效模型,但由于突变仅涉及顺序模型方面,因此它们的多样性较低,并且它们只能向突变体添加24种类型的层。其他相关方法的多样性较低,因为它们仅使用现有模型或采用输入数据。
Out of the 10,000 test cases $99.1%$ were valid, i.e. we were able to produce the same prediction result with our semantics as TensorFlow for the same model and inputs. Seven test cases failed because of a dependency issue of the Masking layer [14] that we could not fully support. (Almost all layers work independently, but the Masking layer is an exception. It requires all following layers to skip a specific input time step.) Two tests failed due to rare inputs which caused inconsistencies with the TensorFlow input argument checks. We are still investigating this issue because of its rare occurrence. Finally, one test failed due to a minor inconstancy in the handling of floating-point values, which only occurred very rarely when a lot of computations with recurrent layers were performed.
在 10,000 个测试用例中,$99.1%$ 是有效的,即我们能够使用我们的语义为相同的模型和输入生成与 TensorFlow 相同的预测结果。有 7 个测试用例由于 Masking 层 [14] 的依赖性问题而失败,我们无法完全支持该问题。(几乎所有层都是独立工作的,但 Masking 层是一个例外。它要求所有后续层跳过特定的输入时间步。)有两个测试由于罕见的输入而失败,这些输入导致了与 TensorFlow 输入参数检查的不一致。由于这种情况很少发生,我们仍在调查此问题。最后,一个测试由于在处理浮点值时的一个微小不一致而失败,这种情况仅在执行大量循环层计算时非常罕见地发生。
The high percentage of valid AI models that we were able to achieve with our test case generation shows that our method is effective, especially considering the low percentage $(25%)$ of the related work [29]. Moreover, we were able to reveal various issues and bugs as we will see in the answer to the next research questions.
我们通过测试用例生成能够实现的高比例有效AI模型表明,我们的方法是有效的,特别是考虑到相关工作 [29] 的低比例 $(25%)$。此外,我们还能够揭示各种问题和错误,这将在回答下一个研究问题时看到。
RQ3: Is our test generation method based on the semantics effective for finding AI framework bugs? With our generated test cases, we were able to discover various issues and bugs in TensorFlow. Some of which were actual faults in TensorFlow, a few issues were related to wrong or misleading error messages, and there were some documentation bugs. Table 1 shows an overview of the found issues, their type, and if they were confirmed or not.
RQ3: 基于语义的测试生成方法对发现AI框架漏洞是否有效?通过我们生成的测试用例,我们能够在TensorFlow中发现各种问题和漏洞。其中一些是TensorFlow中的实际错误,一些问题与错误或误导性的错误信息有关,还有一些是文档错误。表1展示了发现问题的概述、类型以及是否已确认。
The most interesting bug that we found was related to the popular activation function (or layer) called ReLU, which showed a wrong behaviour for negative thresholds that violated the description of the documentation [15]. We believe that this bug is the most significant since it is concerned with one of the most common activation functions, and AI models that used this function with a negative threshold (and negative) values might have suffered from inaccuracy or might have produced unexpected prediction results. Another bug was regarding the ConvLSTM2D layer, which produced potentially wrong prediction results in rare cases [4], and one was an ignored argument [11].
我们发现的最有趣的 bug 与流行的激活函数(或层)ReLU 有关,它在负阈值时表现出错误的行为,违反了文档 [15] 的描述。我们认为这个 bug 是最重要的,因为它涉及最常见的激活函数之一,使用负阈值(和负值)的 AI 模型可能会出现不准确或产生意外的预测结果。另一个 bug 是关于 ConvLSTM2D 层的,它在极少数情况下可能会产生错误的预测结果 [4],还有一个是被忽略的参数 [11]。
Table 2: Overview of the identified model validation issues.
表 2: 已识别的模型验证问题概述
| 模型问题 | 出现次数 | TensorFlow 不一致性 |
|---|---|---|
| 维度错误 | 57 | 5 |
| 输入形状不一致 | 29 | 2 |
| 参数错误 | 14 | 3 |
An issue was regarding an error message that suggested that another number format is required, but when the format was changed then the message stated the original format was needed [12]. There were various similar error message related issues [3, 5, 9, 17, 20].
一个问题涉及到一个错误消息,该消息提示需要另一种数字格式,但当格式更改后,消息又指出需要原始格式 [12]。还存在各种类似的与错误消息相关的问题 [3, 5, 9, 17, 20]。
A documentation bug was about a missing preconditions description that caused an error when certain arguments were not equal in a Separable Con v layer [16]. A number of similar issues [1, 7, 13, 18] were also related to documentation inconsistencies.
一个文档错误是关于缺少前置条件描述,当某些参数在可分离卷积层 (Separable Con v layer) [16] 中不相等时会导致错误。一些类似的问题 [1, 7, 13, 18] 也与文档不一致有关。
In total, we found 14 issues that we reported to the TensorFlow developers and that helped to improve the framework. Ten issues were confirmed or are already fixed, four are unconfirmed or still pending. Note that our method focuses on semantic bugs, like the ReLU bug [15], that are harder to find. Related work cannot directly be compared due to simpler types of oracles, which we will explain in Sect. 5. Hence, we believe it is reasonable effective, and the fact that we found a major bug in one of the most common activation layers highlights the need for a good executable semantics.
我们总共发现了14个问题,并向TensorFlow开发者报告,这些问题有助于改进框架。其中10个问题已被确认或已修复,4个问题尚未确认或仍在处理中。需要注意的是,我们的方法专注于语义错误,如ReLU错误 [15],这类错误更难发现。由于我们使用的预言机类型较为简单,相关研究无法直接进行比较,我们将在第5节中详细解释。因此,我们认为我们的方法是合理有效的,并且我们在最常见的激活层之一中发现了一个重大错误,这凸显了良好可执行语义的必要性。
RQ4: How well can our semantics report bugs of invalid AI models? In order to evaluate the bug finding or local is ation capabilities of our semantics, we generated 100 invalid models by running our test case generation method that gradually finds valid models. Retrieving the last model of this process allowed us to have invalid models that only contain a single bug, which facilitates the manual analysis of the issues. We evaluated the type of issues that made the models invalid and compared the bug finding and reporting capabilities of our semantics to TensorFlow.
RQ4: 我们的语义在报告无效 AI 模型错误方面的表现如何?为了评估我们语义的错误发现或定位能力,我们通过运行逐步找到有效模型的测试用例生成方法生成了 100 个无效模型。检索此过程的最后一个模型使我们能够获得仅包含单个错误的无效模型,这有助于手动分析问题。我们评估了导致模型无效的问题类型,并将我们语义的错误发现和报告能力与 TensorFlow 进行了比较。
Table 2 shows an overview of the types and frequencies of issues, and the number of TensorFlow inconsistencies that we found. The majority (57) of the models were invalid due to dimension errors caused by layers requiring input data with a maximum, minimum or specific dimension, or when a layer has multiple inputs and their dimensions are not matching. The next large portion, 29 models were invalid due to inconsistent input data that occurs when a layer takes multiple inputs and there is a requirement that the inputs need to be the same shape or the size of some dimensions must match. The remaining 14 models were invalid due to argument errors, like a pool or kernel size that is too large, or inconsistencies in weight (or biases) shapes. Most convolutional layers require weight arrays to have a specific shape that can depend on the input size. Hence, a model can, e.g., become invalid when the shape of the weights is inconsistent with the input shape.
表 2 展示了问题类型和频率的概述,以及我们发现的 TensorFlow 不一致性数量。大多数模型(57 个)由于维度错误而无效,这些错误是由需要输入数据具有最大、最小或特定维度的层引起的,或者当层有多个输入且它们的维度不匹配时。接下来,29 个模型由于输入数据不一致而无效,这种情况发生在层接受多个输入并且要求输入必须具有相同的形状或某些维度的大小必须匹配时。剩下的 14 个模型由于参数错误而无效,例如池化或内核大小过大,或权重(或偏差)形状不一致。大多数卷积层要求权重数组具有特定形状,这取决于输入大小。因此,当权重的形状与输入形状不一致时,模型可能会变得无效。
Many error messages from TensorFlow were inconsistent as indicated in the last column of Table 2. Even for simple cases like a wrong input dimension, there are different messages (“ValueError:
表 2 的最后一列显示,TensorFlow 的许多错误信息不一致。即使是像输入维度错误这样简单的情况,也有不同的信息(“ValueError:
Input 0 of layer X is incompatible with the layer: expected ndim ${.=5}$ found ndim ${-3}$ .” or “ValueError: Shape must be rank 4 but is rank 5 for X.” or “ValueError: Inputs should have rank 4. Received input shape ...”). The reason for these differences might be that the error messages are produced independently in different layers. In contrast, we can easily reuse a precondition that produces the same message in all layers. This facilitates automatic error processing.
层 X 的输入 0 与该层不兼容:预期维度 ${.=5}$,实际维度 ${-3}$。” 或 “ValueError: 形状必须为秩 4,但 X 的秩为 5。” 或 “ValueError: 输入应具有秩 4。接收到的输入形状为 ...”。这些差异的原因可能是错误消息是在不同层中独立生成的。相比之下,我们可以轻松地重用在所有层中生成相同消息的前提条件。这有助于自动错误处理。
There were cases in which TensorFlow did not produce an error, although there should be one. For example, when cropping layers are used with a too large cropping values, then TensorFlow will just produce an empty output [5] (more details are in Sect. 3.3). A similar issue was regarding the input shape verification that did not occur with specific prediction methods [9].
在某些情况下,TensorFlow 没有产生错误,尽管本应出现错误。例如,当裁剪层使用了过大的裁剪值时,TensorFlow 只会生成一个空输出 [5](更多细节见第 3.3 节)。类似的问题还出现在特定预测方法中未发生的输入形状验证上 [9]。
Same as TensorFlow, our semantics was able to identify the layers that were causing an invalid model in all cases. In four cases, there were discrepancies regarding the type of error. The reason for that is simple. Some issues can have multiple causes, e.g., when a weight shape is inconsistent with an input, then either of them may be wrong.
与 TensorFlow 相同,我们的语义能够在所有情况下识别导致模型无效的层。在四种情况下,关于错误类型存在差异。原因很简单。某些问题可能有多种原因,例如,当权重形状与输入不一致时,其中任何一个都可能是错误的。
Generally, our produced messages are simpler, consistent, and at a high level. To be fair, TensorFlow’s error messages could be at times more detailed (for better debugging), but our focus is on automatic error handling. Note that there seems to be no other work investigating TensorFlow’s error messages for AI models. Finally, with our approach it is much easier to add preconditions for special use cases or types of layers since the code is much more compact, which makes it easy to find the right place to add a precondition.
通常,我们生成的消息更简洁、一致且层次较高。公平地说,TensorFlow 的错误消息有时可能更详细(以便更好地调试),但我们的重点是自动错误处理。需要注意的是,似乎没有其他工作研究过 TensorFlow 在 AI 模型中的错误消息。最后,由于我们的代码更加紧凑,因此为特殊用例或层类型添加前提条件要容易得多,这使得找到添加前提条件的正确位置变得简单。
Discussion. A threat to the validity of our evaluation might be that there are limited ways that we can compare with existing works. This might have been interesting. However, there were not many related approaches that could be used for a fair comparison since the closest related work for generating AI models worked with a much weaker oracle and only focused on simpler bugs.
讨论。我们评估的一个有效性威胁可能是,我们与现有工作进行比较的方式有限。这可能是有趣的。然而,由于生成 AI 模型的最接近相关工作使用了更弱的预言机,并且只关注更简单的错误,因此没有太多相关方法可以用于公平比较。
One might argue that our AI framework semantics is limited since it is mostly concerned with the behaviour of layers. AI frameworks have other important components, e.g., it would be interesting to specify the learning algorithms. However, in this work we focused on the static behaviour of layers since we believe it is a good start for more general AI framework semantics, and our specification is valuable since it can perform predictions for various complex AI models. Moreover, we illustrated how non-deterministic properties can be evaluated with Prolog.
有人可能会认为我们的AI框架语义学是有限的,因为它主要关注层的行为。AI框架还有其他重要组件,例如,指定学习算法会很有趣。然而,在这项工作中,我们专注于层的静态行为,因为我们相信这是更通用的AI框架语义学的一个良好起点,并且我们的规范是有价值的,因为它可以对各种复杂的AI模型进行预测。此外,我们还展示了如何使用Prolog评估非确定性属性。
Another potential threat to the validity of our evaluation could be that the test inputs we used to generate AI models were too small for a realistic study. AI models can handle huge data like large images or videos. Hence, it is true that an evaluation with larger test inputs might be more realistic, but bugs in the AI frameworks as well as in AI models could be independent of the input size. We believe that our test models with rather small inputs were still reasonable and did not represent a big limitation. Moreover, it is well-known that small test cases can reveal various bugs [28].
另一个可能影响我们评估有效性的潜在威胁是,我们用于生成 AI 模型的测试输入对于实际研究来说可能太小。AI 模型可以处理大型图像或视频等大量数据。因此,使用更大的测试输入进行评估确实可能更贴近实际,但 AI 框架和 AI 模型中的错误可能与输入大小无关。我们认为,使用较小输入的测试模型仍然是合理的,并且并不构成重大限制。此外,众所周知,小型测试用例可以揭示各种错误 [28]。
One might argue that our model validation approach is not as useful as TensorFlow’s bug reporting capabilities. It is true that the produced error messages from TensorFlow are more sophisticated, and they might be better for debugging. However, our goal was to produce simple and consistent error messages that can be used for applications such as automatic error handling or even model repair. Moreover, we wanted to provide an easy way to add new preconditions and reporting capabilities, which can be helpful since not all model issues were identified by TensorFlow.
有人可能会认为我们的模型验证方法不如 TensorFlow 的 bug 报告功能有用。确实,TensorFlow 生成的错误信息更为复杂,可能更适合调试。然而,我们的目标是生成简单且一致的错误信息,以便用于自动错误处理甚至模型修复等应用。此外,我们希望提供一种简单的方式来添加新的前提条件和报告功能,这可能会有帮助,因为并非所有模型问题都能被 TensorFlow 识别。
5 RELATED WORK
5 相关工作
Executable semantics for programming languages are related to our work because AI libraries have a similar purpose as a compiler. For example, there are various executable semantics for popular programming languages that were build with the K framework [35]. In the same way as our AI framework specification, such semantics can be used as a strong test oracle and facilitate the test generation.
编程语言的可执行语义与我们的工作相关,因为 AI 库的目的与编译器类似。例如,使用 K 框架 [35] 构建的流行编程语言有多种可执行语义。与我们的 AI 框架规范类似,这些语义可以作为强大的测试预言,并促进测试生成。
A related approach that also works with AI specifications was presented by Selsam et al. [38]. The authors apply an interactive proofing tool and partial synthesis of some system components to construct an AI system and to prove its correctness. Such a system can also be seen as a formal specification similar to ours since it can be used to detect bugs. However, in this work the authors focus more on an alternative (library) implementation that is correct by design, and they only focus on a limited AI system. Moreover, the authors point out that their proof relies on an idealised setting with infinite-precision real numbers that are replaced for a concrete execution, which can be a potential source for bugs.
Selsam 等人 [38] 提出了一种与 AI 规范相关的类似方法。作者通过应用交互式验证工具和部分系统组件的合成来构建 AI 系统并证明其正确性。这样的系统也可以被视为类似于我们的形式化规范,因为它可以用于检测错误。然而,在这项工作中,作者更关注于一种通过设计保证正确的替代(库)实现,并且他们只关注于一个有限的 AI 系统。此外,作者指出,他们的证明依赖于一个理想化的设置,其中使用了无限精度的实数,而在具体执行时这些实数会被替换,这可能是潜在的错误来源。
The closest related work to our testing approach was presented by Wang et al. [43]. The authors show a differential testing approach called LEMON that uses mutations to produce test cases, and it applies a heuristic strategy to guide the generation in order to amplify inconsistencies between different DL libraries. They do this to get rid of noise introduced by non-determinism in DL libraries. With our approach, we can directly test many library components deterministic ally, which makes the identification of bugs much easier. Moreover, we do not have to rely on the existence of other libraries. Our semantics can also be used to test newly developed features that are only provided by a single library.
与我们测试方法最相关的工作是 Wang 等人 [43] 提出的。作者展示了一种称为 LEMON 的差分测试方法,该方法使用变异生成测试用例,并应用启发式策略来引导生成,以放大不同深度学习 (DL) 库之间的不一致性。他们这样做是为了消除深度学习库中非确定性引入的噪声。通过我们的方法,我们可以直接确定性地测试许多库组件,这使得识别错误变得更加容易。此外,我们不必依赖其他库的存在。我们的语义还可以用于测试仅由单个库提供的新开发功能。
A similar approach called CRADLE was presented by Pham et al. [33]. CRADLE also performs differential testing, and it applies distance metrics to detect inconsistent outputs. It localises the source of an inconsistency by tracking and analysing anomalies in the execution graph. Similarly, S risa kao kul et al. [41] show a differential method for supervised learning software and demonstrate it for k-Nearest Neighbour and Naive Bayes implementations.
Pham 等人 [33] 提出了一种名为 CRADLE 的类似方法。CRADLE 也执行差分测试,并应用距离度量来检测不一致的输出。它通过跟踪和分析执行图中的异常来定位不一致的源头。同样,Srisa-nga-kul 等人 [41] 展示了一种用于监督学习软件的差分方法,并在 k-近邻 (k-Nearest Neighbour) 和朴素贝叶斯 (Naive Bayes) 实现中进行了演示。
A related fuzzing method was illustrated by Li [29]. The approach relies on extracting information from AI library documentation s to find input requirements of deep learning layers, which are needed to generate test data. In contrast to our work, this method only has very limited capabilities to produce valid tests (about $25%$ ) and their oracle can only detect documentation bugs or input violations.
Li [29] 提出了一种相关的模糊测试方法。该方法依赖于从AI库文档中提取信息,以找到深度学习层的输入要求,这些要求是生成测试数据所必需的。与我们的工作相比,该方法生成有效测试的能力非常有限(大约 $25%$ ),并且其测试预言只能检测文档错误或输入违规。
D war a kana th et al.[23] have developed new metamorphic relations for metamorphic testing of AI libraries. For example, they perform permutations, rearrangements, or rotations of the input data that should not change the output. Then, they identify bugs based on unexpected changes in the output. They evaluate their method by artificially introducing bugs into image classification systems.
Dwarakanath 等人 [23] 开发了新的变形关系,用于 AI 库的变形测试。例如,他们对输入数据进行排列、重排或旋转,这些操作不应改变输出。然后,他们根据输出的意外变化来识别错误。他们通过在图像分类系统中人为引入错误来评估其方法。
A similar metamorphic method for validation was presented by Ding et al. [22]. The approach works with metamorphic relations that, e.g., add (or remove) images or image categories to the training, test or validation data. They also work with image classifiers. Another metamorphic method that permutates rows and columns, shuffles and renames input features in order to evaluate the balanced ness of machine learning class if i ers was introduced by Sharma and Wehrheim [40]. In contrast to these approaches, our semantics will work with much more AI applications, and its strong oracle supports the detection of a wider range of bugs.
Ding 等人 [22] 提出了一种类似的用于验证的蜕变方法。该方法通过蜕变关系进行操作,例如向训练、测试或验证数据中添加(或删除)图像或图像类别。他们也应用于图像分类器。Sharma 和 Wehrheim [40] 引入了另一种蜕变方法,该方法通过排列行和列、打乱和重命名输入特征来评估机器学习分类器的平衡性。与这些方法相比,我们的语义将适用于更多的 AI 应用,并且其强大的预言机制支持检测更广泛的错误。
Related to our model validation method are some debugging approaches [26, 37, 42] for AI models that offer features like visualisation or auditing, to find issues. However, such approaches are usually limited to specific types of models, e.g., class if i ers, and their focus is not on introducing custom validation options.
与我们的模型验证方法相关的是,一些针对AI模型的调试方法 [26, 37, 42] 提供了可视化或审计等功能,以发现问题。然而,这些方法通常仅限于特定类型的模型,例如分类器,并且它们的重点不在于引入自定义验证选项。
6 CONCLUSION
6 结论
We have introduced a novel executable semantics for AI frameworks and illustrated two of its applications. Our semantics is implemented in Prolog and it focuses on the deterministic behaviour of deep learning layers. With Prolog, we could develop almost all layers of TensorFlow in a convenient and compact specification style.
我们为AI框架引入了一种新颖的可执行语义,并展示了其两个应用场景。该语义在Prolog语言中实现,专注于深度学习层的确定性行为。通过Prolog,我们能够以便捷且紧凑的规范风格开发TensorFlow的几乎所有层。
One major application of our semantics is the automatic generation of test AI models. We applied a fuzzing approach that incorporates feedback from the semantics to gradually find valid models. The approach was able to consistently produce valid models in $99%$ of the cases, which is much higher than related approaches. Moreover, we evaluated the effectiveness of this method for finding AI framework bugs, and the results were encouraging. We discovered various issues and bugs in TensorFlow and could thereby help to improve this AI framework. Most notably, we even found a bug in the well-established ReLU function.
我们语义学的一个主要应用是自动生成测试AI模型。我们采用了一种模糊测试方法,该方法结合了语义学的反馈,逐步找到有效的模型。该方法能够在99%的情况下持续生成有效模型,远高于相关方法。此外,我们评估了该方法在发现AI框架漏洞方面的有效性,结果令人鼓舞。我们在TensorFlow中发现了各种问题和漏洞,从而帮助改进了这个AI框架。最值得注意的是,我们甚至在成熟的ReLU函数中发现了一个漏洞。
Another application that we presented is the validation of deep learning models. Our semantics is build with various preconditions that can find bugs in invalid AI models. The evaluation showed that we can produce consistent error messages, and we were able to identify issues that could not be discovered by TensorFlow.
我们展示的另一个应用是深度学习模型的验证。我们的语义构建了各种前提条件,能够发现无效AI模型中的错误。评估表明,我们能够生成一致的错误信息,并且能够识别TensorFlow无法发现的问题。
We believe that these two approaches highlight the usefulness and generality of our semantics, which will also enable further applications. In the future, we aim to explore different test generation methods and AI model synthesis.
我们相信,这两种方法凸显了我们语义的实用性和通用性,这将进一步推动更多应用的发展。未来,我们计划探索不同的测试生成方法和AI模型合成。
Acknowledgments. This project is supported by the National Research Foundation, Singapore and National University of Singapore through its National Satellite of Excellence in Trustworthy Software Systems (NSOE-TSS) office under the Trustworthy Software Systems – Core Technologies Grant (TSSCTG) award no. NSOETSS2019-03.
致谢。本项目由新加坡国家研究基金会和新加坡国立大学通过其可信软件系统国家卓越卫星办公室 (NSOE-TSS) 支持,资助编号为 NSOETSS2019-03,属于可信软件系统核心技术资助 (TSSCTG) 项目。
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and National University of Singapore (including its National Satellite of Excellence in Trustworthy Software Systems (NSOE-TSS) office).
本材料中表达的任何观点、发现、结论或建议均为作者个人观点,并不代表新加坡国家研究基金会和新加坡国立大学(包括其可信软件系统国家卓越卫星办公室 (NSOE-TSS))的观点。
Additionally, we would like to thank Mengdi Zhang and Ayesha Sadiq for their help with some experiments of this work.
此外,我们要感谢Mengdi Zhang和Ayesha Sadiq在本工作部分实验中的帮助。
