[论文翻译]ExAIS: 可执行的AI语义


原文地址:https://arxiv.org/pdf/2202.09868


ExAIS: Executable AI Semantics

ExAIS: 可执行的AI语义

ABSTRACT

摘要

Neural networks can be regarded as a new programming paradigm, i.e., instead of building ever-more complex programs through (often informal) logical reasoning in the programmers’ mind, complex ‘AI’ systems are built by optimising generic neural network models with big data. In this new paradigm, AI frameworks such as TensorFlow and PyTorch play a key role, which is as essential as the compiler for traditional programs. It is known that the lack of a proper semantics for programming languages (such as C), i.e., a correctness specification for compilers, has contributed to many problematic program behaviours and security issues. While it is in general hard to have a correctness specification for compilers due to the high complexity of programming languages and their rapid evolution, we have a unique opportunity to do it right this time for neural networks (which have a limited set of functions, and most of them have stable semantics). In this work, we report our effort on providing a correctness specification of neural network frameworks such as TensorFlow. We specify the semantics of almost all TensorFlow layers in the logical programming language Prolog. We demonstrate the usefulness of the semantics through two applications. One is a fuzzing engine for TensorFlow, which features a strong oracle and a systematic way of generating valid neural networks. The other is a model validation approach which enables consistent bug reporting for TensorFlow models.

神经网络可以被视为一种新的编程范式,即不再通过程序员头脑中的(通常是非正式的)逻辑推理来构建越来越复杂的程序,而是通过优化具有大数据的通用神经网络模型来构建复杂的“AI”系统。在这种新范式中,TensorFlow 和 PyTorch 等 AI 框架扮演着关键角色,其重要性不亚于传统程序的编译器。众所周知,编程语言(如 C 语言)缺乏适当的语义(即编译器的正确性规范)导致了许多有问题的程序行为和安全问题。虽然由于编程语言的高度复杂性及其快速演变,通常很难为编译器制定正确性规范,但这次我们有一个独特的机会为神经网络(其功能集有限,且大多数具有稳定的语义)做好这件事。在这项工作中,我们报告了为 TensorFlow 等神经网络框架提供正确性规范的努力。我们在逻辑编程语言 Prolog 中指定了几乎所有 TensorFlow 层的语义。我们通过两个应用展示了这些语义的有用性。一个是 TensorFlow 的模糊测试引擎,它具有强大的预言机功能和生成有效神经网络的系统方法。另一个是模型验证方法,它能够为 TensorFlow 模型提供一致的错误报告。

KEYWORDS

关键词

AI frameworks, AI libraries, deep learning models, semantics, specification, test case generation, model validation, AI model generation

AI框架、AI库、深度学习模型、语义、规范、测试用例生成、模型验证、AI模型生成

1 INTRODUCTION

1 引言

Artificial intelligence (AI) refers to intelligence expressed by machines. It can be based on imitating human intelligence or learning behaviour. For example, the technique called deep learning (DL) applies artificial neural networks to simulate the neurons of natural brains [36]. In recent years, the dramatic growth of data and computing power has led to the spread of DL approaches which enabled numerous new technologies, like autonomous driving, language processing, and face recognition [34].

人工智能 (AI) 指的是由机器表达的智能。它可以基于模仿人类智能或学习行为。例如,深度学习 (DL) 技术应用人工神经网络来模拟自然大脑的神经元 [36]。近年来,数据和计算能力的急剧增长导致了深度学习方法的普及,这些方法催生了许多新技术,如自动驾驶、语言处理和面部识别 [34]。

In principle, AI systems can be seen as a new programming paradigm. Instead of building programs through (often informal) logical reasoning by programmers, AI systems are built by training neural network models with big data. For conventional programming, compilers or interpreters are major development tools. Similarly, for the development of neural networks the focus is on AI frameworks (or libraries) such as TensorFlow and PyTorch, which provide utilities, algorithms, and various types of layers that constitute DL models.

原则上,AI 系统可以被视为一种新的编程范式。与程序员通过(通常是非正式的)逻辑推理来构建程序不同,AI 系统是通过用大数据训练神经网络模型来构建的。对于传统编程,编译器或解释器是主要的开发工具。类似地,对于神经网络的开发,重点是 AI 框架(或库),如 TensorFlow 和 PyTorch,它们提供了构成深度学习模型的工具、算法和各种类型的层。

The lack of a proper semantics (or correctness specifications) for programming languages has resulted in many critical issues such as inconsistencies or vulnerabilities. Generally, it is hard to develop a comprehensive specification for compilers due to the high complexity and the rapid evolution of programming languages [39]. However, we believe that it is more feasible for AI frameworks, since the semantics of AI frameworks are usually stable and simpler, i.e., the set of functions and layers used in popular machine learning models are usually limited.

编程语言缺乏适当的语义(或正确性规范)导致了许多关键问题,如不一致性或漏洞。通常,由于编程语言的高复杂性和快速演变,开发编译器的全面规范非常困难 [39]。然而,我们认为对于 AI 框架来说,这更具可行性,因为 AI 框架的语义通常更稳定且更简单,即流行的机器学习模型中使用的函数和层集合通常是有限的。

Although AI systems have been proven successful in many applications, they are not error free. Especially in safety critical applications, like autonomous driving, or medical systems, even small bugs can have severe consequences. Thus, we need systematic wellgrounded analysis methods for AI systems. While the lack of a proper language specification has hindered many program analysis tasks, we have here a unique opportunity to start afresh by building a formal semantics of AI systems, which then serves as a solid foundation for developing ever more sophisticated analysis methods for AI systems. In this work, we make such an attempt by developing an executable (formal) semantics for AI frameworks. That is, we introduce a Prolog specification for almost all TensorFlow layers. Our specification provides the foundation for solving many AI analysis problems, such as AI test generation, model validation, and potentially AI model synthesis. In the following, we present one example application of the semantics, AI test generation.

尽管AI系统在许多应用中已被证明是成功的,但它们并非无错。特别是在安全关键的应用中,如自动驾驶或医疗系统,即使是小错误也可能带来严重后果。因此,我们需要系统且基础扎实的分析方法来应对AI系统。虽然缺乏适当的语言规范阻碍了许多程序分析任务,但我们在这里有一个独特的机会,通过构建AI系统的形式语义来重新开始,这为开发更复杂的AI系统分析方法奠定了坚实的基础。在本工作中,我们通过为AI框架开发可执行的(形式)语义来做出这样的尝试。也就是说,我们为几乎所有TensorFlow层引入了Prolog规范。我们的规范为解决许多AI分析问题提供了基础,如AI测试生成、模型验证,以及潜在的AI模型合成。接下来,我们将展示该语义的一个应用示例——AI测试生成。

Recently, testing DL systems has become a popular research topic, with a variety of approaches [24, 30, 44]. In comparison, testing AI frameworks has gained less attention. A bug in an AI library can potentially put all systems that were built with the library at risk. Hence, it is crucial to ensure the correctness of these libraries. Multiple researchers made some early attempts [22, 23, 33, 38, 40, 41, 43], although existing approaches suffer from at least two major issues. First, existing approaches often have a weak oracle (i.e., a test case passes if some simple algebraic properties are satisfied [22] or results in similar outputs on two implementations [40]), which can make it difficult to find deep semantic bugs. Second, existing approaches fail to systematically generate valid AI models, i.e., to our knowledge, the state-of-the-art approach has a success rate of $25%$ in generating valid models (after a complicated process combining NLP techniques with manually designed templates) [29].

最近,测试深度学习 (DL) 系统已成为一个热门研究课题,出现了多种方法 [24, 30, 44]。相比之下,测试 AI 框架的关注度较低。AI 库中的一个错误可能会使所有使用该库构建的系统面临风险。因此,确保这些库的正确性至关重要。多位研究人员已经进行了一些早期尝试 [22, 23, 33, 38, 40, 41, 43],尽管现有方法至少存在两个主要问题。首先,现有方法通常具有较弱的预言机制(即,如果满足某些简单的代数属性 [22] 或在两个实现上产生相似的输出 [40],则测试用例通过),这使得难以发现深层次的语义错误。其次,现有方法未能系统地生成有效的 AI 模型,即据我们所知,最先进的方法在生成有效模型时的成功率仅为 $25%$(在将 NLP 技术与手动设计的模板结合的复杂过程之后)[29]。

Our executable semantics solves these two issues naturally. First, given an AI model, our semantics can be executed to compute precisely the constraints that are satisfied by the output (based on the Prolog’s underlying unification and rewriting system), which serves as a strong test oracle. Second, our semantics facilitates the development of a fuzzing-based test generation method to systematically produce valid AI models. Our semantics facilitates the generation process by providing feedback in the form of a badness-value that indicates how far the model is away from being valid (i.e., satisfying the precondition of all layers of the model). Our evaluation shows that we significantly improve the success rate of generating valid AI model (e.g., $99%$ ) and can generate diverse model architectures consisting of various layers (in the form of sequences and complex graphs) as well as layer arguments, weights, and input data.

我们的可执行语义自然地解决了这两个问题。首先,给定一个AI模型,我们的语义可以被执行以精确计算输出所满足的约束(基于Prolog的底层统一和重写系统),这作为一个强大的测试预言。其次,我们的语义促进了基于模糊测试的测试生成方法的开发,以系统地生成有效的AI模型。我们的语义通过提供一个坏值(badness-value)形式的反馈来促进生成过程,该值指示模型距离有效(即满足模型所有层的前置条件)有多远。我们的评估表明,我们显著提高了生成有效AI模型的成功率(例如,$99%$),并且可以生成由各种层(以序列和复杂图的形式)以及层参数、权重和输入数据组成的多样化模型架构。

Structure. The rest of the paper is structured as follows. In Sect. 2, we introduce our semantics and show how we specify different layers. In Sect. 3, we present two applications of the semantics. In Sect. 4, we evaluate these two applications systematically. Lastly, we review the related work in Sect. 5 and conclude in Sect. 6.

结构。本文的其余部分结构如下。在第2节中,我们介绍了我们的语义,并展示了如何指定不同的层次。在第3节中,我们展示了语义的两个应用。在第4节中,我们系统地评估了这两个应用。最后,我们在第5节中回顾了相关工作,并在第6节中进行了总结。

2 AI FRAMEWORK SEMANTICS

2 AI 框架语义

In this section, we start with a brief introduction of the logic programming language Prolog [32] that we adopt for the development of our AI framework semantics. Afterwards, we describe our approach on developing the semantics through examples. In general, to support a variety of analysis tasks including testing, debugging and even synthesising AI models, it is important that the semantics fulfils the following requirements [35, 39]. First, the semantics must be declarative and high-level so that it avoids complications due to implementation-level optimisation and to facilitate a concise specification that is independent of implementation details. Second, it must support symbolic reasoning to allow various property, consistency and correctness checks, and thus it must be specified in logic. Lastly, it must be executable so that it can be used to support a variety of automated analysis tasks, like testing or debugging.

在本节中,我们首先简要介绍我们用于开发 AI 框架语义的逻辑编程语言 Prolog [32]。随后,我们通过示例描述了我们开发语义的方法。总的来说,为了支持包括测试、调试甚至合成 AI 模型在内的各种分析任务,语义必须满足以下要求 [35, 39]。首先,语义必须是声明性的和高层次的,以避免由于实现级别的优化而带来的复杂性,并促进独立于实现细节的简洁规范。其次,它必须支持符号推理,以允许各种属性、一致性和正确性检查,因此它必须在逻辑中指定。最后,它必须是可执行的,以便它可以用于支持各种自动化分析任务,如测试或调试。

2.1 Prolog

2.1 Prolog

Prolog is a declarative language that relies on first order logic. Programs in Prolog are built with rules and facts, which are usually concerned with mathematical relations. Hence, we believe that Prolog is suitable for a formal semantic representation. There might be more formal mathematical semantic models that, e.g., rely on

Prolog 是一种基于一阶逻辑的声明式语言。Prolog 程序由规则和事实构成,通常涉及数学关系。因此,我们认为 Prolog 适合用于形式化的语义表示。可能存在更形式化的数学语义模型,例如依赖于...

Coq [38], but such notations are not as flexible and would make it difficult to support a large number of layers.

Coq [38],但这样的符号不够灵活,难以支持大量层次。

In Prolog, a user can ask queries, and the answer is computed automatically based on the rules and facts with a unification algorithm. The following example shows two facts stating that average and flatten are layers and a rule that says that layers are AI components a i components.

在 Prolog 中,用户可以提出查询,系统会根据规则和事实通过统一算法自动计算答案。以下示例展示了两条事实,分别说明 average 和 flatten 是层,以及一条规则,说明层是 AI 组件 (AI components)。

A user may ask queries such as a i components(flatten)., which is simply answered by true, or a i components(X). where $X$ is a variable which produces the following results.

用户可能会提出诸如 a i components(flatten) 的查询,这只需回答 true,或者提出 a i components(X) 的查询,其中 $X$ 是一个变量,会产生以下结果。

A rule can generally be seen as a predicate that has arguments (in the header) and describes the relation between these arguments with a conjunction of clauses in the body of the rule. Arguments can be atoms (lowercase) or variables (uppercase). Prolog supports a number of data types and has built-in features for various mathematical operations. Moreover, it includes functions for list and string processing, and higher order programming notations [21].

规则通常可以看作是一个谓词,它具有参数(在头部),并通过规则体中的子句连接来描述这些参数之间的关系。参数可以是原子(小写)或变量(大写)。Prolog 支持多种数据类型,并内置了各种数学操作的功能。此外,它还包含了列表和字符串处理的函数,以及高阶编程符号 [21]。

Prolog is a well-suited specification language for our purpose for multiple reasons. First, it is largely intuitive and convenient to use (i.e., by specifying rules and facts) and has a reasonably large user base, which is extremely important as we intend to invite the open-source community to collaborative ly maintain the semantics. Second, it supports various list operations and mathematical expressions, which is helpful for capturing the semantics of AI libraries that involve mathematical operations on multidimensional data. Third, Prolog supports automated reasoning (through an underlying rewriting and search system), which is essential for our purpose. Lastly, Prolog is a well-studied declarative programming language, which is fundamentally different from the imperative programming that is used to implement AI frameworks. This makes it unlikely that our semantics will have the same bugs in the implementation.

Prolog 是一种非常适合我们目的的规范语言,原因如下。首先,它非常直观且易于使用(即通过指定规则和事实),并且拥有相当大的用户群体,这对于我们邀请开源社区共同维护语义至关重要。其次,它支持各种列表操作和数学表达式,这对于捕捉涉及多维数据数学运算的 AI 库的语义非常有帮助。第三,Prolog 支持自动推理(通过底层的重写和搜索系统),这对于我们的目的至关重要。最后,Prolog 是一种经过充分研究的声明式编程语言,与用于实现 AI 框架的命令式编程有根本不同。这使得我们的语义在实现中不太可能出现相同的错误。

2.2 Specifying the Semantics

2.2 语义规范

In this work, we focus on developing a semantics for the TensorFlow framework, since it is the most popular AI framework. TensorFlow is a complex framework that provides various tools, implementations, and features for the development of AI systems. We systematically go through every component in the TensorFlow framework and identify all the components which are relevant to AI models themselves (rather than, components for the training process). Our goal is to implement a comprehensive semantics for one of its core components, i.e., the layers [19]. We specify the semantics of nearly all layers except a few layers for practical reasons. A detailed discussion of the omitted layers is in Sect. 4. In total, we specify the layer functionality that is required to perform a prediction of 72 layers of different types, we did not implement any TensorFlow algorithms that are concerned with the training of models or data processing. A full list of these layers is available in our repository [10]. For each layer, we systematically go through the documentation, and run tests to understand the semantics. Unfortunately, the semantics of quite a number of layers are under-specified, and we have to seek further help from online forums and/or the authors. In multiple cases, we identify under-specified or mis-specified behaviours, like wrong handling of unexpected layer arguments, or missing restrictions for layer inputs (more details are in Sect. 4). This suggests that formal ising the semantics alone is already useful, and can help to improve AI frameworks.

在本工作中,我们专注于为 TensorFlow 框架开发语义,因为它是目前最流行的 AI 框架。TensorFlow 是一个复杂的框架,为 AI 系统的开发提供了各种工具、实现和功能。我们系统地梳理了 TensorFlow 框架中的每个组件,并识别出与 AI 模型本身相关的所有组件(而不是与训练过程相关的组件)。我们的目标是为其核心组件之一(即层 [19])实现全面的语义。出于实际原因,我们指定了几乎所有层的语义,只有少数层除外。关于省略层的详细讨论见第 4 节。我们总共指定了 72 种不同类型层的预测功能所需的层功能,我们没有实现任何与模型训练或数据处理相关的 TensorFlow 算法。这些层的完整列表可在我们的代码库 [10] 中找到。对于每个层,我们系统地查阅文档并运行测试以理解其语义。不幸的是,许多层的语义并未充分指定,我们不得不从在线论坛和/或作者那里寻求进一步帮助。在多个案例中,我们识别出了未充分指定或错误指定的行为,例如对意外层参数的错误处理,或对层输入的限制缺失(更多细节见第 4 节)。这表明,仅形式化语义本身就已经非常有用,并且可以帮助改进 AI 框架。


Figure 1: Histogram of the frequencies of line numbers used to specify the layers in Prolog.

图 1: 用于指定 Prolog 层数的行号频率直方图

The semantics of each layer is specified using one or more rules in Prolog. Figure 1 shows a histogram of the frequencies of the number of lines (or rules) required for the layers. It can be seen that many layers only require less than 10 lines. The reason is that they reuse predicates of other layers, or built-in rules. A major fraction of layers requires no more than 40 lines, and the remaining layers have nearly all less than 80 lines. One outlier (the Dot layer) has 106 lines since the implementation supports various axis combinations for the dot product. In total, the semantics consists of about 3200 lines of Prolog code and on average 44 lines per layer (when shared predicates are considered). The implementation of nearly all layers has an one-to-one correspondence with the deterministic functionality of the TensorFlow layers. This allows us to, given an AI model, straightforwardly invoke the relevant rules in our semantics and apply them to generate constraints on the model output.

每层的语义使用 Prolog 中的一个或多个规则来指定。图 1 显示了各层所需行数(或规则数)的频率直方图。可以看出,许多层只需要不到 10 行代码。原因是它们重用了其他层的谓词或内置规则。大部分层需要的代码不超过 40 行,其余层几乎都少于 80 行。一个异常值(Dot 层)有 106 行代码,因为该实现支持点积的各种轴组合。总的来说,语义部分由大约 3200 行 Prolog 代码组成,平均每层 44 行(考虑共享谓词时)。几乎所有层的实现都与 TensorFlow 层的确定性功能一一对应。这使得我们能够在给定 AI 模型时,直接调用语义中的相关规则,并将其应用于生成模型输出的约束。

We remark that since these Prolog rules are specified by humans, they are potentially buggy as well. Standard practices are applied to ensure the correctness of the semantics itself, such as code review, manual testing (e.g., using documentation examples) and automated testing (e.g., using our own fuzzing engine). While we cannot be completely sure that the semantics is correct, we have reason to believe that it is reasonably so. In general, one can always argue that correctness is an illusion and all there is is consistency (e.g., between manually-programmed TensorFlow and manually-written Prolog semantics). That is, as long as the same bug is not present in both the implementation and the semantics, inconsistency would arise and either TensorFlow or our semantics will be fixed. In the following, we illustrate our semantics with multiple examples.

我们注意到,由于这些 Prolog 规则是由人类指定的,它们也可能存在错误。我们采用了标准实践来确保语义本身的正确性,例如代码审查、手动测试(例如使用文档示例)和自动化测试(例如使用我们自己的模糊测试引擎)。虽然我们无法完全确定语义是正确的,但我们有理由相信它是合理的。一般来说,人们总是可以争论说正确性是一种幻觉,而唯一存在的是(例如手动编程的 TensorFlow 和手动编写的 Prolog 语义之间的)一致性。也就是说,只要实现和语义中不存在相同的错误,就会出现不一致,并且要么修复 TensorFlow,要么修复我们的语义。接下来,我们将通过多个示例来说明我们的语义。

Dense layer. Listing 1 shows our Prolog semantics of a Dense layer [6]. It is a standard densely connected layer, which contains a number of nodes and each is connected with all inputs. The output is computed by the weighted sum of all inputs at each node with an added bias value. Our specification works as follows. The rule starting in Line 1 has multiple parameters: a list [I|Is], a weight array IWs, and a bias array Bs (which can be intuitively regarded as inputs) and a list Os (which can be regarded as the output). The list notation [I|Is] enables access to the first element I and the remaining elements Is of a list. Line 2 constrains the depth of the nested input to two. We handle higher dimensional input separately. Line 3 applies a predicate that is true when O can be unified as the expected output for an input I, and Line 4 is a recursive constraint, which intuitively continues with the next inputs.

Dense 层。代码清单 1 展示了我们对 Dense 层的 Prolog 语义 [6]。它是一个标准的全连接层,包含多个节点,每个节点与所有输入相连。输出通过每个节点上所有输入的加权和加上一个偏置值来计算。我们的规范如下:从第 1 行开始的规则有多个参数:一个列表 [I|Is]、一个权重数组 IWs、一个偏置数组 Bs(可以直观地视为输入)以及一个列表 Os(可以视为输出)。列表符号 [I|Is] 允许访问列表的第一个元素 I 和剩余元素 Is。第 2 行将嵌套输入的深度限制为 2。我们单独处理更高维的输入。第 3 行应用了一个谓词,当 O 可以作为输入 I 的预期输出时,该谓词为真,第 4 行是一个递归约束,直观上继续处理下一个输入。

The rule in Lines 6–9 is similar, except that it handles (layer) inputs with a higher dimension, which is checked in Line 7, and recursively uses our initial predicate from Line 1 since the dense layer only performs computations in the inner most list even when it receives high dimensional input data. Line 11 (and Line 18) are the base cases for the recursion, i.e., when only an empty list remains.

第6-9行的规则类似,不同之处在于它处理具有更高维度的(层)输入,这在第7行进行检查,并递归使用第1行的初始谓词,因为即使接收到高维输入数据,全连接层也只在内层列表执行计算。第11行(和第18行)是递归的基本情况,即当只剩下一个空列表时。

The predicate in Line 13 encodes the main layer functionality and becomes true when the Res variable is the expected output for the input [I|Is]. It has the same arguments as our first rule and an additional temporary variable Res0 for the result. It consists of clauses for multiplying the weight arrays IW with each input I and for adding the results in Line 16. The predicates multiply list with and add_lists are straightforward and are therefore omitted.

第 13 行的谓词编码了主要层的功能,当 Res 变量是输入 [I|Is] 的预期输出时变为真。它与我们的第一条规则具有相同的参数,并且还有一个额外的临时变量 Res0 用于存储结果。它包含将权重数组 IW 与每个输入 I 相乘的条款,并在第 16 行将结果相加。谓词 multiply_list_with 和 add_lists 是直接的,因此省略。

With this Prolog semantics, we can now answer a variety of queries, e.g., to compute the expected output of a Dense layer. The following listing shows some examples. The first is a basic query with two inputs and two nodes and the second has three nodes and more complex input. It can be seen that even such a small and simple specification can handle high dimensional input with just a few lines of code. The third query fails due to invalid arguments since the number of nodes must be equivalent to the size of the bias array and also to the second dimension of the weight array. Many layers have non-trivial preconditions like that. This makes the generation of test data difficult. In Section 3, we will illustrate how we can identify such invalid inputs and produce useful information about the source and severity of such issues to support our model validation and test generation approaches.

有了这个 Prolog 语义,我们现在可以回答各种查询,例如计算 Dense 层的预期输出。以下列表展示了一些示例。第一个是具有两个输入和两个节点的基本查询,第二个具有三个节点和更复杂的输入。可以看出,即使是如此小而简单的规范,也能用几行代码处理高维输入。第三个查询由于参数无效而失败,因为节点数必须等于偏置数组的大小,并且等于权重数组的第二维度。许多层都有类似的重要前提条件。这使得测试数据的生成变得困难。在第 3 节中,我们将说明如何识别这些无效输入,并生成有关此类问题的来源和严重程度的有用信息,以支持我们的模型验证和测试生成方法。

dense_layer ([[[[[1 ,2 ,3]]] ,[[[4 ,5 ,6]]]]] , [[2,3,4],[5,4,6],[7,6,8]], [2,3,4], X). $\texttt{X}=$ [[[[[35 , 32, 44]]], [[[77, 71, 98]]]]]

dense_layer ([[[[[1 ,2 ,3]]] ,[[[4 ,5 ,6]]]]] , [[2,3,4],[5,4,6],[7,6,8]], [2,3,4], X). $\texttt{X}=$ [[[[[35 , 32, 44]]], [[[77, 71, 98]]]]]

Conv1D layer. A convolution layer [2] shifts a kernel (or pool) over the input space and produces the weighted sum of the elements within the kernel. It can, for instance, be used for image preprocessing or recognition. Listing 2 illustrates the (simplified for the sake of presentation) semantics of this layer. Line 1 shows a predicate for the layer, which has the (layer) input data Is, a number KernelSize for the one dimensional size of the window that should be shifted, a weight array IWs for the kernel elements, a bias array Bs of values that are added to the output, a step size Strides, a Boolean Padding for optionally enlarging the input data, e.g., by surrounding it with zeros, and the output data Os, as parameters. Lines 2–4 apply predicates for checking the validity of the layer inputs and arguments, i.e., to ensure that the input dimensions are as required, to check that the kernel is not too large for the input, and to verify that the weights are valid. An example implementation of such a precondition is given in Sect. 3.3. Line 5 applies the pool1D predicate, which is a generic predicate for different types of pooling or convolutional operations, and it becomes true when Os can be unified with the result of the operation.

Conv1D 层。卷积层 [2] 在输入空间上移动一个核(或池),并生成核内元素的加权和。例如,它可以用于图像预处理或识别。清单 2 展示了该层的(为演示目的而简化的)语义。第 1 行显示了该层的谓词,其参数包括(层的)输入数据 Is、窗口的一维大小 KernelSize、核元素的权重数组 IWs、添加到输出的偏置数组 Bs、步长 Strides、用于可选地扩大输入数据的布尔值 Padding(例如,用零包围输入数据)以及输出数据 Os。第 2-4 行应用了用于检查层输入和参数有效性的谓词,即确保输入维度符合要求,检查核是否对输入来说过大,并验证权重是否有效。第 3.3 节给出了此类前提条件的示例实现。第 5 行应用了 pool1D 谓词,这是一个用于不同类型池化或卷积操作的通用谓词,当 Os 可以与操作结果统一时,该谓词为真。

Line 7 shows the pool1D predicate, which has the same arguments as the con v 1 D layer with an additional Poolfunc metapredicate to decide what should happen within the pool, and an argument MultiLayer Pool to specify if the pool should be shifted over the second axis of the input space. Next, there is a call to a subpredicate for one sample of the input in Line 8, with initial is at ions of coordinates and a temporary output variable, Line 9 iterates over the other samples, and Line 11 is a simple stopping clause.

第7行展示了 pool1D 谓词,它与 conv1D 层具有相同的参数,但增加了一个 Poolfunc 元谓词来决定池化操作中的行为,以及一个 MultiLayerPool 参数来指定池化是否应在输入空间的第二轴上移动。接下来,第8行对输入的一个样本调用了一个子谓词,初始化了坐标和一个临时输出变量。第9行遍历了其他样本,而第11行是一个简单的停止子句。

In Lines 13–17, we apply padding with the calc padding predicate that is true when LeftP, and RightP are padding sizes to maintain the input shape after pooling, and by using these variables for the padding1D predicate that succeeds when Is1 is Is with padding. The predicate in Lines 19–24 does the actual application of the convolution. Line 21 utilises the pool predicate, i.e. the weighted sum of the kernel, for a given position X,Y, which is then appended to the output in Line 22. Next, Line 23 has a conditional clause that introduces new variables X1,Y1 for the next coordinates given the current pool position, and finally Line 24 performs recursion to continue with the next coordinates.

在第13-17行中,我们使用 calc_padding 谓词进行填充,当 LeftPRightP 是填充大小时,该谓词为真,以在池化后保持输入形状,并通过使用这些变量来调用 padding1D 谓词,该谓词在 Is1 是带有填充的 Is 时成功。第19-24行的谓词实际应用了卷积操作。第21行使用了 pool 谓词,即给定位置 X,Y 的核的加权和,然后将其附加到第22行的输出中。接下来,第23行包含一个条件子句,该子句为当前池化位置引入新的变量 X1,Y1 以表示下一个坐标,最后第24行执行递归以继续处理下一个坐标。

Lines 26–31 are defined similarly, except that the pool is also shifted over the second axis by iterating over the Y coordinate instead of just using the sum over these values. Finally, Lines 33– 36 show a stopping clause, which becomes true when one of the indexes X,Y is outside the input space.

第 26–31 行的定义类似,只是池化操作还会在第二个轴上移动,通过遍历 Y 坐标而不是仅仅对这些值求和来实现。最后,第 33–36 行展示了一个停止条件,当其中一个索引 X 或 Y 超出输入空间时,该条件为真。

Similarly, with the above semantics, we can now query this layer as shown in the following examples. The first example has integer and the second has floating-point inputs. It can be seen that variables are not limited to a specific type, our implementations works for integers as well as floating-point numbers.

同样地,利用上述语义,我们现在可以查询这一层,如下例所示。第一个例子使用整数输入,第二个例子使用浮点数输入。可以看出,变量并不局限于特定类型,我们的实现既适用于整数,也适用于浮点数。

An interesting aspect of this layer is that a lot of the functionality can be reused for other layers. In particular, the pool1D predicate is defined generic so that it can be applied for the semantics of many related layers. For example, it is applied for simple pooling layers, like (global) average or max pooling and for other convolutional layers, like separable convolution and locally connected layers.

这一层的一个有趣之处在于,许多功能可以复用于其他层。特别是,pool1D谓词被定义为通用的,因此可以应用于许多相关层的语义。例如,它被应用于简单的池化层,如(全局)平均池化或最大池化,以及其他卷积层,如可分离卷积和局部连接层。

Non-deterministic layers. Although most layers have deterministic behaviour when their weights are known, there are a few exceptions including layers such as Dropout, Alpha Dropout, Spati al Dropout, Gaussian Dropout, and Gaussian Noise. All of them either have built-in non-determinism (i.e., random behaviour) or are probabilistic in nature (i.e., producing outputs that follows a certain probability distribution). We defined Prolog rules that constrain these layers to satisfy high-level properties which are documented in the respective documentation, e.g., if the layer is expected to produce results that are according to a specific distribution, they should. We demonstrate how this is done for the dropout layer [8].

非确定性层。尽管大多数层在权重已知时具有确定性行为,但也有一些例外,包括 Dropout、Alpha Dropout、Spatial Dropout、Gaussian Dropout 和 Gaussian Noise 等层。它们要么具有内置的非确定性(即随机行为),要么本质上是概率性的(即生成遵循特定概率分布的输出)。我们定义了 Prolog 规则来约束这些层以满足在各自文档中记录的高级属性,例如,如果该层预期生成符合特定分布的结果,那么它们应该如此。我们展示了如何为 dropout 层实现这一点 [8]。

This layer is expected to set values of the input to zero according to a given rate. (Additionally, the layer also adds noise to the data, but we neglect this feature in the example for brevity.) Listing 3 shows a predicate that checks if the dropout actually occurs at a given rate. The arguments are the inputs Is, the outputs from TensorFlow Os, the rate, and a threshold Accepted Rate Diff for the maximum difference to the observed rate. Line 2 shows predicates which are true when N and NO can be unified with the number of input and output values, which must be equal. Line 3 does the same for the number of zeros in the input, and Line 4 for the number of zeros in the given output. Next, Line 5 sets RealRate to the observed rate of dropouts while considering the already existing zeros in the input, and Line 6 applies the abs function to obtain the difference to the expected rate. Finally, Line 7 checks if the difference is bigger than our threshold and writes a message if that is the case.

该层预期会根据给定速率将输入值设为零。(此外,该层还会向数据添加噪声,但为了简洁起见,我们在示例中忽略了这一特性。)清单 3 展示了一个谓词,用于检查 dropout 是否确实以给定速率发生。参数包括输入 Is、来自 TensorFlow 的输出 Os、速率以及观察到的速率与预期速率之间最大差异的阈值 Accepted Rate Diff。第 2 行展示了当 N 和 NO 可以与输入和输出值的数量统一时为真的谓词,这些数量必须相等。第 3 行对输入中的零数量执行相同操作,第 4 行对给定输出中的零数量执行相同操作。接下来,第 5 行将 RealRate 设置为考虑输入中已存在零的 dropout 观察速率,第 6 行应用 abs 函数以获取与预期速率的差异。最后,第 7 行检查差异是否大于我们的阈值,并在这种情况下写入一条消息。

3 APPLICATIONS OF THE SEMANTICS

3 语义的应用

The semantics has many applications. In this section, we demonstrate two of them. We illustrate the overall design of our testing and model validation approaches and show step-by-step examples.

语义学有许多应用。在本节中,我们将展示其中的两个应用。我们将说明测试和模型验证方法的整体设计,并逐步展示示例。

3.1 AI Framework Testing

3.1 AI 框架测试

Recently, multiple researchers started working on AI framework testing [22, 23, 33, 38, 40, 41, 43]. While an impressive number of bugs have been identified, there are still many open challenges. Based on our executable semantics, we introduce a novel AI framework testing method which improves the existing approaches by addressing the following problems. (1) The oracle problem (i.e., how we verify if the output is correct). (2) The test generation problem (i.e., how can we effectively generate valid test cases). For the first problem, the existing approaches focus on differential testing [33, 41, 43] or a simple form of metamorphic testing [22, 23, 40]. For differential testing, multiple libraries (or implementations) are compared against each other, and bugs are identified when an inconsistency is found. The kind of metamorphic testing done on AI frameworks introduces changes in the test input (i.e., an AI model) that should not affect the output. A different output would thus indicate a bug. In other words, such testing relies on the algebraic property that ‘dead code’ does not change the model’s behaviour. Differential testing is powerful since a second implementation can serve as a good test oracle. However, a second independent implementation is not always available, especially for new algorithms. Even if it is available, it is still possible that it suffers from the same bugs due to reasons such as using the same underlying library or implementing certain optimisation in the same ‘wrong’ way. Our method can also be seen as a form of differential testing, but it is unlikely that our semantics will suffers from the same bugs since it is developed in a different modelling paradigm at a high-level (e.g., where little performance optimisation is involved). On the other hand, metamorphic testing is usually limited to specific types of bugs based on algebraic properties (i.e., a very small piece of the semantics) which are used to guide the test generation. Another issue that limits the applicability of both approaches is the nondeterministic behaviour of the layers, e.g., inconsistency may be due to non-deterministic or probabilistic behaviours rather than bugs. Usually, the training of AI models contains random factors that cause small discrepancies in different implementations and make an exact comparison and the identification of bugs difficult. This issue is sometimes resolved by relying on approximating oracles that accept a range of inputs. This is not an ideal solution since it often requires manual adjustments [31]. Based on our semantics, we have an alternative more powerful solution for this problem. By developing our semantics with a focus on the highlevel behaviour of the layers, we can test the results of an AI model independently of the non-deterministic training algorithms. This facilities the detection of subtle bugs without the interference of underlying non-deterministic or probabilistic behaviours.

最近,多位研究人员开始致力于 AI 框架测试 [22, 23, 33, 38, 40, 41, 43]。尽管已经发现了大量错误,但仍存在许多未解决的挑战。基于我们的可执行语义,我们引入了一种新颖的 AI 框架测试方法,通过解决以下问题改进了现有方法。(1) Oracle 问题(即如何验证输出是否正确)。(2) 测试生成问题(即如何有效地生成有效的测试用例)。对于第一个问题,现有方法主要关注差分测试 [33, 41, 43] 或简单的蜕变测试 [22, 23, 40]。差分测试通过比较多个库(或实现)来发现不一致性,从而识别错误。在 AI 框架上进行的蜕变测试则通过改变测试输入(即 AI 模型)来确保输出不受影响,不同的输出则表明存在错误。换句话说,这种测试依赖于“死代码”不会改变模型行为的代数特性。差分测试非常强大,因为第二个实现可以作为良好的测试 Oracle。然而,第二个独立的实现并不总是可用,尤其是对于新算法。即使可用,由于使用相同的底层库或以相同的“错误”方式实现某些优化,它仍可能受到相同的错误影响。我们的方法也可以被视为一种差分测试,但由于我们的语义是在不同的建模范式中开发的(例如,涉及较少的性能优化),因此不太可能受到相同的错误影响。另一方面,蜕变测试通常局限于基于代数特性的特定类型错误(即语义的非常小的一部分),这些特性用于指导测试生成。限制这两种方法适用性的另一个问题是层的非确定性行为,例如,不一致性可能是由于非确定性或概率行为而非错误引起的。通常,AI 模型的训练包含随机因素,这些因素会导致不同实现中的微小差异,使得精确比较和错误识别变得困难。这个问题有时通过依赖接受一定范围输入的近似 Oracle 来解决。这不是一个理想的解决方案,因为它通常需要手动调整 [31]。基于我们的语义,我们为这个问题提供了一个更强大的替代方案。通过开发专注于层的高级行为的语义,我们可以独立于非确定性训练算法测试 AI 模型的结果。这有助于在不受到底层非确定性或概率行为干扰的情况下检测细微的错误。

The second problem, i.e., the test case generation problem, is often addressed by using existing benchmark examples or by modifying these examples, e.g., via mutation [43]. The problem of the former is that existing examples are limited in both numbers and variety. The problem of the latter is that test cases generated through mutation are often invalid, e.g., due to non-trivial preconditions that must be satisfied by the layers. Most AI libraries contain dozens of different layers that can be connected in sequences or in complex graph structures to form deep learning models. There are simple layers, like a densely connected layer, and also highly complicated layers which apply operations on multidimensional inputs or even combine the functionality of other layers. Usually, they need a number of configuration arguments, e.g., for specifying the number of neurons or a filter size. Most layers have preconditions for the inputs and hence they cannot be freely combined. To address this problem, a recent approach [29] proposes to extract information about the input requirements from the library documentation and to generate ‘valid’ test cases accordingly. However, after all its effort, it still has very limited capability in producing valid test data, i.e., the success rate is around $25%$ .

第二个问题,即测试用例生成问题,通常通过使用现有的基准示例或通过修改这些示例(例如,通过变异 [43])来解决。前者的问题在于现有示例在数量和种类上都有限。后者的问题在于通过变异生成的测试用例通常无效,例如,由于必须满足层的非平凡前提条件。大多数 AI 库包含数十种不同的层,这些层可以按顺序或以复杂的图结构连接以形成深度学习模型。有简单的层,如全连接层,也有高度复杂的层,它们对多维输入应用操作,甚至结合了其他层的功能。通常,它们需要一些配置参数,例如用于指定神经元数量或滤波器大小。大多数层对输入有前提条件,因此它们不能自由组合。为了解决这个问题,最近的一种方法 [29] 提出从库文档中提取有关输入要求的信息,并相应地生成“有效”的测试用例。然而,尽管付出了所有努力,它在生成有效测试数据方面的能力仍然非常有限,即成功率约为 $25%$。

In this work, we develop a fuzzing method which utilises feedback from our semantics to enable a systematic way of producing valid test cases, i.e., by finding layers (and associated arguments) that are compatible with each other. Moreover, our approach explores various layer arguments in order to find interesting combinations that might reveal bugs. The overall workflow is depicted in Fig. 2. Next, we will explain the involved tasks and components.

在本工作中,我们开发了一种模糊测试方法,该方法利用语义反馈来系统化地生成有效测试用例,即通过找到相互兼容的层(及相关参数)。此外,我们的方法探索了各种层参数,以发现可能揭示错误的有趣组合。整体工作流程如图 2 所示。接下来,我们将解释涉及的任务和组件。

Our method has two main components. First, the most important component is the oracle, which is our executable Prolog semantics. Given a deep learning model (in the form of collection of layers and their arguments) and the input data, the oracle executes the semantics to produce a prediction, in a similar fashion as it is done by an AI library. Second, the test case generator is a tool that produces test data in the form of multidimensional arrays and a deep learning model as a collection of layers and their respective arguments. Our test case generator is based on fuzzing techniques and

我们的方法有两个主要组成部分。首先,最重要的组件是预言机 (oracle),它是我们可执行的 Prolog 语义。给定一个深度学习模型(以层及其参数的集合形式)和输入数据,预言机执行语义以生成预测,其方式与 AI 库类似。其次,测试用例生成器是一个工具,它以多维数组的形式生成测试数据,并以层及其各自参数的集合形式生成深度学习模型。我们的测试用例生成器基于模糊测试技术。


Figure 2: Overview of the data flow of our testing approach.

图 2: 我们测试方法的数据流概览。

produces random input data and models. The focus of the generation is to produce complex models with graph structures in order to extensively test combinations of different layers in various settings.

生成随机输入数据和模型。生成的重点是生成具有图结构的复杂模型,以便在各种设置中广泛测试不同层的组合。

In the following, we use the simplified semantics of the convolution layer that we showed in the last section to demonstrate the test generation. The test case generator produces test input data as well as a deep learning model (including layer arguments) to reveal bugs or inconsistencies in the AI library. The model and the input data (for performing a prediction) together form a test case. A test case is automatically generated in the form of a Prolog query, which is given to the Prolog interpreter to compute a result with a unification algorithm. For example, a simple query for our convolutional layer is as follows.

在以下内容中,我们使用上一节中展示的简化卷积层语义来演示测试生成。测试用例生成器生成测试输入数据以及深度学习模型(包括层参数),以揭示AI库中的错误或不一致性。模型和输入数据(用于执行预测)共同构成一个测试用例。测试用例以Prolog查询的形式自动生成,并交给Prolog解释器使用统一算法计算结果。例如,针对我们的卷积层,一个简单的查询如下。

conv1D_layer ([[[5,8,7],[9,10,7],[3,7,7]]],2, [[[1] ,[1] ,[1]] ,[[1] ,[1] ,[1]]] ,[0] ,1 , false ,1,X).

conv1D_layer ([[[5,8,7],[9,10,7],[3,7,7]]],2, [[[1] ,[1] ,[1]] ,[[1] ,[1] ,[1]]] ,[0] ,1 , false ,1,X).

The same test case for TensorFlow is expressed as a simple Python program as shown in Listing 4. The first lines are imports and configurations. Lines 3-4 define the actual deep learning model. In Lines 5-8 we overwrite the weights for the kernel and the biases in order to support a deterministic prediction. Usually these weights are initial is ed randomly and tuned through training. Line 9 sets the input data and Line 10 performs a prediction with the model for the input data, which gives us the test output.

相同的 TensorFlow 测试用例可以用一个简单的 Python 程序表示,如代码清单 4 所示。前几行是导入和配置。第 3-4 行定义了实际的深度学习模型。在第 5-8 行中,我们覆盖了内核和偏置的权重,以支持确定性预测。通常这些权重是随机初始化的,并通过训练进行调整。第 9 行设置了输入数据,第 10 行使用模型对输入数据进行预测,从而得到测试输出。

算法 1 测试模型生成的伪代码
输入: GenHelper: 包含层生成器映射的辅助类 (Layer.Name → Generator) SemanticHelper: 用于执行语义的辅助类
1: 函数 RECURSIVEGENERATION(level, rand)
2: 如果 level = 0 或 rand.nextBool() 则 > 根据随机结果允许更深的图 返回 null > 递归结束
3: 4: > 生成一个随机层,包括参数
5: layer ← GenHelper.generateLayer() inputNumber ← 1
6: 如果 layer.hasMultipleInputs() 则 > 检查层是否需要多个输入
7: inputNumber ← rand.nextInteger(3)
8:
9: 对于 i 从 1 到 inputNumber 执行
10: layer1 ← recursiveGeneration(level-1, rand)
11: layer1.setParent(layer), layer.addChild(layer1)
12: 返回 layer
13: 对于 i 从 1 到 maxTries 执行
14: 函数 FINDVALIDMODEL(layer, maxTries, rand) > 在给定随机模型的情况下找到有效模型
15: (success, error) ← SemanticHelper.run(layer) > 运行语义,获取成功/错误
16: 如果 success 则
17: 返回 layer
18: 否则如果 error.getLocation() = lastError.getLocation() 且 error.getBadness() ≥ lastError.getBadness() 则
19: layer ← lastLayer, error ← lastError > 如果没有改进则恢复模型
20: 否则 lastLayer ← layer, lastError ← error, layer1 ← error.getLocation()
21: 如果 rand.nextBool() 则
22: layer1.regenerateArgs() > 重置层参数
23: 否则如果 rand.nextBool() 则
24: options ← ["Reshape", "UpSampling", "Cropping", "Padding", ...]
25: layer2 ← GenHelper.generateLayer(options) > 生成其中一个选项 > 在输入前插入层
26: layer1.insertLayerBeforeChild(layer2)
27: 否则如果 rand.nextBool() 则
28: layer2 ← GenHelper.generateLayer()
29: layer1.replaceWith(layer2) > 用另一个层替换当前层
30: 否则如果 rand.nextBool() 则
31: layer2 ← GenHelper.generateLayer()
32: layer1.replaceChildWith(layer2) > 用另一个层替换输入层
33: 返回 null

Finally, we compare the output of the test program (that includes the AI library) to the output of the semantics. A test fails if the constraints produced by the semantics are not satisfied by the output from the Python program. In general, such a failure can be caused by bugs in the library and possibly by bugs in the semantics. For our example, the output was [[[46], [43]]] in both cases, which represents a successful test. A bug is identified when there is an inconsistency in the outputs (prediction results) of the TensorFlow program and of our semantics. If there is an inconsistency, we localise the bug by checking which layer specification is violated, and by manually checking the cause.

最后,我们将测试程序(包含AI库)的输出与语义的输出进行比较。如果Python程序的输出不满足语义产生的约束条件,则测试失败。通常,这种失败可能是由库中的错误或语义中的错误引起的。在我们的示例中,两种情况的输出都是[[[46], [43]]],这表示测试成功。当TensorFlow程序的输出(预测结果)与我们的语义输出不一致时,就会识别出错误。如果存在不一致,我们通过检查违反了哪一层规范并手动检查原因来定位错误。

3.2 Test Case Generation

3.2 测试用例生成

The test case generator is a tool that can produce models in the form of Python scripts (that use TensorFlow) as well as models represented as Prolog queries. It also generates random test inputs (in the form of multidimensional arrays that need to have a specific shape or number of dimensions depending on the first layer) for a prediction with a model and layer arguments, which can also be the weights of layers. The generation is done by the test case generator alongside the AI model generation while considering input requirements. The details about the required arguments can be found in the TensorFlow documentation. A test case is a model together with the associated test inputs.

测试用例生成器是一种工具,能够生成以 Python 脚本形式(使用 TensorFlow)的模型,以及以 Prolog 查询形式表示的模型。它还会为模型和层参数的预测生成随机测试输入(以多维数组的形式,这些数组需要具有特定的形状或维度数量,具体取决于第一层),这些参数也可以是层的权重。测试用例生成器在生成 AI 模型的同时生成测试用例,并考虑输入要求。有关所需参数的详细信息可以在 TensorFlow 文档中找到。测试用例是模型及其相关测试输入的组合。

The main functionality of the test generator is the generation of diverse functional models. Pseudo code of the generation process is illustrated in Algorithm 1. It consists of two major functions for randomly building a model and for finding a valid model. The first is a recursive function (Line 1) that starts by selecting an initial root layer and continues to connect random layers onto it. The algorithm increases the probability of stopping based on the level that represents the distance of the current layer to the root layer (Line 2) to prevent too large models. Many layers can take the output of one layer as input, and some can have inputs from multiple layers, e.g., to perform mathematical operations, like addition, or multiplication of the inputs. Hence, there is a variable for the number of inputs, which is set according to the generated layer (Lines 5–7). Finally, the recursion is performed for each input, a connection is formed with the associated functions (by feeding the output of a layer as input to another layer), and the layer is returned. The model in this case is just the root layer that is connected to other layers. Note that the models produced by this function are most likely invalid, and for simplicity it produces models in the form of trees. The semantics also supports more complex graphs.

测试生成器的主要功能是生成多样化的功能模型。生成过程的伪代码如算法 1 所示。它包含两个主要函数:随机构建模型和寻找有效模型。第一个是递归函数(第 1 行),它从选择一个初始根层开始,并继续在其上连接随机层。该算法基于当前层到根层的距离(第 2 行)增加停止的概率,以防止生成过大的模型。许多层可以将一个层的输出作为输入,而有些层可以从多个层获取输入,例如执行加法或乘法等数学运算。因此,有一个输入数量的变量,根据生成的层进行设置(第 5-7 行)。最后,对每个输入执行递归,形成与相关函数的连接(通过将一个层的输出作为另一个层的输入),并返回该层。在这种情况下,模型只是连接到其他层的根层。请注意,此函数生成的模型很可能无效,并且为了简单起见,它以树的形式生成模型。语义也支持更复杂的图结构。


Figure 3: Model visualisation in graph form

图 3: 以图形形式展示的模型可视化

The second function takes this model as a basis to find a valid model with the help of the semantics (Line 12). Our semantics can identify invalid models based on layer preconditions, which are defined as part of the corresponding Prolog rules. These preconditions give feedback in the form of a badness value to guide the generation process towards valid models. Given a concrete model and a precondition to be satisfied, the badness value is defined in the same way as fitness in search-based software testing [25]. It is a distance metric that increases when the model is further away from becoming valid, i.e., we rank the preconditions based on severity and calculate a value by multiplying a severity factor with the difference of, e.g., an expected argument value to an observed one.

第二个函数以此模型为基础,借助语义找到有效的模型(第12行)。我们的语义可以根据层前提条件识别无效模型,这些前提条件被定义为相应Prolog规则的一部分。这些前提条件以不良值的形式提供反馈,以指导生成过程朝着有效模型的方向发展。给定一个具体模型和一个要满足的前提条件,不良值的定义方式与基于搜索的软件测试中的适应度相同 [25]。它是一种距离度量,当模型离有效状态越远时,其值越大,即我们根据严重性对前提条件进行排序,并通过将严重性因子与预期参数值与观察值之间的差值相乘来计算一个值。

Line 14 shows that we apply a Semantic Helper to run our semantics and to retrieve a success or an error message. If the semantics was executed successful then the model is valid and the algorithm stops (Line 16). If an error occurs during the execution, we check if it originates from the same layer as a previous error, and if it is more severe than the last error (based on the badness value), which would indicate that the last model change did not improve the model. In this case, we restore the last model (Line 18). Otherwise, we try to adopt the model to gradually improve its validity (Lines 21–32). This is done by randomly selecting a potential model modification for the layer that causes invalidity, like regenerating arguments, inserting a layer to change the input shapes/dimensions, or replacing the layer with one of its input layers. This process is repeated until a valid model is found or a maximum number of tries is reached.

第14行显示,我们应用了一个语义助手来运行我们的语义并检索成功或错误消息。如果语义成功执行,则模型有效,算法停止(第16行)。如果在执行过程中发生错误,我们会检查它是否源自与之前错误相同的层,并且是否比上一个错误更严重(基于坏值),这将表明上一次模型更改没有改善模型。在这种情况下,我们会恢复上一个模型(第18行)。否则,我们会尝试调整模型以逐步提高其有效性(第21-32行)。这是通过随机选择导致无效的层的潜在模型修改来实现的,例如重新生成参数、插入一个层以更改输入形状/维度,或用其输入层之一替换该层。此过程会重复,直到找到有效模型或达到最大尝试次数。

An example graph model that was generated with our approach is illustrated in Fig. 3. The nodes in the model represent layers and the arrows show the data flow. It can be seen that the model includes various layers, like convolutional or reshape, and the output of the model is derived from the root layer, i.e. that average layer at the bottom. The corresponding TensorFlow and Prolog model are available online in our repository [10].

图 3 展示了使用我们的方法生成的一个示例图模型。模型中的节点代表层,箭头表示数据流。可以看出,该模型包含各种层,如卷积层或重塑层,模型的输出源自根层,即底部的平均层。相应的 TensorFlow 和 Prolog 模型可在我们的在线仓库 [10] 中获取。

3.3 Model Validation

3.3 模型验证

Another application of our semantics is the model validation by identifying bugs in faulty (or invalid) AI models. It can be challenging to create working models, especially since the layers have numerous preconditions that need to be fulfilled when they are used or combined. For example, common errors are wrong input shapes and dimensions, or arguments that are invalid for the input data, like a convolution kernel or an axis value that is too large for an input dimensions [27, 45]. AI libraries, like TensorFlow, already produce error message to detect such issues. However, they can be difficult to understand, the messages are not always consistent for the same type of issue, and in rare cases there is no error for a problematic model (see the example below). Hence, we provide an alternative way to validate models by identifying and reporting issues precisely and consistently with the help of our semantics. By converting an existing TensorFlow model into our Prolog representation, we can exploit our semantics to check if there are any precondition violations in the model, which would make it invalid.

我们语义的另一个应用是通过识别有缺陷(或无效)的AI模型中的错误来进行模型验证。创建有效的模型可能具有挑战性,特别是因为层在使用或组合时需要满足许多前提条件。例如,常见的错误包括错误的输入形状和维度,或者对输入数据无效的参数,例如卷积核或轴值对于输入维度来说过大 [27, 45]。像 TensorFlow 这样的 AI 库已经生成了错误消息来检测这些问题。然而,这些消息可能难以理解,对于相同类型的问题,消息并不总是一致的,并且在极少数情况下,有问题的模型不会产生错误(见下面的示例)。因此,我们提供了一种替代方法,通过在我们的语义帮助下精确且一致地识别和报告问题来验证模型。通过将现有的 TensorFlow 模型转换为我们的 Prolog 表示,我们可以利用我们的语义来检查模型中是否存在任何违反前提条件的情况,这将使模型无效。

An example precondition is shown in Listing 5. It checks if a cropping layer applies too much cropping, which would result in an empty output. There is a predicate check empty cropping that has the layer input and output as arguments and it is true when the output is not empty. Otherwise, an exception is thrown. First, in Lines 1–2 there is a recursive rule for the inspection of multidimensional output. Next, the rule in Line 3 makes our predicate true, if any numerical value is found in the output. If that is not the case, then we generate an error message Lines 4–7.

一个示例前提条件如代码清单 5 所示。它检查裁剪层是否应用了过多的裁剪,这会导致输出为空。有一个谓词检查 empty cropping,它以层的输入和输出作为参数,当输出不为空时为真。否则,抛出异常。首先,在第 1-2 行中有一个用于检查多维输出的递归规则。接着,第 3 行的规则使我们的谓词为真,如果在输出中找到任何数值。如果不是这种情况,则生成错误信息(第 4-7 行)。

An example TensorFlow model that contains a corresponding cropping issue is shown in Listing 6. It can be seen that our model consists of four layers, a convolutional layer that is connected with a cropping layer, a maximum pool layer, and a concatenate

一个包含相应裁剪问题的 TensorFlow 模型示例如代码清单 6 所示。可以看出,我们的模型由四层组成,分别是与裁剪层连接的卷积层、最大池化层和连接层。

layer that takes the outputs of the cropping and pool layers. The same model in Prolog is shown in Listing 7. This model has a slightly diff