[论文翻译]论智能的衡量标准


原文地址:https://arxiv.org/pdf/1911.01547v2


On the Measure of Intelligence

论智能的衡量标准

Francois Chollet Google, Inc. fchollet@google.com

Francois Chollet Google, Inc. fchollet@google.com

November 5, 2019

2019年11月5日

Abstract

摘要

To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks, such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow exper im enters to “buy” arbitrary levels of skills for a system, in a way that masks the system’s own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience, as critical pieces to be accounted for in characterizing intelligent systems. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a new benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

为了朝着更智能、更类人的人工系统方向取得有意识的进展,我们需要遵循适当的反馈信号:必须以能够比较两个系统以及与人类进行比较的方式定义和评估智能。过去一百年间,心理学和AI领域涌现了大量定义与测量智能的尝试。我们总结并批判性评估了这些定义和评估方法,同时揭示了隐含指导它们的两种历史性智能概念。我们注意到,在实践中,当代AI社区仍倾向于通过比较AI与人类在特定任务(如棋盘游戏和电子游戏)中展现的技能来基准测试智能。我们认为仅测量任何给定任务的技能不足以衡量智能,因为技能高度受先验知识和经验调节:无限的先验或训练数据允许实验者以掩盖系统自身泛化能力的方式为系统"购买"任意水平的技能。接着,我们基于算法信息论提出了一个新的智能形式化定义,将智能描述为技能获取效率,并强调范围、泛化难度、先验和经验等概念是表征智能系统的关键要素。基于此定义,我们提出了一套通用AI基准应遵循的准则。最后,我们介绍了一个紧密遵循这些准则的新基准——抽象与推理语料库(ARC),它建立在显式先验集合之上,这些先验被设计为尽可能接近人类与生俱来的先验。我们认为ARC可用于测量类人的通用流体智能,并实现AI系统与人类之间公平的通用智能比较。

Contents

目录

I Context and history 3

I 背景与历史 3

II A new perspective 18

II 全新视角 18

II.1 Critical assessment . 18

II.1 关键评估 . 18

III A benchmark proposal: the ARC dataset 46

III 基准测试提案:ARC数据集

III.1 Description and goals 46

III.1 描述与目标 46

I Context and history

I 背景与历史

I.1 Need for an actionable definition and measure of intelligence

I.1 对可操作的情报定义和衡量标准的需求

The promise of the field of AI, spelled out explicitly at its inception in the 1950s and repeated countless times since, is to develop machines that possess intelligence comparable to that of humans. But AI has since been falling short of its ideal: although we are able to engineer systems that perform extremely well on specific tasks, they have still stark limitations, being brittle, data-hungry, unable to make sense of situations that deviate slightly from their training data or the assumptions of their creators, and unable to repurpose themselves to deal with novel tasks without significant involvement from human researchers.

人工智能领域的承诺,早在20世纪50年代其诞生之初就被明确提出,并在此后被无数次重申,即开发出具有与人类相当智能的机器。但人工智能始终未能达到这一理想:尽管我们能够设计出在特定任务上表现极其出色的系统,它们仍存在明显局限——系统脆弱、依赖海量数据、无法理解略微偏离训练数据或开发者预设条件的情境,且在没有人类研究者深度介入的情况下,无法自主调整以应对新任务。

If the only successes of AI have been in developing narrow, task-specific systems, it is perhaps because only within a very narrow and grounded context have we been able to define our goal sufficiently precisely, and to measure progress in an actionable way. Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress. To make progress towards the promise of our field, we need precise, quantitative definitions and measures of intelligence – in particular human-like general intelligence. These would not be merely definitions and measures meant to describe or characterize intelligence, but precise, explanatory definitions meant to serve as a North Star, an objective function showing the way towards a clear target, capable of acting as a reliable measure of our progress and as a way to identify and highlight worthwhile new approaches that may not be immediately applicable, and would otherwise be discounted.

如果人工智能的成功仅限于开发狭窄、任务专用的系统,那或许是因为只有在非常局限且具体的背景下,我们才能足够精确地定义目标,并以可操作的方式衡量进展。目标定义和评估基准是推动科学进步最有力的因素之一。为了实现本领域的承诺,我们需要对智能——尤其是类人的通用智能——进行精确的量化定义和测量。这些不应仅是用于描述或表征智能的定义与指标,而应成为指引方向的北极星:作为目标函数指明清晰靶向,既能可靠衡量进展,又能识别并凸显那些虽非立即可用却极具价值的新方法——这些方法在传统评估中往往会被忽视。

For instance, common-sense dictionary definitions of intelligence may be useful to make sure we are talking about the same concepts, but they are not useful for our purpose, as they are not actionable, explanatory, or measurable. Similarly, the Turing Test [91] and its many variants (e.g. Total Turing Test and Loebner Prize [75]) are not useful as a driver of progress (and have in fact served as a red herring 1), since such tests completely opt out of objectively defining and measuring intelligence, and instead outsource the task to unreliable human judges who themselves do not have clear definitions or evaluation protocols.

例如,常识性的智力词典定义可能有助于确保我们在讨论相同的概念,但对我们的目的而言并不实用,因为它们不具备可操作性、解释性或可测量性。同样,图灵测试 [91] 及其众多变体(如完全图灵测试和Loebner奖 [75])作为进展的驱动力也毫无用处(事实上已成为干扰项),因为此类测试完全回避了客观定义和测量智力,而是将任务外包给不可靠的人类评判者,而这些评判者本身也没有清晰的定义或评估协议。

It is a testimony to the immaturity of our field that the question of what we mean when we talk about intelligence still doesn’t have a satisfying answer. What’s worse, very little attention has been devoted to rigorously defining it or benchmarking our progress towards it. Legg and Hutter noted in a 2007 survey of intelligence definitions and evaluation methods [53]: “to the best of our knowledge, no general survey of tests and definitions has been published". A decade later, in 2017, Hernandez-Orallo released an extensive survey of evaluation methods [36] as well as a comprehensive book on AI evaluation [37]. Results and recommendations from both of these efforts have since been largely ignored by the community.

当我们谈论智能时究竟指什么,这个问题至今仍无令人满意的答案,这恰恰证明了我们领域的不成熟。更糟的是,学界极少投入精力去严格定义智能或建立衡量进展的基准。Legg和Hutter在2007年关于智能定义与评估方法的综述[53]中指出:"据我们所知,目前尚未发表过关于测试与定义的全面综述"。十年后的2017年,Hernandez-Orallo先后发布了评估方法的详尽综述[36]和关于AI评估的专著[37]。然而这些研究成果提出的结论与建议,至今仍在很大程度上被学界忽视。

We believe this lack of attention is a mistake, as the absence of widely-accepted explicit definitions has been substituted with implicit definitions and biases that stretch back decades. Though invisible, these biases are still structuring many research efforts today, as illustrated by our field’s ongoing fascination with outperforming humans at board games or video games (a trend we discuss in I.3.5 and II.1). The goal of this document is to point out the implicit assumptions our field has been working from, correct some of its most salient biases, and provide an actionable formal definition and measurement benchmark for human-like general intelligence, leveraging modern insight from developmental cognitive psychology.

我们认为这种关注不足是一个错误,因为缺乏广泛接受的明确定义已被延续数十年的隐性定义和偏见所替代。尽管无形,这些偏见至今仍在构建许多研究工作,正如我们领域持续痴迷于在棋盘游戏或电子游戏中超越人类所展示的那样(这一趋势我们将在I.3.5和II.1中讨论)。本文档的目标是指出我们领域一直遵循的隐性假设,纠正其中一些最突出的偏见,并利用发展认知心理学的现代见解,为类人通用智能提供可操作的形式化定义和测量基准。

I.2 Defining intelligence: two divergent visions

I.2 定义智能:两种不同的愿景

Looked at in one way, everyone knows what intelligence is; looked at in another way, no one does.

从某种角度看,人人都明白智能是什么;换一个角度看,又无人能真正说清。

Robert J. Sternberg, 2000

Robert J. Sternberg, 2000

Many formal and informal definitions of intelligence have been proposed over the past few decades, although there is no existing scientific consensus around any single definition. Sternberg & Detterman noted in 1986 [87] that when two dozen prominent psychologists were asked to define intelligence, they all gave somewhat divergent answers. In the context of AI research, Legg and Hutter [53] summarized in 2007 no fewer than 70 definitions from the literature into a single statement: “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.”

过去几十年间,人们提出了许多关于智能的正式与非正式定义,但至今仍未形成统一的科学共识。Sternberg & Detterman在1986年[87]指出,当询问二十多位著名心理学家如何定义智能时,他们的回答均存在差异。在AI研究领域,Legg和Hutter[53]于2007年从文献中汇总了至少70种定义,并提炼为一个核心表述:"智能衡量智能体在多样化环境中实现目标的能力。"

This summary points to two characterizations, which are nearly universally – but often separately – found in definitions of intelligence: one with an emphasis on task-specific skill (“achieving goals”), and one focused on generality and adaptation (“in a wide range of environments”). In this view, an intelligent agent would achieve high skill across many different tasks (for instance, achieving high scores across many different video games). Implicitly here, the tasks may not necessarily be known in advance: to truly achieve generality, the agent would have to be able to learn to handle new tasks (skill acquisition).

该总结指出了两种普遍存在但常被单独讨论的智能定义特征:一种强调特定任务技能("实现目标"),另一种关注通用性与适应性("在多样化环境中")。从这个角度看,一个智能体(intelligent agent)应能在多种不同任务中展现高超技能(例如在多款电子游戏中获得高分)。这里隐含的前提是:这些任务未必需要预先设定——要实现真正的通用性,智能体必须具备学习处理新任务的能力(技能获取)。

These two characterizations map to Catell’s 1971 theory of fluid and crystallized intelligence (Gf-Gc) [13], which has become one of the pillars of the dominant theory of human cognitive abilities, the Cattell-Horn-Caroll theory (CHC) [62]. They also relate closely to two opposing views of the nature of the human mind that have been deeply influential in cognitive science since the inception of the field [85]: one view in which the mind is a relatively static assembly of special-purpose mechanisms developed by evolution, only capable of learning what is it programmed to acquire, and another view in which the mind is a general-purpose “blank slate” capable of turning arbitrary experience into knowledge and skills, and that could be directed at any problem.

这两种特征映射到 Catell 1971 年提出的流体智力与晶体智力理论 (Gf-Gc) [13],该理论已成为人类认知能力主流理论——Cattell-Horn-Carroll 理论 (CHC) [62] 的支柱之一。它们还与认知科学领域自创立以来极具影响力的两种对立心智观紧密相关 [85]:一种观点认为心智是由进化形成的特殊目的机制的相对静态集合体,仅能习得预设程序允许的内容;另一种观点则认为心智是通用"白板",能将任意经验转化为知识与技能,并可应用于任何问题。

A central point of this document is to make explicit and critically assess this dual definition that has been implicitly at the foundation of how we have been conceptualizing and evaluating intelligence in the context of AI research: crystallized skill on one hand, skillacquisition ability on the other. Understanding this intellectual context and its ongoing influence is a necessary step before we can propose a formal definition of intelligence from a modern perspective.

本文的核心在于明确并批判性评估这种隐含在AI研究中对智能概念化和评估基础的双重定义:一方面是固化技能 (crystallized skill),另一方面是技能获取能力 (skill acquisition ability)。理解这一学术背景及其持续影响,是我们从现代视角提出正式智能定义的必要前提。

I.2.1 Intelligence as a collection of task-specific skills

I.2.1 作为特定任务技能集合的智能

In the distant future I see open fields for far more important researches. Psychology will be based on a new foundation, that of the necessary acquirement of each mental power and capacity by gradation.

在遥远的未来,我看到更重要的研究领域将敞开大门。心理学将建立在新基础之上,即通过渐进方式获得每种心智能力与才能的必要性。

Charles Darwin, 1859

查尔斯·达尔文,1859

The evolutionary psychology view of human nature is that much of the human cognitive function is the result of special-purpose adaptations that arose to solve specific problems encountered by humans throughout their evolution (see e.g. [19, 74]) – an idea which originated with Darwin [21] and that coalesced in the 1960s and 1970s. Around the same time that these ideas were gaining prominence in cognitive psychology, early AI researchers, perhaps seeing in electronic computers an analogue of the mind, mainly gravitated towards a view of intelligence as a set of static program-like routines, heavily relying on logical operators, and storing learned knowledge in a database-like memory.

人类本性的进化心理学观点认为,人类大部分认知功能是特殊目的适应性机制的结果,这些机制为解决人类进化过程中遇到的特定问题而形成(参见[19,74])——这一观点源自达尔文[21],并于20世纪60至70年代逐渐成型。当这些思想在认知心理学领域崭露头角时,早期AI研究者或许将电子计算机视为心智的类比,主要倾向于将智能视为一组静态的类程序例程,其高度依赖逻辑运算符,并将习得知识存储于类数据库记忆中。

This vision of the mind as a wide collection of vertical, relatively static programs that collectively implement “intelligence”, was most prominently endorsed by influential AI pioneer Marvin Minsky (see e.g. The Society of Mind, 1986 [63]). This view gave rise to definitions of intelligence and evaluation protocols for intelligence that are focused on task-specific performance. This is perhaps best illustrated by Minsky’s 1968 definition of AI: “AI is the science of making machines capable of performing tasks that would require intelligence if done by humans” 2. It was then widely accepted within the AI community that the “problem of intelligence” would be solved if only we could encode human skills into formal rules and encode human knowledge into explicit databases.

将心智视为一系列垂直、相对静态程序的集合,这些程序共同实现"智能"的观点,最具影响力的是AI先驱Marvin Minsky所倡导的(参见《心智社会》[63])。这种观点催生了以任务表现为核心的智能定义和评估方法。Minsky在1968年对AI的定义最能体现这一点:"AI是让机器能够完成人类需要智能才能完成的任务的科学"。当时AI界普遍认为,只要能将人类技能编码为形式规则、将人类知识编码为显式数据库,"智能问题"就能得到解决。

This view of intelligence was once so dominant that “learning” (discounted as pure memorization) was often not even mentioned at all in AI textbooks until the mid-1980s. Even McCarthy, a rare advocate for generality in AI, believed that the key to achieving generality was better knowledge bases [60]. This definition and evaluation philosophy focused entirely on skill at narrow tasks normally handled by humans has led to a striking paradox, as pointed out by Hernandez-Orallo [36] in his 2017 survey: the field of artificial intelligence has been very successful in developing artificial systems that perform these tasks without featuring intelligence, a trend that continues to this day.

这种对智能的看法曾一度占据主导地位,以至于直到20世纪80年代中期,AI教科书中甚至经常完全不提"学习"(被贬低为纯粹的记忆)。就连罕见的AI通用性倡导者McCarthy也认为,实现通用性的关键在于更好的知识库[60]。正如Hernandez-Orallo[36]在2017年的综述中指出的,这种完全聚焦于人类常规处理的狭窄任务技能的定义和评估理念,导致了一个引人注目的悖论:人工智能领域在开发能执行这些任务却不体现智能的人工系统方面非常成功,这一趋势延续至今。

I.2.2 Intelligence as a general learning ability

I.2.2 作为通用学习能力的智能

Presumably the child brain is something like a notebook as one buys it from the stationer’s. Rather little mechanism, and lots of blank sheets.

儿童的大脑大概就像从文具店买来的笔记本。机械结构很少,空白页很多。

Alan Turing, 1950

Alan Turing, 1950

In contrast, a number of researchers have taken the position that intelligence lies in the general ability to acquire new skills through learning; an ability that could be directed to a wide range of previously unknown problems – perhaps even any problem at all. Contrast Minsky’s task-focused definition of AI with the following one, paraphrased from McCarthy [60] by Hernandez-Orallo: “AI is the science and engineering of making machines do tasks they have never seen and have not been prepared for beforehand” [36].

相比之下,许多研究者认为智能的本质在于通过学习获得新技能的通用能力;这种能力可以应用于各种前所未见的问题——甚至可能是任何问题。将Minsky以任务为核心的AI定义与Hernandez-Orallo转述McCarthy [60] 的观点进行对比:"AI是一门让机器完成它们从未见过且事先未做准备的任务的科学与工程" [36]。

The notion that machines could acquire new skills through a learning process similar to that of human children was initially laid out by Turing in his 1950 paper [91]. In 1958, Friedberg noted astutely: “If we are ever to make a machine that will speak, understand or translate human languages, solve mathematical problems with imagination, practice a profession or direct an organization, either we must reduce these activities to a science so exact that we can tell a machine precisely how to go about doing them or we must develop a machine that can do things without being told precisely how” [26]. But although the idea of generality through learning was given significant consideration at the birth of the field, and has long been championed by pioneers like McCarthy and Papert, it lay largely dormant until the resurgence of machine learning in the 1980s.

机器能够通过类似人类儿童的学习过程获得新技能的观点,最初由图灵在其1950年的论文[91]中提出。1958年,Friedberg敏锐地指出:"如果我们想要制造一台能够说话、理解或翻译人类语言、富有想象力地解决数学问题、从事专业工作或管理组织的机器,要么我们必须将这些活动简化为足够精确的科学,以便准确告诉机器如何执行它们,要么我们必须开发一种无需精确指导就能完成任务的机器" [26]。尽管通过学习实现通用性的想法在该领域诞生之初就受到高度重视,并长期得到McCarthy和Papert等先驱的倡导,但在20世纪80年代机器学习复兴之前,这一理念基本处于休眠状态。

This view of intelligence echoes another long-standing conception of human nature that has had a profound influence on the history of cognitive science, contrasting with the evolutionary psychology perspective: Locke’s Tabula Rasa (blank slate), a vision of the mind as a flexible, adaptable, highly general process that turns experience into behavior, knowledge, and skills. This conception of the human mind can be traced back to Aristotle (De Anima, c. 350BC, perhaps the first treatise of psychology [3]), was embraced and popularized by Enlightenment thinkers such as Hobbes [42], Locke [56], and Rousseau [78]. It has more recently found renewed vitality within cognitive psychology (e.g. [79]) and in AI via connection is m (e.g. [41]).

这种智力观呼应了另一种对人类本性的长期认知观念,即与进化心理学视角形成鲜明对比的洛克"白板说"(Tabula Rasa)——将心智视为一种灵活、适应性强且高度通用的过程,能将经验转化为行为、知识和技能。这一人类心智概念可追溯至亚里士多德(《论灵魂》,约公元前350年,可能是最早的心理学专著[3]),后被霍布斯[42]、洛克[56]和卢梭[78]等启蒙思想家接纳并普及。最近,该理论在认知心理学(如[79])和通过连接主义实现的AI领域(如[41])重新焕发活力。

With the resurgence of machine learning in the 1980s, its rise to intellectual dominance in the 2000s, and its peak as an intellectual quasi-monopoly in AI in the late 2010s via

随着机器学习在20世纪80年代的复兴,到2000年代在学术领域占据主导地位,再到2010年代末通过

Deep Learning, a connection is t-inspired Tabula Rasa is increasingly becoming the dominant philosophical framework in which AI research is taking place. Many researchers are implicitly conceptualizing the mind via the metaphor of a “randomly initialized neural network” that starts blank and that derives its skills from “training data” – a cognitive fallacy that echoes early AI researchers a few decades prior who conceptualized the mind as a kind of mainframe computer equipped with clever subroutines. We see the world through the lens of the tools we are most familiar with.

深度学习
一种受连接主义启发的白板论正日益成为人工智能研究的主导哲学框架。许多研究者不自觉地通过"随机初始化神经网络"的隐喻来概念化心智——这种网络始于空白状态,其能力完全源自"训练数据"。这种认知谬误与几十年前早期AI研究者如出一辙,彼时将心智概念化为配备精巧子程序的计算机主机。我们总是通过最熟悉的工具来观察世界。

Today, it is increasingly apparent that both of these views of the nature of human intelligence – either a collection of special-purpose programs or a general-purpose Tabula Rasa – are likely incorrect, which we discuss in II.1.3, along with implications for artificial intelligence.

如今,这两种对人类智能本质的认知——无论是将之视为专用程序集合还是通用白板 (Tabula Rasa) ——显然都可能是错误的。我们将在第II.1.3节讨论这一观点及其对人工智能的启示。

I.3 AI evaluation: from measuring skills to measuring broad abilities

I.3 人工智能评估:从技能衡量到广泛能力衡量

These two conceptualizations of intelligence – along with many other intermediate views combining elements from each side – have influenced a host of approaches for evaluating intelligence in machines, in humans, and more rarely in both at the same time, which we discuss below. Note that this document is not meant as an extensive survey of AI evaluation methods - for such a survey, we recommend Hernandez-Orallo 2017 [37]. Other notable previous surveys include Cohen and Howe 1988 [69] and Legg and Hutter 2007 [53].

这两种对智能的概念化理解——以及许多融合双方观点的中间立场——影响了评估机器智能、人类智能(以及较少见的二者同时评估)的一系列方法,我们将在下文展开讨论。需要注意的是,本文并非对AI评估方法的全面综述——如需此类综述,我们推荐Hernandez-Orallo 2017 [37]。其他值得关注的既往综述包括Cohen和Howe 1988 [69]以及Legg和Hutter 2007 [53]。

I.3.1 Skill-based, narrow AI evaluation

I.3.1 基于技能的狭义AI评估

In apparent accordance with Minsky’s goal for AI, the major successes of the field have been in building special-purpose systems capable of handling narrow, well-described tasks, sometimes at above human-level performance. This success has been driven by performance measures quantifying the skill of a system at a given task (e.g. how well an AI plays chess, how well an image classifier recognizes cats from dogs). There is no single, formalized way to do skill-based evaluation. Historically successful approaches include:

表面上与Minsky对AI的目标一致,该领域的主要成功在于构建能够处理狭窄、明确描述任务的专用系统,有时甚至超越人类水平。这一成功得益于量化系统在特定任务中技能表现的评估指标(例如AI下棋水平、图像分类器区分猫狗的准确度)。目前尚无统一的技能评估标准化方法。历史上成功的评估方式包括:

• Peer confrontation: having the system compete against either other AIs or humans. This is the preferred mode of evaluation for player-versus-player games, such as chess. • Benchmarks: having the system produce outputs for a “test set” of inputs (or environments) for which the desired outcome is known, and score the response.

• 同行对抗:让系统与其他AI或人类竞争。这是评估玩家对战游戏(如国际象棋)的首选模式。
• 基准测试:让系统为一组已知预期结果的输入(或环境)"测试集"生成输出,并对响应进行评分。

Benchmarks in particular have been a major driver of progress in AI, because they are reproducible (the test set is fixed), fair (the test set is the same for everyone), scalable (it is inexpensive to run the evaluation many times), easy to set up, and flexible enough to be applicable to a wide range of possible tasks. Benchmarks have often been most impactful in the context of a competition between different research teams, such as the ILSVRC challenge for large-scale image recognition (ImageNet) [22] or the DARPA Grand Challenge for autonomous driving [11]. A number of private and community-led initiatives have been started on the premise that such benchmark-based competitions speed up progress (e.g. Kaggle (kaggle.com), as well as academic alternatives such as ChaLearn (chalearn.org), the Hutter prize, etc.), while some government organizations use competitions to deliberately trigger technological breakthroughs (e.g. DARPA, NIST).

基准测试尤其成为推动人工智能进步的主要动力,因为它们具有可复现性(测试集固定)、公平性(所有人使用相同测试集)、可扩展性(多次运行评估成本低廉)、易于设置,以及足够灵活以适用于广泛任务的特点。基准测试在不同研究团队间的竞赛中往往最具影响力,例如大规模图像识别领域的ILSVRC挑战赛(ImageNet)[22]或自动驾驶领域的DARPA Grand Challenge[11]。基于"此类基准竞赛能加速技术进步"的理念,许多私人和社区主导的倡议相继启动(如Kaggle (kaggle.com) 及学术类平台ChaLearn (chalearn.org)、Hutter奖等),而部分政府机构则通过竞赛刻意触发技术突破(例如DARPA、NIST)。

These successes demonstrate the importance of setting clear goals and adopting objective measures of performance that are shared across the research community. However, optimizing for a single metric or set of metrics often leads to tradeoffs and shortcuts when it comes to everything that isn’t being measured and optimized for (a well-known effect on Kaggle, where winning models are often overly specialized for the specific benchmark they won and cannot be deployed on real-world versions of the underlying problem). In the case of AI, the focus on achieving task-specific performance while placing no conditions on how the system arrives at this performance has led to systems that, despite performing the target tasks well, largely do not feature the sort of human intelligence that the field of AI set out to build.

这些成功案例表明,设定明确目标并采用研究社区公认的客观性能衡量标准至关重要。然而,针对单一指标或指标集进行优化时,往往会在未被测量和优化的所有方面产生权衡与捷径(这是Kaggle上众所周知的效应:获胜模型通常过度适配其获胜的特定基准测试,而无法部署到现实世界的实际问题中)。在AI领域,这种专注于实现任务特定性能却不对系统达成该性能的方式加以约束的做法,导致现有系统虽然能出色完成目标任务,但大多不具备AI领域最初试图构建的人类智能特征。

This has been interpreted by McCorduck as an “AI effect” where goalposts move every time progress in AI is made: “every time somebody figured out how to make a computer do something play good checkers, solve simple but relatively informal problems there was a chorus of critics to say, ‘that’s not thinking’ ” [61]. Similarly, Reed notes: “When we know how a machine does something ‘intelligent’, it ceases to be regarded as intelligent. If I beat the world’s chess champion, I’d be regarded as highly bright.” [77]. This interpretation arises from overly anthropocentric assumptions. As humans, we can only display high skill at a specific task if we have the ability to efficiently acquire skills in general, which corresponds to intelligence as characterized in II. No one is born knowing chess, or predisposed specifically for playing chess. Thus, if a human plays chess at a high level, we can safely assume that this person is intelligent, because we implicitly know that they had to use their general intelligence to acquire this specific skill over their lifetime, which reflects their general ability to acquire many other possible skills in the same way. But the same assumption does not apply to a non-human system that does not arrive at competence the way humans do. If intelligence lies in the process of acquiring skills, then there is no task $X$ such that skill at X demonstrates intelligence, unless X is actually a meta-task involving skill-acquisition across a broad range of tasks. The “AI effect” characterization is confusing the process of intelligence (such as the intelligence displayed by researchers creating a chess-playing program) with the artifact produced by this process (the resulting chess-playing program), due to these two concepts being fundamentally intertwined in the case of humans. We discuss this further in II.1.

McCorduck将这种现象解读为一种"AI效应":每当AI取得进展时,评判标准就会随之改变:"每当有人教会计算机下出像样的国际象棋,或解决简单但相对非正式的问题时,总会有一群批评者跳出来说'那不算思考'" [61]。Reed同样指出:"当我们了解机器如何实现某种'智能'行为后,就不再将其视为真正的智能。如果我击败了国际象棋世界冠军,人们会认为我非常聪明。" [77]

这种解读源于过度拟人化的假设。对人类而言,只有在具备高效学习能力的普遍智力(intelligence)基础上(如第二部分所定义),才能展现出特定领域的高超技能。没有人天生就会下棋或专为棋艺而生。因此,当人类展现出高超棋艺时,我们可以合理推断此人具备智力,因为我们默认知道他是通过毕生运用通用智力才获得这项特定技能,这种能力同样适用于获取其他各类技能。

但这种假设并不适用于非人类系统,因为它们获取能力的途径与人类截然不同。如果智力的本质在于技能获取过程,那么不存在某项具体任务X能单独证明智力——除非X本身就是涉及广泛任务学习的元任务。由于人类智力过程与其产出成果本质上是交织的,"AI效应"的论述错误地将智力过程(如研究人员开发象棋程序时展现的智力)与其产物(最终生成的象棋程序)混为一谈。我们将在第二部分第一节进一步探讨这个问题。

Task-specific performance is a perfectly appropriate and effective measure of success if and only if handling the task as initially specified is the end goal of the system – in other words, if our measure of performance captures exactly what we expect of the system. However, it is deficient if we need systems that can show autonomy in handling situations that the system creator did not plan for, that can dynamically adapt to changes in the task – or in the context of the task – without further human intervention, or that can be repurposed for other tasks. Meanwhile, robustness and flexibility are increasingly being perceived as important requirements for certain broader subfields of AI, such as L5 self-driving, domestic robotics, or personal assistants; there is even increasing interest in generality itself (e.g. develo p mental robotics [4], artificial general intelligence [28]). This points to a need to move beyond skill-based evaluation for such endeavours, and to find ways to evaluate robustness and flexibility, especially in a cross-task setting, up to generality. But what do we really mean when we talk about robustness, flexibility, and generality?

任务特定性能是一种完全恰当且有效的成功衡量标准,当且仅当系统以初始指定的方式处理任务是其最终目标时——换句话说,如果我们的性能衡量标准准确反映了我们对系统的期望。然而,如果我们需要系统能够自主处理系统创建者未预料到的情况、能够动态适应任务或其上下文的变化而无需进一步人工干预,或者能够被重新用于其他任务,那么这种衡量标准就显得不足。与此同时,稳健性(robustness)和灵活性(flexibility)正日益被视为某些更广泛的AI子领域的重要需求,例如L5级自动驾驶、家用机器人或个人助理;甚至对通用性(generality)本身的兴趣也在增长(如发展机器人学[4]、通用人工智能[28])。这表明,对于此类努力,我们需要超越基于技能的评估,并找到评估稳健性和灵活性的方法,尤其是在跨任务环境中,直至通用性。但当我们谈论稳健性、灵活性和通用性时,我们真正指的是什么?

I.3.2 The spectrum of generalization: robustness, flexibility, generality

I.3.2 泛化能力谱系:鲁棒性、灵活性、通用性

Even though such machines might do some things as well as we do them, or perhaps even better, they would inevitably fail in others, which would reveal they were acting not through understanding, but only from the disposition of their organs.

尽管这些机器可能在某些事情上做得和我们一样好,甚至更好,但它们在其他方面必然会失败,这表明它们的行为并非基于理解,而只是器官构造的驱使。

René Descartes, 1637

René Descartes, 1637

The resurgence of machine learning in the 1980s has led to an interest in formally defining, measuring, and maximizing generalization. Generalization is a concept that predates machine learning, originally developed to characterize how well a statistical model performs on inputs that were not part of its training data. In recent years, the success of Deep Learning [52], as well as increasingly frequent run-ins with its limitations (see e.g. [51, 16, 59]), have triggered renewed interest in generalization theory in the context of machine learning (see e.g. [102, 67, 45, 70, 17, 49]). The notion of generalization can be formally defined in various contexts (in particular, statistical learning theory [92] provides a widely-used formal definition that is relevant for machine learning, and we provide a more general formalization in II.2). We can informally define “generalization” or “generalization power” for any AI system to broadly mean “the ability to handle situations (or tasks) that differ from previously encountered situations”.

20世纪80年代机器学习的复兴引发了人们对正式定义、衡量和最大化泛化能力的兴趣。泛化这一概念早于机器学习出现,最初用于衡量统计模型在训练数据之外的输入上的表现。近年来,深度学习[52]的成功及其日益凸显的局限性(参见[51,16,59]等) ,重新激发了机器学习领域对泛化理论的研究热情(参见[102,67,45,70,17,49]等)。泛化概念可在不同语境下进行形式化定义(统计学习理论[92]提供了与机器学习相关的经典定义,我们将在II.2节给出更普适的形式化表述)。对于任何AI系统,我们可以非正式地将"泛化"或"泛化能力"定义为"处理与先前经历情境不同(或任务)的能力"。

The notion of “previously encountered situation” is somewhat ambiguous, so we should distinguish between two types of generalization:

“先前遇到的情况”这一概念有些模糊,因此我们应区分两种泛化类型:

• System-centric generalization: this is the ability of a learning system to handle situations it has not itself encountered before. The formal notion of generalization error in statistical learning theory would belong here.

• 系统中心泛化 (System-centric generalization):指学习系统处理自身未曾遇到过的情况的能力。统计学习理论中的泛化误差 (generalization error) 形式化概念即属于此类。

• Developer-aware generalization: this is the ability of a system, either learning or static, to handle situations that neither the system nor the developer of the system have encountered before.

• 开发者感知泛化能力:指系统(无论是学习型还是静态系统)能够处理系统自身及其开发者都未曾遇到过的情况的能力。

In addition, we find it useful to qualitatively define degrees of generalization for informationprocessing systems:

此外,我们发现对信息处理系统的泛化程度进行定性定义很有帮助:

• Absence of generalization: The notion of generalization as we have informally defined above fundamentally relies on the related notions of novelty and uncertainty: a system can only generalize to novel information that could not be known in advance to either the system or its creator. AI systems in which there is no uncertainty do not display generalization. For instance, a program that plays tic-tac-toe via exhaustive iteration cannot be said to “generalize” to all board configurations. Likewise, a sorting algorithm that is proven to be correct cannot be said to “generalize” to all lists of integers, much like proven mathematical statements cannot be said to “generalize” to all objects that match the assumptions of their proof 3.

• 缺乏泛化性:我们前文非正式定义的泛化概念本质上依赖于新颖性和不确定性的相关概念——只有当系统或其创造者都无法提前预知新信息时,系统才可能实现泛化。不存在不确定性的AI系统不会表现出泛化能力。例如,通过穷举法玩井字棋的程序不能被称为"泛化"到所有棋盘配置;同理,已被证明正确的排序算法不能被称为"泛化"到所有整数列表,就像已被证明的数学命题不能被称为"泛化"到所有符合其证明假设的对象[3]。

• Local generalization, or “robustness”: This is the ability of a system to handle new points from a known distribution for a single task or a well-scoped set of known tasks, given a sufficiently dense sampling of examples from the distribution (e.g. tolerance to anticipated perturbations within a fixed context). For instance, an image classifier that can distinguish previously unseen 150x150 RGB images containing cats from those containing dogs, after being trained on many such labeled images, can be said to perform local generalization. One could characterize it as “adaptation to known unknowns within a single task or well-defined set of tasks”. This is the form of generalization that machine learning has been concerned with from the 1950s up to this day.

• 局部泛化 (local generalization) 或"鲁棒性":指系统在给定足够密集的样本分布情况下,能够处理来自已知分布的单个任务或明确定义的已知任务集的新数据点的能力 (例如在固定上下文中对预期扰动的容忍度)。例如,一个经过大量带标签图像训练后,能够区分未见过的150x150 RGB图像中猫狗的分类器,就可以说具备局部泛化能力。可以将其描述为"在单个任务或明确定义的任务集中适应已知的未知因素"。这是从20世纪50年代至今机器学习领域所关注的泛化形式。

Broad generalization, or “flexibility”: This is the ability of a system to handle a broad category of tasks and environments without further human intervention. This includes the ability to handle situations that could not have been foreseen by the creators of the system. This could be considered to reflect human-level ability in a single broad activity domain (e.g. household tasks, driving in the real world), and could be characterized as “adaptation to unknown unknowns across a broad category of related tasks”. For instance, a L5 self-driving vehicle, or a domestic robot capable of passing Wozniak’s coffee cup test (entering a random kitchen and making a cup of coffee) [99] could be said to display broad generalization. Arguably, even the most advanced AI systems today do not belong in this category, although there is increasing research interest in achieving this level.

广泛泛化(Broad generalization),或称"灵活性":指系统无需人类进一步干预即可处理广泛任务类别和环境的能力。这包括处理系统创建者无法预见情况的能力。这可以被视为在单一广泛活动领域(如家务劳动、现实世界驾驶)中反映人类水平的能力,其特征可描述为"在广泛相关任务类别中适应未知的未知因素"。例如,L5级自动驾驶汽车,或能通过Wozniak咖啡杯测试(进入随机厨房制作一杯咖啡)[99]的家用机器人,都可视为具备广泛泛化能力。可以说,即使当今最先进的AI系统也尚未达到这一类别,尽管实现这一水平的研究兴趣正在增长。

• Extreme generalization: This describes open-ended systems with the ability to handle entirely new tasks that only share abstract common ali ties with previously encountered situations, applicable to any task and domain within a wide scope. This could be characterized as “adaptation to unknown unknowns across an unknown range of tasks and domains”. Biological forms of intelligence (humans and possibly other intelligent species) are the only example of such a system at this time. A version of extreme generalization that is of particular interest to us throughout this document is human-centric extreme generalization, which is the specific case where the scope considered is the space of tasks and domains that fit within the human experience. We will refer to “human-centric extreme generalization” as “generality”. Importantly, as we deliberately define generality here by using human cognition as a reference frame (which we discuss in II.1.2), it is only “general” in a limited sense. Do note, however, that humans display extreme generalization both in terms of system-centric generalization (quick adaptability to highly novel situations from little experience) and developer-aware generalization (ability of contemporary humans to handle situations that previous humans have never experienced during their evolutionary history).

• 极端泛化 (Extreme generalization):指能够处理与先前经验仅存在抽象共性的全新任务的开放系统,适用于广泛范围内的任何任务和领域。可描述为"在未知任务和领域范围内适应未知的未知因素"。目前生物形态的智能(人类及其他可能的高等物种)是此类系统的唯一实例。本文档特别关注以人类为中心的极端泛化,即任务和领域范围限定于人类经验范畴的特殊情况。我们将"以人类为中心的极端泛化"简称为"通用性 (generality)"。需注意的是,由于我们刻意以人类认知为参照系来定义通用性(详见II.1.2章节),其"通用"仅具相对意义。但应认识到,人类既展现系统中心泛化(基于少量经验快速适应高度新颖情境)能力,也具备开发者感知泛化(现代人类处理进化史上从未遭遇情境)特质。

To this list, we could, theoretically, add one more entry: “universality”, which would extend “generality” beyond the scope of task domains relevant to humans, to any task that could be practically tackled within our universe (note that this is different from “any task at all” as understood in the assumptions of the No Free Lunch theorem [98, 97]). We discuss in II.1.2 why we do not consider universality to be a reasonable goal for AI.

理论上,我们可以在这个列表上再增加一项:"普适性(universality)",这将把"通用性(generality)"的范畴从人类相关的任务领域扩展到我们宇宙中任何实际可处理的任务(注意这与"任何任务"不同,后者是免费午餐定理[98, 97]假设中的理解)。我们将在II.1.2节讨论为何不认为普适性是人工智能的合理目标。

Crucially, the history of AI has been one of slowly climbing up this spectrum, starting with systems that largely did not display generalization (symbolic AI), and evolving towards robust systems (machine learning) capable of local generalization. We are now entering a new stage, where we seek to create flexible systems capable of broad generalization (e.g. hybrid symbolic and machine learning systems such as self-driving vehicles, AI assistants, or cognitive developmental robots). Skill-focused task-specific evaluation has been appropriate for close-ended systems that aim at robustness in environments that only feature known unknowns, but developing systems that are capable of handling unknown unknowns requires evaluating their abilities in a general sense.

关键在于,人工智能的发展历程正是沿着这一谱系逐步攀升的历程——从几乎不具备泛化能力的符号AI系统起步,逐步演进为能够实现局部泛化的稳健机器学习系统。如今我们正迈入新阶段,致力于构建具备广泛泛化能力的柔性系统(例如融合符号与机器学习技术的自动驾驶汽车、AI助手或认知发育机器人)。针对环境变量仅为"已知的未知"的封闭系统,以技能为核心的专项评估体系确实适用;但要开发能够应对"未知的未知"的系统,就必须建立通用化的能力评估框架。

Importantly, the spectrum of generalization outlined above seems to mirror the organization of humans cognitive abilities as laid out by theories of the structure of intelligence in cognitive psychology. Major theories of the structure of human intelligence (CHC [62], $\mathbf{g}.$ -VPR [48]) all organize cognitive abilities in a hierarchical fashion (figure 1), with three strata (in CHC): general intelligence (g factor) at the top, broad abilities in the middle, and specialized skills or test tasks at the bottom (this extends to 4 strata for g-VPR, which splits broad abilities into two layers), albeit the taxonomy of abilities differs between theories. Here, “extreme generalization” corresponds to the g factor, “broad generalization” across a given domain corresponds to a broad cognitive ability, and “local generalization” (as well as the no-generalization case) corresponds to task-specific skill.

重要的是,上述概述的泛化谱似乎反映了人类认知能力的组织结构,这与认知心理学中智力结构理论所阐述的一致。关于人类智力结构的主要理论(CHC [62]、$\mathbf{g}$-VPR [48])均以分层方式组织认知能力(图 1),包含三个层级(CHC 理论中):顶层的通用智力(g 因子)、中层的广泛能力,以及底层的专项技能或测试任务(g-VPR 理论扩展为四层级,将广泛能力拆分为两层),尽管不同理论对能力的分类存在差异。在此框架中,"极端泛化"对应 g 因子,"特定领域的广泛泛化"对应广泛认知能力,而"局部泛化"(以及无泛化情况)则对应任务专项技能。

Measuring such broad abilities (and possibly generality itself) rather than specific skills has historically been the problematic of the field of psychometrics. Could psychometrics inform the evaluation of abilities in AI systems?

衡量这种广泛能力(甚至可能是通用性本身)而非特定技能,历来是心理测量学领域的难题。心理测量学能否为AI系统能力评估提供参考?


Figure 1: Hierarchical model of cognitive abilities and its mapping to the spectrum of generalization.

图 1: 认知能力层级模型及其与泛化光谱的映射关系

Note that, in what follows:

请注意,在下文中:

• We use “broad abilities” to refer to cognitive abilities that lead to broad or extreme generalization. Developing such abilities should be the goal of any researcher inter

• 我们使用"广泛能力"来指代能带来广泛或极端泛化的认知能力。发展此类能力应成为任何研究者的目标。

I.3.3 Measuring broad abilities and general intelligence: the psychometrics perspective

I.3.3 测量广泛能力与一般智力:心理测量学视角

It seems to us that in intelligence there is a fundamental faculty, the alteration or the lack of which, is of the utmost importance for practical life. This faculty is [...] the faculty of adapting one’s self to circumstances.

在我们看来,智力中存在一种根本能力,这种能力的改变或缺失对现实生活具有极其重大的影响。这种能力就是 [...] 适应环境的能力。

Alfred Binet, 1916

Alfred Binet, 1916

In the early days of the 20th century, Binet and Simon, looking for a formal way to distinguish children with mental disabilities from those with behavior problems, developed the Binet-Simon scale [8], the first test of intelligence, founding the field of psychometrics. Immediately after, Spearman observed that individual results across different, seemingly unrelated types of intelligence tests were correlated, and hypothesized the existence of a single factor of general intelligence, the g factor [83, 84]. Today, psychometrics is a wellestablished subfield of psychology that has arrived at some of the most reproducible results of the field. Modern intelligence tests are developed by following strict standards regarding reliability (low measurement error, a notion tied to reproducibility), validity (measuring what one purports to be measuring, a notion tied to statistical consistency and predictiveness), standardization, and freedom from bias – see e.g. Classical Test Theory (CTT) [20] and Item Response Theory (IRT) [34].

20世纪初,Binet和Simon为了寻找一种正式方法来区分智力障碍儿童与行为问题儿童,开发了首个智力测试工具——Binet-Simon量表 [8],由此创立了心理测量学领域。此后不久,Spearman发现个体在不同类型智力测试中的表现存在相关性,并提出了通用智力单因素假说——g因子 [83, 84]。如今,心理测量学已成为心理学中一个成熟的分支学科,其研究成果在该领域最具可重复性。现代智力测试的开发遵循严格标准,包括信度(低测量误差,与可重复性相关)、效度(测量目标特质,与统计一致性和预测性相关)、标准化及无偏性等要求——参见经典测验理论(CTT)[20]与项目反应理论(IRT)[34]。

A fundamental notion in psychometrics is that intelligence tests evaluate broad cognitive abilities as opposed to task-specific skills. Theories of the structure of intelligence (such as CHC, $\mathrm{g}\mathrm{-VPR}$ ), which have co-evolved with psycho metric testing (statistical phenomena emerging from test results have informed these theories, and these theories have informed test design) organize these abilities in a hierarchical fashion (figure 1), rather similarly to the spectrum of generalization we presented earlier. Importantly, an ability is an abstract construct (based on theory and statistical phenomena) as opposed to a directly measurable, objective property of an individual mind, such as a score on a specific test. Broad abilities in AI, which are also constructs, fall into the exact same evaluation problematic s as cognitive abilities from psychometrics. Psychometrics approaches the quant if i cation of abilities by using broad batteries of test tasks rather than any single task, and by analysing test results via probabilistic models. Importantly, the tasks should be previously unknown to the test-taker, i.e., we assume that test-takers do not practice for intelligence tests. This approach is highly relevant to AI evaluation.

心理测量学的一个基本概念是,智力测试评估的是广泛的认知能力,而非特定任务技能。智力结构理论(如CHC、$\mathrm{g}\mathrm{-VPR}$)与心理测量测试共同演化(测试结果中出现的统计现象启发了这些理论,而这些理论又指导了测试设计),将这些能力以层级方式组织(图1),与我们之前提出的泛化谱系非常相似。值得注意的是,能力是一种抽象建构(基于理论和统计现象),而非可直接测量的个体思维客观属性,如某项特定测试的分数。AI中的广泛能力同样属于建构范畴,其评估难题与心理测量学中的认知能力完全相同。心理测量学通过使用广泛的测试任务组合(而非单一任务),并借助概率模型分析测试结果,来实现能力的量化。关键在于,这些任务对受试者而言应是前所未见的,即我们假设受试者不会为智力测试进行针对性训练。这种方法与AI评估高度相关。

Remarkably, in a parallel to psychometrics, there has been recent and increasing interest across the field of AI in using broad batteries of test tasks to evaluate systems that aim at greater flexibility. Examples include the Arcade Learning Environment for Reinforcement Learning agents [6], Project Malm O¨ [71], the Behavior Suite [68], or the GLUE [95] and SuperGLUE [94] benchmarks for natural language processing. The underlying logic of these efforts is to measure something more general than skill at one specific task by broadening the set of target tasks. However, when it comes to assessing flexibility, a critical defect of these multi-task benchmarks is that the set of tasks is still known in advance to the developers of any test-taking system, and it is fully expected that test-taking systems will be able to practice specifically for the target tasks, leverage task-specific built-in prior knowledge inherited from the system developers, leverage external knowledge obtained via pre-training, etc. As such, these benchmarks still appear to be highly gameable (see e.g. II.1.1) – merely widening task-specific skill evaluation to more tasks does not produce a qualitatively different kind of evaluation. Such benchmarks are still looking at skills, rather than abilities, in contrast with the psychometrics approach (this is not to say that such benchmarks are not useful; merely that such static multi-task benchmarks do not directly assess flexibility or generality).

值得注意的是,与心理测量学类似,AI领域近期对使用多样化测试任务组来评估追求更高灵活性的系统产生了日益浓厚的兴趣。例如强化学习智能体的街机学习环境[6]、Project Malm O¨ [71]、行为测试套件[68],以及自然语言处理领域的GLUE[95]和SuperGLUE[94]基准测试。这些工作的核心逻辑是通过扩展目标任务集,测量比单一特定任务技能更通用的能力。然而在评估灵活性时,这些多任务基准测试存在一个关键缺陷:任何应试系统的开发者都预先知晓任务集合,且完全预期应试系统能够针对目标任务进行专项训练、利用系统开发者内置的任务特定先验知识、通过预训练获取外部知识等。因此,这些基准测试仍具有高度可操纵性(参见II.1.1节)——仅仅将任务特定技能评估扩展到更多任务,并不会产生质变的新型评估。与心理测量学方法不同,这类基准测试关注的仍是技能而非能力(这并非否定其价值,只是说明静态多任务基准无法直接评估灵活性或通用性)。

In addition to these multi-task benchmarks, a number of more ambitious test suites for cognitive abilities of AI have been proposed in the past but have not been implemented in practice: the Newell test by Anderson and Lebiere ([2], named in reference to [66]), the BICA “cognitive decathlon” targeted at developmental robotics [65], the Turing Olympics [27], and the I-Athlon [1]. Lacking concrete implementations, it is difficult to assess whether these projects would have been able to address the ability evaluation problem they set out to solve. On the other hand, two similarly-spirited but more mature test suite have emerged recently, focused on generalization capabilities as opposed to specific tasks: the Animal-AI Olympics [7] (animal a i olympics.com) and the GVGAI competition [72] (gvgai.net). Both take the position that AI agents should be evaluated on an unseen set of tasks or games, in order to test learning or planning abilities rather than special-purpose skill. Both feature a multi-game environment and an ongoing public competition.

除了这些多任务基准测试外,过去还提出过一些更具野心的AI认知能力测试套件,但尚未在实践中实施:Anderson和Lebiere提出的Newell测试([2],名称引用自[66])、面向发育机器人的BICA"认知十项全能"[65]、Turing Olympics[27]以及I-Athlon[1]。由于缺乏具体实现,很难评估这些项目是否能解决它们旨在解决的能力评估问题。另一方面,近期出现了两个理念相似但更成熟的测试套件,重点关注泛化能力而非特定任务:Animal-AI Olympics7和GVGAI竞赛72。两者都主张在未见过的任务或游戏集合上评估AI智能体,以测试学习或规划能力而非专用技能。这两个项目都采用多游戏环境设计,并持续举办公开竞赛。

I.3.4 Integrating AI evaluation and psychometrics

I.3.4 整合AI评估与心理测量学

Besides efforts to broaden task-specific evaluation to batteries of multi-task tests, there have been more direct and explicit attempts to integrate AI evaluation and psychometrics. A first approach is to reuse existing psycho metric intelligence tests, initially developed for humans, as a way to assess intelligence in AI systems – perhaps an obvious idea if we are to take the term “artificial intelligence” literally. This idea was first proposed by Green in 1964 [29], and was, around the same time, explored by Evans [24], who wrote a LISP program called ANALOGY capable of solving a geometric analogy task of the kind that may be found in a py s cho metric intelligence test. Newell suggested the idea again in 1973 [66] in his seminal paper You can’t play 20 questions with Nature and win. It was proposed again and refined by Bringsjord et al. in the 2000s under the name “Psycho metric AI” (PAI) [9]. However, it has since become apparent that it is possible for AI system developers to game human intelligence tests, because the tasks used in these tests are available to the system developers, and thus the developers can straightforwardly solve the abstract form of these problems themselves and hard-code the solution in program form (see, for instance, [23, 80, 44]), much like Evans did with in the 1960s with the ANALOGY program. Effectively, in this case, it is the system developers who are solving the test problems, rather than any AI. The implicit assumptions that psycho metric test designers make about human test-takers turn out to be difficult to enforce in the case of machines.

除了将特定任务的评估扩展到多任务测试组合的努力外,还有更直接明确的尝试将AI评估与心理测量学相结合。第一种方法是复用最初为人类设计的现有心理测量智力测试,以此评估AI系统的智能水平——如果从字面上理解"人工智能"这一术语,这或许是个显而易见的想法。该理念最早由Green于1964年提出[29],同期Evans[24]也进行了探索,他编写了一个名为ANALOGY的LISP程序,能够解决心理测量智力测试中常见的几何类比任务。Newell在1973年[66]的里程碑论文《You can't play 20 questions with Nature and win》中再次提出了这个想法。2000年代,Bringsjord等人以"心理测量AI"(PAI)为名对该理论进行了完善[9]。然而后来发现,AI系统开发者可能利用人类智力测试的漏洞,因为这些测试任务对开发者是可见的,他们可以直接解决这些问题的抽象形式,并将解决方案硬编码为程序(例如[23,80,44]),就像Evans在1960年代对ANALOGY程序所做的那样。实际上,这种情况下是系统开发者在解决测试问题,而非任何AI。心理测量测试设计者对人类受试者的隐含假设,在机器测试场景中难以强制执行。

An alternative, more promising approach is to leverage what psychometrics can teach us about ability assessment and test design to create new types of benchmarks targeted specifically at evaluating broad abilities in AI systems. Along these lines, HernandezOrallo et al. have proposed extending psycho metric evaluation to any intelligent system, including AI agents and animals, in “Universal Psychometrics” [39].

另一种更有前景的方法是借鉴心理测量学在能力评估和测试设计方面的经验,创建专门用于评估AI系统广泛能力的新型基准。沿着这一思路,HernandezOrallo等人在《通用心理测量学》[39]中提出将心理测量评估扩展到任何智能系统,包括AI智能体和动物。

We argue that several important principles of psychometrics can inform intelligence evaluation in AI in the context of the development of broad AI and general AI:

我们认为,在通用人工智能(AGI)和广义AI发展的背景下,心理测量学的若干重要原则可以为AI智能体的智力评估提供参考:

Measuring abilities (representative of broad generalization and skill-acquisition efficiency), not skills. Abilities are distinct from skills in that they induce broad generalization, i.e. they form the basis for skill across a broad range of tasks, including tasks that were previously unknown to the ability-enabled system and its developers. Doing so via batteries of tasks rather than any single task, that should be previously unknown to both the test taking system and the system developers (this is necessary to assess broad generalization as opposed to skill or local generalization). • Having explicit standards regarding reliability, validity, standardization, and freedom from bias:

衡量能力(代表广泛泛化和技能获取效率),而非技能。能力与技能的区别在于能带来广泛泛化,即它们构成跨广泛任务范围技能的基础,包括能力赋能系统及其开发者先前未知的任务。通过任务组而非单一任务实现这一点,这些任务对受测系统和系统开发者都应是未知的(这对评估广泛泛化而非技能或局部泛化至关重要)。• 制定关于可靠性、效度、标准化和无偏性的明确标准:

stance, a test of intelligence designed for both humans and AI should not leverage uniquely human acquired knowledge, or should not involve constraints unrelated to intelligence within which machines have unfair advantages (such as fast reaction times), etc.

立场:一项为人类和AI设计的智力测试不应利用人类独有的后天知识,也不应包含与智力无关但机器具有不公平优势的约束条件(如快速反应时间)等。

Simultaneously, we argue that certain other aspects of psychometrics may be discarded in the development of new intelligence tests for AI:

同时,我们认为在开发针对AI的新型智力测试时,可以摒弃心理测量学的某些其他方面:

• The exact number and taxonomy of cognitive abilities considered, being a subject of ongoing debate within cognitive psychology and being perhaps overly an thro po c entric, should not be used as a strict template for artificial cognitive architectures and their evaluation. Existing taxonomies may at best serve as a source of inspiration. • A number of abilities being assessed by psycho metric intelligence tests are crystallized abilities (e.g. reading and writing), i.e. abilities that are acquired through experience, which are not clearly distinguishable from skills (they are effectively multipurpose skills). We argue that AI tests that seek to assess flexibility and generality should not consider crystallized abilities, but rather, should focus on abilities that enable new skill acquisition. If a system possesses abilities that enable efficient skillacquisition in a domain, the system should have no issue in developing corresponding skills and crystallized abilities.

  • 认知能力的精确数量与分类体系在认知心理学界仍存争议,且可能过度人类中心化,不应作为人工认知架构及其评估的严格模板。现有分类体系至多只能作为灵感来源。
  • 心理测量学智力测试评估的诸多能力属于晶体能力(如读写),即通过经验习得的能力,这些能力与技能界限模糊(实质上是多用途技能)。我们认为评估灵活性与通用性的AI测试应聚焦于支撑新技能习得的能力,而非晶体能力。若系统具备支撑特定领域高效技能习得的能力,自然能发展出相应技能与晶体能力。

I.3.5 Current trends in broad AI evaluation

I.3.5 广义AI评估的当前趋势

Despite a rising interest in building flexible systems, or even in generality itself, for the most part the AI community has not been paying much attention to psycho metric evaluation, Psycho metric AI, or Universal Psychometrics. If we are to assess the contemporary zeitgeist of broad AI evaluation, here is what we see. 4

尽管人们对构建灵活系统乃至通用性本身的兴趣日益增长,但AI领域大多仍未重视心理测量评估(psychometric evaluation)、心理测量AI(Psycho metric AI)或通用心理测量学(Universal Psychometrics)。若要评估当前广义AI评估的时代思潮,我们观察到的现状如下。4

First, we note several positive developments. Since 2017, there is increasing awareness that one should seek to establish some form of generalization in evaluating Reinforcement Learning (RL) algorithms (e.g. [50, 70, 17, 49]), which was previously a stark problem [76, 35, 101, 70], as RL agents have for a long time been tested on their training data. Further, there is increasing interest in evaluating the data-efficiency of learning algorithms (e.g. [10]), in particular in the context of RL for games such as Atari games or Minecraft (e.g. [71, 33]). Lastly, as noted in I.3.3, there has been a trend towards leveraging multi-task benchmarks as a way to assess robustness and flexibility (e.g. [6, 71, 68, 95, 94]).

首先,我们注意到一些积极进展。自2017年以来,人们逐渐意识到在评估强化学习(RL)算法时应建立某种泛化性衡量标准(如[50, 70, 17, 49]),这曾是显著问题[76, 35, 101, 70],因为长期以来RL智能体都在训练数据上进行测试。此外,学界越来越关注评估学习算法的数据效率(如[10]),特别是在Atari游戏或《我的世界》等游戏场景的RL研究中(如[71, 33])。最后如I.3.3节所述,当前趋势是通过多任务基准测试来评估鲁棒性和灵活性(如[6, 71, 68, 95, 94])。

Unfortunately, we must also note several negatives. The robustness of the systems being developed, in particular Deep Learning models, is often problematic (see e.g. [16, 59]). This is due in large part to the fact that most benchmarks do not pay much attention to formally assessing robustness and quantifying generalization, and thus can be solved via “shortcuts” that gradient descent is apt at exploiting (e.g. surface statistics such as textures in the case of computer vision [46]). Likewise, the reproducibility (reliability) of research findings is often an issue [73], especially in Reinforcement Learning, although some progress has been made on this front.

遗憾的是,我们也必须指出若干不足之处。当前开发系统(尤其是深度学习模型)的鲁棒性(robustness)往往存在问题(参见[16, 59])。这主要源于大多数基准测试并未重视对鲁棒性的形式化评估与泛化能力的量化,导致梯度下降容易通过"捷径"解决问题(例如计算机视觉领域利用纹理等表面统计特征[46])。同样,研究结果的可复现性(可靠性)也常受质疑[73],尽管强化学习领域已取得一定进展,但该问题仍尤为突出。

Most importantly, the evaluation of any ability that goes decisively beyond local genera liz ation is still largely a green field, and little effort has been devoted to investigate it. Hernandez-Orallo noted in 2017 that “ability-oriented and general-purpose evaluation approaches [...] are still very incipient, and more research and discussion is needed” [36]. Recent attempts at broadening task-specific benchmarks by including multiple tasks do not measure developer-aware generalization, as the tasks are all known in advance to system developers (as noted in I.3.3). Attempts at assessing generalization by testing RL systems on previously unseen game levels, like CoinRun [17] or Obstacle Tower [49], are still only looking at task-specific local generalization, by evaluating a candidate system on new samples from a known distribution rather than using a substantially new task (as suggested in III.3). In addition, the fact the level-generation programs used are available to the AI developers means it is possible to “cheat” on these benchmarks by sampling arbitrary amounts of training data (cf. II.1.1).

最重要的是,对任何显著超越局部泛化能力的评估仍基本处于空白领域,相关研究投入甚少。Hernandez-Orallo在2017年指出"面向能力与通用目标的评估方法[...]仍处于萌芽阶段,需要更多研究与讨论" [36]。近期通过纳入多任务来扩展特定任务基准的尝试(如I.3.3节所述)并未测量开发者感知的泛化能力,因为这些任务对系统开发者而言都是预先已知的。通过在新游戏关卡(如CoinRun [17]或Obstacle Tower [49])测试强化学习系统来评估泛化的尝试,仍仅关注特定任务的局部泛化——其本质是在已知分布的新样本上评估系统,而非使用全新任务(如III.3节所述)。此外,由于关卡生成程序对AI开发者可见,理论上可通过采样任意数量训练数据来"作弊"(参见II.1.1节)。

Further, contemporary research “moonshots” that are publicly advertised as being steps towards general intelligence appear to still be focusing on skill-based task-specific evaluation for board games and video games (e.g. Go [82, 81] and StarCraft [93] for DeepMind, DotA2 [89] for OpenAI) via highly-mediatized confrontations with top human players. Despite claims of progress towards general AI in associated public communications 5, such evaluation does not involve any measure of generalization power, and has little-to-no overlap with the development of flexibility and generality, as we outline in II.1. For example, although OpenAI’s DotA2-playing AI “Five” was trained on 45,000 years of play and was able to beat top human players [89], it has proven very brittle, as non-champion human players were able to find strategies to reliably beat it in a matter of days after the AI was made available for the public to play against [90]. In addition, Five did not even generalize to DotA2 in the first place: it could only play a restricted version of the game, with 16 characters instead of over 100. Likewise, AlphaGo and its successor AlphaZero, developed in 2016 and 2017, have not yet found any application outside of board games, to the best of our knowledge.

此外,当代那些被公开宣传为迈向通用智能的"登月计划"研究,似乎仍专注于通过高度媒体化的与人类顶尖选手对抗,来评估基于技能的特定任务能力(如DeepMind的围棋[82, 81]和《星际争霸》[93],OpenAI的《Dota2》[89])。尽管相关公开声明宣称这些进展是通向通用人工智能的步骤,但此类评估完全不涉及泛化能力的衡量,与我们第二部分第一节所述的灵活性与通用性发展几乎毫无交集。例如,尽管OpenAI的《Dota2》AI"Five"通过相当于4.5万年的训练击败了人类顶尖选手[89],但其脆弱性很快暴露——在公开对战数日后,非冠军选手就能找到稳定击败它的策略[90]。此外,Five甚至无法完整运行《Dota2》:它只能使用16个英雄而非全部100多个进行简化版游戏。同样,据我们所知,2016-2017年开发的AlphaGo及其迭代产品AlphaZero至今仍未在棋盘游戏之外获得任何应用。

We deplore this discrepancy between a focus on surpassing humans at tests of skill on one hand (while entirely disregarding whether the methods through which skill is achieved are general iz able), and a manifest interest in developing broad abilities on the other hand – an endeavour entirely orthogonal to skill itself. We hypothesize that this discrepancy is due to a lack of a clear conceptualization of intelligence, skill, and generalization, as well as a lack of appropriate measures and benchmarks for broad cognitive abilities. In what follows, we expose in more detail the issue with using task-specific “moonshots” (e.g. achieving better-than-human performance in a video game or board game) as stepping stones towards more general forms of AI, and we propose a formal definition of intelligence meant to be actionable in the pursuit of flexible AI and general AI.

我们痛心于这种矛盾现象:一方面执着于在技能测试中超越人类(却完全忽视所采用方法是否具备泛化能力),另一方面又对发展广泛能力表现出明显兴趣——这种努力方向与技能本身完全相悖。我们推测这种矛盾源于缺乏对智能、技能和泛化的明确定义,以及缺乏针对广泛认知能力的合理衡量标准和基准测试。下文将详细阐述以特定任务"登月计划"(例如在电子游戏或棋类游戏中实现超人类表现)作为通向更通用AI的垫脚石所存在的问题,并提出一个可操作的形式化智能定义,旨在推动柔性AI和通用人工智能的发展。

II A new perspective

II 全新视角

II.1 Critical assessment

II.1 关键评估

II.1.1 Measuring the right thing: evaluating skill alone does not move us forward

II.1.1 衡量正确指标:仅评估技能无法推动进步

In 1973, psychologist and computer science pioneer Allen Newell, worried that recent advances in cognitive psychology were not bringing the field any closer to a holistic theory of cognition, published his seminal paper You can’t play 20 questions with nature and win [66], which helped focus research efforts on cognitive architecture modelling, and provided new impetus to the longstanding quest to build a chess-playing AI that would outperform any human. Twenty-four years later, In 1997, IBM’s DeepBlue beat Gary Kasparov, the best chess player in the world, bringing this quest to an end [12]. When the dust settled, researchers were left with the realization that building an artificial chess champion had not actually taught them much, if anything, about human cognition. They had learned how to build a chess-playing AI, and neither this knowledge nor the AI they had built could generalize to anything other than similar board games.

1973年,心理学家兼计算机科学先驱Allen Newell担忧认知心理学的最新进展并未推动该领域形成整体性认知理论,发表了开创性论文《You can’t play 20 questions with nature and win》[66]。这篇论文促使研究聚焦于认知架构建模,并为长期追求的"构建超越人类的国际象棋AI"目标注入了新动力。24年后的1997年,IBM的DeepBlue击败世界冠军Gary Kasparov,为这一探索画上句号[12]。当尘埃落定时,研究者们意识到:构建人工棋王的过程其实并未增进对人类认知的理解。他们只掌握了构建国际象棋AI的方法,而这些知识及其产物均无法推广到类似棋盘游戏之外的领域。

It may be obvious from a modern perspective that a static chess-playing program based on minimax and tree search would not be informative about human intelligence, nor competitive with humans in anything other than chess. But it was not obvious in the 1970s, when chess-playing was thought by many to capture, and require, the entire scope of rational human thought. Perhaps less obvious in 2019 is that efforts to “solve” complex video games using modern machine learning methods still follow the same pattern. Newell wrote [66]: “we know already from existing work [psychological studies on humans] that the task [chess] involves forms of reasoning and search and complex perceptual and memorial processes. For more general considerations we know that it also involves planning, evaluation, means-ends analysis and redefinition of the situation, as well as several varieties of learning – short-term, post-hoc analysis, preparatory analysis, study from books, etc.”. The assumption was that solving chess would require implementing these general abilities. Chess does indeed involve these abilities – in humans. But while possessing these general abilities makes it possible to solve chess (and many more problems), by going from the general to the specific, inversely, there is no clear path from the specific to the general. Chess does not require any of these abilities, and can be solved by taking radical shortcuts that run orthogonal to human cognition.

从现代视角来看,基于极小化极大算法(minimax)和树搜索的静态象棋程序显然无法揭示人类智能的奥秘,也仅在象棋领域能与人类抗衡。但在1970年代,许多人认为象棋博弈体现并需要人类完整的理性思维能力,这种认知偏差并不明显。2019年更不易察觉的是:当代机器学习方法对复杂电子游戏的"攻克"仍延续着相同模式。Newell在[66]中指出:"通过现有研究[人类心理学实验]可知,象棋任务涉及推理、搜索、复杂感知与记忆过程。更广义而言,它还包含规划、评估、手段-目的分析、情境重构,以及短期/事后/预备性分析、书本学习等多种学习形式。"当时假设攻克象棋需要实现这些通用能力。对人类而言确实如此——但虽然从通用能力到具体问题存在解决路径(如象棋及其他问题),逆向从具体方案到通用能力却无明确通道。象棋本身并不需要这些能力,完全可以通过与人类认知正交的极端捷径来解决。

Optimizing for single-purpose performance is useful and valid if one’s measure of success can capture exactly what one seeks (as we outlined in I.3.1), e.g. if one’s end goal is a chess-playing machine and nothing more. But from the moment the objective is settled, the process of developing a solution will be prone to taking all shortcuts available to satisfy the objective of choice – whether this process is gradient descent or human-driven research. These shortcuts often come with undesirable side-effects when it comes to considerations not incorporated the measure of performance. If the environment in which the system is to operate is too unpredictable for an all-encompassing objective function to be defined beforehand (e.g. most real-world applications of robotics, where systems face unknown unknowns), or if one aims at a general-purpose AI that could be applied to a wide range of problems with no or little human engineering, then one must somehow optimize directly for flexibility and generality, rather than solely for performance on any specific task.

如果成功的衡量标准能准确反映目标(如I.3.1节所述),例如终极目标仅是打造国际象棋机器,那么针对单一用途的性能优化是有效且合理的。但从目标确立那一刻起,无论是梯度下降还是人工驱动的研发过程,解决方案的开发都倾向于利用一切捷径来实现既定目标。当涉及未被纳入性能考量的因素时,这些捷径往往会产生不良副作用。若系统运行环境具有高度不可预测性(例如机器人技术的大多数现实应用会面临未知的未知因素),导致无法预先定义全覆盖的目标函数;或是追求通用人工智能(AGI)以广泛适应各类问题而无需人工干预时,就必须直接针对灵活性与通用性进行优化,而非仅聚焦于特定任务的表现。

This is, perhaps, a widely-accepted view today when it comes to static programs that hard-code a human-designed solution. When a human engineer implements a chatbot by specifying answers for each possible query via if/else statements, we do not assume this chatbot to be intelligent, and we do not expect it to generalize beyond the engineer’s specifications. Likewise, if an engineer looks at a specific IQ test task, comes up with a solution, and write down this solution in program form, we do not expect the program to generalize to new tasks, and we do not believe that the program displays intelligence – the only intelligence at work here is the engineer’s. The program merely encodes the crystallized output of the engineer’s thought process – it is this process, not its output, that implements intelligence. Intelligence is not demonstrated by the performance of the output program (a skill), but by the fact that the same process can be applied to a vast range of previously unknown problems (a general-purpose ability): the engineer’s mind is capable of extreme generalization. Since the resulting program is merely encoding the output of that process, it is no more intelligent than the ink and paper used to write down the proof of a theorem.

这或许是当前针对静态程序的普遍共识——那些固化人类设计解决方案的代码。当工程师用if/else语句为每个可能的问题预设回答来实现聊天机器人时,我们不会认为这个机器人具有智能,也不期待它能突破设计者的预设范围。同理,若工程师针对特定智商测试题编写解决方案程序,我们不会认为该程序能泛化到新任务,更不会将其视为智能体现——这里唯一的智能属于工程师本人。程序仅封装了工程师思维过程的固化成果,真正实现智能的是思维过程本身而非其输出产物。智能的体现不在于程序输出表现(特定技能),而在于同一思维过程能解决海量未知问题(通用能力):工程师的思维具有极强的泛化能力。由于最终程序只是该思维过程的输出编码,其智能程度不会超过记录数学定理证明的纸墨。

However, what of a program that is not hard-coded by humans, but trained from data to perform a task? A learning machine certainly may be intelligent: learning is a necessary condition to adapt to new information and acquire new skills. But being programmed through exposure to data is no guarantee of generalization or intelligence. Hard-coding prior knowledge into an AI is not the only way to artificially “buy” performance on the target task without inducing any generalization power. There is another way: adding more training data, which can augment skill in a specific vertical or task without affecting generalization whatsoever.

然而,如果一个程序并非由人类硬编码,而是通过数据训练来执行任务呢?学习机器当然可能具备智能:学习是适应新信息、掌握新技能的必要条件。但通过数据训练进行编程,并不能保证其具备泛化能力或智能。将先验知识硬编码到AI中,并非人为"购买"目标任务性能而不引发任何泛化能力的唯一方式。还存在另一种途径:增加训练数据量,这可以在不影响泛化的前提下提升特定垂直领域或任务的技能。

Information processing systems form a spectrum between two extremes: on one end, static systems that consist entirely of hard-coded priors (such as DeepBlue or our if/else chatbot example), and on the opposite end, systems that incorporate very few priors and are almost entirely programmed via exposure to data (such as a hashtable or a denselyconnected neural network). Most intelligent systems, including humans and animals, combine ample amounts of both priors and experience, as we point out in II.1.3. Crucially, the ability to generalize is an axis that runs orthogonal to the prior/experience plane. Given a learning system capable of achieving a certain level of generalization, modifying the system by incorporating more priors or more training data about the task can lead to greater task-specific performance without affecting generalization. In this case, both priors and experience serve as a way to “game” any given test of skill without having to display the sort of general-purpose abilities that humans would rely on to acquire the same skill.

信息处理系统构成一个介于两个极端之间的连续谱:一端是完全由硬编码先验(如DeepBlue或我们的if/else聊天机器人示例)组成的静态系统,另一端是几乎完全通过数据训练编程、包含极少先验的系统(如哈希表或全连接神经网络)。如II.1.3节所述,包括人类和动物在内的大多数智能系统都结合了丰富的先验与经验。关键在于,泛化能力是与先验/经验平面正交的维度。对于一个能达到特定泛化水平的学习系统,通过融入更多先验或任务相关训练数据来改进系统,可以在不影响泛化能力的前提下提升特定任务性能。这种情况下,先验和经验都成为"应对"特定技能测试的手段,而无需展现人类掌握该技能所需的通用能力。

This can be readily demonstrated with a simple example: consider a hashtable that uses a locality-sensitive hash function (e.g. nearest neighbor) to map new inputs to previously seen inputs. Such a system implements a learning algorithm capable of local generalization, the extent of which is fixed (independent of the amount of data seen), determined only by the abstraction capabilities of the hash function. This system, despite only featuring trace amounts of generalization power, is already sufficient to “solve” any task for which unlimited training data can be generated, such as any video game. All that one has to do is obtain a dense sampling of the space of situations that needs to be covered, and associate each situation with an appropriate action vector.

通过一个简单例子可以直观说明:假设有一个哈希表使用局部敏感哈希函数(如最近邻算法)将新输入映射到已见过的输入。这种系统实现了具备局部泛化能力的学习算法,其泛化范围固定(与数据量无关),仅由哈希函数的抽象能力决定。该系统尽管只具备微量泛化能力,却已足以"解决"任何能生成无限训练数据的任务(例如电子游戏)。只需对需要覆盖的情境空间进行密集采样,并将每个情境与相应动作向量关联即可。

Adding ever more data to a local-generalization learning system is certainly a fair strategy if one’s end goal is skill on the task considered, but it will not lead to generalization beyond the data the system has seen (the resulting system is still very brittle, e.g. Deep Learning models such as OpenAI Five), and crucially, developing such systems does not teach us anything about achieving flexibility and generality. “Solving” any given task with beyond-human level performance by leveraging either unlimited priors or unlimited data does not bring us any closer to broad AI or general AI, whether the task is chess, football, or any e-sport.

如果最终目标是在特定任务上获得技能,那么向局部泛化学习系统添加更多数据无疑是一种合理策略,但这不会使系统产生超越所见数据的泛化能力(由此产生的系统仍然非常脆弱,例如OpenAI Five等深度学习模型)。更重要的是,开发这类系统并不能帮助我们理解如何实现灵活性与通用性。无论是国际象棋、足球还是任何电子竞技,通过利用无限先验知识或无限数据来"解决"某项任务并达到超人类水平,都不会让我们更接近广义AI或通用人工智能。

Current evidence (e.g. [51, 46, 16, 59, 50]) points to the fact that contemporary Deep Learning models are local-generalization systems, conceptually similar to a locality-sensitive hashtable – they may be trained to achieve arbitrary levels of skill at any task, but doing so requires a dense sampling of the input-cross-target space considered (as outlined in [16]), which is impractical to obtain for high-value real-world applications, such as L5 selfdriving (e.g. [5] notes that 30 million training situations is not enough for a Deep Learning model to learn to drive a car in a plain supervised setting). Hypothetically, it may be shown in the future that methods derived from Deep Learning could be capable of stronger forms of generalization, but demonstrating this cannot be done merely by achieving high skill, such as beating humans at DotA2 or Starcraft given unlimited data or unlimited engineering; instead, one should seek to precisely establish and quantify the generalization strength of such systems (e.g. by considering prior-efficiency and data-efficiency in skill acquisition, as well as the developer-aware generalization difficulty of the tasks considered). A central point of this document is to provide a formal framework for doing so (II.2 and II.3). Failing to account for priors, experience, and generalization difficulty in our evaluation methods will prevent our field from climbing higher along the spectrum of generalization (I.3.2) and from eventually reaching general AI.

现有证据 (如 [51, 46, 16, 59, 50]) 表明,当代深度学习模型是局部泛化系统,概念上类似于局部敏感哈希表——它们可以通过训练在任何任务上达到任意技能水平,但需要密集采样所考虑的输入-目标交叉空间 (如 [16] 所述)。对于高价值的现实应用 (如 L5 级自动驾驶),这种密集采样是不切实际的 (例如 [5] 指出,在普通监督学习场景下,即使 3000 万训练样本也不足以让深度学习模型学会驾驶汽车)。假设未来可能证明基于深度学习的方法能够实现更强的泛化形式,但仅通过实现高超技能 (如在无限数据或工程资源条件下在 DotA2 或星际争霸中击败人类) 并不能证明这一点;相反,应该精确建立和量化这类系统的泛化强度 (例如通过考量技能获取的先验效率和数据效率,以及开发人员感知的任务泛化难度)。本文档的核心目标是为此提供形式化框架 (II.2 和 II.3)。若评估方法不考虑先验、经验和泛化难度,将阻碍本领域沿着泛化谱系向上攀登 (I.3.2),最终无法实现通用人工智能。

In summary, the hallmark of broad abilities (including general intelligence, as per II.1.2) is the power to adapt to change, acquire skills, and solve previously unseen problems – not skill itself, which is merely the crystallized output of the process of intelligence. Testing for skill at a task that is known in advance to system developers (as is the current trend in general AI research) can be gamed without displaying intelligence, in two ways: 1) unlimited prior knowledge, 2) unlimited training data. To actually assess broad abilities, and thus make progress toward flexible AI and eventually general AI, it is imperative that we control for priors, experience, and generalization difficulty in our evaluation methods, in a rigorous and quantitative way.

总之,广泛能力的标志(包括II.1.2节所述的通用智能)在于适应变化、获取技能以及解决前所未见问题的能力——而非技能本身,技能仅是智能过程的具体化产物。针对开发者预先已知任务的技能测试(当前通用AI研究的趋势)可能通过两种方式绕过智能展示:1) 无限先验知识,2) 无限训练数据。要真正评估广泛能力,从而推动柔性AI乃至通用AI的发展,我们必须以严谨量化的方式,在评估方法中控制先验知识、经验积累和泛化难度。

II.1.2 The meaning of generality: grounding the g factor

II.1.2 通用性的意义:g因子的基础

It is a well-known fact of cognitive psychology that different individuals demonstrate different cognitive abilities to varying degrees, albeit results across all tests of intelligence are correlated. This points to cognition being a multi-dimensional object, structured in a hierarchical fashion (figure 1), with a single generality factor at the top, the g factor. But is “general intelligence” the apex of the cognitive pyramid in an absolute sense (as is sometimes assumed by proponents of “Artificial General Intelligence”), or is it merely a broader cognitive ability, one that would remain fairly specialized, and wouldn’t be qualitatively distinct from other abilities lower down the hierarchy? How general is human intelligence?

认知心理学中一个众所周知的事实是,不同个体在不同程度上展现出不同的认知能力,尽管所有智力测试结果之间存在相关性。这表明认知是一个多层次结构的多维对象(图1),其顶端存在一个通用性因子——g因子。但"通用智能"究竟是认知金字塔在绝对意义上的顶峰(正如"通用人工智能"支持者有时认为的那样),还是仅仅一种更广泛的认知能力,这种能力仍相当专一,与层次结构中较低的其他能力并无质的区别?人类智能的通用性究竟有多高?

The No Free Lunch theorem [98, 97] teaches us that any two optimization algorithms (including human intelligence) are equivalent when their performance is averaged across every possible problem, i.e. algorithms should be tailored to their target problem in order to achieve better-than-random performance. However, what is meant in this context by “every possible problem” refers to a uniform distribution over problem space; the distribution of tasks that would be practically relevant to our universe (which, due to its choice of laws of physics, is a specialized environment) would not fit this definition. Thus we may ask: is the human g factor universal? Would it generalize to every possible task in the universe?

无免费午餐定理 [98, 97] 告诉我们,当任何两种优化算法(包括人类智能)在所有可能问题上的平均表现相同时,它们是等价的,即算法需要针对目标问题进行专门设计才能获得优于随机的性能。然而,这里所说的"所有可能问题"指的是问题空间上的均匀分布;而实际上与我们的宇宙相关的任务分布(由于其特定的物理定律选择,这是一个特殊环境)并不符合这一定义。因此我们可以问:人类的g因子是否具有普适性?它能否推广到宇宙中所有可能的任务?

This is a question that is largely irrelevant for psychometrics, because as a subfield of psychology, it makes the implicit assumption that it is concerned solely with humans and the human experience. But this question is highly relevant when it comes to AI: if there is such a thing as universal intelligence, and if human intelligence is an implementation of it, then this algorithm of universal intelligence should be the end goal of our field, and reverse-engineering the human brain could be the shortest path to reach it. It would make our field close-ended: a riddle to be solved. If, on the other hand, human intelligence is a broad but ad-hoc cognitive ability that generalizes to human-relevant tasks but not much else, this implies that AI is an open-ended, fundamentally anthropocentric pursuit, tied to a specific scope of applicability. This has implications for how we should measure it (by using human intelligence and human tasks as a reference) and for the research strategies we should follow to achieve it.

这个问题在心理测量学中基本无关紧要,因为作为心理学的子领域,它隐含地假设仅关注人类及其经验。但对于人工智能而言,该问题至关重要:若存在通用智能 (universal intelligence) ,且人类智能是其一种实现形式,那么这种通用智能算法应成为我们领域的终极目标,而逆向工程人脑可能是实现它的最短路径。这将使我们的领域成为封闭式命题——一个待解的谜题。反之,若人类智能是一种广泛但临时的认知能力,仅适用于人类相关任务而难以泛化,则意味着人工智能是开放式的、本质上以人类为中心的探索,受限于特定适用范围。这不仅影响我们应如何衡量它(以人类智能和人类任务为参照) ,也决定了实现它所需的研究策略。

The g factor, by definition, represents the single cognitive ability common to success across all intelligence tests, emerging from applying factor analysis to test results across a diversity of tests and individuals. But intelligence tests, by construction, only encompass tasks that humans can perform – tasks that are immediately recognizable and understandable by humans (anthropocentric bias), since including tasks that humans couldn’t perform would be pointless. Further, psychometrics establishes measurement validity by demonstrating predictive ness with regard to activities that humans value (e.g. scholastic success): the very idea of a “valid” measure of intelligence only makes sense within the frame of reference of human values.

g因子的定义代表所有智力测试中成功共有的单一认知能力,它源于对多种测试和个体测试结果进行因素分析。但智力测试的设计仅涵盖人类能够执行的任务——那些人类能立即识别和理解的任务(人类中心偏见),因为包含人类无法完成的任务毫无意义。此外,心理测量学通过证明对人类重视活动(如学业成功)的预测性来确立测量效度:"有效"智力测量的概念仅在人类价值观的参考框架内才有意义。

In fact, the interpretation of what specific abilities make someone “intelligent” vary from culture to culture [100, 86, 18]. More broadly, humans have historically had a poor track record when it comes to attributing intelligence to complex information-processing agents around them, whether looking at humans from other cultures or at animals (such as octopuses, dolphins, great apes, etc.). We only reluctantly open up to the possibility that systems different from ourselves may be “intelligent” if they display relatable humanlike behaviors that we associate with intelligence, such as language or tool use; behaviors that have high intrinsic complexity and high adaptability but that are not directly relatable (such as octopus camouflage) are not perceived as intelligent. This observation extends to collective entities (e.g. markets, companies, Science as an institution) and natural processes (e.g. biological evolution). Although they can be modeled as standalone systems whose abilities and behavior match broadly accepted definitions of intelligence (achieving goals across a wide range of environments, demonstrating flexibility and adaptability, etc.), we do not categorize these systems as intelligent, simply because they aren’t sufficiently humanlike.

事实上,关于哪些具体能力使人具备"智能"的解读因文化而异 [100, 86, 18]。更广泛地说,人类在判断周围复杂信息处理主体是否具有智能方面历来表现不佳——无论是看待其他文化中的人类,还是动物(如章鱼、海豚、类人猿等)。只有当异于人类的系统表现出我们与智能相关联的拟人行为(如语言或工具使用)时,我们才会勉强承认其可能具有"智能";而那些具有高度内在复杂性和适应性却无法直接类比的行为(如章鱼伪装),则不被视为智能表现。这一认知现象同样适用于集体实体(如市场、企业、科学共同体)和自然过程(如生物进化)。尽管这些系统可以被建模为符合广泛接受的智能定义(在多环境中实现目标、展现灵活性与适应性等)的独立系统,我们仍拒绝将其归类为智能系统,仅仅因为它们不够"像人"。

To use a well-known cross-domain analogy [25]: much like “intelligence”, the notion of “physical fitness” (as it pertains to sports and other physical activities) is an intuitivelyunderstandable, informal, yet useful concept. Like intelligence, fitness is not easily reducible to any single factor (such as a person’s age or muscle mass), rather, it seems to emerge from a constellation of interdependent factors. If we sought to rigorously measure physical fitness in humans, we would come up with a set of diverse tests such as running a $100\mathrm{m}$ , running a marathon, swimming, doing sit-ups, doing basketball throws, etc., not unlike IQ test suites. Across tests results, we would observe clusters of correlations, corresponding to broad “physical abilities” strictly analogous to cognitive abilities (e.g. lung capacity might be such an “ability” inducing correlations across tests). Much like in the case of cognitive abilities, experts would probably disagree and debate as to the exact taxonomy of these broad abilities (is being “tall and lean” an ability, or is “tallness” a standalone factor?). And crucially, we should intuitively expect to find that all tests results would be correlated: we would observe a physical g factor, corresponding to the general intuitive construct of “physical fitness”.

用一个著名的跨领域类比[25]来说:"体能"(physical fitness)这个概念(涉及运动和其他身体活动)就像"智力"一样,是一种直观易懂、非正式但有用的概念。与智力类似,体能很难简化为任何单一因素(如一个人的年龄或肌肉量),而是似乎从一系列相互依赖的因素中涌现出来。如果我们试图严格测量人类的体能,我们会设计一组多样化的测试,比如跑100米、跑马拉松、游泳、做仰卧起坐、投篮等,这与智商测试套件并无二致。在各项测试结果中,我们会观察到相关性集群,对应着与认知能力严格类比的广泛"身体能力"(例如肺活量可能就是这样一个在各项测试中产生相关性的"能力")。与认知能力的情况非常相似,专家们可能会对这些广泛能力的确切分类存在分歧和争论("高瘦"是一种能力,还是"身高"是一个独立因素?)。最关键的是,我们凭直觉就能预期所有测试结果都会相互关联:我们会观察到一个体能g因子,对应着"体能"这个普遍直观概念。

But would this mean that human morphology and motor afford ances are “general” in an absolute sense, and that a very fit person could handle any physical task at all? Certainly not; we are not adapted for the large majority of environments that can be found in the universe – from the Earth’s oceans to the surface of Venus, from the atmosphere of Jupiter to interstellar space. It is, however, striking and remarkable that human physical abilities generalize to a far greater range of environments and tasks than the limited set of environments and activities that guided their evolution. To caricature, human bodies evolved for running in the East-African savanna, yet they are capable of climbing mount Everest, swimming across lakes, skydiving, playing basketball, etc. This is not a coincidence; by necessity, evolution optimizes for adaptability, whether cognitive adaptability or sensor i motor adaptability. Human physical capabilities can thus be said to be “general”, but only in a limited sense; when taking a broader view, humans reveal themselves to be extremely specialized, which is to be expected given the process through which they evolved.

但这是否意味着人类的形态和运动能力在绝对意义上是"通用"的,一个体能极佳的人可以完成任何身体任务?显然不是;我们并不适应宇宙中存在的大多数环境——从地球海洋到金星表面,从木星大气层到星际空间。然而令人惊叹的是,人类的体能可以泛化到远比其进化环境广阔得多的领域。夸张地说,人类身体是为东非大草原奔跑而进化的,却能够攀登珠穆朗玛峰、横渡湖泊、跳伞、打篮球等。这并非巧合:进化必然要优化适应能力,无论是认知适应还是感知运动适应。因此可以说人类体能是"通用"的,但仅限于相对意义;从更宏观视角看,人类实际上高度特化,这正是进化过程的必然结果。

We argue that human cognition follows strictly the same pattern as human physical capabilities: both emerged as evolutionary solutions to specific problems in specific environments (commonly known as “the four Fs”). Both were, importantly, optimized for adaptability, and as a result they turn out to be applicable for a surprisingly greater range of tasks and environments beyond those that guided their evolution (e.g. piano-playing, solving linear algebra problems, or swimming across the Channel) – a remarkable fact that should be of the utmost interest to anyone interested in engineering broad or generalpurpose abilities of any kind. Both are multi-dimensional concepts that can be modeled as a hierarchy of broad abilities leading up to a “general” factor at the top. And crucially, both are still ultimately highly specialized (which should be unsurprising given the context of their development): much like human bodies are unfit for the quasi-totality of the universe by volume, human intellect is not adapted for the large majority of conceivable tasks.

我们认为人类认知与身体能力遵循完全相同的模式:二者都是针对特定环境中具体问题演化出的解决方案(即俗称的"四F法则")。关键在于,它们都为实现适应性进行了优化,因此能惊人地适用于远超其演化初衷的任务和环境(例如弹钢琴、解线性代数题或横渡英吉利海峡)——这一非凡事实对任何希望设计通用型能力的人士都具有极高参考价值。二者都是多维概念,可建模为由基础能力层级递进至顶端"通用"因子的结构。更重要的是,二者本质上仍具有高度专属性(考虑到其发展背景,这并不意外):正如人类身体无法适应宇宙中绝大多数空间环境,人类智力同样不适合处理可想象任务中的绝大部分。

This includes obvious categories of problems such as those requiring long-term planning beyond a few years, or requiring large working memory (e.g. multiplying 10-digit numbers). This also includes problems for which our innate cognitive priors are unadapted; for instance, humans can be highly efficient in solving certain NP-hard problems of small size when these problems present cognitive overlap with evolutionarily familiar tasks such as navigation (e.g. the Euclidean Traveling Salesman Problem (TSP) with low point count can be solved by humans near-optimally in near-linear optimal time [58], using perceptual strategies), but perform poorly – often no better than random search – for problem instances of very large size or problems with less cognitive overlap with evolutionarily familiar tasks (e.g. certain non-Euclidean problems). For instance, in the TSP, human performance degrades severely when inverting the goal from “finding the shortest path” to “finding the longest path” [57] – humans perform even worse in this case than one of the simplest possible heuristic: farthest neighbor construction. 6

这包括一些明显的问题类别,例如需要超过数年的长期规划,或需要较大工作记忆的任务(例如计算10位数的乘法)。同样也涵盖那些与我们先天认知先验不适配的问题;例如,人类在解决某些小规模NP难问题时可能非常高效,尤其是当这些问题与进化过程中熟悉的任务(如导航)存在认知重叠时(例如,点数较少的欧几里得旅行商问题(TSP)可以通过感知策略由人类在近线性最优时间内近乎最优地解决[58]),但对于规模极大或与进化熟悉任务认知重叠较少的问题实例(例如某些非欧几里得问题),人类表现往往不佳——甚至不比随机搜索更好。例如在TSP中,当目标从"寻找最短路径"反转为"寻找最长路径"时,人类表现会急剧下降[57]——此时人类的表现甚至比最简单的启发式算法之一"最远邻点构造"更差。6

A particularly marked human bias is dimensional bias: humans show excellent performance on 2D navigation tasks and 2D shape-packing puzzles, and can still handle 3D cases albeit with greatly reduced performance, but they are effectively unable to handle 4D and higher. This fact is perhaps unsurprising given human reliance on perceptual strategies for problem-solving – strategies which are backed by neural mechanisms specifically evolved for 2D navigation (hippo camp al systems of place cells and grid cells [64]).

一个特别显著的人类认知偏差是维度偏差:人类在2D导航任务和2D形状拼图任务中表现出色,虽然性能大幅下降但仍能处理3D情况,但基本上无法处理4D及更高维度。考虑到人类依赖感知策略解决问题(这些策略由专门为2D导航进化的神经机制支撑,如海马体位置细胞与网格细胞系统 [64]),这一事实或许并不令人意外。

Thus, a central point of this document is that “general intelligence” is not a binary property which a system either possesses or lacks. It is a spectrum, tied to 1) a scope of application, which may be more or less broad, and 2) the degree of efficiency with which the system translate its priors and experience into new skills over the scope considered, 3) the degree of generalization difficulty represented by different points in the scope considered (see II.2). In addition, the “value” of one scope of application over another is entirely subjective; we wouldn’t be interested in (and wouldn’t even perceive as intelligent) a system whose scope of application had no intersection with our own.

因此,本文的核心观点在于,“通用智能”并非系统要么具备、要么缺失的二元属性。它是一个连续谱系,与以下三点紧密相关:1) 应用范围(可宽可窄);2) 系统在给定范围内将先验知识和经验转化为新技能的效率程度;3) 该范围内不同节点所体现的泛化难度等级(参见II.2节)。此外,不同应用范围之间的“价值”判断完全主观——对于与人类需求毫无交集的应用范围,我们既不会感兴趣,甚至不会将其视为智能系统。

As such, it is conceptually unsound to set “artificial general intelligence” in an absolute sense (i.e. “universal intelligence”) as a goal. To set out to build broad abilities of any kind, one must start from a target scope, and one must seek to achieve a well-defined intelligence threshold within this scope: AI is a deeply contextual and open-ended endeavour, not a single one-time riddle to be solved. However, it may in theory be possible to create human-like artificial intelligence: we may gradually build systems that extend across the same scope of applicability as human intelligence, and we may gradually increase their generalization power within this scope until it matches that of humans. We may even build systems with higher generalization power (as there is no a priori reason to assume human cognitive efficiency is an upper bound), or systems with a broader scope of application. Such systems would feature intelligence beyond that of humans.

因此,将"通用人工智能 (AGI)"绝对化地设定为目标(即"普适智能")在概念上是不合理的。要构建任何类型的广泛能力,必须从目标范围出发,并寻求在该范围内达到明确定义的智能阈值:人工智能是一项高度情境化且无止境的探索,而非一个需要一次性解决的谜题。然而从理论上讲,创造类人人工智能是可能的:我们可以逐步构建覆盖人类智能相同适用范围的系统,并在此范围内持续提升其泛化能力直至达到人类水平。我们甚至可能开发出具有更强泛化能力的系统(因为没有先验理由认定人类认知效率是上限),或是具备更广应用范围的系统。这类系统将展现出超越人类智能的特征。

In conclusion, we propose that research on developing broad in AI systems (up to “general” AI, i.e. AI with a degree of generality comparable to human intelligence) should focus on defining, measuring, and developing a specifically human-like form of intelligence, and should benchmark progress specifically against human intelligence (which is itself highly specialized). This isn’t because we believe that intelligence that greatly differs from our own couldn’t exist or wouldn’t have value; rather, we recognize that characterizing and measuring intelligence is a process that must be tied to a well-defined scope of application, and at this time, the space of human-relevant tasks is the only scope that we can meaningfully approach and assess. We thus disagree with the perspective of Universal Psychometrics [39] or Legg and Hutter’s Universal Intelligence [54], which reject anthropocentrism altogether and seek to measure all intelligence against a single absolute scale. An anthropocentric frame of reference is not only legitimate, it is necessary.

总之,我们建议关于开发广泛AI系统(直至"通用"AI,即具有与人类智能相当程度通用性的AI)的研究,应聚焦于定义、衡量和开发一种特定类人形态的智能,并以人类智能作为明确的基准来衡量进展(人类智能本身也是高度特化的)。这并非因为我们认为与人类智能迥异的其他智能形式不可能存在或没有价值,而是我们认识到描述和衡量智能的过程必须与明确定义的应用范围相关联。在现阶段,与人类相关的任务领域是我们唯一能够有效探索和评估的范围。因此,我们不同意通用心理测量学[39]或Legg与Hutter提出的通用智能[54]的观点,这些理论完全摒弃人类中心主义,试图用单一绝对尺度衡量所有智能。以人类为中心的参照框架不仅是合理的,更是必要的。

II.1.3 Separating the innate from the acquired: insights from developmental psychology

II.1.3 先天与后天的分离:发展心理学的启示

Advances in developmental psychology teach us that neither of the two opposing views of the nature of the mind described in I.2 are accurate (see e.g. [85]): the human mind is not merely a collection of special-purpose programs hard-coded by evolution; it is capable of a remarkable degree of generality and open-endedness, going far beyond the scope of environments and tasks that guided its evolution. The large majority of the skills and knowledge we possess are acquired during our lifetimes, rather than innate. Simultaneously, the mind is not a single, general-purpose “blank slate” system capable of learning anything from experience. Our cognition is specialized, shaped by evolution in specific ways; we are born with priors about ourselves, about the world, and about how to learn, which determine what categories of skills we can acquire and what categories of problems we can solve.

发展心理学的进展告诉我们,I.2节中描述的两种对立心智本质观点都不准确(如[85]所示):人类心智并非仅是进化编码的专用程序集合;它具有显著程度的通用性和开放性,远超指导其进化的环境与任务范围。我们所拥有的大部分技能和知识都是后天习得,而非与生俱来。同时,心智也非单一通用"白板"系统,能够从经验中学习任何事物。我们的认知具有特异性,以特定方式被进化塑造;我们生来就带有关于自身、世界及学习方式的先验知识,这些决定了我们能够获得哪些技能类别以及解决哪些问题类别。

These priors are not a limitation to our generalization capabilities; to the contrary, they are their source, the reason why humans are capable of acquiring certain categories of skills with remarkable efficiency. The central message of the No Free Lunch theorem [98] is that to learn from data, one must make assumptions about it – the nature and structure of the innate assumptions made by the human mind are precisely what confers to it its powerful learning abilities.

这些先验知识并非对我们泛化能力的限制;恰恰相反,它们是能力的源泉,是人类能够以惊人效率掌握某些技能类别的根本原因。无免费午餐定理 [98] 的核心观点是:要从数据中学习,就必须对其做出假设——人类心智与生俱来的假设性质与结构,正是赋予其强大学习能力的本质所在。

We noted in II.1.1 that an actionable measure of intelligence should, crucially, control for priors and experience. We proposed in II.1.2 that evaluating general intelligence should leverage human intelligence as a necessary frame of reference. It follows that we need a clear understanding of human cognitive priors in order to fairly evaluate general intelligence between humans and machines.

我们在II.1.1节中指出,一个可操作的智能衡量标准关键在于控制先验和经验。在II.1.2节中我们提出,评估通用智能应以人类智能作为必要的参照框架。因此,为了公平评估人类与机器之间的通用智能,我们需要清晰理解人类的认知先验。

Human cognitive priors come in multiple forms, in particular 7:

人类认知先验以多种形式存在,具体包括7种:

When it comes to creating artificial human-like intelligence, low-level sensor i motor priors are too specific to be of interest (unless one seeks to build an artificial human body). While human meta-learning priors should be of the utmost interest (understanding the strategies that the brain follows to turn experience into knowledge and skills is effectively our end goal), these priors are not relevant to evaluating intelligence: they are intelligence, rather than a third-party modulating factor to be controlled for. They are part of the black box that we seek to characterize.

在创造类人人工智能时,低级感知运动先验( sensorimotor priors )过于特定而缺乏普适价值(除非旨在构建人工人体)。虽然人类元学习先验( meta-learning priors )最值得关注(理解大脑将经验转化为知识和策略的机制本质上就是我们的终极目标),但这些先验与智力评估无关:它们本身就是智力的一部分,而非需要控制的第三方调节因素。这些先验正是我们试图解析的黑箱( black box )组成部分。

It is knowledge priors that should be accounted for when measuring a human-like form of intelligence. A system that does not possess human innate knowledge priors would be at a critical disadvantage compared to humans when it comes to efficiently turning a given experience curriculum into skill at a given human task. Inversely, a system that has access to more extensive hard-coded knowledge about the task at hand could not be fairly compared to human intelligence – as we noted in II.1.1, unlimited priors allow system developers to “buy” unbounded performance on any given task, with no implications with regard to genera liz ation abilities (what we are actually trying to achieve).

在衡量类人智能形式时,必须考虑知识先验 (priors) 因素。一个缺乏人类先天知识先验的系统,在将既定经验课程高效转化为特定人类任务技能方面,会处于关键劣势。反之,若系统拥有远超人类的手工编码任务知识(如II.1.1节所述),则无法公平地评估其智能水平——无限先验允许开发者通过"购买"方式获得任意任务的无界性能,但这与真正的目标(即泛化能力)毫无关联。

Therefore, we propose that an actionable test of human-like general intelligence should be founded on innate human knowledge priors:

因此,我们提出一个可操作的人类通用智能测试应基于人类先天知识先验:

• The priors should be made as close as possible to innate human knowledge priors as we understand them. As our understanding of human knowledge priors improves over time, so should the test evolve. • The test should assume that the system being measured possesses a specific set of priors. AI systems with more extensive priors should not be benchmarked using such a test. AI systems with fewer priors should be understood to be at a disadvantage.

• 先验知识应尽可能接近我们所理解的人类先天知识先验。随着对人类知识先验理解的不断深入,测试标准也应相应演进。
• 测试应假设被测系统具备特定先验知识集。拥有更广泛先验知识的AI系统不应使用此类测试作为基准。先验知识较少的AI系统应被视为处于劣势。

This leads us to a central question: what is the exact list of knowledge priors that humans are born with? This is the question that the developmental science theory of Core Knowledge [85] seeks to answer. Core Knowledge identifies four broad categories of innate assumptions that form the foundations of human cognition, and which are largely shared by our non-human relatives 8:

这引出了一个核心问题:人类与生俱来的知识先验具体包含哪些内容?这正是核心知识理论 [85] 试图解答的问题。该理论将构成人类认知基础的先天假设归纳为四大类别,这些假设在很大程度上也被人类的近亲物种所共有8:

While cognitive developmental psychology has not yet determined with a high degree of certainty the exact set of innate priors that humans possess, we consider the Core Knowledge theory to offer a credible foundation suitable to the needs of a test of human-like general intelligence. We therefore propose that an actionable test of general intelligence that would be fair for both humans and machines should only feature tasks that assume the four core knowledge systems listed above, and should not involve any acquired knowledge outside of these priors. We also argue, in agreement with [51], that general AI systems should hard-code as fundamental priors these core knowledge principles.

虽然认知发展心理学尚未高度确定人类所拥有的先天先验知识的确切集合,但我们认为核心知识理论为测试类人通用智能提供了一个可信的基础。因此,我们提出一个可操作的通用智能测试,该测试对人类和机器都公平,应仅包含基于上述四种核心知识系统的任务,而不涉及这些先验知识之外的任何习得知识。此外,我们同意[51]的观点,认为通用人工智能系统应将这些核心知识原则硬编码为基本先验。

II.2 Defining intelligence: a formal synthesis

II.2 定义智能:形式化综合

II.2.1 Intelligence as skill-acquisition efficiency

II.2.1 作为技能获取效率的智能

So far, we have introduced the following informally-described intuitions:

到目前为止,我们已经非正式地介绍了以下直观概念:

Let us now formalize these intuitions. In what follows, we provide a series of definitions for key concepts necessary to ground a formal definition of intelligence and its measure. We will leverage the tools of Algorithmic Information Theory. These definitions lead up to a formal way of expressing the following central idea:

现在让我们将这些直觉形式化。接下来,我们将为构建智能及其度量形式化定义所需的关键概念提供一系列定义。我们将利用算法信息论的工具。这些定义最终将导向表达以下核心思想的正式方式:

The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.

系统的智能性是对其在任务范围内技能获取效率的衡量,涉及先验知识、经验及泛化难度。

Intuitively, if you consider two systems that start from a similar set of knowledge priors, and that go through a similar amount of experience (e.g. practice time) with respect to a set of tasks not known in advance, the system with higher intelligence is the one that ends up with greater skills (i.e. the one that has turned its priors and experience into skill more efficiently). This definition of intelligence encompasses meta-learning priors, memory, and fluid intelligence. It is distinct from skill itself: skill is merely the output of the process of intelligence.

直观地说,如果两个系统从相似的知识先验出发,并在未知任务集上经历相近的经验量(例如练习时长),那么智能更高的系统最终会获得更强的技能(即能更高效地将先验和经验转化为技能)。这种智能定义涵盖了元学习先验、记忆和流体智力。它与技能本身不同:技能仅是智能过程的输出产物。

Before we start, let us emphasize that many possible definitions of intelligence may be valid, across many different contexts, and we do not purport that the definition above and the formalism below represent the “one true” definition. Nor is our definition meant to achieve broad consensus. Rather, the purpose of our definition is to be actionable, to serve as a useful perspective shift for research on broad cognitive abilities, and to function as a quantitative foundation for new general intelligence benchmarks, such as the one we propose in part III. As per George Box’s aphorism, “all models are wrong, but some are useful”: our only aim here is to provide a useful North Star towards flexible and general AI. We discuss in II.2.3 the concrete ways in which our formalism is useful and actionable.

在开始之前,我们需要强调,智力的定义可能有很多种,且在不同情境下都有效。我们并不认为上述定义及下文的形式化表述是"唯一正确"的定义,也不追求达成广泛共识。相反,我们定义的目的在于可操作性,为广泛认知能力研究提供有益视角转变,并为新型通用智能基准(如第三部分提出的方案)建立量化基础。正如George Box所言"所有模型都是错的,但有些是有用的":我们唯一的目标是为实现灵活通用的人工智能提供一个有用的指引。在II.2.3节中,我们将具体讨论这套形式化体系的可操作性与实用价值。

Position of the problem

问题定位

First, we must introduce basic definitions to establish our problem setup. It should be immediately clear to the reader that our choice of problem setup is sufficient to model Fully-Supervised Learning, Partially-Supervised Learning, and Reinforcement Learning.

首先,我们需要引入基本定义来建立问题设定。读者应当立即明确,我们所选的问题设定足以建模全监督学习 (Fully-Supervised Learning)、部分监督学习 (Partially-Supervised Learning) 和强化学习 (Reinforcement Learning)。

We consider the interaction between a “task” and an “intelligent system”. This interaction is mediated by a “skill program” (generated by the intelligent system) and a “scoring function” (part of the task).

我们考虑"任务"与"智能系统"之间的交互。这种交互通过"技能程序"(由智能系统生成)和"评分函数"(任务的一部分)作为中介。

We implicitly consider the existence of a fixed universal Turing machine on which our programs run (including the skill programs, as well as programs part of the task and part of the intelligent system). We also assume the existence of a fixed “situation space” Situation Space and “response space” Response Space. Each of these spaces defines the set of binary strings that are allowed as input (and output, respectively) of all skill programs we will consider henceforth. They may be, for instance, the sensor space and the motor space of an animal or robot.

我们隐含地假设存在一个固定的通用图灵机,所有程序(包括技能程序、任务组成部分的程序以及智能系统组成部分的程序)都在其上运行。同时,我们假设存在固定的"情境空间"(Situation Space)和"响应空间"(Response Space)。这些空间分别定义了后续所考虑的所有技能程序允许作为输入(及对应输出)的二进制字符串集合。例如,这些空间可以是动物或机器人的传感器空间与运动空间。


Figure 2: Position of the problem: an intelligent system generates a skill program to interact with a task.

图 2: 问题定位:智能系统生成技能程序以与任务交互。

A task $T$ consists of four objects:

任务 $T$ 包含四个对象:

• A “scoring function” Scoring $:$ [Situation, Response, T askState] $\rightarrow$ [Score, F eedback]. It may be stochastic.

• "评分函数" Scoring $:$ [情境(Situation), 响应(Response), 任务状态(TaskState)] $\rightarrow$ [分数(Score), 反馈(Feedback)]。该函数可能是随机的。

• A self-update function $T a s k U p d a t e:[R e s p o n s e,T a s k S t a t e]\rightarrow T a s k S t a t e,$ , which mutates the task state based on the response to the latest situation. It may be stochastic.

• 自更新函数 $TaskUpdate:[Response,TaskState]\rightarrow TaskState$,该函数根据对最新情况的响应来改变任务状态。它可能是随机的。

For instance, a game such as chess or WarCraft III (as well as what we call “task” in the ARC benchmark presented in III) would constitute a task. A given chess board position, screen frame in WarCraft III, or input grid in ARC, would constitute a situation.

例如,国际象棋或《魔兽争霸III》这样的游戏(以及我们在第三部分介绍的ARC基准测试中所称的“任务”)将构成一个任务。而特定的国际象棋棋盘位置、《魔兽争霸III》中的屏幕画面或ARC中的输入网格,则构成一个情境。

An intelligent system $I S$ consists of three objects:

智能系统 (IS) 由三个对象组成:

– A skill program $S k i l l P r o g r a m:\left[S i t u a t i o n,S P S t a t e\right]\rightarrow\left[R e s p o n s e,S P S t a t e\right]$ is a function that maps an input situation to a valid response (part of Response Space), potentially using some working memory $(S P S t a t e)$ . It may be stochastic. Because it possesses a state $S P S t a t e$ (a binary string), it may be used to autonomously handle a series of connected situations without further communication with the intelligent system that generated it.

  • 技能程序 (Skill Program) $S k i l l P r o g r a m:\left[S i t u a t i o n,S P S t a t e\right]\rightarrow\left[R e s p o n s e,S P S t a t e\right]$ 是一个将输入情境映射为有效响应 (响应空间的一部分) 的函数,可能使用某些工作内存 $(S P S t a t e)$。它可以是随机的。由于拥有状态 $S P S t a t e$ (二进制字符串),该程序可在不与生成它的智能系统进一步通信的情况下,自主处理一系列关联情境。

– A skill program may be, for instance, any game-specific program capable of playing new levels in a given video game.

– 例如,技能程序可以是任何能够玩给定电子游戏中新关卡的特定游戏程序。

– In what follows, we refer to “skill program” as the combination of the SkillP rogram function and the initial skill program state SP State (i.e. skill programs are considered stateful).

  • 在下文中,我们将"技能程序(skill program)"定义为SkillProgram函数与初始技能程序状态SP State的组合(即技能程序被视为有状态的)。

– A skill program represents a frozen version of the system’s task-specific capabilities (including the ability to adapt to novel situations within the task). We use the concept of skill program as a conceptual device to formalize the level of task-specific skill and task-specific generalization capabilities of an agent at a given point in time.

  • 技能程序 (skill program) 代表系统任务特定能力的固化版本 (包括适应任务内新情境的能力)。我们使用技能程序这一概念作为形式化工具,用以表征智能体在特定时间点所具备的任务专项技能水平及任务内泛化能力。

• A self-update function ISU pdate : [Situation, Response, F eedback, ISState] → ISState, which mutates the system’s state based on the latest situation and corresponding feedback. It may be stochastic.

• 自我更新函数 ISU pdate: [Situation, Response, Feedback, ISState] → ISState,该系统根据最新情境及对应反馈更新自身状态。该函数可能具有随机性。

For instance, a neural network generation and training algorithm for games would be an “intelligent system”, and the inference-mode game-specific network it would output at the end of a training run on one game would be a “skill program”. A program synthesis engine capable of looking at an ARC task and outputting a solution program would be an “intelligent system”, and the resulting solution program capable of handling future input grids for this task would be a “skill program”.

例如,用于游戏的神经网络生成和训练算法属于"智能系统",而该算法在完成某一游戏的训练后输出的推理模式专用网络则属于"技能程序"。能够查看ARC任务并输出解决方案的程序合成引擎是"智能系统",而生成的能处理该任务未来输入网格的解决方案程序则是"技能程序"。

The interaction between task, intelligent system, and skill programs is structured in two phases: a training phase and an evaluation phase. The goal of the training phase is for the $I S$ to generate a high-skill skill program that will generalize to future evaluation situations. The goal of the evaluation phase is to assess the capability of this skill program to handle new situations.

任务、智能系统和技能程序之间的交互分为两个阶段:训练阶段和评估阶段。训练阶段的目标是让 $IS$ 生成一个能够泛化到未来评估情境的高技能程序。评估阶段的目标则是评估该技能程序处理新情境的能力。

The training phase consists of the repetition of the following steps (we note the current step as $t$ ). Before we start, we consider two separate initial task states, trainT askState $\scriptstyle{\stackrel{\prime}{t=0}}$ and testT askState $\scriptstyle{\stackrel{\prime}{e=0}}$ .

训练阶段包含以下步骤的重复 (当前步骤记为 $t$ )。开始前,我们考虑两个独立的初始任务状态:trainT askState $\scriptstyle{\stackrel{\prime}{t=0}}$ 和 testT askState $\scriptstyle{\stackrel{\prime}{e=0}}$。

• The skill program outputs a response to the situation: $\begin{array}{r}{[r e s p o n s e_{t},s p S t a t e_{t+1}]\gets s k i l l P r o g r a m_{t}(S i t u a t i o n_{t},s p S t a t e_{t})}\end{array}$ – skillP rogramt is only called once, and $_ {s p S t a t e_{t+1}}$ is discarded, since the skill program is generated anew by the intelligent system at each training step. – In practice, in partially-observable games where consecutive situations are very close to each other (e.g. two consecutive screen frames in WarCraft III), one may assume that skillP rogramt at $t$ and skillP rogramt+1 would not actually be generated independently from scratch and would stay very close to each other (i.e. the $I S$ ’s understanding of the task would be evolving continuously in program space); $s p S t a t e_{t+1}$ as generated by skillP rogramt and $s p S t a t e_{t+1}$ as generated by SkillP rogramGen at $t+1$ would likewise stay very close to each other.

• 技能程序对当前情境输出响应:$\begin{array}{r}{[r e s p o n s e_{t},s p S t a t e_{t+1}]\gets s k i l l P r o g r a m_{t}(S i t u a t i o n_{t},s p S t a t e_{t})}\end{array}$
– skillProgramt 仅被调用一次,且 $_ {s p S t a t e_{t+1}}$ 会被丢弃,因为智能系统在每个训练步骤都会重新生成技能程序。
– 实践中,在连续情境高度相似的部分可观测游戏中(如《魔兽争霸III》的连续屏幕帧),可假设时刻 $t$ 的 skillProgramt 与 skillProgramt+1 不会完全独立生成,而是保持高度相似(即 IS 对任务的理解会在程序空间中持续演化);由 skillProgramt 生成的 $s p S t a t e_{t+1}$ 与由 SkillProgramGen 在 $t+1$ 时刻生成的 $s p S t a t e_{t+1}$ 同样会保持高度相似。

• The task scoring function assigns a score to the response and generates a piece of feedback: $\left[s c o r e_{t},f e e d b a c k_{t}\right]\gets S c o r i n g(S i t u a t i o n_{t},r e s p o n s e_{t},$ trainT askStatet)

• 任务评分函数为响应分配分数并生成一条反馈: $\left[s c o r e_{t},f e e d b a c k_{t}\right]\gets S c o r i n g(S i t u a t i o n_{t},r e s p o n s e_{t},$ trainT askStatet)

The training phase ends at the discretion of the Situation Gen function (e.g. Situation Gen returns a “STOP” situation), at which time SkillP rogramGen would generate its last skill program, including an initial state (initial working memory) meant to perform well during evaluation (e.g. blank).

训练阶段由情境生成(Situation Gen)函数自行决定何时结束(例如情境生成返回一个"STOP"情境),此时技能程序生成(SkillProgramGen)会生成最后一个技能程序,包括一个旨在评估阶段表现良好的初始状态(初始工作内存)(例如空白状态)。

The evaluation phase is superficially similar to the training phase, with the differences that 1) the task starts from testT askState $\scriptstyle\mathrm{e=0}$ and consists of an independent series of situations, 2) it only involves a single fixed skill program testSkillP rogram starting with state $t e s t S P S t a t e_{e=0}$ . Crucially, it no longer involves the intelligent system. Note that testT askStat $\scriptstyle{\mathrm{\mathit{\Sigma}}}^{2}e=0$ could be chosen stochastic ally. For instance, different randomly chosen initial testT askState $e{=}0$ could be different randomly-generated levels of a game.

评估阶段表面上与训练阶段相似,不同之处在于:1) 任务从测试任务状态 $\scriptstyle\mathrm{e=0}$ 开始,由一系列独立情境组成;2) 仅涉及一个固定的技能程序 testSkillProgram,其初始状态为 $testSPS tate_{e=0}$。关键在于,该阶段不再包含智能系统。需注意,测试任务状态 $\scriptstyle{\mathrm{\mathit{\Sigma}}}^{2}e=0$ 可随机选择,例如不同的随机初始测试任务状态 $e{=}0$ 可能对应游戏中随机生成的不同关卡。

Like the separati