On the Measure of Intelligence
论智能的衡量标准
Francois Chollet Google, Inc. fchollet@google.com
Francois Chollet Google, Inc. fchollet@google.com
November 5, 2019
2019年11月5日
Abstract
摘要
To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks, such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow exper im enters to “buy” arbitrary levels of skills for a system, in a way that masks the system’s own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience, as critical pieces to be accounted for in characterizing intelligent systems. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a new benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
为了朝着更智能、更类人的人工系统方向取得有意识的进展,我们需要遵循适当的反馈信号:必须以能够比较两个系统以及与人类进行比较的方式定义和评估智能。过去一百年间,心理学和AI领域涌现了大量定义与测量智能的尝试。我们总结并批判性评估了这些定义和评估方法,同时揭示了隐含指导它们的两种历史性智能概念。我们注意到,在实践中,当代AI社区仍倾向于通过比较AI与人类在特定任务(如棋盘游戏和电子游戏)中展现的技能来基准测试智能。我们认为仅测量任何给定任务的技能不足以衡量智能,因为技能高度受先验知识和经验调节:无限的先验或训练数据允许实验者以掩盖系统自身泛化能力的方式为系统"购买"任意水平的技能。接着,我们基于算法信息论提出了一个新的智能形式化定义,将智能描述为技能获取效率,并强调范围、泛化难度、先验和经验等概念是表征智能系统的关键要素。基于此定义,我们提出了一套通用AI基准应遵循的准则。最后,我们介绍了一个紧密遵循这些准则的新基准——抽象与推理语料库(ARC),它建立在显式先验集合之上,这些先验被设计为尽可能接近人类与生俱来的先验。我们认为ARC可用于测量类人的通用流体智能,并实现AI系统与人类之间公平的通用智能比较。
Contents
目录
I Context and history 3
I 背景与历史 3
II A new perspective 18
II 全新视角 18
II.1 Critical assessment . 18
II.1 关键评估 . 18
III A benchmark proposal: the ARC dataset 46
III 基准测试提案:ARC数据集
III.1 Description and goals 46
III.1 描述与目标 46
I Context and history
I 背景与历史
I.1 Need for an actionable definition and measure of intelligence
I.1 对可操作的情报定义和衡量标准的需求
The promise of the field of AI, spelled out explicitly at its inception in the 1950s and repeated countless times since, is to develop machines that possess intelligence comparable to that of humans. But AI has since been falling short of its ideal: although we are able to engineer systems that perform extremely well on specific tasks, they have still stark limitations, being brittle, data-hungry, unable to make sense of situations that deviate slightly from their training data or the assumptions of their creators, and unable to repurpose themselves to deal with novel tasks without significant involvement from human researchers.
人工智能领域的承诺,早在20世纪50年代其诞生之初就被明确提出,并在此后被无数次重申,即开发出具有与人类相当智能的机器。但人工智能始终未能达到这一理想:尽管我们能够设计出在特定任务上表现极其出色的系统,它们仍存在明显局限——系统脆弱、依赖海量数据、无法理解略微偏离训练数据或开发者预设条件的情境,且在没有人类研究者深度介入的情况下,无法自主调整以应对新任务。
If the only successes of AI have been in developing narrow, task-specific systems, it is perhaps because only within a very narrow and grounded context have we been able to define our goal sufficiently precisely, and to measure progress in an actionable way. Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress. To make progress towards the promise of our field, we need precise, quantitative definitions and measures of intelligence – in particular human-like general intelligence. These would not be merely definitions and measures meant to describe or characterize intelligence, but precise, explanatory definitions meant to serve as a North Star, an objective function showing the way towards a clear target, capable of acting as a reliable measure of our progress and as a way to identify and highlight worthwhile new approaches that may not be immediately applicable, and would otherwise be discounted.
如果人工智能的成功仅限于开发狭窄、任务专用的系统,那或许是因为只有在非常局限且具体的背景下,我们才能足够精确地定义目标,并以可操作的方式衡量进展。目标定义和评估基准是推动科学进步最有力的因素之一。为了实现本领域的承诺,我们需要对智能——尤其是类人的通用智能——进行精确的量化定义和测量。这些不应仅是用于描述或表征智能的定义与指标,而应成为指引方向的北极星:作为目标函数指明清晰靶向,既能可靠衡量进展,又能识别并凸显那些虽非立即可用却极具价值的新方法——这些方法在传统评估中往往会被忽视。
For instance, common-sense dictionary definitions of intelligence may be useful to make sure we are talking about the same concepts, but they are not useful for our purpose, as they are not actionable, explanatory, or measurable. Similarly, the Turing Test [91] and its many variants (e.g. Total Turing Test and Loebner Prize [75]) are not useful as a driver of progress (and have in fact served as a red herring 1), since such tests completely opt out of objectively defining and measuring intelligence, and instead outsource the task to unreliable human judges who themselves do not have clear definitions or evaluation protocols.
例如,常识性的智力词典定义可能有助于确保我们在讨论相同的概念,但对我们的目的而言并不实用,因为它们不具备可操作性、解释性或可测量性。同样,图灵测试 [91] 及其众多变体(如完全图灵测试和Loebner奖 [75])作为进展的驱动力也毫无用处(事实上已成为干扰项),因为此类测试完全回避了客观定义和测量智力,而是将任务外包给不可靠的人类评判者,而这些评判者本身也没有清晰的定义或评估协议。
It is a testimony to the immaturity of our field that the question of what we mean when we talk about intelligence still doesn’t have a satisfying answer. What’s worse, very little attention has been devoted to rigorously defining it or benchmarking our progress towards it. Legg and Hutter noted in a 2007 survey of intelligence definitions and evaluation methods [53]: “to the best of our knowledge, no general survey of tests and definitions has been published". A decade later, in 2017, Hernandez-Orallo released an extensive survey of evaluation methods [36] as well as a comprehensive book on AI evaluation [37]. Results and recommendations from both of these efforts have since been largely ignored by the community.
当我们谈论智能时究竟指什么,这个问题至今仍无令人满意的答案,这恰恰证明了我们领域的不成熟。更糟的是,学界极少投入精力去严格定义智能或建立衡量进展的基准。Legg和Hutter在2007年关于智能定义与评估方法的综述[53]中指出:"据我们所知,目前尚未发表过关于测试与定义的全面综述"。十年后的2017年,Hernandez-Orallo先后发布了评估方法的详尽综述[36]和关于AI评估的专著[37]。然而这些研究成果提出的结论与建议,至今仍在很大程度上被学界忽视。
We believe this lack of attention is a mistake, as the absence of widely-accepted explicit definitions has been substituted with implicit definitions and biases that stretch back decades. Though invisible, these biases are still structuring many research efforts today, as illustrated by our field’s ongoing fascination with outperforming humans at board games or video games (a trend we discuss in I.3.5 and II.1). The goal of this document is to point out the implicit assumptions our field has been working from, correct some of its most salient biases, and provide an actionable formal definition and measurement benchmark for human-like general intelligence, leveraging modern insight from developmental cognitive psychology.
我们认为这种关注不足是一个错误,因为缺乏广泛接受的明确定义已被延续数十年的隐性定义和偏见所替代。尽管无形,这些偏见至今仍在构建许多研究工作,正如我们领域持续痴迷于在棋盘游戏或电子游戏中超越人类所展示的那样(这一趋势我们将在I.3.5和II.1中讨论)。本文档的目标是指出我们领域一直遵循的隐性假设,纠正其中一些最突出的偏见,并利用发展认知心理学的现代见解,为类人通用智能提供可操作的形式化定义和测量基准。
I.2 Defining intelligence: two divergent visions
I.2 定义智能:两种不同的愿景
Looked at in one way, everyone knows what intelligence is; looked at in another way, no one does.
从某种角度看,人人都明白智能是什么;换一个角度看,又无人能真正说清。
Robert J. Sternberg, 2000
Robert J. Sternberg, 2000
Many formal and informal definitions of intelligence have been proposed over the past few decades, although there is no existing scientific consensus around any single definition. Sternberg & Detterman noted in 1986 [87] that when two dozen prominent psychologists were asked to define intelligence, they all gave somewhat divergent answers. In the context of AI research, Legg and Hutter [53] summarized in 2007 no fewer than 70 definitions from the literature into a single statement: “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.”
过去几十年间,人们提出了许多关于智能的正式与非正式定义,但至今仍未形成统一的科学共识。Sternberg & Detterman在1986年[87]指出,当询问二十多位著名心理学家如何定义智能时,他们的回答均存在差异。在AI研究领域,Legg和Hutter[53]于2007年从文献中汇总了至少70种定义,并提炼为一个核心表述:"智能衡量智能体在多样化环境中实现目标的能力。"
This summary points to two characterizations, which are nearly universally – but often separately – found in definitions of intelligence: one with an emphasis on task-specific skill (“achieving goals”), and one focused on generality and adaptation (“in a wide range of environments”). In this view, an intelligent agent would achieve high skill across many different tasks (for instance, achieving high scores across many different video games). Implicitly here, the tasks may not necessarily be known in advance: to truly achieve generality, the agent would have to be able to learn to handle new tasks (skill acquisition).
该总结指出了两种普遍存在但常被单独讨论的智能定义特征:一种强调特定任务技能("实现目标"),另一种关注通用性与适应性("在多样化环境中")。从这个角度看,一个智能体(intelligent agent)应能在多种不同任务中展现高超技能(例如在多款电子游戏中获得高分)。这里隐含的前提是:这些任务未必需要预先设定——要实现真正的通用性,智能体必须具备学习处理新任务的能力(技能获取)。
These two characterizations map to Catell’s 1971 theory of fluid and crystallized intelligence (Gf-Gc) [13], which has become one of the pillars of the dominant theory of human cognitive abilities, the Cattell-Horn-Caroll theory (CHC) [62]. They also relate closely to two opposing views of the nature of the human mind that have been deeply influential in cognitive science since the inception of the field [85]: one view in which the mind is a relatively static assembly of special-purpose mechanisms developed by evolution, only capable of learning what is it programmed to acquire, and another view in which the mind is a general-purpose “blank slate” capable of turning arbitrary experience into knowledge and skills, and that could be directed at any problem.
这两种特征映射到 Catell 1971 年提出的流体智力与晶体智力理论 (Gf-Gc) [13],该理论已成为人类认知能力主流理论——Cattell-Horn-Carroll 理论 (CHC) [62] 的支柱之一。它们还与认知科学领域自创立以来极具影响力的两种对立心智观紧密相关 [85]:一种观点认为心智是由进化形成的特殊目的机制的相对静态集合体,仅能习得预设程序允许的内容;另一种观点则认为心智是通用"白板",能将任意经验转化为知识与技能,并可应用于任何问题。
A central point of this document is to make explicit and critically assess this dual definition that has been implicitly at the foundation of how we have been conceptualizing and evaluating intelligence in the context of AI research: crystallized skill on one hand, skillacquisition ability on the other. Understanding this intellectual context and its ongoing influence is a necessary step before we can propose a formal definition of intelligence from a modern perspective.
本文的核心在于明确并批判性评估这种隐含在AI研究中对智能概念化和评估基础的双重定义:一方面是固化技能 (crystallized skill),另一方面是技能获取能力 (skill acquisition ability)。理解这一学术背景及其持续影响,是我们从现代视角提出正式智能定义的必要前提。
I.2.1 Intelligence as a collection of task-specific skills
I.2.1 作为特定任务技能集合的智能
In the distant future I see open fields for far more important researches. Psychology will be based on a new foundation, that of the necessary acquirement of each mental power and capacity by gradation.
在遥远的未来,我看到更重要的研究领域将敞开大门。心理学将建立在新基础之上,即通过渐进方式获得每种心智能力与才能的必要性。
Charles Darwin, 1859
查尔斯·达尔文,1859
The evolutionary psychology view of human nature is that much of the human cognitive function is the result of special-purpose adaptations that arose to solve specific problems encountered by humans throughout their evolution (see e.g. [19, 74]) – an idea which originated with Darwin [21] and that coalesced in the 1960s and 1970s. Around the same time that these ideas were gaining prominence in cognitive psychology, early AI researchers, perhaps seeing in electronic computers an analogue of the mind, mainly gravitated towards a view of intelligence as a set of static program-like routines, heavily relying on logical operators, and storing learned knowledge in a database-like memory.
人类本性的进化心理学观点认为,人类大部分认知功能是特殊目的适应性机制的结果,这些机制为解决人类进化过程中遇到的特定问题而形成(参见[19,74])——这一观点源自达尔文[21],并于20世纪60至70年代逐渐成型。当这些思想在认知心理学领域崭露头角时,早期AI研究者或许将电子计算机视为心智的类比,主要倾向于将智能视为一组静态的类程序例程,其高度依赖逻辑运算符,并将习得知识存储于类数据库记忆中。
This vision of the mind as a wide collection of vertical, relatively static programs that collectively implement “intelligence”, was most prominently endorsed by influential AI pioneer Marvin Minsky (see e.g. The Society of Mind, 1986 [63]). This view gave rise to definitions of intelligence and evaluation protocols for intelligence that are focused on task-specific performance. This is perhaps best illustrated by Minsky’s 1968 definition of AI: “AI is the science of making machines capable of performing tasks that would require intelligence if done by humans” 2. It was then widely accepted within the AI community that the “problem of intelligence” would be solved if only we could encode human skills into formal rules and encode human knowledge into explicit databases.
将心智视为一系列垂直、相对静态程序的集合,这些程序共同实现"智能"的观点,最具影响力的是AI先驱Marvin Minsky所倡导的(参见《心智社会》[63])。这种观点催生了以任务表现为核心的智能定义和评估方法。Minsky在1968年对AI的定义最能体现这一点:"AI是让机器能够完成人类需要智能才能完成的任务的科学"。当时AI界普遍认为,只要能将人类技能编码为形式规则、将人类知识编码为显式数据库,"智能问题"就能得到解决。
This view of intelligence was once so dominant that “learning” (discounted as pure memorization) was often not even mentioned at all in AI textbooks until the mid-1980s. Even McCarthy, a rare advocate for generality in AI, believed that the key to achieving generality was better knowledge bases [60]. This definition and evaluation philosophy focused entirely on skill at narrow tasks normally handled by humans has led to a striking paradox, as pointed out by Hernandez-Orallo [36] in his 2017 survey: the field of artificial intelligence has been very successful in developing artificial systems that perform these tasks without featuring intelligence, a trend that continues to this day.
这种对智能的看法曾一度占据主导地位,以至于直到20世纪80年代中期,AI教科书中甚至经常完全不提"学习"(被贬低为纯粹的记忆)。就连罕见的AI通用性倡导者McCarthy也认为,实现通用性的关键在于更好的知识库[60]。正如Hernandez-Orallo[36]在2017年的综述中指出的,这种完全聚焦于人类常规处理的狭窄任务技能的定义和评估理念,导致了一个引人注目的悖论:人工智能领域在开发能执行这些任务却不体现智能的人工系统方面非常成功,这一趋势延续至今。
I.2.2 Intelligence as a general learning ability
I.2.2 作为通用学习能力的智能
Presumably the child brain is something like a notebook as one buys it from the stationer’s. Rather little mechanism, and lots of blank sheets.
儿童的大脑大概就像从文具店买来的笔记本。机械结构很少,空白页很多。
Alan Turing, 1950
Alan Turing, 1950
In contrast, a number of researchers have taken the position that intelligence lies in the general ability to acquire new skills through learning; an ability that could be directed to a wide range of previously unknown problems – perhaps even any problem at all. Contrast Minsky’s task-focused definition of AI with the following one, paraphrased from McCarthy [60] by Hernandez-Orallo: “AI is the science and engineering of making machines do tasks they have never seen and have not been prepared for beforehand” [36].
相比之下,许多研究者认为智能的本质在于通过学习获得新技能的通用能力;这种能力可以应用于各种前所未见的问题——甚至可能是任何问题。将Minsky以任务为核心的AI定义与Hernandez-Orallo转述McCarthy [60] 的观点进行对比:"AI是一门让机器完成它们从未见过且事先未做准备的任务的科学与工程" [36]。
The notion that machines could acquire new skills through a learning process similar to that of human children was initially laid out by Turing in his 1950 paper [91]. In 1958, Friedberg noted astutely: “If we are ever to make a machine that will speak, understand or translate human languages, solve mathematical problems with imagination, practice a profession or direct an organization, either we must reduce these activities to a science so exact that we can tell a machine precisely how to go about doing them or we must develop a machine that can do things without being told precisely how” [26]. But although the idea of generality through learning was given significant consideration at the birth of the field, and has long been championed by pioneers like McCarthy and Papert, it lay largely dormant until the resurgence of machine learning in the 1980s.
机器能够通过类似人类儿童的学习过程获得新技能的观点,最初由图灵在其1950年的论文[91]中提出。1958年,Friedberg敏锐地指出:"如果我们想要制造一台能够说话、理解或翻译人类语言、富有想象力地解决数学问题、从事专业工作或管理组织的机器,要么我们必须将这些活动简化为足够精确的科学,以便准确告诉机器如何执行它们,要么我们必须开发一种无需精确指导就能完成任务的机器" [26]。尽管通过学习实现通用性的想法在该领域诞生之初就受到高度重视,并长期得到McCarthy和Papert等先驱的倡导,但在20世纪80年代机器学习复兴之前,这一理念基本处于休眠状态。
This view of intelligence echoes another long-standing conception of human nature that has had a profound influence on the history of cognitive science, contrasting with the evolutionary psychology perspective: Locke’s Tabula Rasa (blank slate), a vision of the mind as a flexible, adaptable, highly general process that turns experience into behavior, knowledge, and skills. This conception of the human mind can be traced back to Aristotle (De Anima, c. 350BC, perhaps the first treatise of psychology [3]), was embraced and popularized by Enlightenment thinkers such as Hobbes [42], Locke [56], and Rousseau [78]. It has more recently found renewed vitality within cognitive psychology (e.g. [79]) and in AI via connection is m (e.g. [41]).
这种智力观呼应了另一种对人类本性的长期认知观念,即与进化心理学视角形成鲜明对比的洛克"白板说"(Tabula Rasa)——将心智视为一种灵活、适应性强且高度通用的过程,能将经验转化为行为、知识和技能。这一人类心智概念可追溯至亚里士多德(《论灵魂》,约公元前350年,可能是最早的心理学专著[3]),后被霍布斯[42]、洛克[56]和卢梭[78]等启蒙思想家接纳并普及。最近,该理论在认知心理学(如[79])和通过连接主义实现的AI领域(如[41])重新焕发活力。
With the resurgence of machine learning in the 1980s, its rise to intellectual dominance in the 2000s, and its peak as an intellectual quasi-monopoly in AI in the late 2010s via
随着机器学习在20世纪80年代的复兴,到2000年代在学术领域占据主导地位,再到2010年代末通过
Deep Learning, a connection is t-inspired Tabula Rasa is increasingly becoming the dominant philosophical framework in which AI research is taking place. Many researchers are implicitly conceptualizing the mind via the metaphor of a “randomly initialized neural network” that starts blank and that derives its skills from “training data” – a cognitive fallacy that echoes early AI researchers a few decades prior who conceptualized the mind as a kind of mainframe computer equipped with clever subroutines. We see the world through the lens of the tools we are most familiar with.
深度学习
一种受连接主义启发的白板论正日益成为人工智能研究的主导哲学框架。许多研究者不自觉地通过"随机初始化神经网络"的隐喻来概念化心智——这种网络始于空白状态,其能力完全源自"训练数据"。这种认知谬误与几十年前早期AI研究者如出一辙,彼时将心智概念化为配备精巧子程序的计算机主机。我们总是通过最熟悉的工具来观察世界。
Today, it is increasingly apparent that both of these views of the nature of human intelligence – either a collection of special-purpose programs or a general-purpose Tabula Rasa – are likely incorrect, which we discuss in II.1.3, along with implications for artificial intelligence.
如今,这两种对人类智能本质的认知——无论是将之视为专用程序集合还是通用白板 (Tabula Rasa) ——显然都可能是错误的。我们将在第II.1.3节讨论这一观点及其对人工智能的启示。
I.3 AI evaluation: from measuring skills to measuring broad abilities
I.3 人工智能评估:从技能衡量到广泛能力衡量
These two conceptualizations of intelligence – along with many other intermediate views combining elements from each side – have influenced a host of approaches for evaluating intelligence in machines, in humans, and more rarely in both at the same time, which we discuss below. Note that this document is not meant as an extensive survey of AI evaluation methods - for such a survey, we recommend Hernandez-Orallo 2017 [37]. Other notable previous surveys include Cohen and Howe 1988 [69] and Legg and Hutter 2007 [53].
这两种对智能的概念化理解——以及许多融合双方观点的中间立场——影响了评估机器智能、人类智能(以及较少见的二者同时评估)的一系列方法,我们将在下文展开讨论。需要注意的是,本文并非对AI评估方法的全面综述——如需此类综述,我们推荐Hernandez-Orallo 2017 [37]。其他值得关注的既往综述包括Cohen和Howe 1988 [69]以及Legg和Hutter 2007 [53]。
I.3.1 Skill-based, narrow AI evaluation
I.3.1 基于技能的狭义AI评估
In apparent accordance with Minsky’s goal for AI, the major successes of the field have been in building special-purpose systems capable of handling narrow, well-described tasks, sometimes at above human-level performance. This success has been driven by performance measures quantifying the skill of a system at a given task (e.g. how well an AI plays chess, how well an image classifier recognizes cats from dogs). There is no single, formalized way to do skill-based evaluation. Historically successful approaches include:
表面上与Minsky对AI的目标一致,该领域的主要成功在于构建能够处理狭窄、明确描述任务的专用系统,有时甚至超越人类水平。这一成功得益于量化系统在特定任务中技能表现的评估指标(例如AI下棋水平、图像分类器区分猫狗的准确度)。目前尚无统一的技能评估标准化方法。历史上成功的评估方式包括:
• Peer confrontation: having the system compete against either other AIs or humans. This is the preferred mode of evaluation for player-versus-player games, such as chess. • Benchmarks: having the system produce outputs for a “test set” of inputs (or environments) for which the desired outcome is known, and score the response.
• 同行对抗:让系统与其他AI或人类竞争。这是评估玩家对战游戏(如国际象棋)的首选模式。
• 基准测试:让系统为一组已知预期结果的输入(或环境)"测试集"生成输出,并对响应进行评分。
Benchmarks in particular have been a major driver of progress in AI, because they are reproducible (the test set is fixed), fair (the test set is the same for everyone), scalable (it is inexpensive to run the evaluation many times), easy to set up, and flexible enough to be applicable to a wide range of possible tasks. Benchmarks have often been most impactful in the context of a competition between different research teams, such as the ILSVRC challenge for large-scale image recognition (ImageNet) [22] or the DARPA Grand Challenge for autonomous driving [11]. A number of private and community-led initiatives have been started on the premise that such benchmark-based competitions speed up progress (e.g. Kaggle (kaggle.com), as well as academic alternatives such as ChaLearn (chalearn.org), the Hutter prize, etc.), while some government organizations use competitions to deliberately trigger technological breakthroughs (e.g. DARPA, NIST).
基准测试尤其成为推动人工智能进步的主要动力,因为它们具有可复现性(测试集固定)、公平性(所有人使用相同测试集)、可扩展性(多次运行评估成本低廉)、易于设置,以及足够灵活以适用于广泛任务的特点。基准测试在不同研究团队间的竞赛中往往最具影响力,例如大规模图像识别领域的ILSVRC挑战赛(ImageNet)[22]或自动驾驶领域的DARPA Grand Challenge[11]。基于"此类基准竞赛能加速技术进步"的理念,许多私人和社区主导的倡议相继启动(如Kaggle (kaggle.com) 及学术类平台ChaLearn (chalearn.org)、Hutter奖等),而部分政府机构则通过竞赛刻意触发技术突破(例如DARPA、NIST)。
These successes demonstrate the importance of setting clear goals and adopting objective measures of performance that are shared across the research community. However, optimizing for a single metric or set of metrics often leads to tradeoffs and shortcuts when it comes to everything that isn’t being measured and optimized for (a well-known effect on Kaggle, where winning models are often overly specialized for the specific benchmark they won and cannot be deployed on real-world versions of the underlying problem). In the case of AI, the focus on achieving task-specific performance while placing no conditions on how the system arrives at this performance has led to systems that, despite performing the target tasks well, largely do not feature the sort of human intelligence that the field of AI set out to build.
这些成功案例表明,设定明确目标并采用研究社区公认的客观性能衡量标准至关重要。然而,针对单一指标或指标集进行优化时,往往会在未被测量和优化的所有方面产生权衡与捷径(这是Kaggle上众所周知的效应:获胜模型通常过度适配其获胜的特定基准测试,而无法部署到现实世界的实际问题中)。在AI领域,这种专注于实现任务特定性能却不对系统达成该性能的方式加以约束的做法,导致现有系统虽然能出色完成目标任务,但大多不具备AI领域最初试图构建的人类智能特征。
This has been interpreted by McCorduck as an “AI effect” where goalposts move every time progress in AI is made: “every time somebody figured out how to make a computer do something play good checkers, solve simple but relatively informal problems there was a chorus of critics to say, ‘that’s not thinking’ ” [61]. Similarly, Reed notes: “When we know how a machine does something ‘intelligent’, it ceases to be regarded as intelligent. If I beat the world’s chess champion, I’d be regarded as highly bright.” [77]. This interpretation arises from overly anthropocentric assumptions. As humans, we can only display high skill at a specific task if we have the ability to efficiently acquire skills in general, which corresponds to intelligence as characterized in II. No one is born knowing chess, or predisposed specifically for playing chess. Thus, if a human plays chess at a high level, we can safely assume that this person is intelligent, because we implicitly know that they had to use their general intelligence to acquire this specific skill over their lifetime, which reflects their general ability to acquire many other possible skills in the same way. But the same assumption does not apply to a non-human system that does not arrive at competence the way humans do. If intelligence lies in the process of acquiring skills, then there is no task $X$ such that skill at X demonstrates intelligence, unless X is actually a meta-task involving skill-acquisition across a broad range of tasks. The “AI effect” characterization is confusing the process of intelligence (such as the intelligence displayed by researchers creating a chess-playing program) with the artifact produced by this process (the resulting chess-playing program), due to these two concepts being fundamentally intertwined in the case of humans. We discuss this further in II.1.
McCorduck将这种现象解读为一种"AI效应":每当AI取得进展时,评判标准就会随之改变:"每当有人教会计算机下出像样的国际象棋,或解决简单但相对非正式的问题时,总会有一群批评者跳出来说'那不算思考'" [61]。Reed同样指出:"当我们了解机器如何实现某种'智能'行为后,就不再将其视为真正的智能。如果我击败了国际象棋世界冠军,人们会认为我非常聪明。" [77]
这种解读源于过度拟人化的假设。对人类而言,只有在具备高效学习能力的普遍智力(intelligence)基础上(如第二部分所定义),才能展现出特定领域的高超技能。没有人天生就会下棋或专为棋艺而生。因此,当人类展现出高超棋艺时,我们可以合理推断此人具备智力,因为我们默认知道他是通过毕生运用通用智力才获得这项特定技能,这种能力同样适用于获取其他各类技能。
但这种假设并不适用于非人类系统,因为它们获取能力的途径与人类截然不同。如果智力的本质在于技能获取过程,那么不存在某项具体任务X能单独证明智力——除非X本身就是涉及广泛任务学习的元任务。由于人类智力过程与其产出成果本质上是交织的,"AI效应"的论述错误地将智力过程(如研究人员开发象棋程序时展现的智力)与其产物(最终生成的象棋程序)混为一谈。我们将在第二部分第一节进一步探讨这个问题。
Task-specific performance is a perfectly appropriate and effective measure of success if and only if handling the task as initially specified is the end goal of the system – in other words, if our measure of performance captures exactly what we expect of the system. However, it is deficient if we need systems that can show autonomy in handling situations that the system creator did not plan for, that can dynamically adapt to changes in the task – or in the context of the task – without further human intervention, or that can be repurposed for other tasks. Meanwhile, robustness and flexibility are increasingly being perceived as important requirements for certain broader subfields of AI, such as L5 self-driving, domestic robotics, or personal assistants; there is even increasing interest in generality itself (e.g. develo p mental robotics [4], artificial general intelligence [28]). This points to a need to move beyond skill-based evaluation for such endeavours, and to find ways to evaluate robustness and flexibility, especially in a cross-task setting, up to generality. But what do we really mean when we talk about robustness, flexibility, and generality?
任务特定性能是一种完全恰当且有效的成功衡量标准,当且仅当系统以初始指定的方式处理任务是其最终目标时——换句话说,如果我们的性能衡量标准准确反映了我们对系统的期望。然而,如果我们需要系统能够自主处理系统创建者未预料到的情况、能够动态适应任务或其上下文的变化而无需进一步人工干预,或者能够被重新用于其他任务,那么这种衡量标准就显得不足。与此同时,稳健性(robustness)和灵活性(flexibility)正日益被视为某些更广泛的AI子领域的重要需求,例如L5级自动驾驶、家用机器人或个人助理;甚至对通用性(generality)本身的兴趣也在增长(如发展机器人学[4]、通用人工智能[28])。这表明,对于此类努力,我们需要超越基于技能的评估,并找到评估稳健性和灵活性的方法,尤其是在跨任务环境中,直至通用性。但当我们谈论稳健性、灵活性和通用性时,我们真正指的是什么?
I.3.2 The spectrum of generalization: robustness, flexibility, generality
I.3.2 泛化能力谱系:鲁棒性、灵活性、通用性
Even though such machines might do some things as well as we do them, or perhaps even better, they would inevitably fail in others, which would reveal they were acting not through understanding, but only from the disposition of their organs.
尽管这些机器可能在某些事情上做得和我们一样好,甚至更好,但它们在其他方面必然会失败,这表明它们的行为并非基于理解,而只是器官构造的驱使。
René Descartes, 1637
René Descartes, 1637
The resurgence of machine learning in the 1980s has led to an interest in formally defining, measuring, and maximizing generalization. Generalization is a concept that predates machine learning, originally developed to characterize how well a statistical model performs on inputs that were not part of its training data. In recent years, the success of Deep Learning [52], as well as increasingly frequent run-ins with its limitations (see e.g. [51, 16, 59]), have triggered renewed interest in generalization theory in the context of machine learning (see e.g. [102, 67, 45, 70, 17, 49]). The notion of generalization can be formally defined in various contexts (in particular, statistical learning theory [92] provides a widely-used formal definition that is relevant for machine learning, and we provide a more general formalization in II.2). We can informally define “generalization” or “generalization power” for any AI system to broadly mean “the ability to handle situations (or tasks) that differ from previously encountered situations”.
20世纪80年代机器学习的复兴引发了人们对正式定义、衡量和最大化泛化能力的兴趣。泛化这一概念早于机器学习出现,最初用于衡量统计模型在训练数据之外的输入上的表现。近年来,深度学习[52]的成功及其日益凸显的局限性(参见[51,16,59]等) ,重新激发了机器学习领域对泛化理论的研究热情(参见[102,67,45,70,17,49]等)。泛化概念可在不同语境下进行形式化定义(统计学习理论[92]提供了与机器学习相关的经典定义,我们将在II.2节给出更普适的形式化表述)。对于任何AI系统,我们可以非正式地将"泛化"或"泛化能力"定义为"处理与先前经历情境不同(或任务)的能力"。
The notion of “previously encountered situation” is somewhat ambiguous, so we should distinguish between two types of generalization:
“先前遇到的情况”这一概念有些模糊,因此我们应区分两种泛化类型:
• System-centric generalization: this is the ability of a learning system to handle situations it has not itself encountered before. The formal notion of generalization error in statistical learning theory would belong here.
• 系统中心泛化 (System-centric generalization):指学习系统处理自身未曾遇到过的情况的能力。统计学习理论中的泛化误差 (generalization error) 形式化概念即属于此类。
• Developer-aware generalization: this is the ability of a system, either learning or static, to handle situations that neither the system nor the developer of the system have encountered before.
• 开发者感知泛化能力:指系统(无论是学习型还是静态系统)能够处理系统自身及其开发者都未曾遇到过的情况的能力。
In addition, we find it useful to qualitatively define degrees of generalization for informationprocessing systems:
此外,我们发现对信息处理系统的泛化程度进行定性定义很有帮助:
• Absence of generalization: The notion of generalization as we have informally defined above fundamentally relies on the related notions of novelty and uncertainty: a system can only generalize to novel information that could not be known in advance to either the system or its creator. AI systems in which there is no uncertainty do not display generalization. For instance, a program that plays tic-tac-toe via exhaustive iteration cannot be said to “generalize” to all board configurations. Likewise, a sorting algorithm that is proven to be correct cannot be said to “generalize” to all lists of integers, much like proven mathematical statements cannot be said to “generalize” to all objects that match the assumptions of their proof 3.
• 缺乏泛化性:我们前文非正式定义的泛化概念本质上依赖于新颖性和不确定性的相关概念——只有当系统或其创造者都无法提前预知新信息时,系统才可能实现泛化。不存在不确定性的AI系统不会表现出泛化能力。例如,通过穷举法玩井字棋的程序不能被称为"泛化"到所有棋盘配置;同理,已被证明正确的排序算法不能被称为"泛化"到所有整数列表,就像已被证明的数学命题不能被称为"泛化"到所有符合其证明假设的对象[3]。
• Local generalization, or “robustness”: This is the ability of a system to handle new points from a known distribution for a single task or a well-scoped set of known tasks, given a sufficiently dense sampling of examples from the distribution (e.g. tolerance to anticipated perturbations within a fixed context). For instance, an image classifier that can distinguish previously unseen 150x150 RGB images containing cats from those containing dogs, after being trained on many such labeled images, can be said to perform local generalization. One could characterize it as “adaptation to known unknowns within a single task or well-defined set of tasks”. This is the form of generalization that machine learning has been concerned with from the 1950s up to this day.
• 局部泛化 (local generalization) 或"鲁棒性":指系统在给定足够密集的样本分布情况下,能够处理来自已知分布的单个任务或明确定义的已知任务集的新数据点的能力 (例如在固定上下文中对预期扰动的容忍度)。例如,一个经过大量带标签图像训练后,能够区分未见过的150x150 RGB图像中猫狗的分类器,就可以说具备局部泛化能力。可以将其描述为"在单个任务或明确定义的任务集中适应已知的未知因素"。这是从20世纪50年代至今机器学习领域所关注的泛化形式。
Broad generalization, or “flexibility”: This is the ability of a system to handle a broad category of tasks and environments without further human intervention. This includes the ability to handle situations that could not have been foreseen by the creators of the system. This could be considered to reflect human-level ability in a single broad activity domain (e.g. household tasks, driving in the real world), and could be characterized as “adaptation to unknown unknowns across a broad category of related tasks”. For instance, a L5 self-driving vehicle, or a domestic robot capable of passing Wozniak’s coffee cup test (entering a random kitchen and making a cup of coffee) [99] could be said to display broad generalization. Arguably, even the most advanced AI systems today do not belong in this category, although there is increasing research interest in achieving this level.
广泛泛化(Broad generalization),或称"灵活性":指系统无需人类进一步干预即可处理广泛任务类别和环境的能力。这包括处理系统创建者无法预见情况的能力。这可以被视为在单一广泛活动领域(如家务劳动、现实世界驾驶)中反映人类水平的能力,其特征可描述为"在广泛相关任务类别中适应未知的未知因素"。例如,L5级自动驾驶汽车,或能通过Wozniak咖啡杯测试(进入随机厨房制作一杯咖啡)[99]的家用机器人,都可视为具备广泛泛化能力。可以说,即使当今最先进的AI系统也尚未达到这一类别,尽管实现这一水平的研究兴趣正在增长。
• Extreme generalization: This describes open-ended systems with the ability to handle entirely new tasks that only share abstract common ali ties with previously encountered situations, applicable to any task and domain within a wide scope. This could be characterized as “adaptation to unknown unknowns across an unknown range of tasks and domains”. Biological forms of intelligence (humans and possibly other intelligent species) are the only example of such a system at this time. A version of extreme generalization that is of particular interest to us throughout this document is human-centric extreme generalization, which is the specific case where the scope considered is the space of tasks and domains that fit within the human experience. We will refer to “human-centric extreme generalization” as “generality”. Importantly, as we deliberately define generality here by using human cognition as a reference frame (which we discuss in II.1.2), it is only “general” in a limited sense. Do note, however, that humans display extreme generalization both in terms of system-centric generalization (quick adaptability to highly novel situations from little experience) and developer-aware generalization (ability of contemporary humans to handle situations that previous humans have never experienced during their evolutionary history).
• 极端泛化 (Extreme generalization):指能够处理与先前经验仅存在抽象共性的全新任务的开放系统,适用于广泛范围内的任何任务和领域。可描述为"在未知任务和领域范围内适应未知的未知因素"。目前生物形态的智能(人类及其他可能的高等物种)是此类系统的唯一实例。本文档特别关注以人类为中心的极端泛化,即任务和领域范围限定于人类经验范畴的特殊情况。我们将"以人类为中心的极端泛化"简称为"通用性 (generality)"。需注意的是,由于我们刻意以人类认知为参照系来定义通用性(详见II.1.2章节),其"通用"仅具相对意义。但应认识到,人类既展现系统中心泛化(基于少量经验快速适应高度新颖情境)能力,也具备开发者感知泛化(现代人类处理进化史上从未遭遇情境)特质。
To this list, we could, theoretically, add one more entry: “universality”, which would extend “generality” beyond the scope of task domains relevant to humans, to any task that could be practically tackled within our universe (note that this is different from “any task at all” as understood in the assumptions of the No Free Lunch theorem [98, 97]). We discuss in II.1.2 why we do not consider universality to be a reasonable goal for AI.
理论上,我们可以在这个列表上再增加一项:"普适性(universality)",这将把"通用性(generality)"的范畴从人类相关的任务领域扩展到我们宇宙中任何实际可处理的任务(注意这与"任何任务"不同,后者是免费午餐定理[98, 97]假设中的理解)。我们将在II.1.2节讨论为何不认为普适性是人工智能的合理目标。
Crucially, the history of AI has been one of slowly climbing up this spectrum, starting with systems that largely did not display generalization (symbolic AI), and evolving towards robust systems (machine learning) capable of local generalization. We are now entering a new stage, where we seek to create flexible systems capable of broad generalization (e.g. hybrid symbolic and machine learning systems such as self-driving vehicles, AI assistants, or cognitive developmental robots). Skill-focused task-specific evaluation has been appropriate for close-ended systems that aim at robustness in environments that only feature known unknowns, but developing systems that are capable of handling unknown unknowns requires evaluating their abilities in a general sense.
关键在于,人工智能的发展历程正是沿着这一谱系逐步攀升的历程——从几乎不具备泛化能力的符号AI系统起步,逐步演进为能够实现局部泛化的稳健机器学习系统。如今我们正迈入新阶段,致力于构建具备广泛泛化能力的柔性系统(例如融合符号与机器学习技术的自动驾驶汽车、AI助手或认知发育机器人)。针对环境变量仅为"已知的未知"的封闭系统,以技能为核心的专项评估体系确实适用;但要开发能够应对"未知的未知"的系统,就必须建立通用化的能力评估框架。
Importantly, the spectrum of generalization outlined above seems to mirror the organization of humans cognitive abilities as laid out by theories of the structure of intelligence in cognitive psychology. Major theories of the structure of human intelligence (CHC [62], $\mathbf{g}.$ -VPR [48]) all organize cognitive abilities in a hierarchical fashion (figure 1), with three strata (in CHC): general intelligence (g factor) at the top, broad abilities in the middle, and specialized skills or test tasks at the bottom (this extends to 4 strata for g-VPR, which splits broad abilities into two layers), albeit the taxonomy of abilities differs between theories. Here, “extreme generalization” corresponds to the g factor, “broad generalization” across a given domain corresponds to a broad cognitive ability, and “local generalization” (as well as the no-generalization case) corresponds to task-specific skill.
重要的是,上述概述的泛化谱似乎反映了人类认知能力的组织结构,这与认知心理学中智力结构理论所阐述的一致。关于人类智力结构的主要理论(CHC [62]、$\mathbf{g}$-VPR [48])均以分层方式组织认知能力(图 1),包含三个层级(CHC 理论中):顶层的通用智力(g 因子)、中层的广泛能力,以及底层的专项技能或测试任务(g-VPR 理论扩展为四层级,将广泛能力拆分为两层),尽管不同理论对能力的分类存在差异。在此框架中,"极端泛化"对应 g 因子,"特定领域的广泛泛化"对应广泛认知能力,而"局部泛化"(以及无泛化情况)则对应任务专项技能。
Measuring such broad abilities (and possibly generality itself) rather than specific skills has historically been the problematic of the field of psychometrics. Could psychometrics inform the evaluation of abilities in AI systems?
衡量这种广泛能力(甚至可能是通用性本身)而非特定技能,历来是心理测量学领域的难题。心理测量学能否为AI系统能力评估提供参考?

Figure 1: Hierarchical model of cognitive abilities and its mapping to the spectrum of generalization.
图 1: 认知能力层级模型及其与泛化光谱的映射关系
Note that, in what follows:
请注意,在下文中:
• We use “broad abilities” to refer to cognitive abilities that lead to broad or extreme generalization. Developing such abilities should be the goal of any researcher inter
• 我们使用"广泛能力"来指代能带来广泛或极端泛化的认知能力。发展此类能力应成为任何研究者的目标。
I.3.3 Measuring broad abilities and general intelligence: the psychometrics perspective
I.3.3 测量广泛能力与一般智力:心理测量学视角
It seems to us that in intelligence there is a fundamental faculty, the alteration or the lack of which, is of the utmost importance for practical life. This faculty is [...] the faculty of adapting one’s self to circumstances.
在我们看来,智力中存在一种根本能力,这种能力的改变或缺失对现实生活具有极其重大的影响。这种能力就是 [...] 适应环境的能力。
Alfred Binet, 1916
Alfred Binet, 1916
In the early days of the 20th century, Binet and Simon, looking for a formal way to distinguish children with mental disabilities from those with behavior problems, developed the Binet-Simon scale [8], the first test of intelligence, founding the field of psychometrics. Immediately after, Spearman observed that individual results across different, seemingly unrelated types of intelligence tests were correlated, and hypothesized the existence of a single factor of general intelligence, the g factor [83, 84]. Today, psychometrics is a wellestablished subfield of psychology that has arrived at some of the most reproducible results of the field. Modern intelligence tests are developed by following strict standards regarding reliability (low measurement error, a notion tied to reproducibility), validity (measuring what one purports to be measuring, a notion tied to statistical consistency and predictiveness), standardization, and freedom from bias – see e.g. Classical Test Theory (CTT) [20] and Item Response Theory (IRT) [34].
20世纪初,Binet和Simon为了寻找一种正式方法来区分智力障碍儿童与行为问题儿童,开发了首个智力测试工具——Binet-Simon量表 [8],由此创立了心理测量学领域。此后不久,Spearman发现个体在不同类型智力测试中的表现存在相关性,并提出了通用智力单因素假说——g因子 [83, 84]。如今,心理测量学已成为心理学中一个成熟的分支学科,其研究成果在该领域最具可重复性。现代智力测试的开发遵循严格标准,包括信度(低测量误差,与可重复性相关)、效度(测量目标特质,与统计一致性和预测性相关)、标准化及无偏性等要求——参见经典测验理论(CTT)[20]与项目反应理论(IRT)[34]。
A fundamental notion in psychometrics is that intelligence tests evaluate broad cognitive abilities as opposed to task-specific skills. Theories of the structure of intelligence (such as CHC, $\mathrm{g}\mathrm{-VPR}$ ), which have co-evolved with psycho metric testing (statistical phenomena emerging from test results have informed these theories, and these theories have informed test design) organize these abilities in a hierarchical fashion (figure 1), rather similarly to the spectrum of generalization we presented earlier. Importantly, an ability is an abstract construct (based on theory and statistical phenomena) as opposed to a directly measurable, objective property of an individual mind, such as a score on a specific test. Broad abilities in AI, which are also constructs, fall into the exact same evaluation problematic s as cognitive abilities from psychometrics. Psychometrics approaches the quant if i cation of abilities by using broad batteries of test tasks rather than any single task, and by analysing test results via probabilistic models. Importantly, the tasks should be previously unknown to the test-taker, i.e., we assume that test-takers do not practice for intelligence tests. This approach is highly relevant to AI evaluation.
心理测量学的一个基本概念是,智力测试评估的是广泛的认知能力,而非特定任务技能。智力结构理论(如CHC、$\mathrm{g}\mathrm{-VPR}$)与心理测量测试共同演化(测试结果中出现的统计现象启发了这些理论,而这些理论又指导了测试设计),将这些能力以层级方式组织(图1),与我们之前提出的泛化谱系非常相似。值得注意的是,能力是一种抽象建构(基于理论和统计现象),而非可直接测量的个体思维客观属性,如某项特定测试的分数。AI中的广泛能力同样属于建构范畴,其评估难题与心理测量学中的认知能力完全相同。心理测量学通过使用广泛的测试任务组合(而非单一任务),并借助概率模型分析测试结果,来实现能力的量化。关键在于,这些任务对受试者而言应是前所未见的,即我们假设受试者不会为智力测试进行针对性训练。这种方法与AI评估高度相关。
Remarkably, in a parallel to psychometrics, there has been recent and increasing interest across the field of AI in using broad batteries of test tasks to evaluate systems that aim at greater flexibility. Examples include the Arcade Learning Environment for Reinforcement Learning agents [6], Project Malm O¨ [71], the Behavior Suite [68], or the GLUE [95] and SuperGLUE [94] benchmarks for natural language processing. The underlying logic of these efforts is to measure something more general than skill at one specific task by broadening the set of target tasks. However, when it comes to assessing flexibility, a critical defect of these multi-task benchmarks is that the set of tasks is still known in advance to the developers of any test-taking system, and it is fully expected that test-taking systems will be able to practice specifically for the target tasks, leverage task-specific built-in prior knowledge inherited from the system developers, leverage external knowledge obtained via pre-training, etc. As such, these benchmarks still appear to be highly gameable (see e.g. II.1.1) – merely widening task-specific skill evaluation to more tasks does not produce a qualitatively different kind of evaluation. Such benchmarks are still looking at skills, rather than abilities, in contrast with the psychometrics approach (this is not to say that such benchmarks are not useful; merely that such static multi-task benchmarks do not directly assess flexibility or generality).
值得注意的是,与心理测量学类似,AI领域近期对使用多样化测试任务组来评估追求更高灵活性的系统产生了日益浓厚的兴趣。例如强化学习智能体的街机学习环境[6]、Project Malm O¨ [71]、行为测试套件[68],以及自然语言处理领域的GLUE[95]和SuperGLUE[94]基准测试。这些工作的核心逻辑是通过扩展目标任务集,测量比单一特定任务技能更通用的能力。然而在评估灵活性时,这些多任务基准测试存在一个关键缺陷:任何应试系统的开发者都预先知晓任务集合,且完全预期应试系统能够针对目标任务进行专项训练、利用系统开发者内置的任务特定先验知识、通过预训练获取外部知识等。因此,这些基准测试仍具有高度可操纵性(参见II.1.1节)——仅仅将任务特定技能评估扩展到更多任务,并不会产生质变的新型评估。与心理测量学方法不同,这类基准测试关注的仍是技能而非能力(这并非否定其价值,只是说明静态多任务基准无法直接评估灵活性或通用性)。
In addition to these multi-task benchmarks, a number of more ambitious test suites for cognitive abilities of AI have been proposed in the past but have not been implemented in practice: the Newell test by Anderson and Lebiere ([2], named in reference to [66]), the BICA “cognitive decathlon” targeted at developmental robotics [65], the Turing Olympics [27], and the I-Athlon [1]. Lacking concrete implementations, it is difficult to assess whether these projects would have been able to address the ability evaluation problem they set out to solve. On the other hand, two similarly-spirited but more mature test suite have emerged recently, focused on generalization capabilities as opposed to specific tasks: the Animal-AI Olympics [7] (animal a i olympics.com) and the GVGAI competition [72] (gvgai.net). Both take the position that AI agents should be evaluated on an unseen set of tasks or games, in order to test learning or planning abilities rather than special-purpose skill. Both feature a multi-game environment and an ongoing public competition.
除了这些多任务基准测试外,过去还提出过一些更具野心的AI认知能力测试套件,但尚未在实践中实施:Anderson和Lebiere提出的Newell测试([2],名称引用自[66])、面向发育机器人的BICA"认知十项全能"[65]、Turing Olympics[27]以及I-Athlon[1]。由于缺乏具体实现,很难评估这些项目是否能解决它们旨在解决的能力评估问题。另一方面,近期出现了两个理念相似但更成熟的测试套件,重点关注泛化能力而非特定任务:Animal-AI Olympics7和GVGAI竞赛72。两者都主张在未见过的任务或游戏集合上评估AI智能体,以测试学习或规划能力而非专用技能。这两个项目都采用多游戏环境设计,并持续举办公开竞赛。
I.3.4 Integrating AI evaluation and psychometrics
I.3.4 整合AI评估与心理测量学
Besides efforts to broaden task-specific evaluation to batteries of multi-task tests, there have been more direct and explicit attempts to integrate AI evaluation and psychometrics. A first approach is to reuse existing psycho metric intelligence tests, initially developed for humans, as a way to assess intelligence in AI systems – perhaps an obvious idea if we are to take the term “artificial intelligence” literally. This idea was first proposed by Green in 1964 [29], and was, around the same time, explored by Evans [24], who wrote a LISP program called ANALOGY capable of solving a geometric analogy task of the kind that may be found in a py s cho metric intelligence test. Newell suggested the idea again in 1973 [66] in his seminal paper You can’t play 20 questions with Nature and win. It was proposed again and refined by Bringsjord et al. in the 2000s under the name “Psycho metric AI” (PAI) [9]. However, it has since become apparent that it is possible for AI system developers to game human intelligence tests, because the tasks used in these tests are available to the system developers, and thus the developers can straightforwardly solve the abstract form of these problems themselves and hard-code the solution in program form (see, for instance, [23, 80, 44]), much like Evans did with in the 1960s with the ANALOGY program. Effectively, in this case, it is the system developers who are solving the test problems, rather than any AI. The implicit assumptions that psycho metric test designers make about human test-takers turn out to be difficult to enforce in the case of machines.
除了将特定任务的评估扩展到多任务测试组合的努力外,还有更直接明确的尝试将AI评估与心理测量学相结合。第一种方法是复用最初为人类设计的现有心理测量智力测试,以此评估AI系统的智能水平——如果从字面上理解"人工智能"这一术语,这或许是个显而易见的想法。该理念最早由Green于1964年提出[29],同期Evans[24]也进行了探索,他编写了一个名为ANALOGY的LISP程序,能够解决心理测量智力测试中常见的几何类比任务。Newell在1973年[66]的里程碑论文《You can't play 20 questions with Nature and win》中再次提出了这个想法。2000年代,Bringsjord等人以"心理测量AI"(PAI)为名对该理论进行了完善[9]。然而后来发现,AI系统开发者可能利用人类智力测试的漏洞,因为这些测试任务对开发者是可见的,他们可以直接解决这些问题的抽象形式,并将解决方案硬编码为程序(例如[23,80,44]),就像Evans在1960年代对ANALOGY程序所做的那样。实际上,这种情况下是系统开发者在解决测试问题,而非任何AI。心理测量测试设计者对人类受试者的隐含假设,在机器测试场景中难以强制执行。
An alternative, more promising approach is to leverage what psychometrics can teach us about ability assessment and test design to create new types of benchmarks targeted specifically at evaluating broad abilities in AI systems. Along these lines, HernandezOrallo et al. have proposed extending psycho metric evaluation to any intelligent system, including AI agents and animals, in “Universal Psychometrics” [39].
另一种更有前景的方法是借鉴心理测量学在能力评估和测试设计方面的经验,创建专门用于评估AI系统广泛能力的新型基准。沿着这一思路,HernandezOrallo等人在《通用心理测量学》[39]中提出将心理测量评估扩展到任何智能系统,包括AI智能体和动物。
We argue that several important principles of psychometrics can inform intelligence evaluation in AI in the context of the development of broad AI and general AI:
我们认为,在通用人工智能(AGI)和广义AI发展的背景下,心理测量学的若干重要原则可以为AI智能体的智力评估提供参考:
Measuring abilities (representative of broad generalization and skill-acquisition efficiency), not skills. Abilities are distinct from skills in that they induce broad generalization, i.e. they form the basis for skill across a broad range of tasks, including tasks that were previously unknown to the ability-enabled system and its developers. Doing so via batteries of tasks rather than any single task, that should be previously unknown to both the test taking system and the system developers (this is necessary to assess broad generalization as opposed to skill or local generalization). • Having explicit standards regarding reliability, validity, standardization, and freedom from bias:
衡量能力(代表广泛泛化和技能获取效率),而非技能。能力与技能的区别在于能带来广泛泛化,即它们构成跨广泛任务范围技能的基础,包括能力赋能系统及其开发者先前未知的任务。通过任务组而非单一任务实现这一点,这些任务对受测系统和系统开发者都应是未知的(这对评估广泛泛化而非技能或局部泛化至关重要)。• 制定关于可靠性、效度、标准化和无偏性的明确标准:
stance, a test of intelligence designed for both humans and AI should not leverage uniquely human acquired knowledge, or should not involve constraints unrelated to intelligence within which machines have unfair advantages (such as fast reaction times), etc.
立场:一项为人类和AI设计的智力测试不应利用人类独有的后天知识,也不应包含与智力无关但机器具有不公平优势的约束条件(如快速反应时间)等。
Simultaneously, we argue that certain other aspects of psychometrics may be discarded in the development of new intelligence tests for AI:
同时,我们认为在开发针对AI的新型智力测试时,可以摒弃心理测量学的某些其他方面:
• The exact number and taxonomy of cognitive abilities considered, being a subject of ongoing debate within cognitive psychology and being perhaps overly an thro po c entric, should not be used as a strict template for artificial cognitive architectures and their evaluation. Existing taxonomies may at best serve as a source of inspiration. • A number of abilities being assessed by psycho metric intelligence tests are crystallized abilities (e.g. reading and writing), i.e. abilities that are acquired through experience, which are not clearly distinguishable from skills (they are effectively multipurpose skills). We argue that AI tests that seek to assess flexibility and generality should not consider crystallized abilities, but rather, should focus on abilities that enable new skill acquisition. If a system possesses abilities that enable efficient skillacquisition in a domain, the system should have no issue in developing corresponding skills and crystallized abilities.
- 认知能力的精确数量与分类体系在认知心理学界仍存争议,且可能过度人类中心化,不应作为人工认知架构及其评估的严格模板。现有分类体系至多只能作为灵感来源。
- 心理测量学智力测试评估的诸多能力属于晶体能力(如读写),即通过经验习得的能力,这些能力与技能界限模糊(实质上是多用途技能)。我们认为评估灵活性与通用性的AI测试应聚焦于支撑新技能习得的能力,而非晶体能力。若系统具备支撑特定领域高效技能习得的能力,自然能发展出相应技能与晶体能力。
I.3.5 Current trends in broad AI evaluation
I.3.5 广义AI评估的当前趋势
Despite a rising interest in building flexible systems, or even in generality itself, for the most part the AI community has not been paying much attention to psycho metric evaluation, Psycho metric AI, or Universal Psychometrics. If we are to assess the contemporary zeitgeist of broad AI evaluation, here is what we see. 4
尽管人们对构建灵活系统乃至通用性本身的兴趣日益增长,但AI领域大多仍未重视心理测量评估(psychometric evaluation)、心理测量AI(Psycho metric AI)或通用心理测量学(Universal Psychometrics)。若要评估当前广义AI评估的时代思潮,我们观察到的现状如下。4
First, we note several positive developments. Since 2017, there is increasing awareness that one should seek to establish some form of generalization in evaluating Reinforcement Learning (RL) algorithms (e.g. [50, 70, 17, 49]), which was previously a stark problem [76, 35, 101, 70], as RL agents have for a long time been tested on their training data. Further, there is increasing interest in evaluating the data-efficiency of learning algorithms (e.g. [10]), in particular in the context of RL for games such as Atari games or Minecraft (e.g. [71, 33]). Lastly, as noted in I.3.3, there has been a trend towards leveraging multi-task benchmarks as a way to assess robustness and flexibility (e.g. [6, 71, 68, 95, 94]).
首先,我们注意到一些积极进展。自2017年以来,人们逐渐意识到在评估强化学习(RL)算法时应建立某种泛化性衡量标准(如[50, 70, 17, 49]),这曾是显著问题[76, 35, 101, 70],因为长期以来RL智能体都在训练数据上进行测试。此外,学界越来越关注评估学习算法的数据效率(如[10]),特别是在Atari游戏或《我的世界》等游戏场景的RL研究中(如[71, 33])。最后如I.3.3节所述,当前趋势是通过多任务基准测试来评估鲁棒性和灵活性(如[6, 71, 68, 95, 94])。
Unfortunately, we must also note several negatives. The robustness of the systems being developed, in particular Deep Learning models, is often problematic (see e.g. [16, 59]). This is due in large part to the fact that most benchmarks do not pay much attention to formally assessing robustness and quantifying generalization, and thus can be solved via “shortcuts” that gradient descent is apt at exploiting (e.g. surface statistics such as textures in the case of computer vision [46]). Likewise, the reproducibility (reliability) of research findings is often an issue [73], especially in Reinforcement Learning, although some progress has been made on this front.
遗憾的是,我们也必须指出若干不足之处。当前开发系统(尤其是深度学习模型)的鲁棒性(robustness)往往存在问题(参见[16, 59])。这主要源于大多数基准测试并未重视对鲁棒性的形式化评估与泛化能力的量化,导致梯度下降容易通过"捷径"解决问题(例如计算机视觉领域利用纹理等表面统计特征[46])。同样,研究结果的可复现性(可靠性)也常受质疑[73],尽管强化学习领域已取得一定进展,但该问题仍尤为突出。
Most importantly, the evaluation of any ability that goes decisively beyond local genera liz ation is still largely a green field, and little effort has been devoted to investigate it. Hernandez-Orallo noted in 2017 that “ability-oriented and general-purpose evaluation approaches [...] are still very incipient, and more research and discussion is needed” [36]. Recent attempts at broadening task-specific benchmarks by including multiple tasks do not measure developer-aware generalization, as the tasks are all known in advance to system developers (as noted in I.3.3). Attempts at assessing generalization by testing RL systems on previously unseen game levels, like CoinRun [17] or Obstacle Tower [49], are still only looking at task-specific local generalization, by evaluating a candidate system on new samples from a known distribution rather than using a substantially new task (as suggested in III.3). In addition, the fact the level-generation programs used are available to the AI developers means it is possible to “cheat” on these benchmarks by sampling arbitrary amounts of training data (cf. II.1.1).
最重要的是,对任何显著超越局部泛化能力的评估仍基本处于空白领域,相关研究投入甚少。Hernandez-Orallo在2017年指出"面向能力与通用目标的评估方法[...]仍处于萌芽阶段,需要更多研究与讨论" [36]。近期通过纳入多任务来扩展特定任务基准的尝试(如I.3.3节所述)并未测量开发者感知的泛化能力,因为这些任务对系统开发者而言都是预先已知的。通过在新游戏关卡(如CoinRun [17]或Obstacle Tower [49])测试强化学习系统来评估泛化的尝试,仍仅关注特定任务的局部泛化——其本质是在已知分布的新样本上评估系统,而非使用全新任务(如III.3节所述)。此外,由于关卡生成程序对AI开发者可见,理论上可通过采样任意数量训练数据来"作弊"(参见II.1.1节)。
Further, contemporary research “moonshots” that are publicly advertised as being steps towards general intelligence appear to still be focusing on skill-based task-specific evaluation for board games and video games (e.g. Go [82, 81] and StarCraft [93] for DeepMind, DotA2 [89] for OpenAI) via highly-mediatized confrontations with top human players. Despite claims of progress towards general AI in associated public communications 5, such evaluation does not involve any measure of generalization power, and has little-to-no overlap with the development of flexibility and generality, as we outline in II.1. For example, although OpenAI’s DotA2-playing AI “Five” was trained on 45,000 years of play and was able to beat top human players [89], it has proven very brittle, as non-champion human players were able to find strategies to reliably beat it in a matter of days after the AI was made available for the public to play against [90]. In addition, Five did not even generalize to DotA2 in the first place: it could only play a restricted version of the game, with 16 characters instead of over 100. Likewise, AlphaGo and its successor AlphaZero, developed in 2016 and 2017, have not yet found any application outside of board games, to the best of our knowledge.
此外,当代那些被公开宣传为迈向通用智能的"登月计划"研究,似乎仍专注于通过高度媒体化的与人类顶尖选手对抗,来评估基于技能的特定任务能力(如DeepMind的围棋[82, 81]和《星际争霸》[93],OpenAI的《Dota2》[89])。尽管相关公开声明宣称这些进展是通向通用人工智能的步骤,但此类评估完全不涉及泛化能力的衡量,与我们第二部分第一节所述的灵活性与通用性发展几乎毫无交集。例如,尽管OpenAI的《Dota2》AI"Five"通过相当于4.5万年的训练击败了人类顶尖选手[89],但其脆弱性很快暴露——在公开对战数日后,非冠军选手就能找到稳定击败它的策略[90]。此外,Five甚至无法完整运行《Dota2》:它只能使用16个英雄而非全部100多个进行简化版游戏。同样,据我们所知,2016-2017年开发的AlphaGo及其迭代产品AlphaZero至今仍未在棋盘游戏之外获得任何应用。
We deplore this discrepancy between a focus on surpassing humans at tests of skill on one hand (while entirely disregarding whether the methods through which skill is achieved are general iz able), and a manifest interest in developing broad abilities on the other hand – an endeavour entirely orthogonal to skill itself. We hypothesize that this discrepancy is due to a lack of a clear conceptualization of intelligence, skill, and generalization, as well as a lack of appropriate measures and benchmarks for broad cognitive abilities. In what follows, we expose in more detail the issue with using task-specific “moonshots” (e.g. achieving better-than-human performance in a video game or board game) as stepping stones towards more general forms of AI, and we propose a formal definition of intelligence meant to be actionable in the pursuit of flexible AI and general AI.
我们痛心于这种矛盾现象:一方面执着于在技能测试中超越人类(却完全忽视所采用方法是否具备泛化能力),另一方面又对发展广泛能力表现出明显兴趣——这种努力方向与技能本身完全相悖。我们推测这种矛盾源于缺乏对智能、技能和泛化的明确定义,以及缺乏针对广泛认知能力的合理衡量标准和基准测试。下文将详细阐述以特定任务"登月计划"(例如在电子游戏或棋类游戏中实现超人类表现)作为通向更通用AI的垫脚石所存在的问题,并提出一个可操作的形式化智能定义,旨在推动柔性AI和通用人工智能的发展。
II A new perspective
II 全新视角
II.1 Critical assessment
II.1 关键评估
II.1.1 Measuring the right thing: evaluating skill alone does not move us forward
II.1.1 衡量正确指标:仅评估技能无法推动进步
In 1973, psychologist and computer science pioneer Allen Newell, worried that recent advances in cognitive psychology were not bringing the field any closer to a holistic theory of cognition, published his seminal paper You can’t play 20 questions with nature and win [66], which helped focus research efforts on cognitive architecture modelling, and provided new impetus to the longstanding quest to build a chess-playing AI that would outperform any human. Twenty-four years later, In 1997, IBM’s DeepBlue beat Gary Kasparov, the best chess player in the world, bringing this quest to an end [12]. When the dust settled, researchers were left with the realization that building an artificial chess champion had not actually taught them much, if anything, about human cognition. They had learned how to build a chess-playing AI, and neither this knowledge nor the AI they had built could generalize to anything other than similar board games.
1973年,心理学家兼计算机科学先驱Allen Newell担忧认知心理学的最新进展并未推动该领域形成整体性认知理论,发表了开创性论文《You can’t play 20 questions with nature and win》[66]。这篇论文促使研究聚焦于认知架构建模,并为长期追求的"构建超越人类的国际象棋AI"目标注入了新动力。24年后的1997年,IBM的DeepBlue击败世界冠军Gary Kasparov,为这一探索画上句号[12]。当尘埃落定时,研究者们意识到:构建人工棋王的过程其实并未增进对人类认知的理解。他们只掌握了构建国际象棋AI的方法,而这些知识及其产物均无法推广到类似棋盘游戏之外的领域。
It may be obvious from a modern perspective that a static chess-playing program based on minimax and tree search would not be informative about human intelligence, nor competitive with humans in anything other than chess. But it was not obvious in the 1970s, when chess-playing was thought by many to capture, and require, the entire scope of rational human thought. Perhaps less obvious in 2019 is that efforts to “solve” complex video games using modern machine learning methods still follow the same pattern. Newell wrote [66]: “we know already from existing work [psychological studies on humans] that the task [chess] involves forms of reasoning and search and complex perceptual and memorial processes. For more general considerations we know that it also involves planning, evaluation, means-ends analysis and redefinition of the situation, as well as several varieties of learning – short-term, post-hoc analysis, preparatory analysis, study from books, etc.”. The assumption was that solving chess would require implementing these general abilities. Chess does indeed involve these abilities – in humans. But while possessing these general abilities makes it possible to solve chess (and many more problems), by going from the general to the specific, inversely, there is no clear path from the specific to the general. Chess does not require any of these abilities, and can be solved by taking radical shortcuts that run orthogonal to human cognition.
从现代视角来看,基于极小化极大算法(minimax)和树搜索的静态象棋程序显然无法揭示人类智能的奥秘,也仅在象棋领域能与人类抗衡。但在1970年代,许多人认为象棋博弈体现并需要人类完整的理性思维能力,这种认知偏差并不明显。2019年更不易察觉的是:当代机器学习方法对复杂电子游戏的"攻克"仍延续着相同模式。Newell在[66]中指出:"通过现有研究[人类心理学实验]可知,象棋任务涉及推理、搜索、复杂感知与记忆过程。更广义而言,它还包含规划、评估、手段-目的分析、情境重构,以及短期/事后/预备性分析、书本学习等多种学习形式。"当时假设攻克象棋需要实现这些通用能力。对人类而言确实如此——但虽然从通用能力到具体问题存在解决路径(如象棋及其他问题),逆向从具体方案到通用能力却无明确通道。象棋本身并不需要这些能力,完全可以通过与人类认知正交的极端捷径来解决。
Optimizing for single-purpose performance is useful and valid if one’s measure of success can capture exactly what one seeks (as we outlined in I.3.1), e.g. if one’s end goal is a chess-playing machine and nothing more. But from the moment the objective is settled, the process of developing a solution will be prone to taking all shortcuts available to satisfy the objective of choice – whether this process is gradient descent or human-driven research. These shortcuts often come with undesirable side-effects when it comes to considerations not incorporated the measure of performance. If the environment in which the system is to operate is too unpredictable for an all-encompassing objective function to be defined beforehand (e.g. most real-world applications of robotics, where systems face unknown unknowns), or if one aims at a general-purpose AI that could be applied to a wide range of problems with no or little human engineering, then one must somehow optimize directly for flexibility and generality, rather than solely for performance on any specific task.
如果成功的衡量标准能准确反映目标(如I.3.1节所述),例如终极目标仅是打造国际象棋机器,那么针对单一用途的性能优化是有效且合理的。但从目标确立那一刻起,无论是梯度下降还是人工驱动的研发过程,解决方案的开发都倾向于利用一切捷径来实现既定目标。当涉及未被纳入性能考量的因素时,这些捷径往往会产生不良副作用。若系统运行环境具有高度不可预测性(例如机器人技术的大多数现实应用会面临未知的未知因素),导致无法预先定义全覆盖的目标函数;或是追求通用人工智能(AGI)以广泛适应各类问题而无需人工干预时,就必须直接针对灵活性与通用性进行优化,而非仅聚焦于特定任务的表现。
This is, perhaps, a widely-accepted view today when it comes to static programs that hard-code a human-designed solution. When a human engineer implements a chatbot by specifying answers for each possible query via if/else statements, we do not assume this chatbot to be intelligent, and we do not expect it to generalize beyond the engineer’s specifications. Likewise, if an engineer looks at a specific IQ test task, comes up with a solution, and write down this solution in program form, we do not expect the program to generalize to new tasks, and we do not believe that the program displays intelligence – the only intelligence at work here is the engineer’s. The program merely encodes the crystallized output of the engineer’s thought process – it is this process, not its output, that implements intelligence. Intelligence is not demonstrated by the performance of the output program (a skill), but by the fact that the same process can be applied to a vast range of previously unknown problems (a general-purpose ability): the engineer’s mind is capable of extreme generalization. Since the resulting program is merely encoding the output of that process, it is no more intelligent than the ink and paper used to write down the proof of a theorem.
这或许是当前针对静态程序的普遍共识——那些固化人类设计解决方案的代码。当工程师用if/else语句为每个可能的问题预设回答来实现聊天机器人时,我们不会认为这个机器人具有智能,也不期待它能突破设计者的预设范围。同理,若工程师针对特定智商测试题编写解决方案程序,我们不会认为该程序能泛化到新任务,更不会将其视为智能体现——这里唯一的智能属于工程师本人。程序仅封装了工程师思维过程的固化成果,真正实现智能的是思维过程本身而非其输出产物。智能的体现不在于程序输出表现(特定技能),而在于同一思维过程能解决海量未知问题(通用能力):工程师的思维具有极强的泛化能力。由于最终程序只是该思维过程的输出编码,其智能程度不会超过记录数学定理证明的纸墨。
However, what of a program that is not hard-coded by humans, but trained from data to perform a task? A learning machine certainly may be intelligent: learning is a necessary condition to adapt to new information and acquire new skills. But being programmed through exposure to data is no guarantee of generalization or intelligence. Hard-coding prior knowledge into an AI is not the only way to artificially “buy” performance on the target task without inducing any generalization power. There is another way: adding more training data, which can augment skill in a specific vertical or task without affecting generalization whatsoever.
然而,如果一个程序并非由人类硬编码,而是通过数据训练来执行任务呢?学习机器当然可能具备智能:学习是适应新信息、掌握新技能的必要条件。但通过数据训练进行编程,并不能保证其具备泛化能力或智能。将先验知识硬编码到AI中,并非人为"购买"目标任务性能而不引发任何泛化能力的唯一方式。还存在另一种途径:增加训练数据量,这可以在不影响泛化的前提下提升特定垂直领域或任务的技能。
Information processing systems form a spectrum between two extremes: on one end, static systems that consist entirely of hard-coded priors (such as DeepBlue or our if/else chatbot example), and on the opposite end, systems that incorporate very few priors and are almost entirely programmed via exposure to data (such as a hashtable or a denselyconnected neural network). Most intelligent systems, including humans and animals, combine ample amounts of both priors and experience, as we point out in II.1.3. Crucially, the ability to generalize is an axis that runs orthogonal to the prior/experience plane. Given a learning system capable of achieving a certain level of generalization, modifying the system by incorporating more priors or more training data about the task can lead to greater task-specific performance without affecting generalization. In this case, both priors and experience serve as a way to “game” any given test of skill without having to display the sort of general-purpose abilities that humans would rely on to acquire the same skill.
信息处理系统构成一个介于两个极端之间的连续谱:一端是完全由硬编码先验(如DeepBlue或我们的if/else聊天机器人示例)组成的静态系统,另一端是几乎完全通过数据训练编程、包含极少先验的系统(如哈希表或全连接神经网络)。如II.1.3节所述,包括人类和动物在内的大多数智能系统都结合了丰富的先验与经验。关键在于,泛化能力是与先验/经验平面正交的维度。对于一个能达到特定泛化水平的学习系统,通过融入更多先验或任务相关训练数据来改进系统,可以在不影响泛化能力的前提下提升特定任务性能。这种情况下,先验和经验都成为"应对"特定技能测试的手段,而无需展现人类掌握该技能所需的通用能力。
This can be readily demonstrated with a simple example: consider a hashtable that uses a locality-sensitive hash function (e.g. nearest neighbor) to map new inputs to previously seen inputs. Such a system implements a learning algorithm capable of local generalization, the extent of which is fixed (independent of the amount of data seen), determined only by the abstraction capabilities of the hash function. This system, despite only featuring trace amounts of generalization power, is already sufficient to “solve” any task for which unlimited training data can be generated, such as any video game. All that one has to do is obtain a dense sampling of the space of situations that needs to be covered, and associate each situation with an appropriate action vector.
通过一个简单例子可以直观说明:假设有一个哈希表使用局部敏感哈希函数(如最近邻算法)将新输入映射到已见过的输入。这种系统实现了具备局部泛化能力的学习算法,其泛化范围固定(与数据量无关),仅由哈希函数的抽象能力决定。该系统尽管只具备微量泛化能力,却已足以"解决"任何能生成无限训练数据的任务(例如电子游戏)。只需对需要覆盖的情境空间进行密集采样,并将每个情境与相应动作向量关联即可。
Adding ever more data to a local-generalization learning system is certainly a fair strategy if one’s end goal is skill on the task considered, but it will not lead to generalization beyond the data the system has seen (the resulting system is still very brittle, e.g. Deep Learning models such as OpenAI Five), and crucially, developing such systems does not teach us anything about achieving flexibility and generality. “Solving” any given task with beyond-human level performance by leveraging either unlimited priors or unlimited data does not bring us any closer to broad AI or general AI, whether the task is chess, football, or any e-sport.
如果最终目标是在特定任务上获得技能,那么向局部泛化学习系统添加更多数据无疑是一种合理策略,但这不会使系统产生超越所见数据的泛化能力(由此产生的系统仍然非常脆弱,例如OpenAI Five等深度学习模型)。更重要的是,开发这类系统并不能帮助我们理解如何实现灵活性与通用性。无论是国际象棋、足球还是任何电子竞技,通过利用无限先验知识或无限数据来"解决"某项任务并达到超人类水平,都不会让我们更接近广义AI或通用人工智能。
Current evidence (e.g. [51, 46, 16, 59, 50]) points to the fact that contemporary Deep Learning models are local-generalization systems, conceptually similar to a locality-sensitive hashtable – they may be trained to achieve arbitrary levels of skill at any task, but doing so requires a dense sampling of the input-cross-target space considered (as outlined in [16]), which is impractical to obtain for high-value real-world applications, such as L5 selfdriving (e.g. [5] notes that 30 million training situations is not enough for a Deep Learning model to learn to drive a car in a plain supervised setting). Hypothetically, it may be shown in the future that methods derived from Deep Learning could be capable of stronger forms of generalization, but demonstrating this cannot be done merely by achieving high skill, such as beating humans at DotA2 or Starcraft given unlimited data or unlimited engineering; instead, one should seek to precisely establish and quantify the generalization strength of such systems (e.g. by considering prior-efficiency and data-efficiency in skill acquisition, as well as the developer-aware generalization difficulty of the tasks considered). A central point of this document is to provide a formal framework for doing so (II.2 and II.3). Failing to account for priors, experience, and generalization difficulty in our evaluation methods will prevent our field from climbing higher along the spectrum of generalization (I.3.2) and from eventually reaching general AI.
现有证据 (如 [51, 46, 16, 59, 50]) 表明,当代深度学习模型是局部泛化系统,概念上类似于局部敏感哈希表——它们可以通过训练在任何任务上达到任意技能水平,但需要密集采样所考虑的输入-目标交叉空间 (如 [16] 所述)。对于高价值的现实应用 (如 L5 级自动驾驶),这种密集采样是不切实际的 (例如 [5] 指出,在普通监督学习场景下,即使 3000 万训练样本也不足以让深度学习模型学会驾驶汽车)。假设未来可能证明基于深度学习的方法能够实现更强的泛化形式,但仅通过实现高超技能 (如在无限数据或工程资源条件下在 DotA2 或星际争霸中击败人类) 并不能证明这一点;相反,应该精确建立和量化这类系统的泛化强度 (例如通过考量技能获取的先验效率和数据效率,以及开发人员感知的任务泛化难度)。本文档的核心目标是为此提供形式化框架 (II.2 和 II.3)。若评估方法不考虑先验、经验和泛化难度,将阻碍本领域沿着泛化谱系向上攀登 (I.3.2),最终无法实现通用人工智能。
In summary, the hallmark of broad abilities (including general intelligence, as per II.1.2) is the power to adapt to change, acquire skills, and solve previously unseen problems – not skill itself, which is merely the crystallized output of the process of intelligence. Testing for skill at a task that is known in advance to system developers (as is the current trend in general AI research) can be gamed without displaying intelligence, in two ways: 1) unlimited prior knowledge, 2) unlimited training data. To actually assess broad abilities, and thus make progress toward flexible AI and eventually general AI, it is imperative that we control for priors, experience, and generalization difficulty in our evaluation methods, in a rigorous and quantitative way.
总之,广泛能力的标志(包括II.1.2节所述的通用智能)在于适应变化、获取技能以及解决前所未见问题的能力——而非技能本身,技能仅是智能过程的具体化产物。针对开发者预先已知任务的技能测试(当前通用AI研究的趋势)可能通过两种方式绕过智能展示:1) 无限先验知识,2) 无限训练数据。要真正评估广泛能力,从而推动柔性AI乃至通用AI的发展,我们必须以严谨量化的方式,在评估方法中控制先验知识、经验积累和泛化难度。
II.1.2 The meaning of generality: grounding the g factor
II.1.2 通用性的意义:g因子的基础
It is a well-known fact of cognitive psychology that different individuals demonstrate different cognitive abilities to varying degrees, albeit results across all tests of intelligence are correlated. This points to cognition being a multi-dimensional object, structured in a hierarchical fashion (figure 1), with a single generality factor at the top, the g factor. But is “general intelligence” the apex of the cognitive pyramid in an absolute sense (as is sometimes assumed by proponents of “Artificial General Intelligence”), or is it merely a broader cognitive ability, one that would remain fairly specialized, and wouldn’t be qualitatively distinct from other abilities lower down the hierarchy? How general is human intelligence?
认知心理学中一个众所周知的事实是,不同个体在不同程度上展现出不同的认知能力,尽管所有智力测试结果之间存在相关性。这表明认知是一个多层次结构的多维对象(图1),其顶端存在一个通用性因子——g因子。但"通用智能"究竟是认知金字塔在绝对意义上的顶峰(正如"通用人工智能"支持者有时认为的那样),还是仅仅一种更广泛的认知能力,这种能力仍相当专一,与层次结构中较低的其他能力并无质的区别?人类智能的通用性究竟有多高?
The No Free Lunch theorem [98, 97] teaches us that any two optimization algorithms (including human intelligence) are equivalent when their performance is averaged across every possible problem, i.e. algorithms should be tailored to their target problem in order to achieve better-than-random performance. However, what is meant in this context by “every possible problem” refers to a uniform distribution over problem space; the distribution of tasks that would be practically relevant to our universe (which, due to its choice of laws of physics, is a specialized environment) would not fit this definition. Thus we may ask: is the human g factor universal? Would it generalize to every possible task in the universe?
无免费午餐定理 [98, 97] 告诉我们,当任何两种优化算法(包括人类智能)在所有可能问题上的平均表现相同时,它们是等价的,即算法需要针对目标问题进行专门设计才能获得优于随机的性能。然而,这里所说的"所有可能问题"指的是问题空间上的均匀分布;而实际上与我们的宇宙相关的任务分布(由于其特定的物理定律选择,这是一个特殊环境)并不符合这一定义。因此我们可以问:人类的g因子是否具有普适性?它能否推广到宇宙中所有可能的任务?
This is a question that is largely irrelevant for psychometrics, because as a subfield of psychology, it makes the implicit assumption that it is concerned solely with humans and the human experience. But this question is highly relevant when it comes to AI: if there is such a thing as universal intelligence, and if human intelligence is an implementation of it, then this algorithm of universal intelligence should be the end goal of our field, and reverse-engineering the human brain could be the shortest path to reach it. It would make our field close-ended: a riddle to be solved. If, on the other hand, human intelligence is a broad but ad-hoc cognitive ability that generalizes to human-relevant tasks but not much else, this implies that AI is an open-ended, fundamentally anthropocentric pursuit, tied to a specific scope of applicability. This has implications for how we should measure it (by using human intelligence and human tasks as a reference) and for the research strategies we should follow to achieve it.
这个问题在心理测量学中基本无关紧要,因为作为心理学的子领域,它隐含地假设仅关注人类及其经验。但对于人工智能而言,该问题至关重要:若存在通用智能 (universal intelligence) ,且人类智能是其一种实现形式,那么这种通用智能算法应成为我们领域的终极目标,而逆向工程人脑可能是实现它的最短路径。这将使我们的领域成为封闭式命题——一个待解的谜题。反之,若人类智能是一种广泛但临时的认知能力,仅适用于人类相关任务而难以泛化,则意味着人工智能是开放式的、本质上以人类为中心的探索,受限于特定适用范围。这不仅影响我们应如何衡量它(以人类智能和人类任务为参照) ,也决定了实现它所需的研究策略。
The g factor, by definition, represents the single cognitive ability common to success across all intelligence tests, emerging from applying factor analysis to test results across a diversity of tests and individuals. But intelligence tests, by construction, only encompass tasks that humans can perform – tasks that are immediately recognizable and understandable by humans (anthropocentric bias), since including tasks that humans couldn’t perform would be pointless. Further, psychometrics establishes measurement validity by demonstrating predictive ness with regard to activities that humans value (e.g. scholastic success): the very idea of a “valid” measure of intelligence only makes sense within the frame of reference of human values.
g因子的定义代表所有智力测试中成功共有的单一认知能力,它源于对多种测试和个体测试结果进行因素分析。但智力测试的设计仅涵盖人类能够执行的任务——那些人类能立即识别和理解的任务(人类中心偏见),因为包含人类无法完成的任务毫无意义。此外,心理测量学通过证明对人类重视活动(如学业成功)的预测性来确立测量效度:"有效"智力测量的概念仅在人类价值观的参考框架内才有意义。
In fact, the interpretation of what specific abilities make someone “intelligent” vary from culture to culture [100, 86, 18]. More broadly, humans have historically had a poor track record when it comes to attributing intelligence to complex information-processing agents around them, whether looking at humans from other cultures or at animals (such as octopuses, dolphins, great apes, etc.). We only reluctantly open up to the possibility that systems different from ourselves may be “intelligent” if they display relatable humanlike behaviors that we associate with intelligence, such as language or tool use; behaviors that have high intrinsic complexity and high adaptability but that are not directly relatable (such as octopus camouflage) are not perceived as intelligent. This observation extends to collective entities (e.g. markets, companies, Science as an institution) and natural processes (e.g. biological evolution). Although they can be modeled as standalone systems whose abilities and behavior match broadly accepted definitions of intelligence (achieving goals across a wide range of environments, demonstrating flexibility and adaptability, etc.), we do not categorize these systems as intelligent, simply because they aren’t sufficiently humanlike.
事实上,关于哪些具体能力使人具备"智能"的解读因文化而异 [100, 86, 18]。更广泛地说,人类在判断周围复杂信息处理主体是否具有智能方面历来表现不佳——无论是看待其他文化中的人类,还是动物(如章鱼、海豚、类人猿等)。只有当异于人类的系统表现出我们与智能相关联的拟人行为(如语言或工具使用)时,我们才会勉强承认其可能具有"智能";而那些具有高度内在复杂性和适应性却无法直接类比的行为(如章鱼伪装),则不被视为智能表现。这一认知现象同样适用于集体实体(如市场、企业、科学共同体)和自然过程(如生物进化)。尽管这些系统可以被建模为符合广泛接受的智能定义(在多环境中实现目标、展现灵活性与适应性等)的独立系统,我们仍拒绝将其归类为智能系统,仅仅因为它们不够"像人"。
To use a well-known cross-domain analogy [25]: much like “intelligence”, the notion of “physical fitness” (as it pertains to sports and other physical activities) is an intuitivelyunderstandable, informal, yet useful concept. Like intelligence, fitness is not easily reducible to any single factor (such as a person’s age or muscle mass), rather, it seems to emerge from a constellation of interdependent factors. If we sought to rigorously measure physical fitness in humans, we would come up with a set of diverse tests such as running a $100\mathrm{m}$ , running a marathon, swimming, doing sit-ups, doing basketball throws, etc., not unlike IQ test suites. Across tests results, we would observe clusters of correlations, corresponding to broad “physical abilities” strictly analogous to cognitive abilities (e.g. lung capacity might be such an “ability” inducing correlations across tests). Much like in the case of cognitive abilities, experts would probably disagree and debate as to the exact taxonomy of these broad abilities (is being “tall and lean” an ability, or is “tallness” a standalone factor?). And crucially, we should intuitively expect to find that all tests results would be correlated: we would observe a physical g factor, corresponding to the general intuitive construct of “physical fitness”.
用一个著名的跨领域类比[25]来说:"体能"(physical fitness)这个概念(涉及运动和其他身体活动)就像"智力"一样,是一种直观易懂、非正式但有用的概念。与智力类似,体能很难简化为任何单一因素(如一个人的年龄或肌肉量),而是似乎从一系列相互依赖的因素中涌现出来。如果我们试图严格测量人类的体能,我们会设计一组多样化的测试,比如跑100米、跑马拉松、游泳、做仰卧起坐、投篮等,这与智商测试套件并无二致。在各项测试结果中,我们会观察到相关性集群,对应着与认知能力严格类比的广泛"身体能力"(例如肺活量可能就是这样一个在各项测试中产生相关性的"能力")。与认知能力的情况非常相似,专家们可能会对这些广泛能力的确切分类存在分歧和争论("高瘦"是一种能力,还是"身高"是一个独立因素?)。最关键的是,我们凭直觉就能预期所有测试结果都会相互关联:我们会观察到一个体能g因子,对应着"体能"这个普遍直观概念。
But would this mean that human morphology and motor afford ances are “general” in an absolute sense, and that a very fit person could handle any physical task at all? Certainly not; we are not adapted for the large majority of environments that can be found in the universe – from the Earth’s oceans to the surface of Venus, from the atmosphere of Jupiter to interstellar space. It is, however, striking and remarkable that human physical abilities generalize to a far greater range of environments and tasks than the limited set of environments and activities that guided their evolution. To caricature, human bodies evolved for running in the East-African savanna, yet they are capable of climbing mount Everest, swimming across lakes, skydiving, playing basketball, etc. This is not a coincidence; by necessity, evolution optimizes for adaptability, whether cognitive adaptability or sensor i motor adaptability. Human physical capabilities can thus be said to be “general”, but only in a limited sense; when taking a broader view, humans reveal themselves to be extremely specialized, which is to be expected given the process through which they evolved.
但这是否意味着人类的形态和运动能力在绝对意义上是"通用"的,一个体能极佳的人可以完成任何身体任务?显然不是;我们并不适应宇宙中存在的大多数环境——从地球海洋到金星表面,从木星大气层到星际空间。然而令人惊叹的是,人类的体能可以泛化到远比其进化环境广阔得多的领域。夸张地说,人类身体是为东非大草原奔跑而进化的,却能够攀登珠穆朗玛峰、横渡湖泊、跳伞、打篮球等。这并非巧合:进化必然要优化适应能力,无论是认知适应还是感知运动适应。因此可以说人类体能是"通用"的,但仅限于相对意义;从更宏观视角看,人类实际上高度特化,这正是进化过程的必然结果。
We argue that human cognition follows strictly the same pattern as human physical capabilities: both emerged as evolutionary solutions to specific problems in specific environments (commonly known as “the four Fs”). Both were, importantly, optimized for adaptability, and as a result they turn out to be applicable for a surprisingly greater range of tasks and environments beyond those that guided their evolution (e.g. piano-playing, solving linear algebra problems, or swimming across the Channel) – a remarkable fact that should be of the utmost interest to anyone interested in engineering broad or generalpurpose abilities of any kind. Both are multi-dimensional concepts that can be modeled as a hierarchy of broad abilities leading up to a “general” factor at the top. And crucially, both are still ultimately highly specialized (which should be unsurprising given the context of their development): much like human bodies are unfit for the quasi-totality of the universe by volume, human intellect is not adapted for the large majority of conceivable tasks.
我们认为人类认知与身体能力遵循完全相同的模式:二者都是针对特定环境中具体问题演化出的解决方案(即俗称的"四F法则")。关键在于,它们都为实现适应性进行了优化,因此能惊人地适用于远超其演化初衷的任务和环境(例如弹钢琴、解线性代数题或横渡英吉利海峡)——这一非凡事实对任何希望设计通用型能力的人士都具有极高参考价值。二者都是多维概念,可建模为由基础能力层级递进至顶端"通用"因子的结构。更重要的是,二者本质上仍具有高度专属性(考虑到其发展背景,这并不意外):正如人类身体无法适应宇宙中绝大多数空间环境,人类智力同样不适合处理可想象任务中的绝大部分。
This includes obvious categories of problems such as those requiring long-term planning beyond a few years, or requiring large working memory (e.g. multiplying 10-digit numbers). This also includes problems for which our innate cognitive priors are unadapted; for instance, humans can be highly efficient in solving certain NP-hard problems of small size when these problems present cognitive overlap with evolutionarily familiar tasks such as navigation (e.g. the Euclidean Traveling Salesman Problem (TSP) with low point count can be solved by humans near-optimally in near-linear optimal time [58], using perceptual strategies), but perform poorly – often no better than random search – for problem instances of very large size or problems with less cognitive overlap with evolutionarily familiar tasks (e.g. certain non-Euclidean problems). For instance, in the TSP, human performance degrades severely when inverting the goal from “finding the shortest path” to “finding the longest path” [57] – humans perform even worse in this case than one of the simplest possible heuristic: farthest neighbor construction. 6
这包括一些明显的问题类别,例如需要超过数年的长期规划,或需要较大工作记忆的任务(例如计算10位数的乘法)。同样也涵盖那些与我们先天认知先验不适配的问题;例如,人类在解决某些小规模NP难问题时可能非常高效,尤其是当这些问题与进化过程中熟悉的任务(如导航)存在认知重叠时(例如,点数较少的欧几里得旅行商问题(TSP)可以通过感知策略由人类在近线性最优时间内近乎最优地解决[58]),但对于规模极大或与进化熟悉任务认知重叠较少的问题实例(例如某些非欧几里得问题),人类表现往往不佳——甚至不比随机搜索更好。例如在TSP中,当目标从"寻找最短路径"反转为"寻找最长路径"时,人类表现会急剧下降[57]——此时人类的表现甚至比最简单的启发式算法之一"最远邻点构造"更差。6
A particularly marked human bias is dimensional bias: humans show excellent performance on 2D navigation tasks and 2D shape-packing puzzles, and can still handle 3D cases albeit with greatly reduced performance, but they are effectively unable to handle 4D and higher. This fact is perhaps unsurprising given human reliance on perceptual strategies for problem-solving – strategies which are backed by neural mechanisms specifically evolved for 2D navigation (hippo camp al systems of place cells and grid cells [64]).
一个特别显著的人类认知偏差是维度偏差:人类在2D导航任务和2D形状拼图任务中表现出色,虽然性能大幅下降但仍能处理3D情况,但基本上无法处理4D及更高维度。考虑到人类依赖感知策略解决问题(这些策略由专门为2D导航进化的神经机制支撑,如海马体位置细胞与网格细胞系统 [64]),这一事实或许并不令人意外。
Thus, a central point of this document is that “general intelligence” is not a binary property which a system either possesses or lacks. It is a spectrum, tied to 1) a scope of application, which may be more or less broad, and 2) the degree of efficiency with which the system translate its priors and experience into new skills over the scope considered, 3) the degree of generalization difficulty represented by different points in the scope considered (see II.2). In addition, the “value” of one scope of application over another is entirely subjective; we wouldn’t be interested in (and wouldn’t even perceive as intelligent) a system whose scope of application had no intersection with our own.
因此,本文的核心观点在于,“通用智能”并非系统要么具备、要么缺失的二元属性。它是一个连续谱系,与以下三点紧密相关:1) 应用范围(可宽可窄);2) 系统在给定范围内将先验知识和经验转化为新技能的效率程度;3) 该范围内不同节点所体现的泛化难度等级(参见II.2节)。此外,不同应用范围之间的“价值”判断完全主观——对于与人类需求毫无交集的应用范围,我们既不会感兴趣,甚至不会将其视为智能系统。
As such, it is conceptually unsound to set “artificial general intelligence” in an absolute sense (i.e. “universal intelligence”) as a goal. To set out to build broad abilities of any kind, one must start from a target scope, and one must seek to achieve a well-defined intelligence threshold within this scope: AI is a deeply contextual and open-ended endeavour, not a single one-time riddle to be solved. However, it may in theory be possible to create human-like artificial intelligence: we may gradually build systems that extend across the same scope of applicability as human intelligence, and we may gradually increase their generalization power within this scope until it matches that of humans. We may even build systems with higher generalization power (as there is no a priori reason to assume human cognitive efficiency is an upper bound), or systems with a broader scope of application. Such systems would feature intelligence beyond that of humans.
因此,将"通用人工智能 (AGI)"绝对化地设定为目标(即"普适智能")在概念上是不合理的。要构建任何类型的广泛能力,必须从目标范围出发,并寻求在该范围内达到明确定义的智能阈值:人工智能是一项高度情境化且无止境的探索,而非一个需要一次性解决的谜题。然而从理论上讲,创造类人人工智能是可能的:我们可以逐步构建覆盖人类智能相同适用范围的系统,并在此范围内持续提升其泛化能力直至达到人类水平。我们甚至可能开发出具有更强泛化能力的系统(因为没有先验理由认定人类认知效率是上限),或是具备更广应用范围的系统。这类系统将展现出超越人类智能的特征。
In conclusion, we propose that research on developing broad in AI systems (up to “general” AI, i.e. AI with a degree of generality comparable to human intelligence) should focus on defining, measuring, and developing a specifically human-like form of intelligence, and should benchmark progress specifically against human intelligence (which is itself highly specialized). This isn’t because we believe that intelligence that greatly differs from our own couldn’t exist or wouldn’t have value; rather, we recognize that characterizing and measuring intelligence is a process that must be tied to a well-defined scope of application, and at this time, the space of human-relevant tasks is the only scope that we can meaningfully approach and assess. We thus disagree with the perspective of Universal Psychometrics [39] or Legg and Hutter’s Universal Intelligence [54], which reject anthropocentrism altogether and seek to measure all intelligence against a single absolute scale. An anthropocentric frame of reference is not only legitimate, it is necessary.
总之,我们建议关于开发广泛AI系统(直至"通用"AI,即具有与人类智能相当程度通用性的AI)的研究,应聚焦于定义、衡量和开发一种特定类人形态的智能,并以人类智能作为明确的基准来衡量进展(人类智能本身也是高度特化的)。这并非因为我们认为与人类智能迥异的其他智能形式不可能存在或没有价值,而是我们认识到描述和衡量智能的过程必须与明确定义的应用范围相关联。在现阶段,与人类相关的任务领域是我们唯一能够有效探索和评估的范围。因此,我们不同意通用心理测量学[39]或Legg与Hutter提出的通用智能[54]的观点,这些理论完全摒弃人类中心主义,试图用单一绝对尺度衡量所有智能。以人类为中心的参照框架不仅是合理的,更是必要的。
II.1.3 Separating the innate from the acquired: insights from developmental psychology
II.1.3 先天与后天的分离:发展心理学的启示
Advances in developmental psychology teach us that neither of the two opposing views of the nature of the mind described in I.2 are accurate (see e.g. [85]): the human mind is not merely a collection of special-purpose programs hard-coded by evolution; it is capable of a remarkable degree of generality and open-endedness, going far beyond the scope of environments and tasks that guided its evolution. The large majority of the skills and knowledge we possess are acquired during our lifetimes, rather than innate. Simultaneously, the mind is not a single, general-purpose “blank slate” system capable of learning anything from experience. Our cognition is specialized, shaped by evolution in specific ways; we are born with priors about ourselves, about the world, and about how to learn, which determine what categories of skills we can acquire and what categories of problems we can solve.
发展心理学的进展告诉我们,I.2节中描述的两种对立心智本质观点都不准确(如[85]所示):人类心智并非仅是进化编码的专用程序集合;它具有显著程度的通用性和开放性,远超指导其进化的环境与任务范围。我们所拥有的大部分技能和知识都是后天习得,而非与生俱来。同时,心智也非单一通用"白板"系统,能够从经验中学习任何事物。我们的认知具有特异性,以特定方式被进化塑造;我们生来就带有关于自身、世界及学习方式的先验知识,这些决定了我们能够获得哪些技能类别以及解决哪些问题类别。
These priors are not a limitation to our generalization capabilities; to the contrary, they are their source, the reason why humans are capable of acquiring certain categories of skills with remarkable efficiency. The central message of the No Free Lunch theorem [98] is that to learn from data, one must make assumptions about it – the nature and structure of the innate assumptions made by the human mind are precisely what confers to it its powerful learning abilities.
这些先验知识并非对我们泛化能力的限制;恰恰相反,它们是能力的源泉,是人类能够以惊人效率掌握某些技能类别的根本原因。无免费午餐定理 [98] 的核心观点是:要从数据中学习,就必须对其做出假设——人类心智与生俱来的假设性质与结构,正是赋予其强大学习能力的本质所在。
We noted in II.1.1 that an actionable measure of intelligence should, crucially, control for priors and experience. We proposed in II.1.2 that evaluating general intelligence should leverage human intelligence as a necessary frame of reference. It follows that we need a clear understanding of human cognitive priors in order to fairly evaluate general intelligence between humans and machines.
我们在II.1.1节中指出,一个可操作的智能衡量标准关键在于控制先验和经验。在II.1.2节中我们提出,评估通用智能应以人类智能作为必要的参照框架。因此,为了公平评估人类与机器之间的通用智能,我们需要清晰理解人类的认知先验。
Human cognitive priors come in multiple forms, in particular 7:
人类认知先验以多种形式存在,具体包括7种:
When it comes to creating artificial human-like intelligence, low-level sensor i motor priors are too specific to be of interest (unless one seeks to build an artificial human body). While human meta-learning priors should be of the utmost interest (understanding the strategies that the brain follows to turn experience into knowledge and skills is effectively our end goal), these priors are not relevant to evaluating intelligence: they are intelligence, rather than a third-party modulating factor to be controlled for. They are part of the black box that we seek to characterize.
在创造类人人工智能时,低级感知运动先验( sensorimotor priors )过于特定而缺乏普适价值(除非旨在构建人工人体)。虽然人类元学习先验( meta-learning priors )最值得关注(理解大脑将经验转化为知识和策略的机制本质上就是我们的终极目标),但这些先验与智力评估无关:它们本身就是智力的一部分,而非需要控制的第三方调节因素。这些先验正是我们试图解析的黑箱( black box )组成部分。
It is knowledge priors that should be accounted for when measuring a human-like form of intelligence. A system that does not possess human innate knowledge priors would be at a critical disadvantage compared to humans when it comes to efficiently turning a given experience curriculum into skill at a given human task. Inversely, a system that has access to more extensive hard-coded knowledge about the task at hand could not be fairly compared to human intelligence – as we noted in II.1.1, unlimited priors allow system developers to “buy” unbounded performance on any given task, with no implications with regard to genera liz ation abilities (what we are actually trying to achieve).
在衡量类人智能形式时,必须考虑知识先验 (priors) 因素。一个缺乏人类先天知识先验的系统,在将既定经验课程高效转化为特定人类任务技能方面,会处于关键劣势。反之,若系统拥有远超人类的手工编码任务知识(如II.1.1节所述),则无法公平地评估其智能水平——无限先验允许开发者通过"购买"方式获得任意任务的无界性能,但这与真正的目标(即泛化能力)毫无关联。
Therefore, we propose that an actionable test of human-like general intelligence should be founded on innate human knowledge priors:
因此,我们提出一个可操作的人类通用智能测试应基于人类先天知识先验:
• The priors should be made as close as possible to innate human knowledge priors as we understand them. As our understanding of human knowledge priors improves over time, so should the test evolve. • The test should assume that the system being measured possesses a specific set of priors. AI systems with more extensive priors should not be benchmarked using such a test. AI systems with fewer priors should be understood to be at a disadvantage.
• 先验知识应尽可能接近我们所理解的人类先天知识先验。随着对人类知识先验理解的不断深入,测试标准也应相应演进。
• 测试应假设被测系统具备特定先验知识集。拥有更广泛先验知识的AI系统不应使用此类测试作为基准。先验知识较少的AI系统应被视为处于劣势。
This leads us to a central question: what is the exact list of knowledge priors that humans are born with? This is the question that the developmental science theory of Core Knowledge [85] seeks to answer. Core Knowledge identifies four broad categories of innate assumptions that form the foundations of human cognition, and which are largely shared by our non-human relatives 8:
这引出了一个核心问题:人类与生俱来的知识先验具体包含哪些内容?这正是核心知识理论 [85] 试图解答的问题。该理论将构成人类认知基础的先天假设归纳为四大类别,这些假设在很大程度上也被人类的近亲物种所共有8:
While cognitive developmental psychology has not yet determined with a high degree of certainty the exact set of innate priors that humans possess, we consider the Core Knowledge theory to offer a credible foundation suitable to the needs of a test of human-like general intelligence. We therefore propose that an actionable test of general intelligence that would be fair for both humans and machines should only feature tasks that assume the four core knowledge systems listed above, and should not involve any acquired knowledge outside of these priors. We also argue, in agreement with [51], that general AI systems should hard-code as fundamental priors these core knowledge principles.
虽然认知发展心理学尚未高度确定人类所拥有的先天先验知识的确切集合,但我们认为核心知识理论为测试类人通用智能提供了一个可信的基础。因此,我们提出一个可操作的通用智能测试,该测试对人类和机器都公平,应仅包含基于上述四种核心知识系统的任务,而不涉及这些先验知识之外的任何习得知识。此外,我们同意[51]的观点,认为通用人工智能系统应将这些核心知识原则硬编码为基本先验。
II.2 Defining intelligence: a formal synthesis
II.2 定义智能:形式化综合
II.2.1 Intelligence as skill-acquisition efficiency
II.2.1 作为技能获取效率的智能
So far, we have introduced the following informally-described intuitions:
到目前为止,我们已经非正式地介绍了以下直观概念:
Let us now formalize these intuitions. In what follows, we provide a series of definitions for key concepts necessary to ground a formal definition of intelligence and its measure. We will leverage the tools of Algorithmic Information Theory. These definitions lead up to a formal way of expressing the following central idea:
现在让我们将这些直觉形式化。接下来,我们将为构建智能及其度量形式化定义所需的关键概念提供一系列定义。我们将利用算法信息论的工具。这些定义最终将导向表达以下核心思想的正式方式:
The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.
系统的智能性是对其在任务范围内技能获取效率的衡量,涉及先验知识、经验及泛化难度。
Intuitively, if you consider two systems that start from a similar set of knowledge priors, and that go through a similar amount of experience (e.g. practice time) with respect to a set of tasks not known in advance, the system with higher intelligence is the one that ends up with greater skills (i.e. the one that has turned its priors and experience into skill more efficiently). This definition of intelligence encompasses meta-learning priors, memory, and fluid intelligence. It is distinct from skill itself: skill is merely the output of the process of intelligence.
直观地说,如果两个系统从相似的知识先验出发,并在未知任务集上经历相近的经验量(例如练习时长),那么智能更高的系统最终会获得更强的技能(即能更高效地将先验和经验转化为技能)。这种智能定义涵盖了元学习先验、记忆和流体智力。它与技能本身不同:技能仅是智能过程的输出产物。
Before we start, let us emphasize that many possible definitions of intelligence may be valid, across many different contexts, and we do not purport that the definition above and the formalism below represent the “one true” definition. Nor is our definition meant to achieve broad consensus. Rather, the purpose of our definition is to be actionable, to serve as a useful perspective shift for research on broad cognitive abilities, and to function as a quantitative foundation for new general intelligence benchmarks, such as the one we propose in part III. As per George Box’s aphorism, “all models are wrong, but some are useful”: our only aim here is to provide a useful North Star towards flexible and general AI. We discuss in II.2.3 the concrete ways in which our formalism is useful and actionable.
在开始之前,我们需要强调,智力的定义可能有很多种,且在不同情境下都有效。我们并不认为上述定义及下文的形式化表述是"唯一正确"的定义,也不追求达成广泛共识。相反,我们定义的目的在于可操作性,为广泛认知能力研究提供有益视角转变,并为新型通用智能基准(如第三部分提出的方案)建立量化基础。正如George Box所言"所有模型都是错的,但有些是有用的":我们唯一的目标是为实现灵活通用的人工智能提供一个有用的指引。在II.2.3节中,我们将具体讨论这套形式化体系的可操作性与实用价值。
Position of the problem
问题定位
First, we must introduce basic definitions to establish our problem setup. It should be immediately clear to the reader that our choice of problem setup is sufficient to model Fully-Supervised Learning, Partially-Supervised Learning, and Reinforcement Learning.
首先,我们需要引入基本定义来建立问题设定。读者应当立即明确,我们所选的问题设定足以建模全监督学习 (Fully-Supervised Learning)、部分监督学习 (Partially-Supervised Learning) 和强化学习 (Reinforcement Learning)。
We consider the interaction between a “task” and an “intelligent system”. This interaction is mediated by a “skill program” (generated by the intelligent system) and a “scoring function” (part of the task).
我们考虑"任务"与"智能系统"之间的交互。这种交互通过"技能程序"(由智能系统生成)和"评分函数"(任务的一部分)作为中介。
We implicitly consider the existence of a fixed universal Turing machine on which our programs run (including the skill programs, as well as programs part of the task and part of the intelligent system). We also assume the existence of a fixed “situation space” Situation Space and “response space” Response Space. Each of these spaces defines the set of binary strings that are allowed as input (and output, respectively) of all skill programs we will consider henceforth. They may be, for instance, the sensor space and the motor space of an animal or robot.
我们隐含地假设存在一个固定的通用图灵机,所有程序(包括技能程序、任务组成部分的程序以及智能系统组成部分的程序)都在其上运行。同时,我们假设存在固定的"情境空间"(Situation Space)和"响应空间"(Response Space)。这些空间分别定义了后续所考虑的所有技能程序允许作为输入(及对应输出)的二进制字符串集合。例如,这些空间可以是动物或机器人的传感器空间与运动空间。

Figure 2: Position of the problem: an intelligent system generates a skill program to interact with a task.
图 2: 问题定位:智能系统生成技能程序以与任务交互。
A task $T$ consists of four objects:
任务 $T$ 包含四个对象:
• A “scoring function” Scoring $:$ [Situation, Response, T askState] $\rightarrow$ [Score, F eedback]. It may be stochastic.
• "评分函数" Scoring $:$ [情境(Situation), 响应(Response), 任务状态(TaskState)] $\rightarrow$ [分数(Score), 反馈(Feedback)]。该函数可能是随机的。
• A self-update function $T a s k U p d a t e:[R e s p o n s e,T a s k S t a t e]\rightarrow T a s k S t a t e,$ , which mutates the task state based on the response to the latest situation. It may be stochastic.
• 自更新函数 $TaskUpdate:[Response,TaskState]\rightarrow TaskState$,该函数根据对最新情况的响应来改变任务状态。它可能是随机的。
For instance, a game such as chess or WarCraft III (as well as what we call “task” in the ARC benchmark presented in III) would constitute a task. A given chess board position, screen frame in WarCraft III, or input grid in ARC, would constitute a situation.
例如,国际象棋或《魔兽争霸III》这样的游戏(以及我们在第三部分介绍的ARC基准测试中所称的“任务”)将构成一个任务。而特定的国际象棋棋盘位置、《魔兽争霸III》中的屏幕画面或ARC中的输入网格,则构成一个情境。
An intelligent system $I S$ consists of three objects:
智能系统 (IS) 由三个对象组成:
– A skill program $S k i l l P r o g r a m:\left[S i t u a t i o n,S P S t a t e\right]\rightarrow\left[R e s p o n s e,S P S t a t e\right]$ is a function that maps an input situation to a valid response (part of Response Space), potentially using some working memory $(S P S t a t e)$ . It may be stochastic. Because it possesses a state $S P S t a t e$ (a binary string), it may be used to autonomously handle a series of connected situations without further communication with the intelligent system that generated it.
- 技能程序 (Skill Program) $S k i l l P r o g r a m:\left[S i t u a t i o n,S P S t a t e\right]\rightarrow\left[R e s p o n s e,S P S t a t e\right]$ 是一个将输入情境映射为有效响应 (响应空间的一部分) 的函数,可能使用某些工作内存 $(S P S t a t e)$。它可以是随机的。由于拥有状态 $S P S t a t e$ (二进制字符串),该程序可在不与生成它的智能系统进一步通信的情况下,自主处理一系列关联情境。
– A skill program may be, for instance, any game-specific program capable of playing new levels in a given video game.
– 例如,技能程序可以是任何能够玩给定电子游戏中新关卡的特定游戏程序。
– In what follows, we refer to “skill program” as the combination of the SkillP rogram function and the initial skill program state SP State (i.e. skill programs are considered stateful).
- 在下文中,我们将"技能程序(skill program)"定义为SkillProgram函数与初始技能程序状态SP State的组合(即技能程序被视为有状态的)。
– A skill program represents a frozen version of the system’s task-specific capabilities (including the ability to adapt to novel situations within the task). We use the concept of skill program as a conceptual device to formalize the level of task-specific skill and task-specific generalization capabilities of an agent at a given point in time.
- 技能程序 (skill program) 代表系统任务特定能力的固化版本 (包括适应任务内新情境的能力)。我们使用技能程序这一概念作为形式化工具,用以表征智能体在特定时间点所具备的任务专项技能水平及任务内泛化能力。
• A self-update function ISU pdate : [Situation, Response, F eedback, ISState] → ISState, which mutates the system’s state based on the latest situation and corresponding feedback. It may be stochastic.
• 自我更新函数 ISU pdate: [Situation, Response, Feedback, ISState] → ISState,该系统根据最新情境及对应反馈更新自身状态。该函数可能具有随机性。
For instance, a neural network generation and training algorithm for games would be an “intelligent system”, and the inference-mode game-specific network it would output at the end of a training run on one game would be a “skill program”. A program synthesis engine capable of looking at an ARC task and outputting a solution program would be an “intelligent system”, and the resulting solution program capable of handling future input grids for this task would be a “skill program”.
例如,用于游戏的神经网络生成和训练算法属于"智能系统",而该算法在完成某一游戏的训练后输出的推理模式专用网络则属于"技能程序"。能够查看ARC任务并输出解决方案的程序合成引擎是"智能系统",而生成的能处理该任务未来输入网格的解决方案程序则是"技能程序"。
The interaction between task, intelligent system, and skill programs is structured in two phases: a training phase and an evaluation phase. The goal of the training phase is for the $I S$ to generate a high-skill skill program that will generalize to future evaluation situations. The goal of the evaluation phase is to assess the capability of this skill program to handle new situations.
任务、智能系统和技能程序之间的交互分为两个阶段:训练阶段和评估阶段。训练阶段的目标是让 $IS$ 生成一个能够泛化到未来评估情境的高技能程序。评估阶段的目标则是评估该技能程序处理新情境的能力。
The training phase consists of the repetition of the following steps (we note the current step as $t$ ). Before we start, we consider two separate initial task states, trainT askState $\scriptstyle{\stackrel{\prime}{t=0}}$ and testT askState $\scriptstyle{\stackrel{\prime}{e=0}}$ .
训练阶段包含以下步骤的重复 (当前步骤记为 $t$ )。开始前,我们考虑两个独立的初始任务状态:trainT askState $\scriptstyle{\stackrel{\prime}{t=0}}$ 和 testT askState $\scriptstyle{\stackrel{\prime}{e=0}}$。
• The skill program outputs a response to the situation: $\begin{array}{r}{[r e s p o n s e_{t},s p S t a t e_{t+1}]\gets s k i l l P r o g r a m_{t}(S i t u a t i o n_{t},s p S t a t e_{t})}\end{array}$ – skillP rogramt is only called once, and $_ {s p S t a t e_{t+1}}$ is discarded, since the skill program is generated anew by the intelligent system at each training step. – In practice, in partially-observable games where consecutive situations are very close to each other (e.g. two consecutive screen frames in WarCraft III), one may assume that skillP rogramt at $t$ and skillP rogramt+1 would not actually be generated independently from scratch and would stay very close to each other (i.e. the $I S$ ’s understanding of the task would be evolving continuously in program space); $s p S t a t e_{t+1}$ as generated by skillP rogramt and $s p S t a t e_{t+1}$ as generated by SkillP rogramGen at $t+1$ would likewise stay very close to each other.
• 技能程序对当前情境输出响应:$\begin{array}{r}{[r e s p o n s e_{t},s p S t a t e_{t+1}]\gets s k i l l P r o g r a m_{t}(S i t u a t i o n_{t},s p S t a t e_{t})}\end{array}$
– skillProgramt 仅被调用一次,且 $_ {s p S t a t e_{t+1}}$ 会被丢弃,因为智能系统在每个训练步骤都会重新生成技能程序。
– 实践中,在连续情境高度相似的部分可观测游戏中(如《魔兽争霸III》的连续屏幕帧),可假设时刻 $t$ 的 skillProgramt 与 skillProgramt+1 不会完全独立生成,而是保持高度相似(即 IS 对任务的理解会在程序空间中持续演化);由 skillProgramt 生成的 $s p S t a t e_{t+1}$ 与由 SkillProgramGen 在 $t+1$ 时刻生成的 $s p S t a t e_{t+1}$ 同样会保持高度相似。
• The task scoring function assigns a score to the response and generates a piece of feedback: $\left[s c o r e_{t},f e e d b a c k_{t}\right]\gets S c o r i n g(S i t u a t i o n_{t},r e s p o n s e_{t},$ trainT askStatet)
• 任务评分函数为响应分配分数并生成一条反馈: $\left[s c o r e_{t},f e e d b a c k_{t}\right]\gets S c o r i n g(S i t u a t i o n_{t},r e s p o n s e_{t},$ trainT askStatet)
The training phase ends at the discretion of the Situation Gen function (e.g. Situation Gen returns a “STOP” situation), at which time SkillP rogramGen would generate its last skill program, including an initial state (initial working memory) meant to perform well during evaluation (e.g. blank).
训练阶段由情境生成(Situation Gen)函数自行决定何时结束(例如情境生成返回一个"STOP"情境),此时技能程序生成(SkillProgramGen)会生成最后一个技能程序,包括一个旨在评估阶段表现良好的初始状态(初始工作内存)(例如空白状态)。
The evaluation phase is superficially similar to the training phase, with the differences that 1) the task starts from testT askState $\scriptstyle\mathrm{e=0}$ and consists of an independent series of situations, 2) it only involves a single fixed skill program testSkillP rogram starting with state $t e s t S P S t a t e_{e=0}$ . Crucially, it no longer involves the intelligent system. Note that testT askStat $\scriptstyle{\mathrm{\mathit{\Sigma}}}^{2}e=0$ could be chosen stochastic ally. For instance, different randomly chosen initial testT askState $e{=}0$ could be different randomly-generated levels of a game.
评估阶段表面上与训练阶段相似,不同之处在于:1) 任务从测试任务状态 $\scriptstyle\mathrm{e=0}$ 开始,由一系列独立情境组成;2) 仅涉及一个固定的技能程序 testSkillProgram,其初始状态为 $testSPS tate_{e=0}$。关键在于,该阶段不再包含智能系统。需注意,测试任务状态 $\scriptstyle{\mathrm{\mathit{\Sigma}}}^{2}e=0$ 可随机选择,例如不同的随机初始测试任务状态 $e{=}0$ 可能对应游戏中随机生成的不同关卡。
Like the separation between skill program and intelligent system, the evaluation phase should be understood as a conceptual device used to quantify the task-specific skill and task-specific generalization capabilities demonstrated by a system at a given point in time. The evaluation phase should not be seen as being conceptually similar to a child taking a school test or an IQ test. In real-world evaluation situations, evaluation involves the entire intelligent system, dynamically adapting its understanding of the task at hand. A real-world evaluation situation would be represented in our formalism as being part of the training curriculum – a series of training situations with blank feedback.
与技能程序和智能系统之间的分离类似,评估阶段应理解为一种概念工具,用于量化系统在特定时间点展现的任务特定技能和任务特定泛化能力。评估阶段不应被视为概念上类似于儿童参加学校考试或智商测试。在实际评估场景中,评估涉及整个智能系统动态调整其对当前任务的理解。在我们的形式化框架中,真实评估场景将被表示为训练课程的一部分——即一系列带有空白反馈的训练场景。
The evaluation phase consists of the repetition of the following steps (the current step is noted $e$ ):
评估阶段包含以下步骤的重复 (当前步骤记为 $e$ ):
The evaluation phase also ends at the discretion of the Situation Gen function.
评估阶段也会根据情境生成 (Situation Gen) 函数的判断结束。
Note that for the sake of simplification, we consider that the $I S$ ’s state does not transfer across tasks; the $I S$ would start with a “blank” state at the beginning of the training phase for each new task (i.e. only possessing built-in priors). However, the setup above and definitions below may be readily extended to consider lifelong learning, to bring it closer to real-world biological intelligent systems, which learn continuously across a multiplicity of partially overlapping tasks with often no clear boundaries.
请注意,为了简化问题,我们假设 $I S$ 的状态不会跨任务转移;在每项新任务的训练阶段开始时,$I S$ 都会以一个"空白"状态启动 (即仅具备内置先验知识)。不过,上述设置和以下定义可以很容易扩展到持续学习 (lifelong learning) 场景,从而更接近现实世界中的生物智能系统——这些系统通常在没有明确界限的、部分重叠的多个任务中持续学习。
Based on the setup described thus far, we can define the following useful concepts:
基于目前所述的设置,我们可以定义以下有用概念:
• Task and skill value function: We define a value function over task space (note that task space may be infinite), associating a scalar value to the combination of a task and a threshold of skill $\theta$ for the $\mathrm{task:~}T a s k V a l u e:T a s k,\theta\rightarrow\omega_{T,\theta}$ . Values are assumed positive or zero, and T askV alue is assumed monotonous as a function of $\theta$ (for a given task, higher skill always has higher value). This value function captures the relative importance of skill at each task and defines the subjective frame of reference of our intelligence definition (for instance, if we wish to evaluate humanlike intelligence, we would place high value on achieving high skill at human-relevant tasks and place no value on tasks that are irrelevant to the human experience). The value $\omega_{T,\theta}$ of a skill level at a task is chosen so that the quantity $\omega_{T,\theta}$ can be compared fairly across different tasks (i.e. it should capture the value we place on achieving skill $\theta$ at task $T$ ). This enables us to homogeneously aggregate skill across different tasks without worrying about the scale of the their respective scoring functions.
• 任务与技能价值函数:我们在任务空间(注意任务空间可能是无限的)上定义一个价值函数,将任务与技能阈值θ的组合关联到一个标量值,即任务价值函数TaskValue: Task,θ→ωT,θ。假设价值为非负,且TaskValue作为θ的函数是单调的(对于给定任务,技能越高价值越大)。该价值函数体现了各任务中技能的相对重要性,并定义了智能定义的主观参考框架(例如,若要评估类人智能,我们会为人类相关任务的高技能赋予高价值,而对与人类体验无关的任务赋予零价值)。选择任务T中技能水平θ的价值ωT,θ时,需确保ωT,θ能在不同任务间公平比较(即应体现我们在任务T达到技能θ所赋予的价值)。这使得我们能够跨任务同质化聚合技能,而无需考虑各自评分函数的量纲问题。
system’s scope. Potential is a property of an intelligent system.
系统的范围。潜力是智能系统的一种属性。
We find that in most cases it is more useful to consider sufficient skill and sufficient solutions rather than optimal skill and optimal solutions – in application settings, we seek to achieve sufficient performance using as little resources as possible; it is rarer and less practical to seek to achieve maximum possible performance using unlimited resources.
我们发现,在大多数情况下,考虑足够的能力 (sufficient skill) 和足够的解决方案 (sufficient solutions) 比追求最优能力和最优解决方案更有实际意义——在应用场景中,我们更倾向于用尽可能少的资源实现足够的性能;而追求用无限资源实现极限性能的情况更为罕见且缺乏实用性。
Quantifying generalization difficulty, experience, and priors using Algorithmic Information Theory
基于算法信息论量化泛化难度、经验与先验
Algorithmic Information Theory (AIT) may be seen as a computer science extension of Information Theory. AIT concerns itself with formalizing useful computer science intuitions regarding complexity, randomness, information, and computation. Central to AIT is the notion of Algorithmic Complexity. Algorithmic Complexity (also known as Kolmogorov Complexity or Algorithmic Entropy) was independently investigated, in different contexts, by R.J. Solomonoff, A.N. Kolmogorov and G.J. Chaitin in the 1960s. For an extensive introduction, see [15, 14, 30, 55].
算法信息论 (Algorithmic Information Theory, AIT) 可视为信息论在计算机科学领域的延伸。该理论致力于将计算机科学中关于复杂性、随机性、信息与计算的有用直觉形式化。其核心概念是算法复杂性 (Algorithmic Complexity) ,该概念由 R.J. Solomonoff、A.N. Kolmogorov 和 G.J. Chaitin 于 1960 年代在不同研究背景下分别提出(也称为柯氏复杂性或算法熵)。详细综述可参阅 [15, 14, 30, 55]。
Much like the concept of Entropy in Information Theory, Algorithmic Complexity is a measure of the “information content” of mathematical objects. For our own needs, we will only consider the specific case of binary strings. Indeed, all objects we have introduced so far have been either scalar values (score, potential), or binary strings (states, programs, situations, and responses), since any program may be represented as a binary string.
与信息论中的熵(Entropy)概念类似,算法复杂度(Algorithmic Complexity)是衡量数学对象"信息量"的指标。为满足我们的需求,我们仅考虑二进制字符串这一特定情况。事实上,我们目前引入的所有对象要么是标量值(分数、势能),要么是二进制字符串(状态、程序、情境和响应),因为任何程序都可以表示为二进制字符串。
The Algorithmic Complexity (noted $H(s),$ ) of a string $s$ is the length of the shortest description of the string in a fixed universal language, i.e. the length of the shortest program that outputs the string when running on a fixed universal Turing machine. Since any universal Turing machine can emulate any other universal Turing machine, $H(s)$ is machine-independent to a constant.
字符串 $s$ 的算法复杂度 (记为 $H(s)$) 是在固定通用语言中对该字符串的最短描述长度,即在固定通用图灵机上运行时能输出该字符串的最短程序长度。由于任何通用图灵机都可以模拟其他通用图灵机,$H(s)$ 在常数范围内与机器无关。
We can use Algorithmic Complexity to define the information content that a string $s_{2}$ possesses about a string $s_{1}$ (called “Relative Algorithmic Complexity” and noted $H(s_{1}|s_{2}))$ , as the length of the shortest program that, taking $s_{2}$ as input, produces $s_{1}$ . “To take $s_{2}$ as input” means that $s_{2}$ is part of the description of the program, but the length of $s_{2}$ would not be taken into account when counting the program’s length.
我们可以用算法复杂度来定义字符串 $s_{2}$ 关于字符串 $s_{1}$ 的信息量(称为"相对算法复杂度"并记为 $H(s_{1}|s_{2}))$),即把 $s_{2}$ 作为输入时能生成 $s_{1}$ 的最短程序长度。"把 $s_{2}$ 作为输入"意味着 $s_{2}$ 是程序描述的一部分,但在计算程序长度时不会计入 $s_{2}$ 的长度。
Because any program may be represented as a binary string, we can use Relative Algorithmic Complexity to describe how closely related two programs are. Based on this observation, we propose to define the intuitive notion of “Generalization Difficulty” of a task as follows:
由于任何程序都可以表示为二进制字符串,我们可以使用相对算法复杂度来描述两个程序之间的关联程度。基于这一观察,我们提出如下定义任务的直观概念"泛化难度":
Consider:
考虑:
We then define Generalization Difficulty as:
我们随后将泛化难度定义为:
Generalization Difficulty of a task given a curriculum $C$ and a skill threshold $\theta$ , noted $G D_{T,C}^{\theta}$ : Fraction of the Algorithmic Complexity of solution $S o l_{T}^{\theta}$ that is explained by the shortest optimal training-time solution T rainSoloTp,tC (i.e. length of the shortest program that, taking as input the shortest possible program that performs optimally over the situations in curriculum $C$ , produces a program that performs at a skill level of at least $\theta$ during evaluation, normalized by the length of that skill program). Note that this quantity is between 0 and 1 by construction.
给定课程 $C$ 和技能阈值 $\theta$ 时任务的泛化难度 $G D_{T,C}^{\theta}$:由最短最优训练时解 T rainSoloTp,tC 解释的解 $S o l_{T}^{\theta}$ 的算法复杂度占比 (即:以在课程 $C$ 情境中表现最优的最短程序为输入,生成评估时技能水平至少达到 $\theta$ 的程序的最短程序长度,并除以该技能程序的长度进行归一化)。根据定义,该数值介于0到1之间。
$$
\begin{array}{r}{G D_{T,C}^{\theta}=\frac{H(S o l_{T}^{\theta}|T r a i n S o l_{T,C}^{o p t})}{H(S o l_{T}^{\theta})}}\end{array}
$$
$$
\begin{array}{r}{G D_{T,C}^{\theta}=\frac{H(S o l_{T}^{\theta}|T r a i n S o l_{T,C}^{o p t})}{H(S o l_{T}^{\theta})}}\end{array}
$$
Thus, a task with high “generalization difficulty” is one where the evaluation-time behavior needs to differ significantly from the simplest possible optimal training-time behavior in order to achieve sufficient skill. Relative Algorithmic Complexity provides us with a metric to quantify this difference: $G D$ is a measure of how much the shortest training-time solution program needs to be edited in order to become an appropriate evaluation-time solution program. If the shortest skill program that performs optimally during training also happens to perform at a sufficient skill level during evaluation, the task has zero generalization difficulty (i.e. it does not involve uncertainty). A general iz able skill program is one that “covers more ground” in situation space than the exact training situations it is familiar with: a program that is capable of dealing with future uncertainty.
因此,一个具有高"泛化难度 (generalization difficulty)"的任务,其评估阶段所需行为必须与最简单的理想训练阶段行为存在显著差异才能达到足够技能水平。相对算法复杂度 (Relative Algorithmic Complexity) 为我们提供了量化这种差异的指标:$GD$ 表示最短训练阶段解决方案程序需要经过多少编辑才能转变为合适的评估阶段解决方案程序。若在训练阶段表现最优的最短技能程序恰好也能在评估阶段达到足够技能水平,则该任务具有零泛化难度 (即不涉及不确定性)。一个可泛化的技能程序能够覆盖比其熟悉的精确训练情境更广泛的场景空间:这种程序具备应对未来不确定性的能力。
Note that this definition of generalization difficulty may seem counter-intuitive. Occam’s razor principle would seem to suggest that the simplest program that works on the training situations should also be a program that generalizes well. However, generalization describes the capability to deal with future uncertainty, not the capability to compress the behavior that would have been optimal in the past – being prepared for future uncertainty has a cost, which is antagonistic to policy compression 9. By necessity, T rainSoloTp,tC does away with any information or capability that isn’t strictly necessary in order to produce the correct response to training situations, and in doing so, it may discard information or capabilities that would have been useful to process evaluation situations. If it is in fact the case that T rainSoloTp,tC d oes not need to discard any such information (i.e. the simplest behavior that was optimal in the past is still sufficient in the future), this implies that the evaluation features no need for adaptation (no non-trivial novelty or uncertainty), and thus the task does not involve generalization, potentially given some starting point (such as the solution of another task).
需要注意的是,这种对泛化难度的定义可能看起来有违直觉。奥卡姆剃刀原则似乎暗示着,在训练情境下有效的最简单程序也应该是泛化能力良好的程序。然而,泛化描述的是应对未来不确定性的能力,而非压缩过去最优行为的能力——为未来不确定性做准备需要付出代价,这与策略压缩相矛盾 [9]。T rainSoloTp,tC 必然会舍弃那些对正确响应训练情境非严格必要的信息或能力,在此过程中,它可能丢弃了本可用于处理评估情境的信息或能力。如果事实上 T rainSoloTp,tC 无需丢弃任何此类信息(即过去最优的最简行为在未来依然足够),这意味着评估特征无需适应(不存在显著的新颖性或不确定性),因此该任务不涉及泛化,可能基于某个起点(如另一任务的解决方案)。
Another way to express the same idea is that generalization requires to reinterpret the task when new data arrives (e.g. at evaluation time). This implies the need to store representations of past data that would be seemingly useless from the perspective of the past but may prove useful in the future. For example, consider the following labeled points along a line: $(x=-0.75,l a b e l=F a l s e)$ , $(x=0.15,l a b e l=T r u e)$ , $(x=-0.1,l a b e l=T r u e)$ ). When training a classification program on the first two of these points, some of the shortest optimal training-time solutions may be $\lambda(x):x>0$ or $\lambda(x):b o o l(c e i l(x))$ . When applied to the last point $(x=-0.1,l a b e l=T r u e)$ , these solutions would fail, while an algorithm that instead stores all past data points and uses nearest-neighbors to return a response at evaluation time would work. The nearest-neighbors program would be better prepared for future uncertainty, but would take significantly more space to write down.
表达同一观点的另一种方式是,泛化要求在新数据到来时(例如在评估阶段)重新解读任务。这意味着需要存储过去数据的表征,这些表征从过去的角度看似无用,但未来可能被证明是有价值。例如,考虑直线上以下带标签的点:$(x=-0.75,l a b e l=F a l s e)$、$(x=0.15,l a b e l=T r u e)$、$(x=-0.1,l a b e l=T r u e)$。当用前两个点训练分类程序时,某些最短的最优训练时解可能是$\lambda(x):x>0$或$\lambda(x):b o o l(c e i l(x))$。当应用于最后一个点$(x=-0.1,l a b e l=T r u e)$时,这些解会失败,而另一种算法会存储所有过去的数据点,并在评估时使用最近邻返回响应,这种方法会奏效。最近邻程序能更好地应对未来的不确定性,但需要显著更多的存储空间来记录。
Importantly, this first definition of generalization difficulty only captures system-centric generalization, as it quantifies the difficulty of handling evaluation situations that differ from training situations regardless of the system’s preexisting capabilities. To capture developer-aware generalization, we need to take into account the system in its initial state at the start of training, SkillP rogramGen, ISU pdate, isStatet=0:
重要的是,这一关于泛化难度的首个定义仅涵盖以系统为中心的泛化能力,因为它量化了处理与训练情境不同的评估情境的难度,而无论系统是否具备先验能力。要衡量开发者感知的泛化能力,我们需要考虑系统在训练开始时的初始状态 SkillProgramGen、ISUpdate 以及 isStatet=0:
Developer-aware Generalization Difficulty of a task for an intelligent system given a curriculum $C$ and a skill threshold $\theta$ , noted $G D_{I S,T,C}^{\theta}$ : Fraction of the Algorithmic Complexity of solution $S o l_{T}^{\theta}$ that is explained by $T r a i n S o l_{T,C}^{o p t}$ and the initial state of the system , i.e. length of the shortest program that, taking as input the initial system plus the shortest possible program that performs optimally over the situations in curriculum $C$ , produces a skill program that performs at a skill level of at least $\theta$ during evaluation, normalized by the length of that skill program. Note that this quantity is between 0 and 1 by construction.
开发者感知的智能系统任务泛化难度
给定课程 $C$ 和技能阈值 $\theta$,记为 $G D_{I S,T,C}^{\theta}$:表示由 $T r a i n S o l_{T,C}^{o p t}$ 和系统初始状态所解释的解决方案 $S o l_{T}^{\theta}$ 的算法复杂度占比。即:以系统初始状态加上在课程 $C$ 情境下表现最优的最短程序作为输入,能生成在评估时技能水平至少达到 $\theta$ 的技能程序的最短程序长度,与该技能程序长度的比值。根据定义,该数值范围在0到1之间。
$$
\begin{array}{r}{G D_{I S,T,C}^{\theta}=\frac{H(S o l_{T}^{\theta}|T r a i n S o l_{T,C}^{o p t},I S_{t=0})}{H(S o l_{T}^{\theta})}}\end{array}
$$
$$
\begin{array}{r}{G D_{I S,T,C}^{\theta}=\frac{H(S o l_{T}^{\theta}|T r a i n S o l_{T,C}^{o p t},I S_{t=0})}{H(S o l_{T}^{\theta})}}\end{array}
$$
In which we note: $I S_{t=0}=S k$ illP rogramGen, ISUpdate, isStatet=0
其中我们注意到:$I S_{t=0}=S k$ illProgramGen、ISUpdate、isStatet=0
Developer-aware generalization thus represents the amount of uncertainty about the shortest evaluation-time solution given that you have at your disposal both the initial system and the shortest training-time solution, i.e. the amount of modifications you would have to make to the shortest training-time solution to obtain the evaluation-time solution, provided that these edits can make use of the contents of the initial system.
开发者感知泛化度表示在同时拥有初始系统和最短训练时解决方案的前提下,对最短评估时解决方案的不确定性程度。换言之,它衡量了在允许利用初始系统内容的情况下,需要对最短训练时解决方案进行多少修改才能获得评估时解决方案。
Likewise, we can define the Generalization Difficulty from task $T_{1}$ to task $T_{2}$ (sufficient case) as $H(S o l_{T_{2}}^{\theta_{T_{2}}}|S o l_{T_{1}}^{\theta_{T_{1}}})/H(S o l_{T_{2}}^{\theta_{T_{2}}})$ . We can also extend these definitions to a set of tasks (e.g. Generalization Difficulty from a set of practice task to a set of test tasks), which can be useful to quantify the Generalization Difficulty of an entire test suite. These notions are related to the concept of intrinsic task difficulty (regardless of generalization) defined in [37] (section 8.6) as the effort necessary to construct a solution.
同样地,我们可以将任务 $T_{1}$ 到任务 $T_{2}$ (充分情况)的泛化难度定义为 $H(S o l_{T_{2}}^{\theta_{T_{2}}}|S o l_{T_{1}}^{\theta_{T_{1}}})/H(S o l_{T_{2}}^{\theta_{T_{2}}})$ 。我们还可以将这些定义扩展到一组任务(例如从一组练习任务到一组测试任务的泛化难度),这对于量化整个测试套件的泛化难度很有用。这些概念与[37] (第8.6节)中定义的内在任务难度(不考虑泛化)相关,即构建解决方案所需的努力。
Next, we can also use Relative Algorithmic Complexity to formally quantify the Priors $P_{I S,T}$ possessed by an intelligent system about a task:
接下来,我们还可以使用相对算法复杂度来形式化量化智能系统对某项任务所具备的先验知识 $P_{IS,T}$:
Priors of an intelligent system relative to a task $T$ and a skill threshold $\theta$ , noted $P_{I S,T}^{\theta}$ : Fraction of the Algorithmic Complexity of the shortest solution of $T$ of skill threshold $\theta$ that is explained by the initial system (at the start of the training phase). This is the length (normalized by $H(S o l_{T}^{\theta}))$ of the shortest possible program that, taking as input the initial system SkillP rogramGen, ISUpdate, ${i s S t a t e}_ {t=0}$ (noted $I S_{t=0}$ ), produces the shortest solution of $T$ that performs at a skill level of at least $\theta$ during evaluation. Note that the intelligent system does not need to be able to produce this specific solution. Note that this quantity is between 0 and 1 by construction.
智能系统相对于任务$T$和技能阈值$\theta$的先验$P_{I S,T}^{\theta}$:由初始系统(训练阶段开始时)解释的、技能阈值为$\theta$的$T$最短解决方案的算法复杂度占比。这是以初始系统SkillProgramGen、ISUpdate、${i s S t a t e}_ {t=0}$(记为$I S_{t=0}$)为输入,生成在评估中技能水平至少为$\theta$的$T$最短解决方案的最短可能程序长度(以$H(S o l_{T}^{\theta})$归一化)。注意,智能系统无需能够生成此特定解决方案。根据定义,该值介于0到1之间。
$$
\begin{array}{r}{P_{I S,T}^{\theta}=\frac{H(S o l_{T}^{\theta})-H(S o l_{T}^{\theta}|I S_{t=0})}{H(S o l_{T}^{\theta})}}\end{array}
$$
$$
\begin{array}{r}{P_{I S,T}^{\theta}=\frac{H(S o l_{T}^{\theta})-H(S o l_{T}^{\theta}|I S_{t=0})}{H(S o l_{T}^{\theta})}}\end{array}
$$
“Priors” thus defined can be interpreted as a measure of how close from a sufficient or optimal solution the system starts, i.e. the “amount of relevant information” embedded in the initial system. Note that this is different from the “amount of information” embedded in the initial system (which would merely be the Algorithmic Complexity of the initial system). As such, our measure only minimally penalizes large systems that contain prior knowledge that is irrelevant to the task at hand (the only added cost is due to knowledge indexing and retrieval overhead).
如此定义的"先验"可以解释为系统初始状态距离充分或最优解的接近程度,即初始系统中嵌入的"相关信息量"。需要注意的是,这与初始系统中嵌入的"信息总量"(即初始系统的算法复杂度)不同。因此,我们的度量方法仅对包含与当前任务无关的先验知识的大型系统施加最小惩罚(唯一增加的成本来自知识索引和检索开销)。
Further, we can use Relative Algorithmic Complexity to define the Experience $E_{I S,T,C}$ accumulated by an intelligent system about a task during a curriculum.
此外,我们可以利用相对算法复杂度来定义智能系统在课程学习过程中积累的关于某项任务的经验$E_{I S,T,C}$。
Consider a single step $t$ during training:
考虑训练过程中的单个步骤 $t$:
We informally define the amount of experience accrued at step $t$ as the amount of relevant, novel information received by the system at $t$ . This corresponds to the amount of potential uncertainty reduction about the solution that is made available by the task in the current situation data and feedback data (i.e. how much the $I S$ could reduce its uncertainty about the solution using the step data if it were optimally intelligent).
我们非正式地将步骤$t$积累的经验量定义为系统在$t$时接收到的相关新信息量。这对应于当前情境数据和反馈数据中任务所提供的关于解决方案的潜在不确定性减少量(即如果$IS$具有最优智能,利用步骤数据能减少多少关于解决方案的不确定性)。
Formally:
正式地:
Experience accrued at step $t$ , noted $E_{I S,T,t}^{\theta}$ :
步骤 $t$ 累计的经验,记为 $E_{I S,T,t}^{\theta}$:
$$
E_{I S,T,t}^{\theta}=H(S o l_{T}^{\theta}|I S_{t})-H(S o l_{T}^{\theta}|I S_{t},d a t a_{t})
$$
$$
E_{I S,T,t}^{\theta}=H(S o l_{T}^{\theta}|I S_{t})-H(S o l_{T}^{\theta}|I S_{t},d a t a_{t})
$$
In which we note:
在此我们注意到:
$$
\begin{array}{l l}{{}}&{{{},I S_{t}=S k i l l P r o g r a m G e n,I S U p d a t e,i s S t a t e{_ t}}}\ {{}}&{{}}&{{}}\ {{}}&{{{},d a t a_{t}=S i t u a t i o n_{t},r e s p o n s e_{t},f e e d b a c k_{t}}}\end{array}
$$
$$
\begin{array}{l l}{{}}&{{{},I S_{t}=S k i l l P r o g r a m G e n,I S U p d a t e,i s S t a t e{_ t}}}\ {{}}&{{}}&{{}}\ {{}}&{{{},d a t a_{t}=S i t u a t i o n_{t},r e s p o n s e_{t},f e e d b a c k_{t}}}\end{array}
$$
By summing over all steps, we obtain the following definition of total experience (note that we normalize by the Algorithmic Complexity of the solution considered, as we did for priors):
通过对所有步骤求和,我们得到总经验的以下定义(注意,与先验类似,我们按所考虑解决方案的算法复杂度进行归一化):
Experience EIθS,T,C over a curriculum C:
在课程C上的经验EIθS,T,C:
$$
\begin{array}{r}{E_{I S,T,C}^{\theta}=\frac{1}{H(S o l_{T}^{\theta})}\sum_{t}E_{I S,T,t}^{\theta}}\end{array}
$$
$$
\begin{array}{r}{E_{I S,T,C}^{\theta}=\frac{1}{H(S o l_{T}^{\theta})}\sum_{t}E_{I S,T,t}^{\theta}}\end{array}
$$
“Experience” thus defined can be interpreted as a measure of the amount of relevant information received by the system about the task over the course of a curriculum, only accounting for novel information at each step.
如此定义的"经验"可解释为系统在课程学习过程中接收到的与任务相关信息量的度量,仅考虑每一步的新信息。
Because this is different from the “amount of information” contained in the curriculum (i.e. the Algorithmic Complexity of the curriculum), our measure does not penalize systems that go through noisy curricula.
因为这不同于课程中包含的"信息量"(即课程的算法复杂度),我们的衡量方法不会对那些经历嘈杂课程的系统造成不利影响。
In addition, because we use an eager sum of relevant and novel information at each step instead of globally pooling the information content of the curriculum, we penalize learners that are slower to absorb the relevant information that is presented to them.
此外,由于我们在每一步都急切地汇总相关且新颖的信息,而非全局汇集课程中的信息内容,因此会对吸收所呈现相关信息较慢的学习者进行惩罚。
Lastly, because our sum is different from “amount of relevant information (novel or not) at each step summed over all steps”, we do not penalize systems that go through repetitive curricula. If a fast learner absorbs sufficient information over the first ten steps of a fixed curriculum, but a slow learner needs 90 more steps of the same curriculum to achieve the same, we will not count as experience for the fast learner the redundant last 90 steps during which it did not learn anything, but we will count all 100 steps for the slow learner.
最后,由于我们的总和与"每一步相关信息量(无论新旧)在所有步骤上的累加"不同,我们不会惩罚采用重复课程的系统。如果一个快速学习者在固定课程的前十个步骤中吸收了足够信息,而慢速学习者需要额外90个相同课程步骤才能达到相同水平,我们不会将冗余的最后90个步骤(期间快速学习者未学到任何新内容)计入其经验值,但会将全部100个步骤计入慢速学习者的经验值。
Defining intelligence
定义智能
We have now established sufficient context and notations to formally express the intuitive definition of intelligence stated earlier, “the intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and
我们现在已经建立了足够的上下文和符号体系,可以正式表述前文提出的直观智能定义:"一个系统的智能是其在任务范围内相对于先验知识、经验和..."
We consider an intelligent system $I S$ . We note $C u r_{T}^{\theta_{T}}$ the space of curricula that result in $I S$ generating a solution of sufficient skill $\theta_{T}$ for a task $T$ , and $C u r_{T}^{o p t}$ the space of curricula that result in $I S$ generating its highest-skill solution (solution reaching the system’s potential $\theta_{T,I S}^{m a x})$ . Note that system’s potential may be lower than the optimal solution for the task, as the system may not be able to learn to optimally solve the task.
我们考虑一个智能系统 $IS$。我们将导致 $IS$ 为任务 $T$ 生成具备足够技能水平 $\theta_T$ 的解决方案的课程空间记为 $Cur_T^{\theta_T}$,而将导致 $IS$ 生成其最高技能解决方案(达到系统潜力 $\theta_{T,IS}^{max}$)的课程空间记为 $Cur_T^{opt}$。需要注意的是,系统潜力可能低于任务的最优解,因为系统可能无法学会最优解决该任务。
To simplify notations, we will denote $\theta_{T,I S}^{m a x}$ as $\Theta$ . We note $A v g$ the averaging function (used to average over task space). We note $P_{C}$ the probability of a given curriculum $C$ .
为简化符号表示,我们将 $\theta_{T,I S}^{m a x}$ 记为 $\Theta$。定义 $Avg$ 为平均函数(用于在任务空间求平均),定义 $P_{C}$ 为给定课程 $C$ 的概率。
We then define the intelligence of $I$ , tied to a scope of tasks scope, as:
随后,我们将与任务范围相关联的智能 $I$ 定义为:
Intelligence of system $I S$ over scope (sufficient case):
系统智能 $IS$ 在范围(充分情况)下:
$$
I_{I S,s c o p e}^{\theta_{T}}=\underbrace{A v g}_ {T\in s c o p e}\left[\omega_{T}\cdot\theta_{T}\quad\sum_{C\in C u r_{T}^{\theta_{T}}}\left[P_{C}\cdot\frac{G D_{I S,T,C}^{\theta_{T}}}{P_{I S,T}^{\theta_{T}}+E_{I S,T,C}^{\theta_{T}}}\right]\right]
$$
$$
I_{I S,s c o p e}^{\theta_{T}}=\underbrace{A v g}_ {T\in s c o p e}\left[\omega_{T}\cdot\theta_{T}\quad\sum_{C\in C u r_{T}^{\theta_{T}}}\left[P_{C}\cdot\frac{G D_{I S,T,C}^{\theta_{T}}}{P_{I S,T}^{\theta_{T}}+E_{I S,T,C}^{\theta_{T}}}\right]\right]
$$
Intelligence of system $I S$ over scope (optimal case):
系统智能 $I S$ 在范围上的表现 (最优情况):
$$
I_{I S,s c o p e}^{o p t}=\underbrace{A v g}_ {T\in s c o p e}\left[\omega_{T,\Theta}\cdot\Theta\sum_{C\in C u r_{T}^{o p t}}\left[P_{C}\cdot\frac{G D_{I S,T,C}^{\Theta}}{P_{I S,T}^{\Theta}+E_{I S,T,C}^{\Theta}}\right]\right]
$$
$$
I_{I S,s c o p e}^{o p t}=\underbrace{A v g}_ {T\in s c o p e}\left[\omega_{T,\Theta}\cdot\Theta\sum_{C\in C u r_{T}^{o p t}}\left[P_{C}\cdot\frac{G D_{I S,T,C}^{\Theta}}{P_{I S,T}^{\Theta}+E_{I S,T,C}^{\Theta}}\right]\right]
$$
Note that:
请注意:
Thus, we equate the intelligence of a system to a measure of the information-efficiency with which the system acquires its final task-specific skill (sufficient skill or highest possible skill) on average (probabilistic average over all applicable curricula), weighted by the developer-aware generalization difficulty of the task considered (as well as the task value $\omega$ , which makes skill commensurable across tasks), averaged over all tasks in the scope.
因此,我们将系统的智能等同于衡量其信息效率的指标,即系统在平均情况下(所有适用课程的概率平均)获取最终任务特定技能(足够技能或最高可能技能)的效率,并乘以开发者感知的任务泛化难度(以及任务价值$\omega$,这使得不同任务的技能具有可比性),最后在范围内所有任务上取平均。
Or, in plain English: intelligence is the rate at which a learner turns its experience and priors into new skills at valuable tasks that involve uncertainty and adaptation.
或者,用简单的话来说:智能是指学习者在涉及不确定性和适应性的有价值任务中,将其经验和先验知识转化为新技能的速度。
Note that our definition is not the first formal definition of intelligence based on Algorithmic Information Theory. We are aware of three other AIT-based definitions: the C-Test [40], the AIXI model [43], and the “Universal Intelligence” model [54] (closely related to AIXI). It should be immediately clear to a reader familiar with these definitions that our own approach represents a very different perspective.
需要注意的是,我们的定义并非首个基于算法信息论 (Algorithmic Information Theory) 的智能形式化定义。已知还有三种基于AIT的定义:C-Test [40]、AIXI模型 [43] 以及"通用智能"模型 [54](与AIXI密切相关)。熟悉这些定义的读者会立即意识到,我们的方法代表了一种截然不同的视角。
We bring the reader’s attention to a number of key observations about our formalism (see also II.2.3):
我们提请读者注意关于我们形式化方法的几个关键观察点 (另见 II.2.3):

Figure 3: Higher intelligence “covers more ground” in future situation space using the same information.
图 3: 更高智能在相同信息条件下能"覆盖更多领域"的未来情境空间。
II.2.2 Computation efficiency, time efficiency, energy efficiency, and risk efficiency
II.2.2 计算效率、时间效率、能效与风险效率
In the above, we only considered the information-efficiency (prior-efficiency and experienceefficiency with respect to generalization difficulty) of intelligent systems. Indeed, we believe this is the most actionable and relevant angle today to move AI research forward (cf II.2.3). But it isn’t the only angle one may want to consider. Several alternatives that could be incorporated into our definition in various ways (e.g. as a regular iz ation term) come to mind:
在上述讨论中,我们仅考虑了智能系统的信息效率(即针对泛化难度的先验效率与经验效率)。实际上,我们认为这是当前推动AI研究最具可操作性和相关性的视角(参见II.2.3节)。但这并非唯一值得考量的角度。我们想到以下几种可融入定义的替代方案(例如作为正则化项):
energy efficiency, this is highly relevant to biological systems and natural evolution, in which certain novelty-seeking behaviors that would lead to faster learning may also be more dangerous.
能效方面,这与生物系统和自然进化高度相关,其中某些可能导致更快学习的新奇探索行为也可能更具危险性。
In fact, one may note that information efficiency acts in many settings as a proxy for energy efficiency and risk efficiency.
事实上,人们可能会注意到信息效率 (information efficiency) 在许多场景中充当着能源效率 (energy efficiency) 和风险效率 (risk efficiency) 的代理指标。
We expect that these alternative ways to quantify efficiency will become relevant in specialized AI application contexts in the future, and we bring them to the reader’s attention to encourage others to develop new formal definitions of intelligence incorporating them in addition to information efficiency.
我们预计这些量化效率的替代方法将在未来的专业AI应用场景中变得重要,因此提请读者关注,以鼓励其他人在信息效率之外结合这些方法开发新的智能形式化定义。
II.2.3 Practical implications
II.2.3 实际影响
The definitions above provide a formal framework as well as quantitative tools to reason about the intuitive notions we have been introducing so far, in particular the concepts of “generalization difficulty”, “intelligence as skill-acquisition efficiency”, and what it means to control for priors and experience when evaluating intelligence, as opposed to looking purely at task-specific skill.
上述定义提供了一个形式化框架和量化工具,用以论证我们迄今提出的直观概念,特别是"泛化难度"、"智能即技能获取效率"等理念,以及在评估智能时控制先验知识与经验的意义(而非仅关注特定任务技能)。
The main value of this framework is to provide an actionable perspective shift in how we understand and evaluate flexible or general artificial intelligence. We argue that this perspective shift has the following practical consequences:
该框架的主要价值在于为我们理解和评估灵活或通用人工智能提供了一个可操作的视角转变。我们认为这一视角转变具有以下实际影响:
a. Consequences for research directions towards flexible or general AI:
a. 对灵活或通用AI研究方向的影响:
• It encourages interest in building systems based on human-like knowledge priors (e.g. Core Knowledge) by drawing attention to the importance of priors in evaluating intelligence.
• 它通过强调先验知识在评估智能中的重要性,激发了基于类人知识先验(如核心知识)构建系统的兴趣。
b. Consequences for evaluating flexible or general AI systems:
b. 评估灵活或通用AI系统的后果:
• By defining and quantifying generalization difficulty, it offers a way to formally reason about what it means to perform “local generalization”, “broad generalization”, and “extreme generalization” (cf. the spectrum of generalization introduced in I.3.2), and to weed out tests that feature zero generalization difficulty.
• 通过定义和量化泛化难度,它提供了一种形式化推理方法,用以说明"局部泛化"、"广泛泛化"和"极端泛化"的含义(参见I.3.2节介绍的泛化谱系),并剔除那些零泛化难度的测试。
• It suggests concrete guidelines for comparing AI and human intelligence: such a comparison requires starting from a shared scope of tasks and shared priors, and would seek to compare experience-efficiency in achieving specific levels of skill. We detail this idea in II.3.1.
• 提出了比较人工智能与人类智能的具体指导原则:此类比较需要从共同的任务范围和先验知识出发,并着重对比两者在达到特定技能水平时的经验效率。我们将在II.3.1节详细阐述这一观点。
• It shows the importance of taking into account generalization difficulty when developing a test set to evaluate a task. We detail this idea in II.3.2. This should hopefully lead us to evaluation metrics that are able to discard solutions that rely on shortcuts that do not generalize (e.g. reliance on local textures as opposed to global semantics in computer vision).
• 它表明在开发用于评估任务的测试集时,考虑泛化难度的重要性。我们将在II.3.2节详细阐述这一观点。这有望引导我们建立能够剔除依赖不可泛化捷径的解决方案的评估指标 (例如计算机视觉中依赖局部纹理而非全局语义的方法)。
• It provides a set of practical questions to ask about any intelligent system to rigorously characterize it:
• 它提供了一套针对任何智能系统的实用问题集,用于严格描述其特征:
II.3 Evaluating intelligence in this light
II.3 基于此视角评估智能
Earlier in this document, we have detailed how measuring skill alone does not move us forward when it comes to the development of broad abilities, we have suggested that AI evaluation should learn from its more mature sister field psychometrics (echoing the thesis of Psycho metric AI and Universal Psychometrics), and we have provided a new formalism with practical implications for AI evaluation, pointing out the importance of the concept of scope, potential, generalization difficulty, experience, and priors. The following section summarizes key practical conclusions with respect to AI evaluation.
在本文件前文中,我们已详细说明仅测量技能无法推动广义能力的发展,建议AI评估应借鉴其更成熟的姊妹学科心理测量学(呼应心理测量AI与通用心理测量学的论点),并提出一个对AI评估具有实际意义的新形式化框架,指出范围(scope)、潜力(potential)、泛化难度(generalization difficulty)、经验(experience)和先验(priors)等概念的重要性。以下部分总结了关于AI评估的关键实践结论。
II.3.1 Fair comparisons between intelligent systems
II.3.1 智能系统间的公平比较
We mentioned in II.2.3 that our formalism suggests concrete guidelines for comparing the intelligence of systems of different nature, such as human intelligence and artificial intelligence. Being able to make such comparisons in a fair and rigorous way is essential to progress towards human-like general AI. Here we argue how such intelligence comparisons between systems entail specific requirements with regard to the target systems’ scope, potential, and priors. We also detail how such comparisons should proceed given that these requirements are met.
我们在II.2.3节中提到,我们的形式化方法为比较不同性质系统(如人类智能和人工智能)的智能水平提供了具体指导。以公平严谨的方式实现这类比较,对实现类人通用人工智能(AGI)至关重要。本节将论述系统间智能比较对目标系统的范围、潜力和先验知识提出的特定要求,并详细说明在满足这些要求时应如何进行此类比较。
Scope and potential requirements. In II.1.2, we argued that intelligence is necessarily tied to a scope of application, an idea also central to the formalism introduced in II.2. As such, a comparison scale must be tied to a well-defined scope of tasks that is shared by the target systems (all target systems should be able to learn to perform the same tasks).
应用范围与潜在要求。在II.1.2节中,我们论证了智能必然与特定应用范围相关联,这一观点也是II.2节所引入的形式化框架的核心。因此,比较标准必须基于目标系统共同适用的明确定义任务范围(所有目标系统都应能学习执行相同任务)。
Further, we must consider that the target systems may have different potential (maximum achievable skill) over their shared scope. An intelligence comparison should focus on skill-acquisition efficiency, but skill-acquisition efficiency cannot be meaningfully compared between systems that arrive at vastly different levels of skills. As such, a comparison scale must be tied to a fixed threshold of skill over the scope of tasks considered. This skill threshold should be achievable by all target systems.
此外,我们必须考虑到目标系统在其共同作用范围内可能具备不同的潜力(可达到的最高技能水平)。智力比较应聚焦于技能获取效率,但对于最终技能水平差异巨大的系统,这种效率比较将失去意义。因此,比较标准必须基于所考虑任务范围内一个固定的技能阈值,该阈值应能被所有目标系统达成。
For instance, comparing a generally-intelligent system to human intelligence would only make sense if the scope of tasks that can be learned by the system is the same scope of tasks that can be learned by a typical human, and the comparison should focus on the efficiency with which the system achieves the same level of skill as a human expert. Comparing maximum realized skill does not constitute an intelligence comparison.
例如,将通用智能系统与人类智能进行比较,只有在系统可学习的任务范围与典型人类可学习的任务范围相同时才有意义,并且比较应侧重于系统达到与人类专家相同技能水平的效率。比较已实现的最大技能并不构成智能比较。
Prior knowledge requirements. Since the formalism of II.2 summarizes priors into a single scalar score, which is homogeneous to the score used to quantify experience, it is not strictly necessary for the two systems being compared to share the same priors. For instance, if two systems achieve the same skill using the same amount of experience (the exact nature of this experience, determined by the curriculum used, may differ), the system that has the least amount of prior knowledge would be considered more intelligent.
先验知识要求。由于II.2节的形式化将先验知识汇总为单一标量分数,该分数与量化经验的分数同质化,因此被比较的两个系统并不严格要求具有相同的先验知识。例如,若两个系统在消耗相同经验量(这些经验的具体性质可能因所用课程而异)的情况下达到相同技能水平,则先验知识更少的系统会被认为更具智能。
However, it would be generally impractical to fully quantify prior knowledge. As such, we recommend only comparing the intelligence of systems that assume a sufficiently similar set of priors. This implies that any measure of intelligence should explicitly and exhaustively list the priors it assumes, an idea we detail below, in II.3.2. Further, this implies that systems that aim at implementing human-like general intelligence should leverage Core Knowledge priors.
然而,全面量化先验知识通常是不切实际的。因此,我们建议仅比较那些假设了足够相似先验知识集的系统智能水平。这意味着任何衡量智能的方法都应明确详尽地列出其假设的先验知识(我们将在II.3.2节详细阐述这一观点)。此外,这也意味着旨在实现类人通用智能的系统应利用核心知识 (Core Knowledge) 作为先验基础。
If the above conditions are met (shared scope, well-defined skill threshold over scope, and comparable knowledge priors), then a fair intelligence comparison would then consist of contrasting the skill-acquisition efficiency profile of the target systems. The more intelligent system would be the one that uses the least amount of experience to arrive at the desired skill threshold in the average case. Alternatively, computation efficiency, energy efficiency, and risk efficiency may also be considered, as per II.2.2.
如果满足上述条件(共享范围、明确定义的技能阈值范围以及可比较的先验知识),那么公平的智力比较将包括对比目标系统的技能获取效率曲线。在平均情况下,使用最少经验达到所需技能阈值的系统将被视为更智能的系统。或者,根据II.2.2节,也可以考虑计算效率、能源效率和风险效率。
II.3.2 What to expect of an ideal intelligence benchmark
II.3.2 理想智能基准的预期标准
The recommendations below synthesizes the conclusions of this document with regard to the properties that a candidate benchmark of human-like general intelligence should possess.
以下建议综合了本文档关于类人通用智能候选基准应具备特性的结论。
is going to involve priors, but in many tasks used for AI evaluation today, priors stay implicit, and the existence of implicit hidden priors may often give an unfair advantage to either humans or machines. • It should work for both humans and machines, fairly, by only assuming the same priors as possessed by humans (e.g. Core Knowledge) and only requiring a humansized amount of practice time or training data.
• 评估将涉及先验知识,但在当前用于AI评估的许多任务中,先验知识往往隐含存在,这些隐性先验可能经常给人类或机器带来不公平的优势。
• 它应公平地适用于人类和机器,仅假设与人类相同的先验知识(如核心知识),且仅要求与人类相当的练习时间或训练数据量。
These recommendations for general AI evaluation wouldn’t be complete without a concrete effort to implement them. In part III, we present our initial attempt.
这些通用人工智能(General AI)评估建议若没有具体实施努力就不算完整。在第三部分,我们将展示初步尝试。
III A benchmark proposal: the ARC dataset
III 基准提案:ARC数据集
In this last part, we introduce the Abstraction and Reasoning Corpus (ARC), a dataset intended to serve as a benchmark for the kind of general intelligence defined in II.2. ARC is designed to incorporate as many of the recommendations of II.3 as possible.
在最后这部分,我们介绍抽象与推理语料库(Abstraction and Reasoning Corpus,简称ARC),该数据集旨在作为II.2章节所定义通用智能的基准。ARC的设计尽可能融入了II.3章节提出的各项建议。
III.1 Description and goals
III.1 描述与目标
III.1.1 What is ARC?
III.1.1 什么是ARC?
ARC can be seen as a general artificial intelligence benchmark, as a program synthesis benchmark, or as a psycho metric intelligence test. It is targeted at both humans and artificially intelligent systems that aim at emulating a human-like form of general fluid intelligence. It is somewhat similar in format to Raven’s Progressive Matrices [47], a classic IQ test format going back to the 1930s.
ARC 可以被视为通用人工智能 (General Artificial Intelligence) 的基准测试,也可以被视为程序合成基准或心理测量智力测试。它既面向人类,也面向旨在模拟类人通用流体智能的人工智能系统。其形式与 Raven 渐进矩阵 [47] 有些相似,这是一种可追溯至 20 世纪 30 年代的经典智商测试形式。
ARC has the following top-level goals:
ARC 有以下顶层目标:
ARC comprises a training set and an evaluation set. The training set features 400 tasks, while the evaluation set features 600 tasks. The evaluation set is further split into a public evaluation set (400 tasks) and a private evaluation set (200 tasks). All tasks are unique, and the set of test tasks and the set of training tasks are disjoint. The task data is available at github.com/fchollet/ARC.
ARC包含一个训练集和一个评估集。训练集包含400个任务,评估集包含600个任务。评估集进一步分为公开评估集(400个任务)和私有评估集(200个任务)。所有任务都是唯一的,测试任务集与训练任务集互不重叠。任务数据可在github.com/fchollet/ARC获取。
Each task consists of a small number of demonstration examples (3.3 on average), and a small number of test examples (generally 1, although it may be 2 or 3 in rare cases). Each example consists of an “input grid” and an “output grid”. Each “grid” is a literal grid of symbols (each symbol is typically visualized via a unique color), as seen in figure 4. There are 10 unique symbols (or colors). A grid can be any height or width between 1x1 and $30\mathrm{x}30$ , inclusive (the median height is 9 and the median width is 10).
每项任务包含少量演示示例(平均3.3个)和少量测试示例(通常为1个,极少数情况下为2或3个)。每个示例由"输入网格"和"输出网格"组成。如[图4]所示,每个"网格"是由符号(通常通过独特颜色可视化)组成的字面网格,共有10种独特符号(或颜色)。网格尺寸可在1x1至$30\mathrm{x}30$之间任意变化(中位高度为9,中位宽度为10)。
When solving an evaluation task, a test-taker has access to the training examples for the task (both the input and output grids), as well as the input grid of the test examples for the task. The test-taker must construct on its own the output grid corresponding to the input grid of each test example. “Constructing the output grid” is done entirely from scratch, meaning that the test-taker must decide what the height and width of the output grid should be, what symbols it should place on the grid, and where. The task is successfully solved if the testtaker can produce the exact correct answer on all test examples for the task (binary measure of success). For each test example in a task, the test-taker (either human or machine) is allowed 3 trials 10. The only feedback received after a trial is binary (correct answer or incorrect answer). The score of an intelligent system on ARC is the fraction of tasks in the evaluation set that it can successfully solve. Crucially, it is assumed that neither the test-taker nor its developer would have had any prior information about the tasks featured in the evaluation set: ARC seeks to measure “developer aware generalization” as defined in I.3.2. The existence of a private evaluation set enables us to strictly enforce this in the setting of a public competition.
在解决评估任务时,应试者可以访问该任务的训练示例(包括输入和输出网格)以及测试示例的输入网格。应试者必须自行构建与每个测试示例输入网格相对应的输出网格。"构建输出网格"需完全从零开始,这意味着应试者需要决定输出网格的高度和宽度、应在网格上放置哪些符号以及放置位置。若应试者能在该任务的所有测试示例中生成完全正确的答案(成功与否的二元衡量标准),则视为任务成功解决。对于任务中的每个测试示例,应试者(无论是人类还是机器)允许进行3次尝试。每次尝试后仅会收到二元反馈(答案正确或错误)。智能系统在ARC上的得分是其能成功解决的评估集中任务的比例。关键在于,假设应试者及其开发者事先对评估集中的任务没有任何了解:ARC旨在衡量I.3.2节中定义的"开发者知晓的泛化能力"。私有评估集的存在使我们能够在公开竞赛环境中严格执行这一原则。
A test-taker is also assumed to have access to the entirety of the training set, although the training data isn’t strictly necessary in order to be successful on the validation set, as all tasks are unique and do not assume any knowledge other than the priors described in III.1.2. A typical human can solve most of the ARC evaluation set without any previous training. As such, the purpose of the training set primarily to serve as a development validation set for AI system developers, or as a mock test for human test-takers. It could also be used as a way to familiarize an algorithm with the content of Core Knowledge priors. We do not expect that practice on the training set would increase human performance on the test set (albeit this hypothesis would need to be concretely tested).
应试者也被假定可以访问整个训练集,尽管要在验证集上取得成功并不严格依赖训练数据,因为所有任务都是独特的,且仅需具备III.1.2节所述的先验知识即可。一个典型的人类无需任何预先训练即可解决ARC评估集中的大部分题目。因此,训练集的主要目的是为AI系统开发者提供开发验证集,或作为人类应试者的模拟测试。它也可用于帮助算法熟悉核心知识先验的内容。我们预计在训练集上的练习不会提升人类在测试集上的表现(尽管这一假设仍需具体验证)。
III.1.2 Core Knowledge priors
III.1.2 核心知识先验
Any test of intelligence is going to involve prior knowledge. ARC seeks to control for its own assumptions by explicitly listing the priors it assumes, and by avoiding reliance on any information that isn’t part of these priors (e.g. acquired knowledge such as language). The ARC priors are designed to be as close as possible to Core Knowledge priors, so as to provide a fair ground for comparing human intelligence and artificial intelligence, as per our recommendations in II.3.1.
任何智力测试都离不开先验知识。ARC通过明确列出其假设的先验知识,并避免依赖这些先验之外的信息(如语言等后天习得的知识),来控制自身假设的影响。根据我们在II.3.1节提出的建议,ARC的先验知识设计尽可能接近核心知识先验(Core Knowledge priors),以便为人类智能与人工智能的比较提供公平基础。
The Core Knowledge priors assumed by ARC are as follows:
ARC假设的核心知识先验如下:

Figure 4: A task where the implicit goal is to complete a symmetrical pattern. The nature of the task is specified by three input/output examples. The test-taker must generate the output grid corresponding to the input grid of the test input (bottom right).
图 4: 一项隐含目标是完成对称模式的任务。任务性质由三个输入/输出示例指定。测试者必须为测试输入(右下)的输入网格生成对应的输出网格。
a. Objectness priors:
a. 物体性先验:
Object cohesion: Ability to parse grids into “objects” based on continuity criteria including color continuity or spatial contiguity (figure 5), ability to parse grids into zones, partitions.
对象内聚性:基于颜色连续性或空间邻接性等连续性标准将网格解析为"对象"的能力(图 5),将网格解析为区域、分区的能力。
Object persistence: Objects are assumed to persist despite the presence of noise (figure 6) or occlusion by other objects. In many cases (but not all) objects from the input persist on the output grid, often in a transformed form. Common geometric transformations of objects are covered in category 4, “basic geometry and topology priors”.
对象持久性:假设对象在存在噪声(图 6)或被其他对象遮挡的情况下仍能持续存在。在许多情况下(但并非全部),输入中的对象会以某种变换形式持续出现在输出网格上。常见的对象几何变换属于第4类“基本几何与拓扑先验”。

Figure 5: Left, objects defined by spatial contiguity. Right, objects defined by color continuity.
图 5: 左图,通过空间连续性定义的对象。右图,通过颜色连续性定义的对象。

Figure 6: A denoising task.
图 6: 去噪任务。
Object influence via contact: Many tasks feature physical contact between objects (e.g. one object being translated until it is in contact with another (figure 7), or a line “growing” until it “rebounds” against another object (figure 8).
通过接触实现物体影响:许多任务涉及物体间的物理接触(例如一个物体移动直至与另一个物体接触(图7),或一条线"生长"直至"反弹"碰到另一个物体(图8)。

Figure 7: The red object “moves” towards the blue object until “contact”.
图 7: 红色物体向蓝色物体"移动"直至"接触"。
b. Goal-directed ness prior:
b. 目标导向性先验:
While ARC does not feature the concept of time, many of the input/output grids can be effectively modeled by humans as being the starting and end states of a process that involves intentional it y (e.g. figure 9). As such, the goal-directed ness prior may not be strictly necessary to solve ARC, but it is likely to be useful.
虽然ARC不包含时间概念,但许多输入/输出网格可以被人类有效建模为某个具有目的性过程的起始和终止状态(例如图9)。因此,目标导向先验可能并非解决ARC所严格必需的,但它很可能具有实用价值。
c. Numbers and Counting priors:
c. 数字与计数先验:
Many ARC tasks involve counting or sorting objects (e.g. sorting by size), comparing numbers (e.g. which shape or symbol appears the most (e.g. figure 10)? The least? The same number of times? Which is the largest object? The smallest? Which objects are the same size?), or repeating a pattern for a fixed number of time. The notions of addition and subtraction are also featured (as they are part of the Core Knowledge number system as per [85]). All quantities featured in ARC are smaller than approximately 10.
许多ARC任务涉及计数或排序对象(例如按大小排序)、比较数字(例如哪种形状或符号出现最多(如图10)?最少?相同次数?哪个是最大的物体?最小的?哪些物体大小相同?),或重复固定次数的模式。加法和减法的概念也有所体现(根据[85],它们是核心知识数字系统的一部分)。ARC中涉及的所有数量都小于约10。

Figure 8: A task where the implicit goal is to extrapolate a diagonal line that “rebounds” upon contact with a red obstacle.
图 8: 一项隐含目标是推断出接触红色障碍物后会"反弹"的对角线任务。
d. Basic Geometry and Topology priors:
d. 基础几何与拓扑先验:
ARC tasks feature a range of elementary geometry and topology concepts, in particular:
ARC任务涵盖了一系列基础几何与拓扑概念,具体包括:
III.1.3 Key differences with psycho metric intelligence tests
III.1.3 与心理测量智力测试的关键差异
We have pointed out in I.3.4 the reasons why using existing psycho metric intelligence tests (or “IQ tests”) does not constitute a sound basis for AI evaluation. Albeit ARC stays deliberately close in format to traditional IQ tests (as well as related efforts such as HernandezOrallo’s C-Test [40]), its design differs from them in fundamental ways. We argue that these differences address the shortcomings of psycho metric intelligence tests in the context of AI evaluation. In particular:
我们在I.3.4节中已经指出,现有心理测量智力测试(或称"IQ测试")无法作为AI评估的可靠基础。尽管ARC在形式上刻意贴近传统IQ测试(以及HernandezOrallo的C-Test [40]等相关研究),但其设计在根本层面上与这些测试存在差异。我们认为,这些差异恰好解决了心理测量智力测试在AI评估场景中的缺陷。具体而言:
• Unlike some psycho metric intelligence tests, ARC is not interested in assessing crystallized intelligence or crystallized cognitive abilities. ARC only assesses a general form of fluid intelligence, with a focus on reasoning and abstraction. ARC does not involve language, pictures of real-world objects, or real-world common sense. ARC seeks to only involve knowledge that stays close to Core Knowledge priors,
• 与某些心理测量智力测试不同,ARC不关注评估晶体智力或晶体认知能力。ARC仅评估一种通用形式的流体智力,重点关注推理和抽象能力。ARC不涉及语言、现实世界物体图片或现实世界常识。ARC力求仅涉及接近核心知识先验的知识,

Figure 9: A task that combines the concepts of “line extrapolation”, “turning on obstacle”, and “efficiently reaching a goal” (the actual task has more demonstration pairs than these three).
图 9: 结合了"直线外推"、"障碍物转向"和"高效抵达目标"概念的任务 (实际任务包含的演示样本多于这三个)。
III.1.4 What a solution to ARC may look like, and what it would imply for AI applications
III.1.4 ARC解决方案的可能形态及其对AI应用的意义
We have found ARC to be fully solvable by humans. While many ARC test tasks are intellectually challenging, human test-takers appear to be able to solve the majority of tasks on their first try without any practice or verbal explanations. Each task included in ARC has been successfully solved by at least one member of a group of three high-IQ humans (who did not communicate with each other), which demonstrates task feasibility. In the future, we hope to be able to further investigate human performance on ARC by gathering a statistically significant amount of human testing data, in particular with regard to the relationship between CHC cognitive abilities and ARC performance.
我们发现ARC对人类而言是完全可解的。虽然许多ARC测试任务在智力上具有挑战性,但人类受试者似乎能在首次尝试时解决大部分任务,且无需练习或语言解释。ARC中的每项任务都已被三名高智商人类(彼此未交流)中的至少一人成功解决,这证明了任务的可行性。未来,我们希望通过收集具有统计意义的人类测试数据,进一步研究人类在ARC上的表现,特别是CHC认知能力与ARC表现之间的关系。

Figure 10: A task where the implicit goal is to count unique objects and select the object that appears the most times (the actual task has more demonstration pairs than these three).
图 10: 一个隐含目标是统计独特对象并选择出现次数最多的对象的任务 (实际任务包含的演示样本对数量多于这三组)。

Figure 11: Drawing the sym met rize d version of a shape around a marker. Many tasks involve some form of symmetry.
图 11: 围绕标记绘制形状的对称版本。许多任务涉及某种形式的对称性。
Crucially, to the best of our knowledge, ARC does not appear to be approachable by any existing machine learning technique (including Deep Learning), due to its focus on broad generalization and few-shot learning, as well as the fact that the evaluation set only features tasks that do not appear in the training set.
关键的是,据我们所知,由于ARC专注于广泛泛化和少样本学习,以及评估集仅包含训练集中未出现的任务这一事实,任何现有的机器学习技术(包括深度学习)似乎都无法应对ARC。
For a researcher setting out to solve it, ARC is perhaps best understood as a program synthesis benchmark. Program synthesis [31, 32] is a subfield of AI interested in the generation of programs that satisfy a high-level specification, often provided in the form of pairs of example inputs and outputs for the program – which is exactly the ARC format.
对于着手解决这一问题的研究者而言,ARC最贴切的理解或许是一个程序合成(program synthesis)基准。[31, 32]程序合成是AI的一个分支领域,专注于根据高级规范生成满足条件的程序,这些规范通常以程序输入输出样例对的形式提供——而这正是ARC的格式。
A hypothetical ARC solver may take the form of a program synthesis engine that uses the demonstration examples of a task to generate candidates that transform input grids into corresponding output grids. Schematically:
假设的ARC求解器可能采用程序合成引擎的形式,利用任务演示样例生成将输入网格转换为对应输出网格的候选方案。示意图如下:
• Start by developing a domain-specific language (DSL) capable of expressing all possible solution programs for any ARC task. Since the exact set of ARC tasks is purposely not formally definable, this may be challenging (the space of tasks is defined as anything expressible in terms of ARC pairs that would only involve Core Knowledge). It would require harding-coding the Core Knowledge priors from III.1.2 in a sufficiently abstract and combinable program form, to serve as basis functions for a kind of “human-like reasoning DSL”. We believe that solving this specific subproblem is critical to general AI progress. • Given a task, use the DSL to generate a set of candidate programs that turn the inputs grids into the corresponding output grids. This step would reuse and recombine subprograms that previously proved useful in other ARC tasks. • Select top candidates among these programs based on a criterion such as program simplicity or program likelihood (such a criterion may be trained on solution programs previously generated using the ARC training set). Note that we do not expect that merely selecting the simplest possible program that works on training pairs will generalize well to test pairs (cf. our definition of generalization difficulty from II.2). • Use the top three candidates to generate output grids for the test examples.
• 首先开发一种领域特定语言 (DSL),能够表达任何ARC任务的所有可能解决方案程序。由于ARC任务的确切集合故意未正式定义,这可能具有挑战性(任务空间被定义为任何仅涉及核心知识的ARC对可表达的内容)。这需要将III.1.2节中的核心知识先验以足够抽象且可组合的程序形式硬编码,作为一类"类人推理DSL"的基础函数。我们认为解决这一具体子问题对通用人工智能的进展至关重要。
• 给定任务时,使用DSL生成一组候选程序,将输入网格转换为相应的输出网格。此步骤将重用并重组先前在其他ARC任务中证明有用的子程序。
• 根据程序简洁性或程序可能性等标准(此类标准可通过先前使用ARC训练集生成的解决方案程序进行训练)从这些程序中选择最优候选。需注意,我们并不期望仅选择在训练对上可行的最简单程序就能很好地泛化到测试对(参见II.2节中对泛化难度的定义)。
• 使用前三名候选程序为测试样例生成输出网格。
We posit that the existence of a human-level ARC solver would represent the ability to program an AI from demonstrations alone (only requiring a handful of demonstrations to specify a complex task) to do a wide range of human-relatable tasks of a kind that would normally require human-level, human-like fluid intelligence. As supporting evidence, we note that human performance on psycho metric intelligence tests (which are similar to ARC) is predictive of success across all human cognitive tasks. Further, we posit that, since an ARC solver and human intelligence would both be founded on the same knowledge priors, the scope of application of an ARC solver would be close to that of human cognition, making such a solver both practically valuable (i.e. it could solve useful, human-relevant problems) and easy to interact with (i.e. it would readily understand human demonstrations and would produce behavior that is in line with human expectations).
我们认为,如果存在一个达到人类水平的ARC求解器,就意味着仅通过演示就能对AI进行编程(仅需少量演示即可指定复杂任务),使其能够执行通常需要人类水平、类人流体智能才能完成的各类与人相关的任务。作为佐证,我们注意到人类在心理测量智力测试(与ARC类似)中的表现可以预测其在所有人类认知任务中的成功程度。此外,我们提出,由于ARC求解器与人类智能都基于相同的先验知识,其应用范围将接近人类认知范畴,这使得此类求解器既具备实用价值(即能解决对人类有用的实际问题),又易于交互(即能轻松理解人类演示并产生符合人类预期的行为)。
Our claims are highly speculative and may well prove fully incorrect, much like Newell’s 1973 hopes that progress on chess playing would translate into meaningful progress on achieving a range of broad cognitive abilities – especially if ARC turns out to feature unforeseen vulnerabilities to unintelligent shortcuts. We expect our claims to be validated or invalidated in the near future once we make sufficient progress on solving ARC.
我们的主张极具推测性,很可能被证明是完全错误的,就像Newell在1973年对国际象棋进展能推动广泛认知能力发展的期望一样——特别是如果ARC最终暴露出对非智能捷径的不可预见漏洞。我们预计在取得足够进展解决ARC后,这些主张将在不久的将来得到验证或推翻。
III.2 Weaknesses and future refinements
III.2 弱点与未来改进方向
It is important to note that ARC is a work in progress, not a definitive solution; it does not fit all of the requirements listed in II.3.2, and it features a number of key weaknesses:
需要注意的是,ARC (AI Research Center) 仍处于开发阶段,并非最终解决方案;它并不完全符合第II.3.2节列出的所有要求,并且存在若干关键缺陷:
• Generalization is not quantified. While ARC is explicitly designed to measure “broad generalization” as opposed to “local generalization” or “extreme generalization”, we do not offer a quantitative measure of the generalization of the evaluation set given the test set, or the generalization difficulty of each task (considered independently). We plan on conducting future work to empirically address this issue by using human performance on a task (considered over many human subjects) to estimate the generalization difficulty it represents. We would be particularly interested in attempting to correlate human performance on a task with an approximation of the AIT measure of generalization difficulty proposed in II.2 (such an approximation should become available as we make progress on ARC solver programs). Finding high correlation, or a lack of correlation, would provide a degree of validation or invalidation of our formal measure.
• 泛化性未量化。虽然ARC明确设计用于衡量"广泛泛化"而非"局部泛化"或"极端泛化",但我们未提供评估集相对于测试集的泛化性量化指标,也未量化每个任务(独立考虑时)的泛化难度。我们计划通过人类在任务中的表现(基于多受试者数据)来实证评估任务所代表的泛化难度,以此作为未来研究方向。特别值得关注的是,尝试将人类任务表现与II.2节提出的AIT泛化难度近似度量建立关联(随着ARC求解程序的进展,此类近似度量将逐步可行)。若发现高度相关性或缺乏相关性,将对我们提出的形式化度量提供一定程度的验证或反证。
• Test validity is not established. Validity represents the predictive ness of test performance with regard to performance on other non-test activities. The validity of ARC should be investigated via large-sample size statistical studies on humans, following the process established by psychometrics. Further, when AI ARC solvers become a reality, we will also be able to study how well ARC performance translates into real-world usefulness across a range of tasks.
• 测试有效性尚未确立。效度(validity)代表测试表现对其他非测试活动表现的预测能力。ARC的效度应通过遵循心理测量学确立的流程、对人类进行大样本量统计研究来验证。此外,当AI的ARC求解器成为现实时,我们还将能够研究ARC表现如何转化为跨多种任务的实际效用。
• Dataset size and diversity may be limited. ARC only features 1,000 tasks in total, and there may be some amount of conceptual overlap across many tasks. This could make ARC potentially vulnerable to shortcut strategies that could solve the tasks without featuring intelligence. We plan on running public AI competitions (using the private evaluation set) as a way to crowd-source attempts to produce such shortcuts (if a shortcut exists, it should arise quickly in a competition setting). Further, to mitigate potential vulnerability against such shortcuts, we intend to keep adding new tasks to ARC in the future, possibly by crowd-sourcing them.
• 数据集规模和多样性可能有限。ARC仅包含1000个任务,且许多任务间可能存在一定概念重叠。这可能导致ARC易受捷径策略影响,这些策略无需展现智能即可解决问题。我们计划通过公开AI竞赛(使用私有评估集)来众包寻找此类捷径(若捷径存在,在竞赛环境下应会快速显现)。此外,为降低此类捷径的潜在风险,我们拟持续为ARC新增任务,可能通过众包方式实现。
• The evaluation format is overly close-ended and binary. The score of a test-taker on an evaluation task is either 0 or 1, which lacks granularity. Further, real-world problem-solving often takes the form of an interactive process where hypotheses are formulated by the test-taker then empirically tested, iterative ly. In ARC, this approach is possible to an extent since the test-taker is allowed 3 trials for each test example in a task. However, this format remains overly limiting. A better approach may be let the test taker dynamically interact with an example generator for the task: the test taker would be able to ask for a new test input at will, would propose a solution for the test input, and would receive feedback on their solution, repeatedly, until the testtaker is reliably able to produce the correct answer. The test-taker’s score on the task would then be a measure of the amount of feedback it required until it became able to reliably generate the correct solution for any new input. This represents a more direct measure of intelligence as formally defined in II.2, where the input generator is in control of the curriculum.
- 评估形式过于封闭且二元化。考生在评估任务中的得分非0即1,缺乏区分度。
- 现实问题解决通常呈现交互式过程:考生提出假设后通过实验反复验证。ARC评估允许每个测试样例进行3次尝试,一定程度上实现了这种模式,但仍存在较大局限。
- 更优方案是让考生与任务样例生成器动态交互:考生可随时请求新测试输入,提交解决方案并获取反馈,循环往复直至能稳定输出正确答案。
- 最终评分将基于考生达到稳定正确输出所需的反馈量,这更符合II.2节对智能的形式化定义(此时输入生成器掌控学习进程)。
• Core Knowledge priors may not be well understood and may not be well captured in ARC. Central to ARC is the notion that it only relies on innate human prior knowledge and does not feature significant amounts of acquired knowledge. However, the exact nature of innate human prior knowledge is still an open problem, and whether these priors are correctly captured in ARC is unclear.
• 核心知识先验 (Core Knowledge priors) 可能未被充分理解,也可能未被ARC充分捕捉。ARC的核心在于它仅依赖人类与生俱来的先验知识,而不包含大量后天习得的知识。然而,人类先天知识的确切本质仍是一个悬而未决的问题,这些先验知识是否被ARC正确捕捉尚不明确。
III.3 Possible alternatives
III.3 可能的替代方案
ARC is merely one attempt to create a human-like general intelligence benchmark that embodies as many of the guidelines listed in II.3 as possible. While ARC stays very close to the format of psycho metric intelligence tests, many other possible approaches could be explored. In this section, we offer some suggestions for alternatives.
ARC只是创建类人通用智能基准的一次尝试,它尽可能体现了II.3节中列出的诸多准则。虽然ARC非常接近心理测量智力测试的形式,但还有许多其他可能的方法值得探索。在本节中,我们提供了一些替代方案的建议。
III.3.1 Re purposing skill benchmarks to measure broad generalization
III.3.1 重新利用技能基准测试衡量广泛泛化能力
We noted in I.3.5 the ongoing fascination of the AI research community in developing systems that surpass human skill at board games and video games. We propose re purposing such tests of skills into tests of intelligence.
我们在I.3.5节中指出,AI研究界一直热衷于开发在棋类和电子游戏领域超越人类技能的系统。我们建议将此类技能测试重新定位为智能测试。
Consider an AI developer interested in solving game $X$ . While the AI would be trained on instances of $X$ , an evaluation arbiter would create multiple variants of $X~(X_{1},X_{2}$ , $X_{n}$ ). These alternative games would be designed to represent a meaningful amount of generalization difficulty over $X$ (as defined in II.2): the simplest game-playing program that is optimal on instances of $X$ (e.g. game levels of $X$ ) would not be optimal on $X_{i}$ . As such, these alternative games would not be mere “new levels” of $X$ , but would feature related-yet-novel gameplay, so as to measure broad generalization as opposed to local genera liz ation. These alternative games would stay unknown to the AI developers, so as to measure developer-aware generalization. This proposed setup is thus markedly different from e.g. CoinRun [17] or Obstacle Tower [49], where the evaluation environments are not alternative games, but only levels of the same game (local generalization, or generalization to known unknowns), randomly sampled from a level generator which is known in advance to the AI developers (no evaluation of developer-aware generalization).
考虑一位对解决游戏 $X$ 感兴趣的AI开发者。虽然AI会在 $X$ 的实例上进行训练,但评估仲裁者会创建 $X$ 的多个变体 $(X_{1},X_{2}$ , $X_{n}$ )。这些替代游戏的设计旨在体现对 $X$ 具有显著泛化难度的特性(如II.2节所定义):在 $X$ 实例(例如游戏关卡)上表现最优的最简单游戏程序,在 $X_{i}$ 上不会保持最优。因此,这些替代游戏不仅仅是 $X$ 的"新关卡",而是会呈现相关但新颖的游戏玩法,以衡量广泛泛化而非局部泛化。这些替代游戏将对AI开发者保密,从而衡量开发者感知的泛化能力。这一设定与CoinRun [17] 或Obstacle Tower [49] 等方案存在显著差异——后者的评估环境并非替代游戏,而是来自开发者预先知晓的关卡生成器(无法评估开发者感知的泛化能力)随机采样的同一游戏的不同关卡(局部泛化或对已知未知的泛化)。
The AI trained on $X$ , once ready, would then be tasked with learning to solve $X_{1}$ , $X_{n}$ . Its evaluation score would then be a measure of the amount of experience it required on each alternative game in order to reach a specific threshold of skill, modulated by the amount of generalization difficulty represented by each alternative game. A measure of the general intelligence of such an AI would then be an average of these evaluation scores over a large number of different source games $X$ .
训练在$X$上的AI一旦就绪,就会被赋予学习解决$X_{1}$、$X_{n}$的任务。其评估分数将衡量它在每个替代游戏上达到特定技能阈值所需的经验量,该数值会根据每个替代游戏所代表的泛化难度进行调整。此类AI的通用智能度量将是对大量不同源游戏$X$求得的这些评估分数的平均值。
For instance, consider the game DotA2: an AI trained on DotA2 may be evaluated by measuring the efficiency with which it can learn to play new games from the same genre, such as League of Legends or Heroes of the Storm. As an even simpler (but weaker) alternative, an AI trained on 16 specific DotA2 characters may be evaluated by measuring the efficiency with which it can learn to master a set of brand new characters it would not have played before – for example, a strong human DotA2 player can play at a high level with a new character upon first try.
例如,以游戏《DotA2》为例:一个在《DotA2》上训练的AI可以通过测量其学习同类型新游戏(如《英雄联盟》或《风暴英雄》)的效率来评估。更简单(但较弱)的替代方案是,一个在16个特定《DotA2》角色上训练的AI可以通过测量其掌握一组从未接触过的新角色的效率来评估——例如,一名高水平人类《DotA2》玩家初次尝试就能用新角色打出高水平表现。
III.3.2 Open-ended adversarial or collaborative approaches
III.3.2 开放式对抗或协作方法
We have pointed out in III.2 some of the key limitations of having to craft evaluation tasks manually: it is a labor-intensive process that makes it difficult to formally control for genera liz ation difficulty, that could potentially result in a low-diversity set of tasks, and that is not easily scalable (although crowd-sourcing tasks may partially address this problem). The diversity and scala b lili ty points are especially critical given that we need a constant supply of substantially new tasks in order to guarantee that the benchmark is measuring developer-aware generalization.
我们在III.2节中指出,手动设计评估任务存在几个关键局限性:这是一个劳动密集型过程,难以正式控制泛化难度,可能导致任务集多样性不足,且不易扩展(尽管众包任务可能部分解决此问题)。考虑到我们需要持续提供大量新任务以确保基准测试能衡量开发者感知的泛化能力,多样性和可扩展性尤为关键。
A solution may be to instead pro grammatically generate new tasks. We noted in III.1.3 that programmatic generation from a static “master” program is not desirable, as it places a ceiling on the diversity and complexity of the set of tasks that can be generated, and it offers a potential avenue to “cheat” on the benchmark by reverse-engineering the master program. We propose instead to generate tasks via an ever-learning program called a “teacher” program, interacting in a loop with test-taking systems, called “student” programs (figure 12). The teacher program would optimize task generation for novelty and interesting ness for a given student (tasks should be new and challenging, while still being solvable by the student), while students would evolve to learn to solve increasingly difficult tasks. This setup is also favorable to curriculum optimization, as the teacher program may be configured to seek to optimize the learning efficiency of its students. This idea is similar to the “anytime intelligence test” proposed in [38] and to the POET system proposed in [96].
一种解决方案可能是通过编程方式生成新任务。我们在III.1.3节中指出,从静态"主控"程序生成任务的方式不可取,因为它限制了可生成任务的多样性和复杂性上限,并可能通过逆向工程主控程序来"作弊"基准测试。我们建议改为通过一个持续学习的"教师"程序生成任务,与被称为"学生"程序的测试系统形成循环交互(图12)。教师程序会针对特定学生优化任务生成的新颖性和趣味性(任务应具有新意和挑战性,同时学生仍能解决),而学生程序则会不断进化以解决日益困难的任务。这种设置也有利于课程优化,因为教师程序可配置为寻求优化学生的学习效率。该想法类似于[38]提出的"随时智力测试"和[96]提出的POET系统。
In order to make sure that the space of generated tasks retains sufficient complexity and novelty over time, the teacher program should draw information from an external source (assumed to feature incompressible complexity), such as the real world. This external source of complexity makes the setup truly open-ended. A teacher program that generates novel tasks that partially emulate human-relevant tasks would have the added advantage that it would guide the resulting student programs towards a form of intelligence that could transfer to real-world human-relevant problems.
为确保生成任务的空间随时间推移保持足够的复杂性和新颖性,教师程序应从外部来源(假设具有不可压缩的复杂性)获取信息,例如现实世界。这种外部复杂性来源使该设置真正具备开放性。通过生成部分模拟人类相关任务的新颖任务,教师程序还具有额外优势:它能引导学生程序发展出可迁移至现实世界人类相关问题的智能形式。

Figure 12: Teacher-student learning and evaluation system.
图 12: 师生学习与评估系统。
Taking stock
盘点
The study of general artificial intelligence is a field still in its infancy, and we do not wish to convey the impression that we have provided a definitive solution to the problem of characterizing and measuring the intelligence held by an AI system. Rather, we have introduced a new perspective on defining and evaluating intelligence, structured around the following ideas:
通用人工智能的研究仍处于起步阶段,我们并不想传达这样一种印象:即我们已经为描述和衡量AI系统所拥有的智能提供了明确的解决方案。相反,我们围绕以下思想引入了一种定义和评估智能的新视角:
We also have provided a new formalism based on Algorithmic Information Theory (cf. II.2) to rigorously and quantitatively reason about these ideas, as well as a set of concrete guidelines to follow for developing a benchmark of general intelligence (cf. II.3.1 and II.3.2).
我们还基于算法信息理论(参见II.2)提出了一种新的形式化方法,用于严格定量地论证这些观点,并制定了一套具体指南来开发通用智能基准(参见II.3.1和II.3.2)。
Our definition, formal framework, and evaluation guidelines, which do not capture all facets of intelligence, were developed to be actionable, explanatory, and quantifiable, rather than being descriptive, exhaustive, or consensual. They are not meant to invalidate other perspectives on intelligence, rather, they are meant to serve as a useful objective function to guide research on broad AI and general AI, as outlined in II.2.3. Our hope is for some part of the AI community interested in general AI to break out of a longstanding and ongoing trend of seeking to achieve raw skill at challenging tasks, given unlimited experience and unlimited prior knowledge.
我们的定义、形式化框架和评估指南并非旨在涵盖智能的所有方面,而是为了具有可操作性、解释性和可量化性,而非描述性、穷尽性或共识性。这些内容并非要否定其他关于智能的观点,而是如II.2.3节所述,旨在作为指导广泛AI和通用人工智能研究的实用目标函数。我们希望部分关注通用人工智能的AI社区能够突破长期存在的趋势——即在无限经验和先验知识条件下追求挑战性任务的原始技能。
To ground our ideas and enable others to build upon them, we are also providing an actual benchmark, the Abstraction and Reasoning Corpus, or ARC:
为了将我们的想法具体化并让他人能够在此基础上继续发展,我们还提供了一个实际的基准测试集,即抽象与推理语料库(Abstraction and Reasoning Corpus, ARC):
Importantly, ARC is still a work in progress, with known weaknesses listed in III.2. We plan on further refining the dataset in the future, both as a playground for research and as a joint benchmark for machine intelligence and human intelligence.
重要的是,ARC仍处于开发阶段,其已知缺陷已在III.2节列出。我们计划未来进一步完善该数据集,既作为研究试验场,也作为机器智能与人类智能的共同基准。
The measure of the success of our message will be its ability to divert the attention of some part of the community interested in general AI, away from surpassing humans at tests of skill, towards investigating the development of human-like broad cognitive abilities, through the lens of program synthesis, Core Knowledge priors, curriculum optimization, information efficiency, and achieving extreme generalization through strong abstraction.
衡量我们信息成功与否的标准,在于它能否将部分关注通用人工智能的社群注意力从"在技能测试中超越人类"转向以下研究方向:通过程序合成 (program synthesis) 、核心知识先验 (Core Knowledge priors) 、课程优化 (curriculum optimization) 、信息效率 (information efficiency) 以及通过强抽象实现极端泛化 (extreme generalization) 等视角,来探索类人广谱认知能力的开发。
