[论文翻译]深度学习的十大挑战 Deep Learning: A Critical Appraisal


Gary Marcus 曾是 Uber 人工智能实验室的负责人,他在去年 12 月把自己的创业公司 Geometric Intelligence 卖给 Uber,并帮助 Uber 组建了人工智能研究团队。仅仅过去四个月,Marcus 就宣布从 Uber 离职。 Gary Marcus 是纽约大学科学家,以批判的态度 对AI进展发表了很多言论。


Although deep learning has historical roots going back decades, neither the term “deep learning” nor the approach was popular just over five years ago, when the field was reignited by papers such as Krizhevsky, Sutskever and Hinton’s now classic 2012 (Krizhevsky, Sutskever, & Hinton, 2012)deep net model of Imagenet.
What has the field discovered in the five subsequent years? Against a background of considerable progress in areas such as speech recognition, image recognition, and game playing, and considerable enthusiasm in the popular press, I present ten concerns for deep learning, and suggest that deep learning must be supplemented by other techniques if we are to reach artificial general intelligence.


尽管深度学习历史可追溯到几十年前,但这种方法,甚至深度学习一词都只是在 5 年前才刚刚流行,也就是该领域被类似于 阿莱克斯·克里泽夫斯基(Alex Krizhevsky)、伊利娅·苏特斯科娃(Ilya Sutskever)r 和 杰弗里·辛顿(Geoffrey Hinton)等人合作的论文这样的研究成果重新点燃的时候。他们的论文如今是 ImageNet 上经典的深度网络模型。

在随后 5 年中,该领域都发现了什么?在语音识别、图像识别、游戏等领域有可观进步,主流媒体热情高涨的背景下,我提出了对深度学习的十点担忧,且如果我们想要达到通用人工智能,我建议要有其他技术补充深度学习。

在深度学习衍生出更好解决方案的众多问题上(视觉、语音),在 2016 -2017 期间而变得收效衰减。——François Chollet, Google, Keras 作者,2017.12.18

"科学是踩着葬礼前行的",未来由极其质疑我所说的一切的那批学生所决定。——Geoffrey Hinton,深度学习教父,谷歌大脑负责人,2017.9.15

1. Is deep learning approaching a wall? 深度学习撞墙了?

Although deep learning has historical roots going back decades(Schmidhuber, 2015), it attracted relatively little notice until just over five years ago. Virtually everything changed in 2012, with the publication of a series of highly influential papers such as
Krizhevsky, Sutskever and Hinton’s 2012 ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky, Sutskever, & Hinton, 2012), which achieved state-of-the-art results on the object recognition challenge known as ImageNet
(Deng et al., ). Other labs were already working on similar work (Cireşan, Meier, Masci, & Schmidhuber, 2012). Before the year was out, deep learning made the front page of The New York Times , and it rapidly became the best known technique in artificial intelligence, by a wide margin.
尽管,深度学习的根源可追溯到几十年前(Schmidhuber,2015),但直到 5 年之前,人们对于它的关注还极其有限。

2012 年,克里泽夫斯基、苏特斯科娃和辛顿发布论文《ImageNet Classification with Deep Convolutional Neural Networks》(Krizhevsky, Sutskever, & Hinton, 2012),在 ImageNet 目标识别挑战赛上取得了顶尖结果(Deng et al.)。

随着这样的一批高影响力论文的发表,一切都发生了本质上的变化。当时,其他实验室已经在做类似的工作(Cireşan, Meier, Masci, & Schmidhuber, 2012)。在 2012 年将尽之时,深度学习上了纽约时报的头版。然后,迅速蹿红成为人工智能中最知名的技术。

If the general idea of training neural networks with multiple layers was not new, it was, in part because of increases in computational power and data, the first time that deep learning truly became practical.
Deep learning has since yielded numerous state of the art results, in domains such as speech recognition, image recognition , and language translation and plays a role in a wide swath of current AI applications. Corporations have invested billions of dollars fighting for deep learning talent. One prominent deep learning advocate, Andrew Ng, has gone so far to suggest that “If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.” (A, 2016). A recent New York Times Sunday Magazine article , largely about deep learning, implied that the technique is “poised to reinvent computing itself.”
Yet deep learning may well be approaching a wall, much as I anticipated earlier, at beginning of the resurgence (Marcus, 2012), and as leading figures like Hinton (Sabour, Frosst, & Hinton, 2017) and Chollet (2017) have begun to imply in recent months.
What exactly is deep learning, and what has its shown about the nature of intelligence? What can we expect it to do, and where might we expect it to break down? How close or far are we from “artificial general intelligence”, and a point at which machines show a human-like flexibility in solving unfamiliar problems? The purpose of this paper is both to temper some irrational exuberance and also to consider what we as a field might need to move forward.


自此之后,深度学习在语音识别、图像识别、语言翻译这样的领域产生了众多顶尖成果,且在目前众多的 AI 应用中扮演着重要的角色。大公司也开始投资数亿美元挖深度学习人才。

深度学习的一位重要拥护者,吴恩达,想的更远并说到,“如果一个人完成一项脑力任务需要少于一秒的考虑时间,我们就有可能在现在或者不久的未来使用 AI 使其自动化。”(A,2016)。


如今,深度学习可能临近墙角,大部分如同前面我在深度学习崛起之时(Marcus 2012)预期到的,也如同辛顿(Sabour, Frosst, & Hinton, 2017)、Chollet(2017)这样的重要人物近几月来暗示的那样。

This paper is written simultaneously for researchers in the field, and for a growing set of AI consumers with less technical background who may wish to understand where the field is headed. As such I will begin with a very brief, nontechnical introduction aimed at elucidating what deep learning systems do well and why (Section 2), before turning to an
assessment of deep learning’s weaknesses (Section 3) and some fears that arise from misunderstandings about deep learning’s capabilities (Section 4), and closing with perspective on going forward (Section 5).
Deep learning is not likely to disappear, nor should it. But five years into the field’s resurgence seems like a good moment for a critical reflection, on what deep learning has and has not been able to achieve.

该论文同时也是写给该领域的研究人员,写给缺乏技术背景又可能想要理解该领域的 AI 消费者。如此一来,在第二部分我将简要地、非技术性地介绍深度学习系统能做什么,为什么做得好。然后在第三部分介绍深度学习的弱点,第四部分介绍对深度学习能力的误解,最后介绍我们可以前进的方向。

深度学习不太可能会消亡,也不该消亡。但在深度学习崛起的 5 年后,看起来是时候对深度学习的能力与不足做出批判性反思了。

2. What deep learning is, and what it does well 深度学习是什么?深度学习擅长什么?

Deep learning, as it is primarily used, is essentially a statistical technique for classifying patterns, based on sample data, using neural networks with multiple layers.5 Neural networks in the deep learning literature typically consist of a set of input units that stand for things like pixels or words, multiple hidden layers (the more such layers, the deeper a network is said to be) containing hidden units (also known as nodes or neurons), and a set output units, with connections running between those nodes. In a typical application such a network might be trained on a large sets of handwritten digits (these are the inputs, represented as images) and labels (these are the outputs) that identify the categories to which those inputs belong (this image is a 2, that one is a 3, and so forth).




Over time, an algorithm called back-propagation allows a process called gradient descent to adjust the connections between units using a process, such that any given input tends to produce the corresponding output.
Collectively, one can think of the relation between inputs and outputs that a neural network learns as a mapping. Neural networks, particularly those with multiple hidden layers (hence the term deep) are remarkably good at learning input-output mappings,
Such systems are commonly described as neural networks because the input nodes, hidden nodes, and output nodes can be thought of as loosely analogous to biological neurons, albeit greatly simplified, and the connections between nodes can be thought of as in some way reflecting connections between neurons. A longstanding question, outside the scope of the current paper, concerns the degree to which artificial neural networks are biologically plausible.
Most deep learning networks make heavy use of a technique called convolution (LeCun ,1989), which constrains the neural connections in the network such that they innatel ycapture a property known as translational invariance. This is essentially the idea that a nobject can slide around an image while maintaining its identity; a circle in the top left ca nbe presumed, even absent direct experience) to be the same as a circle in the bottom right .




大部分深度学习网络大量使用卷积技术(LeCun, 1989),该技术约束网络中的神经连接,使它们本能地捕捉平移不变性(translational invariance)。这本质上就是物体可以围绕图像滑动,同时又保持自己的特征;正如上图中,假设它是左上角的圆,即使缺少直接经验,也可以最终过渡到右下角。

Deep learning is also known for its ability to self-generate intermediate representations, such as internal units that may respond to things like horizontal lines, or more complex elements of pictorial structure.
In principle, given infinite data, deep learning systems are powerful enough to represen tany finite deterministic “mapping” between any given set of inputs and a set of corresponding outputs, though in practice whether they can learn such a mapping depends on many factors. One common concern is getting caught in local minima, in which a systems gets stuck on a suboptimal solution, with no better solution nearby in the space of solutions being searched. (Experts use a variety of techniques to avoid such problems, to reasonably good effect). In practice, results with large data sets are often quite good, on a wide range of potential mappings .
In speech recognition, for example, a neural network learns a mapping between a set of speech sounds, and set of labels (such as words or phonemes). In object recognition, a neural network learns a mapping between a set of images and a set of labels (such that ,for example, pictures of cars are labeled as cars). In DeepMind’s Atari game system(Mnih et al., 2015), neural networks learned mappings between pixels and joystick positions




例如,语音识别中,神经网络学习语音集和标签(如单词或音素)集之间的映射。目标识别中,神经网络学习图像集和标签集之间的映射。在 DeepMind 的 Atari 游戏系统(Mnih et al., 2015)中,神经网络学习像素和游戏杆位置之间的映射。
Deep learning systems are most often used as classification system in the sense that the mission of a typical network is to decide which of a set of categories (defined by the output units on the neural network) a given input belongs to. With enough imagination ,the power of classification is immense; outputs can represent words, places on a Goboard, or virtually anything else .In a world with infinite data, and infinite computational resources, there might be little need for any other technique



3. Limits on the scope of deep learning 深度学习的局限性

Deep learning’s limitations begin with the contrapositive: we live in a world in which data are never infinite. Instead, systems that rely on deep learning frequently have to generalize beyond the specific data that they have seen, whether to a new pronunciation of a word or to an image that differs from one that the system has seen before, and where data are less than infinite, the ability of formal proofs to guarantee high-quality performance is more limited


As discussed later in this article, generalization can be thought of as coming in two flavors, interpolation between known examples, and extrapolation, which requires going beyond a space of known training examples (Marcus, 1998a) .For neural networks to generalize well, there generally must be a large amount of data ,and the test data must be similar to the training data, allowing new answers to be interpolated in between old ones. In Krizhevsky et al’s paper (Krizhevsky, Sutskever, &Hinton, 2012), a nine layer convolutional neural network with 60 million parameters an d650,000 nodes was trained on roughly a million distinct examples drawn from approximately one thousand categories.

我们可以把泛化当作已知样本之间的内插和超出已知训练样本空间的数据所需的外插(Marcus, 1998a)。

对于泛化性能良好的神经网络,通常必须有大量数据,测试数据必须与训练数据类似,使新答案内插在旧的数据之中。在克里泽夫斯基等人的论文(Krizhevsky, Sutskever, & Hinton, 2012)中,一个具备 6000 万参数和 65 万节点的 9 层卷积神经网络在来自大概 1000 个类别的约 100 万不同样本上进行训练。

This sort of brute force approach worked well in the very finite world of ImageNet, into o which all stimuli can be classified into a comparatively small set of categories. It also o works well in stable domains like speech recognition in which exemplars are mapped in constant way onto a limited set of speech sound categories, but for many reasons deep learning cannot be considered (as it sometimes is in the popular press) as a general solution to artificial intelligence .

Here are ten challenges faced by current deep learning systems :

这种暴力方法在 ImageNet 这种有限数据集上效果不错,所有外部刺激都可以被分类为较小的类别。它在稳定的领域效果也不错,如在语音识别中,数据可以一种常规方式映射到有限的语音类别集中,但是出于很多原因,深度学习不是人工智能的通用解决方案。


3.1 目前深度学习需要大量数据 Deep learning thus far is data hungry

Human beings can learn abstract relationships in a few trials. If I told you that a schmister was a sister over the age of 10 but under the age of 21, perhaps giving you a single example, you could immediately infer whether you had any schmisters, whether your best friend had a schmister, whether your children or parents had any schmisters, and so forth .(Odds are, your parents no longer do, if they ever did, and you could rapidly draw that inference, too. )

In learning what a schmister is, in this case through explicit definition, you rely not on hundreds or thousands or millions of training examples, but on a capacity to represent abstract relationships between algebra-like variables .

人类只需要少量的尝试就可以学习抽象的关系。如果我告诉你 schmister 是年龄在 10 岁到 21 岁之间的姐妹。可能只需要一个例子,你就可以立刻推出你有没有 schmister,你的好朋友有没有 schmister,你的孩子或父母有没有 schmister 等等。

你不需要成百上千甚至上百万的训练样本,只需要用少量类代数的变量之间的抽象关系,就可以给 schmister 下一个准确的定义。

Humans can learn such abstractions, both through explicit definition and more implicit means (Marcus, 2001). Indeed even 7-month old infants can do so, acquiring learned abstract language-like rules from a small number of unlabeled examples, in just two minutes (Marcus, Vijayan, Bandi Rao, & Vishton, 1999). Subsequent work by Gervai nand colleagues (2012) suggests that newborns are capable of similar computations .

Deep learning currently lacks a mechanism for learning abstractions through explicit ,verbal definition, and works best when there are thousands, millions or even billions of training examples, as in DeepMind’s work on board games and Atari. As Brenden Lake and his colleagues have recently emphasized in a series of papers, humans are far more efficient in learning complex rules than deep learning systems are (Lake, Salakhutdinov ,& Tenenbaum, 2015; Lake, Ullman, Tenenbaum, & Gershman, 2016). (See also relate dwork by George et al (2017), and my own work with Steven Pinker on children’sover regularization errors in comparison to neural networks (Marcus et al., 1992). )

人类可以学习这样的抽象概念,无论是通过准确的定义还是更隐式的手段(Marcus,2001)。实际上即使是 7 月大的婴儿也可以仅在两分钟内从少量的无标签样本中学习抽象的类语言规则(Marcus, Vijayan, Bandi Rao, & Vishton, 1999)。随后由 Gervain 与其同事做出的研究(2012)表明新生儿也能进行类似的学习。


当有数百万甚至数十亿的训练样本的时候,深度学习能达到最好的性能,例如 DeepMind 在棋牌游戏和 Atari 上的研究。

正如 Brenden Lake 和他的同事最近在一系列论文中所强调的,人类在学习复杂规则的效率远高于深度学习系统(Lake, Salakhutdinov, & Tenenbaum, 2015; Lake, Ullman, Tenenbaum, & Gershman, 2016)(也可以参见 George 等人的相关研究工作,2017)。我和 Steven Pinker 在儿童与神经网络的过度规则化误差的对比研究也证明了这一点。

Geoff Hinton has also worried about deep learning’s reliance on large numbers of labeled examples, and expressed this concern in his recent work on capsule networks with hi scoauthors (Sabour et al., 2017) noting that convolutional neural networks (the most common deep learning architecture) may face “exponential inefficiencies that may lead to their demise. A good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints [ie perspectives on object in visual recognition tasks] .The ability to deal with translation[al invariance] is built in, but for the other ... [common type of] transformation we have to chose between replicating feature detectors on a grid that grows exponentially ... or increasing the size of the labelled training set in a similarly exponential way. ”

In problems where data are limited, deep learning often is not an ideal solution .

Geoff Hinton 也对深度学习依赖大量的标注样本表示担忧,在他最新的 Capsule 网络研究中表达了这个观点,其中指出卷积神经网络可能会遭遇「指数低效」,从而导致网络的失效。




3.2 深度学习目前还是太表浅,没有足够的能力进行迁移 .Deep learning thus far is shallow and has limited capacity for transfer

Although deep learning is capable of some amazing things, it is important to realize that the word “deep” in deep learning refers to a technical, architectural property (the large number of hidden layers used in a modern neural networks, where there predecessors used only one) rather than a conceptual one (the representations acquired by such networks don’t, for example, naturally apply to abstract concepts like “justice” ,“democracy” or “meddling”) .Even more down-to-earth concepts like “ball” or “opponent” can lie out of reach .Consider for example DeepMind’s Atari game work (Mnih et al., 2015) on deep reinforcement learning, which combines deep learning with reinforcement learning (in which a learner tries to maximize reward). Ostensibly, the results are fantastic: the system meets or beats human experts on a large sample of games using a single set of “hyperparameters” that govern properties such as the rate at which a network alters its weights, and no advance knowledge about specific games, or even their rules. But it is easy to wildly overinterpret what the results show. To take one example, according to a widely-circulated video of the system learning to play the brick-breaking Atari game Breakout, “after 240 minutes of training, [the system] realizes that digging a tunnel thought the wall is the most effective technique to beat the game”



考虑 DeepMind 利用深度强化学习对 Atari 游戏的研究,他们将深度学习和强化学习结合了起来。其成果表面上看起来很棒:该系统使用单个“超参数”集合(控制网络的性质,如学习率)在大量的游戏样本中达到或打败了人类专家,其中系统并没有关于具体游戏的知识,甚至连规则都不知道。

但人们很容易对这个结果进行过度解读。例如,根据一个广泛流传的关于该系统学习玩打砖块 Atari 游戏 Breakout 的视频,“经过 240 分钟的训练,系统意识到把砖墙打通一个通道是获得高分的最高效的技术。”

But the system has learned no such thing; it doesn’t really understand what a tunnel, or what a wall is; it has just learned specific contingencies for particular scenarios. Transfer tests — in which the deep reinforcement learning system is confronted with scenarios that differ in minor ways from the one ones on which the system was trained show that deep reinforcement learning’s solutions are often extremely superficial. For example, a team of researchers at Vicarious showed that a more efficient successor technique ,DeepMind’s Atari system [Asynchronous Advantage Actor-Critic; also known as A3C] ,failed on avariety of minor perturbations to Breakout (Kansky et al., 2017) from the training set, such as moving the Y coordinate (height) of the paddle, or inserting a wall midscreen. These demonstrations make clear that it is misleading to credit dee preinforcement learning with inducing concept like wall or paddle; rather, such remarks are what comparative (animal) psychology sometimes call over attributions. It’s not that the Atari system genuinely learned a concept of wall that was robust but rather the system superficially approximated breaking through walls within a narrow set of highly trained circumstances. 7

My own team of researchers at a startup company called Geometric Intelligence (later acquired by Uber) found similar results as well, in the context of a slalom game, In 2017 ,a team of researchers at Berkeley and OpenAI has shown that it was not difficult to construct comparable adversarial examples in a variety of games, undermining not only DQN (the original DeepMind algorithm) but also A3C and several other related techniques (Huang, Papernot, Goodfellow, Duan, & Abbeel, 2017)



例如,Vicarious 的研究团队表明 DeepMind 的更高效的进阶技术——Atari 系统「Asynchronous Advantage Actor-Critic」(也叫 A3C)在玩多种 Breakout 的微改动版本(例如改变球拍的 Y 坐标,或在屏幕中央加上一堵墙)时遭遇了失败。

这些反例证明了深度强化学习并不能学习归纳类似砖墙或球拍这样的概念;更准确地说,这样的评论就是生物心理学中的过度归因所造成的。Atari 系统并没有真正地学习到关于砖墙的鲁棒的概念,而只是在高度训练的情景的狭隘集合中表面地打破了砖墙。

Recent experiments by Robin Jia and Percy Liang (2017) make a similar point, in a different domain: language. Various neural networks were trained on a question answering task known as SQuAD (derived from the Stanford Question AnsweringDatabase), in which the goal is to highlight the words in a particular passage that correspond to a given question. In one sample, for instance, a trained system correctly ,and impressively, identified the quarterback on the winning of Super Bowl XXXIII as John Elway, based on a short paragraph. But Jia and Liang showed the mere insertion of distractor sentences (such as a fictional one about the alleged victory of Google’s JefDean in another Bowl game ) caused performance to drop precipitously. Across sixteen 8models, accuracy dropped from a mean of 75% to a mean of 36% .As is so often the case, the patterns extracted by deep learning are more superficial than they initially appear .

我在初创公司 Geometric Intelligence 的研究团队(后来被 Uber 收购)的滑雪游戏情景中发现了类似的结果。2017 年,伯克利和 OpenAI 的一个研究团队发现可以轻易地在多种游戏中构造对抗样本,使得 DQN(原始的 DeepMind 算法)、A3C 和其它的相关技术(Huang, Papernot, Goodfellow, Duan, & Abbeel, 2017)都失效。

最近由 Robin Jia 和 Percy Liang(2017)做的实验在不同的领域(语言)表明了类似的观点。他们训练了多种用于问答系统任务(被称为 SQuAD,Stanford Question Answering Database)的神经网络,其中任务的目标是在特定的段落中标出和给定问题相关的单词。

例如,有一个已训练系统可以基于一段短文准确地识别出超级碗 XXXIII 的胜利者是 John Elway。但 jia 和 Liang 表明仅仅插入干扰句(例如宣称谷歌的 Jeff Dean 在另一个杯赛中获得了胜利)就可以让准确率大幅下降。在 16 个模型中,平均准确率从 75% 下降了到了 36%。


3.3 迄今深度学习没有自然方式来处理层级架构 .Deep learning thus far has no natural way to deal with hierarchical structure
To a linguist like Noam Chomsky, the troubles Jia and Liang documented would be unsurprising. Fundamentally, most current deep-learning based language models represent sentences as mere sequences of words, whereas Chomsky has long argued that language has a hierarchical structure, in which larger structures are recursively constructed out of smaller components. (For example, in the sentence the teenager who previously crossed the Atlantic set a record for flying around the world, the main clause is the teenager set a record for flying around the world, while the embedded clause who previously crossed the Atlantic is an embedded clause that specifies which teenager. )In the 80’s Fodor and Pylyshyn (1988) expressed similar concerns, with respect to a nearlier breed of neural networks. Likewise, in (Marcus, 2001), I conjectured that single recurrent neural networks (SRNs; a forerunner to today’s more sophisticated deep learning based recurrent neural networks, known as RNNs; Elman, 1990) would have trouble systematically representing and extending recursive structure to various kinds of unfamiliar sentences (see the cited articles for more specific claims about which types) .Earlier this year, Brenden Lake and Marco Baroni (2017) tested whether such pessimistic conjectures continued to hold true. As they put it in their title, contemporary neural nets were “Still not systematic after all these years”. RNNs could “generalize well when the differences between training and test ... are small [but] when generalization requires systematic compositional skills, RNNs fail spectacularly” .

对乔姆斯基这样的语言学家来说,对 Robin Jia 和 Percy Liang 记录的难题并不惊讶。基本上,目前大部分深度学习方法基于语言模型来将句子表达为纯粹的词序列。

然而,乔姆斯基一直以来都认为语言具有层级架构,也就是小的部件循环构建成更大的结构。(例如,在句子「the teenager who previously crossed the Atlantic set a record for flying around the world」中,主句是「the teenager set a record for flying around the world」,「who previously crossed the Atlantic」是指明青年身份的一个字句。

在上世纪 80 年代,Fodor 和 Pylyshyn(1988)也表达了同样的担忧,这是关于神经网络的一个早期分支。

我在 2001 年的文章中也同样揣测到,单个循环神经网络(SRN;是今天基于循环神经网络(也就是 RNN)的更复杂的深度学习方法的前身;Elman,1990) 难以系统表达、扩展各种不熟悉语句的递归结构(从引用论文查看具体是哪种类型)。

2017 早些时候,Brenden Lake 和 Marco Baroni 测试了这样的悲观揣测是否依然正确。就像他们文章标题中写的,当代神经网络“这么多年来依然不体系”。RNN 可能“在训练与测试差别... 很小的时候泛化很好,但当需要系统地组合技能做泛化时,RNN 极其失败。”

Similar issues are likely to emerge in other domains, such as planning and motor control ,in which complex hierarchical structure is needed, particular when a system is likely to encounter novel situations. One can see indirect evidence for this in the struggles with transfer in Atari games mentioned above, and more generally in the field of robotics, in which systems generally fail to generalize abstract plans well in novel environments .The core problem, at least at present, is that deep learning learns correlations between sets of features that are themselves “flat” or nonhierachical, as if in a simple, unstructured list, with every feature on equal footing. Hierarchical structure (e.g., syntactic trees that distinguish between main clauses and embedded clauses in a sentence) are not inherently or directly represented in such systems, and as a result deep learning systems are forced to use a variety of proxies that are ultimately inadequate, such as the sequential position of a word presented in a sequences .

Systems like Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013) that represent individuals words as vectors have been modestly successful; a number of systems that have used clever tricks try to represent complete sentences in deep-learning compatible vector spaces (Socher ,Huval, Manning, & Ng, 2012). But, as Lake and Baroni’s experiments make clear .recurrent networks continue limited in their capacity to represent and generalize rich structure in a faithful manner .

类似的问题在其他领域也可能会暴露出来,例如规划与电机控制,这些领域都需要复杂的层级结构,特别是遇到全新的环境时。从上面提到的 Atari 游戏 AI 难以迁移问题上,我们可以间接看到这一点。更普遍的是在机器人领域,系统一般不能在全新环境中概括抽象规划。


像 Word2Vec(Mikolov, Chen, Corrado, & Dean, 2013) 这样的系统将单个单词表达为向量,有适当的成功。也有一批系统使用小技巧试图在深度学习可兼容的向量空间中表达完整语句 (Socher, Huval, Manning, & Ng, 2012)。但是,就像 Lake 和 Baroni 的实验表明的,循环网络能力依然有限,不足以准确可靠地表达和概括丰富的结构信息。

3.4 迄今为止的深度学习无法进行开放推理 .Deep learning thus far has struggled with open-ended inference

If you can’t represent nuance like the difference between “John promised Mary to leave ”and “John promised to leave Mary”, you can’t draw inferences about who is leaving whom, or what is likely to happen next.

Current machine reading systems have achieved some degree of success in tasks like SQuAD, in which the answer to a given question is explicitly contained within a text, but far less success in tasks in which inference goes beyond what is explicit in a text, either by combining multiple sentences(so called multi-hop inference) or by combining explicit sentences with background knowledge that is not stated in a specific text selection. Humans, as they read texts ,frequently derive wide-ranging inferences that are both novel and only implicitly licensed, as when they, for example, infer the intentions of a character based only on indirect dialog .

Altough Bowman and colleagues (Bowman, Angeli, Potts, & Manning, 2015; Williams ,Nangia, & Bowman, 2017) have taken some important steps in this direction, there is, at present, no deep learning system that can draw open-ended inferences based on realworld knowledge with anything like human-level accuracy

如果你无法搞清“John promised Mary to leave”和“John promised to leave Mary”之间的区别,你就不能分清是谁离开了谁,以及接下来会发生什么。

目前的机器阅读系统已经在一些任务,如 SQuAD 上取得了某种程度的成功,其中对于给定问题的答案被明确地包含在文本中,或者整合在多个句子中(被称为多级推理)或整合在背景知识的几个明确的句子中,但并没有标注特定的文本。对于人类来说,我们在阅读文本时经常可以进行广泛的推理,形成全新的、隐含的思考,例如仅仅通过对话就能确定角色的意图。

尽管 Bowman 等人(Bowman,Angeli,Potts & Manning,2015;Williams,Nangia & Bowman,2017)在这一方向上已经采取了一些重要步骤,但目前来看,没有深度学习系统可以基于真实世界的知识进行开放式推理,并达到人类级别的准确性。

3.5 迄今为止的深度学习不够透明 Deep learning thus far is not sufficiently transparent

The relative opacity of “black box” neural networks has been a major focus of discussion in the last few years (Samek, Wiegand, & Müller, 2017; Ribeiro, Singh, & Guestrin ,2016). In their current incarnation, deep learning systems have millions or even billion sof parameters, identifiable to their developers not in terms of the sort of human interpretable labels that canonical programmers use (“last_character_typed”) but only interms of their geography within a complex network (e.g., the activity value of the ith node in layer j in network module k).

Although some strides have been in visualizing the contributions of individuals nodes in complex networks (Nguyen, Clune, Bengio ,Dosovitskiy, & Yosinski, 2016), most observers would acknowledge that neural network sas a whole remain something of a black box .How much that matters in the long run remains unclear (Lipton, 2016). If systems are robust and self-contained enough it might not matter; if it is important to use them in the context of larger systems, it could be crucial for debuggability .
The transparency issue, as yet unsolved, is a potential liability when using deep learning for problem domains like financial trades or medical diagnosis, in which human users might like to understand how a given system made a given decision. As Catherin eO’Neill (2016) has pointed out, such opacity can also lead to serious issues of bias

神经网络“黑箱”的特性一直是过去几年人们讨论的重点(Samek、Wiegand & Müller,2017;Ribeiro、Singh & Guestrin,2016)。

在目前的典型状态里,深度学习系统有数百万甚至数十亿参数,其开发者可识别形式并不是常规程序员使用的(「last_character_typed」)人类可识别标签,而是仅在一个复杂网络中适用的地理形式(如网络模块 k 中第 j 层第 i 个节点的活动值)。

尽管通过可视化工具,我们可以在复杂网络中看到节点个体的贡献(Nguyen、Clune、Bengio、Dosovitskiy & Yosinski,2016),但大多数观察者都认为,神经网络整体看来仍然是一个黑箱。


为解决透明度问题,对于深度学习在一些领域如金融或医疗诊断上的潜力是致命的,其中人类必须了解系统是如何做出决策的。正如 Catherine O』Neill(2016)指出的,这种不透明也会导致严重的偏见问题。

3.6 迄今为止,深度学习并没有很好地与先验知识相结合 Deep learning thus far has not been well integrated with prior knowledge
The dominant approach in deep learning is hermeneutic, in the sense of being selfcontained and isolated from other, potentially usefully knowledge. Work in deep learning typically consists of finding a training database, sets of inputs associated with respective outputs, and learn all that is required for the problem by learning the relations between those inputs and outputs, using whatever clever architectural variants one might devise ,along with techniques for cleaning and augmenting the data set. With just a handful of exceptions, such as LeCun’s convolutional constraint on how neural networks are wired(LeCun, 1989), prior knowledge is often deliberately minimized .

Thus, for example, in a system like Lerer et al’s (2016) efforts to learn about the physics of falling towers, there is no prior knowledge of physics (beyond what is implied in convolution). Newton’s laws, for example, are not explicitly encoded; the system instead(to some limited degree) approximates them by learning contingencies from raw, pixel level data. As I note in a forthcoming paper in innate (Marcus, in prep) researchers in deep learning appear to have a very strong bias against including prior knowledge even when (as in the case of physics) that prior knowledge is well known .

It also not straightforward in general how to integrate prior knowledge into a deep learning system:, in part because the knowledge represented in deep learning systems pertains mainly to (largely opaque) correlations between features, rather than to abstractions like quantified statements (e.g. all men are mortal), see discussion of universally-quantified one-to-one-mappings in Marcus (2001), or generics (violable statements like dogs have four legs or mosquitos carry West Nilevirus (Gelman, Leslie ,Was, & Koch, 2015)) .



这其中只有少数几个例外,如 LeCun 对神经网络连接卷积约束的研究(LeCun,1989)中,先验知识被有意最小化了。

因此,例如 Lerer 等人(2016)提出的系统学习从塔上掉落物体的物理性质,在此之上并没有物理学的先验知识(除卷积中所隐含的内容之外)。


一般来说,将先验知识整合到深度学习系统中并不简单:一部分是因为深度学习系统中的知识表征主要是特征之间的关系(大部分还是不透明的),而非抽象的量化陈述(如凡人终有一死),参见普遍量化一对一映射的讨论(Marcus,2001),或 generics(可违反的声明,如狗有四条腿或蚊子携带尼罗河病毒(Gelman、Leslie、Was & Koch,2015))。

A related problem stems from a culture in machine learning that emphasizes competitio non problems that are inherently self-contained, without little need for broad general knowledge. This tendency is well exemplified by the machine learning contest platform known as Kaggle, in which contestants vie for the best results on a given data set .Everything they need for a given problem is neatly packaged, with all the relevant input and outputs files. Great progress has been made in this way; speech recognition and some aspects of image recognition can be largely solved in the Kaggle paradigm .

The trouble, however, is that life is not a Kaggle competition; children don’t get all the data they need neatly packaged in a single directory. Real-world learning offers data much more sporadically, and problems aren’t so neatly encapsulated. Deep learning works great on problems like speech recognition in which there are lots of labeled examples, but scarcely any even knows how to apply it to more open-ended problems .What’s the best way to fix a bicycle that has a rope caught in its spokes? Should I major in math or neuroscience? No training set will tell us that .

这个问题根植于机器学习文化中,强调系统需要自成一体并具有竞争力——不需要哪怕是一点先验的通用知识。Kaggle 机器学习竞赛平台正是这一现象的注解,参赛者争取在给定数据集上获取特定任务的最佳结果。任意给定问题所需的信息都被整齐地封装好,其中包含相关的输入和输出文件。在这种范式下我们已经有了很大的进步(主要在图像识别和语音识别领域中)。

问题在于,当然,生活并不是一场 Kaggle 竞赛;孩子们并不会把所有数据整齐地打包进一个目录里。真实世界中我们需要学习更为零散的数据,问题并没有如此整齐地封装起来。


Problems that have less to do with categorization and more to do with commonsense reasoning essentially lie outside the scope of what deep learning is appropriate for, and so far as I can tell, deep learning has little to offer such problems. In a recent review of commonsense reasoning, Ernie Davis and I (2015) began with a set of easily-drawn inferences that people can readily answer without anything like direct training, such as Who is taller, Prince William or his baby son Prince George? Can you make a salad out of a polyester shirt? If you stick a pin into a carrot, does it make a hole in the carrot or in the pin ?

与分类离得越远,与常识离得越近的问题就越不能被深度学习来解决。在近期对于常识的研究中,我和 Ernie Davis(2015)开始,从一系列易于得出的推论开始进行研究,如威廉王子和他的孩子乔治王子谁更高?你可以用聚酯衬衫来做沙拉吗?如果你在胡萝卜上插一根针,是胡萝卜上有洞还是针上有洞?

As far as I know, nobody has even tried to tackle this sort of thing with deep learning .
Such apparently simple problems require humans to integrate knowledge across vastly disparate sources, and as such are a long way from the sweet spot of deep learning-style perceptual classification. Instead, they are perhaps best thought of as a sign that entirely different sorts of tools are needed, along with deep learning, if we are to reach human-level cognitive flexibility .



3.7 到目前为止,深度学习还不能从根本上区分因果关系和相关关系 Deep learning thus far cannot inherently distinguish causation from correlation

If it is a truism that causation does not equal correlation, the distinction between the two is also a serious concern for deep learning. Roughly speaking, deep learning learns complex correlations between input and output features, but with no inherent representation of causality. A deep learning system can easily learn that height and vocabulary are, across the population as a whole, correlated, but less easily represent the way in which that correlation derives from growth and development (kids get bigger as they learn more words, but that doesn’t mean that growing tall causes them to learn more words, nor that learning new words causes them to grow). Causality has been centrals trand in some other approaches to AI (Pearl, 2000) but, perhaps because deep learning is not geared towards such challenges, relatively little work within the deep learning tradition has tried to address it. 9




因果关系在其它一些用于人工智能的方法中一直是核心因素(Pearl, 2000),但也许是因为深度学习的目标并非这些难题,所以深度学习领域传统上在解决这一难题上的研究工作相对较少。[9]

3.8 深度学习假设世界是大体稳定的,采用的方式可能是概率的 Deep learning presumes a largely stable world, in ways that may
be problematic

The logic of deep learning is such that it is likely to work best in highly stable worlds ,like the board game Go, which has unvarying rules, and less well in systems such as politics and economics that are constantly changing. To the extent that deep learning is applied in tasks such as stock prediction, there is a good chance that it will eventually face the fate of Google Flu Trends, which initially did a great job of predicting epidemological data on search trends, only to complete miss things like the peak of the 2013 flu season (Lazer, Kennedy, King, & Vespignani, 2014)


就算把深度学习应用于股票预测等任务,它很有可能也会遭遇谷歌流感趋势(Google Flu Trends)那样的命运;谷歌流感趋势一开始根据搜索趋势能很好地预测流行病学数据,但却完全错过了 2013 年流感季等事件(Lazer, Kennedy, King, & Vespignani, 2014)。

3.9 到目前为止,深度学习只是一种良好的近似,其答案并不完全可信Deep learning thus far works well as an approximation, but its answers often cannot be fully trusted

In part as a consequence of the other issues raised in this section, deep learning systems are quite good at some large fraction of a given domain, yet easily fooled .An ever-growing array of papers has shown this vulnerability, from the linguistic examples of Jia and Liang mentioned above to a wide range of demonstrations in the domain of vision, where deep learning systems have mistaken yellow-and-black pattern sof stripes for school buses (Nguyen, Yosinski, & Clune, 2014) and sticker-clad parking signs for well-stocked refrigerators (Vinyals, Toshev, Bengio, & Erhan, 2014) in the context of a captioning system that otherwise seems impressive .More recently, there have been real-world stop signs, lightly defaced, that have been mistaken for speed limit signs (Evtimov et al., 2017) and 3d-printed turtles that have been mistake for rifles (Athalye, Engstrom, Ilyas, & Kwok, 2017). A recent news story recounts the trouble a British police system has had in distinguishing nudes from sand dunes.1 0The “spoofability” of deep learning systems was perhaps first noted by Szegedy e tal(2013). Four years later, despite much active research, no robust solution has been found .


越来越多的论文都表明了这一缺陷,从前文提及的 Jia 和 Liang 给出的语言学案例到视觉领域的大量示例,比如有深度学习的图像描述系统将黄黑相间的条纹图案误认为校车(Nguyen, Yosinski, & Clune, 2014),将贴了贴纸的停车标志误认为装满东西的冰箱(Vinyals, Toshev, Bengio, & Erhan, 2014),而其它情况则看起来表现良好。

最近还有真实世界的停止标志在稍微修饰之后被误认为限速标志的案例(Evtimov et al., 2017),还有 3D 打印的乌龟被误认为步枪的情况(Athalye, Engstrom, Ilyas, & Kwok, 2017)。最近还有一条新闻说英国警方的一个系统难以分辨裸体和沙丘。[10]

最早提出深度学习系统的“可欺骗性(spoofability)”的论文可能是 Szegedy et al(2013)。四年过去了,尽管研究活动很活跃,但目前仍未找到稳健的解决方法。

3.10 到目前为止,深度学习还难以在工程中使用Deep learning thus far is difficult to engineer with

Another fact that follows from all the issues raised above is that is simply hard to do robust engineering with deep learning. As a team of authors at Google put it in 2014, in the title of an important, and as yet unanswered essay (Sculley, Phillips, Ebner, Chaudhary, & Young, 2014), machine learning is “the high-interest credit card of technical debt”, meaning that is comparatively easy to make systems that work in some limited set of circumstances (short term gain), but quite difficult to guarantee that they will work in alternative circumstances with novel data that may not resemble previous training data (long term debt, particularly if one system is used as an element in another larger system).
In an important talk at ICML, Leon Bottou (2015) compared machine learning to the development of an airplane engine, and noted that while the airplane design relies on building complex systems out of simpler systems for which it was possible to create sound guarantees about performance, machine learning lacks the capacity to produce comparable guarantees. As Google’s Peter Norvig (2016) has noted, machine learning as yet lacks the incrementality, transparency and debuggability of classical programming, trading off a kind of simplicity for deep challenges in achieving robustness.


正如谷歌一个研究团队在 2014 年一篇重要但仍未得到解答的论文(Sculley, Phillips, Ebner, Chaudhary, & Young, 2014)的标题中说的那样:机器学习是「高利息的技术债务信用卡」,意思是说机器学习在打造可在某些有限环境中工作的系统方面相对容易(短期效益),但要确保它们也能在具有可能不同于之前训练数据的全新数据的其它环境中工作却相当困难(长期债务,尤其是当一个系统被用作另一个更大型的系统组成部分时)。

Leon Bottou (2015) 在 ICML 的一个重要演讲中将机器学习与飞机引擎开发进行了比较。他指出尽管飞机设计依靠的是使用更简单的系统构建复杂系统,但仍有可能确保得到可靠的结果,机器学习则缺乏得到这种保证的能力。

正如谷歌的 Peter Norvig 在 2016 年指出的那样,目前机器学习还缺乏传统编程的渐进性、透明性和可调试性,要实现深度学习的稳健,需要在简洁性方面做一些权衡。
Henderson and colleagues have recently extended these points, with a focus on deep reinforcement learning, noting some serious issues in the field related to robustness and replicability (Henderson et al., 2017).
Although there has been some progress in automating the process of developing machine learning systems (Zoph, Vasudevan, Shlens, & Le, 2017), there is a long way to go.

Henderson 及其同事最近围绕深度强化学习对这些观点进行了延展,他们指出这一领域面临着一些与稳健性和可复现性相关的严重问题(Henderson et al., 2017)。

尽管在机器学习系统的开发过程的自动化方面存在一些进展(Zoph, Vasudevan, Shlens, & Le, 2017),但还仍有很长的路要走。

3.11 讨论Discussion
Of course, deep learning, is by itself, just mathematics; none of the problems identified above are because the underlying mathematics of deep learning are somehow flawed. In general, deep learning is a perfectly fine way of optimizing a complex system for representing a mapping between inputs and outputs, given a sufficiently large data set .
The real problem lies in misunderstanding what deep learning is, and is not, good for. The technique excels at solving closed-end classification problems, in which a wide range of potential signals must be mapped onto a limited number of categories, given that there is enough data available and the test set closely resembles the training set .

But deviations from these assumptions can cause problems; deep learning is just a statistical technique, and all statistical techniques suffer from deviation from their assumptions .Deep learning systems work less well when there are limited amounts of training data available, or when the test set differs importantly from the training set, or when the space of examples is broad and filled with novelty. And some problems cannot, given real world limitations, be thought of as classification problems at all. Open-ended natural language understanding, for example, should not be thought of as a classifier mapping between a large finite set of sentences and large, finite set of sentences, but rather a mapping between a potentially infinite range of input sentences and an equally vast array of meanings, many never previously encountered. In a problem like that, deep learning becomes a square peg slammed into a round hole, a crude approximation when there must be a solution elsewhere





One clear way to get an intuitive sense of why something is amiss to consider a set of experiments I did long ago, in 1997, when I tested some simplified aspects of language development on a class of neural networks that were then popular in cognitive science.
The 1997-vintage networks were, to be sure, simpler than current models — they used no more than three layers (inputs nodes connected to hidden nodes connected to outputs node), and lacked Lecun’s powerful convolution technique. But they were driven by backpropagation just as today’s systems are, and just as beholden to their training data.
In language, the name of the game is generalization — once I hear a sentence like John pilked a football to Mary, I can infer that is also grammatical to say John pilked Mary the football, and Eliza pilked the ball to Alec; equally if I can infer what the word pilk means, I can infer what the latter sentences would mean, even if I had not hear them before.

通过考虑我在多年前(1997)做过的一系列实验,可以获得对当前存在错误的直观理解,当时我在一类神经网络(之后在认知科学中变得很流行)上测试了语言开发的一些简单层面。这种网络比现今的模型要更简单,他们使用的层不大于三个(1 个输入层、1 个隐藏层、1 个输出层),并且没有使用卷积技术。他们也使用了反向传播技术。

在语言中,这个问题被称为泛化(generalization)。当我听到了一个句子“John pilked a football to Mary”,我可以从语法上推断“John pilked Mary the football”,如果我知道了 pilk 是什么意思,我就可以推断一个新句子“Eliza pilked the ball to Alec”的含义,即使是第一次听到。
Distilling the broad-ranging problems of language down to a simple example that I believe still has resonance now, I ran a series of experiments in which I trained three layer perceptrons (fully connected in today’s technical parlance, with no convolution) on the identity function, f(x) = x, e.g, f(12)=12.
Training examples were represented by a set of input nodes (and corresponding output nodes) that represented numbers in terms of binary digits. The number 7 for example, would be represented by turning on the input (and output) nodes representing 4, 2, and 1.
As a test of generalization, I trained the network on various sets of even numbers, and tested it all possible inputs, both odd and even.
Every time I ran the experiment, using a wide variety of parameters, the results were the same: the network would (unless it got stuck in local minimum) correctly apply the identity function to the even numbers that it had seen before (say 2, 4, 8 and 12), and to some other even numbers (say 6 and 14) but fail on all the odds numbers, yielding, for example f(15) = 14.

我相信将语言的大量问题提取为简单的例子在目前仍然受到关注,我在恒等函数 f(x) = x 上运行了一系列训练三层感知机(全连接、无卷积)的实验。

训练样本被表征二进制数字的输入节点(以及相关的输出节点)进行表征。例如数字 7,在输入节点上被表示为 4、2 和 1。为了测试泛化能力,我用多种偶数集训练了网络,并用奇数和偶数输入进行了测试。

我使用了多种参数进行了实验,结果输出都是一样的:网络可以准确地应用恒等函数到训练过的偶数上(除非只达到局部最优),以及一些其它的偶数,但应用到所有的奇数上都遭遇了失败,例如 f(15)=14。
In general, the neural nets I tested could learn their training examples, and interpolate to a set of test examples that were in a cloud of points around those examples in n-dimensional space (which I dubbed the training space), but they could not extrapolate beyond that training space.
Odd numbers were outside the training space, and the networks could not generalize identity outside that space. Adding more hidden units didn’t help, and nor did adding 12 more hidden layers. Simple multilayer perceptrons simply couldn’t generalize outside their training space (Marcus, 1998a; Marcus, 1998b; Marcus, 2001). (Chollet makes quite similar points in the closing chapters of his his (Chollet, 2017) text.)
What we have seen in this paper is that challenges in generalizing beyond a space of training examples persist in current deep learning networks, nearly two decades later.
Many of the problems reviewed in this paper — the data hungriness, the vulnerability to fooling, the problems in dealing with open-ended inference and transfer — can be seen as extension of this fundamental problem. Contemporary neural networks do well on challenges that remain close to their core training data, but start to break down on cases further out in the periphery.

大体上,我测试过的神经网络都可以从训练样本中学习,并可以在 n 维空间(即训练空间)中泛化到这些样本近邻的点集,但它们不能推断出超越该训练空间的结果。

奇数位于该训练空间之外,网络无法将恒等函数泛化到该空间之外。即使添加更多的隐藏单元或者更多的隐藏层也没用。简单的多层感知机不能泛化到训练空间之外(Marcus, 1998a; Marcus, 1998b; Marcus, 2001)。

上述就是当前深度学习网络中的泛化挑战,可能会存在二十年。本文提到的很多问题——数据饥饿(data hungriness)、应对愚弄的脆弱性、处理开放式推断和迁移的问题,都可以看作是这个基本问题的扩展。当代神经网络在与核心训练数据接近的数据上泛化效果较好,但是在与训练样本差别较大的数据上的泛化效果就开始崩塌。
The widely-adopted addition of convolution guarantees that one particular class of problems that are akin to my identity problem can be solved: so-called translational invariances, in which an object retains its identity when it is shifted to a location. But the solution is not general, as for example Lake’s recent demonstrations show. (Data augmentation offers another way of dealing with deep learning’s challenges in extrapolation, by trying to broaden the space of training examples itself, but such techniques are more useful in 2d vision than in language).
As yet there is no general solution within deep learning to the problem of generalizing outside the training space. And it is for that reason, more than any other, that we need to look to different kinds of solutions if we want to reach artificial general intelligence.

广泛应用的卷积确保特定类别的问题(与我的身份问题类似)的解决:所谓的平移不变性,物体在位置转换后仍然保持自己的身份。但是该解决方案并不适用于所有问题,比如 Lake 近期的展示。(数据增强通过扩展训练样本的空间,提供另一种解决深度学习外插挑战的方式,但是此类技术在 2d 版本中比在语言中更加有效。)


4 Potential risks of excessive hype 过度炒作的潜在风险

One of the biggest risks in the current overhyping of AI is another AI winter, such as the one that devastated the field in the 1970’s, after the Lighthill report (Lighthill, 1973) ,suggested that AI was too brittle, too narrow and too superficial to be used in practice .Although there are vastly more practical applications of AI now than there were in the 1970s, hype is still a major concern. When a high-profile figure like Andrew Ng writes in the Harvard Business Review promising a degree of imminent automation that is out of step with reality, there is fresh risk for seriously dashed expectations. Machines cannot in fact do many things that ordinary humans can do in a second, ranging from reliably comprehending the world to understanding sentences. No healthy human being would ever mistake a turtle for a rifle or parking sign for a refrigerator .Executives investing massively in AI may turn out to be disappointed, especially given the poor state of the art in natural language understanding. Already, some major project shave been largely abandoned, like Facebook’s M project, which was launched in August2015 with much publicity as a general purpose personal assistant, and then later 1 3downgraded to a significantly smaller role, helping users with a vastly small range of well-defined tasks such as calendar entry .
当前 AI 过度炒作的一个最大风险是再一次经历 AI 寒冬,就像 1970 年代那样。尽管现在的 AI 应用比 1970 年代多得多,但炒作仍然是主要担忧。


大量投资 AI 的人最后可能会失望,尤其是自然语言处理领域。一些大型项目已经被放弃,如 Facebook 的 M 计划,该项目于 2015 年 8 月启动,宣称要打造通用个人虚拟助手,后来其定位下降为帮助用户执行少数定义明确的人物,如日历记录。

It is probably fair to say that chatbots in general have not lived up to the hype they received a couple years ago. If, for example, driverless car should also, disappoint ,relative to their early hype, by proving unsafe when rolled out at scale, or simply not achieving full autonomy after many promises, the whole field of AI could be in for a sharp downturn, both in popularity and funding. We already may be seeing hints of this , as in a just published Wired article that was entitled “After peak hype, self-driving cars 14 enter the trough of disillusionment. ”There are other serious fears, too, and not just of the apocalyptic variety (which for now to still seem to be stuff of science fiction). My own largest fear is that the field of AI could get trapped in a local minimum, dwelling too heavily in the wrong part of intellectual space, focusing too much on the detailed exploration of a particular class of accessible but limited models that are geared around capturing low-hanging fruit —potentially neglecting riskier excursions that might ultimately lead to a more robust path .I am reminded of Peter Thiel’s famous (if now slightly outdated) damning of an often too-narrowly focused tech industry: “We wanted flying cars, instead we got 14 0characters”. I still dream of Rosie the Robost, a full-service domestic robot that take of my home; but for now, six decades into the history of AI, our bots do little more than play music, sweep floors, and bid on advertisements

举例来说,如果无人驾驶汽车在大规模推广后被证明不安全,或者仅仅是没有达到很多承诺中所说的全自动化,让大家失望(与早期炒作相比),那么整个 AI 领域可能会迎来大滑坡,不管是热度还是资金方面。我们或许已经看到苗头,正如 Wired 最近发布的文章《After peak hype, self-driving cars 14 enter the trough of disillusionment》中所说的那样(https://www.wired.com/story/self-driving-cars-challenges/)。


我自己最大的担忧是 AI 领域可能会陷入局部极小值陷阱,过分沉迷于智能空间的错误部分,过于专注于探索可用但存在局限的模型,热衷于摘取易于获取的果实,而忽略更有风险的「小路」,它们或许最终可以带来更稳健的发展路径。

我想起了 Peter Thiel 的著名言论:“我们想要一辆会飞的汽车,得到的却是 140 个字符。”我仍然梦想着 Rosie the Robost 这种提供全方位服务的家用机器人,但是现在,AI 六十年历史中,我们的机器人还是只能玩音乐、扫地和广告竞价。

If didn’t make more progress, it would be a shame. AI comes with risk, but also great potential rewards. AI’s greatest contributions to society, I believe, could and should ultimately come in domains like automated scientific discovery, leading among other things towards vastly more sophisticated versions of medicine than are currently possible .But to get there we need to make sure that the field as whole doesn’t first get stuck in a local minimum .
没有进步就是耻辱。AI 有风险,也有巨大的潜力。我认为 AI 对社会的最大贡献最终应该出现在自动科学发现等领域。但是要想获得成功,首先必须确保该领域不会陷于局部极小值。

5 What would be better?什么会更好?

Despite all of the problems I have sketched, I don’t think that we need to abandon deep learning .Rather, we need to reconceptualize it: not as a universal solvent, but simply as one tool among many, a power screwdriver in a world in which we also need hammers, wrenches ,and pliers, not to mentions chisels and drills, voltmeters, logic probes, and oscilloscopes .In perceptual classification, where vast amounts of data are available, deep learning is a valuable tool; in other, richer cognitive domains, it is often far less satisfactory .The question is, where else should we look? Here are four possibilities .尽管我勾画了这么多的问题,但我不认为我们需要放弃深度学习。相反,我们需要对其进行重新概念化:它不是一个普遍的解决办法,而仅仅只是众多工具中的一个。我们有电动螺丝刀,但我们还需要锤子、扳手和钳子,因此我们不能只提到钻头、电压表、逻辑探头和示波器。


5.1 无监督学习Unsupervised learning

In interviews, deep learning pioneers Geoff Hinton and Yann LeCun have both recently pointed to unsupervised learning as one key way in which to go beyond supervised, datahungry versions of deep learning.
To be clear, deep learning and unsupervised learning are not in logical opposition. Deep learning has mostly been used in a supervised context with labeled data, but there are ways of using deep learning in an unsupervised fashion. But there is certainly reasons in many domains to move away from the massive demands on data that supervised deep learning typically requires.

最近深度学习先驱 Geoffrey Hinton 和 Yann LeCun 都表明无监督学习是超越有监督、少数据深度学习的关键方法。

Unsupervised learning, as the term is commonly used, tends to refer to several kinds of systems. One common type of system “clusters” together inputs that share properties ,even without having them explicitly labeled. Google’s cat detector model (Le et al., 2012 )is perhaps the most publicly prominent example of this sort of approach .Another approach, advocated researchers such as Yann LeCun (Luc, Neverova, Couprie ,Verbeek, & LeCun, 2017), and not mutually exclusive with the first, is to replace labeled data sets with things like movies that change over time. The intuition is that system strained on videos can use each pair of successive frames as a kind of ersatz teaching signal, in which the goal is to predict the next frame; frame t becomes a predictor for frame t1, without the need for any human labeling
无监督学习是一个常用术语,往往指的是几种不需要标注数据的系统。一种常见的类型是将共享属性的输入「聚类」在一起,即使没有明确标记它们为一类也能聚为一类。Google 的猫检测模型(Le et al., 2012)也许是这种方法最突出的案例。

Yann LeCun 等人提倡的另一种方法(Luc, Neverova, Couprie, Verbeek, & LeCun, 2017)起初并不会相互排斥,它使用像电影那样随时间变化的数据而替代标注数据集。

直观上来说,使用视频训练的系统可以利用每一对连续帧替代训练信号,并用来预测下一帧。因此这种用第 t 帧预测第 t+1 帧的方法就不需要任何人类标注信息。

My view is that both of these approaches are useful (and so are some others not discussed here), but that neither inherently solve the sorts of problems outlined in section 3. One is still left with data hungry systems that lack explicit variables, and I see no advance there towards open-ended inference, interpretability or debuggability .That said, there is a different notion of unsupervised learning, less discussed, which I find deeply interesting: the kind of unsupervised learning that human children do. Children often y set themselves a novel task, like creating a tower of Lego bricks or climbing through a small aperture, as my daughter recently did in climbing through a chair, in the space between the seat and the chair back . Often, this sort of exploratory problem solving involves (or at least appears to involve) a good deal of autonomous goal setting(what should I do?) and high level problem solving (how do I get my arm through the chair, now that the rest of my body has passed through?), as well the integration of abstract knowledge (how bodies work, what sorts of apertures and affordances various objects have, and so forth). If we could build systems that could set their own goals and do reasoning and problem-solving at this more abstract level, major progress might quickly follow我的观点是,这两种方法都是有用的(其它一些方法本文并不讨论),但是它们本身并不能解决第 3 节中提到的问题。这些系统还有一些问题,例如缺少了显式的变量。而且我也没看到那些系统有开放式推理、解释或可调式性。




5.2 符号处理和混合模型的必要性Symbol-manipulation, and the need for hybrid models
Another place that we should look is towards classic, “symbolic” AI, sometimes referred to as GOFAI (Good Old-Fashioned AI). Symbolic AI takes its name from the idea, central to mathematics, logic, and computer science, that abstractions can be represented by symbols. Equations like f = ma allow us to calculate outputs for a wide range of inputs ,irrespective of whether we have seen any particular values before; lines in computer programs do the same thing (if the value of variable x is greater than the value of variable y, perform action a) .By themselves, symbolic systems have often proven to be brittle, but they were largely developed in era with vastly less data and computational power than we have now. The right move today may be to integrate deep learning, which excels at perceptual classification, with symbolic systems, which excel at inference and abstraction. One might think such a potential merger on analogy to the brain; perceptual input systems ,like primary sensory cortex, seem to do something like what deep learning does, but there are other areas, like Broca’s area and prefrontal cortex, that seem to operate at much higher level of abstraction. The power and flexibility of the brain comes in part from its capacity to dynamically integrate many different computations in real-time. The process of scene perception, for instance, seamlessly integrates direct sensory information with complex abstractions about objects and their properties, lighting sources, and so forth .

另一个我们需要关注的地方是经典的符号 AI,有时候也称为 GOFAI(Good Old-Fashioned AI)。符号 AI 的名字来源于抽象对象可直接用符号表示这一个观点,是数学、逻辑学和计算机科学的核心思想。

像 f = ma 这样的方程允许我们计算广泛输入的输出,而不管我们以前是否观察过任何特定的值。计算机程序也做着同样的事情(如果变量 x 的值大于变量 y 的值,则执行操作 a)。


人们可能会认为这种潜在的合并可以类比于大脑;如初级感知皮层那样的感知输入系统好像和深度学习做的是一样的,但还有一些如 Broca 区域和前额叶皮质等领域似乎执行更高层次的抽象。

Some tentative steps towards integration already exist, including neurosymbolic modeling (Besold et al., 2017) and recent trend towards systems such as differentiable neural computers (Graves et al., 2016), programming with differentiable interpreters (Bošnjak, Rocktäschel, Naradowsky, & Riedel, 2016), and neural programming with discrete operations (Neelakantan, Le, Abadi, McCallum, & Amodei, 2016). While none of this work has yet fully scaled towards anything like full-service artificial general intelligence, I have long argued (Marcus, 2001) that more on integrating microprocessorlike operations into neural networks could be extremely valuable.
To the extent that the brain might be seen as consisting of “a broad array of reusable computational primitives—elementary units of processing akin to sets of basic instructions in a microprocessor—perhaps wired together in parallel, as in the reconfigurable integrated circuit type known as the field-programmable gate array”, as I have argued elsewhere(Marcus, Marblestone, & Dean, 2014), steps towards enriching the instruction set out of which our computational systems are built can only be a good thing.

现已有一些尝试性的研究探讨如何整合已存的方法,包括神经符号建模(Besold et al., 2017)和最近的可微神经计算机(Graves et al., 2016)、通过可微解释器规划(Bošnjak, Rocktäschel, Naradowsky, & Riedel, 2016)和基于离散运算的神经编程(Neelakantan, Le, Abadi, McCallum, & Amodei, 2016)。

虽然该项研究还没有完全扩展到如像 full-service 通用人工智能那样的探讨,但我一直主张(Marcus, 2001)将更多的类微处理器运算集成到神经网络中是非常有价值的。

对于扩展来说,大脑可能被视为由「一系列可重复使用的计算基元组成 - 基本单元的处理类似于微处理器中的一组基本指令。这种方式在可重新配置的集成电路中被称为现场可编程逻辑门阵列」,正如我在其它地方(Marcus,Marblestone,&Dean,2014)所论述的那样,逐步丰富我们的计算系统所建立的指令集会有很大的好处。

5.3 来自认知和发展心理学的更多洞见More insight from cognitive and developmental psychology
Another potential valuable place to look is human cognition (Davis & Marcus, 2015; Lake et al., 2016; Marcus, 2001; Pinker & Prince, 1988). There is no need for machines to literally replicate the human mind, which is, after all, deeply error prone, and far from perfect. But there remain many areas, from natural language understanding to commonsense reasoning, in which humans still retain a clear advantage; learning the mechanisms underlying those human strengths could lead to advances in AI, even the goal is not, and should not be, an exact replica of human brain.
For many people, learning from humans means neuroscience; in my view, that may be premature. We don’t yet know enough about neuroscience to literally reverse engineer the brain, per se, and may not for several decades, possibly until AI itself gets better. AI can help us to decipher the brain, rather than the other way around.
Either way, in the meantime, it should certainly be possible to use techniques and insights drawn from cognitive and developmental and psychology, now, in order to build more robust and comprehensive artificial intelligence, building models that are motivated not just by mathematics but also by clues from the strengths of human psychology.

另一个有潜在价值的领域是人类认知(Davis & Marcus, 2015; Lake et al., 2016; Marcus, 2001; Pinker & Prince, 1988)。



A good starting point might be to first to try understand the innate machinery in humans minds, as a source of hypotheses into mechanisms that might be valuable in developing artificial intelligences; in companion article to this one (Marcus, in prep) I summarize a number of possibilities, some drawn from my own earlier work (Marcus, 2001) and others from Elizabeth Spelke’s (Spelke & Kinzler, 2007). Those drawn from my own work focus on how information might be represented and manipulated, such as by symbolic mechanisms for representing variables and distinctions between kinds and individuals from a class; those drawn from Spelke focus on how infants might represent notions such as space, time, and object.
A second focal point might be on common sense knowledge, both in how it develops (some might be part of our innate endowment, much of it is learned), how it is represented, and how it is integrated on line in the process of our interactions with the real world (Davis & Marcus, 2015). Recent work by Lerer et al (2016), Watters and colleagues (2017), Tenenbaum and colleagues(Wu, Lu, Kohli, Freeman, & Tenenbaum, 2017) and Davis and myself (Davis, Marcus, & Frazier-Logue, 2017) suggest some competing approaches to how to think about this, within the domain of everyday physical reasoning

理解人类心智中的先天机制可能是一个不错的开始,因为人类心智能作为假设的来源,从而有望助力人工智能的开发;在本论文的姊妹篇中(Marcus,尚在准备中),我总结了一些可能性,有些来自于我自己的早期研究(Marcus, 2001),另一些则来自于 Elizabeth Spelke 的研究(Spelke & Kinzler, 2007)。

来自于我自己的研究的那些重点关注的是表示和操作信息的可能方式,比如用于表示一个类别中不同类型和个体之间不同变量和差异的符号机制;Spelke 的研究则关注的是婴儿表示空间、时间和物体等概念的方式。

另一个关注重点可能是常识知识,研究方向包括常识的发展方式(有些可能是因为我们的天生能力,但大部分是后天学习到的)、常识的表示方式以及我们如何将常识用于我们与真实世界的交互过程(Davis & Marcus, 2015)。Lerer 等人(2016)、Watters 及其同事(2017)、Tenenbaum 及其同事(Wu, Lu, Kohli, Freeman, & Tenenbaum, 2017)、Davis 和我(Davis, Marcus, & Frazier-Logue, 2017)最近的研究提出了一些在日常的实际推理领域内思考这一问题的不同方法。

A third focus might be on human understanding of narrative, a notion long ago suggested by Roger Schank and Abelson (1977) and due for a refresh (Marcus, 2014; Kočiský et al., 2017).

第三个关注重点可能是人类对叙事(narrative)的理解,这是一个历史悠久的概念,Roger Schank 和 Abelson 在 1977 年就已提出,并且也得到了更新(Marcus, 2014; Kočiský et al., 2017)。

5.4. 更大的挑战Bolder challenges
Whether deep learning persists in current form, morphs into something new, or gets replaced altogether, one might consider a variety of challenge problems that push systems to move beyond what can be learned in supervised learning paradigms with large datasets. Drawing in part of from a recent special issue of AI Magazine devoted to moving beyond the Turing Test that I edited with Francesca Rossi, Manuelo Veloso(Marcus, Rossi, Veloso - AI Magazine, & 2016, 2016), here are a few suggestions:


以下是一些建议,它们部分摘自最近一期的《AI Magazine》特刊(Marcus, Rossi, Veloso - AI Magazine, & 2016, 2016),该杂志致力于超越我和 Francesca Rossi、Manuelo Veloso 一起编辑的杂志《Turing Test》:

  • A comprehension challenge (Paritosh & Marcus, 2016; Kočiský et al., 2017)] which would require a system to watch an arbitrary video (or read a text, or listen to a podcast) and answer open-ended questions about what is contained therein. (Who is the protagonist? What is their motivation? What will happen if the antagonist succeeds in her mission?) No specific supervised training set can cover all the possible contingencies; infererence and real-world knowledge integration are necessities.

  • 理解力挑战(Paritosh & Marcus, 2016; Kočiský et al., 2017)需要系统观看一个任意的视频(或者阅读文本、听广播),并就内容回答开放问题(谁是主角?其动机是什么?如果对手成功完成任务,会发生什么?)。没有专门的监督训练集可以涵盖所有可能的意外事件;推理和现实世界的知识整合是必需的。

  • Scientific reasoning and understanding, as in the Allen AI institute’s 8th grade science challenge (Schoenick, Clark, Tafjord, P, & Etzioni, 2017; Davis, 2016). While the answers to many basic science questions can simply be retrieved from web searches, others require inference beyond what is explicitly stated, and the integration of general knowledge.

  • 科学推理与理解,比如艾伦人工智能研究所的第 8 级的科学挑战(Schoenick, Clark, Tafjord, P, & Etzioni, 2017; Davis, 2016)。尽管很多基本科学问题的答案可轻易从网络搜索中找到,其他问题则需要清晰陈述之外的推理以及常识的整合。

  • General game playing (Genesereth, Love, & Pell, 2005), with transfer between games (Kansky et al., 2017), such that, for example, learning about one first-person shooter enhances performance on another with entirely different images, equipment and so forth. (A system that can learn many games, separately, without transfer between them, such as DeepMind’s Atari game system, would not qualify; the point is to acquire cumulative, transferrable knowledge).

  • 一般性的游戏玩法(Genesereth, Love, & Pell, 2005),游戏之间可迁移(Kansky et al., 2017),这样一来,比如学习一个第一人称的射击游戏可以提高带有完全不同图像、装备等的另一个游戏的表现。(一个系统可以分别学习很多游戏,如果它们之间不可迁移,比如 DeepMind 的 Atari 游戏系统,则不具备资格;关键是要获取累加的、可迁移的知识。)

  • A physically embodied test an AI-driven robot that could build things (Ortiz Jr, 2016), ranging from tents to IKEA shelves, based on instructions and real-world physical interactions with the objects parts, rather than vast amounts trial-and-error.

  • 物理具化地测试一个人工智能驱动的机器人,它能够基于指示和真实世界中与物体部件的交互而不是大量试错,来搭建诸如从帐篷到宜家货架这样的系统(Ortiz Jr, 2016)。

    No one challenge is likely to be sufficient. Natural intelligence is multi-dimensional (Gardner, 2011), and given the complexity of the world, generalized artificial intelligence will necessarily be multi-dimensional as well.
    By pushing beyond perceptual classification and into a broader integration of inference and knowledge, artificial intelligence will advance, greatly.

没有一个挑战可能是充足的。自然智能是多维度的(Gardner, 2011),并且在世界复杂度给定的情况下,通用人工智能也必须是多维度的。


6 Conclusions 总结

As a measure of progress, it is worth considering a somewhat pessimistic piece I wrote for The New Yorker five years ago , conjecturing that “deep learning is only part of the 15 larger challenge of building intelligent machines” because “such techniques lack ways of representing causal relationships (such as between diseases and their symptoms), and are
likely to face challenges in acquiring abstract ideas like “sibling” or “identical to.” They have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge, such as information about what objects are, what they are for, and how they are typically used.”
As we have seen, many of these concerns remain valid, despite major advances in specific domains like speech recognition, machine translation, and board games, and despite equally impressive advances in infrastructure and the amount of data and compute available.

为了衡量进步,有必要回顾一下 5 年前我写给《纽约客》的一篇有些悲观的文章,推测“深度学习只是构建智能机器面临的 15 个更大挑战的一部分”,因为“这些技术缺乏表征因果关系(比如疾病与症状)的方法”,并在获取“兄弟姐妹”或“相同”等抽象概念时面临挑战。它们没有执行逻辑推理的显式方法,整合抽象知识还有很长的路要走,比如对象信息是什么、目标是什么,以及它们通常如何使用。

Intriguingly, in the last year, a growing array of other scholars, coming from an impressive range of perspectives, have begun to emphasize similar limits. A partial lis tincludes Brenden Lake and Marco Baroni (2017), François Chollet (2017), Robin Jia an dPercy Liang (2017), Dileep George and others at Vicarious (Kansky et al., 2017) and Pieter Abbeel and colleagues at Berkeley (Stoica et al., 2017) .Perhaps most notably of all, Geoff Hinton has been courageous enough to reconsider ha sown beliefs, revealing in an August interview with the news site Axios that he is 16“deeply suspicious” of back-propagation, a key enabler of deep learning that he helpe dpioneer, because of his concern about its dependence on labeled data sets .Instead, he suggested (in Axios’ paraphrase) that “entirely new methods will probably have to be invented. ”I share Hinton’s excitement in seeing what comes next

有趣的是,去年开始不断有其他学者从不同方面开始强调类似的局限,这其中有 Brenden Lake 和 Marco Baroni (2017)、François Chollet (2017)、Robin Jia 和 Percy Liang (2017)、Dileep George 及其他 Vicarious 同事 (Kansky et al., 2017)、 Pieter Abbeel 及其 Berkeley 同僚 (Stoica et al., 2017)。

也许这当中最著名的要数 Geoffrey Hinton,他勇于做自我颠覆。上年 8 月接受 Axios 采访时他说自己「深深怀疑」反向传播,因为他对反向传播对已标注数据集的依赖性表示担忧。

相反,他建议“开发一种全新的方法”。与 Hinton 一样,我对未来的下一步走向深感兴奋。