Gary Marcus 曾是 Uber 人工智能实验室的负责人,他在去年 12 月把自己的创业公司 Geometric Intelligence 卖给 Uber,并帮助 Uber 组建了人工智能研究团队。仅仅过去四个月,Marcus 就宣布从 Uber 离职。 Gary Marcus 是纽约大学科学家,以批判的态度 对AI进展发表了很多言论。
Abstract
Although deep learning has historical roots going back decades, neither the term “deep learning” nor the approach was popular just over five years ago, when the field was reignited by papers such as Krizhevsky, Sutskever and Hinton’s now classic 2012 (Krizhevsky, Sutskever, & Hinton, 2012)deep net model of Imagenet.
What has the field discovered in the five subsequent years? Against a background of considerable progress in areas such as speech recognition, image recognition, and game playing, and considerable enthusiasm in the popular press, I present ten concerns for deep learning, and suggest that deep learning must be supplemented by other techniques if we are to reach artificial general intelligence.
摘要
尽管深度学习历史可追溯到几十年前,但这种方法,甚至深度学习一词都只是在 5 年前才刚刚流行,也就是该领域被类似于 阿莱克斯·克里泽夫斯基(Alex Krizhevsky)、伊利娅·苏特斯科娃(Ilya Sutskever)r 和 杰弗里·辛顿(Geoffrey Hinton)等人合作的论文这样的研究成果重新点燃的时候。他们的论文如今是 ImageNet 上经典的深度网络模型。
在随后 5 年中,该领域都发现了什么?在语音识别、图像识别、游戏等领域有可观进步,主流媒体热情高涨的背景下,我提出了对深度学习的十点担忧,且如果我们想要达到通用人工智能,我建议要有其他技术补充深度学习。
在深度学习衍生出更好解决方案的众多问题上(视觉、语音),在 2016 -2017 期间而变得收效衰减。——François Chollet, Google, Keras 作者,2017.12.18
"科学是踩着葬礼前行的",未来由极其质疑我所说的一切的那批学生所决定。——Geoffrey Hinton,深度学习教父,谷歌大脑负责人,2017.9.15
1. Is deep learning approaching a wall? 深度学习撞墙了?
Although deep learning has historical roots going back decades(Schmidhuber, 2015), it attracted relatively little notice until just over five years ago. Virtually everything changed in 2012, with the publication of a series of highly influential papers such as
Krizhevsky, Sutskever and Hinton’s 2012 ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky, Sutskever, & Hinton, 2012), which achieved state-of-the-art results on the object recognition challenge known as ImageNet
(Deng et al., ). Other labs were already working on similar work (Cireşan, Meier, Masci, & Schmidhuber, 2012). Before the year was out, deep learning made the front page of The New York Times , and it rapidly became the best known technique in artificial intelligence, by a wide margin.
尽管,深度学习的根源可追溯到几十年前(Schmidhuber,2015),但直到 5 年之前,人们对于它的关注还极其有限。
2012 年,克里泽夫斯基、苏特斯科娃和辛顿发布论文《ImageNet Classification with Deep Convolutional Neural Networks》(Krizhevsky, Sutskever, & Hinton, 2012),在 ImageNet 目标识别挑战赛上取得了顶尖结果(Deng et al.)。
随着这样的一批高影响力论文的发表,一切都发生了本质上的变化。当时,其他实验室已经在做类似的工作(Cireşan, Meier, Masci, & Schmidhuber, 2012)。在 2012 年将尽之时,深度学习上了纽约时报的头版。然后,迅速蹿红成为人工智能中最知名的技术。
If the general idea of training neural networks with multiple layers was not new, it was, in part because of increases in computational power and data, the first time that deep learning truly became practical.
Deep learning has since yielded numerous state of the art results, in domains such as speech recognition, image recognition , and language translation and plays a role in a wide swath of current AI applications. Corporations have invested billions of dollars fighting for deep learning talent. One prominent deep learning advocate, Andrew Ng, has gone so far to suggest that “If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.” (A, 2016). A recent New York Times Sunday Magazine article , largely about deep learning, implied that the technique is “poised to reinvent computing itself.”
Yet deep learning may well be approaching a wall, much as I anticipated earlier, at beginning of the resurgence (Marcus, 2012), and as leading figures like Hinton (Sabour, Frosst, & Hinton, 2017) and Chollet (2017) have begun to imply in recent months.
What exactly is deep learning, and what has its shown about the nature of intelligence? What can we expect it to do, and where might we expect it to break down? How close or far are we from “artificial general intelligence”, and a point at which machines show a human-like flexibility in solving unfamiliar problems? The purpose of this paper is both to temper some irrational exuberance and also to consider what we as a field might need to move forward.
训练多层神经网络的思路并不新颖(确实如此),但因为计算力与数据的增加,深度学习第一次变得实际可用。
自此之后,深度学习在语音识别、图像识别、语言翻译这样的领域产生了众多顶尖成果,且在目前众多的 AI 应用中扮演着重要的角色。大公司也开始投资数亿美元挖深度学习人才。
深度学习的一位重要拥护者,吴恩达,想的更远并说到,“如果一个人完成一项脑力任务需要少于一秒的考虑时间,我们就有可能在现在或者不久的未来使用 AI 使其自动化。”(A,2016)。
最近纽约时报星期天杂志的一篇关于深度学习的文章,暗示深度学习技术“做好准备重新发明计算本身”。
如今,深度学习可能临近墙角,大部分如同前面我在深度学习崛起之时(Marcus 2012)预期到的,也如同辛顿(Sabour, Frosst, & Hinton, 2017)、Chollet(2017)这样的重要人物近几月来暗示的那样。
深度学习到底是什么?它展示了智能的什么特性?我们能期待它做什么?预计什么时候它会不行?我们离「通用人工智能」还有多远?多近?在解决不熟悉的问题上,什么时候机器能够像人类一样灵活?该文章的目的既是为了缓和非理性的膨胀,也是为了考虑需要前进的方向。
This paper is written simultaneously for researchers in the field, and for a growing set of AI consumers with less technical background who may wish to understand where the field is headed. As such I will begin with a very brief, nontechnical introduction aimed at elucidating what deep learning systems do well and why (Section 2), before turning to an
assessment of deep learning’s weaknesses (Section 3) and some fears that arise from misunderstandings about deep learning’s capabilities (Section 4), and closing with perspective on going forward (Section 5).
Deep learning is not likely to disappear, nor should it. But five years into the field’s resurgence seems like a good moment for a critical reflection, on what deep learning has and has not been able to achieve.
该论文同时也是写给该领域的研究人员,写给缺乏技术背景又可能想要理解该领域的 AI 消费者。如此一来,在第二部分我将简要地、非技术性地介绍深度学习系统能做什么,为什么做得好。然后在第三部分介绍深度学习的弱点,第四部分介绍对深度学习能力的误解,最后介绍我们可以前进的方向。
深度学习不太可能会消亡,也不该消亡。但在深度学习崛起的 5 年后,看起来是时候对深度学习的能力与不足做出批判性反思了。
2. What deep learning is, and what it does well 深度学习是什么?深度学习擅长什么?
Deep learning, as it is primarily used, is essentially a statistical technique for classifying patterns, based on sample data, using neural networks with multiple layers.5 Neural networks in the deep learning literature typically consist of a set of input units that stand for things like pixels or words, multiple hidden layers (the more such layers, the deeper a network is said to be) containing hidden units (also known as nodes or neurons), and a set output units, with connections running between those nodes. In a typical application such a network might be trained on a large sets of handwritten digits (these are the inputs, represented as images) and labels (these are the outputs) that identify the categories to which those inputs belong (this image is a 2, that one is a 3, and so forth).
深度学习本质上是一种基于样本数据、使用多层神经网络对模式进行分类的统计学技术。
深度学习文献中的神经网络包括一系列代表像素或单词的输入单元、包含隐藏单元(又叫节点或神经元)的多个隐藏层(层越多,网络就越深),以及一系列输出单元,节点之间存在连接。
在典型应用中,这样的网络可以在大型手写数字(输入,表示为图像)和标签(输出,表示为图像)集上进行训练,标签代表输入所属的类别。
Over time, an algorithm called back-propagation allows a process called gradient descent to adjust the connections between units using a process, such that any given input tends to produce the corresponding output.
Collectively, one can think of the relation between inputs and outputs that a neural network learns as a mapping. Neural networks, particularly those with multiple hidden layers (hence the term deep) are remarkably good at learning input-output mappings,
Such systems are commonly described as neural networks because the input nodes, hidden nodes, and output nodes can be thought of as loosely analogous to biological neurons, albeit greatly simplified, and the connections between nodes can be thought of as in some way reflecting connections between neurons. A longstanding question, outside the scope of the current paper, concerns the degree to which artificial neural networks are biologically plausible.
Most deep learning networks make heavy use of a technique called convolution (LeCun ,1989), which constrains the neural connections in the network such that they innatel ycapture a property known as translational invariance. This is essentially the idea that a nobject can slide around an image while maintaining its identity; a circle in the top left ca nbe presumed, even absent direct experience) to be the same as a circle in the bottom right .
随着时间的进展,一种叫作反向传播的算法出现了,它允许通过梯度下降过程调整单元之间的连接,以使任意给定输入可以有对应的输出。
大体上,我们可以把神经网络所学习的输入与输出之间的关系理解为映射。神经网络,尤其是具备多个隐藏层的神经网络尤其擅长学习输入-输出映射。
此类系统通常被描述为神经网络,因为输入节点、隐藏节点和输出节点类似生物神经元,不过已经大大简化。节点之间的连接类似神经元之间的连接。
大部分深度学习网络大量使用卷积技术(LeCun, 1989),该技术约束网络中的神经连接,使它们本能地捕捉平移不变性(translational invariance)。这本质上就是物体可以围绕图像滑动,同时又保持自己的特征;正如上图中,假设它是左上角的圆,即使缺少直接经验,也可以最终过渡到右下角。
Deep learning is also known for its ability to self-generate intermediate representations, such as internal units that may respond to things like horizontal lines, or more complex elements of pictorial structure.
In principle, given infinite data, deep learning systems are powerful enough to represen tany finite deterministic “mapping” between any given set of inputs and a set of corresponding outputs, though in practice whether they can learn such a mapping depends on many factors. One common concern is getting caught in local minima, in which a systems gets stuck on a suboptimal solution, with no better solution nearby in the space of solutions being searched. (Experts use a variety of techniques to avoid such problems, to reasonably good effect). In practice, results with large data sets are often quite good, on a wide range of potential mappings .
In speech recognition, for example, a neural network learns a mapping between a set of speech sounds, and set of labels (such as words or phonemes). In object recognition, a neural network learns a mapping between a set of images and a set of labels (such that ,for example, pictures of cars are labeled as cars). In DeepMind’s Atari game system(Mnih et al., 2015), neural networks learned mappings between pixels and joystick positions
深度学习还有一个著名的能力——自生成中间表示,如可以响应横线或图结构中更复杂元素的内部单元。
原则上,对于给定的无限多数据,深度学习系统能够展示给定输入集和对应输出集之间的有限确定性“映射”,但实践中系统是否能够学习此类映射需要依赖于很多因素。
一个常见的担忧是局部极小值陷阱,即系统陷入次最优解,附近求解空间内没有更好的解。(专家使用多种技术避免此类问题,达到了比较好的效果。)在实践中,大型数据集的结果通常比较好,因其具备大量可能的映射。
例如,语音识别中,神经网络学习语音集和标签(如单词或音素)集之间的映射。目标识别中,神经网络学习图像集和标签集之间的映射。在 DeepMind 的 Atari 游戏系统(Mnih et al., 2015)中,神经网络学习像素和游戏杆位置之间的映射。
Deep learning systems are most often used as classification system in the sense that the mission of a typical network is to decide which of a set of categories (defined by the output units on the neural network) a given input belongs to. With enough imagination ,the power of classification is immense; outputs can represent words, places on a Goboard, or virtually anything else .In a world with infinite data, and infinite computational resources, there might be little need for any other technique
深度学习系统最常用作分类系统,因其使命是决定给定输入所属的类别(由神经网络的输出单元定义)。只要有足够的想象力,那么分类的能力是巨大的;输出可以表示单词、围棋棋盘上的位置等几乎所有事物。
在拥有无限数据和计算资源的世界,可能就不需要其他技术了。
3. Limits on the scope of deep learning 深度学习的局限性
Deep learning’s limitations begin with the contrapositive: we live in a world in which data are never infinite. Instead, systems that rely on deep learning frequently have to generalize beyond the specific data that they have seen, whether to a new pronunciation of a word or to an image that differs from one that the system has seen before, and where data are less than infinite, the ability of formal proofs to guarantee high-quality performance is more limited
深度学习的局限性首先是这个逆否命题:我们居住的世界具备的数据并不是无穷多的。依赖于深度学习的系统经常要泛化至未见过的数据,这些数据并不是无穷大的,确保高质量性能的能力更是有限。
As discussed later in this article, generalization can be thought of as coming in two flavors, interpolation between known examples, and extrapolation, which requires going beyond a space of known training examples (Marcus, 1998a) .For neural networks to generalize well, there generally must be a large amount of data ,and the test data must be similar to the training data, allowing new answers to be interpolated in between old ones. In Krizhevsky et al’s paper (Krizhevsky, Sutskever, &Hinton, 2012), a nine layer convolutional neural network with 60 million parameters an d650,000 nodes was trained on roughly a million distinct examples drawn from approximately one thousand categories.
我们可以把泛化当作已知样本之间的内插和超出已知训练样本空间的数据所需的外插(Marcus, 1998a)。
对于泛化性能良好的神经网络,通常必须有大量数据,测试数据必须与训练数据类似,使新答案内插在旧的数据之中。在克里泽夫斯基等人的论文(Krizhevsky, Sutskever, & Hinton, 2012)中,一个具备 6000 万参数和 65 万节点的 9 层卷积神经网络在来自大概 1000 个类别的约 100 万不同样本上进行训练。
This sort of brute force approach worked well in the very finite world of ImageNet, into o which all stimuli can be classified into a comparatively small set of categories. It also o works well in stable domains like speech recognition in which exemplars are mapped in constant way onto a limited set of speech sound categories, but for many reasons deep learning cannot be considered (as it sometimes is in the popular press) as a general solution to artificial intelligence .
Here are ten challenges faced by current deep learning systems :
这种暴力方法在 ImageNet 这种有限数据集上效果不错,所有外部刺激都可以被分类为较小的类别。它在稳定的领域效果也不错,如在语音识别中,数据可以一种常规方式映射到有限的语音类别集中,但是出于很多原因,深度学习不是人工智能的通用解决方案。
以下是当前深度学习系统所面临的十个挑战:
3.1 目前深度学习需要大量数据 Deep learning thus far is data hungry
Human beings can learn abstract relationships in a few trials. If I told you that a schmister was a sister over the age of 10 but under the age of 21, perhaps giving you a single example, you could immediately infer whether you had any schmisters, whether your best friend had a schmister, whether your children or parents had any schmisters, and so forth .(Odds are, your parents no longer do, if they ever did, and you could rapidly draw that inference, too. )
In learning what a schmister is, in this case through explicit definition, you rely not on hundreds or thousands or millions of training examples, but on a capacity to represent abstract relationships between algebra-like variables .
人类只需要少量的尝试就可以学习抽象的关系。如果我告诉你 schmister 是年龄在 10 岁到 21 岁之间的姐妹。可能只需要一个例子,你就可以立刻推出你有没有 schmister,你的好朋友有没有 schmister,你的孩子或父母有没有 schmister 等等。
你不需要成百上千甚至上百万的训练样本,只需要用少量类代数的变量之间的抽象关系,就可以给 schmister 下一个准确的定义。
Humans can learn such abstractions, both through explicit definition and more implicit means (Marcus, 2001). Indeed even 7-month old infants can do so, acquiring learned abstract language-like rules from a small number of unlabeled examples, in just two minutes (Marcus, Vijayan, Bandi Rao, & Vishton, 1999). Subsequent work by Gervai nand colleagues (2012) suggests that newborns are capable of similar computations .
Deep learning currently lacks a mechanism for learning abstractions through explicit ,verbal definition, and works best when there are thousands, millions or even billions of training examples, as in DeepMind’s work on board games and Atari. As Brenden Lake and his colleagues have recently emphasized in a series of papers, humans are far more efficient in learning complex rules than deep learning systems are (Lake, Salakhutdinov ,& Tenenbaum, 2015; Lake, Ullman, Tenenbaum, & Gershman, 2016). (See also relate dwork by George et al (2017), and my own work with Steven Pinker on children’sover regularization errors in comparison to neural networks (Marcus et al., 1992). )
人类可以学习这样的抽象概念,无论是通过准确的定义还是更隐式的手段(Marcus,2001)。实际上即使是 7 月大的婴儿也可以仅在两分钟内从少量的无标签样本中学习抽象的类语言规则(Marcus, Vijayan, Bandi Rao, & Vishton, 1999)。随后由 Gervain 与其同事做出的研究(2012)表明新生儿也能进行类似的学习。
深度学习目前缺少通过准确的、语词的定义学习抽象概念的机制。
当有数百万甚至数十亿的训练样本的时候,深度学习能达到最好的性能,例如 DeepMind 在棋牌游戏和 Atari 上的研究。
正如 Brenden Lake 和他的同事最近在一系列论文中所强调的,人类在学习复杂规则的效率远高于深度学习系统(Lake, Salakhutdinov, & Tenenbaum, 2015; Lake, Ullman, Tenenbaum, & Gershman, 2016)(也可以参见 George 等人的相关研究工作,2017)。我和 Steven Pinker 在儿童与神经网络的过度规则化误差的对比研究也证明了这一点。
Geoff Hinton has also worried about deep learning’s reliance on large numbers of labeled examples, and expressed this concern in his recent work on capsule networks with hi scoauthors (Sabour et al., 2017) noting that convolutional neural networks (the most common deep learning architecture) may face “exponential inefficiencies that may lead to their demise. A good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints [ie perspectives on object in visual recognition tasks] .The ability to deal with translation[al invariance] is built in, but for the other ... [common type of] transformation we have to chose between replicating feature detectors on a grid that grows exponentially ... or increasing the size of the labelled training set in a similarly exponential way. ”
In problems where data are limited, deep learning often is not an ideal solution .
Geoff Hinton 也对深度学习依赖大量的标注样本表示担忧,在他最新的 Capsule 网络研究中表达了这个观点,其中指出卷积神经网络可能会遭遇「指数低效」,从而导致网络的失效。
还有一个问题是卷积网络在泛化到新的视角上有困难。
处理转换(不变性)的能力是网络的内在性质,而对于其他常见类型的转换不变性,我们不得不在网格上的重复特征检测器之间进行选择,该过程的计算量是指数式增长的,或者增加标记训练集的规模,其计算量也是指数式增长的。
对于没有大量数据的问题,深度学习通常不是理想的解决方案。
3.2 深度学习目前还是太表浅,没有足够的能力进行迁移 .Deep learning thus far is shallow and has limited capacity for transfer
Although deep learning is capable of some amazing things, it is important to realize that the word “deep” in deep learning refers to a technical, architectural property (the large number of hidden layers used in a modern neural networks, where there predecessors used only one) rather than a conceptual one (the representations acquired by such networks don’t, for example, naturally apply to abstract concepts like “justice” ,“democracy” or “meddling”) .Even more down-to-earth concepts like “ball” or “opponent” can lie out of reach .Consider for example DeepMind’s Atari game work (Mnih et al., 2015) on deep reinforcement learning, which combines deep learning with reinforcement learning (in which a learner tries to maximize reward). Ostensibly, the results are fantastic: the system meets or beats human experts on a large sample of games using a single set of “hyperparameters” that govern properties such as the rate at which a network alters its weights, and no advance knowledge about specific games, or even their rules. But it is easy to wildly overinterpret what the results show. To take one example, according to a widely-circulated video of the system learning to play the brick-breaking Atari game Breakout, “after 240 minutes of training, [the system] realizes that digging a tunnel thought the wall is the most effective technique to beat the game”
这里很重要的一点是,需要认识到“深”在深度学习中是一个技术的、架构的性质(即在现代神经网络中使用大量的隐藏层),而不是概念上的意义(即这样的网络获取的表征可以自然地应用到诸如“公正”、“民主”或“干预”等概念)。
即使是像“球”或“对手”这样的具体概念也是很难被深度学习学到的。
考虑 DeepMind 利用深度强化学习对 Atari 游戏的研究,他们将深度学习和强化学习结合了起来。其成果表面上看起来很棒:该系统使用单个“超参数”集合(控制网络的性质,如学习率)在大量的游戏样本中达到或打败了人类专家,其中系统并没有关于具体游戏的知识,甚至连规则都不知道。
但人们很容易对这个结果进行过度解读。例如,根据一个广泛流传的关于该系统学习玩打砖块 Atari 游戏 Breakout 的视频,“经过 240 分钟的训练,系统意识到把砖墙打通一个通道是获得高分的最高效的技术。”
But the system has learned no such thing; it doesn’t really understand what a tunnel, or what a wall is; it has just learned specific contingencies for particular scenarios. Transfer tests — in which the deep reinforcement learning system is confronted with scenarios that differ in minor ways from the one ones on which the system was trained show that deep reinforcement learning’s solutions are often extremely superficial. For example, a team of researchers at Vicarious showed that a more efficient successor technique ,DeepMind’s Atari system [Asynchronous Advantage Actor-Critic; also known as A3C] ,failed on avariety of minor perturbations to Breakout (Kansky et al., 2017) from the training set, such as moving the Y coordinate (height) of the paddle, or inserting a wall midscreen. These demonstrations make clear that it is misleading to credit dee preinforcement learning with inducing concept like wall or paddle; rather, such remarks are what comparative (animal) psychology sometimes call over attributions. It’s not that the Atari system genuinely learned a concept of wall that was robust but rather the system superficially approximated breaking through walls within a narrow set of highly trained circumstances. 7
My own team of researchers at a startup company called Geometric Intelligence (later acquired by Uber) found similar results as well, in the context of a slalom game, In 2017 ,a team of researchers at Berkeley and OpenAI has shown that it was not difficult to construct comparable adversarial examples in a variety of games, undermining not only DQN (the original DeepMind algorithm) but also A3C and several other related techniques (Huang, Papernot, Goodfellow, Duan, & Abbeel, 2017)
但实际上系统并没有学到这样的思维:它并不理解通道是什么,或者砖墙是什么;它仅仅是学到了特定场景下的特定策略。
迁移测试(其中深度强化学习系统需要面对和训练过程中稍有不同的场景)表明深度强化学习方法通常学到很表明的东西。
例如,Vicarious 的研究团队表明 DeepMind 的更高效的进阶技术——Atari 系统「Asynchronous Advantage Actor-Critic」(也叫 A3C)在玩多种 Breakout 的微改动版本(例如改变球拍的 Y 坐标,或在屏幕中央加上一堵墙)时遭遇了失败。
这些反例证明了深度强化学习并不能学习归纳类似砖墙或球拍这样的概念;更准确地说,这样的评论就是生物心理学中的过度归因所造成的。Atari 系统并没有真正地学习到关于砖墙的鲁棒的概念,而只是在高度训练的情景的狭隘集合中表面地打破了砖墙。
Recent experiments by Robin Jia and Percy Liang (2017) make a similar point, in a different domain: language. Various neural networks were trained on a question answering task known as SQuAD (derived from the Stanford Question AnsweringDatabase), in which the goal is to highlight the words in a particular passage that correspond to a given question. In one sample, for instance, a trained system correctly ,and impressively, identified the quarterback on the winning of Super Bowl XXXIII as John Elway, based on a short paragraph. But Jia and Liang showed the mere insertion of distractor sentences (such as a fictional one about the alleged victory of Google’s JefDean in another Bowl game ) caused performance to drop precipitously. Across sixteen 8models, accuracy dropped from a mean of 75% to a mean of 36% .As is so often the case, the patterns extracted by deep learning are more superficial than they initially appear .
我在初创公司 Geometric Intelligence 的研究团队(后来被 Uber 收购)的滑雪游戏情景中发现了类似的结果。2017 年,伯克利和 OpenAI 的一个研究团队发现可以轻易地在多种游戏中构造对抗样本,使得 DQN(原始的 DeepMind 算法)、A3C 和其它的相关技术(Huang, Papernot, Goodfellow, Duan, & Abbeel, 2017)都失效。
最近由 Robin Jia 和 Percy Liang(2017)做的实验在不同的领域(语言)表明了类似的观点。他们训练了多种用于问答系统任务(被称为 SQuAD,Stanford Question Answering Database)的神经网络,其中任务的目标是在特定的段落中标出和给定问题相关的单词。
例如,有一个已训练系统可以基于一段短文准确地识别出超级碗 XXXIII 的胜利者是 John Elway。但 jia 和 Liang 表明仅仅插入干扰句(例如宣称谷歌的 Jeff Dean 在另一个杯赛中获得了胜利)就可以让准确率大幅下降。在 16 个模型中,平均准确率从 75% 下降了到了 36%。
通常情况下,深度学习提取的模式,相比给人的第一印象,其实更加的表面化。
3.3 迄今深度学习没有自然方式来处理层级架构 .Deep learning thus far has no natural way to deal with hierarchical structure
To a linguist like Noam Chomsky, the troubles Jia and Liang documented would be unsurprising. Fundamentally, most current deep-learning based language models represent sentences as mere sequences of words, whereas Chomsky has long argued that language has a hierarchical structure, in which larger structures are recursively constructed out of smaller components. (For example, in the sentence the teenager who previously crossed the Atlantic set a record for flying around the world, the main clause is the teenager set a record for flying around the world, while the embedded clause who previously crossed the Atlantic is an embedded clause that specifies which teenager. )In the 80’s Fodor and Pylyshyn (1988) expressed similar concerns, with respect to a nearlier breed of neural networks. Likewise, in (Marcus, 2001), I conjectured that single recurrent neural networks (SRNs; a forerunner to today’s more sophisticated deep learning based recurrent neural networks, known as RNNs; Elman, 1990) would have trouble systematically representing and extending recursive structure to various kinds of unfamiliar sentences (see the cited articles for more specific claims about which types) .Earlier this year, Brenden Lake and Marco Baroni (2017) tested whether such pessimistic conjectures continued to hold true. As they put it in their title, contemporary neural nets were “Still not systematic after all these years”. RNNs could “generalize well when the differences between training and test ... are small [but] when generalization requires systematic compositional skills, RNNs fail spectacularly” .
对乔姆斯基这样的语言学家来说,对 Robin Jia 和 Percy Liang 记录的难题并不惊讶。基本上,目前大部分深度学习方法基于语言模型来将句子表达为纯粹的词序列。
然而,乔姆斯基一直以来都认为语言具有层级架构,也就是小的部件循环构建成更大的结构。(例如,在句子「the teenager who previously crossed the Atlantic set a record for flying around the world」中,主句是「the teenager set a record for flying around the world」,「who previously crossed the Atlantic」是指明青年身份的一个字句。
在上世纪 80 年代,Fodor 和 Pylyshyn(1988)也表达了同样的担忧,这是关于神经网络的一个早期分支。
我在 2001 年的文章中也同样揣测到,单个循环神经网络(SRN;是今天基于循环神经网络(也就是 RNN)的更复杂的深度学习方法的前身;Elman,1990) 难以系统表达、扩展各种不熟悉语句的递归结构(从引用论文查看具体是哪种类型)。
2017 早些时候,Brenden Lake 和 Marco Baroni 测试了这样的悲观揣测是否依然正确。就像他们文章标题中写的,当代神经网络“这么多年来依然不体系”。RNN 可能“在训练与测试差别... 很小的时候泛化很好,但当需要系统地组合技能做泛化时,RNN 极其失败。”
Similar issues are likely to emerge in other domains, such as planning and motor control ,in which complex hierarchical structure is needed, particular when a system is likely to encounter novel situations. One can see indirect evidence for this in the struggles with transfer in Atari games mentioned above, and more generally in the field of robotics, in which systems generally fail to generalize abstract plans well in novel environments .The core problem, at least at present, is that deep learning learns correlations between sets of features that are themselves “flat” or nonhierachical, as if in a simple, unstructured list, with every feature on equal footing. Hierarchical structure (e.g., syntactic trees that distinguish between main clauses and embedded clauses in a sentence) are not inherently or directly represented in such systems, and as a result deep learning systems are forced to use a variety of proxies that are ultimately inadequate, such as the sequential position of a word presented in a sequences .
Systems like Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013) that represent individuals words as vectors have been modestly successful; a number of systems that have used clever tricks try to represent complete sentences in deep-learning compatible vector spaces (Socher ,Huval, Manning, & Ng, 2012). But, as Lake and Baroni’s experiments make clear .recurrent networks continue limited in their capacity to represent and generalize rich structure in a faithful manner .
类似的问题在其他领域也可能会暴露出来,例如规划与电机控制,这些领域都需要复杂的层级结构,特别是遇到全新的环境时。从上面提到的 Atari 游戏 AI 难以迁移问题上,我们可以间接看到这一点。更普遍的是在机器人领域,系统一般不能在全新环境中概括抽象规划。
至少,目前深度学习显现的核心问题是它学习特征集相对平滑或者说非层级的关联关系,犹如简单的、非结构化列表,每个特征都平等。层级结构(例如,句子中区分主句和从句的语法树)在这样的系统中并不是固有的,或者直接表达的,结果导致深度学习系统被迫使用各种根本不合适的代理,例如句子中单词的序列位置。
像 Word2Vec(Mikolov, Chen, Corrado, & Dean, 2013) 这样的系统将单个单词表达为向量,有适当的成功。也有一批系统使用小技巧试图在深度学习可兼容的向量空间中表达完整语句 (Socher, Huval, Manning, & Ng, 2012)。但是,就像 Lake 和 Baroni 的实验表明的,循环网络能力依然有限,不足以准确可靠地表达和概括丰富的结构信息。
3.4 迄今为止的深度学习无法进行开放推理 .Deep learning thus far has struggled with open-ended inference
If you can’t represent nuance like the difference between “John promised Mary to leave ”and “John promised to leave Mary”, you can’t draw inferences about who is leaving whom, or what is likely to happen next.
Current machine reading systems have achieved some degree of success in tasks like SQuAD, in which the answer to a given question is explicitly contained within a text, but far less success in tasks in which inference goes beyond what is explicit in a text, either by combining multiple sentences(so called multi-hop inference) or by combining explicit sentences with background knowledge that is not stated in a specific text selection. Humans, as they read texts ,frequently derive wide-ranging inferences that are both novel and only implicitly licensed, as when they, for example, infer the intentions of a character based only on indirect dialog .
Altough Bowman and colleagues (Bowman, Angeli, Potts, & Manning, 2015; Williams ,Nangia, & Bowman, 2017) have taken some important steps in this direction, there is, at present, no deep learning system that can draw open-ended inferences based on realworld knowledge with anything like human-level accuracy
如果你无法搞清“John promised Mary to leave”和“John promised to leave Mary”之间的区别,你就不能分清是谁离开了谁,以及接下来会发生什么。
目前的机器阅读系统已经在一些任务,如 SQuAD 上取得了某种程度的成功,其中对于给定问题的答案被明确地包含在文本中,或者整合在多个句子中(被称为多级推理)或整合在背景知识的几个明确的句子中,但并没有标注特定的文本。对于人类来说,我们在阅读文本时经常可以进行广泛的推理,形成全新的、隐含的思考,例如仅仅通过对话就能确定角色的意图。
尽管 Bowman 等人(Bowman,Angeli,Potts & Manning,2015;Williams,Nangia & Bowman,2017)在这一方向上已经采取了一些重要步骤,但目前来看,没有深度学习系统可以基于真实世界的知识进行开放式推理,并达到人类级别的准确性。
3.5 迄今为止的深度学习不够透明 Deep learning thus far is not sufficiently transparent
The relative opacity of “black box” neural networks has been a major focus of discussion in the last few years (Samek, Wiegand, & Müller, 2017; Ribeiro, Singh, & Guestrin ,2016). In their current incarnation, deep learning systems have millions or even billion sof parameters, identifiable to their developers not in terms of the sort of human interpretable labels that canonical programmers use (“last_character_typed”) but only interms of their geography within a complex network (e.g., the activity value of the ith node in layer j in network module k).
Although some strides have been in visualizing the contributions of individuals nodes in complex networks (Nguyen, Clune, Bengio ,Dosovitskiy, & Yosinski, 2016), most observers would acknowledge that neural network sas a whole remain something of a black box .How much that matters in the long run remains unclear (Lipton, 2016). If systems are robust and self-contained enough it might not matter; if it is important to use them in the context of larger systems, it could be crucial for debuggability .
The transparency issue, as yet unsolved, is a potential liability when using deep learning for problem domains like financial trades or medical diagnosis, in which human users might like to understand how a given system made a given decision. As Catherin eO’Neill (2016) has pointed out, such opacity can also lead to serious issues of bias
神经网络“黑箱”的特性一直是过去几年人们讨论的重点(Samek、Wiegand & Müller,2017;Ribeiro、Singh & Guestrin,2016)。
在目前的典型状态里,深度学习系统有数百万甚至数十亿参数,其开发者可识别形式并不是常规程序员使用的(「last_character_typed」)人类可识别标签,而是仅在一个复杂网络中适用的地理形式(如网络模块 k 中第 j 层第 i 个节点的活动值)。
尽管通过可视化工具,我们可以在复杂网络中看到节点个体的贡献(Nguyen、Clune、Bengio、Dosovitskiy & Yosinski,2016),但大多数观察者都认为,神经网络整体看来仍然是一个黑箱。
从长远看来,目前这种情况的重要性仍不明确(Lipton,2016)。如果系统足够健壮且自成体系,则没有问题;如果神经网络在更大的系统中占据重要的位置,则其可调试性至关重要。
为解决透明度问题,对于深度学习在一些领域如金融或医疗诊断上的潜力是致命的,其中人类必须了解系统是如何做出决策的。正如 Catherine O』Neill(2016)指出的,这种不透明也会导致严重的偏见问题。
3.6 迄今为止,深度学习并没有很好地与先验知识相结合 Deep learning thus far has not been well integrated with prior knowledge
The dominant approach in deep learning is hermeneutic, in the sense of being selfcontained and isolated from other, potentially usefully knowledge. Work in deep learning typically consists of finding a training database, sets of inputs associated with respective outputs, and learn all that is required for the problem by learning the relations between those inputs and outputs, using whatever clever architectural variants one might devise ,along with techniques for cleaning and augmenting the data set. With just a handful of exceptions, such as LeCun’s convolutional constraint on how neural networks are wired(LeCun, 1989), prior knowledge is often deliberately minimized .
Thus, for example, in a system like Lerer et al’s (2016) efforts to learn about the physics of falling towers, there is no prior knowledge of physics (beyond what is implied in convolution). Newton’s laws, for example, are not explicitly encoded; the system instead(to some limited degree) approximates them by learning contingencies from raw, pixel level data. As I note in a forthcoming paper in innate (Marcus, in prep) researchers in deep learning appear to have a very strong bias against including prior knowledge even when (as in the case of physics) that prior knowledge is well known .
It also not straightforward in general how to integrate prior knowledge into a deep learning system:, in part because the knowledge represented in deep learning systems pertains mainly to (largely opaque) correlations between features, rather than to abstractions like quantified statements (e.g. all men are mortal), see discussion of universally-quantified one-to-one-mappings in Marcus (2001), or generics (violable statements like dogs have four legs or mosquitos carry West Nilevirus (Gelman, Leslie ,Was, & Koch, 2015)) .
深度学习的一个重要方向是解释学,就是将自身与其他潜在的、有用的知识隔离开来。
深度学习的工作方式通常包含寻找一个训练数据集,与输入相关联的各个输出,通过任何精巧的架构或变体,以及数据清理和/或增强技术,随后通过学习输入和输出的关系来学会解决问题的方法。
这其中只有少数几个例外,如 LeCun 对神经网络连接卷积约束的研究(LeCun,1989)中,先验知识被有意最小化了。
因此,例如 Lerer 等人(2016)提出的系统学习从塔上掉落物体的物理性质,在此之上并没有物理学的先验知识(除卷积中所隐含的内容之外)。
在这里,牛顿定律并没有被编码,系统通过(在一些有限的层面上)通过原始像素级数据学习了这一定律,并近似它们。正如在我即将发表的论文中所指出的那样,深度学习研究者似乎对于先验知识有着很强的偏见,即使(如在物理上)这些先验知识是众所周知的。
一般来说,将先验知识整合到深度学习系统中并不简单:一部分是因为深度学习系统中的知识表征主要是特征之间的关系(大部分还是不透明的),而非抽象的量化陈述(如凡人终有一死),参见普遍量化一对一映射的讨论(Marcus,2001),或 generics(可违反的声明,如狗有四条腿或蚊子携带尼罗河病毒(Gelman、Leslie、Was & Koch,2015))。
A related problem stems from a culture in machine learning that emphasizes competitio non problems that are inherently self-contained, without little need for broad general knowledge. This tendency is well exemplified by the machine learning contest platform known as Kaggle, in which contestants vie for the best results on a given data set .Everything they need for a given problem is neatly packaged, with all the relevant input and outputs files. Great progress has been made in this way; speech recognition and some aspects of image recognition can be largely solved in the Kaggle paradigm .
The trouble, however, is that life is not a Kaggle competition; children don’t get all the data they need neatly packaged in a single directory. Real-world learning offers data much more sporadically, and problems aren’t so neatly encapsulated. Deep learning works great on problems like speech recognition in which there are lots of labeled examples, but scarcely any even knows how to apply it to more open-ended problems .What’s the best way to fix a bicycle that has a rope caught in its spokes? Should I major in math or neuroscience? No training set will tell us that .
这个问题根植于机器学习文化中,强调系统需要自成一体并具有竞争力——不需要哪怕是一点先验的通用知识。Kaggle 机器学习竞赛平台正是这一现象的注解,参赛者争取在给定数据集上获取特定任务的最佳结果。任意给定问题所需的信息都被整齐地封装好,其中包含相关的输入和输出文件。在这种范式下我们已经有了很大的进步(主要在图像识别和语音识别领域中)。
问题在于,当然,生活并不是一场 Kaggle 竞赛;孩子们并不会把所有数据整齐地打包进一个目录里。真实世界中我们需要学习更为零散的数据,问题并没有如此整齐地封装起来。
深度学习在诸如语音识别这种有很多标记的问题上非常有效,但却几乎没有人知道如何将其应用于更开放的问题。如何把卡在自行车链条上的绳子挑出来?我专业该选数学还是神经科学?训练集不会告诉我们。
Problems that have less to do with categorization and more to do with commonsense reasoning essentially lie outside the scope of what deep learning is appropriate for, and so far as I can tell, deep learning has little to offer such problems. In a recent review of commonsense reasoning, Ernie Davis and I (2015) began with a set of easily-drawn inferences that people can readily answer without anything like direct training, such as Who is taller, Prince William or his baby son Prince George? Can you make a salad out of a polyester shirt? If you stick a pin into a carrot, does it make a hole in the carrot or in the pin ?
与分类离得越远,与常识离得越近的问题就越不能被深度学习来解决。在近期对于常识的研究中,我和 Ernie Davis(2015)开始,从一系列易于得出的推论开始进行研究,如威廉王子和他的孩子乔治王子谁更高?你可以用聚酯衬衫来做沙拉吗?如果你在胡萝卜上插一根针,是胡萝卜上有洞还是针上有洞?
As far as I know, nobody has even tried to tackle this sort of thing with deep learning .
Such apparently simple problems require humans to integrate knowledge across vastly disparate sources, and as such are a long way from the sweet spot of deep learning-style perceptual classification. Instead, they are perhaps best thought of as a sign that entirely different sorts of tools are needed, along with deep learning, if we are to reach human-level cognitive flexibility .
据我所知,目前还没有人常识让深度学习回答这样的问题。
这些对于人类而言非常简单的问题需要整合大量不同来源的知识,因此距离深度学习受用风格的分类还有很长一段距离。相反,这或许意味着若想要达到人类级别的灵活认知能力,我们需要与深度学习完全不同的工具。
3.7 到目前为止,深度学习还不能从根本上区分因果关系和相关关系 Deep learning thus far cannot inherently distinguish causation from correlation
If it is a truism that causation does not equal correlation, the distinction between the two is also a serious concern for deep learning. Roughly speaking, deep learning learns complex correlations between input and output features, but with no inherent representation of causality. A deep learning system can easily learn that height and vocabulary are, across the population as a whole, correlated, but less easily represent the way in which that correlation derives from growth and development (kids get bigger as they learn more words, but that doesn’t mean that growing tall causes them to learn more words, nor that learning new words causes them to grow). Causality has been centrals trand in some other approaches to AI (Pearl, 2000) but, perhaps because deep learning is not geared towards such challenges, relatively little work within the deep learning tradition has tried to address it. 9
如果因果关系确实不等同于相关关系,那么这两者之间的区别对深度学习而言也是一个严重的问题。
粗略而言,深度学习学习的是输入特征与输出特征之间的复杂相关关系,而不是固有的因果关系表征。
深度学习系统可以将人群看作是一个整体而轻松学习到身高与词汇量是相关的,但却更难表征成长与发育之间相互关联的方式(孩子在学会更多词的同时也越长越大,但这并不意味着长高会导致他们学会更多词,学会更多词也不会导致他们长高)。
因果关系在其它一些用于人工智能的方法中一直是核心因素(Pearl, 2000),但也许是因为深度学习的目标并非这些难题,所以深度学习领域传统上在解决这一难题上的研究工作相对较少。[9]
3.8 深度学习假设世界是大体稳定的,采用的方式可能是概率的 Deep learning presumes a largely stable world, in ways that may
be problematic
The logic of deep learning is such that it is likely to work best in highly stable worlds ,like the board game Go, which has unvarying rules, and less well in systems such as politics and economics that are constantly changing. To the extent that deep learning is applied in tasks such as stock prediction, there is a good chance that it will eventually face the fate of Google Flu Trends, which initially did a great job of predicting epidemological data on search trends, only to complete miss things like the peak of the 2013 flu season (Lazer, Kennedy, King, & Vespignani, 2014)
深度学习的逻辑是:在高度稳定的世界(比如规则不变的围棋)中效果很可能最佳,而在政治和经济等不断变化的领域的效果则没有那么好。
就算把深度学习应用于股票预测等任务,它很有可能也会遭遇谷歌流感趋势(Google Flu Trends)那样的命运;谷歌流感趋势一开始根据搜索趋势能很好地预测流行病学数据,但却完全错过了 2013 年流感季等事件(Lazer, Kennedy, King, & Vespignani, 2014)。
3.9 到目前为止,深度学习只是一种良好的近似,其答案并不完全可信Deep learning thus far works well as an approximation, but its answers often cannot be fully trusted
In part as a consequence of the other issues raised in this section, deep learning systems are quite good at some large fraction of a given domain, yet easily fooled .An ever-growing array of papers has shown this vulnerability, from the linguistic examples of Jia and Liang mentioned above to a wide range of demonstrations in the domain of vision, where deep learning systems have mistaken yellow-and-black pattern sof stripes for school buses (Nguyen, Yosinski, & Clune, 2014) and sticker-clad parking signs for well-stocked refrigerators (Vinyals, Toshev, Bengio, & Erhan, 2014) in the context of a captioning system that otherwise seems impressive .More recently, there have been real-world stop signs, lightly defaced, that have been mistaken for speed limit signs (Evtimov et al., 2017) and 3d-printed turtles that have been mistake for rifles (Athalye, Engstrom, Ilyas, & Kwok, 2017). A recent news story recounts the trouble a British police system has had in distinguishing nudes from sand dunes.1 0The “spoofability” of deep learning systems was perhaps first noted by Szegedy e tal(2013). Four years later, despite much active research, no robust solution has been found .
这个问题部分是本节中提及的其它问题所造成的结果,深度学习在一个给定领域中相当大一部分都效果良好,但仍然很容易被欺骗愚弄。
越来越多的论文都表明了这一缺陷,从前文提及的 Jia 和 Liang 给出的语言学案例到视觉领域的大量示例,比如有深度学习的图像描述系统将黄黑相间的条纹图案误认为校车(Nguyen, Yosinski, & Clune, 2014),将贴了贴纸的停车标志误认为装满东西的冰箱(Vinyals, Toshev, Bengio, & Erhan, 2014),而其它情况则看起来表现良好。
最近还有真实世界的停止标志在稍微修饰之后被误认为限速标志的案例(Evtimov et al., 2017),还有 3D 打印的乌龟被误认为步枪的情况(Athalye, Engstrom, Ilyas, & Kwok, 2017)。最近还有一条新闻说英国警方的一个系统难以分辨裸体和沙丘。[10]
最早提出深度学习系统的“可欺骗性(spoofability)”的论文可能是 Szegedy et al(2013)。四年过去了,尽管研究活动很活跃,但目前仍未找到稳健的解决方法。
3.10 到目前为止,深度学习还难以在工程中使用Deep learning thus far is difficult to engineer with
Another fact that follows from all the issues raised above is that is simply hard to do robust engineering with deep learning. As a team of authors at Google put it in 2014, in the title of an important, and as yet unanswered essay (Sculley, Phillips, Ebner, Chaudhary, & Young, 2014), machine learning is “the high-interest credit card of technical debt”, meaning that is comparatively easy to make systems that work in some limited set of circumstances (short term gain), but quite difficult to guarantee that they will work in alternative circumstances with novel data that may not resemble previous training data (long term debt, particularly if one system is used as an element in another larger system).
In an important talk at ICML, Leon Bottou (2015) compared machine learning to the development of an airplane engine, and noted that while the airplane design relies on building complex systems out of simpler systems for which it was possible to create sound guarantees about performance, machine learning lacks the capacity to produce comparable guarantees. As Google’s Peter Norvig (2016) has noted, machine learning as yet lacks the incrementality, transparency and debuggability of classical programming, trading off a kind of simplicity for deep challenges in achieving robustness.
有了上面提到的那些问题,另一个事实是现在还难以使用深度学习进行工程开发。
正如谷歌一个研究团队在 2014 年一篇重要但仍未得到解答的论文(Sculley, Phillips, Ebner, Chaudhary, & Young, 2014)的标题中说的那样:机器学习是「高利息的技术债务信用卡」,意思是说机器学习在打造可在某些有限环境中工作的系统方面相对容易(短期效益),但要确保它们也能在具有可能不同于之前训练数据的全新数据的其它环境中工作却相当困难(长期债务,尤其是当一个系统被用作另一个更大型的系统组成部分时)。
Leon Bottou (2015) 在 ICML 的一个重要演讲中将机器学习与飞机引擎开发进行了比较。他指出尽管飞机设计依靠的是使用更简单的系统构建复杂系统,但仍有可能确保得到可靠的结果,机器学习则缺乏得到这种保证的能力。
正如谷歌的 Peter Norvig 在 2016 年指出的那样,目前机器学习还缺乏传统编程的渐进性、透明性和可调试性,要实现深度学习的稳健,需要在简洁性方面做一些权衡。
Henderson and colleagues have recently extended these points, with a focus on deep reinforcement learning, noting some serious issues in the field related to robustness and replicability (Henderson et al., 2017).
Although there has been some progress in automating the process of developing machine learning systems (Zoph, Vasudevan, Shlens, & Le, 2017), there is a long way to go.
Henderson 及其同事最近围绕深度强化学习对这些观点进行了延展,他们指出这一领域面临着一些与稳健性和可复现性相关的严重问题(Henderson et al., 2017)。
尽管在机器学习系统的开发过程的自动化方面存在一些进展(Zoph, Vasudevan, Shlens, & Le, 2017),但还仍有很长的路要走。
3.11 讨论Discussion
Of course, deep learning, is by itself, just mathematics; none of the problems identified above are because the underlying mathematics of deep learning are somehow flawed. In general, deep learning is a perfectly fine way of optimizing a complex system for representing a mapping between inputs and outputs, given a sufficiently large data set .
The real problem lies in misunderstanding what deep learning is, and is not, good for. The technique excels at solving closed-end classification problems, in which a wide range of potential signals must be mapped onto a limited number of categories, given that there is enough data available and the test set closely resembles the training set .
But deviations from these assumptions can cause problems; deep learning is just a statistical technique, and all statistical techniques suffer from deviation from their assumptions .Deep learning systems work less well when there are limited amounts of training data available, or when the test set differs importantly from the training set, or when the