[论文翻译]Deep learning Review -深度学习总览




[本翻译部分信息来自网络,仅为学习使用]

author

Yann LeCun , Yoshua Bengio & Geoffrey Hinton

论文地址 http://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf
Received 25 February; accepted 1 May 2015.
发表:442 | NATURE | VOL 521 | 28 MAY 2015

摘要 Abstract

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

深度学习是指建立多个处理层组成的计算机模型,以多层抽象的方式去学习数据的表达。这些方法使得一些前沿领域诸如语音识别,视觉物体识别,目标检测以及其他:包括药物发现和基因组学等,研究水平都有了极大地发展。深度学习使用反向传播(BP)算法,指出了机器在多层模型中应该如何用上一层参数来计算当前层参数,从而可以构建应对大型数据集中的需要的复杂结构。深度卷积网络给处理图像、视频、语音和音频的领域带来了突破,而递归网络对于连续的数据例如文本和语音有很好的性能。

Machine-learning 机器学习

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search.
Increasingly, these applications make use of a class of techniques called deep learning.

机器学习技术为现代社会的许多领域提供了强有力的支持:从网站搜索到社交网络的内容过滤再到电子商务网站的建议,并且它越来越多地出现在相机和智能手机等消费电子产品中。机器学习系统被用于分辨图像中的目标,将语音转换成文本,根据用户的兴趣选择新闻内容、广告或者产品,并选择相关的搜索结果。这些应用程序越来越广泛地使用一系列叫做深度学习的技术。

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.

传统的机器学习处理原始数据的能力是有限的。几十年来,构造一个模式识别或机器学习系统,需要细致的组织和专业的知识去设计一个特征提取器。特征提取器可以将原始数据转换成一个合适的内部表示或特征向量,这当中的子系统通常是一个分类器,可以完成对输入样本的检测或分类。

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

表示学习是一系列学习的方法,它允许向机器输入原始数据,能够自动发现用于检测或分类所需的特征表示。深度学习方法就是一种拥有多层表示特征的学习方法,在每一层中通过简单但是非线性的模块将一个层级的数据(以原始数据开始)转换到更高层次,更抽象的表达形式。通过足够多的转换,即使非常复杂的函数也可以被学习到。对于分类任务,更高层次的表示可以放大输入的区别并且抑制无关的变化。例如,对于一幅图像来说,以像素值矩阵的形式输入,在第一层中被学习的往往是在图像中某些特定方向或位置边缘存在与否的特征。第二层往往是根据提取边界的走向来检测图案,而忽略边缘位置的细小变化。第三层可能是将图案组合成更大的组合图案,从而与相似的目标部分对应,并且随后的层会将这些部分再联合从而构成检测的目标。深度学习的关键方面在于这些层的设计不是由人类工程师完成的:它们是通过使用通用的学习算法从原始数据中学习的得到。

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition1–4 and speech recognition5–7, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules8 , analysing particle accelerator data9,10, reconstructing brain circuits11, and predicting the effects of mutations
in non-coding DNA on gene expression and disease12,13. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding14, particularly topic classification, sentiment analysis, question answering15 and language translation16,17.

深度学习正在为解决多年来阻碍人工智能这一先进尝试的发展的诸多问题上发挥重要作用。它被证实已于发现高维数据的复杂结构,因此适用于科学、商业和管理等很多领域。除了打破图像识别和语音识别的很多记录外,它在诸如预测潜在药物分子的活性,分析粒子加速的数据,重新构建大脑回路以及预测非编码基因序列突变对基因表达和疾病的影响等领域打败了其他机器学习的技术。或许更令人惊讶的是,深度学习在自然语言理解方面的各种课题上也形成了极好的成果,特别是在主题分类,情感分析,问题回答和语言翻译上。

We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.

我们认为深度学习在不远的将来会有更大的成功,因为它需要很少的人工干预,所以它可以很容易地利用好计算能力和数据量的提升。目前正在为深度神经网络而开发的新的学习算法和结构只会加快这一进程。

监督学习 Supervised learning

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training.
We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.

最常见的机器学习的形式,就是监督学习,不管其是否是深度的。试想我们建立一个可以对输入的图片中包含的东西进行分类的系统,包括人、车子、房子、宠物等等。我们首先需要建立带有对应类别标签的图片的巨大数据集。在训练过程中,机器被给与一张图片,就会输出一个其对于所有类别的得分构成的向量。我们想要让所期望的类对应最高的分数,但是在我们训练之前这是不太可能的。我们利用衡量误差(或距离)的目标函数来计算实际输出分数与期望的样本分数之间的差距。机器通过修改它内部的可调节的参数使得误差减小。而这些可调节的参数,通常称为权重,是可见的“旋钮”,真正决定了输入输出之间的函数关系。在一个典型的深度学习系统中,会有数百万个这样的可调节的权重,和数百万用来训练的带有标签的样例。

To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.
The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

为了恰当地调整权重向量,学习算法计算了每个权值的梯度向量,表示如果权值增加了一个很小的量,那么误差就会增加或减少的量。然后,权值向量就需要在梯度矢量的相反方向上进行调整。

平均了所有样本的目标函数,可以看作是一种在权值的高维空间上的丘陵地形。负的梯度向量方向表示地形中下降最快的方向,地形上越接近它的最小值,也就取得了平均上的最小误差。

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector
for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights
accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques18. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible
answers on new inputs that it has never seen during training.

在实践中,大部分开发者都会采用随机梯度下降(SGD)的方法。它包括提取一组训练样本向量作为输入,计算输出和误差,还有这些样本的平均梯度,并且据此调整权重。不断地从训练集中获取一部分训练样本来重复这一过程,直到目标函数的平均值不再下降。它被称作随机,是因为每一个小的样本集对于全体样本平均梯度的估计来说都会产生噪声,相较于一些精细的优化方法,这个简单的过程往往会很快地找到一组好的权重。在训练完成后,系统的性能通过测试集来测试,即测试系统的泛化能力——也就是它对于训练过程中未见过的新样本的正确预测能力。

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.

许多当前实际应用中的机器学习在人工提取的特证基础上使用线性分类器进行分类,一个二分类的线性分类器会计算特征向量各元素的加权和,若其高于阈值,那么输入就会被分类为一个特定的类别。

Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane19. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal.
To make classifiers more powerful, one can use generic non-linear features, as with kernel methods20, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples21. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

自从二十世纪九十年代以来,我们就知道一个线性分类器仅仅能通过一个超平面将输入空间分成几个简单的区域。但是图像或语音的识别问题等,要求输入-输出函数对于诸如位置,目标的光照和旋转,或者语音中音高和音色等的变化此类不相关的输入是不敏感的,而对一些特定的微小变化十分敏感,例如我们希望能够区分一只白色的狼和一只很像狼的萨摩耶犬。在像素层面上,在不同环境中不同姿势的两只萨摩耶的图片或许都会有很大的不同,然而相同背景中同样位置的一只萨摩耶和一只狼却会很像。线性的分类器,或者其他的直接在像素上操作的浅层分类器就算能够将前两个分为同一类,也很有可能不能地区分出后两个。这就是什么浅层分类器需要很好的用于解决选择不变问题的特征提取能器——一个能够提取出图像中区分目标的那些关键特征,但是这些特征对于分辨动物的姿势就无能为力了。为了使分类器更强大,可以使用通用的非线性特征,如核方法,但是诸如通过高斯核产生的这些通用特征,不能使学习对象具有对于所有的训练样本都很好的泛化效果。传统的选择是人工去设计好的特征提取器,而这需要大量的工程技能和专业经验。通过使用通用目标的学习过程,可以自动学习到好的特征,从而避免上述问题。这是深度学习的关键优势。

image.png

图1:多层神经网络和反向传播

a一个多层神经网络(由相连接的节点表示)能够通过扭曲输入空间使数据集(红线和蓝线代表的样本)更加线性可分。注意输入空间的规则网格(左侧)是如何被隐藏层(中间)转换的。这是一个只有两个输入节点、两个隐藏节点和一个输出节点的示例,但是用于目标识别或自然语言处理的网络通常有成千上百个这样的节点。

b链式求导法则告诉我们两个小的影响量(x的微小变化对于y的影响,和y对于z的影响)是如何关联的。x的微小变化量Δx首先会通过乘以∂y/∂x(这是偏导数的定义)转变成y的变化量Δy。类似的,Δy会给z带来改变Δz。通过链式法则将一个等式同另一个相关联——Δx通过乘以∂y/∂x和∂z/∂x就可以得到。当x,y,z是向量的时候,同样可以处理(导数是雅克比矩阵)。

c 在带有两个隐形层和一个输出层的神经网络中计算前向通路的值时使用这些等式,每个都包含一个可以反向传播梯度的模块。在每一层中,我们首先计算每一个节点的总输入z,即上一层输出的加权和。然后将非线性函数作用于z就得到这个节点的输出。为了简便,我们省略了偏置项。在神经网络中使用的非线性函数包括近些年广泛使用的修正线性单元(ReLU)f(z)=max(z,0),以及使用更广泛的S函数,例如双曲正切函数f(z)=(exp(z)-exp(-z))/(exp(z)+exp(-z))和Logistic函数f(z)=1/(1+exp(-z))。

d计算反向传播值的等式。在每一个隐藏层,我们计算误差对于每一个节点输出的偏导,它是误差对上一层输入的偏导的加权和。我们通过乘以f(z)的梯度将误差对输出的偏导转换成对输入的导数。在输出层,误差对于每一个节点输出的偏导是通过对代价函数求导取得的。如果节点l的代价函数是$0.5(yl-tl)^2,那么结果就是yl-tl,而tl是目标值。一旦∂E/∂z_k已知,则误差E对节点j的连接上的权重w_lk的导数就是y_j(∂E/∂zk)$。

image.png

图2:一个卷积网络的内部

一个应用于萨摩耶犬图像的典型卷积网络的每一层的输出(不是滤波器)。每一个矩形图像是一个特征图,对应了由每一个位置的检测学习到的特征的输出。信息流自下而上,随着低层特征作为定向边界检测器,并利用修正线性单元计算出每一个输出图片类别的得分。

A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.

一个深度学习框架是简单模块的多层堆叠,其所有(或大多数)的目标是学习,并且很多是在计算非线性的输入输出映射关系。每个模块都在转换其输入,是为了同时增加选择性和表达的不变性。有5到20层的非线性层的系统,可以形成对一些细节很敏感的复杂函数——能够从白色的狼中区分出萨摩耶,并且对大型的不相关变量不敏感,例如背景,姿势,光照和周围的物体。

利用反向传播训练多层结构 Backpropagation to train multilayer architectures

From the earliest days of pattern recognition, the aim of researchers has been to replace hand-engineered features with trainable
multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent.
As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.

在早期的模式识别中,研究的目标是期望利用可训练的网络代替人工设计的特征提取,尽管它很简单,但是它的解决方法直到二十世纪八十年代中期才被广泛理解。它指出的是,多层网络结构可以利用随机梯度下降来进行训练。只要每个模块是输入和内部权重的相对平滑的函数,就可以通过反向传播方法计算梯度。这个方法的可行性与有效性在二十世纪七、八十年代被几个不同的团体都独立地发现了。

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module)
(Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.

用于计算目标函数对于多层模块中的权重的反向传播过程,仅仅是链式求导法则的一个实际应用。关键在于目标对于一个模块输入的导数可以利用对这个模块输出(或后一个模块的输入)的导数来求得(如图1)。反向传播的等式可以重复地被用来在从输出层(网络形成预测结果的输出层)到输入层(外部数据的输入层)的所有模型中传递梯度。一旦计算出这些梯度,就可以直接计算每个模块权重的梯度。

Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input
(for example, an image) to a fixed-size output (for example, a probability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z)= max(z, 0).
In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1+exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training28. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

许多深度学习的应用实例都使用了前馈神经网络结构(如图1),这种结构将固定大小的输入(例如一幅图像)映射到固定大小的输出(例如划分为若干的类别的可能性值)。在层间传递中,一些单元计算了来自于上一层的输入的的加权和,并通过非线性函数传递它们的输出。目前最流行的非线性函数是修正线性单元(ReLU),它是一个简单的半波整流函数f(z)=max(z,0)。过去几十年中,神经网络使用了具有更平滑的非线性的函数,例如tanh(z)和1/(1+exp(z)),但是ReLU仍然可以在多层网络中较快地学习,使得不需要进行非监督的预训练就可以训练一个深度的监督网络。不属于输入层和输出层的单元被称为隐藏单元。隐藏层可以被看作是以非线性方式扭曲输入空间的,所以类别就变得可以被输出层线性分离(如图1)。
In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the
computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.
二十世纪九十年代后期,神经网络和BP算法机器学习领域的研究者所遗弃,也被计算机视觉和和语音识别领域的研究者们忽视。人们广泛认为学习实用的、多阶段的、需要很少先验知识的特征提取器是不可行的。特别的是,人们普遍认为简单的梯度下降很可能陷入局部最小——小的改变不能够使平均误差再下降的权重配置。

In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder29,30. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

在实践中,局部最小在大型网络中很少会成为问题。不考虑初始条件,系统总能得到效果差不多的结果。最近的理论和经验结果都强烈表明,局部最小通常不是一个严重的问题。正相反,解空间中存在大量的梯度为零的鞍点,并且在大多数维度上曲面都是弯曲向上的,只有剩下的很少的曲面方向是向下的。分析指出大量出现的鞍点都只有极少的向下卷曲的方向,但是它们几乎所有的都有和目标函数差不多的值。因此,即使算法陷入这些鞍点也没有太大的问题。

Interest in deep feedforward networks was revived around 2006
(refs 31–34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchersintroduced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation33–35. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited36.

2006年前后,加拿大高级研究院(CIFAR)聚集了一个研究员团队,他们使得人们重燃了对于深度前馈网络的研究兴趣。研究者们介绍了不需要有标签数据就可以创建多层特征检测器的无监督学习方法。在学习过程中,每一层的特征检测器的目标是希望能够在下一层重建或模拟特征检测器(或原始数据)的活动。通过利用重构目标预训练出更加复杂的若干层特征提取器,网络的权重可以被初始化为合适的值。输出层加在网络的顶部后,整个网络可以通过标准的BP算法做出相应的调整。这个方法在手写体识别和行人检测方面有很好的效果,特别是当有标签数据十分有限的时候。

The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program37 and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coefficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabulary38 and was quickly developed to give record-breaking results on a large vocabulary task39. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups6 and were already being deployed in Android phones. For smaller data sets,unsupervised pre-training helps to prevent overfitting40, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

这种预训练尝试的一个主要应用是语音识别,快速图形处理单元(GPU)的出现使得编程变得便捷,并且使研究者们训练网络的速度比以前提升了10到20倍。在2009年这种尝试被应用到将一段声波中提取到的短时间的窗口系数,映射到可以被窗口帧中心代替的一系列语音碎片的概率。它打破了在使用较小词汇库的标准的基准语音识别记录,并很快打破了使用更大的词汇库的识别记录。到2012年为止,从2009年发展起来的深度网络已经被许多主流语音团队所发展,并且已经被应用到了安卓手机上。对于比较小的数据集来说,无监督预训练可以很好地预防过拟合。当有标签数据比较少或者有很多源数据而目标数据很少时,它会取得更好的泛化效果。一旦深度学习的研究获得了恢复,这种预训练也就只有在数据较少的时候才需要了。

There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet)41,42. It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computervision community.

然而,一种特殊的深度前馈网络相对于那种相邻层使用全连接的网络来说更容易训练,泛化性能也更好。它就是卷积神经网络(ConvNet)。在神经网络不受人们关注期间,它取得了许多实践性的成功,并且最近已经被计算机视觉领域的研究者们广泛接受。

Convolutional neural networks 卷积神经网络

ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms;and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

卷积网络是被设计用来处理多维数据的,例如一张 三通道彩色图片。许多数据形态是多维的:1维的是包括语言的信号序列,2维的是图像或声谱图,3维的是视频或立体图像。卷积网络利用自然信号的特性时存在四个关键点:局部连接,权值共享,池化和多层结构。

The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of
layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

典型的卷积网络结构(如图2)由一系列的阶段构成。最初的阶段由卷积层和池化层组成。卷积层的很多节点被构造成了一个特征映射,每个节点与上一层在特征映射中的局部块通过一系列的被称为滤波器的权重连接。这个节点的局部加权和被通过一个非线性函数如ReLU进行传递。一个特征映射中的所有节点分享了相同的滤波器。设计这种结构的原因有两方面。首先,在队列型数据例如图像中,局部的值是高度相关的,形成局部特征图是很容易检测的。其次,图像的局部统计与其他信号在位置上是不相关的。换句话说,一个局部图像也可能出现在其他的任何地方,因此构建在不同位置共享相同权值并在队列的不同部分检测相同模式的单元的方法,在数学上叫做离散卷积,是利用特征映射进行滤波操作的方法。

Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps).
Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.

卷积层的功能是检测前一层特征的局部连接,而池化层的作用是合并语义上相似的特征。这是因为形成一个目标的特征的相对位置会有所不同,位置粗糙颗粒化的特征也可以形成可靠的目标检测。一个典型的池化单元会计算在一个(或几个)特征映射中的一个局部块中单元的最大值。邻近的池化单元对局部块按照一行或一列或者更多的顺序切换取得数据,从而减少了表达的维数,创造了对于微小移动或扭曲的不变性。两到三个卷积层,加上非线性性和池化层的堆叠,再连接上全连接层就构成了卷积网络。同普通深度网络中一样简单的BP算法就可以训练卷积网络所有的滤波器中的权重。

Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features
are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

深度神经网络在探究自然信号层级组成的特性,其中的高级特征是由低级特征组合获得的。在图像中,边缘的局部组合形成了图案,图案聚合成很多部分,最终组成目标。相似的层级结构也存在于来自电话里的声音、音位、音节以及单词和句子等的这些语音和文本中。当前一层的元素的位置或表现变化时,池化操作能够保证表达几乎不变。

The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in
visual neuroscience43, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway44. When ConvNet models and monkeys are shown the same picture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s inferotemporal cortex45. ConvNets have their roots in the neocognitron46, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation.

卷积网络中卷积层和池化层是由视觉神经科学中的简单细胞和复杂细胞的经典观念启发而来的,视觉皮层的神经回路是以LGN–V1–V2–V4–IT这样的整体架构构成的。当给卷积网络和猴子展示相同图片时,卷积网络高层单元的激活过程就可以解释猴子的下颞叶皮质中随机组合的由160个神经元组成的神经元组中一半神经元的活性变化。卷积网络的根源可以归结到神经认知机,他们有相似的结构,但神经认知机中没有类似BP算法之类的端到端的监督学习算法。一个简单的1维卷积网络被称作时延神经网络,它曾被用于音位与简单单词的识别。

A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words47,48.
There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition47 and document reading42. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft49. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands50,51, and for face recognition52.
从二十世纪九十年代早期开始,卷积网络已经有了大量的应用,最初是时延神经网络用于语音识别和文本阅读。文本阅读系统使用了一个训练好的卷积网络和一个受到语言约束的概率模型的联合。二十世纪九十年代后期,这个系统阅读了大概10%的美国全国的支票。后来微软公司研发出大量的基于卷积网络的视觉特征识别和手写体识别系统。二十世纪九十年代早期,卷积网络也曾被实验在自然图片中的目标检测上,包括人脸识别和手写体识别。

使用深度卷积网络的图像理解 Image understanding with deep convolutional networks

Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition53, the segmentati