Deep Learning For Smile Recognition
深度学习在微笑识别中的应用
Inspired by recent successes of deep learning in computer vision, we propose a novel application of deep convolutional neural networks to facial expression recognition, in particular smile recognition. A smile recognition test accuracy of $99.45%$ is achieved for the Denver Intensity of Spontaneous Facial Action (DISFA) database, significantly outperforming existing approaches based on hand-crafted features with accuracies ranging from $65.55%$ to $79.67%$ . The novelty of this approach includes a comprehensive model selection of the architecture parameters, allowing to find an appropriate architecture for each expression such as smile. This is feasible because all experiments were run on a Tesla K40c GPU, allowing a speedup of factor 10 over traditional computations on a CPU.
受深度学习在计算机视觉领域近期成果的启发,我们提出了一种将深度卷积神经网络应用于面部表情识别(特别是微笑识别)的新方法。在Denver Intensity of Spontaneous Facial Action (DISFA) 数据库上,该方法实现了99.45%的微笑识别准确率,显著优于基于手工特征的现有方法(准确率区间为65.55%至79.67%)。该方法的创新点包括对架构参数进行全面的模型选择,从而能为微笑等每种表情找到合适的架构。由于所有实验均在Tesla K40c GPU上运行,其运算速度较传统CPU提升了10倍,使得这一方案具有可行性。
1. Introduction
1. 引言
Neural networks are celebrating a comeback under the term ”deep learning” for the last ten years by training many hidden layers allowing to selflearn complex feature hierarchies. This makes them of particular interest for computer vision, in which feature description is a long-standing issue. Many advances have been reported in this period, including new training methods and a paradigm shift of training from CPUs to GPUs. As a result, those advances allow to train more reliable models much faster. This has for example resulted in breakthroughs $^3$ in signal processing. Nonetheless, deep neural networks are not a magic bullet and successful training is still heavily based on experimentation.
神经网络在过去十年以"深度学习"之名强势回归,通过训练多个隐藏层实现了复杂特征层级的自我学习。这使得它们在计算机视觉领域尤为重要,因为特征描述一直是该领域的长期难题。在此期间涌现了许多进展,包括新的训练方法以及从CPU到GPU的训练范式转变。这些进步使得模型训练速度大幅提升且可靠性增强,例如在信号处理领域实现了突破性进展[3]。然而,深度神经网络并非万能钥匙,成功的训练仍然高度依赖于实验探索。
The Facial Action Coding System (FACS)1 is a system to taxonomize any facial expression of a human being by their appearance on the face. Action units describe muscles or muscle groups in the face, are set or unset and the activation may be on different intensity levels. State-of-the art approaches in this field mostly rely on hand-crafted features leaving a lot of potential for higher accuracies. In contrast to other fields such as face or gesture recognition, only very few works on deep learning applied to facial expression recognition have been reported so far $^2$ in which the architecture parameters are fixed. We are not aware of publications in which the architecture of a deep neural network for facial expression recognition is subject to extensive model selection. This allows to learn appropriate architectures per action unit.
面部动作编码系统 (Facial Action Coding System, FACS) 是一种通过面部外观对人类任何面部表情进行分类的系统。动作单元描述面部肌肉或肌肉群,可被设定或取消设定,且激活可能处于不同的强度水平。该领域最先进的方法大多依赖于手工制作的特征,仍有很大潜力提高准确性。与面部或手势识别等其他领域相比,目前只有极少数关于深度学习应用于面部表情识别的研究被报道 [2],其中架构参数是固定的。我们尚未发现有关针对面部表情识别的深度神经网络架构进行广泛模型选择的出版物。这使得能够针对每个动作单元学习合适的架构。
2. Deep neural networks
2. 深度神经网络
Training neural networks is difficult, as their cost functions have many local minima. The more hidden layers, the more difficult the training of a neural network. Hence, training tends to converge to a local minimum, resulting in poor generalization of the network. In order to overcome these issues, a variety of new concepts have been proposed in the literature, of which only a few can be named in this chapter. Unsupervised pre-training methods, such as auto encoders $\textsf{s}$ allow to initialize the weights well in order for back propagation to quickly optimize them. The Rectified Linear Unit (ReLU)7 and dropout $^{10}$ are new regular iz ation methods. The new training methods and other new concepts can also lead to significant improvements of shallow neural networks with just a few hidden layers. Convolutional neural networks (CNNs) were initially proposed by LeCun $^5$ for the recognition of hand-written digits. A CNN consists of two layers: a convolutional layer, followed by a sub sampling layer. Inspired by biological processes and exploiting the fact that nearby pixels are strongly correlated, CNNs are relatively insensitive to small translations or rotations of the image input.
训练神经网络具有挑战性,因为其成本函数存在许多局部极小值。隐藏层越多,神经网络的训练难度就越大。因此,训练过程容易收敛到局部最小值,导致网络泛化能力较差。为解决这些问题,文献中提出了多种新概念,本章仅列举其中几项。无监督预训练方法(如自动编码器 $\textsf{s}$)能够有效初始化权重,使反向传播快速优化参数。修正线性单元 (ReLU)7 和 dropout $^{10}$ 是新型正则化方法。这些新训练方法及其他创新概念,甚至能显著提升仅含少数隐藏层的浅层神经网络性能。
卷积神经网络 (CNN) 最初由 LeCun $^5$ 提出用于手写数字识别。CNN 包含两个层级:卷积层和子采样层。其设计灵感源自生物过程,并利用相邻像素强相关的特性,使得 CNN 对输入图像的小幅平移或旋转相对不敏感。
Training deep neural networks is slow due to the number of parameters in the model. As the training can be described in a vectorized form, it is possible to massively parallel ize it. Modern GPUs have thousands of cores and are therefore an ideal candidate for the execution of the training of neural networks. Significant speedups of factor 10 or higher $^{9}$ have been reported. A difficulty is to write GPU code. In the last few years, more abstract libraries have been released.
训练深度神经网络由于模型参数数量庞大而速度缓慢。由于训练可以向量化形式描述,因此能够进行大规模并行化处理。现代GPU拥有数千个计算核心,是执行神经网络训练的理想选择。已有文献[9]报告了10倍或更高的显著加速效果。难点在于编写GPU代码,但近年来已发布了更抽象的编程库。
3. DISFA database
3. DISFA数据库
The Denver Intensity of Spontaneous Facial Action (DISFA) $^6$ database consists of 27 videos of 4844 frames each, with 130,788 images in total. Action unit annotations are on different levels of intensity, which are ignored in the following experiments and action units are either set or unset. DISFA was selected from a wider range of databases popular in the field of facial expression recognition because of the high number of smiles, i.e. action unit 12. In detail, 30,792 have this action unit set, 82,176 images have some action unit(s) set and 48,612 images have no action unit(s) set at all. Fig. 1 contains a sample image of DISFA.
丹佛自发面部动作强度数据库 (DISFA) $^6$ 包含27段视频,每段4844帧,共计130,788张图像。动作单元标注包含不同强度等级,但在后续实验中未采用该分级,仅区分动作单元是否激活。DISFA因其包含大量微笑表情(即动作单元12)从众多面部表情识别常用数据库中选出。具体而言,30,792张图像激活了该动作单元,82,176张图像至少激活了一个动作单元,48,612张图像未激活任何动作单元。图1展示了DISFA的示例图像。
Fig. 1. Different input parts: a) mouth, b) face.6 (Not at actual input size/proportions.)
图 1: 不同输入部位: a) 嘴部, b) 面部。6 (非实际输入尺寸/比例。)
In the original paper on DISFA6 multi-class SVMs were trained for the different levels 0-5 of action unit intensity. Test accuracies for the individual levels and for the binary action unit recognition problem are reported for three different hand-crafted feature description techniques. In those three cases, accuracies of 65.55%, $72.94%$ and 79.67% for smile recognition are reported.
在DISFA6的原始论文中,针对动作单元强度0-5的不同级别训练了多类SVM分类器。研究报道了三种不同手工设计特征描述技术在各强度级别及二元动作单元识别问题上的测试准确率。其中微笑识别的准确率分别为65.55%、72.94%和79.67%。
4. Smile recognition
4. 微笑识别
In the following experiments, an aligned version of DISFA is used. In this aligned version, the faces have been cropped and annotated with facial landmark points. Facial landmark points allow to compute a bounding box to fit the mouth in all images. In the experiments, two inputs are used: the mouth and face, downscaled to $85\times69$ and $128\times104$ pixels, respectively. Both inputs are used to assess if the mouth alone is as expressive as or even more expressive than the entire face for smile recognition.
在以下实验中,使用了经过对齐处理的DISFA数据集版本。该对齐版本中,人脸已进行裁剪并标注了面部关键点。通过面部关键点可计算边界框,使所有图像中的嘴部区域适配统一尺寸。实验采用两种输入:嘴部区域和全脸区域,分辨率分别降采样至$85\times69$和$128\times104$像素。通过对比这两种输入,旨在评估仅用嘴部区域是否与全脸区域具有同等甚至更强的笑容识别表现力。
4.1. Model
4.1. 模型
The architecture of the network is as follows: The input images are fed into a convolution comprising a convolutional and a sub sampling layer. That convolution may be followed by more convolutions to become gradually more invariant to distortions in the input. In the second stage, a regular neural network follows the convolutions in order to discriminate the features learned by the convolutions. The output layer consists of two units for smile or no smile. The novelty of this approach is that the exact number of convolutions, number of hidden layers and size of hidden layers are not fixed but subject to extensive model selection in Sec. 4.3.
该网络架构如下:输入图像被送入一个包含卷积层和下采样层的卷积模块。该卷积模块后可接更多卷积层,逐步增强对输入畸变的不变性。第二阶段采用常规神经网络对卷积层学到的特征进行判别。输出层由两个单元组成,分别对应微笑与非微笑分类。该方法的创新点在于:卷积层数量、隐藏层数量和隐藏层尺寸均非固定值,而是需通过第4.3节的广泛模型选择来确定。
4.2. Experiment setting
4.2. 实验设置
Due to training time constraints, some parameters have been fixed to reasonable and empirical values, such as the size of convolutions ( $5\times5$ pixels, 32 feature maps) and the size of sub sampling s ( $2\times2$ pixels using max pooling). All layers use ReLU units, except of softmax being used in the output layer. The learning rate is fixed to $\alpha=0.01$ and not subject to model selection as it would significantly prolong the model selection. The same considerations apply to the momentum, which is fixed to $\mu=0.9$ .
由于训练时间限制,部分参数被固定为合理经验值,例如卷积核尺寸($5\times5$像素,32个特征图)和下采样区域尺寸(采用最大池化的$2\times2$像素)。除输出层使用softmax外,所有层均采用ReLU单元。学习率固定为$\alpha=0.01$且不参与模型选择,因其会显著延长选择流程