[论文翻译]深度学习在微笑识别中的应用


原文地址:https://arxiv.org/pdf/1602.00172v2


Deep Learning For Smile Recognition

深度学习在微笑识别中的应用

Inspired by recent successes of deep learning in computer vision, we propose a novel application of deep convolutional neural networks to facial expression recognition, in particular smile recognition. A smile recognition test accuracy of $99.45%$ is achieved for the Denver Intensity of Spontaneous Facial Action (DISFA) database, significantly outperforming existing approaches based on hand-crafted features with accuracies ranging from $65.55%$ to $79.67%$ . The novelty of this approach includes a comprehensive model selection of the architecture parameters, allowing to find an appropriate architecture for each expression such as smile. This is feasible because all experiments were run on a Tesla K40c GPU, allowing a speedup of factor 10 over traditional computations on a CPU.

受深度学习在计算机视觉领域近期成果的启发,我们提出了一种将深度卷积神经网络应用于面部表情识别(特别是微笑识别)的新方法。在Denver Intensity of Spontaneous Facial Action (DISFA) 数据库上,该方法实现了99.45%的微笑识别准确率,显著优于基于手工特征的现有方法(准确率区间为65.55%至79.67%)。该方法的创新点包括对架构参数进行全面的模型选择,从而能为微笑等每种表情找到合适的架构。由于所有实验均在Tesla K40c GPU上运行,其运算速度较传统CPU提升了10倍,使得这一方案具有可行性。

1. Introduction

1. 引言

Neural networks are celebrating a comeback under the term ”deep learning” for the last ten years by training many hidden layers allowing to selflearn complex feature hierarchies. This makes them of particular interest for computer vision, in which feature description is a long-standing issue. Many advances have been reported in this period, including new training methods and a paradigm shift of training from CPUs to GPUs. As a result, those advances allow to train more reliable models much faster. This has for example resulted in breakthroughs $^3$ in signal processing. Nonetheless, deep neural networks are not a magic bullet and successful training is still heavily based on experimentation.

神经网络在过去十年以"深度学习"之名强势回归,通过训练多个隐藏层实现了复杂特征层级的自我学习。这使得它们在计算机视觉领域尤为重要,因为特征描述一直是该领域的长期难题。在此期间涌现了许多进展,包括新的训练方法以及从CPU到GPU的训练范式转变。这些进步使得模型训练速度大幅提升且可靠性增强,例如在信号处理领域实现了突破性进展[3]。然而,深度神经网络并非万能钥匙,成功的训练仍然高度依赖于实验探索。

The Facial Action Coding System (FACS)1 is a system to taxonomize any facial expression of a human being by their appearance on the face. Action units describe muscles or muscle groups in the face, are set or unset and the activation may be on different intensity levels. State-of-the art approaches in this field mostly rely on hand-crafted features leaving a lot of potential for higher accuracies. In contrast to other fields such as face or gesture recognition, only very few works on deep learning applied to facial expression recognition have been reported so far $^2$ in which the architecture parameters are fixed. We are not aware of publications in which the architecture of a deep neural network for facial expression recognition is subject to extensive model selection. This allows to learn appropriate architectures per action unit.

面部动作编码系统 (Facial Action Coding System, FACS) 是一种通过面部外观对人类任何面部表情进行分类的系统。动作单元描述面部肌肉或肌肉群,可被设定或取消设定,且激活可能处于不同的强度水平。该领域最先进的方法大多依赖于手工制作的特征,仍有很大潜力提高准确性。与面部或手势识别等其他领域相比,目前只有极少数关于深度学习应用于面部表情识别的研究被报道 [2],其中架构参数是固定的。我们尚未发现有关针对面部表情识别的深度神经网络架构进行广泛模型选择的出版物。这使得能够针对每个动作单元学习合适的架构。

2. Deep neural networks

2. 深度神经网络

Training neural networks is difficult, as their cost functions have many local minima. The more hidden layers, the more difficult the training of a neural network. Hence, training tends to converge to a local minimum, resulting in poor generalization of the network. In order to overcome these issues, a variety of new concepts have been proposed in the literature, of which only a few can be named in this chapter. Unsupervised pre-training methods, such as auto encoders $\textsf{s}$ allow to initialize the weights well in order for back propagation to quickly optimize them. The Rectified Linear Unit (ReLU)7 and dropout $^{10}$ are new regular iz ation methods. The new training methods and other new concepts can also lead to significant improvements of shallow neural networks with just a few hidden layers. Convolutional neural networks (CNNs) were initially proposed by LeCun $^5$ for the recognition of hand-written digits. A CNN consists of two layers: a convolutional layer, followed by a sub sampling layer. Inspired by biological processes and exploiting the fact that nearby pixels are strongly correlated, CNNs are relatively insensitive to small translations or rotations of the image input.

训练神经网络具有挑战性,因为其成本函数存在许多局部极小值。隐藏层越多,神经网络的训练难度就越大。因此,训练过程容易收敛到局部最小值,导致网络泛化能力较差。为解决这些问题,文献中提出了多种新概念,本章仅列举其中几项。无监督预训练方法(如自动编码器 $\textsf{s}$)能够有效初始化权重,使反向传播快速优化参数。修正线性单元 (ReLU)7 和 dropout $^{10}$ 是新型正则化方法。这些新训练方法及其他创新概念,甚至能显著提升仅含少数隐藏层的浅层神经网络性能。

卷积神经网络 (CNN) 最初由 LeCun $^5$ 提出用于手写数字识别。CNN 包含两个层级:卷积层和子采样层。其设计灵感源自生物过程,并利用相邻像素强相关的特性,使得 CNN 对输入图像的小幅平移或旋转相对不敏感。

Training deep neural networks is slow due to the number of parameters in the model. As the training can be described in a vectorized form, it is possible to massively parallel ize it. Modern GPUs have thousands of cores and are therefore an ideal candidate for the execution of the training of neural networks. Significant speedups of factor 10 or higher $^{9}$ have been reported. A difficulty is to write GPU code. In the last few years, more abstract libraries have been released.

训练深度神经网络由于模型参数数量庞大而速度缓慢。由于训练可以向量化形式描述,因此能够进行大规模并行化处理。现代GPU拥有数千个计算核心,是执行神经网络训练的理想选择。已有文献[9]报告了10倍或更高的显著加速效果。难点在于编写GPU代码,但近年来已发布了更抽象的编程库。

3. DISFA database

3. DISFA数据库

The Denver Intensity of Spontaneous Facial Action (DISFA) $^6$ database consists of 27 videos of 4844 frames each, with 130,788 images in total. Action unit annotations are on different levels of intensity, which are ignored in the following experiments and action units are either set or unset. DISFA was selected from a wider range of databases popular in the field of facial expression recognition because of the high number of smiles, i.e. action unit 12. In detail, 30,792 have this action unit set, 82,176 images have some action unit(s) set and 48,612 images have no action unit(s) set at all. Fig. 1 contains a sample image of DISFA.

丹佛自发面部动作强度数据库 (DISFA) $^6$ 包含27段视频,每段4844帧,共计130,788张图像。动作单元标注包含不同强度等级,但在后续实验中未采用该分级,仅区分动作单元是否激活。DISFA因其包含大量微笑表情(即动作单元12)从众多面部表情识别常用数据库中选出。具体而言,30,792张图像激活了该动作单元,82,176张图像至少激活了一个动作单元,48,612张图像未激活任何动作单元。图1展示了DISFA的示例图像。


Fig. 1. Different input parts: a) mouth, b) face.6 (Not at actual input size/proportions.)

图 1: 不同输入部位: a) 嘴部, b) 面部。6 (非实际输入尺寸/比例。)

In the original paper on DISFA6 multi-class SVMs were trained for the different levels 0-5 of action unit intensity. Test accuracies for the individual levels and for the binary action unit recognition problem are reported for three different hand-crafted feature description techniques. In those three cases, accuracies of 65.55%, $72.94%$ and 79.67% for smile recognition are reported.

在DISFA6的原始论文中,针对动作单元强度0-5的不同级别训练了多类SVM分类器。研究报道了三种不同手工设计特征描述技术在各强度级别及二元动作单元识别问题上的测试准确率。其中微笑识别的准确率分别为65.55%、72.94%和79.67%。

4. Smile recognition

4. 微笑识别

In the following experiments, an aligned version of DISFA is used. In this aligned version, the faces have been cropped and annotated with facial landmark points. Facial landmark points allow to compute a bounding box to fit the mouth in all images. In the experiments, two inputs are used: the mouth and face, downscaled to $85\times69$ and $128\times104$ pixels, respectively. Both inputs are used to assess if the mouth alone is as expressive as or even more expressive than the entire face for smile recognition.

在以下实验中,使用了经过对齐处理的DISFA数据集版本。该对齐版本中,人脸已进行裁剪并标注了面部关键点。通过面部关键点可计算边界框,使所有图像中的嘴部区域适配统一尺寸。实验采用两种输入:嘴部区域和全脸区域,分辨率分别降采样至$85\times69$和$128\times104$像素。通过对比这两种输入,旨在评估仅用嘴部区域是否与全脸区域具有同等甚至更强的笑容识别表现力。

4.1. Model

4.1. 模型

The architecture of the network is as follows: The input images are fed into a convolution comprising a convolutional and a sub sampling layer. That convolution may be followed by more convolutions to become gradually more invariant to distortions in the input. In the second stage, a regular neural network follows the convolutions in order to discriminate the features learned by the convolutions. The output layer consists of two units for smile or no smile. The novelty of this approach is that the exact number of convolutions, number of hidden layers and size of hidden layers are not fixed but subject to extensive model selection in Sec. 4.3.

该网络架构如下:输入图像被送入一个包含卷积层和下采样层的卷积模块。该卷积模块后可接更多卷积层,逐步增强对输入畸变的不变性。第二阶段采用常规神经网络对卷积层学到的特征进行判别。输出层由两个单元组成,分别对应微笑与非微笑分类。该方法的创新点在于:卷积层数量、隐藏层数量和隐藏层尺寸均非固定值,而是需通过第4.3节的广泛模型选择来确定。

4.2. Experiment setting

4.2. 实验设置

Due to training time constraints, some parameters have been fixed to reasonable and empirical values, such as the size of convolutions ( $5\times5$ pixels, 32 feature maps) and the size of sub sampling s ( $2\times2$ pixels using max pooling). All layers use ReLU units, except of softmax being used in the output layer. The learning rate is fixed to $\alpha=0.01$ and not subject to model selection as it would significantly prolong the model selection. The same considerations apply to the momentum, which is fixed to $\mu=0.9$ .

由于训练时间限制,部分参数被固定为合理经验值,例如卷积核尺寸($5\times5$像素,32个特征图)和下采样区域尺寸(采用最大池化的$2\times2$像素)。除输出层使用softmax外,所有层均采用ReLU单元。学习率固定为$\alpha=0.01$且不参与模型选择,因其会显著延长选择流程。动量值同理固定为$\mu=0.9$。

The entire database has been randomly split into a 60%/20%/20% training/validation/test ratio. Training neural networks comes with uncertainties, mostly due to the random initialization of the weights, but also due to that random split of the data. Evaluations have shown that for 10 similar experiments carried out, the standard deviation of the test accuracy is $0.041725%$ . Because of this low standard deviation, performing each experiment exactly once has only a very low bias and is therefore relatively safe to do for reasons of faster training time. Throughout the experiments, the classification rate is used as the accuracy measure.

整个数据库被随机划分为60%/20%/20%的训练集/验证集/测试集比例。训练神经网络存在不确定性,主要源于权重的随机初始化,以及数据的随机划分。评估表明,在进行的10次类似实验中,测试准确率的标准差为$0.041725%$。由于这一较低的标准差,每项实验仅执行一次所产生的偏差极小,因此出于缩短训练时间的考虑,这种做法相对安全。在所有实验中,分类率被用作准确率衡量指标。

The model is implemented using Lasagne $_4$ and the generated CUDA code is executed on a Tesla $\mathrm{K40c^{y}}$ as training on a GPU allows to perform a comprehensive model selection in a feasible amount of time. Stochastic gradient descent with a batch size of 500 is used.

该模型使用Lasagne$_4$实现,生成的CUDA代码在Tesla $\mathrm{K40c^{y}}$上执行,因为GPU训练能在可行时间内完成全面的模型选择。采用批量大小为500的随机梯度下降法。

4.3. Parameter optimization

4.3. 参数优化

Table 1 contains the four parameters to be optimized: the number of convolutions, the number of hidden layers, the number of units per hidden layer and the dropout factor. Each parameter was optimized independently due to training time constraints. This may not lead to an optimal model, but has proven to work empirically well. Each model was trained for 50 epochs in the model selection.

表 1: 包含四个待优化的参数:卷积层数、隐藏层数、每个隐藏层的单元数以及 dropout 因子。由于训练时间限制,每个参数都是独立优化的。这可能不会得到最优模型,但实践证明效果良好。在模型选择阶段,每个模型都训练了 50 个周期。

For both inputs, Table 2 contains the final models selected. For the mouth input, there is a preference to more convolutions and more hidden layers. This is the case because slight translations or rotations in the mouth input have stronger consequences on the classification result. In the entire face, that sort of distortions may be less of a problem because other parts of the face such as the cheeks contribute to smile recognition, too.

表2列出了两种输入最终选定的模型。针对嘴部输入,模型更倾向于采用更多卷积层和隐藏层。这是因为嘴部输入的轻微平移或旋转会对分类结果产生更大影响。而对于整张人脸输入,这类形变的影响相对较小,因为脸颊等其他面部区域同样有助于笑容识别。

Table 1. Parameters and possible values used in model selection.

ParameterValuesDefaultvalue
#Convolutions1,2,31
#Hiddenlayers1,2,31
#Units / hidden layer100,200,300,400100
Dropout0,0.1, 0.5, 0.70.5

表 1: 模型选择中使用的参数及其可能取值。

参数 取值 默认值
#卷积层数 1,2,3 1
#隐藏层数 1,2,3 1
每隐藏层单元数 100,200,300,400 100
Dropout 0,0.1,0.5,0.7 0.5

Table 2. Selected parameter values for mouth and face input.

Input#Convs#Hidden layers#Units s / hidden layerDropout
Mouth224000.1
Face114000

表 2: 嘴部和面部输入选定的参数值。

输入 卷积层数 隐藏层数 每隐藏层单元数 Dropout
嘴部 2 2 400 0.1
面部 1 1 400 0

4.4. Results and discussion

4.4. 结果与讨论

Both final models were trained for 1000 epochs. The test accuracies of both models started to converge after about 300 epochs. For the mouth and face inputs, the best accuracies were achieved after 700 and 1000 epochs with $99.45%$ and $99.34%$ , respectively. Both models significantly outperform the state-of-the-art SVM baselines reported in Sec. 3 ranging from $65.55%$ to $79.67%$ . Overall, there is no strong preference for either the mouth or face input. Further experiments with a reduced dataset containing only 70% of the images that have no action unit(s) set at all support this hypothesis. Concretely, the test accuracies for the mouth and face input reduced to $99.24%$ and $99.26%$ , respectively. Thus, the difference between the two models has been further reduced and this time giving a very low preference for the face input. Nonetheless, this difference is not representative as it is within the experiment error standard deviation reported in Sec. 4.2.

两个最终模型均训练了1000个周期。约300个周期后,两个模型的测试准确率开始收敛。对于嘴部和面部输入,最佳准确率分别在第700和1000个周期达到(99.45%和99.34%)。这两个模型显著优于第3节报告的最先进SVM基线(65.55%至79.67%)。总体而言,嘴部与面部输入没有明显优劣之分。进一步使用仅含70%未设置任何动作单元图像的缩减数据集进行实验,结果支持该假设:嘴部和面部输入的测试准确率分别降至99.24%和99.26%,两者差异进一步缩小且本次略微倾向面部输入。但该差异不具备代表性,因其处于第4.2节报告的实验误差标准差范围内。

Training time per epoch are 82 seconds and 41 seconds for the mouth and face input models, respectively. Experiments have shown that the training time mostly depends on the number of convolutions. Using the Tesla K40c GPU has allowed to speed up the training time by factor ten over the use of a CPU to execute the CPU code generated by the library. This clearly demonstrates the importance of training on a GPU to do a comprehensive model selection in a feasible amount of time.

每个训练周期中,嘴部和面部输入模型的训练时间分别为82秒和41秒。实验表明,训练时间主要取决于卷积运算次数。使用Tesla K40c GPU相比运行库生成的CPU代码,可将训练速度提升十倍。这充分证明了使用GPU进行训练对于在可行时间内完成全面模型选择的重要性。

5. Conclusions and future work

5. 结论与未来工作

Deep learning is an umbrella term for training neural networks with potentially many hidden layers using new training methods allowing to learn complex feature hierarchies from data. Applied to action unit recognition and smile recognition in particular, a deep convolutional neural network model with an overall accuracy of $99.45%$ significantly outperforms existing approaches. The underlying extensive model selection allows to find for each action unit an appropriate architecture in order to maximize test accuracies. In the future, we will extend the model to images from multiple databases and to make predictions in image sequences.

深度学习是一个涵盖性术语,指利用新训练方法训练可能具有多个隐藏层的神经网络,从而从数据中学习复杂的特征层次结构。特别是在动作单元识别和微笑识别应用中,一个整体准确率达到 $99.45%$ 的深度卷积神经网络模型显著优于现有方法。通过底层广泛的模型选择,可为每个动作单元找到最优架构以最大化测试准确率。未来我们将把模型扩展到多数据库图像,并实现图像序列的预测。

References

参考文献

阅读全文(20积分)