机器视觉上基于骨架的方法：调查报告

Abstract

Recently, skeleton-based approaches have achieved rapid progress on the basis of great success in skeleton representation. Plenty of researches focus on solving specific problems according to skeleton features. Some skeleton-based approaches have been mentioned in several overviews on object detection as a non-essential part. Nevertheless, there has not been any thorough analysis of skeleton-based approaches attentively. Instead of describing these techniques in terms of theoretical constructs, we devote to summarizing skeleton-based approaches with regard to application fields and given tasks as comprehensively as possible. This paper is conducive to further understanding of skeleton-based application and dealing with particular issues.

Introduction

Skeleton-based approaches, also known as kinematic techniques, cover a set of joints and a group of limbs based on physiological body structure. Typically, the number of joints is determined by computational complexity, which is between ten to thirty. In recent years, some significant techniques have worked successfully on discovering the representation of skeletons with dots and edges, which can be categoried as top-down methods (e.g., cascade pyramid network and PoseFix ) and bottom-up methods (e.g., Openpose ). In order to analyze skeleton-based approaches deeply, we observe these researches from two aspects(i.e., single-frame and multi-frame). Fig. 1 describes the summarization. In regard to single-frame type, tasks are handled through investigating independent images within videos. Correspondingly, multi-frame type requires a series of sequential images among videos to explore the inherence.

简介

单帧方法

Multi-view Pose Estimation

Although skeletons with 3D structures enhance the accuracy of pose estimation, the training operation needs plenty of 3D ground-truth data which is costly. Thus, multiple 2D skeletons from different views are integrated into 3D structures under a given strategy to determine the pose. EpipolarPose implements epipolar geometry to combine the 2D skeletons and trains a 3D pose estimator with camera geometry information. RepNet finds the mapping from 2D to 3D skeletons by designing an adversarial training approach and adding feedback projection from 3D to 2D skeletons. Based on Wasserstein generative adversarial network (WGAN), RepNet generates 3D skeletons by sending 2D skeletons as input into WGAN. The produced 3D skeletons are reprojected to 2D domain by camera loss. A substitutive fusion approach tracks the status of joints in two views of 2D skeletons and labels the joints as "well tracked", "inferred", or "not tracked". For a joint in different views, the one diagnosed as "well tracked" has the highest priority, and then the inferred one is better than the one of "not tracked". Furthermore, dynamic time warping with K-nearest neighbor is adopted to classify learned skeleton representation. In regard to multiple persons, a multi-way matching algorithm clusters detected 2D skeletons by sticking the keypoints of a person in various views. Two networks (named as VA-RNN and VA-CNN) work together to discover skeleton representation of actions . An automatic selection scheme is involved in both the nets to choose prior viewpoints of 2D skeletons rather than a fixed criterion. In VA-RNN, 2D skeletons are rotated to obtain a complex feature by training multiple LSTM. For VA-CNN, a intrinsic feature is abstracted by a convolutional network. These two features are fused to determine action classes. Additionally, loss functions (e.g., regression loss and consistency loss.) in deep neural networks (e.g., CNN) are designed to improve pose estimation within multi-view skeletons, and distance biases of skeleton information in multiple views are minimized to improve fusion accuracy . Relation evaluation (e.g., similarity) between skeletons in two views is still an important issue.

Object Segmentation

Object segmentation based on skeletons aims at cutting targeted items (including biological and non-biological things) from others. In this type of tasks, object positions should be ascertained first and then object margins are drawn out. Pose2Seg focuses on human segmentation in images. In the model, detected skeletons are compared with established standard skeletons of each pose, and a affine transformation matrix calculates similarity. SegModule is presented to understand skeleton features and corresponding visualization. In addition, semantics is introduced as a supplementary perspective. By training a Part FCNto obtain semantic part score map, the body is further divided into head, torso, left arm, left leg, right arm, and right leg . Another human segmentation approach discovers edge information through adding all widths of physical parts (e.g., head and shoulders, neck, chest, hip) into body skeletons. In consideration of cascade connection of curve skeletons like a tree and its leaves, body segmentation are realized by trimming branches and isolating intersectional leaves . In a human counting case, blurry margins are allowed, let alone body parts. After removing background, all heads are identified within skeleton graph segmentation and utilized to gain human amount . Besides of human body segmentation, animal and non-biological items are also being studied to be segmented. Critical points are selected on the basis of surface skeletons and expand to component sets separately . With skeleton matching techniques, skeletal branches over a particular position are reconstructed by estimating the distances between two views and used to reconstruct the non-biological things . Pinpoint object margins of diverse items are confirmed not only depending on skeletons but also item structures.

Static Pose Estimation

Without comparison with any benchmark, it is hard to evaluate poses by only one image. . A path (i.e., 2D-3D-2D) learns the features of skeletons severely through projecting into 3D and being reprojected into 2D with lifting networks. The original 2D skeleton is as input . A parsing induced learner exploits parsing information to enhance skeleton information through a pose encoder. A pose encoder abstracts pose features while another pose encoder fuses residual information into pose representation . HRNet has parallel high-to-low resolution subnetworks to gain both high and low resolutions of skeletons and sums them up . Based on ConvNets, 2D images are translated into 3D skeleton models. For instance, heatmaps and silhouettes are extracted from a 2D body, and then pose and shape parameters are abstracted separately. These parameters are meshed together into a 3D body with 2D annotations . In another case, depth knowledge is integrated with ConvNets between paired joints in skeletons to mix into a 3D pose . PoseRefiner implements skeletons in binary channels and pose classes to train ConvNets learning likelihood heatmaps to refine the skeletons . A dual-source approach learns the representations both from 2D and 3D skeletons . 3D poses are projected into multiple 2D skeletons, and used to find the highest likelihood along with a test image. To address skeletons seriously, a upper-body visualization uses different colors and polylines to distinguish between the left and right body . With these representations, 16 poses are clustered with high accuracy. Furthermore, ConvNets parameters are analyzed to find prior settings to identify poses with skeletons . . Cascaded Pyramid Network (CPN) containing two networks (i.e., GlobalNet and RefineNet) adopts a top-down pipeline which means locating skeleton keypoints first with ResNet backbone, extracting features of these keypoints as HyperNet, and then assembling them. A multi-person pose estimation (RMPE) involves spatial transformer networks to rectify various ground truth bounding boxes for people and a final box is obtained for each person . DeepCut uses adapted fast R-CNN (AFR-CNN) to detect body parts with integer linear programming and dense CNN (Dense-CNN) to obtain the intensity with geometric and appearant constraints. Additionally, based on DeepCut, DeeperCut is proposed to improve body part detection with a bottom-up pipeline and an image-conditioned pairwise assembling strategy is designed. Angles among body parts are observed meticulously to assist in searching pairwise joints. Without consecutive images of an action, features learned from static poses can resolve the estimation.

Object Classification

The purpose of object classification is to identify the items in images through observing skeletons without any limitations of object types. In order to offset skeleton noise in classification, skeleton contours are trimmed as a trade-off between shape reconstruction error and skeleton simplicity based on Bayesian theory . Tree representation of skeletons as a graph is turned into strings of skeleton edges, and a deformable contour method is used to compare these strings of diverse objects . Node pairs in curve skeletons are obtained by cascade of symmetry filters and a symmetry correspondence matrix is designed to gain symmetry cloud which is classified by spectral analysis . Curve skeletons are main tools for object classification with obvious transformation of each object.

Pose Denoising

Typically, skeletons are dominant for special structures, especially for human body. When odd poses occur and parts of bodies are overlapped, it is hard to peel skeletons of each body. Redundant joints and limbs are deemed as noise, which can be eliminated by filtering linear transformations and comparing joint positions and limb angles with standard ones . Advanced denoising algorithms may be used to further improve the capability of pose denoising.

Image Synthesis

Commonly, image synthesis tries to produce other views of skeletons by learning the representation of skeletons in a view. GAN is a popular technique to solve this problem, but traditional GAN has weak ability of obtaining the relationship among joints and limbs. Bidirection GAN is adopted to search the mapping from an initial pose skeleton to another pose skeleton . Deformable GANs use heatmaps from skeletons as conditional information and human poses as originial images feeding into the generator to generate other pose images. The generated poses are shifted into the discriminator with corresponding heatmaps. Furthermore, CRF-RNN is established to predict conditional human body with pose transforamtion from a given skeleton to a target skeleton . Mask R-CNN model is built to transform a pose skeleton into another pose skeleton, and paralleling models operate simultaneously with diverse keypoints sampled from initial pose skeletons. These mappings are combined together to creat a new pose skeleton . Some researches conduct physical structure characteristics to reconstruct images other than studying the inherence. For example, body skeletons are divided into multiple components and the background is drawed out . These components are rotated and integrated into a new body which is blended into a synthesized new background. GAN is a useful tool in generation although its performance is still unstable. While applying GAN on skeleton synthesis, how to improve the quality is still a challenge.

Benchmark-based Identification

Comparison with benchmark is a straight way to judge the class of items. For skeletons, key points in standard and objective skeletons are contrasted in turn. Skeleton information (joints and limbs) is transformed into a tree with key points which are compared with given trees. The best match of two shape trees are treated as identical thing . A skeleton graph of any object is also used to contrast with standard units in both high geometric and topological similarities . To highlight the joints in comparisons, lines among joints address the relations of two joints instead of limbs. Unlike a fixed number of compared joints in previous two techniques, comparison steps in skeleton graphs here are random . Similarity metrics along with 3D skeleton features are designed to obtain the similarity between the human skeleton in a single frame image and the templates . Body dot clouds are cooperated with main curve skeletons to judge a matching ratio of human motions, which works better than the case only involved curve skeletons . Skeletons using Kinect from upper and lower body are analyzed separately to identify human gaits with ANN as a classifier . The evaluation criterion of similarity between the object and template skeletons is controlled by experience. Complex metrics may play well on similarity assessment.

Gesture Language Identification

Body gestures have rich information as well as language, also called as body language. Deep learning algorithm is a novel tool to understand gesture meanings, such as RNN and LSTM . Deep networks extract features of gesture skeletons, which is easier than those of intact bodies. Unlike full body skeletons, most body gestures only contain partial joints and limbs. Thus, application scans should be settled to ensure the involved groups of joints and limbs.

多帧方法

Dynamic Pose Estimation

A view adaptive LSTM (VA-LSTM) aims at detecting medical condition (i.e., sneeze/cough, headache, neck pain, staggering, chest pain, vomiting, falling, and back pain), containing a classification and regression subnetwork . Original skeletons are rotated and translated into new architectures and then sent into the subnetworks to learn the corresponding medical classes. Neuromusculoskeletal disorders are also detected by pose estimation . Asymmetry features are extracted with splitted results of body joints according to the left and right body, which is used to catch normal motion patterns by a probabilistic normalcy model. The likelihood between a test action and a normal motion is computed to determine the abnormality. Deep learning structures (e.g. CNN and LSTM ) are applied to abstract the features which are compared with a series of skeleton benchmarks (i.e., joints and limbs). Graph convolutional network (GCN) is a core solution for movement pose estimation on account of its strong ability of capturing spatio and temporal features . Except GCN, deep learning structures (e.g., CNN , RNN , and LSTM ) are also key techniques for movement pose estimation through learning representations under given conditions. Moreover, physical analysis of skeletons and probability estimation are also utilized in movement pose estimation. A exemplar-based method is explored to adjust initial estimated poses with inhomogeneous systematic bias while skeletons are defined as a simple directed graph and limbs are directed arrows. A regression function is proposed to predict the pose with rooted-mean-squared differences between templates and objective skeletons . 2D keypoints extracted by pose estimators with skeletons are fused with SMPL regressors to create 3D models with accurate camera parameters . Semantic representations of volume occupancy and ground plane support are helpful for distinguishing multiple persons after evaluating each single person with 3D skeletons . On the strength of spatial positions and joint changes, a decision tree can quickly recognize basic action events and monitor action types under graph constraints of state transition . Unlike static pose estimation, dynamic pose estimation usually discusses both spatio and temporal features with a series of images. The mixture process is a key point to control the quality of final representations.

动态姿态估计

A View Adaptive LSTM（VA-LSTM）旨在检测医疗状况（即打喷嚏/咳嗽，头痛，颈部疼痛，惊人，胸痛，呕吐，跌倒和背部疼痛），包含分类和回归子网。原始骨架被旋转并转换为新的体系结构，然后发送到子网以学习相应的医疗类。通过姿势估计也检测到神经血清骨骼障碍。根据左侧和右侧的身体接头的分离结果提取不对称特征，其用于通过概率正常模型捕获正常运动模式。计算测试动作和正常运动之间的可能性以确定异常。应用深度学习结构（例如 CNN 和 LSTM）应用于抽象与一系列骨架基准（即关节和四肢）进行比较的特征。图表卷积网络（GCN）是一种用于运动姿势估计的核心解决方案，但由于其捕获时空和时间特征的强大能力。除了 GCN 之外，深度学习结构（例如，CNN，RNN 和 LSTM）也是通过在特定条件下学习陈述来移动姿势的关键技术。此外，骨架的物理分析和概率估计也用于运动姿势估计。探讨了基于示例的方法，以调整初始估计的姿势，其具有不均匀的系统偏置，而骨架被定义为简单的定向图和肢体是指向箭头的。提出了回归函数来预测模板和客观骨骼之间的生根平均平均差异的姿势。 2D 由带骨架提取的姿势估计提取的关键点与 SMPL 回归器融合，以创建具有精确摄像机参数的 3D 模型。体积占用和地面支持的语义表示有助于在评估 3D 骨骼的每个人之后区分多人。在空间位置和联合变化的强度上，决策树可以在状态转换的图形约束下快速识别基本操作事件和监视动作类型。与静态姿势估计不同，动态姿势估计通常讨论具有一系列图像的时空和时间特征。混合过程是控制最终表示质量的关键点。

Object Tracking

PoseTrack follows the particular persons in videos, including multi-person pose estimation in a image, multi-person pose estimation in videos, and multi-person articulated tracking. ArtTrack draws body part graphs in temporal aspect and abandons joints with loose relation by a feed-forward convolutional architecture. For robot, visual distances and lighting intensity are considered while following human using Kinect . Also based on Kinect and Kalman filter, foot points are detected and followed with depth information and pairwise curve matching in 3D space from given views to a virtual bird’s eye view . With capturing keypoints in motion and fitting skeleton, human pose is tracked with transformed 3D models . Unlike deep learning structures, images in 3D models are filtered and the likelihood is calculated between shape models deformed by skeleton poses and image data with regard to probability theory . Fast movements of animals and persons are traced with non-rigid temporal deformation of 3D surface . Other than tracking human, people handling objects are traced with GCNs after detecting hand joints in body skeletons . The relation of a given object between sequences is a crucial issue in object tracking.

对象跟踪

posetrack 遵循视频中的特定人员，包括图像中的多人姿势估计，视频中的多人姿势估计和多人铰接式跟踪。 Arttrack 在时间方面和 Abandons 关节中绘制身体部位图，通过前馈卷积架构的宽松关系。对于机器人，使用 Kinect 后面的人类在人类之后考虑视觉距离和照明强度。同样基于 Kinect 和 Kalman 滤波器，检测到脚点，然后在从给定视图到虚拟鸟瞰图的给定视图中的 3D 空间中的 3D 空间中匹配的脚点。利用捕获运动和拟合骨架的关键点，用转换的 3D 模型跟踪人类姿势。与深度学习结构不同，过滤 3D 模型中的图像，并且在由骨架姿势和关于概率理论的图像数据变形的形状模型之间计算出可能性。动物和人的快速运动追踪 3D 表面的非刚性时间变形。除了跟踪人类，在检测体骨骼中的手关节后，处理物体的处理物体被追踪到 GCN。序列之间给定对象的关系是对象跟踪中的重要问题。

动作识别

g., spatial and temporal , dots and lines , joints and time , position and feature , spatial, temporal, structural, and actional . The types of networks depend on action characteristics, e.g., RNN , RRN , CNN , GCN , and LSTM . . In other categories, all items in joints or/and limbs are given equal importance. However, in particular cases, actions have typical changes of partial joints and limbs which can represent the actions severely. Significant joints and limbs can be chosen by covariance matrix , filtering function on the basis of skeleton graphs , CNN , LSTM , multi-head attention model with itereative attention on diverse parts of a body , projecting skeleton angles onto a unit sphere , information gain with regard to position and velocity histogram from skeletons . Significance sorting techniques of all joints and limbs are conducive to reducing the hardness and complexity of skeleton feature extraction in action recognition, including spatial pyramid model (SPM) . Weighting the features of different parts of human body is also valid for identifying actions, e.g., bidirectional RNN . . This part is the main component of action recognition, where overviews are sufficient over last decades . Therefore, we here introduce the sketch briefly. In the field, inherent representation of skeletons of actions is gained by two ways: physical computation and feature extraction. For the former, based on empirical knowledge of human body, the relations among joints and limbs under a particular action is calculated with explicit equations . For the latter, deep neural networks and other techniques devote to learning skeleton features for each action. Typically, deep neural networks have gained great success on action recognition, e.g., DNN , CNN , RNN , LSTM , GCN . Moreover, traditional machine learning algorithms are designed to identify actions, e.g., kNN , RBF . HMM discovers the semantic information of actions which assists in action identification . Reinforcement learning is also a technique to obtain effective representation of an action . Bayesian varies across different sequences of an action . Additionally, probability and mathematical theories are useful in action recognition, e.g., analogical generalization and retrieval , screw matrices , gradient vector flow comparison , discriminative metric . In action recognition, the greatest challenge is to determine the start and end moment of an action. Usually, fixed time interval is adopted and leads to high deviation while actions have widely different time intervals.

Action Prediction

Unlike HMM, CRF, RNN, LSTM, CNN, a latent global network with latent long-term global information is designed to predict an action . Based on the competition in GAN, two nets (i.e., I-Net and D-Net) are trained iteratively. Full and partial sequences are sent into I-Net to learn representations, separately. Afterwards, the representations are distinguished by D-Net. The evaluation of likelihood between the existed partial images and the intact sequences of an action is a key point of action prediction.

Pose Generation

FAAST provides a toolkit to create animated virtual characters using natural interaction from OpenNI-compliant depth sensors. Mesh body is also produced on basis of rigid limb motions and skinning weights both for humans and animals . Relations both inside skeletons of a pose and among the series of images need be deeply observed to generate precise poses.

姿势生成

faast 提供工具包，用于使用符合 Openni 标准的深度传感器的自然交互来创建动画虚拟字符。网体也基于刚性肢体运动和用于人类和动物的剥皮重量来生产。需要深入观察到姿势的骨架和姿势系列中的骨骼内部的关系以产生精确的姿势。

Pose Stripping

Radio frequency (RF) reflections of Wifi back from environment and humans is captured for estimation poses . Heatmaps in both vertical and horizontal directions are parsed with encoders and then fused with keypoint confidence maps from RGB sequences, in which human skeletons can be stripped from background. The basic assumption is that the reflections from human body and other items are disparate. This assumption is susceptible to the things with the same reflection with human body.

Datasets

We conclude top 11 datasets which have high-frequent usage for skeleton approaches. This dataset consists of 56,880 action samples containing 4 different modalities of data for each sample: 1) RGB videos 136GB, 2) depth map sequences(including Masked depth maps 83GB and Full depth maps 886GB), 3) 3D skeletal data 5.8GB, 4) Infrared videos 221 GB, Total 1.3TB. In this dataset, the resolution of RGB videos are 1920 by 1080, depth maps and IR videos are all in 512 by 424, and 3D skeletal data contains the three dimensional locations of 25 major body joints at each frame. This dataset includes two separate datasets. The first dataset(3.42G) is collected using Kinect mounted on top of a humanoid robot. There are 9 action types in the humanoid robot dataset:stand up,wave,hug,point,punch,reach,throw,run,shake hands.The second dataset(3.33G) is collected using a non-humanoid robot.There are 9 action types in the non-humanoid robot dataset:ignore, pass by the robot, point at the robot, reach an object, run away, stand up, stop the robot, throw at the robot, and wave to the robot. Each dataset contains 5 parts: 1) RGB images(.jpg), the resolution is 480x640. 2) Depth images(.png) the resolution is 320x240. 3) Calibrated depth images(.png), the resolution is 320x240. 4) Sketetal joint locations (.txt). Each row contains the data of one frame, the format is: frame number, frame count, skeletonId, (x,y,z) locations of joint 1-20. 5) Labels of action sequence (.txt). The dataset collected at the University of Florence during 2012,has been captured using a Kinect camera. It includes 9 activities: wave, drink from a bottle, answer phone, clap, tight lace, sit down, stand up, read watch, bow.During acquisition,10 subjects were asked to perform the above actions for 2 or 3 times,which resulted in a total of 215 activity samples. This dataset stems from the competition of ActivityNet Large Scale Activity Recognition Challenge 2018, which started from 2016 CVPR. The dataset provided by Deepmind Team of Google currently includes a total of 600 categories and 500 thousand video clips all from Youtube. In the collected 600 categories, each one has at least 600 videos.

数据集

Each video lasts about 10 seconds. The categories are classified into three main types: 1) Interaction between humans and objects such as playing musical instruments. 2) Human interaction such as handshake, hug. 3) Sports, etc. These three main types can also be described as Person, Person-Person, Person-Object. Northwestern-UCLA dataset(N-UCLA) was collected by three Kinect cameras, which contains 1494 sequences covering 10 action classes from 10 performers. And these 10 actions are: pick up with one hand, pick up with two hands, drop trash, walk around, sit down, stand up, donning, doffing, throw and carry. The subjects perform an action only one time in an action sequence, which contains an average of 39 frames. SBU Interaction dataset was collected with Kinect. It contains 8 classes of two-person interactions, and includes 282 skeleton sequences with 6822 frames. Each body skeleton consists of 15 joints. SYSU 3D Human-Object Interaction (SYSU) dataset is collected by Kinect camera. It contains 480 skeleton clips of 12 action categories performed by 40 subjects and each clip has 20 joints. MSR-Action3D dataset is an action dataset of depth sequences captured by a depth camera. This dataset contains twenty actions: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side-boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up $&$ throw. It is created by Wanqing Li during his time at Microsoft Research Redmond. The Berkeley Multimodal Human Action Database (MHAD) contains 11 actions performed by 7 male and 5 female subjects in the range 23-30 years of age except for one elderly subject. All the subjects performed 5 repetitions of each action, yielding about 660 action sequences which correspond to about 82 minutes of total recording time. This dataset was collected as part of our research on human action recognition using fusion of depth and inertial sensor data. For our multimodal human action dataset reported here, only one Kinect camera and one wearable inertial sensor were used. This was intentional due to the practicality or relatively non-intrusiveness aspect of using these two differing modality sensors. Both of these sensors are low cost, easy to operate, and do not require much computational power for the real-time manipulation of data generated by them. A picture of the Kinect camera can capture a color image with a resolution of 640 by 480 pixels and a 16-bit depth image with a resolution of 320 by 240 pixels. The frame rate is approximately 30 frames per second. HDM05 dataset is a motion capture database which contains more than three hours of systematically recorded and well-documented motion capture data in the C3D as well as in the ASF/AMC data format. Furthermore, HDM05 contains for more than 70 motion classes in 10 to 50 realizations executed by various actors. The HDM05 database has been designed and set up under the direction of Meinard Müller Tido Röder, Michael Clausen, Bernhard Eberhardt, Björn Krüger, and Andreas Weber. The motion capturing has been conducted in the year 2005 at the Hochschule der Medien (HDM), Stuttgart, Germany, supervised by Bernhard Eberhardt.

Conclusion

Skeleton-based approach as a significant part was evolving along with the blooming development of artificial intelligent applications (such as object detection, action identification, pose estimation, and so on) which had attracted high attentions. This paper observed skeleton-based approaches and categorized these techniques in accordance with target tasks rather than theoretical frameworks, which is useful for introducing this scope.