[论文翻译]机器视觉上基于骨架的方法:调查报告


原文地址:https://arxiv.org/pdf/2012.12447v1.pdf


  Skeleton-based Approaches based on Machine Vision: A Survey

 机器视觉上基于骨架的方法:调查报告

  Abstract

Recently, skeleton-based approaches have achieved rapid progress on the basis of great success in skeleton representation. Plenty of researches focus on solving specific problems according to skeleton features. Some skeleton-based approaches have been mentioned in several overviews on object detection as a non-essential part. Nevertheless, there has not been any thorough analysis of skeleton-based approaches attentively. Instead of describing these techniques in terms of theoretical constructs, we devote to summarizing skeleton-based approaches with regard to application fields and given tasks as comprehensively as possible. This paper is conducive to further understanding of skeleton-based application and dealing with particular issues.

 摘要

最近,基于骨架的方法在骨骼识别中取得了巨大成功的基础上取得了快速进展。大量研究侧重于根据骨架特征解决特定问题。在几个对象检测的概述中提到了一些基于骨架的方法作为非必要部分。尽管如此,对骨骼的近端方法没有任何彻底的分析。除了理论构建方面,我们不描述关于概述基于骨架的方法,而不是描述关于应用领域的基于骨架的方法,并尽可能地全面地给予任务。本文有利于进一步了解基于骨架的应用和处理特定问题。

  Introduction

Skeleton-based approaches, also known as kinematic techniques, cover a set of joints and a group of limbs based on physiological body structure. Typically, the number of joints is determined by computational complexity, which is between ten to thirty. In recent years, some significant techniques have worked successfully on discovering the representation of skeletons with dots and edges, which can be categoried as top-down methods (e.g., cascade pyramid network and PoseFix ) and bottom-up methods (e.g., Openpose ). In order to analyze skeleton-based approaches deeply, we observe these researches from two aspects(i.e., single-frame and multi-frame). Fig. 1 describes the summarization. In regard to single-frame type, tasks are handled through investigating independent images within videos. Correspondingly, multi-frame type requires a series of sequential images among videos to explore the inherence.

image

 简介

基于骨架的方法,也称为运动技术,覆盖一组关节和基于生理体结构的一组肢体。通常,关节的数量是通过计算复杂度确定的,这在十到三十之间。近年来,一些重要的技术已经成功地努力发现具有点和边缘的骨骼的表示,可以分类为自上而下的方法(例如,级联金字塔网络和POSEFIX)和自下而上的方法(例如,Openpose)。为了深入分析骨架的方法,我们观察来自两个方面的研究(即单帧和多帧)。图。图1描述了总结。关于单帧类型,通过在视频中调查独立图像来处理任务。相应地,多帧类型需要一系列视频之间的序列图像来探索固有。

  Single-frame Approaches

 单帧方法

  Multi-view Pose Estimation

Although skeletons with 3D structures enhance the accuracy of pose estimation, the training operation needs plenty of 3D ground-truth data which is costly. Thus, multiple 2D skeletons from different views are integrated into 3D structures under a given strategy to determine the pose. EpipolarPose implements epipolar geometry to combine the 2D skeletons and trains a 3D pose estimator with camera geometry information. RepNet finds the mapping from 2D to 3D skeletons by designing an adversarial training approach and adding feedback projection from 3D to 2D skeletons. Based on Wasserstein generative adversarial network (WGAN), RepNet generates 3D skeletons by sending 2D skeletons as input into WGAN. The produced 3D skeletons are reprojected to 2D domain by camera loss. A substitutive fusion approach tracks the status of joints in two views of 2D skeletons and labels the joints as "well tracked", "inferred", or "not tracked". For a joint in different views, the one diagnosed as "well tracked" has the highest priority, and then the inferred one is better than the one of "not tracked". Furthermore, dynamic time warping with K-nearest neighbor is adopted to classify learned skeleton representation. In regard to multiple persons, a multi-way matching algorithm clusters detected 2D skeletons by sticking the keypoints of a person in various views. Two networks (named as VA-RNN and VA-CNN) work together to discover skeleton representation of actions . An automatic selection scheme is involved in both the nets to choose prior viewpoints of 2D skeletons rather than a fixed criterion. In VA-RNN, 2D skeletons are rotated to obtain a complex feature by training multiple LSTM. For VA-CNN, a intrinsic feature is abstracted by a convolutional network. These two features are fused to determine action classes. Additionally, loss functions (e.g., regression loss and consistency loss.) in deep neural networks (e.g., CNN) are designed to improve pose estimation within multi-view skeletons, and distance biases of skeleton information in multiple views are minimized to improve fusion accuracy . Relation evaluation (e.g., similarity) between skeletons in two views is still an important issue.

 多视图姿态估计

虽然具有3D结构的骨架增强了姿势估计的准确性,但训练操作需要大量的3D地面真实数据,这是昂贵的。因此,来自不同视图的多个2D骨架被集成到给定策略下的3D结构中以确定姿势。 Epipolastops实现ePipolar几何形状,将2D骨架组合并培训具有相机几何信息的3D姿势估计器。 REBNET通过设计普通培训方法并将从3D添加到2D骨架中的反馈投影来查找从2D到3D骨架的映射。基于Wassersein生成的对冲网络(WANG),REBNET通过将2D骨骼作为输入发送到Wgan来生成3D骨架。所产生的3D骨架通过相机损耗将其恢复为2D域。替代融合方法在2D骨骼的两个视图中跟踪关节的状态,并将关节标记为“良好追踪”,“推断”或“未跟踪”。对于不同视图中的关节,诊断为“良好追踪”的人具有最高的优先级,然后推断优于“未跟踪”之一。此外,采用与k最近邻居的动态时间翘曲来分类学习骨架表示。关于多人,通过在各种视图中粘贴一个人的关键点来检测多向匹配算法群集检测了2D骨架。两个网络(名称为VA-RNN和VA-CNN)一起工作,以发现行动的骨架表示。自动选择方案涉及网站,以选择2D骨架的先前观点而不是固定标准。在VA-RNN中,通过训练多个LSTM来旋转2D骨架以获得复杂的特征。对于VA-CNN,由卷积网络抽象的内在特征。这两个功能融合以确定动作类。另外,深神经网络(例如,CNN)中的损耗函数(例如,回归损耗和一致性损失)被设计为改善多视图骨架内的姿势估计,并且最小化多个视图中的骨架信息的距离偏差以提高融合精度。两个观点中骨骼之间的关系评估(例如,相似性)仍然是一个重要问题。

  Object Segmentation

Object segmentation based on skeletons aims at cutting targeted items (including biological and non-biological things) from others. In this type of tasks, object positions should be ascertained first and then object margins are drawn out. Pose2Seg focuses on human segmentation in images. In the model, detected skeletons are compared with established standard skeletons of each pose, and a affine transformation matrix calculates similarity. SegModule is presented to understand skeleton features and corresponding visualization. In addition, semantics is introduced as a supplementary perspective. By training a Part FCNto obtain semantic part score map, the body is further divided into head, torso, left arm, left leg, right arm, and right leg . Another human segmentation approach discovers edge information through adding all widths of physical parts (e.g., head and shoulders, neck, chest, hip) into body skeletons. In consideration of cascade connection of curve skeletons like a tree and its leaves, body segmentation are realized by trimming branches and isolating intersectional leaves . In a human counting case, blurry margins are allowed, let alone body parts. After removing background, all heads are identified within skeleton graph segmentation and utilized to gain human amount . Besides of human body segmentation, animal and non-biological items are also being studied to be segmented. Critical points are selected on the basis of surface skeletons and expand to component sets separately . With skeleton matching techniques, skeletal branches over a particular position are reconstructed by estimating the distances between two views and used to reconstruct the non-biological things . Pinpoint object margins of diverse items are confirmed not only depending on skeletons but also item structures.

 对象分割

基于骨骼的对象分割旨在从其他人切割目标物品(包括生物和非生物学)。在这种类型的任务中,应该首先确定对象位置,然后抽出对象边距。 POSE2SEG专注于图像中的人类细分。在该模型中,将检测到的骨架与每个姿势的建立的标准骨架进行比较,并且仿射变换矩阵计算相似度。提出了SEGModule以了解骨架特征和相应的可视化。此外,将语义作为补充视角引入。通过培训FCNTO获得语义零件得分图,身体进一步分为头部,躯干,左臂,左腿,右臂和右腿。另一个人分割方法通过将所有宽度的物理部件(例如,头部和肩部,颈部,胸部,臀部)添加到身体骨架中来发现边缘信息。考虑到曲线骨架的级联连接如树及其叶子,通过修剪分支和隔离交叉叶来实现体分割。在人类计数案例中,允许模糊的边距,更不用说身体部位。去除背景后,所有头部都在骨架图分段中识别并利用以获得人类量。除了人体细分之外,还研究了动物和非生物项目以进行分割。基于表面骨架选择关键点,并单独扩展到组件组。通过骨架匹配技术,通过估计两个视图之间的距离并用于重建非生物学的距离来重建特定位置的骨架分支。确定不同物品的对象边距不仅根据骨架而且项目结构确认。

  Static Pose Estimation

Without comparison with any benchmark, it is hard to evaluate poses by only one image. . A path (i.e., 2D-3D-2D) learns the features of skeletons severely through projecting into 3D and being reprojected into 2D with lifting networks. The original 2D skeleton is as input . A parsing induced learner exploits parsing information to enhance skeleton information through a pose encoder. A pose encoder abstracts pose features while another pose encoder fuses residual information into pose representation . HRNet has parallel high-to-low resolution subnetworks to gain both high and low resolutions of skeletons and sums them up . Based on ConvNets, 2D images are translated into 3D skeleton models. For instance, heatmaps and silhouettes are extracted from a 2D body, and then pose and shape parameters are abstracted separately. These parameters are meshed together into a 3D body with 2D annotations . In another case, depth knowledge is integrated with ConvNets between paired joints in skeletons to mix into a 3D pose . PoseRefiner implements skeletons in binary channels and pose classes to train ConvNets learning likelihood heatmaps to refine the skeletons . A dual-source approach learns the representations both from 2D and 3D skeletons . 3D poses are projected into multiple 2D skeletons, and used to find the highest likelihood along with a test image. To address skeletons seriously, a upper-body visualization uses different colors and polylines to distinguish between the left and right body . With these representations, 16 poses are clustered with high accuracy. Furthermore, ConvNets parameters are analyzed to find prior settings to identify poses with skeletons . . Cascaded Pyramid Network (CPN) containing two networks (i.e., GlobalNet and RefineNet) adopts a top-down pipeline which means locating skeleton keypoints first with ResNet backbone, extracting features of these keypoints as HyperNet, and then assembling them. A multi-person pose estimation (RMPE) involves spatial transformer networks to rectify various ground truth bounding boxes for people and a final box is obtained for each person . DeepCut uses adapted fast R-CNN (AFR-CNN) to detect body parts with integer linear programming and dense CNN (Dense-CNN) to obtain the intensity with geometric and appearant constraints. Additionally, based on DeepCut, DeeperCut is proposed to improve body part detection with a bottom-up pipeline and an image-conditioned pairwise assembling strategy is designed. Angles among body parts are observed meticulously to assist in searching pairwise joints. Without consecutive images of an action, features learned from static poses can resolve the estimation.

 静态姿势估计

没有与任何基准进行比较,很难仅通过一个图像进行姿势。 。路径(即2D-3D-2D)通过将突出到3D突出,并将其恢复为2D,从升降网络中恢复到2D。原始2D骨架是输入。解析诱导学习者利用解析信​​息来通过姿势编码器增强骨架信息。姿势编码器摘要姿势特征,而另一个姿势编码器将剩余信息熔化成姿势表示。 HRNET具有平行的高低分辨率子网,以获得高低和低分辨率的骨骼并将其汇总。基于COMMNETS,2D图像转换为3D骨架模型。例如,从2D主体中提取热线和剪影,然后分开抽出姿势和形状参数。这些参数用2D注释将它们一起啮合到3D主体中。在另一个情况下,深度知识与骨架中的配对接头之间的脉冲集合在一起,以混合到3D姿势中。 Posereefiner在二进制渠道中实现骨骼,并追溯课程,以训练Cummnets学习可能性热量,以改进骨架。双源方法学习来自2D和3D骨架的表示。 3D姿势投射到多个2D骨架中,并用于找到最高的可能性以及测试图像。为了认真地处理骨骼,上身可视化使用不同的颜色和折线区分左侧和右体。通过这些表示,16个姿势以高精度聚集。此外,分析Convnetets参数以查找先前设置以识别骨骼的姿势。 。包含两个网络(即GlobalNet和RefineNet)的级联金字塔网络(CPN)采用自上而下的管道,意味着首先使用Reset备份定位骨架关键点,将这些关键点的功能提取为HyperNet,然后组装它们。多人姿态估计(RMPE)涉及空间变压器网络以纠正各种地面真理绑定框,为每个人获得最终框。 DeepCut使用适应的FAST R-CNN(AFR-CNN),以检测具有整数线性编程和密集CNN(DENSE-CNN)的身体部位,以获得具有几何和出现的约束的强度。另外,基于DeepCut,建议使用自下而上的管道改善身体部位检测,并设计了图像调节的成对组装策略。颗粒地观察身体部位之间的角度以帮助搜索成对接头。在没有动作的连续图像的情况下,从静态姿势学习的功能可以解决估计。

  Object Classification

The purpose of object classification is to identify the items in images through observing skeletons without any limitations of object types. In order to offset skeleton noise in classification, skeleton contours are trimmed as a trade-off between shape reconstruction error and skeleton simplicity based on Bayesian theory . Tree representation of skeletons as a graph is turned into strings of skeleton edges, and a deformable contour method is used to compare these strings of diverse objects . Node pairs in curve skeletons are obtained by cascade of symmetry filters and a symmetry correspondence matrix is designed to gain symmetry cloud which is classified by spectral analysis . Curve skeletons are main tools for object classification with obvious transformation of each object.

 对象分类

对象分类的目的是通过观察骨架来识别图像中的项目,而没有对象类型的任何限制。为了抵消分类中的骨架噪声,基于贝叶斯理论的形状重建误差与骨架简约之间的替代,骨架轮廓被修剪为折衷。骨骼作为图形的树表示变成了骨架边缘的串,并且可以使用可变形的轮廓方法来比较这些不同物体的这些串。曲线骨架中的节点对通过级联对称滤波器获得,并且对称对称矩阵被设计为增益通过光谱分析分类的对称性云。曲线骨架是对象分类的主要工具,具有每个对象的明显变换。

  Pose Denoising

Typically, skeletons are dominant for special structures, especially for human body. When odd poses occur and parts of bodies are overlapped, it is hard to peel skeletons of each body. Redundant joints and limbs are deemed as noise, which can be eliminated by filtering linear transformations and comparing joint positions and limb angles with standard ones . Advanced denoising algorithms may be used to further improve the capability of pose denoising.

 姿势去噪

通常,骨架对于特殊结构的主导,特别是对于人体。当奇怪的姿势发生并且部件重叠,难以剥离每个身体的骨架。冗余关节和肢体被认为是噪声,这可以通过过滤线性变换并将接头位置和肢体角度与标准转换进行比较来消除。高级去噪算法可用于进一步提高姿势去噪的能力。

  Image Synthesis

Commonly, image synthesis tries to produce other views of skeletons by learning the representation of skeletons in a view. GAN is a popular technique to solve this problem, but traditional GAN has weak ability of obtaining the relationship among joints and limbs. Bidirection GAN is adopted to search the mapping from an initial pose skeleton to another pose skeleton . Deformable GANs use heatmaps from skeletons as conditional information and human poses as originial images feeding into the generator to generate other pose images. The generated poses are shifted into the discriminator with corresponding heatmaps. Furthermore, CRF-RNN is established to predict conditional human body with pose transforamtion from a given skeleton to a target skeleton . Mask R-CNN model is built to transform a pose skeleton into another pose skeleton, and paralleling models operate simultaneously with diverse keypoints sampled from initial pose skeletons. These mappings are combined together to creat a new pose skeleton . Some researches conduct physical structure characteristics to reconstruct images other than studying the inherence. For example, body skeletons are divided into multiple components and the background is drawed out . These components are rotated and integrated into a new body which is blended into a synthesized new background. GAN is a useful tool in generation although its performance is still unstable. While applying GAN on skeleton synthesis, how to improve the quality is still a challenge.

 图像合成

通常,图像合成试图通过在视图中学习骨架的表示来产生骨骼的其他视图。 GaN是一种解决这个问题的流行技术,但传统的GaN具有较弱的能力,获得关节和四肢之间的关系。采用双向GaN将初始姿势骨架映射搜索另一个姿势骨架。可变形的GAN使用来自骨骼的热量,因为条件信息和人类姿势作为饲养到发电机的原始图像以产生其他姿势图像。产生的姿势被移入具有相应热插拔的鉴别器。此外,建立CRF-RNN,以预测来自给定骨架的姿势转蛋白的条件人体从给定的骨架到靶骨架。面罩R-CNN模型建立以将姿势骨架转换为另一个姿势骨架,并并联模型同时运行,与初始姿势骨架采样不同的关键点。这些映射组合在一起以创建一个新的姿势骨架。有些研究进行了物理结构特征,以重建除研究其固有的图像。例如,身体骨架被分成多个组件,并绘制背景。这些部件旋转并集成到一个新的身体中,该全身被混合到合成的新背景中。 GaN是一代有用的工具,尽管其性能仍然不稳定。在应用GaN上骨架综合时,如何提高质量仍然是一项挑战。

  Benchmark-based Identification

Comparison with benchmark is a straight way to judge the class of items. For skeletons, key points in standard and objective skeletons are contrasted in turn. Skeleton information (joints and limbs) is transformed into a tree with key points which are compared with given trees. The best match of two shape trees are treated as identical thing . A skeleton graph of any object is also used to contrast with standard units in both high geometric and topological similarities . To highlight the joints in comparisons, lines among joints address the relations of two joints instead of limbs. Unlike a fixed number of compared joints in previous two techniques, comparison steps in skeleton graphs here are random . Similarity metrics along with 3D skeleton features are designed to obtain the similarity between the human skeleton in a single frame image and the templates . Body dot clouds are cooperated with main curve skeletons to judge a matching ratio of human motions, which works better than the case only involved curve skeletons . Skeletons using Kinect from upper and lower body are analyzed separately to identify human gaits with ANN as a classifier . The evaluation criterion of similarity between the object and template skeletons is controlled by experience. Complex metrics may play well on similarity assessment.

 基于基准的基准标识

与基准的比较是判断物品类的直接方式。对于骨骼,标准和客观骨骼的关键点依次形成对比。骨架信息(关节和肢体)转变为树木,与给定树进行比较。两个形状树的最佳匹配被视为相同的东西。任何物体的骨架图也用于与高几何和拓扑相似性的标准单元对比。为了突出比较的关节,关节中的线条解决了两个关节的关系而不是四肢。与前两种技术中的固定数量的比较关节不同,这里骨架图中的比较步骤是随机的。相似性度量与3D骨架特征一起设计用于在单帧图像和模板中获得人类骨架之间的相似性。身体点云与主要曲线骨架合作,以判断人类运动的匹配比,这比仅涉及曲线骨架的案例更好。分别分析使用来自上半身和下半身的Kinect的骨架,以识别具有ANN作为分类器的人类Gaits。物体和模板骨架之间的相似性评估标准由经验控制。复杂度量可能在相似性评估上发挥良好。

  Gesture Language Identification

Body gestures have rich information as well as language, also called as body language. Deep learning algorithm is a novel tool to understand gesture meanings, such as RNN and LSTM . Deep networks extract features of gesture skeletons, which is easier than those of intact bodies. Unlike full body skeletons, most body gestures only contain partial joints and limbs. Thus, application scans should be settled to ensure the involved groups of joints and limbs.

 手势语言识别

身体手势具有丰富的信息以及语言,也称为肢体语言。深度学习算法是一种了解姿态含义的新工具,例如RNN和LSTM。深网络提取手势骨架的特征,比完整的身体更容易。与全身骨骼不同,大多数身体手势仅含有部分关节和四肢。因此,应用扫描应解决,以确保相关的关节和四肢群体。

  Multi-frame Approaches

 多帧方法

  Dynamic Pose Estimation

A view adaptive LSTM (VA-LSTM) aims at detecting medical condition (i.e., sneeze/cough, headache, neck pain, staggering, chest pain, vomiting, falling, and back pain), containing a classification and regression subnetwork . Original skeletons are rotated and translated into new architectures and then sent into the subnetworks to learn the corresponding medical classes. Neuromusculoskeletal disorders are also detected by pose estimation . Asymmetry features are extracted with splitted results of body joints according to the left and right body, which is used to catch normal motion patterns by a probabilistic normalcy model. The likelihood between a test action and a normal motion is computed to determine the abnormality. Deep learning structures (e.g. CNN and LSTM ) are applied to abstract the features which are compared with a series of skeleton benchmarks (i.e., joints and limbs). Graph convolutional network (GCN) is a core solution for movement pose estimation on account of its strong ability of capturing spatio and temporal features . Except GCN, deep learning structures (e.g., CNN , RNN , and LSTM ) are also key techniques for movement pose estimation through learning representations under given conditions. Moreover, physical analysis of skeletons and probability estimation are also utilized in movement pose estimation. A exemplar-based method is explored to adjust initial estimated poses with inhomogeneous systematic bias while skeletons are defined as a simple directed graph and limbs are directed arrows. A regression function is proposed to predict the pose with rooted-mean-squared differences between templates and objective skeletons . 2D keypoints extracted by pose estimators with skeletons are fused with SMPL regressors to create 3D models with accurate camera parameters . Semantic representations of volume occupancy and ground plane support are helpful for distinguishing multiple persons after evaluating each single person with 3D skeletons . On the strength of spatial positions and joint changes, a decision tree can quickly recognize basic action events and monitor action types under graph constraints of state transition . Unlike static pose estimation, dynamic pose estimation usually discusses both spatio and temporal features with a series of images. The mixture process is a key point to control the quality of final representations.

 动态姿态估计

A View Adaptive LSTM(VA-LSTM)旨在检测医疗状况(即打喷嚏/咳嗽,头痛,颈部疼痛,惊人,胸痛,呕吐,跌倒和背部疼痛),包含分类和回归子网。原始骨架被旋转并转换为新的体系结构,然后发送到子网以学习相应的医疗类。通过姿势估计也检测到神经血清骨骼障碍。根据左侧和右侧的身体接头的分离结果提取不对称特征,其用于通过概率正常模型捕获正常运动模式。计算测试动作和正常运动之间的可能性以确定异常。应用深度学习结构(例如CNN和LSTM)应用于抽象与一系列骨架基准(即关节和四肢)进行比较的特征。图表卷积网络(GCN)是一种用于运动姿势估计的核心解决方案,但由于其捕获时空和时间特征的强大能力。除了GCN之外,深度学习结构(例如,CNN,RNN和LSTM)也是通过在特定条件下学习陈述来移动姿势的关键技术。此外,骨架的物理分析和概率估计也用于运动姿势估计。探讨了基于示例的方法,以调整初始估计的姿势,其具有不均匀的系统偏置,而骨架被定义为简单的定向图和肢体是指向箭头的。提出了回归函数来预测模板和客观骨骼之间的生根平均平均差异的姿势。 2D由带骨架提取的姿势估计提取的关键点与SMPL回归器融合,以创建具有精确摄像机参数的3D模型。体积占用和地面支持的语义表示有助于在评估3D骨骼的每个人之后区分多人。在空间位置和联合变化的强度上,决策树可以在状态转换的图形约束下快速识别基本操作事件和监视动作类型。与静态姿势估计不同,动态姿势估计通常讨论具有一系列图像的时空和时间特征。混合过程是控制最终表示质量的关键点。

  Object Tracking

PoseTrack follows the particular persons in videos, including multi-person pose estimation in a image, multi-person pose estimation in videos, and multi-person articulated tracking. ArtTrack draws body part graphs in temporal aspect and abandons joints with loose relation by a feed-forward convolutional architecture. For robot, visual distances and lighting intensity are considered while following human using Kinect . Also based on Kinect and Kalman filter, foot points are detected and followed with depth information and pairwise curve matching in 3D space from given views to a virtual bird’s eye view . With capturing keypoints in motion and fitting skeleton, human pose is tracked with transformed 3D models . Unlike deep learning structures, images in 3D models are filtered and the likelihood is calculated between shape models deformed by skeleton poses and image data with regard to probability theory . Fast movements of animals and persons are traced with non-rigid temporal deformation of 3D surface . Other than tracking human, people handling objects are traced with GCNs after detecting hand joints in body skeletons . The relation of a given object between sequences is a crucial issue in object tracking.

 对象跟踪

posetrack遵循视频中的特定人员,包括图像中的多人姿势估计,视频中的多人姿势估计和多人铰接式跟踪。 Arttrack在时间方面和Abandons关节中绘制身体部位图,通过前馈卷积架构的宽松关系。对于机器人,使用Kinect后面的人类在人类之后考虑视觉距离和照明强度。同样基于Kinect和Kalman滤波器,检测到脚点,然后在从给定视图到虚拟鸟瞰图的给定视图中的3D空间中的3D空间中匹配的脚点。利用捕获运动和拟合骨架的关键点,用转换的3D模型跟踪人类姿势。与深度学习结构不同,过滤3D模型中的图像,并且在由骨架姿势和关于概率理论的图像数据变形的形状模型之间计算出可能性。动物和人的快速运动追踪3D表面的非刚性时间变形。除了跟踪人类,在检测体骨骼中的手关节后,处理物体的处理物体被追踪到GCN。序列之间给定对象的关系是对象跟踪中的重要问题。

  Action Recognition

Action recognition is a crucial part of skeleton approaches, which has been successfully applied to a lot of real projects. Among action recognition techniques about skeletons, there are three categories: (1)Spectrum-based, (2)Image-mapping-based, and (3)Torso-link-based approaches. In torso-link-based techniques, three subtypes are involved: 1) Multiple Streams, 2) Weighting joints, and 3) Non-weighting joints. . Skeleton sequences are turned into color spectrum dots with ConvNets, involving a procedure with 1) joint distribution mapping, 2) spectrum coding of joint trajectories and body parts, and 3) joint velocity weighted saturation and brightness. The spectrum actions are transferred from different filmed angles and then sent to the corresponding ConvNets which separately give scores with regard to actions . . In this area, action skeletions are tranformed into feature maps which are learned by convolutional neural networks as the main solution backbone. With a single CNN, both temporal and spatial information obtained by physical information of joints and limbs are transformed into representation images which are sent to a CNN-family model (e.g., VGG and CNN-LSTM ) for learning features . Furthermore, the representation images are stretched with various sizes and fed into multiple CNNs to combine a highly representative explanation, in which these CNNs have the same structure or diverse structures . Feature maps in traditional three coordinate dimensions (i.e., X, Y, and Z axis) of 3D skeletons are dissembled to corresponding CNNs, resulting in multi-column representations of actions . Besides of coordinate dimensions and time dimension in 3D action skeletons, color (i.e., RGB) is also deemed as a dimension to learn deep features in action recognition . Features of different factors of action skeletons (i.e., joint-joint distances, joint-joint orientations, joint-joint vectors, joint-line distances, line-line angles) obtained by physical computations are encoded into images and loaded into multiple CNNs to further extract features . For a joint in skeleton sequences, potential relations are explored by position chains of a physical body and mixed with features learned from each frame by CNN for compact representations of actions . Additionally, temporal consistency is deeply discovered by establishing more networks to analyze extra information in actions . Thereinto, DD-net is a popular framework. Unlike learning abstract features in action images, traditionally physical and mathematical techniques are still useful in action recognition under certain conditions. Physical information of joint changes is also used to detect actions through calculating precise relations (e.g., distances and angles) among joints for each action . Geometrical relationship of limbs and joints draws a tree in which actions as a father node link key poses. The actions are viewed as tree nodes and features derived from those actions are as child nodes . . Torso-link (also called stick-link) is a most commonly used technique in skeleton-based approaches. . More than two aspects of representations gained from torso-link skeletons are used to learn features of actions by dependent and parallel networks, e.

 动作识别

动作识别是骨架方法的重要组成部分,已成功应用于许多实际项目。在关于骨架的行动识别技术中,有三类:(1)基于频谱,(2)基于图像映射的,(3)基于躯干链路的方法。在基于躯干链路的技术中,涉及三个亚型:1)多流,2)加权接头和3)非加权关节。 。骨架序列变成了呼应集的彩色谱点,涉及具有1)联合分配映射的程序,2)关节轨迹和身体部位的光谱编码,以及3)接合速度加权饱和度和亮度。频谱动作从不同的拍摄角度传送,然后发送到相应的呼声孔,该呼声孔分别提供关于动作的分数。 。在该区域中,动作备手将被转换为由卷积神经网络作为主要解决方案骨干的卷积神经网络学习的特征映射。利用单个CNN,通过关节和四肢物理信息获得的时间和空间信息被转换为表示发送到CNN家族模型(例如,VGG和CNN-LSTM)的表示图像,用于学习特征。此外,表示图像以各种尺寸拉伸,并馈入多个CNN以结合高度代表性的说明,其中这些CNN具有相同的结构或不同的结构。 3D骨架的传统三个坐标尺寸(即x,y和z轴)中的传统三个坐标尺寸(即x,y和z轴)的特征映射被播放到相应的cnns,从而导致操作的多列表示。除了3D动作骨架中的坐标尺寸和时间尺寸之外,还被认为是学习行动识别中深度特征的尺寸。通过物理计算获得的不同动作骨架(即,关节关节距离,关节距离,关节接头取向,关节接头,线线角,线线角度)的特征被编码成图像并进一步加载到多个CNN中提取功能。对于骨架序列中的关节,通过物理体的位置链探索潜在关系,并与CNN从每个帧中学到的特征混合,以便紧凑的动作表示。此外,通过建立更多网络来分析行动中的额外信息,深入了解时间一致性。其中,DD-net是一个流行的框架。与动作图像中的抽象特征不同,传统上物理和数学技术在某些条件下仍然有用。联合变化的物理信息还用于通过计算每个动作的关节中的精确关系(例如,距离和角度)来检测动作。肢体和关节的几何关系绘制了一棵树,其中作为父节点链接密钥姿势的动作。该操作被视为树节点,并且从这些操作导出的功能是子节点。 。躯干链接(也称为Stick-Link)是基于骨架的方法中最常用的技术。 。从躯干链路骨架中获得的两个以上的表示,用于通过依赖和并行网络来学习动作的特征,即

g., spatial and temporal , dots and lines , joints and time , position and feature , spatial, temporal, structural, and actional . The types of networks depend on action characteristics, e.g., RNN , RRN , CNN , GCN , and LSTM . . In other categories, all items in joints or/and limbs are given equal importance. However, in particular cases, actions have typical changes of partial joints and limbs which can represent the actions severely. Significant joints and limbs can be chosen by covariance matrix , filtering function on the basis of skeleton graphs , CNN , LSTM , multi-head attention model with itereative attention on diverse parts of a body , projecting skeleton angles onto a unit sphere , information gain with regard to position and velocity histogram from skeletons . Significance sorting techniques of all joints and limbs are conducive to reducing the hardness and complexity of skeleton feature extraction in action recognition, including spatial pyramid model (SPM) . Weighting the features of different parts of human body is also valid for identifying actions, e.g., bidirectional RNN . . This part is the main component of action recognition, where overviews are sufficient over last decades . Therefore, we here introduce the sketch briefly. In the field, inherent representation of skeletons of actions is gained by two ways: physical computation and feature extraction. For the former, based on empirical knowledge of human body, the relations among joints and limbs under a particular action is calculated with explicit equations . For the latter, deep neural networks and other techniques devote to learning skeleton features for each action. Typically, deep neural networks have gained great success on action recognition, e.g., DNN , CNN , RNN , LSTM , GCN . Moreover, traditional machine learning algorithms are designed to identify actions, e.g., kNN , RBF . HMM discovers the semantic information of actions which assists in action identification . Reinforcement learning is also a technique to obtain effective representation of an action . Bayesian varies across different sequences of an action . Additionally, probability and mathematical theories are useful in action recognition, e.g., analogical generalization and retrieval , screw matrices , gradient vector flow comparison , discriminative metric . In action recognition, the greatest challenge is to determine the start and end moment of an action. Usually, fixed time interval is adopted and leads to high deviation while actions have widely different time intervals.

空间和时间,点和线,关节和时间,位置和特征,空间,颞,结构和一股。网络类型取决于动作特征,例如RNN,RRN,CNN,GCN和LSTM。 。在其他类别中,关节或/和四肢中的所有项目都具有相同的重要性。然而,在特定情况下,行动具有部分关节和四肢的典型变化,其可以严重代表行动。显着的关节和四肢可以通过协方差矩阵,过滤功能基于骨架图,CNN,LSTM,多头注意模型,在一个身体的多样性部分上呈现出来,将骨架角度突出到单位球体上,信息增益关于从骨架的位置和速度直方图。所有关节和肢体的重要分选技术有利于降低行动识别中骨架特征提取的硬度和复杂性,包括空间金字塔模型(SPM)。加权人体不同部分的特征也适用于识别动作,例如双向RNN。 。这部分是行动识别的主要成分,概述在上几十年中足够了。因此,我们在这里简要介绍了草图。在该领域中,通过两种方式获得了行动骨架的固有表示:物理计算和特征提取。对于前者,根据人体的经验知识,用明确方程计算特定行动下关节和四肢之间的关系。对于后者,深度神经网络和其他技术为每个动作进行学习骨架功能。通常,深神经网络在动作识别上获得了巨大的成功,例如DNN,CNN,RNN,LSTM,GCN。此外,传统的机器学习算法旨在识别动作,例如KNN,RBF。嗯发现有助于行动识别的行动的语义信息。增强学习也是获得动作有效表示的技术。贝叶斯因行动的不同序列而异。另外,概率和数学理论可用于动作识别,例如模拟泛化和检索,螺杆矩阵,梯度向量流量比较,鉴别度量。在行动认可中,最大的挑战是确定行动的开始和结束时刻。通常,采用固定时间间隔并导致高偏差,而动作具有广泛不同的时间间隔。

  Action Prediction

Unlike HMM, CRF, RNN, LSTM, CNN, a latent global network with latent long-term global information is designed to predict an action . Based on the competition in GAN, two nets (i.e., I-Net and D-Net) are trained iteratively. Full and partial sequences are sent into I-Net to learn representations, separately. Afterwards, the representations are distinguished by D-Net. The evaluation of likelihood between the existed partial images and the intact sequences of an action is a key point of action prediction.

 动作预测

与HMM,CRF,RNN,LSTM,CNN,具有潜在长期全球信息的潜在全球网络不同,旨在预测动作。基于GaN的竞争,迭代培训两台网(即,I-Net和D-Net)。完整和部分序列被发送到I-Net中以单独学习表示。之后,该表示由D-Net区分。存在部分图像与动作的完整序列之间的似然性的评估是动作预测的关键点。

  Pose Generation

FAAST provides a toolkit to create animated virtual characters using natural interaction from OpenNI-compliant depth sensors. Mesh body is also produced on basis of rigid limb motions and skinning weights both for humans and animals . Relations both inside skeletons of a pose and among the series of images need be deeply observed to generate precise poses.

 姿势生成

faast提供工具包,用于使用符合Openni标准的深度传感器的自然交互来创建动画虚拟字符。网体也基于刚性肢体运动和用于人类和动物的剥皮重量来生产。需要深入观察到姿势的骨架和姿势系列中的骨骼内部的关系以产生精确的姿势。

  Pose Stripping

Radio frequency (RF) reflections of Wifi back from environment and humans is captured for estimation poses . Heatmaps in both vertical and horizontal directions are parsed with encoders and then fused with keypoint confidence maps from RGB sequences, in which human skeletons can be stripped from background. The basic assumption is that the reflections from human body and other items are disparate. This assumption is susceptible to the things with the same reflection with human body.

 捕捉

从环境和人类回来的WiFi的射频(RF)反射被估计姿势。垂直和水平方向上的热量用编码器解析,然后与RGB序列的关键点置信度映射融合,其中人类骨骼可以从背景中剥离。基本假设是人体和其他物品的反射是不同的。这种假设易于与人体相同反射的东西。

  Datasets

We conclude top 11 datasets which have high-frequent usage for skeleton approaches. This dataset consists of 56,880 action samples containing 4 different modalities of data for each sample: 1) RGB videos 136GB, 2) depth map sequences(including Masked depth maps 83GB and Full depth maps 886GB), 3) 3D skeletal data 5.8GB, 4) Infrared videos 221 GB, Total 1.3TB. In this dataset, the resolution of RGB videos are 1920 by 1080, depth maps and IR videos are all in 512 by 424, and 3D skeletal data contains the three dimensional locations of 25 major body joints at each frame. This dataset includes two separate datasets. The first dataset(3.42G) is collected using Kinect mounted on top of a humanoid robot. There are 9 action types in the humanoid robot dataset:stand up,wave,hug,point,punch,reach,throw,run,shake hands.The second dataset(3.33G) is collected using a non-humanoid robot.There are 9 action types in the non-humanoid robot dataset:ignore, pass by the robot, point at the robot, reach an object, run away, stand up, stop the robot, throw at the robot, and wave to the robot. Each dataset contains 5 parts: 1) RGB images(.jpg), the resolution is 480x640. 2) Depth images(.png) the resolution is 320x240. 3) Calibrated depth images(.png), the resolution is 320x240. 4) Sketetal joint locations (.txt). Each row contains the data of one frame, the format is: frame number, frame count, skeletonId, (x,y,z) locations of joint 1-20. 5) Labels of action sequence (.txt). The dataset collected at the University of Florence during 2012,has been captured using a Kinect camera. It includes 9 activities: wave, drink from a bottle, answer phone, clap, tight lace, sit down, stand up, read watch, bow.During acquisition,10 subjects were asked to perform the above actions for 2 or 3 times,which resulted in a total of 215 activity samples. This dataset stems from the competition of ActivityNet Large Scale Activity Recognition Challenge 2018, which started from 2016 CVPR. The dataset provided by Deepmind Team of Google currently includes a total of 600 categories and 500 thousand video clips all from Youtube. In the collected 600 categories, each one has at least 600 videos.

 数据集

我们得出了前面的11个数据集,该数据集具有高频繁用法的骨架方法。该数据集由56,880个操作样本组成,其中包含每个样本的4个不同的数据模式:1)RGB视频136GB,2)深度映射序列(包括屏蔽深度映射83GB和全深度图886GB),3)3D骨架数据5.8GB,4 )红外视频221 GB,总1.3TB。在该数据集中,RGB视频的分辨率为1920年,深度映射和IR视频均为512×424,3D骨架数据包含每个帧的25个主要主体关节的三维位置。此数据集包含两个单独的数据集。使用安装在人形机器人顶部的Kinect收集第一个数据集(3.42g)。人形机器人数据集中有9种动作类型:站起来,波,拥抱,点,冲头,伸手可及,抛出,运行,握手。使用非人形机器人收集第二个数据集(3.33g)。9非人形机器人数据集中的动作类型:忽略,通过机器人,点在机器人,到达一个物体,逃跑,站起来,停止机器人,扔在机器人,并挥动到机器人。每个数据集包含5部分:1)RGB图像(.jpg),分辨率为480x640。 2)深度图像(.png)分辨率为320x240。 3)校准深度图像(.png),分辨率为320x240。 4)跳线接头位置(.txt)。每行包含一个帧的数据,格式为:帧号,帧计数,骨架内,(x,y,z)接头1-20的位置。 5)动作序列(.txt)的标签。 2012年佛罗伦萨大学收集的数据集已使用Kinect相机捕获。它包括9个活动:波浪,从瓶子中喝饮料,答案手机,拍手,紧身蕾丝,坐下,站起来,读手表,鞠躬,提出了10个科目,要求执行上述2或3次的动作导致总共215个活性样本。这款数据集源于2018年度CVPR的ActivityNet大规模活动识别挑战赛的竞争。 DeepMind团队提供的数据集目前包括总共600个类别和来自YouTube的50万个视频片段。在收集的600类中,每个人至少有600个视频。

Each video lasts about 10 seconds. The categories are classified into three main types: 1) Interaction between humans and objects such as playing musical instruments. 2) Human interaction such as handshake, hug. 3) Sports, etc. These three main types can also be described as Person, Person-Person, Person-Object. Northwestern-UCLA dataset(N-UCLA) was collected by three Kinect cameras, which contains 1494 sequences covering 10 action classes from 10 performers. And these 10 actions are: pick up with one hand, pick up with two hands, drop trash, walk around, sit down, stand up, donning, doffing, throw and carry. The subjects perform an action only one time in an action sequence, which contains an average of 39 frames. SBU Interaction dataset was collected with Kinect. It contains 8 classes of two-person interactions, and includes 282 skeleton sequences with 6822 frames. Each body skeleton consists of 15 joints. SYSU 3D Human-Object Interaction (SYSU) dataset is collected by Kinect camera. It contains 480 skeleton clips of 12 action categories performed by 40 subjects and each clip has 20 joints. MSR-Action3D dataset is an action dataset of depth sequences captured by a depth camera. This dataset contains twenty actions: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side-boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up $ & $ throw. It is created by Wanqing Li during his time at Microsoft Research Redmond. The Berkeley Multimodal Human Action Database (MHAD) contains 11 actions performed by 7 male and 5 female subjects in the range 23-30 years of age except for one elderly subject. All the subjects performed 5 repetitions of each action, yielding about 660 action sequences which correspond to about 82 minutes of total recording time. This dataset was collected as part of our research on human action recognition using fusion of depth and inertial sensor data. For our multimodal human action dataset reported here, only one Kinect camera and one wearable inertial sensor were used. This was intentional due to the practicality or relatively non-intrusiveness aspect of using these two differing modality sensors. Both of these sensors are low cost, easy to operate, and do not require much computational power for the real-time manipulation of data generated by them. A picture of the Kinect camera can capture a color image with a resolution of 640 by 480 pixels and a 16-bit depth image with a resolution of 320 by 240 pixels. The frame rate is approximately 30 frames per second. HDM05 dataset is a motion capture database which contains more than three hours of systematically recorded and well-documented motion capture data in the C3D as well as in the ASF/AMC data format. Furthermore, HDM05 contains for more than 70 motion classes in 10 to 50 realizations executed by various actors. The HDM05 database has been designed and set up under the direction of Meinard Müller Tido Röder, Michael Clausen, Bernhard Eberhardt, Björn Krüger, and Andreas Weber. The motion capturing has been conducted in the year 2005 at the Hochschule der Medien (HDM), Stuttgart, Germany, supervised by Bernhard Eberhardt.

每个视频持续大约10秒钟。这些类别分为三种主要类型:1)人类和物体之间的互动,如在播放乐器。 2)人类互动,如握手,拥抱。 3)运动,等等。这些三种主要类型也可以被描述为个人,与人的人,与人对象。 Northwestern-UCLA数据集(N-UCLA)由三个Kinect摄像机收集,其中包含1494个序列,其中包含10个表演者的10个动作类。而这10个行动是:用一只手拿起,用两只手拿起,滴下垃圾,走来走去,坐下来,站起来,缠身,落下,扔扔。受试者在动作序列中仅执行一个动作,该动作序列包含平均39帧。使用Kinect收集SBU交互数据集。它包含8类的双人交互,包括282个骨架序列,具有6822帧。每个身体骨架由15个关节组成。通过Kinect相机收集Sysu 3D人对象交互(Sysu)数据集。它包含480个由40个受试者执行的12个动作类别的骨架夹,每个夹子有20个关节。 MSR-Action3D数据集是由深度摄像机捕获的深度序列的动作数据集。此数据集包括23项措施:高手臂波,水平臂波,锤,手抓住,向前冲,高抛,抓X,画勾,画圆形,击掌,两只手波,侧拳,弯曲,向前踢出,侧踢,慢跑,网球摆动,网球服务,高尔夫挥杆,拿起$& $扔。它是由Wanqing Li在Microsoft Research Redmond的时间创作。伯克利多模式人体行动数据库(MHAD)包含7名男性和5名女性主题的11项行动,除了一名老年人之外的23至30岁。所有受试者对每个动作进行5重复,产生约660个动作序列,其对应于总记录时间的约82分钟。使用深度和惯性传感器数据融合,作为我们对人类行动识别研究的一部分收集的。对于我们在此报告的多模式人类行动数据集,仅使用一个Kinect相机和一个可穿戴惯性传感器。这是由于使用这两个不同的方式传感器的实用性或相对非侵入性方面而有意。这两个传感器都是低成本,易于操作,并且不需要大量计算能力,以便实时操纵由它们产生的数据。 Kinect摄像机的图片可以捕获具有640像素的分辨率的彩色图像,并且具有240像素的分辨率为320像素的16位深度图像。帧速率每秒大约30帧。 HDM05数据集是一个运动捕捉数据库,其中包含超过三个小时的系统记录和记录良好的C3D中的运动捕获数据以及ASF / AMC数据格式。此外,HDM05在10到50个由各种演员执行的可实现中包含超过70个运动类。 HDM05数据库已在MeinardMüllerTidoRöder,Michael Clausen,Bernhard Eberhardt,BjörnKrüger和andreasWeber的方向下设计和设置。 Motion捕获在2005年在Hochschule der Medien(HDM),德国斯图加特斯图特加特,由Bernhard Eberhardt监督。

  Conclusion

Skeleton-based approach as a significant part was evolving along with the blooming development of artificial intelligent applications (such as object detection, action identification, pose estimation, and so on) which had attracted high attentions. This paper observed skeleton-based approaches and categorized these techniques in accordance with target tasks rather than theoretical frameworks, which is useful for introducing this scope.

 结论

基于骨架的方法,作为一个重要的部分以及人工智能应用的盛开开发(如对象检测,动作识别,姿势估计等)吸引了高度关注。本文观察了基于骨架的方法,并根据目标任务而不是理论框架对这些技术进行分类,这对于引入此范围是有用的。