Skeleton-based Approaches based on Machine Vision: A Survey

机器视觉上基于骨架的方法：调查报告

Abstract

Recently, skeleton-based approaches have achieved rapid progress on the basis of great success in skeleton representation. Plenty of researches focus on solving specific problems according to skeleton features. Some skeleton-based approaches have been mentioned in several overviews on object detection as a non-essential part. Nevertheless, there has not been any thorough analysis of skeleton-based approaches attentively. Instead of describing these techniques in terms of theoretical constructs, we devote to summarizing skeleton-based approaches with regard to application fields and given tasks as comprehensively as possible. This paper is conducive to further understanding of skeleton-based application and dealing with particular issues.

摘要

最近，基于骨架的方法在骨骼识别中取得了巨大成功的基础上取得了快速进展。大量研究侧重于根据骨架特征解决特定问题。在几个对象检测的概述中提到了一些基于骨架的方法作为非必要部分。尽管如此，对骨骼的近端方法没有任何彻底的分析。除了理论构建方面，我们不描述关于概述基于骨架的方法，而不是描述关于应用领域的基于骨架的方法，并尽可能地全面地给予任务。本文有利于进一步了解基于骨架的应用和处理特定问题。

Introduction

Skeleton-based approaches, also known as kinematic techniques, cover a set of joints and a group of limbs based on physiological body structure. Typically, the number of joints is determined by computational complexity, which is between ten to thirty. In recent years, some significant techniques have worked successfully on discovering the representation of skeletons with dots and edges, which can be categoried as top-down methods (e.g., cascade pyramid network and PoseFix ) and bottom-up methods (e.g., Openpose ). In order to analyze skeleton-based approaches deeply, we observe these researches from two aspects(i.e., single-frame and multi-frame). Fig. 1 describes the summarization. In regard to single-frame type, tasks are handled through investigating independent images within videos. Correspondingly, multi-frame type requires a series of sequential images among videos to explore the inherence.

简介

基于骨架的方法，也称为运动技术，覆盖一组关节和基于生理体结构的一组肢体。通常，关节的数量是通过计算复杂度确定的，这在十到三十之间。近年来，一些重要的技术已经成功地努力发现具有点和边缘的骨骼的表示，可以分类为自上而下的方法（例如，级联金字塔网络和POSEFIX）和自下而上的方法（例如，Openpose）。为了深入分析骨架的方法，我们观察来自两个方面的研究（即单帧和多帧）。图。图1描述了总结。关于单帧类型，通过在视频中调查独立图像来处理任务。相应地，多帧类型需要一系列视频之间的序列图像来探索固有。

Single-frame Approaches

单帧方法

Multi-view Pose Estimation

Although skeletons with 3D structures enhance the accuracy of pose estimation, the training operation needs plenty of 3D ground-truth data which is costly. Thus, multiple 2D skeletons from different views are integrated into 3D structures under a given strategy to determine the pose. EpipolarPose implements epipolar geometry to combine the 2D skeletons and trains a 3D pose estimator with camera geometry information. RepNet finds the mapping from 2D to 3D skeletons by designing an adversarial training approach and adding feedback projection from 3D to 2D skeletons. Based on Wasserstein generative adversarial network (WGAN), RepNet generates 3D skeletons by sending 2D skeletons as input into WGAN. The produced 3D skeletons are reprojected to 2D domain by camera loss. A substitutive fusion approach tracks the status of joints in two views of 2D skeletons and labels the joints as "well tracked", "inferred", or "not tracked". For a joint in different views, the one diagnosed as "well tracked" has the highest priority, and then the inferred one is better than the one of "not tracked". Furthermore, dynamic time warping with K-nearest neighbor is adopted to classify learned skeleton representation. In regard to multiple persons, a multi-way matching algorithm clusters detected 2D skeletons by sticking the keypoints of a person in various views. Two networks (named as VA-RNN and VA-CNN) work together to discover skeleton representation of actions . An automatic selection scheme is involved in both the nets to choose prior viewpoints of 2D skeletons rather than a fixed criterion. In VA-RNN, 2D skeletons are rotated to obtain a complex feature by training multiple LSTM. For VA-CNN, a intrinsic feature is abstracted by a convolutional network. These two features are fused to determine action classes. Additionally, loss functions (e.g., regression loss and consistency loss.) in deep neural networks (e.g., CNN) are designed to improve pose estimation within multi-view skeletons, and distance biases of skeleton information in multiple views are minimized to improve fusion accuracy . Relation evaluation (e.g., similarity) between skeletons in two views is still an important issue.

多视图姿态估计

虽然具有3D结构的骨架增强了姿势估计的准确性，但训练操作需要大量的3D地面真实数据，这是昂贵的。因此，来自不同视图的多个2D骨架被集成到给定策略下的3D结构中以确定姿势。 Epipolastops实现ePipolar几何形状，将2D骨架组合并培训具有相机几何信息的3D姿势估计器。 REBNET通过设计普通培训方法并将从3D添加到2D骨架中的反馈投影来查找从2D到3D骨架的映射。基于Wassersein生成的对冲网络（WANG），REBNET通过将2D骨骼作为输入发送到Wgan来生成3D骨架。所产生的3D骨架通过相机损耗将其恢复为2D域。替代融合方法在2D骨骼的两个视图中跟踪关节的状态，并将关节标记为“良好追踪”，“推断”或“未跟踪”。对于不同视图中的关节，诊断为“良好追踪”的人具有最高的优先级，然后推断优于“未跟踪”之一。此外，采用与k最近邻居的动态时间翘曲来分类学习骨架表示。关于多人，通过在各种视图中粘贴一个人的关键点来检测多向匹配算法群集检测了2D骨架。两个网络（名称为VA-RNN和VA-CNN）一起工作，以发现行动的骨架表示。自动选择方案涉及网站，以选择2D骨架的先前观点而不是固定标准。在VA-RNN中，通过训练多个LSTM来旋转2D骨架以获得复杂的特征。对于VA-CNN，由卷积网络抽象的内在特征。这两个功能融合以确定动作类。另外，深神经网络（例如，CNN）中的损耗函数（例如，回归损耗和一致性损失）被设计为改善多视图骨架内的姿势估计，并且最小化多个视图中的骨架信息的距离偏差以提高融合精度。两个观点中骨骼之间的关系评估（例如，相似性）仍然是一个重要问题。

Object Segmentation

Object segmentation based on skeletons aims at cutting targeted items (including biological and non-biological things) from others. In this type of tasks, object positions should be ascertained first and then object margins are drawn out. Pose2Seg focuses on human segmentation in images. In the model, detected skeletons are compared with established standard skeletons of each pose, and a affine transformation matrix calculates similarity. SegModule is presented to understand skeleton features and corresponding visualization. In addition, semantics is introduced as a supplementary perspective. By training a Part FCNto obtain semantic part score map, the body is further divided into head, torso, left arm, left leg, right arm, and right leg . Another human segmentation approach discovers edge information through adding all widths of physical parts (e.g., head and shoulders, neck, chest, hip) into body skeletons. In consideration of cascade connection of curve skeletons like a tree and its leaves, body segmentation are realized by trimming branches and isolating intersectional leaves . In a human counting case, blurry margins are allowed, let alone body parts. After removing background, all heads are identified within skeleton graph segmentation and utilized to gain human amount . Besides of human body segmentation, animal and non-biological items are also being studied to be segmented. Critical points are selected on the basis of surface skeletons and expand to component sets separately . With skeleton matching techniques, skeletal branches over a particular position are reconstructed by estimating the distances between two views and used to reconstruct the non-biological things . Pinpoint object margins of diverse items are confirmed not only depending on skeletons but also item structures.

对象分割

基于骨骼的对象分割旨在从其他人切割目标物品（包括生物和非生物学）。在这种类型的任务中，应该首先确定对象位置，然后抽出对象边距。 POSE2SEG专注于图像中的人类细分。在该模型中，将检测到的骨架与每个姿势的建立的标准骨架进行比较，并且仿射变换矩阵计算相似度。提出了SEGModule以了解骨架特征和相应的可视化。此外，将语义作为补充视角引入。通过培训FCNTO获得语义零件得分图，身体进一步分为头部，躯干，左臂，左腿，右臂和右腿。另一个人分割方法通过将所有宽度的物理部件（例如，头部和肩部，颈部，胸部，臀部）添加到身体骨架中来发现边缘信息。考虑到曲线骨架的级联连接如树及其叶子，通过修剪分支和隔离交叉叶来实现体分割。在人类计数案例中，允许模糊的边距，更不用说身体部位。去除背景后，所有头部都在骨架图分段中识别并利用以获得人类量。除了人体细分之外，还研究了动物和非生物项目以进行分割。基于表面骨架选择关键点，并单独扩展到组件组。通过骨架匹配技术，通过估计两个视图之间的距离并用于重建非生物学的距离来重建特定位置的骨架分支。确定不同物品的对象边距不仅根据骨架而且项目结构确认。

Static Pose Estimation

Without comparison with any benchmark, it is hard to evaluate poses by only one image. . A path (i.e., 2D-3D-2D) learns the features of skeletons severely through projecting into 3D and being reprojected into 2D with lifting networks. The original 2D skeleton is as input . A parsing induced learner exploits parsing information to enhance skeleton information through a pose encoder. A pose encoder abstracts pose features while another pose encoder fuses residual information into pose representation . HRNet has parallel high-to-low resolution subnetworks to gain both high and low resolutions of skeletons and sums them up . Based on ConvNets, 2D images are translated into 3D skeleton models. For instance, heatmaps and silhouettes are extracted from a 2D body, and then pose and shape parameters are abstracted separately. These parameters are meshed together into a 3D body with 2D annotations . In another case, depth knowledge is integrated with ConvNets between paired joints in skeletons to mix into a 3D pose . PoseRefiner implements skeletons in binary channels and pose classes to train ConvNets learning likelihood heatmaps to refine the skeletons . A dual-source approach learns the representations both from 2D and 3D skeletons . 3D poses are projected into multiple 2D skeletons, and used to find the highest likelihood along with a test image. To address skeletons seriously, a upper-body visualization uses different colors and polylines to distinguish between the left and right body . With these representations, 16 poses are clustered with high accuracy. Furthermore, ConvNets parameters are analyzed to find prior settings to identify poses with skeletons . . Cascaded Pyramid Network (CPN) containing two networks (i.e., GlobalNet and RefineNet) adopts a top-down pipeline which means locating skeleton keypoints first with ResNet backbone, extracting features of these keypoints as HyperNet, and then assembling them. A multi-person pose estimation (RMPE) involves spatial transformer networks to rectify various ground truth bounding boxes for people and a final box is obtained for each person . DeepCut uses adapted fast R-CNN (AFR-CNN) to detect body parts with integer linear programming and dense CNN (Dense-CNN) to obtain the intensity with geometric and appearant constraints. Additionally, based on DeepCut, DeeperCut is proposed to improve body part detection with a bottom-up pipeline and an image-conditioned pairwise assembling strategy is designed. Angles among body parts are observed meticulously to assist in searching pairwise joints. Without consecutive images of an action, features learned from static poses can resolve the estimation.

静态姿势估计

没有与任何基准进行比较，很难仅通过一个图像进行姿势。。路径（即2D-3D-2D）通过将突出到3D突出，并将其恢复为2D，从升降网络中恢复到2D。原始2D骨架是输入。解析诱导学习者利用解析信息来通过姿势编码器增强骨架信息。姿势编码器摘要姿势特征，而另一个姿势编码器将剩余信息熔化成姿势表示。 HRNET具有平行的高低分辨率子网，以获得高低和低分辨率的骨骼并将其汇总。基于COMMNETS，2D图像转换为3D骨架模型。例如，从2D主体中提取热线和剪影，然后分开抽出姿势和形状参数。这些参数用2D注释将它们一起啮合到3D主体中。在另一个情况下，深度知识与骨架中的配对接头之间的脉冲集合在一起，以混合到3D姿势中。 Posereefiner在二进制渠道中实现骨骼，并追溯课程，以训练Cummnets学习可能性热量，以改进骨架。双源方法学习来自2D和3D骨架的表示。 3D姿势投射到多个2D骨架中，并用于找到最高的可能性以及测试图像。为了认真地处理骨骼，上身可视化使用不同的颜色和折线区分左侧和右体。通过这些表示，16个姿势以高精度聚集。此外，分析Convnetets参数以查找先前设置以识别骨骼的姿势。。包含两个网络（即GlobalNet和RefineNet）的级联金字塔网络（CPN）采用自上而下的管道，意味着首先使用Reset备份定位骨架关键点，将这些关键点的功能提取为HyperNet，然后组装它们。多人姿态估计（RMPE）涉及空间变压器网络以纠正各种地面真理绑定框，为每个人获得最终框。 DeepCut使用适应的FAST R-CNN（AFR-CNN），以检测具有整数线性编程和密集CNN（DENSE-CNN）的身体部位，以获得具有几何和出现的约束的强度。另外，基于DeepCut，建议使用自下而上的管道改善身体部位检测，并设计了图像调节的成对组装策略。颗粒地观察身体部位之间的角度以帮助搜索成对接头。在没有动作的连续图像的情况下，从静态姿势学习的功能可以解决估计。

Object Classification

The purpose of object classification is to identify the items in images through observing skeletons without any limitations of object types. In order to offset skeleton noise in classification, skeleton contours are trimmed as a trade-off between shape reconstruction error and skeleton simplicity based on Bayesian theory . Tree representation of skeletons as a graph is turned into strings of skeleton edges, and a deformable contour method is used to compare these strings of diverse objects . Node pairs in curve skeletons are obtained by cascade of symmetry filters and a symmetry correspondence matrix is designed to gain symmetry cloud which is classified by spectral analysis . Curve skeletons are main tools for object classification with obvious transformation of each object.

对象分类

对象分类的目的是通过观察骨架来识别图像中的项目，而没有对象类型的任何限制。为了抵消分类中的骨架噪声，基于贝叶斯理论的形状重建误差与骨架简约之间的替代，骨架轮廓被修剪为折衷。骨骼作为图形的树表示变成了骨架边缘的串，并且可以使用可变形的轮廓方法来比较这些不同物体的这些串。曲线骨架中的节点对通过级联对称滤波器获得，并且对称对称矩阵被设计为增益通过光谱分析分类的对称性云。曲线骨架是对象分类的主要工具，具有每个对象的明显变换。

Pose Denoising

Typically, skeletons are dominant for special structures, especially for human body. When odd poses occur and parts of bodies are overlapped, it is hard to peel skeletons of each body. Redundant joints and limbs are deemed as noise, which can be eliminated by filtering linear transformations and comparing joint positions and limb angles with standard ones . Advanced denoising algorithms may be used to further improve the capability of pose denoising.

姿势去噪

通常，骨架对于特殊结构的主导，特别是对于人体。当奇怪的姿势发生并且部件重叠，难以剥离每个身体的骨架。冗余关节和肢体被认为是噪声，这可以通过过滤线性变换并将接头位置和肢体角度与标准转换进行比较来消除。高级去噪算法可用于进一步提高姿势去噪的能力。

Image Synthesis

Commonly, image synthesis tries to produce other views of skeletons by learning the representation of skeletons in a view. GAN is a popular technique to solve this problem, but traditional GAN has weak ability of obtaining the relationship among joints and limbs. Bidirection GAN is adopted to search the mapping from an initial pose skeleton to another pose skeleton . Deformable GANs use heatmaps from skeletons as conditional information and human poses as originial images feeding into the generator to generate other pose images. The generated poses are shifted into the discriminator with corresponding heatmaps. Furthermore, CRF-RNN is established to predict conditional human body with pose transforamtion from a given skeleton to a target skeleton . Mask R-CNN model is built to transform a pose skeleton into another pose skeleton, and paralleling models operate simultaneously with diverse keypoints sampled from initial pose skeletons. These mappings are combined together to creat a new pose skeleton . Some researches conduct physical structure characteristics to reconstruct images other than studying the inherence. For example, body skeletons are divided into multiple components and the background is drawed out . These components are rotated and integrated into a new body which is blended into a synthesized new background. GAN is a useful tool in generation although its performance is still unstable. While applying GAN on skeleton synthesis, how to improve the quality is still a challenge.

图像合成

通常，图像合成试图通过在视图中学习骨架的表示来产生骨骼的其他视图。 GaN是一种解决这个问题的流行技术，但传统的GaN具有较弱的能力，获得关节和四肢之间的关系。采用双向GaN将初始姿势骨架映射搜索另一个姿势骨架。可变形的GAN使用来自骨骼的热量，因为条件信息和人类姿势作为饲养到发电机的原始图像以产生其他姿势图像。产生的姿势被移入具有相应热插拔的鉴别器。此外，建立CRF-RNN，以预测来自给定骨架的姿势转蛋白的条件人体从给定的骨架到靶骨架。面罩R-CNN模型建立以将姿势骨架转换为另一个姿势骨架，并并联模型同时运行，与初始姿势骨架采样不同的关键点。这些映射组合在一起以创建一个新的姿势骨架。有些研究进行了物理结构特征，以重建除研究其固有的图像。例如，身体骨架被分成多个组件，并绘制背景。这些部件旋转并集成到一个新的身体中，该全身被混合到合成的新背景中。 GaN是一代有用的工具，尽管其性能仍然不稳定。在应用GaN上骨架综合时，如何提高质量仍然是一项挑战。

Benchmark-based Identification

Comparison with benchmark is a straight way to judge the class of items. For skeletons, key points in standard and objective skeletons are contrasted in turn. Skeleton information (joints and limbs) is transformed into a tree with key points which are compared with given trees. The best match of two shape trees are treated as identical thing . A skeleton graph of any object is also used to contrast with standard units in both high geometric and topological similarities . To highlight the joints in comparisons, lines among joints address the relations of two joints instead of limbs. Unlike a fixed number of compared joints in previous two techniques, comparison steps in skeleton graphs here are random . Similarity metrics along with 3D skeleton features are designed to obtain the similarity between the human skeleton in a single frame image and the templates . Body dot clouds are cooperated with main curve skeletons to judge a matching ratio of human motions, which works better than the case only involved curve skeletons . Skeletons using Kinect from upper and lower body are analyzed separately to identify human gaits with ANN as a classifier . The evaluation criterion of similarity between the object and template skeletons is controlled by experience. Complex metrics may play well on similarity assessment.

基于基准的基准标识

与基准的比较是判断物品类的直接方式。对于骨骼，标准和客观骨骼的关键点依次形成对比。骨架信息（关节和肢体）转变为树木，与给定树进行比较。两个形状树的最佳匹配被视为相同的东西。任何物体的骨架图也用于与高几何和拓扑相似性的标准单元对比。为了突出比较的关节，关节中的线条解决了两个关节的关系而不是四肢。与前两种技术中的固定数量的比较关节不同，这里骨架图中的比较步骤是随机的。相似性度量与3D骨架特征一起设计用于在单帧图像和模板中获得人类骨架之间的相似性。身体点云与主要曲线骨架合作，以判断人类运动的匹配比，这比仅涉及曲线骨架的案例更好。分别分析使用来自上半身和下半身的Kinect的骨架，以识别具有ANN作为分类器的人类Gaits。物体和模板骨架之间的相似性评估标准由经验控制。复杂度量可能在相似性评估上发挥良好。

Gesture Language Identification

Body gestures have rich information as well as language, also called as body language. Deep learning algorithm is a novel tool to understand gesture meanings, such as RNN and LSTM . Deep networks extract features of gesture skeletons, which is easier than those of intact bodies. Unlike full body skeletons, most body gestures only contain partial joints and limbs. Thus, application scans should be settled to ensure the involved groups of joints and limbs.

手势语言识别

身体手势具有丰富的信息以及语言，也称为肢体语言。深度学习算法是一种了解姿态含义的新工具，例如RNN和LSTM。深网络提取手势骨架的特征，比完整的身体更容易。与全身骨骼不同，大多数身体手势仅含有部分关节和四肢。因此，应用扫描应解决，以确保相关的关节和四肢群体。

Multi-frame Approaches

多帧方法

Dynamic Pose Estimation

A view adaptive LSTM (VA-LSTM) aims at detecting medical condition (i.e., sneeze/cough, headache, neck pain, staggering, chest pain, vomiting, falling, and back pain), containing a classification and regression subnetwork . Original skeletons are rotated and translated into new architectures and then sent into the subnetworks to learn the corresponding medical classes. Neuromusculoskeletal disorders are also detected by pose estimation . Asymmetry features are extracted with splitted results of body joints according to the left and right body, which is used to catch normal motion patterns by a probabilistic normalcy model. The likelihood between a test action and a normal motion is computed to determine the abnormality. Deep learning structures (e.g. CNN and LSTM ) are applied to abstract the features which are compared with a series of skeleton benchmarks (i.e., joints and limbs). Graph convolutional network (GCN) is a core solution for movement pose estimation on account of its strong ability of capturing spatio and temporal features . Except GCN, deep learning structures (e.g., CNN , RNN , and LSTM ) are also key techniques for movement pose estimation through learning representations under given conditions. Moreover, physical analysis of skeletons and probability estimation are also utilized in movement pose estimation. A exemplar-based method is explored to adjust initial estimated poses with inhomogeneous systematic bias while skeletons are defined as a simple directed graph and limbs are directed arrows. A regression function is proposed to predict the pose with rooted-mean-squared differences between templates and objective skeletons . 2D keypoints extracted by pose estimators with skeletons are fused with SMPL regressors to create 3D models with accurate camera parameters . Semantic representations of volume occupancy and ground plane support are helpful for distinguishing multiple persons after evaluating each single person with 3D skeletons . On the strength of spatial positions and joint changes, a decision tree can quickly recognize basic action events and monitor action types under graph constraints of state transition . Unlike static pose estimation, dynamic pose estimation usually discusses both spatio and temporal features with a series of images. The mixture process is a key point to control the quality of final representations.

动态姿态估计

A View Adaptive LSTM（VA-LSTM）旨在检测医疗状况（即打喷嚏/咳嗽，头痛，颈部疼痛，惊人，胸痛，呕吐，跌倒和背部疼痛），包含分类和回归子网。原始骨架被旋转并转换为新的体系结构，然后发送到子网以学习相应的医疗类。通过姿势估计也检测到神经血清骨骼障碍。根据左侧和右侧的身体接头的分离结果提取不对称特征，其用于通过概率正常模型捕获正常运动模式。计算测试动作和正常运动之间的可能性以确定异常。应用深度学习结构（例如CNN和LSTM）应用于抽象与一系列骨架基准（即关节和四肢）进行比较的特征。图表卷积网络（GCN）是一种用于运动姿势估计的核心解决方案，但由于其捕获时空和时间特征的强大能力。除了GCN之外，深度学习结构（例如，CNN，RNN和LSTM）也是通过在特定条件下学习陈述来移动姿势的关键技术。此外，骨架的物理分析和概率估计也用于运动姿势估计。探讨了基于示例的方法，以调整初始估计的姿势，其具有不均匀的系统偏置，而骨架被定义为简单的定向图和肢体是指向箭头的。提出了回归函数来预测模板和客观骨骼之间的生根平均平均差异的姿势。 2D由带骨架提取的姿势估计提取的关键点与SMPL回归器融合，以创建具有精确摄像机参数的3D模型。体积占用和地面支持的语义表示有助于在评估3D骨骼的每个人之后区分多人。在空间位置和联合变化的强度上，决策树可以在状态转换的图形约束下快速识别基本操作事件和监视动作类型。与静态姿势估计不同，动态姿势估计通常讨论具有一系列图像的时空和时间特征。混合过程是控制最终表示质量的关键点。

Object Tracking

PoseTrack follows the particular persons in videos, including multi-person pose estimation in a image, multi-person pose estimation in videos, and multi-person articulated tracking. ArtTrack draws body part graphs in temporal aspect and abandons joints with loose relation by a feed-forward convolutional architecture. For robot, visual distances and lighting intensity are considered while following human using Kinect . Also based on Kinect and Kalman filter, foot points are detected and followed with depth information and pairwise curve matching in 3D space from given views to a virtual bird’s eye view . With capturing keypoints in motion and fitting skeleton, human pose is tracked with transformed 3D models . Unlike deep learning structures, images in 3D models are filtered and the likelihood is calculated between shape models deformed by skeleton poses and image data with regard to probability theory . Fast movements of animals and persons are traced with non-rigid temporal deformation of 3D surface . Other than tracking human, people handling objects are traced with GCNs after detecting hand joints in body skeletons . The relation of a given object between sequences is a crucial issue in object tracking.

对象跟踪

posetrack遵循视频中的特定人员，包括图像中的多人姿势估计，视频中的多人姿势估计和多人铰接式跟踪。 Arttrack在时间方面和Abandons关节中绘制身体部位图，通过前馈卷积架构的宽松关系。对于机器人，使用Kinect后面的人类在人类之后考虑视觉距离和照明强度。同样基于Kinect和Kalman滤波器，检测到脚点，然后在从给定视图到虚拟鸟瞰图的给定视图中的3D空间中的3D空间中匹配的脚点。利用捕获运动和拟合骨架的关键点，用转换的3D模型跟踪人类姿势。与深度学习结构不同，过滤3D模型中的图像，并且在由骨架姿势和关于概率理论的图像数据变形的形状模型之间计算出可能性。动物和人的快速运动追踪3D表面的非刚性时间变形。除了跟踪人类，在检测体骨骼中的手关节后，处理物体的处理物体被追踪到GCN。序列之间给定对象的关系是对象跟踪中的重要问题。

Action Recognition

Action recognition is a crucial part of skeleton approaches, which has been successfully applied to a lot of real projects. Among action recognition techniques about skeletons, there are three categories: (1)Spectrum-based, (2)Image-mapping-based, and (3)Torso-link-based approaches. In torso-link-based techniques, three subtypes are involved: 1) Multiple Streams, 2) Weighting joints, and 3) Non-weighting joints. . Skeleton sequences are turned into color spectrum dots with ConvNets, involving a procedure with 1) joint distribution mapping, 2) spectrum coding of joint trajectories and body parts, and 3) joint velocity weighted saturation and brightness. The spectrum actions are transferred from different filmed angles and then sent to the corresponding ConvNets which separately give scores with regard to actions . . In this area, action skeletions are tranformed into feature maps which are learned by convolutional neural networks as the main solution backbone. With a single CNN, both temporal and spatial information obtained by physical information of joints and limbs are transformed into representation images which are sent to a CNN-family model (e.g., VGG and CNN-LSTM ) for learning features . Furthermore, the representation images are stretched with various sizes and fed into multiple CNNs to combine a highly representative explanation, in which these CNNs have the same structure or diverse structures . Feature maps in traditional three coordinate dimensions (i.e., X, Y, and Z axis) of 3D skeletons are dissembled to corre

[论文翻译]机器视觉上基于骨架的方法：调查报告

原文地址：https://arxiv.org/pdf/2012.12447v1.pdf