[论文翻译]多人三维姿态与形状估计:基于逆向运动学与优化的方法


原文地址:https://arxiv.org/pdf/2210.13529v2


Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement

多人三维姿态与形状估计:基于逆向运动学与优化的方法

Abstract. Estimating 3D poses and shapes in the form of meshes from monocular RGB images is challenging. Obviously, it is more difficult than estimating 3D poses only in the form of skeletons or heatmaps. When interacting persons are involved, the 3D mesh reconstruction becomes more challenging due to the ambiguity introduced by person-to-person occlusions. To tackle the challenges, we propose a coarse-to-fine pipeline that benefits from 1) inverse kinematics from the occlusion-robust 3D skeleton estimation and 2) Transformer-based relation-aware refinement techniques. In our pipeline, we first obtain occlusion-robust 3D skeletons for multiple persons from an RGB image. Then, we apply inverse kinematics to convert the estimated skeletons to deformable 3D mesh parameters. Finally, we apply the Transformer-based mesh refinement that refines the obtained mesh parameters considering intra- and inter-person relations of 3D meshes. Via extensive experiments, we demonstrate the effectiveness of our method, outperforming state-of-the-arts on 3DPW, MuPoTS and AGORA datasets.

摘要:从单目RGB图像中以网格形式估计3D姿态和形状具有挑战性。显然,这比仅以骨架或热图形式估计3D姿态更为困难。当涉及交互人物时,由于人物间遮挡引入的歧义性,3D网格重建变得更加困难。为应对这些挑战,我们提出了一种由粗到精的流程,该流程受益于:1) 基于抗遮挡3D骨架估计的逆向运动学;2) 基于Transformer的关系感知细化技术。在我们的流程中,首先从RGB图像中获取多人抗遮挡3D骨架,然后应用逆向运动学将估计的骨架转换为可变形3D网格参数,最后采用基于Transformer的网格细化方法,该技术通过考虑3D网格的人物内和人物间关系来优化获得的网格参数。通过大量实验,我们在3DPW、MuPoTS和AGORA数据集上验证了本方法的有效性,其性能优于现有技术。

Keywords: Multi-person, 3D mesh reconstruction, Transformer

关键词:多人、3D网格重建 (3D mesh reconstruction)、Transformer

1 Introduction

1 引言

Recovering 3D human body meshes for a single person or multi-person from a monocular RGB image has made great progress in recent years [3, 10, 12, 17, 23, 27, 28, 30–33, 38, 39, 61, 64, 69, 71, 73]. The technique is essential to understand people’s behaviors, intentions and person-to-person interactions. It has a wide range of real-world applications such as human motion imitation [41], virtual try on [47], motion capture [45], action recognition [5, 57, 66], etc.

从单目RGB图像中恢复单人或多人的3D人体网格近年来取得了重大进展[3, 10, 12, 17, 23, 27, 28, 30–33, 38, 39, 61, 64, 69, 71, 73]。该技术对于理解人类行为、意图及人际互动至关重要,在人体运动模仿[41]、虚拟试穿[47]、动作捕捉[45]、动作识别[5, 57, 66]等现实场景中具有广泛应用。

Recently, deep convolutional neural network-based mesh reconstruction methods [6,10,12,17,23,27,28,30–33,38,39,61,64,69,71,73] have shown the practical performance on in-the-wild scenes [21,25,44,68]. Most of the existing 3D human body pose and shape estimation approaches [6, 10, 17, 27, 28, 30–33, 38, 39, 69] achieved promising results for single-person cases. Generally, firstly they crop the area with a person in an input image using bounding-box and then extract features for each detected person, which are further used for 3D human mesh regression.

近年来,基于深度卷积神经网络的网格重建方法 [6,10,12,17,23,27,28,30–33,38,39,61,64,69,71,73] 在真实场景 [21,25,44,68] 中展现了实用性能。现有大多数3D人体姿态与形状估计方法 [6,10,17,27,28,30–33,38,39,69] 在单人场景下取得了显著成果。通常,这些方法首先通过边界框裁剪输入图像中的人物区域,随后为每个检测到的人物提取特征,进而用于3D人体网格回归。


Fig. 1: Example outputs from our pipeline: (a) input RGB image, (b) initial skeleton estimation results obtained from the input image, (c) initial meshes obtained from the inverse kinematics process, (d) refined meshes obtained from the refinement Transformer, (e, f) top- and side-views for the refined meshes.

图 1: 我们流程的示例输出: (a) 输入RGB图像, (b) 从输入图像获得的初始骨架估计结果, (c) 通过逆向运动学过程获得的初始网格, (d) 经过精炼Transformer优化后的网格, (e, f) 优化后网格的顶视图和侧视图。

Some of the recent studies [26–28, 30–33, 36, 39, 64, 71] reconstruct each person 3D mesh individually for multi-person 3D mesh reconstruction using the same bounding-boxes detector [4, 18, 55]. Multiple persons can create severe person-to-person or person-to-environmental occlusions, erroneous monocular depth and diverse human body appearance which results in performance ambiguity in crowded scenes, while in these methods, proper modules that tackle the interacting persons have not been established yet. A few recent methods [23,73] applied direct regression for multiple persons which do not require individual person detection. Sun et al. [61] used body center heatmaps as the target represent ation to identify mesh parameter map. However, without applying the human detection, the human pose estimation result is frequently affected by unimportant pixels and it frequently fails to capture scale variations, which result in the inferior performance.

一些近期研究[26–28, 30–33, 36, 39, 64, 71]采用相同的边界框检测器[4, 18, 55]为多人三维网格重建单独重构每个人的三维网格。多人场景会产生严重的人与人或人与环境遮挡、错误的单目深度以及多样化的人体外观,导致拥挤场景中的性能模糊,而这些方法尚未建立处理交互人物的适当模块。少数近期方法[23,73]采用直接回归处理多人场景,无需单独人物检测。Sun等人[61]使用人体中心热图作为目标表示来识别网格参数图。然而,由于未应用人体检测,人体姿态估计结果常受无关像素影响,且难以捕捉尺度变化,导致性能较差。

In parallel, there have been efforts to reduce the ambiguity of estimating 3D meshes from an RGB image. However in the aspect of the pose recovery, 3D body mesh recovery methods [27,30,31,33] still fall behind the 3D skeleton or heatmap estimation methods [8,9,22,60]. One drawback of 3D skeleton estimation method is that it cannot reconstruct the full 3D body mesh. Recently, Li et al. [36] proposed an inverse kinematics method for single-person mesh reconstruction to recover 3D meshes from 3D skeletons. This approach is promising since it is able to deliver good poses obtained from 3D skeleton estimator to the 3D mesh reconstruction pipeline.

与此同时,学界也在努力降低从RGB图像估计3D网格(3D mesh)的模糊性。但在姿态恢复方面,3D人体网格恢复方法[27,30,31,33]仍落后于3D骨架或热图估计方法[8,9,22,60]。3D骨架估计方法的缺点是无法重建完整的3D人体网格。最近,Li等人[36]提出了一种单人网格重建的逆向运动学方法,可从3D骨架恢复3D网格。该方案的优势在于能将3D骨架估计器获得的优质姿态传递至3D网格重建流程。

To tackle the multi-person 3D body mesh reconstruction task, we propose a coarse-to-fine pipeline that first estimates 3D skeletons, reconstruct 3D meshes from 3D skeletons via inverse kinematics and refine the initial 3D mesh parameters via relation-aware refinement. Inspired by [59], our 3D skeleton estimator involves metric-scale heatmaps and is trained by both relative and absolute positional 3D poses to be robust to occlusions. By extending the IK process [36] towards the multi-person scenario, we are able to obtain the initial 3D meshes for multiple persons from 3D skeletons; while the accuracy is limited especially for interacting person cases. To compensate for the limitation, we propose the relation-aware Transformer to refine the initial mesh parameters considering intra- and inter-person 3D mesh relationships. The Fig. 1 shows example outputs for intermediate steps. To summarize, our contributions are as follows:

为解决多人3D人体网格重建任务,我们提出了一种由粗到精的流程:首先估计3D骨架,通过逆向运动学从3D骨架重建3D网格,再通过关系感知优化器细化初始3D网格参数。受[59]启发,我们的3D骨架估计器采用公制尺度热力图,并通过相对与绝对位置3D姿态联合训练以提升遮挡鲁棒性。通过将逆向运动学过程[36]扩展至多人场景,我们能够从3D骨架获取多人的初始3D网格,但其精度在交互人物场景中尤为受限。为弥补这一局限,我们提出关系感知Transformer,通过考虑人体内与人体间的3D网格关系来优化初始网格参数。图1展示了中间步骤的示例输出。我们的主要贡献可总结如下:

– We propose a coarse-to-fine multi-person 3D body mesh reconstruction pipeline that first estimates 3D skeletons and then delivers it toward 3D meshes via inverse kinematics. To make our pipeline robust to interacting persons, we borrowed the occlusion-robust techniques for 3D skeleton estimation. – To further boost the performance, we propose the Transformer-based architecture for relation-aware mesh refinement to refine the initial mesh parameters considering intra- and inter-person relationships. – Extensive comparisons are conducted involving three challenging multi-person 3D body pose benchmarks (i.e. 3DPW, MuPoTS and AGORA) and we have demonstrated the state-of-the-art performance on each benchmark. Via ablation studies, we prove that each component works in the meaningful way.

  • 我们提出了一种由粗到精的多人物3D人体网格重建流程,该流程首先估计3D骨架,然后通过逆向运动学将其转化为3D网格。为了使我们的流程对交互人物具有鲁棒性,我们借鉴了用于3D骨架估计的抗遮挡技术。
  • 为了进一步提升性能,我们提出了基于Transformer的关系感知网格细化架构,通过考虑人物内和人物间的关系来优化初始网格参数。
  • 我们在三个具有挑战性的多人物3D人体姿态基准测试(即3DPW、MuPoTS和AGORA)上进行了广泛的比较,并在每个基准测试上展示了最先进的性能。通过消融研究,我们证明了每个组件都以有意义的方式发挥作用。

2 Related Works

2 相关工作

Single-person 3D mesh regression. There is a long history of methods for predicting 3D human body meshes from monocular RGB images or video frames [16]. Recently, there has been quick advancement in this field thanks to SMPL [42] which provides a low dimensional parameter iz ation of the 3D human body mesh. Here we focus on a 3D body mesh regressing by adopting a parametric model like SMPL from a monocular RGB image. Bogo et al. [3] represented an optimization-based method called SMPLify by fitting SMPL on the detected 2D body joints iterative ly. However, this optimization-based approach is comparatively time-consuming and struggle with the higher inference time per input frame.

单人3D网格回归。从单目RGB图像或视频帧预测3D人体网格的方法已有很长历史[16]。近年来,得益于SMPL[42]提供的低维度3D人体网格参数化方法,该领域发展迅速。本文重点研究通过采用类似SMPL的参数化模型,从单目RGB图像回归3D人体网格。Bogo等人[3]提出了一种基于优化的SMPLify方法,通过在检测到的2D人体关节点上迭代拟合SMPL模型。然而这种基于优化的方法耗时较长,且每帧推理时间较高。

Some recent studies [34, 50, 53] use deep neural networks for SMPL parameters regression from images in a two-stage manner, which have been effective and can generate more accurate mesh reconstruction outputs in the presence of large-scale 3D datasets. They first determine intermediate renderings such as silhouettes and 2D keypoints from input images and then map them to the SMPL parameters. Impressive results have been achieved for in-the-wild images by applying diverse weak supervision signals such as semantic segmentation [69], texture consistency [52], efficient temporal features [30,63,65], 2D pose [11,27,35], motion dynamics [28], etc.

一些近期研究 [34, 50, 53] 采用深度神经网络以两阶段方式从图像中回归SMPL参数,这种方法在大规模3D数据集存在时效果显著,能生成更精确的网格重建结果。这些方法首先从输入图像中确定中间渲染结果(如轮廓和2D关键点),随后将其映射到SMPL参数。通过应用多种弱监督信号(如语义分割 [69]、纹理一致性 [52]、高效时序特征 [30,63,65]、2D姿态 [11,27,35]、运动动力学 [28]等),针对自然场景图像已取得令人印象深刻的效果。

More recently, Li et al. [36] proposed a 3D human body pose and shape estimation method by collaborating the 3D keypoints and body meshes. Authors introduced an inverse kinematics process to find the relative rotations using twist-and-swing decomposition which estimates targeted body joint locations.

最近,Li等人[36]提出了一种通过协同3D关键点和身体网格来估计人体姿态和形状的方法。作者引入了一个逆向运动学过程,利用扭转-摆动分解来找到相对旋转,从而估计目标身体关节的位置。

Multi-person 3D skeleton regression. There have been variety of methods [13, 46, 48, 56] that tackle the 3D body pose estimation for multi-person: Zanfir et al. [56] proposed LCR-Net that consists of localization, classification, and regression modules. The localization module detects multi-persons from a single image. The classification module classifies the detected human into several anchor-poses. Finally, the regression module refines the anchor-poses. Mehta et al. [46] proposed a single-shot method for multi-person 3D pose estimation from a single image. In addition, they introduced the MuCo-3DHP dataset which has multi-person interactions and occlusions images. Moon et al. [48] proposed top-down method for 3D multi-person pose estimation from a monocular RGB image. This method consists of human detection, absolute 3D human root localization, and root-relative 3D single-person pose estimation modules. Dong et al. [13] used the multi-view images for estimating the multi-person 3D pose. They proposed a coarse-to-fine method lifting the 2D joints to the 3D joints. They obtained the 2D joints candidates from [4]. The initial 3D joints are triangulated from 2D joints candidates of different camera views of the same image. In addition, the initial 3D joints are updated using the prior information using the SMPL [42] model.

多人3D骨架回归。现有多种方法[13,46,48,56]致力于解决多人3D人体姿态估计问题:Zanfir等人[56]提出包含定位、分类和回归模块的LCR-Net。定位模块从单张图像中检测多人,分类模块将检测到的人体归类到若干锚点姿态,回归模块则对锚点姿态进行优化。Mehta等人[46]提出从单张图像进行多人3D姿态估计的单阶段方法,并引入了包含多人交互和遮挡图像的MuCo-3DHP数据集。Moon等人[48]提出基于单目RGB图像的3D多人姿态估计自上而下方法,该方法包含人体检测、绝对3D人体根节点定位和根节点相对3D单人姿态估计模块。Dong等人[13]利用多视角图像进行多人3D姿态估计,提出从2D关节点提升到3D关节点的由粗到细方法,其2D关节点候选来自[4],初始3D关节点通过对同一图像不同相机视角的2D关节点候选进行三角测量获得,并利用SMPL[42]模型的先验信息更新初始3D关节点。

Recent multi-person 3D pose regression works [7, 54, 59, 72] tackled a variety of issues such as developing attention-based mechanism dedicated to the 3D pose estimation problem which considers 3D-to-2D projection process [72], combining the top-down and bottom-up networks [7], developing the tracking-based for multi-person [54] and so on. Sarandi et al. [59] recently proposed a metric-scale 3D pose estimation method that is robust to truncation s. It is able to reason about the out-of-image joints well. Also, this method is robust to occlusion and bounding-box noise.

近期多人3D姿态回归研究[7,54,59,72]解决了诸多问题,例如:开发专用于3D姿态估计任务、考虑3D到2D投影过程的注意力机制[72];结合自上而下与自下而上网络架构[7];构建基于追踪的多人姿态估计系统[54]等。Sarandi等人[59]最新提出的度量尺度3D姿态估计方法对截断具有鲁棒性,能有效推断图像外关节点位置,同时该方法对遮挡和边界框噪声也具备良好适应性。

Multi-person 3D mesh regression. There have been few works [12,23,61,62, 70, 73] that concern the multi-person 3D body mesh regression: The approaches could be categorized into two: bottom-up and top-down methods.

多人3D网格回归。目前关于多人3D人体网格回归的研究较少[12,23,61,62,70,73],这些方法可分为两类:自底向上和自顶向下方法。

Bottom-up methods [23, 61, 62, 73] perform multi-person detection and 3D mesh reconstruction simultaneously. Zhang et al. [73] proposed a Body Mesh as Point (BMP) using a multi-scale 2D center map grid-level representation, which locates selective persons at the grid cell’s center. Sun et al. [61] proposed a ROMP, which creates parameter maps (i.e. body center heatmap, camera map and mesh parameter map) for 2D human body detection, body positioning and 3D body mesh parameter regression, respectively. Jiang et al. [23] represented a coherent reconstruction of multiple humans (CRMH) model, which utilizes the Faster R-CNN based RoI-aligned feature of all persons to estimate SMPL parameters. They further defined the position relevance between multiple persons through a depth ordering-aware loss and an inter penetration. Sun et al. [62] further introduced Bird’s-Eye-View (BEV) representation for reasoning the multiperson body centers and depth simultaneously and combining them to estimate 3D body positions.

自下而上方法 [23, 61, 62, 73] 同时执行多人检测和3D网格重建。Zhang等人 [73] 提出了一种基于多尺度2D中心点网格级表示的Body Mesh as Point (BMP)方法,该方法在网格单元中心定位选定人物。Sun等人 [61] 提出的ROMP方法分别生成用于2D人体检测、体位定位和3D人体网格参数回归的参数图(即人体中心热力图、相机图和网格参数图)。Jiang等人 [23] 提出了多人连贯重建(CRMH)模型,该模型利用基于Faster R-CNN的所有人物RoI对齐特征来估计SMPL参数,并通过深度排序感知损失和相互穿透约束进一步定义了多人之间的位置关联。Sun等人 [62] 进一步引入鸟瞰图(BEV)表示来同时推理多人身体中心和深度,并将其结合以估计3D人体位置。

Top-down methods [12,70] first detect each individual person in the frame using bounding-boxes and then estimate the 3D mesh parameters of each detected person. They are basically similar to the single-person 3D mesh reconstruction pipeline; however different in that they provide dedicated modules or loss functions for the multi-person scenario. For example, Zanfir et al. [70] proposed a 3D mesh reconstruction method to firstly infer 3D skeletons of each person and group estimated skeletons to infer the final 3D meshes for multi-person. Choi et al. [12] proposed a method for combining early-stage image features and estimated 2D pose heatmaps which are robust to occlusions, to reconstruct 3D meshes for multiple persons.

自上而下方法 [12,70] 首先使用边界框检测帧中的每个个体,然后估计每个检测到的人的3D网格参数。它们基本上与单人3D网格重建流程类似,但不同之处在于为多人场景提供了专用模块或损失函数。例如,Zanfir等人 [70] 提出了一种3D网格重建方法,先推断每个人的3D骨架,再将估计的骨架分组以推断多人最终的3D网格。Choi等人 [12] 提出了一种结合早期图像特征和对遮挡具有鲁棒性的估计2D姿态热图的方法,以重建多个人的3D网格。

Bottom-up methods are frequently affected by unimportant image pixels and suffer from scale variations. They further fail to detect small persons since the person detection is not powerful enough compared to that of top-down methods. On the contrary, in top-down methods, proper modules that tackle the interacting persons have not been established yet. In this paper, we take the top-down approach to secure the robustness and propose to use the Transformer architecture to consider the interacting scenario.

自下而上的方法常受无关图像像素干扰且难以应对尺度变化。由于人体检测能力弱于自上而下方法,这类方法还易漏检小尺寸目标。相比之下,自上而下方法尚未建立有效处理交互人体的模块。本文采用自上而下框架确保鲁棒性,并创新性地运用Transformer架构来建模交互场景。

3 Method

3 方法

Our aim is to reconstruct 3D meshes ${\mathbf{M}^{i}}{i=1}^{M}$ of the multiple persons in an RGB image $\mathbf{I}$ , where $M$ denotes the number of persons in $\mathbf{I}$ . To achieve this goal, we propose the coarse-to-fine reconstruction pipeline as in Fig. 2 that 1) first estimates the 3D skeletons ${\mathbf{P}^{i}}_{i=1}^{M},~2\rangle$ obtains the deformable 3D mesh parameters from 3D skeletons via the inverse kinematics (IK) process and 3) refines the initial 3D meshes ${\mathbf{M}^{i}}_{i=1}^{M}$ using the Transformer architecture that considers intra-person and inter-person relationships. In the remainder of this section, we will elaborate each step in detail.

我们的目标是从一张RGB图像$\mathbf{I}$中重建多人的3D网格${\mathbf{M}^{i}}{i=1}^{M}$,其中$M$表示$\mathbf{I}$中的人数。为实现这一目标,我们提出了如图2所示的从粗到细的重建流程:1) 首先估计3D骨架${\mathbf{P}^{i}}{i=1}^{M}$;2) 通过逆向运动学(IK)过程从3D骨架获取可变形3D网格参数;3) 使用考虑人体内和人际关系的Transformer架构优化初始3D网格${\mathbf{M}^{i}}_{i=1}^{M}$。本节剩余部分将详细阐述每个步骤。

SMPL body model. For the 3D mesh representation, we use the SMPL de- formable 3D mesh model [42] for its compact representation. Variations of the SMPL model [42] are controlled by pose $\pmb{\theta_{\mathrm{\tiny~\in~\mathbb{R}^{24\times6}~}}}$ and shape parameters $\beta\in\mathbb{R}^{1\times10}$ . The pose and shape parameters contain 3D rotational information of 24 human body joints in 6D representation and the top-10 principal component analysis coefficients of the 3D shape space, respectively. Using the differentiable mapping between SMPL parameters (i.e. $\pmb{\theta}$ and $\beta$ ) and the 3D body mesh $\mathbf{M}={\mathbf{v},\mathbf{f}}$ defined in [42], we can different i ably obtain the 3D body mesh M from $\pmb{\theta}$ and $\beta$ , where $\mathbf{v}\in\mathbb{R}^{6,890\times3}$ , $\mathbf{f}\in\mathbb{R}^{13,776\times3}$ denote vertices having $6,890$ vertices, 13, 776 triangular faces that are defined by 3 vertices.

SMPL人体模型。对于3D网格表示,我们采用SMPL可变形3D网格模型[42]的紧凑表示。SMPL模型[42]的变体由姿态参数$\pmb{\theta_{\mathrm{\tiny~\in~\mathbb{R}^{24\times6}~}}}$和形状参数$\beta\in\mathbb{R}^{1\times10}$控制。其中姿态参数采用24个人体关节的6D表示三维旋转信息,形状参数对应三维形状空间的前10个主成分分析系数。通过[42]定义的SMPL参数(即$\pmb{\theta}$和$\beta$)与3D人体网格$\mathbf{M}={\mathbf{v},\mathbf{f}}$之间的可微分映射,我们可以从$\pmb{\theta}$和$\beta$可微分地获取3D人体网格M。其中$\mathbf{v}\in\mathbb{R}^{6,890\times3}$表示6,890个顶点,$\mathbf{f}\in\mathbb{R}^{13,776\times3}$表示由3个顶点定义的13,776个三角面片。

3.1 Initial 3D Skeleton Estimation

3.1 初始3D骨架估计

We take the top-down approach for 3D skeleton estimation that first detect bounding boxes of the humans and estimate 3D skeletons within each bounding box. Following [43, 49, 58, 59], we constituted the person detector using the

我们采用自上而下的方法进行3D骨架估计,首先检测人体的边界框,然后在每个边界框内估计3D骨架。参照[43, 49, 58, 59],我们使用...


Fig. 2: The schematic diagram of our framework: We first detect persons from an image $\mathbf{I}$ and crop it to $\mathbf{X}$ and the image encoder extracts image features $\mathbf{F}{\mathrm{img}}$ from $\mathbf{X}$ . Then, initial 3D skeletons $\mathbf{P}$ are estimated via the initial 3D skeleton estimator $f^{\mathrm{P}}$ and SMPL parameters $\Theta_{\mathrm{init}}$ are reconstructed via the inverse kinematic process, involving the twist angle and shape estimator $f^{\mathrm{TS}}$ (GAP denotes global average pooling layer). Finally, we refine the initial SMPL parameters by inputting the image features $\mathbf{F}{\mathrm{img}}$ and $\Theta_{\mathrm{init}}$ to the relation-aware refiner $f^{\mathrm{Ref}}$ to produce the refined mesh parameters $\Theta_{\mathrm{ref}}$ . The final 3D mesh $\mathbf{M}$ is obtained from the refined SMPL parameters $\Theta_{\mathrm{ref}}$ . The blue boxes denote involved loss functions ( $L_{\mathrm{P}}$ , $L_{\mathrm{TS}}$ , $L_{\mathrm{mesh}}$ , $L_{\mathrm{adv}}$ and $L_{\mathrm{pose}}$ ).

图 2: 我们的框架示意图:首先从图像 $\mathbf{I}$ 中检测人物并裁剪为 $\mathbf{X}$,图像编码器从 $\mathbf{X}$ 中提取图像特征 $\mathbf{F}{\mathrm{img}}$。接着,通过初始3D骨架估计器 $f^{\mathrm{P}}$ 估算初始3D骨架 $\mathbf{P}$,并通过逆运动学过程(包含扭转角和形状估计器 $f^{\mathrm{TS}}$)重建SMPL参数 $\Theta_{\mathrm{init}}$(GAP表示全局平均池化层)。最后,通过将图像特征 $\mathbf{F}{\mathrm{img}}$ 和 $\Theta_{\mathrm{init}}$ 输入关系感知优化器 $f^{\mathrm{Ref}}$ 来优化初始SMPL参数,生成优化后的网格参数 $\Theta_{\mathrm{ref}}$。最终3D网格$\mathbf{M}$由优化后的SMPL参数 $\Theta_{\mathrm{ref}}$ 获得。蓝色框表示涉及的损失函数($L_{\mathrm{P}}$、$L_{\mathrm{TS}}$、$L_{\mathrm{mesh}}$、$L_{\mathrm{adv}}$ 和 $L_{\mathrm{pose}}$)。

YOLOv4 [2] to obtain the cropped image $\mathbf{X}\in\mathbb{R}^{256\times256\times3}$ from an image $\mathbf{I}$ . In order to develop the initial 3D skeleton estimation network $f^{\mathrm{P}}:\mathbf{X}\to\mathbf{P}\in\mathbb{R}^{K\times3}$ aiming to use it for the inverse kinematics (IK) process, it is necessary to align the output dimension $K$ with the SMPL model [42]: We are required to set $K$ as 24 to align it with the SMPL model which uses 24-dimensional pose parameters. The reason behind this is that the IK process we use (see Sec. 3.2) requires to calculate the SMPL pose parameters $\pmb{\theta}$ by comparing the 3D skeletons to the SMPL template skeletons having 24 joints.

使用YOLOv4 [2]从图像$\mathbf{I}$中获取裁剪后的图像$\mathbf{X}\in\mathbb{R}^{256\times256\times3}$。为了开发用于逆向运动学(IK)过程的初始3D骨架估计网络$f^{\mathrm{P}}:\mathbf{X}\to\mathbf{P}\in\mathbb{R}^{K\times3}$,需要将输出维度$K$与SMPL模型[42]对齐:必须将$K$设为24以匹配SMPL模型使用的24维姿态参数。这是因为我们采用的IK流程(见第3.2节)需要通过将3D骨架与具有24个关节点的SMPL模板骨架进行比较,来计算SMPL姿态参数$\pmb{\theta}$。

To further obtain occlusion-robust 3D skeletons, we follow the architecture and loss functions of the recent 3D skeletal estimation approach [59] which utilizes the metric-scale heatmaps for 3D skeleton estimation and use both absolutescale 3D skeletons and image aligned skeletons as the supervision. Within $f^{\mathrm{P}}$ , we applied ResNet [19] as a feature extractor that predicts image features $\mathbf{F}{\mathrm{img}}\in$ $\mathbb{R}^{8\times8\times2,048}$ . They are fed to a 1x1 convolutional layer to extract 3D heatmaps that can produce the root-relative 3D skeletons $\mathbf{P}{\mathrm{rel}}$ . In parallel, the image features $\mathbf{F}{\mathrm{img}}$ are fed to a 1x1 convolutional layer to obtain image-scale 2D heatmaps which further produces the 2D skeletons $\mathbf{P}{\mathrm{img}}$ . Finally, the absolute 3D skeletons $\mathbf{P}$ are different i ably calculated by combining $\mathbf{P}{\mathrm{img}}$ and $\mathbf{P}_{\mathrm{rel}}$ with camera intrinsics as in [59].

为了进一步获得抗遮挡的3D骨架,我们遵循了近期3D骨架估计方法[59]的架构和损失函数,该方法利用度量尺度热图进行3D骨架估计,并使用绝对尺度3D骨架和图像对齐骨架作为监督。在$f^{\mathrm{P}}$中,我们采用ResNet[19]作为特征提取器来预测图像特征$\mathbf{F}{\mathrm{img}}\in$ $\mathbb{R}^{8\times8\times2,048}$。这些特征被输入到一个1x1卷积层以提取3D热图,从而生成根相对3D骨架$\mathbf{P}{\mathrm{rel}}$。同时,图像特征$\mathbf{F}{\mathrm{img}}$被输入到另一个1x1卷积层以获得图像尺度的2D热图,进而生成2D骨架$\mathbf{P}{\mathrm{img}}$。最后,绝对3D骨架通过结合$\mathbf{P}{\mathrm{img}}$和$\mathbf{P}_{\mathrm{rel}}$以及相机内参,按照[59]中的方法可微分地计算得出。

3.2 Initial 3D Mesh Reconstruction via Inverse Kinematics.

3.2 基于逆向运动学的初始3D网格重建

We define the inverse kinematics (IK) as the process that reveals angle $\pmb{\theta}$ and shape $\beta$ parameters of the SMPL [42] model from estimated 3D skeletons $\mathbf{P}$ . The angle parameter $\pmb{\theta}$ could be obtained by finding the relative rotation matrix $\mathbf{R}$ that rotates the template skeletons $\mathbf{T}={\mathbf{t}{k}}{k=1}^{K}$ to locate it on estimated initial 3D skeletons $\mathbf{P}={\mathbf{p}{k}}{k=1}^{K}$ . To reconstruct this, we use the same formula as [36] that decompose the relative rotation matrix with twist and swing angles. Reconstructing swing angles $\pmb{\mathscr{C}}$ . The axis of swing rotation $\mathbf{n}{k}$ which is perpendicular to $\mathbf{t}{k}$ and $\mathbf{p}{k}$ and the swing angle $\pmb{\alpha}={\alpha_{k}}_{k=1}^{K}$ are expressed as:

我们将逆运动学 (IK) 定义为从估计的3D骨架 $\mathbf{P}$ 中揭示SMPL [42]模型角度参数 $\pmb{\theta}$ 和形状参数 $\beta$ 的过程。角度参数 $\pmb{\theta}$ 可通过寻找相对旋转矩阵 $\mathbf{R}$ 来获得,该矩阵将模板骨架 $\mathbf{T}={\mathbf{t}{k}}{k=1}^{K}$ 旋转至估计的初始3D骨架 $\mathbf{P}={\mathbf{p}{k}}{k=1}^{K}$ 上。为此,我们采用与[36]相同的公式,用扭转角和摆动角分解相对旋转矩阵。重建摆动角 $\pmb{\mathscr{C}}$ 时,垂直于 $\mathbf{t}{k}$ 和 $\mathbf{p}{k}$ 的摆动旋转轴 $\mathbf{n}{k}$ 及摆动角 $\pmb{\alpha}={\alpha_{k}}_{k=1}^{K}$ 可表示为:

$$
\mathbf{n}{k}={\frac{\mathbf{t}{k}\times\mathbf{p}{k}}{\left|\mathbf{t}{k}\times\mathbf{p}{k}\right|}},\quad\cos\alpha_{k}={\frac{\mathbf{t}{k}\cdot\mathbf{p}{k}}{\left|\mathbf{t}{k}\right|\left|\mathbf{p}{k}\right|}},\quad\sin\alpha_{k}={\frac{\mathbf{t}{k}\times\mathbf{p}{k}}{\left|\mathbf{t}{k}\right|\left|\mathbf{p}_{k}\right|}}
$$

$$
\mathbf{n}{k}={\frac{\mathbf{t}{k}\times\mathbf{p}{k}}{\left|\mathbf{t}{k}\times\mathbf{p}{k}\right|}},\quad\cos\alpha_{k}={\frac{\mathbf{t}{k}\cdot\mathbf{p}{k}}{\left|\mathbf{t}{k}\right|\left|\mathbf{p}{k}\right|}},\quad\sin\alpha_{k}={\frac{\mathbf{t}{k}\times\mathbf{p}{k}}{\left|\mathbf{t}{k}\right|\left|\mathbf{p}_{k}\right|}}
$$

By the Rodrigues formula, swing rotation matrix $\mathbf{R}{k}^{\mathrm{sw}}$ can be derived from the axis $\mathbf{n}{k}$ and the angle $\alpha_{k}$ as follows:

根据罗德里格斯公式,摆动旋转矩阵 $\mathbf{R}{k}^{\mathrm{sw}}$ 可由轴 $\mathbf{n}{k}$ 和角度 $\alpha_{k}$ 推导如下:

$$
\mathbf{R}{k}^{\mathrm{sw}}=\mathbf{I}+\sin\alpha_{k}[\mathbf{n}{k}]{\times}+(1-\cos\alpha_{k})[\mathbf{n}{k}]_{\times}^{2}
$$

$$
\mathbf{R}{k}^{\mathrm{sw}}=\mathbf{I}+\sin\alpha_{k}[\mathbf{n}{k}]{\times}+(1-\cos\alpha_{k})[\mathbf{n}{k}]_{\times}^{2}
$$

where $\big[\mathbf{t}{k}\big]\times$ is the skew-symmetric matrix of $\mathbf{t}_{k}$

其中 $\big[\mathbf{t}{k}\big]\times$ 是 $\mathbf{t}_{k}$ 的斜对称矩阵

Finally, the relative rotation matrix $\mathbf{R}$ can be determined as follows:

最后,相对旋转矩阵 $\mathbf{R}$ 可按如下方式确定:

$$
{\bf R}={\bf R}^{\mathrm{sw}}{\bf R}^{\mathrm{tw}}.
$$

$$
{\bf R}={\bf R}^{\mathrm{sw}}{\bf R}^{\mathrm{tw}}.
$$

After obtaining the rotation matrix $\mathbf{R}$ , we convert it to 6D rotation representation and obtain the pose parameters $\theta_{\mathrm{init}}$ . We initialize the camera parameter $\mathbf{C}_{\mathrm{init}}$ as [0.9, 0, 0] and use the constant values during the inverse kinematics step.

在获得旋转矩阵 $\mathbf{R}$ 后,我们将其转换为6D旋转表示并获取姿态参数 $\theta_{\mathrm{init}}$。我们将相机参数 $\mathbf{C}_{\mathrm{init}}$ 初始化为[0.9, 0, 0],并在逆运动学步骤中使用该常量值。

3.3 3D Mesh Refinement via Relation-Aware Transformer

3.3 基于关系感知Transformer的三维网格优化

The relation-aware refiner $f^{\mathrm{Ref}}:[\mathbf{F}{\mathrm{img}},\pmb{\theta}{\mathrm{init}}]\rightarrow\pmb{\theta}{\mathrm{ref}}$ is proposed to refine the initial SMPL parameters based on the vision Transformer architecture [14]. Its input is the concatenation of image features $\mathbf{F}{\mathrm{img}}$ and SMPL parameters $\Theta_{\mathrm{init}}=$ $\vert\pmb{\theta}{\mathrm{init}};\beta_{\mathrm{init}};\mathbf{C}{\mathrm{init}}\vert$ which are obtained from the IK process. We use $N\times K$ as the sequence length of the Transformer where $N$ is the maximum number of people for the input and $K$ is the number of joints for one person. By rearranging and concatenating image features $\mathbf{F}{\mathrm{img}}$ with $\Theta_{\mathrm{init}}$ , we generate the $(N\times K)\times2,067$ array as the input to the Transformer (see supplemental for details). We obtain the $\varDelta\Theta_{\mathrm{ref}}$ from the Transformer and the final SMPL parameter is obtained as follows:

关系感知优化器 $f^{\mathrm{Ref}}:[\mathbf{F}{\mathrm{img}},\pmb{\theta}{\mathrm{init}}]\rightarrow\pmb{\theta}{\mathrm{ref}}$ 被提出用于基于视觉Transformer架构[14]优化初始SMPL参数。其输入是图像特征 $\mathbf{F}{\mathrm{img}}$ 与SMPL参数 $\Theta_{\mathrm{init}}=$ $\vert\pmb{\theta}{\mathrm{init}};\beta_{\mathrm{init}};\mathbf{C}{\mathrm{init}}\vert$ 的拼接,这些参数通过逆向运动学(IK)过程获得。我们使用 $N\times K$ 作为Transformer的序列长度,其中 $N$ 是输入的最大人数,$K$ 是单个人的关节数量。通过重新排列并将图像特征 $\mathbf{F}{\mathrm{img}}$ 与 $\Theta_{\mathrm{init}}$ 拼接,生成 $(N\times K)\times2,067$ 的数组作为Transformer的输入(详见补充材料)。从Transformer中获取 $\varDelta\Theta_{\mathrm{ref}}$ 后,最终SMPL参数计算如下:

$$
\pmb{\theta}{\mathrm{ref}}=\pmb{\theta}{\mathrm{init}}+\varDelta\pmb{\theta}_{\mathrm{ref}}.
$$

$$
\pmb{\theta}{\mathrm{ref}}=\pmb{\theta}{\mathrm{init}}+\varDelta\pmb{\theta}_{\mathrm{ref}}.
$$

From the refined parameter $\pmb{\theta}{\mathrm{ref}}=[\pmb{\theta}{\mathrm{ref}};\pmb{\beta}{\mathrm{ref}};\pmb{\mathrm{C}}{\mathrm{ref}}]$ , 3D meshes $\mathbf{M}$ are obtained, and corresponding 3D skeletons $\mathbf{P}_{\mathrm{ref}}$ are further obtained by applying the mesh-to-joint regressor [42] to mesh vertices.

从精炼参数 $\pmb{\theta}{\mathrm{ref}}=[\pmb{\theta}{\mathrm{ref}};\pmb{\beta}{\mathrm{ref}};\pmb{\mathrm{C}}{\mathrm{ref}}]$ 中获取3D网格 $\mathbf{M}$,并通过将网格到关节的回归器 [42] 应用于网格顶点,进一步得到对应的3D骨架 $\mathbf{P}_{\mathrm{ref}}$。

When constituting the Transformer, we use the masking input patch as METRO [38]: randomly 0 to $30%$ of input patches are masked and this makes the Transformer learn non-local interactions. We select not to use the positional embedding while using the masking scheme from results in Table 4.

在构建Transformer时,我们采用与METRO [38]相同的掩码输入块方案:随机遮蔽0%至30%的输入块,这使得Transformer能够学习非局部交互。根据表4的结果,我们选择在不使用位置嵌入的情况下采用该掩码方案。

Sampling interacting persons. The number of persons $M$ varies depending on the image $\mathbf{I}$ ; while the relation-aware refiner $f^{\mathrm{Ref}}$ requires to fix $N$ which is the maximum number of persons in the input. We set $N$ as 3 according to the ablation study shown in Table 4. For images having less than $N$ persons ( $M<N$ ), we apply Transformer once by simply zero-padding unoccupied inputs, while for images having more than $N$ persons ( $M>N$ ), we need to apply Transformers multiple times by sampling the interacting persons. The sampling scheme during training and testing stages are proposed as follows: At training, we randomly sample multiple data consisting of $N$ persons so that Transformer can see various combinations as epochs go. At testing, we run $f^{\mathrm{Ref}}$ exactly $M$ times, getting results once for each person. At each run, we set each person as the target to refine, inputting $N-1$ closest persons as contexts.

采样交互人物。人物数量 $M$ 根据图像 $\mathbf{I}$ 变化;而关系感知优化器 $f^{\mathrm{Ref}}$ 需要固定输入的最大人物数 $N$。根据表4的消融实验,我们将 $N$ 设为3。对于人物数少于 $N$ 的图像 ($M<N$),我们通过零填充空缺输入单次应用Transformer;对于人物数超过 $N$ 的图像 ($M>N$),则需通过采样交互人物多次应用Transformer。训练和测试阶段的采样方案如下:训练时随机采样包含 $N$ 人物的多组数据,使Transformer能随着训练轮次观察不同组合;测试时精确运行 $f^{\mathrm{Ref}}$ $M$ 次,每次为一个人物生成优化结果。每次运行时,将当前人物设为目标优化对象,并输入 $N-1$ 个最近邻人物作为上下文。

3.4 Training Method

3.4 训练方法

We use PyTorch to implement our pipeline. A single NVIDIA TITAN GPU is used for each experiment with a batch size of 64. The Adam optimizer [29] is used for the optimization with a learning rate of $5\times10^{-5}$ for relation-aware Transformer and $1\times10^{-4}$ for all other networks, respectively. We decrease the learning rate exponentially by a factor of 0.9 per each epoch. Total 100 epochs are executed for completely training our network.

我们使用PyTorch实现整个流程。每个实验使用一块NVIDIA TITAN GPU,批处理大小为64。采用Adam优化器[29]进行优化,其中关系感知Transformer (relation-aware Transformer) 的学习率为$5\times10^{-5}$,其他所有网络的学习率为$1\times10^{-4}$。学习率按每训练周期0.9的系数进行指数衰减。完整训练共执行100个周期。

To train the proposed initial 3D skeleton estimation network $f^{\mathrm{P}}$ , twist angle and shape estimation network $f^{\mathrm{TS}}$ and relation-aware refiner $f^{\mathrm{Ref}}$ , we used the loss $L$ defined as follows:

为了训练提出的初始3D骨架估计网络 $f^{\mathrm{P}}$ 、扭转角和形状估计网络 $f^{\mathrm{TS}}$ 以及关系感知优化器 $f^{\mathrm{Ref}}$ ,我们使用了如下定义的损失函数 $L$ :

$$
\begin{array}{r}{L(f^{\mathrm{P}},f^{\mathrm{TS}},f^{\mathrm{Ref}})=L_{\mathrm{P}}(f^{\mathrm{P}})+L_{\mathrm{TS}}(f^{\mathrm{TS}})+L_{\mathrm{Ref}}(f^{\mathrm{Ref}}).}\end{array}
$$

$$
\begin{array}{r}{L(f^{\mathrm{P}},f^{\mathrm{TS}},f^{\mathrm{Ref}})=L_{\mathrm{P}}(f^{\mathrm{P}})+L_{\mathrm{TS}}(f^{\mathrm{TS}})+L_{\mathrm{Ref}}(f^{\mathrm{Ref}}).}\end{array}
$$

Each loss term is detailed in the remainder of this subsection.

各损失项的详细说明见本小节剩余部分。
where P3Ds, P3D and $\hat{\mathbf{P}}^{\mathrm{2D}}$ are ground-truth absolute 3D skeletons, relative 3D skeletons and 2D skeletons, respectively. $I I$ is an orthographic projection. Twist angle and shape loss $\boldsymbol{L}{\mathbf{TS}}$ . We use the loss $L_{\mathrm{TS}}$ to train the twist angle and shape estimator $f^{\mathrm{TS}}$ as follows:

其中 P3Ds、P3D 和 $\hat{\mathbf{P}}^{\mathrm{2D}}$ 分别表示真实绝对 3D 骨架、相对 3D 骨架和 2D 骨架。$I I$ 为正交投影。扭转角与形状损失 $\boldsymbol{L}{\mathbf{TS}}$。我们使用损失 $L_{\mathrm{TS}}$ 训练扭转角与形状估计器 $f^{\mathrm{TS}}$,公式如下:

$$
L_{\mathrm{TS}}(f^{\mathrm{TS}})=L_{\mathrm{angle}}(f^{\mathrm{TS}})+L_{\mathrm{shape}}(f^{\mathrm{TS}})
$$

$$
L_{\mathrm{TS}}(f^{\mathrm{TS}})=L_{\mathrm{angle}}(f^{\mathrm{TS}})+L_{\mathrm{shape}}(