Mixing-Denoising General iz able Occupancy Networks
Mixing-Denoising 通用化占据网络
Abstract
摘要
While current state-of-the-art general iz able implicit neural shape models [7, 54] rely on the inductive bias of convolutions, it is still not entirely clear how properties emerging from such biases are compatible with the task of 3D reconstruction from point cloud. We explore an alternative approach to general iz ability in this context. We relax the intrinsic model bias (i.e. using MLPs to encode local features as opposed to convolutions) and constrain the hypothesis space instead with an auxiliary regular iz ation related to the reconstruction task, i.e. denoising. The resulting model is the first only-MLP locally conditioned implicit shape reconstruction from point cloud network with fast feed forward inference. Point cloud borne features and denoising offsets are predicted from an exclusively MLP-made network in a single forward pass. A decoder predicts occupancy probabilities for queries anywhere in space by pooling nearby features from the point cloud order-invariant ly, guided by denoised relative positional encoding. We outperform the state-of-the-art convolutional method [7] while using half the number of model parameters.
虽然当前最先进的通用隐式神经形状模型[7,54]依赖于卷积的归纳偏置,但尚不完全清楚这些偏置所产生的属性如何与点云三维重建任务兼容。我们在此背景下探索了一种替代性的通用化方法:通过放松内在模型偏置(即使用MLP而非卷积来编码局部特征),转而采用与重建任务(即去噪)相关的辅助正则化来约束假设空间。由此产生的模型成为首个纯MLP架构、局部条件化的点云隐式形状重建网络,具备快速前馈推理能力。该网络仅通过MLP结构在单次前向传播中预测点云特征和去噪偏移量,并通过去噪相对位置编码的引导,以点云顺序无关的方式聚合邻近特征,最终由解码器预测空间中任意位置查询的占据概率。在使用半数模型参数的情况下,我们的方法超越了当前最先进的卷积方法[7]。
1. Introduction
1. 引言
One of the most sought after goals in modern day computer vision and machine learning is enabling machines to understand and navigate 3D given limited input. This faculty is manifested in several downstream vision and graphics tasks, such as shape reconstruction from noisy and relatively sparse point clouds. Recovering full shape from point clouds is all the more an important problem in account of the ubiquity of this light, albeit incomplete, 3D representation, whether it is acquired from the nowadays increasingly accessible scanning devices, or as obtained from multi-view vision algorithms such as Structure From Motion or MultiView Stereo (e.g. [63, 64]). While classical optimization based approaches such as Poisson Reconstruction [33] or moving least squares [28] can mostly successfully deliver from dense clean point sets and normal estimations, the deep learning based more recent alternatives offer faster and more robust prediction especially for noisy and sparse inputs, without requiring normals.
现代计算机视觉和机器学习领域最受追捧的目标之一,是让机器在有限输入条件下理解并处理三维空间。这种能力体现在多个下游视觉与图形任务中,例如从噪声且相对稀疏的点云中进行形状重建。鉴于这种轻量但不完整的三维表示形式无处不在——无论是通过如今日益普及的扫描设备获取,还是通过运动恢复结构(Structure From Motion)或多视图立体(MultiView Stereo)等算法生成(如[63, 64])——从点云恢复完整形状成为一个尤为重要的问题。虽然基于经典优化的方法(如泊松重建[33]或移动最小二乘法[28])大多能成功处理密集且干净的点集及法线估计,但基于深度学习的现代方案提供了更快速、更鲁棒的预测,尤其适用于含噪声的稀疏输入,且无需法线信息。

Figure 1. Our method couples 3D implicit reconstruction with explicit denoising of point clouds. We show here reconstruction examples from 3000 noisy points (Gaussian noise of std. dev. 0.025). We show the input, our intermediate denoised point cloud, and our predicted geometry.
图 1: 我们的方法将3D隐式重建与点云显式去噪相结合。此处展示了从3000个含噪点(标准差0.025的高斯噪声)进行重建的示例,包括输入数据、中间去噪点云及预测几何形状。
For learning based reconstruction from point cloud, feed forward (optimization-free) general iz able models are important to the community by virtue of their high performances and fast inference. State-of-the-art ones are based on implicit neural shape representations. Typically [15, 54], a deep convolutional network builds features from the input point cloud, then an implicit decoder maps the feature of a query point to its occupancy or signed distance w.r.t. the target shape. While the ConvNet predicts explicit extrinsic features in [15, 54, 55], recent work [7] suggests that learning features intrinsically i.e. only at the point cloud yields superior results. In this case, features for query points are pooled from the nearest points of the input point cloud.
基于点云的学习重建方法中,免优化(optimization-free)的前馈通用模型因其高性能和快速推理能力备受学界关注。当前最优方法基于隐式神经形状表示:典型方案[15, 54]通过深度卷积网络从输入点云提取特征,再由隐式解码器将查询点的特征映射为相对于目标形状的占用值或有向距离。尽管[15, 54, 55]中的卷积网络预测显式外部特征,但近期研究[7]表明,仅从点云本身学习内在特征能获得更优结果——此时查询点特征通过池化输入点云的最近邻点获得。
While the aforementioned existing work advocates using convolutions to achieve general iz ability for this problem, we ask our selves the question, what if we do not constrain our hypothesis space in such an Ad hoc way ? i.e. what if we use the most inductive bias free model posssible, and rely on supervision and eventually regular iz at ions, would we succeed in achieving generalization while maintaining fast inference ? This line of thought is inspired by the recent success of MLP-only models in vision [72] and their application to point cloud processing [17]. Their idea consists in applying MLPs (mixing) to the data feature dimension and the spatial dimension separately and repeatedly. Point clouds being unordered and irregular, the spatial (Token) mixing is achieved with an order-invariant (Softmax) based pooling on a limited support from the point cloud.
虽然上述现有研究主张使用卷积来实现该问题的泛化能力,但我们不禁自问:如果不以这种临时方式约束假设空间会怎样?换言之,若采用尽可能无归纳偏置的模型,仅依赖监督学习和最终的正则化手段,能否在保持快速推理的同时实现泛化?这一思路受到近期纯MLP模型在视觉领域[72]及其点云处理应用[17]成功的启发。他们的核心方法是对数据特征维度和空间维度分别且重复地应用MLP(混合操作)。由于点云具有无序和非规则特性,其空间(Token)混合是通过基于点云有限支撑集的顺序不变(Softmax)池化实现的。
Based on all the above, we propose a new encoderdecoder model for implicit reconstruction from point cloud. It processes the input point cloud intrinsically (as in [7]) but uses solely MLPs from start to end (unlike [7]). This allows us to outperform the state-of-the-art method [7] whilst using only half the number of parameters (6.5M for us, vs. 12.5M for [7]). Not only this shows that convolution inductive biases are not a necessity in this context, but we even achieve a more parameter efficient intrinsic point cloud reconstruction model without them. Additionally, we show that coupling our implicit reconstruction learning with a related regular iz ation task, in the form of weakly supervised point cloud denoising, can yield even better generalization. In fact, the performance gap between our method trained with and without denoising increases fourfold in Chamfer distance, when going to a setup requiring more generalization ability (See intra-ShapeNet generalization in Table 2, vs. the more challenging ShapeNet to ScanNet generalization in Table 5).
基于以上分析,我们提出了一种新的点云隐式重建编码器-解码器模型。该模型本质化处理输入点云(如[7]所述),但全程仅使用多层感知机(MLP)(与[7]不同)。这使得我们在仅使用一半参数量(本模型6.5M vs [7]的12.5M)的情况下,性能超越了当前最先进方法[7]。这不仅证明卷积归纳偏置在此场景中并非必需,我们还实现了更高效的参数化点云本质重建模型。此外,我们发现通过将隐式重建学习与弱监督点云去噪这一相关正则化任务相结合,能进一步提升泛化能力。事实上,在需要更强泛化能力的场景下(对比表2中ShapeNet内部泛化与表5中更具挑战性的ShapeNet到ScanNet泛化),采用去噪训练的模型在倒角距离指标上比未采用时性能差距扩大了四倍。
Our network builds on the dense task prediction model PointMixer [17] to predict per-point features. However, to infer occ up an cie s for extrinsic query points (not only on the point cloud), we introduce the new ”Extra-set” mixing layer in concordance with PointMixer terminology. The ”Extra-set” mixing layer pools features from nearby points of the input point cloud. As this pooling can be guided with positional encoding of the query relative to these support points, we task our PointMixer additionally with denoising the point cloud in parallel, and we use the denoised point cloud to compute more accurate positional encodings (unlike [7]). Furthermore, differently from [7], we incorporate a global decoder using coarse features on a downsampled point cloud as pooling support, in order to robustify the method against local outliers.
我们的网络基于密集任务预测模型PointMixer[17]来预测逐点特征。但为了推断外部查询点(不仅限于点云)的占用情况,我们引入了与PointMixer术语一致的"Extra-set"混合层。该混合层会从输入点云的邻近点汇集特征。由于这种汇集可以通过查询点相对于这些支撑点的位置编码来引导,我们额外让PointMixer并行执行点云去噪任务,并使用去噪后的点云来计算更准确的位置编码(与[7]不同)。此外,不同于[7],我们采用了基于下采样点云粗粒度特征的全局解码器作为汇集支撑,以提高方法对局部异常值的鲁棒性。
We note that existing MLP-only feed forward generalizable methods lack a local feature aggregation mechanismin their encoder architectures (PointNet), which is crucial for performance without sacrificing inference speed. [48] used a global feature causing under-performance. [22] enforces locality through query point dependent local patch inputs, hence requiring as many encoder forward passes as there are query points. This naturally results in significantly increased inference times surpassing even e.g. some auto-decoding optimization based methods. Conversely, being equipped with point mixing locality, our MLP-only model’s encoder solely requires a single pass for the input point cloud. It offers consequently about 3 orders of magnitude faster reconstruction (within $500\mathrm{ms}$ at resolution $128^{3}$ ) compared to [22], in line with convolutional locality endowed competitors ([7, 54]).
我们注意到,现有的纯MLP前馈泛化方法在其编码器架构(PointNet)中缺乏局部特征聚合机制,而这对于在不牺牲推理速度的前提下提升性能至关重要。[48]采用全局特征导致性能不足。[22]通过依赖于查询点的局部块输入来强制局部性,因此需要执行与查询点数量相同的编码器前向传递。这自然会导致推理时间显著增加,甚至超过某些基于自动解码优化的方法。相比之下,我们的纯MLP模型编码器配备了混合点局部性,仅需对输入点云执行单次前向传递。因此在128³分辨率下,其重建速度比[22]快约3个数量级(在500ms内),与具备卷积局部性的竞争对手([7,54])相当。
We outperform existing comparable methods on standard benchmarks for object and scene level reconstruction, and in generalization to real scan data (Section 4). The performance gap is most important in the most sparse and noisy input setup (Table 1), demonstrating our resilience to scarcity and corruption. Our reconstructions are also more robust, as witnessed by our superior L2 Chamfer errors across all benchmarks. We evaluate the benefits of each of our contributions in our ablation section (4.7).
我们在物体和场景级别重建的标准基准测试中超越了现有可比方法,并在泛化到真实扫描数据方面表现优异(第4节)。性能差距在最稀疏和噪声最多的输入设置中最为显著(表1),这证明了我们对稀缺和损坏的鲁棒性。我们的重建结果也更加稳健,这体现在我们在所有基准测试中更优的L2 Chamfer误差上。我们在消融实验部分(4.7)评估了各项贡献的收益。
Our contributions can be summarized as follows: • A convolution-free feed forward intrinsic occupancy network from point cloud attaining the new state-of-art, while being twice as parameter efficient as its closest intrinsic convolutional counterpart ([7]). • Joint learning of implicit reconstruction from point cloud with explicit point cloud denoising for the first time to the best of our knowledge. • Combining local and global pooling based decoding for deep intrinsic (e.g. [7]) occupancy regression.
我们的贡献可概括如下:
- 提出了一种无卷积的前馈式点云本征占用网络 (intrinsic occupancy network),其性能达到新 state-of-the-art,同时参数量仅为最接近的本征卷积方法 ([7]) 的一半。
- 据我们所知,首次实现了点云隐式重建与显式点云去噪的联合学习。
- 结合基于局部和全局池化的解码策略,用于深度本征 (如 [7]) 占用回归。
2. Related work
2. 相关工作
Shape Representations in Deep Learning Shapes can be represented in deep learning either intrinsically or extr in sic ally. Intrinsic representations discritize the shape itself. When done explicitly using e.g. tetrahedral or polygonal meshes [32, 76], or point clouds [23], this bounds the output topology thus limiting the variability of outputs. Among other forms of intrinsic representations, 2D patches [21, 27, 80] can prompt discontinuities, whilst the simple nature of shape primitives such as cuboids [74, 92], planes [39] and Gaussians [24] limits their expressivity. Differently, extrinsic shape representations model the space containing the scene/object. Voxel grids [84, 85] are the most popular one being the extension of 2D pixels to 3D domain. Their capacity is limited though by their cubic resolution memory cost. Sparse representation like octrees [60, 71, 78] can alleviate this issue to some extent.
深度学习中的形状表示
形状在深度学习中可以通过内在或外在方式表示。内在表示对形状本身进行离散化处理。当采用显式方法(如四面体或多边形网格 [32, 76] 或点云 [23])时,这会限制输出拓扑结构,从而降低输出多样性。其他形式的内在表示中,2D面片 [21, 27, 80] 可能导致不连续性,而长方体 [74, 92]、平面 [39] 和高斯体 [24] 等简单几何基元的表达力有限。
与之不同,外在形状表示对包含场景/物体的空间进行建模。体素网格 [84, 85] 是最流行的方案,它是二维像素向三维领域的延伸。但其表达能力受立方分辨率内存成本的制约。八叉树 [60, 71, 78] 等稀疏表示可在一定程度上缓解此问题。
Implicit Neural Shape Representations Implicit neural representations (INRs) stood out as a major new medium for modelling shape and radiance (e.g. [11, 30, 49, 77, 86]) extrinsically. They overcome many of the limitations of the aforementioned classical representations thanks to their ability to represent shapes with arbitrary topologies at virtually infinite resolution. They are usually parameter is ed with MLPs mapping spatial locations or features to e.g. occupancy [48], signed [53] or unsigned [16, 91] distances relative to the target shape. The level-set of the inferred field from these MLPs can be rendered through ray marching [29], or tess ella ted into an explicit shape using e.g. Marching Cubes [43]. A noteworthy branch of work builds hybrid implicit/explicit representations [14, 20, 52, 87] based mostly on differentiable space partitioning.
隐式神经形状表示
隐式神经表示 (INRs) 作为一种建模形状和辐射度的全新媒介脱颖而出 (例如 [11, 30, 49, 77, 86])。它们克服了前述传统表示的诸多限制,得益于其能以近乎无限的分辨率表示任意拓扑结构的形状。这些表示通常通过多层感知机 (MLPs) 将空间位置或特征映射为诸如占据率 [48]、有符号 [53] 或无符号 [16, 91] 距离等相对于目标形状的量。从这些 MLPs 推断出的场等值面可通过光线步进 [29] 进行渲染,或使用诸如行进立方体 [43] 等方法镶嵌为显式形状。一个值得注意的研究分支主要基于可微分空间划分,构建了混合隐式/显式表示 [14, 20, 52, 87]。
Generalizing Implicit Neural Shape Representations In order to represent collections of shape, implicit neural models require conditioning mechanisms. These include latent code concatenation, batch normalization, hypernetworks [65, 67, 68, 79] and gradient-based meta-learning [50, 66]. Concatenation based conditioning was first implemented using single global latent codes [13, 48, 53], and further improved with the use of local features [15, 22, 25, 31, 35, 54, 69, 73].
泛化隐式神经形状表示
为了表示形状集合,隐式神经模型需要条件机制。这些机制包括潜在编码拼接、批量归一化、超网络 [65, 67, 68, 79] 以及基于梯度的元学习 [50, 66]。基于拼接的条件机制最初通过单一全局潜在编码实现 [13, 48, 53],并随着局部特征的使用进一步改进 [15, 22, 25, 31, 35, 54, 69, 73]。
Shape Reconstruction from Point Cloud Classical approaches include combi natori cal ones where the shape is defined through an input point cloud based space partitioning, through e.g. alpha shapes [5] Voronoi diagrams [1] or triangulation [9, 41, 59]. On the other hand, the input samples can be used to define an implicit function whose zero level set represents the target shape, using global smoothing priors [36, 81, 82] e.g. radial basis function [8] and Gaussian kernel fitting [62], local smoothing priors such as moving least squares [28, 34, 42, 47], or by solving a bound- ary conditioned Poisson equation [33]. The recent literature proposes to parameter is e these implicit functions with deep neural networks and learn their parameters with gradient descent, either in a supervised or unsupervised manner.
从点云重建形状的经典方法包括组合方法,其中形状通过基于输入点云的空间划分来定义,例如 alpha shapes [5]、Voronoi 图 [1] 或三角剖分 [9, 41, 59]。另一方面,输入样本可用于定义隐函数,其零水平集表示目标形状,使用全局平滑先验 [36, 81, 82],例如径向基函数 [8] 和高斯核拟合 [62],局部平滑先验如移动最小二乘法 [28, 34, 42, 47],或通过求解边界条件的泊松方程 [33]。近期文献提出用深度神经网络参数化这些隐函数,并通过梯度下降以监督或无监督方式学习其参数。
Unsupervised Implicit Neural Reconstruction A neural network is typically fitted to the input point cloud without extra information. Regular iz at ions can improve the convergence such as the spatial gradient constraint based on the Eikonal equation introduced by Gropp et al. [26], a spatial divergence constraint as in [4], Lipschitz regular iz ation on the network [40]. Atzmon et al. learn an SDF from unsigned distances [2], and further supervises the spatial gradient of the function with normals [3]. Ma et al. [44] expresses the nearest point on the surface as a function of the neural signed distance and its gradient. They also leverage self-supervised local priors to deal with very sparse inputs [45] and improve generalization [46]. All of the aforementioned work benefits from efficient gradient computation through back-propagation in the neural network. Periodic activation s were introduced in [67]. Lipman [38] learns a function that converges to occupancy while its log transform converges to a distance function. [81] learns infinitely wide shallow MLPs as random feature kernels.
无监督隐式神经重建
通常,神经网络会在没有额外信息的情况下拟合输入点云。正则化方法可以改善收敛性,例如基于Gropp等人[26]提出的Eikonal方程的空间梯度约束、[4]中的空间散度约束,以及对网络进行Lipschitz正则化[40]。Atzmon等人从无符号距离学习SDF[2],并进一步用法向量监督函数的空间梯度[3]。Ma等人[44]将曲面上最近点表示为神经符号距离及其梯度的函数。他们还利用自监督局部先验处理极稀疏输入[45]并提升泛化能力[46]。上述工作均通过神经网络的反向传播实现高效梯度计算。[67]引入了周期性激活函数。Lipman[38]学习一个收敛到占据率的函数,同时其对数变换收敛于距离函数。[81]将无限宽浅层MLP学习为随机特征核。
Supervised Implicit Neural Reconstruction Supervised methods assume a labeled training data corpus commonly in the form of dense samples with ground truth shape information. Auto-decoding methods [10, 31, 35, 53, 73] require test time optimization to be fitted to a new point cloud, which can take up to several seconds. Encoderdecoder based methods enable fast feed forward inference. Introduced first in this respect, Pooling-based set encoders [13, 25, 48] such as PointNet [56] have been shown to underfit the context. Convolutional encoders yield state-ofthe-art performances. They use local features either defined in explicit volumes and planes [15, 37, 54, 55] or solely at the input points [7]. Ouasfi and Boukhayma [51] proposed concurrently to robustify the generalization of these models through transfer learning to a kernel ridge regression whose hyper parameters are fitted to the shape. Peng et al. [55] proposed a differentiable Poisson solving layer that converts predicted normals into an indicator function grid efficiently. However, it is limited to small scenes due to the cubic memory requirement in grid resolution.
监督式隐式神经重建
监督方法假设存在一个带标注的训练数据集,通常以带有真实形状信息的密集样本形式存在。自动解码方法 [10, 31, 35, 53, 73] 需要进行测试时优化以适应新点云,耗时可能长达数秒。基于编码器-解码器的方法能实现快速前馈推理。该领域首次提出的基于池化的集合编码器 [13, 25, 48](如 PointNet [56])已被证明存在上下文欠拟合问题。卷积编码器则展现出最先进的性能,它们使用的局部特征要么定义在显式体积和平面中 [15, 37, 54, 55],要么仅定义在输入点处 [7]。Ouasfi 和 Boukhayma [51] 同时提出通过迁移学习增强这些模型的泛化能力,将其应用于超参数适配形状的核岭回归。Peng 等人 [55] 提出了一种可微分泊松求解层,能高效地将预测法线转换为指示函数网格,但由于网格分辨率需要立方级内存,该方法仅适用于小场景。
3. Method
3. 方法
Given a noisy input point cloud $\mathbf{X}\subset\mathbb{R}^{N_{p}\times3}$ , our objective is to recover a shape surface $s$ that best explains this observation, i.e. the input point cloud elements being noisy samples from $s$ .
给定一个带有噪声的输入点云 $\mathbf{X}\subset\mathbb{R}^{N_{p}\times3}$,我们的目标是恢复一个最能解释该观测结果的形状表面 $s$,即输入点云元素是来自 $s$ 的噪声样本。
To achieve this, we train a deep implicit neural network $f_{\theta}$ to predict occupancy values relative to a target shape $s$ at any queried Euclidean space location $x\in\mathbb{R}^{3}$ , given the input point cloud $\mathbf{X}$ , i.e. $f_{\theta}(\mathbf{X},x)=1$ if $x$ is inside, and 0 otherwise. The inferred shape $\hat{S}$ can then be obtained as a level set the occupancy field inferred using $f_{\theta}$ :
为此,我们训练了一个深度隐式神经网络 $f_{\theta}$,用于在给定输入点云 $\mathbf{X}$ 的情况下,预测任意查询欧几里得空间位置 $x\in\mathbb{R}^{3}$ 相对于目标形状 $s$ 的占用值,即 $f_{\theta}(\mathbf{X},x)=1$ 表示 $x$ 位于内部,否则为 0。推断出的形状 $\hat{S}$ 可通过 $f_{\theta}$ 推断的占用场等值面获得:
$$
{\hat{\cal S}}={x\in\mathbb{R}^{3}\mid f_{\theta}({\mathbf{X}},x)=0.5}.
$$
$$
{\hat{\cal S}}={x\in\mathbb{R}^{3}\mid f_{\theta}({\mathbf{X}},x)=0.5}.
$$
In practice, an explicit triangle mesh for $\hat{S}$ is extracted using the Marching Cubes [43] algorithm.
在实践中,使用 Marching Cubes [43] 算法提取 $\hat{S}$ 的显式三角网格。
We opt for a feed forward conditional implicit model $f_{\theta}$ , as we want our method to generalize to multiple shapes simultan e ou sly and provide fast test-time optimization free inference. Similarly to the state-of-the-art in this category e.g. [7, 54], $f_{\theta}$ consists of a point cloud conditioning network and an implicit decoder.
我们选择前馈条件隐式模型 $f_{\theta}$ ,因为希望该方法能同时泛化到多种形状并提供快速的测试时免优化推理。与此类方法的最先进方案 [7, 54] 类似, $f_{\theta}$ 由点云条件网络和隐式解码器构成。
3.1. Model
3.1. 模型
The staple back-bone model for feed-forward general iz able 3D shape reconstruction from point cloud has lately been the one introduced by Peng et al. [54], which also bears similarities with the concurrent work by Chibane et al. [15]. It has been widely used among methods predicting other implicit fields from point clouds as well, such as point displacements [75], unsigned distances [16] or whether point pairs are separated by a surface [88]. In this model, the encoder builds an explicit extrinsic 3D feature volume from the voxelized or pixelized input point cloud through 3D (or 2D) convolutional networks.
点云前馈可泛化三维形状重建的主流骨干模型最近由Peng等人[54]提出,该模型也与Chibane等人[15]的同期工作具有相似性。该模型在从点云预测其他隐式场的方法中也得到广泛应用,例如点位移[75]、无符号距离[16]或点对是否被表面分隔[88]。在该模型中,编码器通过三维(或二维)卷积网络,从体素化或像素化的输入点云构建显式外部三维特征体积。
Recently, Boulch et al. [7] argued against this strategy. They contented that these methods operate in a voxelized disc ret iz ation whose voxel centers maybe far from the input point cloud locations. They also argued that voxel centers holding the shape information are uniformly sampled, while they should be ideally more focused near the surface. Conclusively, they introduced a point cloud convolutional encoder, where features are only defined at the input point cloud locations, where shape information matters most. They achieve superior reconstruction performances consequently on several benchmarks.
最近,Boulch等人[7]对这一策略提出了反对意见。他们认为这些方法在体素化离散化空间中操作,其体素中心可能远离输入点云的位置。他们还指出,承载形状信息的体素中心是均匀采样的,而理想情况下应更集中于表面附近。最终,他们提出了一种点云卷积编码器,其中特征仅定义在输入点云的位置上,即形状信息最关键的区域。因此,他们在多个基准测试中实现了卓越的重建性能。

Figure 2. Overview: Our method uses exclusively MLPs from beginning to end to predict an implicit shape function given a noisy input point cloud. First a U-Net PointMixer [17] (Gray boxes) is tasked with denoising the input and producing per point features. The building blocks of this network combine 3 pre-exiting forms of point mixing which pool features from support points in a point cloud into other points of the point cloud. Channel-wise mixing is also used throughout the model as well. We refer the reader to the work of Choe et al. [17] for a detailed description on these preexisting mixing layers. We introduce ”Extra-set mixing” (Section 3.3), which we use to pool features both locally and globally onto any external query point to estimate its occupancy. This mixing is guided by relative positional encoding using the denoised support point coordinates (See Figure 3).
图 2: 概述:我们的方法全程仅使用多层感知机 (MLP) 从含噪输入点云预测隐式形状函数。首先通过U-Net PointMixer [17](灰色框)对输入点云去噪并生成逐点特征。该网络的构建模块结合了3种现有点混合方式,将点云中支撑点的特征池化到点云的其他点。通道混合技术也贯穿整个模型。具体混合层实现细节请参阅Choe等人的工作[17]。我们提出的"额外集混合"(第3.3节)可将特征局部和全局池化到任意外部查询点以估算其占据概率,该混合过程由基于去噪支撑点坐标的相对位置编码引导(见图3)。
Both of the aforementioned strategies however still rely on convolutions for translational e qui variance and generalization across scenes and objects. Differently, we propose here an architecture based solely on MLPs, and show that it mostly outperforms the previously mentioned strategies.
然而,上述两种策略仍依赖卷积来实现平移等变性和跨场景/物体的泛化能力。与此不同,我们在此提出一种仅基于MLP的架构,并证明其性能在多数情况下优于前述策略。
3.2. Feature Extraction and Denoising Network
3.2. 特征提取与去噪网络
Our point cloud feature extraction network takes noisy point cloud $\mathbf{X}$ as input and produces an output per element in the point cloud. However, differently from [7], it does not use convolutions, but bases the entire model on MLPs instead, inspired by recent success of the PointMixer [17] architecture.
我们的点云特征提取网络以含噪点云$\mathbf{X}$作为输入,并为点云中的每个元素生成输出。但与[7]不同,该网络未使用卷积操作,而是受PointMixer[17]架构近期成果的启发,完全基于多层感知机(MLP)构建模型。
Specifically, we use the dense prediction task model in [17], which consists of a U-Net [61] symmetric encoderdecoder scheme. The point cloud is down sampled gradually throughout the encoder $E$ using Farthest point sampling until it reaches the coarsest point cloud Y ∈ RNd×3. It is upsampled back then correspondingly through out the decoder $D$ . As illustrated in Figure 2, each block in the encoder/decoder consists mostly of a Hierarchical mixing layer, an intra-set mixing layer, a inter-set mixing layer and channel mixing, as elaborated in [17].
具体而言,我们采用[17]中的密集预测任务模型,该模型采用U-Net [61]对称编码器-解码器架构。点云通过编码器$E$逐步下采样,使用最远点采样(Farthest point sampling)直至生成最粗糙点云Y∈RNd×3,随后通过解码器$D$对应上采样。如图2所示,编码器/解码器的每个模块主要包含层级混合层、集合内混合层、集合间混合层以及通道混合层,具体结构详见[17]。
The channel mixing layer is an MLP operating directly on the feature domain, i.e. mixing feature channels. The other mixings are point-wise mixings to replace the tokenwise mixing in the original MLP-Mixer model [72]. They are softmax based order invariant pooling layers, that pool features from a limited set of point cloud support points into a point cloud query. They vary in the way their supports are defined. In the intra-set one the support is the nearest points to the query. In the inter-set one the support is the inverse of the intra-set support, i.e. a point is part of the support of a query if the query is part of the nearest points of that point. The hierarchical-set mixing is used for down/up transitions in the encoder/decoder, and hence the support is the nearest points in the previous/next point cloud resolution. In practice, for the decoder to be symmetrical with the encoder, the up transition support set is defined as the inverse of the equivalent down transition support set, differently from seminal work [57, 89, 90]. For a more thorough explanation of the intra-set, inter-set and hierarchical-set mixing layers we refer the reader to the work by Choe et al. [17].
通道混合层是一个直接在特征域上操作的多层感知机(MLP),即混合特征通道。其他混合操作采用逐点混合方式,替代了原始MLP-Mixer模型[72]中的Token混合方式。这些基于softmax的次序不变池化层,将来自有限点云支撑点的特征池化到点云查询点中,其差异在于支撑点的定义方式:
- 集合内混合的支撑点是查询点的最近邻点
- 集合间混合的支撑点与集合内相反(若某点的最近邻包含查询点,则该点属于查询点的支撑集)
- 层级集合混合用于编码器/解码器的下/上采样转换,其支撑点来自前/后一级分辨率点云的最近邻点。为实现编解码器对称性,上采样支撑集被定义为对应下采样支撑集的逆,这与开创性研究[57, 89, 90]有所不同。关于三类混合层的详细说明,请参阅Choe等人[17]的研究。
Differently from [7], we design our feature extraction network to perform two tasks simultaneously: produce per point fine features $\mathbf{F}{\mathbf{X}}\in\mathbb{R}^{N_{p}\times F}$ and displacement vectors $\mathbf{\bar{\Delta}X}\in\mathbb{R}^{N_{p}\times3}$ :
与[7]不同,我们设计的特征提取网络可同时执行两项任务:生成逐点精细特征 $\mathbf{F}{\mathbf{X}}\in\mathbb{R}^{N_{p}\times F}$ 和位移向量 $\mathbf{\bar{\Delta}X}\in\mathbb{R}^{N_{p}\times3}$ :
$$
\mathbf{F}_{\mathbf{X}},\pmb{\Delta}\mathbf{X}=D\circ E(\mathbf{X}).
$$
$$
\mathbf{F}_{\mathbf{X}},\pmb{\Delta}\mathbf{X}=D\circ E(\mathbf{X}).
$$
We train the displacements semi-supervised ly to produce a denoised version $\tilde{\mathbf{X}}\in\mathbb{R}^{N_{p}\times3}$ of the input point cloud:
我们以半监督方式训练位移,以生成输入点云的降噪版本 $\tilde{\mathbf{X}}\in\mathbb{R}^{N_{p}\times3}$:
$$
\tilde{\mathbf{X}}=\mathbf{X}+\Delta\mathbf{X}.
$$
$$
\tilde{\mathbf{X}}=\mathbf{X}+\Delta\mathbf{X}.
$$

Figure 3. Using denoised point coordinates and features, our implicit decoder performs local and global extra-set point mixing at a given query location to predict its shape occupancy value. The global extra-set mixing mechanism is similar to the local one. The local extra-set mixing uses the $K$ nearest denoised points of the point cloud. The global extra-set mixing uses to the full $N_{d}$ denoised points of the down sampled point cloud.
图 3: 通过去噪点坐标和特征,我们的隐式解码器在给定查询位置执行局部与全局跨集点混合,以预测其形状占据值。全局跨集混合机制与局部机制类似。局部跨集混合使用点云的 $K$ 个最近去噪点,全局跨集混合则使用下采样点云的全部 $N_{d}$ 个去噪点。
We note that using the encoder $E$ , we can also extract coarse features $\mathbf{F}_{\mathbf{Y}}$ for the down sampled point cloud $\mathbf{Y}$ at the bottleneck of the model:
我们注意到,使用编码器 $E$ 还可以为模型瓶颈处的下采样点云 $\mathbf{Y}$ 提取粗粒度特征 $\mathbf{F}_{\mathbf{Y}}$:
$$
\mathbf{F}_{\mathbf{Y}}=E(\mathbf{X}).
$$
$$
\mathbf{F}_{\mathbf{Y}}=E(\mathbf{X}).
$$
A denoised version $\tilde{\mathbf{Y}}\in\mathbb{R}^{N_{d}\times3}$ of the down sampled input point cloud $\mathbf{Y}$ can be naturally obtained from the output denoised point cloud $\tilde{\bf X}$ via the encoder defined downsampling indexing $\delta:\mathbb{[1,}N_{p}\mathbb{]}\to\mathbb{[1,}N_{d}\mathbb{]}$ :
降采样输入点云 $\mathbf{Y}$ 的去噪版本 $\tilde{\mathbf{Y}}\in\mathbb{R}^{N_{d}\times3}$ 可通过编码器定义的下采样索引 $\delta:\mathbb{[1,}N_{p}\mathbb{]}\to\mathbb{[1,}N_{d}\mathbb{]}$ 从输出去噪点云 $\tilde{\bf X}$ 中自然获得:
$$
\begin{array}{r}{\tilde{\mathbf{Y}}{i}=\tilde{\mathbf{X}}{\delta(i)},\quad\mathbf{Y}{i}=\mathbf{X}_{\delta(i)}.}\end{array}
$$
$$
\begin{array}{r}{\tilde{\mathbf{Y}}{i}=\tilde{\mathbf{X}}{\delta(i)},\quad\mathbf{Y}{i}=\mathbf{X}_{\delta(i)}.}\end{array}
$$
3.3. Extra-set Mixing Decoder
3.3. 集外混合解码器
To serve the role of our implicit decoder, we introduce a new point mixing layer, borrowing the terminology in PointMixer [17]. As opposed to the existing point mixing layers in [17], this layer takes as input a query point representing any location in Euclidean space, and not necessarily an element from the input point cloud topology $\mathbf{X}$ , hence the denomination ”Extra-set”. Similarly to the point-mixing layers in [17], it pools features from a support set using softmax in an order-invariant fashion.
为了充当我们的隐式解码器,我们引入了一种新的点混合层(point mixing layer),借鉴了PointMixer [17]中的术语。与[17]中现有的点混合层不同,该层的输入是一个表示欧几里得空间中任意位置的查询点,而不必是输入点云拓扑结构$\mathbf{X}$中的元素,因此命名为"超集"(Extra-set)。与[17]中的点混合层类似,它以顺序不变的方式使用softmax从支持集中池化特征。
We first use our layer to extract a local fine feature $\mathbf{y}^{l o c}$ for a query point $q\in\mathbb{R}^{3}$ . This feature is built using the $K$ nearest points to $q$ from the denoised output point cloud $\hat{\bf X}$ and their corresponding output fine features in $\mathbf{F}_{\mathbf{X}}$ , as generated by decoder $D$ .
我们首先使用我们的层为查询点 $q\in\mathbb{R}^{3}$ 提取局部精细特征 $\mathbf{y}^{l o c}$。该特征是通过使用去噪输出点云 $\hat{\bf X}$ 中距离 $q$ 最近的 $K$ 个点及其在解码器 $D$ 生成的 $\mathbf{F}_{\mathbf{X}}$ 中对应的输出精细特征构建的。
To this end, and as illustrated in Figure 3, we first compute an aggregation score $s_{\tilde{p}}$ for each support element $\tilde{p}$ :
为此,如图 3 所示,我们首先为每个支持元素 $\tilde{p}$ 计算聚合分数 $s_{\tilde{p}}$:
$$
s_{\tilde{p}}=g_{2}\circ g_{1}(\tilde{p}-q;\mathbf{F}{\mathbf{X}}(p)),\quad\tilde{p}\in\mathcal{X}_{q}=k\mathbf{N}\mathbf{N}(\tilde{\mathbf{X}},q),
$$
$$
s_{\tilde{p}}=g_{2}\circ g_{1}(\tilde{p}-q;\mathbf{F}{\mathbf{X}}(p)),\quad\tilde{p}\in\mathcal{X}_{q}=k\mathbf{N}\mathbf{N}(\tilde{\mathbf{X}},q),
$$
where $g_{1}(.)$ and $g_{2}(.)$ are channel mixing MLPs and $\mathbf{F}_{\mathbf{X}}({\boldsymbol{p}})$ is the fine feature of denoised support point $\tilde{p}$ . We use these aggregation scores to pool features of the support points into the final local feature of the query point:
其中 $g_{1}(.)$ 和 $g_{2}(.)$ 是通道混合 MLP,$\mathbf{F}_{\mathbf{X}}({\boldsymbol{p}})$ 是去噪支撑点 $\tilde{p}$ 的精细特征。我们利用这些聚合分数将支撑点的特征汇集为查询点的最终局部特征:
$$
\mathbf{y}^{\mathrm{loc}}=\sum_{\tilde{p}\in\mathcal{X}{q}}\mathrm{softmax}(s_{\tilde{p}})\odot\big[g_{3}\circ g_{1}(\tilde{p}-q;\mathbf{F}_{\mathbf{X}}(p))\big],
$$
$$
\mathbf{y}^{\mathrm{loc}}=\sum_{\tilde{p}\in\mathcal{X}{q}}\mathrm{softmax}(s_{\tilde{p}})\odot\big[g_{3}\circ g_{1}(\tilde{p}-q;\mathbf{F}_{\mathbf{X}}(p))\big],
$$
where softmax $(.)$ is the softmax normalization over the support point dimension, $g_{3}(.)$ is another channel mixing MLP, and $\odot$ denotes the element wise product. Conclusively, this layer is in fact akin to a spatial attention module [83], space here spanning the support points.
其中 softmax $(.)$ 是在支撑点维度上的 softmax 归一化, $g_{3}(.)$ 是另一个通道混合 MLP, $\odot$ 表示逐元素乘积。最终,该层实际上类似于空间注意力模块 [83],这里的空间跨越支撑点。
We note additionally that the architecture of this layer follows the decoder in [7]. Most importantly, we pinpoint that a main difference here is that the relative positional encoding in our extra-set mixing uses denoised support coordinates: ${\tilde{p}-q},\tilde{p}\in\tilde{\mathbf{X}}$ , instead of the original noisy ones: ${p-q}$ , $p\in\mathbf{X}$ , as done in [7]. This allows us to reduce the effect of noise on the Euclidean relative position guided feature aggregation. Similarly to [7], we use in practice $N_{H}$ MLPs $g_{2}$ in parallel, i.e. softmax $\left(s_{\tilde{p}}\right)\mathrel{\mathop:}=$ N≥ softmax(g o gn(p- q; F(p).
我们进一步注意到,该层的架构遵循了[7]中的解码器设计。最关键的是,我们指出此处的主要差异在于:额外集混合中采用的相对位置编码使用了去噪后的支撑坐标${\tilde{p}-q},\tilde{p}\in\tilde{\mathbf{X}}$,而非如[7]中直接使用原始含噪坐标${p-q}, p\in\mathbf{X}$。这一设计能降低噪声对欧几里得相对位置引导特征聚合的影响。与[7]类似,实际应用中我们并行使用$N_{H}$个多层感知机$g_{2}$,即softmax$\left(s_{\tilde{p}}\right)\mathrel{\mathop:}=$ N≥ softmax(g o gn(p- q; F(p)。
Next, we use our layer introduced above to extract a global coarse feature $\mathbf{y}^{g l o b}$ as well for a query point $q\in\mathbb{R}^{3}$ . The goal of this global mixing component is to include global context in the reconstruction task and prevent the method from over-fitting on local information. This feature is built using the entire $N_{d}$ sized down-sampled denoised point cloud $\tilde{\mathbf{Y}}$ elements, along with their corresponding output coarse features in $\mathbf{F}_{\mathbf{Y}}$ , as generated by encoder $E$ .
接下来,我们使用上述提出的层来为查询点 $q\in\mathbb{R}^{3}$ 提取全局粗略特征 $\mathbf{y}^{g l o b}$。该全局混合组件的目标是在重建任务中融入全局上下文,防止方法对局部信息过拟合。该特征通过整个降采样去噪点云 $\tilde{\mathbf{Y}}$ 的 $N_{d}$ 个元素及其在编码器 $E$ 生成的 $\mathbf{F}_{\mathbf{Y}}$ 中对应的输出粗略特征共同构建。
Similarly to the previous case, we first compute aggregation scores $r_{\tilde{p}}$ for each support element $\tilde{p}$ , and we use a support point-wise softmax normalization of these scores to weight the support point features:
与前例类似,我们首先计算每个支撑元素 $\tilde{p}$ 的聚合分数 $r_{\tilde{p}}$ ,并通过支撑点级别的softmax归一化对这些分数进行加权以调整支撑点特征:
$$
\begin{array}{r}{r_{\tilde{p}}=h_{2}\circ h_{1}(\tilde{p}-q;\mathbf{F}{\mathbf{Y}}(\tilde{p})),\quad\tilde{p}\in\tilde{\mathbf{Y}}.\qquad}\ {\mathbf{y}^{\mathrm{glob}}=\displaystyle\sum_{\tilde{p}\in\tilde{\mathbf{Y}}}\operatorname{softmax}(r_{\tilde{p}})\odot\big[h_{3}\circ h_{1}(\tilde{p}-q;\mathbf{F}_{\mathbf{Y}}(\tilde{p}))\big],}\end{array}
$$
$$
\begin{array}{r}{r_{\tilde{p}}=h_{2}\circ h_{1}(\tilde{p}-q;\mathbf{F}{\mathbf{Y}}(\tilde{p})),\quad\tilde{p}\in\tilde{\mathbf{Y}}.\qquad}\ {\mathbf{y}^{\mathrm{glob}}=\displaystyle\sum_{\tilde{p}\in\tilde{\mathbf{Y}}}\operatorname{softmax}(r_{\tilde{p}})\odot\big[h_{3}\circ h_{1}(\tilde{p}-q;\mathbf{F}_{\mathbf{Y}}(\tilde{p}))\big],}\end{array}
$$
where $h_{1}(.),h_{2}(.)$ and $h_{3}(.)$ are channel mixing MLPs, and $\mathbf{F}_{\mathbf{Y}}({\boldsymbol{p}})$ is the coarse feature of denoised support point $\tilde{p}$ .
其中 $h_{1}(.),h_{2}(.)$ 和 $h_{3}(.)$ 是通道混合 MLP (Multilayer Perceptron),$\mathbf{F}_{\mathbf{Y}}({\boldsymbol{p}})$ 是去噪支撑点 $\tilde{p}$ 的粗粒度特征。
Finally, a last channel mixing MLP $g_{4}(.)$ combines the local feature $\mathbf{y}^{l o c}$ and the global feature $\mathbf{y}^{g l o b}$ of a query point $q$ to produce the occupancy probability of the latter, which writes:
最后,一个通道混合多层感知机 $g_{4}(.)$ 将查询点 $q$ 的局部特征 $\mathbf{y}^{l o c}$ 和全局特征 $\mathbf{y}^{g l o b}$ 结合,生成后者的占用概率,其表达式为:
$$
f_{\theta}(\mathbf{X},q):=\operatorname{occ}(q)=g_{4}(\mathbf{y}^{\mathrm{{loc}}};\mathbf{y}^{\mathrm{{glob}}}).
$$
$$
f_{\theta}(\mathbf{X},q):=\operatorname{occ}(q)=g_{4}(\mathbf{y}^{\mathrm{{loc}}};\mathbf{y}^{\mathrm{{glob}}}).
$$
3.4. Training
3.4. 训练
Our method is fully differentiable and is trained end-to-end leveraging the combination of a reconstruction loss and a denoising loss:
我们的方法完全可微分,并通过结合重建损失和去噪损失进行端到端训练:
$$
\begin{array}{r}{\mathcal{L}=\mathcal{L}{\mathrm{rec}}+\mathcal{L}_{\mathrm{den}}.}\end{array}
$$
$$
\begin{array}{r}{\mathcal{L}=\mathcal{L}{\mathrm{rec}}+\mathcal{L}_{\mathrm{den}}.}\end{array}
$$
We train the feature extraction and denoising network to produce a denoising offset for each input point by backpropagating the L2 Chamfer distance between the denoised point cloud X and a ground truth point cloud sampled from the ground truth shape $\mathbf{X}_{\mathrm{gt}}\sim S$ :
我们训练特征提取和去噪网络,通过反向传播去噪点云X与从真实形状$\mathbf{X}_{\mathrm{gt}}\sim S$采样的真实点云之间的L2 Chamfer距离,为每个输入点生成去噪偏移量:

Figure 4. ShapeNet reconstructions from 3000 noisy points, with standard deviation of 0.005. Our method reproduces details and fine structures with more fidelity.
图 4: 从3000个噪声点(标准差0.005)重建的ShapeNet模型。我们的方法能以更高保真度还原细节和精细结构。
$$
\mathcal{L}{\mathrm{den}}=\frac{1}{2|\tilde{\mathbf{X}}|}\sum_{\tilde{p}\in\tilde{\mathbf{X}}}\operatorname*{min}{p\in\mathbf{X}}{|p-\tilde{p}|{2}^{2}}+\frac{1}{2|\mathbf{X}{\mathrm{gt}}|}\sum_{p\in\mathbf{X}{\mathrm{gt}}}\operatorname*{min}{\tilde{p}\in\tilde{\mathbf{X}}}{|\tilde{p}-p|_{2}^{2}},
$$
$$
\mathcal{L}{\mathrm{den}}=\frac{1}{2|\tilde{\mathbf{X}}|}\sum_{\tilde{p}\in\tilde{\mathbf{X}}}\operatorname*{min}{p\in\mathbf{X}}{|p-\tilde{p}|{2}^{2}}+\frac{1}{2|\mathbf{X}{\mathrm{gt}}|}\sum_{p\in\mathbf{X}{\mathrm{gt}}}\operatorname*{min}{\tilde{p}\in\tilde{\mathbf{X}}}{|\tilde{p}-p|_{2}^{2}},
$$
where $|.|$ is the point set cardinality. A similar loss is used in recent supervised point cloud denoising literature (e.g. [58]). Ground truth point cloud $\mathbf{X}_{\mathrm{gt}}$ counts $100\mathrm{k}$ ground truth surface samples. Besides providing more accurate support relative positional encoding for the decoder extraset mixing, this loss can provide additional regular iz ation to the feature extraction network, due to the correlations between the denoising and reconstruction tasks.
其中 $|.|$ 表示点集的基数。类似的损失函数在近年来的监督式点云去噪文献中也有使用(例如 [58])。真实点云 $\mathbf{X}_{\mathrm{gt}}$ 包含 10 万个真实表面采样点。除了为解码器的额外集合混合提供更精确的相对位置编码支持外,由于去噪任务与重建任务之间的相关性,该损失函数还能为特征提取网络提供额外的正则化约束。
Following seminal work [48], the reconstruction loss is the Binary Cross-entropy loss between query points ${q}$ sampled around the ground truth surface $s$ and their ground truth occupancy labels:
继开创性研究[48]之后,重建损失是地面真实表面$s$周围采样查询点${q}$与其地面真实占用标签之间的二元交叉熵损失:
$$
{\mathcal{L}}{\mathrm{rec}}=\sum_{q}{\mathrm{BCE}}(\sec(q),\sec(q)_{\mathrm{gt}}).
$$
$$
{\mathcal{L}}{\mathrm{rec}}=\sum_{q}{\mathrm{BCE}}(\sec(q),\sec(q)_{\mathrm{gt}}).
$$
4. Results
4. 结果
We compare in this section our approach quantitatively and qualitatively to state-of-the-art methods on object and scene reconstruction benchmarks, using both synthetic and real data.
本节我们通过合成数据和真实数据,在物体与场景重建基准测试中,从定量和定性两个维度将本方法与前沿方法进行对比。

Figure 5. ShapeNet reconstructions from 3000 noisy points, with standard deviation of 0.025. We show more resilience to high levels of noise.
图 5: 从3000个噪声点(标准差为0.025)重建的ShapeNet结果。我们的方法对高强度噪声表现出更强的鲁棒性。
4.1. Implementation details
4.1. 实现细节
We implement our method using the PyTorch Framework. The U-Net follows the architecture in [17]. Coarse and fine feature dimensions are $F_{c}=512$ and $F_{f}=32$ respectively. For inputs of size $N_{p}=3000$ , we use $N=64$ nearest neighbor as the local decoder support size. For $N=300$ , $N=12$ . The down sampled point cloud is of size $N_{d}=12$ . MLPs $g_{1}/h_{1}$ have 3 ReLU activated hidden layers of dimension 32. $g_{2}/h_{2}$ and $g_{3}/h_{3}$ have one layer of dimension 64 and 32 respectively. $g_{4}$ has two layers of dimensions 64 and 2. $N_{H}=64$ . Following [7], we train for $600\mathrm{k}$ iterations in batches of 16 shapes and 2048 query points, and we reconstruct similarly at resolution 256.
我们使用 PyTorch 框架实现我们的方法。U-Net 遵循 [17] 中的架构。粗粒度特征和细粒度特征的维度分别为 $F_{c}=512$ 和 $F_{f}=32$。对于大小为 $N_{p}=3000$ 的输入,我们使用 $N=64$ 最近邻作为局部解码器的支持尺寸。对于 $N=300$,$N=12$。下采样点云的大小为 $N_{d}=12$。MLP $g_{1}/h_{1}$ 有 3 个 ReLU 激活的隐藏层,维度为 32。$g_{2}/h_{2}$ 和 $g_{3}/h_{3}$ 分别有 1 个维度为 64 和 32 的层。$g_{4}$ 有 2 个维度为 64 和 2 的层。$N_{H}=64$。遵循 [7],我们训练 $600\mathrm{k}$ 次迭代,每批 16 个形状和 2048 个查询点,并在分辨率 256 下进行类似重建。
4.2. Metrics
4.2. 指标
Following seminal work, we evaluate our method and the competition w.r.t. the ground truth using standard metrics for the 3D reconstruction task. Namely, the L1 and L2 Chamfer Distances $\mathrm{CD}{1}\times10^{2}$ and $\mathrm{CD}_{2}\times10^{4}$ , Normal Consistency (NC) and the F-Score (FS) based on Euclidean distance. We detail their expressions in the supplementary material.
在开创性工作之后,我们使用3D重建任务的标准指标评估了我们的方法及竞品相对于真实值的表现。具体包括L1和L2倒角距离 $\mathrm{CD}{1}\times10^{2}$ 与 $\mathrm{CD}_{2}\times10^{4}$ 、法向一致性(NC)以及基于欧氏距离的F值(FS)。这些指标的详细表达式见补充材料。
4.3. Datasets
4.3. 数据集
ShapeNet [12] is used to evaluate object level reconstruction. It consists of various instances of 13 different object classes. We train/test on this dataset using input point clouds of sizes 300 and 3000, and noises of standard deviation $(\sigma)~0.005$ and 0.025. Synthetic Rooms [54] is used to evaluate scene level reconstruction. Numerical and qualitative results for this dataset are shown in the supplementary material. It consists of 5000 scenes made of a floor and walls and populated with ShapeNet objects. We train/test on this dataset using inputs of size $10\mathrm{k\Omega}$ and a noise of standard deviation 0.005. For both datasets, we follow the train/test splits, the training procedure and the evaluation protocol in [7]. Finally, we test the ShapeNet trained models on real scan datasets to evaluate generalization. We use the FAUST [6] dataset. It consists of real scans of 10 human body identities in 10 different poses. We sample 3000 points from the scans as inputs. We use the provided mesh registrations to compute IoU. We also use the ScanNet v2 [19] dataset. It contains 1513 rooms captured with an RGB-D camera. We sample $10\mathrm{k\Omega}$ points as inputs.
ShapeNet [12] 用于评估物体级重建效果。该数据集包含13个不同物体类别的多种实例。我们分别使用300和3000个点的输入点云,以及标准差为 $(\sigma)~0.005$ 和0.025的噪声进行训练/测试。Synthetic Rooms [54] 用于评估场景级重建效果,其数值与定性结果见补充材料。该数据集由5000个包含地板、墙壁及ShapeNet物体的场景构成,我们使用 $10\mathrm{k\Omega}$ 个点的输入和标准差0.005的噪声进行训练/测试。对于这两个数据集,我们遵循文献[7]中的训练/测试划分、训练流程及评估协议。最后,我们在真实扫描数据集上测试ShapeNet训练模型以评估泛化能力:使用FAUST [6] 数据集(包含10种人体姿态的10个真实扫描样本),从扫描数据中采样3000个点作为输入,并利用提供的网格配准计算IoU;同时使用ScanNet v2 [19] 数据集(包含RGB-D相机采集的1513个房间场景),采样 $10\mathrm{k\Omega}$ 个点作为输入。
Table 1. ShapeNet reconstruction from 300 noisy points, with standard deviations $(\sigma)$ of 0.005 and 0.025.
表 1: 从 300 个噪声点重建 ShapeNet 的结果,标准偏差 $(\sigma)$ 分别为 0.005 和 0.025。
| 0 =0.005 | 0 =0.025 | |||||||
|---|---|---|---|---|---|---|---|---|
| CD1↓ | CD2↓ | NC↑ | FS↑ | CD1↓ | CD2↓ | NC↑ | FS↑ | |
| SAP[55] | 0.58 | 1.04 | 0.89 | 0.87 | 1.02 | 3.50 | 0.85 | 0.69 |
| POCO [7] | 0.59 | 1.54 | 0.88 | 0.88 | 1.10 | 4.40 | 0.83 | 0.66 |
| Ours w/o den. | 0.55 | 0.92 | 0.89 | 0.89 | 1.05 | 3.56 | 0.84 | 0.66 |
| Ours | 0.54 | 0.94 | 0.90 | 0.89 | 1.00 | 3.49 | 0.85 | 0.69 |
Table 2. ShapeNet reconstruction from 3000 noisy points, with standard deviations $(\sigma)$ of 0.005 and 0.025.
表 2: 从3000个噪声点重建ShapeNet的结果,标准差 $(\sigma)$ 分别为0.005和0.025。
| =0.005 | =0.025 | |||||||
|---|---|---|---|---|---|---|---|---|
| CD1↓ | CD2↓ | NC↑ | FS↑ | CD1↓ | CD2← | NC↑ | FS↑ | |
| SPR [33] | 2.98 | 0.77 | 0.61 | 4.99 | 0.60 | 0.32 | ||
| 3D-R2N2 [18] | 1.72 | 0.71 | 0.40 | 1.72 | 0.71 | 0.42 | ||
| AtlasNet[27] | 0.93 | 0.85 | 0.71 | 1.17 | 0.82 | 0.53 | ||
| ConvONet[54] | 0.44 | 0.94 | 0.94 | 0.66 | 0.91 | 0.85 | ||
| SAP [55] | 0.34 | 0.60 | 0.94 | 0.97 | 0.55 | 0.95 | 0.91 | 0.89 |
| POCO [7] | 0.30 | 0.31 | 0.95 | 0.98 | 0.58 | 1.24 | 0.90 | 0.88 |
| Oursw/oden. | 0.32 | 0.25 | 0.94 | 0.98 | 0.58 | 0.98 | 0.90 | 0.87 |
| Ours | 0.29 | 0.15 | 0.95 | 0.99 | 0.55 | 0.93 | 0.91 | 0.89 |
4.4. Object level reconstruction
4.4. 物体级别重建
Table 1 shows numerical evaluations of ShapeNet [12] reconstructions from 300 noisy input points, with both noises of 0.005 and 0.025 standard deviation. Table 2 shows results for 3000 points with the same noises. We show numbers for our method (Ours) and our method without denoising (Ours w/o denois.). We report numbers for [55] (SAP) and [7] (POCO) using their available models when applicable, and train the rest when needed following their official imple ment at ions. We compile numbers for [54] (ConvONet), [33] (SPR), [18] (3D-R2N2), [27] (AtlasNet), from [7, 55]. Figures 4 and 5 show visual comparisons of reconstructions from 3000 points with noises of 0.005 and 0.025 respectively.
表 1 展示了从 300 个噪声输入点 (标准差分别为 0.005 和 0.025) 对 ShapeNet [12] 重建的数值评估结果。表 2 展示了在相同噪声水平下对 3000 个点的重建结果。我们列出了本方法 (Ours) 和未去噪版本 (Ours w/o denois.) 的数值结果。对于 [55] (SAP) 和 [7] (POCO),我们使用其可用模型报告数据 (适用时),其余情况按其官方实现进行训练。从 [7, 55] 中我们汇总了 [54] (ConvONet)、[33] (SPR)、[18] (3D-R2N2)、[27] (AtlasNet) 的数据。图 4 和图 5 分别展示了在 0.005 和 0.025 噪声水平下对 3000 个点进行重建的视觉对比结果。
Our method outperforms the state-of-the-art across all metrics, and the gap increases overall with input sparsity and noise variance. Using denoising allows us in particular to improve our performance w.r.t. to our baseline especially in the most extreme case (sparse inputs and big variance noise). We note that our denoising-free baseline already achieves mostly on par results with the state-of- the-art method POCO, while using only half the number of parameters (Ours 6.5M, POCO 12.5M). Poisson recon- struction (SPR) is outperformed by learning based methods. Qualitative results show that our method recovers fine structures and details with more fidelity (Figure 4), and that it can perform robustly even under heavy noise (Figure 5), thanks to the combination of local and global reasoning at the decoder, and the denoising of the decoder’s support.
我们的方法在所有指标上均优于当前最优技术,且随着输入稀疏性和噪声方差的增加,优势差距整体扩大。去噪技术的应用尤其提升了我们在极端情况(稀疏输入和大方差噪声)下相对于基线的表现。值得注意的是,即使不使用去噪,我们的基线方法在参数量仅为POCO一半(本方法6.5M参数,POCO 12.5M参数)的情况下,已基本达到与当前最优方法POCO相当的结果。基于学习的方法超越了泊松重建(SPR)技术。定性结果表明:本方法能更高保真度地恢复精细结构与细节(图4),并借助解码器局部与全局推理的结合以及支撑域去噪技术,即使在强噪声下仍能保持鲁棒性(图5)。
Table 3. Comparison to test-time fitting methods on class Tables of ShapeNet. Reconstruction from 3000 noisy (0.005 standard deviation) points.
表 3: ShapeNet类别表格上与测试时拟合方法的对比。从3000个带噪声(标准差0.005)的点进行重建。
| CD1√ NC↑ | FS↑ | |
|---|---|---|
| SA-ConvONet[70] | 0.56 0.93 | 0.92 |
| Neural-Pull[44] | 0.71 0.85 | 0.83 |
| Ours | 0.30 0.96 | 0.99 |
Table 4. FAUST reconstruction from 3000 points.
表 4: 从3000个点重建FAUST的结果
| CD1↓ | CD2↓ | NC↑ | FS↑ | |
|---|---|---|---|---|
| SAP[55] | 0.29 | 0.13 | 0.95 | 0.98 |
| POCO [7] | 0.25 | 0.09 | 0.96 | 0.99 |
| Ours | 0.23 | 0.08 | 0.96 | 0.99 |
Although they offer mostly superior performance, we note that intrinsic models such as POCO and ours are less parameter efficient then extrinsic ones (e.g. ConvONet and SAP contain about 2M parameters). While SAP offers a good model size/performance combination, it is however limited to small scenes by design. We note additionally that our reconstruction time at resolution $\mathrm{128^{3}}$ is $538\mathrm{ms}$ (POCO: 1300ms, SAP: 64ms, ConvONet: 381ms).
虽然它们通常提供更优越的性能,但我们注意到 POCO 和我们的模型等内在模型在参数效率上不如外在模型(例如 ConvONet 和 SAP 包含约 200 万参数)。尽管 SAP 在模型大小与性能之间取得了良好平衡,但其设计上仅限于小场景。此外,我们观察到在 $\mathrm{128^{3}}$ 分辨率下的重建时间为 $538\mathrm{ms}$(POCO: 1300ms, SAP: 64ms, ConvONet: 381ms)。
In Table 3, we compare our feed-forward general iz able method to deep learning optimization based approaches using their official implementations, specifically fitting an MLP to the input point cloud as in Neural-Pull [44], and finetuning ConvONet: SA-ConvONet [70]. As optimiza- tion methods are time consuming, we follow here the slipt of Class Tables in [81]. Our method outperforms these competing methods.
在表3中,我们将我们的前馈泛化方法与基于深度学习优化的方法进行了比较,这些方法使用了它们的官方实现,具体来说,就像Neural-Pull [44]中那样,将一个多层感知机(MLP)拟合到输入点云上,并对ConvONet: SA-ConvONet [70]进行微调。由于优化方法耗时较长,我们在此遵循[81]中的Class Tables划分。我们的方法优于这些竞争方法。
4.5. Object to Real Articulated Shape Generalization
4.5. 物体到真实铰接形状的泛化
To assess the capacity of our model to generalize outside the training shape distribution (ShapeNet), as well as to reconstruct from real scans, we task the ShapeNet trained model with reconstructing FAUST models from 3000 points. Table 4 shows numerical comparisons using the real scans, with qualitative comparisons displayed in Figure 6. We report numbers for POCO and SAP using their available ShapeNet models. While all presented methods generalize relatively successfully despite not being trained on any articulated shapes, we outperform state-of-the-art methods
为评估模型在训练形状分布(ShapeNet)之外的泛化能力以及从真实扫描中重建的能力,我们让基于ShapeNet训练的模型从3000个点重建FAUST模型。表4展示了使用真实扫描的数值对比,定性对比结果如图6所示。我们报告了POCO和SAP使用其现有ShapeNet模型的数据。尽管所有方法都未在铰接形状上进行训练却取得了相对成功的泛化效果,但我们的方法超越了当前最优方法

Figure 6. FAUST reconstructions from real 3000 scan points. Models are trained on ShapeNet, and they were not trained on any human or articulated shapes.
图 6: 基于真实3000个扫描点的FAUST重建结果。所有模型均在ShapeNet数据集上训练,且未针对人体或关节形状进行专门训练。
Table 5. ScanNet reconstructions from 10k points.
表 5: 基于10k点的ScanNet重建结果
| 方法 | CD1← | FS↑ |
|---|---|---|
| ConvONet [54] | 0.77 | 0.89 |
| SAP [55] | 1.11 | 0.71 |
| POCO [7] | 1.03 | 0.72 |
| Oursw/odenois. | 0.69 | 0.86 |
| Ours | 0.58 | 0.91 |
POCO and SAP on most metrics.
POCO 和 SAP 在多数指标上。
4.6. Object to Real Scenes Generalization
4.6. 物体到真实场景的泛化
We extend generalization experiments to a more challenging scenario. We use both our’s, SAP’s and POCO’s ShapeNet models trained with 3k sized inputs to reconstruct the real scans of ScanNet [19] using 10k sized inputs. We report results for ConvONet from their paper as a reference point. Table 5 shows numerical results where we outperform the competition. Figure 7 shows a qualitative comparison. We note that as the ConvONet model we show was trained on the Synthetic Rooms dataset using 10k inputs, it tends to produce more flat and extrapolated planar surfaces. Although trained only on objects, our model displays satisfactory scene generalization. We notice also that it tends to be more faithful to the input. These results raise the question of extrapolation vs. input fidelity trade-off desired from reconstruction models, which we would like to explore in future work.
我们将泛化实验扩展到一个更具挑战性的场景。使用我们、SAP和POCO基于3k规模输入训练的ShapeNet模型,对ScanNet [19]的10k规模真实扫描数据进行重建。作为参考基准,我们直接引用了ConvONet论文中的结果。表5显示了我们优于竞争对手的量化结果,图7则呈现了定性对比。值得注意的是,由于展示的ConvONet模型是在10k输入的合成房间数据集上训练的,其输出往往会产生更多平面化和外推的平坦表面。尽管我们的模型仅在物体数据上训练,却展现出令人满意的场景泛化能力。同时我们观察到,该模型对输入数据的保真度更高。这些结果引发了关于重建模型在外推能力与输入保真度之间权衡的思考,这将是我们未来工作的探索方向。

Figure 7. ScanNet reconstructions from real 10k depth sensor points.

图 7: 基于真实1万深度传感器点的ScanNet重建结果。
Table 6. Ablation on class Tables of ShapeNet. Reconstruction from 3000 noisy (0.005 std. dev.) points.
表 6. ShapeNet类别表格的消融实验。从3000个含噪声(标准差0.005)点重建。
| CD1√ | CD2√ | NC↑ | FS↑ | |
|---|---|---|---|---|
| POCO [7] | 0.36 | 0.23 | 0.95 | 0.97 |
| Ours w/o den. w/o glob. | 0.34 | 0.23 | 0.96 | 0.98 |
| Ours w/o den. | 0.33 | 0.19 | 0.96 | 0.98 |
| Ours | 0.31 | 0.14 | 0.97 | 0.99 |
4.7. Ablation
4.7. 消融实验
We propose a quantitative analysis of the impact of the components of our method at this junction on class Tables of ShapeNet. Results are summarized in Table 6. Starting from our full method (Ours), we ablate denoising (Ours w/o den.) then the global decoding (Ours w/o den. w/o glob.). We also show numbers for POCO. The global decoder helps robustify the method especially against outliers as witnessed by the improvement in L2 chamfer distance when the global mixing is introduced. Denoising improves performance consistently across all metrics as also confirmed in the previous experiments.
我们在此节点对ShapeNet类表格的组件影响进行了定量分析,结果总结在表6中。从完整方法(Ours)开始,我们依次消融了去噪模块(Ours w/o den.)和全局解码器(Ours w/o den. w/o glob.),同时列出了POCO的数值对比。全局解码器能显著增强方法的鲁棒性,特别是对异常值的处理能力——引入全局混合后L2倒角距离指标的提升印证了这一点。去噪模块则在所有指标上持续提升性能,这与先前实验的结论一致。
5. Conclusion
5. 结论
We showed in this work that coupling denoising with reconstruction is beneficial for intrinsic implicit feed forward reconstruction models. We also showed that for these types of models, a fully MLP based architecture can produce stateof-the-art results with less parameters compared to a convolutional counterpart. This result questions the utility of convolution based locality mechanisms in this context.
我们在本研究中表明,将去噪与重建耦合对内在隐式前馈重建模型具有增益效果。同时发现对于此类模型,全MLP架构能以更少参数实现媲美卷积架构的顶尖性能,这一结果对卷积局部性机制在此场景下的必要性提出了质疑。
