[论文翻译]Mixing-Denoising 通用化占据网络


原文地址:https://arxiv.org/pdf/2311.12125v1


Mixing-Denoising General iz able Occupancy Networks

Mixing-Denoising 通用化占据网络

Abstract

摘要

While current state-of-the-art general iz able implicit neural shape models [7, 54] rely on the inductive bias of convolutions, it is still not entirely clear how properties emerging from such biases are compatible with the task of 3D reconstruction from point cloud. We explore an alternative approach to general iz ability in this context. We relax the intrinsic model bias (i.e. using MLPs to encode local features as opposed to convolutions) and constrain the hypothesis space instead with an auxiliary regular iz ation related to the reconstruction task, i.e. denoising. The resulting model is the first only-MLP locally conditioned implicit shape reconstruction from point cloud network with fast feed forward inference. Point cloud borne features and denoising offsets are predicted from an exclusively MLP-made network in a single forward pass. A decoder predicts occupancy probabilities for queries anywhere in space by pooling nearby features from the point cloud order-invariant ly, guided by denoised relative positional encoding. We outperform the state-of-the-art convolutional method [7] while using half the number of model parameters.

虽然当前最先进的通用隐式神经形状模型[7,54]依赖于卷积的归纳偏置,但尚不完全清楚这些偏置所产生的属性如何与点云三维重建任务兼容。我们在此背景下探索了一种替代性的通用化方法:通过放松内在模型偏置(即使用MLP而非卷积来编码局部特征),转而采用与重建任务(即去噪)相关的辅助正则化来约束假设空间。由此产生的模型成为首个纯MLP架构、局部条件化的点云隐式形状重建网络,具备快速前馈推理能力。该网络仅通过MLP结构在单次前向传播中预测点云特征和去噪偏移量,并通过去噪相对位置编码的引导,以点云顺序无关的方式聚合邻近特征,最终由解码器预测空间中任意位置查询的占据概率。在使用半数模型参数的情况下,我们的方法超越了当前最先进的卷积方法[7]。

1. Introduction

1. 引言

One of the most sought after goals in modern day computer vision and machine learning is enabling machines to understand and navigate 3D given limited input. This faculty is manifested in several downstream vision and graphics tasks, such as shape reconstruction from noisy and relatively sparse point clouds. Recovering full shape from point clouds is all the more an important problem in account of the ubiquity of this light, albeit incomplete, 3D representation, whether it is acquired from the nowadays increasingly accessible scanning devices, or as obtained from multi-view vision algorithms such as Structure From Motion or MultiView Stereo (e.g. [63, 64]). While classical optimization based approaches such as Poisson Reconstruction [33] or moving least squares [28] can mostly successfully deliver from dense clean point sets and normal estimations, the deep learning based more recent alternatives offer faster and more robust prediction especially for noisy and sparse inputs, without requiring normals.

现代计算机视觉和机器学习领域最受追捧的目标之一,是让机器在有限输入条件下理解并处理三维空间。这种能力体现在多个下游视觉与图形任务中,例如从噪声且相对稀疏的点云中进行形状重建。鉴于这种轻量但不完整的三维表示形式无处不在——无论是通过如今日益普及的扫描设备获取,还是通过运动恢复结构(Structure From Motion)或多视图立体(MultiView Stereo)等算法生成(如[63, 64])——从点云恢复完整形状成为一个尤为重要的问题。虽然基于经典优化的方法(如泊松重建[33]或移动最小二乘法[28])大多能成功处理密集且干净的点集及法线估计,但基于深度学习的现代方案提供了更快速、更鲁棒的预测,尤其适用于含噪声的稀疏输入,且无需法线信息。


Figure 1. Our method couples 3D implicit reconstruction with explicit denoising of point clouds. We show here reconstruction examples from 3000 noisy points (Gaussian noise of std. dev. 0.025). We show the input, our intermediate denoised point cloud, and our predicted geometry.

图 1: 我们的方法将3D隐式重建与点云显式去噪相结合。此处展示了从3000个含噪点(标准差0.025的高斯噪声)进行重建的示例,包括输入数据、中间去噪点云及预测几何形状。

For learning based reconstruction from point cloud, feed forward (optimization-free) general iz able models are important to the community by virtue of their high performances and fast inference. State-of-the-art ones are based on implicit neural shape representations. Typically [15, 54], a deep convolutional network builds features from the input point cloud, then an implicit decoder maps the feature of a query point to its occupancy or signed distance w.r.t. the target shape. While the ConvNet predicts explicit extrinsic features in [15, 54, 55], recent work [7] suggests that learning features intrinsically i.e. only at the point cloud yields superior results. In this case, features for query points are pooled from the nearest points of the input point cloud.

基于点云的学习重建方法中,免优化(optimization-free)的前馈通用模型因其高性能和快速推理能力备受学界关注。当前最优方法基于隐式神经形状表示:典型方案[15, 54]通过深度卷积网络从输入点云提取特征,再由隐式解码器将查询点的特征映射为相对于目标形状的占用值或有向距离。尽管[15, 54, 55]中的卷积网络预测显式外部特征,但近期研究[7]表明,仅从点云本身学习内在特征能获得更优结果——此时查询点特征通过池化输入点云的最近邻点获得。

While the aforementioned existing work advocates using convolutions to achieve general iz ability for this problem, we ask our selves the question, what if we do not constrain our hypothesis space in such an Ad hoc way ? i.e. what if we use the most inductive bias free model posssible, and rely on supervision and eventually regular iz at ions, would we succeed in achieving generalization while maintaining fast inference ? This line of thought is inspired by the recent success of MLP-only models in vision [72] and their application to point cloud processing [17]. Their idea consists in applying MLPs (mixing) to the data feature dimension and the spatial dimension separately and repeatedly. Point clouds being unordered and irregular, the spatial (Token) mixing is achieved with an order-invariant (Softmax) based pooling on a limited support from the point cloud.

虽然上述现有研究主张使用卷积来实现该问题的泛化能力,但我们不禁自问:如果不以这种临时方式约束假设空间会怎样?换言之,若采用尽可能无归纳偏置的模型,仅依赖监督学习和最终的正则化手段,能否在保持快速推理的同时实现泛化?这一思路受到近期纯MLP模型在视觉领域[72]及其点云处理应用[17]成功的启发。他们的核心方法是对数据特征维度和空间维度分别且重复地应用MLP(混合操作)。由于点云具有无序和非规则特性,其空间(Token)混合是通过基于点云有限支撑集的顺序不变(Softmax)池化实现的。

Based on all the above, we propose a new encoderdecoder model for implicit reconstruction from point cloud. It processes the input point cloud intrinsically (as in [7]) but uses solely MLPs from start to end (unlike [7]). This allows us to outperform the state-of-the-art method [7] whilst using only half the number of parameters (6.5M for us, vs. 12.5M for [7]). Not only this shows that convolution inductive biases are not a necessity in this context, but we even achieve a more parameter efficient intrinsic point cloud reconstruction model without them. Additionally, we show that coupling our implicit reconstruction learning with a related regular iz ation task, in the form of weakly supervised point cloud denoising, can yield even better generalization. In fact, the performance gap between our method trained with and without denoising increases fourfold in Chamfer distance, when going to a setup requiring more generalization ability (See intra-ShapeNet generalization in Table 2, vs. the more challenging ShapeNet to ScanNet generalization in Table 5).

基于以上分析,我们提出了一种新的点云隐式重建编码器-解码器模型。该模型本质化处理输入点云(如[7]所述),但全程仅使用多层感知机(MLP)(与[7]不同)。这使得我们在仅使用一半参数量(本模型6.5M vs [7]的12.5M)的情况下,性能超越了当前最先进方法[7]。这不仅证明卷积归纳偏置在此场景中并非必需,我们还实现了更高效的参数化点云本质重建模型。此外,我们发现通过将隐式重建学习与弱监督点云去噪这一相关正则化任务相结合,能进一步提升泛化能力。事实上,在需要更强泛化能力的场景下(对比表2中ShapeNet内部泛化与表5中更具挑战性的ShapeNet到ScanNet泛化),采用去噪训练的模型在倒角距离指标上比未采用时性能差距扩大了四倍。

Our network builds on the dense task prediction model PointMixer [17] to predict per-point features. However, to infer occ up an cie s for extrinsic query points (not only on the point cloud), we introduce the new ”Extra-set” mixing layer in concordance with PointMixer terminology. The ”Extra-set” mixing layer pools features from nearby points of the input point cloud. As this pooling can be guided with positional encoding of the query relative to these support points, we task our PointMixer additionally with denoising the point cloud in parallel, and we use the denoised point cloud to compute more accurate positional encodings (unlike [7]). Furthermore, differently from [7], we incorporate a global decoder using coarse features on a downsampled point cloud as pooling support, in order to robustify the method against local outliers.

我们的网络基于密集任务预测模型PointMixer[17]来预测逐点特征。但为了推断外部查询点(不仅限于点云)的占用情况,我们引入了与PointMixer术语一致的"Extra-set"混合层。该混合层会从输入点云的邻近点汇集特征。由于这种汇集可以通过查询点相对于这些支撑点的位置编码来引导,我们额外让PointMixer并行执行点云去噪任务,并使用去噪后的点云来计算更准确的位置编码(与[7]不同)。此外,不同于[7],我们采用了基于下采样点云粗粒度特征的全局解码器作为汇集支撑,以提高方法对局部异常值的鲁棒性。

We note that existing MLP-only feed forward generalizable methods lack a local feature aggregation mechanismin their encoder architectures (PointNet), which is crucial for performance without sacrificing inference speed. [48] used a global feature causing under-performance. [22] enforces locality through query point dependent local patch inputs, hence requiring as many encoder forward passes as there are query points. This naturally results in significantly increased inference times surpassing even e.g. some auto-decoding optimization based methods. Conversely, being equipped with point mixing locality, our MLP-only model’s encoder solely requires a single pass for the input point cloud. It offers consequently about 3 orders of magnitude faster reconstruction (within $500\mathrm{ms}$ at resolution $128^{3}$ ) compared to [22], in line with convolutional locality endowed competitors ([7, 54]).

我们注意到,现有的纯MLP前馈泛化方法在其编码器架构(PointNet)中缺乏局部特征聚合机制,而这对于在不牺牲推理速度的前提下提升性能至关重要。[48]采用全局特征导致性能不足。[22]通过依赖于查询点的局部块输入来强制局部性,因此需要执行与查询点数量相同的编码器前向传递。这自然会导致推理时间显著增加,甚至超过某些基于自动解码优化的方法。相比之下,我们的纯MLP模型编码器配备了混合点局部性,仅需对输入点云执行单次前向传递。因此在128³分辨率下,其重建速度比[22]快约3个数量级(在500ms内),与具备卷积局部性的竞争对手([7,54])相当。

We outperform existing comparable methods on standard benchmarks for object and scene level reconstruction, and in generalization to real scan data (Section 4). The performance gap is most important in the most sparse and noisy input setup (Table 1), demonstrating our resilience to scarcity and corruption. Our reconstructions are also more robust, as witnessed by our superior L2 Chamfer errors across all benchmarks. We evaluate the benefits of each of our contributions in our ablation section (4.7).

我们在物体和场景级别重建的标准基准测试中超越了现有可比方法,并在泛化到真实扫描数据方面表现优异(第4节)。性能差距在最稀疏和噪声最多的输入设置中最为显著(表1),这证明了我们对稀缺和损坏的鲁棒性。我们的重建结果也更加稳健,这体现在我们在所有基准测试中更优的L2 Chamfer误差上。我们在消融实验部分(4.7)评估了各项贡献的收益。

Our contributions can be summarized as follows: • A convolution-free feed forward intrinsic occupancy network from point cloud attaining the new state-of-art, while being twice as parameter efficient as its closest intrinsic convolutional counterpart ([7]). • Joint learning of implicit reconstruction from point cloud with explicit point cloud denoising for the first time to the best of our knowledge. • Combining local and global pooling based decoding for deep intrinsic (e.g. [7]) occupancy regression.

我们的贡献可概括如下:

  • 提出了一种无卷积的前馈式点云本征占用网络 (intrinsic occupancy network),其性能达到新 state-of-the-art,同时参数量仅为最接近的本征卷积方法 ([7]) 的一半。
  • 据我们所知,首次实现了点云隐式重建与显式点云去噪的联合学习。
  • 结合基于局部和全局池化的解码策略,用于深度本征 (如 [7]) 占用回归。

2. Related work

2. 相关工作

Shape Representations in Deep Learning Shapes can be represented in deep learning either intrinsically or extr in sic ally. Intrinsic representations discritize the shape itself. When done explicitly using e.g. tetrahedral or polygonal meshes [32, 76], or point clouds [23], this bounds the output topology thus limiting the variability of outputs. Among other forms of intrinsic representations, 2D patches [21, 27, 80] can prompt discontinuities, whilst the simple nature of shape primitives such as cuboids [74, 92], planes [39] and Gaussians [24] limits their expressivity. Differently, extrinsic shape representations model the space containing the scene/object. Voxel grids [84, 85] are the most popular one being the extension of 2D pixels to 3D domain. Their capacity is limited though by their cubic resolution memory cost. Sparse representation like octrees [60, 71, 78] can alleviate this issue to some extent.

深度学习中的形状表示
形状在深度学习中可以通过内在或外在方式表示。内在表示对形状本身进行离散化处理。当采用显式方法(如四面体或多边形网格 [32, 76] 或点云 [23])时,这会限制输出拓扑结构,从而降低输出多样性。其他形式的内在表示中,2D面片 [21, 27, 80] 可能导致不连续性,而长方体 [74, 92]、平面 [39] 和高斯体 [24] 等简单几何基元的表达力有限。

与之不同,外在形状表示对包含场景/物体的空间进行建模。体素网格 [84, 85] 是最流行的方案,它是二维像素向三维领域的延伸。但其表达能力受立方分辨率内存成本的制约。八叉树 [60, 71, 78] 等稀疏表示可在一定程度上缓解此问题。

Implicit Neural Shape Representations Implicit neural representations (INRs) stood out as a major new medium for modelling shape and radiance (e.g. [11, 30, 49, 77, 86]) extrinsically. They overcome many of the limitations of the aforementioned classical representations thanks to their ability to represent shapes with arbitrary topologies at virtually infinite resolution. They are usually parameter is ed with MLPs mapping spatial locations or features to e.g. occupancy [48], signed [53] or unsigned [16, 91] distances relative to the target shape. The level-set of the inferred field from these MLPs can be rendered through ray marching [29], or tess ella ted into an explicit shape using e.g. Marching Cubes [43]. A noteworthy branch of work builds hybrid implicit/explicit representations [14, 20, 52, 87] based mostly on differentiable space partitioning.

隐式神经形状表示
隐式神经表示 (INRs) 作为一种建模形状和辐射度的全新媒介脱颖而出 (例如 [11, 30, 49, 77, 86])。它们克服了前述传统表示的诸多限制,得益于其能以近乎无限的分辨率表示任意拓扑结构的形状。这些表示通常通过多层感知机 (MLPs) 将空间位置或特征映射为诸如占据率 [48]、有符号 [53] 或无符号 [16, 91] 距离等相对于目标形状的量。从这些 MLPs 推断出的场等值面可通过光线步进 [29] 进行渲染,或使用诸如行进立方体 [43] 等方法镶嵌为显式形状。一个值得注意的研究分支主要基于可微分空间划分,构建了混合隐式/显式表示 [14, 20, 52, 87]。

Generalizing Implicit Neural Shape Representations In order to represent collections of shape, implicit neural models require conditioning mechanisms. These include latent code concatenation, batch normalization, hypernetworks [65, 67, 68, 79] and gradient-based meta-learning [50, 66]. Concatenation based conditioning was first implemented using single global latent codes [13, 48, 53], and further improved with the use of local features [15, 22, 25, 31, 35, 54, 69, 73].

泛化隐式神经形状表示
为了表示形状集合,隐式神经模型需要条件机制。这些机制包括潜在编码拼接、批量归一化、超网络 [65, 67, 68, 79] 以及基于梯度的元学习 [50, 66]。基于拼接的条件机制最初通过单一全局潜在编码实现 [13, 48, 53],并随着局部特征的使用进一步改进 [15, 22, 25, 31, 35, 54, 69, 73]。

Shape Reconstruction from Point Cloud Classical approaches include combi natori cal ones where the shape is defined through an input point cloud based space partitioning, through e.g. alpha shapes [5] Voronoi diagrams [1] or triangulation [9, 41, 59]. On the other hand, the input samples can be used to define an implicit function whose zero level set represents the target shape, using global smoothing priors [36, 81, 82] e.g. radial basis function [8] and Gaussian kernel fitting [62], local smoothing priors such as moving least squares [28, 34, 42, 47], or by solving a bound- ary conditioned Poisson equation [33]. The recent literature proposes to parameter is e these implicit functions with deep neural networks and learn their parameters with gradient descent, either in a supervised or unsupervised manner.

从点云重建形状的经典方法包括组合方法,其中形状通过基于输入点云的空间划分来定义,例如 alpha shapes [5]、Voronoi 图 [1] 或三角剖分 [9, 41, 59]。另一方面,输入样本可用于定义隐函数,其零水平集表示目标形状,使用全局平滑先验 [36, 81, 82],例如径向基函数 [8] 和高斯核拟合 [62],局部平滑先验如移动最小二乘法 [28, 34, 42, 47],或通过求解边界条件的泊松方程 [33]。近期文献提出用深度神经网络参数化这些隐函数,并通过梯度下降以监督或无监督方式学习其参数。

Unsupervised Implicit Neural Reconstruction A neural network is typically fitted to the input point cloud without extra information. Regular iz at ions can improve the convergence such as the spatial gradient constraint based on the Eikonal equation introduced by Gropp et al. [26], a spatial divergence constraint as in [4], Lipschitz regular iz ation on the network [40]. Atzmon et al. learn an SDF from unsigned distances [2], and further supervises the spatial gradient of the function with normals [3]. Ma et al. [44] expresses the nearest point on the surface as a function of the neural signed distance and its gradient. They also leverage self-supervised local priors to deal with very sparse inputs [45] and improve generalization [46]. All of the aforementioned work benefits from efficient gradient computation through back-propagation in the neural network. Periodic activation s were introduced in [67]. Lipman [38] learns a function that converges to occupancy while its log transform converges to a distance function. [81] learns infinitely wide shallow MLPs as random feature kernels.

无监督隐式神经重建
通常,神经网络会在没有额外信息的情况下拟合输入点云。正则化方法可以改善收敛性,例如基于Gropp等人[26]提出的Eikonal方程的空间梯度约束、[4]中的空间散度约束,以及对网络进行Lipschitz正则化[40]。Atzmon等人从无符号距离学习SDF[2],并进一步用法向量监督函数的空间梯度[3]。Ma等人[44]将曲面上最近点表示为神经符号距离及其梯度的函数。他们还利用自监督局部先验处理极稀疏输入[45]并提升泛化能力[46]。上述工作均通过神经网络的反向传播实现高效梯度计算。[67]引入了周期性激活函数。Lipman[38]学习一个收敛到占据率的函数,同时其对数变换收敛于距离函数。[81]将无限宽浅层MLP学习为随机特征核。

Supervised Implicit Neural Reconstruction Supervised methods assume a labeled training data corpus commonly in the form of dense samples with ground truth shape information. Auto-decoding methods [10, 31, 35, 53, 73] require test time optimization to be fitted to a new point cloud, which can take up to several seconds. Encoderdecoder based methods enable fast feed forward inference. Introduced first in this respect, Pooling-based set encoders [13, 25, 48] such as PointNet [56] have been shown to underfit the context. Convolutional encoders yield state-ofthe-art performances. They use local features either defined in explicit volumes and planes [15, 37, 54, 55] or solely at the input points [7]. Ouasfi and Boukhayma [51] proposed concurrently to robustify the generalization of these models through transfer learning to a kernel ridge regression whose hyper parameters are fitted to the shape. Peng et al. [55] proposed a differentiable Poisson solving layer that converts predicted normals into an indicator function grid efficiently. However, it is limited to small scenes due to the cubic memory requirement in grid resolution.

监督式隐式神经重建
监督方法假设存在一个带标注的训练数据集,通常以带有真实形状信息的密集样本形式存在。自动解码方法 [10, 31, 35, 53, 73] 需要进行测试时优化以适应新点云,耗时可能长达数秒。基于编码器-解码器的方法能实现快速前馈推理。该领域首次提出的基于池化的集合编码器 [13, 25, 48](如 PointNet [56])已被证明存在上下文欠拟合问题。卷积编码器则展现出最先进的性能,它们使用的局部特征要么定义在显式体积和平面中 [15, 37, 54, 55],要么仅定义在输入点处 [7]。Ouasfi 和 Boukhayma [51] 同时提出通过迁移学习增强这些模型的泛化能力,将其应用于超参数适配形状的核岭回归。Peng 等人 [55] 提出了一种可微分泊松求解层,能高效地将预测法线转换为指示函数网格,但由于网格分辨率需要立方级内存,该方法仅适用于小场景。

3. Method

3. 方法

Given a noisy input point cloud $\mathbf{X}\subset\mathbb{R}^{N_{p}\times3}$ , our objective is to recover a shape surface $s$ that best explains this observation, i.e. the input point cloud elements being noisy samples from $s$ .

给定一个带有噪声的输入点云 $\mathbf{X}\subset\mathbb{R}^{N_{p}\times3}$,我们的目标是恢复一个最能解释该观测结果的形状表面 $s$,即输入点云元素是来自 $s$ 的噪声样本。

To achieve this, we train a deep implicit neural network $f_{\theta}$ to predict occupancy values relative to a target shape $s$ at any queried Euclidean space location $x\in\mathbb{R}^{3}$ , given the input point cloud $\mathbf{X}$ , i.e. $f_{\theta}(\mathbf{X},x)=1$ if $x$ is inside, and 0 otherwise. The inferred shape $\hat{S}$ can then be obtained as a level set the occupancy field inferred using $f_{\theta}$ :

为此,我们训练了一个深度隐式神经网络 $f_{\theta}$,用于在给定输入点云 $\mathbf{X}$ 的情况下,预测任意查询欧几里得空间位置 $x\in\mathbb{R}^{3}$ 相对于目标形状 $s$ 的占用值,即 $f_{\theta}(\mathbf{X},x)=1$ 表示 $x$ 位于内部,否则为 0。推断出的形状 $\hat{S}$ 可通过 $f_{\theta}$ 推断的占用场等值面获得:

$$
{\hat{\cal S}}={x\in\mathbb{R}^{3}\mid f_{\theta}({\mathbf{X}},x)=0.5}.
$$

$$
{\hat{\cal S}}={x\in\mathbb{R}^{3}\mid f_{\theta}({\mathbf{X}},x)=0.5}.
$$

In practice, an explicit triangle mesh for $\hat{S}$ is extracted using the Marching Cubes [43] algorithm.

在实践中,使用 Marching Cubes [43] 算法提取 $\hat{S}$ 的显式三角网格。

We opt for a feed forward conditional implicit model $f_{\theta}$ , as we want our method to generalize to multiple shapes simultan e ou sly and provide fast test-time optimization free inference. Similarly to the state-of-the-art in this category e.g. [7, 54], $f_{\theta}$ consists of a point cloud conditioning network and an implicit decoder.

我们选择前馈条件隐式模型 $f_{\theta}$ ,因为希望该方法能同时泛化到多种形状并提供快速的测试时免优化推理。与此类方法的最先进方案 [7, 54] 类似, $f_{\theta}$ 由点云条件网络和隐式解码器构成。

3.1. Model

3.1. 模型

The staple back-bone model for feed-forward general iz able 3D shape reconstruction from point cloud has lately been the one introduced by Peng et al. [54], which also bears similarities with the concurrent work by Chibane et al. [15]. It has been widely used among methods predicting other implicit fields from point clouds as well, such as point displacements [75], unsigned distances [16] or whether point pairs are separated by a surface [88]. In this model, the encoder builds an explicit extrinsic 3D feature volume from the voxelized or pixelized input point cloud through 3D (or 2D) convolutional networks.

点云前馈可泛化三维形状重建的主流骨干模型最近由Peng等人[54]提出,该模型也与Chibane等人[15]的同期工作具有相似性。该模型在从点云预测其他隐式场的方法中也得到广泛应用,例如点位移[75]、无符号距离[16]或点对是否被表面分隔[88]。在该模型中,编码器通过三维(或二维)卷积网络,从体素化或像素化的输入点云构建显式外部三维特征体积。

Recently, Boulch et al. [7] argued against this strategy. They contented that these methods operate in a voxelized disc ret iz ation whose voxel centers maybe far from the input point cloud locations. They also argued that voxel centers holding the shape information are uniformly sampled, while they should be ideally more focused near the surface. Conclusively, they introduced a point cloud convolutional encoder, where features are only defined at the input point cloud locations, where shape information matters most. They achieve superior reconstruction performances consequently on several benchmarks.

最近,Boulch等人[7]对这一策略提出了反对意见。他们认为这些方法在体素化离散化空间中操作,其体素中心可能远离输入点云的位置。他们还指出,承载形状信息的体素中心是均匀采样的,而理想情况下应更集中于表面附近。最终,他们提出了一种点云卷积编码器,其中特征仅定义在输入点云的位置上,即形状信息最关键的区域。因此,他们在多个基准测试中实现了卓越的重建性能。


Figure 2. Overview: Our method uses exclusively MLPs from beginning to end to predict an implicit shape function given a noisy input point cloud. First a U-Net PointMixer [17] (Gray boxes) is tasked with denoising the input and producing per point features. The building blocks of this network combine 3 pre-exiting forms of point mixing which pool features from support points in a point cloud into other points of the point cloud. Channel-wise mixing is also used throughout the model as well. We refer the reader to the work of Choe et al. [17] for a detailed description on these preexisting mixing layers. We introduce ”Extra-set mixing” (Section 3.3), which we use to pool features both locally and globally onto any external query point to estimate its occupancy. This mixing is guided by relative positional encoding using the denoised support point coordinates (See Figure 3).

图 2: 概述:我们的方法全程仅使用多层感知机 (MLP) 从含噪输入点云预测隐式形状函数。首先通过U-Net PointMixer [17](灰色框)对输入点云去噪并生成逐点特征。该网络的构建模块结合了3种现有点混合方式,将点云中支撑点的特征池化到点云的其他点。通道混合技术也贯穿整个模型。具体混合层实现细节请参阅Choe等人的工作[17]。我们提出的"额外集混合"(第3.3节)可将特征局部和全局池化到任意外部查询点以估算其占据概率,该混合过程由基于去噪支撑点坐标的相对位置编码引导(见图3)。

Both of the aforementioned strategies however still rely on convolutions for translational e qui variance and generalization across scenes and objects. Differently, we propose here an architecture based solely on MLPs, and show that it mostly outperforms the previously mentioned strategies.

然而,上述两种策略仍依赖卷积来实现平移等变性和跨场景/物体的泛化能力。与此不同,我们在此提出一种仅基于MLP的架构,并证明其性能在多数情况下优于前述策略。

3.2. Feature Extraction and Denoising Network

3.2. 特征提取与去噪网络

Our point cloud feature extraction network takes noisy point cloud $\mathbf{X}$ as input and produces an output per element in the point cloud. However, differently from [7], it does not use convolutions, but bases the entire model on MLPs instead, inspired by recent success of the PointMixer [17] architecture.

我们的点云特征提取网络以含噪点云$\mathbf{X}$作为输入,并为点云中的每个元素生成输出。但与[7]不同,该网络未使用卷积操作,而是受PointMixer[17]架构近期成果的启发,完全基于多层感知机(MLP)构建模型。

Specifically, we use the dense prediction task model in [17], which consists of a U-Net [61] symmetric encoderdecoder scheme. The point cloud is down sampled gradually throughout the encoder $E$ using Farthest point sampling until it reaches the coarsest point cloud Y ∈ RNd×3. It is upsampled back then correspondingly through out the decoder $D$ . As illustrated in Figure 2, each block in the encoder/decoder consists mostly of a Hierarchical mixing layer, an intra-set mixing layer, a inter-set mixing layer and channel mixing, as elaborated in [17].

具体而言,我们采用[17]中的密集预测任务模型,该模型采用U-Net [61]对称编码器-解码器架构。点云通过编码器$E$逐步下采样,使用最远点采样(Farthest point sampling)直至生成最粗糙点云Y∈RNd×3,随后通过解码器$D$对应上采样。如图2所示,编码器/解码器的每个模块主要包含层级混合层、集合内混合层、集合间混合层以及通道混合层,具体结构详见[17]。

The channel mixing layer is an MLP operating directly on the feature domain, i.e. mixing feature channels. The other mixings are point-wise mixings to replace the tokenwise mixing in the original MLP-Mixer model [72]. They are softmax based order invariant pooling layers, that pool features from a limited set of point cloud support points into a point cloud query. They vary in the way their supports are defined. In the intra-set one the support is the nearest points to the query. In the inter-set one the support is the inverse of the intra-set support, i.e. a point is part of the support of a query if the query is part of the nearest points of that point. The hierarchical-set mixing is used for down/up transitions in the encoder/decoder, and hence the support is the nearest points in the previous/next point cloud resolution. In practice, for the decoder to be symmetrical with the encoder, the up transition support set is defined as the inverse of the equivalent down transition support set, differently from seminal work [57, 89, 90]. For a more thorough explanation of the intra-set, inter-set and hierarchical-set mixing layers we refer the reader to the work by Choe et al. [17].

通道混合层是一个直接在特征域上操作的多层感知机(MLP),即混合特征通道。其他混合操作采用逐点混合方式,替代了原始MLP-Mixer模型[72]中的Token混合方式。这些基于softmax的次序不变池化层,将来自有限点云支撑点的特征池化到点云查询点中,其差异在于支撑点的定义方式:

  • 集合内混合的支撑点是查询点的最近邻点
  • 集合间混合的支撑点与集合内相反(若某点的最近邻包含查询点,则该点属于查询点的支撑集)
  • 层级集合混合用于编码器/解码器的下/上采样转换,其支撑点来自前/后一级分辨率点云的最近邻点。为实现编解码器对称性,上采样支撑集被定义为对应下采样支撑集的逆,这与开创性研究[57, 89, 90]有所不同。关于三类混合层的详细说明,请参阅Choe等人[17]的研究。

Differently from [7], we design our feature extraction network to perform two tasks simultaneously: produce per point fine features $\mathbf{F}{\mathbf{X}}\in\mathbb{R}^{N_{p}\times F}$ and displacement vectors $\mathbf{\bar{\Delta}X}\in\mathbb{R}^{N_{p}\times3}$ :

与[7]不同,我们设计的特征提取网络可同时执行两项任务:生成逐点精细特征 $\mathbf{F}{\mathbf{X}}\in\mathbb{R}^{N_{p}\times F}$ 和位移向量 $\mathbf{\bar{\Delta}X}\in\mathbb{R}^{N_{p}\times3}$ :

$$
\mathbf{F}_{\mathbf{X}},\pmb{\Delta}\mathbf{X}=D\circ E(\mathbf{X}).
$$

$$
\mathbf{F}_{\mathbf{X}},\pmb{\Delta}\mathbf{X}=D\circ E(\mathbf{X}).
$$

We train the displacements semi-supervised ly to produce a denoised version $\tilde{\mathbf{X}}\in\mathbb{R}^{N_{p}\times3}$ of the input point cloud:

我们以半监督方式训练位移,以生成输入点云的降噪版本 $\tilde{\mathbf{X}}\in\mathbb{R}^{N_{p}\times3}$:

$$
\tilde{\mathbf{X}}=\mathbf{X}+\Delta\mathbf{X}.
$$

$$
\tilde{\mathbf{X}}=\mathbf{X}+\Delta\mathbf{X}.
$$


Figure 3. Using denoised point coordinates and features, our implicit decoder performs local and global extra-set point mixing at a given query location to predict its shape occupancy value. The global extra-set mixing mechanism is similar to the local one. The local extra-set mixing uses the $K$ nearest denoised points of the point cloud. The global extra-set mixing uses to the full $N_{d}$ denoised points of the down sampled point cloud.

图 3: 通过去噪点坐标和特征,我们的隐式解码器在给定查询位置执行局部与全局跨集点混合,以预测其形状占据值。全局跨集混合机制与局部机制类似。局部跨集混合使用点云的 $K$ 个最近去噪点,全局跨集混合则使用下采样点云的全部 $N_{d}$ 个去噪点。

We note that using the encoder $E$ , we can also extract coarse features $\mathbf{F}_{\mathbf{Y}}$ for the down sampled point cloud $\mathbf{Y}$ at the bottleneck of the model:

我们注意到,使用编码器 $E$ 还可以为模型瓶颈处的下采样点云 $\mathbf{Y}$ 提取粗粒度特征 $\mathbf{F}_{\mathbf{Y}}$:

$$
\mathbf{F}_{\mathbf{Y}}=E(\mathbf{X}).
$$

$$
\mathbf{F}_{\mathbf{Y}}=E(\mathbf{X}).
$$

A denoised version $\tilde{\mathbf{Y}}\in\mathbb{R}^{N_{d}\times3}$ of the down sampled input point cloud $\mathbf{Y}$ can be naturally obtained from the output denoised point cloud $\tilde{\bf X}$ via the encoder defined downsampling indexing $\delta:\mathbb{[1,}N_{p}\mathbb{]}\to\mathbb{[1,}N_{d}\mathbb{]}$ :

降采样输入点云 $\mathbf{Y}$ 的去噪版本 $\tilde{\mathbf{Y}}\in\mathbb{R}^{N_{d}\times3}$ 可通过编码器定义的下采样索引 $\delta:\mathbb{[1,}N_{p}\mathbb{]}\to\mathbb{[1,}N_{d}\mathbb{]}$ 从输出去噪点云 $\tilde{\bf X}$ 中自然获得:

$$
\begin{array}{r}{\tilde{\mathbf{Y}}{i}=\tilde{\mathbf{X}}{\delta(i)},\quad\mathbf{Y}{i}=\mathbf{X}_{\delta(i)}.}\end{array}
$$

$$
\begin{array}{r}{\tilde{\mathbf{Y}}{i}=\tilde{\mathbf{X}}{\delta(i)},\quad\mathbf{Y}{i}=\mathbf{X}_{\delta(i)}.}\end{array}
$$

3.3. Extra-set Mixing Decoder

3.3. 集外混合解码器

To serve the role of our implicit decoder, we introduce a new point mixing layer, borrowing the terminology in PointMixer [17]. As opposed to the existing point mixing layers in [17], this layer takes as input a query point representing any location in Euclidean space, and not necessarily an element from the input point cloud topology $\mathbf{X}$ , hence the denomination ”Extra-set”. Similarly to the point-mixing layers in [17], it pools features from a support set using softmax in an order-invariant fashion.

为了充当我们的隐式解码器,我们引入了一种新的点混合层(point mixing layer),借鉴了PointMixer [17]中的术语。与[17]中现有的点混合层不同,该层的输入是一个表示欧几里得空间中任意位置的查询点,而不必是输入点云拓扑结构$\mathbf{X}$中的元素,因此命名为"超集"(Extra-set)。与[17]中的点混合层类似,它以顺序不变的方式使用softmax从支持集中池化特征。

We first use our layer to extract a local fine feature $\mathbf{y}^{l o c}$ for a query point $q\in\mathbb{R}^{3}$ . This feature is built using the $K$ nearest points to $q$ from the denoised output point cloud $\hat{\bf X}$ and their corresponding output fine features in $\mathbf{F}_{\mathbf{X}}$ , as generated by decoder $D$ .

我们首先使用我们的层为查询点 $q\in\mathbb{R}^{3}$ 提取局部精细特征 $\mathbf{y}^{l o c}$。该特征是通过使用去噪输出点云 $\hat{\bf X}$ 中距离 $q$ 最近的 $K$ 个点及其在解码器 $D$ 生成的 $\mathbf{F}_{\mathbf{X}}$ 中对应的输出精细特征构建的。

To this end, and as illustrated in Figure 3, we first compute an aggregation score $s_{\tilde{p}}$ for each support element $\tilde{p}$ :

为此,如图 3 所示,我们首先为每个支持元素 $\tilde{p}$ 计算聚合分数 $s_{\tilde{p}}$:

$$
s_{\tilde{p}}=g_{2}\circ g_{1}(\tilde{p}-q;\mathbf{F}{\mathbf{X}}(p)),\quad\tilde{p}\in\mathcal{X}_{q}=k\mathbf{N}\mathbf{N}(\tilde{\mathbf{X}},q),
$$

$$
s_{\tilde{p}}=g_{2}\circ g_{1}(\tilde{p}-q;\mathbf{F}{\mathbf{X}}(p)),\quad\tilde{p}\in\mathcal{X}_{q}=k\mathbf{N}\mathbf{N}(\tilde{\mathbf{X}},q),
$$

where $g_{1}(.)$ and $g_{2}(.)$ are channel mixing MLPs and $\mathbf{F}_{\mathbf{X}}({\boldsymbol{p}})$ is the fine feature of denoised support point $\tilde{p}$ . We use these aggregation scores to pool features of the support points into the final local feature of the query point:

其中 $g_{1}(.)$ 和 $g_{2}(.)$ 是通道混合 MLP,$\mathbf{F}_{\mathbf{X}}({\boldsymbol{p}})$ 是去噪支撑点 $\tilde{p}$ 的精细特征。我们利用这些聚合分数将支撑点的特征汇集为查询点的最终局部特征:

$$
\mathbf{y}^{\mathrm{loc}}=\sum_{\tilde{p}\in\mathcal{X}{q}}\mathrm{softmax}(s_{\tilde{p}})\odot\big[g_{3}\circ g_{1}(\tilde{p}-q;\mathbf{F}_{\mathbf{X}}(p))\big],
$$

$$
\mathbf{y}^{\mathrm{loc}}=\sum_{\tilde{p}\in\mathcal{X}{q}}\mathrm{softmax}(s_{\tilde{p}})\odot\big[g_{3}\circ g_{1}(\tilde{p}-q;\mathbf{F}_{\mathbf{X}}(p))\big],
$$

where softmax $(.)$ is the softmax normalization over the support point dimension, $g_{3}(.)$ is another channel mixing MLP, and $\odot$ denotes the element wise product. Conclusively, this layer is in fact akin to a spatial attention module [83], space here spanning the support points.

其中 softmax $(.)$ 是在支撑点维度上的 softmax 归一化, $g_{3}(.)$ 是另一个通道混合 MLP, $\odot$ 表示逐元素乘积。最终,该层实际上类似于空间注意力模块 [83],这里的空间跨越支撑点。

We note additionally that the architecture of this layer follows the decoder in [7]. Most importantly, we pinpoint that a main difference here is that the relative positional encoding in our extra-set mixing uses denoised support coordinates: ${\tilde{p}-q},\tilde{p}\in\tilde{\mathbf{X}}$ , instead of the original noisy ones: ${p-q}$ , $p\in\mathbf{X}$ , as done in [7]. This allows us to reduce the effect of noise on the Euclidean relative position guided feature aggregation. Similarly to [7], we use in practice $N_{H}$ MLPs $g_{2}$ in parallel, i.e. softmax $\left(s_{\tilde{p}}\right)\mathrel{\mathop:}=$ N≥ softmax(g o gn(p- q; F(p).

我们进一步注意到,该层的架构遵循了[7]中的解码器设计。最关键的是,我们指出此处的主要差异在于:额外集混合中采用的相对位置编码使用了去噪后的支撑坐标${\tilde{p}-q},\tilde{p}\in\tilde{\mathbf{X}}$,而非如[7]中直接使用原始含噪坐标