[论文翻译]ZeroNVS:基于单张图像的零样本 360 度视角合成


原文地址:https://arxiv.org/abs/2310.17994

代码地址:https://kylesargent.github.io/zeronvs/

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

基于单张图像的零样本 360 度视角合成

Kyle Sargent^1^ , Zizhang Li^1^ , Tanmay Shah^2^ , Charles Herrmann^2^ , Hong-Xing Yu^1^ ,
Yunzhi Zhang^1^ , Eric Ryan Chan^1^ , Dmitry Lagun^2^ , Li Fei-Fei^1^ , Deqing Sun^2^ , Jiajun Wu^1^
^1^ Stanford University, ^2^ Google Research

摘要

我们介绍了一种基于 3D 感知的扩散模型 ZeroNVS,用于野外场景的单图像新视角合成。 现有的方法是为具有遮罩背景的单个物体设计的,而我们提出了新的技术来应对野外具有复杂背景的多物体场景带来的挑战。 具体来说,我们在一个混合数据源上训练一个生成先验,该数据源捕捉以物体为中心、室内和室外的场景。 为了解决数据混合带来的深度尺度模糊等问题,我们提出了一种新的相机条件参数化和归一化方案。 此外,我们观察到分数蒸馏采样(SDS)在 360 度场景蒸馏过程中往往会截断复杂背景的分布,并提出了“SDS 锚定”来提高合成新视角的多样性。 即使超过专门在 DTU 数据集上训练的方法,我们的模型在零样本设置下也获得了 DTU 数据集上 LPIPS 的最新结果。 我们进一步将具有挑战性的 Mip-NeRF 360 数据集改编为单图像新视角合成的新基准,并在这种设置下展示了强大的性能。 代码和模型可在此网址获得。

1 引言

CO3D
Refer to caption Refer to caption
Refer to caption Refer to caption
Input view ———————— Novel views ————————
RealEstate10K
Refer to caption Refer to caption
Refer to caption Refer to caption
Input view ———————— Novel views ————————
DTU (Zero-shot)
Refer to caption Refer to caption
Refer to caption Refer to caption
Input view ———————— Novel views ————————
Mip-NeRF 360 (Zero-shot)
Refer to caption Refer to caption
Refer to caption Refer to caption
Input view ———————— Novel views ————————

图 1: 基于单张图像的视角合成结果。 所有 NeRFs 都是由同一个模型预测的。

Models for single-image, 360-degree novel view synthesis (NVS) should produce realistic and diverse results: the synthesized images should look natural and 3D-consistent to humans, and they should also capture the many possible explanations of unobservable regions. This challenging problem has typically been studied in the context of single objects without backgrounds, where the requirements on both realism and diversity are simplified. Recent progress relies on large 3D datasets like Objaverse-XL [9] which have enabled training conditional diffusion [19] models to perform photorealistic and 3D-consistent NVS via Score Distillation Sampling [SDS; 27]. Meanwhile, since image diversity mostly lies in the background, not the object, the ignorance of background significantly lowers the expectation of synthesizing diverse images–in fact, most object-centric methods do not consider diversity metrics [19, 22, 28].

用于单图像、360 度新视角合成 (NVS) 的模型应该产生逼真多样化 的结果:合成的图像应该对人类来说看起来自然且 3D 一致,并且它们还应该捕捉不可观察区域的许多可能解释。 这个具有挑战性的问题通常是在没有背景的单个物体的背景下研究的,其中对真实性和多样性的要求都得到了简化。 近期的进展依赖于大型 3D 数据集,例如 Objaverse-XL [9],这使得能够训练条件扩散模型[19]通过分数蒸馏采样[SDS; 27]执行逼真的、3D 一致的新视角合成(NVS)。 同时,由于图像多样性主要在于背景而不是物体,忽略背景会显著降低合成多样化图像的期望——事实上,大多数以物体为中心的的方法并没有考虑多样性指标[19, 22, 28]。

Neither assumption holds for the more challenging problem of zero-shot, 360-degree novel view synthesis on real-world scenes. There is no single, large-scale dataset of scenes with ground-truth geometry, texture, and camera parameters, analogous to Objaverse-XL for objects. The background, which cannot be ignored anymore, also needs to be well modeled for synthesizing diverse results.

对于更具挑战性的零样本、360 度现实场景新视角合成问题,上述假设均不成立。 不存在类似于 Objaverse-XL 用于物体的,包含场景真实几何形状、纹理和相机参数的大规模单一数据集。 背景不再可以被忽略,也需要被很好地建模才能合成多样化的结果。

We address both issues with our new model, ZeroNVS. Inspired by previous object-centric methods [19, 22, 28], ZeroNVS also trains a 2D conditional diffusion model followed by 3D distillation. But unlike them, ZeroNVS works well on scenes due to two technical innovations: a new camera parametrization and normalization scheme for conditioning, which allows training the diffusion model on diverse scene datasets, and an “SDS anchoring” mechanism, improving the background diversity over standard SDS.

我们用新的模型 ZeroNVS 解决了这两个问题。 受先前以物体为中心的的方法启发[19, 22, 28],ZeroNVS 也训练了一个 2D 条件扩散模型,然后进行 3D 蒸馏。 但与它们不同的是,ZeroNVS 由于两项技术创新而能够很好地处理场景:一种用于调节的新相机参数化和归一化方案,允许在多样化的场景数据集上训练扩散模型;以及一种“SDS 锚定”机制,提高了标准 SDS 的背景多样性。

To overcome the key challenge of limited training data, we propose training the diffusion model on a massive mixed dataset comprised of all scenes from CO3D [31], RealEstate10K [45], and ACID [17], so that the model may potentially handle complex in-the-wild scenes. The mixed data of such scale and diversity are captured with a variety of camera settings and have several different types of 3D ground truth, e.g., computed with COLMAP [33] or ORB-SLAM [24]. We show that while the camera conditioning representations from prior methods [19] are too ambiguous or inexpressive to model in-the-wild scenes, our new camera parametrization and normalization scheme allows exploiting such diverse data sources and leads to superior NVS on real-world scenes.

为了克服训练数据有限的关键挑战,我们建议在包含 CO3D [31]、RealEstate10K [45]和 ACID [17]所有场景的大规模混合数据集上训练扩散模型,以便模型能够潜在地处理复杂的野外场景。 这种规模和多样性的混合数据是用各种相机设置捕获的,并且具有几种不同类型的 3D 真实数据,例如,使用 COLMAP [33]或 ORB-SLAM [24]计算的。 我们表明,虽然先前方法[19]的相机调节表示对于野外场景的建模过于模糊或表达能力不足,但我们的新型相机参数化和归一化方案允许利用这种多样化的数据源,并在现实场景中实现优越的 NVS。

Building a 2D conditional diffusion model that works effectively for in-the-wild scenes enables us to then study the limitations of SDS in the scene setting. In particular, we observe limited diversity from SDS in the generated scene backgrounds when synthesizing long-range (e.g., 180-degree) novel views. We therefore propose “SDS anchoring” to ameliorate the issue. In SDS anchoring, we propose to first sample several “anchor" novel views using the standard Denoising Diffusion Implicit Model (DDIM) sampling [37]. This yields a collection of pseudo-ground-truth novel views with diverse contents, since DDIM is not prone to mode collapse like SDS. Then, rather than using these views as RGB supervision, we sample from them randomly as conditions for SDS, which enforces diversity while still ensuring 3D-consistent view synthesis.

建立一个能够有效处理野外场景的 2D 条件扩散模型,使我们能够研究 SDS 在场景设置中的局限性。 特别地,我们观察到,在合成长距离(例如,180 度)新视角时,SDS 在生成的场景背景中多样性有限。 因此,我们提出“SDS 锚定”来改善这个问题。 在 SDS 锚定中,我们建议首先使用标准去噪扩散隐式模型(DDIM)采样[37]采样几个“锚”新视角。 这产生了一组具有多样化内容的伪真实新视角,因为 DDIM 不像 SDS 那样容易模式崩溃。 然后,我们不是将这些视图用作 RGB 监督,而是随机从中采样作为 SDS 的条件,这在保证 3D 一致的视图合成同时,也增强了多样性。

ZeroNVS achieves strong zero-shot generalization to unseen data. We set a new state-of-the-art LPIPS score on the challenging DTU benchmark, even outperforming methods that were directly fine-tuned on this dataset. Since the popular benchmark DTU consists of scenes captured by a forward-facing camera rig and cannot evaluate more challenging pose changes, we propose to use the Mip-NeRF 360 dataset [2] as a single-image novel view synthesis benchmark. ZeroNVS achieves the best LPIPS performance on this benchmark. Finally, we show the potential of SDS anchoring for addressing diversity issues in background generation via a user study.

ZeroNVS 在未见数据上实现了强大的零样本泛化能力。 我们在具有挑战性的 DTU 基准测试中取得了新的 LPIPS 得分最高水平,甚至超过了直接在这个数据集上微调的方法。 由于流行的基准 DTU 由前向摄像机设备拍摄的场景组成,无法评估更具挑战性的姿态变化,我们建议使用 Mip-NeRF 360 数据集[2]作为单图像新视图合成基准。 ZeroNVS 在这个基准测试中实现了最佳的 LPIPS 性能。 最后,我们通过用户研究展示了 SDS 锚定在解决背景生成中多样性问题的潜力。

To summarize, we make the following contributions:

  • • We propose ZeroNVS, which enables full-scene NVS from real images. ZeroNVS first demonstrates that SDS distillation can be used to lift scenes that are not object-centric and may have complex backgrounds to 3D.
  • • We show that the formulations on handling cameras and scene scale in prior work are either inexpressive or ambiguous for in-the-wild scenes. We propose a new camera conditioning parameterization and a scene normalization scheme. These enable us to train a single model on a large collection of diverse training data consisting of CO3D, RealEstate10K and ACID, allowing strong zero-shot generalization for NVS on in-the-wild images.
  • • We study the limitations of SDS distillation as applied to scenes. Similar to prior work, we identify a diversity issue, which manifests in this case as novel view predictions with monotone backgrounds. We propose SDS anchoring to ameliorate the issue.
  • • We show state-of-the-art LPIPS results on DTU zero-shot , surpassing prior methods finetuned on this dataset. Furthermore, we introduce the Mip-NeRF 360 dataset as a scene-level single-image novel view synthesis benchmark and analyze the performances of our and other methods. Finally, we show that our proposed SDS anchoring is preferred via a user study.

总而言之,我们的贡献如下:

  • • 我们提出了 ZeroNVS,它能够从真实图像生成全场景新视图合成。 ZeroNVS 首先证明了 SDS 蒸馏可以用于将非以物体为中心的场景(可能具有复杂的背景)提升到 3D。
  • • 我们表明,先前工作中处理相机和场景比例的公式对于野外场景来说要么表达力不足,要么模棱两可。 我们提出了一种新的相机条件参数化方案和场景归一化方案。 这些使我们能够在一个由 CO3D、RealEstate10K 和 ACID 组成的大量多样化训练数据集合上训练单个模型,从而为野外图像上的新视图合成实现强大的零样本泛化能力。
  • • 我们研究了应用于场景的 SDS 蒸馏的局限性。 与先前的工作类似,我们发现了一个多样性问题,在这种情况下表现为具有单调背景的新视图预测。 我们提出 SDS 锚定来改善这个问题。
  • • 我们在 DTU 上展示了最先进的 LPIPS 结果零样本 ,超过了在这个数据集上微调的先前方法。 此外,我们引入了 Mip-NeRF 360 数据集作为场景级单图像新视角合成基准,并分析了我们和其他方法的性能。 最后,我们通过用户研究表明,我们提出的 SDS 锚定方法更受青睐。

2 相关工作

3D generation. DreamFusion [27] proposed Score Distillation Sampling (SDS) as a way of leveraging a diffusion model to extract a NeRF given a user-provided text prompt. After DreamFusion, follow-up works such as Magic3D [16], ATT3D [21], ProlificDreamer [39], and Fantasia3D [7] improved the quality, diversity, resolution, or run-time. Other types of 3D generative models include GAN-based 3D generative models, which are primarily restricted to single object categories [4, 26, 13, 5, 25, 34] or to synthetic data [12]. Recently, 3DGP [35] adapted the GAN approach to train 3D generative models on ImageNet. VQ3D [32] and IVID [41] leveraged vector quantization and diffusion, respectively, to learn generative models on ImageNet. One critical critical challenge for scene-based 3D-aware methods 360-degree viewpoint change. Both VQ3D and 3DGP demonstrate only limited camera motion, while IVID generally focuses on small camera motion but can achieve 360-degree views for a small subset of scenes.

3D 生成。 DreamFusion [27]提出了一种分数蒸馏采样(SDS)方法,该方法利用扩散模型根据用户提供的文本提示提取 NeRF。 在 DreamFusion 之后,后续工作如 Magic3D [16]、ATT3D [21]、ProlificDreamer [39]和 Fantasia3D [7]提高了质量、多样性、分辨率或运行时间。 其他类型的 3D 生成模型包括基于 GAN 的 3D 生成模型,这些模型主要局限于单个物体类别[4, 26, 13, 5, 25, 34]或合成数据[12]。 最近,3DGP [35]采用 GAN 方法在 ImageNet 上训练 3D 生成模型。 VQ3D [32]和 IVID [41]分别利用矢量量化和扩散在 ImageNet 上学习生成模型。 基于场景的 3D 感知方法的一个关键挑战是 360 度视角变化。 VQ3D 和 3DGP 都只演示了有限的摄像机运动,而 IVID 通常关注小的摄像机运动,但可以为一小部分场景实现 360 度视角。

Single-image novel view synthesis. PixelNeRF [44] and DietNeRF [15] learn to infer NeRFs from sparse views via training an image-based 3D feature extractor or semantic consistency losses, respectively. However, these approaches do not produce renderings resembling crisp natural images from a single image. Several recent diffusion-based approaches achieve high quality results by separating the problem into two stages. First, a (potentially 3D-aware) diffusion model is trained, and second, the diffusion model is used to distill 3D-consistent scene representations given an input image via techniques like score distillation sampling [27], score Jacobian chaining [38], textual inversion or semantic guidance leveraging the diffusion model [22, 10], or explicit 3D reconstruction from multiple sampled views of the diffusion model [18, 20]. Another diffusion-based work, GeNVS [6], achieves 360 camera motion but only for specific object categories such as fire hydrants. By contrast, ZeroNVS generates 360-degree camera motion by default for a variety of scene categories. This is enabled by innovations such as novel camera conditioning representations and SDS anchoring, which enable us to train on massive real scene datasets and then to perform scene-level NVS with up to 360-degree viewpoint change on diverse scene types.

单图像新视角合成。 PixelNeRF [44]和 DietNeRF [15]分别通过训练基于图像的 3D 特征提取器或语义一致性损失来学习从稀疏视图推断 NeRF。 但是,这些方法不会从单个图像生成类似清晰自然图像的渲染结果。 最近几种基于扩散的方法通过将问题分成两个阶段来实现高质量的结果。 首先,训练一个(可能是 3D 感知的)扩散模型;其次,利用分数蒸馏采样[27]、分数雅可比链[38]、利用扩散模型的文本反转或语义引导[22, 10],或从扩散模型的多个采样视图进行显式 3D 重建[18, 20]等技术,利用扩散模型来提取给定输入图像的 3D 一致性场景表示。 另一项基于扩散的工作,GeNVS [6],实现了 360° 摄像机运动,但仅限于特定的物体类别,例如消防栓。 相比之下,ZeroNVS 默认情况下为各种场景类别生成 360 度摄像机运动。 这得益于一些创新,例如新颖的摄像机条件表示和 SDS 锚定,这些使我们能够在海量的真实场景数据集上进行训练,然后对各种场景类型执行高达 360 度视点变化的场景级 NVS。

3 方法

We consider the problem of scene-level novel view synthesis from a single real image. Similar to prior work [19, 28], we first train a diffusion model 𝐩θ to perform novel view synthesis, and then leverage it to perform 3D SDS distillation. Unlike prior work, we focus on scenes rather than objects. Scenes present several unique challenges. First, prior works use representations for cameras and scale which are either ambiguous or insufficiently expressive for scenes. Second, the inference procedure of prior works is based on SDS, which has a known mode collapse issue and which manifests in scenes through greatly reduced background diversity in predicted views. We will attempt to address these challenges through improved representations and inference procedures for scenes compared with prior work [19, 28].

我们考虑从单个真实图像进行场景级新视点合成的问题。 与之前的工作[19, 28]类似,我们首先训练一个扩散模型𝐩θ来执行新视点合成,然后利用它来执行 3D SDS 蒸馏。 与之前的工作不同,我们关注的是场景而不是物体。 场景呈现出一些独特的挑战。 首先,以前的工作使用的摄像机和比例表示要么模棱两可,要么对于场景来说表达能力不足。 其次,先前工作的推理过程基于 SDS,SDS 已知存在模式崩溃问题,这在场景中表现为预测视图中背景多样性大大减少。 我们将尝试通过改进的表示和推理过程来解决这些挑战,与之前的工作[19, 28]相比,这些改进针对场景。

We shall begin by introducing some general notation. Let a scene S consist of a set of images X={Xi}i=1n, depth maps D={Di}i=1n, extrinsics E={Ei}i=1n, and a shared field-of-view f. We note that an extrinsics matrix Ei can be identified with its rotation and translation components, defined by Ei=(EiR,EiT). We preprocess our data to consist of square images and assume intrinsics are shared within a given scene, and that there is no skew, distortion, or off-center principal point.

We will focus on the design of the conditional information which is passed to the view synthesis diffusion model 𝐩θ in addition to the input image. This conditional information can be represented via a function, 𝐌⁢(D,f,E,i,j), which computes a conditioning embedding given the depth maps and extrinsics for the scene, the field of view, and the indices i,j of the input and target view respectively. We learn a generative model over novel views 𝐩θ given by

我们将首先介绍一些通用符号。 令一个场景S由一组图像X={Xi}i=1n、深度图D={Di}i=1n、外部参数E={Ei}i=1n和共享视野f组成。我们注意到,外部参数矩阵Ei可以与其旋转和平移分量等同,由Ei=(EiR,EiT)定义。 我们对数据进行预处理,使其包含正方形图像,并假设在给定场景内内参是共享的,并且没有倾斜、失真或偏离中心的中心点。

我们将重点关注传递给视点合成扩散模型𝐩θ的条件信息的设计,此外还有输入图像。 此条件信息可以通过函数𝐌⁢(D,f,E,i,j)来表示,该函数根据场景的深度图和外部参数、视野以及输入和目标视图的索引i,j计算条件嵌入。 我们学习了一个在由𝐩θ给出的新视角上的生成模型

Xj∼𝐩θ⁢(Xj|Xi,𝐌⁢(D,f,E,i,j)).

The output of 𝐌 and the input image Xi are the only information used by the model for NVS. Both Zero-1-to-3 (Section 3.1) and our model, as well as several intermediate models that we will study (Sections 3.2 and 3.3), can be regarded as different choices for 𝐌. As we illustrate in Figures 2, 3, 5 and 6, and verify in experiments, different choices for 𝐌 can have drastic impacts on performance.

𝐌的输出和输入图像Xi是模型用于 NVS 的唯一信息。 Zero-1-to-3(第3.1节)和我们的模型,以及我们将要研究的几个中间模型(第3.2节和3.3节),都可以被认为是𝐌的不同选择。 正如我们在图2356中所示,并在实验中验证的那样,𝐌的不同选择会对性能产生巨大影响。

image.png

图 2: 三自由度相机姿态捕捉高程、方位角和半径,但无法表示相机的翻滚(如图所示)或任意位置和方向的相机。

Refer to caption图 3: 对于单目相机来说,靠近相机的较小物體(左)和远处的大物体(右)看起来相同,尽管它们代表不同的场景。 输入视图中的尺度模糊导致多个可能的新的视图。

Refer to caption图 4: 基于 SDS 的 NeRF 蒸馏(左)对所有 360 度的新的视图使用相同的引导图像。 我们的“SDS 锚定”(右)首先通过 DDIM[36]采样新的视图,然后使用最近的图像(无论是输入图像还是采样的新视图)作为引导。

3.1 用于视图合成的对象表示

Zero-1-to-3 [19] represents poses with 3 degrees of freedom, given by an elevation θ, azimuth ϕ, and radius z. Let 𝐏:SE⁢(3)→ℝ3 be the projection to (θ,ϕ,z), then

Zero-1-to-3 [19] 使用高程θ、方位角ϕ和半径z表示具有三个自由度的姿态。 令𝐏:SE⁢(3)→ℝ3为到(θ,ϕ,z)的投影,则

| | 𝐌Zero−1−to−3⁢(D,f,E,i,j)=𝐏⁢(Ei)−𝐏⁢(Ej) | |

is the camera conditioning representation used by Zero-1-to-3. For object mesh datasets such as Objaverse [8] and Objaverse-XL [9], this representation is appropriate because the data is known to consist of single objects without backgrounds, aligned and centered at the origin and imaged from training cameras generated with three degrees of freedom. However, such a parameterization limits the model’s ability to generalize to non-object-centric images, and to real-world data. In real-world data, poses can have six degrees of freedom, incorporating both rotation (pitch, roll, yaw) and 3D translation. An illustration of a failure of the 3DoF camera representation is shown in Figure 2.

是 Zero-1-to-3 使用的相机条件表示。 对于诸如 Objaverse [8] 和 Objaverse-XL [9] 之类的物体网格数据集,这种表示方式是合适的,因为已知数据由没有背景的单个物体组成,这些物体已对齐并以原点为中心,并且是从具有三个自由度的训练摄像机生成的图像中获取的。 然而,这种参数化限制了模型泛化到非以物体为中心的图像和真实世界数据的能力。 在真实世界的数据中,姿态可以具有六个自由度,包含旋转(俯仰、横滚、偏航)和 3D 平移。 3DoF 摄像机表示失败的一个例子如图 2 所示。

3.2 视图合成场景表示

For scenes, we should use a camera representation with six degrees of freedom that can capture all possible positions and orientations. One straightforward choice is the relative pose parameterization [40]. We propose to also include the field of view as an additional degree of freedom. We term this combined representation “6DoF+1”. This gives us

对于场景,我们应该使用具有六个自由度的摄像机表示,它可以捕获所有可能的 位置和方向。 一个简单的选择是相对姿态参数化 [40]。 我们建议将视野作为额外的自由度包含在内。 我们将这种组合表示称为“6DoF+1”。 这给了我们

| | 𝐌6⁢D⁢o⁢F+1⁢(D,f,E,i,j)=[Ei−1⁢Ej,f]. | |

Importantly, 𝐌6⁢D⁢o⁢F+1 is invariant to any rigid transformation E~ of the scene, so that we have

重要的是,𝐌6⁢D⁢o⁢F+1 对场景的任何刚性变换 E~ 不变,因此我们有

𝐌6⁢D⁢o⁢F+1⁢(D,f,E~⋅E,i,j)=[Ei−1⁢Ej,f].

This is useful given the arbitrary nature of the poses for our datasets which are determined by COLMAP or ORB-SLAM. The poses discovered via these algorithms are not related to any meaningful alignment of the scene, such as a rigid transformation and scale transformation which align the scene to some canonical frame and unit of scale. Although we have seen that 𝐌6⁢D⁢o⁢F+1 is invariant to rigid transformations of the scene, it is not invariant to scale. The scene scales determined by COLMAP and ORB-SLAM are also arbitrary and may vary significantly. One solution is to directly normalize the camera locations. Let 𝐑⁢(E,λ):SE⁢(3)×ℝ→SE⁢(3) be a function that scales the translation component of E by λ. Then we define

鉴于我们的数据集姿态的任意性(由 COLMAP 或 ORB-SLAM 确定),这一点非常有用。 通过这些算法发现的姿态与场景的任何有意义的对齐无关,例如将场景与某个规范框架和比例单位对齐的刚性变换和比例变换。 虽然我们已经看到 𝐌6⁢D⁢o⁢F+1 对场景的刚性变换不变,但它对比例是不变的。 COLMAP 和 ORB-SLAM 确定的场景比例也是任意的,可能会有很大的差异。 一个解决方案是直接规范化摄像机位置。 令𝐑⁢(E,λ):SE⁢(3)×ℝ→SE⁢(3)为一个函数,它将E的平移分量按λ缩放。 然后我们定义

s= 1n⁢∑i=1n‖EiT−1n⁢∑j=1nEjT‖2,
𝐌6⁢D⁢o⁢F+1,norm.⁢(D,f,E,i,j)= [𝐑⁢(Ei−1⁢Ej,1s),f],
- - - -

where s is the average norm of the camera locations when the mean of the camera locations is chosen as the origin. In 𝐌6⁢D⁢o⁢F+1,norm., the camera locations are normalized via rescaling by 1s, in contrast to 𝐌6⁢D⁢o⁢F+1 where the scales are arbitrary. This choice of 𝐌 assures that scenes from our mixture of datasets will have similar scales

其中s是当选择相机位置的均值作为原点时相机位置的平均范数。 在 𝐌6⁢D⁢o⁢F+1,norm. 中,相机位置通过 1s 重新缩放来标准化,而 𝐌6⁢D⁢o⁢F+1 中的缩放比例是任意的。 选择𝐌可以确保我们混合数据集中的场景具有相似的比例。

3.3 使用新的标准化方案解决尺度模糊性

Refer to caption图 5: ZeroNVS 多个样本的 Sobel 边缘的样本和方差热图。 𝐌6⁢D⁢o⁢F+1,viewer减少了尺度模糊带来的随机性。Refer to caption图 6: 上:一个场景,有两个摄像头对着物体。 下:同一个场景,增加了一个新的摄像头对着地面。

The representation 𝐌6⁢D⁢o⁢F+1,norm. achieves reasonable performance on real scenes by addressing issues in prior representations with limited degrees of freedom and handling of scale. However, a normalization scheme that better addresses scale ambiguity may lead to improved performance. Scene scale is ambiguous given a monocular input image [30, 43]. This complicates NVS, as we illustrate in Figure 3. We therefore choose to introduce information about the scale of the visible content to our conditioning embedding function 𝐌. Rather than normalize by camera locations, Stereo Magnification [45] takes the 5-th quantile of each depth map of the scene, and then takes the 10-th quantile of this aggregated set of numbers, and declares this as the scene scale. Let 𝐐k be a function which takes the k-th quantile of a set of numbers, then we define

在𝐌6⁢D⁢o⁢F+1,agg.下添加摄像头 C 会剧烈改变场景的比例。 𝐌6⁢D⁢o⁢F+1,viewer. 避免了这种情况。该表示 𝐌6⁢D⁢o⁢F+1,norm. 通过解决先前表示中自由度有限和尺度处理的问题,在真实场景中取得了合理的性能。 然而,更好的解决尺度模糊性的归一化方案可能会提高性能。 给定单目输入图像 [30, 43],场景尺度是模糊的。 正如我们在图 3 中所示,这使 NVS 变得复杂。 因此,我们选择将有关可见内容尺度的信息引入我们的条件嵌入函数 𝐌。 立体放大 [45] 并非通过相机位置进行归一化,而是取场景每个深度图的第 5 个分位数,然后取此聚合数字集的第 10 个分位数,并将其声明为场景尺度。 令 𝐐k 为一个函数,它取一组数字的 k 分位数,然后我们定义

q= 𝐐10⁢({𝐐5⁢(Di)}i=1n),
𝐌6⁢D⁢o⁢F+1,agg.⁢(D,f,E,i,j)= [𝐑⁢(Ei−1⁢Ej,1q),f],
- - - -

where in 𝐌6⁢D⁢o⁢F+1,agg., q is the scale applied to the translation component of the scene’s cameras before computing the relative pose. In this way 𝐌6⁢D⁢o⁢F+1,agg. is different from 𝐌6⁢D⁢o⁢F+1,norm. be