[论文翻译]蒸馏任意深度:蒸馏打造更强大的单目深度估计器


原文地址:https://arxiv.org/pdf/2502.19204v1


Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

蒸馏任意深度:蒸馏打造更强大的单目深度估计器


Figure 1: Zero-shot prediction on in-the-wild images. Our model, distilled from Genpercept [45] and Depth Anything v 2 [47], outperforms other methods by delivering more accurate depth details and exhibiting superior generalization for monocular depth estimation on in-the-wild images.

图 1: 对野外图像的零样本预测。我们的模型从 Genpercept [45] 和 Depth Anything v 2 [47] 中蒸馏而来,在提供更准确的深度细节和展示对野外图像的单目深度估计的优越泛化能力方面,优于其他方法。

Abstract

摘要

Mon ocular depth estimation(MDE)aims to predict scene depth from a single RGB image and plays a crucial role in 3D scene understanding.Recent advances in zero-shot MDE leverage normalized depth representations and distillationbased learning to improve generalization across diverse scenes.However,current depth normalization methods for distillation, relying on global normalization, can amplify noisy pseudo-labels, reducing distillation effectiveness. In this paper, we systematically analyze the impact of different depth normalization strategies onpseudo-label distillation. Based on our findings, we propose Cross-Context Distillation,which integrates global and local depth cues to enhance pseudo-label quality.Additionally,we introduce a multi-teacher distillation framework that leverages comple ment ary strengths of different depth estimation models, leading to more robust and accurate depth predictions.Exten sive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods,both quantitatively and qualitatively.

单目深度估计 (Monocular Depth Estimation, MDE) 旨在从单个 RGB 图像中预测场景深度,并在 3D 场景理解中发挥着关键作用。最近的零样本 MDE 进展利用归一化深度表示和基于蒸馏的学习来提升跨多样场景的泛化能力。然而,当前依赖于全局归一化的深度归一化方法可能会放大噪声伪标签,从而降低蒸馏效果。本文中,我们系统分析了不同深度归一化策略对伪标签蒸馏的影响。基于我们的发现,我们提出了跨上下文蒸馏 (Cross-Context Distillation),该方法整合了全局和局部深度线索以提升伪标签质量。此外,我们引入了一个多教师蒸馏框架,该框架利用了不同深度估计模型的互补优势,从而生成更鲁棒和准确的深度预测。在基准数据集上的大量实验表明,我们的方法在定量和定性上均显著优于现有最先进的方法。

1. Introduction

1. 引言

Monocular depth estimation (MDE) predicts scene depth from a single RGB image, offering flexibility compared to stereo or multi-view methods. This makes MDE ideal for applications like autonomous driving and robotic navigation [10, 12, 16, 48, 28]. Recent research on zero-shot MDE models [34, 51, 43, 22] aims to handle diverse scenarios, but training such models requires large-scale, diverse depth data, which is often limited by the need for specialized equipment [29, 50]. A promising solution is using large-scale unlabeled data, which has shown success in tasks like classification and segmentation [25, 58, 44]. Studies like Depth Anything [46] highlight the effectiveness of using pseudo labels from teacher models for training student

单目深度估计 (Monocular Depth Estimation, MDE) 通过单张 RGB 图像预测场景深度,相比立体或多视角方法更具灵活性。这使得 MDE 成为自动驾驶和机器人导航等应用的理想选择 [10, 12, 16, 48, 28]。近期关于零样本 MDE 模型的研究 [34, 51, 43, 22] 旨在处理多样化的场景,但训练此类模型需要大规模、多样化的深度数据,而这通常受到专用设备需求的限制 [29, 50]。一个可行的解决方案是使用大规模未标注数据,这已在分类和分割等任务中取得了成功 [25, 58, 44]。像 Depth Anything [46] 这样的研究强调了使用教师模型的伪标签来训练学生模型的有效性。

models.

模型

To enable training on such a diverse, mixed dataset, most state-of-the-art methods [47, 37, 51] employ scale-and-shift invariant (SSI) depth representations for loss computation. This approach normalizes raw depth values within an image, making them invariant to scaling and shifting, and ensures that the model learns to focus on relative depth relationships rather than absolute values. The SSI representation facilitates the joint use of diverse depth data, thereby improving the model's ability to generalize across different scenes [35. 4]. Similarly, during evaluation, the metric depth of the prediction is recovered by solving for the unknown scale and shift coefficients of the predicted depth using least squares, ensuring the application of standard evaluation metrics.

为了能够在如此多样化的混合数据集上进行训练,大多数最先进的方法 [47, 37, 51] 采用尺度和平移不变 (SSI) 深度表示进行损失计算。这种方法对图像中的原始深度值进行归一化,使其对缩放和平移保持不变,并确保模型学习关注相对深度关系而非绝对值。SSI 表示促进了多样化深度数据的联合使用,从而提高了模型在不同场景中的泛化能力 [35, 4]。同样,在评估过程中,通过使用最小二乘法求解预测深度的未知缩放和平移系数,恢复预测的度量深度,确保标准评估指标的应用。

Despite its advantages, using SSI depth representation for pseudo-label distillation in MDE models presents several issues. Specifically, the inherent normalization process in SSI loss makes the depth prediction at a given pixel not only dependent on the teacher model's raw prediction at that location but also influenced by the depth values in other regions of the image. This becomes problematic because pseudo-labels inherently introduce noise. Even if certain local regions are predicted accurately, inaccuracies in other regions can negatively affect depth estimates after global normalization, leading to suboptimal distillation results. As shown in Fig. 2, we empirically demonstrate that normalizing depth maps globally tends to degrade the accuracy of local regions, as compared to only applying normalization within localized regions during evaluation.

尽管有其优势,但在 MDE 模型中使用 SSI 深度表示进行伪标签蒸馏存在一些问题。具体来说,SSI 损失中的固有归一化过程使得给定像素的深度预测不仅依赖于教师模型在该位置的原始预测,还受到图像其他区域深度值的影响。这带来了问题,因为伪标签本质上引入了噪声。即使某些局部区域预测准确,其他区域的不准确性也会在全局归一化后对深度估计产生负面影响,导致蒸馏结果不理想。如图 2 所示,我们通过实验证明,与仅在评估时对局部区域应用归一化相比,全局归一化深度图往往会降低局部区域的准确性。

Building on this insight, in this paper, we first investigate the issue of depth normalization in pseudo-label distillation. We begin by analyzing various depth normalization strategies, including global normalization, local normalization, hybrid global-local approaches, and the absence of normalization. Through empirical experiments, we explore how each technique affects the performance of various distillation designs, especially when using pseudo-labels for training. Our analysis provides valuable insights into how different normalization methods influence the MDE loss function and distillation outcomes, offering a set of best practices for optimizing performance in diverse scenarios.

基于这一洞察,本文首先研究了伪标签蒸馏中的深度归一化问题。我们分析了多种深度归一化策略,包括全局归一化、局部归一化、全局-局部混合方法以及不进行归一化的情况。通过实验,我们探讨了每种技术如何影响不同蒸馏设计的性能,特别是在使用伪标签进行训练时。我们的分析为不同归一化方法如何影响MDE损失函数和蒸馏结果提供了有价值的见解,并为优化各种场景下的性能提供了一套最佳实践。

Building on this empirical foundation, we introduce a Cross-Context Distillation method, designed to distill knowledge from the teacher model more effectively. We are motivated by our finding that local regions, when used for distillation, produce pseudo-labels that capture higher-quality depth details, improving the student model's depth estimation accuracy. However, focusing solely on local regions might overlook the broader contextual relationships in the image. To address the issue , we combine local and global inputs within a unified distillation framework. By combining the context-specific advantages of local distillation with the broader understanding provided by global methods, our

基于这一实证基础,我们引入了一种跨上下文蒸馏方法,旨在更有效地从教师模型中提取知识。我们的动机是发现,当局部区域用于蒸馏时,生成的伪标签能够捕捉到更高质量的深度细节,从而提高了学生模型的深度估计准确性。然而,仅关注局部区域可能会忽略图像中更广泛的上下文关系。为了解决这一问题,我们在统一的蒸馏框架中结合了局部和全局输入。通过将局部蒸馏的上下文特定优势与全局方法提供的更广泛理解相结合,我们的

Furthermore, we propose a multi-teacher distillation framework that leverages the complementary strengths of multiple depth estimation models. Our design is motivated by the observation of recent advancements that diffusionbased models, benefiting from large-scale image priors, excel at capturing fine-grained details but are computationally expensive, while encoder-decoder models provide higher accuracy and efficiency but relatively lack fine-detail reconstruction. To harness these strengths, we randomly select different models to generate pseudo-labels, and then supervise the student model based on these labels. This operation enables the student model to learn from the detailed depth information of diffusion-based models while benefiting from the precision of encoder-decoder models.

此外,我们提出了一种多教师蒸馏框架,该框架利用了多个深度估计模型的互补优势。我们的设计灵感来自于对最新进展的观察,即基于扩散的模型受益于大规模图像先验,擅长捕捉细粒度细节,但计算成本高,而编码器-解码器模型提供更高的准确性和效率,但在细粒度重建方面相对不足。为了利用这些优势,我们随机选择不同的模型生成伪标签,然后基于这些标签监督学生模型。这一操作使学生模型能够从基于扩散的模型的详细深度信息中学习,同时受益于编码器-解码器模型的精度。

To validate the effectiveness of our design, we conduct extensive experiments on various benchmark datasets. The empirical results show that our method significantly outperforms existing baselines qualitatively and quantitatively. The contributions can be summarized below:

为了验证我们设计的有效性,我们在各种基准数据集上进行了广泛的实验。实证结果表明,我们的方法在质量和数量上均显著优于现有基线。贡献可以总结如下:

2. Related Work

  1. 相关工作

2.1. Monocular Depth Estimation

2.1. 单目深度估计

Monocular depth estimation (MDE) has evolved from hand-crafted methods to deep learning, significantly improving accuracy [10, 27, 11, 15, 57, 35]. Architectural refinements, such as multi-scale designs and attention mechanisms, have further enhanced feature extraction [19, 5, 56]. However, most models remain reliant on labeled data and struggle to generalize across diverse environments. Zeroshot MDE improves generalization by leveraging largescale datasets, geometric constraints, and multi-task learning [34, 51, 53, 55]. Metric depth estimation incorporates intrinsic data for absolute depth learning [2, 52, 20, 42], while generative models such as Marigold refine depth details using diffusion priors [22, 45]. Despite these advances, effectively utilizing unlabeled data remains a challenge due to pseudo-label noise and inconsistencies across different contexts. Depth Anything [47] explores large-scale unlabeled data but struggles with pseudo-label reliability. Patch Fusion [8, 30] improves depth estimation by refining high-resolution image representations but lacks adaptability in generative settings. To address these issues, we propose Cross-Context and Multi-Teacher Distillation, which enhances pseudo-label supervision by leveraging diverse contextual information and multiple expert models, improving both accuracy and generalization ability.

单目深度估计 (MDE) 从手工设计方法发展到深度学习,显著提高了精度 [10, 27, 11, 15, 57, 35]。多尺度设计和注意力机制等架构改进进一步增强了特征提取能力 [19, 5, 56]。然而,大多数模型仍然依赖于标注数据,并且难以在不同环境中泛化。零样本 MDE 通过利用大规模数据集、几何约束和多任务学习来提高泛化能力 [34, 51, 53, 55]。度量深度估计结合了内在数据进行绝对深度学习 [2, 52, 20, 42],而生成式模型如 Marigold 则利用扩散先验来优化深度细节 [22, 45]。尽管取得了这些进展,但由于伪标签噪声和不同上下文之间的不一致性,有效利用未标注数据仍然是一个挑战。Depth Anything [47] 探索了大规模未标注数据,但在伪标签可靠性方面存在困难。Patch Fusion [8, 30] 通过优化高分辨率图像表示来改进深度估计,但在生成式环境中缺乏适应性。为了解决这些问题,我们提出了跨上下文和多教师蒸馏,通过利用多样化的上下文信息和多个专家模型来增强伪标签监督,从而提高精度和泛化能力。

2.2. Semi-supervised Monocular Depth Estimation

2.2. 半监督单目深度估计

Semi-supervised depth estimation has gained attention by utilizing temporal consistency to better use unlabeled data [26, 17]. Some methods [1, 40, 6, 49, 14] apply stere0 geometric constraints, enforcing left-right consistency to enhance depth accuracy, while others use additional supervision like semantic priors [33, 18] or GANs, such as DepthGAN [21]. However, these approaches are limited by their reliance on temporal cues or stereo constraints, restricting their applicability. Recent work [32] explored pseudo-labeling for semi-supervised MDE but lacks generative modeling capabilities. Depth Anything [46] demonstrated the potential of large-scale unlabeled data, though pseudo-label reliability remains challenging. In contrast, our approach improves pseudo-label reliability and enhances MDE accuracy, relying solely on unlabeled data without additional constraints.

半监督深度估计通过利用时间一致性更好地利用未标记数据 [26, 17],已引起关注。一些方法 [1, 40, 6, 49, 14] 应用立体几何约束,通过左右一致性来提高深度精度,而其他方法则使用语义先验 [33, 18] 或 GAN(如 DepthGAN [21])等额外监督。然而,这些方法受限于其对时间线索或立体约束的依赖,限制了其适用性。最近的工作 [32] 探索了半监督 MDE 的伪标记方法,但缺乏生成建模能力。Depth Anything [46] 展示了大规模未标记数据的潜力,尽管伪标签的可靠性仍具挑战性。相比之下,我们的方法提高了伪标签的可靠性并增强了 MDE 的准确性,仅依靠未标记数据且无需额外约束。

3. Method

3. 方法

In this section, we introduce a novel distillation framework designed to leverage unlabeled images for training zero-shot Monocular Depth Estimation (MDE) models. We begin by exploring various depth normalization techniques in Section 3.1, followed by detailing our proposed distillation method in Section 3.2, which combines predictions across multiple contexts. The overall framework is illustrated in Fig. 3. Finally, we describe a multi-teacher distillation mechanism in Section 3.2 that integrates diverse depth estimators as teacher models to train the student model.

在本节中,我们介绍了一种新颖的蒸馏框架,旨在利用未标注的图像来训练零样本单目深度估计 (Monocular Depth Estimation, MDE) 模型。我们首先在第3.1节中探讨了各种深度归一化技术,随后在第3.2节中详细介绍了我们提出的结合多上下文预测的蒸馏方法。整体框架如图3所示。最后,在第3.2节中,我们描述了一种多教师蒸馏机制,该机制整合了多种深度估计器作为教师模型来训练学生模型。

3.1. Depth Normalization

3.1. 深度归一化

Depth normalization is a crucial component of our framework as it adjusts the pseudo-depth labels ${\bf d}^{t}$ from the teacher model and the depth predictions $\mathbf{d}^{s}$ from the student model for effective loss computation. To understand the influence of normalization techniques on distillation performance, we systematically analyze several approaches commonly employed in prior works. These strategies are visually illustrated in Fig. 4.

深度归一化是我们框架中的一个关键组件,它调整来自教师模型的伪深度标签 ${\bf d}^{t}$ 和来自学生模型的深度预测 $\mathbf{d}^{s}$,以便进行有效的损失计算。为了理解归一化技术对蒸馏性能的影响,我们系统分析了先前工作中常用的几种方法。这些策略在图 4 中进行了可视化展示。


(b) Metric of different way of Least-square

图 1: (b) 不同最小二乘法 (Least-square) 的度量

Figure 2: Issue with Global Normalization (SSI). In (a), we compare two alignment strategies for the central $w/2,h/2$ region: (1) Global Least-Square, where alignment is applied to the full image before cropping, and (2) Local Least-Square, where alignment is performed on the cropped region. Metrics are computed on the cropped region. As shown in (b), the outperformed local strategy demonstrates that global normalization degrades local accuracy compared to local normalization.

图 2: 全局归一化 (SSI) 的问题。在 (a) 中,我们比较了两种针对中心 $w/2,h/2$ 区域的对齐策略:(1) 全局最小二乘法,即在裁剪前对整个图像进行对齐;(2) 局部最小二乘法,即在裁剪区域上进行对齐。指标在裁剪区域上计算。如 (b) 所示,表现更优的局部策略表明,与局部归一化相比,全局归一化会降低局部精度。

Global Normalization: The first strategy we examine is the global normalization [46, 47] used in recent distillation methods. Global normalization [34] adjusts depth predictions using global statistics of the entire depth map. This strategy aims to ensure scale-and-shift invariance by normalizing depth values based on the median and mean absolute deviation of the depth map. For each pixel $i$ , the normalized depth for the student model and pseudo-labels are computed as:

全局归一化:我们研究的第一个策略是近期蒸馏方法中使用的全局归一化 [46, 47]。全局归一化 [34] 通过使用整个深度图的全局统计量来调整深度预测。该策略旨在通过基于深度图的中位数和平均绝对偏差对深度值进行归一化,以确保尺度和位移不变性。对于每个像素 $i$,学生模型和伪标签的归一化深度计算如下:

image.png

where $\bmod{\left(\mathbf{d}^{s}\right)}$ and $\bmod{(\mathbf{d}^{t})}$ are the medians of the predicted depth and pseudo depth, respectively. The final regression loss for distillation is computed as the average absolute difference between the normalized predicted depth and the normalized pseudo depth across all valid pixels $M$

其中 $\bmod{\left(\mathbf{d}^{s}\right)}$ 和 $\bmod{(\mathbf{d}^{t})}$ 分别是预测深度和伪深度的中位数。最终的蒸馏回归损失计算为所有有效像素 $M$ 上归一化预测深度与归一化伪深度之间的平均绝对差。

image.png

Hybrid Normalization: In contrast to global normalization, Hierarchical Depth Normalization [54] employs a hybrid normalization approach by integrating both global and local depth information. This strategy is designed to preserve both the global structure and local geometry in the depth map. The process begins by dividing the depth range into $S$ segments, where $S$ is selected from ${1,2,4}$ . When $S=1$ ,the entire depth range is normalized globally, treating all pixels as part of a single context, akin to global normalization. In the case of $S,=,2$ , the depth range is divided into two segments, with each pixel being normalized within one of these two local contexts. Similarly, for $S=4$ ,the depth range is split into four segments, allowing normalization to be performed within smaller, localized contexts. By adapting the normalization process to multiple levels of granularity, hybrid normalization achieves a balance between global coherence and local adaptability. For each context $u$ , the normalized depth values for the student model $\mathcal{N}_{u}(d_{i}^{s})$ and pseudo-labels $\mathcal{N}_{u}(d_{i}^{t})$ are calculated within the corresponding depth range. The loss for each pixel $i$ is then computed by averaging the losses across all contexts $U_{i}$ to which the pixel belongs:

混合归一化:与全局归一化不同,层次深度归一化 (Hierarchical Depth Normalization) [54] 采用了一种混合归一化方法,通过整合全局和局部深度信息来保留深度图中的全局结构和局部几何特征。该过程首先将深度范围划分为 $S$ 个段,其中 $S$ 从 ${1,2,4}$ 中选择。当 $S=1$ 时,整个深度范围被全局归一化,将所有像素视为单一上下文的一部分,类似于全局归一化。当 $S=2$ 时,深度范围被划分为两个段,每个像素在其中一个局部上下文中归一化。类似地,当 $S=4$ 时,深度范围被划分为四个段,允许在更小的局部上下文中进行归一化。通过在多粒度级别上调整归一化过程,混合归一化在全局一致性和局部适应性之间取得了平衡。对于每个上下文 $u$,学生模型的归一化深度值 $\mathcal{N}_{u}(d_{i}^{s})$ 和伪标签 $\mathcal{N}_{u}(d_{i}^{t})$ 在相应的深度范围内计算。然后,通过平均像素 $i$ 所属的所有上下文 $U_{i}$ 的损失来计算每个像素 $i$ 的损失:


Figure 3: Overview of Cross-Context Distillation. Our method combines local and global depth information to enhance the student model's predictions. It includes two scenarios: (1) Shared-Context Distillation, where both models use the same image for distillation; and (2) Local-Global Distillation, where the teacher predicts depth for overlapping patches while the student predicts the full image. The Local-Global loss $\mathcal{L}_{\mathrm{lg}}$ (Top Right) ensures consistency between local and global predictions, enabling the student to learn both fine details and broad structures, improving accuracy and robustness.

图 3: 跨上下文蒸馏概述。我们的方法结合了局部和全局深度信息,以增强学生模型的预测。它包括两种场景:(1) 共享上下文蒸馏,其中两个模型使用同一张图像进行蒸馏;(2) 局部-全局蒸馏,其中教师模型预测重叠补丁的深度,而学生模型预测整个图像。局部-全局损失 $\mathcal{L}_{\mathrm{lg}}$(右上角)确保了局部和全局预测之间的一致性,使学生能够学习到细节和整体结构,从而提高准确性和鲁棒性。


Figure 4: Normalization Strategies. We compare four normalization strategies: Global Norm [34], Hybrid Norm [54], Local Norm, and No Norm. The figure visualizes how each strategy processes pixels within the normalization region (Norm. Area). The red dot represents any pixel within the region.

图 4: 归一化策略。我们比较了四种归一化策略:全局归一化 (Global Norm) [34]、混合归一化 (Hybrid Norm) [54]、局部归一化 (Local Norm) 和无归一化 (No Norm)。该图展示了每种策略如何处理归一化区域 (Norm. Area) 内的像素。红点代表区域内的任意像素。

image.png

where $|U_{i}|$ denotes the total number of groups (or contexts) that pixel $i$ is associated with. To obtain the final loss $\mathcal{L}_{\mathrm{Dis}}$ we average the pixel-wise losses across all valid pixels $M$

其中 $|U_{i}|$ 表示像素 $i$ 关联的组(或上下文)的总数。为了获得最终损失 $\mathcal{L}_{\mathrm{Dis}}$,我们在所有有效像素 $M$ 上对逐像素损失进行平均。

image.png

Local Normalization: In addition to global and hybrid normalization, we investigate Local Normalization, a strategy that focuses exclusively on the finest-scale groups used in hybrid normalization. This approach isolates the smallest local contexts for normalization, emphasizing the preservation of fine-grained depth details without considering hierarchical or global scales. Local normalization operates by dividing the depth range into the smallest groups, corresponding to $S,=,4$ in the hybrid normalization framework, and each pixel is normalized within its local context. The loss for each pixel $i$ is computed using a similar formulation as in hybrid normalization, but with $u^{i}$ now representing the local context for pixel $i$ , defined by the smallest four-part group:

局部归一化 (Local Normalization):除了全局归一化和混合归一化,我们还研究了局部归一化,这是一种专注于混合归一化中最细尺度组的策略。该方法将最小局部上下文隔离进行归一化,强调在不考虑层次或全局尺度的情况下保留细粒度的深度细节。局部归一化通过将深度范围划分为最小的组来操作,对应于混合归一化框架中的 $S,=,4$,每个像素在其局部上下文中进行归一化。每个像素 $i$ 的损失使用与混合归一化类似的公式计算,但 $u^{i}$ 现在表示像素 $i$ 的局部上下文,由最小的四部分组定义:


Figure 5: Different Inputs Lead to Different Pseudo Labels. Global Depth: The teacher model predicts depth using the entire image, and the local region's prediction is cropped from the output. Local Depth: The teacher model directly takes the cropped local region as input, resulting in more refined and detailed depth estimates for that area, capturing finer details compared to using the entire image.

图 5: 不同输入导致不同的伪标签。全局深度:教师模型使用整张图像预测深度,局部区域的预测是从输出中裁剪的。局部深度:教师模型直接将裁剪的局部区域作为输入,从而得到该区域更精细和详细的深度估计,与使用整张图像相比,能够捕捉到更细微的细节。

image.png

No Normalization: As a baseline, we also consider a direct depth regression approach with no explicit normalization. The absolute difference between raw student predictions and teacher pseudo-labels is used for loss computation:

无归一化:作为基线,我们还考虑了一种没有显式归一化的直接深度回归方法。使用原始学生预测与教师伪标签之间的绝对差异来计算损失:

image.png

This approach eliminates the need for normalization, assuming pseudo-depth labels naturally reside in the same domain as predictions. It provides insight into whether normalization enhances distillation effectiveness or if raw depth supervision suffices.

这种方法消除了归一化的需求,假设伪深度标签自然与预测结果处于同一域中。它提供了洞察,说明归一化是否增强了蒸馏效果,或者原始的深度监督是否足够。

3.2. Distillation Pipeline

3.2. 蒸馏管道

In this section, we introduce an enhanced distillation pipeline that integrates two complementary strategies: CrossContext Distillation and Multi-Teacher Distillation. Both strategies aim to improve the quality of pseudo-label distillation, enhance the model's fine-grained perception, and boost generalization across diverse scenarios.

在本节中,我们介绍了一种增强的蒸馏管道,它集成了两种互补策略:跨上下文蒸馏 (CrossContext Distillation) 和多教师蒸馏 (Multi-Teacher Distillation)。这两种策略旨在提高伪标签蒸馏的质量,增强模型的细粒度感知,并提升跨不同场景的泛化能力。

Cross-context Distillation.A key challenge in monocular depth distillation is the trade-off between local detail preservation and global depth consistency. As shown in Fig. 5, providing a local crop of an image as input to the teacher model enhances fine-grained details in the pseudo-depth labels, but it may fail to capture the overall scene structure. Conversely, using the entire image as input preserves the global depth structure but often lacks fine details. To address this limitation, we propose Cross-Context Distillation, a method that enables the student model to learn both local details and global structures simultaneously. Cross-context distillation consists of two key strategies:

跨上下文蒸馏。单目深度蒸馏的一个关键挑战是在局部细节保留和全局深度一致性之间的权衡。如图 5 所示,将图像的局部裁剪作为教师模型的输入可以增强伪深度标签中的细粒度细节,但可能无法捕捉整体场景结构。相反,使用整个图像作为输入可以保留全局深度结构,但往往缺乏细节。为了解决这一限制,我们提出了跨上下文蒸馏,这是一种使学生模型能够同时学习局部细节和全局结构的方法。跨上下文蒸馏包含两个关键策略:

  1. Shared-Context Distillation: In this setup, both the teacher and student models receive the same cropped region of the image as input. Instead of using the full image, we randomly sample a local patch of varying sizes from the original image and provide it as input to both models. This encourages the student model to learn from the teacher model across different spatial contexts, improving its ability to generalize to varying scene structures. For the loss of shared-context distillation, the teacher and student models receive identical inputs and produce each depth prediction, denoted as $\mathbf{d}{\mathrm{local}}^{t}$ and $\mathbf{d}{\mathrm{local}}^{s}$
  2. 共享上下文蒸馏 (Shared-Context Distillation):在此设置中,教师模型和学生模型都接收图像的相同裁剪区域作为输入。我们不是使用完整图像,而是从原始图像中随机采样不同大小的局部区域,并将其作为输入提供给两个模型。这鼓励学生模型在不同的空间上下文中从教师模型中学习,从而提高其泛化到不同场景结构的能力。对于共享上下文蒸馏的损失,教师模型和学生模型接收相同的输入并生成每个深度预测,分别表示为 $\mathbf{d}{\mathrm{local}}^{t}$ 和 $\mathbf{d}{\mathrm{local}}^{s}$。

image.png

This loss encourages the student model to refine its finegrained predictions by directly aligning with the teacher's outputs at local scales.

这种损失鼓励学生模型通过在局部尺度上直接与教师模型的输出对齐来优化其细粒度预测。

  1. Local-Global Distillation: In this approach, the teacher and student models operate on different input contexts. The teacher model processes local cropped regions, generating fine-grained depth predictions, while the student model predicts a global depth map from the entire image. To ensure knowledge transfer, the teacher's local depth predictions supervise the corresponding overlapping regions in the student's global depth map. This strategy allows the student to integrate fine-grained local details into its holistic depth estimation. Formally, the teacher model produces multiple depth predictions for cropped regions, denoted as $\mathbf{d}{\mathrm{local}{n}}^{t}$ while the student generates a global depth map, $\mathbf{d}_{\mathrm{global}}^{s}$ The loss for Local-Global distillation is computed only over overlapping areas between the teacher's local predictions and the corresponding regions in the student's global depth map:

局部-全局蒸馏:在此方法中,教师模型和学生模型在不同输入上下文中操作。教师模型处理局部裁剪区域,生成细粒度的深度预测,而学生模型从整个图像预测全局深度图。为了确保知识传递,教师模型的局部深度预测监督学生模型全局深度图中相应的重叠区域。这种策略使学生能够将细粒度的局部细节整合到其整体深度估计中。形式上,教师模型为裁剪区域生成多个深度预测,记为 $\mathbf{d}{\mathrm{local}{n}}^{t}$,而学生模型生成全局深度图 $\mathbf{d}_{\mathrm{global}}^{s}$。局部-全局蒸馏的损失仅在教师模型的局部预测与学生模型的全局深度图的相应区域之间的重叠区域上计算:

image.png

where $\mathrm{{Crop}(\cdot)}$ extracts the overlapping region from the student's depth prediction, and $N$ is the total number of sampled patches. This loss ensures that the student benefits from the detailed local supervision of the teacher model while maintaining global depth consistency. The total loss function integrates both local and cross-context losses along with additional constraints, including feature alignment and gradient preservation, as proposed in prior works [47]:

其中,$\mathrm{{Crop}(\cdot)}$ 从学生的深度预测中提取重叠区域,$N$ 是采样的补丁总数。该损失确保学生从教师模型的详细局部监督中受益,同时保持全局深度一致性。总损失函数结合了局部和跨上下文损失以及额外的约束,包括特征对齐和梯度保留,如先前工作 [47] 中提出的:


Figure 6: Multi-teacher Mechanism. We introduce a multiteacher distillation approach, where pseudo-labels are generated from multiple teacher models. At each training iteration, one teacher is randomly selected to produce pseudo-labels for unlabeled images.

图 6: 多教师机制。我们引入了一种多教师蒸馏方法,其中伪标签由多个教师模型生成。在每次训练迭代中,随机选择一个教师模型为未标记的图像生成伪标签。

image.png

Here, $\lambda_{1},\lambda_{2}$ ,and $\lambda_{3}$ are weighting factors that balance the different loss components. By incorporating cross-context supervision, this framework effectively allows the student model to integrate both fine-grained details from local crops and structural coherence from global depth maps.

在这里,$\lambda_{1}$、$\lambda_{2}$ 和 $\lambda_{3}$ 是平衡不同损失分量的权重因子。通过引入跨上下文监督,该框架有效地使学生模型能够整合来自局部裁剪的细粒度细节和来自全局深度图的结构一致性。

Multi-teacher Distillation. In addition to cross-context distillation, we adopt a multi-teacher distillation strategy, illustrated in Fig. 6, to further enhance the quality and robustness of the distilled depth knowledge. This approach leverages multiple teacher models, each trained with distinct architectures, optimization strategies, or data distributions, to generate diverse pseudo-labels. By aggregating knowledge from multiple sources, the student model benefits from a richer and more generalized depth representation. Formally, given a set of pre-trained teacher models $\mathcal{M}{1},\mathcal{M}{2},\ldots,\mathcal{M}_{N}$ , we employ a probabilistic teacher selection mechanism, where one teacher model is randomly selected at each training iteration to generate pseudo-labels for the input image. The inclusion of multiple teacher models allows the student to learn from a diverse set of predictions, effectively mitigating biases and limitations inherent to any single model.

多教师蒸馏。除了跨上下文蒸馏外,我们还采用了多教师蒸馏策略(如图 6 所示),以进一步提高蒸馏深度知识的质量和鲁棒性。该方法利用多个教师模型,每个模型通过不同的架构、优化策略或数据分布进行训练,从而生成多样化的伪标签。通过聚合来自多个来源的知识,学生模型可以从更丰富、更通用的深度表示中受益。形式化地,给定一组预训练的教师模型 $\mathcal{M}{1},\mathcal{M}{2},\ldots,\mathcal{M}_{N}$,我们采用概率教师选择机制,在每个训练迭代中随机选择一个教师模型来为输入图像生成伪标签。多个教师模型的引入使学生能够从多样化的预测中学习,有效减轻了任何单一模型固有的偏见和局限性。

4. Experiment

4. 实验

4.1. Experimental Settings

4.1. 实验设置

Datasets. To evaluate the effectiveness of our proposed distillation framework, we follow the methodology outlined in Depth Anything v 2 [47]. Specifically, we conduct our distillation experiments using a subset of 200,000 samples from the SA-1B dataset [24].

数据集。为了评估我们提出的蒸馏框架的有效性,我们遵循了Depth Anything v 2 [47]中概述的方法。具体来说,我们使用SA-1B数据集 [24] 中的200,000个样本子集进行蒸馏实验。

For evaluation, we assess the performance of the distilled student model on five widely used depth estimation benchmarks, ensuring that these datasets remain unseen during training to enable a robust zero-shot evaluation. The chosen benchmarks include: NYUv2 [39], KITTI [13], ETH3D [38], ScanNet [7], and DIODE [41]. Additional dataset details are provided in the Appendix.

为了评估,我们在五个广泛使用的深度估计基准上测试了蒸馏学生模型的性能,确保这些数据集在训练期间未被使用,以便进行稳健的零样本评估。选择的基准包括:NYUv2 [39]、KITTI [13]、ETH3D [38]、ScanNet [7] 和 DIODE [41]。更多数据集详细信息请参见附录。

Metrics. We assess depth estimation performance using two key metrics: the mean absolute relative error (AbsRel) and $\delta_{1}$ accuracy. Following previous studies [34, 52, 22] on zero-shot MDE, we align predictions with ground truth in both scale and shift before evaluation.

我们使用两个关键指标来评估深度估计性能:平均绝对相对误差(AbsRel)和 $\delta_{1}$ 准确率。根据之前关于零样本 MDE 的研究 [34, 52, 22],我们在评估之前将预测结果与真实值在尺度和偏移上进行对齐。

Implementation. Our experiments use state-of-the-art monocular depth estimation models as teachers to generate pseudo-labels, supervising various student models in a distillation framework with only RGB images as input. In shared-context distillation, both teacher and student receive the same global region, extracted via random cropping from the original image. The crop maintains a 1:1 aspect ratio and is sampled within a range from 644 pixels to the shortest side of the image, then resized to $560\times560$ for local predictions. In global-local distillation, the global region is cropped into overlapping local patches, each sized $560!\times!560$ for the teacher model to predict pseudo-labels. We use GenPercept [45]) and Depth Anything v 2 (DAv2) [47] as teacher models for the multi-teacher mechanism. The learning rate is in tune with that of the corresponding student model. For DAv2 [47], the decoder learning rate is set to $5\times10^{-5}$ . For the total loss function, we set the parameters as follows: $\lambda_{1}=0.5$ , $\lambda_{2}=1.0$ and $\lambda_{3}=2.0$

实现。我们的实验使用最先进的单目深度估计模型作为教师模型生成伪标签,在仅以RGB图像为输入的蒸馏框架中监督各种学生模型。在共享上下文蒸馏中,教师和学生都接收相同的全局区域,该区域通过从原始图像中随机裁剪提取。裁剪保持1:1的宽高比,并在644像素到图像最短边之间的范围内采样,然后调整为 $560\times560$ 以进行局部预测。在全局-局部蒸馏中,全局区域被裁剪为重叠的局部块,每个块大小为 $560!\times!560$ ,供教师模型预测伪标签。我们使用GenPercept [45]和Depth Anything v 2 (DAv2) [47]作为多教师机制的教师模型。学习率与相应的学生模型保持一致。对于DAv2 [47],解码器学习率设置为 $5\times10^{-5}$ 。对于总损失函数,我们设置参数如下: $\lambda_{1}=0.5$ , $\lambda_{2}=1.0$ 和 $\lambda_{3}=2.0$

4.2. Analysis

4.2. 分析

For the ablation study and analysis, we sample a subset of 50K images from SA-1B [24] as our training data, with an input image size of $560\times560$ for the network. We conduct experiments on two of the most challenging benchmarks, DIODE [41] and ETH3D [38], which include both indoor and outdoor scenes. The model was trained with a batch size of 4 and converged after approximately 20,000 iterations on a single NVIDIA V100 GPU.

为了进行消融研究和分析,我们从 SA-1B [24] 中采样了 50K 张图像作为训练数据,网络的输入图像尺寸为 $560\times560$。我们在两个最具挑战性的基准 DIODE [41] 和 ETH3D [38] 上进行了实验,这些基准包括室内和室外场景。模型以批大小为 4 进行训练,并在单个 NVIDIA V100 GPU 上经过大约 20,000 次迭代后收敛。

Impact of Normalization across Cross-Context Distillation. We evaluate the effect of different normalization strate- gies on Cross-Context Distillation, as shown in Table 1. The results indicate that the optimal normalization method varies across different distillation strategies. For shared-context distillation, no normalization achieves the best performance, assuming that pseudo-depth labels naturally reside in the same domain as predictions. For Local-Global distillation, Hybrid Normalization proves most effective, maintaining consistent depth predictions across regions through hierarchical normalization within specific depth ranges.

跨上下文蒸馏中归一化的影响

Ablation Study of Cross-Context Distillation. To further validate the effectiveness of our distillation framework, we conduct ablation studies by removing Shared-Context Distillation and Local-Global Distillation in Table 2. The results show that both components contribute significantly to

跨上下文蒸馏的消融研究


Figure 7: Qualitative Comparison of Relative Depth Estimations. We present visual comparisons of depth predictions from our method ("Ours") alongside other classic depth estimators (*"MiDaS v3.1’" [3], and models using DINOv2 [31] or SD as priors ('Depth Anything v 2 [47]", "Marigold" [22], "Genpercept" [45]). Compared to state-of-the-art methods, the depth map produced by our model, particularly at the position indicated by the black arrow, exhibits finer granularity and more detailed depth estimation.

图 7: 相对深度估计的定性对比。我们展示了我们方法("Ours")的深度预测与其他经典深度估计器的视觉对比(*"MiDaS v3.1’" [3],以及使用 DINOv2 [31] 或 SD 作为先验的模型('Depth Anything v 2 [47]", "Marigold" [22], "Genpercept" [45]))。与最先进的方法相比,我们的模型生成的深度图,特别是在黑色箭头指示的位置,展现出更细粒度和更详细的深度估计。

Table 1: Analysis of Normalization Strategies. Performance comparison of different normalization strategies across Shared-Context Distillation and Local-Global Distillation.

表 1: 归一化策略分析。不同归一化策略在共享上下文蒸馏和局部-全局蒸馏中的性能比较。

方法 归一化 ETH3D AbsRel↓ DIODE AbsRelL
共享上下文蒸馏 GlobalNorm. 0.067 0.243
No Norm. 0.058 0.236
Local Norm. 0.060 0.238
Hybrid Norm. 0.059 0.237
局部-全局蒸馏 GlobalNorm. 0.065 0.253
No Norm. 0.060 0.235
Local Norm. 0.059 0.235
Hybrid Norm. 0.056 0.232

Table 2: Effect of Cross-context Distillation. Performance comparison of various combinations of SharedContext Distillation and Local-Global distillation on the ETH3D [38] and DIODE [41] datasets. The baseline corresponds to a simple shared-context approach with no random cropping. When neither method is applied, the model defaults to this baseline.

表 2: 跨上下文蒸馏的效果。在 ETH3D [38] 和 DIODE [41] 数据集上,不同共享上下文蒸馏和局部-全局蒸馏组合的性能对比。基线对应没有随机裁剪的简单共享上下文方法。当两种方法均未应用时,模型默认为此基线。

共享上下文蒸馏 (Shared-Context Distillation) 局部-全局蒸馏 (Local-Global Distillation) ETH3D AbsRel√ DIODE AbsRel√
X 0.075 0.270
0.057 (-24.0%) 0.234 (-13.3%)
0.058 (-22.6%) 0.237 (-12.2%)
0.056 (-25.3%) 0.232 (-14.1%)

improving the student model's ability to utilize pseudolabels, demonstrating the robustness of our approach.

提高学生模型利用伪标签的能力,展示了我们方法的鲁棒性。

Table 3: Comparison in Cross-Architecture Distillation. Evaluation of our distillation pipeline in the context of CrossArchitecture Distillation. We adopt different architectures as teacher and student models, where the Base represents the previous distillation method [47]. Our method consistently improves the performance of the distilled student models.

表 3: 跨架构蒸馏对比。评估我们的蒸馏管道在跨架构蒸馏中的表现。我们采用不同的架构作为教师模型和学生模型,其中 Base 代表之前的蒸馏方法 [47]。我们的方法一致地提升了蒸馏学生模型的性能。

Teacher Student Training Loss DIODE AbsRel↓ ETH3D AbsRel↓
DA-L DA-S Base 0.290 0.110
Ours 0.262 (-9.6%) 0.098 (-10.9%)
DA-L Midas-L Base 0.313 0.147
Ours 0.295 (-5.7%) 0.126 (-14.3%)
Midas-L Midas-S Base 0.303 0.150
Ours 0.272 (-10.2%) 0.120 (-20.0%)

Cross-Architecture Distillation. To evaluate our normalization strategy, we conducted experiments using MiDaS [34] and Depth Anything [47], testing four configurations (DAL, MiDaS-L, DA-S, MiDaS-S) as shown in Table 3. Our method consistently outperforms previous distillation approaches that use global normalization on the DIODE [41] and ETH3D [38] datasets, demonstrating superior performance both within and across architectures, and highlighting the limitations of global normalization in pseudo-label distil lation.

跨架构蒸馏

Table 4: Quantitative comparison of our multi-teacher distillation model on zero-shot benchmarks. The bold values indicate the best performance. Our model, which integrates diverse depth estimation models, achieves higher accuracy than any individual teacher model.

表 4: 我们多教师蒸馏模型在零样本基准上的定量比较。加粗值表示最佳性能。我们整合了多种深度估计模型的模型,其准确性高于任何单个教师模型。

方法 NYUv2 KITTI DIODE ScanNet ETH3D Avg. Rank
AbsRel 81↑ AbsRelL 81↑ AbsRel 81↑ AbsRel↓
DepthAnything v2 NeurIPS'24 0.045 0.979 0.074 0.946 0.262 0.754
Genpercept(Disparity)icLR'25 0.058 0.969 0.080 0.934 0.226 0.741
Ours(Multi-teacher) 0.043 0.981 0.077 0.945 0.298 0.756

Table 5: Quantitative comparison with other affine-invariant depth estimators on several zero-shot benchmarks. The bold values indicate the best performance, and underscored represent the second-best results.

Refers to our method applied on the MiDaS v3.1. * Refers to our method applied on the Depth Anything v 2-Large.

表 5: 在多个零样本基准上与其他仿射不变深度估计器的定量比较。加粗的值表示最佳性能,下划线表示次佳结果。

方法 NYUv2 AbsRel↓ NYUv2 01↑ KITTI AbsRel↓ KITTI 81↑ DIODE AbsRel↓ DIODE 81↑ ScanNet AbsRel↓ ScanNet 81↑ ETH3D AbsRel↓ ETH3D 01↑
DiverseDepth [51] 0.117 0.875 0.190 0.704 0.376 0.631 0.108 0.882 0.228 0.694
MiDaS [34] 0.111 0.885 0.236 0.630 0.332 0.715 0.111 0.886 0.184 0.752
LeReS [43] 0.090 0.916 0.149 0.784 0.271 0.766 0.095 0.912 0.171 0.777
Omnidata [9] 0.074 0.945 0.149 0.835 0.339 0.742 0.077 0.935 0.166 0.778
HDN [54] 0.069 0.948 0.115 0.867 0.246 0.780 0.080 0.939 0.121 0.833
DPT [36] 0.098 0.903 0.100 0.901 0.182 0.758 0.078 0.938 0.078 0.946
DepthAnything v2 [46] 0.045 0.979 0.074 0.946 0.262 0.754 0.042 0.978 0.131 0.865
Marigold [23] 0.055 0.961 0.099 0.916 0.308 0.773 0.064 0.951 0.065 0.960
Midas v3.1 [3] 0.980 0.949 0.061 0.968
Ourst 0.046 0.985 0.063 0.972 0.142 0.788 0.049 0.980 0.057 0.976
Ours* 0.043 0.981 0.070 0.949 0.233 0.753 0.042 0.980 0.054 0.981

Ourst 表示我们的方法应用于 MiDaS v3.1。Ours* 表示我们的方法应用于 DepthAnything v2-Large。

Multi-teacher Mechanism. We evaluate the effectiveness of our multi-teacher distillation strategy across five benchmarks in Table 4. To handle the diverse output depth distributions of different teacher models, we use Hybrid Normalization for Shared-Context Distillation in this experiment. Using diffusion-based Genpercept [45] and Dinov2- based Depth Anything v 2 [47] as teacher models, we train a lightweight DPT-based depth estimation model. Our approach outperforms both teacher models overall, with only a minor gap on KITTI [13], demonstrating the effectiveness of multi-teacher distillation.

多教师机制

4.3. Comparison with State-of-the-Art

4.3. 与现有技术的对比

Quantitative Analysis. Our model achieves SOTA performance across both indoor and outdoor datasets, demonstrating strong generalization from structured indoor scenes (NYUv2 [39], ScanNet [7]) to complex outdoor environments (KITTI [13], DIODE [41], ETH3D [38]), as shown in Table 5. By optimizing pseudo-label distillation and depth normalization, our student model not only surpasses its teacher but also achieves a new SOTA on multiple benchmarks, demonstrating the effectiveness of our approach.

定量分析

Qualitative analysis. We show a qualitative comparison of different depth estimations between SOTA models and the proposed method in Fig. 7. Compared with DAv2 [47], our method preserves finer details, particularly in areas marked by arrows. Although Marigold [22] and Genpercept [45] generate detailed maps using generative priors, they struggle with correct relative depth relationships. In contrast, our model preserves fine details while maintaining accurate relative depth relationships, resulting in a visually consistent and reliable depth estimation.

定性分析。我们在图 7 中展示了 SOTA 模型与所提方法在不同深度估计之间的定性对比。与 DAv2 [47] 相比,我们的方法保留了更精细的细节,尤其是在箭头标记的区域。尽管 Marigold [22] 和 Genpercept [45] 使用生成先验生成了详细的深度图,但它们在正确的相对深度关系方面存在困难。相比之下,我们的模型在保持精确的相对深度关系的同时保留了精细的细节,从而实现了视觉上一致且可靠的深度估计。

5. Conclusion

5. 结论

In this work, we study pseudo-label distillation strategies for MDE. We find that the widely used SSI normalization amplifies noise in teacher-generated pseudo-labels, impairing local depth accuracy. To address the problem, we propose Cross-Context Distillation, which combines local refinement with global consistency, enabling the model to learn fine details and structural context. Our multi-teacher framework, integrating diffusion-based models and encoder-decoder networks, achieves state-of-the-art performance on multiple benchmarks. Future work could improve the efficiency of unlabeled data distillation.

在本研究中,我们探讨了用于MDE(单目深度估计)的伪标签蒸馏策略。我们发现广泛使用的SSI(尺度不变性)归一化会放大教师生成的伪标签中的噪声,从而损害局部深度精度。为了解决这一问题,我们提出了跨上下文蒸馏(Cross-Context Distillation),该方法结合了局部优化与全局一致性,使模型能够学习精细细节和结构上下文。我们整合了基于扩散的模型和编码器-解码器网络的多教师框架,在多个基准测试中实现了最先进的性能。未来工作可以改进未标记数据蒸馏的效率。

References

参考文献

In Proceedings of theIEEE conference on computer vision

IEEE计算机视觉会议论文集

6. Appendix

6. 附录

6.1. Dataset Details

6.1. 数据集详情

Datasets. We train our model on SA-1B [24], a large-scale dataset covering diverse indoor and outdoor environments, enabling robust depth learning for real-world scenes. For evaluation, we use established monocular depth benchmarks:

数据集。我们在 SA-1B [24] 上训练模型,这是一个涵盖多种室内外环境的大规模数据集,能够为真实场景提供鲁棒的深度学习。为了评估,我们使用了现有的单目深度基准:

· NYUv2 [39]: Indoor depth estimation and semantic segmentation. · KITTI [13]: Autonomous driving dataset with outdoor scenes and high-quality LiDAR depth. · ETH3D [38]: High-resolution stereo images for indoor/outdoor depth estimation and 3D reconstruction. · ScanNet [7]: Large-scale RGB-D dataset for 3D scene reconstruction and semantic segmentation. · DIODE [41]: Dense, high-quality depth maps for both indoor and outdoor environments.

· NYUv2 [39]: 室内深度估计与语义分割。
· KITTI [13]: 自动驾驶数据集,包含室外场景和高精度的 LiDAR 深度。
· ETH3D [38]: 高分辨率立体图像,用于室内/室外深度估计和 3D 重建。
· ScanNet [7]: 大规模 RGB-D 数据集,用于 3D 场景重建和语义分割。
· DIODE [41]: 密集且高质量的深度图,适用于室内和室外环境。

Metrics. We evaluate depth estimation using mean absolute relative error (AbsRel) and $\delta_{1}$ accuracy. AbsRel is defined as:

指标。我们使用平均绝对相对误差(AbsRel)和 $\delta_{1}$ 准确率来评估深度估计。AbsRel 定义为:

image.png

where $d_{i}$ is the predicted depth, $d_{i}^{*}$ is the ground truth, and $M$ is the total number of depth values. $\delta_{1}$ accuracy measures the percentage of pixels where:

其中 $d_{i}$ 是预测的深度,$d_{i}^{*}$ 是真实值,$M$ 是深度值的总数。$\delta_{1}$ 准确度衡量了像素的百分比,其中:

image.png

indicating prediction accuracy within a specific tolerance. Following Metric3D [34, 52, 22], we align predictions with ground truth in scale and shift before evaluation.

在特定容差范围内指示预测准确性。按照 Metric3D [34, 52, 22] 的方法,我们在评估前将预测结果与真实值进行尺度和偏移对齐。

6.2. More Experiments

6.2. 更多实验

Effect of Data Scaling. To investigate how dataset size affects model performance, we conducted experiments using progressively larger training datasets and compared our method against the SSI Loss baseline. Fig. 8 shows the Absolute Relative Error (AbsRel) as the dataset size increases from 10K to 200K images.

数据规模的影响。为了研究数据集大小如何影响模型性能,我们使用逐渐增大的训练数据集进行了实验,并将我们的方法与 SSI Loss 基线进行了对比。图 8 展示了当数据集大小从 10K 张图像增加到 200K 张图像时,绝对相对误差 (AbsRel) 的变化。

Distilling Generative Models vs. Depth Anything v 2. Beyond distilling encoder-decoder depth models, we extend our approach to generative models, specifically Genpercept [45], aiming to transfer their superior detail preservation to a more efficient student model. While diffusion-based depth estimators achieve fine-grained depth reconstruction, their high computational cost limits practical applications. We investigate whether their depth estimation capability can be effectively distilled into a lightweight DPT-based model. Experimental results in Fig. 9 show that compared to using

蒸馏生成式模型与 Depth Anything v 2 的对比。除了蒸馏编码器-解码器深度模型外,我们还将方法扩展到生成式模型,特别是 Genpercept [45],旨在将其卓越的细节保留能力转移到更高效的学生模型中。虽然基于扩散的深度估计器实现了细粒度的深度重建,但其高计算成本限制了实际应用。我们研究了它们的深度估计能力是否可以有效地蒸馏到基于 DPT 的轻量级模型中。图 9 中的实验结果表明,与使用


Figure 8: Comparison of Data Scaling . Performance comparison of our model with SSI Loss as the dataset size increases, measured by the average AbsRel. The results indicate that our method consistently outperforms the baseline method.

图 8: 数据规模对比。随着数据集规模的增加,使用 SSI 损失 (SSI Loss) 的模型与基线方法的性能对比,通过平均 AbsRel 衡量。结果表明,我们的方法始终优于基线方法。

Depth Anything v 2 as the teacher, distilling from a diffusionbased model yields a student model with significantly enhanced fine-detail prediction.

Depth Anything v 2 作为教师,通过基于扩散模型的蒸馏,显著提升了学生模型的细节预测能力。

Qualitative Comparison with Baseline Distillation. We present a qualitative comparison between our method and the previous distillation method [47], where the Base model relies solely on global normalization. We analyze the depth map details and the distribution differences between predicted and ground truth depths. The red diagonal lines represent the ground truth, with results closer to these lines indicating better performance. As shown in Fig. 10, our method produces smoother surfaces, sharper edges, and more detailed depth maps.

与基线蒸馏的定性对比

Qualitative Comparison: Additional Results on Depth Estimation in the Wild. We present additional depth maps generated by our model on in-the-wild scenes, emphasizing its robustness and precision. As shown in Fig. 11, our method produces sharper edges and more detailed depth maps, even in challenging regions such as hair, cartoon scenes, and other diverse environments.

定性对比:野外深度估计的额外结果。我们展示了模型在野外场景中生成的额外深度图,强调了其鲁棒性和精确性。如图 11 所示,即使在头发、卡通场景和其他多样化环境等具有挑战性的区域,我们的方法也能生成更清晰的边缘和更详细的深度图。

Dav2 as Teacher

Dav2 作为教师


RGB Figure 9: Distilled Generative Models: Instead of just distilling classical depth models, we also apply distillation to generative models, aiming for the student model to capture their rich details.

图 9: 蒸馏生成式模型:我们不仅对传统的深度模型进行蒸馏,还将蒸馏应用于生成式模型,旨在让学生模型捕捉到它们丰富的细节。


Figure 10: Qualitative Comparison with Baseline Distillation. We compare our method with the baseline as the previous distillation method, which uses only global normalization. The red diagonal lines represent the ground truth, with results closer to the lines indicating better performance. Our method produces smoother surfaces, sharper edges, and more detailed depth maps.

图 10: 与基线蒸馏的定性对比。我们将我们的方法与基线进行比较,基线是之前的蒸馏方法,仅使用全局归一化。红色对角线代表真实值,结果越接近线表示性能越好。我们的方法生成更平滑的表面、更锐利的边缘和更详细的深度图。


Figure 11: Additional Results on Depth Estimation in the Wild. We showcase more depth maps generated by our model on in-the-wild scenes, highlighting its robustness and precision.

图 11: 野外深度估计的额外结果。我们展示了更多由我们的模型在野外场景中生成的深度图,突出了其鲁棒性和精确性。

阅读全文(20积分)