Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator
蒸馏任意深度:蒸馏打造更强大的单目深度估计器
Figure 1: Zero-shot prediction on in-the-wild images. Our model, distilled from Genpercept [45] and Depth Anything v 2 [47], outperforms other methods by delivering more accurate depth details and exhibiting superior generalization for monocular depth estimation on in-the-wild images.
图 1: 对野外图像的零样本预测。我们的模型从 Genpercept [45] 和 Depth Anything v 2 [47] 中蒸馏而来,在提供更准确的深度细节和展示对野外图像的单目深度估计的优越泛化能力方面,优于其他方法。
Abstract
摘要
Mon ocular depth estimation(MDE)aims to predict scene depth from a single RGB image and plays a crucial role in 3D scene understanding.Recent advances in zero-shot MDE leverage normalized depth representations and distillationbased learning to improve generalization across diverse scenes.However,current depth normalization methods for distillation, relying on global normalization, can amplify noisy pseudo-labels, reducing distillation effectiveness. In this paper, we systematically analyze the impact of different depth normalization strategies onpseudo-label distillation. Based on our findings, we propose Cross-Context Distillation,which integrates global and local depth cues to enhance pseudo-label quality.Additionally,we introduce a multi-teacher distillation framework that leverages comple ment ary strengths of different depth estimation models, leading to more robust and accurate depth predictions.Exten sive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods,both quantitatively and qualitatively.
单目深度估计 (Monocular Depth Estimation, MDE) 旨在从单个 RGB 图像中预测场景深度,并在 3D 场景理解中发挥着关键作用。最近的零样本 MDE 进展利用归一化深度表示和基于蒸馏的学习来提升跨多样场景的泛化能力。然而,当前依赖于全局归一化的深度归一化方法可能会放大噪声伪标签,从而降低蒸馏效果。本文中,我们系统分析了不同深度归一化策略对伪标签蒸馏的影响。基于我们的发现,我们提出了跨上下文蒸馏 (Cross-Context Distillation),该方法整合了全局和局部深度线索以提升伪标签质量。此外,我们引入了一个多教师蒸馏框架,该框架利用了不同深度估计模型的互补优势,从而生成更鲁棒和准确的深度预测。在基准数据集上的大量实验表明,我们的方法在定量和定性上均显著优于现有最先进的方法。
1. Introduction
1. 引言
Monocular depth estimation (MDE) predicts scene depth from a single RGB image, offering flexibility compared to stereo or multi-view methods. This makes MDE ideal for applications like autonomous driving and robotic navigation [10, 12, 16, 48, 28]. Recent research on zero-shot MDE models [34, 51, 43, 22] aims to handle diverse scenarios, but training such models requires large-scale, diverse depth data, which is often limited by the need for specialized equipment [29, 50]. A promising solution is using large-scale unlabeled data, which has shown success in tasks like classification and segmentation [25, 58, 44]. Studies like Depth Anything [46] highlight the effectiveness of using pseudo labels from teacher models for training student
单目深度估计 (Monocular Depth Estimation, MDE) 通过单张 RGB 图像预测场景深度,相比立体或多视角方法更具灵活性。这使得 MDE 成为自动驾驶和机器人导航等应用的理想选择 [10, 12, 16, 48, 28]。近期关于零样本 MDE 模型的研究 [34, 51, 43, 22] 旨在处理多样化的场景,但训练此类模型需要大规模、多样化的深度数据,而这通常受到专用设备需求的限制 [29, 50]。一个可行的解决方案是使用大规模未标注数据,这已在分类和分割等任务中取得了成功 [25, 58, 44]。像 Depth Anything [46] 这样的研究强调了使用教师模型的伪标签来训练学生模型的有效性。
models.
模型
To enable training on such a diverse, mixed dataset, most state-of-the-art methods [47, 37, 51] employ scale-and-shift invariant (SSI) depth representations for loss computation. This approach normalizes raw depth values within an image, making them invariant to scaling and shifting, and ensures that the model learns to focus on relative depth relationships rather than absolute values. The SSI representation facilitates the joint use of diverse depth data, thereby improving the model's ability to generalize across different scenes [35. 4]. Similarly, during evaluation, the metric depth of the prediction is recovered by solving for the unknown scale and shift coefficients of the predicted depth using least squares, ensuring the application of standard evaluation metrics.
为了能够在如此多样化的混合数据集上进行训练,大多数最先进的方法 [47, 37, 51] 采用尺度和平移不变 (SSI) 深度表示进行损失计算。这种方法对图像中的原始深度值进行归一化,使其对缩放和平移保持不变,并确保模型学习关注相对深度关系而非绝对值。SSI 表示促进了多样化深度数据的联合使用,从而提高了模型在不同场景中的泛化能力 [35, 4]。同样,在评估过程中,通过使用最小二乘法求解预测深度的未知缩放和平移系数,恢复预测的度量深度,确保标准评估指标的应用。
Despite its advantages, using SSI depth representation for pseudo-label distillation in MDE models presents several issues. Specifically, the inherent normalization process in SSI loss makes the depth prediction at a given pixel not only dependent on the teacher model's raw prediction at that location but also influenced by the depth values in other regions of the image. This becomes problematic because pseudo-labels inherently introduce noise. Even if certain local regions are predicted accurately, inaccuracies in other regions can negatively affect depth estimates after global normalization, leading to suboptimal distillation results. As shown in Fig. 2, we empirically demonstrate that normalizing depth maps globally tends to degrade the accuracy of local regions, as compared to only applying normalization within localized regions during evaluation.
尽管有其优势,但在 MDE 模型中使用 SSI 深度表示进行伪标签蒸馏存在一些问题。具体来说,SSI 损失中的固有归一化过程使得给定像素的深度预测不仅依赖于教师模型在该位置的原始预测,还受到图像其他区域深度值的影响。这带来了问题,因为伪标签本质上引入了噪声。即使某些局部区域预测准确,其他区域的不准确性也会在全局归一化后对深度估计产生负面影响,导致蒸馏结果不理想。如图 2 所示,我们通过实验证明,与仅在评估时对局部区域应用归一化相比,全局归一化深度图往往会降低局部区域的准确性。
Building on this insight, in this paper, we first investigate the issue of depth normalization in pseudo-label distillation. We begin by analyzing various depth normalization strategies, including global normalization, local normalization, hybrid global-local approaches, and the absence of normalization. Through empirical experiments, we explore how each technique affects the performance of various distillation designs, especially when using pseudo-labels for training. Our analysis provides valuable insights into how different normalization methods influence the MDE loss function and distillation outcomes, offering a set of best practices for optimizing performance in diverse scenarios.
基于这一洞察,本文首先研究了伪标签蒸馏中的深度归一化问题。我们分析了多种深度归一化策略,包括全局归一化、局部归一化、全局-局部混合方法以及不进行归一化的情况。通过实验,我们探讨了每种技术如何影响不同蒸馏设计的性能,特别是在使用伪标签进行训练时。我们的分析为不同归一化方法如何影响MDE损失函数和蒸馏结果提供了有价值的见解,并为优化各种场景下的性能提供了一套最佳实践。
Building on this empirical foundation, we introduce a Cross-Context Distillation method, designed to distill knowledge from the teacher model more effectively. We are motivated by our finding that local regions, when used for distillation, produce pseudo-labels that capture higher-quality depth details, improving the student model's depth estimation accuracy. However, focusing solely on local regions might overlook the broader contextual relationships in the image. To address the issue , we combine local and global inputs within a unified distillation framework. By combining the context-specific advantages of local distillation with the broader understanding provided by global methods, our
基于这一实证基础,我们引入了一种跨上下文蒸馏方法,旨在更有效地从教师模型中提取知识。我们的动机是发现,当局部区域用于蒸馏时,生成的伪标签能够捕捉到更高质量的深度细节,从而提高了学生模型的深度估计准确性。然而,仅关注局部区域可能会忽略图像中更广泛的上下文关系。为了解决这一问题,我们在统一的蒸馏框架中结合了局部和全局输入。通过将局部蒸馏的上下文特定优势与全局方法提供的更广泛理解相结合,我们的
Furthermore, we propose a multi-teacher distillation framework that leverages the complementary strengths of multiple depth estimation models. Our design is motivated by the observation of recent advancements that diffusionbased models, benefiting from large-scale image priors, excel at capturing fine-grained details but are computationally expensive, while encoder-decoder models provide higher accuracy and efficiency but relatively lack fine-detail reconstruction. To harness these strengths, we randomly select different models to generate pseudo-labels, and then supervise the student model based on these labels. This operation enables the student model to learn from the detailed depth information of diffusion-based models while benefiting from the precision of encoder-decoder models.
此外,我们提出了一种多教师蒸馏框架,该框架利用了多个深度估计模型的互补优势。我们的设计灵感来自于对最新进展的观察,即基于扩散的模型受益于大规模图像先验,擅长捕捉细粒度细节,但计算成本高,而编码器-解码器模型提供更高的准确性和效率,但在细粒度重建方面相对不足。为了利用这些优势,我们随机选择不同的模型生成伪标签,然后基于这些标签监督学生模型。这一操作使学生模型能够从基于扩散的模型的详细深度信息中学习,同时受益于编码器-解码器模型的精度。
To validate the effectiveness of our design, we conduct extensive experiments on various benchmark datasets. The empirical results show that our method significantly outperforms existing baselines qualitatively and quantitatively. The contributions can be summarized below:
为了验证我们设计的有效性,我们在各种基准数据集上进行了广泛的实验。实证结果表明,我们的方法在质量和数量上均显著优于现有基线。贡献可以总结如下:
2. Related Work
- 相关工作
2.1. Monocular Depth Estimation
2.1. 单目深度估计
Monocular depth estimation (MDE) has evolved from hand-crafted methods to deep learning, significantly improving accuracy [10, 27, 11, 15, 57, 35]. Architectural refinements, such as multi-scale designs and attention mechanisms, have further enhanced feature extraction [19, 5, 56]. However, most models remain reliant on labeled data and struggle to generalize across diverse environments. Zeroshot MDE improves generalization by leveraging largescale datasets, geometric constraints, and multi-task learning [34, 51, 53, 55]. Metric depth estimation incorporates intrinsic data for absolute depth learning [2, 52, 20, 42], while generative models such as Marigold refine depth details using diffusion priors [22, 45]. Despite these advances, effectively utilizing unlabeled data remains a challenge due to pseudo-label noise and inconsistencies across different contexts. Depth Anything [47] explores large-scale unlabeled data but struggles with pseudo-label reliability. Patch Fusion [8, 30] improves depth estimation by refining high-resolution image representations but lacks adaptability in generative settings. To address these issues, we propose Cross-Context and Multi-Teacher Distillation, which enhances pseudo-label supervision by leveraging diverse contextual information and multiple expert models, improving both accuracy and generalization ability.
单目深度估计 (MDE) 从手工设计方法发展到深度学习,显著提高了精度 [10, 27, 11, 15, 57, 35]。多尺度设计和注意力机制等架构改进进一步增强了特征提取能力 [19, 5, 56]。然而,大多数模型仍然依赖于标注数据,并且难以在不同环境中泛化。零样本 MDE 通过利用大规模数据集、几何约束和多任务学习来提高泛化能力 [34, 51, 53, 55]。度量深度估计结合了内在数据进行绝对深度学习 [2, 52, 20, 42],而生成式模型如 Marigold 则利用扩散先验来优化深度细节 [22, 45]。尽管取得了这些进展,但由于伪标签噪声和不同上下文之间的不一致性,有效利用未标注数据仍然是一个挑战。Depth Anything [47] 探索了大规模未标注数据,但在伪标签可靠性方面存在困难。Patch Fusion [8, 30] 通过优化高分辨率图像表示来改进深度估计,但在生成式环境中缺乏适应性。为了解决这些问题,我们提出了跨上下文和多教师蒸馏,通过利用多样化的上下文信息和多个专家模型来增强伪标签监督,从而提高精度和泛化能力。
2.2. Semi-supervised Monocular Depth Estimation
2.2. 半监督单目深度估计
Semi-supervised depth estimation has gained attention by utilizing temporal consistency to better use unlabeled data [26, 17]. Some methods [1, 40, 6, 49, 14] apply stere0 geometric constraints, enforcing left-right consistency to enhance depth accuracy, while others use additional supervision like semantic priors [33, 18] or GANs, such as DepthGAN [21]. However, these approaches are limited by their reliance on temporal cues or stereo constraints, restricting their applicability. Recent work [32] explored pseudo-labeling for semi-supervised MDE but lacks generative modeling capabilities. Depth Anything [46] demonstrated the potential of large-scale unlabeled data, though pseudo-label reliability remains challenging. In contrast, our approach improves pseudo-label reliability and enhances MDE accuracy, relying solely on unlabeled data without additional constraints.
半监督深度估计通过利用时间一致性更好地利用未标记数据 [26, 17],已引起关注。一些方法 [1, 40, 6, 49, 14] 应用立体几何约束,通过左右一致性来提高深度精度,而其他方法则使用语义先验 [33, 18] 或 GAN(如 DepthGAN [21])等额外监督。然而,这些方法受限于其对时间线索或立体约束的依赖,限制了其适用性。最近的工作 [32] 探索了半监督 MDE 的伪标记方法,但缺乏生成建模能力。Depth Anything [46] 展示了大规模未标记数据的潜力,尽管伪标签的可靠性仍具挑战性。相比之下,我们的方法提高了伪标签的可靠性并增强了 MDE 的准确性,仅依靠未标记数据且无需额外约束。
3. Method
3. 方法
In this section, we introduce a novel distillation framework designed to leverage unlabeled images for training zero-shot Monocular Depth Estimation (MDE) models. We begin by exploring various depth normalization techniques in Section 3.1, followed by detailing our proposed distillation method in Section 3.2, which combines predictions across multiple contexts. The overall framework is illustrated in Fig. 3. Finally, we describe a multi-teacher distillation mechanism in Section 3.2 that integrates diverse depth estimators as teacher models to train the student model.
在本节中,我们介绍了一种新颖的蒸馏框架,旨在利用未标注的图像来训练零样本单目深度估计 (Monocular Depth Estimation, MDE) 模型。我们首先在第3.1节中探讨了各种深度归一化技术,随后在第3.2节中详细介绍了我们提出的结合多上下文预测的蒸馏方法。整体框架如图3所示。最后,在第3.2节中,我们描述了一种多教师蒸馏机制,该机制整合了多种深度估计器作为教师模型来训练学生模型。
3.1. Depth Normalization
3.1. 深度归一化
Depth normalization is a crucial component of our framework as it adjusts the pseudo-depth labels ${\bf d}^{t}$ from the teacher model and the depth predictions $\mathbf{d}^{s}$ from the student model for effective loss computation. To understand the influence of normalization techniques on distillation performance, we systematically analyze several approaches commonly employed in prior works. These strategies are visually illustrated in Fig. 4.
深度归一化是我们框架中的一个关键组件,它调整来自教师模型的伪深度标签 ${\bf d}^{t}$ 和来自学生模型的深度预测 $\mathbf{d}^{s}$,以便进行有效的损失计算。为了理解归一化技术对蒸馏性能的影响,我们系统分析了先前工作中常用的几种方法。这些策略在图 4 中进行了可视化展示。
(b) Metric of different way of Least-square
图 1: (b) 不同最小二乘法 (Least-square) 的度量
Figure 2: Issue with Global Normalization (SSI). In (a), we compare two alignment strategies for the central $w/2,h/2$ region: (1) Global Least-Square, where alignment is applied to the full image before cropping, and (2) Local Least-Square, where alignment is performed on the cropped region. Metrics are computed on the cropped region. As shown in (b), the outperformed local strategy demonstrates that global normalization degrades local accuracy compared to local normalization.
图 2: 全局归一化 (SSI) 的问题。在 (a) 中,我们比较了两种针对中心 $w/2,h/2$ 区域的对齐策略:(1) 全局最小二乘法,即在裁剪前对整个图像进行对齐;(2) 局部最小二乘法,即在裁剪区域上进行对齐。指标在裁剪区域上计算。如 (b) 所示,表现更优的局部策略表明,与局部归一化相比,全局归一化会降低局部精度。
Global Normalization: The first strategy we examine is the global normalization [46, 47] used in recent distillation methods. Global normalization [34] adjusts depth predictions using global statistics of the entire depth map. This strategy aims to ensure scale-and-shift invariance by normalizing depth values based on the median and mean absolute deviation of the depth map. For each pixel $i$ , the normalized depth for the student model and pseudo-labels are computed as:
全局归一化:我们研究的第一个策略是近期蒸馏方法中使用的全局归一化 [46, 47]。全局归一化 [34] 通过使用整个深度图的全局统计量来调整深度预测。该策略旨在通过基于深度图的中位数和平均绝对偏差对深度值进行归一化,以确保尺度和位移不变性。对于每个像素 $i$,学生模型和伪标签的归一化深度计算如下:
where $\bmod{\left(\mathbf{d}^{s}\right)}$ and $\bmod{(\mathbf{d}^{t})}$ are the medians of the predicted depth and pseudo depth, respectively. The final regression loss for distillation is computed as the average absolute difference between the normalized predicted depth and the normalized pseudo depth across all valid pixels $M$
其中 $\bmod{\left(\mathbf{d}^{s}\right)}$ 和 $\bmod{(\mathbf{d}^{t})}$ 分别是预测深度和伪深度的中位数。最终的蒸馏回归损失计算为所有有效像素 $M$ 上归一化预测深度与归一化伪深度之间的平均绝对差。
Hybrid Normalization: In contrast to global normalization, Hierarchical Depth Normalization [54] employs a hybrid normalization approach by integrating both global and local depth information. This strategy is designed to preserve both the global structure and local geometry in the depth map. The process begins by dividing the depth range into $S$ segments, where $S$ is selected from ${1,2,4}$ . When $S=1$ ,the entire depth range is normalized globally, treating all pixels as part of a single context, akin to global normalization. In the case of $S,=,2$ , the depth range is divided into two segments, with each pixel being normalized within one of these two local contexts. Similarly, for $S=4$ ,the depth range is split into four segments, allowing normalization to be performed within smaller, localized contexts. By adapting the normalization process to multiple levels of granularity, hybrid normalization achieves a balance between global coherence and local adaptability. For each context $u$ , the normalized depth values for the student model $\mathcal{N}_{u}(d_{i}^{s})$ and pseudo-labels $\mathcal{N}_{u}(d_{i}^{t})$ are calculated within the corresponding depth range. The loss for each pixel $i$ is then computed by averaging the losses across all contexts $U_{i}$ to which the pixel belongs:
混合归一化:与全局归一化不同,层次深度归一化 (Hierarchical Depth Normalization) [54] 采用了一种混合归一化方法,通过整合全局和局部深度信息来保留深度图中的全局结构和局部几何特征。该过程首先将深度范围划分为 $S$ 个段,其中 $S$ 从 ${1,2,4}$ 中选择。当 $S=1$ 时,整个深度范围被全局归一化,将所有像素视为单一上下文的一部分,类似于全局归一化。当 $S=2$ 时,深度范围被划分为两个段,每个像素在其中一个局部上下文中归一化。类似地,当 $S=4$ 时,深度范围被划分为四个段,允许在更小的局部上下文中进行归一化。通过在多粒度级别上调整归一化过程,混合归一化在全局一致性和局部适应性之间取得了平衡。对于每个上下文 $u$,学生模型的归一化深度值 $\mathcal{N}_{u}(d_{i}^{s})$ 和伪标签 $\mathcal{N}_{u}(d_{i}^{t})$ 在相应的深度范围内计算。然后,通过平均像素 $i$ 所属的所有上下文 $U_{i}$ 的损失来计算每个像素 $i$ 的损失:
Figure 3: Overview of Cross-Context Distillation. Our method combines local and global depth information to enhance the student model's predictions. It includes two scenarios: (1) Shared-Context Distillation, where both models use the same image for distillation; and (2) Local-Global Distillation, where the teacher predicts depth for overlapping patches while the student predicts the full image. The Local-Global loss $\mathcal{L}_{\mathrm{lg}}$ (Top Right) ensures consistency between local and global predictions, enabling the student to learn both fine details and broad structures, improving accuracy and robustness.
图 3: 跨上下文蒸馏概述。我们的方法结合了局部和全局深度信息,以增强学生模型的预测。它包括两种场景:(1) 共享上下文蒸馏,其中两个模型使用同一张图像进行蒸馏;(2) 局部-全局蒸馏,其中教师模型预测重叠补丁的深度,而学生模型预测整个图像。局部-全局损失 $\mathcal{L}_{\mathrm{lg}}$(右上角)确保了局部和全局预测之间的一致性,使学生能够学习到细节和整体结构,从而提高准确性和鲁棒性。
Figure 4: Normalization Strategies. We compare four normalization strategies: Global Norm [34], Hybrid Norm [54], Local Norm, and No Norm. The figure visualizes how each strategy processes pixels within the normalization region (Norm. Area). The red dot represents any pixel within the region.
图 4: 归一化策略。我们比较了四种归一化策略:全局归一化 (Global Norm) [34]、混合归一化 (Hybrid Norm) [54]、局部归一化 (Local Norm) 和无归一化 (No Norm)。该图展示了每种策略如何处理归一化区域 (Norm. Area) 内的像素。红点代表区域内的任意像素。
where $|U_{i}|$ denotes the total number of groups (or contexts) that pixel $i$ is associated with. To obtain the final loss $\mathcal{L}_{\mathrm{Dis}}$ we average the pixel-wise losses across all valid pixels $M$
其中 $|U_{i}|$ 表示像素 $i$ 关联的组(或上下文)的总数。为了获得最终损失 $\mathcal{L}_{\mathrm{Dis}}$,我们在所有有效像素 $M$ 上对逐像素损失进行平均。
Local Normalization: In addition to global and hybrid normalization, we investigate Local Normalization, a strategy that focuses exclusively on the finest-scale groups used in hybrid normalization. This approach isolates the smallest local contexts for normalization, emphasizing the preservation of fine-grained depth details without considering hierarchical or global scales. Local normalization operates by dividing the depth range into the smallest groups, corresponding to $S,=,4$ in the hybrid normalization framework, and each pixel is normalized within its local context. The loss for each pixel $i$ is computed using a similar formulation as in hybrid normalization, but with $u^{i}$ now representing the local context for pixel $i$ , defined by the smallest four-part group:
局部归一化 (Local Normalization):除了全局归一化和混合归一化,我们还研究了局部归一化,这是一种专注于混合归一化中最细尺度组的策略。该方法将最小局部上下文隔离进行归一化,强调在不考虑层次或全局尺度的情况下保留细粒度的深度细节。局部归一化通过将深度范围划分为最小的组来操作,对应于混合归一化框架中的 $S,=,4$,每个像素在其局部上下文中进行归一化。每个像素 $i$ 的损失使用与混合归一化类似的公式计算,但 $u^{i}$ 现在表示像素 $i$ 的局部上下文,由最小的四部分组定义:
Figure 5: Different Inputs Lead to Different Pseudo Labels. Global Depth: The teacher model predicts depth using the entire image, and the local region's prediction is cropped from the output. Local Depth: The teacher model directly takes the cropped local region as input, resulting in more refined and detailed depth estimates for that area, capturing finer details compared to using the entire image.
图 5: 不同输入导致不同的伪标签。全局深度:教师模型使用整张图像预测深度,局部区域的预测是从输出中裁剪的。局部深度:教师模型直接将裁剪的局部区域作为输入,从而得到该区域更精细和详细的深度估计,与使用整张图像相比,能够捕捉到更细微的细节。
No Normalization: As a baseline, we also consider a direct depth regression approach with no explicit normalization. The absolute difference between raw student predictions and teacher pseudo-labels is used for loss computation:
无归一化:作为基线,我们还考虑了一种没有显式归一化的直接深度回归方法。使用原始学生预测与教师伪标签之间的绝对差异来计算损失:
This approach eliminates the need for normalization, assuming pseudo-depth labels naturally reside in the same domain as predictions. It provides insight into whether normalization enhances distillation effectiveness or if raw depth supervision suffices.
这种方法消除了归一化的需求,假设伪深度标签自然与预测结果处于同一域中。它提供了洞察,说明归一化是否增强了蒸馏效果,或者原始的深度监督是否足够。
3.2. Distillation Pipeline
3.2. 蒸馏管道
In this section, we introduce an enhanced distillation pipeline that integrates two complementary strategies: CrossContext Distillation and Multi-Teacher Distillation. Both strategies aim to improve the quality of pseudo-label distillation, enhance the model's fine-grained perception, and boost generalization across diverse scenarios.
在本节中,我们介绍了一种增强的蒸馏管道,它集成了两种互补策略:跨上下文蒸馏 (CrossContext Distillation) 和多教师蒸馏 (Multi-Teacher Distillation)。这两种策略旨在提高伪标签蒸馏的质量,增强模型的细粒度感知,并提升跨不同场景的泛化能力。
Cross-context Distillation.A key challenge in monocular depth distillation is the trade-off between local detail preservation and global depth consistency. As shown in Fig. 5, providing a local crop of an image as input to the teacher model enhances fine-grained details in the pseudo-depth labels, but it may fail to capture the overall scene structure. Conversely, using the entire image as input preserves the global depth structure but often lacks fine details. To address this limitation, we propose Cross-Context Distillation, a method that enables the student model to learn both local details and global structures simultaneously. Cross-context distillation consists of two key strategies:
跨上下文蒸馏。单目深度蒸馏的一个关键挑战是在局部细节保留和全局深度一致性之间的权衡。如图 5 所示,将图像的局部裁剪作为教师模型的输入可以增强伪深度标签中的细粒度细节,但可能无法捕捉整体场景结构。相反,使用整个图像作为输入可以保留全局深度结构,但往往缺乏细节。为了解决这一限制,我们提出了跨上下文蒸馏,这是一种使学生模型能够同时学习局部细节和全局结构的方法。跨上下文蒸馏包含两个关键策略:
- Shared-Context Distillation: In this setup, both the teacher and student models receive the same cropped region of the image as input. Instead of using the full image, we randomly sample a local patch of varying sizes from the original image and provide it as input to both models. This encourages the student model to learn from the teacher model across different spatial contexts, improving its ability to generalize to varying scene structures. For the loss of shared-context distillation, the teacher and student models receive identical inputs and produce each depth prediction, denoted as $\mathbf{d}{\mathrm{local}}^{t}$ and $\mathbf{d}{\mathrm{local}}^{s}$
- 共享上下文蒸馏 (Shared-Context Distillation):在此设置中,教师模型和学生模型都接收图像的相同裁剪区域作为输入。我们不是使用完整图像,而是从原始图像中随机采样不同大小的局部区域,并将其作为输入提供给两个模型。这鼓励学生模型在不同的空间上下文中从教师模型中学习,从而提高其泛化到不同场景结构的能力。对于共享上下文蒸馏的损失,教师模型和学生模型接收相同的输入并生成每个深度预测,分别表示为 $\mathbf{d}{\mathrm{local}}^{t}$ 和 $\mathbf{d}{\mathrm{local}}^{s}$。
This loss encourages the student model to refine its finegrained predictions by directly aligning with the teacher's outputs at local scales.
这种损失鼓励学生模型通过在局部尺度上直接与教师模型的输出对齐来优化其细粒度预测。
- Local-Global Distillation: In this approach, the teacher and student models operate on different input contexts. The teacher model processes local cropped regions, generating fine-grained depth predictions, while the student model predicts a global depth map from the entire image. To ensure knowledge transfer, the teacher's local depth predictions supervise the corresponding overlapping regions in the student's global depth map. This strategy allows the student to integrate fine-grained local details into its holistic depth estimation. Formally, the teacher model produces multiple depth predictions for cropped regions, denoted as $\mathbf{d}{\mathrm{local}{n}}^{t}$ while the student generates a global depth map, $\mathbf{d}_{\mathrm{global}}^{s}$ The loss for Local-Global distillation is computed only over overlapping areas between the teacher's local predictions and the corresponding regions in the student's global depth map:
局部-全局蒸馏:在此方法中,教师模型和学生模型在不同输入上下文中操作。教师模型处理局部裁剪区域,生成细粒度的深度预测,而学生模型从整个图像预测全局深度图。为了确保知识传递,教师模型的局部深度预测监督学生模型全局深度图中相应的重叠区域。这种策略使学生能够将细粒度的局部细节整合到其整体深度估计中。形式上,教师模型为裁剪区域生成多个深度预测,记为 $\mathbf{d}{\mathrm{local}{n}}^{t}$,而学生模型生成全局深度图 $\mathbf{d}_{\mathrm{global}}^{s}$。局部-全局蒸馏的损失仅在教师模型的局部预测与学生模型的全局深度图的相应区域之间的重叠区域上计算:
where $\mathrm{{Crop}(\cdot)}$ extracts the overlapping region from the student's depth prediction, and $N$ is the total number of sampled patches. This loss ensures that the student benefits from the detailed local supervision of the teacher model while maintaining global depth consistency. The total loss function integrates both local and cross-context losses along with additional constraints, including feature alignment and gradient preservation, as proposed in prior works [47]:
其中,$\mathrm{{Crop}(\cdot)}$ 从学生的深度预测中提取重叠区域,$N$ 是采样的补丁总数。该损失确保学生从教师模型的详细局部监督中受益,同时保持全局深度一致性。总损失函数结合了局部和跨上下文损失以及额外的约束,包括特征对齐和梯度保留,如先前工作 [47] 中提出的:
Figure 6: Multi-teacher Mechanism. We introduce a multiteacher distillation approach, where pseudo-labels are generated from multiple teacher models. At each training iteration, one teacher is randomly selected to produce pseudo-labels for unlabeled images.
图 6: 多教师机制。我们引入了一种多教师蒸馏方法,其中伪标签由多个教师模型生成。在每次训练迭代中,随机选择一个教师模型为未标记的图像生成伪标签。
Here, $\lambda_{1},\lambda_{2}$ ,and $\lambda_{3}$ are weighting factors that balance the different loss components. By incorporating cross-context supervision, this framework effectively allows the student model to integrate both fine-grained details from local crops and structural coherence from global depth maps.
在这里,$\lambda_{1}$、$\lambda_{2}$ 和 $\lambda_{3}$ 是平衡不同损失分量的权重因子。通过引入跨上下文监督,该框架有效地使学生模型能够整合来自局部裁剪的细粒度细节和来自全局深度图的结构一致性。
Multi-teacher Distillation. In addition to cross-context distillation, we adopt a multi-teacher distillation strategy, illustrated in Fig. 6, to further enhance the quality and robustness of the distilled depth knowledge. This approach leverages multiple teacher models, each trained with distinct architectures, optimization strategies, or data distributions, to generate diverse pseudo-labels. By aggregating knowledge from multiple sources, the student model benefits from a richer and more generalized depth representation. Formally, given a set of pre-trained teacher models $\mathcal{M}{1},\mathcal{M}{2},\ldots,\mathcal{M}_{N}$ , we employ a probabilistic teacher selection mechanism, where one teacher model is randomly selected at each training iteration to generate pseudo-labels for the input image. The inclusion of multiple teacher models allows the student to learn from a diverse set of predictions, effectively mitigating biases and limitations inherent to any single model.
多教师蒸馏。除了跨上下文蒸馏外,我们还采用了多教师蒸馏策略(如图 6 所示),以进一步提高蒸馏深度知识的质量和鲁棒性。该方法利用多个教师模型,每个模型通过不同的架构、优化策略或数据分布进行训练,从而生成多样化的伪标签。通过聚合来自多个来源的知识,学生模型可以从更丰富、更通用的深度表示中受益。形式化地,给定一组预训练的教师模型 $\mathcal{M}{1},\mathcal{M}{2},\ldots,\mathcal{M}_{N}$,我们采用概率教师选择机制,在每个训练迭代中随机选择一个教师模型来为输入图像生成伪标签。多个教师模型的引入使学生能够从多样化的预测中学习,有效减轻了任何单一模型固有的偏见和局限性。
4. Experiment
4. 实验
4.1. Experimental Settings
4.1. 实验设置
Datasets. To evaluate the effectiveness of our proposed distillation framework, we follow the methodology outlined in Depth Anything v 2 [47]. Specifically, we conduct our distillation experiments using a subset of 200,000 samples from the SA-1B dataset [24].
数据集。为了评估我们提出的蒸馏框架的有效性,我们遵循了Depth Anything v 2 [47]中概述的方法。具体来说,我们使用SA-1B数据集 [24] 中的200,000个样本子集进行蒸馏实验。
For evaluation, we assess the performance of the distilled student model on five widely used depth estimation benchmarks, ensuring that these datasets remain unseen during training to enable a robust zero-shot evaluation. The chosen benchmarks include: NYUv2 [39], KITTI [13], ETH3D [38], ScanNet [7], and DIODE [41]. Additional dataset details are provided in the Appendix.
为了评估,我们在五个广泛使用的深度估计基准上测试了蒸馏学生模型的性能,确保这些数据集在训练期间未被使用,以便进行稳健的零样本评估。选择的基准包括:NYUv2 [39]、KITTI [13]、ETH3D [38]、ScanNet [7] 和 DIODE [41]。更多数据集详细信息请参见附录。
Metrics. We assess depth estimation pe