[论文翻译]全帧视频稳定算法


原文地址:https://arxiv.org/abs/2102.06205

代码地址:https://github.com/alex04072000/FuSta

Neural Re-rendering for Full-frame Video Stabilization

ABSTRACT.

Existing video stabilization methods either require aggressive cropping of frame boundaries or generate distortion artifacts on the stabilized frames. In this work, we present an algorithm for full-frame video stabilization by first estimating dense warp fields. Full-frame stabilized frames can then be synthesized by fusing warped contents from neighboring frames. The core technical novelty lies in our learning-based hybrid-space fusion that alleviates artifacts caused by optical flow inaccuracy and fast-moving objects. We validate the effectiveness of our method on the NUS and selfie video datasets. Extensive experiment results demonstrate the merits of our approach over prior video stabilization methods.

现有的视频稳定方法要么需要积极裁剪帧边界,要么在稳定的帧上产生失真伪影。在这项工作中,我们通过首先估计密集扭曲场来提出一种全帧视频稳定算法。然后可以通过融合来自相邻帧的扭曲内容来合成全帧稳定帧。核心技术创新在于我们基于学习的混合空间融合,它减轻了由光流不准确和快速移动物体引起的伪影。我们在 NUS 和自拍视频数据集上验证了我们的方法的有效性。广泛的实验结果证明了我们的方法优于先前的视频稳定方法。

video stabilization, neural rendering, deep learning



Figure 1. Full-frame video stabilization of challenging videos. Our method takes a shaky input video (top) and produces a stabilized and distortion-free video (bottom), as indicated by less fluctuation in the y−t epipolar plane image. Furthermore, by robustly fusing multiple neighboring frames, our results do not suffer from aggressive cropping of frame borders in the stabilized video and can even expand the field of view of the original video. Our approach significantly outperforms representative state-of-the-art video stabilization algorithms on these challenging scenarios (see Fig. [2]).
图 1. 具有挑战性的视频的全帧视频稳定。 我们的方法采用不稳定的输入视频(顶部)并生成稳定且无失真的视频(底部),如是-吨对极平面图像。此外,通过稳健地融合多个相邻帧,我们的结果不会受到稳定视频中帧边界的激进裁剪的影响,甚至可以扩大原始视频的视野。我们的方法在这些具有挑战性的场景中明显优于代表性的最先进的视频稳定算法(见图 [2])。



Figure 2. Limitations of current state-of-the-art video stabilization techniques. (a) Current commercial video stabilization software (Adobe Premiere Pro 2020) fails to generate smooth video in challenging scenarios of rapid camera shakes. (b) The method in (Yu and Ramamoorthi, 2020) produces temporally smooth video. However, the warped (stabilized) video contains many missing pixels at frame borders and inevitably requires applying aggressive cropping (green checkerboard areas) to generate a rectangle video. (c) The DIFRINT method (Choi and Kweon, 2020) achieves full-frame video stabilization by iteratively applying frame interpolation to generate in-between, stabilized frames. However, interpolating between frames with large camera motion and moving occlusion is challenging. Their results are thus prone to severe artifacts.
图 2. 当前最先进的视频稳定技术的局限性。 (a) 当前的商业视频稳定软件(Adobe Premiere Pro 2020)在相机快速抖动的挑战性场景中无法生成流畅的视频。(b) (Yu and Ramamoorthi, 2020 ) 中的方法产生时间平滑的视频。然而,扭曲(稳定)的视频在帧边界处包含许多丢失的像素,并且不可避免地需要应用积极的裁剪(绿色棋盘区域)来生成矩形视频。(c) DIFRINT 方法 (Choi 和 Kweon,2020 年)通过迭代应用帧插值来生成中间稳定的帧,实现全帧视频稳定。然而,在具有大相机运动和移动遮挡的帧之间进行插值是具有挑战性的。因此,他们的结果容易出现严重的伪影。

1.INTRODUCTION

Video stabilization has become increasingly important with the rapid growth of video content on the Internet platforms, such as YouTube, Vimeo, and Instagram. Casually captured cellphone videos without a professional video stabilizer are often shaky and unpleasant to watch. These videos pose significant challenges for video stabilization algorithms. For example, videos are often noisy due to small image sensors, particularly in low-light environments. Handheld captured videos may contain large camera shake/jitter, resulting in severe motion blur and wobble artifacts from a rolling shutter camera.
随着 YouTube、Vimeo 和 Instagram 等互联网平台上视频内容的快速增长,视频稳定变得越来越重要。在没有专业视频稳定器的情况下随意拍摄的手机视频通常会抖动且观看时不愉快。这些视频对视频稳定算法提出了重大挑战。例如,由于图像传感器较小,尤其是在弱光环境中,视频通常会产生噪声。手持拍摄的视频可能包含较大的相机抖动/抖动,从而导致滚动快门相机产生严重的运动模糊和晃动伪影。

Existing video stabilization methods usually consist of three main components: 1) motion estimation, 2) motion smoothing and 3) stable frame generation. First, the motion estimation step involves estimating motion through 2D feature detection/tracking (Lee et al., 2009; Liu et al., 2011; Goldstein and Fattal, 2012; Wang et al., 2013), dense flow (Yu and Ramamoorthi, 2019, 2020), or recovering camera motion and scene structures (Liu et al., 2009; Zhou et al., 2013; Buehler et al., 2001b; Smith et al., 2009; Liu et al., 2012). Second, the motion smoothing step then removes the high-frequency jittering in the estimated motion and predicts the spatial transformations to stabilize each frame in the form of homography (Matsushita et al., 2005), mixture of homography (Liu et al., 2013; Grundmann et al., 2012), or per-pixel warp fields (Liu et al., 2014; Yu and Ramamoorthi, 2019, 2020). Third, the stable frame generation step uses the predicted spatial transform to synthesize the stabilized video. The stabilized frames, however, often contain large missing regions at frame borders, particularly when videos with large camera motion. This forces existing methods to apply aggressive cropping for maintaining a rectangular frame and therefore leads to a significantly zoomed-in video with resolution loss (Fig. 2(a) and (b)).
现有的视频稳定方法通常由三个主要部分组成:1) 运动估计,2) 运动平滑和 3) 稳定帧生成。首先,运动估计步骤涉及通过 2D 特征检测/跟踪来估计运动(Lee等人,2009 年;Liu等人,2011 年;Goldstein 和 Fattal,2012 年;Wang等人,2013 年)、密集流 (Yu 和 Ramamoorthi , 2019 , 2020 )或恢复相机运动和场景结构(Liu et al. , 2009 ; Zhou et al., 2013 年;布勒等人。, 2001b ; 史密斯等人。, 2009 年;刘等人。, 2012 )。其次,运动平滑步骤然后去除估计运动中的高频抖动,并以单应性(Matsushita等人,2005)、单应性混合 (Liu等人,2013)的形式预测空间变换以稳定每一帧 ; Grundmann等人,2012 年),或每像素扭曲场 (Liu等人,2014 年;Yu 和 Ramamoorthi,2019 年2020 年)。第三,稳定帧生成步骤使用预测的空间变换来合成稳定的视频。然而,稳定的帧通常在帧边界处包含大的缺失区域,特别是当视频具有较大的相机运动时。这迫使现有方法应用积极的裁剪来维持矩形帧,因此导致显着放大的视频具有分辨率损失(图 2(a)和(b))。

Full-frame video stabilization methods aim to address the above-discussed limitation and produce stabilized video with the same field of view (FoV). One approach for full-frame video stabilization is to first compute the stabilized video (with missing pixels at the frame borders) and then apply flow-based video completion methods (Matsushita et al., 2005; Huang et al., 2016; Gao et al., 2020) to fill in missing contents. Such two-stage methods may suffer from the inaccuracies in flow estimation and inpainting (e.g., in poorly textured regions, fluid motion, and motion blur). A recent learning-based method, DIFRINT (Choi and Kweon, 2020), instead uses iterative frame interpolation to stabilize the video while maintaining the original FoV. However, we find that applying frame interpolation repeatedly leads to severe distortion and blur artifacts in challenging cases (Fig. 2(c)).
全帧视频稳定方法旨在解决上述限制并生成具有相同视场 (FoV) 的稳定视频。全帧视频稳定的一种方法是首先计算稳定的视频(在帧边界丢失像素),然后应用基于流的视频补全方法 (Matsushita等人,2005 年;Huang等人,2016 年;Gao等人) al. , 2020 )来填补缺失的内容。这种两阶段方法可能会受到流量估计和修复不准确的影响(例如,在纹理较差的区域、流体运动和运动模糊)。最近的一种基于学习的方法,DIFRINT (Choi and Kweon, 2020 ),而是使用迭代帧插值来稳定视频,同时保持原始 FoV。然而,我们发现在具有挑战性的情况下,反复应用帧插值会导致严重的失真和模糊伪影(图 2(c))。

In this paper, we present a new algorithm that takes a shaky video and the estimated smooth motion fields for stabilization as inputs and produces a full-frame stable video. The core idea of our method lies in fusing information from multiple neighboring frames in a robust manner. Instead of using color frames directly, we use a learned CNN representation to encode rich local appearance for each frame, fuse multiple aligned feature maps, and use a neural decoder network to render the final color frame. We explore multiple design choices for fusing and blending multiple aligned frames (commonly used in image stitching and view synthesis applications). We then propose a hybrid fusion mechanism that leverages both feature-level and image-level fusion to alleviate the sensitivity to flow inaccuracy. In addition, we improve the visual quality of synthesized results by learning to predict spatially varying blending weights, removing blurry input frames for sharp video generation, and transferring high-frequency details residual to the re-rendered, stabilized frames. Our method generates stabilized video with significantly fewer artifacts and distortions while retaining (or even expanding) the original FoV (Fig. 1). We evaluate the proposed algorithm with the state-of-the-art methods and commercial video stabilization software (Adobe Premiere Pro 2020 warp stabilizer). Extensive experiments show that our method performs favorably against existing methods on two public benchmark datasets (Liu et al., 2013; Yu and Ramamoorthi, 2018).
在本文中,我们提出了一种新算法,该算法将抖动视频和用于稳定的估计平滑运动场作为输入并生成全帧稳定视频。我们方法的核心思想在于以稳健的方式融合来自多个相邻帧的信息。我们没有直接使用颜色帧,而是使用学习到的 CNN 表示为每一帧编码丰富的局部外观,融合多个对齐的特征图,并使用神经解码器网络来渲染最终的颜色帧。我们探索了融合和混合多个对齐帧的多种设计选择(通常用于图像拼接和视图合成应用程序)。然后,我们提出了一种混合融合机制,该机制利用特征级和图像级融合来减轻对流不准确性的敏感性。此外,我们通过学习预测空间变化的混合权重,去除模糊的输入帧以生成清晰的视频,并将高频细节残差转移到重新渲染的稳定帧来提高合成结果的视觉质量。我们的方法生成稳定的视频,在保留(甚至扩展)原始 FoV 的同时,伪影和失真明显减少(图 1)。 1)。我们使用最先进的方法和商业视频稳定软件(Adobe Premiere Pro 2020 经线稳定器)评估所提出的算法。大量实验表明,我们的方法在两个公共基准数据集上的性能优于现有方法(Liu等人,2013 年;Yu 和 Ramamoorthi,2018 年)。

The main contributions of this work are:

  • We apply neural rendering techniques in the context of video stabilization to alleviate the issues of sensitivity to flow inaccuracy.
  • We present a hybrid fusion mechanism for combining information from multiple frames at both feature- and image-level. We systematically validate various design choices through ablation studies.
  • We demonstrate favorable performance against representative video stabilization techniques on two public datasets. We will release the source code to facilitate future research.
  • 我们在视频稳定的背景下应用神经渲染技术来缓解对流量不准确的敏感性问题。
  • 我们提出了一种混合融合机制,用于在特征和图像级别结合来自多个帧的信息。我们通过消融研究系统地验证了各种设计选择。
  • 我们在两个公共数据集上展示了针对代表性视频稳定技术的良好性能。我们将发布源代码以方便未来的研究。

Motion estimation and smoothing. Most video stabilization methods focus on estimating motion between frames and smoothing the motion. For motion estimation, existing methods often estimate the motion in the 2D image space using sparse feature detection/tracking and dense optical flow. These methods differ in motion modeling, e.g., eigen-trajectories (Liu et al., 2011), epipolar geometry (Goldstein and Fattal, 2012), warping grids (Liu et al., 2013), or dense flow fields (Yu and Ramamoorthi, 2020). For motion smoothing, prior methods use low-pass filtering (Liu et al., 2011), L1 optimization (Grundmann et al., 2011), and spatio-temporal optimization (Wang et al., 2013).

In contrast to estimating 2D motion, several methods recover the camera motion and proxy scene geometry by leveraging Structure from Motion (SfM) algorithms. These methods stabilize frames using 3D reconstruction and projection along with image-based rendering (Kopf et al., 2014) or content-preserving warps (Buehler et al., 2001b; Liu et al., 2009; Goldstein and Fattal, 2012). However, SfM algorithms are less effective in handling complex videos with severe motion blur and highly dynamic scenes. Specialized hardware such as depth cameras (Liu et al., 2012) or light field cameras (Smith et al., 2009) may be required for reliable pose estimation.

Deep learning-based approaches have recently been proposed to directly predict warping fields (Xu et al., 2018; Wang et al., 2018) or optical flows (Yu and Ramamoorthi, 2019, 2020) for video stabilization. In particular, methods with dense warp fields (Liu et al., 2014; Yu and Ramamoorthi, 2019, 2020) offer greater flexibility for compensating motion jittering and implicitly handling rolling shutter effects than parametric warp fields (Xu et al., 2018; Wang et al., 2018).

Our work builds upon existing 2D motion estimation/techniques for stabilization and focuses on synthesizing full-frame video outputs. Specifically, we adopt the state-of-the-art flow-based stabilization method (Yu and Ramamoorthi, 2020) and use the estimated per-frame warped fields as inputs to our method11Our method is agnostic to the motion smoothing techniques. Other approaches such as parametric warps can also be applied..

运动估计和平滑 大多数视频稳定方法侧重于估计帧之间的运动并平滑运动。对于运动估计,现有方法通常使用稀疏特征检测/跟踪和密集光流来估计 2D 图像空间中的运动。这些方法在运动建模方面有所不同,例如,特征轨迹 (Liu等人,2011 年)、对极几何 (Goldstein 和 Fattal,2012 年)、翘曲网格 (Liu等人,2013 年)或密集流场 (Yu 和 Ramamoorthi) , 2020 )。对于运动平滑,先前的方法使用低通滤波 (Liu et al. , 2011 )、L1 优化 (Grundmann et al. , 2011 )和时空优化 (Wang et al. , 2013 )。

与估计 2D 运动相比,有几种方法通过利用运动中的结构 (SfM) 算法来恢复相机运动和代理场景几何。这些方法使用 3D 重建和投影以及基于图像的渲染 (Kopf等人,2014 年)或内容保留扭曲 (Buehler等人,2001b;Liu等人,2009 年;Goldstein 和 Fattal,2012 年)来稳定帧。然而,SfM 算法在处理具有严重运动模糊和高度动态场景的复杂视频时效果较差。专用硬件,如深度相机 (Liu et al. , 2012)或光场相机 (Smith et al. , 2009 )可能需要可靠的姿态估计。

深基于学习的方法最近提出了直接预测翘曲场 (徐等人。,2018 ;王等人,2018)或光流 (Yu和Ramamoorthi,20192020)的视频稳定。特别是,具有密集扭曲场的方法 (Liu et al. , 2014 ; Yu and Ramamoorthi, 2019 , 2020)比参数扭曲场(Xu et al. , 2018)在补偿运动抖动和隐式处理卷帘效应方面提供了更大的灵活性 ; 王等人。, 2018 )。

我们的工作建立在现有的 2D 运动估计/稳定技术之上,并专注于合成全帧视频输出。具体来说,我们采用最先进的基于流的稳定方法 (Yu 和 Ramamoorthi,2020 年),并使用估计的每帧扭曲场作为我们方法1 的输入1我们的方法与运动平滑技术无关。也可以应用其他方法,例如参数扭曲。.

Image fusion and composition. With the estimated and smoothed motion, the final step of video stabilization is to render the stabilized frames. Most existing methods synthesize frames by directly warping each input frame to the stabilized location using smoothed warping grids (Liu et al., 2013, 2011) or flow fields predicted by CNNs (Yu and Ramamoorthi, 2019, 2020). However, such approaches inevitably synthesize images with missing regions around frame boundaries. To maintain a rectangle shape, existing methods often crop off the blank areas and generate output videos with a lower resolution than the input video. To address this issue, full-frame video stabilization methods aim to stabilize videos without cropping. These methods use neighboring frames to fill in the blank and produce full-frame results by 2D motion inpainting (Matsushita et al., 2005; Gleicher and Liu, 2008; Gao et al., 2020). In contrast to existing motion inpainting methods that first generate stabilized frames then filling in missing pixels, our method leverage neural rendering to encode and fuse warped appearance features and learn to decode the fused feature map to the final color frames.

Several recent methods can generate full-frame stabilized videos without explicit motion estimation. For example, the method in (Wang et al., 2018) train a CNN with collected unstable-stable pairs to directly synthesize stable frames. However, direct synthesis of output frames without spatial transformations remains challenging. Recently, the DIFRINT method (Choi and Kweon, 2020) generates full-frame stable videos by iteratively applying frame interpolation. This method couples motion smoothing and frame rendering together. However, the repeated frame interpolation often introduces visible distortion and artifacts (see Fig. 2(c)).

View synthesis. View synthesis algorithms aims to render photorealistic images from novel viewpoints a single image (Niklaus et al., 2019; Wu et al., 2020; Wiles et al., 2020; Shih et al., 2020; Kopf et al., 2020) or multiple posed images (Levoy and Hanrahan, 1996; Gortler et al., 1996; Chaurasia et al., 2013; Penner and Zhang, 2017; Hedman et al., 2018; Riegler and Koltun, 2020a, b). These methods mainly differ in the ways to map and fuse information, e.g., view interpolation (Chen and Williams, 1993; Debevec et al., 1996; Seitz and Dyer, 1996), 3D proxy geometry and mapping (Buehler et al., 2001a), 3D multi-plane image (Zhou et al., 2018; Srinivasan et al., 2019), and CNN (Hedman et al., 2018; Riegler and Koltun, 2020a, b). Our fusion network resembles the encoder-decoder network used in (Riegler and Koltun, 2020a) for view synthesis. However, our method does not require estimating scene geometry, which is challenging for dynamic scenes in video stabilization.

A recent line of research focuses on rendering novel views for dynamic scenes from a single video (Xian et al., 2020; Tretschk et al., 2020; Li et al., 2020) based on neural volume rendering (Lombardi et al., 2019; Mildenhall et al., 2020). These methods can be used for full-frame video stabilization by rendering the dynamic video from a smooth camera trajectory. While promising results have been shown, these methods require per-video training and precise camera pose estimates. In contrast, our video stabilization method applies to a wider variety of videos.
图像融合和合成。 通过估计和平滑的运动,视频稳定的最后一步是渲染稳定的帧。大多数现有方法通过使用平滑的扭曲网格(Liu等人,2013 年2011 年)或 CNN 预测的流场 (Yu 和 Ramamoorthi,2019 年2020 年)将每个输入帧直接扭曲到稳定位置来合成帧 . 然而,这些方法不可避免地合成了帧边界周围缺失区域的图像。为了保持矩形形状,现有方法通常会裁剪掉空白区域并生成分辨率低于输入视频的输出视频。为了解决这个问题,全帧视频稳定方法旨在在不裁剪的情况下稳定视频。这些方法使用相邻的帧来填充空白并通过 2D 运动修复产生全帧结果 (Matsushita等人,2005 年;Gleicher 和 Liu,2008 年;Gao等人,2020 年). 与首先生成稳定帧然后填充缺失像素的现有运动修复方法相比,我们的方法利用神经渲染来编码和融合扭曲的外观特征,并学习将融合的特征图解码为最终的颜色帧。

最近的几种方法可以在没有明确运动估计的情况下生成全帧稳定视频。例如,(Wang et al. , 2018 ) 中的方法使用收集的不稳定-稳定对训练 CNN 以直接合成稳定帧。然而,没有空间变换的输出帧的直接合成仍然具有挑战性。最近,DIFRINT 方法 (Choi 和 Kweon,2020 年)通过迭代应用帧插值生成全帧稳定视频。这种方法将运动平滑和帧渲染结合在一起。然而,重复的帧插值通常会引入可见的失真和伪影(见图 2(c))。

查看综合。 视图合成算法旨在从新的视角将逼真的图像渲染为单个图像 (Niklaus等人,2019 年;Wu等人,2020 年;Wiles等人,2020 年;Shih等人,2020 年;Kopf等人,2020 年)或多个姿势图像 (Levoy 和 Hanrahan,1996 年;Gortler等人,1996 年;Chaurasia等人,2013 年;Penner 和 Zhang,2017 年;Hedman等人,2018 年;Riegler 和 Koltun,2020ab )。这些方法的主要区别在于映射和融合信息的方式,例如视图插值 (Chen and Williams, 1993 ; Debevec et al. , 1996 ; Seitz and Dyer, 1996)、3D 代理几何和映射 (Buehler et al. , 2001a )、3D 多平面图像 (Zhou等人,2018 年;Srinivasan等人,2019 年)和 CNN (Hedman等人,2018 年); Riegler 和 Koltun,2020ab )。我们的融合网络类似于(Riegler and Koltun, 2020a ) 中用于视图合成的编码器-解码器网络。然而,我们的方法不需要估计场景几何形状,这对于视频稳定中的动态场景具有挑战性。

最近的研究路线着重于渲染新颖的观点动态,从单一的视频场景 (仙等人。,2020 ; Tretschk 。等,2020 ;李等人,2020)基于神经体绘制 (隆巴迪等人,2019 年;米尔登霍尔等人,2020 年). 通过从平滑的相机轨迹渲染动态视频,这些方法可用于全帧视频稳定。虽然已经显示出有希望的结果,但这些方法需要对每个视频进行训练和精确的相机姿态估计。相比之下,我们的视频稳定方法适用于更广泛的视频。

Neural rendering. A direct blending of multiple images in the image space may lead to glitching artifacts (visible seams). Some recent methods train neural scene representations to synthesize novel views, such as NeRF (Mildenhall et al., 2020), scene representation networks (Sitzmann et al., 2019b), neural voxel grid (Sitzmann et al., 2019a; Lombardi et al., 2019), 3D Neural Point-Based Graphics (Aliev et al., 2020), and neural textures (Thies et al., 2019). However, these methods often require time-consuming per-scene training and do not handle dynamic scenes. Our method does not require per-video finetuning.
神经渲染。 图像空间中多个图像的直接混合可能会导致毛刺伪影(可见接缝)。最近的一些方法训练神经场景表示来合成新颖的视图,例如 NeRF (Mildenhall等人,2020 年)、场景表示网络 (Sitzmann等人,2019b)、神经体素网格 (Sitzmann等人,2019a;Lombardi等人) . , 2019 )、3D Neural Point-Based Graphics (Aliev et al. , 2020 )和神经纹理 (Thies等。,2019 年)。然而,这些方法通常需要耗时的逐场景训练并且不能处理动态场景。我们的方法不需要对每个视频进行微调。

| 
| 
| ![

Figure 3. Design choices for fusing multiple frames. To synthesize full-frame stabilized video, we need to align and fuse the contents from multiple neighboring frames in the input shaky video. (a) Conventional panorama image stitching (or in general image-based rendering) methods often fuse the warped (stabilized) images in the image level. Fusing in image-level works well when the alignment is accurate, but may generate blending artifacts (e.g., visible seams) when flow estimates are not reliable. (b) One can also encode the images as abstract CNN features, perform the fusion in the feature-space, and learn a decoder to convert the fused feature to output frames. Such approaches are more robust to flow inaccuracy but often produce overly blurred images. (c) Our proposed combines the advantages of both strategies. We first extract abstract image features (Eq. (6)). We then fuse the warped features from multiple frames. For each source frame, we take the fused feature map together with the individual warped features and decode it to the output frames and the associated confidence maps. Finally, we produce the final output frame by using the weighted average of the generated images as in Eq. (8).
图 3. 融合多个帧的设计选择。 为了合成全帧稳定视频,我们需要对齐融合输入抖动视频中多个相邻帧的内容。(a) 传统的全景图像拼接(或一般基于图像的渲染)方法通常在图像级别融合扭曲(稳定)的图像. 当对齐准确时,图像级融合效果很好,但当流量估计不可靠时可能会产生混合伪影(例如,可见的接缝)。(b) 还可以将图像编码为抽象的 CNN 特征,在特征空间中执行融合,并学习解码器将融合特征转换为输出帧。这种方法对流量不准确更稳健,但通常会产生过度模糊的图像。(c) 我们提出的结合了两种策略的优点。我们首先提取抽象图像特征(方程(6))。然后我们融合来自多个帧的扭曲特征。对于每个源帧,我们将融合特征图与单个扭曲特征一起使用,并将其解码为输出帧和相关的置信度图。最后,我们通过使用生成图像的加权平均值来生成最终的输出帧,如等式。( 8 ).

3.FULL-FRAME VIDEO STABILIZATION

We denote It the frame in the real (unstabilized) camera space and I^t the frame in the virtual (stabilized) camera space at a timestamp t. Given an input video with T frames {It}Tt=1, our goal is to generate a video {I^t}T^t=1 that is visually stable and maintains the same FOV as the input video without any cropping. Existing video stabilization methods often apply a certain amount of cropping to exclude any missing pixels due to frame warping, as shown in Fig. 2. In contrast, we utilize the information from neighboring frames to render stabilized frames with completed contents or even expanding the FOV of the input video.

Video stabilization methods typically consist of three stages: 1) motion estimation, 2) motion smoothing, and 3) frame warping/rendering. Our method focuses on the third stage for rendering high-quality frames without any cropping. Our proposed algorithm is thus agnostic to particular motion estimation/smooth techniques. We assume that the warping field from the real camera space to the virtual camera space is available for each frame (e.g., from (Yu and Ramamoorthi, 2020)).

Given an input video, we first encode image features for each frame, warp the neighboring frames to the virtual camera space at the specific target timestamp, and then fuse the features to render a stabilized frame. We describe the technical detail of each step in the following sections.
我们的目标是生成视频 {It}Tt=1视觉稳定并保持与输入视频相同的 FOV,无需任何裁剪。现有的视频稳定方法通常应用一定量的裁剪来排除由于帧变形而丢失的任何像素,如图2所示 。相比之下,我们利用来自相邻帧的信息来渲染具有完整内容的稳定帧,甚至扩展输入视频的 FOV。

视频稳定方法通常包括三个阶段:1) 运动估计,2) 运动平滑,和 3) 帧变形/渲染。我们的方法侧重于在没有任何裁剪的情况下渲染高质量帧的第三阶段。因此,我们提出的算法与特定的运动估计/平滑技术无关。我们假设从真实相机空间到虚拟相机空间的扭曲场对于每一帧都是可用的(例如,来自 (Yu and Ramamoorthi, 2020 ))。

给定输入视频,我们首先对每一帧的图像特征进行编码,在特定目标时间戳将相邻帧扭曲到虚拟相机空间,然后融合这些特征以呈现稳定的帧。我们将在以下部分中描述每个步骤的技术细节。

3.1.PRE-PROCESSING

Motion estimation and smoothing. Several motion estimation and smoothing methods have been developed (Liu et al., 2011, 2013; Yu and Ramamoorthi, 2020). In this work, we use the state-of-the-art method (Yu and Ramamoorthi, 2020) to obtain a backward dense warping field Fk→^k for each frame, where k indicates the input and ^k denotes the stabilized output. Note that one can also generate the warping fields from other parametric motion models such as homography or affine. These warping fields can be directly used to warp the input video. However, the stabilized video often contains irregular boundaries and large portion of missing pixels. Therefore, the output video requires aggressive cropping and thus lose some contents.

Optical flow estimation. To recover the missing pixels caused by warping, we need to project the corresponding pixels from nearby frames to the target stabilized frame. For each key frame Ik at time t=k, we compute the optical flows {Fn→k}n∈Ωk from neighboring frames to the key frame using RAFT (Teed and Deng, 2020), where n indicates a neighboring frame and Ωk denotes the set of neighboring frames for the key frame Ik.

Blurry frames removal. For videos with rapid camera or object motion, the input frames may suffer from severe blur. Fusing information from these blurry frames inevitably leads to blurry output. We thus use the method of (Pech-Pacheco et al., 2000) to compute a sharpness score for each neighboring frame and then select the top 50% sharp neighboring frames for fusion. This pre-processing step helps restore sharper contents (see Fig. 4) and improves the runtime.
运动估计和平滑。 已经开发了几种运动估计和平滑方法 (Liu等人,2011 年2013 年;Yu 和 Ramamoorthi,2020 年)。在这项工作中,我们使用最先进的方法 (Yu 和 Ramamoorthi,2020 年)获得后向密集翘曲场F克→^克 对于每一帧,其中 克 表示输入和 ^克表示稳定的输出。请注意,还可以从其他参数运动模型(如单应性或仿射)生成翘曲场。这些扭曲场可以直接用于扭曲输入视频。然而,稳定的视频通常包含不规则的边界和大部分丢失的像素。因此,输出视频需要大量裁剪,从而丢失一些内容。

光流估计。 为了恢复由翘曲引起的丢失像素,我们需要将附近帧的相应像素投影到目标稳定帧。对于每个关键帧一世克 时 吨=克,我们计算光流 {Fn→克}n∈Ω克使用 RAFT 从相邻帧到关键帧 (Teed 和 Deng,2020 年),其中n 表示相邻帧和 Ω克 表示关键帧的相邻帧集 一世克.

去除模糊帧。 对于具有快速相机或物体运动的视频,输入帧可能会严重模糊。融合来自这些模糊帧的信息不可避免地会导致输出模糊。因此,我们使用 (Pech-Pacheco et al. , 2000 ) 的方法来计算每个相邻帧的清晰度分数,然后选择前 50 个%用于融合的锐利相邻帧。这个预处理步骤有助于恢复更清晰的内容(见图 4)并改善运行时间。

| 
Rapid camera or object motion leads to blurry frames (a).
Stabilizing the videos using all these frames inevitably results in blurry outputs (b).
By selecting and removing blurred frames in a video, our method re-renders the stabilized video using only | 
Rapid camera or object motion leads to blurry frames (a).
Stabilizing the videos using all these frames inevitably results in blurry outputs (b).
By selecting and removing blurred frames in a video, our method re-renders the stabilized video using only | ![
Rapid camera or object motion leads to blurry frames (a).
Stabilizing the videos using all these frames inevitably results in blurry outputs (b).

Figure 4. Effect of selecting and removing blurred frames. Rapid camera or object motion leads to blurry frames (a). Stabilizing the videos using all these frames inevitably results in blurry outputs (b). By selecting and removing blurred frames in a video, our method re-renders the stabilized video using only sharp input frames and thus can avoid synthesizing blurry frames.

3.2.WARPING AND FUSION

Warping. We warp the neighboring frames {In}n∈Ωk to align with the target frame I^k in the virtual camera space. Since we already have the warping field from the target frame to the keyframe F^k→k (estimated from (Yu and Ramamoorthi, 2020)) and the estimated optical flow from the keyframe to neighboring frames {Fk→n}n∈Ωk, we can then compute the warping field from the target frame to neighboring frames {F^k→n}n∈Ωk by chaining the flow vectors. We can thus warp a neighboring frame In to align with the target frame I^k using backward warping. Some pixels in the target frame are not visible in the neighboring frames due to occlusion/dis-occlusion or out-of-boundary. Therefore, we compute visibility mask {αn}n∈Ωk for each neighboring frame to indicate whether a pixel is valid (labeled as 1) in the source frame or not. We use the method (Sundaram et al., 2010) to identify occluded pixels (labeled as 0).

| 
(a) Blending multiple frames at the pixel-level retains sharp contents but often leads to visible artifacts due to the sensitivity to flow inaccuracy.
(b) Blending in the feature level alleviates these artifacts but has difficulty in generating sharp results.
(c) The proposed hybrid-space fusion method is robust to flow inaccuracy and produces sharp output frames.
| 
(a) Blending multiple frames at the pixel-level retains sharp contents but often leads to visible artifacts due to the sensitivity to flow inaccuracy.
(b) Blending in the feature level alleviates these artifacts but has difficulty in generating sharp results.
(c) The proposed hybrid-space fusion method is robust to flow inaccuracy and produces sharp output frames.
| ![
(a) Blending multiple frames at the pixel-level retains sharp contents but often leads to visible artifacts due to the sensitivity to flow inaccuracy.
(b) Blending in the feature level alleviates these artifacts but has difficulty in generating sharp results.
(c) The proposed hybrid-space fusion method is robust to flow inaccuracy and produces sharp output frames.

Figure 5. Effect of different blending spaces. (a) Blending multiple frames at the pixel-level retains sharp contents but often leads to visible artifacts due to the sensitivity to flow inaccuracy. (b) Blending in the feature level alleviates these artifacts but has difficulty in generating sharp results. (c) The proposed hybrid-space fusion method is robust to flow inaccuracy and produces sharp output frames.Fusion space. With the aligned frames, we explore several fusion strategies. First, we can directly blend the warped color frames in the image space to produce the output stabilized frame, as shown in Fig. 3(a). This image-space fusion approach is a commonly used technique in image stitching (Agarwala et al., 2004; Szeliski, 2006), video extrapolation (Lee et al., 2019), and novel view synthesis (Hedman et al., 2018). However, image-space fusion is prone to generating ghosting artifacts due to misalignment, or glitch artifacts due to inconsistent labeling between neighbor pixels (Fig. 5(a)). Alternatively, one can also fuse the aligned frames in the feature space (e.g., (Choi et al., 2019)) (Fig. 3(b)). Fusing in the high-dimensional feature spaces allows the model to be more robust to flow inaccuracy. However, rendering the fused feature map using a neural image-translation decoder often leads to blurry outputs (Fig. 5(b)).

To combine the best worlds of both image-space and feature-space fusions, we propose a hybrid-space fusion mechanism for video stabilization (Fig. 3(c)). Similar to the feature-space fusion, we first extract high-dimensional features from each neighboring frame and warp the features using flow fields. We then learn a CNN to predicting the blending weights that best fuse the features. We concatenate the fused feature map and the warped feature for each neighboring frame to form the input for our image decoder. The image decoder learns to predict a target frame and a confidence map for each neighboring frame. Finally, we adopt an image-space fusion to merge all the predicted target frames according to the predicted weights to generate the final stabilized frame.

The core difference between our hybrid-space fusion and feature-space fusion lies in the input to the image decoder. The image decoder in Fig. 5(b) takes only the fused feature as input to predict the output frame. The fused feature map already contains mixed information from multiple frames. The image decoder may thus have difficulty in synthesizing sharp image contents. In contrast, our image decoder in Fig. 5(c) takes the fused feature map as guidance to reconstruct the target frame from the warped feature. We empirically find that this improves the sharpness of the output frame while avoiding ghosting and glitching artifacts, as shown in Fig. 5(c).

Fusion function. In addition to explore where should the contents from multiple frames be fused (image, feature, or hybrid), here we discuss how. To enable end-to-end training, we consider the following differentiable function for fusion:
融合空间。 通过对齐的帧,我们探索了几种融合策略。首先,我们可以直接在图像空间中混合扭曲的颜色帧以产生输出稳定帧,如图 3(a)所示。这种图像空间融合方法是图像拼接(Agarwala等人,2004 年;Szeliski,2006 年)、视频外推 (Lee等人,2019 年)和新视图合成 (Hedman等人,2018 年)中的常用技术 . 然而,图像空间融合容易由于未对准而产生重影伪影,或由于相邻像素之间的标签不一致而产生毛刺伪影(图 5(a))。或者,也可以融合特征空间中对齐的帧(例如,(Choi et al. , 2019))(图 3(b))。在高维特征空间中进行融合可以使模型对流动不准确性更加鲁棒。然而,使用神经图像翻译解码器渲染融合特征图通常会导致输出模糊(图 5(b))。

为了结合图像空间和特征空间融合的最佳世界,我们提出了一种用于视频稳定的混合空间融合机制(图 3(c))。与特征空间融合类似,我们首先从每个相邻帧中提取高维特征并使用流场扭曲特征。然后我们学习一个 CNN 来预测最能融合特征的混合权重。我们将每个相邻帧融合特征图扭曲特征连接起来**形成我们的图像解码器的输入。图像解码器学习预测目标帧和每个相邻帧的置信度图。最后,我们采用图像空间融合根据预测的权重合并所有预测的目标帧以生成最终的稳定帧。

我们的混合空间融合和特征空间融合之间的核心区别在于图像解码器的输入。图5(b)中的图像解码器 将融合特征作为输入来预测输出帧。融合的特征图已经包含来自多个帧的混合信息。图像解码器因此可能难以合成清晰的图像内容。相比之下,我们在图5(c)中的图像解码器以 融合特征图为指导,从扭曲特征重建目标帧。我们凭经验发现,这提高了输出帧的清晰度,同时避免了重影和毛刺伪影,如图 5(c)所示。

融合功能。 除了探索来自多个帧的内容应该在哪里融合(图像、特征或混合)之外,这里我们讨论如何. 为了实现端到端的训练,我们考虑以下用于融合的可微函数:

1) Mean fusion:

The simplest approach for fusion is computing the mean of available contents at each location:

f^kmean=∑n∈Ωkf^knM^kn∑n∈ΩkM^kn+τ, |

where f^kn and M^kn is the encoded feature map and warping mask of frame n, respectively. The superscript ^k denotes the encoded feature and warping mask are warped to the stable frame, and τ is a small constant to avoid dividing by zero.

2) Gaussian-weighted fusion:

Extending the mean fusion by putting larger weights for frames that are temporally closer to the keyframe:

(2) f^kgauss=∑n∈Ωkf^knM^knWn∑n∈ΩkM^knWn+τ,

where Wn=G(k,∥n−k∥) is the Gaussian blending weight.

3) Argmax fusion:

Instead of blending (which may cause blur), we can also take the available content that is temporally closest to the keyframe (using the warping mask M^kn and the weights Wn).

(3) f^kargmax=fargmaxnM^knWn.

4) Flow error-weighted fusion:

Instead of the above simple heuristic, we can define the blending weights using the confidence of flow estimate:

(4) f^kFE=∑n∈ΩkfnM^knWFEn∑n∈ΩkM^knWFEn+τ,

where weights WFEn is computed by: WFEn=exp(−e^knβ). Here, we use forward-backward flow consistency error e^kn to measure the confidence:

| (5) | | en(p)=∥|Fk→n(p)+Fn→k(p+Fk→n)∥|2, | |
| - | - | - | - |

where p denotes pixel coordinate in Fk→n and we set β=0.1 in all our experiments. In addition, we need to warp the flow error en to the target stable frame. We denote the warped flow error as e^kn.

5) CNN-based fusion function:

We further explore a learning-based fusion function. Specifically, we train a CNN to predicts a blending weight for each neighboring frame using the encoded features, visibility masks, and the flow error (Fig. 6).

(6) f^kCNN=∑n∈Ωkf^knθ(f^kn,M^kn,f^kk,M^kk,e^kn),

where θ is a CNN followed by soft-max.

After fusing the feature, we concatenate the fused feature with the warped feature and warping mask of each frame as the input to the image decoder. The image decoder then predicts the output color frame and confidence map for each frame:

(7) {I^kn,C^kn}=ϕ(f^kn,M^kn,fCNN^k),

where ϕ denotes the image decoder, I^kn and C^kn represent the predicted frame and confidence map of frame n in the virtual camera space at time k, respectively. Finally, the output stabilized frame I^k is generated by a wei