[论文翻译]全帧视频稳定算法


原文地址:https://arxiv.org/abs/2102.06205

代码地址:https://github.com/alex04072000/FuSta

Neural Re-rendering for Full-frame Video Stabilization

ABSTRACT.

Existing video stabilization methods either require aggressive cropping of frame boundaries or generate distortion artifacts on the stabilized frames. In this work, we present an algorithm for full-frame video stabilization by first estimating dense warp fields. Full-frame stabilized frames can then be synthesized by fusing warped contents from neighboring frames. The core technical novelty lies in our learning-based hybrid-space fusion that alleviates artifacts caused by optical flow inaccuracy and fast-moving objects. We validate the effectiveness of our method on the NUS and selfie video datasets. Extensive experiment results demonstrate the merits of our approach over prior video stabilization methods.

现有的视频稳定方法要么需要积极裁剪帧边界,要么在稳定的帧上产生失真伪影。在这项工作中,我们通过首先估计密集扭曲场来提出一种全帧视频稳定算法。然后可以通过融合来自相邻帧的扭曲内容来合成全帧稳定帧。核心技术创新在于我们基于学习的混合空间融合,它减轻了由光流不准确和快速移动物体引起的伪影。我们在 NUS 和自拍视频数据集上验证了我们的方法的有效性。广泛的实验结果证明了我们的方法优于先前的视频稳定方法。

video stabilization, neural rendering, deep learning



Figure 1. Full-frame video stabilization of challenging videos. Our method takes a shaky input video (top) and produces a stabilized and distortion-free video (bottom), as indicated by less fluctuation in the y−t epipolar plane image. Furthermore, by robustly fusing multiple neighboring frames, our results do not suffer from aggressive cropping of frame borders in the stabilized video and can even expand the field of view of the original video. Our approach significantly outperforms representative state-of-the-art video stabilization algorithms on these challenging scenarios (see Fig. [2]).
图 1. 具有挑战性的视频的全帧视频稳定。 我们的方法采用不稳定的输入视频(顶部)并生成稳定且无失真的视频(底部),如是-吨对极平面图像。此外,通过稳健地融合多个相邻帧,我们的结果不会受到稳定视频中帧边界的激进裁剪的影响,甚至可以扩大原始视频的视野。我们的方法在这些具有挑战性的场景中明显优于代表性的最先进的视频稳定算法(见图 [2])。



Figure 2. Limitations of current state-of-the-art video stabilization techniques. (a) Current commercial video stabilization software (Adobe Premiere Pro 2020) fails to generate smooth video in challenging scenarios of rapid camera shakes. (b) The method in (Yu and Ramamoorthi, 2020) produces temporally smooth video. However, the warped (stabilized) video contains many missing pixels at frame borders and inevitably requires applying aggressive cropping (green checkerboard areas) to generate a rectangle video. (c) The DIFRINT method (Choi and Kweon, 2020) achieves full-frame video stabilization by iteratively applying frame interpolation to generate in-between, stabilized frames. However, interpolating between frames with large camera motion and moving occlusion is challenging. Their results are thus prone to severe artifacts.
图 2. 当前最先进的视频稳定技术的局限性。 (a) 当前的商业视频稳定软件(Adobe Premiere Pro 2020)在相机快速抖动的挑战性场景中无法生成流畅的视频。(b) (Yu and Ramamoorthi, 2020 ) 中的方法产生时间平滑的视频。然而,扭曲(稳定)的视频在帧边界处包含许多丢失的像素,并且不可避免地需要应用积极的裁剪(绿色棋盘区域)来生成矩形视频。(c) DIFRINT 方法 (Choi 和 Kweon,2020 年)通过迭代应用帧插值来生成中间稳定的帧,实现全帧视频稳定。然而,在具有大相机运动和移动遮挡的帧之间进行插值是具有挑战性的。因此,他们的结果容易出现严重的伪影。

1.INTRODUCTION

Video stabilization has become increasingly important with the rapid growth of video content on the Internet platforms, such as YouTube, Vimeo, and Instagram. Casually captured cellphone videos without a professional video stabilizer are often shaky and unpleasant to watch. These videos pose significant challenges for video stabilization algorithms. For example, videos are often noisy due to small image sensors, particularly in low-light environments. Handheld captured videos may contain large camera shake/jitter, resulting in severe motion blur and wobble artifacts from a rolling shutter camera.
随着 YouTube、Vimeo 和 Instagram 等互联网平台上视频内容的快速增长,视频稳定变得越来越重要。在没有专业视频稳定器的情况下随意拍摄的手机视频通常会抖动且观看时不愉快。这些视频对视频稳定算法提出了重大挑战。例如,由于图像传感器较小,尤其是在弱光环境中,视频通常会产生噪声。手持拍摄的视频可能包含较大的相机抖动/抖动,从而导致滚动快门相机产生严重的运动模糊和晃动伪影。

Existing video stabilization methods usually consist of three main components: 1) motion estimation, 2) motion smoothing and 3) stable frame generation. First, the motion estimation step involves estimating motion through 2D feature detection/tracking (Lee et al., 2009; Liu et al., 2011; Goldstein and Fattal, 2012; Wang et al., 2013), dense flow (Yu and Ramamoorthi, 2019, 2020), or recovering camera motion and scene structures (Liu et al., 2009; Zhou et al., 2013; Buehler et al., 2001b; Smith et al., 2009; Liu et al., 2012). Second, the motion smoothing step then removes the high-frequency jittering in the estimated motion and predicts the spatial transformations to stabilize each frame in the form of homography (Matsushita et al., 2005), mixture of homography (Liu et al., 2013; Grundmann et al., 2012), or per-pixel warp fields (Liu et al., 2014; Yu and Ramamoorthi, 2019, 2020). Third, the stable frame generation step uses the predicted spatial transform to synthesize the stabilized video. The stabilized frames, however, often contain large missing regions at frame borders, particularly when videos with large camera motion. This forces existing methods to apply aggressive cropping for maintaining a rectangular frame and therefore leads to a significantly zoomed-in video with resolution loss (Fig. 2(a) and (b)).
现有的视频稳定方法通常由三个主要部分组成:1) 运动估计,2) 运动平滑和 3) 稳定帧生成。首先,运动估计步骤涉及通过 2D 特征检测/跟踪来估计运动(Lee等人,2009 年;Liu等人,2011 年;Goldstein 和 Fattal,2012 年;Wang等人,2013 年)、密集流 (Yu 和 Ramamoorthi , 2019 , 2020 )或恢复相机运动和场景结构(Liu et al. , 2009 ; Zhou et al., 2013 年;布勒等人。, 2001b ; 史密斯等人。, 2009 年;刘等人。, 2012 )。其次,运动平滑步骤然后去除估计运动中的高频抖动,并以单应性(Matsushita等人,2005)、单应性混合 (Liu等人,2013)的形式预测空间变换以稳定每一帧 ; Grundmann等人,2012 年),或每像素扭曲场 (Liu等人,2014 年;Yu 和 Ramamoorthi,2019 年2020 年)。第三,稳定帧生成步骤使用预测的空间变换来合成稳定的视频。然而,稳定的帧通常在帧边界处包含大的缺失区域,特别是当视频具有较大的相机运动时。这迫使现有方法应用积极的裁剪来维持矩形帧,因此导致显着放大的视频具有分辨率损失(图 2(a)和(b))。

Full-frame video stabilization methods aim to address the above-discussed limitation and produce stabilized video with the same field of view (FoV). One approach for full-frame video stabilization is to first compute the stabilized video (with missing pixels at the frame borders) and then apply flow-based video completion methods (Matsushita et al., 2005; Huang et al., 2016; Gao et al., 2020) to fill in missing contents. Such two-stage methods may suffer from the inaccuracies in flow estimation and inpainting (e.g., in poorly textured regions, fluid motion, and motion blur). A recent learning-based method, DIFRINT (Choi and Kweon, 2020), instead uses iterative frame interpolation to stabilize the video while maintaining the original FoV. However, we find that applying frame interpolation repeatedly leads to severe distortion and blur artifacts in challenging cases (Fig. 2(c)).
全帧视频稳定方法旨在解决上述限制并生成具有相同视场 (FoV) 的稳定视频。全帧视频稳定的一种方法是首先计算稳定的视频(在帧边界丢失像素),然后应用基于流的视频补全方法 (Matsushita等人,2005 年;Huang等人,2016 年;Gao等人) al. , 2020 )来填补缺失的内容。这种两阶段方法可能会受到流量估计和修复不准确的影响(例如,在纹理较差的区域、流体运动和运动模糊)。最近的一种基于学习的方法,DIFRINT (Choi and Kweon, 2020 ),而是使用迭代帧插值来稳定视频,同时保持原始 FoV。然而,我们发现在具有挑战性的情况下,反复应用帧插值会导致严重的失真和模糊伪影(图 2(c))。

In this paper, we present a new algorithm that takes a shaky video and the estimated smooth motion fields for stabilization as inputs and produces a full-frame stable video. The core idea of our method lies in fusing information from multiple neighboring frames in a robust manner. Instead of using color frames directly, we use a learned CNN representation to encode rich local appearance for each frame, fuse multiple aligned feature maps, and use a neural decoder network to render the final color frame. We explore multiple design choices for fusing and blending multiple aligned frames (commonly used in image stitching and view synthesis applications). We then propose a hybrid fusion mechanism that leverages both feature-level and image-level fusion to alleviate the sensitivity to flow inaccuracy. In addition, we improve the visual quality of synthesized results by learning to predict spatially varying blending weights, removing blurry input frames for sharp video generation, and transferring high-frequency details residual to the re-rendered, stabilized frames. Our method generates stabilized video with significantly fewer artifacts and distortions while retaining (or even expanding) the original FoV (Fig. 1). We evaluate the proposed algorithm with the state-of-the-art methods and commercial video stabilization software (Adobe Premiere Pro 2020 warp stabilizer). Extensive experiments show that our method performs favorably against existing methods on two public benchmark datasets (Liu et al., 2013; Yu and Ramamoorthi, 2018).
在本文中,我们提出了一种新算法,该算法将抖动视频和用于稳定的估计平滑运动场作为输入并生成全帧稳定视频。我们方法的核心思想在于以稳健的方式融合来自多个相邻帧的信息。我们没有直接使用颜色帧,而是使用学习到的 CNN 表示为每一帧编码丰富的局部外观,融合多个对齐的特征图,并使用神经解码器网络来渲染最终的颜色帧。我们探索了融合和混合多个对齐帧的多种设计选择(通常用于图像拼接和视图合成应用程序)。然后,我们提出了一种混合融合机制,该机制利用特征级和图像级融合来减轻对流不准确性的敏感性。此外,我们通过学习预测空间变化的混合权重,去除模糊的输入帧以生成清晰的视频,并将高频细节残差转移到重新渲染的稳定帧来提高合成结果的视觉质量。我们的方法生成稳定的视频,在保留(甚至扩展)原始 FoV 的同时,伪影和失真明显减少(图 1)。 1)。我们使用最先进的方法和商业视频稳定软件(Adobe Premiere Pro 2020 经线稳定器)评估所提出的算法。大量实验表明,我们的方法在两个公共基准数据集上的性能优于现有方法(Liu等人,2013 年;Yu 和 Ramamoorthi,2018 年)。

The main contributions of this work are:

  • We apply neural rendering techniques in the context of video stabilization to alleviate the issues of sensitivity to flow inaccuracy.
  • We present a hybrid fusion mechanism for combining information from multiple frames at both feature- and image-level. We systematically validate various design choices through ablation studies.
  • We demonstrate favorable performance against representative video stabilization techniques on two public datasets. We will release the source code to facilitate future research.
  • 我们在视频稳定的背景下应用神经渲染技术来缓解对流量不准确的敏感性问题。
  • 我们提出了一种混合融合机制,用于在特征和图像级别结合来自多个帧的信息。我们通过消融研究系统地验证了各种设计选择。
  • 我们在两个公共数据集上展示了针对代表性视频稳定技术的良好性能。我们将发布源代码以方便未来的研究。

Motion estimation and smoothing. Most video stabilization methods focus on estimating motion between frames and smoothing the motion. For motion estimation, existing methods often estimate the motion in the 2D image space using sparse feature detection/tracking and dense optical flow. These methods differ in motion modeling, e.g., eigen-trajectories (Liu et al., 2011), epipolar geometry (Goldstein and Fattal, 2012), warping grids (Liu et al., 2013), or dense flow fields (Yu and Ramamoorthi, 2020). For motion smoothing, prior methods use low-pass filtering (Liu et al., 2011), L1 optimization (Grundmann et al., 2011), and spatio-temporal optimization (Wang et al., 2013).

In contrast to estimating 2D motion, several methods recover the camera motion and proxy scene geometry by leveraging Structure from Motion (SfM) algorithms. These methods stabilize frames using 3D reconstruction and projection along with image-based rendering (Kopf et al., 2014) or content-preserving warps (Buehler et al., 2001b; Liu et al., 2009; Goldstein and Fattal, 2012). However, SfM algorithms are less effective in handling complex videos with severe motion blur and highly dynamic scenes. Specialized hardware such as depth cameras (Liu et al., 2012) or light field cameras (Smith et al., 2009) may be required for reliable pose estimation.

Deep learning-based approaches have recently been proposed to directly predict warping fields (Xu et al., 2018; Wang et al., 2018) or optical flows (Yu and Ramamoorthi, 2019, 2020) for video stabilization. In particular, methods with dense warp fields (Liu et al., 2014; Yu and Ramamoorthi, 2019, 2020) offer greater flexibility for compensating motion jittering and implicitly handling rolling shutter effects than parametric warp fields (Xu et al., 2018; Wang et al., 2018).

Our work builds upon existing 2D motion estimation/techniques for stabilization and focuses on synthesizing full-frame video outputs. Specifically, we adopt the state-of-the-art flow-based stabilization method (Yu and Ramamoorthi, 2020) and use the estimated per-frame warped fields as inputs to our method11Our method is agnostic to the motion smoothing techniques. Other approaches such as parametric warps can also be applied..

运动估计和平滑 大多数视频稳定方法侧重于估计帧之间的运动并平滑运动。对于运动估计,现有方法通常使用稀疏特征检测/跟踪和密集光流来估计 2D 图像空间中的运动。这些方法在运动建模方面有所不同,例如,特征轨迹 (Liu等人,2011 年)、对极几何 (Goldstein 和 Fattal,2012 年)、翘曲网格 (Liu等人,2013 年)或密集流场 (Yu 和 Ramamoorthi) , 2020 )。对于运动平滑,先前的方法使用低通滤波 (Liu et al. , 2011 )、L1 优化 (Grundmann et al. , 2011 )和时空优化 (Wang et al. , 2013 )。

与估计 2D 运动相比,有几种方法通过利用运动中的结构 (SfM) 算法来恢复相机运动和代理场景几何。这些方法使用 3D 重建和投影以及基于图像的渲染 (Kopf等人,2014 年)或内容保留扭曲 (Buehler等人,2001b;Liu等人,2009 年;Goldstein 和 Fattal,2012 年)来稳定帧。然而,SfM 算法在处理具有严重运动模糊和高度动态场景的复杂视频时效果较差。专用硬件,如深度相机 (Liu et al. , 2012)或光场相机 (Smith et al. , 2009 )可能需要可靠的姿态估计。

深基于学习的方法最近提出了直接预测翘曲场 (徐等人。,2018 ;王等人,2018)或光流 (Yu和Ramamoorthi,20192020)的视频稳定。特别是,具有密集扭曲场的方法 (Liu et al. , 2014 ; Yu and Ramamoorthi, 2019 , 2020)比参数扭曲场(Xu et al. , 2018)在补偿运动抖动和隐式处理卷帘效应方面提供了更大的灵活性 ; 王等人。, 2018 )。

我们的工作建立在现有的 2D 运动估计/稳定技术之上,并专注于合成全帧视频输出。具体来说,我们采用最先进的基于流的稳定方法 (Yu 和 Ramamoorthi,2020 年),并使用估计的每帧扭曲场作为我们方法1 的输入1我们的方法与运动平滑技术无关。也可以应用其他方法,例如参数扭曲。.

Image fusion and composition. With the estimated and smoothed motion, the final step of video stabilization is to render the stabilized frames. Most existing methods synthesize frames by directly warping each input frame to the stabilized location using smoothed warping grids (Liu et al., 2013, 2011) or flow fields predicted by CNNs (Yu and Ramamoorthi, 2019, 2020). However, such approaches inevitably synthesize images with missing regions around frame boundaries. To maintain a rectangle shape, existing methods often crop off the blank areas and generate output videos with a lower resolution than the input video. To address this issue, full-frame video stabilization methods aim to stabilize videos without cropping. These methods use neighboring frames to fill in the blank and produce full-frame results by 2D motion inpainting (Matsushita et al., 2005; Gleicher and Liu, 2008; Gao et al., 2020). In contrast to existing motion inpainting methods that first generate stabilized frames then filling in missing pixels, our method leverage neural rendering to encode and fuse warped appearance features and learn to decode the fused feature map to the final color frames.

Several recent methods can generate full-frame stabilized videos without explicit motion estimation. For example, the method in (Wang et al., 2018) train a CNN with collected unstable-stable pairs to directly synthesize stable frames. However, direct synthesis of output frames without spatial transformations remains challenging. Recently, the DIFRINT method (Choi and Kweon, 2020) generates full-frame stable videos by iteratively applying frame interpolation. This method couples motion smoothing and frame rendering together. However, the repeated frame interpolation often introduces visible distortion and artifacts (see Fig. 2(c)).

View synthesis. View synthesis algorithms aims to render photorealistic images from novel viewpoints a single image (Niklaus et al., 2019; Wu et al., 2020; Wiles et al., 2020; Shih et al., 2020; Kopf et al., 2020) or multiple posed images (Levoy and Hanrahan, 1996; Gortler et al., 1996; Chaurasia et al., 2013; Penner and Zhang, 2017; Hedman et al., 2018; Riegler and Koltun, 2020a, b). These methods mainly differ in the ways to map and fuse information, e.g., view interpolation (Chen and Williams, 1993; Debevec et al., 1996; Seitz and Dyer, 1996), 3D proxy geometry and mapping (Buehler et al., 2001a), 3D multi-plane image (Zhou et al., 2018; Srinivasan et al., 2019), and CNN (Hedman et al., 2018; Riegler and Koltun, 2020a, b). Our fusion network resembles the encoder-decoder network used in (Riegler and Koltun, 2020a) for view synthesis. However, our method does not require estimating scene geometry, which is challenging for dynamic scenes in video stabilization.

A recent line of research focuses on rendering novel views for dynamic scenes from a single video (Xian et al., 2020; Tretschk et al., 2020; Li et al., 2020) based on neural volume rendering (Lombardi et al., 2019; Mildenhall et al., 2020). These methods can be used for full-frame video stabilization by rendering the dynamic video from a smooth camera trajectory. While promising results have been shown, these methods require per-video training and precise camera pose estimates. In contrast, our video stabilization method applies to a wider variety of videos.
图像融合和合成。 通过估计和平滑的运动,视频稳定的最后一步是渲染稳定的帧。大多数现有方法通过使用平滑的扭曲网格(Liu等人,2013 年2011 年)或 CNN 预测的流场 (Yu 和 Ramamoorthi,2019 年2020 年)将每个输入帧直接扭曲到稳定位置来合成帧 . 然而,这些方法不可避免地合成了帧边界周围缺失区域的图像。为了保持矩形形状,现有方法通常会裁剪掉空白区域并生成分辨率低于输入视频的输出视频。为了解决这个问题,全帧视频稳定方法旨在在不裁剪的情况下稳定视频。这些方法使用相邻的帧来填充空白并通过 2D 运动修复产生全帧结果 (Matsushita等人,2005 年;Gleicher 和 Liu,2008 年;Gao等人,2020 年). 与首先生成稳定帧然后填充缺失像素的现有运动修复方法相比,我们的方法利用神经渲染来编码和融合扭曲的外观特征,并学习将融合的特征图解码为最终的颜色帧。

最近的几种方法可以在没有明确运动估计的情况下生成全帧稳定视频。例如,(Wang et al. , 2018 ) 中的方法使用收集的不稳定-稳定对训练 CNN 以直接合成稳定帧。然而,没有空间变换的输出帧的直接合成仍然具有挑战性。最近,DIFRINT 方法 (Choi 和 Kweon,2020 年)通过迭代应用帧插值生成全帧稳定视频。这种方法将运动平滑和帧渲染结合在一起。然而,重复的帧插值通常会引入可见的失真和伪影(见图 2(c))。

查看综合。 视图合成算法旨在从新的视角将逼真的图像渲染为单个图像 (Niklaus等人,2019 年;Wu等人,2020 年;Wiles等人,2020 年;Shih等人,2020 年;Kopf等人,2020 年)或多个姿势图像 (Levoy 和 Hanrahan,1996 年;Gortler等人,1996 年;Chaurasia等人,2013 年;Penner 和 Zhang,2017 年;Hedman等人,2018 年;Riegler 和 Koltun,2020ab )。这些方法的主要区别在于映射和融合信息的方式,例如视图插值 (Chen and Williams, 1993 ; Debevec et al. , 1996 ; Seitz and Dyer, 1996)、3D 代理几何和映射 (Buehler et al. , 2001a )、3D 多平面图像 (Zhou等人,2018 年;Srinivasan等人,2019 年)和 CNN (Hedman等人,2018 年); Riegler 和 Koltun,2020ab )。我们的融合网络类似于(Riegler and Koltun, 2020a ) 中用于视图合成的编码器-解码器网络。然而,我们的方法不需要估计场景几何形状,这对于视频稳定中的动态场景具有挑战性。

最近的研究路线着重于渲染新颖的观点动态,从单一的视频场景 (仙等人。,2020 ; Tretschk 。等,2020 ;李等人,2020)基于神经体绘制 (隆巴迪等人,2019 年;米尔登霍尔等人,2020 年). 通过从平滑的相机轨迹渲染动态视频,这些方法可用于全帧视频稳定。虽然已经显示出有希望的结果,但这些方法需要对每个视频进行训练和精确的相机姿态估计。相比之下,我们的视频稳定方法适用于更广泛的视频。

Neural rendering. A direct blending of multiple images in the image space may lead to glitching artifacts (visible seams). Some recent methods train neural scene representations to synthesize novel views, such as NeRF (Mildenhall et al., 2020), scene representation networks (Sitzmann et al., 2019b), neural voxel grid (Sitzmann et al., 2019a; Lombardi et al., 2019), 3D Neural Point-Based Graphics (Aliev et al., 2020), and neural textures (Thies et al., 2019). However, these methods often require time-consuming per-scene training and do not handle dynamic scenes. Our method does not require per-video finetuning.
神经渲染。 图像空间中多个图像的直接混合可能会导致毛刺伪影(可见接缝)。最近的一些方法训练神经场景表示来合成新颖的视图,例如 NeRF (Mildenhall等人,2020 年)、场景表示网络 (Sitzmann等人,2019b)、神经体素网格 (Sitzmann等人,2019a;Lombardi等人) . , 2019 )、3D Neural Point-Based Graphics (Aliev et al. , 2020 )和神经纹理 (Thies等。,2019 年)。然而,这些方法通常需要耗时的逐场景训练并且不能处理动态场景。我们的方法不需要对每个视频进行微调。

| 
| 
| ![

Figure 3. Design choices for fusing multiple frames. To synthesize full-frame stabilized video, we need to align and fuse the contents from multiple neighboring frames in the input shaky video. (a) Conventional panorama image stitching (or in general image-based rendering) methods often fuse the warped (stabilized) images in the image level. Fusing in image-level works well when the alignment is accurate, but may generate blending artifacts (e.g., visible seams) when flow estimates are not reliable. (b) One can also encode the images as abstract CNN features, perform the fusion in the feature-space, and learn a decoder to convert the fused feature to output frames. Such approaches are more robust to flow inaccuracy but often produce overly blurred images. (c) Our proposed combines the advantages of both strategies. We first extract abstract image features (Eq. (6)). We then fuse the warped features from multiple frames. For each source frame, we take the fused feature map together with the individual warped features and decode it to the output frames and the associated confidence maps. Finally, we produce the final output frame by using the weighted average of the generated images as in Eq. (8).
图 3. 融合多个帧的设计选择。 为了合成全帧稳定视频,我们需要对齐融合输入抖动视频中多个相邻帧的内容。(a) 传统的全景图像拼接(或一般基于图像的渲染)方法通常在图像级别融合扭曲(稳定)的图像. 当对齐准确时,图像级融合效果很好,但当流量估计不可靠时可能会产生混合伪影(例如,可见的接缝)。(b) 还可以将图像编码为抽象的 CNN 特征,在特征空间中执行融合,并学习解码器将融合特征转换为输出帧。这种方法对流量不准确更稳健,但通常会产生过度模糊的图像。(c) 我们提出的结合了两种策略的优点。我们首先提取抽象图像特征(方程(6))。然后我们融合来自多个帧的扭曲特征。对于每个源帧,我们将融合特征图与单个扭曲特征一起使用,并将其解码为输出帧和相关的置信度图。最后,我们通过使用生成图像的加权平均值来生成最终的输出帧,如等式。( 8 ).

3.FULL-FRAME VIDEO STABILIZATION

We denote It the frame in the real (unstabilized) camera space and I^t the frame in the virtual (stabilized) camera space at a timestamp t. Given an input video with T frames {It}Tt=1, our goal is to generate a video {I^t}T^t=1 that is visually stable and maintains the same FOV as the input video without any cropping. Existing video stabilization methods often apply a certain amount of cropping to exclude any missing pixels due to frame warping, as shown in Fig. 2. In contrast, we utilize the information from neighboring frames to render stabilized frames with completed contents or even expanding the FOV of the input video.

Video stabilization methods typically consist of three stages: 1) motion estimation, 2) motion smoothing, and 3) frame warping/rendering. Our method focuses on the third stage for rendering high-quality frames without any cropping. Our proposed algorithm is thus agnostic to particular motion estimation/smooth techniques. We assume that the warping field from the real camera space to the virtual camera space is available for each frame (e.g., from (Yu and Ramamoorthi, 2020)).

Given an input video, we first encode image features for each frame, warp the neighboring frames to the virtual camera space at the specific target timestamp, and then fuse the features to render a stabilized frame. We describe the technical detail of each step in the following sections.
我们的目标是生成视频 {It}Tt=1视觉稳定并保持与输入视频相同的 FOV,无需任何裁剪。现有的视频稳定方法通常应用一定量的裁剪来排除由于帧变形而丢失的任何像素,如图2所示 。相比之下,我们利用来自相邻帧的信息来渲染具有完整内容的稳定帧,甚至扩展输入视频的 FOV。

视频稳定方法通常包括三个阶段:1) 运动估计,2) 运动平滑,和 3) 帧变形/渲染。我们的方法侧重于在没有任何裁剪的情况下渲染高质量帧的第三阶段。因此,我们提出的算法与特定的运动估计/平滑技术无关。我们假设从真实相机空间到虚拟相机空间的扭曲场对于每一帧都是可用的(例如,来自 (Yu and Ramamoorthi, 2020 ))。

给定输入视频,我们首先对每一帧的图像特征进行编码,在特定目标时间戳将相邻帧扭曲到虚拟相机空间,然后融合这些特征以呈现稳定的帧。我们将在以下部分中描述每个步骤的技术细节。

3.1.PRE-PROCESSING

Motion estimation and smoothing. Several motion estimation and smoothing methods have been developed (Liu et al., 2011, 2013; Yu and Ramamoorthi, 2020). In this work, we use the state-of-the-art method (Yu and Ramamoorthi, 2020) to obtain a backward dense warping field Fk→^k for each frame, where k indicates the input and ^k denotes the stabilized output. Note that one can also generate the warping fields from other parametric motion models such as homography or affine. These warping fields can be directly used to warp the input video. However, the stabilized video often contains irregular boundaries and large portion of missing pixels. Therefore, the output video requires aggressive cropping and thus lose some contents.

Optical flow estimation. To recover the missing pixels caused by warping, we need to project the corresponding pixels from nearby frames to the target stabilized frame. For each key frame Ik at time t=k, we compute the optical flows {Fn→k}n∈Ωk from neighboring frames to the key frame using RAFT (Teed and Deng, 2020), where n indicates a neighboring frame and Ωk denotes the set of neighboring frames for the key frame Ik.

Blurry frames removal. For videos with rapid camera or object motion, the input frames may suffer from severe blur. Fusing information from these blurry frames inevitably leads to blurry output. We thus use the method of (Pech-Pacheco et al., 2000) to compute a sharpness score for each neighboring frame and then select the top 50% sharp neighboring frames for fusion. This pre-processing step helps restore sharper contents (see Fig. 4) and improves the runtime.
运动估计和平滑。 已经开发了几种运动估计和平滑方法 (Liu等人,2011 年2013 年;Yu 和 Ramamoorthi,2020 年)。在这项工作中,我们使用最先进的方法 (Yu 和 Ramamoorthi,2020 年)获得后向密集翘曲场F克→^克 对于每一帧,其中 克 表示输入和 ^克表示稳定的输出。请注意,还可以从其他参数运动模型(如单应性或仿射)生成翘曲场。这些扭曲场可以直接用于扭曲输入视频。然而,稳定的视频通常包含不规则的边界和大部分丢失的像素。因此,输出视频需要大量裁剪,从而丢失一些内容。

光流估计。 为了恢复由翘曲引起的丢失像素,我们需要将附近帧的相应像素投影到目标稳定帧。对于每个关键帧一世克 时 吨=克,我们计算光流 {Fn→克}n∈Ω克使用 RAFT 从相邻帧到关键帧 (Teed 和 Deng,2020 年),其中n 表示相邻帧和 Ω克 表示关键帧的相邻帧集 一世克.

去除模糊帧。 对于具有快速相机或物体运动的视频,输入帧可能会严重模糊。融合来自这些模糊帧的信息不可避免地会导致输出模糊。因此,我们使用 (Pech-Pacheco et al. , 2000 ) 的方法来计算每个相邻帧的清晰度分数,然后选择前 50 个%用于融合的锐利相邻帧。这个预处理步骤有助于恢复更清晰的内容(见图 4)并改善运行时间。

| 
Rapid camera or object motion leads to blurry frames (a).
Stabilizing the videos using all these frames inevitably results in blurry outputs (b).
By selecting and removing blurred frames in a video, our method re-renders the stabilized video using only | 
Rapid camera or object motion leads to blurry frames (a).
Stabilizing the videos using all these frames inevitably results in blurry outputs (b).
By selecting and removing blurred frames in a video, our method re-renders the stabilized video using only | ![
Rapid camera or object motion leads to blurry frames (a).
Stabilizing the videos using all these frames inevitably results in blurry outputs (b).

Figure 4. Effect of selecting and removing blurred frames. Rapid camera or object motion leads to blurry frames (a). Stabilizing the videos using all these frames inevitably results in blurry outputs (b). By selecting and removing blurred frames in a video, our method re-renders the stabilized video using only sharp input frames and thus can avoid synthesizing blurry frames.

3.2.WARPING AND FUSION

Warping. We warp the neighboring frames {In}n∈Ωk to align with the target frame I^k in the virtual camera space. Since we already have the warping field from the target frame to the keyframe F^k→k (estimated from (Yu and Ramamoorthi, 2020)) and the estimated optical flow from the keyframe to neighboring frames {Fk→n}n∈Ωk, we can then compute the warping field from the target frame to neighboring frames {F^k→n}n∈Ωk by chaining the flow vectors. We can thus warp a neighboring frame In to align with the target frame I^k using backward warping. Some pixels in the target frame are not visible in the neighboring frames due to occlusion/dis-occlusion or out-of-boundary. Therefore, we compute visibility mask {αn}n∈Ωk for each neighboring frame to indicate whether a pixel is valid (labeled as 1) in the source frame or not. We use the method (Sundaram et al., 2010) to identify occluded pixels (labeled as 0).

| 
(a) Blending multiple frames at the pixel-level retains sharp contents but often leads to visible artifacts due to the sensitivity to flow inaccuracy.
(b) Blending in the feature level alleviates these artifacts but has difficulty in generating sharp results.
(c) The proposed hybrid-space fusion method is robust to flow inaccuracy and produces sharp output frames.
| 
(a) Blending multiple frames at the pixel-level retains sharp contents but often leads to visible artifacts due to the sensitivity to flow inaccuracy.
(b) Blending in the feature level alleviates these artifacts but has difficulty in generating sharp results.
(c) The proposed hybrid-space fusion method is robust to flow inaccuracy and produces sharp output frames.
| ![
(a) Blending multiple frames at the pixel-level retains sharp contents but often leads to visible artifacts due to the sensitivity to flow inaccuracy.
(b) Blending in the feature level alleviates these artifacts but has difficulty in generating sharp results.
(c) The proposed hybrid-space fusion method is robust to flow inaccuracy and produces sharp output frames.

Figure 5. Effect of different blending spaces. (a) Blending multiple frames at the pixel-level retains sharp contents but often leads to visible artifacts due to the sensitivity to flow inaccuracy. (b) Blending in the feature level alleviates these artifacts but has difficulty in generating sharp results. (c) The proposed hybrid-space fusion method is robust to flow inaccuracy and produces sharp output frames.Fusion space. With the aligned frames, we explore several fusion strategies. First, we can directly blend the warped color frames in the image space to produce the output stabilized frame, as shown in Fig. 3(a). This image-space fusion approach is a commonly used technique in image stitching (Agarwala et al., 2004; Szeliski, 2006), video extrapolation (Lee et al., 2019), and novel view synthesis (Hedman et al., 2018). However, image-space fusion is prone to generating ghosting artifacts due to misalignment, or glitch artifacts due to inconsistent labeling between neighbor pixels (Fig. 5(a)). Alternatively, one can also fuse the aligned frames in the feature space (e.g., (Choi et al., 2019)) (Fig. 3(b)). Fusing in the high-dimensional feature spaces allows the model to be more robust to flow inaccuracy. However, rendering the fused feature map using a neural image-translation decoder often leads to blurry outputs (Fig. 5(b)).

To combine the best worlds of both image-space and feature-space fusions, we propose a hybrid-space fusion mechanism for video stabilization (Fig. 3(c)). Similar to the feature-space fusion, we first extract high-dimensional features from each neighboring frame and warp the features using flow fields. We then learn a CNN to predicting the blending weights that best fuse the features. We concatenate the fused feature map and the warped feature for each neighboring frame to form the input for our image decoder. The image decoder learns to predict a target frame and a confidence map for each neighboring frame. Finally, we adopt an image-space fusion to merge all the predicted target frames according to the predicted weights to generate the final stabilized frame.

The core difference between our hybrid-space fusion and feature-space fusion lies in the input to the image decoder. The image decoder in Fig. 5(b) takes only the fused feature as input to predict the output frame. The fused feature map already contains mixed information from multiple frames. The image decoder may thus have difficulty in synthesizing sharp image contents. In contrast, our image decoder in Fig. 5(c) takes the fused feature map as guidance to reconstruct the target frame from the warped feature. We empirically find that this improves the sharpness of the output frame while avoiding ghosting and glitching artifacts, as shown in Fig. 5(c).

Fusion function. In addition to explore where should the contents from multiple frames be fused (image, feature, or hybrid), here we discuss how. To enable end-to-end training, we consider the following differentiable function for fusion:
融合空间。 通过对齐的帧,我们探索了几种融合策略。首先,我们可以直接在图像空间中混合扭曲的颜色帧以产生输出稳定帧,如图 3(a)所示。这种图像空间融合方法是图像拼接(Agarwala等人,2004 年;Szeliski,2006 年)、视频外推 (Lee等人,2019 年)和新视图合成 (Hedman等人,2018 年)中的常用技术 . 然而,图像空间融合容易由于未对准而产生重影伪影,或由于相邻像素之间的标签不一致而产生毛刺伪影(图 5(a))。或者,也可以融合特征空间中对齐的帧(例如,(Choi et al. , 2019))(图 3(b))。在高维特征空间中进行融合可以使模型对流动不准确性更加鲁棒。然而,使用神经图像翻译解码器渲染融合特征图通常会导致输出模糊(图 5(b))。

为了结合图像空间和特征空间融合的最佳世界,我们提出了一种用于视频稳定的混合空间融合机制(图 3(c))。与特征空间融合类似,我们首先从每个相邻帧中提取高维特征并使用流场扭曲特征。然后我们学习一个 CNN 来预测最能融合特征的混合权重。我们将每个相邻帧融合特征图扭曲特征连接起来**形成我们的图像解码器的输入。图像解码器学习预测目标帧和每个相邻帧的置信度图。最后,我们采用图像空间融合根据预测的权重合并所有预测的目标帧以生成最终的稳定帧。

我们的混合空间融合和特征空间融合之间的核心区别在于图像解码器的输入。图5(b)中的图像解码器 将融合特征作为输入来预测输出帧。融合的特征图已经包含来自多个帧的混合信息。图像解码器因此可能难以合成清晰的图像内容。相比之下,我们在图5(c)中的图像解码器以 融合特征图为指导,从扭曲特征重建目标帧。我们凭经验发现,这提高了输出帧的清晰度,同时避免了重影和毛刺伪影,如图 5(c)所示。

融合功能。 除了探索来自多个帧的内容应该在哪里融合(图像、特征或混合)之外,这里我们讨论如何. 为了实现端到端的训练,我们考虑以下用于融合的可微函数:

1) Mean fusion:

The simplest approach for fusion is computing the mean of available contents at each location:

f^kmean=∑n∈Ωkf^knM^kn∑n∈ΩkM^kn+τ, |

where f^kn and M^kn is the encoded feature map and warping mask of frame n, respectively. The superscript ^k denotes the encoded feature and warping mask are warped to the stable frame, and τ is a small constant to avoid dividing by zero.

2) Gaussian-weighted fusion:

Extending the mean fusion by putting larger weights for frames that are temporally closer to the keyframe:

(2) f^kgauss=∑n∈Ωkf^knM^knWn∑n∈ΩkM^knWn+τ,

where Wn=G(k,∥n−k∥) is the Gaussian blending weight.

3) Argmax fusion:

Instead of blending (which may cause blur), we can also take the available content that is temporally closest to the keyframe (using the warping mask M^kn and the weights Wn).

(3) f^kargmax=fargmaxnM^knWn.

4) Flow error-weighted fusion:

Instead of the above simple heuristic, we can define the blending weights using the confidence of flow estimate:

(4) f^kFE=∑n∈ΩkfnM^knWFEn∑n∈ΩkM^knWFEn+τ,

where weights WFEn is computed by: WFEn=exp(−e^knβ). Here, we use forward-backward flow consistency error e^kn to measure the confidence:

| (5) | | en(p)=∥|Fk→n(p)+Fn→k(p+Fk→n)∥|2, | |
| - | - | - | - |

where p denotes pixel coordinate in Fk→n and we set β=0.1 in all our experiments. In addition, we need to warp the flow error en to the target stable frame. We denote the warped flow error as e^kn.

5) CNN-based fusion function:

We further explore a learning-based fusion function. Specifically, we train a CNN to predicts a blending weight for each neighboring frame using the encoded features, visibility masks, and the flow error (Fig. 6).

(6) f^kCNN=∑n∈Ωkf^knθ(f^kn,M^kn,f^kk,M^kk,e^kn),

where θ is a CNN followed by soft-max.

After fusing the feature, we concatenate the fused feature with the warped feature and warping mask of each frame as the input to the image decoder. The image decoder then predicts the output color frame and confidence map for each frame:

(7) {I^kn,C^kn}=ϕ(f^kn,M^kn,fCNN^k),

where ϕ denotes the image decoder, I^kn and C^kn represent the predicted frame and confidence map of frame n in the virtual camera space at time k, respectively. Finally, the output stabilized frame I^k is generated by a weighted sum using these predicted frames and confidence maps:

(8) I^k=∑n∈ΩkI^knC^kn.


Figure 6. Learning-based fusion. Given the warped features {fn}n∈Ωk (or warped images for image-space fusion), warping masks {Mn}n∈Ωk, and the flow error maps {en}n∈Ωk, we first concatenate feature, warping mask, and flow erro maps for each frame (shown as dotted blocks). We then use a CNN to predict the blending weights for each neighbor frame. Using the predicted weights, we compute the fused feature by weighted averaging the individual warped features {fn}n∈Ωk.

3.3.IMPLEMENTATION DETAILS

Encoding source frames. Before warping and fusion, we encode each source frame {In}n∈Ωk to a higher dimensional feature map. The image encoder consists of 8 ResNet blocks (He et al., 2016) with a stride of 1. All convolutional layers are followed by a ReLU (Nair and Hinton, 2010) activation function. The final encoded features are 32-dimensional for each location.

Residual detail transfer. We observe that the output color frames synthesized by the image decoder often does not retain the full high-frequency details as in the input video. We tackle this problem by adopting the residual detail transfer technique similar to (Lu et al., 2020) to add the missing residual details back to output frames, as shown in Fig. 7. Specifically, we feed the warped image feature fk into the decoder to reconstruct an image and then compute the residual ΔIk by subtracting the reconstructed image with the input image Ik. Next, we warp the residual using flow F^k→k to obtain the warped residual ΔI^kk. Finally, we add the warped residual to the decoded image I^kk before the final fusion stage. This residual detail transfer helps us recover more high-frequency details and improve the overall sharpness of the synthesized frame (see Fig. 8). The use of residual transfer differs from (Lu et al., 2020) in that 1) we transfer the residual to a virtual stabilized frame (as opposed to the original frame) and 2) our method does not require per-scene/video training.


As rendering (warped) image features using a decoder may suffer from loss of high-frequency details, we compute the residuals and transfer them to the target frames to recover these missing details.
Figure 7. Residual detail transfer. As rendering (warped) image features using a decoder may suffer from loss of high-frequency details, we compute the residuals and transfer them to the target frames to recover these missing details.Training. Training our network typically involves unstable-stable video pairs. However, such training data is hard to acquire. We thus adopt the leave-one-out protocol (Riegler and Koltun, 2020a) to train our network. Specifically, we first sample a short sequence of 7 frames from the training set of (Su et al., 2017) and randomly crop the frames to generate the input unstable video. We then apply another random cropping on the center frame as the ground-truth of the keyframe.

Our loss functions include the L1 and VGG perceptual losses:

| (9) | | L=∥|Ik−I^k∥|1+∑lλl∥|ψl(Ik)−ψl(I^k)∥|1, | |
| - | - | - | - |

where ψl are intermediate features from a pre-trained VGG-19 network (Simonyan and Zisserman, 2015) and We train our network using the Adam optimizer with a learning learning rate of 0.0001 and default parameters. Our training patch size is 256×256. We train the model for 50 epochs with a batch size of 1. The training process takes 4 days on a single V100 GPU.

4.EXPERIMENTAL RESULTS

We start with validating various design choices of our approaches (Section 4.1). Next, we present quantitative comparison against representative state-of-the-art video stabilization algorithms (Section 4.2) and visual results (Section 4.3). We conclude the section with a user study (Section 4.4), a runtime analysis (Section 4.5), and discuss limitations (Section 4.6). Furthermore, we present additional detailed experimental results and stabilized videos by the evaluated methods in the supplementary material. We will make the source code and pre-trained model publicly available.
我们首先验证我们方法的各种设计选择(第4.1节 )。接下来,我们将与具有代表性的最先进的视频稳定算法(第4.2节)和视觉结果(第4.3节)进行 定量比较 。我们以用户研究(第4.4节)、运行时分析(第4.5节 )和讨论局限性(第4.6节 )结束本节 。此外,我们通过补充材料中的评估方法提供了额外的详细实验结果和稳定的视频。我们将公开源代码和预训练模型。

| | (a) Image-space fusion | (b) Feature-space | (c) Hybrid-space |
| | LPIPS ↓ | SSIM ↑ | PSNR ↑ | LPIPS ↓ | SSIM ↑ | PSNR ↑ | LPIPS ↓ | SSIM ↑ | PSNR ↑ |
| - | - | - | - |
| Multi-band blending | 0.105 | 0.926 | 26.150 | - | - | - | - | - | - |
| Graph-cut | 0.105 | 0.928 | 26.190 | - | - | - | - | - | - |
| Mean | 0.123 | 0.899 | 24.618 | 0.108 | 0.878 | 26.028 | 0.099 | 0.898 | 27.013 |
| Gaussian | 0.099 | 0.920 | 25.310 | 0.095 | 0.874 | 26.344 | 0.097 | 0.899 | 27.080 |
| Argmax | 0.105 | 0.928 | 26.200 | 0.093 | 0.892 | 26.891 | 0.087 | 0.906 | 27.519 |
| Flow error-weighted | 0.114 | 0.901 | 25.363 | 0.096 | 0.885 | 26.176 | 0.095 | 0.911 | 27.371 |
| CNN-based (Ours) | 0.101 | 0.895 | 26.692 | 0.092 | 0.902 | 27.187 | 0.073 | 0.914 | 27.868 |

Table 1. Quantitative evaluation of fusion functions on the test set of (Su et al., 2017). We highlight the best and the second best in each column.### 4.1.ABLATION STUDY

We analyze the contribution of each design choice, including the fusion function design, fusion mechanism, and residual detail transfer. We use the test set of (Su et al., 2017) for performance evaluation. The test set contains 10 videos (each has 100 frames). We sample sequences of 7 frames in a sliding window fashion and use the center frame as our keyframe. In total, we have (100−3−3)×10=960 short test sequences (the ”3” comes from the beginning/end of the sequences). We generate each testing sequence by randomly cropping 256×256 patches to simulate unstable frames (same as the training data generation). We generate another random crop at the center frame in each testing sequence as our target stabilized frame. Here we use RAFT (Teed and Deng, 2020) to compute the warp field between the two sampled patches at the center frame to simulate the warp field used for stabilizing the video frames.

Fusion function. We train the proposed model using the image-space fusion, feature-space fusion, and hybrid-space fusion with different fusion functions discussed in Fig. 5 and Section 3.2. For image-space fusion, we also include two conventional fusion methods: multi-band blending (Brown et al., 2003) and graph-cut (Agarwala et al., 2004). We report the quantitative results in Table 1.

For image-space fusion, none of the fusion methods dominant the results, where the argmax and CNN-based fusions perform slightly better than other alternatives. For both feature-space and hybrid-space fusion, the proposed CNN-based fusion shows advantages over other approaches.

Fusion space. Next, we compare different fusion levels (using CNN-based fusion). Table 2 shows that the proposed hybrid-space fusion achieves the best results compared to image-space fusion and feature-space fusion. Fig. 5 shows an example to demonstrate the rendered frames using different fusion spaces. The synthesized frame from the image-space fusion looks sharp but contains visible glitching artifacts due to the discontinuity of different frames and inaccurate motion estimation. The results of the feature-space fusion are smooth but overly blurred. Finally, our hybrid-space fusion takes advantage of both above methods, generating a sharp and artifact-free frame.

Residual detail transfer. Table 3 shows that our residual detail transfer approach helps restore the high-frequency details lost during neural rendering. Fig. 8 shows an example of visual improvement from residual detail transfer.

LPIPS ↓ SSIM ↑ PSNR ↑
Image-space fusion 0.101 0.895 26.692
Feature-space fusion 0.092 0.902 27.187
Hybrid-space fusion (Ours) 0.073 0.914 27.868

Table 2. Quantitative comparisons on fusion spaces. We use our CNN-based fusion function to compare the performance of three different fusion spaces. See Fig. 5 for visual comparisons.

我们分析了每个设计选择的贡献,包括融合函数设计、融合机制和残差细节转移。我们使用(Su et al. , 2017 )的测试集 进行性能评估。测试集包含 10 个视频(每个有 100 帧)。我们以滑动窗口方式对 7 帧序列进行采样,并使用中心帧作为我们的关键帧。我们总共有(100-3-3)×10=960短测试序列(“3”来自序列的开头/结尾)。我们通过随机裁剪生成每个测试序列256×256模拟不稳定帧的补丁(与训练数据生成相同)。我们在每个测试序列的中心帧处生成另一个随机裁剪作为我们的目标稳定帧。在这里,我们使用 RAFT (Teed 和 Deng,2020 年)来计算中心帧处两个采样补丁之间的扭曲场,以模拟用于稳定视频帧的扭曲场。

融合功能。 我们使用图5和第3.2节中 讨论的具有不同融合函数的图像空间融合、特征空间融合和混合空间融合来训练所提出的模型 。对于图像空间融合,我们还包括两种传统的融合方法:多波段混合 (Brown等人,2003 年)和图切割 (Agarwala等人,2004 年)。我们在表1 中报告了定量结果 。

对于图像空间融合,没有一种融合方法能主导结果,其中基于 argmax 和 CNN 的融合比其他替代方法表现稍好。对于特征空间和混合空间融合,所提出的基于 CNN 的融合显示出优于其他方法的优势。

融合空间。 接下来,我们比较不同的融合级别(使用基于 CNN 的融合)。表 2表明,与图像空间融合和特征空间融合相比,所提出的混合空间融合取得了最好的结果。图 5显示了一个示例,用于演示使用不同融合空间的渲染帧。来自图像空间融合的合成帧看起来很清晰,但由于不同帧的不连续性和不准确的运动估计而包含可见的毛刺伪影。特征空间融合的结果是平滑但过于模糊。最后,我们的混合空间融合利用了上述两种方法,生成了一个清晰且无伪影的帧。

残差细节转移。 表 3显示我们的残差细节转移方法有助于恢复神经渲染过程中丢失的高频细节。图 8显示了从残差细节转移的视觉改进示例。

LPIPS ↓ SSIM ↑ PSNR ↑
w/o residual detail transfer 0.073 0.914 27.868
w/ residual detail transfer 0.056 0.942 29.255

Table 3. Ablation study of residual detail transfer. See Fig. 8 for visual comparisons.|  The residual detail transfer restores more high-frequency details in the output frames.
| ![ The residual detail transfer restores more high-frequency details in the output frames.

Figure 8. Effect of the residual detail transfer. The residual detail transfer restores more high-frequency details in the output frames.

4.2.QUANTITATIVE EVALUATION

We evaluate the proposed method with state-of-the-art video stabilization algorithms, including (Grundmann et al., 2011), (Liu et al., 2013), (Wang et al., 2018), (Choi and Kweon, 2020), and (Yu and Ramamoorthi, 2020), and the warp stabilizer in Adobe Premiere Pro CC 2020. We obtain the results of the compared methods from the videos released by the authors or generated from the publicly available official implementation with default parameters or pre-trained models.

Datasets. We evaluate the above methods on the NUS dataset (Liu et al., 2013) and selfie dataset (Yu and Ramamoorthi, 2018). The NUS dataset consists 144 video sequences, which are classified into six categories: 1) Simple, 2) Quick rotation, 3) Zooming, 4) Parallax, 5) Crowd, and 6) Running. The selfie dataset includes 33 video clips with frontal faces and severe jittering.

Metrics. We use three widely-used metrics to evaluate the performance of video stabilization (Liu et al., 2013; Wang et al., 2018; Choi et al., 2019): cropping ratio, distortion, and stability. We also include the accumulated optical flow used in Yu et al. (2019) for evaluation. We describe the definition of each evaluation metric in the supplementary material.

Results on the NUS dataset. We show the average scores on the NUS dataset (Liu et al., 2013) on the left side of Table 4 and plot the cumulative score of each metric in the first row of Fig. 9. Both (Choi and Kweon, 2020) and our method are full-frame methods and thus have an average cropping ratio of 1. Note that the distortion metric measures the global distortion by fitting a homography between the input and stabilized frames, which is not suitable to measure local distortion. Although (Choi and Kweon, 2020) obtain the highest distortion score, their results contain visible local or color distortion due to iterative frame interpolation, as shown in Fig. 10 and our supplementary material. Adobe Premiere Pro 2020 warp stabilizer obtains the highest stability score at the cost of a low cropping ratio (0.74). Our results achieve the lowest accumulated optical flow, the second-best distortion and stability scores, and with an average cropping ratio of 1 (no cropping for all the videos), demonstrating our advantages over the state-of-the-art approaches in the NUS dataset.

NUS dataset (Liu et al., 2013) Selfie dataset (Yu and Ramamoorthi, 2018)
Cropping Distortion
ratio ↑ value ↑
Bundle (Liu et al., 2013) 0.84 0.93
L1Stabilizer (Grundmann et al., 2011) 0.74 0.92
StabNet (Wang et al., 2018) 0.68 0.82
DIFRINT (Choi and Kweon, 2020) 1.00 0.97
(Yu and Ramamoorthi, 2020) 0.86 0.91
(Yu and Ramamoorthi, 2018) - -
Adobe Premiere Pro 2020 warp stabilizer 0.74 0.83
Ours 1.00 0.96

Table 4. Quantitative evaluations with the state-of-the-art methods on the NUS dataset (Liu et al., 2013) and the selfie dataset (Yu and Ramamoorthi, 2018). Red text indicates the best and blue text indicates the second-best performing method.Results on Selfie dataset. We show the average scores on the selfie dataset (Yu and Ramamoorthi, 2018) on the right side of Table 4 and plot the cumulative score of each metric in the second row of Fig. 9. Similar to the NUS dataset, our method achieves the best cropping ratio and accumulated optical flow. Our distortion and stability scores are comparable to (Yu and Ramamoorthi, 2018), which is specially designed to stabilize selfie videos.

Figure 9. Quantitative evaluation on the NUS dataset (Liu et al., 2013) and the selfie dataset (Yu and Ramamoorthi, 2018). As some of the methods fail to produce results for several sequences, computing averaged scores of successful sequences does not reflect the robustness of the method. For each method, we store all the error metrics for each sequence and plot the sorted scores. These plots reflect both the completeness and the performance (in cropping ratio, distortion, stability, and smoothness). Our approach achieves full-frame stabilization (except for a few sequences where (Yu and Ramamoorthi, 2020) fails to generate results) while maintaining competitive performance compared with the state-of-the-art.

4.3.VISUAL COMPARISON

We show one stabilized frame of our method and state-of-the-art approaches from the Selfie dataset in Fig. 10. Most of the methods (Grundmann et al., 2011; Wang et al., 2018; Liu et al., 2013; Yu and Ramamoorthi, 2020) suffer from a large amount of cropping, as indicated by the green checkerboard regions. The aspect ratios of the output frames are also changed. To keep the same aspect ratio as the input video, more excessive cropping is required. The DIFRINT method (Choi and Kweon, 2020) generates full-frame results. However, their results often contain clearly visible artifacts due to frame interpolation. In contrast, our method generates full-frame stabilized videos with fewer visual artifacts. Please refer to our supplementary material for more video comparisons.

 Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT  Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT  Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT  Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT
Bundle (Liu et al., 2013) L1Stabilizer (Grundmann et al., 2011) StabNet (Wang et al., 2018) (d) DIFRINT (Choi and Kweon, 2020)
 Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT  Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT  Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT  Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT
(Yu and Ramamoorthi, 2020) Adobe Premiere Pro 2020 warp stabilizer Ours Input

Figure 10. Visual comparison to state-of-the-art methods. Our proposed fusion approach does not suffer from aggressive cropping of frame borders and renders stabilized frames with significantly fewer artifacts than DIFRINT (Choi and Kweon, 2020). We refer the readers to our supplementary material for extensive video comparisons with prior representative methods.

4.4.USER STUDY

As the quality of video stabilization is a subjective matter of taste, we also conduct a user study to compare the proposed method and three approaches:(Yu and Ramamoorthi, 2020), DIFRINT (Choi and Kweon, 2020), and Adobe Premiere Pro 2020 warp stabilizer. We randomly select two videos from the Selfie dataset (Yu and Ramamoorthi, 2018) and two videos from each category of the NUS dataset (Liu et al., 2013), resulting in total 2×(6+1)=14 videos for comparison. We adopt the pairwise comparison (Rubinstein et al., 2010; Lai et al., 2016), where each video pair has three questions:

  1. Which video preserves the most content?
  2. Which video contains fewer artifacts or distortions?
  3. Which video is more stable?

The order of the videos is randomly shuffled. The input video is also shown to users as a reference.

From this user study, we collect the results from 46 subjects. We show the winning rate of our method against the compared approaches in Fig. 11. Overall, our results are preferred by the users in more than 60% of the comparisons, demonstrating that the proposed method also outperforms existing approaches in the subjective evaluation.


Figure 11. User preference of our method against other state-of-the-art methods. We ask the subjects to rate their preferences over the video stabilization results in a randomized paired comparison (ours vs. other) in terms of content preservation, visual artifacts, and temporal stability. Overall our results are preferred by the users in all aspects.

4.5.RUNTIME ANALYSIS

We measure the runtime of CPU-based approaches (Grundmann et al., 2011; Liu et al., 2013; Yu and Ramamoorthi, 2018), on a laptop with i7-8550U CPU. For our method and GPU-based approaches (Wang et al., 2018; Yu and Ramamoorthi, 2020; Choi and Kweon, 2020), we evaluate on a server with Nvidia Tesla V100 GPU. We use sequence #10 from the Selfie dataset for evaluation, which has a frame resolution of 854×480. We show the runtime comparisons in Table 5. We also provide the runtime profiling of our method in Table 6. The pre-processing step of motion smoothing from (Yu and Ramamoorthi, 2020) takes about 71% of our runtime. Our method can be accelerated by using a more efficient motion smoothing algorithm. However, the quality of warping and fusion also depends on the accuracy of the motion estimation and smoothing.

Method Environment Implementation Runtime (s)
L1Stabilizer (Grundmann et al., 2011) CPU Matlab 0.623
Bundle (Liu et al., 2013) CPU Matlab 7.612
StabNet (Wang et al., 2018) GPU TensorFlow 0.073
DIFRINT (Choi and Kweon, 2020) GPU PyTorch 1.627
(Yu and Ramamoorthi, 2020) GPU PyTorch 6.501
(Yu and Ramamoorthi, 2018) CPU Matlab 15 mins*
Adobe Premiere Pro 2020 warp stabilizer CPU - 0.045
Ours excluding (Yu and Ramamoorthi, 2020) GPU PyTorch 3.090
Ours GPU PyTorch 9.591

Table 5. Per-frame runtime comparison. CPU-based methods are evaluated on a laptop with i7-8550U CPU, while GPU-based methods are evaluated on a server with Nvidia Tesla V100 GPU. The testing video has a frame resolution of 854×480. *: Time is reported in (Yu and Ramamoorthi, 2018) as the source code is not available.

Stage Runtime (s) Percentage
Motion smoothing (Yu and Ramamoorthi, 2020) 6.826 71.03%
Image encoder 0.724 7.53%
Warping 0.207 2.15%
CNN-based fusion 0.952 9.91%
Image decoder 0.881 9.17%
Final fusion 0.020 0.21%

Table 6. Runtime breakdown of the proposed method.

4.6.LIMITATIONS

Video stabilization for videos in unconstrained scenarios remains challenging. Here we discuss several limitations of our approach. We refer to the readers to our supplementary material for video results.

Wobble. When the camera or object motion is too rapid, we observe that the stabilized frame exhibit rolling shutter wobble (Fig. 12(a)). Integrating rolling shutter rectification into our approach can potentially alleviate such an issue.

Visible seams. As our method fuse multiple frames to re-render a stabilized frame, our results may contain visible seams, particularly when there are large lighting variations caused by camera white‐balancing and exposure corrections (Fig. 12(b)).

Temporal flicker and distortion. Our method builds upon an existing motion smoothing method (Yu and Ramamoorthi, 2020) to obtained stabilized frames. However, the method (Yu and Ramamoorthi, 2020) may produce temporally flickering results due to large non-rigid occlusion and inconstant foreground/background motion (e.g., selfie videos). In such cases, due to the dependency of motion inpainting, our method also suffers from undesired temporal flicker and distortion (Fig. 12(c)).

Speed. Our work aims for offline application. Currently, our method runs slowly (about 10 seconds/frame) compared to many existing methods. Speeding up the runtime is an important future work.

Figure 12. Limitations. (a) Our method does not correct rolling shutter and thus may suffer wobble artifacts. (b) Our fusion approach is not capable of compensating large photometric variations across frames, resulting in visible seams. (c) Existing smoothing approach (Yu and Ramamoorthi, 2020) may introduce distortion, particularly when videos contain non-rigid large occlusion (such as selfie videos). As our method depends on the flow produced by (Yu and Ramamoorthi, 2020) for stabilizing the video, we cannot correct such errors.