[论文翻译]MODNet:不用绿幕就能实现的实时精准人像抠图


原文地址:https://arxiv.org/pdf/2011.11961.pdf

代码地址:https://github.com/ZHKKKe/MODNet

Is a Green Screen Really Necessary for Real-Time Human Matting?

author

Zhanghan Ke1,2  Kaican Li2  Yurou Zhou2  Qiuhua Wu2  Xiangyu Mao2
Qiong Yan2  Rynson W.H. Lau1
1Department of Computer ScienceCity University of Hong Kong     2SenseTime Research

ABSTRACT

For human matting without the green screen, existing works either require auxiliary inputs that are costly to obtain or use multiple models that are computationally expensive. Consequently, they are unavailable in real-time applications. In contrast, we present a light-weight matting objective decomposition network (MODNet), which can process human matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. Moreover, since trimap-free methods usually suffer from the domain shift problem in practice, we introduce (1) a self-supervised strategy based on sub-objectives consistency to adapt MODNet to real-world data and (2) a one-frame delay trick to smooth the results when applying MODNet to video human matting.

MODNet is easy to be trained in an end-to-end style. It is much faster than contemporaneous matting methods and runs at 63 frames per second. On a carefully designed human matting benchmark newly proposed in this work, MODNet greatly outperforms prior trimap-free methods. More importantly, our method achieves remarkable results in daily photos and videos. Now, do you really need a green screen for real-time human matting?

摘要

用于没有绿幕的人为遮罩,现有的工作要么需要获得昂贵的辅助输入,要么使用计算上昂贵的多个模型。因此,它们在实时应用程序中不可用。相比之下,我们提出了一种轻量级的分割目标分解网络(MODNet),该网络可以实时从单个输入图像处理人为分割。MODNet的设计受益于通过显式约束同时优化一系列相关的子目标。此外,由于无Trimap的方法在实践中通常会遇到域偏移问题,因此我们引入(1)基于子目标一致性的自监督策略以使MODNet适应实际数据,以及(2)一帧延迟将MODNet应用于视频人类抠像时,可以使结果平滑的技巧。

MODNet易于以端到端的方式进行训练。它比同时代的分割方法快得多,并且以每秒63帧的速度运行。在这项工作中新提出的经过精心设计的人为分割基准上,MODNet大大优于以前的无trimap的方法。更重要的是,我们的方法在日常照片和视频中取得了显着的效果。现在,您真的需要一个用于实时人为分割的绿幕吗?


Our method can process trimap-free human matting in real time under changing scenes.
(a) We train MODNet on the labeled dataset to learn matting sub-objectives from RGB images. (b) To adapt to real-world data, we finetune MODNet on the unlabeled data by using the consistency between sub-objectives. (c) In the application of video human matting, our OFD trick can help smooth the predicted alpha mattes of the video sequence.
Figure 1: Our Framework for Human Matting. Our method can process trimap-free human matting in real time under changing scenes. (a) We train MODNet on the labeled dataset to learn matting sub-objectives from RGB images. (b) To adapt to real-world data, we finetune MODNet on the unlabeled data by using the consistency between sub-objectives. (c) In the application of video human matting, our OFD trick can help smooth the predicted alpha mattes of the video sequence.## 1INTRODUCTION

图1:我们的人类抠图算法框架。 我们的方法可以在不断变化的场景下实时处理无Trimap的人类抠图。(a)我们在标记的数据集上训练MODNet,以从RGB图像中学习分割子目标。(b)为了适应现实世界的数据,我们通过使用子目标之间的一致性来对未标记的数据进行MODNet的微调。(c)在视频抠像的应用中,我们的OFD技巧可以帮助平滑视频序列的预测alpha抠像。

Introduction 介绍

Human matting aims to predict a precise alpha matte that can be used to extract people from a given image or video. It has a wide variety of applications, such as photo editing and movie re-creation. Currently, a green screen is required to obtain a high quality alpha matte in real time.

人体抠图算法的目的是预测可用于从给定图像或视频中提取人物的精确Alpha遮罩。它具有广泛的应用场景,例如照片编辑和电影重新创建。当前,需要绿幕以实时获得高质量的alpha遮罩。

When a green screen is not available, most existing matting methods [AdaMatting, CAMatting, GCA, IndexMatter, SampleMatting, DIM] use a pre-defined trimap as a priori. However, the trimap is costly for humans to annotate, or suffer from low precision if captured via a depth camera. Therefore, some latest works attempt to eliminate the model dependence on the trimap, \ie, trimap-free methods. For example, background matting [BM] replaces the trimap by a separate background image. Others [SHM, BSHM, DAPM] apply multiple models to firstly generate a pseudo trimap or semantic mask, which is then served as the priori for alpha matte prediction. Nonetheless, using the background image as input has to take and align two photos while using multiple models significantly increases the inference time. These drawbacks make all aforementioned matting methods not suitable for real-time applications, such as preview in a camera. Besides, limited by insufficient amount of labeled training data, trimap-free methods often suffer from domain shift [DomainShift] in practice, \ie, the models cannot well generalize to real-world data, which has also been discussed in [BM].

当没有绿幕时,大多数现有的抠图方法[ AdaMatting,CAMatting,GCA,IndexMatter,SampleMatting,DIM ]将使用预先定义的trimap作为先验。但是,如果通过深度摄像头捕获,则Trimap对人类进行注释的成本很高,或者精度较低。因此,一些最新的工作试图消除模型对trimap的依赖,即无trimap的方法。例如,背景抠图[ BM ]将Trimap替换为单独的背景图像。其他[ SHM,BSHM,DAPM]应用多个模型首先生成一个伪trimap或语义蒙版,然后将其用作alpha遮罩预测的先验。但是,使用背景图像作为输入必须拍摄并对齐两张照片,而使用多个模型则显着增加了推理时间。这些缺点使得所有上述抠图方法都不适合实时应用,例如在相机中预览。此外,受限于标记训练数据的数量不足,无三图法在实践中经常遭受域移位[ DomainShift ]的困扰,即模型无法很好地推广到现实世界的数据,这也在[ BM ]中进行了讨论。

To predict an accurate alpha matte from only one RGB image by using a single model, we propose MODNet, a light-weight network that decomposes the human matting task into three correlated sub-tasks and optimizes them simultaneously through specific constraints. There are two insights behind MODNet. First, neural networks are better at learning a set of simple objectives rather than a complex one. Therefore, addressing a series of matting sub-objectives can achieve better performance. Second, applying explicit supervisions for each sub-objective can make different parts of the model to learn decoupled knowledge, which allows all the sub-objectives to be solved within one model. To overcome the domain shift problem, we introduce a self-supervised strategy based on sub-objective consistency (SOC) for MODNet. This strategy utilizes the consistency among the sub-objectives to reduce artifacts in the predicted alpha matte. Moreover, we suggest a one-frame delay (OFD) trick as post-processing to obtain smoother outputs in the application of video human matting. Fig. 1 summarizes our framework.

为了通过使用单个模型仅从一个RGB图像中预测准确的alpha遮罩,我们提出了MODNet,这是一种轻量级的网络,可将人工遮罩任务分解为三个相关的子任务,并通过特定的约束同时对其进行优化。MODNet背后有两个见解。首先,神经网络更擅长学习一组简单的目标,而不是复杂的目标。因此,解决一系列分割子目标可以实现更好的性能。其次,对每个子目标应用明确的监督可以使模型的不同部分学习解耦的知识,从而可以在一个模型中解决所有子目标。为了克服域移位问题,我们针对MODNet引入了一种基于子目标一致性(SOC)的自监督策略。此策略利用子目标之间的一致性来减少预测的alpha遮罩中的伪像。此外,我们建议使用一帧延迟(OFD)技巧作为后处理,以在视频人像抠像的应用中获得更平滑的输出。如图。 1总结了我们的框架。

MODNet has several advantages over previous trimap-free methods. First, MODNet is much faster. It is designed for real-time applications, running at 63 frames per second (fps) on an Nvidia GTX 1080Ti GPU with an input size of 512×512. Second, MODNet achieves state-of-the-art results, benefitted from (1) objective decomposition and concurrent optimization; and (2) specific supervisions for each of the sub-objectives. Third, MODNet can be easily optimized end-to-end since it is a single well-designed model instead of a complex pipeline. Finally, MODNet has better generalization ability thanks to our SOC strategy. Although our results are not able to surpass those of the trimap-based methods on the human matting benchmarks with trimaps, our experiments show that MODNet is more stable in practical applications due to the removal of the trimap input. We believe that our method is challenging the necessity of using a green screen for real-time human matting.

Since open-source human matting datasets [DAPM, DIM] have limited scale or precision, prior works train and validate their models on private datasets of diverse quality and difficulty levels. As a result, it is not easy to compare these methods fairly. In this work, we evaluate existing trimap-free methods under a unified standard: all models are trained on the same dataset and validated on the portrait images from Adobe Matting Dataset [DIM] and our newly proposed benchmark. Our new benchmark is labelled in high quality, and it is more diverse than those used in previous works. Hence, it can reflect the matting performance more comprehensively. More on this is discussed in Sec. 5.1.

In summary, we present a novel network architecture, named MODNet, for trimap-free human matting in real time. Moreover, we introduce two techniques, SOC and OFD, to generalize MODNet to new data domains and smooth the matting results on videos. Another contribution of this work is a carefully designed validation benchmark for human matting. Our code, pre-trained model, and validation benchmark will be made available at:

https://github.com/ZHKKKe/MODNet .

与以前的无trimap的方法相比,MODNet具有许多优点。首先,MODNet要快得多。它是为实时应用程序设计的,运行于63 每秒帧数 (Fps)在Nvidia GTX 1080Ti GPU上,输入大小为 512×512。其次,得益于(1)目标分解和并发优化,MODNet获得了最先进的结果。(2)对每个子目标的具体监督。第三,由于MODNet是一个设计良好的模型,而不是复杂的管道,因此可以轻松地进行端到端优化。最后,由于我们的SOC策略,MODNet具有更好的泛化能力。尽管我们的结果无法在使用Trimap的人类分割基准上超过基于Trimap的方法,但是我们的实验表明,由于删除了Trimap输入,MODNet在实际应用中更加稳定。我们认为,我们的方法正在挑战使用绿色屏幕进行实时人类抠像的必要性。

由于开源人类抠图数据集[ DAPM,DIM ]的规模或精度有限,因此先前的工作在具有不同质量和难度级别的私人数据集上训练并验证了其模型。结果,要公平地比较这些方法并不容易。在这项工作中,我们在统一的标准下评估了现有的无trimap的方法:所有模型都在同一数据集上进行训练,并在Adobe Matting Dataset [ DIM ]和我们新提出的基准测试中的人像图像上进行了验证。我们的新基准被标记为高质量,并且比以前的作品中使用的基准更加多样化。因此,它可以更全面地反映分割性能。有关更多信息,请参见第二节。 5.1

总而言之,我们提出了一种新颖的网络体系结构,名为MODNet,用于实时实现无Trimap的人类抠图。此外,我们介绍了SOC和OFD这两种技术,以将MODNet推广到新的数据域并平滑视频上的分割效果。这项工作的另一个贡献是精心设计的用于人体分割的验证基准。我们的代码,预先训练的模型和验证基准将在以下位置提供:

https://github.com/ZHKKKe/MODNet  。

2.1IMAGE MATTING

The purpose of image matting is to extract the desired foreground F from a given image I. Unlike the binary mask output from image segmentation [IS_Survey] and saliency detection [SOD_Survey], matting predicts an alpha matte with preccise foreground probability for each pixel, which is represented by α in the following formula:
图像抠图的目的是提取所需的前景 F 从给定的图像 I。与图像分割[ IS_Survey ]和显着性检测[ SOD_Survey ]输出的二值蒙版不同,遮罩可预测每个像素具有精确前景概率的alpha遮罩,其表示为α 在以下公式中:

image.png

where i is the pixel index, and B is the background of I. When the background is not a green screen, this problem is ill-posed since all variables on the right hand side are unknown. Most existing matting methods take a pre-defined trimap as an auxiliary input, which is a mask containing three regions: absolute foreground (α=1), absolute background (α=0), and unknown area (α=0.5). In this way, the matting algorithms only have to estimate the foreground probability inside the unknown area based on the priori from the other two regions.

I是像素索引,并且B 是I的背景 。当背景不是绿色屏幕时,由于右侧的所有变量都是未知的,因此该问题不适当。大多数现有的抠图方法都将预定义的Trimap用作辅助输入,这是一个包含三个区域的蒙版:绝对前景(α=1个),绝对背景(α=0)和未知区域(α=0.5)。这样,分割算法仅需基于来自其他两个区域的先验来估计未知区域内部的前景概率。

Traditional matting algorithms heavily rely on low-level features, \eg, color cues, to determine the alpha matte through sampling [sampling_chuang, sampling_feng, sampling_gastal, sampling_he, sampling_johnson, sampling_karacan, sampling_ruzon] or propagation [prop_aksoy2, prop_aksoy, prop_bai, prop_chen, prop_grady, prop_levin, prop_levin2, prop_sun], which often fail in complex scenes. With the tremendous progress of deep learning, many methods based on convolutional neural networks (CNN) have been proposed, and they improve matting results significantly. Cho \etal[NIMUDCNN] and Shen \etal[DAPM] combined the classic algorithms with CNN for alpha matte refinement. Xu \etal[DIM] proposed an auto-encoder architecture to predict alpha matte from a RGB image and a trimap. Some works [GCA, IndexMatter] argued that the attention mechanism could help improve matting performance. Lutz \etal[AlphaGAN] demonstrated the effectiveness of generative adversarial networks [GAN] in matting. Cai \etal[AdaMatting] suggested a trimap refinement process before matting and showed the advantages of an elaborate trimap. Since obtaining a trimap requires user effort, some recent methods (including our MODNet) attempt to avoid it, as described below.

传统的分割算法很大程度上依赖于低级别特征,如,颜色提示,通过取样确定所述α无光泽[ sampling_chuang,sampling_feng,sampling_gastal,sampling_he,sampling_johnson,sampling_karacan,sampling_ruzon ]或传播[ prop_aksoy2,prop_aksoy,prop_bai,prop_chen,prop_grady,prop_levin,prop_levin2,prop_sun ],这些方法通常在复杂的场景中会失败。随着深度学习的巨大进步,提出了许多基于卷积神经网络(CNN)的方法,它们显着提高了分割效果。Cho \etal[ NIMUDCNN ]和Shen \etal[ DAPM ]结合经典算法与CNN阿尔法磨砂细化。Xu \ etal [ DIM ]提出了一种自动编码器体系结构,可以根据RGB图像和三映射来预测alpha遮罩。一些著作[ GCA,IndexMatter ]认为注意力机制可以帮助改善分割性能。 Lutz \etal [AlphaGan]证实生成对抗网络的有效性[ GAN ]在分割。Cai \ etal [ AdaMatting ]在进行分割之前建议了trimap的改进过程,并展示了精心制作的trimap的优点。由于获取三部曲需要用户的努力,因此某些最新方法(包括我们的MODNet)会尝试避免这种情况,如下所述。

2.2TRIMAP-FREE HUMAN MATTING 无TRIMAP的人类抠图算法

Image matting is extremely difficult when trimaps are unavailable as semantic estimation will be necessary (to locate the foreground) before predicting a precise alpha matte. Currently, trimap-free methods always focus on a specific type of foreground objects, such as humans. Nonetheless, feeding RGB images into a single neural network still yields unsatisfactory alpha mattes. Sengupta \etal[BM] proposed to capture a less expensive background image as a pseudo green screen to alleviate this issue. Other works designed their pipelines that contained multiple models. For example, Shen \etal[SHM] assembled a trimap generation network before the matting network. Zhang \etal[LFM] applied a fusion network to combine the predicted foreground and background. Liu \etal[BSHM] concatenated three networks to utilize coarse labeled data in matting. The main problem of all these methods is that they cannot be used in interactive applications since: (1) the background images may change frame to frame, and (2) using multiple models is computationally expensive. Compared with them, our MODNet is light-weight in terms of both input and pipeline complexity. It takes one RGB image as input and uses a single model to process human matting in real time with better performance.

没有trimaps 作为语义分割图像(定位前景)的情况下,图像分割是个非常难的问题,无法预测精确的Alpha遮罩。当前,无trimap的方法始终专注于特定类型的前景对象,例如人类。尽管如此,将RGB图像馈入单个神经网络仍然会产生不令人满意的alpha遮罩。Sengupta \ etal [ BM ]提议捕获较便宜的背景图像作为伪绿幕,以缓解此问题。其他作品设计了包含多个模型的管道。例如,Shen \ etal [ SHM ]在抠图网络之前组装了一个trimap生成网络。张\ etal [LFM ]施加的融合网络到预测前景和背景结合。Liu \ etal [ BSHM ]连接了三个网络,以在遮罩中利用粗略的标记数据。所有这些方法的主要问题是它们不能用于交互式应用程序,因为:(1)背景图像可能会逐帧更改,并且(2)使用多个模型在计算上很昂贵。与它们相比,我们的MODNet在输入和管道复杂性方面都是轻量级的。它以一个RGB图像作为输入,并使用单个模型以更好的性能实时处理人为遮罩。

2.3OTHER TECHNIQUES 其他技术

We briefly discuss some other techniques related to the design and optimization of our method.

High-Resolution Representations. Popular CNN architectures [net_resnet, net_mobilenet, net_densenet, net_vggnet, net_insnet] generally contain an encoder, i.e., a low-resolution branch, to reduce the resolution of the input. Such a process will discard image details that are essential in many tasks, including image matting. Wang \etal[net_hrnet] proposed to keep high-resolution representations throughout the model and exchange features between different resolutions, which induces huge computational overheads. Instead, MODNet only applies an independent high-resolution branch to handle foreground boundaries.

Attention Mechanisms. Attention [attention_survey] for deep neural networks has been widely explored and proved to boost the performance notably. In computer vision, we can divide these mechanisms into spatial-based or channel-based according to their operating dimension. To obtain better results, some matting models [GCA, IndexMatter] combined spatial-based attentions that are time-consuming. In MODNet, we integrate the channel-based attention so as to balance between performance and efficiency.

Consistency Constraint. Consistency is one of the most important assumptions behind many semi-/self-supervised [semi_un_survey] and domain adaptation [udda_survey] algorithms. For example, Ke \etal[GCT] designed a consistency-based framework that could be used for semi-supervised matting. Toldo \etal[udamss] presented a consistency-based domain adaptation strategy for semantic segmentation. However, these methods consist of multiple models and constrain the consistency among their predictions. In contrast, our MODNet imposes consistency among various sub-objectives within a model.
我们简要讨论了与方法设计和优化有关的其他技术。

高分辨率表示形式:流行的CNN架构[ net_resnet,net_mobilenet,net_densenet,net_vggnet,net_insnet ]通常包含一个编码器,即低分辨率分支,以降低输入的分辨率。这样的过程将丢弃在许多任务(包括图像分割)中必不可少的图像细节。王\ etal [ net_hrnet ]提出在整个模型中保留高分辨率表示并在不同分辨率之间交换特征,这会导致巨大的计算开销。相反,MODNet仅应用独立的高分辨率分支来处理前景边界。

注意机制:深度神经网络的注意事项[ tention_survey ]已得到广泛探索,并被证明可以显着提高性能。在计算机视觉中,我们可以根据它们的操作维度将这些机制分为基于空间的或基于通道的。为了获得更好的结果,一些分割模型[ GCA,IndexMatter ]组合了基于空间的注意力,这非常耗时。在MODNet中,我们集成了基于通道的关注,以便在性能和效率之间取得平衡。

一致性约束:一致性是许多半/自我监督[ semi_un_survey ]和域自适应[ udda_survey ]算法背后最重要的假设之一。例如,Ke \ etal [ GCT ]设计了一个基于一致性的框架,该框架可用于半监督分割。Toldo \ etal [ udamss ]提出了一种基于一致性的语义分割领域自适应策略。但是,这些方法由多个模型组成,并限制了其预测之间的一致性。相反,我们的MODNet在模型中的各个子目标之间强加了一致性。


Given an input image Figure 2: Architecture of MODNet. Given an input image I, MODNet predicts human semantics sp, boundary details dp, and final alpha matte αp through three interdependent branches, S, D, and F, which are constrained by specific supervisions generated from the ground truth matte αg. Since the decomposed sub-objectives are correlated and help strengthen each other, we can optimize MODNet end-to-end.

3MODNET

In this section, we elaborate the architecture of MODNet and the constraints used to optimize it.
在本节中,我们将详细阐述MODNet的体系结构以及用于对其进行优化的约束条件。

3.1OVERVIEW 概述

Methods that are based on multiple models [SHM, BSHM, DAPM] have shown that regarding trimap-free matting as a trimap prediction (or segmentation) step plus a trimap-based matting step can achieve better performances. This demonstrates that neural networks are benefited from breaking down a complex objective. In MODNet, we extend this idea by dividing the trimap-free matting objective into semantic estimation, detail prediction, and semantic-detail fusion. Intuitively, semantic estimation outputs a coarse foreground mask while detail prediction produces fine foreground boundaries, and semantic-detail fusion aims to blend the features from the first two sub-objectives.

As shown in Fig. 2, MODNet consists of three branches, which learn different sub-objectives through specific constraints. Specifically, MODNet has a low-resolution branch (supervised by the thumbnail of the ground truth matte) to estimate human semantics. Based on it, a high-resolution branch (supervised by the transition region (α∈(0,1)) in the ground truth matte) is introduced to focus on the human boundaries. At the end of MODNet, a fusion branch (supervised by the whole ground truth matte) is added to predict the final alpha matte. In the following subsections, we will delve into the branches and the supervisions used to solve each sub-objective.

是基于多个模型的方法[ SHM,BSHM,DAPM ]已经表明,关于trimap-free 作为三分图预测(或分割)步骤加上基于三分图分割步骤可实现更好的性能。这表明,分解复杂的目标可以使神经网络受益。在MODNet中,我们将无Trimap遮罩目标划分为语义估计,细节预测和语义细节融合,从而扩展了这一思想。直观地,语义估计输出粗略的前景蒙版,而细节预测则产生精细的前景边界,语义细节融合旨在融合前两个子目标的特征。

如图2所示 ,MODNet由三个分支组成,它们通过特定的约束学习不同的子目标。具体来说,MODNet具有一个低分辨率的分支(由地面实况遮罩的缩略图监督)来估计人类的语义。基于此,一个高分辨率分支(由过渡区域(α∈(0,1个)),在地面上引入了真面目),以关注人类的边界。在MODNet的末尾,添加了一个融合分支(由整个地面真实遮罩监督)以预测最终的alpha遮罩。在以下小节中,我们将深入研究用于解决每个子目标的分支和监督。

3.2SEMANTIC ESTIMATION ### 语义估计

Similar to existing multiple-model approaches, the first step of MODNet is to locate the human in the input image I. The difference is that we extract the high-level semantics only through an encoder, i.e., the low-resolution branch S of MODNet, which has two main advantages. First, semantic estimation becomes more efficient since it is no longer done by a separate model that contains the decoder. Second, the high-level representation S(I) is helpful for subsequent branches and joint optimization. We can apply arbitrary CNN backbone to S. To facilitate real-time interaction, we adopt the MobileNetV2 [net_mobilenetv2] architecture, an ingenious model developed for mobile devices, as our S.

类似于现有的多模型方法,MODNet的第一步是在输入图像中定位人 I。区别在于我们仅通过编码器(即低分辨率分支)提取高级语义SMODNet,具有两个主要优点。首先,语义估计变得更加有效,因为它不再由包含解码器的单独模型来完成。二,高层表达S(I)对于后续分支和联合优化很有帮助。我们可以将任意的CNN主干应用于S。为了促进实时交互,我们采用MobileNetV2 [ net_mobilenetv2 ]体系结构(一种为移动设备开发的独创模型)作为我们的S。

When analysing the feature maps in S(I), we notice that some channels have more accurate semantics than others. Besides, the indices of these channels vary in different images. However, the subsequent branches process all S(I) in the same way, which may cause the feature maps with false semantics to dominate the predicted alpha mattes in some images. Our experiments show that channel-wise attention mechanisms can encourage using the right knowledge and discourage those that are wrong. Therefore, we append a SE-Block [net_senet] after S to reweight the channels of S(I).
在分析特征图时 S(I),我们注意到某些通道比其他通道具有更准确的语义。此外,这些通道的索引在不同的图像中有所不同。但是,后续分支处理所有S(I)以相同的方式,这可能会导致具有错误语义的特征图支配某些图像中的预测alpha遮罩。我们的实验表明,基于通道的注意力机制可以鼓励使用正确的知识,并阻止错误的知识。因此,我们在S后追加一个SE-块[ net_senet ]重新调整通道S(I)。

To predict coarse semantic mask sp, we feed S(I) into a convolutional layer activated by the Sigmoid function to reduce its channel number to 1. We supervise sp by a thumbnail of the ground truth matte αg. Since sp is supposed to be smooth, we use L2 loss here, as:
预测粗略的语义遮罩 sp,我们喂 S(I) 进入由Sigmoid函数激活的卷积层以将其通道数减少为 1个。我们监督sp 通过地面真相磨砂的缩略图 αG。自从sp 应该是平滑的,我们在这里使用L2损失,如下所示:

$$ {L}{s} = \frac{1}{2} , \big|\big| s_p - G(\alpha_g) \big|\big|{2} $$

where G stands for 16× downsampling followed by Gaussian blur. It removes the fine structures (such as hair) that are not essential to human semantics.

G 代表 16×先进行降采样,然后再进行高斯模糊处理。它删除了对人类语义不是必不可少的精细结构(例如头发)

3.3DETAIL PREDICTION 细节预测

We process the transition region around the foreground human with a high-resolution branch D, which takes I, S(I), and the low-level features from S as inputs. The purpose of reusing the low-level features is to reduce the computational overheads of D. In addition, we further simplify D in the following three aspects: (1) D consists of fewer convolutional layers than S; (2) a small channel number is chosen for the convolutional layers in D; (3) we do not maintain the original input resolution throughout D. In practice, D consists of 12 convolutional layers, and its maximum channel number is 64. The feature map resolution is downsampled to 1/4 of I in the first layer and restored in the last two layers. The impact of this setup on detail prediction is negligible since D contains a skip link.

我们使用高分辨率分支处理前景人类周围的过渡区域 d,这需要 I, S(I),以及来自 S作为输入。重用这些低级功能的目的是减少以下方面的计算开销:d。此外,我们进一步简化了d 在以下三个方面:(1) d 由较少的卷积层组成 S; (2)在d; (3)我们不会始终保持原始输入分辨率d。在实践中,d 由组成 12 卷积层,其最大通道数为 64。要素地图分辨率下采样到1个/4 的 I在第一层中,并在最后两层中恢复。这种设置对细节预测的影响可以忽略不计,因为d 包含一个跳过链接。

We denote the outputs of D as D(I,S(I)), which implies the dependency between sub-objectives — high-level human semantics S(I) is a priori for detail prediction. We calculate the boundary detail matte dp from D(I,S(I)) and learn it through L1 loss, as:
我们表示的输出 d 作为 d(I,S(I)),这意味着子目标之间的依赖关系-高级人类语义 S(I)是进行细节预测的先验条件。我们计算边界细节遮罩dp 从 d(I,S(I)) 并通过L1损失学习它,如下所示:

image.png

where md is a binary mask to let Ld focus on the human boundaries. md is generated through dilation and erosion on αg. Its values are 1 if the pixels are inside the transition region, and 0 otherwise. In fact, the pixels with md=1 are the ones in the unknown area of the trimap. Although dp may contain inaccurate values for the pixels with md=0, it has high precision for the pixels with md=1.
md 是要让的二进制掩码 Ld 关注人类边界。 md 是通过扩张和侵蚀而产生的 αG。其值是1个 如果像素在过渡区域内,并且 0否则。实际上,具有md=1个是trimap未知区域中的那些。虽然dp 可能包含不正确的像素值 md=0,对于具有 md=1个。

3.4SEMANTIC-DETAIL FUSION 语义细节融合

The fusion branch F in MODNet is a straightforward CNN module, combining semantics and details. We first upsample S(I) to match its shape with D(I,S(I)). We then concatenate S(I) and D(I,S(I)) to predict the final alpha matte αp, constrained by:
融合分支 FMODNet中的XML是一个简单的CNN模块,结合了语义和细节。我们先升采样S(I) 与它的形状相匹配 d(I,S(I))。然后我们串联S(I) 和 d(I,S(I)) 预测最终的Alpha遮罩 αp, 受约束:

$$ \mathcal{L}{\alpha} = \big|\big| \alpha{p} - \alpha_{g} \big|\big|{1} + \mathcal{L}{c} $$

where Lc is the compositional loss from [DIM]. It measures the absolute difference between the input image I and the composited image obtained from αp, the ground truth foreground, and the ground truth background.
LC是[ DIM ]的成分损失。它测量输入图像之间的绝对差I 以及从 αp,地面真相前景和地面真相背景。

MODNet is trained end-to-end through the sum of Ls, Ld, and Lα, as:
MODNet通过以下方面的总和进行端到端训练: Ls, Ld, 和 Lα, 作为:

$$ \mathcal{L} = \lambda_s , \mathcal{L}{s} + \lambda_d , \mathcal{L}{d} + \lambda_{\alpha} , \mathcal{L}_{\alpha}$$

where λs, λd, and λα are hyper-parameters balancing the three losses. The training process is robust to these hyper-parameters. We set λs=λα=1 and λd=10.
λs, λd, 和 λα是平衡这三个损失的超参数。训练过程对于这些超参数是鲁棒的。我们设置λs=λα=1个 和 λd=10。

4ADAPTATION TO REAL-WORLD DATA 适应现实世界的数据

The training data for human matting requires excellent labeling in the hair area, which is almost impossible for natural images with complex backgrounds. Currently, most annotated data comes from photography websites. Although these images have monochromatic or blurred backgrounds, the labeling process still needs to be completed by experienced annotators with considerable amount of time and the help of professional tools. As a consequence, the labeled datasets for human matting are usually small. Xu et al. [DIM] suggested using background replacement as a data augmentation to enlarge the training set, and it has become a typical setting in image matting. However, the training samples obtained in such a way exhibit different properties from those of the daily life images for two reasons. First, unlike natural images of which foreground and background fit seamlessly together, images generated by replacing backgrounds are usually unnatural. Second, professional photography is often carried out under controlled conditions, like special lighting that is usually different from those observed in our daily life. Therefore, existing trimap-free models always tend to overfit the training set and perform poorly on real-world data.

To address the domain shift problem, we utilize the consistency among the sub-objectives to adapt MODNet to unseen data distributions (Sec. 4.1). Moreover, to alleviate the flicker between video frames, we apply a one-frame delay trick as post-processing (Sec. 4.2).

用于人体分割的训练数据需要在头发区域进行出色的标记,这对于具有复杂背景的自然图像几乎是不可能的。当前,大多数带注释的数据来自摄影网站。尽管这些图像具有单色或模糊的背景,但是标记过程仍然需要经验丰富的注释者在相当长的时间和专业工具的帮助下完成。结果,用于人类分割的标记数据集通常很小。徐等。 [昏暗]建议使用背景替换作为数据增强来扩大训练集,并且它已成为图像抠像中的典型设置。然而,由于两个原因,以这种方式获得的训练样本表现出与日常生活图像不同的性质。首先,与前景和背景无缝融合的自然图像不同,通过替换背景生成的图像通常是不自然的。其次,专业摄影通常是在受控条件下进行的,例如通常与我们日常生活中观察到的特殊照明不同的照明。因此,现有的无Trimap模型总是倾向于过度拟合训练集,并且在真实数据上的表现不佳。

为了解决域转移问题,我们利用子目标之间的一致性使MODNet适应看不见的数据分布(第4.1节 )。此外,为了减轻视频帧之间的闪烁,我们应用一帧延迟技巧作为后处理(第4.2节 )。

4.1SUB-OBJECTIVES CONSISTENCY (SOC) ### 子目标一致性(SOC)

For unlabeled images from a new domain, the three sub-objectives in MODNet may have inconsistent outputs. For example, the foreground probability of a certain pixel belonging to the background may be wrong in the predicted alpha matte αp but is correct in the predicted coarse semantic mask sp. Intuitively, this pixel should have close values in αp and sp. Motivated by this, our self-supervised SOC strategy imposes the consistency constraints between the predictions of the sub-objectives (Fig. 1 (b)) to improve the performance of MODNet in the new domain.

Formally, we use M to denote MODNet. As described in Sec. 3, M has three outputs for an unlabeled image ~I, as:

对于来自新域的未标记图像,MODNet中的三个子目标可能具有不一致的输出。例如,在预测的alpha遮罩中,属于背景的某个像素的前景概率可能是错误的αp 但在预测的粗略语义掩码中是正确的 sp。凭直觉,此像素应在αp 和 sp。因此,我们的自我监督SOC策略在子目标的预测之间施加了一致性约束(图 1(b)),以提高MODNet在新领域中的性能。

正式地,我们使用 M表示MODNet。如第二节所述。 3,M 具有未标记图像的三个输出 〜I, 作为:

image.png

We force the semantics in ~αp to be consistent with ~sp and the details in ~αp to be consistent with ~dp by:
我们强制语义 〜αp 与...保持一致 〜sp 和中的详细信息 〜αp 与...保持一致 〜dp 经过:

$$\mathcal{L}{cons} = \frac{1}{2} , \big|\big| G(\tilde{\alpha}{p}) - \tilde{s}{p} \big|\big|{2} + \tilde{m}d , \big|\big| \tilde{\alpha}{p} - \tilde{d}{p} \big|\big|{1} $$

where ~md indicates the transition region in ~αp, and G has the same meaning as the one in Eq. 2. However, adding the L2 loss on blurred G(~αp) will smooth the boundaries in the optimized ~αp. Hence, the consistency between ~αp and ~dp will remove the details predicted by the high-resolution branch. To prevent this problem, we duplicate M to M′ and fix the weights of M′ before performing SOC. Since the fine boundaries are preserved in ~d′p output by M′, we append an extra constraint to maintain the details in M as:
在哪里 〜md 指示的过渡区域 〜αp, 和 G与等式中的含义相同。 2。但是,加上模糊时的L2损失G(〜αp) 将平滑优化中的边界 〜αp。因此,〜αp 和 〜dp将删除高分辨率分支预测的详细信息。为防止出现此问题,我们重复M 到 M′ 并固定 M′在执行SOC之前。由于精细边界被保留在〜d′p 由输出 M′,我们附加了一个额外的约束来维护其中的详细信息 M 作为:

$$\mathcal{L}{dd} = \tilde{m}d , \big|\big| \tilde{d}{p}^{\prime} - \tilde{d}{p} \big|\big|_{1} $$

We generalize MODNet to the target domain by optimizing Lcons and Ldd simultaneously.
我们通过优化将MODNet推广到目标域 LCØñs 和 Ldd 同时

 The foreground moves slightly to the left in three consecutive frames. We focus on three pixels: (1) the pixel marked in green does not satisfy the 1st condition in Figure 3: Flickering Pixels Judged by OFD. The foreground moves slightly to the left in three consecutive frames. We focus on three pixels: (1) the pixel marked in green does not satisfy the 1st condition in C; (2) the pixel marked in blue does not satisfy the 2nd condition in C; (3) the pixel marked in red flickers at frame t.

4.2ONE-FRAME DELAY (OFD) 一帧延迟(OFD)

Applying image processing algorithms independently to each video frame often leads to temporal inconsistency in the outputs. In matting, this phenomenon usually appears as flickers in the predicted matte sequence. Since the flickering pixels in a frame are likely to be correct in adjacent frames, we may utilize the preceding and the following frames to fix these pixels. If the fps is greater than 30, the delay caused by waiting for the next frame is negligible.

Suppose that we have three consecutive frames, and their corresponding alpha mattes are αt−1, αt, and αt+1, where t is the frame index. We regard αit as a flickering pixel if it satisfies the following conditions C (illustrated in Fig. 3):
独立地将图像处理算法应用于每个视频帧通常会导致输出中的时间不一致。在分割过程中,此现象通常在预测的分割序列中显示为闪烁。由于一帧中的闪烁像素在相邻帧中可能是正确的,因此我们可以利用之前和之后的帧来固定这些像素。如果Fps 大于 30,等待下一个帧造成的延迟可以忽略不计。

假设我们有三个连续的帧,它们对应的alpha遮罩是 $\alpha_{t-1}$, $\alpha_{t}$, 和$\alpha_{t+1}$个, 在哪里 Ť是帧索引。我们认为αIŤ 如果满足以下条件,则为闪烁的像素 C(如图3所示 ):

  1. $$|\alpha_{t-1}^{i} - \alpha_{t+1}^{i}| \leq \xi$$
    2.$$|\alpha_{t}^{i} - \alpha_{t-1}^{i}| > \xi \, and \, |\alpha_{t}^{i} - \alpha_{t+1}^{i}| > \xi$$

In practice, we set ξ=0.1 to measure the similarity of pixel values. C indicates that if the values of \alpha_{t-1} and \alpha_{t+1} are close, and \alpha_{t} is very different from the values of both \alpha_{t-1} and \alpha_{t+1}, a flicker appears in αit. We replace the value of αit by averaging \alpha_{t-1} and \alpha_{t+1}, as:
实际上,我们设置 ξ=0.1 测量像素值的相似度。 C 指示如果的值 αit-1个 和 αit+1个 靠近,并且 αit 与αit-1 和 αit+1的值有很大不同 ,闪烁出现在αit-1。我们用 αit-1个 和 αit+1的平均值代替 αit:

$$\big(\alpha_{t-1}^{i} + \alpha_{t+1}^{i}\big),/,2, \quad ,, \textrm{if} ; C $$

Note that OFD is only suitable for smooth movement. It may fail in fast motion videos.
请注意,OFD仅适用于平稳移动。在快速运动视频中可能会失败。


(a) Validation benchmarks used in Figure 4: Benchmark Comparison. (a) Validation benchmarks used in [SHM, BSHM, LFM] synthesize samples by replacing the background. Instead, our PHM-100 contains original image backgrounds and has higher diversity in the foregrounds. We show samples (b) with fine hair, (c) with additional objects, and (d) without bokeh or with full-body.

5EXPERIMENTS 实验

In this section, we first introduce the PHM-100 benchmark for human matting. We then compare MODNet with existing matting methods on PHM-100. We further conduct ablation experiments to evaluate various aspects of MODNet. Finally, we demonstrate the effectiveness of SOC and OFD in adapting MODNet to real-world data.

在本节中,我们首先介绍用于人为分割的PHM-100基准。然后,我们将MODNet与PHM-100上的现有分割方法进行比较。我们进一步进行消融实验,以评估MODNet的各个方面。最后,我们证明了SOC和OFD在使MODNet适应实际数据中的有效性。

5.1PHOTOGRAPHIC HUMAN MATTING BENCHMARK 摄影人体抠图基准

Existing works constructed their validation benchmarks from a small amount of labeled data through image synthesis. Their benchmarks are relatively easy due to unnatural fusion or mismatched semantics between the foreground and the background (Fig. 4 (a)). Therefore, trimap-free models may be comparable to trimap-based models on these benchmarks but have unsatisfactory results in natural images, i.e., the images without background replacement, which indicates that the performance of trimap-free methods has not been accurately assessed. We prove this standpoint by the matting results on Adobe Matting Dataset 22Refer to Appendix B for the results of portrait images (with synthetic backgrounds) from Adobe Matting Dataset..

现有作品通过图像合成从少量标记数据构建了验证基准。由于前景和背景之间的融合不自然或语义不匹配,因此它们的基准测试相对容易(图 4(a))。因此,无trimap的模型在这些基准上可以与基于trimap的模型相比,但是在自然图像(即没有背景替换的图像)中的结果并不令人满意,这表明没有对trimap的方法的性能进行准确评估。我们通过Adobe Matting Dataset 上的分割结果证明了这一观点。

 2个有关Adobe Matting Dataset中肖像图像(具有合成背景)的结果,请参阅附录B。

In contrast, we propose a Photographic Human Matting benchmark (PHM-100), which contains 100 finely annotated portrait images with various backgrounds. To guarantee sample diversity, we define several classifying rules to balance the sample types in PHM-100. For example, (1) whether the whole human body is included; (2) whether the image background is blurred; and (3) whether the person holds additional objects. We regard small objects held by people as a part of the foreground since this is more in line with the practical applications. As exhibited in Fig. 4(b)(c)(d), the samples in PHM-100 have more natural backgrounds and richer postures. So, we argue that PHM-100 is a more comprehensive benchmark.

相比之下,我们提出了摄影人像基准(PHM-100),其中包含100张带有各种背景的带有精细注释的肖像图像。为了保证样品的多样性,我们定义了几种分类规则以平衡PHM-100中的样品类型。例如,(1)是否包括整个人体;(2)图像背景是否模糊;(3)该人是否持有其他物品。我们将人为携带的小物件作为前景的一部分,因为这与实际应用更为吻合。如图 4(b)(c)(d)所示,PHM-100中的样品具有更自然的背景和更丰富的姿势。因此,我们认为PHM-100是更全面的基准。

5.2RESULTS ON PHM-100 PHM-100上的结果

We compare MODNet with FDMPA [FDMPA], LFM [LFM], SHM [SHM], BSHM [BSHM], and HAtt [HAtt]. We follow the original papers to reproduce the methods that have no publicly available codes. We use DIM [DIM] as trimap-based baseline.

For a fair comparison, we train all models on the same dataset, which contains nearly 3000 annotated foregrounds. The background replacement [DIM] is applied to extend our training set. For each foreground, we generate 5 samples by random cropping and 10 samples by compositing the backgrounds from the OpenImage dataset [openimage]. We use MobileNetV2 pre-trained on the Supervisely Person Segmentation (SPS) [SPS] dataset as the backbone of all trimap-free models. For previous methods, we explore the optimal hyper-parameters through grid search. For MODNet, we train it by SGD for 40 epochs. With a batch size of 16, the initial learning rate is 0.01 and is multiplied by 0.1 after every 10 epochs. We use Mean Square Error (MSE) and Mean Absolute Difference (MAD) as quantitative metrics.
我们将MODNet与FDMPA [ FDMPA ],LFM [ LFM ],SHM [ SHM ],BSHM [ BSHM ]和HAtt [ HAtt ]进行比较。我们按照原始论文来复制没有公开代码的方法。我们使用DIM [ DIM ]作为基于trimap的基线。

为了公平起见,我们在同一数据集上训练所有模型,其中几乎包含 3000带注释的前景。背景替换[ DIM ]用于扩展我们的训练集。对于每个前景,我们生成5 通过随机裁剪和 10通过合成OpenImage数据集中的背景进行采样[ openimage ]。我们使用在有监督人员细分(SPS)[ SPS ]数据集上经过预训练的MobileNetV2作为所有无trimap模型的骨干。对于以前的方法,我们通过网格搜索探索最佳超参数。对于MODNet,我们通过SGD对它进行训练40时代。批量大小为16,初始学习率是 0.01 并乘以 0.1 每次之后 10时代。我们使用均方误差(MSE)和均值绝对差(MAD)作为定量指标。

Table 1 shows the results on PHM-100, MODNet surpasses other trimap-free methods in both MSE and MAD. However, it still performs inferior to trimap-based DIM, since PHM-100 contains samples with challenging poses or costumes. When modifying our MODNet to a trimap-based method, i.e., taking a trimap as input, it outperforms trimap-based DIM, which reveals the superiority of our network architecture. Fig. 5 visualizes some samples 33Refer to Appendix A for more visual comparisons..
1显示了在PHM-100上的结果,在MSE和MAD中MODNet都超过了其他无trimap的方法。但是,由于PHM-100包含具有挑战性姿势或服装的样本,它的性能仍不及基于Trimap的DIM。将我们的MODNet修改为基于trimap的方法时,即以trimap为输入,它的性能优于基于trimap的DIM,这揭示了我们网络体系结构的优越性。图 5可视化一些样本 33有关更多视觉比较,请参阅附录A。


MODNet performs better in hollow structures (the 1st row) and hair details (the 2nd row). However, it may still make mistakes in challenging poses or costumes (the 3rd row). DIM Figure 5: Visual Comparisons of Trimap-free Methods on PHM-100. MODNet performs better in hollow structures (the 1st row) and hair details (the 2nd row). However, it may still make mistakes in challenging poses or costumes (the 3rd row). DIM [DIM] here does not take trimaps as the input but is pre-trained on the SPS [SPS] dataset. Zoom in for the best visualization.

Method Trimap MSE↓ MAD↓
DIM [DIM] 0.0016 0.0063
MODNet (Our) 0.0013 0.0056
DIM [DIM] 0.0221 0.0327
DIM† [DIM] 0.0115 0.0178
FDMPA† [FDMPA] 0.0101 0.0160
LFM† [LFM] 0.0094 0.0158
SHM† [SHM] 0.0072 0.0152
HAtt† [HAtt] 0.0067 0.0137
BSHM† [BSHM] 0.0063 0.0114
MODNet† (Our) 0.0046 0.0097

Table 1: Quantitative Results on PHM-100. ‘†’ indicates the models pre-trained on the SPS dataset. ‘↓’ means lower is better.

Ls Ld SEB SPS MSE↓ MAD↓
0.0162 0.0235
0.0097 0.0158
0.0083 0.0142
0.0068 0.0128
0.0046 0.0097

Table 2: Ablation of MODNet. SEB: SE-Block in MODNet low-resolution branch. SPS: Pre-training on the SPS dataset.
Shorter inference time is better, and fewer model parameters is better. We can divide Figure 6: Comparisons of Model Size and Execution Efficiency. Shorter inference time is better, and fewer model parameters is better. We can divide 1000 by the inference time to obtain fps.We further demonstrate the advantages of MODNet in terms of model size and execution efficiency. A small model facilitates deployment on mobile devices, while high execution efficiency is necessary for real-time applications. We measure the model size by the total number of parameters, and we reflect the execution efficiency by the average inference time over PHM-100 on an NVIDIA GTX 1080Ti GPU (input images are cropped to 512×512). Note that fewer parameters do not imply faster inference speed due to large feature maps or time-consuming mechanisms, e.g., attention, that the model may have. Fig. 6 illustrates these two indicators. The inference time of MODNet is 15.8ms (63fps), which is twice the fps of previous fastest FDMPA (31fps). Although MODNet has a slightly higher number of parameters than FDMPA, our performance is significantly better.

我们进一步展示了MODNet在模型大小和执行效率方面的优势。小型模型有助于在移动设备上进行部署,而实时应用程序则需要很高的执行效率。我们通过参数总数来衡量模型的大小,并通过NVIDIA GTX 1080Ti GPU上PHM-100上的平均推断时间来反映执行效率(输入图像被裁剪为512×512)。注意,由于模型可能具有较大的特征图或耗时的机制(例如注意力),较少的参数并不意味着推理速度更快。图 6示出了这两个指示器。MODNet的推理时间为15.8ms (63Fps),是 Fps 之前最快的FDMPA(31Fps)。尽管MODNet的参数数量比FDMPA略高,但我们的性能明显更好。

We also conduct ablation experiments for MODNet on PHM-100 (Table 2). Applying Ls and Ld to constrain human semantics and boundary details brings considerable improvement. The result of assembling SE-Block proves the effectiveness of reweighting the feature maps. Although the SPS pre-training is optional to MODNet, it plays a vital role in other trimap-free methods. For example, in Table 1, the performance of trimap-free DIM without pre-training is far worse than the one with pre-training.

我们还在PHM-100上进行了MODNet的消融实验(表 2)。正在申请Ls 和 Ld约束人类的语义和边界细节带来了可观的改进。组装SE-Block的结果证明了对特征图重新加权的有效性。尽管SPS预训练对于MODNet是可选的,但它在其他无trimap的方法中起着至关重要的作用。例如,在表 1中,没有预训练的无Trimap的DIM的性能远比有预训练的DIM的性能差。


We show three consecutive video frames from left to right. From top to bottom: (a) Input, (b) MODNet, (c) MODNet + SOC, and (d) MODNet + SOC + OFD. The blue marks in frame Figure 7: Results of SOC and OFD on a Real-World Video. We show three consecutive video frames from left to right. From top to bottom: (a) Input, (b) MODNet, (c) MODNet + SOC, and (d) MODNet + SOC + OFD. The blue marks in frame t−1 demonstrate the effectiveness of SOC while the red marks in frame t highlight the flickers eliminated by OFD.

5.3RESULTS ON REAL-WORLD DATA 真实数据的结果

Real-world data can be divided into multiple domains according to different device types or diverse imaging methods. By assuming that the images captured by the same kind of device (such as smartphones) belong to the same domain, we capture several video clips as the unlabeled data for self-supervised SOC domain adaptation. In this stage, we freeze the BatchNorm [BatchNorm] layers within MODNet and finetune the convolutional layers by Adam with a learning rate of 0.0001. Here we only provide visual results 44Refer to our online supplementary video for more results. because no ground truth mattes are available. In Fig. 7, we composite the foreground over a green screen to emphasize that SOC is vital for generalizing MODNet to real-world data. In addition, OFD further removes flickers on the boundaries.
实际数据可以根据不同的设备类型或不同的成像方法分为多个域。通过假设由相同类型的设备(例如智能手机)捕获的图像属于同一域,我们捕获了几个视频剪辑作为未标记的数据,以进行自我监督的SOC域适配。在此阶段,我们将MODNet中的BatchNorm [ BatchNorm ]层冻结,并用Adam的学习速率对卷积层进行微调。0.0001。这里我们只提供视觉结果 44有关更多结果,请参阅我们的在线补充视频。 因为没有可用的地面真相遮罩。在图 7中,我们在绿色屏幕上合成了前景,以强调SOC对于将MODNet推广到实际数据至关重要。此外,OFD还消除了边界上的闪烁。

Applying trimap-based methods in practice requires an additional step to obtain the trimap, which is commonly implemented by a depth camera, e.g., ToF [ToF]. Specifically, the pixel values in a depth map indicate the distance from the 3D locations to the camera, and the locations closer to the camera have smaller pixel values. We can first define a threshold to split the reversed depth map into foreground and background. Then, we can generate the trimap through dilation and erosion. However, this scheme will identify all objects in front of the human, i.e., objects closer to the camera, as the foreground, leading to an erroneous trimap for matte prediction in some scenarios. In contrast, MODNet avoids such a problem by decoupling from the trimap input. We give an example in Fig. 8.
在实践中应用基于trimap的方法需要一个额外的步骤来获取trimap,该步骤通常由深度相机(例如ToF [ ToF ])实现。具体地,深度图中的像素值指示从3D位置到相机的距离,并且更靠近相机的位置具有较小的像素值。我们首先可以定义一个阈值,将反向深度图分为前景和背景。然后,我们可以通过膨胀和腐蚀生成三图。但是,此方案将识别人类面前的所有物体,即,更靠近相机的对象作为前景,在某些情况下会导致错误的三边形图进行遮罩预测。相反,MODNet通过与Trimap输入解耦来避免此类问题。我们在图8中给出一个例子 。

We also compare MODNet against the background matting (BM) proposed by [BM]. Since BM does not support dynamic backgrounds, we conduct validations  ††footnotemark: in the fixed-camera scenes from [BM]. BM relies on a static background image, which implicitly assumes that all pixels whose value changes in the input image sequence belong to the foreground. As shown in Fig. 9, when a moving object suddenly appears in the background, the result of BM will be affected, but MODNet is robust to such disturbances.
在实践中应用基于trimap的方法需要一个额外的步骤来获取trimap,该步骤通常由深度相机(例如ToF [ ToF ])实现。具体地,深度图中的像素值指示从3D位置到相机的距离,并且更靠近相机的位置具有较小的像素值。我们首先可以定义一个阈值,将反向深度图分为前景和背景。然后,我们可以通过膨胀和腐蚀生成三图。但是,此方案将识别人类面前的所有物体,即,更靠近相机的对象作为前景,在某些情况下会导致错误的三边形图进行遮罩预测。相反,MODNet通过与Trimap输入解耦来避免此类问题。我们在图8中给出一个例子 。


In this case, an incorrect trimap generated from the depth map causes the trimap-based DIM Figure 8: Advantages of MODNet over Trimap-based Method. In this case, an incorrect trimap generated from the depth map causes the trimap-based DIM [DIM] to fail. For comparsion, MODNet handles this case correctly, as it inputs only an RGB image. MODNet outperforms BM Figure 9: MODNet versus BM under Fixed Camera Position. MODNet outperforms BM [BM] when a car is entering the background (marked in red).

6CONCLUSIONS 结论

This paper has presented a simple, fast, and effective MODNet to avoid using a green screen in real-time human matting. By taking only RGB images as input, our method enables the prediction of alpha mattes under changing scenes. Moreover, MODNet suffers less from the domain shift problem in practice due to the proposed SOC and OFD. MODNet is shown to have good performances on the carefully designed PHM-100 benchmark and a variety of real-world data. Unfortunately, our method is not able to handle strange costumes and strong motion blurs that are not covered by the training set. One possible future work is to address video matting under motion blurs through additional sub-objectives, e.g., optical flow estimation.
本文提出了一种简单,快速且有效的MODNet,以避免在实时人为遮罩中使用绿幕。通过仅采用RGB图像作为输入,我们的方法可以预测场景变化时的alpha遮罩。而且,由于提议的SOC和OFD,MODNet在实践中受域转移问题的影响较小。MODNet在精心设计的PHM-100基准测试和各种实际数据中显示出良好的性能。不幸的是,我们的方法无法处理训练套装未涵盖的奇特服装和强烈的运动模糊。一项可能的未来工作是通过其他子目标(例如光流估计)解决运动模糊下的视频抠像问题。

 We compare our MODNet with DIM Figure 10: More Visual Comparisons of Trimap-free Methods on PHM-100. We compare our MODNet with DIM [DIM], FDMPA [FDMPA], LFM [LFM], SHM [SHM], HAtt [HAtt], and BSHM [BSHM]. Note that DIM here does not take trimaps as the input but is pre-trained on the SPS [SPS] dataset. Zoom in for the best visualization.

APPENDIX A

Fig. 10 provides more visual comparisons of MODNet and the existing trimap-free methods on PHM-100.
图 10提供了MODNet和PHM-100上现有的无trimap的方法的更直观的比较。


In the first row, the foreground and background lights come from opposite directions (unnatural fusion). In the second row, the portrait is placed on a huge meal (mismatched semantics).
Figure 11: Visual Results on AMD. In the first row, the foreground and background lights come from opposite directions (unnatural fusion). In the second row, the portrait is placed on a huge meal (mismatched semantics).

APPENDIX B

We argue that trimap-free models can obtain results comparable to trimap-based models in the previous benchmarks because of unnatural fusion or mismatched semantics between synthetic foreground and background. To demonstrate this, we conduct experiments on the open-source Adobe Matting Dataset (AMD) [DIM]. We first pick the portrait foregrounds from AMD. We then composite 10 samples for each foreground with diverse backgrounds. We finally validate all models on this synthetic benchmark.
我们认为,由于合成前景和背景之间的融合不自然或语义不匹配,无三图的模型可以获得与以前基准测试中基于三图的模型相当的结果。为了证明这一点,我们在开源Adobe Matting数据集(AMD)[ DIM ]上进行了实验。我们首先从AMD选择肖像前景。然后,我们为具有不同背景的每个前景合成10个样本。我们最终在此综合基准上验证所有模型。

表 3显示了上述基准的定量结果。与PHM-100上的结果不同,无trimap的模型和基于trimap的模型之间的性能差距要小得多。例如,无trimap的MODNet和基于trimap的DIM之间的MSE和MAD仅与0.001。我们在图11中提供了一些视觉比较 。

Table 3 shows the quantitative results on the aforementioned benchmark. Unlike the results on PHM-100, the performance gap between trimap-free and trimap-based models is much smaller. For example, MSE and MAD between trimap-free MODNet and trimap-based DIM is only about 0.001. We provide some visual comparison in Fig. 11.

Method Trimap MSE↓ MAD↓
DIM [DIM] 0.0014 0.0069
MODNet (Our) 0.0011 0.0061
DIM [DIM] 0.0075 0.0159
DIM† [DIM] 0.0048 0.0116
FDMPA† [FDMPA] 0.0047 0.0115
LFM† [LFM] 0.0043 0.0101
SHM† [SHM] 0.0031 0.0092
HAtt† [HAtt] 0.0034 0.0094
BSHM† [BSHM] 0.0029 0.0088
MODNet† (Our) 0.0024 0.0081

Table 3: Quantitative Results on AMD. We pick the portrait foregrounds from AMD for validation. ‘†’ indicates the models pre-trained on the SPS [SPS] dataset.

REFERENCES