[论文翻译]MODNet:不用绿幕就能实现的实时精准人像抠图


原文地址:https://arxiv.org/pdf/2011.11961.pdf

代码地址:https://github.com/ZHKKKe/MODNet

Is a Green Screen Really Necessary for Real-Time Human Matting?

author

Zhanghan Ke1,2  Kaican Li2  Yurou Zhou2  Qiuhua Wu2  Xiangyu Mao2
Qiong Yan2  Rynson W.H. Lau1
1Department of Computer ScienceCity University of Hong Kong     2SenseTime Research

ABSTRACT

For human matting without the green screen, existing works either require auxiliary inputs that are costly to obtain or use multiple models that are computationally expensive. Consequently, they are unavailable in real-time applications. In contrast, we present a light-weight matting objective decomposition network (MODNet), which can process human matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. Moreover, since trimap-free methods usually suffer from the domain shift problem in practice, we introduce (1) a self-supervised strategy based on sub-objectives consistency to adapt MODNet to real-world data and (2) a one-frame delay trick to smooth the results when applying MODNet to video human matting.

MODNet is easy to be trained in an end-to-end style. It is much faster than contemporaneous matting methods and runs at 63 frames per second. On a carefully designed human matting benchmark newly proposed in this work, MODNet greatly outperforms prior trimap-free methods. More importantly, our method achieves remarkable results in daily photos and videos. Now, do you really need a green screen for real-time human matting?

摘要

用于没有绿幕的人为遮罩,现有的工作要么需要获得昂贵的辅助输入,要么使用计算上昂贵的多个模型。因此,它们在实时应用程序中不可用。相比之下,我们提出了一种轻量级的分割目标分解网络(MODNet),该网络可以实时从单个输入图像处理人为分割。MODNet的设计受益于通过显式约束同时优化一系列相关的子目标。此外,由于无Trimap的方法在实践中通常会遇到域偏移问题,因此我们引入(1)基于子目标一致性的自监督策略以使MODNet适应实际数据,以及(2)一帧延迟将MODNet应用于视频人类抠像时,可以使结果平滑的技巧。

MODNet易于以端到端的方式进行训练。它比同时代的分割方法快得多,并且以每秒63帧的速度运行。在这项工作中新提出的经过精心设计的人为分割基准上,MODNet大大优于以前的无trimap的方法。更重要的是,我们的方法在日常照片和视频中取得了显着的效果。现在,您真的需要一个用于实时人为分割的绿幕吗?


Our method can process trimap-free human matting in real time under changing scenes.
(a) We train MODNet on the labeled dataset to learn matting sub-objectives from RGB images. (b) To adapt to real-world data, we finetune MODNet on the unlabeled data by using the consistency between sub-objectives. (c) In the application of video human matting, our OFD trick can help smooth the predicted alpha mattes of the video sequence.
Figure 1: Our Framework for Human Matting. Our method can process trimap-free human matting in real time under changing scenes. (a) We train MODNet on the labeled dataset to learn matting sub-objectives from RGB images. (b) To adapt to real-world data, we finetune MODNet on the unlabeled data by using the consistency between sub-objectives. (c) In the application of video human matting, our OFD trick can help smooth the predicted alpha mattes of the video sequence.## 1INTRODUCTION

图1:我们的人类抠图算法框架。 我们的方法可以在不断变化的场景下实时处理无Trimap的人类抠图。(a)我们在标记的数据集上训练MODNet,以从RGB图像中学习分割子目标。(b)为了适应现实世界的数据,我们通过使用子目标之间的一致性来对未标记的数据进行MODNet的微调。(c)在视频抠像的应用中,我们的OFD技巧可以帮助平滑视频序列的预测alpha抠像。

Introduction 介绍

Human matting aims to predict a precise alpha matte that can be used to extract people from a given image or video. It has a wide variety of applications, such as photo editing and movie re-creation. Currently, a green screen is required to obtain a high quality alpha matte in real time.

人体抠图算法的目的是预测可用于从给定图像或视频中提取人物的精确Alpha遮罩。它具有广泛的应用场景,例如照片编辑和电影重新创建。当前,需要绿幕以实时获得高质量的alpha遮罩。

When a green screen is not available, most existing matting methods [AdaMatting, CAMatting, GCA, IndexMatter, SampleMatting, DIM] use a pre-defined trimap as a priori. However, the trimap is costly for humans to annotate, or suffer from low precision if captured via a depth camera. Therefore, some latest works attempt to eliminate the model dependence on the trimap, \ie, trimap-free methods. For example, background matting [BM] replaces the trimap by a separate background image. Others [SHM, BSHM, DAPM] apply multiple models to firstly generate a pseudo trimap or semantic mask, which is then served as the priori for alpha matte prediction. Nonetheless, using the background image as input has to take and align two photos while using multiple models significantly increases the inference time. These drawbacks make all aforementioned matting methods not suitable for real-time applications, such as preview in a camera. Besides, limited by insufficient amount of labeled training data, trimap-free methods often suffer from domain shift [DomainShift] in practice, \ie, the models cannot well generalize to real-world data, which has also been discussed in [BM].

当没有绿幕时,大多数现有的抠图方法[ AdaMatting,CAMatting,GCA,IndexMatter,SampleMatting,DIM ]将使用预先定义的trimap作为先验。但是,如果通过深度摄像头捕获,则Trimap对人类进行注释的成本很高,或者精度较低。因此,一些最新的工作试图消除模型对trimap的依赖,即无trimap的方法。例如,背景抠图[ BM ]将Trimap替换为单独的背景图像。其他[ SHM,BSHM,DAPM]应用多个模型首先生成一个伪trimap或语义蒙版,然后将其用作alpha遮罩预测的先验。但是,使用背景图像作为输入必须拍摄并对齐两张照片,而使用多个模型则显着增加了推理时间。这些缺点使得所有上述抠图方法都不适合实时应用,例如在相机中预览。此外,受限于标记训练数据的数量不足,无三图法在实践中经常遭受域移位[ DomainShift ]的困扰,即模型无法很好地推广到现实世界的数据,这也在[ BM ]中进行了讨论。

To predict an accurate alpha matte from only one RGB image by using a single model, we propose MODNet, a light-weight network that decomposes the human matting task into three correlated sub-tasks and optimizes them simultaneously through specific constraints. There are two insights behind MODNet. First, neural networks are better at learning a set of simple objectives rather than a complex one. Therefore, addressing a series of matting sub-objectives can achieve better performance. Second, applying explicit supervisions for each sub-objective can make different parts of the model to learn decoupled knowledge, which allows all the sub-objectives to be solved within one model. To overcome the domain shift problem, we introduce a self-supervised strategy based on sub-objective consistency (SOC) for MODNet. This strategy utilizes the consistency among the sub-objectives to reduce artifacts in the predicted alpha matte. Moreover, we suggest a one-frame delay (OFD) trick as post-processing to obtain smoother outputs in the application of video human matting. Fig. 1 summarizes our framework.

为了通过使用单个模型仅从一个RGB图像中预测准确的alpha遮罩,我们提出了MODNet,这是一种轻量级的网络,可将人工遮罩任务分解为三个相关的子任务,并通过特定的约束同时对其进行优化。MODNet背后有两个见解。首先,神经网络更擅长学习一组简单的目标,而不是复杂的目标。因此,解决一系列分割子目标可以实现更好的性能。其次,对每个子目标应用明确的监督可以使模型的不同部分学习解耦的知识,从而可以在一个模型中解决所有子目标。为了克服域移位问题,我们针对MODNet引入了一种基于子目标一致性(SOC)的自监督策略。此策略利用子目标之间的一致性来减少预测的alpha遮罩中的伪像。此外,我们建议使用一帧延迟(OFD)技巧作为后处理,以在视频人像抠像的应用中获得更平滑的输出。如图。 1总结了我们的框架。

MODNet has several advantages over previous trimap-free methods. First, MODNet is much faster. It is designed for real-time applications, running at 63 frames per second (fps) on an Nvidia GTX 1080Ti GPU with an input size of 512×512. Second, MODNet achieves state-of-the-art results, benefitted from (1) objective decomposition and concurrent optimization; and (2) specific supervisions for each of the sub-objectives. Third, MODNet can be easily optimized end-to-end since it is a single well-designed model instead of a complex pipeline. Finally, MODNet has better generalization ability thanks to our SOC strategy. Although our results are not able to surpass those of the trimap-based methods on the human matting benchmarks with trimaps, our experiments show that MODNet is more stable in practical applications due to the removal of the trimap input. We believe that our method is challenging the necessity of using a green screen for real-time human matting.

Since open-source human matting datasets [DAPM, DIM] have limited scale or precision, prior works train and validate their models on private datasets of diverse quality and difficulty levels. As a result, it is not easy to compare these methods fairly. In this work, we evaluate existing trimap-free methods under a unified standard: all models are trained on the same dataset and validated on the portrait images from Adobe Matting Dataset [DIM] and our newly proposed benchmark. Our new benchmark is labelled in high quality, and it is more diverse than those used in previous works. Hence, it can reflect the matting performance more comprehensively. More on this is discussed in Sec. 5.1.

In summary, we present a novel network architecture, named MODNet, for trimap-free human matting in real time. Moreover, we introduce two techniques, SOC and OFD, to generalize MODNet to new data domains and smooth the matting results on videos. Another contribution of this work is a carefully designed validation benchmark for human matting. Our code, pre-trained model, and validation benchmark will be made available at:

https://github.com/ZHKKKe/MODNet .

与以前的无trimap的方法相比,MODNet具有许多优点。首先,MODNet要快得多。它是为实时应用程序设计的,运行于63 每秒帧数 (Fps)在Nvidia GTX 1080Ti GPU上,输入大小为 512×512。其次,得益于(1)目标分解和并发优化,MODNet获得了最先进的结果。(2)对每个子目标的具体监督。第三,由于MODNet是一个设计良好的模型,而不是复杂的管道,因此可以轻松地进行端到端优化。最后,由于我们的SOC策略,MODNet具有更好的泛化能力。尽管我们的结果无法在使用Trimap的人类分割基准上超过基于Trimap的方法,但是我们的实验表明,由于删除了Trimap输入,MODNet在实际应用中更加稳定。我们认为,我们的方法正在挑战使用绿色屏幕进行实时人类抠像的必要性。

由于开源人类抠图数据集[ DAPM,DIM ]的规模或精度有限,因此先前的工作在具有不同质量和难度级别的私人数据集上训练并验证了其模型。结果,要公平地比较这些方法并不容易。在这项工作中,我们在统一的标准下评估了现有的无trimap的方法:所有模型都在同一数据集上进行训练,并在Adobe Matting Dataset [ DIM ]和我们新提出的基准测试中的人像图像上进行了验证。我们的新基准被标记为高质量,并且比以前的作品中使用的基准更加多样化。因此,它可以更全面地反映分割性能。有关更多信息,请参见第二节。 5.1

总而言之,我们提出了一种新颖的网络体系结构,名为MODNet,用于实时实现无Trimap的人类抠图。此外,我们介绍了SOC和OFD这两种技术,以将MODNet推广到新的数据域并平滑视频上的分割效果。这项工作的另一个贡献是精心设计的用于人体分割的验证基准。我们的代码,预先训练的模型和验证基准将在以下位置提供:

https://github.com/ZHKKKe/MODNet  。

2.1IMAGE MATTING

The purpose of image matting is to extract the desired foreground F from a given image I. Unlike the binary mask output from image segmentation [IS_Survey] and saliency detection [SOD_Survey], matting predicts an alpha matte with preccise foreground probability for each pixel, which is represented by α in the following formula:
图像抠图的目的是提取所需的前景 F 从给定的图像 I。与图像分割[ IS_Survey ]和显着性检测[ SOD_Survey ]输出的二值蒙版不同,遮罩可预测每个像素具有精确前景概率的alpha遮罩,其表示为α 在以下公式中:

image.png

where i is the pixel index, and B is the background of I. When the background is not a green screen, this problem is ill-posed since all variables on the right hand side are unknown. Most existing matting methods take a pre-defined trimap as an auxiliary input, which is a mask containing three regions: absolute foreground (α=1), absolute background (α=0), and unknown area (α=0.5). In this way, the matting algorithms only have to estimate the foreground probability inside the unknown area based on the priori from the other two regions.

I是像素索引,并且B 是I的背景 。当背景不是绿色屏幕时,由于右侧的所有变量都是未知的,因此该问题不适当。大多数现有的抠图方法都将预定义的Trimap用作辅助输入,这是一个包含三个区域的蒙版:绝对前景(α=1个),绝对背景(α=0)和未知区域(α=0.5)。这样,分割算法仅需基于来自其他两个区域的先验来估计未知区域内部的前景概率。

Traditional matting algorithms heavily rely on low-level features, \eg, color cues, to determine the alpha matte through sampling [sampling_chuang, sampling_feng, sampling_gastal, sampling_he, sampling_johnson, sampling_karacan, sampling_ruzon] or propagation [prop_aksoy2, prop_aksoy, prop_bai, prop_chen, prop_grady, prop_levin, prop_levin2, prop_sun], which often fail in complex scenes. With the tremendous progress of deep learning, many methods based on convolutional neural networks (CNN) have been proposed, and they improve matting results significantly. Cho \etal[NIMUDCNN] and Shen \etal[DAPM] combined the classic algorithms with CNN for alpha matte refinement. Xu \etal[DIM] proposed an auto-encoder architecture to predict alpha matte from a RGB image and a trimap. Some works [GCA, IndexMatter] argued that the attention mechanism could help improve matting performance. Lutz \etal[AlphaGAN] demonstrated the effectiveness of generative adversarial networks [GAN] in matting. Cai \etal[AdaMatting] suggested a trimap refinement process before matting and showed the advantages of an elaborate trimap. Since obtaining a trimap requires user effort, some recent methods (including our MODNet) attempt to avoid it, as described below.

传统的分割算法很大程度上依赖于低级别特征,如,颜色提示,通过取样确定所述α无光泽[ sampling_chuang,sampling_feng,sampling_gastal,sampling_he,sampling_johnson,sampling_karacan,sampling_ruzon ]或传播[ prop_aksoy2,prop_aksoy,prop_bai,prop_chen,prop_grady,prop_levin,prop_levin2,prop_sun ],这些方法通常在复杂的场景中会失败。随着深度学习的巨大进步,提出了许多基于卷积神经网络(CNN)的方法,它们显着提高了分割效果。Cho \etal[ NIMUDCNN ]和Shen \etal[ DAPM ]结合经典算法与CNN阿尔法磨砂细化。Xu \ etal [ DIM ]提出了一种自动编码器体系结构,可以根据RGB图像和三映射来预测alpha遮罩。一些著作[ GCA,IndexMatter ]认为注意力机制可以帮助改善分割性能。 Lutz \etal [AlphaGan]证实生成对抗网络的有效性[ GAN ]在分割。Cai \ etal [ AdaMatting ]在进行分割之前建议了trimap的改进过程,并展示了精心制作的trimap的优点。由于获取三部曲需要用户的努力,因此某些最新方法(包括我们的MODNet)会尝试避免这种情况,如下所述。

2.2TRIMAP-FREE HUMAN MATTING 无TRIMAP的人类抠图算法

Image matting is extremely difficult when trimaps are unavailable as semantic estimation will be necessary (to locate the foreground) before predicting a precise alpha matte. Currently, trimap-free methods always focus on a specific type of foreground objects, such as humans. Nonetheless, feeding RGB images into a single neural network still yields unsatisfactory alpha mattes. Sengupta \etal[BM] proposed to capture a less expensive background image as a pseudo green screen to alleviate this issue. Other works designed their pipelines that contained multiple models. For example, Shen \etal[SHM] assembled a trimap generation network before the matting network. Zhang \etal[LFM] applied a fusion network to combine the predicted foreground and background. Liu \etal[BSHM] concatenated three networks to utilize coarse labeled data in matting. The main problem of all these methods is that they cannot be used in interactive applications since: (1) the background images may change frame to frame, and (2) using multiple models is computationally expensive. Compared with them, our MODNet is light-weight in terms of both input and pipeline complexity. It takes one RGB image as input and uses a single model to process human matting in real time with better performance.

没有trimaps 作为语义分割图像(定位前景)的情况下,图像分割是个非常难的问题,无法预测精确的Alpha遮罩。当前,无trimap的方法始终专注于特定类型的前景对象,例如人类。尽管如此,将RGB图像馈入单个神经网络仍然会产生不令人满意的alpha遮罩。Sengupta \ etal [ BM ]提议捕获较便宜的背景图像作为伪绿幕,以缓解此问题。其他作品设计了包含多个模型的管道。例如,Shen \ etal [ SHM ]在抠图网络之前组装了一个trimap生成网络。张\ etal [LFM ]施加的融合网络到预测前景和背景结合。Liu \ etal [ BSHM ]连接了三个网络,以在遮罩中利用粗略的标记数据。所有这些方法的主要问题是它们不能用于交互式应用程序,因为:(1)背景图像可能会逐帧更改,并且(2)使用多个模型在计算上很昂贵。与它们相比,我们的MODNet在输入和管道复杂性方面都是轻量级的。它以一个RGB图像作为输入,并使用单个模型以更好的性能实时处理人为遮罩。

2.3OTHER TECHNIQUES 其他技术

We briefly discuss some other techniques related to the design and optimization of our method.

High-Resolution Representations. Popular CNN architectures [net_resnet, net_mobilenet, net_densenet, net_vggnet, net_insnet] generally contain an encoder, i.e., a low-resolution branch, to reduce the resolution of the input. Such a process will discard image details that are essential in many tasks, including image matting. Wang \etal[net_hrnet] proposed to keep high-resolution representations throughout the model and exchange features between different resolutions, which induces huge computational overheads. Instead, MODNet only applies an independent high-resolution branch to handle foreground boundaries.

Attention Mechanisms. Attention [attention_survey] for deep neural networks has been widely explored and proved to boost the performance notably. In computer vision, we can divide these mechanisms into spatial-based or channel-based according to their operating dimension. To obtain better results, some matting models [GCA, IndexMatter] combined spatial-based attentions that are time-consuming. In MODNet, we integrate the channel-based attention so as to balance between performance and efficiency.

Consistency Constraint. Consistency is one of the most important assumptions behind many semi-/self-supervised [semi_un_survey] and domain adaptation [udda_survey] algorithms. For example, Ke \etal[GCT] designed a consistency-based framework that could be used for semi-supervised matting. Toldo \etal[udamss] presented a consistency-based domain adaptation strategy for semantic segmentation. However, these methods consist of multiple models and constrain the consistency among their predictions. In contrast, our MODNet imposes consistency among various sub-objectives within a model.
我们简要讨论了与方法设计和优化有关的其他技术。

高分辨率表示形式:流行的CNN架构[ net_resnet,net_mobilenet,net_densenet,net_vggnet,net_insnet ]通常包含一个编码器,即低分辨率分支,以降低输入的分辨率。这样的过程将丢弃在许多任务(包括图像分割)中必不可少的图像细节。王\ etal [ net_hrnet ]提出在整个模型中保留高分辨率表示并在不同分辨率之间交换特征,这会导致巨大的计算开销。相反,MODNet仅应用独立的高分辨率分支来处理前景边界。

注意机制:深度神经网络的注意事项[ tention_survey ]已得到广泛探索,并被证明可以显着提高性能。在计算机视觉中,我们可以根据它们的操作维度将这些机制分为基于空间的或基于通道的。为了获得更好的结果,一些分割模型[ GCA,IndexMatter ]组合了基于空间的注意力,这非常耗时。在MODNet中,我们集成了基于通道的关注,以便在性能和效率之间取得平衡。

一致性约束:一致性是许多半/自我监督[ semi_un_survey ]和域自适应[ udda_survey ]算法背后最重要的假设之一。例如,Ke \ etal [ GCT ]设计了一个基于一致性的框架,该框架可用于半监督分割。Toldo \ etal [ udamss ]提出了一种基于一致性的语义分割领域自适应策略。但是,这些方法由多个模型组成,并限制了其预测之间的一致性。相反,我们的MODNet在模型中的各个子目标之间强加了一致性。


Given an input image Figure 2: Architecture of MODNet. Given an input image I, MODNet predicts human semantics sp, boundary details dp, and final alpha matte αp through three interdependent branches, S, D, and F, which are constrained by specific supervisions generated from the ground truth matte αg. Since the decomposed sub-objectives are correlated and help strengthen each other, we can optimize MODNet end-to-end.

3MODNET

In this section, we elaborate the architecture of MODNet and the constraints used to optimize it.
在本节中,我们将详细阐述MODNet的体系结构以及用于对其进行优化的约束条件。

3.1OVERVIEW 概述

Methods that are based on multiple models [SHM, BSHM, DAPM] have shown that regarding trimap-free matting as a trimap prediction (or segmentation) step plus a trimap-based matting step can achieve better performances. This demonstrates that neural networks are benefited from breaking down a complex objective. In MODNet, we extend this idea by dividing the trimap-free matting objective into semantic estimation, detail prediction, and semantic-detail fusion. Intuitively, semantic estimation outputs a coarse foreground mask while detail prediction produces fine foreground boundaries, and semantic-detail fusion aims to blend the features from the first two sub-objectives.

As shown in Fig. 2, MODNet consists of three branches, which learn different sub-objectives through specific constraints. Specifically, MODNet has a low-resolution branch (supervised by the thumbnail of the ground truth matte) to estimate human semantics. Based on it, a high-resolution branch (supervised by the transition region (α∈(0,1)) in the ground truth matte) is introduced to focus on the human boundaries. At the end of MODNet, a fusion branch (supervised by the whole ground truth matte) is added to predict the final alpha matte. In the following subsections, we will delve into the branches and the supervisions used to solve each sub-objective.

是基于多个模型的方法[ SHM,BSHM,DAPM ]已经表明,关于trimap-free 作为三分图预测(或分割)步骤加上基于三分图分割步骤可实现更好的性能。这表明,分解复杂的目标可以使神经网络受益。在MODNet中,我们将无Trimap遮罩目标划分为语义估计,细节预测和语义细节融合,从而扩展了这一思想。直观地,语义估计输出粗略的前景蒙版,而细节预测则产生精细的前景边界,语义细节融合旨在融合前两个子目标的特征。

如图2所示 ,MODNet由三个分支组成,它们通过特定的约束学习不同的子目标。具体来说,MODNet具有一个低分辨率的分支(由地面实况遮罩的缩略图监督)来估计人类的语义。基于此,一个高分辨率分支(由过渡区域(α∈(0,1个)),在地面上引入了真面目),以关注人类的边界。在MODNet的末尾,添加了一个融合分支(由整个地面真实遮罩监督)以预测最终的alpha遮罩。在以下小节中,我们将深入研究用于解决每个子目标的分支和监督。

3.2SEMANTIC ESTIMATION ### 语义估计

Similar to existing multiple-model approaches, the first step of MODNet is to locate the human in the input image I. The difference is that we extract the high-level semantics only through an encoder, i.e., the low-resolution branch S of MODNet, which has two main advantages. First, semantic estimation becomes more efficient since it is no longer done by a separate model that contains the decoder. Second, the high-level representation S(I) is helpful for subsequent branches and joint optimization. We can apply arbitrary CNN backbone to S. To facilitate real-time interaction, we adopt the MobileNetV2 [net_mobilenetv2] architecture, an ingenious model developed for mobile devices, as our S.

类似于现有的多模型方法,MODNet的第一步是在输入图像中定位人 I。区别在于我们仅通过编码器(即低分辨率分支)提取高级语义SMODNet,具有两个主要优点。首先,语义估计变得更加有效,因为它不再由包含解码器的单独模型来完成。二,高层表达S(I)对于后续分支和联合优化很有帮助。我们可以将任意的CNN主干应用于S。为了促进实时交互,我们采用MobileNetV2 [ net_mobilenetv2 ]体系结构(一种为移动设备开发的独创模型)作为我们的S。

When analysing the feature maps in S(I), we notice that some channels have more accurate semantics than others. Besides, the indices of these channels vary in different images. However, the subsequent branches process all S(I) in the same way, which may cause the feature maps with false semantics to dominate the predicted alpha mattes in some images. Our experiments show that channel-wise attention mechanisms can encourage using the right knowledge and discourage those that are wrong. Therefore, we append a SE-Block [net_senet] after S to reweight the channels of S(I).
在分析特征图时 S(I),我们注意到某些通道比其他通道具有更准确的语义。此外,这些通道的索引在不同的图像中有所不同。但是,后续分支处理所有S(I)以相同的方式,这可能会导致具有错误语义的特征图支配某些图像中的预测alpha遮罩。我们的实验表明,基于通道的注意力机制可以鼓励使用正确的知识,并阻止错误的知识。因此,我们在S后追加一个SE-块[ net_senet ]重新调整通道S(I)。

To predict coarse semantic mask sp, we feed S(I) into a convolutional layer activated by the Sigmoid function to reduce its channel number to 1. We supervise sp by a thumbnail of the ground truth matte αg. Since sp is supposed to be smooth, we use L2 loss here, as:
预测粗略的语义遮罩 sp,我们喂 S(I) 进入由Sigmoid函数激活的卷积层以将其通道数减少为 1个。我们监督sp 通过地面真相磨砂的缩略图 αG。自从sp 应该是平滑的,我们在这里使用L2损失,如下所示:

$$ {L}{s} = \frac{1}{2} , \big|\big| s_p - G(\alpha_g) \big|\big|{2} $$

where G stands for 16× downsampling followed by Gaussian blur. It removes the fine structures (such as hair) that are not essential to human semantics.

G 代表 16×先进行降采样,然后再进行高斯模糊处理。它删除了对人类语义不是必不可少的精细结构(例如头发)

3.3DETAIL PREDICTION 细节预测

We process the transition region around the foreground human with a high-resolution branch D, which takes I, S(I), and the low-level features from S as inputs. The purpose of reusing the low-level features is to reduce the computational overheads of D. In addition, we further simplify D in the following three aspects: (1) D consists of fewer convolutional layers than S; (2) a small channel number is chosen for the convolutional layers in D; (3) we do not maintain the original input resolution throughout D. In practice, D consists of 12 convolutional layers, and its maximum channel number is 64. The feature map resolution is downsampled to 1/4 of I in the first layer and restored in the last two layers. The impact of this setup on detail prediction is negligible since D contains a skip link.

我们使用高分辨率分支处理前景人类周围的过渡区域 d,这需要 I, S(I),以及来自 S作为输入。重用这些低级功能的目的是减少以下方面的计算开销:d。此外,我们进一步简化了d 在以下三个方面:(1) d 由较少的卷积层组成 S; (2)在d; (3)我们不会始终保持原始输入分辨率d。在实践中,d 由组成 12 卷积层,其最大通道数为 64。要素地图分辨率下采样到1个/4 的 I在第一层中,并在最后两层中恢复。这种设置对细节预测的影响可以忽略不计,因为d 包含一个跳过链接。

We denote the outputs of D as D(I,S(I)), which implies the dependency between sub-objectives — high-level human semantics S(I) is a priori for detail prediction. We calculate the boundary detail matte dp from D(I,S(I)) and learn it through L1 loss, as:
我们表示的输出 d 作为 d(I,S(I)),这意味着子目标之间的依赖关系-高级人类语义 S(I)是进行细节预测的先验条件。我们计算边界细节遮罩dp 从 d(I,S(I)) 并通过L1损失学习它,如下所示:

image.png

where md is a binary mask to let Ld focus on the human boundaries. md is generated through dilation and erosion on αg. Its values are 1 if the pixels are inside the transition region, and 0 otherwise. In fact, the pixels with md=1 are the ones in the unknown area of the trimap. Although dp may contain inaccurate values for the pixels with md=0, it has high precision for the pixels with md=1.
md 是要让的二进制掩码 Ld 关注人类边界。 md 是通过扩张和侵蚀而产生的 αG。其值是1个 如果像素在过渡区域内,并且 0否则。实际上,具有md=1个是trimap未知区域中的那些。虽然dp 可能包含不正确的像素值 md=0,对于具有 md=1个。

3.4SEMANTIC-DETAIL FUSION 语义细节融合

The fusion branch F in MODNet is a straightforward CNN module, combining semantics and details. We first upsample S(I) to match its shape with D(I,S(I)). We then concatenate S(I) and D(I,S(I)) to predict the final alpha matte αp, constrained by:
融合分支 FMODNet中的XML是一个简单的CNN模块,结合了语义和细节。我们先升采样S(I) 与它的形状相匹配 d(I,S(I))。然后我们串联S(I) 和 d(I,S(I)) 预测最终的Alpha遮罩 αp, 受约束:

$$ \mathcal{L}{\alpha} = \big|\big| \alpha{p} - \alpha_{g} \big|\big|{1} + \mathcal{L}{c} $$

where Lc is the compositional loss from [DIM]. It measures the absolute difference between the input image I and the composited image obtained from αp, the ground truth foreground, and the ground truth background.
LC是[ DIM ]的成分损失。它测量输入图像之间的绝对差I 以及从 αp,地面真相前景和地面真相背景。

MODNet is trained end-to-end through the sum of Ls, Ld, and Lα, as:
MODNet通过以下方面的总和进行端到端训练: Ls, Ld, 和 Lα, 作为:

$$ \mathcal{L} = \lambda_s , \mathcal{L}{s} + \lambda_d , \mathcal{L}{d} + \lambda_{\alpha} , \mathcal{L}_{\alpha}$$

where λs, λd, and λα are hyper-parameters balancing the three losses. The training process is robust to these hyper-parameters. We set λs=λα=1 and λd=10.
λs, λd, 和 λα是平衡这三个损失的超参数。训练过程对于这些超参数是鲁棒的。我们设置λs=λα=1个 和 λd=10。

4ADAPTATION TO REAL-WORLD DATA 适应现实世界的数据

The training data for human matting requires excellent labeling in the hair area, which is almost impossible for natural images with complex backgrounds. Currently, most annotated data comes from photography websites. Although these images have monochromatic or blurred backgrounds, the labeling process still needs to be completed by experienced annotators with considerable amount of time and the help of professional tools. As a consequence, the labeled datasets for human matting are usually small. Xu et al. [DIM] suggested using background replacement as a data augmentation to enlarge the training set, and it has become a typical setting in image matting. However, the training samples obtained in such a way exhibit different properties from those of the daily life images for two reasons. First, unlike natural images of which foreground and background fit seamlessly together, images generated by replacing backgrounds are usually unnatural. Second, professional photography is often carried out under controlled conditions, like special lighting that is usually different from those observed in our daily life. Therefore, existing trimap-free models always tend to overfit the training set and perform poorly on real-world data.

To address the domain shift problem, we utilize the consistency among the sub-objectives to adapt MODNet to unseen data distributions (Sec. 4.1). Moreover, to alleviate the flicker between video frames, we apply a one-frame delay trick as post-processing (Sec. 4.2).

用于人体分割的训练数据需要在头发区域进行出色的标记,这对于具有复杂背景的自然图像几乎是不可能的。当前,大多数带注释的数据来自摄影网站。尽管这些图像具有单色或模糊的背景,但是标记过程仍然需要经验丰富的注释者在相当长的时间和专业工具的帮助下完成。结果,用于人类分割的标记数据集通常很小。徐等。 [昏暗]建议使用背景替换作为数据增强来扩大训练集,并且它已成为图像抠像中的典型设置。然而,由于两个原因,以这种方式获得的训练样本表现出与日常生活图像不同的性质。首先,与前景和背景无缝融合的自然图像不同,通过替换背景生成的图像通常是不自然的。其次,专业摄影通常是在受控条件下进行的,例如通常与我们日常生活中观察到的特殊照明不同的照明。因此,现有的无Trimap模型总是倾向于过度拟合训练集,并且在真实数据上的表现不佳。

为了解决域转移问题,我们利用子目标之间的一致性使MODNet适应看不见的数据分布(第4.1节 )。此外,为了减轻视频帧之间的闪烁,我们应用一帧延迟技巧作为后处理(第4.2节 )。

4.1SUB-OBJECTIVES CONSISTENCY (SOC) ### 子目标一致性(SOC)

For unlabeled images from a new domain, the three sub-objectives in MODNet may have inconsistent outputs. For example, the foreground probability of a certain pixel belonging to the background may be wrong in the predicted alpha matte αp but is correct in the predicted coarse semantic mask sp. Intuitively, this pixel should have close values in αp and sp. Motivated by this, our self-supervised SOC strategy imposes the consistency constraints between the predictions of the sub-objectives (Fig. 1 (b)) to improve the performance of MODNet in the new domain.

Formally, we use M to denote MODNet. As described in Sec. 3, M has three outputs for an unlabeled image ~I, as:

对于来自新域的未标记图像,MODNet中的三个子目标可能具有不一致的输出。例如,在预测的alpha遮罩中,属于背景的某个像素的前景概率可能是错误的αp 但在预测的粗略语义掩码中是正确的 sp。凭直觉,此像素应在αp 和 sp。因此,我们的自我监督SOC策略在子目标的预测之间施加了一致性约束(图 1(b)),以提高MODNet在新领域中的性能。

正式地,我们使用 M表示MODNet。如第二节所述。 3,M 具有未标记图像的三个输出 〜I, 作为:

image.png

We force the semantics in ~αp to be consistent with ~sp and the details in ~αp to be consistent with ~dp by:
我们强制语义 〜αp 与...保持一致 〜sp 和中的详细信息 〜αp 与...保持一致 〜dp 经过:

$$\mathcal{L}{cons} = \frac{1}{2} , \big|\big| G(\tilde{\alpha}{p}) - \tilde{s}{p} \big|\big|{2} + \tilde{m}d , \big|\big| \tilde{\alpha}{p} - \tilde{d}{p} \big|\big|{1} $$

where ~md indicates the transition region in ~αp, and G has the same meaning as the one in Eq. 2. However, adding the L2 loss on blurred G(~αp) will smooth the boundaries in the optimized ~αp. Hence, the consistency between ~αp and ~dp will remove the details predicted by the high-resolution branch. To prevent this problem, we duplicate M to M′ and fix the weights of M′ before performing SOC. Since the fine boundaries are preserved in ~d′p output by M′, we append an extra constraint to maintain the details in M as:
在哪里 〜md 指示的过渡区域 〜αp, 和 G与等式中的含义相同。 2。但是,加上模糊时的L2损失G(〜αp) 将平滑优化中的边界 〜αp。因此,〜αp 和 〜dp将删除高分辨率分支预测的详细信息。为防止出现此问题,我们重复M 到 M′ 并固定 M′在执行SOC之前。由于精细边界被保留在〜d′p 由输出 M′,我们附加了一个额外的约束来维护其中的详细信息 M 作为:

$$\mathcal{L}{dd} = \tilde{m}d , \big|\big| \tilde{d}{p}^{\prime} - \tilde{d}{p} \big|\big|_{1} $$

We generalize MODNet to the target domain by optimizing Lcons and Ldd simultaneously.
我们通过优化将MODNet推广到目标域 LCØñs 和 Ldd 同时

 The foreground moves slightly to the left in three consecutive frames. We focus on three pixels: (1) the pixel marked in green does not satisfy the 1st condition in Figure 3: Flickering Pixels Judged by OFD. The foreground moves slightly to the left in three consecutive frames. We focus on three pixels: (1) the pixel marked in green does not satisfy the 1st condition in C; (2) the pixel marked in blue does not satisfy the 2nd condition in C; (3) the pixel marked in red flickers at frame t.

4.2ONE-FRAME DELAY (OFD) 一帧延迟(OFD)

Applying image processing algorithms independently to each video frame often leads to temporal inconsistency in the outputs. In matting, this phenomenon usually appears as flickers in the predicted matte sequence. Since the flickering pixels in a frame are likely to be correct in adjacent frames, we may utilize the preceding and the following frames to fix these pixels. If the fps is greater than 30, the delay caused by waiting for the next frame is negligible.

Suppose that we have three consecutive frames, and their corresponding alpha mattes are αt−1, αt, and αt+1, where t is the frame index. We regard αit as a flickering pixel if it satisfies the following conditions C (illustrated in Fig. 3):
独立地将图像处理算法应用于每个视频帧通常会导致输出中的时间不一致。在分割过程中,此现象通常在预测的分割序列中显示为闪烁。由于一帧中的闪烁