[论文翻译]基于对比学习的标签有效语义切分方法


下载PDF:https://arxiv.org/pdf/2012.06985v2.pdf


Contrastive Learning for Label-Efficient Semantic Segmentation

基于对比学习的标签有效语义切分方法

Abstract

Collecting labeled data for the task of semantic segmentation is expensive and time-consuming, as it requires dense pixel-level annotations. While recent Convolutional Neural Network (CNN) based semantic segmentation approaches have achieved impressive results by using large amounts of labeled training data, their performance drops significantly as the amount of labeled data decreases. This happens because deep CNNs trained with the de facto cross-entropy loss can easily overfit to small amounts of labeled data. To address this issue, we propose a simple and effective contrastive learning-based training strategy in which we first pretrain the network using a pixel-wise class label-based contrastive loss, and then fine-tune it using the cross-entropy loss.This approach increases intra-class compactness and inter-class separability, thereby resulting in a better pixel classifier.We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings using the Cityscapes and PASCAL VOC 2012 segmentation datasets.Our results show that pretraining with label-based contrastive loss results in large performance gains (more than 20% absolute improvement in some settings) when the amount of labeled data is limited.

摘要

收集标记数据的语义分割任务是昂贵且耗时的,因为它需要密集的像素级注释。虽然最近基于卷积神经网络(CNN)的语义分割方法通过使用大量标记的训练数据来实现了令人印象深刻的结果,但随着标记数据的量减少,它们的性能显着下降。发生这种情况是因为随着事实上的跨熵损耗训练的深度 CNN 可以容易地过度占据少量标记数据。要解决这个问题,我们提出了一种简单有效的对比学习的学习训练策略,我们首先使用像素 - 明智的基于类标签的对比丢失,然后使用跨熵损失进行微调。该方法增加了类内的紧凑性和阶级间可分离性,从而导致更好的像素分类器。我们展示了使用 CityCapes 和 Pascal VOC 2012 分段数据集的完全监督和半监督设置中所提出的训练策略的有效性。我们的结果表明,当标记数据的数量有限时,基于标签的对比损价导致基于标签的对比损价导致大量的性能提升(在某些设置中超过 20%)。

Introduction

In the recent past, various approaches based on Convolutional Neural Networks (CNNs) have reported excellent results on several semantic segmentation datasets by leveraging large amounts of dense pixel-level annotations. However, labeling images with pixel-level annotations is time-consuming and expensive. For example, the average time taken to annotate a single image in the Cityscapes dataset is 90 minutes.pascal_drop shows how the performance of a DeepLabV3+ model trained on the PASCAL VOC 2012 dataset using the cross-entropy loss drops as the number of training images decreases. This happens because CNNs trained with the cross-entropy loss can easily overfit to small amounts of labeled data, as the cross-entropy loss does not explicitly encourage intra-class compactness or large margins between classes. To address this issue, we propose to first pretrain the CNN feature extractor using a pixel-wise class label-based contrastive loss (referred to as contrastive pretraining), and then fine-tune the entire network including the softmax classifier using the cross-entropy loss (referred to as softmax fine-tuning).emb_pascal20_baseline andemb_pascal20_proposed show the distributions of various classes in the softmax input feature spaces of models trained with the cross-entropy loss and the proposed strategy, respectively, using 2118 labeled images from the PASCAL VOC 2012 dataset. The mean IOU values of the corresponding models on the PASCAL VOC 2012 validation dataset are 39.1 and 62.7, respectively. The class support regions are more compact and separated when trained with the proposed strategy, leading to a better performance.

简介

在最近的过去,通过利用大量密集像素级注释,基于卷积神经网络(CNNS)的各种方法在多个语义分段数据集上报告了出色的结果。但是,标记具有像素级注释的图像是耗时和昂贵的。例如,在 CityScapes 数据集中注释单个图像所花的平均时间为 90 分钟。Pascal_Drop 显示了使用跨熵丢失下降在 Pascal VOC 2012 数据集上训练的 Deplabv3 + 模型的性能如何随着训练图像的数量减少而降低。发生这种情况是因为随着跨熵损耗训练的 CNN 可以容易地磨损到少量标记数据,因为跨熵损失没有明确鼓励类内的压缩度或课程之间的大边距。为了解决这个问题,我们建议使用基于像素的基于类标签的对比丢失(称为对比预先估计)来先前推出 CNN 特征提取器,然后微调包括 Softmax 分类器的整个网络,包括 Softmax 分类器 - 网上损失(称为 softmax 细小调谐)。emb_pascal20_baseline 和 meb_pascal20_proped 显示使用 2118 来自跨熵丢失和所提出的策略的模型的 Softmax 输入功能空间中各种类的分布,使用 2118 Pascal VOC 2012 数据集。 Pascal VOC 2012 验证数据集上相应模型的平均值分别为 39.1 和 62.7。随着拟议策略训练,课堂支援区域更紧凑,分开,导致更好的性能。

We use t-SNE for generating the visualizations.

image

We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings. Our main contributions are as follows.

  • Simple approach: We propose a simple contrastive learning-based training strategy for improving the performance of semantic segmentation models. We consider the simplicity of our training strategy as its main strength since it can be easily adopted by existing and future semantic segmentation approaches.
  • Strong results: We show that label-based contrastive pretraining results in large performance gains on two widely-used semantic segmentation datasets in both fully-supervised and semi-supervised settings, especially when the amount of labeled data is limited.
  • Detailed analyses: We show visualizations of class distributions in the feature spaces of trained models to provide insights into why the proposed training strategy works better (see figure: emb_pascal20_baseline and emb_pascal20_proposed). We also present various ablation studies that justify our design choices.

image

image

我们使用 T-SNE 来生成可视化。

我们展示了拟议的训练策略在完全监督和半监督的环境中的有效性。我们的主要贡献如下。

  • 简单的方法:我们提出了一种简单的基于对比的学习训练策略,可以提高语义分割模型的性能。我们认为我们的训练策略的简单性是其主要优势,因为它可以通过现有和未来的语义细分方法轻松采用。
  • 更好的结果:我们展示了基于标签的对比预制导致两种广泛使用的语义分段数据集的大型性能提升,尤其是当标记数据的数量有限时。
  • 详细分析:我们在训练有素的模型的特征空间中显示了类分布的可视化,以便为提出的训练策略更好地提供洞察力(参见图 fem_pascal20_baseline 和 fem_pascal20_proposed)。我们还提出了各种融合研究,证明了我们的设计选择。

These approaches learn representations in a discriminative fashion by contrasting positive pairs against negative pairs. Recently, several approaches based on contrastive loss have been proposed for self-supervised visual representation learning. These approaches treat each instance as a class and use contrastive loss-based instance discrimination for representation learning. Specifically, they use augmented version for an instance to form the positive pair and other instances to form negative pairs for the contrastive loss. Noting that using a large number of negatives is crucial for the success of contrastive loss-based representation learning, various recent approaches use memory banks to store the representations. While some recent contrastive approaches have attempted to attribute their success to maximization of mutual information, argues that their success cannot be attributed to the properties of mutual information alone. Recently, proposed supervised contrastive loss for the task of image classification. Since CNNs have been introduced to solve the semantic segmentation problem, several deep CNN-based approaches have been proposed that gradually improved the performance using large amounts of pixel-level annotations. However, collecting dense pixel-level annotations is difficult and costly. To address this issue, several existing works focus on leveraging weaker forms of supervision such as bounding boxes, scribbles, points, and image-level labels either exclusively or along with dense pixel-level supervision. While this work also focuses on improving semantic segmentation performance when the amount of pixel-level annotations is limited, we do not use any additional forms of annotations.

相关工程

这些方法通过对比负对对形成正面对以判别方式学习表示。最近,已经提出了用于自我监督的视觉表现学习的基于对比损失的几种方法。这些方法将每个实例视为一个类,并使用基于对比的丢失的实例歧视来进行表示学习。具体地,它们使用增强版本来形成实例,以形成正对和其他实例以形成负对对比损耗。注意到使用大量负面的否定对于基于对比的损失的代表学习的成功至关重要,各种最近的方法使用内存库来存储表示。虽然最近的一些对比方法试图将其成功归因于相互信息的最大化,但他们的成功不能归因于单独的互信息的属性。最近,建议监督图像分类任务的对比损失。由于已经引入了 CNN 来解决语义分割问题,因此已经提出了几种基于 CNN 的基于 CNN 的方法,从而使用大量的像素级注释逐渐提高了性能。然而,收集密集像素级注释是困难且昂贵的。为了解决这个问题,一些现有的工作侧重于利用较弱的监督形式,例如界限框,涂鸦,点和图像级标签,或者与密集像素级监督一起。虽然这项工作还专注于提高语义分段性能时,当像素级注释有限时,我们不使用任何其他形式的注释。

Instead, we propose a contrastive learning-based pretraining strategy that achieves significant performance gains without the need for any additional data. Another line of work that deals with limited labeled data includes semi-supervised approaches that leverage unlabeled images. While some of these works use generative adversarial training to leverage unlabeled images, others use pseudo labels and consistency-based regularization with various data augmentations. The proposed contrastive pretraining strategy is complimentary to these approaches and can be used in conjunction with them. We demonstrate this in this paper by showing the effectiveness of contrastive pretraining in the semi-supervised setting by using pseudo labels. Recently, proposed to train the CNN feature extractor of a semantic segmentation model by maximizing the log likelihood of extracted pixel features under a mixture of vMF distributions model. During inference, they first segment the pixel features extracted from an image using spherical K-Means clustering, and then perform k-nearest neighbor search for each segment to retrieve the labels from segments in the training set. While this approach is shown to improve the performance when compared to the widely-used pixel-wise softmax training, it is very complicated as it uses a two-stage expectation-maximization algorithm for training. In comparison, the proposed training strategy is simple, and can be easily adopted by existing and future semantic segmentation approaches.

相反,我们提出了一种基于对比的学习预用策略,无需任何其他数据即可实现显着性能。处理有限标记数据的另一项工作包括利用未标记图像的半监督方法。虽然其中一些作品使用生成的对抗性训练来利用未标记的图像,但其他人使用伪标签和基于一致性的正则化与各种数据增强。拟议的对比预制策略与这些方法互补,可以与它们一起使用。我们通过使用伪标签显示半监督设置对比预借预介质的有效性,我们在本文中证明了这一点。最近,建议通过在 VMF 分布模型的混合下最大化提取的像素特征的日志似然来训练语义分割模型的 CNN 特征提取器。在推理期间,首先将从图像中提取的像素特征分段使用球面 K-icon 群集,然后对每个段执行 K-Collecti 邻搜索,以从训练集中的段中检索标签。虽然该方法被证明与广泛使用的像素-Wise SoftMax 训练相比提高性能,但它非常复杂,因为它使用了一种用于训练的两阶段预期最大化算法。相比之下,拟议的训练策略很简单,可以通过现有和未来的语义细分方法轻松采用。

Experiments

实验

Datasets and metricsPASCAL VOC 2012

This dataset consists of 10,582 training (including the annotations provided by), 1,449 validation, and 456 test images with pixel-level annotations for 20 foreground object classes and one background class. The performance is measured in terms of pixel Intersection-Over-Union (IOU) averaged across the 21 classes.[5pt]: This dataset contains high quality pixel-level annotations of 5,000 images collected in street scenes from 50 different cities. Following the evaluation protocol of, 19 semantic labels belonging to 7 super categories (ground, construction, object, nature, human, vehicle and sky) are used for evaluation, and the void label is ignored. The performance is measured in terms of pixel IOU averaged across the 19 classes. The training, validation, and test splits contain 2975, 500, and 1525 images, respectively. For both the datasets, we perform experiments in both fully-supervised and semi-supervised settings varying the amounts of labeled and unlabeled training data.

数据集和 MetricsPascal VOC 2012

该数据集由 10,582 个训练集,包括由 1,449 验证和 456 个测试图像组成,具有 20 个前景对象类和一个背景类的像素级注释。在 21 个类中平均平均值的像素交汇处(iou)的衡量表现。此数据集包含在 50 个不同城市的街道场景中收集的 5,000 张图像的高质量像素级注释。遵循属于 7 个超类别(地面,建筑,物体,性质,人,车辆和天空)的 19 个语义标签进行评估,忽略空隙标签。性能是以跨 19 个类的像素 IOU 的衡量标准。训练,验证和测试分别包含 2975,500 和 1525 个图像。对于两种数据集,我们在完全监督和半监控的环境中执行实验,改变标记和未标记的训练数据的数量。

Model architecture

Our feature extractor follows DeepLabV3 $ + $ encoder-decoder architecture with the ResNet50-based encoder of DeepLabV3. The output spatial resolution of the feature extractor is four times lower than the input resolution. Our projection head consists of three $ 1\times 1 $ convolution layers with 256 channels followed by a unit-normalization layer. The first two layers in the projection head use the ReLU activation function.

型号架构

我们的功能提取器跟随 DEEPLABV3 $ + $编码器 - 解码器架构与 DEEPLABV3 的基于 ResET50 的编码器。特征提取器的输出空间分辨率比输入分辨率低四倍。我们的投影头由三个$ 1\times 1 $卷积图层组成,具有 256 个通道,然后是单位归一化层。投影头中的前两层使用 Relu 激活功能。

Training and inference

Following, we use $ 513\times 513 $ random crops extracted from preprocessed (random left-right flipping and scaling) input images for training. All the models are trained from scratch using asynchronous stochastic gradient descent on 8 replicas with minibatches of size 16, weight decay of $ 4e^{-5} $ , momentum of 0.9, and cosine learning rate decay. For contrastive pretraining, we use an initial learning rate of 0.1 and 300K training steps. For softmax fine-tuning, we use an initial learning rate of 0.007 and 300K training steps when the number of labeled images is above 2500 in the case of PASCAL VOC 2012 dataset and above 1000 in the case of Cityscapes dataset, and 50K training steps in other settings We observed overfitting with longer training when the number of labeled images is low. . When we use softmax training without contrastive pretraining, we use an initial learning rate of 0.03 and 600K training steps when the number of labeled images is above 2500 in the case of PASCAL VOC 2012 dataset and above 1000 in the case of Cityscapes dataset, and 300K training steps in other settings. The temperature parameter $ \tau $ of contrastive loss is set to 0.07 in all the experiments. We use color distortions from for contrastive pretraining, and random brightness and contrast adjustments for softmax fine-tuning Using hue and saturation adjustments from while training the softmax classifier resulted in a slight drop in the performance. . For generating pseudo labels, we use a threshold of 0.8 for all the foreground classes of the PASCAL VOC 2012 and Cityscapes datasets, and a threshold of 0.97 for the background class of the PASCAL VOC 2012 dataset.

image

For $ 513\times 513 $ input, our feature extractor produces a $ 129\times 129 $ feature map. Since the memory complexity of contrastive loss is quadratic in the number of pixels, to avoid GPU memory issues, we resize the feature map to $ 65\times 65 $ using bilinear resizing before computing the contrastive loss. The corresponding low-resolution label map is obtained from the original label map using nearest neighbor downsampling. For softmax training, we follow and upsample the logits from $ 129\times 129 $ to $ 513\times 513 $ using bilinear resizing before computing the pixel-wise cross entropy loss. Since the model is fully-convolutional, during inference, we directly run it on an input image and upsample the output logits to input resolution using bilinear resizing.

image

训练和推理

之后,我们使用$ 513\times 513 $随机作物从预处理(随机左侧翻转和缩放)输入图像进行训练。所有型号通过在 8 个复制品上使用异步随机梯度下降训练,其中 8 次复制品尺寸为 16,重量衰减为$ 4e^{-5} $,动量为 0.9 和余弦学习率衰减。为了对比预制,我们使用 0.1 和 300K 训练步骤的初始学习率。对于 Softmax 微调,我们使用初始学习率为 0.007 和 300K 训练步骤的训练步骤在 Pascal VOC 2012 DataSet 和 CityCAPES 数据集的情况下的标记图像上高于 2500 时,以及 50K 的训练步骤当标记图像的数量低时,我们观察到使用更长的训练的其他设置。当我们使用 SoftMax 训练而没有对比预先估计时,我们使用初始学习率为 0.03 和 600K 训练步骤的训练步骤在 Pascal VOC 2012 DataSet 的情况下,在 CityCapes 数据集的情况下为 1000 时,以及 300k 其他设置中的训练步骤。在所有实验中,对比损耗的温度参数$\tau $设定为 0.07。我们使用来自对比的预磨损的颜色扭曲,以及使用色调和饱和度调整的软 MAX 微调的随机亮度和对比度调整,从训练 Softmax 分类器的性能下降略有下降。对于生成伪标签,我们使用 Pascal VOC 2012 和 CityCapes 数据集的所有前景类使用 0.8 的阈值,以及 Pascal VOC 2012 数据集的背景类的阈值为 0.97。

用于$ 513\times 513 $输入,我们的特征提取器产生$ 129\times 129 $特征图。由于对比度损失的内存复杂性在像素的数量中是二次的,因此要避免 GPU 存储器问题,我们将特征映射调整为$ 65\times 65 $在计算对比损耗之前使用双线性调整大小。相应的低分辨率标签图是从原始标签映射获得的,使用最近的邻沿下采样获得。对于 SoftMax 训练,我们使用双线性调整在计算像素方面的交叉熵损耗之前,从$ 129\times\ 129 $到$ 513\times 129 $上遵循和上置 Logits。由于该模型是完全卷积的,在推理期间,我们直接在输入图像上运行它,并使用 Bilinear 调整尺寸将输出记录上置为输入分辨率。

Results - Fully-supervised setting

Figurescityscapes_fs andpascal_fs show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets, respectively, obtained by contrastive pretraining in the fully-supervised setting. $ 2\times $ while improving the performance. $ 5\times $ more data (5295 images). These results clearly demonstrate the effectiveness of the proposed label-based contrastive pretraining. The performance improvements seen on the PASCAL VOC 2012 dataset are much higher than the improvements seen on the Cityscapes dataset.visual_results shows some segmentation results of models trained with and without label-based contrastive pretraining using 2118 labeled images from the PASCAL VOC 2012 dataset. Contrastive pretraining improves the segmentation results by reducing the confusion between background and various foreground classes, and also the confusion between different foreground classes.

image

image

结果 - 全监督设置

FIGURYSCAPES_FS ANDPASCAL_FS 分别通过完全监督设置对比预借预先绘制的 CITYSCAPES 和 PASCAL VOC 2012 年数据集的验证拆分进行性能改进。 $ 2\times $同时提高性能。 $ 5\times $更多数据(5295 图像)。这些结果清楚地展示了所提出的基于标签的对比预制覆盖的有效性。 Pascal VOC 2012 DataSet 上看到的性能改进远高于 CityScapes DataSet.visual_Result 的改进,显示了使用来自 Pascal VOC 2012 DataSet 的 2118 标记的图像,而无需基于标签的对比预制训练的模型的一些分段结果。通过减少背景和各种前景阶级的混淆,以及不同前景阶级之间的混淆,对比预借预制提高了分割结果。

Results - Semi-supervised setting

Figurescityscapes_ss andpascal_ss show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets, respectively, obtained by contrastive pretraining in the semi-supervised setting. $ 2\times $ while improving the performance on the PASCAL VOC 2012 dataset.

image

Figurescityscapes_pl_improv andpascal_pl_improv show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets, respectively, obtained by using pseudo labels. Though pseudo labeling is a straightforward approach for making use of unlabeled data, it gives impressive performance gains, both with and without contrastive pretraining.

image

结果 - 半监控设置

FigurityScapes_ss 和 pascal_sss 分别通过半监督设置中的对比预先绘制获得了 Cityscapes 和 Pascal VOC 2012 年数据集的验证拆分的性能改进。 $ 2\times $,同时提高 Pascal VOC 2012 数据集的性能。

figurityscapes_pl_improv andpascal_pl_improv 分别通过使用伪标签获得的 Citycapes 和 Pascal VOC 2012 数据集的验证拆分进行性能改进。尽管伪标签是利用未标记数据的直接方法,但它具有令人印象深刻的性能收益,无论是在没有对比的预制税。

Ablation studies

In this section, we perform various ablation studies under the fully-supervised setting on the Cityscapes and PASCAL VOC 2012 datasets with 596 and 2118 labeled training images, respectively.

消融研究

在本节中,我们分别在 CityCAPES 和 Pascal VOC 2012 年数据集上的完全监督环境下进行各种消融研究,分别具有 596 和 2118 个标记的训练图像。

Importance of distortions for contrastive loss

In the case of contrastive loss-based self-supervised learning, distortions are necessary to generate positive pairs. But, in the case of label-based contrastive learning, positive pairs can be generated using labels, and hence, it is unclear how important distortions are. In this work, we use the color distortions from a recent self-supervised learning method that worked well for the downstream task of image classification.distortions shows the effect of using these distortions in the contrastive pretraining stage. We can see a small performance gain on the Cityscapes dataset and no gain on the PASCAL VOC 2012 dataset Differences lower than 0.5 are too small to draw any conclusion. . These results suggest that distortions that work well for image recognition may not work for semantic segmentation.

失真对对比损失的重要性

在基于对比损失的自监督学习中,distortions 是产生正对的必要条件。但是,在基于标签的对比学习的情况下,可以使用标签生成正对,并且因此,目前尚不清楚重要的扭曲。在这项工作中,我们使用最近的自我监督学习方法的颜色扭曲,这适用于图像分类的下游任务。Distortions 显示了在对比预制阶段使用这些扭曲的效果。我们可以在 Citycapes 数据集上看到小的性能增益,并且 Pascal VOC 2012 年的 GAGAL DataSet 差异低于 0.5 的增益太小,无法得出任何结论。 这些结果表明,用于图像识别良好的扭曲可能无法用于语义分割。

Contrastive loss variants

The pixel-wise label-based contrastive loss used in this work is first computed separately for each image and then averaged across all the images in a minibatch. We refer to this as the single image variant. An alternative option is to consider all the pixels in a minibatch as a single bag of pixels for computing the contrastive loss.batch variant. Note that the memory complexity of contrastive loss is quadratic in the number of pixels. Hence, to avoid GPU memory issues, we randomly sample 10K pixels from the entire minibatch for computing the batch variant of the contrastive loss. Tableloss_variants compares the performances of these two variants.

Contrastive loss Cityscapes PASCAL VOC
(596 images) (2118 images)
Single image variant 67.5 62.7
Batch variant (1.5 \downarrow ) 66.0 (1.0 \downarrow ) 61.7
Pretraining Cityscapes PASCAL VOC
ImageNet Contrastive (596 images) (2118 images)
xmark xmark 63.6 39.1
cmark xmark (5.2 \uparrow ) 68.8 (19.8 \uparrow ) 58.9
xmark cmark 67.5 62.7
cmark cmark (1.1 \uparrow ) 68.6 62.6

对比损耗变体

本作中使用的基于像素的基于标签的对比损失首先为每个图像分开计算,然后在小纤维中的所有图像上平均。我们将此称为单图像变体。另一种选择是将小批次中的所有像素视为用于计算对比损耗的单个像素。批量变体。注意,对比损耗的内存复杂性在像素的数量中是二次的。因此,为了避免 GPU 存储器问题,我们从整个小匹配来随机采样 10K 像素,以计算对比损耗的批量变体。 Tableloss_variants 比较了这两个变体的性能。

Pretraining on external classification dataset

To study the effect of additional pretraining on a large-scale image classification dataset, we compare the models trained from scratch with the models trained from ImageNet-pretrained weights in Tableimagenet_results. When contrastive pretraining is not used, ImageNet pretraining results in large performance gains on both Cityscapes and PASCAL VOC 2012 datasets. However, when contrastive pretraining is used, the performance gain due to ImageNet pretraining is limited (only 1.1 points on the Cityscapes dataset and no improvement on the PASCAL VOC 2012 dataset). Also, the results in the second and third rows show that contrastive pretraining, which does not use any additional labels, outperforms ImageNet pretraining (which uses more than a million additional image labels) by 3.8 points on the PASCAL VOC 2012 dataset, and is only slightly worse (1.3 points) on the Cityscapes dataset. These results clearly demonstrate the effectiveness of contrastive pretraining in reducing the need for labeled data.

外部分类数据集

研究了额外预先训练在大型图像分类数据集上的效果,我们比较从划痕训练的模型与 TableImagenet_Result 中的想象成普雷雷达权重训练的模型。当未使用对比呈预灌注时,ImageNet 预先预订会导致城市景观和 Pascal VOC 2012 数据集的大型性能增益。然而,当使用对比呈预灌注时,由于想象成普雷威预测引起的性能收益是有限的(Citycapes 数据集只有 1.1 点,帕斯卡 VOC 2012 数据集没有改进)。此外,第二行和第三行的结果表明,不使用任何额外标签的对比预制,胜过想象成借鉴(使用超过一百万额外的图像标签)在 Pascal VOC 2012 数据集上,仅限于 CityScapes DataSet 上有点差(1.3 分)。这些结果清楚地展示了对比借鉴降低了标记数据的需要的有效性。

Performance on test splits

Tablecity_test shows the performance improvements on the test splits of the Cityscapes and PASCAL VOC 2012 datasets obtained by contrastive pretraining in the fully-supervised setting. Similar to the results on validation splits, label-based contrastive pretraining leads to significant performance improvements on test splits.

Contrastive pretraining Cityscapes
(2975 images) (596 images)
xmark 76.3 64.6
cmark (1.8 \uparrow ) 78.1 (3.4 \uparrow ) 68.0
PASCAL VOC 2012
(10528 images) (2118 images)
xmark 67.1 39.4
cmark (7.8 \uparrow ) 74.9 (21.8 \uparrow ) 61.2

image

image

在测试拆分上的性能

TableCity_Test 显示了通过在完全监督环境中对比预借预先绘制而获得的城市景观和 Pascal VOC 2012 年测试分割的性能改进。与验证分裂的结果类似,基于标签的对比普雷威预测导致对测试分裂的显着性能改进。

Conclusions and future work

Deep CNN-based semantic segmentation models trained with cross-entropy loss easily overfit to small amounts of training data, and hence perform poorly when trained with limited labeled data. To address this issue, we proposed a simple and effective contrastive learning-based training strategy in which we first pretrain the feature extractor of the model using a pixel-wise label-based contrastive loss and then fine-tune the entire network including the softmax classifier using the cross-entropy loss. This training approach increases both intra-class compactness and inter-class separability, thereby enabling a better pixel classifier. We performed experiments on two widely-used semantic segmentation datasets, namely, PASCAL VOC 2012 and Cityscapes, in both fully-supervised and semi-supervised settings. In both settings, we achieved large performance gains on both datasets by using contrastive pretraining, especially when the amount of labeled data is limited. In this work, we used a simple pseudo labeling-based approach to leverage unlabeled images in the semi-supervised setting. We thank Yukun Zhu and Liang-Chieh Chen from Google for their support with the DeepLab codebase.

结论和未来工作

基于 CNN 的基于 CNN 的语义分割模型,具有跨熵损耗的跨熵损失,容易过度占用少量训练数据,因此在受限标记数据有限的训练时表现不佳。要解决此问题,我们提出了一种简单有效的基于对比的学习训练策略,我们首先使用基于像素的基于标签的对比损耗来预留模型的特征提取器,然后微调包括 Softmax 分类器的整个网络使用跨熵损失。这种训练方法增加了类内紧凑性和级别间可分离性,从而实现了更好的像素分类器。我们在完全监督和半监督的环境中对两个广泛使用的语义分割数据集进行了实验,即 Pascal VOC 和 CityCapes。在两个设置中,我们通过使用对比预制来实现两个数据集的大量性能提升,特别是当标记数据的数量有限时。在这项工作中,我们使用了一个简单的基于伪标签的方法来利用半监督设置中的未标记图像。我们感谢 Yukun Zhu 和 Liang-Chieh 陈从谷歌与 Deeblab CodeBase 支持。