Contrastive Learning for Label-Efficient Semantic Segmentation
基于对比学习的标签有效语义切分方法
Abstract
Collecting labeled data for the task of semantic segmentation is expensive and time-consuming, as it requires dense pixel-level annotations. While recent Convolutional Neural Network (CNN) based semantic segmentation approaches have achieved impressive results by using large amounts of labeled training data, their performance drops significantly as the amount of labeled data decreases. This happens because deep CNNs trained with the de facto cross-entropy loss can easily overfit to small amounts of labeled data. To address this issue, we propose a simple and effective contrastive learning-based training strategy in which we first pretrain the network using a pixel-wise class label-based contrastive loss, and then fine-tune it using the cross-entropy loss.This approach increases intra-class compactness and inter-class separability, thereby resulting in a better pixel classifier.We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings using the Cityscapes and PASCAL VOC 2012 segmentation datasets.Our results show that pretraining with label-based contrastive loss results in large performance gains (more than 20% absolute improvement in some settings) when the amount of labeled data is limited.
摘要
收集标记数据的语义分割任务是昂贵且耗时的,因为它需要密集的像素级注释。虽然最近基于卷积神经网络(CNN)的语义分割方法通过使用大量标记的训练数据来实现了令人印象深刻的结果,但随着标记数据的量减少,它们的性能显着下降。发生这种情况是因为随着事实上的跨熵损耗训练的深度CNN可以容易地过度占据少量标记数据。要解决这个问题,我们提出了一种简单有效的对比学习的学习训练策略,我们首先使用像素 - 明智的基于类标签的对比丢失,然后使用跨熵损失进行微调。该方法增加了类内的紧凑性和阶级间可分离性,从而导致更好的像素分类器。我们展示了使用CityCapes和Pascal VOC 2012分段数据集的完全监督和半监督设置中所提出的训练策略的有效性。我们的结果表明,当标记数据的数量有限时,基于标签的对比损价导致基于标签的对比损价导致大量的性能提升(在某些设置中超过20%)。
Introduction
In the recent past, various approaches based on Convolutional Neural Networks (CNNs) have reported excellent results on several semantic segmentation datasets by leveraging large amounts of dense pixel-level annotations. However, labeling images with pixel-level annotations is time-consuming and expensive. For example, the average time taken to annotate a single image in the Cityscapes dataset is 90 minutes.pascal_drop shows how the performance of a DeepLabV3+ model trained on the PASCAL VOC 2012 dataset using the cross-entropy loss drops as the number of training images decreases. This happens because CNNs trained with the cross-entropy loss can easily overfit to small amounts of labeled data, as the cross-entropy loss does not explicitly encourage intra-class compactness or large margins between classes. To address this issue, we propose to first pretrain the CNN feature extractor using a pixel-wise class label-based contrastive loss (referred to as contrastive pretraining), and then fine-tune the entire network including the softmax classifier using the cross-entropy loss (referred to as softmax fine-tuning).emb_pascal20_baseline andemb_pascal20_proposed show the distributions of various classes in the softmax input feature spaces of models trained with the cross-entropy loss and the proposed strategy, respectively, using 2118 labeled images from the PASCAL VOC 2012 dataset. The mean IOU values of the corresponding models on the PASCAL VOC 2012 validation dataset are 39.1 and 62.7, respectively. The class support regions are more compact and separated when trained with the proposed strategy, leading to a better performance.
简介
在最近的过去,通过利用大量密集像素级注释,基于卷积神经网络(CNNS)的各种方法在多个语义分段数据集上报告了出色的结果。但是,标记具有像素级注释的图像是耗时和昂贵的。例如,在CityScapes数据集中注释单个图像所花的平均时间为90分钟。Pascal_Drop显示了使用跨熵丢失下降在Pascal VOC 2012数据集上训练的Deplabv3 +模型的性能如何随着训练图像的数量减少而降低。发生这种情况是因为随着跨熵损耗训练的CNN可以容易地磨损到少量标记数据,因为跨熵损失没有明确鼓励类内的压缩度或课程之间的大边距。为了解决这个问题,我们建议使用基于像素的基于类标签的对比丢失(称为对比预先估计)来先前推出CNN特征提取器,然后微调包括Softmax分类器的整个网络,包括Softmax分类器 - 网上损失(称为softmax细小调谐)。emb_pascal20_baseline和meb_pascal20_proped显示使用2118来自跨熵丢失和所提出的策略的模型的Softmax输入功能空间中各种类的分布,使用2118 Pascal VOC 2012数据集。 Pascal VOC 2012验证数据集上相应模型的平均值分别为39.1和62.7。随着拟议策略训练,课堂支援区域更紧凑,分开,导致更好的性能。
We use t-SNE for generating the visualizations.
We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings. Our main contributions are as follows.
- Simple approach: We propose a simple contrastive learning-based training strategy for improving the performance of semantic segmentation models. We consider the simplicity of our training strategy as its main strength since it can be easily adopted by existing and future semantic segmentation approaches.
- Strong results: We show that label-based contrastive pretraining results in large performance gains on two widely-used semantic segmentation datasets in both fully-supervised and semi-supervised settings, especially when the amount of labeled data is limited.
- Detailed analyses: We show visualizations of class distributions in the feature spaces of trained models to provide insights into why the proposed training strategy works better (see figure: emb_pascal20_baseline and emb_pascal20_proposed). We also present various ablation studies that justify our design choices.
我们使用T-SNE来生成可视化。
我们展示了拟议的训练策略在完全监督和半监督的环境中的有效性。我们的主要贡献如下。
- 简单的方法:我们提出了一种简单的基于对比的学习训练策略,可以提高语义分割模型的性能。我们认为我们的训练策略的简单性是其主要优势,因为它可以通过现有和未来的语义细分方法轻松采用。
- 更好的结果:我们展示了基于标签的对比预制导致两种广泛使用的语义分段数据集的大型性能提升,尤其是当标记数据的数量有限时。
- 详细分析:我们在训练有素的模型的特征空间中显示了类分布的可视化,以便为提出的训练策略更好地提供洞察力(参见图 fem_pascal20_baseline和fem_pascal20_proposed)。我们还提出了各种融合研究,证明了我们的设计选择。
Related works
These approaches learn representations in a discriminative fashion by contrasting positive pairs against negative pairs. Recently, several approaches based on contrastive loss have been proposed for self-supervised visual representation learning. These approaches treat each instance as a class and use contrastive loss-based instance discrimination for representation learning. Specifically, they use augmented version for an instance to form the positive pair and other instances to form negative pairs for the contrastive loss. Noting that using a large number of negatives is crucial for the success of contrastive loss-based representation learning, various recent approaches use memory banks to store the representations. While some recent contrastive approaches have attempted to attribute their success to maximization of mutual information, argues that their success cannot be attributed to the properties of mutual information alone. Recently, proposed supervised contrastive loss for the task of image classification. Since CNNs have been introduced to solve the semantic segmentation problem, several deep CNN-based approaches have been proposed that gradually improved the performance using large amounts of pixel-level annotations. However, collecting dense pixel-level annotations is difficult and costly. To address this issue, several existing works focus on leveraging weaker forms of supervision such as bounding boxes, scribbles, points, and image-level labels either exclusively or along with dense pixel-level supervision. While this work also focuses on improving semantic segmentation performance when the amount of pixel-level annotations is limited, we do not use any additional forms of annotations.
相关工程
这些方法通过对比负对对形成正面对以判别方式学习表示。最近,已经提出了用于自我监督的视觉表现学习的基于对比损失的几种方法。这些方法将每个实例视为一个类,并使用基于对比的丢失的实例歧视来进行表示学习。具体地,它们使用增强版本来形成实例,以形成正对和其他实例以形成负对对比损耗。注意到使用大量负面的否定对于基于对比的损失的代表学习的成功至关重要,各种最近的方法使用内存库来存储表示。虽然最近的一些对比方法试图将其成功归因于相互信息的最大化,但他们的成功不能归因于单独的互信息的属性。最近,建议监督图像分类任务的对比损失。由于已经引入了CNN来解决语义分割问题,因此已经提出了几种基于CNN的基于CNN的方法,从而使用大量的像素级注释逐渐提高了性能。然而,收集密集像素级注释是困难且昂贵的。为了解决这个问题,一些现有的工作侧重于利用较弱的监督形式,例如界限框,涂鸦,点和图像级标签,或者与密集像素级监督一起。虽然这项工作还专注于提高语义分段性能时,当像素级注释有限时,我们不使用任何其他形式的注释。
Instead, we propose a contrastive learning-based pretraining strategy that achieves significant performance gains without the need for any additional data. Another line of work that deals with limited labeled data includes semi-supervised approaches that leverage unlabeled images. While some of these works use generative adversarial training to leverage unlabeled images, others use pseudo labels and consistency-based regularization with various data augmentations. The proposed contrastive pretraining strategy is complimentary to these approaches and can be used in conjunction with them. We demonstrate this in this paper by showing the effectiveness of contrastive pretraining in the semi-supervised setting by using pseudo labels. Recently, proposed to train the CNN feature extractor of a semantic segmentation model by maximizing the log likelihood of extracted pixel features under a mixture of vMF distributions model. During inference, they first segment the pixel features extracted from an image using spherical K-Means clustering, and then perform k-nearest neighbor search for each segment to retrieve the labels from segments in the training set. While this approach is shown to improve the performance when compared to the widely-used pixel-wise softmax training, it is very complicated as it uses a two-stage expectation-maximization algorithm for training. In comparison, the proposed training strategy is simple, and can be easily adopted by existing and future semantic segmentation approaches.
相反,我们提出了一种基于对比的学习预用策略,无需任何其他数据即可实现显着性能。处理有限标记数据的另一项工作包括利用未标记图像的半监督方法。虽然其中一些作品使用生成的对抗性训练来利用未标记的图像,但其他人使用伪标签和基于一致性的正则化与各种数据增强。拟议的对比预制策略与这些方法互补,可以与它们一起使用。我们通过使用伪标签显示半监督设置对比预借预介质的有效性,我们在本文中证明了这一点。最近,建议通过在VMF分布模型的混合下最大化提取的像素特征的日志似然来训练语义分割模型的CNN特征提取器。在推理期间,首先将从图像中提取的像素特征分段使用球面K-icon群集,然后对每个段执行K-Collecti邻搜索,以从训练集中的段中检索标签。虽然该方法被证明与广泛使用的像素-Wise SoftMax训练相比提高性能,但它非常复杂,因为它使用了一种用于训练的两阶段预期最大化算法。相比之下,拟议的训练策略很简单,可以通过现有和未来的语义细分方法轻松采用。
Experiments
实验
Datasets and metricsPASCAL VOC 2012
This dataset consists of 10,582 training (including the annotations provided by), 1,449 validation, and 456 test images with pixel-level annotations for 20 foreground object classes and one background class. The performance is measured in terms of pixel Intersection-Over-Union (IOU) averaged across the 21 classes.[5pt]: This dataset contains high quality pixel-level annotations of 5,000 images collected in street scenes from 50 different cities. Following the evaluation protocol of, 19 semantic labels belonging to 7 super categories (ground, construction, object, nature, human, vehicle and sky) are used for evaluation, and the void label is ignored. The performance is measured in terms of pixel IOU averaged across the 19 classes. The training, validation, and test splits contain 2975, 500, and 1525 images, respectively. For both the datasets, we perform experiments in both fully-supervised and semi-supervised settings varying the amounts of labeled and unlabeled training data.
数据集和MetricsPascal VOC 2012
该数据集由10,582个训练集,包括由1,449验证和456个测试图像组成,具有20个前景对象类和一个背景类的像素级注释。在21个类中平均平均值的像素交汇处(iou)的衡量表现。此数据集包含在50个不同城市的街道场景中收集的5,000张图像的高质量像素级注释。遵循属于7个超类别(地面,建筑,物体,性质,人,车辆和天空)的19个语义标签进行评估,忽略空隙标签。性能是以跨19个类的像素IOU的衡量标准。训练,验证和测试分别包含2975,500和1525个图像。对于两种数据集,我们在完全监督和半监控的环境中执行实验,改变标记和未标记的训练数据的数量。
Model architecture
Our feature extractor follows DeepLabV3 $ + $ encoder-decoder architecture with the ResNet50-based encoder of DeepLabV3. The output spatial resolution of the feature extractor is four times lower than the input resolution. Our projection head consists of three $ 1\times 1 $ convolution layers with 256 channels followed by a unit-normalization layer. The first two layers in the projection head use the ReLU activation function.
型号架构
我们的功能提取器跟随DEEPLABV3 $ + $编码器 - 解码器架构与DEEPLABV3的基于ResET50的编码器。特征提取器的输出空间分辨率比输入分辨率低四倍。我们的投影头由三个$ 1\times 1 $卷积图层组成,具有256个通道,然后是单位归一化层。投影头中的前两层使用Relu激活功能。
Training and inference
Following, we use $ 513\times 513 $ random crops extracted from preprocessed (random left-right flipping and scaling) input images for training. All the models are trained from scratch using asynchronous stochastic gradient descent on 8 replicas with minibatches of size 16, weight decay of $ 4e^{-5} $ , momentum of 0.9, and cosine learning rate decay. For contrastive pretraining, we use an initial learning rate of 0.1 and 300K training steps. For softmax fine-tuning, we use an initial learning rate of 0.007 and 300K training steps when the number of labeled images is above 2500 in the case of PASCAL VOC 2012 dataset and above 1000 in the case of Cityscapes dataset, and 50K training steps in other settings We observed overfitting with longer training when the number of labeled images is low. . When we use softmax training without contrastive pretraining, we use an initial learning rate of 0.03 and 600K training steps when the number of labeled images is above 2500 in the case of PASCAL VOC 2012 dataset and above 1000 in the case of Cityscapes dataset, and 300K training steps in other settings. The temperature parameter $ \tau $ of contrastive loss is set to 0.07 in all the experiments. We use color distortions from for contrastive pretraining, and random brightness and contrast adjustments for softmax fine-tuning Using hue and saturation adjustments from while training the softmax classifier resulted in a slight drop in the performance. . For generating pseudo labels, we use a threshold of 0.8 for all the foreground classes of the PASCAL VOC 2012 and Cityscapes datasets, and a threshold of 0.97 for the background class of the PASCAL VOC 2012 dataset.
For $ 513\times 513 $ input, our feature extractor produces a $ 129\times 129 $ feature map. Since the memory complexity of contrastive loss is quadratic in the number of pixels, to avoid GPU memory issues, we resize the feature map to $ 65\times 65 $ using bilinear resizing before computing the contrastive loss. The corresponding low-resolution label map is obtained from the original label map using nearest neighbor downsampling. For softmax training, we follow and upsample the logits from $ 129\times 129 $ to $ 513\times 513 $ using bilinear resizing before computing the pixel-wise cross entropy loss. Since the model is fully-convolutional, during inference, we directly run it on an input image and upsample the output logits to input resolution using bilinear resizing.
训练和推理
之后,我们使用$ 513\times 513 $随机作物从预处理(随机左侧翻转和缩放)输入图像进行训练。所有型号通过在8个复制品上使用异步随机梯度下降训练,其中8次复制品尺寸为16,重量衰减为$ 4e^{-5} $,动量为0.9和余弦学习率衰减。为了对比预制,我们使用0.1和300K训练步骤的初始学习率。对于Softmax微调,我们使用初始学习率为0.007和300K训练步骤的训练步骤在Pascal VOC 2012 DataSet和CityCAPES数据集的情况下的标记图像上高于2500时,以及50K的训练步骤当标记图像的数量低时,我们观察到使用更长的训练的其他设置。当我们使用SoftMax训练而没有对比预先估计时,我们使用初始学习率为0.03和600K训练步骤的训练步骤在Pascal VOC 2012 DataSet的情况下,在CityCapes数据集的情况下为1000时,以及300k其他设置中的训练步骤。在所有实验中,对比损耗的温度参数$\tau $设定为0.07。我们使用来自对比的预磨损的颜色扭曲,以及使用色调和饱和度调整的软MAX微调的随机亮度和对比度调整,从训练Softmax分类器的性能下降略有下降。对于生成伪标签,我们使用Pascal VOC 2012和City