# 基于对比学习的标签有效语义切分方法

## Abstract

Collecting labeled data for the task of semantic segmentation is expensive and time-consuming, as it requires dense pixel-level annotations. While recent Convolutional Neural Network (CNN) based semantic segmentation approaches have achieved impressive results by using large amounts of labeled training data, their performance drops significantly as the amount of labeled data decreases. This happens because deep CNNs trained with the de facto cross-entropy loss can easily overfit to small amounts of labeled data. To address this issue, we propose a simple and effective contrastive learning-based training strategy in which we first pretrain the network using a pixel-wise class label-based contrastive loss, and then fine-tune it using the cross-entropy loss.This approach increases intra-class compactness and inter-class separability, thereby resulting in a better pixel classifier.We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings using the Cityscapes and PASCAL VOC 2012 segmentation datasets.Our results show that pretraining with label-based contrastive loss results in large performance gains (more than 20% absolute improvement in some settings) when the amount of labeled data is limited.

## Introduction

In the recent past, various approaches based on Convolutional Neural Networks (CNNs) have reported excellent results on several semantic segmentation datasets by leveraging large amounts of dense pixel-level annotations. However, labeling images with pixel-level annotations is time-consuming and expensive. For example, the average time taken to annotate a single image in the Cityscapes dataset is 90 minutes.pascal_drop shows how the performance of a DeepLabV3+ model trained on the PASCAL VOC 2012 dataset using the cross-entropy loss drops as the number of training images decreases. This happens because CNNs trained with the cross-entropy loss can easily overfit to small amounts of labeled data, as the cross-entropy loss does not explicitly encourage intra-class compactness or large margins between classes. To address this issue, we propose to first pretrain the CNN feature extractor using a pixel-wise class label-based contrastive loss (referred to as contrastive pretraining), and then fine-tune the entire network including the softmax classifier using the cross-entropy loss (referred to as softmax fine-tuning).emb_pascal20_baseline andemb_pascal20_proposed show the distributions of various classes in the softmax input feature spaces of models trained with the cross-entropy loss and the proposed strategy, respectively, using 2118 labeled images from the PASCAL VOC 2012 dataset. The mean IOU values of the corresponding models on the PASCAL VOC 2012 validation dataset are 39.1 and 62.7, respectively. The class support regions are more compact and separated when trained with the proposed strategy, leading to a better performance.

## 简介

We use t-SNE for generating the visualizations.

We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings. Our main contributions are as follows.

• Simple approach: We propose a simple contrastive learning-based training strategy for improving the performance of semantic segmentation models. We consider the simplicity of our training strategy as its main strength since it can be easily adopted by existing and future semantic segmentation approaches.
• Strong results: We show that label-based contrastive pretraining results in large performance gains on two widely-used semantic segmentation datasets in both fully-supervised and semi-supervised settings, especially when the amount of labeled data is limited.
• Detailed analyses: We show visualizations of class distributions in the feature spaces of trained models to provide insights into why the proposed training strategy works better (see figure: emb_pascal20_baseline and emb_pascal20_proposed). We also present various ablation studies that justify our design choices.

• 简单的方法：我们提出了一种简单的基于对比的学习训练策略，可以提高语义分割模型的性能。我们认为我们的训练策略的简单性是其主要优势，因为它可以通过现有和未来的语义细分方法轻松采用。
• 更好的结果：我们展示了基于标签的对比预制导致两种广泛使用的语义分段数据集的大型性能提升，尤其是当标记数据的数量有限时。
• 详细分析：我们在训练有素的模型的特征空间中显示了类分布的可视化，以便为提出的训练策略更好地提供洞察力（参见图 fem_pascal20_baseline 和 fem_pascal20_proposed）。我们还提出了各种融合研究，证明了我们的设计选择。

These approaches learn representations in a discriminative fashion by contrasting positive pairs against negative pairs. Recently, several approaches based on contrastive loss have been proposed for self-supervised visual representation learning. These approaches treat each instance as a class and use contrastive loss-based instance discrimination for representation learning. Specifically, they use augmented version for an instance to form the positive pair and other instances to form negative pairs for the contrastive loss. Noting that using a large number of negatives is crucial for the success of contrastive loss-based representation learning, various recent approaches use memory banks to store the representations. While some recent contrastive approaches have attempted to attribute their success to maximization of mutual information, argues that their success cannot be attributed to the properties of mutual information alone. Recently, proposed supervised contrastive loss for the task of image classification. Since CNNs have been introduced to solve the semantic segmentation problem, several deep CNN-based approaches have been proposed that gradually improved the performance using large amounts of pixel-level annotations. However, collecting dense pixel-level annotations is difficult and costly. To address this issue, several existing works focus on leveraging weaker forms of supervision such as bounding boxes, scribbles, points, and image-level labels either exclusively or along with dense pixel-level supervision. While this work also focuses on improving semantic segmentation performance when the amount of pixel-level annotations is limited, we do not use any additional forms of annotations.

## 相关工程

Instead, we propose a contrastive learning-based pretraining strategy that achieves significant performance gains without the need for any additional data. Another line of work that deals with limited labeled data includes semi-supervised approaches that leverage unlabeled images. While some of these works use generative adversarial training to leverage unlabeled images, others use pseudo labels and consistency-based regularization with various data augmentations. The proposed contrastive pretraining strategy is complimentary to these approaches and can be used in conjunction with them. We demonstrate this in this paper by showing the effectiveness of contrastive pretraining in the semi-supervised setting by using pseudo labels. Recently, proposed to train the CNN feature extractor of a semantic segmentation model by maximizing the log likelihood of extracted pixel features under a mixture of vMF distributions model. During inference, they first segment the pixel features extracted from an image using spherical K-Means clustering, and then perform k-nearest neighbor search for each segment to retrieve the labels from segments in the training set. While this approach is shown to improve the performance when compared to the widely-used pixel-wise softmax training, it is very complicated as it uses a two-stage expectation-maximization algorithm for training. In comparison, the proposed training strategy is simple, and can be easily adopted by existing and future semantic segmentation approaches.

## 实验

### Datasets and metricsPASCAL VOC 2012

This dataset consists of 10,582 training (including the annotations provided by), 1,449 validation, and 456 test images with pixel-level annotations for 20 foreground object classes and one background class. The performance is measured in terms of pixel Intersection-Over-Union (IOU) averaged across the 21 classes.[5pt]: This dataset contains high quality pixel-level annotations of 5,000 images collected in street scenes from 50 different cities. Following the evaluation protocol of, 19 semantic labels belonging to 7 super categories (ground, construction, object, nature, human, vehicle and sky) are used for evaluation, and the void label is ignored. The performance is measured in terms of pixel IOU averaged across the 19 classes. The training, validation, and test splits contain 2975, 500, and 1525 images, respectively. For both the datasets, we perform experiments in both fully-supervised and semi-supervised settings varying the amounts of labeled and unlabeled training data.

### Model architecture

Our feature extractor follows DeepLabV3 $+$ encoder-decoder architecture with the ResNet50-based encoder of DeepLabV3. The output spatial resolution of the feature extractor is four times lower than the input resolution. Our projection head consists of three $1\times 1$ convolution layers with 256 channels followed by a unit-normalization layer. The first two layers in the projection head use the ReLU activation function.

### Training and inference

Following, we use $513\times 513$ random crops extracted from preprocessed (random left-right flipping and scaling) input images for training. All the models are trained from scratch using asynchronous stochastic gradient descent on 8 replicas with minibatches of size 16, weight decay of $4e^{-5}$ , momentum of 0.9, and cosine learning rate decay. For contrastive pretraining, we use an initial learning rate of 0.1 and 300K training steps. For softmax fine-tuning, we use an initial learning rate of 0.007 and 300K training steps when the number of labeled images is above 2500 in the case of PASCAL VOC 2012 dataset and above 1000 in the case of Cityscapes dataset, and 50K training steps in other settings We observed overfitting with longer training when the number of labeled images is low. . When we use softmax training without contrastive pretraining, we use an initial learning rate of 0.03 and 600K training steps when the number of labeled images is above 2500 in the case of PASCAL VOC 2012 dataset and above 1000 in the case of Cityscapes dataset, and 300K training steps in other settings. The temperature parameter $\tau$ of contrastive loss is set to 0.07 in all the experiments. We use color distortions from for contrastive pretraining, and random brightness and contrast adjustments for softmax fine-tuning Using hue and saturation adjustments from while training the softmax classifier resulted in a slight drop in the performance. . For generating pseudo labels, we use a threshold of 0.8 for all the foreground classes of the PASCAL VOC 2012 and Cityscapes datasets, and a threshold of 0.97 for the background class of the PASCAL VOC 2012 dataset.

For $513\times 513$ input, our feature extractor produces a $129\times 129$ feature map. Since the memory complexity of contrastive loss is quadratic in the number of pixels, to avoid GPU memory issues, we resize the feature map to $65\times 65$ using bilinear resizing before computing the contrastive loss. The corresponding low-resolution label map is obtained from the original label map using nearest neighbor downsampling. For softmax training, we follow and upsample the logits from $129\times 129$ to $513\times 513$ using bilinear resizing before computing the pixel-wise cross entropy loss. Since the model is fully-convolutional, during inference, we directly run it on an input image and upsample the output logits to input resolution using bilinear resizing.

### Results - Fully-supervised setting

Figurescityscapes_fs andpascal_fs show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets, respectively, obtained by contrastive pretraining in the fully-supervised setting. $2\times$ while improving the performance. $5\times$ more data (5295 images). These results clearly demonstrate the effectiveness of the proposed label-based contrastive pretraining. The performance improvements seen on the PASCAL VOC 2012 dataset are much higher than the improvements seen on the Cityscapes dataset.visual_results shows some segmentation results of models trained with and without label-based contrastive pretraining using 2118 labeled images from the PASCAL VOC 2012 dataset. Contrastive pretraining improves the segmentation results by reducing the confusion between background and various foreground classes, and also the confusion between different foreground classes.

### 结果 - 全监督设置

FIGURYSCAPES_FS ANDPASCAL_FS 分别通过完全监督设置对比预借预先绘制的 CITYSCAPES 和 PASCAL VOC 2012 年数据集的验证拆分进行性能改进。 $2\times$同时提高性能。 $5\times$更多数据（5295 图像）。这些结果清楚地展示了所提出的基于标签的对比预制覆盖的有效性。 Pascal VOC 2012 DataSet 上看到的性能改进远高于 CityScapes DataSet.visual_Result 的改进，显示了使用来自 Pascal VOC 2012 DataSet 的 2118 标记的图像，而无需基于标签的对比预制训练的模型的一些分段结果。通过减少背景和各种前景阶级的混淆，以及不同前景阶级之间的混淆，对比预借预制提高了分割结果。

### Results - Semi-supervised setting

Figurescityscapes_ss andpascal_ss show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets, respectively, obtained by contrastive pretraining in the semi-supervised setting. $2\times$ while improving the performance on the PASCAL VOC 2012 dataset.

Figurescityscapes_pl_improv andpascal_pl_improv show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets, respectively, obtained by using pseudo labels. Though pseudo labeling is a straightforward approach for making use of unlabeled data, it gives impressive performance gains, both with and without contrastive pretraining.

### 结果 - 半监控设置

FigurityScapes_ss 和 pascal_sss 分别通过半监督设置中的对比预先绘制获得了 Cityscapes 和 Pascal VOC 2012 年数据集的验证拆分的性能改进。 $2\times$，同时提高 Pascal VOC 2012 数据集的性能。

figurityscapes_pl_improv andpascal_pl_improv 分别通过使用伪标签获得的 Citycapes 和 Pascal VOC 2012 数据集的验证拆分进行性能改进。尽管伪标签是利用未标记数据的直接方法，但它具有令人印象深刻的性能收益，无论是在没有对比的预制税。

### Ablation studies

In this section, we perform various ablation studies under the fully-supervised setting on the Cityscapes and PASCAL VOC 2012 datasets with 596 and 2118 labeled training images, respectively.

### 消融研究

#### Importance of distortions for contrastive loss

In the case of contrastive loss-based self-supervised learning, distortions are necessary to generate positive pairs. But, in the case of label-based contrastive learning, positive pairs can be generated using labels, and hence, it is unclear how important distortions are. In this work, we use the color distortions from a recent self-supervised learning method that worked well for the downstream task of image classification.distortions shows the effect of using these distortions in the contrastive pretraining stage. We can see a small performance gain on the Cityscapes dataset and no gain on the PASCAL VOC 2012 dataset Differences lower than 0.5 are too small to draw any conclusion. . These results suggest that distortions that work well for image recognition may not work for semantic segmentation.

#### Contrastive loss variants

The pixel-wise label-based contrastive loss used in this work is first computed separately for each image and then averaged across all the images in a minibatch. We refer to this as the single image variant. An alternative option is to consider all the pixels in a minibatch as a single bag of pixels for computing the contrastive loss.batch variant. Note that the memory complexity of contrastive loss is quadratic in the number of pixels. Hence, to avoid GPU memory issues, we randomly sample 10K pixels from the entire minibatch for computing the batch variant of the contrastive loss. Tableloss_variants compares the performances of these two variants.

Contrastive loss Cityscapes PASCAL VOC
(596 images) (2118 images)
Single image variant 67.5 62.7
Batch variant (1.5 \downarrow ) 66.0 (1.0 \downarrow ) 61.7
Pretraining Cityscapes PASCAL VOC
ImageNet Contrastive (596 images) (2118 images)
xmark xmark 63.6 39.1
cmark xmark (5.2 \uparrow ) 68.8 (19.8 \uparrow ) 58.9
xmark cmark 67.5 62.7
cmark cmark (1.1 \uparrow ) 68.6 62.6

#### Pretraining on external classification dataset

To study the effect of additional pretraining on a large-scale image classification dataset, we compare the models trained from scratch with the models trained from ImageNet-pretrained weights in Tableimagenet_results. When contrastive pretraining is not used, ImageNet pretraining results in large performance gains on both Cityscapes and PASCAL VOC 2012 datasets. However, when contrastive pretraining is used, the performance gain due to ImageNet pretraining is limited (only 1.1 points on the Cityscapes dataset and no improvement on the PASCAL VOC 2012 dataset). Also, the results in the second and third rows show that contrastive pretraining, which does not use any additional labels, outperforms ImageNet pretraining (which uses more than a million additional image labels) by 3.8 points on the PASCAL VOC 2012 dataset, and is only slightly worse (1.3 points) on the Cityscapes dataset. These results clearly demonstrate the effectiveness of contrastive pretraining in reducing the need for labeled data.

### Performance on test splits

Tablecity_test shows the performance improvements on the test splits of the Cityscapes and PASCAL VOC 2012 datasets obtained by contrastive pretraining in the fully-supervised setting. Similar to the results on validation splits, label-based contrastive pretraining leads to significant performance improvements on test splits.

Contrastive pretraining Cityscapes
(2975 images) (596 images)
xmark 76.3 64.6
cmark (1.8 \uparrow ) 78.1 (3.4 \uparrow ) 68.0
PASCAL VOC 2012
(10528 images) (2118 images)
xmark 67.1 39.4
cmark (7.8 \uparrow ) 74.9 (21.8 \uparrow ) 61.2

### 在测试拆分上的性能

TableCity_Test 显示了通过在完全监督环境中对比预借预先绘制而获得的城市景观和 Pascal VOC 2012 年测试分割的性能改进。与验证分裂的结果类似，基于标签的对比普雷威预测导致对测试分裂的显着性能改进。

## Conclusions and future work

Deep CNN-based semantic segmentation models trained with cross-entropy loss easily overfit to small amounts of training data, and hence perform poorly when trained with limited labeled data. To address this issue, we proposed a simple and effective contrastive learning-based training strategy in which we first pretrain the feature extractor of the model using a pixel-wise label-based contrastive loss and then fine-tune the entire network including the softmax classifier using the cross-entropy loss. This training approach increases both intra-class compactness and inter-class separability, thereby enabling a better pixel classifier. We performed experiments on two widely-used semantic segmentation datasets, namely, PASCAL VOC 2012 and Cityscapes, in both fully-supervised and semi-supervised settings. In both settings, we achieved large performance gains on both datasets by using contrastive pretraining, especially when the amount of labeled data is limited. In this work, we used a simple pseudo labeling-based approach to leverage unlabeled images in the semi-supervised setting. We thank Yukun Zhu and Liang-Chieh Chen from Google for their support with the DeepLab codebase.