# 基于 ResUnet 的解剖脑屏障分割

## Abstract

Accuracy segmentation of brain structures could be helpful for glioma and radiotherapy planning. However, due to the visual and anatomical differences between different modalities, the accurate segmentation of brain structures becomes challenging.
To address this problem, we first construct a residual block based U-shape network with a deep encoder and shallow decoder, which can trade off the framework performance and efficiency. Then, we introduce the Tversky loss to address the issue of the class imbalance between different foreground and the background classes. Finally, a model ensemble strategy is utilized to remove outliers and further boost performance.

## Introduction

Gliomas are the most common type of brain tumors, originating from glial cells in the human brain. In clinical radiation therapy, the precise identification of the clinical target volume (CTV) boundary can ensure the operation effect, whereas the CTV boundary is determined by anatomical structures. Therefore, the accurate and automatic segmentation of brain anatomical structures could improve both efficiency and effectiveness for gliomas treatment.
Recently, deep learning approaches have been proven its the superiority on both 2D natural images and 3D medical modalities segmentation task compared to the traditional methods. Especially for the brain tumor segmentation, proposed a U-Net with an additional VAE branch to reconstruct input images and regularize the shared encoder. established a two-stage cascaded U-Net to refine the coarse prediction from the first stage and capture context information in the second stage. As to brain CTV segmentation, introduced DenseNet to predict resection cavity precise contours.
To raise the researcher interest in the study of brain CTV segmentation, the Anatomical Brain Barriers to Cancer Spread challenge (ABCs) aims to encourage challengers to construct an automatic brain structures segmentation methods, where some critical structures (e.g., structures that are served as barriers to the spread of brain cancers or spared from irradiation) are included in two tasks.
The dataset provides 45 multi-modal images with ground truth annotations for training and validation, 15 images for the online test, and 15 images for the final test. Each case is given with a CT scan and two diagnostic MRI scan which includes contrast-enhanced T1-weighted and T2-weighted FLAIR of the post-operative brain. Besides, all images are co-registered and re-sampled to the size of 164x194x142 pixels with the isotropic resolution of 1.2x1.2x1.2 mm. For the task 1, challengers are asked to segment brain structures for the automatic identification of the CTV. As for the task 2, participants are required to segment structures that used in radiotherapy treatment plan optimization. Evaluation metrics are specified to the Dice score and surface Dice score to evaluate the performance of the proposed algorithms.
In this report, we propose a residual block based U-Net which is composed of a deep encoder and a shallow decoder and implemented with nn-UNet framework. To provide multi-scale guidance for boosting framework performance and accelerate training convergence, we employ a deep supervision strategy in our framework. To suppress the false-negative results and retrieve the missing small targets, we utilize the Tversky loss together with the cross-entropy loss as our criterion. Experiments show that the proposed framework outperforms the compared state-of-the-art methods.

## 介绍

gliomas 是最常见的脑肿瘤类型，来自人脑中的胶质细胞。在临床放射治疗中，临床目标体积（CTV）边界的精确鉴定可以确保操作效果，而 CTV 边界由解剖结构确定。因此，脑解剖结构的准确性和自动分割可以提高胶质瘤治疗的效率和有效性。最近，与传统方法相比，已经证明了深度学习方法和 3D 医学方式分割任务的优势。特别是对于脑肿瘤分割，提出了 U-Net，其中具有额外的 VAE 分支来重建输入图像并将共享编码器正规化。建立了两阶段的级联 U-Net，以优化来自第一阶段的粗略预测并在第二阶段捕获上下文信息。至于脑 CTV 分割，引入 DENSENET 预测切除腔精确轮廓。为了提高研究人员对脑 CTV 分割研究的兴趣，癌症传播挑战（ABCS）的解剖学脑障碍旨在鼓励挑战者构建自动脑结构分割方法，其中一些关键结构（例如，作为屏障的结构脑癌或从照射中施加的蔓延）包括在两个任务中。 DataSet 提供了 45 个多模态图像，用于训练和验证的地面真理注释，为在线测试的 15 张图像，以及最终测试的 15 张图像。每种情况都具有 CT 扫描和两个诊断 MRI 扫描，其包括与后术后脑的对比度增强的 T1 加权和 T2 加权 Flair。此外，所有图像都是共同登记的，并重新采样到 164x194x142 像素的大小，各向同性分辨率为 1.2x1.2x1.2 mm。对于任务 1，要求挑战者被要求分段为自动识别 CTV 的脑结构。至于任务 2，参与​​者需要在放射治疗计划优化中使用的分段结构。评估指标指定为骰子得分和表面骰子分数，以评估所提出的算法的性能。在本报告中，我们提出了一种基于剩余的 U-Net，其由深编码器和浅解码器组成，并用 NN-UNET 框架实现。为提高框架性能并加速训练融合提供多种规模指导，我们在我们的框架中采用了深度的监督战略。为了抑制假阴性结果并检索丢失的小目标，我们将 TVersky 损失与跨熵丢失一起使用作为我们的标准。实验表明，所提出的框架优于比较的最先进的方法。

## Method

Based on the U-Net architecture and nn-UNet framework, we propose a variant of this approach based on residual blocks. The details are illustrated as follows.

## 基于 U-Net 架构和 NN-UNET 框架的方法

### Encoder design

The proposed encoder is composed of residual blocks as shown in Fig.1, which consist of two 3x3x3 3D convolutions layers following with a normalization layer and a activation layer. Then, an identity skip connection operation is employed to link the shallow and the corresponding deep level features. Specifically, we use Instance Normalization (IN) as the normalization layer, since it performs better than BatchNorm when batch size is small (ours is 2), and requires less computation cost than Group Normalization (GN). In addition, We utilize LeakyReLU as the activation layer to retain more information than ReLU. Totally, we have 5 spatial levels with the number of residual blocks in each level of 1, 2, 3, 4, 4. The number of filters, which is implemented by a 3D stridden convolution layer with 3x3x3 filter and a stride of 2, is initiated with 32. Then, we doubles this number after each downsample operation. Notably, the number of the filter in the last level is set to 320 for computational efficiency. The last downsample convolution avoid the axis z for retaining the slice-wise resolution. We randomly crop the patch with the size of 3x128x160x112 from 3-modality images as the framework input. After the proposed framework processing, the final output is obtained with a size of 320x8x10x7.

### Decoder design

The decoder structure is the opposite operation of the encoder, composed of an upsample block and a standard convolution block, where the upsample block is implemented with a 1x1x1 3D convolution layer and a 3D transpose convolution layer. The 3D convolution layer aims to reduce the number of features by a factor of 2, and the 3D transpose convolution layer is used for double the spatial dimension. The standard convolution block consists of a simple 3x3x3 3D convolution layer followed with a Instance Normalization and a Leaky ReLU layers. At the end of the decoder, a 1x1x1 convolution layer with the output channels of target classes conjoins a softmax activation layer to get the final prediction. The shallow decoder could not only boost the speed of training and inference stage but also avoid overfitting. To accelerate model convergence and provide multi-scale guidance, we also introduce a deepvision strategy, where other three levels of the features will be processed with an extra output block so as to obtain the additional predictions and supervised by the same loss function. The final loss will be weighted differently with respect to the importance of each level output. Here, we set the weights of 8/15, 4/15, 2/15, and 1/15 for the outputs from level 1, 2, 3, and 4 respectively.

### Loss function

We utilize the DC-CE loss as our criterion, which is composed of origin linear Dice loss and cross-entropy loss:

$$L_{DCCE} = -\frac{\sum_{i=1}^{N}\sum_{c=1}^{C} p_{i, c} y_{i, c}+\epsilon}{\sum_{i=1}^{N}\sum_{c=1}^{C} p_{i, c}+y_{i, c}+\epsilon} -\frac{1}{N}\frac{1}{C}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i, c}\log\left(p_{i, c}\right)$$

where $N$ is the total number of pixels, and $C$ denotes all classes. $p_{i, c}$ and $y_{i, c}$ represent the prediction and annotation of pixel $i$ and class $c$ , respectively. The Tversky loss is also introduced to train our proposed framework, which can be formulated as:

$$L_{Tversky}=\frac{\sum_{i=1}^{N}\sum_{c=1}^{C} p_{i, c} y_{i, c}+\epsilon}{\sum_{i=1}^{N}\sum_{c=1}^{C} p_{i, c} y_{i, c}+\alpha \sum_{i=1}^{N}\sum_{c=1}^{C} p_{i, c} \bar{y}{i, c}+\beta \sum{i=1}^{N}\sum_{c=1}^{C} \bar{p}{i, c} y{i, c}+\epsilon}$$

where $p_{i, c}\bar{y}{i, c}$ denotes the false positive (FP) pixels; $bar{p}{i, c} y_{i, c}$ denotes the false negative (FN) pixels. The importance of FP and FN is weighted by $\alpha$ and $\beta$ , which are set to 0.3 and 0.7, respectively.

### 损失函数

$$l_ {dcce} = - \ frac {\ sum_ {i = 1} ^ { n} \ sum_ {c = 1} ^ {c} p_ {i，c} y_ {i，c} + \ epsilon} {\ sum_ {i = 1} ^ {n} \ sum_ {c = 1} ^ { C} p_ {i，c} + y_ {i，c} + \ epsilon} - \ frac {1} {n} \ frac {1} {c} \ sum_ {i = 1} ^ {n} \ sum_ { c = 1} ^ {c} y_ {i，c} \ log \ left（p_ {i，c} \右）$$

$$l_ {tversky} = \ frac {\ sum_ {i = 1} ^ {n} \ sum_ {c} p_ {i，c} y_ {i，c} + \ epsilon} {\ sum_ {i = 1} ^ {n} \ sum_ {c = 1} ^ {c} p_ {i，c} y_ {i，c } + \ alpha \ sum_ {i = 1} ^ {n} \ sum_ {c = 1} ^ {c} p_ {i，c} \ bar {y} _ {i，c} + \ beta \ sum_ {i = 1} ^ {n} \ sum_ {c = 1} ^ {c} \ bar {p} _ {i，c} y_ {i，c} y_ {i，c} + epsilon}$$

### Pseudo training with model ensemble

For a further improvement of segmentation accuracy, we gradually employ the model ensemble strategy and pseudo training strategy. At first, we calculate the average of all selected models to get the prediction of the test volumes:

$$P^{*}=\frac{1}{N}\sum_{i=1}^{N} P_{i}$$

The $P_{i}$ denotes the specific prediction of different models, while the $P^{*}$ denotes the final ensembled prediction. We find it more stable than any single prediction since the error-prone outliers are removed in the ensemble process. Based on the reliable predictions, we treat them as the pseudo labels for the test volumes. The final loss function for the model training can be described as follow:

$$L_{hybrid}(V, L)=L_{DCCE}(V, L)+L_{Tversky}(V, L)$$

$$L_{final}=L_{hybrid}\left(V_{s}, L_{s}\right)+L_{hybrid}\left(V_{t}, P_{t}^{*}\right)$$

We define the sum of $L_{DCCE}$ and $L_{Tversky}$ as $L_{hybrid}$ , and utilize the ensembled prediction $P^{*}$ as the pseudo label for test volumes. The pseudo training strategy can bring the model more training pairs, and the model can learn reliable information from pseudo labels. The result in Table t1 proved the effectiveness of the methods in the section.

### 伪训练与模型集合

$$p ^ {*} = \ frac {1} {n} \ sum_ {i = 1} ^ {n} p_ {i$$

$P_{i}$表示不同模型的特定预测，而$P^{*}$表示最终的集成预测。我们发现它比任何单个预测更稳定，因为在集合过程中删除了易于容易出错的异常值。根据可靠的预测，我们将它们视为测试卷的伪标签。模型训练的最终损失函数可以描述如下：

$$l_ {hybrid}（v，l）= l_ {dcce}（v，l）+ l_ {tversky}（v，l）$$

$$l_ {final} = l_ {hybrid} \ left（v_ {s}，l_ {s}右）+ l_ {hybrid} \ left（v_ {t}，p_ {t} ^ {*} \ rectle）$$

### Optimization

We utilize the SGD with initial learning rate of $\alpha_{0}=1e-4$ and momentum of 0.99 to optimize our network. Poly strategy is employed to progressively decrease learning rate according to:

$$\alpha=\alpha_{0} *\left(1-\frac{e}{N_{e}}\right)^{0.9}$$

where $e$ an iterator of the current epoch; $N_e$ is the total number of the training epochs. We set $N_e=1000$ with batch size of 2 in the training stage. L2 norm regularization with a weight of $1e-5$ is used to prevent overfitting.

### 优化

$$\ alpha = \ alpha_ {0} * \ left（1- \ frac {e} {n_ {e} \ \右）^ {0.9}$$

## 实验

### Implementation details and data processing

In this section, we will demonstrate the experimental details of our method. Firstly, we adopt the normalization for three modalities images by subtracting the mean values and dividing the variances. Specifically, to reduce the negative effect from extremum in CT images, the gray intensity in CT modality is clipped into [0.5, 0.995]. Afterwards, we concatenate 3 modalities to obtain a 3-channel volume as the framework input. A series of data augmentation strategies are employed, including rotation, scale, mirror, gamma correction, and brightness additive, to improve the robustness of the framework. As to the inference phase, the test time augmentation strategy(e.g., sliding window and flipping across three axes) is introduced to improve the performance. It is worth noting that adopting horizontal flip along with the x-axis could greatly reduce the accuracy in our experiment. The reason might be that such operation will misleads the framework when learning the paired tissues such as eyes and cochlea. Therefore, we decide to avoid the horizontal flip along with x-axis in Task 2. To evaluate the effectiveness of the proposed framework, we employ the 5-fold cross validations, and choose the proper models to generate final submission. All proposed framework is implemented in PyTorch using an NVIDIA Tesla V100 GPU.

### Quantitative and Qualitative Analysis

We evaluate the performance of the proposed frameworks on the online test dataset, with the evaluation metrics as Dice score (DSC) and surface Dice score (SDSC) for Task1 and Task2. The experiment result as shown in Table t1, we choose nnU-Net as the state-of-the-art for comparison, which achieves 87.2% in DSC and 97.4% in SDSC of Task1 and 77.7% in DSC and 92.6% in SDSC of Task2, respectively. In contrast to nn-UNet, our ResUnet achieves the improvement with 0.3% in DSC and 0.3% in SDSC of Task2, and ResUnet with Tversky loss achieves improvement with 0.1% of DSC and 0.1% of SDSC in Task2. To further boost the framework performance, we ensemble the prediction of above methods to get better results. We visualize the predictions for a qualitative analysis. As shown in Fig.results the predictions of our proposed framework are very close to corresponding labels. We also reconstructed the prediction in 3D in Fig.3d, where all target issues are clearly identified.

## Conclusions

In this study, we proposed a effective framework for Anatomical Brain Barriers to Cancer Spread challenge. Specifically, we used residual block based U-shape network as the proposed architecture and the Tversky loss as the criterion, to enforce the the feature extraction ability. The ensemble strategy was adopted to refine the prediction and get the better result. Experiments on the online test set varied the efficacy of proposed framework.