[论文翻译]双向与密集连接卷积


原文地址:https://arxiv.org/pdf/1909.00166v1


Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions

Abstract

摘要

In recent years, deep learning-based networks have achieved state-of-the-art performance in medical image segmentation. Among the existing networks, U-Net has been successfully applied on medical image segmentation. In this paper, we propose an extension of U-Net, Bi-directional ConvLSTM U-Net with Densely connected convolutions (BCDU-Net), for medical image segmentation, in which we take full advantages of U-Net, bi-directional ConvLSTM (BConvLSTM) and the mechanism of dense convolutions. Instead of a simple concatenation in the skip connection of U-Net, we employ BConvLSTM to combine the feature maps extracted from the corresponding encoding path and the previous decoding up-convolutional layer in a non-linear way. To strengthen feature propagation and encourage feature reuse, we use densely connected convolutions in the last convolutional layer of the encoding path. Finally, we can accelerate the convergence speed of the proposed network by employing batch normalization (BN). The proposed model is evaluated on three datasets of: retinal blood vessel segmentation, skin lesion segmentation, and lung nodule segmentation, achieving state-of-the-art performance.

近年来,基于深度学习的网络在医学图像分割领域取得了最先进的性能。在现有网络中,U-Net已成功应用于医学图像分割。本文提出了一种U-Net的扩展架构——双向ConvLSTM密集连接卷积U-Net (BCDU-Net) ,该模型综合运用了U-Net、双向ConvLSTM (BConvLSTM) 和密集卷积机制的优势。与U-Net跳跃连接中简单的拼接操作不同,我们采用BConvLSTM以非线性方式融合来自编码路径的特征图与解码上采样层的特征图。为增强特征传播并促进特征复用,我们在编码路径的最后一个卷积层采用密集连接卷积。此外,通过引入批归一化 (BN) 可加速网络收敛速度。该模型在视网膜血管分割、皮肤病变分割和肺结节分割三个数据集上均达到当前最优性能。

1. Introduction

1. 引言

Medical images play a key role in medical treatment and diagnosis. The goal of Computer-Aided Diagnosis (CAD) systems is providing doctors with precise interpretation of medical images to have better treatment of a large number of people. Moreover, automatic processing of medical images results in reducing the time, cost, and error of humanbased processing. One of the main research areas in this field is medical image segmentation, being a critical step in numerous medical imaging studies. Like other fields of research in computer vision, deep learning networks achieve outstanding results and use to outperform non-deep stateof-the-art methods in medical imaging. Deep neural networks are mostly utilized in classification tasks, where the output of the network is a single label or probability values associated labels to a given input image. These networks work fine thanks to some structural features [2] such as: activation function, different efficient optimization algorithms, and dropout as a regularize r for the network. These networks require a large amount of data to train and provide a good generalization behavior given the huge number of network parameters. A critical issue in medical image segmentation is the un availability of large (and annotated) datasets. In medical image segmentation, per pixel labeling is required instead of image level label.

医学影像在医疗和诊断中起着关键作用。计算机辅助诊断(CAD)系统的目标是为医生提供医学影像的精确解读,以便更好地治疗大量患者。此外,医学影像的自动处理可以减少人工处理的时间、成本和错误。该领域的主要研究方向之一是医学图像分割,这是众多医学影像研究中的关键步骤。与计算机视觉的其他研究领域一样,深度学习网络在医学影像领域取得了卓越成果,其性能超越了非深度学习的最先进方法。深度神经网络主要用于分类任务,其输出是单个标签或与给定输入图像相关联的概率值。这些网络的良好性能得益于一些结构特征[2],如:激活函数、不同的高效优化算法,以及作为网络正则化器的dropout。由于网络参数数量庞大,这些网络需要大量数据进行训练,才能提供良好的泛化行为。医学图像分割中的一个关键问题是缺乏大型(且标注好的)数据集。在医学图像分割中,需要的是像素级标注,而非图像级标签。

Fully convolutional neural network (FCN) [17] was one of the first deep networks applied to image segmentation. Ronne berger et al. [21] extended this architecture to U-Net, achieving good segmentation results leveraging the need of a large amount of training data. Their network consists of encoding and decoding paths. In the encoding path a large number of feature maps with reduced dimensionality are extracted from the input data. The decoding path is used to produce segmentation maps (with the same size as the input) by performing up-convolutions. Many extensions of U-Net have been proposed so far [2, 19]. The most important modification is mainly about the skipping connections. In some extended versions of U-Net, the extracted feature maps in the skip connection are first fed to a processing step (e.g. attention gates [19]) and then concatenated. The main drawback of these networks is that the processing step is performed individually for the two sets of feature maps, and these features are then simply concatenated.

全卷积神经网络 (fully convolutional neural network, FCN) [17] 是最早应用于图像分割的深度网络之一。Ronneberger等人 [21] 将该架构扩展为U-Net,在减少大量训练数据需求的同时取得了良好的分割效果。其网络由编码路径和解码路径组成:编码路径从输入数据中提取大量降维特征图;解码路径通过执行反卷积操作生成与输入尺寸相同的分割图。迄今为止已提出多种U-Net变体 [2,19],最重要的改进集中在跳跃连接机制——在某些改进版本中,跳跃连接提取的特征图会先经过处理步骤(如注意力门 [19])再进行拼接。这类网络的主要缺陷在于:两组特征图需独立进行处理,仅通过简单拼接实现特征融合。

In this paper, we propose BCDU-Net, an extended version of the U-Net, by including BConvLSTM [23] in the skip connection and reusing feature maps with densely convolutions. The feature maps from the corresponding encoding layer have higher resolution while the feature maps extracted from the previous up-convolutional layer contain more semantic information. Instead of a simple concatenation, combining these two kinds of feature maps with nonlinear functions may result in more precise segmentation output. Therefore, in this paper we extend the U-Net architecture by adding BConvLSTM in the skip connection to combine these two kinds of feature maps.

本文提出BCDU-Net,通过在跳跃连接中加入BConvLSTM[23]并利用密集卷积复用特征图,对U-Net进行了扩展。编码层对应的特征图具有更高分辨率,而前一层上卷积提取的特征图则包含更丰富的语义信息。与简单拼接不同,通过非线性函数融合这两类特征图可获得更精确的分割结果。因此,我们在跳跃连接中引入BConvLSTM来融合这两类特征图,从而扩展了U-Net架构。

Having a sequence of convolutional layers may help the network to learn more kinds of features; however, in many cases, the network learns redundant features. To mitigate this problem and enhance information flow through the network, we utilize the idea of densely connected convolutions [12]. In the last layer of the contracting path, convolutional blocks are connected to all subsequent blocks in that layer via channel-wise concatenation. Features which are learned in each block are passed forward to the next block. This strategy helps the method to learn a diverse set of features based on the collective knowledge gained by previous layers, and therefore, avoiding learning redundant features..

采用一系列卷积层可能有助于网络学习更多类型的特征;然而,在许多情况下,网络会学习冗余特征。为了缓解这一问题并增强网络中的信息流动,我们采用了密集连接卷积(densely connected convolutions)的思想[12]。在收缩路径的最后一层,卷积块通过通道级联与该层中所有后续块相连。每个块中学习到的特征会传递到下一个块。这一策略有助于方法基于先前层获得的集体知识学习多样化的特征集,从而避免学习冗余特征。

Furthermore, we accelerate the convergence speed of the network by employing BN after the up-convolution filters. We evaluate the proposed BCDU-Net on three different applications of: retinal blood vessel segmentation (DRIVE datase), Skin lesion segmentation (ISIC 2018 dataset) and lung nodule segmentation (Lung dataset). The experimental results demonstrate that the proposed network achieves superior performance than state-of-the-art alternatives. 1

此外,我们通过在反卷积滤波器后采用批量归一化(BN)来加速网络收敛速度。本文在三个不同任务上评估了提出的BCDU-Net网络:视网膜血管分割(DRIVE数据集)、皮肤病变分割(ISIC 2018数据集)和肺结节分割(Lung数据集)。实验结果表明,该网络性能优于当前最先进方法。1

2. Related Work

2. 相关工作

One of the most crucial tasks in medical imaging is semantic segmentation. Before the revolution of deep learning in computer vision, traditional handcrafted features were exploited for semantic segmentation. During the last few years, deep learning-based approaches have outstandingly improved the performance of classical image segmentation strategies. Based on the exploited deep architecture, these approaches can be divided into three groups of: convolutional neural network (CNN), fully convolutional network (FCN), and recurrent neural network (RNN).

医学影像中最关键的任务之一是语义分割。在深度学习引发计算机视觉革命之前,传统手工特征被用于语义分割。过去几年间,基于深度学习的方法显著提升了经典图像分割策略的性能。根据采用的深度架构,这些方法可分为三类:卷积神经网络 (CNN)、全卷积网络 (FCN) 和循环神经网络 (RNN)。

2.1. Convolutional Neural Network (CNN)

2.1. 卷积神经网络 (CNN)

Cui et al. [10] exploited CNN for automatic segmentation of brain MRI images. The authors first divided the in- put images into some patches and then utilized these patches for training CNN. To handle an arbitrary number of modalities as the input data, Kleesiek et al. [14] proposed a 3D

崔等人[10]利用CNN(卷积神经网络)实现了脑部MRI图像的自动分割。作者首先将输入图像划分为若干小块,然后利用这些小块训练CNN。为处理任意数量的模态输入数据,Kleesiek等人[14]提出了一种3D

CNN for brain lesion segmentation. To process MRI data, the network consists of four channels: non-enhanced and contrast-enhanced T1w, T2w and FLAIR contrasts. Roth et al. [22] proposed a multi-level deep convolutional networks for pancreas segmentation in abdominal CT scans as a probabilistic bottom-up approach.

CNN用于脑部病变分割。为处理MRI数据,该网络包含四个通道:非增强和对比增强T1w、T2w以及FLAIR对比度。Roth等[22]提出了一种多层级深度卷积网络,作为腹部CT扫描中胰腺分割的概率自底向上方法。

2.2. Fully Convolutional Network (FCN)

2.2. 全卷积网络 (FCN)

One of the main problems of the CNN models for segmentation tasks is that the spatial information of the image is lost when the convolutional features are fed into the fc layers. To overcome this problem the fully convolutional network (FCN) was proposed by Long et al. [17]. This network is trained end-to-end and pixels-to-pixels for semantic segmentation. All fc layers of the CNN architecture are replaced with convolutional and de convolutional ones to keep the original spatial resolutions. Therefore, the original spatial dimension of the features maps are recovered while the network is performing the segmentation task. FCN has been frequently utilized for segmentation of medical and biomedical images [27, 28]. Zhou et al. [27] exploited FCN for segmentation of anatomical structures on 3D CT images. An FCN with convolution and de-convolution parts is trained end-to-end, performing voxel-wise multiple-class classification to map each voxel in a CT image to an anatomical label. Drozdzal et al. [11] proposed very deep FCN by using short skip connections. The authors showed that a very deep FCN with both long and short skip connections achieved better result than the original one.

CNN模型在分割任务中的一个主要问题是,当卷积特征输入全连接层(fc)时图像的空间信息会丢失。为解决这一问题,Long等人[17]提出了全卷积网络(FCN)。该网络采用端到端、像素到像素的方式进行语义分割训练。CNN架构中的所有全连接层都被替换为卷积层和反卷积层,以保持原始空间分辨率。因此,在执行分割任务时能恢复特征图的原始空间维度。FCN已广泛应用于医学和生物医学图像分割[27,28]。Zhou等人[27]利用FCN对3D CT图像中的解剖结构进行分割,通过端到端训练包含卷积和反卷积部分的FCN,执行体素级多类分类,将CT图像中的每个体素映射到解剖标签。Drozdzal等人[11]提出采用短跳跃连接的极深FCN,证明同时包含长短跳跃连接的极深FCN能获得优于原始架构的效果。

U-Net, proposed by Ronne berger et al. [21], is one of the most popular FCNs for medical image segmentation. This network consists of contracting and expanding paths. U-Net has some advantages than the other segmentationbased network [2]. It works well with few training samples and the network is able to utilize the global location and context information at the same time. Milletari et al. [18] proposed V-Net, a 3D extension version of U-Net to predict segmentation of a given volume at once. In that network, the authors proposed an end-to-end 3D image segmentation network based on a volumetric (MRI volumes), fully convolutional, neural network. 3D U-Net [7] is proposed for processing 3D volumes instead of 2D images as input. In which, all 2D operations of U-Net are replaced with their 3D counterparts. VoxResNet [5], a deep voxel-wise residual network, was proposed for brain segmentation from MR. This 3D network is inspired by deep residual learning, performing summation of feature maps from different layers.

U-Net由Ronneberger等人[21]提出,是最流行的医学图像分割全卷积网络(FCN)之一。该网络由收缩路径和扩展路径组成。相比其他基于分割的网络[2],U-Net具有几项优势:在训练样本较少时表现良好,且能同时利用全局位置和上下文信息。Milletari等人[18]提出了V-Net,这是U-Net的三维扩展版本,可一次性预测给定体积的分割结果。该研究基于容积式(MRI体积)全卷积神经网络,构建了端到端3D图像分割网络。3D U-Net[7]专为处理3D体积数据而设计(而非2D图像输入),其将U-Net所有2D操作替换为对应的3D版本。VoxResNet[5]是一种深度体素级残差网络,专为MR脑部分割提出。这个3D网络受深度残差学习启发,通过叠加不同层的特征图实现特征融合。

2.3. Recurrent Neural Network (RNN)

2.3. 循环神经网络 (RNN)

Pinheiro et al. [20] proposed an end-to-end feed forward deep network consisting of an RNN that can take into account long range label dependencies in the scenes while limiting the capacity of the model. Visin et al. [25] proposed ReSeg for semantic segmentation. In that network, the input images are processed with a pre-trained VGG-16 model and its resulting feature maps are then fed into one or more ReNet layers. DeepLab architecture [6] contains a deep convolutional neural network in which all fully connected layers are replaced by convolutional layers and then the feature resolution is increased through atrous convolutional layers. Alom et al. [2] proposed Recurrent Convolutional Neural Network (RCNN) and Recurrent Residual Convolutional Neural Network (R2CNN) based on U-Net models for medical image segmentation. Bai et al. [4] combined an FCN with an RNN for medical image sequence segmentation, which is able to incorporate both spatial and temporal information for MR images.

Pinheiro等人[20]提出了一种端到端前馈深度网络,该网络由RNN组成,能够在限制模型容量的同时考虑场景中的长距离标签依赖关系。Visin等人[25]提出了用于语义分割的ReSeg网络。在该网络中,输入图像通过预训练的VGG-16模型处理,生成的特征图随后输入到一个或多个ReNet层。DeepLab架构[6]包含一个深度卷积神经网络,其中所有全连接层都被卷积层取代,然后通过空洞卷积层提高特征分辨率。Alom等人[2]基于U-Net模型提出了用于医学图像分割的循环卷积神经网络(RCNN)和循环残差卷积神经网络(R2CNN)。Bai等人[4]将FCN与RNN结合用于医学图像序列分割,能够同时利用MR图像的空间和时间信息。

In this paper, BCDU-Net is proposed as an extension of U-Net, showing better performance than state-of-the-art alternatives for the segmentation task. Moreover, BN has a significant effect on the convergence speed of the network.

本文提出的BCDU-Net作为U-Net的扩展版本,在分割任务中表现出优于现有最优方案的性能。此外,批量归一化(BN)对网络收敛速度具有显著影响。

3. Proposed Method

3. 提出的方法

Inspired by U-Net [21], BConvLSTM [23], and dense convolutions [12], we propose the BCDU-Net as shown in Figure 1. The network utilizes the strengths of both BConvLSTM states and densely connected convolutions. We detail different parts of the network in the next sub sections.

受U-Net [21]、BConvLSTM [23]和密集卷积[12]的启发,我们提出了如图1所示的BCDU-Net。该网络结合了BConvLSTM状态和密集连接卷积的优势。我们将在接下来的小节中详细说明网络的各个部分。

3.1. Encoding Path

3.1. 编码路径

The contracting path of BCDU-Net includes four steps. Each step consists of two convolutional $3\times3$ filters followed by a $2\times2$ max pooling function and ReLU. The number of feature maps are doubled at each step. The contracting path extracts progressively image representations and increases the dimension of these representations layer by layer. Ultimately, the final layer in the encoding path produces a high dimensional image representation with high semantic information. The original U-Net contains a sequence of convolutional layers in the last step of encoding path. Having a sequence of convolutional layers in a network yields the method learn different kinds of features. Nevertheless, the network might learn redundant features in the successive convolutions. To mitigate this problem, densely connected convolutions are proposed [12]. This helps the network to improve its performance by the idea of “collective knowledge” in which the feature maps are reused through the network. It means feature maps learned from all previous convolutional layers are concatenated with the feature map learned from the current layer and then are forwarded to use as the input to the next convolution.

BCDU-Net的收缩路径包含四个步骤。每个步骤由两个$3\times3$卷积滤波器组成,后接一个$2\times2$最大池化函数和ReLU激活。每经过一个步骤,特征图的数量会翻倍。收缩路径逐步提取图像表征,并逐层增加这些表征的维度。最终,编码路径的最后一层会生成一个包含高语义信息的高维图像表征。

原始U-Net在编码路径的最后一步包含一系列卷积层。在网络中使用连续的卷积层可以让方法学习不同类型的特征。然而,连续的卷积操作可能导致网络学习到冗余特征。为缓解这一问题,文献[12]提出了密集连接卷积。这种方法通过"集体知识"的理念提升网络性能——即特征图在整个网络中被重复利用。具体而言,所有先前卷积层学到的特征图会与当前层学到的特征图进行拼接,随后作为下一层卷积的输入向前传递。

The idea of densely connected convolutions has some advantages over the regular convolutions [12]. First of all, it helps the network to learn a diverse set of feature maps instead of redundant features. Moreover, this idea improves the network’s representational power by allowing information flow through the network and reusing features. Furthermore, dense connected convolutions can benefit from all the produced features before it, which prompt the network to avoid the risk of exploding or vanishing gradients. In addition, the gradients are sent to their respective places in the network more quickly in the backward path. We employ the idea of densely connected convolutions in the proposed network. To do that, we introduce one block as two consecutive convolutions. There are a sequence of $N$ blocks in the last convolutional layer of the encoding path, shown in Figure 2. These blocks are densely connected. We consider $\mathcal{X}{e}^{i}$ as the output of the $i^{t h}$ convolutional block. The input of the $i^{t h}$ $(i\in{1,...,N})$ con- volutional block receives the concatenation of the feature maps of all preceding convolutional blocks as its input, i.e., $\bigl[\lambda_{e}^{\dot{1}},\lambda_{e}^{2},...,\dot{\chi}{e}^{i-1}\bigr]\stackrel{\leftarrow}{\in}\mathbb{R}^{(i-1)F_{l}\times W_{l}\times H_{l}}$ , and the output of the $i^{t h}$ block is $\mathcal{X}{e}^{i^{-}}\in\mathbb{R}^{F_{l}\times W_{l}\times H_{l}}$ . In the remaining part of the paper we us eX sei m∈ply $\mathcal{X}{e}$ instead of $\mathcal{X}_{e}^{N}$ .

密集连接卷积 (densely connected convolutions) 的理念相比常规卷积具有若干优势 [12]。首先,它有助于网络学习多样化的特征图而非冗余特征。其次,该理念通过允许信息在网络中流动并重用特征,提升了网络的表征能力。此外,密集连接卷积可以利用其之前生成的所有特征,从而促使网络避免梯度爆炸或消失的风险。另外,在反向传播过程中梯度能更快传递到网络中相应位置。我们在提出的网络中采用了密集连接卷积的理念,具体通过将每个模块设计为两个连续卷积来实现。如图2所示,在编码路径的最后一个卷积层中包含N个按序排列的模块,这些模块采用密集连接方式。设$\mathcal{X}{e}^{i}$表示第i个卷积模块的输出,则第i个$(i\in{1,...,N})$卷积模块的输入会接收所有前置卷积模块特征图的拼接结果,即$\bigl[\lambda_{e}^{\dot{1}},\lambda_{e}^{2},...,\dot{\chi}{e}^{i-1}\bigr]\stackrel{\leftarrow}{\in}\mathbb{R}^{(i-1)F_{l}\times W_{l}\times H_{l}}$,而第i个模块的输出为$\mathcal{X}{e}^{i^{-}}\in\mathbb{R}^{F_{l}\times W_{l}\times H_{l}}$。在本文后续部分中,我们将统一使用$\mathcal{X}{e}$代替$\mathcal{X}_{e}^{N}$。

3.2. Decoding Path

3.2. 解码路径

Each step in the decoding path starts with performing an up-sampling function over the output of the previous layer. In the standard U-Net, the corresponding feature maps in the contracting path are cropped and copied to the decoding path. These feature maps are then concatenated with the output of the up-sampling function. In BCDU-Net, we employ BConvLSTM to process these two kinds of feature maps in a more complex way. Let $\mathcal{X}{e} \in~\mathbb{R}^{F_{l}\times W_{l}\times H_{l}}$ be the set of feature maps copied from the encoding part, and $\mathcal{X}{d}\in\mathbb{R}^{F_{l+1}\times W_{l+1}\times\bar{H_{l+1}}}$ be the the set of feature maps from the previous convolutional layer, where $F_{l}$ is number of feature maps at layer $l$ , and $W_{l}\times H_{l}$ is the size of each feature map at layer $l$ . It is worth mentioning that $F_{l+1}=2*F_{l}$ , $W_{l+1}={\textstyle\frac{1}{2}}*W_{l}$ , and $H_{l+1}={\begin{array}{l}{{\frac{1}{2}}*H_{l}}\end{array}}$ . Based on Figure 3, $\mathcal{X}{d}$ is first passed to an up-convolutional layer in which an up-sampling function followed by a $2\times2$ convolution are applied, doubling the size of each feature map and halving the number of feature channels, i.e., producing $\mathcal{X}{d}^{u p}\in\mathbb{R}^{F_{l}\times W_{l}\times H_{l}}$ . In other words, the expanding path increases the size of the feature maps layer by layer to reach the original size of the input image after the final layer.

解码路径中的每一步都始于对前一层输出执行上采样函数。在标准U-Net中,收缩路径中对应的特征图会被裁剪并复制到解码路径。这些特征图随后与上采样函数的输出进行拼接。在BCDU-Net中,我们采用BConvLSTM以更复杂的方式处理这两类特征图。设$\mathcal{X}{e} \in~\mathbb{R}^{F_{l}\times W_{l}\times H_{l}}$为从编码部分复制的特征图集合,$\mathcal{X}{d}\in\mathbb{R}^{F_{l+1}\times W_{l+1}\times\bar{H_{l+1}}}$为前一个卷积层的特征图集合,其中$F_{l}$是第$l$层的特征图数量,$W_{l}\times H_{l}$是第$l$层每个特征图的尺寸。值得注意的是$F_{l+1}=2*F_{l}$,$W_{l+1}={\textstyle\frac{1}{2}}*W_{l}$,且$H_{l+1}={\begin{array}{l}{{\frac{1}{2}}*H_{l}}\end{array}}$。根据图3,$\mathcal{X}{d}$首先传递至一个上卷积层,该层应用上采样函数后接$2\times2$卷积,使每个特征图尺寸翻倍且特征通道数减半,即生成$\mathcal{X}{d}^{u p}\in\mathbb{R}^{F_{l}\times W_{l}\times H_{l}}$。换言之,扩展路径逐层增大特征图尺寸,最终达到输入图像的原始尺寸。

3.2.1 Batch Normalization:

3.2.1 批量归一化 (Batch Normalization):

After up-sampling, $\mathcal{X}{d}^{u p}$ goes through a BN function and produces $\widehat{\mathcal{X}}_{d}^{u p}$ . A problem in the intermediate layers in training st ebp is that the distribution of the activation s varies. This problem makes the training process very slow since each layer in every training step has to learn to adapt themselves to a new distribution. BN [13] is utilized to increase the stability of a neural network, which standardizes the inputs to a layer in the network by subtracting the batch mean and dividing by the batch standard deviation. BN affectedly accelerates the speed of training process of a neural network. Moreover, in some cases the performance of the model is improved thanks to the modest regular iz ation effect. More details can be found in [13].

上采样后,$\mathcal{X}{d}^{u p}$ 经过批量归一化 (BN) 函数处理,生成 $\widehat{\mathcal{X}}_{d}^{u p}$。训练 st ebp 时中间层存在一个问题:激活值的分布会不断变化。该问题会大幅降低训练速度,因为每一层在每个训练步骤中都必须学习适应新的分布。BN [13] 通过减去批次均值并除以批次标准差来标准化网络层的输入,从而有效提升神经网络的稳定性。BN 能显著加速神经网络的训练过程,在某些情况下还能通过温和的正则化效应提升模型性能。更多细节详见 [13]。


Figure 1. BCDU-Net with bi-directional ConvLSTM in the skip connections and densely connected convolution.

图 1: 在跳跃连接和密集连接卷积中使用双向 ConvLSTM 的 BCDU-Net。


Figure 2. Dense layer of the BCDU-Net.

图 2: BCDU-Net 的密集层。

3.2.2 Bi-Directional ConvLSTM:

3.2.2 双向ConvLSTM:

The output of the BN step $(\widehat{\mathcal{X}}{d}^{u p}\in\mathbb{R}^{F_{l}\times W_{l}\times H_{l}})$ is now fed to a BConvLSTM layer. bThe main disadvantage of the standard LSTM is that these networks does not take into account the spatial correlation since these models use full connections in input-to-state and state-to-state transitions. To solve this problem, ConvLSTM [26] was proposed which exploited convolution operations into input-to-state and state-to-state transitions. It consists of an input gate $i_{t}$ , an output gate $o_{t}$ , a forget gate $f_{t}$ , and a memory cell $\mathcal{C}_{t}$ . Input, output and forget gates act as controlling gates to access, update, and clear memory cell. ConvLSTM can be formulated as follows (for convenience we remove the subscript and superscript from the parameters):

BN步骤的输出$(\widehat{\mathcal{X}}{d}^{u p}\in\mathbb{R}^{F_{l}\times W_{l}\times H_{l}})$现在被送入BConvLSTM层。标准LSTM的主要缺点在于这些网络未考虑空间相关性,因为这类模型在输入-状态和状态-状态转换中使用全连接。为解决该问题,ConvLSTM[26]被提出,它将卷积运算引入输入-状态和状态-状态转换中。该结构包含输入门$i_{t}$、输出门$o_{t}$、遗忘门$f_{t}$和记忆单元$\mathcal{C}_{t}$。输入门、输出门和遗忘门作为控制门来访问、更新和清除记忆单元。ConvLSTM可表述如下(为方便起见,我们省略了参数的下标和上标):

$$
\begin{array}{r l}&{i_{t}=\sigma\left(\mathbf{W}{x i}\ast\boldsymbol{\mathcal{X}}{t}+\mathbf{W}{h i}\ast\boldsymbol{\mathcal{H}}{t-1}+\mathbf{W}{c i}\ast\boldsymbol{\mathcal{C}}{t-1}+b_{i}\right)}\ &{f_{t}=\sigma\left(\mathbf{W}{x f}\ast\boldsymbol{\mathcal{X}}{t}+\mathbf{W}{h f}\ast\boldsymbol{\mathcal{H}}{t-1}+\mathbf{W}{c f}\ast\boldsymbol{\mathcal{C}}{t-1}+b_{f}\right)}\ &{\boldsymbol{\mathcal{C}}{t}=f_{t}\circ\boldsymbol{\mathcal{C}}{t-1}+i_{t}\operatorname{tanh}\left(\mathbf{W}{x c}\ast\boldsymbol{\mathcal{X}}{t}+\mathbf{W}{h c}\ast\boldsymbol{\mathcal{H}}{t-1}+b_{c}\right)}\ &{o_{t}=\sigma\left(\mathbf{W}{x o}\ast\boldsymbol{\mathcal{X}}{t}+\mathbf{W}{h o}\ast\boldsymbol{\mathcal{H}}{t-1}+\mathbf{W}{c o}\circ\boldsymbol{\mathcal{C}}{t}+b_{c}\right)}\ &{\boldsymbol{\mathcal{H}}{t}=o_{t}\circ\operatorname{tanh}(\boldsymbol{\mathcal{C}}_{t}),}\end{array}
$$

$$
\begin{array}{r l}&{i_{t}=\sigma\left(\mathbf{W}{x i}\ast\boldsymbol{\mathcal{X}}{t}+\mathbf{W}{h i}\ast\boldsymbol{\mathcal{H}}{t-1}+\mathbf{W}{c i}\ast\boldsymbol{\mathcal{C}}{t-1}+b_{i}\right)}\ &{f_{t}=\sigma\left(\mathbf{W}{x f}\ast\boldsymbol{\mathcal{X}}{t}+\mathbf{W}{h f}\ast\boldsymbol{\mathcal{H}}{t-1}+\mathbf{W}{c f}\ast\boldsymbol{\mathcal{C}}{t-1}+b_{f}\right)}\ &{\boldsymbol{\mathcal{C}}{t}=f_{t}\circ\boldsymbol{\mathcal{C}}{t-1}+i_{t}\operatorname{tanh}\left(\mathbf{W}{x c}\ast\boldsymbol{\mathcal{X}}{t}+\mathbf{W}{h c}\ast\boldsymbol{\mathcal{H}}{t-1}+b_{c}\right)}\ &{o_{t}=\sigma\left(\mathbf{W}{x o}\ast\boldsymbol{\mathcal{X}}{t}+\mathbf{W}{h o}\ast\boldsymbol{\mathcal{H}}{t-1}+\mathbf{W}{c o}\circ\boldsymbol{\mathcal{C}}{t}+b_{c}\right)}\ &{\boldsymbol{\mathcal{H}}{t}=o_{t}\circ\operatorname{tanh}(\boldsymbol{\mathcal{C}}_{t}),}\end{array}
$$


Figure 3. Bi-directional ConvLSTM in CUA-Net.

图 3: CUA-Net中的双向ConvLSTM。

where $^$ and $\circ$ denote the convolution and Hadamard functions, respectively. $\mathcal{X}{t}$ is the input tensor (in our case $\mathcal{X}{e}$ and $\widehat{\mathcal{X}}{d}^{u p}$ ), ${\mathcal{H}}{t}$ is the hidden sate tensor, $\mathcal{C}{t}$ is the memory cell tebnsor, and, $\mathbf{W}{x*}$ and $\mathbf{W}{h*}$ are 2D Convolution kernels corresponding to the input and hidden state, respectively, and $b_{i},b_{f},b_{o}$ , and $b_{c}$ are the bias terms.

其中 $^$ 和 $\circ$ 分别表示卷积运算和Hadamard积。$\mathcal{X}{t}$ 是输入张量 (本文中对应 $\mathcal{X}{e}$ 和 $\widehat{\mathcal{X}}{d}^{u p}$ ),${\mathcal{H}}{t}$ 是隐藏状态张量,$\mathcal{C}{t}$ 是记忆单元张量,$\mathbf{W}{x*}$ 和 $\mathbf{W}{h*}$ 分别是输入和隐藏状态对应的二维卷积核,$b_{i},b_{f},b_{o}$ 和 $b_{c}$ 为偏置项。

In this network, we employ BConvLSTM [23] to encode $\mathcal{X}{e}$ and $\widehat{\mathcal{X}}_{d}^{u p}$ . BConvLSTM uses two ConvLSTMs to process the ibnput data into two directions of forward and backward paths, and then makes a decision for the current input by dealing with the data dependencies in both directions. In a standard ConvLSTM, only the dependencies of the forward direction are processed. However, all the information in a sequence should be fully considered, therefore, it might be effective to take into account backward dependencies. It has been proved that analyzing both forward and backward temporal perspectives enhanced the predictive performance [9]. Each of the forward and backward ConvLSTM can be considered as a standard one. Therefore, we have two sets of parameters for backward and forward states. The output of the BConvLSTM is calculated as

在该网络中,我们采用 BConvLSTM [23] 对 $\mathcal{X}{e}$ 和 $\widehat{\mathcal{X}}_{d}^{u p}$ 进行编码。BConvLSTM 使用两个 ConvLSTM 分别沿前向和后向路径处理输入数据,并通过双向数据依赖关系为当前输入做出决策。标准 ConvLSTM 仅处理前向依赖关系,但序列中的所有信息都应被充分考虑,因此引入后向依赖可能更有效。已有研究证明,同时分析前向与后向时序视角能提升预测性能 [9]。前向和后向 ConvLSTM 均可视为标准结构,因此我们为双向状态保留两组参数。BConvLSTM 的输出计算公式为

$$
\mathbf{Y}{t}=\operatorname{tanh}\left(\mathbf{W}{y}^{\overrightarrow{\mathcal{H}}}*{\overrightarrow{\mathcal{H}}}{t}+\mathbf{W}{y}^{\overleftarrow{\mathcal{H}}}\overleftarrow{\mathcal{H}}_{t}+b\right)
$$

$$
\mathbf{Y}{t}=\operatorname{tanh}\left(\mathbf{W}{y}^{\overrightarrow{\mathcal{H}}}*{\overrightarrow{\mathcal{H}}}{t}+\mathbf{W}{y}^{\overleftarrow{\mathcal{H}}}\overleftarrow{\mathcal{H}}_{t}+b\right)
$$

where $\overrightarrow{\mathcal{H}}{t}$ and $\overleftarrow{\mathcal{H}}{t}$ denote the hidden sate tensors for forward and backward states, respectively, $b$ is the bias term, and $\mathbf{Y}{t}\in\mathbb{R}^{F_{l}\times W_{l}\times H_{l}}$ indicates the final output considering bidirectional spatio-temporal information. Moreover, tanh is the hyperbolic tangent which is utilized here to combine the output of both forward and backward states through a non-linear way. We utilize the energy function like the original U-Net to train the network.

其中 $\overrightarrow{\mathcal{H}}{t}$ 和 $\overleftarrow{\mathcal{H}}{t}$ 分别表示前向和后向状态的隐藏状态张量,$b$ 是偏置项,$\mathbf{Y}{t}\in\mathbb{R}^{F_{l}\times W_{l}\times H_{l}}$ 表示考虑双向时空信息的最终输出。此外,tanh 是双曲正切函数,这里用于通过非线性方式结合前向和后向状态的输出。我们采用与原始 U-Net 类似的能量函数来训练网络。

4. Experimental Results

4. 实验结果

We evaluate BCDU-Net on DRIVE, ISIC 2018, and a lung segmentation public benchmark datasets. DRIVE is a dataset for blood vessel segmentation from retina images, ISIC is for skin cancer lesion segmentation, and the last dataset consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with markedup annotated lesions. The empirical results show that the proposed method outperforms state-of-the-art alternatives. Keras with TenserFlow backend is utilized for this implementation. The network is trained from scratch for all datasets. We consider several performance metrics to perform the experimental comparative, including accuracy (AC), sensitivity (SE), specificity (SP), F1-Score, Jaccard similarity (JS), and area under the curve (AUC). We stop the training of the network when the validation loss remains the same in 10 consecutive epochs which is 50, 100, and 25 for DRIVE, ISIC, and Lung datasets, respectively.

我们在DRIVE、ISIC 2018和肺部分割公开基准数据集上评估BCDU-Net。DRIVE是用于视网膜图像血管分割的数据集,ISIC用于皮肤癌病灶分割,最后一个数据集包含带有标注病灶的诊断和肺癌筛查胸部计算机断层扫描(CT)影像。实验结果表明,所提方法优于现有最优方案。本实现采用基于TenserFlow后端的Keras框架。所有数据集均从头开始训练网络。我们采用多组性能指标进行实验对比,包括准确率(AC)、灵敏度(SE)、特异度(SP)、F1分数、Jaccard相似度(JS)和曲线下面积(AUC)。当验证损失连续10个epoch保持不变时停止训练,DRIVE、ISIC和肺部数据集对应的停止epoch数分别为50、100和25。

4.1. DRIVE Dataset

4.1. DRIVE数据集

DRIVE [24] is a dataset for blood vessel segmentation from retina images. It includes 40 color retina images, from which 20 samples are used for training and the remaining 20 samples for testing. The original size of images is $565\times584$ pixels. It is clear that a dataset with this number of samples is not sufficient for training a deep neural network. Therefore, we use the same strategy as [2] for training our network. The input images are first randomly divided into a number of patches. In total, around 190, 000 patches are produced from 20 training images, from which 171, 000 patches are used for training, and the remaining

DRIVE [24] 是一个用于视网膜图像血管分割的数据集。它包含40张彩色视网膜图像,其中20个样本用于训练,其余20个样本用于测试。图像原始尺寸为$565\times584$像素。显然,这种样本数量的数据集不足以训练深度神经网络。因此,我们采用与[2]相同的策略来训练网络。输入图像首先被随机划分为若干图像块。从20张训练图像中共生成约190,000个图像块,其中171,000个用于训练,剩余部分...

![](https://miner.umaxing.com/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=0cfcb850e2182a3fb0078d8f7642bf3ca7fe