.

# 用深度学习进行骨盆分割：大型 CT 数据集和基线模型

## Abstract

Purpose: Pelvic bone segmentation in CT has always been an essential step in clinical diagnosis and surgery planning of pelvic bone diseases. Existing methods for pelvic bone segmentation are either hand-crafted or semi-automatic and achieve limited accuracy when dealing with image appearance variations due to the multi-site domain shift, the presence of contrasted vessels, coprolith and chyme, bone fractures, low dose, metal artifacts, etc. Due to the lack of a large-scale pelvic CT dataset with annotations, deep learning methods are not fully explored.

Methods: In this paper, we aim to bridge the data gap by curating a large pelvic CT dataset pooled from multiple sources, including 1,184 CT volumes with a variety of appearance variations. Then we propose for the first time, to the best of our knowledge, to learn a deep multi-class network for segmenting lumbar spine, sacrum, left hip, and right hip, from multiple-domain images simultaneously to obtain more effective and robust feature representations. Finally, we introduce a post-processor based on the signed distance function (SDF).

Results: Extensive experiments on our dataset demonstrate the effectiveness of our automatic method, achieving an average Dice of 0.987 for a metal-free volume. SDF post-processor yields a decrease of 15.1% in Hausdorff distance compared with traditional post-processor. Conclusion: We believe this large-scale dataset will promote the development of the whole community and open source the images, annotations, codes, and trained baseline models at https://github.com/ICT-MIRACLE-lab/CTPelvic1K. \keywords{CT dataset \and Pelvic segmentation \and Deep learning \and SDF post-processing}

## 摘要

\关键字{CT DataSet \和骨盆分割\和深度学习\和 SDF 后处理}

## Introduction

The pelvis is an important structure connecting the spine and lower limbs and plays a vital role in maintaining the stability of the body and protecting the internal organs of the abdomen. The abnormality of the pelvis, like hip dysplasia and pelvic fractures, can have a serious impact on our physical health. For example, as the most severe and life-threatening bone injuries, pelvic fractures can wound other organs at the fracture site, and the mortality rate can reach 45% at the most severe situation, the open pelvic fractures. Medical imaging plays an important role in the whole process of diagnosis and treatment of patients with pelvic injuries. Compared with X-Ray images, CT preserves the actual anatomic structure including depth information, providing more details about the damaged site to surgeons, so it is often used for 3D reconstruction to make follow-up and evaluation of postoperative effects. In these applications, is crucial for assessing the severity of pelvic injuries and helping surgeons to make correct judgments and choose the appropriate surgical approaches. In the past, surgeons segmented pelvis manually from CT using software like Mimics https://en.wikipedia.org/wiki/Mimics , which is time-consuming and non-reproducible. To address these clinical needs, we here present an automatic algorithm that can accurately and quickly segment pelvic bones from CT.

Existing methods for pelvic bone segmentation from CT mostly use simple thresholding, region growing, and handcrafted models, which include deformable models, statistical shape models, watershed and others. These methods focus on local gray information and have limited accuracy due to the density differences between cortical and trabecular bones. And trabecular bone is similar to that of the surrounding tissues in terms of texture and intensity. Bone fractures, if present, further lead to weak edges. Recently, deep learning-based methods have achieved great success in image segmentation; however, their effectiveness for CT pelvic bone segmentation is not fully known.

## 简介

Although there are some datasets related to pelvic bone, only a few of them are open-sourced and with small size (less than 5 images or 200 slices), far less than other organs. Although conducted experiments based on deep learning, the result was not very good (Dice=0.92) with the dataset only having 200 CT slices. For the robustness of the deep learning method, it is essential to have a comprehensive dataset that includes as many real scenes as possible. In this paper, we bridge this gap by curating a large-scale CT dataset and explore the use of deep learning in this task, which marks, to the best of our knowledge, in this area, with more statistical significance and reference value. To build a comprehensive dataset, we have to deal with diverse image appearance variations due to differences in imaging resolution and field-of-view (FOV), domain shift arising from different sites, the presence of contrasted vessels, coprolith and chyme, bone fractures, low dose, metal artifacts, etc. Fig. fig-pic-3 gives some examples about these various conditions. Among the above-mentioned appearance variations, the challenge of the metal artifacts is the most difficult to handle. Further, we aim at a multi-class segmentation problem that separates the pelvis into multiple bones, including , , , and , instead of simply segmenting out the whole pelvis from CT. The contributions of this paper are summarized as follows: - A pelvic CT dataset} pooled from multiple domains and different manufacturers, including 1,184 CT volumes (over 320K CT slices) of diverse appearance variations (including 75 CTs with metal artifacts). Their multi-bone labels are carefully annotated by experts. We open source it to benefit the whole community; - Learning a deep multi-class segmentation network} to obtain more effective representations for joint lumbar spine, sacrum, left hip, and right hip segmentation from multi-domain labeled images, thereby yielding desired accuracy and robustness; - A fully automatic analysis pipeline} that achieves high accuracy, efficiency, and robustness, thereby enabling its potential use in clinical practices.

## Our Dataset 我们的数据集

Dataset name[M] # Mean spacing(mm) Mean size Tr/Val/Ts Source and Year
\ABDOMEN 35 (0.76, 0.76, 3.80) (512, 512, 73) 21/7/7 Public 2015
\COLONOG[\Checkmark] 731 (0.75, 0.75, 0.81) (512, 512, 323) 440/146/145 Public 2008
\MSD_T10 155 (0.77, 0.77, 4.55) (512, 512, 63) 93/31/31 Public 2019
\KITS19 44 (0.82, 0.82, 1.25) (512, 512, 240) 26/9/9 Public 2019
\CERVIX 41 (1.02, 1.02, 2.50) (512, 512, 102) 24/8/9 Public 2015
\CLINIC[\Checkmark] 103 (0.85, 0.85, 0.80) (512, 512, 345) 61/21/21 Collected 2020
\CLINIC-metal[\Checkmark] 75 (0.83, 0.83, 0.80) (512, 512, 334) 0(61)/0/14 Collected 2020
Our Datasets 1,184 (0.78, 0.78, 1.46) (512, 512, 273) 665(61)/222/236 -

### Data Collection 数据收集

To build a comprehensive pelvic CT dataset that can replicate practical appearance variations, we curate a large dataset of pelvic CT images from seven sources, two of which come from a clinic and five from existing CT datasets. The overview of our large dataset is shown in Tabletab0. These seven sub-datasets are curated separately from different sites and sources with different characteristics often encountered in the clinic. In these sources, we exclude some cases of very low quality or without pelvic region and remove the unrelated areas outside the pelvis in our current dataset. Among them, the raw data of COLONOG, CLINIC, and CLINIC-metal are stored in a DICOM format, with more information like scanner manufacturers can be accessed. More details about our dataset are given in Online Resource 1 https://drive.google.com/file/d/115kLXfdSHS9eWxQmxhMmZJBRiSfI8_4_/view . We reformat all DICOM images to NIfTI to simplify data processing and de-identify images, meeting the institutional review board (IRB) policies of contributing sites. All existing sub-datasets are under Creative Commons license CC-BY-NC-SA at least and we will keep the license unchanged. For CLINIC and CLINIC-metal sub-datasets, we will open-source them under Creative Commons license CC-BY-NC-SA 4.0.

### Data Annotation

Considering the scale of thousands of cases in our dataset and annotation itself is truly a subjective and time-consuming task. We introduce a strategy of Annotation by Iterative Deep Learning (AID) to speed up our annotation process. In the AID workflow, we train a deep network with a few precisely annotated data in the beginning. Then the deep network is used to automatically annotate more data, followed by revision from human experts. The human-corrected annotations and their corresponding images are added to the training set to retrain a more powerful deep network. These steps are repeated iteratively until we finish our annotation task. In the last, only minimal modification is needed by human experts. The annotation pipeline is shown in Fig.fig-pic-10. In Step , two senior experts are invited to pixel-wise annotate 40 cases of CLINIC sub-dataset precisely as the initial database based on the results from simple thresholding method, using ITK Snap (Philadelphia, PA) software. All annotations are performed in the transverse plane. The sagittal and coronal planes are used to assist the judgment in the transverse plane. The reason for starting from the CLINIC sub-dataset is that the cancerous bone and surrounding tissues exhibit similar appearances at the fracture site, which needs more prior knowledge guidance from doctors. In Step , we train a deep network with the updated database and make predictions on new 100 data selected randomly at a time. In Step , some junior annotators refine the labels based on the prediction results, and each junior annotator is only responsible for part of 100 new data. A coordinator will check the quality of refinement by all junior annotators. For easy cases, the annotation process is over in this stage; for hard cases, senior experts are invited to make more precise annotations. Step and Step are repeated until we finish the annotation of all images in our dataset. Finally, we conduct another round of scrutiny for outliers and mistakes and make necessary corrections to ensure the final quality of our dataset. In Fig.fig-pic-10, Junior annotators' are graduate students in the field of medical image analysis. The  Coordinator' is a medical image analysis practitioner with many years of experience, and the Senior experts' are cooperating doctors in the partner hospital, one of the best orthopedic hospitals in our country. In total, we have annotations for $1,109$ metal-free CTs and 14 metal-affected CTs. The remaining 61 metal-affected CTs of image are left unannotated and planned for use in unsupervised learning.

## Segmentation Methodology

The overall pipeline of our deep approach for segmenting pelvic bones is illustrated in Fig.fig-pic-5. The input is a 3D CT volume. (i) First, the input is sent to our segmentation module. It is a plug and play (PnP) module that can be replaced at will. (ii) After segmentation is done, we send the multi-class 3D prediction to a SDF post-processor, which removes some false predictions and outputs the final multi-bone segmentation result.

## 分割方法

### Segmentation Module

Based on our large-scale dataset collected from multiple sources together with annotations, we use a fully supervised method to train a deep network to learn an effective representation of the pelvic bones. The deep learning framework we choose here is 3D U-Net cascade version of nnU-Net, which is a robust state-of-the-art deep learning-based medical image segmentation method. 3D U-Net cascade contains two 3D U-net, where the first one is trained on downsampled images (stage 1 in Fig.fig-pic-5), the second one is trained on full resolution images (stage 2 in Fig.fig-pic-5). A 3D network can better exploit the useful 3D spatial information in 3D CT images. Training on downsampled images first can enlarge the size of patches in relation to the image, then also enable the 3D network to learn more contextual information. Training on full resolution images second refines the segmentation results predicted from former U-Net.

### SDF Post Processor

Post-processing is useful for a stable system in clinical use, preventing some mispredictions in some complex scenes. In the segmentation task, current segmentation systems usually determine whether to remove the outliers according to the size of the connected region to reduce mispredictions. However, in the pelvic fractures scene, broken bones may also be removed as outliers. To this end, we introduce the SDF filtering as our post-processing module to add a besides the . We calculate SDF based on the maximum connected region (MCR) of the anatomical structure in the prediction result, obtaining a 3D distance map that increases from the bone border to the image boundary to help determining whether outlier prediction' defined by traditional MCR-based method should be removed.

## Experiments 实验

### Implementation Details 实现详细信息

We implement our method based on open source code of nnU-Net https://github.com/mic-dkfz/nnunet . We also used MONAI https://monai.io/ during our algorithm development. Details please refer to the Online Resource 1. For our metal-free dataset, we randomly select 3/5, 1/5, 1/5 cases in each sub-dataset as the training set, validation set, and testing set, respectively, and keep such a data partition unchanged in all-dataset experiments and sub-datasets experiments.

### Results and Discussion 结果和讨论

#### Segmentation Module

To prove that learning from our large-scale pelvic bones CT dataset is helpful to improve the robustness of our segmentation system, we conduct a series of experiments in different aspects. Firstly, we test the performance of models of different dimensions on our entire dataset. The Exp (a) in Tabletab_basic shows the quantitative results. $\Phi_{ALL}$ denotes a deep network model trained on ALL' dataset. Following the conventions in most literature, we use Dice coefficient(DC) and Hausdorff distance (HD) as the metrics for quantitative evaluation. All results are tested on our testing set. Same as we discussed in Sect.subsectionSegModule, $\Phi_{ALL(3D_cascade)}$ shows the best performance, achieving an average DC of 0.987 and HD of 5.50 voxels, which means 3D U-Net cascade can learn the semantic features of pelvic anatomy better then 2D/3D U-Net. As the following experiments are all trained with 3D U-Net cascade, the mark ${(3D_cascade)}$ of $\Phi{ALL(3D_cascade)}$ is omitted for notational clarity.

Secondly, we train six deep networks, one network per single sub-dataset ( $\Phi_{ABDOMEN}$ , etc.). Then we test them on each sub-dataset. Quantitative and qualitative results are shown in Tabletab_subdataset and Fig.fig-pic-6, respectively. We also calculate the performance of $\Phi_{ALL}$ on each sub-dataset.

#### 分割模块

For a fair comparison, cross-testing of sub-dataset networks is also conducted on each sub-dataset's testing set. We observe that the evaluation metrics of model $\Phi_{ALL}$ are generally better than those for the model trained on a single sub-dataset. These models trained on a single sub-dataset are difficult to consistently perform well in other domains, except $\Phi_{COLONOG}$ , which contains the largest amount of data from various sources, originally. This observation implies that the domain gap problem does exist and the solution of collecting data directly from multi-source is effective. More intuitively, we show the Average' values in heat map format in Fig.table2fig. Furthermore, we implement cross-validation of these six metal-free sub-datasets to verify the generalization ability of this solution. Models are marked as \$\Phi_{ex\ABDOMEN} \$ , etc. The results of \$\Phi_{ex\COLONOG} \$ can fully explain that training with data from multi-sources can achieve good results on data that has not been seen before. When the models trained separately on the other five sub-datasets cannot achieve good results on COLONOG, aggregating these five sub-datasets can get a comparable result compared with \$\Phi_{ALL} \$ , using only one third of the amount of data. More data from multi-sources can be seen as additional constraints on model learning, prompting the network to learn better feature representations of the pelvic bones and the background. In Fig.fig-pic-6, the above discussions can be seen intuitively through qualitative results. For more experimental results and discussions, e.g.  Generalization across manufacturers', Limitations of the dataset', please refer to Online Resource 1.

#### SDF post-processor

The Exp (b) in Tabletab_basic shows the effect of the post-processing module. SDF post-processor yields a decrease of 80.7% and 15.1% in HD compared with no post-processor and MCR post-processor. Details please refer to Online Resource 1. The visual effects of two cases are displayed in Fig.fig-pic-8. Large fragments near the anatomical structure are kept with SDF post-processing but are removed by the MCR method.

#### sdf 后处理器

tabletab_basic 中的 exp（b）显示了后处理模块的效果。与没有后处理器和 MCR 后处理器相比，SDF 后处理器在高清中降低了 80.7％和 15.1％。详细信息请参阅在线资源 1.两种情况的视觉效果显示在图 4.图片-8 中。解剖结构附近的大碎片与 SDF 后处理保持，但是通过 MCR 方法除去。

## Conclusion

To benefit the pelvic surgery and diagnosis community, we curate and open source a large-scale pelvic CT dataset pooled from multiple domains, including $1,184$ CT volumes (over 320K CT slices) of various appearance variations, and present a pelvic segmentation system based on deep learning, which, to the best of our knowledge, marks the first attempt in the literature. We train a multi-class network for segmentation of lumbar spine, sacrum, left hip, and right hip using the multiple-domain images to obtain more effective and robust features. SDF filtering further improves the robustness of the system. This system lays a solid foundation for our future work. We plan to test the significance of our system in real clinical practices, and explore more options based on our dataset, e.g. devising a module for metal-affected CTs and domain-independent pelvic bones segmentation algorithm. This research was supported in part by the Youth Innovation Promotion Association CAS (grant 2018135). The authors have no relevant financial or non-financial interests to disclose. Please refer to URL. Please refer to URL. We have obtained the approval from the Ethics Committee of clinical hospital. Not applicable.