Automatic Lumbar Spinal CT Image Segmentation with a Dual Densely Connected U-Net

The clinical treatment of degenerative and developmental lumbar spinal stenosis (LSS) is different. Computed tomography (CT) is helpful in distinguishing degenerative and developmental LSS due to its advantage in imaging of osseous and calcified tissues. However, boundaries of the vertebral body, spinal canal and dural sac have low contrast and hard to identify in a CT image, so the diagnosis depends heavily on the knowledge of expert surgeons and radiologists. In this paper, we develop an automatic lumbar spinal CT image segmentation method to assist LSS diagnosis. The main contributions of this paper are the following: 1) a new lumbar spinal CT image dataset is constructed that contains 2393 axial CT images collected from 279 patients, with the ground truth of pixel-level segmentation labels; 2) a dual densely connected U-shaped neural network (DDU-Net) is used to segment the spinal canal, dural sac and vertebral body in an end-to-end manner; 3) DDU-Net is capable of segmenting tissues with large scale-variant, inconspicuous edges (e.g., spinal canal) and extremely small size (e.g., dural sac); and 4) DDU-Net is practical, requiring no image preprocessing such as contrast enhancement, registration and denoising, and the running time reaches 12 FPS. In the experiment, we achieve state-of-the-art performance on the lumbar spinal image segmentation task. We expect that the technique will increase both radiology workflow efficiency and the perceived value of radiology reports for referring clinicians and patients.


I. INTRODUCTION
Lumbar spinal stenosis (LSS) is one of the most common diseases encountered in spinal surgery practice. Diagnosis of LSS is usually made under the guidance of medical imaging techniques such as magnetic resonance imaging (MRI) and computed tomography (CT). More previous studies prefer MRI because it is safer and does not involve any radiation. However, the pathogenesis of degenerative LSS and developmental LSS differ [1]. Degeneration of the lumbar intervertebral disc, hypertrophy of the articular process and the calcification of ligamentum flavum are the main causes of degenerative LSS; treatment for patients is usually lumbar decompression. Developmental LSS is usually due to vertebral laminae osseous stenosis, and the corresponding treatment is usually laminectomy. Precisely identifying the The associate editor coordinating the review of this manuscript and approving it for publication was Cristian A. Linte. vertebral body, spinal canal and dural sac is helpful in diagnosing different types of LSS [2]. Surgeons usually use lumbar spinal CT images to distinguish between degenerative and developmental LSS because CT is better at imaging osseous and calcified tissues than MRI is [3]. However, boundaries of the spinal canal and dural sac in CT images are not intuitive; segmentation of these two tissues depends heavily on expert surgeons and radiologists, which brings uncertainty and risk. In this paper, we provide a sufficiently labeled lumbar spinal CT image dataset; the areas of the spinal canal, dural sac and vertebral body are labeled in pixel-level. We hope this new dataset will promote the automatic diagnosis of LSS. We then propose a multi-scale densely connected neural network that can automatically segment the spinal canal, dural sac and vertebral body from a raw CT image. To the best of our knowledge, this is the first deep learning-based method to simultaneously segment the spinal canal, dural sac and vertebral body from CT images.
Recently, deep convolutional neural networks have been applied in medical image analysis for it providing abundant and discriminative image representations. Feng et al. [4] segmented retinal vessels by a cross-connected convolutional network and multi-scale features. Baldeon-Calisto et al. [5] segmented medical images by a multiobjective adaptive convolutional neural network. Li et al. [6] proposed a 3D fully convolutional network to rationally fuse the complementary information in PET/CT for accurate tumor segmentation. Han et al. [7] introduced a lung CT imaging signs dataset and proposed a software of abnormal regions annotation. Yu et al. [8] presented a melanoma recognition method by both a deep learning method and a local descriptor encoding strategy. Nie et al. [9] used deep convolutional adversarial networks to synthesize more medical images. Abbati et al. [10] proposed an automatic treatment decision-making plan for LSS. Chen et al. [11] used a deep convolutional symmetric neural network to segment brain tumors. In practice, medical image collection and labeling are expensive and time-consuming, however, training deep neural networks usually requires a massive number of training samples. In this paper, we perform data augmentation of the CT images to overcome this limitation. Moreover, we include several dense blocks [12] in the proposed dual densely connected U-shaped network (DDU-Net) to reduce the number of parameters and increase the computation efficiency. These two attempts alleviate the gradient vanishing problem when training a deep neural network with limited data and improve prediction accuracy as well. In the experiment, we found that some dim-small tissues (e.g., dural sac) were difficult to segment from the original CT image, and the scales of tissues in CT images showed large variance. To handle these problems, the proposed DDU-Net contains two U-shaped sub-networks with different sizes of receptive fileds, which allows DDU-Net to perform extraction of multi-scale features and segmentation of different sizes of tissues automatically and precisely.
In the experimental section, we test our method on the proposed new dataset and compare the performance with three state-of-the-art image segmentation methods, i.e., U-Net [13], FCN [14] and DeepLab [15]. Both visual comparison and quantitative comparison show that our method outperforms these state-of-the-art methods.
In summary, this paper makes the following contributions: 1) A new challenging dataset is collected for further research and evaluation of spinal CT image segmentation; 2) Unlike previous works that produce binary segmentation, this is the first work to segment the spinal canal, dural sac and vertebral body from a spinal CT image simultaneously; we hope that this work will promote automatic diagnosis of lumbar spinal stenosis; 3) The proposed DDU-Net segment spinal CT images in an automatic and end-to-end manner; all parameters are optimized simultaneously. DDU-Net is capable of segmenting tissues with scale-variant, inconspicuous edges (e.g., spinal canal) and extremely small size (e.g., dural sac); 4) The proposed method is practical; it requires no image pre-processing such as image registration, denoising, or contrast enhancement. The proposed DDU-Net has only 54 M parameters, and it outperforms state-of-theart methods in both visual comparison and quantitative comparison, with the running time reaching 12 FPS. The rest of this paper is organized as follows. In section II, we introduce some previous papers that are related to our work, including medical image analysis and basic deep learning technology. In section III, the new lumbar spinal CT image dataset is demonstrated. Section IV covers the methodology part of this paper, where we introduce the data augmentation method and architecture of DDU-Net, explaining the details of the network training. In section V, visual comparison, qualitative and quantitative comparisons are conducted on the proposed method and state-of-the-art methods. Finally, the conclusion is described in section VI.

II. RELATED WORKS
Several state-of-the-art methods for spinal image segmentation have been developed over the past ten years. Some methods have used traditional machine learning and image processing technology, for example, [16] and [17] developed an automatic method for spinal cord and spinal canal segmentation for CT images. Their method is based on multi-resolution propagation of tubular deformable models and coupled with an automatic intervertebral disk identification method.
With the remarkable performance of deep convolutional neural networks (DCNNs) in different domains, such as natural image classification [18], [19] and segmentation [14], [15], biomedical image segmentation has achieved a breakthrough by using a U-shaped fully convolutional network (FCN). U-Net [13] is an end-to-end architecture used to segment different semantics of images. Due to skip connections, this method won the ISBI cell tracking challenge 2015 by using only 30 training images, outperforming the second best method by a large margin. Since then, deep convolutional networks have become popular in automatic biomedical image segmentation. Korez et al. proposed an automatic method to segment vertebral bodies from 3D MRI images. Abbati et al. [10] introduced MRI-based surgical planning for lumbar spinal stenosis, developing an automated algorithm to localize the stenosis causing the patient's symptoms from the MR image; before training the network, the authors manually cropped the original images to obtain the region of interest and trained the network with both labeled and unlabeled images, and the results demonstrated promising performance. Korez et al. [20] segmented vertebral bodies from MR images with 3D CNNs. Gros et al. [21] segmented both spinal cord and intramedullary multiple sclerosis lesions by convolutional neural networks (CNNs).
In contrast with the aforementioned works, in this paper, we introduce a new fully convolutional network to segment the spinal canal, dural sac and vertebral body in parallel.  Fig. 1(b). The red regions denote the vertebral body, the green regions denote the spinal canal, the white regions inside green regions denote the dural sac, and the black regions denote background.
The proposed method is automatic and does not require any image pre-processing, and the performance surpasses that of state-of-the-art methods.

III. A NEW DATASET
As this is the first attempt to simultaneously segment the spinal canal, dural sac and vertebral body from CT images, to promote the study of this problem, we built a new dataset with pixel-level labels. We collected 2393 axial lumbar spinal CT images from 279 patients.
We consider lumbar spinal image segmentation as a pixel-level multi-class classification task, where the input is CT images from different patients of different views and the expected output is a mask with 4 classes, i.e., spinal canal, dural sac, vertebral body and background. Since the spinal canal, dural sac and vertebral body are unique in each CT image, we treat this pixel-level multi-class classification task as a semantic segmentation problem. To ensure label consistency, we asked four radiologists to label four different semantics in all images using a custom-designed interactive segmentation tool. We only kept the images that were given very similar labels by all four radiologists. Finally, the proposed dataset contains 1280 images with precise and consistent pixel-level labels. We randomly divided the dataset into three parts, i.e., 50% for training, 20% for validation and 30% for testing.
The first column of Fig. 1 shows 2 sample CT images from our dataset, and each image shows a CT scan acquired from an individual patient. The second column is the corresponding ground truth of each raw image. Each ground truth has the same size as the raw image; the red mask of a ground truth indicates the vertebral body region, the green mask of a ground truth indicates the spinal canal region, and the white mask of a ground truth indicates dural sac region. These masks are unique in one ground truth. Raw images in our dataset have obvious variations, e.g., scale, rotation, brightness, and noise. As shown in Fig. 1, the scale of the top-left of image Fig. 1(a) is smaller than that of the bottom-left image Fig. 1(b), that is, the details of Fig. 1(a) are more abundant, and we can see more tissue in Fig. 1 Fig. 1(a) is more noisy than Fig. 1(b), a vertical line exists throughout Fig. 1(a), and Fig. 1(b) is slightly rotated from the standard visual angle shown in Fig. 1(a). On the other hand, as shown in the enlarged view of Fig. 1(b), the boundaries of the spinal canal and dural sac have low contrast against the nearby regions, and the size of the dural sac is extremely small, making it difficult to identify in practical CT images. In sum, the variations and low contrast of the raw CT images make lumbar spinal CT image segmentation difficult, and this dataset is challenging. In this paper, we develop a robust method to segment raw CT images in an end-toend manner, without image denoising, contrast enhancement, registration, etc.

1) SAMPLE IMBALANCE OF THE DATASET
The labels in our dataset are imbalanced; see Fig. 2 for the class distribution: background (black color) 95.02%, vertebral body (red color) 4.43%, spinal canal (green color) 0.37%, and dural sac (white color) 0.18%. We handle this problem by a weighted cross-entropy loss function; please see section IV-C for details.

IV. METHOD
In this section, we introduce details regarding how we segment spinal images automatically and precisely with limited data and why this method works. First, we augment the image using several image processing approaches. This data augmentation alleviates overfitting when training, and we report the performance comparison with and without data augmentation in an ablation study. Second, we construct a dual densely connected U-shaped network (DDU-Net) to segment the spinal canal, dural sac and vertebral body in parallel. Finally, we introduce how to train this network in detail.

A. DATA AUGMENTATION
For convolutional neuron network training, we use the following data augmentation: rotation by a random angle between (0, 2π ); horizontal flip of the original images; random crop of the images to a size of 400 × 400 from the original size of 512 × 512; and random standard Gaussian noise is applied to the images, with the standard deviation σ = 0.15 + 1.15 × random(), where the random function produces a random float value between 0 and 1. Each image is augmented 100 times by the methods, which alleviates the requirement of a large quantity of labeled data and allows us to train the convolutional neural network successfully.

B. DDU-NET ARCHITECTURE
We propose a deep fully convolutional network to segment the CT images. Different from U-Net [13] and some related medical image segmentation methods like [22] and [23], the proposed method does not split the original large images into patches, the input of the proposed DDU-Net is the original large images with size 400 × 400, which avoids the need to recompose the image patches. The network architecture is shown in Fig. 3; the cuboids represent feature maps, and the grey cuboids are copied feature maps from prior layers. In Fig. 3, the arrows are connections between layers, where solid thin arrows represent standard batch normalization (BN) -rectified linear unit (ReLU)-convolution (Conv); green arrows represent upsampling-BN-ReLU-Conv, and the size of a feature map will increase after this operation; brown arrows represent BN-ReLU-Conv-average pooling, and the size of a feature map will decrease after this operation; black arrows represent max pooling, and the size of a feature map will also decrease after this operation; and grey arrows represent copying the source feature maps and concatenating them to the target feature maps, and the channels of the target layer will increase after this operation. Black numbers are the size of the feature maps, and blue numbers are the channels of the layers.

1) DUAL NETWORK STRUCTURE
DDU-Net consists of two sub-networks, and we duplicate an image and feed it into the two sub-networks separately. The upper sub-network upsamples feature maps at the fourth layer and downsamples at the fifth layer, while the lower sub-network downsamples feature maps at the fourth layer and upsamples at the fifth layer. Neurons of the upper sub-network have a smaller receptive field than the lower sub-network; consequently, the upper sub-network concentrates smaller tissues, and the lower sub-network concentrates larger tissues. We merge the feature maps of the last layer from the two sub-networks and convolve them with a 1×1×4 convolution layer to obtain a pixel-level classification. This dual network architecture makes DDU-Net segment different size of tissues robustly, e.g., Fig. 1(c) and Fig. 1(d) show the small dural sac (white) and the large vertebral body (red).

2) SKIP CONNECTIONS
Inspired by U-Net [13], DDU-Net consists of a downsampling part (left side) and an upsampling part (right side). The downsampling part is used to encode input images in a VOLUME 8, 2020 lower dimensionality with richer filters, while the upsampling part is designed to complete the inverse process of encoding by upsampling and merging low-dimensional feature maps, which produce dense predictions of each pixel. On each subnetwork, skip connections copy feature maps from the 5 th , 7 th and 9 th layers, and concatenate the feature maps to the 12 th , 13 th and 14 th layers, respectively; as shown in Fig. 3 grey arrows indicate copy directions, and grey cuboids represent duplicated feature maps.

3) DENSE BLOCKS
Generally, a deeper network performs better than a shallower network. To balance the depth of network and number of parameters, we insert dense blocks at the downsampling part of DDU-Net. Inside the dense blocks, neurons of each layer connect not only to the next layer but also to all their subsequent layers: where L n is the n th layer and n () is a transition function, which is usually BN-ReLU-Conv 3 consecutive operations, and affected by L 1 , L 2 , . . . , L n−1 . Fig. 4 shows an illustration of a dense block; between the input and output, there are 3 layers, with each layer having k = 2 channels, and we call it a 3k dense block. In this example, layer 1 connects to layer 2, as well as to layer 3 and the output layer. As shown in Fig. 3, we designed 8 dense blocks in DDU-Net in total, i.e., 6k, 12k, 24k, and 16k dense blocks for each sub-network.

4) DETAILED STRUCTURE OF DDU-Net
See Table 1 for a detailed structure of the upper sub-network and lower sub-network of DDU-Net. Please note that in the layer details, conv is BN-ReLU-Conv; 7 × 7, 64 conv, stride 2 corresponds to the sequence BN-ReLU-Conv layer with convolutional kernel size of 7 × 7 and 64 channels with stride 2; the symbol [] × n represents a dense block with operations in [] repeating n times, and −[] indicates that this layer skip-connects with a dense block. The growth rate for all dense blocks is k = 32; the upsampling is bilinear interpolation, and each transition layer is an operation between two dense blocks. The lower sub-network has one additional upsamplinig layer to recover the feature map size to 400 × 400. The results of the upper sub-network and lower sub-network are concatenated and convolved with a 1 × 1 × 4 convolutional layer for dense classification.

C. TRAINING
We randomly divide the proposed dataset into three parts, i.e., 50% for training, 20% for validation and 30% for testing. In practice, 50% of the images are fed into DDU-Net for training, 20% of the images are used for hyperparameters optimization and prevention of overfitting, and 30% of the images are used to evaluate the performance of the neural networks. The DDU-Net is trained in an end-to-end manner, and all parameters in the network are optimized simultaneously.
For a typical lumbar spinal CT image, most pixels are the background, and regions such as the dural sac and spinal canal are extremely small (see Fig. 1 and Fig. 2). To solve FIGURE 4. A 3-layer dense block with growth rate k = 2. Each layer has 2 channels, and each layer connect not only to the next layer but also to all their subsequent layers. this problem, we introduce a class-balancing weight w c on a per-pixel term basis, this class-balancing weight is designed to offset the imbalance between major and minor classes and promote the neural network to learn features of small tissues such as the dural sac and spinal canal. Specifically, the DDU-Net adopts a weighted cross-entropy function as the loss function, which can be formulated as: where N is pixel number of the image, y c i is the labeled class of the pixel i, p c i is the prediction probability of pixel i belonging to class c, i.e., spinal canal (c = 1), dural sac (c = 2), vertebral body (c = 3) and background (c = 4), and w c denotes the class-balancing weight as: where N is pixel number of the image and N c is the pixel number that belongs to class c. The code of DDU-Net is implemented by Pytorch framework. We train DDU-Net using two NVIDIA GeForce GTX 1080Ti GPUs, and due to GPU memory constraints, our model is trained with a mini-batch size of 4. The optimizer that we adopt is stochastic gradient descent (mini-batch SGD) [24] with momentum = 0.95; the learning rate is set to 1e-7, and the weight decay is 5e-4. The parameters in dense blocks of each sub-network are initialized with DenseNet [12] weights pretrained on ImageNet [25], and other parameters are initialized by He initialization [26].

V. EXPERIMENTS AND RESULTS
In this section, we introduce the performance evaluation metrics of this paper, after which ablation studies are conducted to explain why this method works. Finally, we compare the DDU-Net with state-of-the-art methods.

A. EVALUATION METRICS
Suppose that we classify image pixels into C classes, where N is the total pixel number of the image, #y ii is the correct predicted pixel number, and #y ij and #y ji are false positive and false negative predicted pixel numbers respectively, as shown in Fig. 5. We evaluated our model using several semantic segmentation metrics [27]: pixel accuracy (PA), mean pixel FIGURE 5. An illustration of semantic segmentation: y ii is a correctly predicted pixel, y ij is an i-class pixel that was predicted as j-class, and y ji is a j-class pixel that was predicted as i-class.

accuracy (MPA), mean intersection over union (MIoU), and frequency weighted intersection over union (FWIoU).
PA measures the ratio between correctly predicted pixels and total pixels. This metric can be formulated as follows: MPA is a simple improvement of PA that calculates the percentage of correctly predicted pixels of each class and averages the percentages as a result. This metric can be formulated as follows: mIoU is a common metric in semantic segmentation that calculates the ratio between the intersection region (true positives) and the union region (true positives, false positives and false negatives), and average the ratios on all classes. This metric can be formulated as follows: fwIoU is an improvement of mIoU, it weights the intersection over union of each class by their occurrence rate. This metric can be formulated as follows:

B. ABLATION STUDIES
To investigate the importance of different options in our method, we conduct an ablation study. Ablation studies include: with/without data augmentation, with/without skip connections, with/without dense blocks, with/without VOLUME 8, 2020 FIGURE 6. Training on the proposed dataset, (a) without data augmentation, (b) with data augmentation. We observe that training with data augmentation alleviates overfitting and improves the performance. multi-branch networks, and the growth rate affect. Table 2 presents the details of the ablation studies, the last row shows the default DDU-Net as the baseline, that is, the network adopts skip connections, dense blocks and multi-branches, and uses data augmentation.

1) DATA AUGMENTATION
Since the proposed dataset is not a large-scale dataset, we conduct data augmentation to alleviate overfitting when training the network; the data augmentation details are introduced in section IV-A. To reveal the benefit of data augmentation, we train this model for 100 epochs with and without data augmentation. The last row and the second row of Table 2 show the performance comparison between training with and without data augmentation under the same network architecture; the network adopts skip-connections, dense blocks and multi-branch structure, and after data agumentation, the mIoU improves by approximately 1 point. Fig. 6 shows the training and validation losses during the training procedure with and without data augmentation; the blue curves denote training loss, and the orange curves denote validation loss. Fig. 6(a) shows the training process without data augmentation, and the best mIoU is 0.8209 at about the 90 th epoch. The validation loss begins to increase again (after a fall) at about the 30 th epoch, this phenomenon is caused by overfitting. Fig. 6(b) depicts the training process with data augmentation, although the validation curve also increases again at about the 45 th epoch, with the best mIoU reaching 0.8314 at about the 90 th epoch, and improving by approximately 1 point. The abovementioned experiments indicate that data augmentation in our method not only alleviates overfitting but also improves the performance.

2) NETWORK ARCHITECTURE
We design several modules to improve architecture of DDU-Net. Skip connections in the DDU-Net are designed to reuse low-level features and fuse multi-level features. As shown in the third row of Table 2, we remove all skip connections in DDU-Net, after which the network becomes a flat network; the mIoU of the flat network is 0.7989, which is lower than that of the default DDU-Net by 3.42 points. Dense blocks are capable of alleviating the gradient vanishing problem in deep networks, enhancing feature reuse and feature propagation, and reducing the number of parameters. As shown in Table 2, without dense blocks, the mIoU decreased to 0.7636, which is lower than that of the default DDU-Net by 6.95 points. We separate DDU-Net into two types: one retains only the upper sub-network, and the other retains only the lower sub-network. The fifth and sixth rows of Table 2 show the performances of stand-alone upper sub-network and lower sub-network, respectively; their mIoU are 0.8186 and 0.8067, which are lower than the mIoU achieved by the default DDU-Net consisting of multi-branch networks by 1.45 and 2.64 points. The evaluation results under other metrics such as PA, MPA and fwIoU also indicate that the default DDU-Net which using data augmentation and three modules (SC, DB and MB) performs best.
We visualize the ablation studies in Fig. 7, where the images in the leftmost column are two input CT images; scale of the top-left image is larger than that of the bottom-right image. The images on rightmost side are the corresponding ground truth (GT), of which the spinal canal, dural sac, vertebral body and background are marked in green, white, red and black, respectively. The columns from Fig. 7(b) to Fig. 7(f) show the network predictions under several options, that is, 89234 VOLUME 8, 2020    Fig. 7(c) represents results from the network without dense blocks, Fig. 7(d) represents results from the upper sub-network, Fig. 7(e) represents results from the lower sub-network, and Fig. 7(e) represents the complete DDU-Net segmentations. We can clearly see that the segmentation results generated by the complete DDU-Net are much closer to the ground truth than the other results.

3) GROWTH RATE
The growth rate is a hyperparameter of a densely connected neural network that indicates how many layers a dense block has. Generally, a network with a larger growth rate will perform better. However, a larger growth rate will bring about more parameters, and the running time of the network will also increase. In the experiment we test several growth rate numbers under the same network architecture and data. As shown in Table 3, the method performs best when the growth rate equal to 48, and it surpasses the growth rate 32 by a very small margin. On the other hand, when the growth rate is equal to 32, the running time is much faster than when the growth rate is equal to 48; this is because the former parameter number is much smaller than the latter. Consequently, we set the growth rate equal to 32 as the optimal hyperparameter since it has a good tradeoff between accuracy and efficiency.

1) QUALITATIVE COMPARISON
Several state-of-the-art methods are related to spinal image analysis; since the proposed method is the first to simultaneously segment the vertebral body, spinal canal and dural sac, we compare our DDU-Net with five related stateof-the-art methods in the following three aspects: 1) data source type, either MRI or CT images; 2) technical details, including objective of the work and methodology; and 3) segmentation targets, explaining what contents are segmented from the images. As shown in Table 4, all methods except our proposed DDU-Net using MR images as training data. [16] and [28] segment images by using traditional machine learning algorithms, [10] segment images manually, where the segmented images are fed into CNNs as intermediate results; and other methods segment images by CNNs. The segmentation targets of these methods are different, only the proposed DDU-Net segments the vertebral body, spinal canal and dural sac from spinal CT images.

2) VISUAL COMPARISON
For visual comparison, we select some sample segmentation results of three state-of-the-art deep learning-based semantic segmentation methods and DDU-Net. As shown in Fig. 8, the images in the leftmost column are input images, and the images in the rightmost column are the ground truth. Fig. 8(b) shows the result for FCN-8s [14], which applies per-pixel classification using a fully convolutional network; in this experiment, we adopt the FCN 8 pixel stride version since it performs best among all FCN versions; Fig. 8(c) shows the results for U-Net [13], which segments medical images by a U-shaped fully convolutional network with skip connections; Fig. 8(d) shows the results for DeeplabV3 [15], which improved Deeplab [29] by multigrid and atrous spatial pyramid pooling; Fig. 8(e) demonstrates segmentation maps of the proposed DDU-Net. As demonstrated in Fig. 8, FCN-8s [14] and U-Net [13] failed to segment the extremely small dural sac, boundaries of the vertebral body in U-Net [13] and DeeplabV3 [15] are not precise, and we can clearly see that the segmentation maps of DDU-Net are much closer to the GT than those of the other methods. From the top-left to bottom-left image, the scale of the images is   [14], U-Net [13], DeeplabV3 [15] and the proposed DDU-Net, respectively. Our results are the closest to the ground truth.
increasing, that is, the top-left image concentrates details of tissues, where the regions of interest are larger than those of other images, the bottom-left image represents a global view of CT, and regions of interest are smaller than other images. Under this challenging condition, DDU-Net not only handles images with different scales but also segments semantics with different sizes, such as the large vertebral body (denoted in red) and small dural sac (denoted in white), while FCN-8s, U-Net and DeeplabV3 failed in at least one case.

3) QUANTITATIVE COMPARISON
The quantitative comparison of several methods on our dataset is shown in Table 5. For fair comparison, the parameters of FCN-8s [14], U-Net [13] and DeeplabV3 [15] are finetuned on our dataset before comparing them. We can see that DDU-Net performs best in terms of all four evaluation metrics. Since over 95% of the labels are the background class, the performances on the PA and fwIoU metrics are quite saturated, and these two metrics are not appropriate to evaluate performance of one method. On the other hand, the performances of DDU-Net in terms of the mIoU and MPA metrics reach 0.8331 and 0.9099, respectively, surpassing the state-of-the-art methods by at least 3 points. Consequently, both qualitative and quantitative comparisons between our method and the state-of-the-art methods indicate that the proposed method can generate promising segmentations on practical lumbar spinal CT images.

VI. CONCLUSION
Precisely identifying and recognizing the vertebral body, spinal canal and dural sac is a key step in diagnosing different types of LSS. In this paper, we first provide a new lumbar spinal CT image segmentation dataset with pixel-level labels and present a fully automatic method for segmentation of the vertebral body, spinal canal and dural sac from axial spine CT images based on a dual densely connected U-shaped network. Our method is practical, and requires no image preprocessing such as contrast enhancement, registration and denoising; the input is raw CT images, and the output is the desired segmentation maps; and the running speed is about 12 FPS (please see Table 3). Our method is precise, and by comparing the segmentation results to those of existing state-of-the-art methods on our new dataset, the proposed method proved superior in terms of segmentation accuracy ( Table 5).
Given that we automatically segmented the vertebral body, spinal canal and dural sac from CT images, there is still one more step before fully automatic LSS diagnosis of different types. In future work, we will apply the proposed DDU-Net as an approach for generating regions of interest and will investigate the complete automatic LSS diagnosis pipeline.
[28] C. Gros XIN LI is currently with the Union Hospital Affiliated to Tongji Medical College, Huazhong University of Science and Technology as an Associate Professor. She served as the Deputy Group Leader of the Endocrinology, Genetics and Metabolism Group, Hubei Academy of Pediatrics, the Director of the Hubei Society of Medical Biology and Immunology, and a Standing Member of the Children's Cancer Professional Committee of Hubei Cancer Society. She is mainly engaged in the scientific research of endocrine. She is also studying how to improve the accuracy of bone age through computer data analysis, so as to accurately judge the current bone development status of children and predict final height of children.
CHAO LIU received the B.S. degree in software engineering from the Wuhan University of Science and Technology, China, in 2018. He is currently pursuing the M.S. degree in software engineering with the Huazhong University of Science and Technology, China. His research interests include image semantic segmentation and salient region detection.