A Novel Deep Learning Pipeline for Vertebra Labeling and Segmentation of Spinal Computed Tomography Images

Automatic segmentation of vertebrae from computed tomography (CT) scans play an important role in the clinical interpretation and treatment of spinal co-morbidities. Labelling and segmentation of vertebrae is labour intensive and challenging, due to various fields of view and fuzzy boundaries in CT scans. Therefore, successful labelling and segmentation is highly dependent on the level of expertise of the radiologist. In this paper, we propose a three-step fully-automated end-to-end pipeline for vertebra labelling and segmentation of spinal CT images. A novel deep learning architecture, Unbalanced-UNet, is proposed for extracting the region proposals for spine detection. A modified SpatialConfiguration-Net, 3D SCN, is used for labelling of vertebra and centroid extraction. Finally, a 3D U-Net is employed for the segmentation of each vertebra. The models were validated on VERSE’19 public dataset. An identification rate of 90.20% and 91.47% was obtained for the first and second test sets of the VERSE’19 dataset, respectively. Mean localization distance of 4.97 mm and 5.32 mm was obtained for the first and second test sets, respectively. The final segmentation stage shows a dice score and Hausdorff surface distance of 93.07% and 5.36 mm, respectively, for the first test set, and 92.01% and 5.63 mm, respectively, for the second test set. The results show that the proposed approach outperforms the state-of-art models for segmentation of vertebrae. The proposed Unbalanced-UNet architecture increased the accuracy of accruing the region proposals for spine detection. The proposed fully automated pipeline has potential clinical applications in treatment and surgical planning of spinal deformities.


I. INTRODUCTION
Spine constitutes a critical part of the skeletal system which supports the body and its system of organs, while heavily assisting movement and balance.The vertebral column safeguards the spinal cord from external injuries and shocks.Understanding the intrinsic mechanics of the vertebral column involves quantitative imaging, finite element modelling, alignment analysis of the spine and highly specialized models being built.
The inspiration to the field of computer vision is the ideology that a computer can have human-like perception of its visual surroundings.Computer vision is moving from a The associate editor coordinating the review of this manuscript and approving it for publication was Nuno M. Garcia .statistical based approach to a deep-learning based approach.This change in perspective has helped solve many age-old problems and has given rise to some very exciting results and methodologies.The deep learning models show a better performance on benchmark problems, by using a single model to extract meaningful features from images and thus alleviating the need for a pipeline of hand-crafted methods.However, a plethora of unsolved problems still remains.
The popularity of medical imaging techniques has grown tenfold in the past decade [1].Imaging techniques such as magnetic resonance imaging, Computed Tomography (CT), X-ray, and ultrasound imaging are predominantly used techniques for biomedical image processing applications.Due to the high bone-to-soft tissue contrast and non-invasive imaging nature, CT is the preferred modality to capture and diagnose the disease conditions and deformities in the vertebral column.
Physiological variations in the bone structure and density can lead to excruciating pain and short-term disability, while they can be morbid in the long term; for example, osteoporotic fracture could cause morbidity and mortality of about 70% in affected population and the risk is higher in women [2].Despite their importance, spinal deformations are often under-diagnosed.Hence, there exists a need for medical image analysis and automated pipelines for early discovery of variations or co-morbidities, which facilitates early treatment.
Labelling, also known as identification, and segmentation of the vertebral column are the two important steps in the prognosis of medical conditions related to the spine.Vertebra labelling and segmentation have diverse diagnostic magnitudes as detecting underlying fractures and grading them, estimating the spinal curve, and identifying deformities such as scoliosis and kyphosis [3], [4].Scoliosis is an abnormal curvature of the spinal cord.Labelling and centroid extraction of the spinal cord can help to fit the centroids into an elliptical curve, in order to check whether the patient has scoliosis or not.3D segmentation of the vertebral column can aid radiologists in detecting spinal fractures at an early stage.From a clinical point-of-view, labelling and segmentation facilitate efficient modelling, that simplifies surgical planning in a sensitive region.Manual labelling of vertebrae is usually a straightforward task, except for low exposure conditions or a restricted field of view (FoV).However, manual segmentation of vertebrae, namely annotating the regions-of-interest with each vertebrae having a size of 10 3 voxels, is a tedious task.Automating these tasks can help in reducing the time taken [5].This comes with a lot of challenges such as large size of the CT scans, similar shapes of adjacent vertebrae, noise introduced during the scan, acquisition conditions, highly varying FoVs across datasets, complex morphology or multiple anomalies of the vertebrae [1].Hence, there exists a necessity to provide a fully automated framework for labelling and segmenting vertebrae, which would significantly decrease the time spent on analyzing the CT scans manually by the radiologist.Due to these challenges, many scientific researchers have devoted their research on fully-automated or interactive strategies for the labelling and segmentation of vertebral column.
The aim of this paper is to develop a fully automated deep learning approach for large scale vertebra labelling and segmentation of spinal CT images.This paper proposes a new efficient architecture named Unbalanced-UNet, to extract region-of-interest for the spine.Centroids of each vertebrae in the spinal column are extracted and labelled using a modified SpatialConfiguration-Net, 3D SCN.Finally, the vertebrae are segmented using a 3D U-Net model.
The remaining paper is organized as follows: Section II discusses related work in literature on vertebrae labelling and segmentation.Section III explains the proposed methodology in detail.Section IV presents the results of our work and comparison with related work.Section V concludes the main key-points of the paper.

II. RELATED WORK
In literature, there have been various methods developed for vertebral labelling and segmentation, which require manual intervention or are fully automated.Before deep learning based methods were in vogue, image processing methods were widely used.
Zamora [6] proposed a hierarchical based segmentation method.Their methodology utilises edge detection methods followed by applying the Generalized Hough Transform for pose estimation.They have used techniques for energy minimization for holding the shape of the template and obtained errors of ∼6mm in the segmentation process.The drawback of this method is that they have worked on spinal X-Rays, whereas majority of clinical data is in the form of CT images.Applying edge detection techniques to CT scans would prove futile in clinical application [7].Similar to the above methodology, [8] proposed a polar signature based representation which was achieved using gradients.Polynomial fitting was employed for edge closing and segmentation was done on the basis on the contours extracted.This method suffers the same downfall as the method by [6], as in the work was done solely on X-Rays.
One of the earliest works done in the field of automated labelling and segmentation dates back to 2009.The method proposed by [9] is predominantly a statistical approach.Their method employs the use of profound medical knowledge through the use of various models taking into regard gradient, shape and appearance information.They obtained a vertebra identification rate of 70% and mean point-to-surface segmentation error of 1.12 ± 1.04 mm.Although these results are very good, it can be noted that they can be improved significantly.
Glocker [10] proposed a method based on probabilistic graphical methods and regression forests, for identification or labelling of spinal CT scans.The regression forest is employed in locating the spinal cord.The vertebrae are identified and labelled using a Hidden Markov model.They obtained a mean identification rate of 81%.Their model boasts a very high efficiency if taken into account the fact that their model requires significantly lesser computational power.The speed of the inference in their model is also excellent.Although this is a good result, it cannot be used in a clinical setting.The main drawback remains in their identification rate.
More recently, deep learning based approaches have dominated the literature in the field of medical imaging, and especially segmentation.The methodology proposed by [11] and [12], which describes a fully convolutional network approach, U-Net [13], has facilitated this change in approach.A team of researchers from the Alibaba DAMO Institute have proposed a method called SpineSegNet [1].They propose a key-point based instance segmentation network for the segmentation of vertebrae.This framework consists of three sub-parts which are used for vertebra segmentation, predicting position and labelling the vertebrae.The team broke down the main problems into smaller problems to ensure that their networks do not compromise on the learning.A deep learning based approach is used to differentiate the vertebral column from the background of tissues and muscle.The binary volume output of this model is fed into the network to be utilised for predicting position, which predicts the centroids.Their methodology showcases an identification rate of ∼85% and average localization error of ∼9.5mm.Although their results are good, they are not sufficient for clinical applications.Their main drawback arises out of ignoring the loss of spatial information while using their novel approach.
The methodology proposed by [14] uses a 3D U-Net [15], which analyses a [128,128,128] region of interest (ROI) in the input CT scan.In the ROI, the network is developed for segmentation and labelling of only the bottom-most vertebra and ignore the rest of the vertebral column.The ROI is iteratively moved upwards, like a sliding window.When the entire vertebral column has been iterated over, the results for segmentation and labelling are stored.The predicted labels of all the vertebrae are fused with the help of a maximum likelihood model.Their model displays an identification rate of ∼89% and localization error of ∼12 mm for identification and a dice score of 85% and Hausdorff surface distance of ∼9mm.The results obtained point to an accurate model, but the highly iterative approach places a significant trade-off on memory and computational power.As a result, this pipeline performs slow as compared to other state-of-art models, and may not be utilised in a clinical setting effectively.
Payer [16] proposed a coarse-to-fine automatic method for detection and segmentation in spine CT volumes.They have utilised a SpatialConfiguration-Net (SCN) [17].They identified the qualifications to deep learning based methods while operating with small datasets.The SCN is a fully convolutional network (FCN) [18], which combines the local and spatial predictions for accurate landmark localization.They proposed a U-Net for binary segmentation of vertebrae.Their model displays an identification rate of ∼93% and localization error of ∼4.5 mm for identification and a dice score of 89% and Hausdorff surface distance of ∼7 mm.Their approach constitutes the current state-of-art model.
The Vertebra Labelling and Segmentation Benchmark challenge was organized in conjunction with MICCAI 2019, for developing automated and accurate methods for labelling and segmentation of 1735 vertebrae from 160 spinal CT images.The dataset used, VERSE'19, and the results of the participant groups are publicly available [1], [19].The ''iFLYTEK'' team who participated in the challenge, utilised a two-step methodology for predicting the masks [1].The first stage employs a 3D U-Net [15].The network is fed with regions of size [224,160,128] which are randomly extracted from each scan.This framework is set up to predict 25 labels according to the ground truth.The team uses these patches instead of entire scans to train the network, as it is easier to extract the important features for each spinal class, and hence easier for classification.For the localization task, the team uses a RCNN localization network [20].The network has an input size of 160 × 192×224.The team defines a tight region of interest box by analysing the largest connected patches in the sagittal direction.Their model displays an identification rate of ∼90% and localization error of ∼6 mm for identification and a dice score of 88% and Hausdorff surface distance of ∼9 mm.This model performs extremely well for test-1 data of the VERSE'19 dataset, showing performance on par with the state-of-art models.But their model performance drops heavily when tested on test-2 data of the VERSE'19 dataset.The reason of this significant drop-off is the fact that the team trained their model over the public test set, test-1.
A report submitted by Christoph Angerman in [1] proposes a novel method of extending the 2D U-Net structure to volumetric data using 3-D convolutions.Significant drawbacks of the 3D U-Net are its huge memory requirements which hampers the complexity of the algorithm [15].The objective is to transform the 3D data to 2D by combining all projection layers from different directions, thus including all the information of the 3D images.A 2D U-Net can be applied to learn the features for segmentation.The author proposes the use of binocular reconstruction [21] to bring the data back to its original format.Their model displays an identification rate of ∼45% and localization error of ∼30 mm for identification and a dice score of 40% and Hausdorff surface distance of ∼37 mm.The results obtained by this model indicates a very poor model, but it does not consider the different ways their methodology has simplified 3D segmentation by overcoming the need to use a U-Net, instead using binocular reconstruction.
The state-of-art methods predominantly use FCNs [20] for segmentation, and for landmark localization.The widespread use of FCN delineates their importance to the field of medical imaging.In this paper, we propose a methodology for labelling and segmenting of vertebrae based on FCNs.We utilize different variations of the U-Net [13] for locating the approximate location, labelling and centroid extraction and vertebra segmentation.
The following are the major contributions of this work.
(1) We propose the novel Unbalanced-UNet encoderdecoder architecture.The main architectural difference of our proposed architecture from a 3D U-Net [15] lies in the number of blocks included in the encoder and decoder.Our proposed network employs lesser blocks in the encoder so as to focus the attention of the network mainly on the creation of the feature maps.This proposed change facilitates accurate extraction of region proposals.(2) We propose a modified version of the SCN [17], named 3D SCN, for labelling and centroid extraction.The primary difference lies in the spatial feature extractor.
The network is set up in such a way to extract the features from all the axes, so as to make more accurate spatial predictions.The number of convolutions are also increased to further improve the feature extraction.(3) We develop a fully automated end-to-end deep learning pipeline for vertebra segmentation.The feature maps are passed along with each individual vertebra for more accurate segmentation, due to the highly similar shapes of vertebrae.

III. PROPOSED METHODOLOGY A. DATASET USED
This research work is carried out on the VERSE'19 dataset, which is a part of the Vertebra Labelling and Segmentation Benchmark challenge [1], [19].The objective of the challenge is to provide public diverse dataset, and a common benchmark for future algorithms [22].The challenge consists of two tasks, namely vertebra labelling and vertebra segmentation.The dataset comprises of 160 CT scans, out of which 80 are used for training, 40 as test-1 and 40 as test-2 data.Each volume consists of a full 3D spine.The average voxel volume for the input CT scans is 1.63 ± 0.69 mm 3 , and the voxel dimensions range from (144, 144, 144) to (915, 1189, 1189).The 160 multidetector CT scan dataset was annotated by a human-machine hybrid algorithm.A few example images from the VERSE'19 dataset are shown in Fig. 1.The figure shows that the dataset has diverse CT scan images with varying field of view and image dimensions.

B. PRE-PROCESSING THE DATA
The overview of the proposed methodology is shown in Fig. 2. It consists of a preprocessing stage followed by extraction of region proposal of spine, vertebral centroid extraction and labelling, and final segmentation of vertebrae.
The CT scans were provided in the NIFTI file format [23].The NIFTI file format encloses various properties, such as orientation in space, shape, etc.The primary views of the NIFTI images are in RAS+, i.e., Right Anterior Superior, and the '+' stands for a frontal view.In the primary view, the CT scans appear excessively constricted, and hence, it would be very difficult for any type of feature extraction, or efficient computation on them.Therefore, the CT scans were pre-processed to change their orientation from RAS+ to RAI, i.e., Right Anterior Inferior.This simplified the further processing tasks done on the dataset.After pre-processing, the voxel dimensions ranged from (41, 103, 103) to (512, 512, 512).Fig. 3 shows an original image in NIFTI format and reoriented to RAI.

C. REGION PROPOSALS FOR THE SPINE USING UNBALANCED-UNet
Determination of the region proposals for the spinal cord is an intermediate step for labelling and centroid extraction.We propose a novel architecture, Unbalanced-UNet, for extracting region proposals from spinal CT scans.The Unbalanced-UNet is a modified 3D U-Net [15], where the number of encoder blocks and decoder blocks are not equal.
The Unbalanced-UNet is designed with unequal lengths of encoder and decoder blocks to extract accurate region proposals.The data that is present in the VERSE '19 dataset is provided with the centroids and segmentation masks as the ground truth.As a result, we are not able to obtain any sort of spatial information about the spinal cord present in each CT image.The Unbalanced-UNet is proposed to overcome this shortcoming.Prior to the centroid extraction and segmentation of vertebra, the task at hand is to localize or detect the spinal cord.For tasks such as this, i.e. object detection of unnatural shapes, the U-Net fails to perform satisfactorily [24].The Unbalanced-UNet solves this problem by limiting the feature extraction.The rationale behind this is that we do not use Unbalanced-UNet to perform a segmentation task, but rather an object localization task, the object being the spinal cord.Hence, feature extraction has been limited to make sure that deeper features are not extracted.If the deeper features were extracted, this task would become similar to a segmentation task.Additionally, it would lead to poor segmentation as we do not possess any spatial information about the CT scans on the ground truth data.Hence, we have trained the Unbalanced-UNet to extract features for determining the spinal region proposals.
The encoder facilitates feature extraction, and the input is compressed by a fixed stride in each of the convolutional blocks in the encoder.The decoder is tasked with building the region proposals.It does so by performing an upsampling task on the output of the encoder.
The optimum imbalance for the Unbalanced-UNet is determined by experimentation.We built three different Unbalanced-UNets with 3 encoders and 5 decoders; 4 encoders and 5 decoders; and 4 encoders and 6 decoders.For simplicity, let us call them Unbalanced-UNet-1, Unbalanced-UNet-2 and Unbalanced-UNet-3.For Unbalanced-UNet-1 and Unbalanced-UNet-3, the region proposals that were extracted were not accurate enough.This is substantiated in the segmentation task.The segmentation task requires concatenation of the cropped volume of the region proposals and the CT scans to be passed as input to the network.This is done to reduce the complexity of the 3D segmentation algorithm.The region proposals accrued from the Unbalanced-UNet-1 and Unbalanced-UNet-3, seemed to increase the processing time, and the segmentation outputs obtained are not satisfactory.The region proposals from Unbalanced-UNet-2, reduced the processing times by a significant margin.Moreover, the segmentation results are observed to be excellent.Hence, 4 encoders and 5 decoders are chosen as the optimum imbalance.
The proposed model is built and trained to detect and localize all individual vertebrae simultaneously.All vertebrae are assumed to be equally important.The Unbalanced-UNet can be used as a general architecture for extracting the region proposals only.The segmentation masks are used as ground Additionally, Laplacian of Gaussian (LoG) filters are used to enhance the filter maps, as they have very sharp maxima [25].Equation (1) shows the LoG filter function.The points of maxima are chosen as the predicted centroids for this task.It is to be noted that the centroids estimated in this step are not the final predicted centroids.The centroids estimated in this step are used to regress a line that approximately passes through all the vertebrae.The final centroids are predicted in the next stage, i.e., vertebra localization.The individual feature maps are concatenated to form the entire region proposal of the spine.Fig. 4 shows the architecture of the Unbalanced-UNet model.

LoG(x, y)
The learning rate used is 10 −5 .The loss function used is L2 loss, as shown in (2).The L2 loss function is used to minimize the error value which is the sum of all the squared error values.
L2 loss = N i (y true − y pred ) 2  (2) where, y true and y pred are the ground truth and predicted centroids.The predicted centroids are chosen as the points of maxima of LoG filters applied on individual vertebrae.The encoder of the proposed Unbalanced-UNet model has four convolutional blocks, with two 3D convolution (Conv3D) layers each [26].The first convolutional layer has input and output dimensions of [64, 32,32] with 64 filters, kernel size of [3,3,3], ''same'' padding (that ensures same dimensions for output as that of input), and Rectified Linear Unit (ReLU) activation.The batch size is 100 and the number of channels is 3. Thus, the input shape becomes [100, 64, 32, 32, 3].The detail of each layer is given in Table 1.Each Conv3D layer is connected to the next Conv3D layer by the means of a [1,1,1] convolution.This size ensures dimensionality is maintained.An encoder block is connected to the next encoder block by a max_pooling3D layer [26] of pool size [2,2,2].The max_pooling3D layer gives a downsampled feature map that helps to reduce over-fitting.The fourth encoder block opens into a bottleneck that has input dimensions of [8,4,4] with 64 filters.The bottleneck opens into the decoder that has five convolutional blocks with two Conv3D layers each.Each convolutional layer is connected by the means of a [1,1,1] convolution.A decoder block is connected to the next decoder block through a UpSample3D layer [26] with sampling factor [2,2,2] that helps to upsample the feature maps.Each decoder block is also connected in parallel with the encoder block of same dimensions.The detail of the decoder is given in Table 2.The fifth decoder block opens into the output layer, which is a Conv3D layer with input shape of [128,64,64] with 64 filters, [3,3,3] kernel size and ''same'' padding.The network outputs are verified at each level using class activation maps [27] to ensure optimum performance.The voxel dimensions of the input are the same as the input CT scans.The output voxel dimensions range from (43, 129,129) to (427,427,427).
Fig. 5(a) shows the pre-processed image of a CT scan data.Fig. 5(b) shows the region proposal output, and Fig. 5(c) shows the class activation map after 2 nd encoder block.Class activation maps are used to inspect the outputs at each stage of the network.The output of the Unbalanced-UNet contains the region proposal for the spinal region which is present in the CT scan.In some cases, the whole CT scan is present, and in some cases only a part of the spine is present.Our network is able to extract the region proposals for the cases accurately.

D. VERTEBRA LABELLING AND CENTROID EXTRACTION
This subsection describes in detail the proposed network architecture for extraction of centroid and labelling of vertebrae.This has been accomplished by using a modified version of SpatialConfiguration-Net or SCN [17].SCN model architecture is divided into two parts -the U-Net part for local predictions and the additional convolutional layers for spatial prediction.Local predictions are output heat maps that are generated from the U-Net part of the SCN.These heat maps are locally accurate but are ambiguous in a spatial sense.This  is analogous to an image without a header file with all its information.Here, the image is the local prediction, and the header file is the spatial prediction.SCN combines the local and spatial predictions for a more robust prediction.The local heat maps (output of the U-Net part of SCN) are fed as input to the spatial prediction part of the SCN.The computations are performed on 1/3 rd resolution.The actual prediction is obtained as the convolution of the spatial prediction over the local prediction.We have taken the concept behind SCN model and applied it to three dimensions, instead of the conventional 2D.This modified model is named 3D SCN.Fig. 6 shows the architecture of 3D SCN.The encoder block of the 3D SCN consists of three convolutional blocks  [3,3,3], with ''same'' padding and leaky ReLU activation function.The batch size is 100 and the number of channels is 3.The input shape to the network becomes [100,64,48,48,3].This layer is connected to the next layer by a [1,1,1] convolution.The convolutional block is connected to the next convolutional block by a max_pooling3D layer of pool size [2,2,2].The details of parameters of each layer are given in Table 3.
The encoder opens into a bottle neck, which has input dimensions of [16,12,12] and has 64 filters.The encoder opens into the decoder block of the 3D SCN.The decoder has four convolutional blocks.Each block consists of two Conv3D layers.The first layer has input shape of [16,12,12], kernel size of [3,3,3], 64 filters, ''same'' padding, and activation-leaky ReLU.This layer is connected to the next 15336 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The 3D SCN three convolutional layers, so as to extract features from each of the axes.Each convolutional layer has input and output dimensions of [64,32,32], kernel size of [9,9,9], 64 filters, ''same'' padding, activation function of leaky ReLU.The last convolutional layer opens into a UpSample3D layer of size [2,3,3].This layer is connected to convolutional layer meant for the spatial prediction.It has input dimensions of [128,64,64], with kernel size of [9,9,9], 64 filters, ''same'' padding and tanh activation function.
The input and output voxel dimensions are same as the pre-processed CT scans.
The actual prediction is obtained by a simple convolution of the local and spatial predictions, as shown in (3).Equation ( 4) shows the loss function used, where y true and y pred are ground truth and predicted centroids.Fig. 7 shows the predicted centroids of 25 vertebrae and the corresponding ground truth.
This subsection explains the methodology used for vertebra segmentation.This work uses a 3D U-Net model for vertebra segmentation, due to its proven efficacy in segmentation tasks [1], [15], [28], [29].Fig. 8 shows the 3D U-Net used for segmentation of vertebra.
The objective of the segmentation task is to segment each localized vertebra individually.As the vertebrae are  8 filters, kernel size of [3,3,3], ''same'' padding and ReLU activation.Consecutive convolutional layers are connected by [1,1,1] convolution.The blocks are connected by means of UpSample3D layer of size 2.Each decoder block is parallelly connected to the encoder block of same dimensions.The complete network parameters are given in Table 6.The 3D U-Net opens into the output layer which has dimensions of [64,96,128], 128 filters, ''same'' padding and no activation function, as the outputs are the segmentation masks.
The learning rate used here is 10 −5 .The loss function used is the dice loss, which is the ratio of twice the area of overlap to the of individual areas of images, as shown in (5).
Fig. 9 shows the segmentation of 25 vertebrae using 3D U-Net and the corresponding segmentation mask produced by the proposed centroid extraction and segmentation framework.Fig. 10 shows the views of sample images along the Y, X, and Z axes.Multi-view of the segmented output helps to give a better analysis on the efficacy of the proposed model.

IV. RESULTS AND DISCUSSION
The proposed segmentation and labelling framework is validated on the public VERSE'19 challenge dataset [22].The efficiency of the proposed method for vertebra labelling and centroid extraction, is experimentally verified using the rate of identification and mean localization distance.The segmentation results are validated using the Dice score coefficient and Hausdorff surface distance.

A. EVALUATION METRICS
To evaluate the results obtained for the labelling and centroid extraction, two performance metrics based on localization error and identification rates are utilized.In the following 15340 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.definitions of the evaluation metrics, note that G denotes ground truth and P denotes the predicted values.
The rate of identification is defined in [1] as the number of correctly predicted vertebrae, N1, to the total number of predicted vertebrae, N , as given in (6).

Rate of identification
Equation (7) shows the localization distance measure, which is equal to the total Euclidean error distance between the ground truth and predicted centroids.The mean localization distance is defined as the localization distance per vertebra and is computed as in (8).
To evaluate the results obtained for the segmentation of vertebra, Dice score and Hausdorff surface distance are used to compare the predicted and ground truth segmentation masks.Dice score coefficient is used to measure the similarity between two images, as given in (9).
Equation (10) shows the measurement of Hausdorff surface distance, which is the largest distance from a point in one set to the nearest point in the other set.HD (P, G) = max sup p∈P d (p, Y ) , sup g∈G d (P, g) (10) where d (a, B) = inf b∈B d(a, b) quantifies a distance from a point a ∈ P to the subset B ⊆ P and sup and inf represent the supremum and infimum, respectively.The rate of identification for the test-1 is obtained as 90.20%, with mean localization distance of 4.97 mm.The rate of identification for test-2 is 91.47%, with mean localization distance of 5.32 mm.Hence, the overall identification rate obtained is greater than 90%.A mean localization distance of ∼5 mm indicates an error of ∼5 mm in extracting each centroid.The results for both test data are summarized in Table 7 and Table 8.The identification rates for individual vertebra for test-1 and test-2 sets are summarized in Table 9 and Table 10.Table 9 and Table 10 show the labels of each vertebrae, the total number of vertebrae of that particular label present in the ground truth data, the total number of predicted vertebrae of that particular label and the rate of identification for each vertebrae.
These results indicate that more than 90% of the vertebrae identified by the model are the actual vertebrae present in the CT scans.The main reason for this result is that the organizers of the grand challenge have not included the 26 th vertebra in the ground truth in all the cases, where it was present.The proposed model has labelled and extracted the centroid for the 26 th vertebra in all the cases wherein it was present.Another reason is that many of the CT scans has half cut-off vertebrae that were not added into the ground truth labels.Few of the ground truth also neglect the vertebrae which are present at the extremes.The proposed network identifies all these vertebrae, and this is one of the major contributions of this work.

C. SEGMENTATION RESULTS
The segmentation results for both the test data are summarized in Table 7 and 8 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The results indicate that the segmentation masks obtained for both the test sets are highly similar to the ground truth segmentation masks.The model's multi-dimensional feature extraction leads to better and more robust segmentation performance.Fig. 11 shows the results of the three stages of the proposed pipeline for 4 test images in the VERSE'19 dataset.

D. COMPARISON WITH RELATED WORK
The capability of the proposed method for vertebrae labelling and segmentation is compared with the state-of-the-art methods.The experimentation is performed on the publicly available VERSE'19 dataset, which provides robust and clinically comparable data.The labelling methodology, which was evaluated on the rate of identification and mean localization distance, yielded the third-best and second-best result in test-1 and test-2 data respectively, when compared with the results of the teams who participated in the VERSE'19 challenge.The rate of identification is 90.20% and 91.47%.For the first test set, the best performing methods have yielded rates of ∼95%.The proposed method falls behind in not being able to differentiate between vertebrae that have not been defined in the ground truth.Nevertheless, the proposed method outperforms the remaining teams by a significant increasing the complexity, which leads to better learning of the network.
Compared to the other teams of the VERSE'19 challenge, our proposed methodology allows for full automation, whereas a significant number of the teams rely on manual intervention for delineating the region proposals.Our methodology not only uses the heat maps obtained for region proposals, but the nature of the filters utilised allows for further application in the upcoming tasks.
The centroid extraction and labelling of vertebra utilise the proposed 3D SCN, which combines the local and spatial prediction.The number of filters is kept a constant to increase the number of layers without increasing complexity.The U-Net part of the 3D SCN has a very similar architecture, but the main difference lies in the spatial prediction part, which employs convolutional layers in series for feature extraction.Our proposed method increases the number of convolutional layers and increases the resolution in which the images are processed to extract all the features from multiple axes to extract the centroids accurately.Compared to other related work, the proposed methodology shows promising results.
The segmentation task is solved using a 3D U-Net model.The model performs binary segmentation to segment the vertebrae from the background.The main modification in the model is in the inputs that are fed to network.A concatenation of the cropped region around the centroid and its localized heat map are fed as inputs.Compared to other teams in the challenge, most of them pass the entire CT scan along with its entire region proposal.Although this reduces the computational load on the network, it would not segment each vertebra accurately.Although our model requires higher computational power, it consistently outperforms the best performing teams.

E. DISCUSSION
Labelling and segmentation of the spinal cord is challenging due to highly varying fields of views across datasets, large scan sizes, and similar shapes of adjacent vertebrae [1].Complete automation of these tasks plays an important role in clinical treatment and surgical planning of spine-related conditions.In this paper, we proposed a fully-automated pipeline for vertebra labelling and segmentation.
A novel architecture of Unbalanced-UNet was proposed for extracting the region proposals.The proposed network is built to focus on the feature maps rather than on the feature extraction.The unbalanced nature of the network facilitates in extracting accurate region proposals.These feature maps lighten the computational load and simplify the segmentation task.
The data from the dataset is initially pre-processed.This pre-processed data is fed into the proposed Unbalanced-UNet to extract the region proposals for the spinal cord in each CT image.This forms the spine localization task.
For the extraction of vertebra centroid, the pre-processed data is fed into the proposed 3D SCN, and individual centroids are extracted.The 3D SCN overcomes the disability of the U-Net to retain spatial information.The 3D SCN is successful in fitting data with highly variant fields of view.
Following the spine localization and centroid extraction, a 3D U-Net is employed for extracting the segmentation masks of the vertebral column.For the segmentation task, we have concatenated a 4 × 4 × 4 mm 3 region around the centroid (extracted using the 3D SCN) from the region proposal for each CT image and from the pre-processed data.The resulting cube has a volume of 4 × 4 × 8 mm 3 .Batches of such cubes are fed into the 3D U-Net for training.The 3D U-Net demonstrated the ability to learn high-level contextual information in an effective manner which improves the performance of the deeper layers of the network and leads to better segmentation results.If the cropped region from the region proposals is not concatenated, model performance drops by around 8% for the dice score, as shown in Table 11.This shows that the Unbalanced-UNet is an essential part of pipeline.
The performance of the models is analysed using evaluation metrics such as the rate of identification, mean localization distance, dice score and Hausdorff distance.The localization and segmentation results obtained are clinically acceptable.
Independent sample t-test is used to check for significance of the results with the related work at p < 0.05.The null hypothesis assumed here is that M -µ = 0, where M is the sample mean and µ is the hypothesized mean.The null hypothesis assumes that the two group means are similar.The statistical test results are incorporated in Table 7 and 8 for the segmentation metrics.It shows a high significance at p < 0.05, rejecting the null hypothesis.Hence, our model performs significantly better than the related work of the VERSE'19 challenge teams.
The proposed method is compared with related papers on vertebra labelling and segmentation.The results of this comparison are present in Table 12.The high Dice score of 93% implies the high potential applications of our proposed framework.
In order to analyse the robustness of our proposed methodology, the trained model is tested on another public dataset, UWSpineCT [30], [31].Twenty random samples are taken for testing.Our proposed method shows a mean identification rate of 93.02% and mean localisation distance of 3.74 mm for all vertebrae.Table 13 shows that our results compared with related works gives the best results for the identification rate.Additionally, we measured the processing times for each stage.The average time to localize a spinal region is 41 seconds.The average time to extract centroids and the labels of vertebra is 25 seconds.The average time for vertebra segmentation is 47 seconds.Hence, the average time for the whole pipeline for a single CT scan is about 113 seconds.The result analysis proves that our model is fast and highly efficient.Future work includes deploying the model in real time clinical situations and validating on a larger clinical data with robust conditions.
A limitation of our proposed methodology is that the 3D SCN is not able to ignore the 26 th vertebra, which poses a problem in the evaluation metric computation, as it is not included in the ground truth.Another drawback is the memory and computational requirements required for carrying out training and testing.Future work includes incorporating all three networks into a single network.This would also abet in reducing the memory and computational requirements.

FIGURE 2 .
FIGURE 2. Overview of the proposed methodology.

FIGURE 7 .
FIGURE 7. Visual comparison of ground truth (blue) and predicted (red) centroids of each vertebra.
The rate of identification and dice score are measured in percentages, and 100% indicates the best result, whereas 0% indicates the worst.The mean localization distance and Hausdorff surface distance are measured in millimetres (mm), and the lowest value of 0 mm points to the best result.B. RESULTS OF LABELLING AND CENTROID EXTRACTIONThe VERSE'19 dataset consists of two test sets, test-1 and test-2, which consists of 40 CT scans each[22].The proposed model was trained on the training set, containing 80 CT scans.The proposed model takes into consideration various features exhibited by the vertebrae owing to the various configuration in space, in which they are present.

FIGURE 10 .
FIGURE 10.Segmentation results of different images using proposed method: View of the segmented vertebrae in (a) Y axis, (b) X axis, and (c) Z axis.

TABLE 3 .
Encoder parameters of 3D SCN. of two Conv3D layers each.The input dimension of the first layer is [64,48,48].It has a kernel size of

TABLE 7 .
Comparison of results for test-1 data with VERSE'19 challenge teams.

TABLE 8 .
Comparison of results for test-2 data with VERSE'19 challenge teams.

TABLE 9 .
Identification rates of individual vertebra for test-1 of VERSE'19 dataset.

TABLE 10 .
Identification rates of individual vertebra for test-2 of VERSE'19 dataset.

TABLE 11 .
Results of vertebra segmentation, without and with using unbalanced-UNet model.

TABLE 12 .
Comparison of proposed method with related papers on VerSe'19 dataset.

TABLE 13 .
Identification rates of UWSpineCT test data.