Improved Abdominal Multi-Organ Segmentation via 3D Boundary-Constrained Deep Neural Networks

Quantitative assessment of the abdominal region from CT scans requires the accurate delineation of abdominal organs. Therefore, automatic abdominal image segmentation has been the subject of intensive research for the past two decades. Recently, deep learning-based methods have resulted in state-of-the-art performance for the 3D abdominal CT segmentation. However, the complex characterization of abdominal organs with weak boundaries prevents the deep learning methods from accurate segmentation. Specifically, the voxels on the boundary of organs are more vulnerable to misprediction due to the highly-varying intensities. This paper proposes a method for improved abdominal image segmentation by leveraging organ-boundary prediction as a complementary task. We train 3D encoder-decoder networks to simultaneously segment the abdominal organs and their boundaries via multi-task learning. We explore two network topologies based on the extent of weights shared between the two tasks within a unified multi-task framework. In the first topology, the whole-organ prediction task and the boundary detection task share all the layers in the network except for the last task-specific layers. The second topology employs a single shared encoder but two separate task-specific decoders. The effectiveness of utilizing the organs’ boundary information for abdominal multi-organ segmentation is evaluated on two publically available abdominal CT datasets: Pancreas-CT and the BTCV dataset. The improvements shown in segmentation results reveal the advantage of the multi-task training that forces the network to pay attention to ambiguous boundaries of organs. A maximum relative improvement of 3.5% and 3.6% is observed in Mean Dice Score for Pancreas-CT and BTCV datasets, respectively.


Introduction
Multi-organ segmentation on abdominal Computed Tomography (CT) scans is an essential prerequisite for computer-assisted surgery and organ transplantation [1], [2].Particularly, quantitative assessment of abdominal regions enables accurate organ dose calculation, required in numerous radiotherapy treatment options.
Erroneous delineation of abdominal organs prevents harnessing the benefits of radiotherapeutic advancements.In clinical practice, physicians delineate abdominal organs using manual segmentation tools, which are time-consuming, observer-dependent, and error-prone.With the increased use of imaging facilities and production of a large number of abdominal CT scans, the utilization of automated, robust, and efficient organ-delineation tools has become compulsory [2], [3], [4].Automatic segmentation tools delineate the abdominal structures much faster and overcome the issues like variability in human expertise and inherent subjectivity.
Abdominal CT scans often present weak inter-organ boundaries characterized by regions of similar voxel intensities, which in turn results in low-contrast representations.Such appearances are usually caused by the representation of abdominal soft tissues in a narrow band of Hounsfield (HU) values.
Another factor that enhances the already complex representation of abdominal organs is the existence of artifacts occurring due to blood flow, respiratory, and cardiac motion.Accurate delineation of abdominal organs with unclear boundaries and complex geometrical shapes is one of the ongoing challenges that hurdles the abdominal-related clinical diagnosis.Earlier methods proposed for the abdominal multi-organ segmentation mainly were based on multi-atlas [5], [6] or statistical models [7], [8].Some methods also made use of handcrafted or learned features to segment abdominal organs [9], [10].However, the recent Fully Convolutional Network (FCN) based approaches have presented better results due to the improved organ representation learning [2] [11].Being able to preserve the image structure and provision of efficient learning as well as inference, FCN-based methods are currently considered state-of-the-art for abdominal multiorgan segmentation [2], [12], [13], [14].Specifically, these networks follow the encoder-decoder architectural design [15].In such networks, the shallow layers in the encoder aim to extract low-level features, and the deep layers encode high-level features.While the mirrored-decoder maps back the learned features to generate an output of the same size as input with skip connections assisting in retaining the crucial features extracted in the encoding path [15].
Existing FCN-based methods for abdominal multiorgan segmentation employ either 2D or 3D convolutional architectures [13], [12].2D methods process the CT scans in a slice-by-slice fashion and predict the organ labels on individual slices [13].Despite being memory-and parameter-efficient, 2D methods are unable to make full use of 3D contextual information [2].3D methods make use of rich volumetric context by processing the whole CT volume and generating voxel-maps in a single forward propagation pass, leading to better abdominal CT segmentation performance than 2D approaches [16], [17].
The existing 3D methods have primarily focused on designing better architectures for improved abdominal multiorgan representation learning [12], [2].However, they treat all the anatomical parts within a single organ equally since they solely rely on voxel-level information and do not specifically focus on improving the segmentation of voxels in vulnerable regions/parts of organs.As an example, we highlight some of the important characteristics of abdominal organs in Fig. 1.
From Figures 1a and 1b, it can be noticed that the adjacent organs have weak contours which sometimes touch each other.
As an example, observe the low-contrasted and touching boundaries between stomach ( ) and pancreas ( ).Moreover, 3D multi-organ visualization in Fig. 1c shows that the adjacent positioning of organs in the abdominal cavity aggravates the complex spatial relationship among the organs.Simultaneously segmenting the abdominal organs with soft contours and complex spatial relationships is a challenging task.
The boundaries of anatomical regions in medical scans serve as an important cue for facilitating manual and automated delineation [18].Numerous existing deep learningbased studies leveraged learning of features corresponding to boundary of regions for improved medical image segmentation via multitask learning paradigm [19], [20], [21], [22], [23].In recent years, deep multitask learning paradigm has been widely used due to its potential to solve multiple tasks in one forward propagation and ability to learn better representations because of the multiple supervisory signals [24], [25].In this paper, we propose to improve the segmentation of abdominal organs on CT scans by enhancing the segmentation of boundary of organs.Particularly, we train the 3D deep learning networks to simultaneously predict the boundary and the entire region of organs.The inclusion of boundary information is motivated by the fact that the voxels on the boundary of organs are more vulnerable to misprediction because of their ambiguous appearance and complex relationship with adjacent organs.Specifically, our work makes the following contributions: (i) We develop an end-to-end trainable 3D multi-task learning framework that simultaneously predicts the voxel-labels of abdominal organs and their corresponding boundaries.By integrating the boundary features, our proposed boundaryconstrained 3D deep learning framework focuses on the accurate prediction of the edges of organs in addition to whole organs.
(ii) Instead of relying on a single network topology, we explore and compare two network topologies for conducting multi-task learning.In the first topology, the whole encoder-decoder network is shared with separate task-specific prediction layers at the end for predicting boundaries and entire organs' maps.In the second topology, an encoder is shared with separate task-specific decoders for decoding the features, jointly learned by the shared encoder to predict the boundary and organ probability maps.With an extensive comparison, we reveal that integration of boundary features invariably improves the multi-organ segmentation performance, independent of the multi-task network design.
(iii) We utilize three state-of-the-art 3D encoder-decoder architectures, i.e., UNet [26], UNet ++ [27], and Attention-UNet [28] as baseline networks for evaluating the effect of incorporating boundary information.We modifiy each baseline architecture according to our proposed multitask topologies.We demonstrate significant performance improvements with a negligible increase in trainable parameters.
(iv) We validate the performance of baseline and counterpart boundary-constrained models on two publically available datasets (Pancreas-CT [29] and BTCV [30]) using Dice Score, Average Hausdorff Distance, Recall, and Precision.Furthermore, we conduct additional experiments to evaluate the improvement in the segmentation of regions around the boundaries.
The results show that the boundary-constrained networks learn feature representations that focus on the accurate organs segmentation and the challenging parts around the border of the organs.
The rest of the article is organized as follows.In section 2, we review the existing methods for abdominal multi-organ segmentation.
Section 3 describes our framework for incorporating the boundary information into the 3D fully convolutional networks, including the multi-task loss function and the details of boundary-constrained network topologies.Next, we describe the dataset specifications and implementation details in section 4. We then present the experimental results, comparisons with existing single-task approaches, and indepth performance analysis of boundary-constrained models in section 5. Finally, we discuss the important highlights and some directions for future work in section 6 and present the conclusion in section 7.

Related Work
Segmentation of anatomical structures from abdominal scans is a prerequisite for various high-level CT-based clinical applications.Existing computerized tools for abdominal image segmentation are either based on deep learning or non-deep learning methods.In this section, we first briefly discuss the non-deep learning methods (section 2.1) and then present a review of deep learning-based methods for abdominal multi-organ segmentation (section 2.2).We conclude this section with a discussion on multi-task deep neural networks being employed for complementary boundary learning task to improve medical image segmentation (section 2.3).

Non-deep learning-based abdominal organs segmentation
Earlier methods proposed for abdominal multi-organ segmentation have primarily utilized registration-based approaches [7], [8].Among the registration-based approaches, the widely used ones include statistical shape models [7], [8] and multi-atlas label fusion techniques [5], [6].The development of statistical models requires registration of training images for estimating the shape or appearance of anatomical organs followed by fitting constructed models to test images for generating segmentations [31], [32].Multi atlas-based methods utilize an atlas created using multiple labelled images in the training set, and the test image is segmented by propagating the reference segmentations.Atlases are constructed by capturing the prior anatomical knowledge relevant to target organs.However, it is difficult to build an adequate model to capture the large variability of the deformable organs with limited data [33].Furthermore, the performance of both these approaches is restricted by image registration accuracy.
Registration-free approaches train a classifier using either handcrafted or learned features to segment abdominal images [9].Extraction of robust and deformation-invariant features relies on expert knowledge about abdominal organs [34].
Having the ability to learn the features automatically, FCNbased methods, have rapidly replaced the traditional solutions that require image registration or handcrafted features and have shown improved performance for abdominal CT segmentation [2], [12], [13], [35].

Fully Convolutional Networks for abdominal multi-organ segmentation
In recent years, Fully Convolution Network (FCN) and its variants (e.g., UNet [15]) have become a common choice for medical image segmentation.This dominancy can be attributed to their ability to learn effective task representations and efficient inference.UNet has an encoder-decoder style architecture and consists of skip connections, joining the encoding and decoding layers on the same level.Despite being trained from scratch, UNet demonstrated state-of-theart performance for various medical image segmentation tasks [36], [37].Built on top of UNet, several other modified architectures were subsequently proposed, e.g., UNet ++ [27], Attention-UNet [28], etc.
Existing deep learning-based studies for abdominal multiorgan segmentation have utilized 2D or 3D convolutional networks.
2D methods are less parameter-intensive; however, they cannot exploit the 3D contextual information and eventually provide sub-accurate organ-delineation performance.3D convolutional networks are facilitated with 3D convolutions, 3D pooling, and 3D normalization to exploit the rich volumetric context and generate dense voxel-wise predictions [26].Advances in efficient 3D convolutional implementation and increased GPU memory have enabled the adoption of 3D convolutional models for abdominal multi-organ segmentation [38], [3].
Roth et al. [16] proposed a cascaded architecture based on two 3D UNets where the first UNet is trained to separate the abdominal area from the background, and the latter utilized the output from the first UNet to simultaneously segment the abdominal organs.Peng et al. [39] delineated abdominal organs using 3D UNet with residual-learning based units (ResNets) to calculate patient-specific CT organ dose.In another study [2], abdominal organs are segmented using a 3D FCN with dilated convolutions based densely connected units.Heinrich et al. [11] leveraged 3D deformable convolutions to spatially adapt the receptive field for abdominal multi-organ segmentation.In [40], abdominal scans were segmented using a 3D deeply supervised patch-based UNet with grid-based attention gates to encourage the network to focus on useful salient features propagated through the skip connections.Some existing methods have employed post-processing steps, including levelsets [3] and graph-cut [4] to refine initial segmentation maps obtained from 3D deep convolutional networks.
Through the efforts mentioned above, the existing 3D methods have mostly emphasized developing better deep learning architectures and did not attempt to improve the segmentation of challenging parts of abdominal organs, e.g., voxels that belong to the contour of organs and regions within the vicinity of organ-contour.The fuzzy appearance of the boundary of organs and low contrast between the adjacent abdominal structures makes the voxels belonging to these regions more susceptible to wrong label prediction.

Boundary-constrained medical image segmentation
Several existing deep learning-based medical image segmentation methods have utilized the boundary information of regions of interest to overcome the misprediction of boundary pixels [19], [20], [21], [41].In these methods, the networks are trained in a multi-task learning fashion to simultaneously predict the probability maps of entire organs and their corresponding boundaries.Most of these methods have resorted to the hard-parameter sharing technique, where a single network contains shared and task-specific parameters and is jointly trained to solve multiple tasks.
Chen et al. [19] segmented the glands and their corresponding boundaries via multi-task training.By training the model to learn the co-representations, the model achieved better gland segmentation performance than the single-task models.In [42], a dual-decoder-based network is presented that simultaneously detects the boundaries and predicts the semantic labels of cells.Features from the boundary-decoding path were concatenated with those learned in the entire cell region decoding path via additional skip connections.This led to the improved histopathological image segmentation performance.In [43], boundary and distance maps were used for improved polyp and optic disk segmentation, respectively.Tan et al. [20] proposed a multi-task medical image segmentation network consisting of a single encoder and separate dedicated arms for decoding regions and boundaries.The study was evaluated on numerous applications, including MR femur and CT kidney segmentation.Zhang et al., [44] presented a edge-based deeply supervised network for predicting the regions of interest and their corresponding boundaries.The method was validated for retinal, x-ray, and CT image segmentation.Wang et al. [45] proposed a two-parallel stream model in which each of the two streams was trained to segment region and detect boundary followed by fusion of contour and region prediction maps.Lee et al. [41] proposed a framework that predicts boundary keypoint maps and makes use of adversarial loss for improved boundary preserving in medical image segmentation.
Given the challenge presented by voxels on the organs' boundaries and the evidence in the literature that focusing on boundaries is beneficial for performance, we integrate the organs boundary prediction as an auxiliary task into the training of state-of-the-art 3D medical image segmentation networks.Since the design choice of network topology impacts the learning process, we explore two multi-task network designs and analyze their performance.The boundary cotraining resulted in improved performance on abdominal CT segmentation tasks compared to the several state-of-the-art 3D fully convolutional baseline architectures.

Proposed Method
In this section, we first describe the boundary-constrained loss for training the 3D encoder-decoder network to simultaneously predict the boundaries and entire abdominal organ regions via multi-task learning (Section 3.1), followed by an exhibition of our proposed multi-task network topologies (Section 3.2).After that, we discuss the architecture of the 3D networks that we have as baselines in our work (Section 3.3).Finally, we present the architectural design of the counterpart 3D boundary-constrained models (Section 3.4).

Boundary-Constrained Loss
Consider a 3D encoder-decoder network trained to predict the voxel labels of the abdominal CT scan with W × H × Z dimensions, where W, H, and Z denote the length, width, and depth of the scan, respectively.Such a network takes an abdominal multi-organ CT scan as an input and outputs a labelled voxel map of the same size as the input.To utilize the boundary information of abdominal organs for improved representation learning, we train the network to predict the 3D organ-semantic masks and 3D organ-boundaries in one forward propagation pass.We formulate this problem using a multitask learning paradigm where multiple tasks are learned jointly using shared and task-specific representations.The loss L for this multi-task learning problem is a weighted combination of per-task losses, organ segmentation loss L RS and organ boundary detection loss L BD .We use multi-class dice loss [46] for evaluating the performance of the multi-organ segmentation task, given as where, ŷi,c and y i,c denote the 3D multi-organ probability map and ground-truth mask, respectively, of the i th abdominal CT scan.C denotes the number of organ classes.
where p(x i,n ) represents the label probability of n th voxel in i th scan and N refers to the total number of voxels in a scan.
To evaluate the model's performance in predicting the boundaries, we use binary cross-entropy loss (shown in Eq.Eq. 4).Binary cross-entropy loss for predicting 3D boundaries is given as êi and e i represent the edge probability map and the corresponding ground-truth.p(x i,w,h,z ) represents the edge probability of the n th voxel in i th scan.θ s represents the weights of the entire deep multi-task encoder-decoder network.
The combined total loss L is minimized with respect to the parameters θ s , as shown in Eq. 5. Thus our goal is to evaluate if a network can learn more robust features and subsequently produce improved organ segmentations by being trained to explicitly recognize the boundaries.
M and λ represents the total number of CT scans in the training set and the weight assigned to the edge detection loss in Eq. 5, respectively.We hypothesize that the additional boundary loss (L BD ) would impose a larger penalty on erroneous contour voxels, and it subsequently pushes the optimization of the segmentation network towards the solutions with more accurate boundaries.Thus, one would potentialize the ability of a boundaryconstrained network to extract features that account for the semantic abdominal organ regions and boundaries.

Boundary-Constrained Network Topologies
Multi-task learning is generally formulated via hardparameter sharing and soft-parameter sharing.In the hardparameter sharing paradigm, multiple tasks share a subset of jointly optimized parameters, whereas task-specific parameters are optimized separately.In soft-parameter sharing, each task is parameterized using its own set of parameters which are jointly regularized using constraints [47].In practice, hard-parameter sharing approaches incur much less parameter and computational cost.In our work, we formulate the multi-task learning problem via hard-parameter sharing to train the encoder-decoder network to do multiple tasks, i.e., organ segmentation and boundary detection.For deep neural networks, the hard-parameter sharing approach is realized by sharing some network layers between the tasks while keeping some layers task-specific.
We explore two different network topologies to conduct multi-task training, as shown in Figures 2a and 2b.The motivation to explore multiple topologies is to investigate the impact of sharing the larger and smaller number of parameters in the network between the two tasks.We explain these multitask topologies below.

Task-Specific Output Layers (TSOL)
The first multi-task topology that we explore is formulated by appending two separate prediction layers for predicting the boundaries and semantic organ masks.This topology employs an encoder-decoder network whose weights are shared between the tasks, except for the last output layers, as shown in Fig. 2a.Technically, it encourages the use of compact and tightly shared feature representations.As evident, this configuration has negligibly fewer more parameters than the single-task network.We denote this configuration as TSOL.

Task-Specific Decoders (TSD)
In second mutli-task topology, we modify the 3D encoderdecoder model to have a single shared encoder but two separate decoding arms for predicting the semantic regions and boundaries.The sibling-decoding arms upsample the region and boundary maps separately.This type of formulation ensures sparse representation sharing amongst the two tasks since decoders have been parameterized separately, as shown in Fig. 2b.The presence of two synthesis paths results in having significantly more parameters than its counterpart single-task network.We refer to this configuration as TSD.

3D Baseline models
We use UNet [26], UNet ++ [27] and Attention-UNet (Att-UNet) [28] as our baseline models.We illustrate these models in Supplementary material (see Figures S1 -S3).These architectures are based on encoder-decoder design and extended to segment 3D volumes by replacing the 2D convolutions, pooling, and upsampling with 3D counterparts.Each baseline model processes a 3D input scan with dimensions W × H × Z and outputs a predicted organ-label map of the same size as input.The encoder of the model contains five convolutional blocks with pooling layers, and the decoder comprises four upsampling layers.Each convolutional block in the encoder consists of two convolutional layers with 3 × 3 × 3 filters, followed by batch normalization and Exponential Linear Unit (ELU) activation [48].We use padded convolutions to keep the output dimensions of convolutional layers the same as the input dimensions.A 2 × 2 × 2 max pooling layer with a stride of two in each dimension is sandwiched between every two convolutional blocks for feature maps' downsampling.The bilinear interpolation layers are used in the decoder to upsample the extracted feature maps in each dimension.The feature maps in the decoder are concatenated with the equal-sized representations learned in the encoder via skip connections.The concatenated feature maps are then transformed using convolutional blocks, similar to those used in the encoder.The last 1 × 1 × 1 convolutional layer maps the feature channels to the class labels, followed by a softmax activation.
The resolution of the smallest feature map is 9 × 9 × 9 and the minimum and maximum feature count at the first and last encoding stage is 16 and 256, respectively.Note that the original UNet ++ model is trained with deep supervision driven by output layers of UNet with varying depths; however, we train UNet ++ without deep supervision to constrain the computational expense.

3D Boundary-contrained models
To utilize the boundary information of the organs, we train the baseline models (given in Section 3.3) to predict the organ boundaries along with organs.We propose two multitask network topologies (shown in Fig. 2) for integrating the boundary information and for that, we modify each baseline model, i.e., 3D UNet, 3D UNet ++ , and 3D Att-UNet, according to two multi-task learning-based topologies.
To modify the baseline models according to the first multitask topology (TSOL) (shown in Fig. 2a), we append a separate head at the end to predict boundaries along with the organs.We refer to these models as UNet-MTL-TSOL, UNet ++ -MTL-TSOL, and Att-UNet-MTL-TSOL, as shown in Figures 3, 5 and 7, respectively.To modify the baseline models according to the second boundary-constrained multi-task topology (TSD) (shown in Fig. 2b), we modify the baseline models (3D UNet, 3D UNet ++ , and 3D Att-UNet) to have separate decoding paths followed by prediction layers for predicting boundaries and organs.We refer to the models modified according to this topology as UNet-MTL-TSD, UNet ++ -MTL-TSD, and Att-UNet-MTL-TSD and show them in Figures 4, 6 and 8, respectively.Note from Fig. 6, we only use the skip connections between the encoder with the greatest depth and boundarydecoder instead of utilizing feature maps extracted by nested-UNets of all depths.This design choice is made to constrain the number of parameters in UNet ++ -MTL-TSD.Furthermore, observe from Fig. 8, we do not employ the attention mechanism while decoding the boundary-features in Att-UNet-MTL-TSD.

Experimental details
This section first describes the datasets used to validate our study and the pre-processing we perform on the datasets (Section 4.1), followed by implementation details (Section 4.2).Finally, we conclude this section by discussing the metrics used to evaluate baseline and boundary-constrained models (Section 4.3).

Description of datasets and data preprocessing
We utilize two publically available abdominal CT datasets (Pancreas-CT and BTCV) to evaluate baseline and boundaryconstrained models.Abdominal scans in Pancreas-CT were acquired at the National Institutes of Health Clinical Center from pre-nephrectomy healthy kidney donors and subjects with neither major abdominal pathologies nor pancreatic cancer lesions [29].The BTCV dataset consists of abdominal scans acquired at the Vanderbilt University Medical Center from metastatic liver cancer patients or post-operative ventral hernia patients [33].

Pancreas-CT Dataset (TCIA-43)
The pancreas-CT dataset [29], [49] comprised 82 abdominal contrast-enhanced 3D CT scans and was initially provided with manually drawn contours of the pancreas [50], [49].Recently, 43 scans from this dataset have been re-annotated to include the segmentation of the liver, duodenum, stomach, esophagus, spleen, gallbladder, and left kidney [51].Therefore, we use only 43 scans that have been re-annotated to incorporate labels for multiple organs.
We first crop a region-of-interest from the CT scans and the corresponding ground-truth labels using the bounding box coordinates provided with the dataset [51].The cropping step ensures the models are only fed with the foreground inputs without the redundant background region.The cropped regionof-interest from the CT scans and ground-truth labels are then resampled to a common dimension of 144 × 144 × 144 voxels.We randomly divide the available 43 studies into 28, 5, and 10 for training, validation, and test, respectively.

BTCV Dataset
BTCV was released [30], [33] as a part of a challenge held in conjunction with MICCAI 2015.The challenge compared the abdominal organs' segmentation algorithms on 3D CT scans.Our work focuses on the segmentation of eight organs from the BTCV dataset, i.e., liver, duodenum, stomach, esophagus, spleen, gallbladder, left kidney, and pancreas.
For the BTCV dataset, we utilize the bounding box coordinates given with the dataset for cropping the region-ofinterest for both the CT scans and ground-truth labels [51].Like the Pancreas-CT dataset, the cropped region-of-interest is then resampled to a common dimension of 144 × 144 × 144 voxels.Finally, we randomly divide the available 47 studies into 32, 5, and 10 for training, validation, and test, respectively.
For both Pancreas-CT and BTCV dataset, we applied affine random transformations to augment the data but did not observe a significant difference in segmentation performance on the validation set.Therefore, we did not use any data augmentation.We used the same dataset splits for all the experiments.To analyze the occurrence of each organ in the dataset, we present the organs' occupancy ratio in Figures 9 and 10 Organ Occupancy Ratio: BTCV dataset Fig. 10.Organ occupancy ratio for BTCV dataset.

Implementation details
All experiments are conducted using Pytorch [52] on two Nvidia Tesla P100, accessed through the HPC platform 1 .We train all the baseline networks with a mini-batch of size 4, except for 3D UNet ++ , which is trained with one batch size.These choices have been made according to available GPU memory.All baseline (single-task) networks are trained using multi-class dice loss as expressed in Equations ( 1)-(3).For acquiring multi-class dice loss, the 3D predicted organsegmentation maps are compared with 3D organ ground-truth maps.All the networks (baseline and boundary-constrained) are optimized via Adam optimizer [53].The learning rate is initially set to 0.001, decaying by a factor of 0.9 after every epoch.To assess the effect of changing values of training hyperparameters on validation segmentation performance, we conduct experiments to guide us in selecting the optimal settings for baseline models.The results for these experiments are given in Tables S1 and S2 in Supplementary section.We monitor the mean dice score on the validation set during training and utilize the model for testing that results in the highest dice coefficient on the validation set.
We use a combination of multi-class dice loss and binary cross-entropy loss, as illustrated in Eq. 5 for training boundaryconstrained models.
3D organ-boundary predictions are compared with the 3D boundary ground-truths to obtain the binary cross-entropy loss.Since the datasets do not contain the boundary annotations of organs, we acquire the groundtruth boundaries by first eroding the multi-organ ground-truth labels and then taking the difference from the original groundtruth map.This process gives us boundary annotations of organs.For UNet-MTL-TSOL, UNet-MTL-TSD, Att-UNet-MTL-TSOL and Att-UNet-MTL-TSD, we use a batch of size two, whereas, for UNet ++ -MTL-TSOL and UNet ++ -MTL-TSD, we use a single batch size.These choices are made according to the available GPU memory.Note that the boundary predictions are only used in the training stage.During the validation stage, we only consider the organ predictions.
We conduct a grid search (on the range from 0 to 2 with a step of 0.5) to find the optimal value of λ (responsible for balancing the boundary detection loss).The exact value of λ selected to balance the boundary loss for each boundary-constrained model are given in Tables S3 and S4 in Supplementary material.These tables also present the standard deviation between validation dice scores when different values of λ are used.

Evaluation metrics
We compare the predicted segmentation masks with the ground-truths to evaluate the segmentation performance of baseline and boundary-constrained models on test set.To do that, we utilize Dice Score, Recall, Precision, and Average Hausdorff Distance as metrics for assessing the quality of predicted segmentation masks.These metrics are calculated for each organ individually and then an average is taken across all subjects.All metrics are calculated by taking a mean of the 5 runs.

Results and Analysis
This section demonstrates the experimental results obtained from boundary-constrained abdominal segmentation models and compares them against the performance given by baseline models.For the sake of brevity, we denote the baseline and boundary-constrained models with the abbreviations below (shown in bold).(f) 3D UNet ++ -MTL-TSD: Boundary constrained 3D UNet ++ with task specific decoders shown in Fig. 6.

Quantitative Results
Table 1 summarizes the segmentation results for the Pancreas-CT and BTCV datasets.
We report the mean Dice Score, mean Average Hausdorff Distance (Avg.HD), mean Recall, and mean Precision computed by comparing the predicted segmentation against the ground-truth.These measures are calculated on test sets of each dataset.Table 1 shows that the boundary-constrained models achieve improved multi-organ segmentation on the abdominal CT scans.Firstly, the mean Dice scores are improved by 1.8% (UNet vs. UNet-MTL-TSOL) and 3.5% (Att-UNet vs. Att-UNet-MTL-TSOL), for Pancreas-CT dataset.The corresponding values of mean dice scores for BTCV dataset are improved by 3.1% (UNet vs. UNet-MTL-TSD), 3.6% (UNet ++ vs. UNet ++ -MTL-TSOL), and 3.5% (Att-UNet vs. Att-UNet-MTL-TSD).The improved overlap between predicted segmentations and manually annotated masks can be attributed to the enhanced semantic representations learned by boundary-constrained models.
Secondly, we observe that boundary-constrained models achieve a lower Avg.HDs for all the datasets than those obtained from baseline models as shown in Table 1.For example, after adding boundary information, the Avg.HD values of UNet, UNet ++ , and Att-UNet are decreased by 11.5%, 14.5%, and 18.4%, respectively, for the Pancreas-CT dataset.Likewise, a decrease of 15.4%, 16.2%, and 30%, respectively, is seen for the BTCV dataset.Furthermore, we notice that the Avg.HD is still lower for the cases where boundaryconstrained models obtained lower or equivalent mean Dice score, e.g., UNet ++ -MTL-TSOL vs. UNet ++ and UNet ++ -MTL-TSD vs. UNet ++ , for Pancreas-CT results.This indicates that even for the equivalent dice overlap, the performance of the boundary-constrained models in accurately predicting the boundaries is improved.
Thirdly, our boundary-constrained models achieve higher values of mean Recall and mean Precision for all the models and datasets except for UNet ++ corresponding to the Pancreas-CT dataset, as shown in Table 1.The utilization of boundary information has caused a decrease in false-negative rates and false-positive rates.Specifically, the mean recall is increased by 1.3% (UNet vs. UNet-MTL-TSOL) and 4.3% (Att-UNet vs. Att-UNet-MTL-TSOL) for Pancreas-CT dataset.For BTCV dataset, an increase of 1.3% (UNet vs. UNet-MTL-TSOL), 3.7% (UNet ++ -MTL-TSOL vs. UNet ++ ), and 1.9% (Att-UNet vs. Att-UNet-MTL-TSOL) is observed in Mean Recalls.Finally, we see a maximum improvement of 1.9% and 4.3% in values of mean Precision for Pancreas-CT and BTCV datasets, respectively.The improvment in mean Recalls and mean Precisions show the capability of boundary-constrained models in addressing the issues of under-segmentation and over-segmentation.

Computational Complexity and Architectural Analysis
Table 2 reports parameter count (in million M) and each method's time to segment a single CT abdominal volume in the test phase.We also highlight the increase in parameter count (given in brackets) compared with the baseline model.Among the single-task baseline models, UNet ++ is most parameterextensive with 6.87 ×10 6 parameters.The parameter count
Analyzing the relationship between the multi-task network design and the segmentation performance from Table 1, we note there is not a single/fixed topology that led to the maximum improvement.For example, the TSD showed maximum improvement in mean DC and Avg HD corresponding to the Pancreas-CT dataset.Hd over the baseline UNet.However, when the performance of Att-UNet is compared with boundaryconstrained models, we notice that the TSOL demonstrated the best results.This reveals that the multi-task network configuration that offers the best results varies depending on each baseline architecture.All in all, we found that integration of boundary information improved the multi-organ segmentation, independent of the network topology.

Assesment of organ-wise segmentation performance
To assess which specific organs benefitted greatly from incorporating boundary information for the segmentation task, we examine the mean Dice scores and mean Avg.HDs achieved by baseline, and best performing boundary constrained models (from Table 1) for each abdominal organ.We compute the Dice scores and Average Hausdorff distances for each organ individually and then average across all subjects.From Table 3, we can see both baseline and boundary-constrained models have yielded the highest mean Dice scores for liver ( ), spleen ( ), and kidney ( ) and lowest for duodenum ( ).However, boundary-based models have led to the maximum relative improvement for the gallbladder ( ), pancreas ( ), and duodenum ( ).From Table 4, we observe that the boundary-constrained models have significantly improved the mean Avg.HD distances for the spleen ( ), kidney ( ), and gallbladder ( ).Finally, relating the increase in dice overlap to the organ occurrence (shown in Figures 9 and 10), we observe that the most significant improvement has occurred for the underweighted classes.In contrast, the boundary distances are maximally decreased for both small (e.g., gallbladder ( ) and large structures like spleen ( ).Furthermore, the boundary-constrained models led to lower standard deviations of the mean Dice scores and Avg.HDs across different subjects which show the robustness of proposed models.

Segmentation performance along boundary of organs
Unclear boundaries of organs and low inter-organ contrast prevent accurate segmentation of challenging regions around the organ boundaries on abdominal CT scans.To assess if the incorporation of boundary information has improved the segmentation of voxels within the close vicinity of organ boundaries, we evaluate the quality of predicted voxel-labels within these regions and compare them against the ones acquired via baseline methods.To do this, we generate trimaps [54] [55] with different voxel-bands surrounding the boundaries of organs.Trimap is a narrow region along the boundary of an object which is utilized to evaluate the quality of segmentation within a specific distance from the object's contour.First, we generate trimap regions around the boundary of organs for predicted and ground-truth segmentations, and then, we compare the 3D trimaps by computing mean Dice scores between them.We show the exemplary trimap regions on 2D abdominal axial slices in Fig. 12.For the sake of compendious presentation, we have computed the trimap Dice scores only for TSOL network topology.Fig. 13 shows the mean Dice score plotted against the number of voxels the trimap contains.The top row shows trimap plots for the Pancreas-CT dataset, whereas the bottom row shows the trimap plots for the BTCV dataset.Our proposed boundary-constrained models consistently perform better than the baseline models in predicting the semantic labels within the vicinity of organboundaries, except for one case, i.e., (3D UNet ++ vs 3D UNet ++ -MTL-TSOL).

Qualitative Results
Fig. 14 shows semantic segmentation predictions on a single 2D axial slice of the 3D scans.The first and second row correspond to segmentation results from the Pancreas-CT dataset whereas, the third and fourth row corresponds to results from the BTCV dataset.Each column (from left to right) illustrates the original abdominal 2D images (Fig. 14a), groundtruth masks (Fig. 14b), baseline model (UNet and Att-UNet) segmentations (Fig. 14c), and segmentations acquired from the boundary-constrained counterparts (UNet-MTL-TSD and Att-UNet-MTL-TSOL) in (Fig. 14d).We observe that the baseline models led to under-segmentations and over-segmentations, indicated in white boxes in Fig. 14c.
Furthermore, the segmentations generated by single-task baseline models show isolated and biologically implausible organs' parts.Moreover, comparing with the corresponding boundaryconstrained segmentations (Fig. 14d), the incorporation of boundary information has prevented the issue of mispredictions near boundary of organs and led to generation of biologically plausible segmentations.Fig. 15 presents the 3D segmentations generated by baseline (UNet) and boundary-constrained model (UNet-MTL-TSD) along with the ground-truths from the Pancreas-CT (first row) and BTCV dataset (second row).Notice that the boundary-constrained segmentations (third column) are more similar to the ground-truth masks (first column) as compared to the baseline segmentations (second column).These qualitative results show the improvements induced by the use of organs borders in training the 3D fully convolutional models for abdominal organs segmentation.), pancreas ( ), left kidney ( ), gallbladder ( ), esophagus ( ), liver ( ), stomach ( ), and duodenum ( ), whereas second and third column corresponds to volume labels predicted by UNet and UNet-MTL-TSD.

Discussion
Accurate segmentation of abdominal organs from CT scans is required for numerous advanced clinical procedures such as computer-assisted surgery and organ transplantation.The lowcontrasted appearance and weak edges of abdominal organs in CT scans adversely affect the accurate segmentation.In this paper, we propose to leverage boundary information of organs as an additional cue for improved 3D abdominal multi-organ segmentation.The boundary-constrained encoder-decoder network simultaneously learns to delineate the semantic abdominal regions and detect the boundaries of organs.This multi-task learning model exploits the statistics from more than one ground-truth source and subsequently retains features shared between the tasks.The boundary annotations of abdominal organs can be easily generated from the groundtruth masks and provide cost-free additional knowledge about the organs.
As reported by quantitative results in Table 1, the proposed boundary-constrained 3D encoder-decoder models achieve improved multi-organ segmentation across the majority of the baselines (3D UNet, 3D UNet ++ , and 3D Att-UNet) and datasets (Pancreas-CT and BTCV).We have shown that the significant improvement in segmentation performance evaluated via Dice Score, Avg.HD, Recall, and Precision is caused by the improved segmentation of organs' boundaries and regions surrounding boundaries.Furthermore, significant improvements with a negligible increase in parameter count (0.0002% -TSOL topology) reveal the benefit of regularizing the existing encoder-decoder segmentation models using boundary information.
The reduction in Avg.HD for both datasets across all baselines depicts the advantage of informing the model about the vulnerable regions of the organ.The dramatic decrease in Avg.HD, ranging from 18% to 30%, shows that the model learned feature combinations that were expressive about the entire appearance of organs.We believe that training the models with auxiliary knowledge encourages learning more generalizable and discriminative features.Notably, the additional experiments that we have conducted to precisely assess the improvements in segmentation of regions in the vicinity of organ-boundaries further verify the superiority of training the segmentation model with complementary boundary information (shown in Fig. 15c).
Since there may exist several different configurations through which a fully convolutional architecture can be designed under a multi-task learning paradigm, we explore two network topologies based on the extent of parameter-sharing between the tasks.Through our extensive comparison, we notice that an overly-shared multi-task network (TSOL) performs on par with the network designed to have an increased number of task-specific layers (TSD) (Table 1).This indicates that models with many parameters do not necessarily correspond to higher performance.Most importantly, we also found that incorporation of boundary information improved the multiorgan segmentation performance, regardless of the network topology.One of the critical challenges in designing 3D multitask deep learning models is determining which layers should be shared while keeping the computational expense reasonable.In the future, we aim to investigate other network topologies for training the encoder-decoder model in a multi-task learning fashion.Another valuable extension of our work is to develop a mechanism/policy that can automatically dictate the sharing pattern of network layers between the two tasks.
As reported in Tables 3 and 4, the improvement in the segmentation performance of organs rarely occurring in the dataset reveals the efficacy of boundary-regularized models in compensating for their rare presence.The qualitative results shown in Figures 14 and 15 verify the positive impact of making the model aware of organ-boundaries during training.The ability of our model to simultaneously learn the improved representations of multiple organs is indicated by the qualitative examples in Figures 14 and 15, where the boundary-constrain has reduced the occurrence of over-and under-segmented organs.

Conclusion
In this paper, we leverage organ boundary information for an improved 3D abdominal multi-organ segmentation by addressing the challenge of unclear boundaries in low-contrasted CT scans.We demonstrate that boundary information can be seamlessly introduced in the training of 3D encoder-decoder models through different multi-task configurations.The experimental results show the boundaryconstrained multi-organ segmentation outperforms the ones obtained from several FCN-based baseline models, including 3D UNet, 3D UNet ++ , 3D Attention-UNet.Furthermore, we found that the multi-task topology that shows maximum improvement is not fixed and varies depending on the baseline architecture.This insight shows that the optimal utilization of auxiliary information cannot always be harvested through a single deep multi-task design but instead requires the exploration of different multi-task topologies.Our findings also reveal that leveraging organs boundary features improves the segmentation of underweighted organs like the gallbladder, pancreas, and duodenum with a negligible parameter-overhead.Additionally, the experimental results disclose that the boundary-constrained models improve the labelling of weak sub-parts of organs in the vicinity of boundaries.We believe the proposed 3D boundary-constrained models would be valuable for enhancing abdominal organ segmentation and utilizing those segmentations in relevant clinical applications.S3 and S4, respectively.
Table S4: Value of λ used to balance the boundary loss for BTCV dataset.The first column represents the model's name; the second column shows the λ value.The last column shows the standard deviation from the mean of validation dice scores generated using different λ values in the search grid.

Fig. 2 .
Fig. 2. Multi-task topologies of 3D boundary-constrained network.(a) Multi-task topology with shared encoder-decoder network and task-specific prediction layers, and (b) Multi-task topology with shared encoder and task-specific decoders.

Fig. 11 .
Fig. 11.Examining the mean dice score computed by comparing predicted multi-organ segmentation masks with the ground-truth on validation set: First two rows (a-c) and last two rows (d-f) correspond to the dice score curves for Pancreas-ct dataset and BTVC dataset, respectively.Each figure compares the mean DC of the baseline (Blue) and counterpart boundary-constrained model (Red) on the validation set, computed after each epoch.The baseline models are trained to predict the masks of the abdominal regions, whereas the boundary-constrained models are trained to predict the labels of organs and boundaries.

Fig. 11
shows the performance curves based on the mean Dice scores, computed by comparing the predicted multi-organ segmentation with the ground-truth masks on the validation set.Figures 11a-11c correspond to performance curves for the Pancreas-CT and Figures 11d-11f for the BTCV dataset.It can be seen that the incorporation of boundary information has improved the mean Dice score as the training progresses.

Fig. 12 .Fig. 13 .
Fig. 12. Trimap regions along boundary of organs shown on 2D abdominal images.(a) Original abdominal image, (b) Organ label-maps corresponding to 2D image, (c) Trimap region of 5 voxel width around boundaries of organs shown in gray color, and (d) Trimap region of 7 voxel width shown in gray color.

Fig. 14 .
Fig. 14.Qualitative results for multi-organ segmentation between baseline and boundary-constrained models are shown.Rows 1-2 correspond to results for the Pancreas-CT dataset, whereas Rows 3-4 show results for the BTCV dataset.Columns 1-2 show the original image and the corresponding ground-truth mask overlayed on the 2D image, i.e., spleen ( ), left kidney ( ), gallbladder ( ), liver ( ), stomach ( ), and duodenum ( ).Columns 3-4 illustrate the segmentation results related to baseline UNet and UNet-MTL-TSOL counterparts.White boxes indicate the segmented regions improved by incorporation of boundary information.
. Organ occupancy ratio for Pancreas-ct dataset.

Table 2
Comparison of parameter-cost and computational time.

Table 3
Organ-wise Mean Dice scores ± Std.Dev.obtained from baseline and best performing boundary-constrained models: Bold shows the highest value corresponding to each baseline.