Cascaded MultiTask 3-D Fully Convolutional Networks for Pancreas Segmentation

Automatic pancreas segmentation is crucial to the diagnostic assessment of diabetes or pancreatic cancer. However, the relatively small size of the pancreas in the upper body, as well as large variations of its location and shape in retroperitoneum, make the segmentation task challenging. To alleviate these challenges, in this article, we propose a cascaded multitask 3-D fully convolution network (FCN) to automatically segment the pancreas. Our cascaded network is composed of two parts. The first part focuses on fast locating the region of the pancreas, and the second part uses a multitask FCN with dense connections to refine the segmentation map for fine voxel-wise segmentation. In particular, our multitask FCN with dense connections is implemented to simultaneously complete tasks of the voxel-wise segmentation and skeleton extraction from the pancreas. These two tasks are complementary, that is, the extracted skeleton provides rich information about the shape and size of the pancreas in retroperitoneum, which can boost the segmentation of pancreas. The multitask FCN is also designed to share the low- and mid-level features across the tasks. A feature consistency module is further introduced to enhance the connection and fusion of different levels of feature maps. Evaluations on two pancreas datasets demonstrate the robustness of our proposed method in correctly segmenting the pancreas in various settings. Our experimental results outperform both baseline and state-of-the-art methods. Moreover, the ablation study shows that our proposed parts/modules are critical for effective multitask learning.

the low-and mid-level features across the tasks. A feature consistency module is further introduced to enhance the connection and fusion of different levels of feature maps. Evaluations on two pancreas datasets demonstrate the robustness of our proposed method in correctly segmenting the pancreas in various settings. Our experimental results outperform both baseline and state-of-the-art methods. Moreover, the ablation study shows that our proposed parts/modules are critical for effective multitask learning.

I. INTRODUCTION
P ANCREATIC cancer, like ductal adenocarcinoma, has a high mortality rate with a low five-year survival rate, and is one of the most challenging cancers to treat [3]. Patients are frequently examined by the early parenchyma phase abdominal CT [7]. In upper abdominal surgery, such as laparoscopic gastrectomy or pancreatectomy, the location of the pancreas is required for enabling safer surgical procedure [10] whereas, manual delineation for the pancreas is time consuming and often irreproducible. Therefore, there is a calling need for developing an efficient computer-aided segmentation method to help physicians diagnose and assess the progression of diabetes or pancreatic cancer, as done in other applications [11]- [14]. However, accurate segmentation of pancreas is challenging due to the following two reasons: 1) the pancreas has especially large intersubject variability in its location, size, and shape (see Fig. 1) and 2) the pancreas has a thinner shape in the abdomen, compared with other abdominal organs, as can also be observed in Fig. 1. As shown in Fig. 1, the intensities of voxels in the pancreatic region are very similar to those of the neighboring structures (i.e., the stomach wall, duodenum, and intestines).
With the recent advances, deep-learning methods obtained superior performance in the segmentation of medical images [15]- [20]. For accurate pancreas segmentation, several convolutional-neural-network-based (CNN) [21] methods have been developed. Generally, we can classify the existing deeplearning frameworks for this segmentation task into two kinds, that is, 1) the one-stage methods and 2) the two-stage methods [2], [4]- [6], [8], [9], [22]. The one-stage methods directly segment organ(s) in a whole image, while the two-stage methods first locate the organ(s) and then perform segmentation on the localized region(s). This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Two examples of the pancreas from the NIH dataset [1], shown in three views (raw CT image, and the axial, sagittal and coronal views from left to right, respectively). Pancreas regions are shown as red, from which we can observe. 1) large shape, appearance and location variations of the pancreas across cases; 2) thinner shape in the abdomen; and 3) similar intensities in pancreas and neighboring structures. As the one-stage method, Farag et al. [2] used a CNN model with dropout [24] to conduct a classification of the pancreas and nonpancreas regions [25]. In another work, Cai et al. [9] added a convolutional long short-term memory (LSTM) [26] network to the output layer of CNN to complete the segmentation on 2-D slices of the pancreas. These methods directly apply CNNs on the entire CT images. However, pancreas is relatively small in size (i.e., less than 0.5% of the entire CT volume [4]). For such organs, deep-learning methods can be disrupted by the nontarget region, which often occupies a large fraction in the abdomen CT images. Therefore, the segmentation results are often not satisfied. To overcome the above-mentioned challenge, the two-stage methods have been developed. These methods only focused on the target region (i.e., the pancreas), which allows achieving more accurate segmentation [5] in the second stage. For example, Roth et al. [6] presented a two-stage method to localize and segment pancreas, respectively. Specifically, they employed holistically nested convolutional networks (HNNs) [27] on three views to do the task. A similar strategy was adopted in Zhou et al. [4] by applying the fixed-point models to shrink the input region. In their method, a 2-D fully convolution network (FCN) model was used. Besides, Yu et al. [8] added a recurrent saliency transformation module into the coarse-to-fine model, which achieved the best performance among all existing methods in terms of dice ratio. But, all of these methods merge contexts of different views of 2-D slices of CT images for segmentation, which unavoidably miss some spatial information across slices. Recently, Zhu et al. [5] proposed a 3-D coarse-to-fine segmentation method by using 3-D U-net [28] with residual connections. Similarly, Roth et al. [22], [23] employed a 3-D U-net with concatenation and summation skip connections to segment pancreas. Our method is also a twostage framework which can also focus on small organ regions, and also works on 3-D by using 3-D patches of CT volumes as the input to leverage spatial information along all three axes.
Existing methods for pancreas segmentation [2], [4]- [6], [8], [9], [22] mainly use standard segmentation approaches in the literature for medical (or natural) image segmentation, ignoring problem-specific challenges (i.e., varying locations and shapes of the pancreas across different subjects). We argue that to segment the detailed and fine structures, like pancreas, shape-specific cues can significantly improve the segmentation performance.
To this end, in this article, we propose a cascaded 3-D FCN, composed of two major cascaded stages (see Fig. 2 for an overview of our approach). In the first part of the cascade (i.e., C 1 in Fig. 2), we concentrate on fast locating the region of the pancreas, since pancreas is relatively small in size in the CT volume. Based on the obtained region, in the second part (C 2 ), we construct a novel multitask FCN with dense connections to adopt guidance from the organ skeleton to improve the accuracy and stability of the final segmentation, which consists of two branches with shared learned features. One branch of the deep multitask network is a regression network, aiming at describing the shapes of the pancreas, and another branch segments the pancreas. We use skeleton as a shape representation to more accurately and efficiently represent the pancreas shapes in the abdomen CT images. Extraction of object skeletons from images has been well studied and successfully applied to shape-based object matching and recognition [29]- [31]. The skeleton is a useful structure-based object descriptor, which can deliver significant information about the presence, shape, and size of the object [32]- [34], since the shapes of the pancreas are variable across different subjects and also thinner than other abdominal organs. Moreover, pancreas can be found as two separate parts in some axial views. In our method, we describe the pancreas with skeletons to capture its shape and preserve its geometric properties. Since extracting skeletons and segmenting pancreas are interrelated, the multitask [35]- [41] framework can be used to optimize both tasks and boost performance for pancreas segmentation. Besides, we also propose a feature consistency module to further enhance the connection and fusion of different levels of feature maps for improving the performance. To remove small false segments, we finally employ 3-D fully connected conditional random field (CRF) [42]- [45] as a post-processing step for pancreas segmentation.
For comparison, we evaluate our approach on the NIH pancreas segmentation dataset [1], which has been adopted by many previous methods. The average dice similarity coefficient (DSC) of our method reaches 86.4%, outperforming those previous methods. To check the contribution of each proposed module in our method, we also evaluate it on the NIH dataset. Besides the NIH dataset, we further evaluate our method on another in-house dataset, to show the robustness of our method across different datasets.
The contributions of this article can be summarized as follows.
1) We propose a cascaded shape-specific cues-guided FCN, composed of two major cascaded stages. The first stage focuses on fast locating the region of pancreas. The second stage employs a 3-D multitask dense-U-Net architecture to perform accurate segmentation on the located pancreas region. With this two-stage method, small pancreas can also be accurately segmented. 2) We adopt guidance from the pancreas skeleton to help the segmentation network to better learn the segmentation task. In particular, the estimated pancreas skeleton can provide coarse shape information, which can alleviate both issues of the low contrast in the boundary and the high geometric variability of pancreas in the CT images. 3) We propose a novel 3-D multitask framework for volumetric pancreas segmentation. Our proposed 3-D segmentation framework can leverage rich spatial information along all three axes for accurate segmentation.

II. METHOD
The overall framework of our proposed method is shown in Fig. 2, where a cascaded multitask 3-D FCN is proposed to first automatically localize the pancreas region(s) and then segment the located pancreas in detail. Since the raw CT image of the upper body contains a large region while the target pancreas is relatively small, we design a cascaded framework to first conduct an initial segmentation in a coarse level, and then segment pancreas in a fine level. The first stage of the cascaded framework, denoted as C 1 , is implemented with an FCN, which is designed to localize the pancreas from the raw CT image. The details of C 1 are introduced in Section II-A. Then, a multitask FCN with dense connections is utilized in the second stage of the cascaded framework, denoted as C 2 , to accurately segment pancreas based on the detected region(s) from C 1 . This multitask FCN consists of two interrelated steps, that is, pancreas skeleton extraction and segmentation. Here, the pancreas skeleton serves as supplementary guidance to help the segmentation network to better learn the segmentation task. The details of C 2 are introduced in Section II-B. Afterward, a 3-D fully connected CRF is employed as a post-processing step to achieve smoother predictions, as described in Section II-C.

A. Pancreas Localization using 3-D FCN
We design C 1 to localize pancreas in raw CT images, which will propose regions for C 2 . This can be considered as a coarse segmentation of pancreas and can be regarded as a binary-classification problem. In particular, FCN is devised to fast segment pancreas from the down-sampled CT image. Specifically, the original CT image is down-sampled to 1/4 of its original resolution in our case. After inference with the proposed FCN, we upsample the coarse segmentation result to the original resolution. Then, the pancreas region can be obtained from this upsampled segmentation result.
We use a variant of FCN, U-Net [28], as the base architecture of our network, which is illustrated in Fig. 3 (cascade C 1 ). The entire network contains a contracting path and an expanding path. The contracting path is consist of three blocks, each containing one or two convolutional layer(s), followed by a max-pooling layer with a kernel size of 2 × 2 × 2. Each 3 × 3 × 3 convolutional layer [46] with strides and padding of one is followed by a rectified linear unit (ReLU) [47]. The expanding path includes the same number of blocks as the contracting path. Each block has one transposed convolutional layer with several convolutional layers. The transposed convolutional layer has the kernel size of 2×2×2 with strides of two. The output feature maps of the deconvolutional layer are concatenated with the feature maps of the convolutional layer in the corresponding scale of the contracting path. Then, output features are fed into the subsequent convolutional layers.
We employ patch-wise training, rather than entire-image training, due to a small number of training samples. Particularly, in this article, the images are cropped to 3-D patches with the size of 16 × 64 × 64, according to the effect of patch sizes analyzed in Section III. Then, the bounding box of pancreas is obtained by morphological operation on this coarse segmentation. The intermediate point of the bounding box is selected as the centroid to crop the region of size 128×224×224 from the raw CT images. The region size is ensured to cover the entire pancreas. This located region is finally fed into the next stage of the cascaded framework, C 2 .

B. Pancreas Segmentation Using 3-D MultiTask FCN
In the second stage of the cascaded framework, C 2 , we use a multitask FCN with dense connections for accurate pancreas segmentation, based on the localized region(s) from C 1 . As discussed earlier, due to variable shapes of the pancreas across different patients, the guidance from the estimated pancreas skeleton can help better segment pancreas; on the other hand, the segmentation result can also help estimate pancreas skeleton. The network architecture is outlined in Fig. 3 (cascade C 2 ).
1) Pancreas Skeleton: One of the main difficulties of pancreas segmentation is their variable shapes across different patients. To deal with this issue, we estimate the skeleton of pancreas to provide a reliable reference for the pancreas shape, thus helping the segmentation network to better capture morphological information of pancreas in the CT image. Note that the ground-truth skeleton is only needed in the training phase, which can be actually obtained by morphological operation, followed by smoothing with a Gaussian filter.
2) Network Architecture: We adopt a 3-D multitask dense-U-Net to conduct the pancreas segmentation. To make full use of features and also strengthen feature propagation, we employ dense connections in the network [48], [49]. Especially, we first crop patches from the proposed region (generated by C 1 ). Then, we feed the cropped patches into the network. Since the two tasks of estimating pancreas skeleton and segmenting pancreas are interrelated, they are designed to share the entire encoder and a part of the decoder. In particular, the encoder part of the network includes five blocks of convolutional layers and pooling layers (See the left part of cascade C 2 in Fig. 3). Each block includes several convolutional layers with a kernel size of 3 × 3 × 3, followed by a pooling layer with the size of 2 × 2 × 2. The decoder is consist of a corresponding number of blocks, each with one transposed convolutional layer, followed by several convolutional layer(s). The transposed convolutional layer has the kernel size of 2 × 2 × 2 with strides of two. After the decoder (see the right part of cascade C 2 in Fig. 3), each task has two extra convolutional layers to continue learning the task-specific features. Note that ReLU is adopted as the activation function for each convolutional layer.
Through the structure of U-net, we can make full use of context information from the coarse feature maps learned in low-level layers (in the encoder part) and fuse it with fine information from the dense feature maps learned in the highlevel layers (in the decoder part). To enhance the fusion of these low-level and high-level features, we propose to use several modules of feature consistency (denoted by three green cubes in Fig. 3), which consist of additional convolutional layers, to generate more precise output after the concatenation.
In the learning process, the segmentation loss is defined by the cross entropy loss (L CE ), and the regression loss by the Euclidean loss (L REG ), respectively. Hence, the final loss of the network is with where λ 1 and λ 2 are the weights of L CE and L REG in our experiments. m is the number of subjects. In (2), k is the total number of classes. x and y represent the input data and labels, respectively. l is the layer index, and θ is the parameter. In (4), p i is the predict value, and g i is the corresponding groundtruth value.

C. Post-Processing of Pancreas Segmentation by 3-D Fully Connected Conditional Random Field
Although the segmentation results of 3-D FCN are smooth, there are still some small isolated regions, caused by the independent and identically distributed inference of the network. We employ a 3-D fully connected CRF [42]- [45] as a postprocessing step, which is able to connect all pairs of individual voxels in the image to refine the boundaries between pancreas and background and also remove isolated false positive segmentation.
For an input image I and the ground-truth segmentation S, the Gibbs energy in a CRF model is given by (5) where ω (1) and ω (2) define the weights of k (1) and k (2) . k is a linear combination of Gaussian kernels (9), which is defined over an arbitrary feature space, with f i , f j being the feature vectors of the pair of voxels. There are two types of k, that is, one is the smoothness function k (1) , and another is the appearance function k (2) . k (1) and k (2) are defined in (10) and (11), where p i,d represents the voxel coordinates and σ α,d denotes the size and shape of neighbors that same labels are inspired where σ γ can be viewed as how strong to implement the same appearance.

A. Data Acquisition
We compare our proposed method with previous methods on the NIH pancreas dataset [1]. The NIH pancreas segmentation dataset includes 82 contrast-enhanced abdominal CT scans. The image size is 512 × 512 × (181 ∼ 466) with slice thickness as 1.5 − 2.5 mm, acquired on Philips and Siemens MDCT scanners (120 kVp tube voltage). We also applied our proposed method to our own dataset (denoted as the Fujian Medical University (FMU) dataset), which was collected from the First Affiliated Hospital of FMU, China and approved by the Institutional Review Board of FMU. All experiments were performed in compliance with the Declaration of Helsinki. Written informed consent was acquired from each patient or next of kin. This dataset has a total number of 59 contrast-enhanced abdominal CT volumes, along with manual delineations of the pancreas by experienced physicians. The size of CT volumes in our own dataset is 512 × 512×(37-224) with slice thickness as 2.5 − 5.0 mm, acquired on Toshiba Aquilion one 320 CT kv 120 MA 193. Different from the NIH healthy pancreas dataset, some subjects in our own dataset include benign/malignant pathological cysts, which impact the morphology of the pancreas [50], thus making this dataset extremely challenging for segmentation due to large variation. Our experiments are conducted on splitting of 82 patients (from NIH) and 59 patients (from FMU) into their own four folds of 20, 20, 21, and 21 patients, and 15, 15, 15, and 14 cases, respectively. In each round of 4-fold standard cross-validation (CV-4), we employ three folds of data as training cases and the remaining fold for testing. 10% of the training cases are randomly selected for validation.

B. Evaluation Metrics
To evaluate the segmentation performance, four metrics (DSC, Jaccard similarity coefficient, precision, and recall) are used, with their definitions given below. Let V s denote the voxel set of automatic segmentation volume, and V g denote the voxel set of ground-truth volume. DSC and Jaccard metrics are means of measuring the correct overlap between the automatic segmentation and the ground-truth segmentation. Precision (or positive predictive value) and recall (or sensitivity) measure the fractions of relevant segmented voxels

C. Parameters Setting
Our method was implemented based on the widely used open-source framework Caffe [51] customized to support 3-D operations for all necessary layers [17]. We train our network via the standard stochastic gradient descent (SGD) algorithm using a step learning rate. The learning rate is initialized at 10 −2 and decayed over the training iterations with a rate of 10 −1 until it reaches 10 −6 . We use the momentum of 0.9 to make a tradeoff between the last observed image and the newly observed image. The batch size is 20. All the network parameters are initialized by Xavier's [52] method.
We resample all images to a unified resolution (i.e., 1 × 1 × 1 mm 3 ) and then crop the image to delete nonabdomen regions by selecting the maximum connected area automatically. The input data from each image is decremented by the means of the whole image first. Then, we normalize intensities into the range of (−1, 1) by dividing the maximum intensity value.
In the training process, we randomly crop patches through the whole image in the cascade C 1 and the images of pancreas regions in the cascade C 2 . We employ the patch size of 16×64×64 as the input image size with the consideration of computational expenses. In the testing phase, we crop patches with a fixed step size of 4×16×16, as explained in detail in the following section.

D. Evaluation of Pancreas Localization on the NIH Dataset
Recall that the cascade C 1 conducts coarse segmentation of the pancreas (Fig. 4). The DSC value of pancreas in C 1 is 71.9%. Although the DSC value of pancreas in C 1 is not high, the proposed region can cover the whole pancreas (Fig. 5). Pancreas skeletons are also shown in Fig. 5. Moreover, the efficiency is significantly improved by using the down-sampled image, as the computational time of C 1 on an NVIDIA TITAN XP GPU is only 4 seconds.   state-of-the-art methods for the pancreas segmentation, as briefly introduced below. 1) Roth et al. [6] introduced HNNs on the three orthogonal axial, sagittal, and coronal views to do the localization and the segmentation of pancreas. 6) Yu et al. [8] added the recurrent saliency transformation module into their previous models (Zhou et al. [4]), which achieved the state-of-the-art performance in terms of DSC. Table I compares the segmentation performance of our proposed method with six state-of-the-art methods, using mean DSC, Jaccard, precision, and recall (with standard deviation).

E. Evaluation of Pancreas Segmentation on the NIH
The four indices over 82 samples increase from 84.6% to 85.9%, 71.8% to 75.7%, 84.5% to 87.6%, and 82.8% to 85.2%, compared to the state-of-the-art methods. Three examples of the ground-truth segmentations and our segmentations are shown in Figs. 6 and 7. As can be observed from Figs. 6 and 7, the similarity with manual delineations is higher by our proposed method, in spite of diverse shapes and locations of pancreas in CT images. We compared our results with previous methods [2], [4], [6], [8], [9] through t-test    8. Changes of values of four evaluation metrics with respect to three different patch sizes. For each metric, the first bar, second bar, and last bar correspond to the patch sizes of 16×32×32, 16×64×64, and 16×128×128, respectively. Leave-one-subject-out cross-validation is used for obtaining all these results.
( Table II). The t-values with 81 degrees of freedom are 9.86, 4.14, 5.14, 1.77, and 3.11, respectively. The corresponding p-values are p < 0.001, p < 0.001, p < 0.001, p < 0.1, and p < 0.05, respectively. Therefore, our proposed method has statistically significant improvements (p < 0.001) compared with other methods [2], [4], [6]. We improve the results obviously (p < 0.05) compared with method [9]. However, the improvements do not seem significant compared with the recent state-of-the-art recurrent saliency transformation network (RSTN) method proposed by Yu et al. [8]. To further verify the performance of our method, we compared our results with RSTN [8] through a paired-sample t-test. In RSTN, all the intensity values were first saturated Fig. 9. Changes of values of four evaluation metrics with respect to three different step sizes at the testing phase. For each metric, the first bar, second bar, and last bar correspond to the step sizes of 2 × 8 × 8, 4 × 16 × 16, and 8 × 32 × 32, respectively. Leave-one-subject-out cross-validation is used for obtaining all these results. into [−100, 240]. The FCN model pretrained on PascalVOC is then adapted to conduct the segmentation on Caffe. The coarse-stage segmentation ran 60 000 iterations with the learning rate of 10 −5 . The saliency transformation module is implemented by two 3 × 3 convolutional layers. The fine-scaled segmentation model ran 60,000 iterations with the learning rate of 10 −5 with images cropped from the coarse-scaled segmentation mask. The mean and standard deviation of the differences in DSC in the paired-sample t-test are +2.1% and 5.0%, respectively. The t-value is 2.48 with 81 degrees of freedom (p < 0.05). Therefore, our proposed method achieves statistically significant improvements.

2) Ablation Study:
a) Evaluation on the impact of input patch size: Since different patch sizes change the receptive field of the network which contributes to different region accuracies, we conduct experiments using three different input patch sizes, that is, 16×32×32, 16×64×64, and 16×128×128, for training the same network architecture shown in Fig. 3. As shown in Fig. 8, our method obtains the best results with the patch size of 16×64×64. Due to the small observation of contexture, the segmentation performance is the lowest with the patch size of 16×32×32. However, we found that the performance did not become obviously better with the patch size of 16×128×128. This is because a smaller number of input samples could be used to train the network, when using large patch size. b) Evaluation on the impact of step size: In the testing phase, we crop patches with a fixed step size. After these patches are fed into the trained model, all the predicted label patches from the same subject are combined into a single label image by averaging the label values of the overlapping image regions. To find a step size that balances between the segmentation accuracy and computational complexity, we extract patches from CT images with a step size of 2×8×8, 4×16×16, and 8×32×32 for testing. The average time consumed for these three different step sizes are 706.69, 53.62, and 7.89 s, respectively. Small step size increases the workload and costs more time.
The performance results are given in Fig. 9. The step size of 2×8×8 did not obviously perform better than the step size of Fig. 11. Comparison between models with and without dense connections. For each metric, the first bar is the result of the model without dense connections, and the second bar is the result of the model with dense connections. Leave-one-subject-out cross validation is used to generate these results.
c) Comparison with nonskeleton-guided model: To evaluate the contribution of skeleton in our network, we compare our proposed network with our downgraded network without using guidance from the pancreas skeleton (i.e., the single task model). The DSC (mean±std [max, min]) value of pancreas after segmentation is 83.0%±5.7% [91.15%, 59.32%], which is significantly lower than our proposed method. The p-value for the single task model and the proposed method is less than 0.05. Therefore, our proposed method with the guidance of skeletons improves the segmentation accuracy significantly.
In the conventional U-Net or FCN, the pixel labels equally contribute to the training of the network. However, in this situation, organs like pancreas will not be completely segmented because of the noise in CT images. Therefore, to enhance the discriminative ability of the network for pancreas, we provide the network with an additional guidance to improve the significance of the region where pancreas is located. This methodology teaches the network to focus more on the pancreas area. In the multitask learning strategy, the learning process is guided by the segmentation loss and the regression loss simultaneously. Complementary information from segmentation and regression tasks are better used. The regression task of pancreas skeleton provides a strong reference for the organ shape, which alleviates the under-estimation caused by large shape variation for pancreas in CT images. From the visualization results shown in Fig. 10, we can observe that, for cases 2 and 3, with large variable pancreas shapes, the single task model under-estimated pancreas severely. However, in the proposed model, the extracted skeleton provides rich information about the shape and size of the pancreas. The final results are improved obviously. On the other hand, for case 1 with relatively regular shape, the single task model obtains a similar result as our proposed method, although still worse than our method due to unclear boundaries of pancreas. The results verify the effectiveness of using the skeleton guidance for organs with large variability in shape. d) Comparison with the multiTask model without dense connections: To investigate the impact of dense connections in the model, we conduct another experiment using our proposed model and the multitask model without dense connections. Fig. 11 shows the results of the four metrics for the two models. As confirmed in Fig. 11, dense connections are useful for training pancreas segmentation models. e) Effectiveness of feature consistency: To demonstrate the effectiveness of using modules of feature consistency, we Fig. 13. Comparison between models with and without modules of feature consistency. For each metric, the first bar is the result of the model without modules of feature consistency, and the second bar is the result of the model with modules of feature consistency. Leave-one-subject-out cross validation is used to generate these results. also run the model without modules of feature consistency. From Fig. 13, we can see that the results with the use of modules are more accurate. f) Evaluation on the effectiveness of the 3-D fully connected conditional random field: As described in Section II-C, we utilize 3-D fully connected CRF to refine the segmentation results of our proposed model. The four indices over  82 samples increases from 85.9% to 86.4%, 75.7% to 76.2%, 87.6% to 88.3%, and 85.2% to 85.3%. Two examples of the ground-truth segmentations and our segmentations are shown in Fig. 12. As can be seen, the 3-D fully connected CRF can effectively remove isolated false positive segmentations.

F. Evaluation of Pancreas Segmentation on the FMU Dataset
We trained the networks for the FMU dataset with the same training settings as the networks for the NIH dataset. Three examples of the ground-truth segmentations and our segmentations are shown in Figs. 14 and 15. As can be observed, the segmentations generated by our proposed method are highly consistent with the ground-truth segmentations, in spite of the shape and size variation of pancreas in CT images. To further verify the effectiveness of our proposed 3-D multitask FCN, the RSTN is also used for segmenting the pancreas on the FMU dataset. We fine-tuned the RSTN for the NIH dataset to make the model converge on the FMU dataset. As can be seen in Table III, our proposed method can achieve better segmentation precision than the RSTN method, indicating potential feasibility in real clinical applications.

IV. CONCLUSION
In this article, we have proposed a cascaded multitask 3-D FCN to address the challenging pancreas segmentation problem in abdominal CT images. Since pancreas is relatively small in shape and may appear in different locations of the abdomen, we first fast locate the pancreas in the raw CT image using a standard FCN architecture. Then, we apply a second stage of the cascade to only the localized region for final finegrained segmentation in a multitask scheme. More importantly, we have proposed a skeleton-guided network to grasp the organ's morphological information, which is shown critical for accurate segmentation, especially in the challenging cases. The experimental results on the two challenging pancreas datasets, that is, the NIH dataset and the FMU dataset, indicate that our proposed method is more accurate than the state-of-the-art methods, and is also more robust across two different datasets. Finally, the ablation study also demonstrates that our proposed module does contribute to the performance gain. She is currently an Associate Professor with Business School, Shandong Normal University. Her current research interests include image processing, medical image analysis, and membrane computing.
Dr. Xue won a National Visiting Scholar Program with the University of North Carolina from 2017 to 2018.
Kelei He received the Ph.D. degree in computer science and technology from Nanjing University, Nanjing, China.
He is currently an Assistant Professor with the Medical School of Nanjing University. His research interests includes medical image analysis, computer vision, and deep learning.