Organ at Risk Segmentation in Head and Neck CT Images by Using a Two-Stage Segmentation Framework Based on 3D U-Net

Accurate segmentation of organ at risk (OAR) play a critical role in the treatment planning of image guided radiation treatment of head and neck cancer. This segmentation task is challenging for both human and automatic algorithms because of the relatively large number of OARs to be segmented, the large variability of the size and morphology across different OARs, and the low contrast of between some OARs and the background. In this paper, we proposed a two-stage segmentation framework based on 3D U-Net. In this framework, the segmentation of each OAR is decomposed into two sub-tasks: locating a bounding box of the OAR and segment it from a small volume within the bounding box, and each sub-tasks is fulfilled by a dedicated 3D U-Net. The decomposition makes each of the two sub-tasks much easier, so that they can be better completed. We evaluated the proposed method and compared it to state-of-the-art methods by using the MICCAI 2015 Challenge dataset. In terms of the boundary-based metric 95HD, the proposed method ranked first in eight of all nine OARs and ranked second in the other OAR. In terms of the area-based metric DSC, the proposed method ranked first in six of the nine OARs and ranked second in the other three OARs with small difference with the first one.


Introduction
cancer is one of the most common cancer and there are more than half a million incidence all over the world every year (Fitzmaurice et al., 2017). Currently, image guided radiation therapy (IGRT), including intensity-modulated radiation therapy (IMRT) and volumetric-modulated arc therapy (VMAT), is the stateof-the-art treatment option, because of their ability for highly conformal dose delivery (Hansen et al., 2006;Verbakel et al., 2009;Zhao et al., 2011). The key to the success of the IGRT is patient specific treatment planning, in which medical images were used to make a radiation plan to concentrate the dose on the target volume while minimize the dose applied on the surrounding organs at risk (OARs). Therefore, it is essential to segment out the OARs from treatment planning image, which is usually HaN computer tomography (CT) image.
In current clinical practice, OARs are usually delineated manually, but the complexity and variability of the OARs morphology in head and neck CT images make manual delineation prone to errors and very time consuming (Harari et al., 2010;La Macchia et al., 2012). It may take up to three hours to segment all OARs for treatment planning (Harari et al., 2010). Some treatment planning systems provide automatic segmentation methods, such as atlas based method (Sharp et al., 2014), but the segmentation result are far from being accurate enough and intensive labor is still needed for manual adjustment of the segmentation result to make it applicable for treatment planning, and the time needed for manual adjustment is even comparable to manual segmentation from stretch (La Macchia et al., 2012). Therefore, there is a great demand for a fast, accurate and automatic OARs segmentation method for considerable reduction of human labor in HaN treatment planning.
Medical image segmentation is a hot research area and a great amount of methods has been proposed in literature to segment different targets from medical images of different modalities. Some of these approaches have also been applied in OARs segmentation, but unfortunately, current results are far from being satisfactory. There was a Head and Neck Auto Segmentation Challenge held in conjunction with MICCAI in 2015 ("MICCAI 2015 Challenge" from here on), which provided segmentation results of nine OARs for evaluation (Raudaschl et al., 2017). This challenge established an unbiased benchmark in OARs segmentation in HaN CT images. Six teams participated in this challenge by using different kinds of traditional segmentation approaches, including statistical shape model, active appearance model, multi-atlas based segmentation, and semi-automatic segmentation method (Raudaschl et al., 2017), but there is still a big room for improvement. The challenges of OARs segmentation in HaN CT images include: (i) The complexity and variability of the OARs are high, so it is difficult to incorporate prior information in shape models to help segment new images; (ii) The size of the OARs are quite different, so segmentation algorithms may be easily biased to be favorable to big OARs and result in large errors in small OARs; (iii) The contrast of soft tissues is poor in the CT images, which makes the segmentation of some OARs, such as the brainstem, be very difficult.
In recent years, deep learning methods, especially the Convolutional Neural Networks (CNN), have demonstrated state-of-the-art performance in many image segmentation tasks, including segmentation of medical images (Cai et al., 2016;Cha et al., 2016;Hu et al., 2017;Milletari et al., 2017Milletari et al., , 2016Zhu et al., 2017). Several groups (Ibragimov and Xing, 2017;Ren et al., 2018) used deep learning method in OARs segmentation, but the improvement over traditional methods are not significant, and currently the superiority of deep learning approach in the OARs segmentation task is not fully revealed.
In this paper, we propose a 3D U-Net (Çiç ek et al., 2016) based two-stage strategy to automatically segment OARs from HaN CT images. In the first stage, we use a 3D U-Net to locate a 3D bounding box of the target OAR, and in the second stage, we use another 3D U-Net to segment the target OAR from the volume within the bounding box. The two-stage strategy decomposes localization and segmentation, which are the two conceptual parts in segmenting a target from a big volume, and makes each 3D U-Net better trained for its own objective. We tested the proposed method on segmenting nine OARs in the MICCAI 2015 Challenge dataset, and the overall results are significantly better than the state-of-the-art methods.

Related work
Though the contrast between bone and soft organ are relatively high in CT images, the characteristics of HaN OARs segmentation task, including the large number of OARs to be segmented, the great variety of the size and morphology among different OARs and the low contrast between some OARs and their background, make simple segmentation methods not applicable, such as thresholding, edge detection, region growing, and so on. Many methods that have been successfully used in other medical image segmentation tasks have also been applied in this field, such as 3D level set (Street et al., 2007) and atlas-based method (Han et al., 2008), but the results are not satisfactory.
Several approaches have been developed to incorporate prior knowledge, which is often the gold-standard segmentation results of some subjects, to help the segmentation of new subjects, and these approaches have also been used in OARs segmentation. For example, Mannion-haworth et al. (n.d.) builds a statistical shape model of the OARs and deforms the model to fit the image to achieve a segmentation. A multiple atlas approach is adopted in Han et al. (2008) and Peroni (2011), which registers prior segmented images to the target image and then fuses the label of the segmented images to obtain a labeling of the target image. Another approach is to train a classifier with the prior segmented images and transform the segmentation of the target image into a classification problem . In the MICCAI 2015 Challenge, most teams adopted some approaches to utilize prior knowledge, including statistical shape model, active appearance model, and multi-atlas based method. This challenge provides a unified evaluation framework for the research on OARs segmentation methods. Along with its big success in general image processing tasks, deep learning has also been widely used in medical image segmentation (Cai et al., 2016;Cha et al., 2016;Hu et al., 2017;Milletari et al., 2017Milletari et al., , 2016Zhu et al., 2017), including OARs segmentation in HaN CT images. Ibragimov and Xing (2017) applied 2D CNN in segmenting OARs of in house HaN CT images but only achieved slight improvement in the right submandibular gland and right optic nerve, and the performance on other OARs are similar to traditional methods. Ren et al. (2018) proposed interleaved 3D-CNNs for joint segmentation of small OARs but only reported results on three small OARs (chiasm, left and right optic nerves) on the MICCAI 2015 Challenge dataset.
Our contribution in this paper is that we propose a two-stage framework to decompose OARs segmentation into two relatively simpler tasks and complete each task by a dedicated 3D U-Net (Çiç ek et al., 2016). The first task is to locate the target OAR with a bounding box, and the second task it to segment the target OAR within the bounding box. The decomposition makes each task much simpler than directly segmenting out an OAR directly from the whole volume and helps improve the overall performance. Experiments on the MICCAI 2015 Challenge data show that the proposed method achieved the highest DSC in the segmentation of six of the total nine OARs and achieved the second highest DSC in the other three OARs. In addition, the proposed method achieved the smallest 95HD scores for eight of all OARs with a significant edge and achieved the second smallest 95HD score for the other OAR.

The Dataset of the MICCAI 2015 Challenge
In this study, we evaluated the proposed OARs segmentation framework and compared it to other methods using the PDDCA dataset, which is publicly available on the link (http://www.imagenglab.com/newsite/pddca/). This dataset is provided by Dr. Gregory C Sharp, and it was used in the Head and Neck Auto-Segmentation Challenge 2015, a satellite event at the Medical Image Computing and Computer Assisted Surgery (MICCAI) 2015 conference. Current version (v 1.4.1) PDDCA dataset consists of 25 training images, 8 additional training images and 15 testing images. The original images come from the RTOG 0522 clinical trial (Milletari et al., 2016), which provides 111 treatment planning HaN CT images. The subset was chosen to ensure that the image quality is adequate and the target OARs has minimal overlap with the tumors. Each image consists of a series of axis slices with 512×512 voxels on each slice and the number of slices varies from 76 to 263. The in-plane spacing is between 0.76mm×0.76mm and 1.27mm×1.27mm, and the inter-plane spacing is between 1.25 and 3mm.
In this dataset, nine anatomical structures were used as the segmentation targets, and they are brainstem, optic nerve left, optic nerve right, chiasm, parotid left, parotid right, mandible, submandibular left and submandibular right. All these nine structures are important OARs in head and neck radiation treatment (Zhu et al., 2017), and they are manually segmented by experts to provide high quality and consistency. The mask of most of these structures are provided in all 33 training images, except that the submandibular left and the submandibular right are only segmented in 26 and 21 training images, respectively. The masks of all these nine structures are provided in the 15 testing images, and they are used as the gold standard for evaluation.

Overview of the two-stage segmentation framework
The proposed two-stage segmentation framework and its training and testing flowcharts are illustrated in Fig. 1. The framework consists of two 3D U-Net. The original images and the masks were first cropped into a volume with a consistent resolution of 384×384×224 for further processing.
The first 3D U-Net is used to coarsely locate the target structure with a bounding box, and it is denoted as LocNet. The cropped images and masks are first down-sampled to a resolution of 96×96×56 in voxel and then used for training the LocNet. LocNet outputs a 0-1 classification of each voxel, labeling if a voxel falls in the bounding box. A post-processing step is used to calculate a bounding of size (h/4)×(w/4)×(k/4) from the output of the LocNet and the bounding box is transferred back to the coordinate frame of the cropped volume. Then the bounding box is applied on the cropped volume to obtain a smaller volume of size h×w×k, which is denoted as target volume. One LocNet is trained for each target structure, which needs a specific size of the bounding box.
The second 3D U-net is used to segment the target structure from the target volume obtained from the previous step, and we denote this network as SegNet. The target volume has a size of h×w×k, which is much smaller than the 384×384×224 sized cropped volume, and we need only segment one structure from it. These two characteristics make the segmentation task of SegNet much easier. The output of the SegNet is a mask volume with each voxel being 0-1, indicating background and target voxel, respectively.
The LocNet and SegNet are trained separately, and a LocNet and a SegNet is trained for each of the nine structures. In the following two subsections, we introduce the preprocessing needed to prepare the training and testing data for the two 3D U-Nets and the concrete training and testing procedure of them.

Interpolation and cropping the original images to the same size.
The original images have different in-plane and inter-plane resolutions. This increases the variance of the shape and size of each structure and potentially increases the difficulty of segmenting them. Therefore, we resampled all the images into isotropic volumes with the same spatial resolution of 1mm×1mm×1mm by using bi-cubic interpolation. After interpolation, the in-plane size of all the training and testing images is between 389×389 and 650×650 in voxel, and the slice number is between 226 and 416.
Because 3D U-Net needs that all the training and testing inputs have the same size, we need to crop the isotropic volumes after interpolation. Considering the sizes of the isotropic volumes in this dataset and the requirement that the size in each direction should be a multiple of eight, we crop them into a volume of size 384×384×224. We need not crop the training and testing images manually to put the target structures at the center of the cropped volume. On the contrary, we put the nine target structures into two groups and adopt a consistent cropping strategy for each group. The first group consists of brainstem, optic chiasm and optic nerves (both left and right), and the second group consists of mandible, parotid glands (both left and right), submandibular glands (both left and right). The X, Y, and Z axis of the coordinate frame of the original images correspond to the left-right, anterior-posterior, and superior-inferior direction of the human body, respectively. We put the 384×384×224 sized cropping window in the original images and there are margins on both sides of the cropping window along each axis. The voxels of the margins may be different for different images because of the size of the images are different. Nevertheless, for all the target structures, the ratio between the left and right margins along the X-axis is 0.5 to 0.5. The ratio between the anterior and the posterior margins along the Y-axis is 0.3 to 0.7 and 0.2 to 0.8 for the structures in the first and the second group, respectively. The ratio between the superior and the inferior margins along the Z-axis is 0.9 to 0.1 and 0.7 to 0.3 for the structures in the first and the second group, respectively. Please note that each target structure does not necessarily locate at the center in the cropped volume, because a dedicated network will be used to locate it. For the structures in each group, the cropping processes are done automatically on both the training and the testing images with the same parameters.
3.3.2 Determining the size of the bounding box for each structure. Table 1 Size of the bounding box for each target structure.
In the 384×384×224 sized cropped volume, we first locate a bounding box to enclose the target structure, and the volume data within the bounding box is called target volume. Obviously, we need to determine the size of the bounding box for each structure before locating it. Because the target volume is the input of the SegNet, its size in each direction should also be a multiple of eight. In this study, we determined the size of the target volume for each structure by considering the structure's size in the training dataset, and they are listed in Table 1.

Two-stage 3D U-Net segmentation framework
In this study, we concatenate two 3D U-Nets to segment a target structure, where the first one is used to locate a relatively small target volume that encloses the target structure and the second one is used to segment out the target structure from the target volume. The first and the second network are called LocNet and SegNet, respectively. As shown in Fig. 2, LocNet and SegNet have the same network structure, which consists of an analysis path and a synthesis path. In the analysis path, each layer contains two 3×3×3 convolutions each followed by a batch normalization (BN) and a rectified linear unit (ReLu), and then a 2×2×2 max pooling with strides of two in each dimension. In the synthesis path, each layer consists of an up-convolution of 2×2×2 by strides of two in each dimension, followed by two 3×3×3 convolutions each followed by a BN and a ReLu. Shortcut connections from layers of equal resolution in the analysis path provide essential high-resolution features to the synthesis path. At the final layer, a 1×1×1 convolution is used to reduce the number of output channels to a 0-1 classification. In total, each network has 17 convolutional layers. The 384×384×224 sized cropped volume is first down-sampled into the size of 96×96×56 and then inputted into LocNet. The output of the LocNet is a 96×96×56 sized binary volume, from which we locate the bounding box. Please note that the cropped volume is down-sampled with a factor of four, so the bounding boxes we want to locate in the 96×96×56 output volume is also shrank with a factor of four. For example, the size of the bounding box of mandible is 144×144×112 in voxel, so we need to locate a bounding box with a size of 36×36×28 in the 96×96×56 output volume. It is very unlikely that all the voxels having value 1 fall in a cuboid of the expected size, and here we use a sliding window technique to locate the expected bounding box. We slide a cuboid with the expected size in the output volume and the location at which the cuboid encloses the maximum number of voxels with value 1 is regarded as the true location of the bound box. When multiple locations have the same maximum number, the average location is used.

Experiments and results
We used all the 33 training images in the dataset for training the LocNet and the SetNet and evaluated the segmentation framework using the 15 testing images. Four metrics were calculated to evaluate the performance of the proposed segmentation framework. We compared the proposed method to several state-of-the-art methods, including both traditional and artificial intelligence based approaches. Finally, we show the efficiency of the locating network by comparing the proposed method to two traditional approaches used in segmenting 3D medical images by deep learning framework.

Evaluation metrics
We used four evaluation metrics in this study, as defined below.
(1) Dice Similarity Coefficient (DSC). DSC measures the degree of overlap between the segmentation result and the gold standard, and it is defined as follows.

DSC = 2| ∩ | | | + | |
Where and represents the voxel set of segmentation result and the voxel set of the gold standard, respectively.
(2) 95% Hausdorff Distance (95HD). Before giving 95HD, we first give the definition of Hausdorff Distance, which is usually used to measure the deviation of the contour of two areas. Given two point sets and Y, and ( , ) measuring the Euler distance between the two points ∈ , and ∈ , the directed Hausdorff Distance can be defined as follows.
The Hausdorff Distance ⃗⃗⃗⃗⃗ ( , ) measures the largest distance from points in to its nearest neighbor in , and this distance is sensitive to large segmentation error in a very small region. To eliminate this sensitivity, a r% Hausdorff Distance can be calculated to measure the r th percentile of the distance, which is denoted as , ⃗⃗⃗⃗⃗⃗⃗ ( , ).
In this study we used the 95HD, which is calculated as follows.

PPV = | ∩ | | |
(4) Sensitivity (SEN). PPV is the proportion of the correctly segmented volume in the whole volume of the gold standard.

Experimental settings
The proposed networks were implemented using Python based on the Keras package (Chollet, 2015), and experiments were done on a computer with a single GPU (i.e., NVIDIA GTX 1080 Ti) and Linux Ubuntu 14.04 LTS 64 bits operating system.
We trained one LocNet and one SegNet for each of the nine OARs. The size of the training images for the LocNet was 96×96×56, and the size of training images for SegNet was determined by the size of the bounding box of each structure as listed in Table 1, except for mandible. The size of the bounding box for mandible is 144×144×112, but its target volume was further down-sampled to 144×144×56 because of the memory limit. The size of a mini-batch in each epoch was 1. The network was trained by the Adam optimizer using recommended parameters, and the training was stopped at 200 iterations over the training images. For each OAR, we trained one LocNet and one SegNet, so we trained 18 networks for the nine OARs in this dataset, which costed approximately 42 hours. In the testing stage, the segmentation of one OAR on one image was approximately six seconds, of which only about two seconds were spent on the network processing of the image and about four seconds were spent on the post-processing of the output of LocNet. There were several small isolated regions that do not belong to the target structure in the output of the SegNet for some structures. We adopted a simple post-processing, in which we deleted isolated regions whose volume is less than 10% of the total segmentation result.

Segmentation results of the proposed method
We evaluated the performance of the method that used a two-stage segmentation framework and interpolated isotropic images. In addition, we also tested the proposed segmentation framework using the original images without interpolation. Without interpolation, the original images were not cropped and they are directly down-sampled into a resolution of 128×128×64 as the input to the LocNet. The size of the bounding box of some structures is different from that of the interpolated images, but the same size was used across all images. The processing after obtaining the target volume is the same as with the interpolated images. Under each of these two scenarios, DSC, 95HD, PPV and SEN were calculated for each OARs, and the results are listed in Table 2. In Table 2, the OARs are ordered in decreasing volume. Generally speaking, the proposed method achieved good segmentation accuracy in large OARs, and its performance tends to decrease with the decrease of the volume of the OARs, when we consider the volumerelated metrics, including DSC, PPV and SEN. This rule does not hold for the contourbased metric 95DH. Overall, the mean and the standard deviation of the 95DHs of all the OARs are small, which means that the segmentation method finds the correct contour in most areas for every structure. The difference of the performance reflected by the volume-based and contour-based metrics is due to the fact that similar levels of error on contour will result in large errors in small structures and small errors in large structures when computing volume-based metrics, because volume-based metrics use the true volume of the structure as a denominator.
For most OARs, the segmentation results with interpolation are better than that without interpolation, especially when the volume of the OAR is small. On possible reason of the decreased accuracy with interpolation might be that the interpolated images have lower in-plane resolution than the original images. We interpolated the original images into 1mm resolution in each dimension because of the memory limit. Using a higher resolution for the interpolated images may further improve the accuracy of the segmentation, not only for large OARs but also for small OARs. For visual illustration, Fig. 4 shows the segmentation results of Subject 0522c0857 with and without interpolation. We can see that the mandible was segmented more accurate without interpolation, while submandibular、optic nerve and chiasm were segmented better with interpolation. To visually illustrate the overall performance of the proposed method, we show the segmentation results of Subject 0522c576, 0522c0667 and 0522c0857 with interpolation in Fig. 5. In addition, for each of the nine OARs, we chose one good segmentation result and one bad segmentation result and show the slices in Fig. 6.   Fig.4. Segmentation results of Subject 0522c0857. The first and second rows show the segmentation results without and with interpolation, respectively. From left to right: the 85th, 92th, 102th, 92th, 112th, 118th and 120th slice of the axial view. The good standard results are depicted in green and our results are depicted in red.

Accuracy comparison against state-of-the-art methods
It is not trivial to compare different methods of OAR segmentation in HaN CT images, because of the difference in dataset, OARs and evaluation metrics used in different studies. MICCAI 2015 Challenge provides a unified evaluation framework, and we first compare the proposed method (with interpolation) to the four methods that rank top in the challenge. In these four methods, UC (Albrecht et al., n.d.) gives DSC for all the nine OARs but gives no 95HD, IM (Arteaga et al., 2016) gives DSC and 95HD for three OARs, and UB (Mannion-haworth et al., n.d.) and VU (Chen and Dawant, 2016) give DSC and 95HD for all nine OARs. Table 3 and Table 4 illustrate the DSC and 95HD of our method and the four competing methods, respectively. From Table 4 we can see that our method outperforms the competing methods in terms of 95HD with a large margin in eight of all nine OARs. In terms of DSC, our method ranks first in six of the nine OARs, and ranks second in the other three OARs. Besides the above four methods that can be compared to directly, there are some other studies that used different datasets or used the same dataset in a different trainingtesting grouping scheme. Ibragimov and Xing (2017) et al. was the first to utilize deep learning approach to segment OARs in HaN CT images, and they provided DSC for 13 OARs. Eight of these OARs are used in the MICCAI 2015 Challenge (except brainstem). We cannot directly compare the result of our method to Ibragimov and Xing (2017), because they used a different set of data. Nevertheless, the DSC of our method is higher than that of Ibragimov and Xing (2017) for seven OARs (6.0% higher in average). In addition, the method in Ibragimov and Xing (2017) needs the doctor to determine a rough location for each OARs to be segmented. Wang et al. (2018) proposed a hierarchical vertex regression-based segmentation method, and the DSC of brainstem, mandible and parotid were 0.9±0.04, 0.94±0.01 and 0.84±0.06, respectively. However, they evaluated their method by two-fold cross validation on the 33 training images and did not evaluated its performance on the 15 testing images. Segmentation results of other structures were not provided in Wang et al. (2018). Ren et al. (2018) proposed interleaved 3D-CNNs to jointly segment the optic nerve and chiasm. They also located a bounding box enclosing the target OAR and then performed segmentation in the small target volume. The localization is fulfilled by an atlas-based method, which is slower than our LocNet. Their DSCs for optic nerve left, optic nerve right and chiasm were 0.72±0.08, 0.70±0.09, and 0.58±0.17, respectively. This method is designed to segment small targets, and it is not applied on other OARs in the MICCAI 2015 Challenge dataset. In addition, they utilized a joint segmentation scheme, while our method segment each OAR separately.

Runtime comparison against state-of-the-art methods
Runtime comparison is difficult because the code of the competing methods is not available and we cannot run every method on the same computer. Nevertheless, we listed the runtime of VU、IM 、UB、Ibragimova given in the original papers in segmenting all nine OARs of one subject in Table 5, which may give a rough idea on the runtime comparison. Segmenting all nine OARs by using our method needs approximately 108 seconds in average.

Role of target localization
Because of the memory limit, it is usually difficult to put the whole 3D image volume into the GPU for training and testing. One straightforward solution is to down-sample the original images to a manageable size, but obviously the accuracy is expected to be decreased. Another solution adopted in previous studies is sliding-window (Ronneberger et al., 2015;Yu et al., 2017), which crops the original images into small blocks and performs segmentation block by block. In this study, we propose to use a dedicated 3D U-Net to localize a small volume containing the target structure and then segment out the target structure accurately from the small volume. In this experiment, we compared the proposed strategy to the down-sampling and the sliding-window strategy. The same segmentation network as SegNet was used for these two comparing strategies.
In the down-sampling strategy, we down-sampled the training and the testing images to a resolution of 96×96×56. In the sliding-window strategy, we cropped the original volume data to no-overlapping blocks with the size of 64×64×64. Of all the blocks, only a very small proportion contains the target structure, and we cannot use all the blocks for training. Therefore, we kept all the blocks containing the target structure and randomly chose the same number of blocks without the target structure for training the SegNet. In the testing stage, we slide a window of 64×64×64 size in the whole volume with some overlapping between neighboring windows and adopted a max voting for each voxel to obtain the final segmentation result.
The DSC and the 95HD of the two competing strategies and the proposed method with interpolation are listed in Table 6. We can see that the proposed method significantly outperforms down-sampling and sliding-window methods. Especially, it is very difficult to distinguish small structures, such as optic nerve and chiasm, in the down-sampled images, so their segmentation accuracy is very low in the downsampling strategy. For the sliding-window strategy, several parameters, such as the size of the window, step size, ratio between the positive and the negative samples for training, may influence the final result. We tried several combinations of parameters and kept the best one, but we cannot guarantee that the reported accuracy is the best possible result. The 95HDs of the sliding-window strategy are very large for most of the OARs, and the reason is this method segment out some false positive voxels far from the true target OAR. Some post-processing may improve the accuracy of these two strategies, but the improvement will be limited.  Table 7 illustrates the training and testing time used by these three strategies. As expected, the down-sampling strategy is the fastest. Regarding training time, the proposed two-stage strategy needs to train two networks, so it took longer time than down-sampling. In the testing stage, the data needs to be processed by two networks, and between which some processing needs to be done to obtain the bounding box. Therefore, the testing time of the proposed method is longer than that of down-sampling.
The training and testing time of sliding-window strategy are much longer than that of the proposed method.

Discussion
In this work, we propose a new framework for automatic segmentation of OARs in HaN CT images and evaluated its performance with the MICCAI 2015 Challenge dataset. Different from previous methods based on deep neural networks, the proposed framework decomposes the segmentation into two simpler tasks: locating a bounding box and segmenting the small volume within the bounding box, and a 3D U-Net is trained for each task. Experiments on the MICCAI 2015 Challenge dataset show that the proposed method significantly outperformed state-of-the-art methods.
Many traditional methods have been used in the segmentation of OARs in HaN CT images, but the results are relative poor when comparing to other medical image segmentation tasks. The difficulty comes from the characteristics of the OARs segmentation task, such as the large variability of the shape and size across a different target structures and the poor contrast between some structures and their background. Deep neural networks have become the best choice for most image processing tasks and often outperform traditional methods with a large margin, including in many medical image segmentation applications (Cai et al., 2016;Cha et al., 2016;Hu et al., 2017;Milletari et al., 2017Milletari et al., , 2016Zhu et al., 2017). However, existing studies of applying deep neural networks in OARs segmentation of HaN CT images only achieve similar performance to traditional methods. One of the major obstacles of using deep neural networks in medical image segmentation has been the contradiction between large-size high-resolution images and limited memory space. Previously, this problem is tackled by down-sampling or sliding-window strategies, but our experiments show that the performances of these two strategies are very poor. In a recent work, Ren et al. (2018) first used a multi-atlas based segmentation method to roughly locate the region of interest, and then only segmented the small volume within the region of interest by a CNN. They achieved high segmentation accuracy on three small structures and shows that decomposing the localization and segmentation tasks is helpful.
In this study, we utilize 3D U-Net for both the localization and the segmentation tasks. The decomposition makes each of the two tasks much easier and the deep neural network can be properly trained for the specific task. Experiments show that the trained LocNet can find the bounding box containing the target structure in all cases. After locating the bounding box accurately, training the SetNet to segment one structure with similar shape and appearance in different subjects are much easier than training a network to segment the structure from the original images, which contains multiple structures with different shape and appearance. We think that this strategy of decomposing a medical image segmentation task into two tasks, i.e. locating a bounding box and segmentation in the bounding box may be used in other applications, where multiple structures are to be segmented. 3D U-Net was used for both the locating and the segmentation tasks, and many other network structures can be used to replace the 3D U-Net for one or both tasks. We did not attempt to test different network structures in this study, but trying more new network architectures is a potential research direction in the future. In some of the output of SegNet, there are several small isolated regions that do not belong to the target structure. The simple post-processing adopted in this study only slightly improved the final results, and more sophisticated post-processing method may further improve the accuracy. In addition, the number of subjects in the MICCAI 2015 Challenge dataset is not very large, which may limit the performance of the deep learning network. In the future, it is also necessary to verify whether the segmentation accuracy of the method can be further improved by training on more data. At the same time, it is the most important verification step to test whether the method is suitable for clinical use and whether it can help improve treatment planning workflow.

Conclusion
In this study, we proposed a two-stage segmentation framework based on 3D U-Net for automatic segmentation of OARs in HaN CT images. The framework decompose the original segmentation tasks into two easier sub-tasks: locating a bound box of the target structure and segment the target structure in a small volume within the bounding box. One 3D U-Net is trained for each task, and the decomposition makes the two tasks can be completed more accurately and quickly. Experiments on the MICCAI 2015 Challenge dataset show that the proposed method significantly outperforms the stateof-the-art methods.