Prior Attention Enhanced Convolutional Neural Network Based Automatic Segmentation of Organs at Risk for Head and Neck Cancer Radiotherapy

Aimed to automate the segmentation of organs at risk (OARs) in head and neck (H&N) cancer radiotherapy, we develop a novel Prior Attention enhanced convolutional neural Network (PANet) based Stepwise Refinement Segmentation Framework (SRSF) on full-size computed tomography (CT) images. The SRSF is built with a multiscale segmentation concept, in which OARs are segmented from coarse to fine. PANet is a pyramidal architecture with elements of inception block and prior attention. In this study, the developed PANet based SRSF is applied for OARs segmentation in H&N radiotherapy. 139 CT series and manually delineated contours of twenty-two OARs by experienced oncologists are collected from 139 H&N patients for training and evaluating the proposed PANet based SRSF. The mean testing Dice similarity coefficients (DSC) on 39 CT series range from 76.1± 8.3% (left middle ear) to 91.9± 1.4% (right mandible) for large volume OARs(mean volume >1cc) while the corresponding ranges are 63.4± 12.3%(chiasm) to 81.0± 14.1% (right lens) for small and challenging OARs(mean volume ≤1cc). Furthermore, the proposed method also achieved superior segmentations over reference methods on the MICCAI 2015 H&N dataset with mean DSC of 95.6± 0.7%, 81.3± 4.0%, 77.6± 4.5%, 77.5± 4.6%, and 69.2± 7.6%, on the mandible, left submandibular, left and right optical nerve, and chiasm, respectively. The accurate segmentation of OARs is obtained on both the self-collected testing data and public testing dataset, which implies that the proposed method can be used as a practicable and efficient tool for automated OARs contouring in the H&N cancer radiotherapy.


I. INTRODUCTION
Organs at risk (OARs) delineation in computed tomography (CT) is a critical step in radiotherapy planning to achieve organ dose sparing for minimizing radiation-induced toxicity [1], [2]. Manual delineation is usually adopted in current clinical practices, which is time-consuming and with large inter-and intra-operator variabilities [3]. On the other hand, the quality of OARs delineation directly influences the dose distribution in OARs, especially for head and neck(H&N) cancer radiotherapy, which involves many important OARs, such as brain stem, optical nerves, pituitary, and so on. A more robust and accurate automatic OARs segmentation is clinically desirable for H&N cancer radiotherapy [1].
In the past several decades, many automatic OARs segmentation methods, such as the watershed segmentation algorithm [4], [5], active contour model-based algorithm [6], [7], and region-growing based segmentation algorithm [8], [9] were developed for H&N cancer radiotherapy. The most widely studied and used traditional method is the atlas-based automatic segmentation (ABAS) method, which is extensively adopted in the commercial treatment planning system for assisting contour delineation. ABAS method can be divided into two categories: single atlas [10], [11] and multiple atlases based methods [12]- [15]. The single atlasbased method is sensitive to the selected atlas, which may fail if there are great anatomical differences between the target image and the atlas [16], [17]. In contrast, multiple atlases based method has lower sensitivity to atlases, but also lower efficiency with involving more registration procedures, which may introduce more registration errors [15]. However, due to the low soft-tissue contrast and inter-patient variances in CT images, the ABAS method tends to low accuracy in segmentation. Thus, more manual modification is usually required to satisfy the clinical requirement in radiotherapy planning.
Recently, the convolutional neural network (CNN) based deep learning methods were considered as the state-of-theart approaches for the tasks of medical image segmentation. Many deep learning researches had been conducted in H&N OARs segmentation for radiotherapy [18]- [23]. Ibragimov and Xing [19] applied a CNN in thirteen OARs segmentation in CT images for H&N cancer radiotherapy, and achieved higher accuracies in most OARs than conventional ABAS methods, while reported poor segmentations in low-contrast and small organs such as the optical nerves (ONs) and chiasm with Dice similarity coefficient (DSC) of 63.9% and 37.4%, respectively. Liang et al. [20] proposed a two-stages (detection and segmentation) method for eighteen H&N OARs segmentation with DSC from 68.9% (ONs) to 93.4%(eyes), which is superior to the results of fully convolutional neural network (FCN). However, the segmentation accuracies were limited by only using 2D image information with mean DSC <70% for ONs. Tong et al. [21] developed a shape representation model to constrain the 3D FCN for nine H&N OARs segmentation, which achieved mean DSC from 58.5%(chiasm) to 93.7% (mandible). However, this study is conducted on downsampled CT images with a voxel size of 2mm×2mm × 2mm2 × 2 × 2mm 3 , which is not suitable for clinical usage [21]. Chen et al. [22] developed an ensemble UNet [24] based recursive segmentation framework for brain stem, eyes, ONs, and chiasm segmentation on magnetic resonance image, which performs superior to UNet, even in small OARs with mean DSC of 80.1% and 71.1% for ONs and chiasm. Yet, the delineations on MRI still need to be extrapolated to CT via image registration for radiotherapy treatment planning, which will introduce the registration uncertainties. Zhu et al. [23] constructed a squeeze and excitation residual block based AnatomyNet for nine OARs on whole volume CT images. The mean segmentation DSC achieved by AnatomyNet ranges from 53.5%(chiasm) to 91.3%(mandible). However, the above methods still perform poorly on low contrast and small OARs because of the blurred boundary and limited image information. Furthermore, the compatibility of the proposed methods on very large and small OARS such as temporal lobe, pituitary, and chiasm was not considered. Gao et al. [25] proposed a FocusNet to balance large and small OARs segmentation. It achieved more accurate segmentation on small OARs with training OAR specific model, which is time-consuming. Besides, FocusNet used the prior information by simply concatenating feature maps from OARs localization, which may weaken the model stability.
In this study, twenty-two OARs for H&N cancer radiotherapy are involved in the segmentation task, including four single organs: brainstem, spinal cord, chiasm, and pituitary, and nine paired organs: temporal lobes(TLs), eyes, optical nerves(ONs), lens, middle ears(MEs), mastoids, mandibles, temporal mandibular joints(TMJs), parotids. In the following paragraph, the left and right parts of paired OAR are expressed as OAR_l, OAR_r, respectively. To achieve fully automatic accurate segmentation for large and small OARs in full volume CT images for H&N cancer radiotherapy, we developed and evaluated a novel Stepwise Refined Segmentation Framework (SRSF), whose core model is a novel Prior Attention enhanced Convolutional Neural Network (PANet). The PANet based SRSF(SRSF PANet ) explores and takes advantage of the inherently stable relative position among OARs, and achieves OARs segmentation from coarse to fine via three sequential segmentation steps: OAR-groups segmentation (OGS), large/easy OARs segmentation (LOS), and small/difficult OARs segmentation (SOS). To improve the segmentation accuracy in each step, a novel combined attention of prior and learnable spatial attention is applied to a justified inception block for more accurate and effective feature extraction in PANet.

II. METHODS AND MATERIALS A. METHODS
In this study, a novel PANet based SRSF: SRSF PANet is developed for a large amount of OARs' segmentation in common large volume CT images. Twenty-two OARs to be delineated are divided into three groups: Group A: brainstem and spinal cord; Group B: mastoids, mandibles, temporal mandibular joints, parotids, middle ears; Group C : temporal lobes, eyes, VOLUME 8, 2020 and adjacent small OARs (mean volume ≤1cc): lens, optical nerves, chiasm, and pituitary. As illustrated in Figure 1, the SRSF includes three sequential segmentation steps: OGS, LOS, and SOS. Firstly, the OGS model is trained on downsampled CT images with half-resolution for OARs group segmentation. The label of each OARs group is obtained via Equation 1. That means OARs in each group are regarded as one individual target. Then, the corresponding regions of interest (ROIs) are localized and prior probability maps are predicted based on the rough OGS for LOS, respectively. Similarly, the small OARs (if exist) in each group are also segmented as a whole structure for ROI location and prior probability obtaining. For LOS and SOS, each OAR except for the small OARs group in LOS C is treated as an individual target.
For OGS, a justified inception pyramidal network(IPNet) is constructed based on a classic pyramidal network: UNet [24]. Compared with UNet, the core feature extractor of IPNet is a justified inception block without the pooling path. As shown in Fig.2, the justified inception block improves the respective field and feature variety by using multiple convolution paths with different kernel sizes. However, the pooling path was removed to avoid the image feature losing. Then, a convolution block is followed for combining the multi kernels extracted feature maps. The convolution block sequentially includes a convolution layer with a kernel size of 3×3×3, a batch normalization (BN) layer [26], and the ReLu activation layer. With the justified inception block, the depth of the pyramidal network also can be reduced to avoid image feature losing, especially for small targets.
To constrain the network pays more attention to effective and informative spatial regions, a Prior Attention enhanced Inception (PAI) block is designed in the PANet. As shown in Fig.3, the learnable PAI firstly adjusts feature maps in  spatial positions via prior attention (PA) and convolutional spatial attention. Finally, all the attention refined feature maps are element-wise added with the unrefined feature maps to avoid the gradient vanishing problem. In this study, prior attention map P i for depth i in PANet is generated by average pooling. In this study, the probability maps predicted by IPNet and PANet in OGS and LOS for group C were regarded as the prior information for following OARs segmentation, respectively.
Considering surface distance (SD) is more sensitive to the shape changes than the dice coefficient. A combined loss of Dice loss [27] and SD loss [28] is employed for segmentation model training in this study, which is defined as: where P c i and G c i represents the predicted SoftMax probability and gold standard label in voxel i of channel c, respectively. D c i is the corresponding normalized distance to the surface of the gold standard. γ and α are the parameters to adjust the penalty of large surface error, and the weight of Loss SD , are set as 1 refer to [28]. Adam optimizer [29] is chosen for minimizing the loss function.
The proposed SRSF PANet is implemented with the deep learning library of Pytorch in Python 3.5. The model training and validation are completed on two GPU cards (NVIDIA GeForce GTX 1080) with 12GB memory. The hyperparameters of models are illustrated in Table 1. The maximum training epoch is set as 50 with an early stop strategy (10 epochs without validation loss decrease) to avoid overfitting.

B. MATERIALS
139 independent CT series from 139 nasopharyngeal cancer patients with manual contours are collected in Sun Yat-Sen Cancer Center, China. All contours are manually delineated by an experienced oncologist and review and adjust by the other experienced oncologist, which are regarded as the gold standard in this study. Resolutions of the CT images vary between 0.7mm∼1.2mm in the transverse plane. The slice thickness is 3mm for all cases. The number of slices ranges from 90 to 172 with an average of 111. There are 15,443 slices extracted from the self-collected dataset.
All the CT images are clipped to the range of [WL-WW/2, WL+WW/2], and then normalized to the range of [−1, 1], where WW and WL represent window width and level, respectively. 100 of all 139 patients are randomly split for training and the rest for testing. During training, 10% of the training set is randomly divided into inner-validation data to avoid overfitting. Translation, rotation, and noise addition are applied for the training data augmentation. With the data augmentation, there are 360 three dimensional images used for the model training. To obtain the rough prediction probability on training data, five-fold cross-validation is employed in the OGS and LOS C.
where h (G, P) = max g∈G min g∈P g − p . DSC ranges from 0 to 1, corresponding to the worst and the best segmentation, respectively. HD95 ranges from 0 to positive infinity. Higher DSC and lower HD95 indicate more accurate segmentation. In this study, the DSC and HD95 are calculated on a threedimensional basis for each patient. Besides, the volume cover ratio (VCR) is defined as VCR = V in V gt , where V in and V gt are the covered volume within the extracted ROI and the gold standard volume of the targeted OAR, respectively. VCR is used for evaluating ROI localization accuracy. 100% of VCR means the localized ROI can cover all target OARs. Because of the large occupation in the Graphics Processing Unit (GPU), general hardware conditions are difficult to support the segmentation of twenty-two organs on the original CT image directly. Thus, the proposed SRSF PANet are compared with UNet and IPNet based SRSF(SRSF UNet and SRSF IPNet ) in the evaluation study. The core models used in the above three methods are illustrated in Table 2. The Kolmogorov-Smirnov test is employed for normal  distribution testing(p > 0.05). Then, the Wilcoxon rank-sum test and paired t-test are used for statistical significance analysis on the dataset with abnormal and normal distributions, respectively. The statistical analysis is implemented in SPSS 19.0 software in this study. The significant difference was defined by p < 0.05.
To compare the proposed method with other state of art methods, SRSF PANet is also compared with five published methods [23], [25], [32]- [34] on the MICCAI 2015 H&N OARs segmentation dataset [32], denoted as MIC-CAI'15 dataset (http://www.imagenglab.com/newsite/pddca/). The MICCAI'15 dataset consists of 38 samples for training and 10 samples for testing. Five reference methods include the champion method of MICCAI'15 challenge [32] and the other four state-of-art deep learning based methods [23], [25], [33], [34]. To transfer the proposed SRSF PANet for this MICCAI'15 segmentation task, the original output channels for TMJs are replaced for submandibular segmentation in LOS of group B in this comparison study. Other settings of the SRSF PANet are the same as those used in experiments on the self collected dataset. Table 3 illustrates the OARs group segmentation accuracy in OGS and LOS C. The mean DSCs of large OARs groups in OGS are >83%, and the corresponding HD95 are <4.8mm with small variations. The mean DSC and HD95 of small OARs in LOS C is 68.3±4.7% and 6.3±2.6mm, respectively. Furthermore, 100% of VCR shows that all the OARs are covered in the corresponding ROIs under the size settings. As the results showed, the segmentation accuracies on OARs groups and ROI size settings are enough for the SRSF in this study. As the segmentation example illustrated in Fig.4, under and over segmentations are observed in segmentation results obtained by SRSF UNet , especially for mandible and TLs. Benefiting from the larger respective field of the inception module, SRSF IPNet achieved better performance than SRSF UNet , but still cannot achieve accurate segmentation on very large organs and low contrast organs, such as TLs, mastoids, and ONs. In comparison, SRSF PANet performed superior over SRSF UNet and SRSF IPNet with the best agreements to the physician delineated gold standard. On the small OARs, such as ONs and chiasm, SRSF PANet also achieves the best segmentation results. Table 4 lists the testing results of SRSF UNet , SRSF IPNet , and SRSF PANet . For SRSF UNet , the mean DSCs are above 70.0% with ranges from 72.8%(ME_r) to 89.1%(mandible_l) on large volume OARs, and ranges from 51.2%(chiasm) to 75.2%(ON_r) on small OARs. For almost all OARs, SRSF UNet could achieve good results, except for very larger volume OARs, such as TLs and mandibles. With the respective field improved, SRSF IPNet achieves significantly better segmentation results on TL_l/r, parotid_l/r, ME_r, mastoid_l, TNJ_l/r, lens_l/r, ON_l/r, chiasm and pituitary, and not significantly different results on eye_l/r, ME_l, mastoid_r, but significantly worse results on the spinal cord and TMJ_l/r. Compared with SRSF UNet , SRSF PANet achieves significantly better segmentation results on nineteen OARs, and not significantly different results on ME_l, TMJ_l/r. Compared with SRSF IPNet , SRSF PANet also achieves significantly better segmentation results on seventeen OARs, and not significantly different result on five OARs (mastoid_l, mandible_l, lens_l, ON_l, and pituitary). In the comparison of HD95, SRSF IPNet achieves significantly superior performance over SRSF UNet on ten of twenty-two OARs. Moreover, SRSF PANet achieves significantly superior performance on thirteen and seven OARs over SRSF UNet and SRSF IPNet , respectively. Overall, SRSF IPNet achieves performance improvement, while SRSF PANet achieves the best results among all three methods. Fig.5 depicts the boxplots of DSC and HD95 comparisons in testing data. We can observe that: 1)Among all the three methods, SRSF PANet achieves best results on DSC and HD95 overall; 2) Compared to SRSF UNet , SRSF IPNet performs significantly superior for eleven OARs, but

significantly inferior on two OARs(ME_l and Mandible_r);
3) The worst results achieved by SRSF PANet on chiasm and pituitary are worse than SRSF UNet and SRSF IPNet . In general, the SRSF PANet achieves more accurate segmentation for H&N OARs than SRSF UNet and SRSF IPNet . Table 5 illustrates the segmentation comparison on the MICCAI'15 dataset among our proposed method and five state-of-art methods. In comparison, the proposed method achieved comparative and slightly superior(mandible) segmentation accuracy on large volume OARs. For small OARs, the most accurate segmentations were achieved by the proposed method with means DSC of 77.6±4.5%, 77.5±4.6%, and 69.2±7.6%, on the ON_l, ON_r, and chiasm, respectively. It should be noted that Zhu, et al. [23] used an additional training dataset in their study. The segmentation results of the public dataset demonstrated that: (1) the segmentation accuracy of SRSF PANet is superior over these reference methods; (2) the SRSF and PANet both can be easily transferred for different OARs segmentation scenarios.

IV. DISCUSSIONS
This study developed and validated a novel PANet based SRSF: SRSF PANet for the automatic segmentation of OARs in CT images for H&N cancer radiotherapy. SRSF is proposed to alleviate the volume imbalance in multiple target segmentation, especially for tiny targets, such as the optical nerves, chiasm, and pituitary in this study. Excluding more background regions is a direct and efficient approach to solve this issue. Thus, we achieve the multiple OARs segmentation stepwise via SRSF. The primary step is used for achieving rough segmentation, OARs localization, and prior attention, which is useful for the next segmentation refinement in different aspects. Thus, the proposed SRSF is compatible with different basic networks, such as PANet, IPNet, and UNet for multi-targets segmentation. Furthermore, we propose prior attention in PANet to utilize the predicted confidence probability map in previous segmentation. To improve the segmentation accuracy on small organs, a justified inception block is employed for feature extraction on a larger scale with the pooling operation reduced, which is employed in IPNet and PANet.
The quantitative and qualitative evaluation results (Table 4, Fig. 5, Table 5) achieved in 39 testing cases and MICCAI'15 public datasets have demonstrated the effectiveness of the proposed method. Moreover, compared with SRSF UNet and SRSF IPNet , the proposed SRSF PANet achieved significantly better performance on most of the OARs. Besides, the meantime cost in segmenting all twenty-two OARs for a new case is about 30s, which can effectively support clinical delineation work.
As the quantitative and qualitative evaluation results illustrated in Table 5 and Fig. 5, we can observe that: (1) the segmentation accuracies achieved by SRSF UNet are inferior than SRSF IPNet and SRSF PANet , especially for very large OARs(TLs) and small OARs(ONs, and chiasm). There are two reasons: shallower network and larger respective fields in IPNet and PANet. To avoid the feature missing for small OARs segmentation, we reduce the pooling operation. Thus, the network is shallower, which will weaken the capability of networks likes UNet, but will not affect the IPNet and PANet with a larger respective field. The larger respective field benefited from the inception block helps IPNet and VOLUME 8, 2020

FIGURE 5.
Quantitative comparisons in DSC and HD95 among SRSF UNet , SRSF IPNet , and SRSF PANet . The boxes run from the 25th to 75th percentile; the two ends of the whiskers represent the 10th to 90th percentile, the horizontal line and cross symbol in the box represent the median, mean values, respectively. The ' * ' and '-' symbol above each group represents the statistically significant differences exist or not exist between the two approaches.
PANet extracting more helpful global features for segmentation. In this way, the IPNet and PANet achieved a balance between pooling operation and more global features. (2) Even with the same respective field, the SRSF PANet still achieves superior segmentation performance over SRSF IPNet , which is benefited from the combination mechanism of prior attention and convolutional spatial attention. Firstly, the proposed SRSF provides a practicable way to utilize the information obtained from previous segmentation steps. It is not just for ROI localization, also can be used as prior attention in PANet. Thus, the prior attention from OGS and LOS for group C can provide additional glob information, which involved the relationship among OARs. Secondly, the learnable convolutional spatial attention can achieve case-specific spatial adjustment on feature maps. Furthermore, the learnable spatial attention is soft, which can adjust the hard attention from prior. Therefore, the proposed SRSF PANet is reliable in theory and practice.
Moreover, as shown in Table 5, Wang's method [34] achieved the best performance on the brain stem and mandible but performed poorly on parotid. The reason is that they used a shape regression model constructed based on the shape correspondences detected across all atlases. However, the shape variety of parotid is much larger than the brain stem and mandible. Due to the larger error of shape correspondence detection, the segmentation accuracy was reduced on parotid. Besides, Zhu's model [23] achieved the best performance on the left parotid and right submandibular but performed particularly poorly on the brain stem and chiasm. With trained on an additional dataset, the segmentation accuracy was improved on the brain stem and chiasm, but reduced on the mandible and optical nerves. These results imply the instability of Zhu's method, which achieved multiple OARs on whole CT images via a single model. In comparison, the proposed SRSF PANet is more accurate and stable, which are benefit from the three innovations: SRSF, the justified inception block with a larger receptive field, and the prior attention mechanism.
However, this study also has several limitations. 1) as the results illustrated in Fig.3, the worst results achieved by SRSF PANet on chiasm and pituitary are worse than SRSF UNet and SRSF IPNet , although SRSF PANet performed superior in most of the cases. Considering optical chiasm and pituitary are very small, which usually appear in only one to two CT slices, the prior information tends to misguide the further segmentation. Thus, it is believed that the proposed model is still sensitive to the prior for small target segmentation to some extent. 2) PANet training relies on the prior probability map, which is from the previous segmentation step. To improve the model stability, the prior probability maps of all training data were obtained via inner five-fold crossvalidation. Thus, the training procedure is more complex than the general model. In this study, the time cost for model training is about 70 hours. In future work, other fast conventional approaches may be employed for obtaining prior to avoid such a disadvantage. 3) the inception block is not the only method for achieving a larger receptive field. For example, the dilated convolution also can achieve a similar multi-scale effect as the inception block with fewer parameters. However, the gridding problem in dilation convolution is adverse to small target segmentation. Thus, more feasible revised approaches, such as the receptive field block, which combined the ideal of inception block and dilated convolution, are also worth applying to similar segmentation tasks in the future work. 4) this study only considers the segmentation of OARs on H&N in non-contrast CT images. We plan to apply the proposed method for more segmentation applications to assist radiotherapy treatment planning in future works. 5) the size of the evaluation dataset is limited. We are planning to evaluate the proposed method on more clinical data on different anatomic sites to provide more clinical support.
In conclusion, an SRSF framework is developed for the automatically sequential segmentation of H&N OARs.
Based on the SRSF, a novel PANet is proposed for more accurate segmentation by balancing the respective felid and pooling operation and comminating the soft spatial attention and hard prior attention. The good evaluation results achieved by SRSF PANet on independent and public testing datasets both demonstrated that the proposed SRSF PANet could be a potential tool for automatic OARs contouring in the H&N cancer radiotherapy. DONGYUN  LIN CHANG received the master's degree from the Medical School, Southeast University. She is currently the Chief Technician of the Children's Hospital of Nanjing Medical University. Her research interests include tumor immunity and medical laboratory diagnosis.
YING SUN received the Ph.D. degree in imaging and nuclear medicine from Sun Yat-sen University, Guangzhou, China, in 2002. She is currently a Professor of radiation oncology and the Vice President of the Sun Yat-sen University Cancer Center, Guangzhou. Her main research interests include the individualized and precise treatment of nasopharyngeal carcinoma (NPC), artificial intelligence-assisted delineation of tumor targets and organs at risk for radiotherapy of NPC, bigdata-driven risk stratification and individualized treatment of non-metastatic NPC, and translational research focused on developing prognostic and predictive markers in patients with NPC.
DONGMEI WU was a Postdoctoral Fellow with the Kimmel Cancer Center, Thomas Jefferson University Hospital. She is currently an Associate Professor with the Department of Radiation Oncology and the Department of Cancer Biology, Nanxishan Hospital of Guangxi Zhuang Autonomous Region. Her research interests include nasopharyngeal carcinoma radiotherapy and artificial intelligence application in radiotherapy.
YAO LU was a Postdoctoral Research Fellow and a Research Investigator with the Medical School, University of Michigan. He is currently a Professor with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China. His research interests include inverse problem, medical image processing, and computer-aided diagnosis. VOLUME 8, 2020