Reducing the Model Variance of a Rectal Cancer Segmentation Network

In preoperative imaging, the demarcation of rectal cancer with magnetic resonance images provides an important basis for cancer staging and treatment planning. Recently, deep learning has greatly improved the state-of-the-art method in automatic segmentation. However, limitations in data availability in the medical field can cause large variance and consequent overfitting to medical image segmentation networks. In this study, we propose methods to reduce the model variance of a rectal cancer segmentation network by adding a rectum segmentation task and performing data augmentation; the geometric correlation between the rectum and rectal cancer motivated the former approach. Moreover, we propose a method to perform a bias-variance analysis within an arbitrary region-of-interest (ROI) of a segmentation network, which we applied to assess the efficacy of our approaches in reducing model variance. As a result, adding a rectum segmentation task reduced the model variance of the rectal cancer segmentation network within tumor regions by a factor of 0.90; data augmentation further reduced the variance by a factor of 0.89. These approaches also reduced the training duration by a factor of 0.96 and a further factor of 0.78, respectively. Our approaches will improve the quality of rectal cancer staging by increasing the accuracy of its automatic demarcation and by providing rectum boundary information since rectal cancer staging requires the demarcation of both rectum and rectal cancer. Besides such clinical benefits, our method also enables segmentation networks to be assessed with bias-variance analysis within an arbitrary ROI, such as a cancerous region.


I. INTRODUCTION
Globally, colorectal cancer is the third most common cancer and the second leading cause of cancer mortality [1]. Specifically, colorectal cancer was the most commonly diagnosed cancer in Korea in 2017, with 27,837 new cases [2]. The global burden of colorectal cancer is rising rapidly and is expected to increase by 60% by 2030 [3].
Depending on the cancer site, colorectal cancer can be defined as colon cancer or rectal cancer [1]. The Union for International Cancer Control's TNM Classification of Malignant Tumors (8th edition) categorizes rectal cancer as a tumor starting in the rectum, i.e., the last 12 centimeters of the colon [4]. The T-categorization of rectal cancer, a widely used rectal cancer staging criterion, pathologically classifies its progression by the degree of tumor invasion into the rectal wall. In magnetic resonance (MR) images, the T-category is determined by the relative location of rectal cancer and the rectal wall [5]. Since current treatment guidelines for rectal cancer utilize the T-category to recommend clinical treatments, accurate segmentation of rectal cancer is crucial. However, in practice, radiologists manually locate rectal cancer using MR images. Manual localization is timeconsuming, and a reliable automatic segmentation system is necessary [6].
In recent years, deep learning has improved the state-ofthe-art methods in various fields related to computer vision [7]. Its wide applicability derives from its ability to find complex structures in high-dimensional data. The introduction of convolutional neural networks (CNNs) has enhanced the ability of deep learning to learn a complex representation of images. Its performance has been further improved by the incorporation of new network backbones and convolution block [8][9][10][11][12].
Deep learning has also proved its applicability in various medical image analysis tasks, including medical image segmentation [13]. For example, Ronneberger et al. have introduced the U-Net by implementing a VGG-Net-like encoder together with a mirrored decoder for cell segmentation [14]. This encoder-decoder approach implements a fully convolutional network which is known to improve the computational efficiency of patch-based segmentation methods [6,15]. Further, Milletari et al. have extended the U-Net to 3D images and introduced the Dice Similarity Coefficient (DSC)-based loss function for the segmentation of the prostate volume in magnetic resonance imaging (MRI) [16].
However, automation of medical image analysis remains challenging due to the inherent complexity of medical images and the extensive variation between patients [17]. Such complexity and large variability within data call for a model with a large capacity such as a deep neural network (DNN), able to discover intricate structures in the data. However, since high-capacity models can fit the intricate details of the data, they are usually less robust to data variation, and prone to overfitting, unless trained with many samples [18]. Unfortunately, in practice, relatively few annotated images are available in the medical field, so that overfitting can be a problem in building DNN models for medical image analysis.
There have been various attempts to moderate overfitting in deep learning, such as batch normalization, drop-out, data augmentation, image normalization, etc. [19][20][21]. In addition, multi-task learning (MTL) is known to reduce the risk of overfitting [18,22]. By adding a different task, the parameters of the model are optimized towards values that can explain the variation observed in both tasks, thus reducing the risk of overfitting for the original network. In the case of a DNN model, the shared portion of the MTL network can be constrained towards values with better generalization ability if the additional task provides information relevant to the original task. Therefore, adding another task will reduce the risk of overfitting, if the additional task is related to the original one.
The risk of overfitting can be assessed by bias-variance analysis since overfitting is caused by high variance [18]. Specifically, a bias-variance analysis decomposes the generalization error into model bias and model variance. The analysis evaluates the model variance by creating multiple models from a single learner by varying the learner training sets. By varying the training set, bias-variance analysis can assess if the model is robust to data variation. If the learner is not robust and cannot generalize the data well, varying the training set will cause highly variant models. As a result, the risk of overfitting can be reduced by lowering the model variance.
Although such model robustness can also be measured by the discrepancy between training accuracy and validation accuracy, selection bias in choosing the training and validation sets can be a problem. In fact, selection bias can be critical especially for medical data, due to their limited size. Also, measuring the discrepancy between training accuracy and validation accuracy cannot capture the model robustness within a specific Region-of-Interest (ROI). However, in medical image analysis, model performance is especially important in the regions adjacent to the positive (cancerous) area. Consequently, a method which can measure the model robustness within an arbitrary ROI can help building models for medical image analysis.
Model variance is important to DNN-based medical image analysis models, but model bias cannot be ignored either. In fact, model bias contributes to the expected loss between ground truth and prediction through a trade-off relationship with model variance [18]. Although the mean squared loss is often used to derive the bias-variance decomposition, the unified theorem of bias-variance decomposition enables arbitrary loss functions to be decomposed into noise, bias, and variance, without loss of generality [23]. Despite the importance of model variance, we are not aware of any report that suggests a method to perform a bias-variance analysis of a segmentation model. Also, we have not come across any study reporting an automatic system for segmenting both rectal cancer and rectum at once using MR images, although cancer staging indeed requires the demarcation of both rectum and rectal cancer. However, for the automatic segmentation of rectal cancer, Trebeschi et al. proposed a patch-based CNN model using both T2-weighted and diffusion-weighted images from MRI [6]. As a validation method, all image data were equally divided between the training and validation sets, which might lead to selection bias. Also, the post-processing method ran the risk of removing valid tumor areas, except for the largest tumor region.
In this study, we propose a pixel-wise bias-variance decomposition method to measure the model variance of segmentation network. This pixel-wise approach can not only measure the expected model bias and variance within an arbitrary ROI but can also visualize the bias and variance map of sample image. We also suggest two methods to reduce the model variance of rectal cancer segmentation network: 1) multi-region segmentation network (MRSN) by adding rectum segmentation task to rectal cancer segmentation network; and 2) the augmentation method that resizes each mini-batch into a random scale. The efficacy of two proposed methods in reducing the model variance is validated in Section III A by the suggested pixel-wise bias-variance decomposition. Section II D will explain the proposed pixel-wise bias-variance decomposition in detail whereas Section II C and B describe MRSN and suggested augmentation method in detail, respectively. Note that in this study the term "model" denotes a learner whose parameters have been optimized using a training set, whereas "network" denotes a DNN learner.

A. DATA PREPARATION
The experiment was performed using MR images of 1,813 rectal cancer patients, obtained between September 2004 and June 2016 at the National Cancer Center of South Korea. Among these cases, 457 were selected after disregarding the cases with at least one of the following properties: preoperative chemo-radiation, incomplete pathologic stage information, disagreement between MR image and pathologic staging, pathologic stage T1 or T4, tumors located more than 13 cm or less than 3 cm from the anal verge, or the application of either clipping or stents. The whole study was conducted according to the principles of the Declaration of Helsinki, and the protocol was approved by the Institutional Review Board of our institution (NCC2017-0031).
Among approximately 30 image slices per patient, we selected one or two to create the dataset. The 907 selected images clearly reflected the T-category of the patient by showing the rectum with clear appearance of either T2 or T3 rectal cancer. Two gastrointestinal clinical specialists were involved not only in selecting the 907 image slices from the 457 cases, but also in the manual delineation of both rectum and rectal cancer. Specifically, one specialist drew the boundary, and the other specialist confirmed the outcome. These manual segmentation results were used as our ground truth.
For bias-variance analysis, we set apart 10% of the 907 images as a test set. Then, we used the remaining 90% to create nine training sets for which we performed 9-fold cross-validation. Nine-fold cross-validation was adopted only to create nine training sets, so we disregarded the nine validation sets thus created. With these nine different training sets, we created nine different models per network. Note that we did not create many training sets by a random sampling method such as bootstrapping since the training of many DNN models is time-consuming. Instead, all networks shared the same nine training sets and the single test set. This approach not only allowed for the fair comparison of learners by sharing the same nine training sets among different learners, but also allowed using all the available data efficiently.

B. PREPROCESSING
We applied both image intensity range normalization and histogram equalization to improve image contrast and generalization [24]. As a normalization step, 90% of both the maximum and the minimum intensity value from the overall image slices of a patient were used to reduce the image depth from 12-bit to 8-bit. We also applied contrast-limited adaptive histogram equalization to enhance the contrast as well as to reduce the illumination effect [25][26][27]. As shown in Fig. 1, an image with a high-intensity artifact region, which decreases the overall image contrast, became more interpretable after preprocessing.
Motivated by Dao et al. [28], we also performed data augmentation to reduce the model variance of our proposed rectal cancer segmentation network, described in Section II C. Especially, we aimed to enhance the scale-robustness of our segmentation system since our raw MR images have heterogeneous scales (from 512×512 to 1056×1056) depending on the MR scanner and its settings. It should be noted that equalizing the pixel spacing of all images does not invalidate the need for scale-robustness; the anatomical structures in MR images can still differ in scale while being similar in shape even if the pixel spacing is equalized among all images. Above all, equalizing pixel spacing is infeasible because fixing the pixel spacing will make the size of images within a mini-batch heterogeneous. To enhance the scalerobustness of our segmentation network, we resized each mini-batch into a randomly chosen scale (ranging from 192×192 to 288×288); the fully convolutional nature of our network allows input images to have different sizes. Note that we did not crop images after random resizing to synchronize the size of all training images. Moreover, we did not create an image pyramid nor supplemented an additional network for scale-robustness, which would have increased the computational cost and the implementation complexity [29]. Instead, we trained the network with images at heterogeneous scales and with their original field-of-views uncropped. Given that medical images usually vary in scale due to the variability of the scanners and their settings, this augmentation method is expected, in general, to enhance the scale-robustness of other medical image analysis systems as well. The efficacy of this augmentation approach is evaluated in Sections III A and B.
Beside scale augmentation, we also performed the conventional random augmentation of the training images, i.e., adjusting the contrast, brightness, and sharpness, followed by a rotation, flipping (left and right), and cropping (maximum 10% from the edge and preserving the square shape). It should be noted that neither the validation nor the test data were augmented, but just resized to the single scale of 256×256.

C. SEGMENTATION NETWORK ARCHITECTURE
We developed an encoder-decoder segmentation network to improve the computational efficiency of an existing rectal cancer segmentation method [6]. The first convolution layer of the network involves forty 3×3 filters. The number of filters per layer at the encoder is doubled after each convolution block as in the VGG-Net or U-Net neural networks [9,14]. The decoder is a mirrored version of the encoder, and the number of filters per layer is halved through convolution transpose. Appendix A describes how the convolution block, which is illustrated in Fig. 6-(c), is selected for our network.
As depicted in Fig. 2, we added the rectum segmentation task to reduce the model variance of the rectal cancer segmentation network. The geometric correlation between the rectum and rectal cancer, which can be noticed from Fig.  3, motivated us to adopt this MTL-based approach. Specifically, rectal cancer is mostly located inside the rectum since it grows from the rectum area, which can be found in Fig. 3 [5]. Since our dataset only includes images that clearly reflect either T2-or T3-stage rectal cancer, rectal cancer is always found along the rectum wall [4]. Moreover, rectum and rectal cancer often share some portion of their boundaries, as can also be seen in Fig. 3. By adding a rectum segmentation task, our network yields the prediction for rectum boundary as a by-product, which is clinically informative as well, especially in cancer staging.
To implement the additional segmentation task, we added an additional task-specific 1×1 convolution layer for rectum segmentation after the last convolution block, as shown in Fig. 2. Note that we did not use a softmax function, but calculated the probability of both classes by logistic sigmoid after their own task-specific convolution layer, because rectal cancer and rectum can overlap each other and thus are not mutually exclusive. In this paper, single-region segmentation network (SRSN) denotes the network without additional task-specific layer, whereas multi-region segmentation network (MRSN) denotes the network with two parallel task-specific layers. In addition, MRSN-AUG denotes the MRSN with data augmentation based on image resizing, as described in Section II B. Consequently, the SRSN can segment only a single region, either rectum or rectal cancer, whereas both MRSN and MRSN-AUG segment both regions at once.
The efficacy of both MTL and image resizing-based augmentation in reducing model variance were evaluated by bias-variance analysis. The bias and variance of the SRSN, MRSN, and MRSN-AUG are compared in Section III A, whereas their segmentation performance based on DSC with 10-fold cross-validation are compared in Section III B.
All filter weights were initialized using the normal distribution sampling method suggested by He et al. [26], except for the transposed convolution filters, which were initialized using the uniform distribution sampling method suggested by Glorot and Bengio [30,31]. The Adam optimization algorithm was implemented to stochastically optimize the parameters with a mini-batch size of 20 [32]. Due to the preponderance of negative pixels, we implemented the DSC-based loss function suggested by Milletari et al. as our optimization objective function, which can be written as where the sums run over the N pixels, the predicted binary segmentation pixel being indicated by ∈ and the ground truth binary pixel by [16].

D. PIXEL-WISE BIAS-VARIANCE ANALYSIS FOR SEGMENTATION NETWORKS
We propose a method to quantify the bias, variance, and expected loss of a segmentation network within an arbitrary ROI, such as a cancerous region. This method allows us to confirm if both the additional rectum segmentation task and the augmentation method based on image resizing reduce the model variance of the rectal cancer segmentation network without increasing the model bias.
We measured bias and variance in accordance with the unified definition suggested by Domingos [23]. However,  two additional conditions should be considered for our problem. First, we have a single ground truth per test sample image. Second, our prediction is a multi-dimensional vector since the segmentation network predicts an image. Considering these two additional conditions, we decided to perform a pixel-wise bias-variance analysis. Then we calculated the expected values of bias, variance, and expected loss within an arbitrary area by averaging. Our approach for generating the training sets and the test set is illustrated in Section II A.
The main prediction for the test image at pixel for a loss function and a set of training sets becomes where ( ) is the prediction value for the test image at pixel . We can specify our loss function as a zero-one loss since our problem is a pixel-wise classification problem. Considering that we have nine training sets, the main prediction at pixel becomes the mode among nine binary predictions at pixel . Now, bias and variance can be defined using the main prediction.
The bias of a network for the test image at pixel is where ( ) represents the ground truth of the test image at pixel . ( ) is the main prediction which is the mode, as stated above. The variance of the test image at pixel can be defined as Now, we can compare the three different rectal cancer segmentation networks (i.e., SRSN, MRSN, and MRSN-AUG) in terms of their expected values of bias, variance, and zero-one loss within an arbitrary ROI. We measured the expected values over the entire image as well as over the positive (cancerous) region of the test images and reported the results in Section III A. We calculated the expected values within positive regions for two reasons. First, segmentation performance is more important in the positive than in the negative region. Second, the negative region contains an excess of non-body area, and segmentation models usually classify non-body regions correctly without difficulty. Consequently, including negative regions can excessively dilute the expected values (bias and variance); this makes it needlessly hard to prove a statistically significant difference between networks with these expected values. We performed statistical tests to objectively compare the distributions of the expected values per test sample from different networks.

E. PERFORMANCE EVALUATION WITH CROSS-VALIDATION
Along with bias-variance analysis, we tested if our approach to reduce the model variance would also demonstrate improvement when evaluated by the conventional evaluation scheme. Specifically, we measured DSC, sensitivity, and specificity, via 10-fold cross-validation of 907 images, to compare the performance of SRSN, MRSN, and MRSN-AUG. We discuss the results in Section III B. Moreover, we used the same evaluation method to compare the performance of different convolution blocks, as described in Appendix A. DSC is a widely used metric in medical image segmentation tasks due to its robustness to highly imbalanced classes [24]. The DSC between two sets and (e.g., prediction and ground truth) is defined as

A. PIXEL-WISE BIAS-VARIANCE ANALYSIS
We compared the model variance of three different rectal cancer segmentation networks in order to assess the efficacy of our proposed methods, i.e., the addition of a rectum segmentation task and the augmentation method based on image resizing. Considering the geometric correlation of the rectum to rectal cancer as well as the heterogeneous scale of our MR images, we assumed that both our methods would reduce the variance of the rectal cancer segmentation network. To this end, the test set was predicted by nine different models trained on nine different training sets, as described in Section II A. Fig. 4 shows an example of the nine different rectal cancer prediction maps generated by the three different networks, i.e., SRSN, MRSN, and MRSN-AUG. The ground truth for this test sample is provided in Fig. 5, column (d). From each network, nine prediction maps (P1-P9) were generated by varying the training set. On the other hand, three prediction maps were generated from three different networks using each training set, namely prediction maps P1, P2, …, P9 for SRSN, MRSN, and MRSN-AUG. Large variations among the nine prediction maps indicate that the network cannot generalize well upon variation of the training data, which can lead to overfitting. Fig. 5 presents two test images overlaid with the corresponding pixel-wise bias, variance, and expected loss maps of the three different rectal cancer segmentation networks. Because most negative (noncancerous) regions show neither bias nor variance, we cropped the images according to the yellow box in column (d) for ease of visualization. Column (d) also illustrates the cropped MR image as well as the ground truth overlaid with the cropped MR image. The yellow arrows in columns (a) and (b) indicate distinct regions compared to the map below. For example, arrows on the variance map created by MRSN indicate the regions where major difference in variance between MRSN and MRSN-AUG occurs. The variance maps visualize the regional robustness of the rectal cancer segmentation networks. Moreover, combined with the bias maps, the variance maps create the expected loss map, thus visualizing how variance affects the expected loss. Fig. 5 suggests that bias and variance tend to occur at the boundary of the rectal cancer regions, which may call for a loss function that weighs the border region more heavily than other regions, as suggested by Ronneberger et al. [14]. Otherwise, the general segmentation loss and the boundary segmentation loss can be treated as separate tasks and be merged by adding their respective losses, as suggested by Wang et al [33]. The yellow arrows suggest that both adding an additional rectum segmentation task and the augmentation method based on image resizing reduce the variance (or bias) of rectal cancer segmentation networks in the positive regions.  Furthermore, the expected values over the entire image as well as over the positive regions, which were described in Section II D, were used to assess the efficacy of our two proposed methods in reducing the model variance of rectal cancer segmentation networks. Table 1 shows that a significant difference (p < 0.05, paired t-test) in model variance within the positive region was observed between SRSN and MRSN as well as between MRSN and MRSN-AUG. Adding the rectum segmentation task decreased the variance of rectal cancer segmentation by a factor of 0.90, whereas the augmentation method further lowered the variance by a factor of 0.89, on average. Moreover, the augmentation based on image resizing significantly reduced the variance of rectal cancer segmentation networks over the entire image. Neither approach increased the bias. Instead, both approaches decreased the bias, although not in a statistically significant amount.
In the future, the bias-variance decomposition using DSC as a loss function will be an interesting topic of investigation since DSC is a widely used metric in medical image analysis. Given that our main focus was to confirm that rectum information could improve rectal cancer segmentation, we left the design of an elaborate task-specific layer for both rectum and rectal cancer segmentation to future research. It has to be noted that our network has a limitation in that it neglects the information from neighboring image slices. The investigation of 3D segmentation network with stacked rectal MR images will be an interesting topic for future investigation. Although previous studies have also used 2D images in order to reduce the computational complexity, the 3D network can exploit information along the vertical axis [34,35]. Both the segmentation performance and the biasvariance map of the 2D network can also be compared to those of the 3D network.

B. PERFORMANCE EVALUATION WITH CROSS-VALIDATION
We asked whether our approaches to reducing model variance would show improvement also using conventional evaluation methods. The performance of SRSN, MRSN, and MRSN-AUG were compared using the method described in Section II E, and the results are reported in Table 2.
Significant differences in tumor DSC as well as in tumor sensitivity (p < 0.05, paired t-test) were observed between SRSN and MRSN. The augmentation method based on image resizing also improved the segmentation performance of MRSN in tumor DSC, rectum DSC, tumor sensitivity, and rectum specificity, with statistical significance.
Our approaches reduced the training duration as well. The training of the MRSN networks took less time than that of the rectal cancer and rectum SRSNs by an average factor of 0.96 and 0.81, respectively. The augmentation further reduced the MRSN training duration by a factor of 0.78 on average.

IV. CONCLUSION
Deep learning has improved the state-of-the-art in various computer vision-related tasks, including image segmentation. Although most deep learning-based models were trained on large datasets, medical datasets are usually more limited in size [36][37][38]. In particular, annotated data for medical image segmentation are especially scarce due to the difficulty of manual delineation. Consequently, deep learning-based medical segmentation models risk suffering from model variance, which can cause overfitting. Methods able to reduce and evaluate the variance of segmentation models are thus important.
In this study, we suggested methods to measure and reduce the model variance of a rectal cancer segmentation model. First, we proposed a method for the pixel-wise bias-variance analysis of segmentation networks. This method can visualize the map of bias, variance, and expected loss, and also quantify their expected values within an arbitrary ROI. Second, we exploited the geometric correlation between the rectum and rectal cancer to reduce the model variance of the deep learning-based rectal cancer segmentation network. Lastly, we performed data augmentation by resizing minibatches of images to further reduce the model variance. Such an approach was motivated by the common scale heterogeneity of medical imaging datasets.
To prove the efficacy of these two approaches in reducing variance without increasing bias, we tested the proposed pixel-wise bias-variance analysis method. Both approaches successfully reduced the model variance, especially within the positive region, without increasing the bias, and reduced the training duration as well. The efficacy of our approaches was also confirmed by using DSC via 10-fold crossvalidation. Besides, our encoder-decoder segmentation network improves the computational efficiency of a previous study of rectal cancer segmentation as well [6]. Clinically, our network can effectively assist radiologists, because the demarcation of both rectum and rectal cancer is required for rectal cancer staging. By reducing the model variance, our approach will improve the accuracy of rectal cancer staging as well. Other cancer segmentation networks may be inspired by our approach to lowering the variance by exploiting the geometric correlation between cancer and the organ from which cancer grows. In our future research, we will develop a 3D segmentation network with stacked medical images. We will develop a 3D rectal cancer segmentation network and compare its performance with that of the 2D network.

A. CONVOLUTION BLOCK STUDY
This section describes the details of our network. Our encoder-decoder network (Fig. 2) involves seven convolution blocks for which we have selected the best block among three candidates as illustrated in Fig. 6. Specifically, block3 adopts two consecutive residual connections whereas block1 and block2 are conventional VGG-style convolutions without a skip connection and a conventional residual block, respectively [11]. For both block2 and block3, we implemented a pre-activation policy [11].
We compared the segmentation performance of three different blocks on MRSN using the method described in Section II E. As described in Table 3, block3 scored the highest DSC for rectal cancer segmentation tasks and thus was selected as our convolution block. In addition, it scored the best also for the rectum segmentation task. Using block3, the network was trained faster than using the other two blocks (block1 was 1.46 times slower, and block2 was 1.04 times slower than block3). 0.999 ± 0.002 All quantities are expressed as mean ± standard deviation. and were obtained through 10-fold cross-validation. Block 1, 2, and 3 refer to the convolution blocks described in Fig. 6.

B. BIAS-VARIANCE ANALYSIS WITHIN NON-CANCEROUS REGION
We focused on the positive regions to measure the expected values of bias, variance, and expected loss of the rectal cancer segmentation networks. However, bias and variance can also occur in negative areas, albeit rarely, and our method of exploiting the geometric correlation between rectum and rectal cancer can improve such problems. In Fig.  7, the SRSN model variance occurs at an organ with appearance similar to that of the rectum. Such variance at negative regions is removed by adding a rectum segmentation task to the network. Information about the rectum location can benefit the rectal cancer segmentation network since rectal cancer rarely exists outside of the rectum, especially in datasets, like our own, not containing tumors in the T4 group.