Spatial Enhanced Rotation Aware Network for Breast Mass Segmentation in Digital Mammogram

Breast cancer is the most common cancer with highest mortality risk among the female worldwide and breast mass is the most effective sign for cancer identification. Thus, accurate segmentation of breast mass is regarded as a key step to reduce the death rate. Traditional segmentation methods require prior knowledge and manually set parameters, while recent studies prefer to construct neural networks based on feature reuse. However, breast mass can display in different orientations and the spatial context is complex, which makes the segmentation remain a challenging task. For these concerns, we propose a Spatial Enhanced Rotation Aware Network (SERAN) for automatic breast mass segmentation. SERAN consists of two critical components: 1) a residual attention encoder with spatial enhancement mechanism for effective feature extraction, and 2) a decoder constructed by multi-stream rotation aware blocks for feature fusion and prediction refinement. To optimize SERAN better and avoid misclassification in background area, a regulation item named Inside-outside Loss (IOL) is used in training procedure. The experimental results tested on a representative subset of Digital Database for Screening Mammography (DDSM) dataset show that SERAN outperforms state-of-the-art methods among most adopted evaluation metrics.

As is known to all, breast mass is one of the most effective 23 signs for cancer identification [3]. Thus, breast mass segmen-24 tation on medical image is regarded as the first step of early 25 diagnosis and the key step prior to classification of benign and 26 malignant. Traditional approaches for breast mass segmen- 27 tation are manually, time-consuming and heavily dependent 28 on radiologist's experience. To reduce processing time and 29 improve the accuracy of segmentation result, computer-aided 30 The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . detection (CADe) technology has been rapidly developed 31 since the late 1980s [4] and digital mammogram is the most 32 reliable technique which widely used in breast mass seg-33 mentation [5]. However, breast masses are varied in a wide 34 range in shape, size and texture, which make the segmentation 35 remain a challenging task [6]. 36 Various machine learning algorithms were utilized to 37 establish traditional CADe systems for disease detection and 38 segmentation [7], [8], especially for breast mass segmen-39 tation [9]. Region growing and thresholding were the two 40 most widely used methods. With respect to region growing, 41 Mencattini et al. [10] introduced an effective region growing 42 algorithm for breast mass segmentation on Digital Database 43 for Screening Mammography (DDSM) [11] dataset. They 44 followed the typical processing flow which consists of 45 artifacts removal, contrast enhancement and segmentation. 46 Moreover, an iterative post processing step was designed to 47 remove peninsulas on the boundary of mass. A precision 48 VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ image patch only contains local context information about the 105 patch itself and losses global context information about the 106 whole image. Because of these, the performance of traditional 107 CNN is limited. To address these problems, Long et al. [21] 108 proposed fully convolutional networks (FCN) for segmenta-109 tion. FCN replaces all the fully connected layers by convo-110 lution layers. For this reason, the input of FCN is allowed 111 in arbitrarily size. As the size of segmentation result should 112 be the same as the input, FCN upsamples the feature maps 113 from high level layers and introduces skip connection to fuse 114 the outputs from different layers. The experimental results 115 showed that the spatial information supplied by skip connec-116 tion and gradually upsampling can improve the performance 117 of FCN. However, the segmentation details are still far from 118 expectation. Inspired by FCN, many successful neural net-119 works were proposed for image segmentation task, such as 120 U-Net [19], DeepLab [22] and PSPNet [23]. Among these 121 works, the encoder-decoder structure of U-Net is the easiest 122 to implement and extend. The encoder part of U-Net employs 123 multi-stage convolutions connected by maxpooling layers 124 to extract features and expand receptive fields, while the 125 decoder part contains multi-stage convolutions connected by 126 upsampling layers to fuse the features and expand the size. 127 Skip connection is applied in every decoder stage to concate-128 nate the feature map from upsampling layers with the feature 129 maps from the corresponding stage in encoder. Benefit from 130 small stride upsampling to expand the size gradually and 131 skip connection to make up for spatial information, U-Net 132 achieves great improvement in image segmentation.

133
Afterwards, various U-Net based architectures have been 134 proposed to handle different application scenes [24]- [27], 135 especially in breast mass segmentation [28]- [30]. 136 Li et al. [28] proposed an Attention Dense-U-net for breast 137 mass segmentation on a subset of DDSM. They used 138 dense connected block to remould the encoder and applied 139 Attention Gates [31] at skip connection step for spatial 140 information enhancement. An Area under the Receiver Oper-141 ating Characteristic Curve (AUC) of 0.8605 was reported. 142 Hai et al. [29] remoulded both the encoder and decoder by 143 dense connected block. Moreover, they held the opinion 144 that the multi-scale context was useful in segmentation task. 145 Thus, they used atrous spatial pyramid pooling (ASPP) 146 mechanism for multi-scale context captation. Their method 147 achieved the best performance on a self-collected dataset with 148 380 mammograms in total. Different from the approaches 149 mentioned above, Sun et al. [30] proposed an attention-150 guided dense-upsampling block to decrease the information 151 loss in upsampling operation. The original upsampling 152 operation was replaced by a dense-upsampling block and 153 a channel-wise attention was equipped for information 154 enhancement. An average Dice coefficient of 81.8% was 155 achieved on CBIS-DDSM dataset.

156
To improve the encoder part of U-Net, most of the stud-157 ies mentioned above held the opinion that feature reuse, 158 which can be modeled by residual connection [17], is the 159 most important factor. Unlike them, we think that the 160    The rest of this paper is organized as follows. In Section II, 197 we describe the details about dataset and pre-processing 198 as well as the architecture of proposed SERAN. To verify 199 the ability of proposed method, extensive experiments are 200 summarized in Section III. Finally, a conclusion is drawn 201 in Section IV.

203
This paper aims to build a Spatial Enhanced Rota-204 tion Aware Network (SERAN) for breast mass segmen-205 tation in digital mammogram. The overview of SERAN 206 is illustrated as Fig. 1. The materials used in this study 207 and the details of SERAN are described in following 208 sections.

209
A. DATA PREPARATION 210 DDSM database from the University of Florida is used in this 211 study. Approximately 2,500 cases are collected in DDSM and 212 each of them contains two X-ray mammograms about each 213 breast. The database can be downloaded online and all the 214 image sizes are larger than 2000 × 3000 pixels. The location 215 and type of suspicious area in each image has been annotated 216 by experienced radiologists as ground truth. The images are 217 encoded in ''.LJPEG'' format and the corresponding ground 218 truths are saved in ''.OVERLAY'' format. To transform the 219 images and the ground truths to ''.PNG'' format, which 220 can be easily used during programing, a public tool named 221 DDSM-LJPEG-Converter is used.

222
In this study, 400 representative images are selected from 223 DDSM by an experience radiologist. The standard of the 224 selection can be described as following: 1) the mass is relative 225 clear in the image, and 2) the annotated area can cover the 226 mass but not much bigger than the mass. The images are 227 randomly divided into three nonoverlapping parts in a ratio 228 of 4:1:1, which are used as training set, validation set and test 229 set.

270
As shown in Fig.3, the shapes and sizes of different breast 271 masses are varied in a wide range. Thus, it is required to build 272 a powerful encoder for effective feature extraction.

273
Typical encoder is made up of convolution layers for fea-274 ture extraction and pooling layers for receptive field expand-275 ing. All the layers are formed in a series connection style and 276 the input of layer l is only the output of layer l-1. However, 277 the ability of encoder may degrade with gradient vanishing. 278 To address this problem, He et al. [17] proposed ResNet with 279 residual connection. Using residual connection, the input of 280 convolution layer l is summed with the convolution result to 281 form the output, which can be expressed as: where x l+1 and x l are inputs of layer l and l + 1 respectively. 284 F(·, ω) is a learning function and H (·) is a mapping function. 285 When the number of channels of x l is equal to the channel 286 number of F(·, ω), H (·) is an identity mapping function 287 (shown as Fig.4 (a)). Otherwise, 1 × 1 convolution is used 288 in H (·) to adjust the channel number (shown as Fig.4 (b)). 289 σ refers to the ReLu [35] function for activation.

290
Residual connection can avoid gradient vanishing and pro-291 vide feature reuse for better convergence. For these reasons, 292 we employ the residual learning structure as a basic module 293 in encoder part.

294
In addition, in our view, spatial information is the most 295 important factor to boost the performance in segmentation 296 task. Thus, it will be helpful to model the attention on spatial 297 information explicitly.

298
Inspired by [32], a spatial attention block (SA-Block) is 299 designed to enhance the spatial context information at each 300 pixel position. As shown in Fig.5, the SA-Block consists of 301 a feature extraction branch and a mask attention branch. The 302 feature extraction branch utilizes two 3×3 convolution layers 303 to capture features from complex image context. The mask 304 attention branch is implemented as a U-Net liked structure. 305 A single-channel attention map is generated by the mask 306  ReLu. The formulation of SA-Block can be summarized as:  the shallow layers of the encoder and the shallower one is 330 embedded into the deeper layers.

331
Composed of residual convolution blocks and SA-Blocks, 332 the encoder can focus on extracting features from suspi-333 cious places (as shown in Fig.7). Moreover, the responses of 334 regions with similar features to the target in one image can be 335 restrained gradually. The architecture of encoder is shown as 336 the left part of Fig.1. In complex scenes, the masses can display in every orienta-339 tions, which can be summarized as rotation. In order to assist 340 doctors for precise diagnosis, the neural network must be 341 aware to mass rotation and predict robust results in different 342 rotations. The goal of each decoder stage is to fuse the fea-343 tures and refine the prediction of mass location and contour 344 VOLUME 10, 2022 FIGURE 8. Comparison between 3 × 1 kernel (red) and 3 × 3 kernel (yellow). The 3 × 1 kernel obtains the same results on horizontal flipped input, but 3 × 3 kernel gets different result.   As a common view, the detail information lost in pooling-378 upsampling structure can be supplied through skip connec-379 tion. Thus, skip connection is used to connect each encoder 380 stage with corresponding decoder stage. The output of each 381 SA-Block is treated as a part of the input of corresponding 382 MSRA-Block through skip connection.

383
The output of the last MSRA-Block is used for final pre-384 diction. A 1 × 1 convolution is employed to fuse the features, 385 reduce the number of channels and keep the spatial size. 386 Two feature maps are generated and a channel-wise softmax 387 operation is used to transform the pixel values to probabili-388 ties. One of the feature maps is regarded as background and 389 another one predicts the mass. To train a segmentation network, binary cross-entropy (BCE) 392 loss was the most common choice due to its smooth derivative 393 and the stable optimization procedure. The formulation of 394 BCE can be denoted as: where l ce is the cross entropy loss of one pixel. y predict is the 399 predicted label, and y true denotes the actual label. BCE is the 400 average cross entropy loss of all the pixels.

401
It is obvious that BCE regards segmentation task as pixel-402 wise classification task and treats each pixel equally. How-403 ever, the number of pixels representing breast mass is only 404 a small proportion of the entire image. In other words, there 405 exists heavy data imbalance when applying BCE as the objec-406 tive function. To address the problem, Dice loss was presented 407 to measure the similarity between the predicted area and the 408 ground truth, which can be denoted as: where A is the predicted probability map and B is the ground 411 truth. | · | is used to calculate the summation of pixel values. 412 |A ∩ B| represents the intersection area of prediction and 413 ground truth, which is named True Positive (TP).

414
Dice loss offers a hidden linked constraint about the inter-415 section area within mass region and the area outside mass 416 region, which is indicated by |A|. The change of TP is 417 the main factor affecting Dice loss. In our point of view, 418 explicitly constrain the two parts can be more efficient for 419 model optimization. Thus, a novel regulation item, named 420 Inside-outside Loss (IOL), is designed to explicitly constrain 421 both the intersection area and the area outside the mass 422 for better optimization. The goal of IOL is to maximize 423 the probabilities inside the ground truth area and minimize 424 the outside probabilities. The formulation of IOL is shown 425 where TP is the number of pixels correctly predicted in mass 461 region and TN is the number of correct predicted pixels in 462 background. FP are the pixels predicted wrong as mass and 463 FN are the wrong pixels as background. The confusion matrix 464 is showed in Table 1     Sensitive, IOU and Dice coefficient, the overall performance 537 of SERAN is relatively more outstanding.

538
In order to analyze the effectiveness of multi-stream rota-539 tion aware mechanism, we expand the test set using the data 540 augment method mentioned in Section II.B. The masses in the 541 expanded test set may display in different orientations due to 542 the mirror operation and rotation operation. The experimental 543 results are summarized in Table 3. It is obvious that both SEN 544 and SERAN achieve much better performances than U-Net 545 due to the effective encoder we designed. Benefit from the 546 multi-stream rotation aware mechanism, SERAN wins 8 out 547 of 10 comparisons facing mass rotation. Although the perfor-548 mance of SERAN has declined to some extent compared to 549 the performance on original test set, it is still at a relatively 550 high level.

552
In this part, effectiveness of IOL is investigated. The line 553 4 of Table 2 shows the results achieved by SERAN, which 554 is trained by Dice loss with IOL.

555
Compared to the performances of SERAN trained without 556 IOL (as shown in the line 3 of Table 2.), IOL can obviously 557 improve the performances of SERAN. It achieves 8 compara-558 ble results among 10 comparisons. Accuracy, IOU and Dice 559 coefficient of SERAN are significantly improved. According 560 to the visual results shown in Fig.10, IOL can avoid mis-561 classification in background area and optimize the network 562 for better prediction, which are the leading causes for the 563 improvement.

564
For the purpose to determine the weight of IOL, we test 565 several values of λ, which are 1, 0.1, 0.01 and 0.001. The 566 experimental results are showed in Table 4 and the visual 567 results are showed in Fig.11. The loss function used in 568 this work is a combination of Dice loss and IOL, which is 569 described in (7). According to (4)-(6), the change of TP is 570  Fig.11 (d)). When the λ gets larger, IOL 578 affects the loss function a lot. Thus, SERAN trends to predict 579 the mass with larger size than ground truth to get the highest 580 TP when trained by λ = 1 (as shown in Fig.11 (a)). That is   Table 5 shows the comparison between the 5 methods 596 and SERAN. And Table 6   The comparison result of sensitive shows that all the methods 606 can predict masses with accurate locations. Indicated by IOU 607 and Dice coefficient, the mass predicted by SERAN owns 608 the biggest overlapping area with ground truth. The Accu-609 racy indicator shows that SERAN can make better prediction 610 about both mass and background than other methods.

611
The most competitive performance is achieved by the 612 method proposed by Sun et al. [30]. It can be attributed 613 to the dense-upsampling and channel attention. The two 614 mechanisms are effective to maintain most useful features. 615 However, the channel attention is in a fully connected style, 616 which is the leading cause for the high time cost. The method 617 proposed by Li et al. [28] employs spatial attention to guide 618 the prediction made by decoder and achieves competitive 619 results in all the metrics. The method proposed in [29] owns 620 the minimal complexity among all the neural networks and 621 gets a good performance as well. The multi-scale context is 622 effective to extract features under different receptive fields. 623 All the neural networks run faster than the traditional 624 artificial intelligence methods. This may be caused by the 625 great ability of Pytorch framework and the end-to-end style 626 of neural network. The method proposed in [6] has used a 627 complex processing flow to detect and identify the masses in 628 each scale, which consists of morphological filtering, simple 629 linear iterative clustering segmentation, feature extraction 630 and classification. The method proposed in [16] has used 631 watershed segmentation twice to find out all the regions of 632 interest. A k-means clustering algorithm is used to reduce the 633 number of regions found out by the first watershed segmenta-634 tion algorithm and provide the maker to the second watershed 635 VOLUME 10, 2022   In the real world, it is easy for specialists to discover  Moreover, the visual results of all the cases in test set have 650 been reviewed by an experienced radiologist. In summary, our 651 work in this paper has achieved great improvement in breast 652 mass segmentation and can assist the radiologists to a certain 653 extent.

655
In this paper, a Spatial Enhanced Rotation Aware Net-656 work (SERAN) is developed for breast mass segmentation 657 using digital mammogram. Two main critical components 658 are proposed for effective feature extraction and prediction 659 refinement. An encoder with spatial attention enhancement 660 under residual learning paradigm is designed for effective fea-661 ture extraction. Spatial attention maps are explicitly modeled 662 to adjust the focus in every encoder stage. Moreover, residual 663 connection is utilized to avoid gradient vanishing and achieve 664 better convergence. To boost the robustness of SERAN to 665 masses displayed in different orientations, a decoder using 666 multi-stream rotation aware mechanism for feature fusion and 667 prediction refinement is designed. A 3 × 3 convolution and 668 two asymmetric convolutions whose kernels are 1 × 3 and 669 3 × 1 respectively are combined in parallel to boost the 670 rotation robustness. To avoid misclassification in background 671 area and optimize SERAN for better prediction, a novel 672 regulation item named Inside-outside Loss is applied in train-673 ing procedure. Comparing with state-of-the-arts, SERAN 674 has achieved significant performance improvement for breast 675 mass segmentation. A sensitive of 87.7%, an IOU of 73.95% 676 and a Dice coefficient of 84.3% are achieved by SERAN on a 677 representative subset of DDSM dataset. In future work, we 678 will focus on developing SERAN to fit different types of 679 medical image, such as ultrasound, CT and MRI. Moreover, 680 we will try to transform 2D convolution by 3D convolution 681 to adapt 3D scenes, such as 3D CT and 3D MRI. Besides, 682 we will update the current system to fit different application 683 scenes, such as brain cancer segmentation and prostate cancer