SCA-Net: A Spatial and Channel Attention Network for Medical Image Segmentation

Automatic medical image segmentation is a critical tool for medical image analysis and disease treatment. In recent years, convolutional neural networks (CNNs) have played an important role in this field, and U-Net is one of the most famous fully convolutional network architectures among many kinds of CNNs for medical segmentation tasks. However, the CNNs based on U-Net used for medical image segmentation rely only on simple concatenation operation of multiscale features. The spatial and channel context information is easily missed. To capture the spatial and channel context information and improve the segmentation performance, in this paper, a spatial and channel attention network (SCA-Net) is proposed. SCA-Net presents two novel blocks: a spatial attention block and a channel attention block. The spatial attention block (SAB) combines the multiscale information from high-level and low-level stages to learn more representative spatial features, and the channel attention block (CAB) redistributes the channel feature responses to strengthen the most critical channel information while restraining the irrelevant channels. Compared with other state-of-the-art networks, our proposed framework obtained better segmentation performance in each of the three public datasets. The average Dice score improved from 88.79% to 92.92% for skin lesion segmentation, 94.02% to 98.25% for thyroid gland segmentation and 87.98% to 91.37% for pancreas segmentation compared with U-Net. Additionally, the Bland–Altman analysis showed that our network had better agreement between automatic and manually calculated areas in each task.


I. INTRODUCTION
Medical image segmentation is an essential tool for current clinical applications, such as computer-aided diagnosis/detection (CAD) or therapy plan systems (TPSs) [1], [2]. Automation of medical segmentation can increase the speed and efficiency and greatly reduce tedious and timeconsuming work for doctors. In brief, the main target of medical image segmentation is to distinguish the target region of interest from the background effectively. However, it is a challenging task due to several factors. First, medical images are collected by different acquisition facilities and usually have low imaging quality, leading to incomplete segmentation or excessive segmentation. Second, some segmentation targets usually have a wide variety of shapes and scales from patient to patient, making it difficult to construct excellent performance. Additionally, some targets of interest to be segmented have a wide range of orientations and positions in the context of medical images, such as the pancreas in magnetic resonance imaging (MRI) [3], [4], [5].
In recent years, deep learning has become the mainstream research method in many fields, and deep convolutional neural networks (CNNs) have attracted much attention from researchers in the field of medical image segmentation because of their good performance. Compared with traditional medical image segmentation methods, the ability to extract the features automatically helps CNNs learn from the obtained dataset. Many state-of-the-art works have achieved noticeable performance in medical image segmentation tasks. However, there are still some problems with CNNs. First, the weight-sharing design of CNNs between the same input feature layer and output feature layer easily weakens the learning ability of CNNs for complex textures and shapes. At the same time, the increased number of channels causes redundant computation and memory consumption. Second, with the depth growth of CNNs, the network becomes hard to train, and the risk of gradient disappearance is increased. Third, continuous pooling operations cause important local and global context information to be lost. To efficiently enhance the network segmentation performance for the network constructed with convolution operation, some ideas have been fused in CNNs and showed signs of progress in the medical image segmentation field. For instance, U-Net [6], one of the most popular architectures in the medical image segmentation field, employed a symmetrical U-shaped structure with skip connections to concatenate multiscale feature maps from low-level and high-level layers. [7] employed a dilated convolution operation with multiple different dilation rates to extract contextual feature maps. [8] applied a fully connected CRF to maximize labeling similar pixel points and modeling the spatial contextual feature relationships in object classes. Although t feature extraction ability, the descriptive information for spatial features and channel features, which are very useful for medical image segmentation, is still limited.
To learn more local related features and overlook irrelevant details from the feature maps, several variants of attention mechanisms have been proposed and have achieved better performances in computer vision tasks [9] [13]. Attention U-Net [13] employed the attention gate (AG), which fuses multistage contextual information from the encoder and the decoder. AGs learn to suppress the irrelevant characteristic response in the background while focusing on target regions. SE-Net [14] employed the SE-block, which is a kind of channel attention mechanism. It recalibrates the channel feature maps, assigns more weights to important feature channels and restrains irrelevant channels. The semantic segmentation methods proposed in [15] and [16] utilized similar ideas to enhance the network segmentation performance. [17] and [18] introduced an attention mechanism into the deep adversarial learning framework for capturing more contextual information. The results obtained by these works demonstrate the effectiveness of attention modules for segmentation tasks.
Inspired by previous works of CNNs on medical image segmentation, this paper introduces multiscale spatial and channel information to achieve better segmentation performance for medical images. Based on the encoderdecoder architecture and the attention mechanism, we proposed a spatial and channel attention network (SCA-Net) for medical image segmentation tasks, which is shown in Fig. 1. In SCA-Net, two novel attention blocks are constructed for capturing the spatialwise and channelwise relationships. One is the spatial attention block (SAB), and the other is the channel attention block (CAB). The two blocks are integrated into the decoder. The SAB learns to focus on the target spatial regions and ignores the irrelevant background by resigning each pixel weight. The CAB emphasizes the relativity of different channels, which redistributes the critical channel information and overgoes unrelated channel information.
In summary, the main contributions of our work are organized as follows: 1) We propose two attention blocks: a spatial attention block (SAB) and a channel attention block (CAB). The SAB is supported to recalibrate the features of spatial context information, and the CAB is supported to highlight the relevant channels and restrain the irrelevant channels.
2) The proposed blocks SAB and CAB are integrated in a novel network named SCA-Net. The ablation study shows that the proposed blocks can effectively capture the features for the targets of interest to be segmented.
3) Our proposed method was verified on three different medical image segmentation tasks. The experimental results show that SCA-Net has superior performance.

A. CNNs FOR IMAGE SEGMENTATION
Convolution is the core operation of CNNs. Without manually selecting features or prior knowledge, CNNs express the ability to learn features from acquired datasets automatically. In recent research, CNNs have been widely applied in different tasks [19] [21]. By deepening the CNN layers and using ReLU+dropout, AlexNet achieved the best classification results at that time [22].
By replacing the last fully connected layers of classification CNNs with convolution layers, fully convolutional network (FCN) architectures have made significant progress for natural semantic segmentation, such as DeepLab for semantic image segmentation [23]. Subsequently, SegNet [24] proposed the encoder and decoder architecture, which employed CNN as the base unit and achieved state-of-the-art performance for semantic image segmentation. However, the CNN performance is still limited by position-invariant convolutional kernels, without attending to spatial and channel information, which are very important for segmenting objects.

B. MULTI-SCALE INFORMATION FUSION
In computer vision tasks, rich contextual features extracted from multiscale information help the network achieve better segmentation performance. Many methods using multiscale information have been proposed and applied to 2D and 3D medical image segmentation. Similar to [24], the structure of U-Net [6] adopts a symmetrical encoder and decoder architecture with a skip connection to perform 2D medical image segmentation. To date, many models have been proposed based on U-Net, including U-Net++ [25], DoubleU-Net [26], and DUNet [27]. They have been successfully applied to different 2D medical image segmentation tasks. At the same time, 3D U-Net [28] and V-Net [29] were proposed for 3D medical image segmentation tasks.
To compensate for lost feature details during the downsampling operation, dilation convolution [30] with different rates enlarges the receptive field to capture more contextual information [31] [34]. For instance, CE-Net designed a context extractor module to learn contextual semantic information [31]. It generated more presentive feature maps.
[33] learned local geometric details using the cascaded pyramid architecture, which was fused in dilation convolution with different dilation rates. However, the scan area of dilation convolution is not continuous. For small targets, the gain is not worth the loss. Less attention has been paid to the interrelationship between spatial and channel characteristics.

C. ATTENTION MECHANISM
The attention mechanism has proven to be an efficient method to enhance CNN performance [35]. It mimics the biological observation process of paying attention to more detailed information about the desired target and suppressing useless information.
[36] was the first to propose an attention mechanism for processing natural language translation. [37] relied on self-attention to capture the dependencies of inputs for machine translation. Meanwhile, the attention mechanism has been used in the field of computer vision [38]- [40]. [38] and [39] used spatial attention for image classification and image captioning. [40] employed a dual attention mechanism to capture global features for semantic segmentation.
In many digital image segmentation tasks, the attention mechanism has also been adopted for better performance. Generally, attention modules can be plug-and-play in CNNs and help CNNs focus on more effective features of the target using spatial regions and channel interrelationships. Based on U-Net [6], AG Gate [13] focuses on the salient feature shape and size of the target through multiscale information. SE-Net [41] employed the squeeze and excitation (SE) block to recalibrate relevant channel feature maps and overgo irrelevant features. CBAM [42] emphasized the meaningful features in space and channels. It enhanced the feature representation of key regions related to the target. [43] designed an autofocus attention layer for semantic segmentation. It employed multiparallel attention branches, which had different scales of receptive fields to focus on the optimal scales. However, multiple branches increase the complexity of models and the difficulty of training. Inspired by previous methods, we hypothesize that the effective use of spatial information and channel-dependent features can improve the segmentation performance of our network.

III. MATH
Based on previous works, we use the effective architecture of the encoder and decoder as our backbone. As illustrated in Fig. 1, the architecture proposed by this paper has three major components: residual block, spatial attention block (SAB) and channel attention block (CAB). The encoder transforms the input image into multidimensional feature maps and extracts the segmentation information, and the decoder generates spatial feature maps across aggregating multiscale information and distributes the weight of feature map channels.
In the encoding stage, we use the residual block to retain more original information and extract the feature maps. In the decoding stage, the SAB redistributes the spatial pixel weights by aggregating pooled feature information from high-level and low-level stages. The CAB exploits the channel features, which uses global average pooling and global max pooling to excite more channel contextual information. It reassigns the relationship of every channel and its neighbors to highlightmore important channel information. The details of these modules are described as below.

A. RESIDUAL BLOCK
With increasing network depth, the model generally has a better expression for tasks. However, it increases the risk of gradient degradation and explosion of the network at the same time. To solve these problems, [44] proposed the residual learning network, which employed the residual connection to ease the difficulty of network training and keep more learnable features.
Inspired by the residual learning framework, we use two convolution blocks and one convolution block to generate the multidimensional feature maps, and another residual connection is employed to reserve the original feature information, which uses convolution to adapt the number of channels. A small batch size may cause training gradient degradation and decrease the network performance. Thus, we use group normalization [45] instead of BN in our entire network. The residual block avoids the risk of vanishing gradients and accelerates network convergence. The residual block used in this paper is shown in Fig. 2.

B. SPATIAL ATTENTION BLOCK
Previous works [31], [43] show that a deep convolution network with atrous convolutional blocks and multikernel branches can effectively extract contextual features from images. However, using these blocks consumes considerable memory and increases the complexity of the model. To use the multiscale contextual information and the experimental -UNet [13] utilized AG Gate to capture the spatial features from multiscale information. Motivated by these methods, we design SAB to fuse adjacent features of high-level and low-level spatial feature maps from multiple stages. By extracting the relationship of spatial interpixels, SAB can focus on meaningful spatial features and highlight prospective information.
The SAB is shown in Fig. 3. represents the low-level feature input from the encoder with the shape of , where denotes input channels and indicate the height and width of input, respectively.
represents the input of high-level features with the shape of , which are upsampled from the previous decoder layer. Compared with , has a higher spatial resolution. First, we concatenate them into the shape of , then feed them into global average-pooled and global max-pooled functions along the spatial dimension with the shape of and concatenate them by the channel dimension. One convolution kernel with an output channel of is employed to fuse the spatial feature. The activation function is applied to gain a spatial wise statistic . The size of feature map is . To calibrate the spatial feature maps, is subsequently multiplied by . To reuse the feature of , we employ the residual connection. is compressed by convolution with output channels as . Furthermore, the output is obtained as: where denotes the convolution with output channels and denotes the elementwise dot product. The number of output channels depends on the stage of . Here, is 128, 64, 32 and 16 for different dimensional stages.

C. CHANNEL ATTENTION BLOCK
The spatial feature maps from SAB contain considerable spatial interpixel information, as shown in Fig. 3. However, the output from SAB still contains unutilized channel feature information. To exploit critical features and suppress useless ones, we use CAB to redistribute the channel feature responses and strengthen important channel features provided by SAB. The details of CAB are shown in Fig. 4. SE-Net [41] shows the effectiveness of the squeeze-andexcitation block, which specifies the interchannel relationship. However, it only uses global average-pooled information. Compared with SE-Block, we additionally use the global maxpooled information, which stores more channel contextual information. Taking as an input with the shape of , global average pooling and global maximal pooling are separately applied along the channel dimension to obtain global channel information with the shape of . Inspired by ECA-Net [35], CAB employs onedimensional kernel convolution with a kernel size of to preliminarily capture the nonlinear cross-channel interaction. To decrease the parameters and complexity, the weights of convolution kernels are shared. We use the function to fuse the obtained channel information, and the result is fed into the active function to obtain the output with the shape of . Finally, the output of our channel attention module is: where denotes channelwise multiplication. The shape of is .

D. LOSS FUNCTION
Our proposed framework is an end-to-end training network. In our medical image segmentation tasks, we need to train our network to accurately predict the classification of each pixel. In recent years, the cross entropy loss function has been broadly used in the medical image segmentation field. However, some medical image segmentation objects often have a range of variations in scale and direction in the region of interest, particularly the pancreas and skin lesions. Accordingly, we used the soft dice loss function to alleviate the above problem. The soft dice loss function uses the predicted probability maps instead of thresholding and converts them into a binary mask. We used it in training and validation processing. It is described as: where denotes the ground truth point values and denotes the predicted probability point values.

A. DATASET
To assess the effectiveness of the proposed method, we applied our network to three medical image segmentation tasks: skin lesion segmentation from ISIC 2018, thyroid gland segmentation, and pancreas segmentation. Each task has its own challenge, and the sample of three datasets is shown in Fig. 5.

B. IMPLEMENTION DETAILS
During our experiment, the input images were resized with a uniform size of and normalized by the mean value and standard deviation. Fivefold cross-validation was employed to assess the performance of the proposed model. The dataset was randomly split at ratios of 70%, 10% and 20% for training, validation and testing, respectively. To reduce the risk of overfitting, we randomly rotated the training dataset at an angle of ( , ), which increased the number of training images.
Our framework was implemented on the PyTorch platform. The training batch size was 16, and the Adaptive Moment Estimation (Adam) optimizer was employed to train the network. The initial learning rate is , and the weight decay is for our experiments. For each task, we iterate the network for 300 epochs. The experimental hardware used is one NVIDIA Tesla P100 with 16 GB for all experiments. The soft dice loss function is used to train our network. During the process of validation, we saved the best performing model with the smallest loss. It was used in the test dataset to evaluate model performance.

C. EVALUATION METHODS
To quantitatively evaluate the segmentation performance of networks, we used the following evaluation methods, which are shown below: where A denotes the region of the predicted probability segmentation map and B denotes the ground truth binary image. denotes the set of segmentation boundary points, and is defined as the set of ground truth boundary points.
denotes the shortest Euclidean distance between point and all points of . Additionally, the Bland Altman plot, which is a commonly used method for analyzing the consistency of two technologies in medical statistics, is applied to visualize the potential bias between the areas segmented by the automatic method and in a manual manner.

D. ABLATION ANALYSIS
To prove the validity of SAB and CAB in our proposed SCA-Net, we evaluate the proposed module by ablation analysis. Each module performance was tested by segmenting skin lesions from the ISIC 2018 dataset. The residual block replaces all convolutional layers in U-Net [6] as our backbone.
In the next ablation experiment, the residual block is used in the encoder path, and the decoder path integrates the SAB and CAB to extract the feature information from feature maps separately. The skip connection is used for concatenating features between the encoder and the decoder as implemented in the U-Net architecture [6].
The results of the quantitative comparison of these methods are shown in Table 1. U-Net with residual block is assumed to be the backbone. For the skin lesion segmentation task, the performance of the backbone with SAB and CAB is improved separately. Additionally, our proposed SCA-Net can significantly enhance the performance of medical image segmentation. Compared with the backbone, our proposed network improved the average Dice from 0.8944 to 0.9292. The visual segmentation result is shown in Fig. 6. We could learn that SAB shows up the target space region and that CAB pays attention to the edge information. Our SCA-Net achieves better segmentation result.

E. SKIN LEISION SEGMENTATION
To assess the performance of our proposed SCA-Net, we first put its paces on the skin lesion segmentation dataset from ISIC 2018. The dataset contains 2594 images with their ground truth [46], [47]. The skin lesion boundaries vary in scale, shape and color, necessitating automated segmentation methods to be extremely sensitive to these variations [48].
We present the comparison of our method with other stateof-the-art networks. The comparison was made with seven existing networks, including U-Net [8], ResUNet [50], U-Net++ [25], CE-Net [31], Attention-UNet [13], FCA-Net [17] and Singh et al. [18]. All of them are adopted with the original implementation, and the soft dice loss function is used uniformly.
The properties describing the detailed results are displayed in Table 2. We calculated the means and standard deviation of the four assessed metrics in all experiments. Fig. 7 shows the visual segmentation results, and it is obvious that our framework outperformed other state-of-the-art methods in skin lesion segmentation. Our SCA-Net achieved a Dice score of 0.9292, an IoU score of 0.8730, an ASSD of 0.5079 and an RAVD of -0.0061 for skin lesion segmentation. The training parameters of U-Net [6] have 20.96 M but our network only has 13.36 M, showing that the complexity of the model is higher than our model, but our network performs better. From the sample performance images, the segmentation results show that other state-of-the-art networks produce missegmentation due to color and hair interference. Fig. 8 depicts the Bland Altman plots for the comparison difference between the segmentation areas of the ground truth and automatic segmentation methods. Compared with Singh et al. [18], our proposed method has a lower average deviation, which illustrates that our model is much more robust.

F. THYROID GLAND SEMGENTATION
We conducted the following evaluation task: thyroid gland segmentation [49], which consists of sixteen records of 3D volumes and their matching ground truth. To match our network input format, the 3D volumes and their corresponding ground truth were split into 4762 individual slices with the shape of . According to the ground truth marked by the sonographer, the unmarked slices were removed from the dataset. Finally, 3999 images and the corresponding ground truth were screened. The main challenge of this task is the diversity of thyroid tissue size and morphology in the thyroid ultrasound images. The complexity of the peripheral tissue also affects the segmentation performance.
The result shown in Table 3 shows that our network is successful and achieves higher efficiency compared to other state-of-the-art networks. Our proposed network outperformed U-Net with a Dice score of 0.9825, an IoU score of 0.9661, an ASSD score of 0.0508, and a RAVD score of 0.0021. U-Net can segment the general outline of thyroid glands. However, it lacks the ability to segment both blurred and prominent edges. ResUNet has a better performance than U-Net, in which the residual connection enhances the segmentation ability. CE-Net, U-Net++, FCA-Net and Singh et al. have slight oversegmentation and undersegmentation, respectively. Compared with other networks, our model can segment the details of the thyroid edge. We show some samples of segmentation results for visual comparison in Fig. 9. In Fig. 10, all automatic methods dealing with the thyroid gland segmentation task present consistency with manual segmentation, and our proposed method performs better with a lower average deviation and smaller dispersion.

G. PANCREAS SEGMENTATION
Pancreas segmentation is the last experimental task. This dataset comes from The Nation Institutes of Health Clinical Center, which consists of 82 abdominal contrast-enhanced 3D CT scans from 53 male and 27 female subjects. Pancreases corresponding to the ground truth were manually slice-by-slice segmented by a medical student and inspected by an experienced radiologist. The anatomical structure of the pancreas is complex, and it is mainly located in the posterior peritoneum, with very high shape and volume variability in morphology among different slices. It is surrounded by adjacent tissues, and these tissues are close to the pancreas in CT images, which causes blurring of segmentation boundaries. Together with the noise of the CT images themselves, local body effects and the influence of tissue motion, pancreas segmentation is a very challenging problem.
The result is displayed in Table 4. Our network has a better performance than other state-of-the-art networks. The model obtained the best Dice score of 0.9137, IoU score of 0.8530, ASSD of 0.3079 and RAVD of -0.0069. The samples of segmentation results for visual comparison are illustrated in Fig. 11, and the Bland Altman plots of these methods are presented in Fig. 12. In comparison with the segmentation samples of other state-of-the-art networks, we find that SCA-Net is slightly worse in complex boundary segmentation than CE-Net and Singh et al., which has more parameters and better fitting segmentation boundaries. However, they are more complex than our network, and our model excels at focusing on specific target areas. Although our SCA-Net has a slightly higher confidence interval than FCA-Net in Fig. 12, the difference is not obvious, and our proposed network has a lower bias.

V. DISCUSSION
For medical image segmentation tasks, better segmentation results help clinicians make a considerable preclinical diagnosis and assist them in clinical treatment. The variety of shapes, sizes and target locations, such as skin lesions, requires the network to have strong robustness. Original methods based on CNNs produced many channel feature maps and saved important features relying on simple concatenation operations. However, the relevant information is not utilized efficiently between multiscale spatial and channel features. The component structure of attention mechanisms handles the relevant feature maps, which improves the performance of segmentation tasks. Thus, we conceive a novel framework for medical image segmentation. The SAB connects the high-and low-level information from multiple stages to produce more representative contextual features. Additionally, the CAB redistributes the channel feature responses and strengthens important channel features.
To further verify the validity and robustness of the model, we conducted tests in three different medical image domains, including RGB images, MRI slices and ultrasound images. Compared with state-of-the-art networks, our SCA-Net has a significant improvement over three representative datasets, which shows that SCA-Net has better performance for different medical image segmentation tasks. We are more interested in applying our network to 3D data in the future.
We also find that our SCA-Net outperforms other networks in the thyroid gland segmentation task, but there are no significant segmentation differences. The reason we believe is that the boundary and the shape of the thyroid gland have small differences, and the distribution of locations is similar. Our proposed network can discern the boundary effectively. Compared with the skin lesion segmentation task, the color of the ultrasound image is gray, which may make it easier to learn the characteristics of the thyroid gland. In the pancreas segmentation tasks, SCA-Net scored significantly higher than the other networks. Our network shows more effectiveness of segmentation. Compared with other state-of-the-art methods, SCA-Net has fewer parameters and higher efficiency.

VI. CONCLUSION
Medical image segmentation tasks are crucial for clinical analysis and diagnosis. Due to the large variation in shape and texture of segmented targets, higher demands are placed on the robustness and performance of medical image segmentation networks. We introduced a spatial and channel attention network (SCA-Net) in this study, aiming to enhance the segmentation performance of medical image segmentation methods. Specifically, we design the SAB to consider the multiscale spatial information and the CAB to recalibrate the channel information. We train our SCA-Net, and the result demonstrates the superiority of our method in different tasks, including skin lesion segmentation, thyroid gland segmentation and pancreas segmentation. Our model can be used in a new application by fine-tuning using a new dataset and the manual ground truth.
In this paper, we conducted three experiments to verify the effectiveness of our network on 2D medical images. In possible future work, we will develop an extension to process 3D data.