Global Context and Enhanced Feature Guided Residual Refinement Network for 3D Cardiovascular Image Segmentation

As an important pre-processing step in clinical applications, automatic and accurate 3D cardiovascular image segmentation has attracted more and more attention. However, cardiovascular structures are often with high diversity, blood pool and myocardium shapes are also with large variability, and ambiguous cardiac borders make the segmentation task very challenging. In this paper, a novel deep neural network to segment the blood pool and myocardium from three dimensional cardiovascular images is introduced by fully exploiting the global context and complementary information encoded in different feature extraction layers, referred to as GCEFG-R2Net briefly. In order to semantically locate the two kinds of regions in a global manner, we design a global context pooling module which can effectively learn context information in a global manner from the deep features extracted from the last two deep layers. Instead of directly using or combining different levels of deep features, we develop an interactive feature aggregation strategy to enhance different levels of deep features by embedding a series of interactive feature aggregation modules. By using the enhanced features, a residual feature refining branch is designed for refining the side outputs in a top-down stream with the guidance of global context features. Finally, the refined side outputs of different layers and the enhanced deep features are combined to generate the final segmentation result by using a feature fusion module. Extensive experiments on two challenge datasets are conducted to demonstrate that the proposed GCEFG-R2Net can obtain appealing segmentation results for the blood pool and myocardium and performs better than other state-of-the-art methods.


I. INTRODUCTION
There are a large number of people that face the cardiovascular diseases each year in the world. Therefore, timely cardiovascular disease diagnosing and treatment is crucial [1]. As an intuitive manner, cardiovascular images can give detailed visual morphology presentation for the blood pool and the corresponding surrounding myocardium. Segmenting the heart in cardiovascular images plays an important and crucial role in cardiovascular disease diagnosing and treatment planning [2]- [4]. However, manually accomplishing this task is laborious, tedious and much time is needed, especially The associate editor coordinating the review of this manuscript and approving it for publication was Vishal Srivastava. when medical resources are scarce. As a result, designing effective algorithms for accurately segmenting 3D cardiovascular images in an automatic manner is imperious.
In the past few decades, there are many methods proposed for segmenting the blood pool and surrounding myocardium from cardiovascular images. In general, there are two mainly kinds of methods for this task. One prominent family of methods that focus on multiple atlases and traditional deformable models [5]- [11] and the other family of methods based on deep neural networks (DNNs) [12]- [16]. As to the first kind of methods, the high anatomical variations in different parts should be taken into consideration. As to DNNs based methods, learning discriminative deep features is critical, and sufficient number of training data is also necessary to train an VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ effective network. Although great success has been achieved by previous proposed methods, some challenging issues still significantly influence the performance of different models and hinder their practical applications. E.g., the diversity of cardiovascular structures is often very high (As shown in the red rectangles of Figure 1), the shapes of blood pool and myocardium also vary widely, and ambiguous cardiac borders (as shown in the yellow rectangles of Figure 1), and the cardiac borders are ambiguous since the contrast between cardiac and the surrounding tissues is often very low. In order to boost the performance of existing 3D cardiovascular image segmentation methods, we propose a novel deep neural network (GCEFG-R 2 Net) which can automatically segment the blood pool and myocardium from cardiovascular images more accurately by fully exploiting the global context and complementary information encoded in different feature extraction layers. For capturing the heterogeneity of different parts of the cardiovascular image in a global manner, a global context pooling module is designed for learning image content context information from the deep features extracted from the last two deep layers. Instead of using original different levels of deep features, we develop an interactive feature aggregation strategy to enhance different levels of deep features, which can sufficiently obtain more efficient multi-scale information. Then, a hierarchical residual feature refining branch is designed by using the enhanced features to refine the side outputs in a top-down stream with the guidance of global context features generated from the global context pooling module. At last, the refined side outputs from each layer are fused by a feature fusion module to generate final segmentation result. During the fusion process, the enhanced deep features are also leveraged to boost the final segmentation map. In a nutshell, the major contributions of this work are as follows: • We propose a new deep neural network for blood pool and myocardium segmentation from 3D cardiovascular images; • In order to capture the heterogeneity from different parts of the cardiovascular image, we design a context pooling module to learn the cardiovascular image content context information from deep features; An interactive feature aggregation strategy is introduced to enhance different levels of deep features, which aims to obtain more efficient multi-scale information; • A residual feature refining branch is designed for refining the segmentation result in a hierarchical and topdown manner. In addition, the learned global context features are used as a guidance for fusing the side output of each layer to get the final result; • Extensive experiments are conducted to validate the superiority of the proposed network when compared with other state-of-the-art methods.
The rest of this paper are arranged as follows. Some related works about cardiovascular image segmentation will be firstly introduced in Section II. The detailed illustration of our proposed modules and network construction are elaborated in Section III and the experimental results with analysis are shown in Section V. In Section VI, we draw the conclusion of this work.

II. RELATED WORK
Medical images provide abundant information to help disease diagnosing and treatment, a large number of medical image segmentation methods have been designed during the past decades. In this work, we focus on automatic blood pool and myocardium segmentation from 3D cardiovascular images.
In the earlier years, segmentation methods rely heavily on multiple atlases and traditional deformable models. In [2], an interactive method is developed to accurately segment the cardiac chambers and vessels. However, since this method needs to be performed in an interactive manner, it is very slow and laborious. By combining the active appearance model and active shape model together, Mitchell et al. [6] introduced a hybrid approach to separate the right ventricle and left ventricle automatically. By using the Markov random field, a mixed flow model is designed to segment the right ventricle as well as generate the target shape priority [11]. As a classic model used for natural image segmentation, graph cut is also deployed for right ventricle segmentation [17] and myocardium segmentation [18]. Motivated by the prior knowledge from atlas and the 4D Markov random field, Lorenzo-Valds et al. [19] utilized the space-time background information for ventricle and myocardium segmentation from the cardiac MR image volume. Inspired by the shape discrepancy compensation principle, Liu et al. [20] found that the cross-constraint shape can be used to help segment the myocardium over delayed enhancement and T2-weight images. In order to exploit the multi-dimensional information [21], 3D random field model [22], multi-atlas as well as level sets [23] are also utilized (Atlas). In [24], random forest is used to learn context information, and the segmentation is obtained via appearance priori. Although these traditional methods based on atlases and deformable models make remarkable progress for cardiovascular image segmentation, their results are still not satisfactory due to the limitation of existing priori and extracted features.
Due to the powerful feature representation learning capability, DNNs boost the performance of a tremendous number of tasks such as computer vision, natural language processing and biomedical data analysis in a bid step [25], [26]. Therefore, DNN based methods have also been put forward for cardiovascular image segmentation. In [27], dilated CNN is used to demarcate blood pool and myocardium, while 3D volumetric information is neglected. In order to tackle this issue, Xu et al. [28] combined convolutional neural network (CNN) and recurrent neural network (RNN) to detect and segment the myocardial from fraction areas, which considers the 3D volumetric structure information. In [29], Payer et al. proposed a pipeline of two fully convolutional networks for automatic multi-label whole heart segmentation from CT and MRI volumes (MLWHS), which learns from the relative positions among labels and focuses on anatomically feasible configurations. Considering that fully convolutional neural network (FCN) can also obtain appealing performance for image segmentation, Qin et al. [30] proposed to use FCN for ventricles and myocardium segmentation. In addition, the motion state of the heart can be also well estimated. In order to preserve the maximum information flow between different deep feature extraction layers, Yu et al. [15] added the densely-connected mechanism into their network, and extra auxiliary side paths are embedded to strengthen the gradient propagation as well as stabilize the learning process. Since there are plenty of complementary information contained in multiple views of 3D cardiac data, Zheng et al. [31] utilized asymmetrical 3D kernels and pooling to capture contextual information. In order to avoid the domain shift in the field of biomedical image analysis, Dou et al. [32] proposed an unsupervised domain adaptation framework with adversarial learning for cross-modality biomedical image segmentation (UCMDA). By combining hybrid pyramid pooling and dilated residual learning, Du et al. [16] proposed a multi-task framework for joint blood pool and myocardium segmentation. In [33], both up-sampling and down-sampling strategies are used for blood vessels and the myocardium segmentation. Compared to traditional methods that rely on hand-crafted features or some pre-defined priori models, it is the fact methods based on deep leaning predominate the field of medical image segmentation and obtain better performance. To this end, we also focus on deep learning and propose a deep neural network to segment the blood pool and myocardium from 3D cardiovascular images.

III. PROPOSED NETWORK
In this section, we will illustrate the details of our proposed modules and the construction of the whole GCEFG-R 2 Net, which consists of four main components including a global context pooling module (GCPM), an interactive feature aggregation module (IFAM), a hierarchical feature refining module (RFRM) and a deep feature guided feature fusion module (DFGFM). In Figure 2, we give an overview of our proposed GCEFG-R 2 Net. During our network implementation, we use the 3D ResNeXt structure [34] as our feature extraction backbone and obtain five feature extraction layers. For simplicity, we denote the features extracted from the five layers as F 1 , F 2 , F 3 , F 4 and F 5 , respectively. Since the size of the slices used in the experiments is often small, in order to obtain the final accurate result, we first upsample all of the feature maps to the size of original input slice. In the following sections, we will elaborate each module of the proposed GCEFG-R 2 Net in detail.

A. GCPM
As can be seen from the example images in Figure 1, the spatial distribution of different parts of an slice is often scattered with varying shapes. Therefore, it is important to capture the global context information to help locate different parts in the whole slice. Considering that the higher layers of the backbone network contains abundant semantic and context information [35]- [37], we use F 4 and F 5 to learn the global context information, as shown by Figure 3. Firstly, we concatenate the upsampled F 4 and F 5 together. Then, a series of convolutions with a hybrid dilation rate are used to learn the global context features (GCF). The ''hybrid dilation rate assembled convolution'' can be regarded as a manner to aggregate the input features locally. From Figure 3, we can see that the proposed GCPM is motivated from Atrous Spatial Pyramid Pooling (ASPP) [38]. However, our proposed GCPM differs ASPP significantly at least from following two aspects: • Firstly, the channel attention is embedded into the proposed GCPM to adaptively fuse multi-scale information.
• Secondly, feature channel selection and receptive fields enlarging are performed simultaneously in our GCPM to exploit the global information for feature interaction. In addition, the GCF learned from GCPM is used as a  guidance for feature refining which will be introduced in the later section.

B. IFAM
For the feature extraction backbone network, different layers of features reflect different degree of feature abstraction for original image slice. The shallow layers extract most of the details, the top layers often contain sufficient semantic and global context information, and the middle layers often contain both semantic and detailed information. It can help enhance the representation capability of different layers of features by exploiting their complementary information. Therefore, we design a series of interactive feature aggregation modules to aggregate the deep features in an interactive manner. Figure 4 gives a brief architecture of a IFAM corresponding to the i-th layer. As can be seen, except for the first and fifth layers, there are three input channels for each IFAM. Without loss of generality, the input of the IFAM corresponding to the i-th layer consists of features F i , F i+1 and F i−1 . For each input, we implement an initial transformation by a combination of a convolutional operation, a batch normalization operation and a ReLU operation, the channel number of initial features can be reduced. For feature aggregation, F i , F i+1 and F i−1 interact with their corresponding layer of features. Finally, the three feature interaction branches are fused together to obtain the enhanced features, i.e., F i , which will be used for segmentation refinement.
As to the first and fifth IFAM, there are only two input branch, i.e., F 5 and F 4 for the first IFAM, F 1 and F 2 for the fifth IFAM. The enhanced features F 5 and F 1 can be learned in a similar way. In such a manner, each IFAM performs feature crossing to mitigate the discrepancy between different layers of features. The common parts of continuous feature layers are firstly extracted by element-wise multiplication and then original features are combined to capture complementary information by element-wise addition.

C. RFRM
By implementing a convolution operation on the enhanced feature maps F 5 , we can get an initial segmentation result, i.e., O 0 . However, the resolution of the initial segmentation map is very low due to a series of pooling operation, and some detailed information such as the edges of different parts of original image content could be lost due to a series of continual pooling operations consisted in different feature extraction layers. Therefore, we design an RFRM and embed it into the proposed network for segmentation refinement in a hierarchical manner. In each RFRM, residual feature learning is used for refining the stage-wise segmentation map. The motivation of the proposed RFRM lies in two points. Firstly, previous literatures demonstrate that deep neural network embedded with residual feature learning can obtain better results in many computer vision problems when compared to commonly used plain network blocks. Secondly, gradient vanishing problem during deep neural network training can be effectively avoided and the training process of deep neural network can reach convergence faster, especially for the medical image segmentation task that the training samples are limited. In Figure 5, we present the detailed structure of an specific RFRM. In detail, we embed four RFRMs in our GCEFG-R 2 Net. As to the t-th RFRM (t = 1, 2, · · · , 4), there are three inputs, including the output of the (t − 1)-th RFRM, i.e., O t−1 , enhanced feature maps F 5−t and the GCF learned from the GCPM. Mathematically, the residual learned from the t-th RFRM can be formulated as follows: where O t−1 denotes the output obtained from the (t − 1)-th RFRM, (·) represents a mapping process which consists of a set of convolution and ReLU operations, ⊗ represents element-wise multiplication of feature maps and Cat is the channel-wise concatenation operation. By adding R t with O t−1 , the output of RFRM can be calculated as follows: In addition, in order to improve the side output of each refining step during the training process, we add the supervision signal to each RFRM [39].

D. DFGFM
Since different layers of features reflect different abstract levels of original image slices, we develop a feature fusion module to fuse the segmentation maps generated from different RFRMs to obtain the final segmentation result. In addition, considering that information in original images is also important to help segmentation, we produce guiding deep features by using original enhanced features for guiding the final fusion process. The deep guided features DGF can be obtained as follows: where the enhanced feature maps of the i-th layer are denoted as F i . W and b are the convolution parameters that need to be learned during the training process and ReLU is the ReLU activation function [25]. Then, the final segmentation map O can be generated by following operations: (4) where W and b are also the convolution parameters.

IV. IMPLEMENTATION DETAILS
In our experiments, we implement the proposed GCEFG-R 2 Net by using the PyTorch framework and we use the 3D ResNeXt [34] as backbone network for feature extraction. In this work, the mean square error (MSE) is used to compute the loss between the ground-truth G and outputs of the network, and the final loss function is formulated as follows: where L mse (·, ·) is a function to compute the MSE between two segmentation maps, L mse (O, G) is the MSE between the final fused output and the ground-truth, L i mse (O i , G) is the MSE between the i-th layer-wise side-output and the groundtruth. The definition of MSE can be formulated as follows: Our network is trained in an end-to-end manner by using the Adam algorithm with the initial learning rate of 0.001 on a single Nvidia Titan V GPU with 12Gb memory. We train the network with the ''poly'' learning rate policy and the training data are also augmented to reduce over-fitting, the training batch size is fixed to 4.

V. EXPERIMENTAL RESULTS
In this section, we report the segmentation results of the proposed network on two datasets including the 2016 HVSMR dataset [2] and the 2017 MM-WHS CT dataset [40]. In addition, we also compare our network with other state-of-the-art ones to validate its superiority.

A. DATASETS
The 2016 HVSMR dataset [2] aims to segment myocardium and blood pool from cardiovascular MR images. There are 10 patients with 3D Cardiovascular MR images including 10 training sets and 10 testing sets. For different patients, the VOLUME 9, 2021 The 2017 MM-WHS CT dataset [40] aims to evaluate algorithms that segment seven cardiac structures, i.e., the left/right ventricle blood cavity (LV/RV), left/right atrium blood cavity (LA/RA), myocardium of the left ventricle (LV-myo), ascending aorta (AO), and pulmonary artery (PA). Similar to [31], we randomly split the dataset into the training subset (16 subjects) and testing subset (4 subjects) by following the work in [32].
In Table 1, we present the details of the two datasets.

B. EXPERIMENTS ON THE 2016 HVSMR DATASET
Since the segmentation ground-truth of testing sets are not available and the challenge submission system also dose not work for online testing, we divide the training sets into training subsets and a validation subset. In this experiment, we use leave-one-out setting for performance evaluation of our network. For each patient, the corresponding images are used for testing and the images of other patients are used for training. Therefore, there are totally 10 repeated training and testing times. Six indicators including Dice coefficient (Dice), Jaccard coefficient (Jac), positive predictive value (PPV), sensitivity (Sens), specificity (Spec), and Hausdorff distance of boundaries (HD) are used to evaluate the performance of our different methods. In Figure 6, we show the six indicators of the segmentation results of 10 subjects in this dataset. As can be seen, for the blood pool, the segmentation accuracy is much higher than the myocardium, which indicates that segmenting the myocardium is more difficult than the blood pool. This is an also challenging problem faced by previous segmentation methods, which is caused by the fact that the myocardium areas in medical images are often small, scattered and with varying shapes. As to the HD indicator, the score of the blood pool is much smaller than that of the myocardium. In most cases, our network can obtain stable results for different patients for the blood pool.
In order to demonstrate our proposed network can obtain better segmentation results, we also compare it with traditional method Atlas [23] and some classical segmentation network, i.e., U-Net [13]. In addition, previous cardiac image segmentation networks including the SSLLN [41], SDNet [42], HFA-Net [31] and DRHPPN [16] are also used for comparison. In Table 2, we report the results of different methods in terms of different indicators and the results also demonstrate that our proposed network performs better than other state-of-the-art ones.

C. EXPERIMENTS ON THE 2017 mm-WHS CT DATASET
For this dataset, the ground-truth segmentation maps for both training subjects and testing subjects are available. As a result, we use the training samples for network training and test it on the testing subset. Four indicators including Dice coefficient (Dice), Jaccard coefficient (Jac), average surface distance (ADB) and Hausdorff distance of boundaries (HD) are used for evaluation. We compare the proposed GCEFG-R 2 Net with HFA-Net [31] as well as its baselines. In addition, we also compare the Dice indicator with [32] and [29] which can be obtained from [31]. The results of different method on this dataset are shown in Table 3, which also validate the efficacy of our proposed network.

D. ABLATION STUDIES
As mentioned in previous sections, there are two critical modules for our proposed network, i.e., GCPM which learns the global context information that can be used to guide the feature refining and the IFAM which enhances layer-wise features in a cross layer manner. In order to validate the influence of the two modules for the final results, we remove GCPM and IFAMs respectively from GCEFG-R 2 Net (denoted as noGCPM and noIFAM) and perform experiments on the 2016 HVSMR dataset and report the results in Table 4. As can be seen, when GCPM is removed, the results degenerate significantly, which validate that the global context information is critical for final segmentation. In order to give a more intuitive demonstration, we also show the visual segmentation results without the two modules in Figure 7. As can be seen, when the GCPM module is removed, the segmentation refining process is similar to tradition U-Net. Without GCPM module, the global context information cannot be well embedded for feature refining, which produces some missed regions in the final results, as shown by the row titled ''noGCPM'' in Figure 7. When the IFAM module is removed, shallow features and deep features are not well aggregated, which induces incomplete segmentation results, especially for the   details information, as shown by the row titled ''noIFAM'' in Figure 7.
In addition, In order to demonstrate the efficacy of different layers of our proposed network, we report the results of different layers before the final DFGF module, which are denoted by L0, L1, L2, L3 and L4, respectively. We show the results in above Table 4. As can been seen, by aggregating the final segmentation results of different layers, the DFGF module can capture the complementary information of layer-wise features and side-output results to generate better final segmentation map.

VI. CONCLUSION
In this paper, we introduce a deep neural network for segmenting blood pool and myocardium from 3D cardiovascular images. In order to capture the global context information of the two kinds of regions, a global context pooling module is designed to learn the context information from the deep features extracted from the last two deep layers of backbone network. Rather than directly using or combining different levels of deep features, we design an interactive feature aggregation strategy to enhance different levels of deep features by embedding a series of interactive feature aggregation modules. Extensive experiments as well as ablation analysis are conducted on two public datasets to validate the efficacy of the proposed network, which can obtain higher segmentation accuracy.
JINGJING LIU received the B.S. degree in clinical medicine and the master's degree in cardiosurgery from Tianjin Medical University, Tianjin, China, in 2012 and 2018, respectively. She is currently an Attending Doctor with the Department of Cardiac Surgery, Tianjin Chest Hospital, Tianjin. Her current research interests include congenital heart disease, structural heart disease, and translational medicine.
AO WEI received the B.S. degree in clinical medicine and the master's degree in cardiology from Tianjin Medical University, Tianjin, China, in 2011 and 2019, respectively. He is currently an Attending Doctor with the Department of Cardiology, Tianjin Chest Hospital, Tianjin. His current research interests include coronary atherosclerotic heart disease and arrhythmia.
ZHIGANG GUO received the B.S. degree in clinical medicine from Tianjin Medical University, Tianjin, China, in 1989, and the master's degree in cardiosurgery from the Peking Union Medical College Hospital, Beijing, China, in 2008. He is currently the President of Tianjin Chest Hospital, Tianjin, and also a Doctoral Supervisor of Tianjin Medical University and Tianjin University. His current research interests include coronary atherosclerotic heart disease, structural heart disease, and translational medicine.