Residual Convolutional Neural Network for Cardiac Image Segmentation and Heart Disease Diagnosis

Deep learning (DL) has been widely used in biomedical image segmentation and automatic disease diagnosis, leading to state-of-the-art performance. However, automated cardiac disease diagnosis heavily relies on cardiac segmentation maps from cardiac magnetic resonance (CMR), most current DL segmentation methods, such as 2D convolution on planes, 3D convolution, are not fully applicable to CMR due to loss of spatial structure information or large gap between slices. To make better exploit spatial aspects of the CMR data to improve cardiac segmentation accuracy, we propose a new DL segmentation structure, which consists of a residual convolution neural network for compressing the intra-slice information, and a bidirectional-convolutional long short term memory (Bi-CLSTM) for leveraging the inter-slice contexts. Moreover, automatic disease diagnosis has been conducted using the segmentation maps. Experimental results of the automatic cardiac diagnosis challenge (ACDC) show that our cardiac segmentation structure and disease diagnosis methods have achieved promising results and it can be widely extended to computer-aided diagnosis.


I. INTRODUCTION
Cardiac magnetic resonance (CMR) is the gold standard method for the assessment of cardiac structure and function [1], extracting left and right ventricular ejection fractions (EF), stroke volumes (SV) and left ventricle mass from CMR is usually as the primary step for cardiac disease diagnosis, but most current heart segmentation methods process 3D data in a slice-by-slice fashion, as the result these methods lose the spatial structure information in the original data, so we use the method of combining 2D and temporal information to compensate the spatial aspects loss of the original data in the 2D convolution network.
In this study, we propose a new DL segmentation structure, which is constituted by a residual convolution neural network for compressing the intra-slice information, and a Bi-CLSTM [2] for leveraging the inter-slice contexts.
The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . Moreover, we conducted automatic disease diagnosis from the segmentation maps. Experimental results on the automatic cardiac diagnosis challenge show that our cardiac segmentation structure and disease diagnosis achieved promising results. In summary, our contributions are: • The intra-slice information is compressed by convolutional neural network. U-Net is used as a submodule of our improved framework [3]. In order to mitigate the problem of vanishing/exploding gradient, we used the residual connection in our network structure [4]. At the same time, we used the dilated convolution instead of the max-pooling operation to improve the segmentation accuracy [5].
• Bi-CLSTM [2] is used to learn spatial information, which essentially assembles the intra-slice contexts into a 3D context and extracts 3D context information from it. For a set of 3D cardiac data, adjacent frames are similar, so the temporal relationship in Bi-CLSTM are used to simulate intra-slice contexts. System overview of proposed method for automated cardiac segmentation and diagnosis. The pipeline involved: (i) The region of interest (ROI) is extracted from the CMR data, then, process the 3D data into 2D slices. (ii) The 2D slices are fed into the residual convolution neural network (CNN) (Fig. 2 shows the network structure of the residual convolution network in detail) for exploiting the intra-slice information. (iii) Input the results learned by residual convolution neural network into two CLSTMs (which work in two opposite directions, also known as Bi-CLSTM) for learning spatial information. (iv) Concatenate the results of the two CLSTMs learning, then a softmax function is applied to get the cardiac segmentation. (v) Finally, cardiac disease is automatically diagnosised from the segmentation maps.
• Cardiac diseases can be automatically diagnosis from the segmentation maps. Our approach combines automatic cardiac segmentation and disease diagnosis together. As shown in Fig. 1, which is a complete framework.

II. RELATED WORK
Each frame in the CMR is a 3D structure, current state-ofthe-art methods for 3D biomedical image segmentation (Here, we mainly focus on DL segmentation methods, because they are relevant to our scheme and usually perform well) can be roughly divided into the following five categories. (I) 2D convolution on planes. (II) 3D convolution.

A. 2D CONVOLUTION ON PLANES
U-Net only needs less training data and gets better segmentation results [3], so such network structure is considered as the most prominent 2D network structure in the field of medical image segmentation. Baumgartner et al. [6] used 2D U-Net to segment the CMR data, they processed the 3D data in a slice-by-slice fashion, and there is no correlation between each 2D slice, as the result this method lose the spatial structure information in the original data. In general, 2D medical image segmentation methods [7], [8] have the problem of losing spatial information.

B. 3D CONVOLUTION
The 3D convolution network structure is an improvement on 2D convolution, which makes full leverage of the spatial correlation along the z-direction. Yang et al. [9] used 3D convolution with residual connections for CMR image segmentation. However, due to large inter slice gap, the 3D convolutional network is not fully applicable to CMR image. Moreover, 3D convolution network usually has a complicated structure that requires many training parameters and a long training time. In the case of a small amount of training data, it tends to lead to over-fitting and poor generalization.

C. MULTI-PLANAR SCHEMES
Multi-planar method slices the dataset along different axial directions and compensates for the missing spatial structure information about the 2D segmentation method via multiple axial training methods. Prasoon et al. [10] integrated three 2D convolutional neural networks (xy, yz and zx planes of 3D image) to segment the tibial cartilage, and it performs better than the method based 3D multi-scale features. Similarly, Mortazi et al. [11] used multi-planar deep convolutional neural network to segment the whole heart and this method automatically utilizes spatial information by an adaptive fusion strategy. However, the multi-planar method is not suitable for CMR data, usually, CMR slice thickness from 5 mm to 10 mm, inter-slice gap of 5 mm, and the spatial resolution varies from 1.34 to 1.68 mm 2 /pixel [12], so the slices of xy plane have very different appearances in the slices of xz or yz plane, and slicing directly along different direction axes may cause unnecessary errors.

D. FUSION OF 2D AND 3D
At present, there are some 2D and 3D fusion network structures, and the segmentation accuracy is seriously relying on the fusion strategy. Isensee et al. [13] used a 2D and a 3D model for segmentation prediction and extracting features of averaged segmentation, which are fed into a classifier for cardiac disease diagnosis. This method of averaging 2D and 3D fusion models is simple but effective. Similarly, Li et al. [14] proposed a hybrid densely connected U-Net (H-DenseUNet) for liver and tumor segmentation, which fuse of 2D and 3D network structures. These methods make full use of the advantages of 2D convolution and 3D convolution, but the segmentation accuracy is easily affected by the fusion strategy.

E. COMBINING 2D AND TEMPORAL INFORMATION
The combination of 2D and temporal information compensates the loss of the original data spatial structure of the 2D convolution network by using a time storage network such as long short-term memory (LSTM). Zhang et al. [15] proposed a multi-level convolutional long short-term memory (ConvLSTM) model to the segmentation of left ventricle myocardium. LSTM is often used in time series data. e.g. [16] used LSTM to estimate regional wall thicknesses.
Tseng et al. [17] proposed a 3D biomedical segmentation method that joints sequence learning for 2D slices and jointly learn the multi-modalities. Simultaneously, Chen et al. [18] proposed a framework of 3D biomedical segmentation, which uses a recurrent neural network to exploit the inter-slice contexts. Inspired by these work [17], [18], we propose a new DL segmentation structure, which is constituted by a residual convolution neural network and a bidirectional-convolutional LSTM, and this network structure with 2D and temporal information has great potential.
In brief, there are several problems with the current methods of DL-based 3D biomedical segmentation. (i) 2D convolution on planes method treats the 3D data in a slice-by-slice fashion, thus losing the spatial structure information in the original data. (ii) 3D convolution network usually has a complicated structure, requires many training parameters, which requires a long training time. (iii) Multi-planar method performs well in some 3D biomedical segmentation, but this method is not suitable for CMR data (large inter-slice gap on the CMR). (iv) The fusion network structure and the 2D convolution method with temporal information have achieved good performance in CMR data segmentation, and we believe that these two methods have great potential in the future. Fig. 1 illustrates the framework for automated cardiac segmentation and diagnosis. The pipeline involved: (i) The region of interest (ROI) extraction method is firstly designed to perform data preprocessing operations (The detailed ROI extraction method can be found in the work of Ioffe and Szegedy [20]), then, process the 3D data into 2D slices (It should be noted here that the original data and the corresponding label are 3D data). (ii) The 2D slices are fed into the residual convolution neural network for exploiting the intra-slice information ( Fig. 2 shows the network structure of residual convolution neural network in detail). (iii) Input the results learned by residual convolution neural network into two CLSTMs (which work in two opposite directions, also known as Bi-CLSTM) for learning spatial information. (iv) Concatenate the results of the two CLSTMs learning to get the cardiac segmentation results. (v) Finally, cardiac disease is automatically diagnosis from the segmentation maps. In Fig. 1, Bi-CLSTM is used to model temporal relationship between adjacent frames. The input is an image sequence Image = Image t |t = 1, 2, . . . , T across time frames t and the output is the predicted label map sequence Label = {Label t |t = 1, 2, . . . , T }. The output of Bi-CLSTM is a pixel-wise feature map at each time point t. The time information between different slices is linked by the Bi-CLSTM. Compared to the standard LSTM which analyses one-dimensional signals, Bi-CLSTM can solve image sequence prediction problems and learn spatial structure information in image sequences. Section III-A introduces our proposed residual convolution neural network, Section III-B presents the Bi-CLSTM, and Section III-C describes the cardiac disease prediction in this study.  (i) Left: the residual convolution neural network, note that the residual network is used to build deeper network structures, at the same time, the dilated convolution instead of the max-pooling operation. (ii) Middle: the normal residual connection, which uses a stack of 2 layers with a convolution kernel size of 3*3 and the feature layer of 256. (iii) Right: a deeper residual connection, which uses a stack of 3 layers. The first layer uses a 1*1 convolution kernel for dimensionality reduction, the ''BN'' in the second layer represents the batch normalization [19], and the third layer uses a 1*1 convolution for increasing dimensions. After the restoring and increasing dimensions of the first and third layers, the second layer has smaller input and output.

A. RESIDUAL CONVOLUTION NEURAL NETWORK
To obtain expressive and task-aware features (e.g., texture, shapes), a specially tailored residual convolution neural network is designed for each cardiac slice, as shown in the left of Fig. 2. Compared to the current convolutional neural network structure, our residual convolution neural network adds deep residual learning and dilated convolution.
Here, U-Net is employed as a submodule of our improved residual convolution neural network [3]. U-Net consists of a contracting path and a symmetric expanding path, which can capture context and enable precise localization. U-Net combines low-level features and high-level features, which is suitable for medical image segmentation. Therefore, U-Net is chosen as the network submodule, and the contracting path and the expanding path in U-Net are retained in our framework. Moreover, in our network structure, all blocks are replaced by residual connection, and the max-pool is replaced by dilated convolution.

1) DEEP RESIDUAL LEARNING
Deeper network structures can learn richer features, many of the visual tasks have benefit from deeper network structures [21], [22]. Simply increasing the number of stacked layers will cause gradient explosion/disappearance problems [23], which can tend to cause give rise to network degradation. Normalized initialization is often used to alleviate this problem [24]. Therefore, He et al. [4] proposed a deep residual learning method, which solved the gradient explosion/disappearance problems of deep network structure, and made the deeper network structure possible. Based on U-Net as a submodule, we apply the method of residual connection to our network structure, which can learn more information on each cardiac slice by increasing the depth of the network.
Residual connection approach (as shown on the middle side of Fig. 2) is widely used in deep networks, however, deeper network requires more parameters and computational resources. In this study, in order to design a more efficient model, the normal residual connection is replaced by the deeper residual connection (as shown on the right side of Fig. 2). In the deeper residual connection, a 1 * 1 convolution kernel is used in the first layer to dimension reduction, and another 1 * 1 convolution is used in the last layer to increase dimensions. By this connection method, the network structure conforms to the input and output dimensions, and reduces training parameters.

2) DILATED CONVOLUTION
The max-pooling operation increases the receptive field by reducing the image size, which results in a loss of resolution, as a result the segmentation accuracy is decreased. Yu and Koltun [5] used dilated convolutions to systematically aggregate multi-scale contextual information without loss of resolution or coverage. Inspired by this, to improve the cardiac segmentation accuracy proposed in this study, we replaced all the max-pooling with the dilated convolution as shown on the left side of Fig. 2.
In brief, the details of our proposed residual convolution neural network structure as follows: (i) U-Net is used as the submodule. In order to capture context and enable precise localization, the contracting path and the expanding path structure are retained in our residual convolution neural network. (ii) In order to increase the depth of the network, learn more features and reduces training parameters, all blocks (a convolutional layer, a ReLU unit) in the U-Net are updated to a deeper residual connection (as shown on the right side of Fig. 2). (iii) All the max-pooling are replaced by the dilated convolution to increases the receptive field without loss of resolution. The residual convolution neural network structure is shown in the left side of Fig. 2.

B. BIDIRECTIONAL-CONVOLUTIONAL LONG SHORT TERM MEMORY: BI-CLSTM
Long short-term memory (LSTM) is a special recurrent neural networks (RNN) [25], LSTM and several variant uits are widely used in processing sequential data [26]. LSTM can be able to perform better in longer sequences, and its definition is as follows.
where i t , f t , c t , o t and h t are respectively the input gate (i), forget gate (f), memory cell (c), output gate (o) and hidden state (h), at time t. W and b denote the diagonal weight matrices governing and bias, respectively. σ () and tanh() are logistic sigmoid and hyperbolic tangent functions, x t is the input at time t. There are many variants of LSTM, among them, Xingjian et al. [27] proposed a convolutional LSTM (CLSTM) network for the precipitation nowcasting problem, which replaces the vector multiplication by convolutional operators. The structure of CLSTM is shown in Fig. 3, where the left position enlarges its core computation unit. CLSTM can solve image sequence prediction problems and learn spatial structure information in image sequences. CLSTM can be formulated as follows.
Different from LSTM Eq. 1. * denotes convolution and denotes element-wise product. Bi-CLSTM [2] consists of two CLSTM in opposite directions, which allows the network model to learn more features. Concatenate the results of the two CLSTM learning, then a softmax function is applied to get the segmentation result. The entire cardiac segmentation process is shown in Fig. 1. The pipeline involved: (i) Process the 3D data into 2D slices. (ii) Feed the 2D slices into the residual convolution neural network for exploiting the intra-slice information. (iii) Assembles the intra-slice contexts into 3D contexts, and input it into Bi-CLSTM for learning spatial information. (iv) Concatenate the results of the Bi-CLSTM learning to FIGURE 3. The structure of CLSTM. VOLUME 8, 2020 get the cardiac segmentation. The left ventricular (LV), right ventricular (RV) and myocardium (Myo) for both end diastolic (ED) and end systolic (ES) phase instances can be segmented from the CMR by our proposed segmentation method.

1) FEATURE EXTRACTION
Automatic cardiac disease diagnosis mainly uses patient characteristics and image features. The patient characteristics are patient weight (kg) and patient height (cm), and the image features (Left ventricular, right ventricular and myocardium for both end diastolic and end systolic phase instances) can be obtained from our segmented network structure. Combination of patient characteristics and image features can extract features such as ejection fraction (EF) and cardiac volume index. EF is commonly applied as a clinically relevant metric to assess ventricular function [28], and it can be calculated as follows.
where, EDV and ES are ventricular end diastolic volume and ventricular systolic volume, respectively. The ejection fraction reflects the ejection function of the ventricle and is an important indication for determining the type of heart failure. The left ventricular volume index of diastolic end (LVEDVI ) is calculated as follows: Here, LVEDV is the left ventricular end diastolic volume, and BSA is the body surface area. The BSA can be calculated as BSA = 0.007184 · weight 0.425 · height 0.725 [29]. The extracted features are the basis for the cardiac disease diagnosis.

2) DIAGNOSTIC RULE
According to medical reports [12] that (i) patients with MINF have an ejection fraction of the left ventricle lower than 40%, (ii) patients with DCM have an ejection fraction of the left ventricle lower than 40% and the diastolic left ventricular volume >100mL/m 2 , (iii) patients with HCM have a left ventricular cardiac mass high than 110g/m 2 , (iv) patients with RV have an ejection fraction of the rigth ventricle lower than 40% and the volume of the right ventricular cavity higher than 110mL/m 2 , (v) otherwise it is NOR. The cardiac disease diagnosis process is shown in Alg. 1 (It should be noted here that the parameters in Alg. 1 are derived from the medical report [12].).

IV. EXPERIMENT AND ANALYSIS
We have conducted experiments on the ACDC (Automated cardiac diagnosis challenge) datasets to evaluate the proposed method, and the ACDC took place during the MICCAI 2017 (International conference on medical image computing and computer assisted intervention), it remains open for new submissions over the next years. We compared the segmentation results with several state-of-the-art methods, and the segmentation results and diagnose results were evaluated on the dedicated online evaluation website. 1 We conducted experiments on a desktop computer with NVIDIA-GeForce-GTX-1080-Ti GPU, Intel Core i7-4790 CPU @ 3.60GHz and 32GB RAM.

A. SEGMENTATION RESULTS
We performed experiments on the ACDC dataset, which consisted of 150 patients (Patients with one of four diseases or healthy individuals), in addition, each patient has additional information: weight, height, ED, and ES. The ACDC dataset contains a training dataset and a testing dataset. (i) The training dataset composed of 100 patients including segmentation labels and diagnostic results, and our proposed network  model is trained on the training dataset. (ii) The testing dataset consisted of 50 new patients without segmentation labels and diagnostic results, we performed experiments on the testing dataset and evaluated our results on the dedicated online evaluation website. The original input data size is not uniform, which is not conductive to subsequent processing. Therefore, the region of interest (ROI) is extracted from the original data. Gaussian kernel-based circular Hough transform approach was used for left ventricle (LV) localization, the original data is cropped according to the coordinates of the left ventricle center. After the ROI extraction, the 3D data is processed into 2D slices of 128 * 128 size. The ROI extraction has the following three advantages: (i) Reducing interference from other tissues and organs. (ii) Unifying data format. The data slice is unified with 128 * 128 size of convenient for network training. (iii) Saving computing resources.
It should be mentioned that the Dice coefficients and Hausdorff distance evaluation methods are used to evaluate the segmentation method in this study, and this is to prevent the segmentation algorithm from favoring some methods over others. Table 2 shows Dice coefficients for LV, RV and Myo segmentation at ED and ES. The Dice coefficients is to measure the accuracy of the segmentation results, and is defined as where V user is the segmented volume and the V ref is the corresponding reference volume. Dice coefficients value close to 1 represents a better segmentation result. Hausdorff distance measures the local maximum distance between the segmented surface S user and the corresponding reference surface S ref [33]. Table 3 shows the Hausdorff distance under different segmentation methods. The small Hausdorff distance represents a better segmentation result. Our method gets the Dice coefficients of the LV at ED of 0.95, the most difficult segmentation is the RV at ES of 0.86. Our approach has achieved satisfactory results because the network can integrate contextual information along the z−direction.
In order to prove the effectiveness of the algorithm, we selected typical method for comparison. (i) U-Net method was selected to verify the effect of 2D segmentation method.
(ii) U-Net + CLSTM and (iii) U-Net + Bi-CLSTM were used to prove the segmentation method of 2D combined temporal information. (iv) Our residual convolution neural network method was selected to verify the deep residual connection effect of 2D segmentation method. (v) Our method is a variant of 2D fusion temporal information approach. Comparing the U-Net and U-Net + Bi-CLSTM methods, the segmentation method with fusion temporal information has a smaller Hausdorff distance, that means the distance between the S ref and the S user is closer. The method proposed in this paper has a small improvement under the Dice coefficients (Table 2), but performs better under the Hausdorff distance evaluation index (Table 3), because there are certain differences in different evaluation indexes. Comparing the U-Net and our residual convolution neural network methods, residual convolution neural network has better segmentation results. Table 3 shows that Hausdorff distance of the U-Net+CLSTM method is the smallest except the LV Hausdorff distance of ED, which shows that the time information can improve the contour accuracy of segmentation. Our residual convolutional networks have deeper network structures to extract richer semantic features, and the connection of residuals makes the network learn different levels of semantic information. In addition, dilated convolution of the residual convolutional networks increases the receptive field of the network to improve segmentation accuracy. Moreover, it can be seen from Table 2 and Table 3 that the proposed method has a good performance under different evaluation metrics (Dice coefficients and Hausdorff distance), which proves that our method does combine intra-slice information and inter-slice  information from the sequenced image slices, so the segmentation accuracy is improved.

B. DIAGNOSE RESULTS
Cardiac disease diagnosis rule described in Section III-C. The diagnostic results obtained were evaluated on the online evaluation website. Fig. 4 shows the confusion matrix of our results on automated disease classification challenge, with an overall accuracy of 94%. Sensitivity was 100% for the NOR, DCM, HCM and RV classes, 70% for the MINF class. The three MINFs were mistakenly divided into DCM because the two diseases are visually similar and characterized by low EF of the LV. Table 1 compares our method with other cardiac disease diagnosis methods, and our method has a promising result.

V. CONCLUSION
In this study, we proposed a new DL segmentation structure, which is based on a residual convolution neural network to compress the intra-slice information and a Bi-CLSTM to leverage the inter-slice contexts. Using this segmentation structures, the left ventricle, right ventricle, and myocardium were successfully segmented from the CMR dataset. In addition, we conducted automatic cardiac disease diagnosis from the segmentation maps. Our approach combines automatic cardiac segmentation and diagnosis, which has great potential for computer-aided diagnosis. We performed experiments on the public ACDC dataset, and the results of automated segmentation results and diagnostic results were satisfactory. In the future, we will delve deeper into the segmentation method for 2D combined temporal information, because this segmentation method has great potential in 3D biomedical segmentation. XIAOYING HUANG is currently studying at the School of Information Science and Technology, Beijing Normal University, Beijing, China, where she is a Student. Her research interest is in medical image processing.
QINGJUN WANG is currently the Deputy Director of medical imaging, a Medical Doctor, and the Deputy Chief Physician. Particularly good at post-treatment evaluation of brain tumors, preoperative white matter fiber bundle and central functional area localization analysis, cardiac magnetic resonance imaging, and cardiac function analysis. VOLUME 8, 2020