Convolutional-Neural-Network-Based Approach for Segmentation of Apical Four-Chamber View from Fetal Echocardiography

An apical four-chamber (A4C) view from early fetal echocardiography is an extremely significant step in early diagnosis and timely treatment of congenital heart diseases. The objective is to perform automated segmentation of cardiac structures, namely, the epicardium, left ventricle, left atrium, descending aorta, right atrium, right ventricle, and thorax, in ultrasound A4C views in one shot in order to assist clinicians in prenatal examination. However, such a segmentation task is often faced with the following challenges: 1) low imaging resolution; 2) incomplete tissue boundary; 3) overall contrast of the image. To address these issues, in this study, we propose a cascaded U-net, named CU-net, with structural similarity index measure (SSIM) loss. First, the CU-net with two branch supervisions helps gain clear tissue boundaries and alleviate the gradient vanishing problem caused by increasing network depth. Second, between-net connections in the CU-net can transmit the prior information from the shallow layer to the deeper layer and obtain more refined segmentation results. Third, the method leverages on SSIM loss to preserve fine-grained structural information and obtain clear boundaries. Extensive experiments on a dataset of 1712 A4C views demonstrate that the proposed method achieves a high dice coefficient of 0.856, Hausdorff distance of 3.33, and pixel accuracy of 0.929, revealing its effectiveness and potential as a clinical tool.


I. INTRODUCTION
Congenital heart diseases (CHDs) are a series of deformities in the fetal heart structure or function, accounting for functional heart incapacitation, which may result in severe physiology defects [1]- [3]. If CHDs cannot be treated in time, the morbidity and mortality rates of neonates will be high [4]- [6]. Hence, early diagnosis and screening for pregnant women is crucial.
Fetal echocardiography is an elementary low-cost method that does not use radiation and is widely used to detect CHDs by reflecting real-time structures. An apical four-chamber (A4C) view is one of the most important ultrasonic views The associate editor coordinating the review of this manuscript and approving it for publication was Vishal Srivastava. in fetal echocardiography [7]- [9], because plenty of CHDs could be clearly identified in this view. In prenatal ultrasound examination of CHDs, the diagnostic anatomical structures of A4C views are epicardium (EP), thorax, left ventricle (LV), left atrium (LA), descending aorta (DAO), right atrium (RA), and right ventricle (RV) [2]- [4].
The interpretation of A4C views requires clinicians to have rich theoretical knowledge and clinical experience [10]- [13]. However, doctors may make incorrect decision due to fatigue in long-term diagnosis [14]- [16]. With the development of computer technology, computer-aided diagnosis plays an indispensable role in medical image enhancement, segmentation, and recognition [17]. Accurate segmentation of A4C views can provide pathological information and save clinicians considerable time in observation and measurement. For example, from the segmentation of LV and LA, the presence of left ventricular dysplasia can be detected. Accurate A4C segmentation can not only help imaging experts and clinicians avoid medical accidents but also is a key step to prevent future risks of pregnant women and fetuses. Convolutional neural network (CNN) approaches have achieved state-of-the-art performance in the field of medical image processing. Powered by CNNs, the performance of segmentation has been largely improved, such as in computed tomography (CT) and magnetic resonance imaging (MRI) [18]- [20]. In this work, we try to utilize the CNN for ultrasound image processing to automatically segment RA, RV, LA, LV, DAO, EP, and thorax from A4C views.
The following three main challenges exist in A4C view segmentation: (i) Ultrasound images have low resolution and noise, which result in large artifacts in the processed images, leading to great interference in the segmentation task, such as the blue arrows in Fig. 1. (ii) The boundaries of tissues are incomplete in echocardiography images. For example, because of echo dropout, the mitral valve and tricuspid valve in A4C views may be incomplete, as demonstrated by the red arrows in the first column of Fig. 1. Moreover, the openness of the interventricular septum and atrial septum may lead to incomplete boundaries. The cardiac tendinous chordae in ventricles of heart will blur the boundary, as demonstrated by the green arrows in Fig. 1. These phenomena will render segmentation of the ventricular-atrial boundary difficult. (iii) A4C view segmentation needs to consider the overall contrast of the whole image, rather than local or pixel features. The DAO and pulmonary vein are shown by the pink arrows in the third column of Fig. 1. To obtain accurate segmentation results, the segmentation algorithm must learn the global structural information of the whole image.
To overcome the aforementioned challenges, we propose cascaded U-nets (CU-net) with a structural similarity index measure (SSIM) loss function and achieve automated semantic segmentation of the seven structures of fetal heart simultaneously. The proposed CU-net comprises double U-nets within an integrated end-to-end framework. To obtain clear tissue boundaries and mitigate the problem of disappearance of gradients due to increased network depth, we add branch supervision during the training process. To reduce information loss in deeper layers, we design between-net connections to help transmit high-resolution information from shallow layers to the corresponding deeper layers, thus obtaining more refined segmentation results. Furthermore, we present an SSIM loss function to model the spatial information and help the optimization focus on boundaries.
In this study, we proposed a novel image segmentation network, CU-net with the SSIM loss function as a method to achieve automated semantic segmentation. Our experimental results show that this method performs considerably better than other methods in terms of Dice score (DSC), Hausdorff distance (HF), and Pixel accuracy (PA). We improved an endto-end network, CU-net, by adding branch supervisions and between-net connections for accurate segmentation of the seven structures of fetal heart. Branch supervisions utilize the strategy of coarse-to-fine segmentation. Between-net connections can transmit the prior information from the shallow layer to the deeper layer and obtain more refined segmentation results. Further, we applied the SSIM loss function to ultrasound fetal multi-tissue segmentation and found that it can introduce global information and help the optimization focus on tissue boundaries. This proves the potential and effectiveness of the SSIM loss function in segmentation.

II. RELATED WORK A. CNNs FOR MEDICAL IMAGE SEGMENTATION
CNNs have been widely utilized in medical image processing [5]. The U-net has been proposed by O. Ronneberger in 2015 to solve the segmentation problem in more complicated scenarios by benefitting from the accumulation of images and higher computational capacity [21].
Since this study, the U-net has been widely used in medical image segmentation [6], [7]. Many improvements of the U-net have also been derived, such as the H-DenseUNet for liver segmentation [22] and the coarse-to-fine U-net for left atrium segmentation [23]. Furthermore, stacked U-net with multiple U-nets has been proposed [8], [9], which helps increase the network depth and the number of trainable parameters, thereby improving the network performance.
Another improvement of the U-net in medical image segmentation is cascaded U-nets, proposed for brain tumor segmentation [38], prostate segmentation [39], and glioma segmentation [40]. For example, [39] proposed two dense U-nets for prostate MRI segmentation. However, excessive network length may result in the appearance of grades in practical training. To address this problem, a skip connection between two U-nets (from the decoder layer of the first u-net to the encoder layer of the second one) was proposed [38], which has also been mentioned in [37]. In this study, we proposed a new between-net connection and compared the two connections in Section V.

B. LOSS FUNCTION FOR CNNs
In most existing segmentation methods, the model is completely monitored by local loss functions at the pixel level, such as cross-entropy loss and dice loss, without utilizing the global dependence and structure information of the output space. Therefore, the global information of prediction results often does not consist of shape priors of the target. The SSIM, originally proposed for image quality assessment [24], measures the similarity between two images. This index defines structural information as independent of brightness and contrast from the perspective of image composition, reflecting the attributes of the object structure in the scene. Hang Zhao et al. were the first to use the SSIM in natural image reconstruction [25]. The author believes that the loss of traditional mean squared error (MSE)-based images cannot express the intuitive sense of the human visual system about images. Therefore, because the SSIM combines brightness, contrast, and structural information of an image, it is designed as a loss function. Xuebin Qin et al. applied the SSIM loss function to the second-class segmentation of natural images [26].

C. CARDIAC IMAGE SEGMENTATION
With the continuous development of deep learning, CNN models have shown significant advantages in computer vision and image processing problems. Automatic segmentation of cardiac images by CNNs has attracted increasing attention.
Accurate segmentation of cardiac chambers is crucial for diagnosis and prognosis of cardiac diseases. In the recent years, more studies have focused on the segmentation of the left ventricle to calculate clinical indicators of patients, such as left ventricular mass and ventricular volume [27]- [29], while some have also considered right ventricle segmentation [30], [31] to quantify clinical indicators such as ejection fraction. In [32], segmentation of two or four chambers of the heart was performed considering different views. However, all these studies [27]- [32] focus on MRI segmentation. Furthermore, in another study [33], left ventricle segmentation of patients was performed using three-dimensional ultrasound images. As for the segmentation of fetal cardiac structures, Li Yu et al. only segmented left ventricle in fetal echocardiographic sequences [34]. In [37], we proposed a DW-net, comprising a dilated convolutional chain (DCC) and a W-net, for A4C segmentation with a dataset of 895 A4C views. This method has the potential to accurately segment complex ultrasound multi-structured images when the data are not large.

III. METHOD
Our proposed method is capable of segmenting seven crucial anatomical structures in A4C views. The architecture of the proposed method consists of a CU-net (see Section 3.1), as illustrated in Fig. 2, and a novel loss function (see Section 3.2).

A. DESIGN OF THE CU-NET
As Fig. 2 shows, our segmentation network is a novel end-to-end architecture, mainly composed of two cascaded U-nets, designed to take advantage of coarse-to-fine segmentation. The first-stage U-net performs a coarse segmentation and sends the extracted features to the second-stage U-net for further precise segmentation.
The cascaded structure multiplies the network depth and enhances the ability of the method to extract semantic features. However, deep network may exacerbate the gradient vanishing problem. This may lead to total loss information being lost in long-distance propagation. Therefore, to address this issue, we add an auxiliary supervision of the first Unet. Each U-net has a loss of output: the first U-net has a coarse loss, while the second has a fine loss. In addition, we redesign an inter-network connection. In previous research [37], as mentioned in Section II, we proposed between-net connections (as BNC_DE) from the decoder layers of the first U-net to the encoder layers of the second one. In this study, for the better use of priors, we build between-net connections (as BNC_EE) from the encoder layers of the first U-net to the encoder layers of the second one. We do not connect from the first decoder because the BNC_EE enables prior information of shallow layers to be preserved and transferred to deeper layers to describe the details of the heart structure, thereby achieving more accurate segmentation.
Each U-net is a typical encoder-decoder neural network structure. Every layer of the encoder comprises three convolution operations, followed by instance normalization, a ReLU activation function, and a max-pooling operation of stride 2. In addition, each encoder layer has 20 convolution filters of size 3×3 and stride 2. The decoder is symmetrical to the encoder, and skip connection is used to connect the feature maps of the encoder with the feature maps of the decoder. The final outputs of both U-nets are produced by a softmax layer.

B. SSIM LOSS FUNCTION
The CU-net is an end-to-end architecture in which two U-nets are trained jointly to ensure the efficiency of data processing. Our training loss is defined as the summation of the outputs of both U-nets: where α and β are the weighted coefficients; L coarse is the loss between the output of the first U-net and the target;L fine is the loss between the output of the second U-net and the target.
In [25], [26], the SSIM captures luminance, contrast, and structural information of images. Therefore, we integrate it into our training losses to learn the contrast and structural information of the apparent facts of the object.
In previous studies [24], because of image reconstruction, the mean and variance of the whole map often change dramatically over its span. Hence, a sliding window is used to calculate the SSIM of patches under the sliding window with a step size 1, and then the average value is taken as the SSIM of the whole map. Let S = S ij : j = 1, . . . , W 2 and G = G ij : j = 1, . . . , W 2 be corresponding patches cropped from the segmentation result and the ground truth, respectively. Here ii is the number of segmented categories, and W 2 is the size of the patch. The SSIM of S and G is defined as follows: where µ S and µ G are the means of S and G, δ S and δ G are the standard deviations of S and G, δ SG is their covariance, C 1 = 0.01 2 , and C 2 = 0.02 2 .
Hence, SSIM loss is defined as follows: where c is the number of segmented categories and N is the number of patches.
When the SSIM is used as a measure of image reconstruction, the Gaussian filter is often used to calculate the mean and variance of the image. We use mean filtering to calculate the mean and variance of each patch in SSIM.
Analogous to (3), we may write: l(S, G) and cs (S, G) are the terms of the SSIM (Equation), and their derivatives are, respectively, and In SSIM loss, the mean, standard deviation, and covariance are used as the estimate of brightness, contrast, and structural similarity, respectively.
In our method, the two U-nets use the SSIM loss function as their loss function, which is defined as follows: The dataset used in this research is provided by the echocardiography department of Beijing Anzhen Hospital, Capital Medical University, Beijing, China. This clinical center specializes in the detection and treatment of fetal congenital heart diseases. Because the hospital collects a large number of data on various complicated cases from all over the country, the dataset is representative and universal. The dataset employed in the research comprises 1712 fetal A4C views. The segmentation ground truth was labeled by experienced doctors from the echocardiography department of the hospital according to clinical criteria. Each label contains seven structures: left atrium (LA), right atrium (RA), left ventricle (LV), right ventricle (RV), epicardium (EP), descending aorta (DAO), and thorax. We divide the training set and testing set in a 3:1 ratio. We randomly selected 1284 fetal A4C views as the training set and the remaining other 428 images as the testing set, and there is no overlap between the two sets. Each image is 256 × 256 pixels.

B. EVALUATION CRITERIA
We use three measures to evaluate our method: dice coefficient (DSC), Hausdorff distance (HF), and pixel-level accuracy (PA). The DSC of one class is defined as where c is the category of segmentation, P c is the automated segmentation map of class c, and Q c is the ground truth of class c. The range of DSC is [0,1], with the maximum and minimum values being 1 and 0, respectively. The HF of one class is defined as where P c is the pixel set of the automated segmentation map of class c, Q c is the pixel set of the ground truth of class c, and h (P c , Q c ) and h (Q c , P c ) are given as and where · is the Euclidean distance between points q and p. Smaller the value of HF, the more accurate the segmentation results are. The average PA is defined as where c is the number of the class, p c denotes the amount of right classified pixels of class c, and q c denotes all pixels of class c. The range of PA is [0,1], with the maximum and minimum values being 1 and 0, respectively. In addition, we use a paired t-test to compare the performance between the two methods. As the test sample, the results of seven structure regions obtained from the two experiments are considered. The null hypothesis is that there is no statistical difference between the results of two experiments. A p-value of less than 0.05 indicates a significant difference between the two experiments, while that less than 0.01 indicates a highly significant difference between the two experiments.

C. IMPLEMENTATION DETAILS
The experiment was implemented using Python 3.5 and Tensorflow framework [36], and the hardware used was NVIDIA Tesla K80 GPU.
Our method directly handles clinical A4C views without any data augmentation. We employed the Adam optimization strategy in the training process. The initial learning rate was 0.0004 with a weight decay of 0.1 per 1000 iterations. We trained 100 epochs. As for the loss function, because we used the same dice loss or SSIM loss function for both supervision branches, with the same order of magnitude, we set an equal weight of 1 for both α and β.

A. RESULTS OF SEGMENTATION
We performed the experiments using three different network structures, namely the FCN [35], U-net [21], and proposed method, CU-net, to segment seven structures in A4C views, which are EP, LV, LA, DAO, RA, RV, and thorax. We first trained the three methods with dice loss and named them FCN+l dice , U-net+l dice , and CU-net+l dice , respectively. Then, we trained these methods with SSIM loss and named them FCN+l ssim , U-net+l ssim , and CU-net+l ssim , respectively. We designed the patch size of SSIM loss to be 256, and the number of patches for each structure in the loss function was 1. The reason to adopt this value will be discussed at the end of this section. Table 1 presents the PA as well as the mean DSC and mean HF, which are the means of the corresponding values for seven structures. Tables 2 and 3 illustrate the DSC and HF values of seven structures, respectively. The CU-net with SSIM loss obtained an average DSC of 0.856±0.096, average HF of 3.311±0.805, and average PA of 0.929±0.037, indicating high performance in terms of all evaluation metrics.
As shown in Tables 1, 2, and 3 (from the 2 nd row, 4 th row, and 6 th row), we can see that with the same dice loss, the CU-net has a superior segmentation performance. The CU-net significantly outperforms the FCN by 8.9% average DSC, 51.6% average HF, and 2.8% PA. In addition, the CU-net significantly outperforms the U-net by 1.1% average DSC, 11.2% average HF, and 0.5% PA.
From Tables 1-3 (comparing 2 nd row with 3 rd row, 4 th row with 5 th row, and 6 th row with 7 th row), we observe that incorporating SSIM loss into the neural network segmentation models significantly improves results. The CU-net with SSIM loss outperforms the CU-net with dice loss by 1.1% average DSC, 7.9% average HF, and 0.6% PA.

B. VISUALIZATION RESULTS
To further understand the origin of the performance gain, we visualized the segmentation results of some subjects. Fig. 3 shows three example cases for comparing segmentation performances of different networks and application of two different loss functions. In Fig. 3 (1), the A4C view has artifacts at the thorax, which leads to vanishing boundaries of thorax and DAO. In addition, its atrial septum is open, which may influence the boundary between LA and LV. In Fig. 3 (2), cardiac valves are open and RV has artifacts, which is chordae tendineae. Further, in Fig. (3), the boundaries of the four chambers are obscure. The result of the CU-net with SSIM loss shows a robust performance against the above challenges and exhibits better boundaries. Comparison of the 1 st column of each group indicates that the CU-net is considerably better than the U-net and FCN. Comparison between the different loss functions indicate that although the performances of the VOLUME 8, 2020   same network with dice loss and SSIM loss are comparable to each other, SSIM loss results are relatively better.

C. ABLATION EXPERIMENT
In this part, we validate the effectiveness of each key components used in our model. The ablation study involved two parts: ablation experiment on the structure of the CU-net and that on the loss function. All ablation experiments were conducted on the same dataset.
To prove the effectiveness of our CU-net, we report the quantitative comparison results of our model with other related architectures. The results are presented in Table 4. The U-net 2 is a cascaded network of two U-nets. The U-net + BNC_DE [37]     middle output supervision to the conjunction of two U-nets. Table 4 indicates that the two cascaded U-nets with BNC_EE and supervision architecture achieves the best performance among these configurations. The training time and segmentation time for the four experiments listed in Table 4 are similar, where the training time is approximately 9 h and the processing time for each A4C view is approximately 0.528 s.
For the analysis of the architecture, the results of supervision branch 1 and branch 2 of CU-net trained with SSIM loss are shown in Fig. 4 (3 rd and 4 th columns, respectively). Fig. 4 (a) presents the images with artifacts generated in RV and thorax. The results of branch 2 show comparatively clearer boundaries than branch 1. Similarly, in Fig. 4 (b), the mitral valve and tricuspid valve of the A4C view are open, which makes segmentation of the boundaries of LA and LV and RA and RV difficult. Through visualization, we can observe that the boundary of branch 2 was better than that of branch 1; then, we compared the DSC of four chambers and epicardium. The DSC and HF values of RV, LV, LA, RA, and EP are presented in Fig. 5. We can observe that all the values of branch 2 are better than those of branch 1.
For loss function analysis, the CU-net with SSIM loss is trained for different values of W, which is the kernel width, and the observations are summarized in Fig. 6. The mean values of DSC and HF, which are useful evaluation metrics to measure segmentation accuracy, are considered. We can observe that the best performance is observed for W = 256. When W increases, the value of DSC increases, whereas that of HF decreases.

D. STATISTICAL ANALYSIS
To further explore whether our method is significantly different from other methods and whether it is effective for improving automated segmentation of A4C views, we conducted a paired sample t-test. The results of the statistical analysis are presented in Table 5.
The statistical results show the effectiveness of our method, and the following three conclusions can be drawn. In the first part (2 nd and 3 rd rows), compared with the FCN and U-net, the CU-net shows significant differences in terms of the two indexes, HF and DSC.
In the second part (4 th to 6 th row), the U-net 2 +BNC+Sup has a significant effect on improving the performance. Significant differences exist between the U-net 2 and U-net 2 +BNC and between U-net 2 +BNC+Sup and U-net 2 +BNC in terms of DSC and HF. This indicates that the between-net connections and auxiliary supervision are effective.
In the third part (7 th to 9 th row), significant differences are observed between the different network incorporating SSIM loss and that incorporating dice loss with the p-value being less than 0.05 for both DSC and HF, except for the DSC for U-net incorporating two different loss.

VI. DISCUSSION
In the study, we introduced cascaded U-net with the SSIM loss function for the segmentation of LA, LV, RA, RV, DAO, EP, and thorax from ultrasound A4C views for further extraction of useful clinical indicators. An ultrasound A4C view has three shortcomings: low imaging resolution, obscure tissue boundary, and learning of global structural information. Herein, the proposed model for A4C segmentation focuses on two of these problems, namely, boundary segmentation and utilization of global information.
From Tables 1-3, the following two conclusions can be derived. First, the CU-net performs the best comparing to the FCN and U-net, thus proving its successful design. Second, the SSIM loss function performs better than the dice loss function. The performances of the FCN, U-net, and CU-net are all improved under the constraints of SSIM loss, which demonstrates that SSIM is effective for improving the segmentation performance.
From the experimental results comparing the morphology obtained by the model trained with dice loss and SSIM loss, we observed that the FCN and U-net with SSIM loss are better than the FCN and U-net with dice loss. By constraining the global shape of the target and capturing the global information, a more rounded shape and better segmentation are achieved. Further, visualization of the results indicates that the boundaries of the three subjects are smoother and more accurate, thus confirming that SSIM loss is effective.  Furthermore, the segmentation obtained by the CU-net is more anatomically reasonable than that by the U-net and FCN. As we can see in the first column of each group in Fig. 3, the shape of the chamber segmented by SSIM resembles a circle, with a rough surface. In sum, these results suggest that the proposed CU-net with the SSIM loss function is effective.
To further understand whether the CU-net works in the way it has been designed, we demonstrate the intermediate results of the CU-net. Table 4 confirms that by leveraging the two-branch supervision and the between-net connections from the first encoder to the second encoder, the CU-net can obtain more refined segmentation results. This proves that the BNC_EE is more suitable than BNC_DE [37], [38] for fetal ultrasound multi-tissue segmentation.
In Figs. 4 and 5, we observe that the output of branch 1 is coarser than that of branch 2. The feature output by branch 1 is processed by branch 2, which results in a robust performance as more accurate boundaries are yielded compared to those in branch 1 results.
Our CU-net is more concise compared with other cascaded U-net methods, because of residual blocks in [38], and dense blocks in [39]. As for other fetal echocardiography segmentation methods [37], our CU-net is more less timeconsuming. [37] has dilated convolution and highly complex networks and require more segmentation time.
Further, we explored the effect of different kernel widths (W) in SSIM on the performance measured by the dice coefficient and HF. The experimental results confirm that the performance improves with increasing W probably because of the model's ability to grasp global information. Moreover, when W is 256, equal to the size of image, the results are optimum.
The results of the statistical analysis suggest that our method achieved more accurate segmentation compared with two existing methods and the performance improvement due to the use of between-net connections and auxiliary supervision is statistically significant. The results also prove that the CU-net with the SSIM loss significantly outperforms that with the dice loss in fetal ultrasound A4C view segmentation.
Even though the proposed method has been shown to provide good generalization capabilities across the segmentation of images, our work has the following limitation. There is further scope to improve the proposed method, particularly in terms of incorporating clinical prior knowledge into the A4C view segmentation; this will be explored in future.

VII. CONCLUSION
We proposed a novel end-to-end cascaded U-net model, that is, CU-net with SSIM loss, for accurate seven structures segmentation in A4C view. The proposed CU-net is a predict-refine architecture, which consists of two U-nets with branch supervisions and between-net connections. Combined with the SSIM loss function, this method can capture both global information and clear tissue boundaries. Experimental results on A4C view datasets showed that our proposed could achieve 85.6% DSC, 3.311 HF, and 92.9% PA, and performed better than some mainstream methods. Thus, it was demonstrated that our method can assist in early prenatal diagnosis of CHDs. This method can be adapted to semantic segmentation of other organs, and has the potential to be applied to solve segmentation problems in other views of fetal cardiac ultrasound, such as the left ventricular outflow tract view.