Toward Robust Cardiac Segmentation Using Graph Convolutional Networks

Fully automatic cardiac segmentation can be a fast and reproducible method to extract clinical measurements from an echocardiography examination. The U-Net architecture is the current state-of-the-art deep learning architecture for medical segmentation and can segment cardiac structures in real-time with average errors comparable to inter-observer variability. However, this architecture still generates large outliers that are often anatomically incorrect. This work uses the concept of graph convolutional neural networks that predict the contour points of the structures of interest instead of labeling each pixel. We propose a graph architecture that uses two convolutional rings based on cardiac anatomy. While this architecture does not improve performance on classical measures like Dice score and Hausdorff distance, it does eliminate anatomical incorrect segmentations. Additionally, we propose to use the inter-model agreement of the U-Net and the graph network as a predictor of both the input and segmentation quality in real-time. The results show that from the 100 high agreement samples, 93 were in distribution, while from the 100 low agreement samples, only 7 were in distribution. Finally, this work contributes with an ablation study of the graph convolutional architecture on the publicly available CAMUS dataset and an evaluation of clinical measurements on the clinical HUNT4 dataset. Source code is available online: https://github.com/gillesvntnu/GCN_multistructure


I. INTRODUCTION
Cardiovascular diseases are the most common cause of death, accounting for one in three deaths in the United States [1].Ultrasound imaging is the standard diagnostic tool for cardiology, as it is safe, inexpensive, real-time, and non-invasive.Accurate segmentation of the cardiac structures from ultrasound images is important for diagnosis, as it enables the extraction of standard clinical measurements such as left ventricular (LV) volume, ejection fraction (EF), and global longitudinal strain (GLS), which all provide important insights into the patient's cardiovascular health.In today's clinics, it is mainly done manually or semi-automatically.
The associate editor coordinating the review of this manuscript and approving it for publication was Riccardo Carotenuto .
Clinical guidelines recommend that clinical measurements should be repeated over three cardiac cycles [2].However, this is typically not performed since it takes too much time for the clinician.Automating segmentation would make repeating the measurements on multiple cardiac cycles trivial.
With exponential growth in computing power and access to large amounts of digitized data, deep learning methods have become the most common approach for automatic medical image segmentation, showing average performance comparable to inter-observer variability [3], [4], [5].Deep learning segmentation approaches usually directly predict a label for each pixel in the image.The U-Net architecture [6] is the most commonly used architecture for pixel-wise segmentation.U-Net is a fully convolutional neural network (CNN) that efficiently combines high and low-level features by integrating skip connections in an encoder-decoder structure.The output is a multi-channel segmentation map, with each channel corresponding to a structure.
Whereas the pixel-wise approach of U-Net gives it excellent accuracy on average, it has two fundamental issues: 1) The U-Net has no anatomical constraints, which can result in outliers and anatomical incorrect segmentations, as shown in Fig. 1.
2) The U-Net has no way of reporting when it is uncertain on how to segment an image, for instance in out-ofdistribution and low image quality cases.This lack of robustness hinders clinical use.Since the segmentation accuracy (e.g.Dice) of LV segmentation is on average already very high with U-Net methods, the goal of this paper is not to create a new network which improves these segmentation metrics, but instead investigate new methods for dealing with these two fundamental issues of anatomical incorrectness and detection of failing cases.

A. ROBUST U-NET-BASED SEGMENTATION
Several works have tried addressing the fundamental outlier problem of U-Net using shape priors.These works encourage the U-Net to produce anatomically valid shapes, either with a specialized loss function expressing anatomical correctness [7], an atlas prior [8], closeness to a learned embedding [9], temporal consistency, [10] or with joint landmark detection [11].However, Painchaud et al. [12] showed that methods using soft constraints on pixel labeling methods still produce anatomically incorrect outputs.Their work proposed a post-processing approach that refines the output of the U-Net to the closest anatomically valid shape [12] in a large latent space.The architecture consists of an autoencoder trained on ground truth segmentations and generates a latent space of valid anatomically correct segmentations.During inference, the output of the U-Netbased model is then fed to the encoder producing a latent representation.A nearest neighbor search is performed in the latent space and fed to the decoder to produce an anatomically corrected output.This method shows excellent results but is too slow for real-time applications.However, the work introduces a real-time variant that works as a denoising autoencoder.In this work, we compare our results to this version on the CAMUS dataset [4].

B. GRAPH NEURAL NETWORKS
Graph Convolutional Networks (GCN) address the problem of segmentation in a fundamentally different way.Instead of predicting a pixel-wise segmentation map, these models predict the contour of the segmentation as a graph.The contours are sampled into keypoints which form the nodes of a graph.Previous works have shown the effectiveness of this method for MRI segmentation [13] and X-ray [14].In our previous work, we demonstrated the feasibility of GCNs for ultrasound segmentation [15] using the EchoNet dataset [16].However, this work was limited to single-structure segmentation (LV endocardium) on a single view (apical four-chamber) and lacked a thorough evaluation.

C. CONTRIBUTIONS
In this work, we extend our work on the GCN approach for cardiac ultrasound segmentation with a focus on the two fundamental issues highlighted in the introduction; anatomical correctness and detection of failing cases.
1) Multi-structure segmentation: We explore methods for extending the graph to segment multiple structures, such as the LV epicardium and left atrium (LA), enabling the segmentation to be used for more measurements such as strain.We propose a clinically motivated design and show that this eliminates anatomically incorrect shapes in the segmentation output of the network.We perform an ablation study of the GCN architecture on the public CAMUS dataset and compare it with U-Net.Additionally, we evaluate the final models on the clinical HUNT4 dataset.2) Segmentation quality predictor: We combine the outputs of the GCN and the U-Net and show that this can be used to detect out-of-distribution and unsuitable low image quality cases, resulting in bad segmentation output.3) Open-source framework and demo: We make our code publicly available.We provide two parts: the full PyTorch framework to reproduce the results (https://github.com/gillesvntnu/GCN_multistructure)and the C++ code to run the real-time demo application described later in this article (https://github.com/gillesvntnu/GCN_UNET_agreement_demo).

A. U-NET BASELINE MODELS
We compare the GCN with two U-Net architectures.U-Net 1 [4], a fixed architecture optimised for speed, and nnU-Net V2 [17], [18], a self-adapting framework optimised for accuracy.
For U-Net 1, we use the same architecture and training procedure as in [4], but with additional augmentations as listed in Table 1.These are the same augmentations as in [15].The nnU-Net is used out of the box using the default configuration, but without the final ensemble step [17], [18].For both U-Nets, the input images are resized to 256 × 256 pixels as a pre-processing step.

B. GENERAL STRUCTURE OF THE GCN
The general structure of a GCN for cardiac segmentation [15] consists of a CNN encoder pre-trained on ImageNet [19] that acts as a feature extractor and a graph convolutional decoder that reconstructs the keypoints of the contours of the cardiac structures.Fig. 2 shows a general overview of the GCN architecture.The encoder transforms a grayscale ultrasound image of N H ×W pixel values to a vector embedding of R X values, where X is the feature embedding size of the last layer of the encoder.A dense layer transforms these embeddings to a representation in the keypoint space of size R n×C1 , where n is the number of keypoints and C1 is the number of output channels in the first layer of the GCN decoder.The decoder layers perform graph convolutions in the keypoint space and gradually decrease the number of channels until the final output with dimensions R n×2 , representing the relative pixel coordinates for each keypoint in the image.The size of the intermediate channels of the keypoint embeddings C i is a tunable parameter.Every graph convolutional layer consists of a linear layer followed by an activation function.It takes the embeddings of adjacent keypoints as input to update the embedding of each keypoint.In the ring of LV keypoints p 1..n , the embeddings of keypoints p i−w..i+w , are the input to update the embeddings of keypoint p i , where w is the receptive field of the GCN decoder.The same weights are reused at each keypoint, creating an inductive bias of locality.Section II-D describes how to expand this architecture to segment multiple structures simultaneously.

C. KEYPOINT EXTRACTION
The keypoints are extracted in a standardized way from the annotations that are available as pixel-wise labels of the left ventricle (LV), left atrium (LA), and myocardium (MYO).The following algorithm extracts anatomical landmarks from the annotations and uniformly samples the contours between these landmarks to fill in the remaining keypoints, as shown in Fig. 3.The anatomical landmarks A-G have a unique physical meaning/location, while the rest of the keypoints are sampled uniformly on the contour between these anatomical landmarks as described in the algorithm below: 1) Extract the annulus points, the corner points where the MYO meets the LA.These are points A and B in Fig. 3. Connecting these points gives the base line.2) Extract the base points of the MYO by extending the base line and selecting the furthest point with the MYO label on the base line.These are points C and D in Fig. 3 3) Extract the apexes of the LV, MYO, and LA, defined as the furthest points from the base line with the corresponding label.These are points E,F, and G respectively in Fig. 3  4) Sample the contour of each structure to complete the keypoint extraction.Sample the endocardium contour line between E and A in Fig. 3 for n equidistant points and do the same for the endocardium contour line between E and B, with n a model parameter.The total number of endocardium points is then 2n + 3, consisting of the sampled points together with A,B, and E. Follow the same procedure for the epicardium, using F,C, and D as corner points.This results in another 2n + 3 keypoints for the epicardium, as the number of endoand epicardium keypoints needs to be equal in our architecture.Finally, sample the LA border between G and A for m equidistant points and do the same between G and B, with m another model parameter.
The total number of LA keypoints is then 2m + 1, consisting of the sampled points together with G.We do not add A and B to the LA points because these points already belong to the endocardium keypoints.In this work, we choose n=20 and m=10, resulting in a total of 107 keypoints.The right side of Fig. 3 shows the final set of keypoints.

D. MULTI-STRUCTURE SEGMENTATION
The graph convolutions of the GCN are performed over the keypoints in the spatial neighborhood in the graph.Since the original approach only included the endocardium, the graph convolutions had to be adapted for the additional structures; the epicardium and the LA.
To include the LA keypoints, the ring of n endocardium keypoints p 1..n from the original GCN was extended to a ring of keypoints p 1..n+m , where m is the number of LA keypoints.The endocardium and LA keypoints together form the keypoints in the inner ring.In our implementation, n = 43 and m = 21 keypoints are used.For the epicardium keypoints, a second ring of keypoints q 1..n was added, which are zero-padded to form a ring of q 1..m+n outer keypoints.In the graph convolutional layers of the decoder, the embeddings of inner keypoints p i−(w−1)/2..i+(w−1)/2 and outer keypoints q i−(v−1)/2..i+(v−1)/2 are the input to produce the embeddings of each inner ring keypoint p i , where w is the primary receptive field and v is the secondary receptive field of the GCN decoder.The weights are reused at each keypoint, creating an inductive bias of locality.Fig. 4a visualizes the inner convolution schematically.For each epicardium keypoint q i , the embeddings of the outer keypoints q i−(w−1)/2..i+(w−1)/2 and inner keypoints p i−(v−1)/2..i+(v−1)/2 are the input to the convolutional layer.Also, for the epicardium the weights are reused at each keypoint, but the weights of the inner ring are different than the weights of the outer ring.Fig. 4b visualizes the outer convolution schematically.

TABLE 1.
Characteristics of the U-Net baselines.U-Net 1 and nnU-Net are the same as in [4] and [17] respectively.The ''number of channels'' column indicates the number of channels at the first, bottom, and last convolution of the U-Net.Table 7 shows the inference time and number of parameters for each network.calculated using the two neighboring keypoints.Only the final layer of the graph decoder needs to be adjusted to the new output format.The displacement representation prevents the network from producing erroneous segmentations where the epicardium keypoints are inside the endocardium keypoints.

F. GCN TRAINING PROCEDURE
The training procedure of the GCN is the same as in [15].It uses the Adam [20] optimizer using a learning rate of 1e-5.The model is trained for 5000 epochs and the model weights with the best validation accuracy are retained.The training data is resized to 256 × 256 pixels and augmented by using rotations, scaling, cropping, brightness adjustments, and mirroring.The loss function is the sum of the Euclidean distances of all predicted keypoints plus the mean absolute error of the displacements, if applicable.

G. COMBINING U-NET AND GCN 1) GCN -U-NET CASCADE MODEL
We explore a cascade network of GCN with displacement method followed by U-Net.The idea is that the GCN with displacement method will produce an anatomically correct initial shape, which the U-Net can refine to a more accurate pixel-wise segmentation.In the first stage, the GCN with displacement method is trained as before.After training, the GCN performs inference on the training set, and the resulting keypoints are transformed into pixel-wise segmentation outputs.In the second stage, a U-Net is trained with the concatenation of the grayscale ultrasound image and the segmentation output of the GCN as input.The only change to the U-Net architecture is that the first layer takes an extra input channel.We use the same U-Net architectures as described in section II-A.For U-Net 1, the input segmentation channel is additionally augmented separately with rotation, scaling, and translation to avoid the U-Net from fully relying on the segmentation outputs of the GCN.For nnU-Net, the GCN segmentation serves as an extra input channel and the framework is used out of the box without custom augmentations.Fig. 5 shows a graphical representation of the cascade model.

2) GCN -U-NET ENSEMBLE
For estimating ejection fraction using the GCN -U-Net ensemble, both networks run in parallel and construct their segmentations, leading to two different ejection fraction estimations.The average of these two measurements is the output of the ensemble.

H. INTER-MODEL AGREEMENT AS A QUALITY PREDICTOR
The second goal of this work was to develop a method to detect when the segmentation fails.When the GCN and U-Net generate segmentations in parallel, the agreement of the two models in terms of Dice agreement between the two segmentation outputs contains additional information.The idea is that if the two methods, each with their own data representation, generate similar segmentations, there is a higher probability of the segmentations being correct.On the other hand, when one of the methods makes a large error, for example, due to an out-of-distribution case, it is unlikely the other method will make the same mistake because of their fundamental differences in data representation, as demonstrated in Fig. 1.Thus we propose to use the inter-model agreement of these two different segmentation architectures to detect failing segmentation cases.

A. DATASETS 1) CAMUS
The CAMUS dataset is a publicly available dataset of 500 patients including apical 2 chamber (A2C) and apical 4 chamber (A4C) views obtained from a GE Vivid E95 ultrasound scanner, equalling 2000 image annotation pairs [4].The annotations are available as pixel-wise labels of the left ventricle (LV), left atrium (LA), and myocardium (MYO), split into 10 folds for cross-validation.

2) HUNT4
The Helse Undersøkelsen i Nord-Trøndelag ultrasound dataset (HUNT4Echo) is a large-scale clinical dataset of 2462 patient examinations of LV-focused A2C and A4C recordings used for clinical measurements [5].Each recording contains 3 cardiac cycles.A fraction of 311 patient exams, the training set, contains single frame segmentation annotations in both ED and ES as pixel-wise labels of the LV, LA, and MYO.For 1913 patient exams, there is a reference value of the biplane LV volumes in end-diastole (ED) and end-systole (ES), obtained manually using the clinically approved EchoPAC software (GE HealthCare).

3) DIFFERENCES
Due to differences in annotation conventions and scanning approaches, the HUNT4 and CAMUS datasets can not be used jointly.For instance, the myocardium is consistently annotated to be much thicker in CAMUS than in HUNT4.Although both datasets contain cardiac images of the same 33880 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
clinical views, the images in the HUNT4 dataset are consistently LV-focused, which is not the case for CAMUS.Whereas we consider the HUNT4 annotations to be more clinically correct, we have used and included results for the CAMUS dataset as well in our evaluation since it is publicly available whereas HUNT4 is not.

B. ABLATION STUDY
The ablation study uses the first subgroup of the CAMUS dataset, meaning it tests on the first cross-validation split, validates on the second and uses the remaining eight splits for training.The first experiment varies the decoder while keeping the CNN encoder fixed to MobileNet-v2 [21].The second experiment varies the encoder.
Table 2 summarizes the results of the first part of the ablation study that focuses on varying the complexity of the decoder.The channels are the embedding sizes of the decoder layers, corresponding to C 1..n in Fig. 2. The secondary receptive field is the number of keypoints used from the second ring, as described in section II-D.The primary receptive field is kept constant at w = n + m, the total number of inner keypoints, as in the original GCN.When there are no channels, the decoder is replaced by a single dense layer, converting the feature embeddings from the encoder directly to keypoint coordinates.The difference in overall Dice score between the best and worst performing variation is not statistically significant, p = 0.77 and p = 0.58 with and without displacement method respectively, using the Wilcoxon signed-rank test [22].The remainder of this work uses the GCN version with two decoder layers and a secondary receptive field of one.
The second part of the ablation study varies the CNN encoder.Table 3 compares the Dice scores of the network with MobileNet-v2 [21] and ResNet-50 [23] as encoder.The difference in overall Dice score is statistically significant (p < 0.05) for the version with displacement method and not significant (p = 0.41) for the version without displacement method.While slightly more accurate, the ResNet-50 encoder makes the model slower and an order of magnitude larger, as shown in Table 7.Therefore, the remainder of this work uses MobileNet-v2 as encoder.
As the purpose of this work was to create a more anatomical correct segmentation method, an experiment was conducted to measure the number of anatomical incorrect segmentation.For this purpose we use the publicly available CAMUS dataset and the same criteria for anatomical correctness as Painchaud et al. [12].In addition, we calculate the Dice and Hausdorff distance to demonstrate the trade-off between anatomical correctness and segmentation accuracy.We performed this experiment using our GCN architectures and other state-of-the-art (SOTA) cardiac segmentation methods for comparison, and the results are summarized in Table 4.As this study focuses on real-time applications, we compare them with the real-time version of the post-processing method proposed by Painchaud et al., the robust Variational AutoEncoder (rVAE) [12].Furthermore, we compare to U-Net 1 [4] and nnU-Net [18].To be able to compare to previous work, the Dice score is calculated in the same way as in the work of Painchaud et al. [12] where the myocardium is added to the LV lumen to create an ''epicardium'' region which results in overall higher Dice scores.The Dice score and Hausdorff distance are calculated for the LV by itself (LV endo in [4]) and the LV and MYO together (LV epi in [4]).The final result is the average of these two values.With p < 0.05 using the Wilcoxon signed-rank test [22], the Dice score of nnU-Net is significantly higher than that of the GCN, regardless of the presence of the displacement method.
The anatomically incorrect cases for the GCN are cases where the model confuses the endo-and epicardium keypoints, as in Fig. 6.Using the displacement method avoids this from happening, resulting in no anatomically invalid segmentations.Fig. 7 shows the median and worst cases of the GCN with the displacement method and nnU-Net.

D. INTER-MODEL AGREEMENT AS QUALITY PREDICTOR
The goal of this experiment was to evaluate the proposed inter-model agreement as a segmentation quality predictor and to investigate whether GCN has a better worst-case performance than nnU-Net.The inter-model agreement is measured as the Dice similarity between the segmentation masks of nnU-Net and GCN with displacement method.For this experiment, we use the clinical dataset HUNT4 as it better represents the ultrasound images used in practice.Fig. 8 shows a histogram of the inter-model Dice scores on the HUNT4 evaluation set.For the experiment, the samples are split between cases with high and low inter-model agreement.The threshold for low inter-model Dice is 0.8, equalling ∼2% of cases.The threshold of high inter-model dice is 0.9, equalling ∼92% of cases.A total of 200 samples were extracted by combining and randomizing 100 samples below the low threshold and 100 samples above the high threshold.Each sample contains 3 images: the input ultrasound image and the two segmentations produced by the two architectures in random order.A clinician manually classified each sample using the following criteria: • Overall quality of ultrasound image: 'High', 'Medium', 'Low', or 'Unsuitable for measurements'.The last category is reserved for images with so low quality that they should not be used for any clinical measurement.
• LV-focused?: 'Yes' or 'No'.'Yes' means the LV is central in the image and the depth is adjusted to the LV.
• Correct placement of segmentation?'Yes' or 'No'.'Yes' means the segmentation is placed correctly.
Table 5 summarizes the results of the experiment in which the first subtable (a) shows that when the inter-model agreement is high (> 0.9) the two networks both produce valid segmentations.The second subtable (b) shows that when the inter-model agreement is low (< 0.8), the GCN produces slightly more correct segmentations.If we regard samples that have suitable image quality, the correct view, and are LV-focused as in-distribution samples, we see that from the 100 high agreement samples, 93 were in distribution, while for the 100 low agreement samples, only 7 were in distribution.Furthermore, all 7 in-distribution samples with low agreement had low quality.Fig. 9 visualizes these results showing the difference in input quality between samples with high and low inter-model agreement.

E. CLINICAL EVALUATION ON HUNT4
We use the clinical dataset HUNT4 to evaluate the accuracy of clinical measurements instead of CAMUS because the CAMUS dataset only contains volume measurements which were measured using the annotations and not using clinically approved software.The HUNT4 dataset also contains considerably more patients.We compare the accuracy of the U-Net and the GCN which both were trained on the HUNT4 training set.The GCN used for this experiment is the GCN with displacement method.
The automatic estimation of EF mimics the procedure for manual LV volume calculations suggested by clinical guidelines [2].The automatic procedure contains the following steps: 1) Detect ED and ES of each cycle using the timing neural network described by Fiorito et al. [24] for both the A2C and A4C recordings of the same patient.This means the view was labeled manually during acquisition and the timing is estimated using deep learning.2) Segment the ED and ES frame of each cycle for both the A2C and A4C recordings of the same patient.3) Combine the usable segmentations in the A2C recordings with the usable segmentations in the A4C recordings for biplane volume calculations using the modified Simpson method.A segmentation is unusable if the procedure described in section II-C fails to extract the LV apex or base points, which position is required for the modified Simpson method.The EF is then calculated as EF = ED volume−ES volume ED volume . Measurements from all cardiac cycles are combined and averaged.
Our results include 1877 out of 1913 patients, equalling a feasibility of 98.1%.In 13 cases the timing network failed.In the remaining 23 cases, either U-Net 1 or nnU-Net failed to produce a usable segmentation for all cycles.Because the GCN predicts the LV apex and base points directly as one of the keypoints, it can produce an EF for every recording.However, for a fair comparison, we exclude these cases from our results.Fig. 10 shows the Bland-Altman plots of the automatic methods compared to the manual reference.Table 6 shows the Mean Absolute Error (MAE) of EF measurements between the automatic method and manual reference.The ensembles average the EF estimates of the two models.The difference in MAE between nnU-Net and all other methods is significant with p < 0.05 using the Wilcoxon signed-rank test [22].
Finally, we use the findings of the previous experiment and filter the ED and ES frames based on inter-model agreement.Fig. 11 shows the same Bland-Altman plots when only using the frames where the inter-model Dice between U-Net or nnU-Net and the GCN is above 0.85.In 137 out of 1877 cases, there was no ED or ES frame in all three cycles of the recording with Dice above 0.85, resulting in the patient being excluded.As a result, the filtered results only include 1740 out of 1913 patients, equalling a feasibility of 91%.TABLE 5. Results of the experiment on using the inter-model agreement as a quality predictor.The GCN uses the displacement method in this experiment.

F. INFERENCE TIME
RTX 3070 Ti Laptop GPU.To measure the inference time, the model performs a warmup run on 1000 random dummy inputs.Afterward, the model performs 10 test runs on 100 random inputs.The final inference time is the average runtime of each of the test runs divided by 100.These values do not include the data loading time and pre-and postprocessing steps.

A. ANATOMICAL CORRECTNESS AND SEGMENTATION ACCURACY
In this work, we developed a multi-structure graph convolutional network (GCN) for cardiac ultrasound segmentation.Comparison of input quality between samples with a high and a low inter-model agreement.The exact number of cases can be found in Table 5.From the 100 high agreement samples, 93 were in distribution, while from the 100 low agreement samples, only 7 were in distribution.
While the GCN method can eliminate anatomical incorrect segmentations, it comes at the cost of slightly reducing the segmentation accuracy compared to U-Net.Whereas the GCN does not outperform U-Net in terms of accuracy for multi-structure segmentation, the keypoint representation has its own advantages.Both architectures have their unique data representation and solve the problem in a fundamentally different way.The pixel-wise approach of U-Net can give the most accurate segmentations, but can also produce anatomically incorrect results.On the other hand, the GCN with displacement method has lower accuracy but does not produce anatomically incorrect segmentations as the cardiac shapes are embedded as a strong bias in the network.However, the GCN can still fail to place the anatomically correct shape correctly in the image.Fig. 7b shows an example where both the GCN and U-Net fail to generalize to samples with increased depth.From an abstract point of view, GCN can be seen as a heavily regularised version of pixelwise architectures.The displacement method regularizes the architecture even further.Finally, it is worth noting that nnU-Net is an order of magnitude larger and slower than the GCN.

B. ABLATION STUDY
The results of the ablation study show that reducing the depth of the decoder changes the performance of the GCN non-significantly, indicating that the pre-trained CNN is the most important contributor.There are two possible explanations for this behavior.It could be that the architecture of the GCN decoder is not efficient for processing data in keypoint embedding space.On the other hand, it could also be that the decoder part is superfluous, as the shapes of the cardiac structures are easy enough to not require a shape encoding as the embedding of the graph convolutional layers.

C. COMBINATION WITH U-NET
Combining GCN and U-Net in a cascade gives a mixed result of the characteristics of both architectures.For the GCN -nnU-Net cascade, there are no additional augmentations to the GCN output that serve as an extra input channel to the nnU-Net, as it is used out of the box.As a result, the nnU-Net cascade relies too much on the GCN output and only learns to do minor adjustments to the GCN intermediates.On the other hand, the GCN -U-Net 1 training procedure augments the GCN intermediates with additional rotation, scaling, and translation augmentations specifically for the cascade network, as described in subsection II-G1.This results in the U-Net correcting some cases of wrong placement of the GCN but also re-introduces anatomical incorrect outliers.Table 4 reflects these findings numerically.

D. INTER-MODEL AGREEMENT
Despite the excellent Dice and Hausdorff metrics of U-Net, this pixel-wise segmentation approach can fail completely in certain cases as shown in Fig. 1.In this work, we have proposed using the inter-model agreement as a simple and effective approach to detect these failing cases.
The distinct characteristics of U-Net and GCN make them appealing for joint use to quantify uncertainty.Although averaging their clinical estimates doesn't enhance accuracy, the agreement between the models offers valuable insight for identifying difficult cases and erroneous segmentations of U-Net.The leading approach to uncertainty quantification in deep learning is Bayesian Neural Networks [25], but their multiple passes during inference hinder real-time use.Deep ensembles are another option [25], but require multiple models to run on each input.Combining U-Net with GCN approximates deep ensembles practically.Specifically, U-Net and GCN are unlikely to make the same mistake since the GCN is designed to avoid the most common error of U-Net i.e. anatomically incorrect segmentations.
The results of the experiments using inter-model agreement indicate a clear distinction in input quality between high and low agreement cases.Fig. 9 shows significantly fewer high-quality images and fewer unsuitable cases with high inter-model agreement as compared to the cases with low inter-model agreement (p < 0.05).Furthermore, most of the cases with low inter-model agreements were the wrong view.In only 15 out of 74 suitable low agreement cases the view was correct.The majority (55) of these wrong view cases are apical long axis (ALAX) views or views rotated towards ALAX.These images are in the dataset because the clinicians used these views in practice even though they were instructed to use A4C and A2C views to measure EF.One of the reasons is that sometimes it is not possible to get a clear A2C or A4C view so the clinician trades view correctness for image quality.However, the segmentation models are only trained on LV-focused A2C and A4C views, resulting in poor performance.Finally, only 7 out of 100 low agreement cases had suitable quality, the right view, and were LV-focused as compared to 93 out of 100 high agreement cases.This demonstrates the inter-model agreement estimator as an efficient method to detect out-of-distribution cases in the form of very low image quality or wrong views.

E. CLINICAL EVALUATION
For accuracy on EF measurements, the nnU-Net exhibits the narrowest limits of agreement and the fewest outliers in comparison to both GCN and U-Net 1.However, when frames with low inter-model agreement are omitted, the limits of agreement for all three networks improve, as can be observed when comparing Figs. 10 and 11.This improvement can be attributed to the elimination of outliers stemming from problematic inputs or flawed segmentations.Thus, while the GCN does not improve EF accuracy directly, it can help discard cases where automatic EF is not appropriate, consequently narrowing the limits of agreements for the remaining cases.The negative mean in Figs. 10 and 11 show that each of the deep learning methods underestimates the reference ejection fraction.Because there are many possible sources of error, both from the deep learning methods and from the reference annotations, the reason for the negative bias is not trivial.The article by Olaisen et al. [5] discusses this more in-depth.

V. REAL-TIME DEMO APPLICATION
To demonstrate the potential of the inter-model agreement to detect out-of-distribution cases and faulty segmentations in real-time, a real-time application was created using the FAST framework [26].It shows the segmentation output of the GCN and nnU-Net side by side together with a status bar that visualizes the agreement between the models.Fig. 12 shows a screenshot of the application in action.We created a demo video [27] that shows the application in use while a clinician is operating a GE Vivid E95 scanner.The video demonstrates the effectiveness of the inter-model agreement as a method to detect out-ofdistribution and low image quality cases.The video is available at https://doi.org/10.6084/m9.figshare.24230194.

VI. CONCLUSION
This paper proposes a multi-structure graph convolutional network (GCN) to reduce anatomical incorrect results produced by U-Nets in cardiac ultrasound segmentation.We propose the displacement method as a clinically motivated regularization that eliminates anatomically incorrect segmentations for multi-structure segmentation.The ablation study showed that changing the architecture of the GCN decoder does not affect performance significantly.This indicates the encoder is the most important component of the network Measuring the inter-model agreement between GCN and U-Net yields a simple and highly effective method for quantifying uncertainty and thereby detecting out-of-distribution and failing cases.Whereas the average performance of both U-Net and the GCN is within inter-observer variability, their worst-case behavior is still drastically worse than human annotators.These outliers occur mostly due to a lack of generalization capabilities, which is still a fundamental problem in cardiac segmentation.Low inter-model agreement can detect out-of-distribution cases and thus can be used as a warning signal to the user using the automatic segmentation tool in practice.
The idea of GCNs is a relatively new concept for doing cardiac segmentation that shows promising initial results towards more robust cardiac segmentation.The GCN can be implemented as a robust independent model or as a supplementary model to validate the segmentation outcomes of pixel-wise techniques such as U-Net.

FIGURE 1 .
FIGURE 1.Though U-Nets can achieve a high average Dice accuracy on large datasets (> 0.94), they can still produce anatomical incorrect results as shown here to the left, with multiple atria disconnected from the LV and an incoherent myocardium around the LV, none of which are anatomically plausible.The anatomical correct output of the proposed graph convolutional network (GCN) is shown to the right.

FIGURE 2 .
FIGURE 2. The architecture of the GCN.The CNN encoder transforms the input ultrasound image of width W and height H to an embedded vector of size X .A dense layer transforms this embedding to an embedding in keypoint space, with 107 keypoints and C 1 channels.The decoder consists of a sequence of graph convolutions over these keypoint embeddings.The final outputs are the 2D coordinates of the keypoints in the image.

FIGURE 3 .
FIGURE 3. Schematic diagram showing the preprocessing to transform pixel labels to keypoints positions.A and B are the base points of the LV.C and D are the base points of the MYO.E, F, and G are the apexes of the LV, MYO, and LA respectively.

FIGURE 4 .
FIGURE 4. Schematic diagram showing the multi-structure convolution.q 1..n are zero-padded epicardium keypoints, p 1..n are endocardium and left atrium keypoints.The highlighted points are used as input to update the embedding of the purple keypoint, with w the primary receptive field and v the secondary receptive field.For illustrative purposes, the diagram does not show the actual number of keypoints used in this work.

FIGURE 5 .
FIGURE 5. Graphical representation of the GCN -U-Net cascade.The cascade model concatenates the initial segmentation produced by the GCN with the original ultrasound input and feeds this to the U-Net to produce the final, refined segmentation.The segmentations shown are only illustrative.

FIGURE 6 .
FIGURE 6. Anatomically incorrect case for the GCN without displacement method.The predicted keypoints of the endo-and epicardium are flipped and partly overlapping.The displacement method eliminates these errors.

FIGURE 7 .
FIGURE 7. Case analysis and comparison of the GCN with displacement method and nnU-Net on CAMUS.The cases are selected based on the overall Dice score between the annotation and the GCN or U-Net segmentations.

FIGURE 8 .
FIGURE 8.Histogram of inter-model Dice scores between nnU-Net and GCN on the HUNT4 evaluation set.The upper and lower threshold determine the set of images that are sampled for the experiment that evaluates inter-model agreement as a quality predictor.

FIGURE 9 .
FIGURE 9.Comparison of input quality between samples with a high and a low inter-model agreement.The exact number of cases can be found in Table5.From the 100 high agreement samples, 93 were in distribution, while from the 100 low agreement samples, only 7 were in distribution.

FIGURE 10 .
FIGURE 10.Bland-Altman plots of the automatic EF measurements compared to the manual reference.All frames with a usable segmentation are used.

FIGURE 11 .
FIGURE 11.Bland-Altman plots of the automatic EF measurements compared to the manual reference when only using the frames for which the inter-model Dice agreement is more than 0.85.

FIGURE 12 .
FIGURE 12. Screenshot of the real-time demo application.The GCN and nnU-Net segmentations are shown on the left and right side respectively.The color-coded status bar on top visualizes the agreement between the models.

TABLE 2 .
Ablation study GCN part one.The difference in Dice score between the best and worst performing variation is not statistically significant with or without displacement method.

TABLE 3 .
Ablation study GCN part two.The difference in Dice score between the two decoders is statistically significant.

TABLE 4 .
Anatomical correctness and segmentation accuracy (Dice and Hausdorff) comparison with SOTA on all cross-validation splits of CAMUS.

Table 7
summarizes the inference times of the proposed architectures and baseline models on an NVIDIA GeForce

TABLE 6 .
Evaluation of automatic EF estimation on HUNT4.The mean absolute error of nnU-Net is statistically significantly lower than all the other models.

TABLE 7 .
Inference time comparison of the proposed architectures and baseline models on an NVIDIA GeForce RTX 3070 Ti Laptop GPU.