Deep Learning Technique for Congenital Heart Disease Detection Using Stacking-Based CNN-LSTM Models From Fetal Echocardiogram: A Pilot Study

Congenital heart defects (CHDs) are a leading cause of death in infants under 1 year of age. Prenatal intervention can reduce the risk of postnatal serious CHD patients, but current diagnosis is based on qualitative criteria, which can lead to variability in diagnosis between clinicians. Objectives: To detect morphological and temporal changes in cardiac ultrasound (US) videos of fetuses with hypoplastic left heart syndrome (HLHS) using deep learning models. A small cohort of 9 healthy and 13 HLHS patients were enrolled, and ultrasound videos at three gestational time points were collected. The videos were preprocessed and segmented to cardiac cycle videos, and five different deep learning CNN-LSTM models were trained (MobileNetv2, ResNet18, ResNet50, DenseNet121, and GoogleNet). The top-performing three models were used to develop a novel stacking CNN-LSTM model, which was trained using five-fold cross-validation to classify HLHS and healthy patients. The stacking CNN-LSTM model outperformed other pre-trained CNN-LSTM models with the accuracy, precision, sensitivity, F1 score, and specificity of 90.5%, 92.5%, 92.5%, 92.5%, and 85%, respectively for video-wise classification, and with the accuracy, precision, sensitivity, F1 score, and specificity of 90.5%, 92.5%, 92.5%, 92.5%, and 85%, respectively for subject-wise classification using ultrasound videos. This study demonstrates the potential of using deep learning models to classify CHD prenatal patients using ultrasound videos, which can aid in the objective assessment of the disease in a clinical setting.


I. INTRODUCTION
Congenital heart defects (CHDs) account for 1% of all live births worldwide [1].Generally, hyperplasia refers to a condition of delayed or stunted development in which an organ or a part of it remains below its normal size or remains immature [2].Hypoplastic left heart syndrome (HLHS) is a group of cardiac malformations characterized by underdevelopment of both the aorta and the left heart, resulting in significantly impaired blood flow into the systemic circusslation and inadequate support for the circulation by the left heart [3].HLHS is a very severe form of CHD characterized by an insufficient and non-viable left ventricle (LV) caused by congenital abnormalities that compromise the LV's ability to perform its prefusion function [4].The incidence of HLHS is estimated to be between 0.016% and 0.036% of all live births; the occurrence is estimated to be in approximately 2 out of every 10,000 pregnancie [5], [6].HLHS accounts for 1 to 3.8 % of congenital cardiac malformations, 8-12% of heart defects of infants with critical heart disease, and critically responsible for 25% to 40% of all neonatal cardiac mortality [7], [8].
There is currently no definitive explanation for the etiology of HLHS.Higher incidence within families with disease history suggests genetic contribution.The higher incidence of HLHS in families with a disease history suggests a genetic contribution.In some children, isolated HLHS is known to have a genetic basis.These cases may be due to mutations in the GJA1 gene with autosomal recessive inheritance or the NKX2-5 gene with autosomal dominant inheritance [9], [10].However, in the majority of the cases, the disease is diagnosed without any genetic relevance.Clinically, disturbed hemodynamics have been shown as a major contribution to the fetal development of the disease [11], [12], [13].The growth of the left ventricle is hindered when there is a disturbance of blood flow or when the foramen ovale is affected during fetal development.Patients with HLHS have a diminution of the foramen ovale [14].HLHS is also associated with anatomical abnormalities of the atrial septum when the superior edge of the septum and/or the primum, deviates posteriorly and leftward, resulting in obstruction of the atrial shunt [15].The abnormal development in cardiac valves or the left ventricle itself may be caused by HLHS [16], [17].We have recently revealed evolving hemodynamics in normally and HLHS diagnosed human fetuses and demonstrated severe hemodynamic abnormalities in fetal HLHS hearts [18], [19].Animal studies supported these observations, in which surgical interventions causing blood flow abnormalities resulted in ventricular hypoplasia in the developing embryo [20], [21], [22].
The main treatment for HLHS involves a series of surgical procedures aimed at establishing the right ventricle as the main pumping chamber of the heart after birth.These procedures, collectively known as surgical palliation for HLHS neonates, involve three steps: the Norwood Procedure, the Bi-directional Glenn Operation, and the Fontan Operation [23].These procedures establish a new functional systemic circuit in patients with HLHS.[24].During the initial days of a newborn's life, the Norwood Procedure is conducted to establish the right ventricle as the primary pump for pulmonary and systemic circulation throughout the body.This is achieved through a connection made between the left and right atria via atrial septectomy.Subsequently, the narrowed outflow track is reconstructed by creating a connection between the right ventricle and the aorta using tissue grafts from the distal main pulmonary artery.The final step in providing pulmonary blood flow is the aortopulmonary shunt, which connects the aorta with the main pulmonary artery [25].After a six-month recovery period following the Norwood surgery, the bidirectional Glenn procedure is performed.[26].During this procedure, the shunt placed between the pulmonary arteries and the right pulmonary artery during the Norwood procedure is disconnected, and the right pulmonary artery is then connected to the superior vena cava (SVC).[26].This allows for blood from the upper part of the body to enter the pulmonary artery directly, bypassing the ventricles.The Fontan operation is the third and final surgical procedure and is typically performed between 18 to 36 months after the Glenn procedure.During this procedure, a channel is created through or outside the heart to connect the vena cava to the pulmonary artery and direct blood flow to the pulmonary artery [26].
Recently, alternative surgical approaches have been proposed for treating HLHS in the fetus.One such approach is fetal valvuloplasty (FV), which is aimed at improving left heart hemodynamics, promoting growth, and maintaining biventricular circulation at birth.[27].FV may be performed to prevent the progression of severe mid-gestation US [28].The FV procedure involves balloon dilation inflation of the aorta to reduce fetal aortic stenosis in utero [28], [29], [30], [31], [32], [33], [34], [35].In a pioneering study with this approach on 100 HLHS-diagnosed fetuses and 43% of live-born patients had biventricular circulation, demonstrating the feasibility of the approach [33].
The fetal diagnosis of HLHS is of utmost importance for therapy planning, as well as for the advancement of new approaches such as fetal surgeries, as mentioned earlier.There are multiple tools available for the diagnosis of HLHS, including Computed Tomography Angiography (CTA), Cardiac Catheterization, Chest X-ray Radiography (CXR), Electrocardiography (ECG), and Echocardiography.However, all these techniques, except for Echocardiography, are difficult to apply to the fetus due to several limitations, such as invasiveness, radiation hazard, or acquisition of noise.Echocardiography, on the other hand, poses no danger to the fetus as it does not involve radiation and can provide accurate images and real-time measurements.A B-mode scan can be used to evaluate heart anatomy and ventricular position, while an M-mode and Doppler scan can be used for assessing valvular and vascular functionality [36].The diagnosis of fetal HLHS through echocardiography relies heavily on qualitative criteria, which may lead to variations in diagnosis among clinicians.Despite this limitation, echocardiography remains a valuable tool for diagnosing fetal HLHS, as it allows for 110376 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.visualization of shunt flow, evaluation of the atrial and ventricular septum, assessment of vessel anatomy, and provision of functional information regarding the atrioventricular and outflow valves [37].
Recent studies by Masaaki et al. [38] introduced SONO, an architecture that employs a CNN to identify cardiac substructures and anomalies within fetal ultrasound videos.The technique involves a timeline visualization for detection likelihood and the computation of anomaly scores.The assessment focuses on cardiac structural anomalies (specifically Heart and Vessels), using area under the curve-receiver operating characteristic (AUC-ROC) analysis, and demonstrates competitive performance compared to established methods.Gangadhar et al. [39] investigated the feasibility of utilizing deep learning algorithms, specifically Artificial Neural Networks, to predict coronary artery disease at an early stage.The research aimed to enhance cardiac diagnosis and preventive measures through effective analysis of data patterns.Gonsalves et al [40] explored the utilization of medical data for CHD prediction through Naive Bayes, Support Vector Machine, and Decision Tree ML methods, highlighting the potential of Naive Bayes probabilistic models in enhancing CHD detection.
Several computational approaches have been introduced in the medical field to advance diagnosis and therapy in severe clinical conditions.Computer-Assisted Diagnosis (CAD), for instance, has revolutionized medical image analysis, from oncology to cardiology.CAD medical image analysis has been attempted since 1965, when J.M. Prewitt and others published papers on the use of computerized image analysis of cell images [41], [42].The introduction of Machine Learning (ML) has revolutionized the image analysis field in medicine.Deep neural networks such as Convolutional Neural Networks (CNNs) [43] and Long Short-Term Memory (LSTM) [44] and their hybrid model CNN-LSTM have shown outstanding performance in different Computer Vision problems [45].CNN has the advantage of automatically extracting useful spatial features from the image [46], [47] while LSTM is popular for extracting important temporal features [48] from a sequence of images or frames or a video.Therefore, the combination of CNN and LSTM can offer spatial and temporal feature extraction on temporally varying image or video data.Architecture combined with CNN and LSTM has been successfully used in Natural Language Processing (NLP) applications [49], Speech Recognition [50], Video Description [51], Action Recognition [52], and so on.
The importance of echocardiography in the diagnosis of CVDs is evident as it is the only imaging method that enables real-time imaging of the heart, thereby allowing for the immediate detection of various abnormalities [53].Combining clinician interpretation and machine learning (ML) has the potential to improve the accuracy of echocardiography by reducing inter-and intra-operator variability.In addition, ML can provide predictive information that may be too subtle for humans to detect [54].Some limitations of using echocardiography include the heavy reliance on the operator's experience and the qualitative interpretation of the heart's anatomical features [55].This limitation can be addressed by integrating ML into echocardiography, which can introduce more automated and quantitative parameters [56], [57], [58], [59].The datasets generated from echocardiography, particularly with the advancements in techniques such as 3D echocardiography, are often underutilized, and vast amounts of data remain uninterpreted.To bridge the gap between clinical and echocardiographic data, the introduction of ML and deep learning algorithms can be of great assistance [60], [61], [62], [63], [64], [65], [66].
Accurate fetal diagnosis of CHDs is particularly an important area that will benefit from adopting ML approaches to advance echocardiography-based diagnosis.As in the case of HLHS, it is challenging to obtain an accurate diagnosis with conventional echocardiography approaches.Here, the structure and function of the fetal heart can be assessed through a variety of Ultrasound (US) techniques, including conventional 2-D imaging, M-mode imaging, and tissue Doppler imaging among others.However, it remains difficult to assess the fetal heart due to the involuntary movements of the fetus and its small size, in addition to some sonographers' lack of expertise in fetal echocardiography [67].Despite the great potential, to date, there is no study on the application of ML for the advancement of fetal diagnosis of CHDs.In this study, we are aiming to develop a deep learning technique for the automatic diagnosis of HLHS from fetal B-mode echocardiography.The main contributions of the paper are:  The study included 13 subjects with HLHS and 9 healthy control subjects.

B. PATIENT SELECTION CRITERIA
Women in the control group were deemed eligible if they had a scheduled routine fetal ultrasound examination between weeks 16 and 18 of pregnancy.Women referred for determining gestational age or growth discrepancy, experiencing preceding miscarriage, unable to detect a fetal heartbeat, or other miscellaneous reasons were also eligible to be part of the control group, provided the fetus was determined to be normal.Patient selection diagram is presented in Figure 1.In Figure 1(A) number of CHD subjects illustrated, trimester refers to one of the three distinct periods into which a pregnancy is divided.In Figure 1(A) number of healthy subjects illustrated.

C. ACQUISITION OF ECHOCARDIOGRAPHY VIDEOS
All examinations were conducted by a specific and experienced fetal cardiologist with ample background in fetal echocardiogram examination, using the Voluson E10 (General Electric) Ultrasound System.The Ultrasound examinations were performed with the GE RAB6-D 4D convex probe and following the guidelines issued by the American Society of Echocardiography and standards for the performance of fetal echocardiography.A detailed evaluation of all essential components of the fetal echocardiogram, including the four-chamber view, diameters of the mitral and tricuspid valve annuli, and the lengths of the left and right ventricle, were obtained.Flow patterns across the atrioventricular and semilunar valves were evaluated using color Doppler.Doppler indices were obtained by placing the sample volume 110378 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

D. MACHINE LEARNING MODEL DEVELOPMENT
In this study, CNN-LSTM deep learning architectures with five different encoders (such as MobileNetv2, ResNet18, ResNet50, DenseNet121, and GoogleNet) were investigated to extract useful temporal and spatial features.These features were used to train a fully connected network (FCN)/multilayer perceptron (MLP) classifier.Out of the five encoders, the three best encoder-based CNN-LSTM models' predictions were used to train a meta-classifier, a novel stacking model for the early and precise detection of CHD patients.The classification results are reported both video-wise and subject-wise.Figure 2 illustrates the schematic overview of the methodology.

E. DATASET DESCRIPTION
There was a total of 13 HLHS and 9 control subjects included in the study, and multiple B-mode Ultrasound videos were available for each subject in the dataset.Echocardiography was conducted at three time points: 1) 16-19 weeks gestation, 2) first follow-up at 23-26 weeks gestation, and 3) second follow-up at 31-34 weeks gestation.Each video was segmented into short videos based on the cardiac cycle.Table 1 presents the number of available videos with corresponding time points and the number of segmented videos for each patient.

F. DATASET PREPROCESSING
Four steps were used to process echocardiogram videos of healthy and CHD patients: 1) cleaning and cropping the videos, 2) improving the video quality, 3) segmenting the videos, and 4) training test subject-wise splitting for fivefold cross-validation for CNN-LSTM model development, validation, and testing.

1) VIDEO CLEANING & CROPPING
All static information, such as text or color, was removed from the US videos.To remove text and color from videos, firstly, we have detected fixed text and color marking regions in each frame using canny edge detection approach.Then, we replaced the identified regions with corresponding background content from the same frame or neighboring frames to achieve the removal effect.The videos were cropped by 10% from each side to eradicate unnecessary segments.Cropping eliminates unimportant elements, such as black areas, enhancing visual focus while preserving aspect ratios for uniformity.The cleaned videos were then cropped by 10% from each side to reduce unnecessary parts (black areas) of the ultrasound videos.Then, all frames of each RGB video were resized to 224 × 224, which is applied as input to the model.Figure 2 shows a sample of raw ultrasound frames, static information cleaned frames, and cropped frames.

2) VIDEO ENHANCEMENT
For each frame of the videos, gamma correction was applied to enhance the video quality.Typically, linear operations are performed on individual pixels in image normalization, such as scalar multiplication, addition, and subtraction [1].In gamma correction, pixels in the source image are subjected to a non-linear operation.Gamma correction alternates the pixel value to improve the image using the projection relationship between the value of the pixel and the value of the gamma according to the internal map.If P represents the pixel value inside the [0,255] range, represents the angle value, is the symbol of the gamma value set, x is the grayscale value of the pixel (xϵ P) in equation ( 1)-(4).Let x m be range midpoint [0, 255].The linear map ϕ from group P to group is defined as: The mapping h from to is defined as: where a ϵ [1, 0] denotes a weighted factor.
110380 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Based on this map h, group P can be related to group pixel values.The arbitrary pixel value is calculated in relation to a given Gamma number.Let γ (x) = h(x) and the Gamma correction function is shown in Equation 5.
where g(x) represents the output pixel correction value in the grayscale.

3) VIDEO SEGMENTATION
The US videos were segmented into individual heart cycle video segments using expert medical annotation.To feed these segments into the deep learning model, a fixed number of frames per video was required.The average frame-persecond (fps) of the videos are made to 33.However, some videos has much less than 20 fps while some videos have much higher than 20 frames.Therefore, each video data was pre-processed to have a maximum length of 20 frames.The choice of using 20 frames per video segment was likely determined based on a combination of factors such as the expected duration of a heart cycle, computational efficiency, and model requirements.Videos with less than 20 frames were adjusted to 20 frames by replicating the last frame, ensuring a uniform length of 20 frames across all videos.The process of video upsampling using padded frames is illustrated in Figure 4 and videos with more than 20 frames were downsampled to 20 fps.

4) TRAIN-TEST FOLD CREATION AND AUGMENTATION n
Five different CNN-LSTM models were investigated along with a novel stacking model.
Five-fold subject-wise cross-validation was used in this study, where 80% of the videos used for training (10% of which were used for validation) and 20% for testing.To avoid overfitting, the training data classes were made balanced since the HLHS, and healthy classes are not equal [2].We used three popular image augmentation techniques (rotation, scaling, and translation) to make the training set balanced.Each image in the video was rotated by an angle of 5 to 10 degrees clockwise and counterclockwise for image augmentation.Each frame of the videos was scaled (magnified or reduced) by 2.5% to 10%.The images were translated horizontally and vertically by 5% to 10%.The weighted average of the five-folds is reported for each performance metric.Table 2 shows the details of the number of training, validation, and test data sets used in this study.

G. DEVELOPMENT OF CLASSIFICATION MODEL
US videos are composed of spatial and temporal features.The spatial feature of an US video is the shape of the heart chamber, while the changes in the chamber shape during systolic and diastolic events constitute the temporal feature.In this study, CNN and LSTM layers were utilized for the extraction of spatial and temporal features, respectively.Deep CNNs have been widely employed for image classification due to their superior performance in comparison to other machine learning methods.These networks are capable of automatically extracting spatial features of an image.The approach of transfer learning has been successfully incorporated in many applications [3], [4], [5], [6], [7], especially where a large dataset can be hard to find.Thus, it opens the opportunity of utilizing a smaller dataset and reduces the time required to develop a deep learning algorithm from scratch [8], [9].In this study, we used five deep learning pre-trained CNN models such as ResNet18, ResNet50 [10], DenseNet201 [11], MobileNetV2 [12], and GoogleNet [11], which were predominantly used in the literature.The feature vector after flattening the layer of CNN was fed to the LSTM  layers and then it is fed to the fully connected network (FCN) or MLP for classification.
In Ultrasound videos, there is a degree of temporal connection between consecutive frames which contains the information about systolic and diastolic events.Neural networks such as vanilla recurrent the temporal connections in video data.LSTM [13] layers are widely used in different Ultrasound video classification [14], [15].LSTM performs better than va-nilla RNN in this task where previous frames can preserve information in understanding the present frame [16].LSTM layer has cell states and hidden states which enables LSTM to add or remove information by regulating gates using cell states.Moreover, LSTM resolves the vanishing gradient problem of vanilla RNN by possessing the additive gradient mechanism [16].Two layers of LSTM with 256 hidden states with a 20% dropout rate were used in this study for temporal feature extraction.Figure 5 represents the architecture of the LSTM module in the CNN-LSTM.
In this study, the stacking approach was deployed with the top-performing three CNN-LSTM models (with three different pre-trained encoders) as base learners and a logistic regression classifier was used as meta learners to identify the CHD patients.If a single dataset A, which consists of input vectors (x i ) and their classification score (y i ).At first, a set of base-level CNN-LSTM classifiers M 1 , . . . . . .,M p is trained and the prediction of these base learners is used to train the logistic regression-based meta-level classifier M f as illustrated in Figure 6.
We used five-fold cross-validation to generate a training set for the meta-level classifier.Among these folds, baselevel classifiers were trained on four-folds, leaving one-fold for testing.Each base-level classifier produces a probability value for the possible classes.Thus, using input x, a probability distribution is created using the predictions of the base-level classifier set, M : (6) where (c 1 ,c 2 , . . . . . .,c n ) is the set of possible class values n, m denotes the number of subjects and P M (c i | x) denotes the probability that example, x belongs to a class c j as estimated (and predicted) by the classifier, M in Equation (7).The class c i with the highest-class probability P M j (c i | x) is predicted by a classifier, M .The metalevel classifier M f , and attributes are thus the probabilities predicted for each possible class by each of the base-level classifiers, i.e., P M j (c i | x) for I = 1,. . .., n and j = 1,. . .., p where n, p denotes the number of classes and the number of base learners.The pseudo-code for the stacking approach is shown in Algorithm 1.

H. DECISION FUNCTION
A decision function was used in this study to take the decision on the final classification.The decision was taken using two different approaches: Ultrasound video-wise and subjectwise.For video-wise decisions, the average of the prediction probability scores was calculated for the segmented short videos of 20 frames individually and made one final decision for full video from all the short videos.Similarly, for subjectwise decisions, the average of the prediction probability scores of different Ultrasound videos of the subject was used to produce the final decision.Equation 7shows the final decision function: where σ is the decision function of n number of segmented or full videos.P i (x) is the probability scores for each segmented or full video.
The mean probability scores of all segmented videos were used to take the decision for the full Ultrasound video.Similarly, this study also produced the final decision of the subject by taking the mean of the probability scores of all videos of the subject

I. EXPERIMENTAL SETUP
This study was carried out with the Pytorch package and Python 3.7.Google ColabPro was used to train all the models and the specification was 16GB Tesla T4 GPU and 120GB High RAM.Table 3 shows the training settings that were employed in this experiment.

J. EVALUATION METRICS
Precision, Sensitivity, Specificity, Accuracy, F1-Score, and receiver operating characteristic (ROC) with the area under the curve (AUC), were used to evaluate the performance of different classifiers.Weighted metrics per class and overall accuracy were used as both classes had different numbers of instances.The area under the curve (AUC) was also analyzed as a metric for evaluation.Equations (8)(9)(10)(11)(12) show the mathematical expression of five evaluation measures (weighted sensitivity or recall, specificity, precision, overall accuracy, and F1 score): Recall/Sensitivity = TP T P + FN (10) Here, true positive (TP), true negative (TN), false positive (FP), and false-negative (FN) were used to denote the number of HLHS videos or subjects were identified as HLHS, the number of healthy videos or subjects were identified as healthy, the number of healthy videos or subjects incorrectly identified as HLHS and the number of HLHS videos or subjects incorrectly identified as healthy, respectively.We report the weighted performance metric, with a 95 % confidence interval, for Sensitivity, Specificity, Precision, and F1-Score, and the overall accuracy with a 95 % confidence interval for the accuracy.

III. RESULTS AND DISCUSSION
The heart is the first functional organ of the fetus.Therefore, the heart continues to function and develop at the same time.Since blood flows constantly through a developing heart, it has been suggested that hemodynamic forces (i.e.forces on cardiac tissue by flowing blood) are an important epigenetic factor governing cardiogenesis.Congenital heart disease (CHDs) form during the very complex events of heart development.These defects affect about 1% of newborn children and are the leading cause of death in infants under 1 year of age.CHDs can be detected prenatally via medical imaging whereas echocardiography is the most widely used technique for this purpose.Real-time imaging via echocardiography enables the assessment of heart morphology (i.e., the size of the heart chambers and valves etc.) whereas Doppler echocardiography enables the measurement of blood flow velocities through the heart (i.e., inflow through heart chambers and flow through heart valves etc.), hence evaluation of heart function.For example, prenatal echo can detect one of the most serious types of CHDs, ventricular hypoplasia (under development), as early as the 18th week of gestation, with high accuracy.Echocardiography revealed the presence of disturbed hemodynamics in hypoplastic fetal hearts, and associated abnormal forces are thought to contribute to the development of this condition.However, this evaluation is highly subjective while computer-aided-diagnosis can help here significantly.This work used echocardiography videos of a small healthy and CHD (mainly HLHS) patients cohort to develop deep learning-based detection system to automatically classify the HLHS and healthy subjects automatically and reliably.Figure 7 shows the sample of healthy and HLHS patients' 4-chamber view of the heart to show the difference in the morphology of the heart chamber in unhealthy groups during the different gestational weeks.
This study investigated and compared five different deep learning CNN-LSTM architectures using 5 pretrained models, such as MobileNetv2, ResNet18, ResNet50, DenseNet121, and GoogleNet, for the purpose of developing a novel stacking model to predict the HLHS patients from Ultrasound videos.The results are reported using video-wise and subject-wise evaluations.

A. VIDEO-WISE CLASSIFICATION
As discussed above, this study analyzed different deep learning LSTM models and stacking models to classify HLHS or healthy patients using echocardiogram videos.This study yielded the best performance with MobileNetv2-LSTM architecture which produced the accuracy, precision, sensitivity, F1 score, and specificity of 88.9%, 92.4%, 90.3%, 91.6%, and 86%, respectively.The stacking model was developed using the probability scores of Top-3 performing models (MobileNetv2-LSTM, ResNet18-LSTM, and GoggleNet-LSTM) which improve the result by ∼2% with the accuracy, precision, sensitivity, F1 score, and specificity of 89.5%, 92.5%, 91.8%, 92.5%, and 86.2%, respectively.Table 4 shows the comparisons of different CNN-LSTM models and the stacking model for video-wise HLHS and healthy patient classification.
Figure 8 shows the area under the curve (AUC)/receiveroperating characteristics (ROC) curve (also known as AUROC (area under the receiver operating characteristics)) for video-wise HLHS classification using Ultrasound videos, 110384 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.which is one of the most important evaluation metrics for checking any classification model's performance.This is apparent from the ROC curves that the stacking CNN-LSTM model outperformed other networks for classification with 93.7% AUC whereas the AUC of the best performing CNN-LSTM (MobileNetv2-LSTM) model is 92.2%.
Figure 9 shows the confusion matrix for the best performing CNN-LSTM (MobileNetv2-LSTM) model and the stacking model for video-wise classification using echocardiogram videos.Figure 9(A

B. SUBJECT-WISE CLASSIFICATION
This study also investigated different deep learning CNN-LSTM models and stacking models for subject-wise CHD or healthy patient classification using echocardiogram videos.In subject-wise classification, this study made the decision for each subject which is the average of the predicted scores of all videos of this subject.This study produced the best performance with MobileNetv2-LSTM architecture which produced the accuracy, precision, sensitivity, F1 score, and specificity of 86.4%, 85.7%, 90.9%, 88.9%, and 80.1%, respectively.The stacking model was developed using the probability scores of Top-3 performing models (MobileNetv2-LSTM, ResNet18-LSTM, and GoogleNet-LSTM), which improved the result by ∼2% with the accuracy, precision, sensitivity, F1 score, and specificity of 91%, 86.7%, 97.9%, 92.9%, and 81.4%, respectively.Table 5 shows the comparisons of different CNN-LSTM models and the stacking model for subject-wise classification.
Figure 9 shows the AUC/ROC/AUROC for subject-wise classification of HLHS and healthy subjects using Ultrasound videos.This is evident from the ROC curves that the stacking CNN-LSTM model outperformed other networks for classification with 94.5% AUC whereas the AUC of the best performing CNN-LSTM (MobileNetv2-LSTM) model is 88.4%.A significant margin of improvement in AUC (∼6%) was observed in using the novel stacking technique in the final subject-wise classification.
Figure 11 shows the confusion matrix for the best performing CNN-LSTM (MobileNetv2-LSTM) model and the stacking model for subject-wise classification of HLHS and healthy subjects using echocardiogram videos.It is evident from this study that the presented novel framework that was developed using the stacking CNN-LSTM model is capable of detecting CHD patients reliably.The performance of this model can be further enhanced by increasing the sample size in the training process.For this study, there were only 9 healthy and 13 CHD patients' data available where some of the Ultrasound videos were significantly corrupted by the motion artifacts and could not be included in the analysis.Moreover, the number of videos (e.g., 60) at different time points for healthy patients was half of the CHD patients' videos (e.g, 120).Due to the limitations of healthy subjects' data, the model was not learning enough about the healthy cardiac cycle in Ultrasound videos, this might be the reason for the misclassification of the two healthy patients by the algorithm.Otherwise, the model performed outstanding in case of HLHS patient detection using Ultrasound videos.To the best of the authors' knowledge, this is the first study using Ultrasound videos to reliably classify the HLHS patients using the deep learning technique.This study can be extended with a larger patient cohort with the more longitudinal time point of Ultrasound videos to identify at which time point (i.e., gestational week) typically the deep learning model can detect the HLHS patients reliably.This will allow us to identify the more useful temporally distinctive feature(s) in the cardiac cycle of the Ultrasound videos in the different gestational week time points.
Even though, patient cohort size was low for the study, number of video samples that were used to train and test the algorithm was quite high (as shown in Table 2).For each patient, echocardiography was performed at up to three different timepoints (different gestation weeks).For a specific patient and a specific timepoint, several different echocardiography b-mode videos were collected from different orientations.This way, total number of full videos reached up to 180.These videos were then segmented to involve different cardiac cycles.Segmented videos were not just repetition since in most cases, fetus was moving during imaging and also operator was moving the probe for better signal.Therefore, segmented videos were treated as more sample videos in this study.The number of video samples reached to 2834 for CHD class and 1114 for healthy class which we believe, sufficient to train a deep learning model.Future work will involve testing the algorithm in larger 110386 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
cohorts involving different types of CHDs that are more prevalent.

IV. CONCLUSION
Congenital heart disease (CHDs) affect 0.6-0.8% of the population and are the leading cause of death in infants under 1 year of age.Current treatment of serious CHDs involves a serious of high-risk operations shortly after birth.Recently, prenatal intervention to treat these conditions has emerged as a potential therapy alternative emphasizing the importance of diagnosis of the condition in utero.Echocardiography is the gold standard for CHD diagnosis in utero.A severe CHD type is hypoplastic left heart syndrome responsible for 25% all prenatal deaths.Recently, some pioneering works showed the potential of applying ML into fetal CHD diagnosis.The objective of this study was to detect the morphological and temporal changes in the cardiac Ultrasound videos of the fetus due to HLHS using ML approaches.For this purpose, we collected echocardiography videos at different stages of gestation from fetal HLHS patients.These videos were preprocessed and used to train 5 different deep learning CNN-LSTM models and a novel stacking CNN-LSTM model.Our results suggest that the stacking CNN-LSTM model, which is developed using MobileNetv2-LSTM, ResNet18-LSTM, GoogleNet-LSTM models with different pre-trained encoders is very effective for differentiating the HLHS patients from the healthy patients using Ultrasound videos.The model could distinguish the healthy and HLHS human fetal heart differences during the gestational development stages in terms of cross-sectional heart chamber dimensions, and flow hemodynamics.This pioneering study has demonstrated that the deep learning framework is capable of distinguishing the unhealthy heart in the early gestational week using Ultrasound videos which can help in applying potential prenatal therapy rather than postnatal therapy to increase the chance of the patient survival.

FIGURE 2 .
FIGURE 2. Overview of the methodology.

TABLE 1 .
Summary of the echocardiogram videos for (A) CHD subjects and (B) Healthy subjects.

FIGURE 3 .
FIGURE 3. Sample raw, cleaned, and cropped frames of the echocardiogram videos.

FIGURE 4 .
FIGURE 4. Pre-processing of the Ultrasound video to up-sample to 20 frames by adding the last frame.
A = {x i ,y i } m i=1 Output: a stacking classifier M f 1: Step 1: learn base-level classifiers 2: for t=1 to T do 3: learn h t based on A 4: end for 5: Step 2: construct new data set of predictions 6: for i =1 to m do 7: A h = x ′ i ,y i , where x ′ i = {h 1 (x i ) , . . ...,h T (x i ) 8: end for 9: Step 3: learn a meta-classifier 10: learn M f based on A h 11: return M f

FIGURE 7 .
FIGURE 7. Sample snapshot for (A) healthy, and (B) HLHS patient's echocardiogram videos for two timepoints of each subject.LA is left atria, LV is left ventricle, RA is right atria, RV is right ventricle.Snapshots are for ventricle diastole in the cardiac cycle.

TABLE 4 .
Comparison of different CNN-LSTM performances for video-wise classification.

FIGURE 8 .
FIGURE 8. ROC curve for video-wise binary classification using different CNN-LSTM and stacking CNN-LSTM models.
) shows the confusion matrix of the best performing CNN-LSTM (MobileNetv2-LSTM) model and Figure 9(B) shows the confusion matrix of the best performing stacking CNN-LSTM model.The best performing MobileNetv2-LSTM network detects 109 videos out of 120 Ultrasound videos correctly for CHD patients while 51 videos out of 60 Ultrasound videos are correctly detected for healthy patients.On the other hand, stacking the CNN-LSTM model slightly improves the performance, where 112 videos out of 120 Ultrasound videos are detected correctly for CHD patients and 51 videos out of 60 Ultrasound videos are correctly classified as healthy.

FIGURE 9 .
FIGURE 9. Confusion matrix for video-wise classification using (A) the best performing CNN-LSTM model, (B) and the best performing stacking CNN-LSTM model.

FIGURE 10 .
FIGURE 10.ROC curve for subject-wise classification using different CNN-LSTM networks and stacking CNN-LSTM model.

FIGURE 11 .
FIGURE 11.Confusion matrix for subject-wise classification using (A) the best performing CNN-LSTM model, and (B) the best performing stacking CNN-LSTM model.

Figure 11 (
Figure 11(A) shows the confusion matrix of the best performing CNN-LSTM (MobileNetv2-LSTM) model and Figure 11(B) shows the confusion matrix of the best performing stacking CNN-LSTM model.The best performing MobileNetv2-LSTM network detects 12 out of 13 subjects correctly as CHD patients while 7 out of 9 subjects are detected correctly as healthy patients.On the other hand, the stacking CNN-LSTM model improves the performance where all subjects are detected correctly as CHD patients, and 7 out of 9 subjects are correctly classified as healthy patients.
TAWSIFUR RAHMAN received the B.Sc. (Eng.)degree the Department of Electrical and Electronic Engineering, University of Chittagong, Bangladesh, and the M.Sc.degree from the Department of Biomedical Physics and Technology (BPT), University of Dhaka, Bangladesh.He is currently working as a Research Assistant with the Department of Electrical Engineering, Qatar University.He has published several journal articles on medical imaging.His current research interests include biomedical image and signal processing, machine learning, computer vision, and data science.He has expertise in designing and developing deep CNN models using PyTorch and TensorFlow framework and has expertise in implementing nerve stimulators for measuring conduction velocity in the human body, developing electrocardiogram (ECG), electromyogram (EMG) circuit, developing Howland constant current source and instrumentation amplifier to measure tetra polar bio-impedance, and detection of the different stage of brain activity by analyzing various EEG wave.He has received an ICT Fellowship (2019-2020) from the ICT Ministry of Bangladesh for research entitled ''Driver drowsiness detection from HRV and computer vision using machine learning.''He and his team have recently won the COVID-19 Dataset Award for their contribution to the fight against COVID-19.MAHMOUD KHATIB A. A. AL-RUWEIDI is currently pursuing the Graduate degree with the College of Pharmacy, Qatar University, Doha, Qatar.

TABLE 2 .
Details of the dataset used for training, validation, and testing.

TABLE 3 .
Details of training parameters of CNN-LSTM models.

TABLE 5 .
Comparison of different CNN-LSTM performances for subject-wise classification.