An HPC Pipeline for Calcium Quantification of Aortic Root From Contrast-Enhanced CCT Scans

Precise assessment of calcification lesions in the Aortic Root (AR) is relevant for the success of the Transcatheter Aortic Valve Implantation (TAVI) procedure. To this end, the radiologists analyze the Cardiac Computed Tomography (CCT) scans of patients, and detect the position and extent of the calcium deposits. In this contribution, we develop a computationally efficient High-Performance Computing (HPC) system to detect, segment, and quantify volumes of calcium in contrast-enhanced CCTs, embedding in a three-step pipeline two 3D Convolutional Neural Networks (CNN) and a threshold adaptive filter. The first step crops the images to a bounding-box around the AR keeping the original resolution, the second builds the segmentation, and the third detects and measures the volume of the calcium lesions. Our system is trained on high-resolution contrast-CCTs routinely planned for the TAVI manually annotated by expert radiologists, and evaluated on a test-set of patients with different levels of calcifications. The accuracy achieved in segmenting the AR is approximately 92% for the test-set, while the average difference of calcium lesion volumes with respect to the radiologists measurements is about 0.49 mm3. Running on a 4X NVIDIA-V100 and an 8X NVIDIA-A100 GPU systems, we achieve a remarkable inference throughput of 17 and 70 CCT/sec respectively, and a linear scaling of computing performance. Our contribution provides an HPC system suitable for hospital premises installation and is able to aid radiologists in assessing the calcification level of patients undergoing the TAVI, making this process automated, fast and more reliable.


I. INTRODUCTION
The Aortic Stenosis (AS) is the most frequent valvulopathy treated in the western world, especially in the elderly The associate editor coordinating the review of this manuscript and approving it for publication was Ravibabu Mulaveesala .population, with a prevalence of 2-9% in patients older than 65 [1].Traditional treatment of severe AS is the Aortic Valve Replacement (AVR) by open-heart surgery, accounting for 60-70% of valve surgeries performed in the elderly [2].However, at least 30% of patients are not advised for AVR due to high surgical risks related to advanced age or the presence of various comorbidities [3], making the Transcatheter Aortic Valve Implantation (TAVI) more appropriate for them as it is less invasive [4].The TAVI must be carefully planned to avoid issues such as the paravalvular regurgitation associated with high mortality probability [5], [6], occurring when the bioprosthesis does not adhere properly to the aortic annulus due to multiple factors [7] among them the presence, position, and extension of calcification lesions within the aortic root.
A reliable method to assess the calcification level of the aortic root is the analysis of contrast-enhanced CCTs [8], [9].To do this, delimitation of the aortic root and calcification lesions is routinely performed manually by expert radiologists on post-contrast CCT scans, resulting in a slow and timeconsuming process prone to operator-dependent errors.
Making this process automatic is not trivial, and several issues have to be faced.In fact, since calcium is also contained in other internal organs, especially in bones, automatic analysis of CCT scans requires conceptually two steps, namely identification of the aortic root, and then of the calcium lesions within it.Moreover, in patients undergoing the TAVI the aortic root might be severely calcified, making it difficult to discover the bounds.For this reason, contrastenhanced CCTs are usually performed to highlight the shapes of the tissues and other organ structures.However, using a contrast medium hampers the identification of the calcium lesions, since the luminance of voxels with the contrast is similar to that with low and medium levels of calcium.Finally, while for non-contrast CCTs standard methods have been already developed, like the Agatston score commonly used to evaluate the coronary artery calcifications [10], for contrast-enhanced no standard method has been yet implemented [7].
In this work, our aim is to develop a pipeline to automate and boost the segmentation of the aortic root, and make an assessment of calcium lesions within it.Automatic aortic root segmentation has several applications, among them the evaluation of aortic valve degeneration, and automatic aortic root geometrical measurements such as aortic valve or sinotubular junction dimensions.Here we focus on the quantification of the calcium, post-processing the segmentations using an adaptive filter.We do not aim to define any standard model for calcium assessment in contrast-enhanced CCT scans, which is out of the scope of our work, but we show how measures done by expert radiologists can be replicated by an automatic system.
In detail, our contribution is threefold: i) we design and implement a pipeline to make the segmentation of the aortic root in contrast-enhanced CCT scans, using two 3D U-Net CNNs running on HPC multi GPU systems; ii) we evaluate the performance of the CNNs in terms of segmentation accuracy, computing throughput, and scalability; iii) we validate the accuracy of the segmentations making automatic assessment of the aortic-root calcification level using a post-processing threshold adaptive filter, and comparing the results with that done by expert radiologists on a set of patients undergoing the TAVI.We underline that the success of the latter point strongly depends on the precision of the aortic root segmentation.Moreover, we show how smallsized multi GPU-based HPC systems suitable to be hosted on hospital premises can be used to routinely support the work of radiologists in assessing the aortic root calcification level quickly and reliably.Besides this, the results achieved in this work in terms of throughput candidates our approach to be devised in retrospective studies characterized by a large amount of CCT images, and also in application scenarios where frequent scheduled CNN model retraining is foreseen to include new clinical cases and evidences.
The remainder of the paper is organized as follows: in the section II we make an overview of the related works and provide motivations for our work; in the section III we describe our dataset and how the image slices of CCTs have been manually annotated; in the section IV we give details about the methodologies used to make the segmentations through CNNs, and we describe the development of the adaptive threshold filter to post-process them to identify the calcium lesions; in the section V we present the results both in terms of accuracy and computational performance, and finally in section VI we draw our conclusions.

II. RELATED WORKS AND MOTIVATION
In the last decade, Deep Learning (DL) methods applied to diverse imaging techniques such as ultrasounds [11], Magnetic Resonance Imaging (MRI) [12], and CT [13], became the most widely used approach for cardiac image segmentation.In particular, the latter imaging technique is preferred since it generally leads to better segmentation accuracy ascribed to a higher image quality [14].A revision of the literature [13] exposed four main branches where DL methods like Fully Connected Networks (FCN), CNNs, and more articulated models are devised for CCT imaging: i) cardiac substructure segmentation [15], [16], [17]; ii) coronary artery segmentation [18], [19]; iii) aortic root segmentation [20], [21], [22]; iv) calcium and plaque segmentation in coronaries and aortic valve [23], [24], [25].Despite the confidence in the potentialities achievable in medical imaging by adopting DL-assisted segmentation procedures, there are still several challenges to address to improve the accuracy (even with image pre-and post-processing) and performance (i.e., images processed per second).

A. SEGMENTING THE AORTIC ROOT
The first goal of our work is to use DL to provide segmentation of the aortic root structure, which falls in the scope of i) and eventually iv).Previous attempts on the same anatomical region of the heart relied on a two-step segmentation pipeline of neural networks [26], [27].The first step consists of the extraction of the ROI to identify the heart and then feeding the second pipeline stage with images classified with a CNN [27].Refined models use a localization network (Spatial Configuration Net) producing a coarse detection of the aortic landmarks and then applying a 3D CNN like UNet for segmentation [26].A similar approach, although applied to ultrasound images, has been proposed in [28].
Most of the studies performed in the literature deal with non-contrast CCT images and consider patients affected by a low, yet null, level of calcification of the aortic valve.The works in [20] and [29] consider contrast-enhanced 2D and 3D CCT images on patients eventually undergoing the TAVI procedure (with a moderate/high level of calcification of the aortic valve), that is the scenario we are targeting in this work.Concerning the use of specific CNN models for, in [27] the authors have presented a CNN-based approach for detecting the aortic root in contrast-enhanced CTs, measured the aortic annulus diameter, and select the valvular prosthesis before TAVI.Meanwhile, in [20] authors have exposed an accurate and fast method to choose the TAVI device size by measuring the aortic annulus perimeter and area from manually segmented annulus planes using two U-Net models.
However, using standard DL models on our image dataset and in general with contrast-enhanced CCT would lead in some cases to the situation shown in Fig. 1a.Large calcified areas are not correctly picked up in the aortic root segmentation, thus leading to an incorrect evaluation of the anatomical structure and neglecting a large part of the calcium for quantification.We address this challenge by applying a filter pre-processing technique for artifact removal that will prove beneficial in the subsequent segmentation steps and in the overall quantification of the calcium on the aortic valve (see Fig. 1b).

B. QUANTIFYING THE CALCIUM ON THE AORTIC VALVE
The exact quantification of the calcium volume on the aortic valve is non-trivial when contrast-enhanced CCTs are considered.Indeed, there is a significant inherent inter-and intra-patient opacity (luminescence) variability of the dye [7] that hamper the development of fully-automated DL methods that exploit a constant luminescence threshold for calcium extraction.This source of variation is ascribed to many factors such as the infusion rate of the contrast, the time elapsed between the injection and the start of the CCT exam, and the patient body habitus [7].They cannot be ignored even following the guidelines on contrast usage for CCT exams [30].Different approaches to quantify the calcium on the aortic valve have been proposed in literature, either using fixed or relative threshold values concerning the contrast medium luminescence (i.e., blood pool attenuation).
In [7], authors have measured the calcium volume score in CT angiography (CTA) to find the most accurate threshold to predict paravalvular regurgitation (PVR) after TAVI.They use two fixed thresholds (i.e., 650 and 850 HU) and four related to the luminal attenuation (LA) in the aortic annulus (LAx1.25,LAx1.5, LA+50, LA+100) for calcium detection, showing that with LA threshold cutoffs it is possible to achieve a discrimination accuracy (measured with the Area Under Curve method) between mild and severe PVR of 0.81.
In [31], authors have considered the contrast enhancement within the left ventricular outflow tract (LVOT), with a cutoff of 300 HU.The aortic calcium volume is measured with three fixed thresholds (450, 850, and probe+100 HU) and the results show that the best value (accuracy up to 70%) depends on the contrast in the LVOT.The developed method is not based on automated segmentation performed by DL methods, but rather applied to manual segmentations of the aortic valve.The authors point out that a standard reference for calcium assessment on contrast-enhanced CT series is currently lacking.
In [32], the authors applied commercial software to manually select the calcium elements in the CCT scan and obtain remarkable accuracy in scoring the aortic valve calcification.This work exposes a strong linear relationship between the blood pool attenuation and the threshold chosen in calcium volume quantification.Once again, this methodology is not included in a fully automated pipeline that performs this operation using an end-to-end approach with DL-assisted segmentation.
Current commercial solutions always require manual input in the segmentation and in the threshold choice, which introduces inter-operator variability [33].Our work goes in the direction of providing an objective, fully repeatable, yet systematic framework with DL methods that is complemented with an adaptive filtering methodology to ease the calcium volume quantification when large luminescence variations between patients and within a single patient come into play.

C. UNCOMPROMISING ACCURACY WITH PERFORMANCE
The image processing time required by DL-assisted segmentation methods is usually an unexplored parameter in clinical studies.The accuracy of the segmentation appears to be the parameter of utmost importance in the development of fully automated methods and in some cases this is traded with the overall system performance.Despite some works addressing the achieved segmentation time and the required neural networks training time [27], [33], [34], [35], [36], [37], [38], none of them, to the best of our knowledge, neither investigate the scaling of such methods nor studied the role of the precision used in neural networks data representation in the context of HPC.Optimizing the segmentation performance becomes twofold relevant: i) can enable large volume studies considering many images from different sources; ii) provides fast re-training of the neural networks to improve the segmentation accuracy and calcium quantification even daily.
In Table 1, we report the characteristics of the studies that consider segmentation of the aorta for various applications using either 2D or 3D DL approaches based on CNN.We do not include in this comparison the possible pre-and postprocessing times of the images, since it can be neglected to the overall inference time.Training time is nowadays in the range of several hours and even days for large datasets, whereas the inference time usually stays within one minute.
Our work investigates the impact on the training and inference segmentation performance of different small-sized HPC systems based on two generations of NVIDIA GPUs, namely the V100 and the most recent A100.We analyze the scaling in terms of GPUs adopted in the process and show how to achieve high accuracy and performance using different numerical precision in the DL data representation.

III. DATASET FEATURES AND IMAGING PROTOCOL
Our dataset comprises 27392 image slices corresponding to a total of 107 annotated gated contrast-enhanced CCTs acquired with a 256-slice scanner (Revolution CT, General Electric, Chicago, IL, USA), with prospective gating, setting the slice thickness at 0.625 mm, 120 kV, automatic mA, rotation time of 0.28 seconds, DFOV at 25 cm, and detector coverage at 160 mm.All patients have been subjected to administration of contrast material (Omnipaque 350 mg/mL, GE Healthcare, Chicago, IL, USA) at a rate of 5 mL/s followed by a saline chaser.The image acquisition was triggered after a threshold of 80 HU was reached in a region of interest placed in the left ventricle (bolus-tracking technique).Each CCT scan has 256 slices of 512 × 512 pixels, with a spacing of 0.488 × 0.488 × 0.625 mm 3 , and the level of calcium regions ranges from 0 to 716 mm 3 , located at different places within the aortic root, including the valve cuspids, the annulus, the sinotubular junction, and the coronaries junctions.
CCT analysis was performed by two radiologists in consensus with respectively 2 and 7 years of experience in cardiovascular imaging, using the Medical Imaging Interaction Toolkit (MITK), free open-source software for processing medical images [39].Following the same approach used in [40], they first worked separately and, afterward, they reviewed the segmentations together.In case of disagreement, the consensus between the two is used as ground truth.The aortic root segmentation was built by making 3-4 contour planes along the axial, sagittal, and coronal anatomical axes, from the sinotubular junction to the aortic annulus.Then, a 3D interpolation has been applied to reconstruct the entire volume.To segment the calcification regions, the slices were inspected one by one, setting an appropriate threshold value to select the calcification voxels, as described in Fig. 2. Due to the differences in voxel luminance, the reference threshold is specific for each image [7]; for example, in our dataset, the mean reference threshold is 667 ± 127 HU.Once the calcium regions have been segmented, the corresponding volume has been measured using the Statistics tool of MITK.

IV. PIPELINE DESIGN FOR CALCIUM QUANTIFICATION
We propose a three-step pipeline as shown in Fig. 3.It is based on two CNN models to segment the aortic root, named U-Net1 and U-Net2, and on an adaptive threshold filter to detect the calcium regions.The entire process is complemented with image pre-processing, down-scaling, and cropping steps to increase the segmentation accuracy.The U-Net1 extracts the region of interest (ROI) around the aortic root (Fig. 3a) in low resolution obtained by scaling original images by a factor 2 along the coronal and sagittal axes, in order to fit images into the memory available on the GPUs used for running our models.This step gives, as a result, a bounding-box mask that is used to crop the original image to a volume of 224 × 256 × 256 voxels.Such a choice keeps the context around the aortic root as large as possible to help the second CNN detect it while fitting both the size of GPU memory and the requirements of the next CNN architecture.The U-Net2 (Fig. 3-b) takes as input the cropped images at full resolution and returns a volume mask corresponding to the segmentation of the aortic root.The last step (Fig. 3c) finds the calcification lesions within the region of the image corresponding to the aortic root segmented by the U-Net2.The whole flowchart is depicted in Fig. 3d, where the green boxes represent the preprocessing steps performed at the input of U-Ne1 and U-Net2 and blue boxes the two U-Nets and the adaptive threshold filter.

A. AORTIC ROOT SEGMENTATION
U-Net1 and U-Net2 are CNNs based on the U-Net model, widely used in several studies for processing biomedical images [20], [41].This neural network has been developed for image segmentation [42], with a U-shape architecture consisting of a contractive branch aiming to capture the image context, and an expansive branch to make precise localization of the segmentations.In particular, we have modified and adapted the 3D U-Net Medical CNN available at the NVIDIA DeepLearningExamples GitHub repository to our specific case [43].This is an implementation of the model developed by [44] with improvements from [45], originally designed for the BRATS 2019 challenge to segment brain tumors.
The 3D U-Net architecture used in our work is configured with 5 down-sampling and 5 up-sampling steps as shown in Fig. 4. The steps in the down-sampling path consist of two convolutional blocks, each made of a convolution, a normalization, and an activation layer.The first block performs a stride down-sampling reducing all the spatial dimensions by a factor of two, while the latter consolidates the features learned at that depth level.In the up-sampling path, each step is made by a transposed convolution and two convolutional blocks.The first layer up-samples by a factor of two the spatial dimensions of the previous step, while the last two blocks keep as input the concatenation of the outputs of the transposed convolution and the skip connection.Skip connections are the outputs of the down-sampling blocks at the same depth level as the up-sampling block.They are particularly useful in image segmentation since they help to faithfully reconstruct the images using fine-grained details learned in the encoder part of the network.Lastly, the final output produced by the output block is processed by a softmax operation that is used for voxel classification.The model is implemented using TensorFlow [46] to leverage GPU acceleration and Horovod [47] for distributed learning on multi-GPU systems.Horovod is an open-source framework that provides tools to train deep learning models efficiently across multiple GPUs.By default, the model runs using 32-bit floating-point numerical precision, while mixed precision (AMP) computation, combining 16-and 32-bit operations, can be enabled to boost processing time both during the training and inference steps.Moreover, Accelerated Linear Algebra [48] (XLA) can be enabled to further boost the computation of mathematical operations.By default, the layers of the network coded using TensorFlow are processed independently.In contrast, using the XLA DL graph compiler, parts of the network are clustered into sub-graphs that can be optimized and compiled, providing performance benefits at the cost of some compilation overhead.
U-Net1 and U-Net2 are both trained to minimize the loss function L = (1 − Dice) + CE and validated by measuring the Dice score on the inferences of the test set.The CE is the Cross Entropy computed as − n i=1 t i log(p i ), where t i is the truth label and p i is the probability value for the i − th class output by the CNN for each voxel of the image; it gives a measure of the difference between the predicted probability distribution and the true distribution of the data.The Dice score is another commonly used metric in semantic segmentation returning a value between 0 and 1, being 1 when the segmentations match the ground truths.
At the input of both CNNs, the voxel intensities are clipped to 85% of the maximum luminance value to reduce the range and Z-score normalized, and stored as floating point numbers in the [0 . . .1] range, whereas labels are one-hot encoded for their later use in Dice or pixel-wise cross-entropy loss TABLE 2. Accuracy of the filters in replicating the threshold reference value used by the radiologists to segment the calcium areas within the aortic root.
computation.Moreover, at the inputs of the U-Net2 a custom filter is applied to lower the luminance of the voxels in the ROI detected by U-Net1 and remove artifacts due to severe calcifications that may prevent the network from finding the bounds of the aortic root.
To increase the robustness of the network to different kinds of images, data augmentation techniques have been applied.In our case, at each training epoch, three different augmentations are randomly applied with a probability of 0.5 to the input images, namely random crop size, random brightness shifting, and horizontal flip.In this way, at each epoch the input samples are different from the previous epoch, helping in preventing over-fitting, and artificially increasing the number of samples processed by the CNNs.

B. DETECTION OF THE CALCIFICATION REGIONS
To segment and quantify calcifications, we have opted to filter the voxels within the aortic root with the luminance above a specific threshold.We developed different filters based on both fixed and variable thresholds computed with different algorithms, and tested them on a sample dataset of 7680 image slices of 30 CCTs randomly selected patients with different levels of calcifications.In Table 2, we assess for each filter the accuracy in replicating the reference threshold value (R.Thr.)set by the radiologists to segment the calcification regions.In detail, we report: i) the coefficient of determination R 2 to measure the goodness of the fit; this is a statistical indicator measuring how well the reference thresholds are replicated by the filter, based on the proportion of the total variation of the reference values explained by the model underlying the filter algorithm.The R 2 range is [−∞, 1], being 1 when the model perfectly fits the reference data; ii) the values of mean, standard deviation, and range of the absolute error of the thresholds predicted by the filter with respect to the R.Thr.
The filters F1, F2, and F3 use a fixed threshold set respectively to 600, 700, and 800 HU.In this case, the accuracy is low since the R 2 value is below zero, meaning that they are not able to replicate the reference thresholds.Additionally, the mean errors (94.90, 107.97, and 162.83 respectively) and the corresponding standard deviations (96.81, 66.25 and 86.41) are large.
To design a filter with a variable threshold, we measured the luminance L m on a sample cube located within the aortic root of each patient.To this end, we first located the geometric center (GC) point of the aortic volume (see Fig. 5a), and then constructed around it a cube of 20 voxels along each spatial dimension corresponding to approximately a volume of 10 mm 3 .The cube was then moved above the GC, near the sinotubular junction, to avoid contact with the valve leaves where calcification deposits may occur (see Fig. 5b).In this way, the luminance of the voxels within the cube is that of blood and contrast medium.Furthermore, since the values of L m and R.Thr.are strongly correlated (the Pearson correlation index measured with the Bivariate Correlation function is r = 0.96 with a P-value < 0.0001), we can expect improved results when using filters with variable thresholds computed as a function of L m .
The threshold for filter F4 is set to the value of L m , while the thresholds for filters F5 and F6 are respectively L m × 1.25 and L m × 1.50, as described in [7].However, also in this case, none of these filters are able to accurately replicate the reference thresholds, since the R 2 values are very low (0.52, 0,52, and −2.57, respectively), and the mean and standard deviation of the absolute errors are large.This is mainly due to the contrast medium that does not distribute uniformly within the aortic root, making some spots brighter than the maximum luminosity sampled within the cube.For example, as shown in Fig. 5c, the value of L m alone (597 HU in this case) is not high enough to avoid the selection of voxels in the middle of the aortic root (yellow spots with luminance greater than 597 HU), where we do not expect to have any calcification, and increasing the value by a fixed factor might bring to underestimate the calcium voxel count.
To further improve the accuracy, we have developed the filter F7 based on adaptive thresholds computed as L m + k(1000 − L m ).This formula is an implicit form of a linear regression where voxels within the region of the aortic root with a density above the upper bound of 1000 HU are assumed to be calcium.To this extent, the L m needs to be increased by a corrective value that is lower as L m approaches the upper limit of 1000 HU, otherwise, some areas of calcium are not identified.This allows the exclusion of the spots of contrast voxels within the aortic root.Applying this filter to our sample dataset, we have found that a value of k = 0.2 is enough to exclude the selection of spurious voxels.As reported in Table 2, the indicator values for the filter F7 are significantly better compared to the others, being the R 2 = 0.89 and the mean absolute error of 29.58 HU with a standard deviation of 25.81, making it the most accurate in predicting the reference thresholds.

V. EXPERIMENTAL RESULTS
In this section, we discuss and assess the results achieved in terms of precision in segmenting the aortic root, quantification of the volume of calcium lesions, and computing performance.

A. AORTIC ROOT AND CALCIFICATIONS SEGMENTATION
Both U-Net1 and U-Net2 have been trained for 1250 epochs with a learning rate of 10 −4 using the Adam optimizer.Using 18944 image slices of 74 CCTs for the train set and 8448 image slices of 33 CCTs for the test set, we have achieved for both CNNs a Dice score of ≈0.90.To validate the results and to verify that no bias is present in the dataset, we also performed a K -fold cross-validation with K = 5, splitting the dataset into K random splits and training each model K times, where for each round four splits are used in turn for training and the last for testing.
Applying the filter F7 to the test set, the coefficient of determination R 2 between R.Thr.and the filter calculated threshold is 0.94, with a mean absolute error of 26.15±21.81HU and a range of 99 HU; regarding the volume, we get R 2 = 0.97, with a mean absolute error of 13.65 ± 25.35 mm 3 and a range of 95.46 mm 3 .In Fig. 6 we show two Bland-Altman plots; Fig. 6a refers to the agreement of the thresholds measured in HU for which the mean difference is 7.55 and the 95% confidence interval is [−58.55,73.64], while Fig. 6b to the calcium volume measured in mm 3 , where the mean error is −0.49 and the 95% confidence interval is [−57.79,56.81].In both cases, the agreement of the values predicted by F7 with the radiologists measures is accurate within the 95% of the confidence interval.

B. COMPUTING PERFORMANCE
We have run our application on two small-sized HPC systems, suitable to be installed in small data centers like those available today in several hospitals.The first system hosts 2X Intel Gold 6242 CPUs operating at 2.8 GHz, equipped with 394 GB of RAM and 4 NVIDIA V100-SXM2-32GB GPU accelerators.The latter is an NVIDIA DGX A100 based on dual AMD Rome 7742 128-core CPUs running at 2.25 GHz and 320Â GB of memory, with 8 A100 GPUs.This is hosted in a standard 19 inch blade chassis of approximately 482 mm wide, 44 mm tall and 894 mm depth.The A100 architecture has introduced a novel math mode dedicated to AI training, namely the TensorFloat-32 (TF32).In this case, the numerical range is the same as FP32 while the precision is that of FP16, resulting in a significant reduction in computation, memory, and memory bandwidth requirements without harmful impacts on prediction accuracy.This translates to an increase of operation throughput of up to 10× or more compared to prior V100 generation of GPUs, and up to 5× higher performance for DL workloads [49].By default, when running on A100 GPUs, the TF32 numerical precision is used unless AMP or FP32 is explicitly selected.The DGX is hosted in a 19 inch box of approximately 482 mm width, 264 mm tall, and 897 mm depth.
In the following, we report the computing performance for the U-Net2 only without a lack of generality, since similar results have been measured also for the U-Net1.
In Fig. 7 we report for both systems the training latency per processed CCT (256 slice images) in units of milliseconds, achieved using up to four and eight GPUs respectively.We run the models using floating-point numerical precision, FP32 on the V100 and TF32 for the A100 system, and mixed precision (AMP) with and without XLA enabled.For the V100 system, using the FP32-based model (see Fig. 7a) we measure approximately 550 ms per CCT on one GPU.The time decreases using multiple GPUs, reaching nearly 150 ms per CCT with 4 GPUs, corresponding to an improvement of about 70%.Enabling XLA further improves the processing time by an additional 20% regardless of the number of GPUs (see Fig. 7b).By using the AMP precision, see Fig. 7c, the time decreases of ≈40% compared to the FP32, and ≈30% compared to FP32 with XLA, with an average execution-time of 306 ms using 1 GPU and 87 ms with 4 GPUs.Lastly, by combining AMP and XLA (see Fig. 7d), the processing time further decreases by ≈30% reaching ≈60 ms per CCT using 4 GPUs, and running 9× faster than the baseline (FP32based model on 1 GPU).For the A100 system we have a similar behavior, achieving a lower training time.Using AMP with XLA enabled, we measure on average ≈55 ms per CCT on 1 GPU, while using 8 GPUs the training time decreases by 85% reaching approximately 8.65 ms per image.
Fig. 8 shows the throughput and the scalability achieved during the training process.As we see, the AMP model with XLA enabled performs significantly better than the others, reaching a remarkable throughput of ≈17 CCT/sec and ≈120 CCT/sec on the V100 and the A100 system, respectively.The scaling achieved is close to the ideal curve on both systems.In particular, on the V100 system, the FP32-based model scales better with and without the XLA enabled, while the AMP models perform somewhat worse due to the additional operations that each GPU needs to convert the computational graph to mixed precision.Additionally, the overheads for handling multi-GPU computation dominate the CCT processing time with a negative impact on the scaling.Similar behavior is observed for the A100 system, achieving a peak throughput of ≈120 CCT per second using all 8 GPUs.
In Fig. 9 we report the inference throughput running on one V100 and one A100 GPU.As we see, on both systems, moving from TF32/FP32 to AMP the throughput slightly increases, giving the best performance using the AMP model with XLA enabled.Since, the inference can be easily parallelized running multiple instances of the trained model on different GPUs with different input images, running on all available GPUs on each system, we achieve ≈17 CCT per second on the V100 system, and ≈70 CCT per second on the A100 system.
To assess the impact of the use of AMP and XLA optimizations on the accuracy of the segmentation, we have run the K -fold cross-validation using different numerical precision, with and without XLA enabled.On the V100 system, using the FP32 the Dice score achieved is 0.915 ± 0.04 without XLA and 0.92 ± 0.02 with XLA enabled, while using the AMP we get 0.915 ± 0.04 and 0.926 ± 0.02.This shows that moving from FP32 to an AMP-based model with and without XLA enabled we have no loss in terms of Dice  score, but in terms of computing time, we run approximately 3× faster with XLA, and 2.5× without compared to the FP32 model.Similar behavior is measured on the A100 system.
Finally, in Fig. 10 we report the quality of service (QoS) plots running 10000 inferences.On both systems we see that the time of inference is quite stable up to 99.99% of the inference latency distribution, exposing that the GPU inference time is largely predictable and the impact of potential distribution outliers is negligible.

VI. CONCLUSION
We have developed an HPC computing pipeline to automate the segmentation of the aortic root in contrast-enhanced CCTs using CNNs and GPU accelerators to quantify calcification volumes within it.The CNNs used achieve a Dice score of ≈0.90, whereas the post-processing adaptive filter we have developed agrees with the radiologists measurements of calcification volumes with a R 2 = 0.97, and a mean error of ≈0.49 mm 3 .We have run our pipeline on two small footprint multi-GPU HPC systems, and in terms of computing performance the scaling as function of GPUs is quite ideal, achieving an aggregate throughput of approximately 17 CCT/sec on the V100, and approximately 120 CCT/sec on the DGX A100 system.These results show that it is suitable to be hosted in hospital data centers to support radiologists in assessing the aortic root calcification level in routine clinical activities to plan the TAVI.
In addition, having a computational fast, and accurate system for processing CCT images can be beneficial for many reasons.For example, to make retrospective studies and find correlations between the calcification level of the aortic root and with success of the TAVI procedure, it is necessary to process hundreds or thousands of CCT scans.Also, it might open the possibility of making (pseudo) real-time assessments of the calcium burden volume by integrating the prediction system in the CT scanner.Fast training is also beneficial, for example, to re-train the models quickly on larger datasets including new cases in the train-set each time significant mispredictions are found.Finally, and this is a more general comment, reducing the time to solution is also beneficial in terms of energy to solution.
For future developments, we plan to validate our system on a larger dataset that includes CCTs from different scanners and use it to make retrospective studies to find correlations between aortic root calcification and TAVI procedure failures, as well as patients post-procedural prognosis.
GIADA MINGHINI received the master's degree in biotechnology for the environment and health from the University of Ferrara, Ferrara, Italy, in 2021.Since 2022, she has been a Researcher at the University of Ferrara and her activity focuses on the analysis of cardiovascular images to develop a high-performance automatic system to detect, segment, and quantify the volume of calcium in the aortic valve.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 1 .
FIGURE 1. From left to right, Axial, Coronal, and Sagittal views of the segmentation of the aortic root.The images shown in (a) are extracted without applying the pre-processing filter for removing artifacts (see text), and evidence an incorrect anatomical structure segmentation since the areas of the aortic root under the white spots are not included in the segmentation; images in (b) show the correct segmentation achieved by applying the pre-processing filter.

FIGURE 2 .
FIGURE 2. Steps to build segmentation of calcium lesions: (a) the 3D Threshold tool of MITK allows to highlight of all pixels with a luminance value equal to or greater than a threshold; (b) setting the appropriate threshold value, the operator highlights the calcium lesions within the aortic root; (c) application of the MITK Segmentation tool to create and export the ground-truth mask.

FIGURE 3 .
FIGURE 3. Processing steps to segment aortic root and calcifications: (a) The CCT image is scaled by a factor 2 along the coronal and sagittal axes and the ROI around the aortic root is extracted (yellow bounding box) by the U-Net1.(b) The aortic root is segmented with the U-Net2 within the ROI built by the U-Net1 on the image at full resolution.(c) The adaptive filter is applied on the aortic root mask to detect calcification areas (green regions).(d) The flowchart of our pipeline, where the green boxes identify the CCTs processing steps and the blue ones the U-Nets and filter phases.

FIGURE 4 .
FIGURE 4. Architecture of the 3D-UNet Medical CNN.It consists of 5-step contracting and expansive paths.

FIGURE 5 .
FIGURE 5. (a) Position of the geometric center (red point).(b) Position of the cube to sample the value of L m at the sino-tubular junction.(c) Example where the value of L m = 597 sampled within the cube is not high enough to avoid selection of voxels in the middle of the aortic root where calcifications are not possible (yellow regions); the luminance of calcification areas (green regions) is ≥ 704 HU.

FIGURE 6 .
FIGURE 6. Bland-Altman difference plots for calcium thresholds in HU (a) , and calcium volume in mm 3 (b).Limits of agreement are ±1.96times the standard deviation.

FIGURE 7 .
FIGURE 7. Train latency per CCT for the different models as a function of the number of GPUs.

FIGURE 8 .
FIGURE 8. Average train throughput (a) and corresponding scaling (b) achieved with different precisions and optimizations as a function of the number of GPUs.

FIGURE 9 .
FIGURE 9. Throughput achieved for inference on one GPU with different precisions and optimizations enabled.

FIGURE 10 .
FIGURE 10.Quality of Service plots obtained running 10000 image inferences on the V100 (a) and A100 (b) systems.
ARMANDO UGO CAVALLO received the degree in medicine and surgery from the University of Salerno, in 2014, and the Ph.D. degree in medical biotechnology and translational medicine with a major focus on radiomics and artificial intelligence applications in cardiovascular and thoracic imaging.From 2015 to 2019, he was a Radiology Resident at the University of Rome ''Tor Vergata.''In 2018, he was a Research Fellow in cardiovascular imaging at Cleveland University Hospital/Case Western Reserve University, Cleveland, OH, USA.He has been a Staff Radiologist at Istituto Dermopatico dell'Immacolata, Rome, Italy, since 2022.

TABLE 1 .
Summary of the accuracy and performance characteristics in state-of-the-art applications for aorta segmentation.