Learning-Based Quality Control for Cardiac MR Images

The effectiveness of a cardiovascular magnetic resonance (CMR) scan depends on the ability of the operator to correctly tune the acquisition parameters to the subject being scanned and on the potential occurrence of imaging artefacts such as cardiac and respiratory motion. In the clinical practice, a quality control step is performed by visual assessment of the acquired images: however, this procedure is strongly operator-dependent, cumbersome and sometimes incompatible with the time constraints in clinical settings and large-scale studies. We propose a fast, fully-automated, learning-based quality control pipeline for CMR images, specifically for short-axis image stacks. Our pipeline performs three important quality checks: 1) heart coverage estimation, 2) inter-slice motion detection, 3) image contrast estimation in the cardiac region. The pipeline uses a hybrid decision forest method - integrating both regression and structured classification models - to extract landmarks as well as probabilistic segmentation maps from both long- and short-axis images as a basis to perform the quality checks. The technique was tested on up to 3000 cases from the UK Biobank as well as on 100 cases from the UK Digital Heart Project, and validated against manual annotations and visual inspections performed by expert interpreters. The results show the capability of the proposed pipeline to correctly detect incomplete or corrupted scans (e.g. on UK Biobank, sensitivity and specificity respectively 88% and 99% for heart coverage estimation, 85% and 95% for motion detection), allowing their exclusion from the analysed dataset or the triggering of a new acquisition.

Abstract-The effectiveness of a cardiovascular magnetic resonance (CMR) scan depends on the ability of the operator to correctly tune the acquisition parameters to the subject being scanned and on the potential occurrence of imaging artefacts such as cardiac and respiratory motion. In the clinical practice, a quality control step is performed by visual assessment of the acquired images: however, this procedure is strongly operatordependent, cumbersome and sometimes incompatible with the time constraints in clinical settings and large-scale studies. We propose a fast, fully-automated, learning-based quality control pipeline for CMR images, specifically for short-axis image stacks. Our pipeline performs three important quality checks: 1) heart coverage estimation, 2) inter-slice motion detection, 3) image contrast estimation in the cardiac region. The pipeline uses a hybrid decision forest method -integrating both regression and structured classification models -to extract landmarks as well as probabilistic segmentation maps from both long-and short-axis images as a basis to perform the quality checks. The technique was tested on up to 3000 cases from the UK Biobank as well as on 100 cases from the UK Digital Heart Project, and validated against manual annotations and visual inspections performed by expert interpreters. The results show the capability of the proposed pipeline to correctly detect incomplete or corrupted scans (e.g. on UK Biobank, sensitivity and specificity respectively 88% and 99% for heart coverage estimation, 85% and 95% for motion detection), allowing their exclusion from the analysed dataset or the triggering of a new acquisition.
Index Terms-Image quality assessment, Magnetic resonance imaging, Motion compensation and analysis, Heart

I. INTRODUCTION
C ARDIOVASCULAR magnetic resonance (CMR) imaging presents a wide variety of different applications for the anatomical and functional assessment of the heart. The success of a CMR acquisition relies, however, on the ability of the MR operator to correctly tune the acquisition parameters to the subject being scanned [1]. Moreover, CMR can be negatively affected by a long list of imaging artefacts (caused for instance by respiratory and cardiac motion, blood flow and magnetic field inhomogeneities) [2]. Therefore, a quality control step is required to assess the usability of the acquired G. Tarroni  This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Heart Coverage
Image Contrast Inter-slice Motion Optimal Non-optimal Fig. 1: Potential issues affecting CMR image acquisitions. In the first two columns, the superimposition of long-axis twochamber views (red) and short-axis stacks (gray) is shown, while, in the last one, short-axis slices are displayed.
images. In the clinical practice this step is performed by visual inspection, usually carried out by the same operator who set up the acquisition, thus leading to highly subjective results. In the last decades, several initiatives for the acquisition of open access large-scale population studies have been launched. For example, the UK Biobank (UKBB) is a population-based prospective study, established to allow detailed investigations of the genetic and non-genetic determinants of the diseases of middle and old age. Of the 500,000 subjects enrolled in the study, CMR will be collected from 100,000 of them [3]. At the time of submission the acquisition is ongoing, with close to 20,000 subjects already scanned. Together with this trend towards the implementation of large-scale multicentre imaging datasets, the need for fast and reliable quality control techniques for CMR images has become evident, as highlighted also by several studies aiming to define standardized criteria for this task [4]. In this scenario, quality control through visual inspection is not only subjective, but simply infeasible due to the very high throughput demanded by the acquisition pipeline. On the other hand, failure to correctly identify corrupted or unusable images could affect the results of automated analysis performed on the dataset, with undesirable effects. Consequently, the need for fully automated quality control pipelines for CMR images has arisen.
Many research efforts have been dedicated to the automated identification of quality metrics from MR images. Most of these efforts have focused on the automated estimation of noise levels [5], [6]. Still, many aspects related to the usability of the acquired images are inherently modality-specific. Several automated pipelines for quality control have been proposed for brain MR imaging [7]. However, to our knowledge, no comprehensive automated quality control pipelines have been proposed so far for CMR images, in particular for the shortaxis (SA) cine image stacks, which are the reference images for the structural and functional assessment of the heart. One crucial aspect of the acquisition of SA image stacks is that it requires the MR operator to identify the direction of the left ventricular (LV) long axis -the line going from the apex to the centre of the mitral valve -and to define a region of interest: the correct planning will generate a SA stack encompassing both those landmarks with slices perpendicular to the LV long axis. If this selection is incorrect, the acquired SA stack may include an insufficient number of SA slices to fully cover the LV (see first column of Fig. 1). As a consequence, any functional analysis performed on the stack (e.g. ventricular volumes estimation) may be compromised. Another important aspect involved in CMR acquisitions is that SA cine stacks are generated during multiple breath-holds (with usually 1-3 slices acquired per each breath-hold). Although the subjects are instructed to hold their breath at the same breath-holding position, in practice the heart location can vary considerably. If the differences between the breath-holding positions are too pronounced, the acquired image stack will be affected by inter-slice motion and thus will not correctly represent the cardiac shape, introducing potential errors in the following analyses and visualizations (see second column of Fig. 1). Finally, the contrast of the obtained CMR images is directly affected by the chosen acquisition parameters (as well as by potential artefacts). If the different structures of the heart are not properly contrasted, the assessment of the cardiac function can be hampered (see third column of Fig. 1).
In this paper, we present a fully-automated, learning-based quality control technique for CMR SA image stacks. Our approach uses a hybrid decision forest method to extract at once both landmark positions (LMs) and probabilistic segmentation maps (PSMs) from long-axis (LA) and SA images. LMs and PSMs are then used to perform three quality checks: 1) heart coverage estimation, 2) inter-slice motion detection, 3) image contrast estimation in the cardiac region. Our hybrid forest method is thus not intended as a novel technique for landmark detection and segmentation per se, but rather as an integral component of our pipeline. The extraction of LMs from multiple LA views and the probabilistic nature of PSMs allow to assess the reliability of the pipeline for each scan using dedicated sanity checks. The technique was tested on two datasets (up to 3000 cases from the UKBB study and 100 cases from the UK Digital Heart Project [8], UKDHP) and validated against manual annotations and visual inspections.

II. RELATED WORK
To the best of our knowledge, differently from brain MRI [7], no comprehensive quality control techniques for cardiac CMR images have been reported in the literature. One of the few studies in this direction has been recently presented by Alba et al. [9], who however focussed on assessing segmentation quality rather than image quality. On the other hand, automated heart coverage estimation alone has been the aim of several studies. Zhang et al. [10], [11] proposed to use convolutional neural networks (CNN) to perform slice classification in order to detect the presence or absence of the basal and apical slices. In their first work [10] they proposed a 2D CNN trained on UKBB data, while in their more recent one [11] they improved their previous results by using a generative adversarial network. Differently from these techniques, our approach to heart coverage estimation is based on the detection of landmarks: in our previous preliminary work [12], we proposed a decision forest method to detect the cardiac apex and the mitral valve on long-axis 2-chamber (LA 2CH) view images, and used the position of these landmarks with respect to the space encompassed by the acquired stack to estimate the coverage. The technique was applied to 3000 cases extracted from the UKBB, and was able to detect SA stacks with insufficient coverage with relatively high accuracy.
Motion detection and modeling in the thoracic area has been a highly investigated subject for more than a decade [13]. As far as inter-slice respiratory motion in CMR is concerned, most of the approaches reported in the last decade have focussed on motion correction rather than motion detection [14], [15], [16], [17]. All of the cited studies focused on the compensation of inter-slice motion and in the generation of a corrected SA stack by means of rigid in-plane registration. Unfortunately, however, respiration causes a complex roto-translation of the heart in all three dimensions [18]: while most translation happens in the cranio-caudal direction (thus approximately almost perpendicularly to the long axis of the LV), big differences in subsequent breath-holding positions can cause out-of-plane motion, which would lead to an inaccurate representation of the heart in the stack. As a consequence, it is important to estimate the amount of motion occurred during the acquisition of the stack in order to decide whether there are the grounds for the application of a motion correction technique or it is instead advisable to repeat the scan (or exclude it from subsequent analyses).
In the past, several research efforts have been made towards the correct quantification of signal-to-noise (SNR) or contrastto-noise (CNR) ratios in MR images [5]. However, modern acquisition techniques making use of parallel imaging produce images with spatially-varying noise distributions, rendering image-based estimators unreliable [19]. To overcome this limitation, more elaborate methods have been proposed exploiting information about coil sensitivity or reconstruction coefficients [20]. Unfortunately, these data are very often not available, making the estimation of noise, and consequently of SNR and CNR, practically unfeasible in most scenarios. At the same time, image contrast between two objects -simply defined as the difference between their signal intensity -has long been used to determine their visual differentiability in the acquired MR image [21]. In CMR imaging, images with poor contrast between the LV cavity and myocardium can potentially hinder the assessment of cardiac structure and function: consequently, contrast estimation in the cardiac region can provide a useful metric for quality control purposes, either triggering the use of contrast-enhancing techniques or a new acquisition.
In this paper, we present a fully-automated, learning-based quality control pipeline for CMR SA stacks. The proposed approach builds upon our previous work [12], which used a hybrid decision forest method [22] to extract LMs from LA 2CH view images in order to perform heart coverage estimation. With respect to our previous approach as well as to state-of-the-art techniques, the main contributions of the present work can be listed as follows: • We present the first comprehensive, fast, fully-automated quality control pipeline specifically designed for CMR SA image stacks. The checks incorporated in the pipeline are 1) heart coverage estimation, 2) inter-slice motion detection, 3) image contrast estimation in the cardiac region. To the best of our knowledge, motion detection and cardiac image contrast for the sake of quality control have not been investigated before. As for heart coverage estimation, we build on our previously published study [12] by extending LMs extraction to all long-axis views.
LMs are then combined together to substantially increase the robustness and the reliability of this quality check (for details please refer to the Discussion section); • We propose a different implementation of the previously published hybrid decision forest [22] (adopted in our previous work [12]) which allowed the joint extraction of LMs and probabilistic edge maps (PEMs). The new implementation (based on a novel mapping) allows instead the extraction of LMs and PSMs: PSMs are required to perform both inter-slice motion detection and cardiac image contrast estimation, and enable sanity checks to assess the reliability of the pipeline; • We validate this pipeline by applying it to up to 3000 cases extracted from the UKBB study and to 100 cases from the UKDHP, showing its accuracy and robustness in real world scenarios. The pipeline could be both applied retrospectively on large-scale datasets to improve the reliability of clinical studies or deployed prospectively at acquisition sites to allow almost real-time assessment of the acquired scans.

III. METHODS
The proposed quality control pipeline is summarized in Fig. 2. All of the three quality control steps are based on the information extracted by hybrid decision forest models from the acquired images. This section of the paper starts with a brief recap on the theory behind decision forests and is followed by the description of the implementation adopted in the proposed pipeline, which allows the joint extraction of LMs and PSMs. Finally, each specific quality control step is described in detail.

A. Hybrid Decision Forests
A decision tree consists of a combination of split and leaf nodes arranged in a binary tree structure [23]. Trees route a sample x ∈ X (in our case an image patch) by recursively branching left or right at each split node j until a leaf node k is reached, where the posterior distribution p(y|x) for the output variable y ∈ Y is stored. Each split node j is associated with a binary split function h(x, θ j ) ∈ {0, 1} defined by the set of parameters θ j : if h = 0 the node sends x to the left, otherwise to the right. Usually h is a decision stump, i.e. a single feature dimension n of x is compared with a threshold τ : A decision forest is an ensemble of T independent decision trees: during testing, given a sample patch x, the predictions of the different trees are combined into a single output by means of an ensemble model. During training, at each node the goal is to find the set of parameters θ j which maximizes a previously defined information gain I j , usually defined as j are respectively the training set (comprising of samples x and associated labels y) arriving at node j, leaving the node to the left and to the right. H(S) is the entropy of the training set, whose construction depends on the task at hand (e.g. classification, regression). Different types of nodes (maximizing different information gains) can be interleaved within a single tree structure (hence named "hybrid") in order to perform multiple tasks. As in previous approaches [24], [22], in the proposed technique structured classification nodes (aiming at the detection of an object close to the desired landmarks, in our case usually the LV cavity) and regression nodes (aiming at landmark localization) are combined (see Fig. 3). In particular, in the proposed framework, landmark localization is conditioned on the results of the detection of the cavity [24]. This not only leads to the extraction of two different types of information (PSMs and LMs) with only one model, but improves landmark localization by implicitly incorporating complementary information about cardiac position and shape.
Structured Classification and PSM Extraction: Structured classification extends the concept of classification by using structured labels for Y instead of integer labels. In our case, each label y ∈ Y (associated with the image patch x) consists of a segmentation of the LV cavity within x. To train a structured classification node it is necessary to find a way to cluster structured labels at each split node into two subgroups depending on a similarity measure. The solution to this problem was first proposed by Dollar et al. [25] and consists of two steps. First, Y is mapped to an intermediate space Z by means of the function Π : Y → Z where the distance between labels can be computed. Importantly, Π must be chosen so that similar labels y will be associated with vectors z close to each other with respect to the distance defined in Z. Then, PCA is applied to the vectors z to map the associated labels y into a binary set of labels c ∈ C = {0, 1}, which is achieved by applying a binary quantization to the principal component of each z vector. Finally, the Shannon entropy can be adopted [25]: with p(c) indicating the empirical distribution extracted from the training subset at each node. In our previous work [22], this approach has been adopted for structured labels Y consisting of edge maps (EMs) highlighting the contours of the myocardium. In the case of EMs, the mapping Π can simply encode for each pair of pixels whether they belong to the same segment in the label y or not: where j 1 and j 2 are indices spanning every pixel in y [25]. The resulting long binary vector z (which has a number of dimensions equal to the number of pixel pairs in y) can be used to compare this particular label to the other ones by simply computing the Euclidean distance in Z. However, the same choice for Π cannot be adopted for our task, which aims at using structured labels consisting of segmentation maps (SMs) of the LV cavity. For example, let's imagine two labels y 1 and y 2 , the former completely outside the LV cavity and the latter completely inside: using the mapping Π EM , we would obtain z 1 = z 2 , which contradicts the requirement by which only similar labels will be mapped close to each other in Z. Consequently, we implemented a different mapping: which encodes for each pair of pixels in y whether they are both equal to 0, whether they are both equal to 1 and then concatenates the two obtained binary vectors. This formulation ensures the proper computation of the distance between labels, and thus their clustering at each node based on their similarity. At the end of the training process, the labelŷ stored in each leaf node is the one whoseẑ is the medoid (i.e. that minimizes the sum of distances to all the other z at the same node). At testing time, each sample patch of the test image is sent down each tree of the forest, and the segmentation maps stored at each selected leaf node are averaged, producing a smooth segmentation map (PSM) of the LV cavity. The values in the PSM are actual probabilities (proportional to the certainty in LV cavity detection), and can be used to assess the reliability of the prediction. Of note, the introduced formulation for Π SM in Eq. 3 could be easily extended to multi-label PSM generation by concatenating additional binary vectors computed for each label c i and by performing a channel-based averaging operation at testing time.
Regression and Landmark Detection: To train regression nodes, it is necessary to associate with each sample patch x an additional label D = (d 1 , d 2 , . . . , d L ), where d l represents for each of the L landmarks the N -dimensional displacement vector from the patch centre to the landmark location. Instead of the Shannon entropy defined in Eq. 1, regression nodes are trained by minimizing the determinant of the covariance matrix |Λ(S)| defined by the landmark displacement vectors: Landmark positions are assumed to be uncorrelated, thus only the diagonal elements of Λ(S) are used in Eq. 4 [26]. The location predictions are stored at each leaf node k using a parametric model following a N · L-dimensional multivariate normal distribution with d l k and Σ l k mean and covariance matrices, respectively. At testing time, Hough vote maps are generated for each landmark by summing up the posterior distributions obtained from each tree for each patch (applying normalization factors) [24]. Assuming that pixels belonging to the LV are more informative for cardiac landmark detection than background ones, the PSM values for the LV cavity are used for each patch as weighting factor during the generation of the L Hough vote maps, effectively restricting voting rights only to pixels likely to belong to the LV cavity itself [22]. Finally, the location of a landmark is determined by identifying the pixel with the highest value on each Hough vote map.
Model Training: Each patch x is represented by several features: multi-resolution image intensity, histogram of gradients (HoG) and gradient magnitude. For a detailed description, please refer to [22]. The described hybrid random forest approach is used to build five different models (I-V) for our application (see Fig. 2): PSM estimation of LV cavity and LMs extraction for apex and mitral valve for LA images, PSM of LV cavity and LV myocardium for SA stacks. For LA 3CH and 4CH images (models II and III) only one mitral valve point is identified because in these images the LV outflow tract of the aorta can partially occlude one side of the mitral valve, making its localization inaccurate. Also, the training of the models using SA images (models IV and V) is performed by feeding the random forests with all the slices extracted from the SA image stacks: consequently, at testing time, the models are applied to each slice of the stack independently.

Algorithm 1: Heart Coverage Estimation
Input landmarks: Compute median landmarks: with z-components l a and l m , respectively Extract SA stack extension in the z direction: Apex: r a Base: r m Compute coverage CV:

B. Heart Coverage Estimation
Heart coverage is estimated exploiting the landmarks identified on LA 2CH, 3CH and 4CH images using the previously trained hybrid forest models. The rationale is that a properly scanned SA stack should encompass the whole portion of space between the apex and the mitral valve. As highlighted in Fig. 2, for a specific subject we identify three landmarks for the apex (one per each LA image: l 2CH with values in the coordinate systems of each respective LA image. Using the orientation matrix extracted from the DICOM headers of the acquired SA and LA images, it is possible to define the coordinates of these landmarks in the coordinate system of the SA stack itself. Two new "median" landmarks (l a and l m ) are then defined taking the medians of the coordinates of the landmarks for the apex and for the mitral valve, respectively, in the SA coordinate system. The extension in the z direction (i.e. along the LV long axis) of the SA stack can be easily computed from the slice thickness and slice number, which are stored in the DICOM header of the stack itself: the two extrema along this direction are defined r a and r m , respectively. Finally, the relative coverage can be computed by comparing the relative positions along the z direction of l a and l m (i.e. the space that is supposed to be covered by the SA stack) to the portion of space between r a and r m (i.e. the space that is actually covered). The steps for coverage estimation are listed in Algorithm 1, including the formula for the computation of the coverage (under the assumption that the apex is located at higher z compared to the mitral valve). Importantly, this technique can seamlessly be applied even if only one LA image is available. Also, while minor motion can occur between the acquisitions of LA images and of the SA stack, it is generally negligible in the z direction (the only one influencing coverage) [18] and thus registration procedures between these images were found to be unnecessary. Finally, a sanity check is performed to detect cases in which landmark detection failed: for each LA view, when either of the relative distances between the landmarks was greater or smaller than reference values by a certain threshold, the landmarks from that image were discarded, and the automated coverage estimation was performed only on the remaining landmarks (if available).

C. Inter-Slice Motion Detection
Inter-slice motion detection relies on the PSMs extracted from the acquired images. The rationale is that while LV cavity PSMs of motion-corrupted SA slices are misaligned, PSMs extracted from the LA images represent sections of the true shape of the LV cavity and can consequently be used as reference. Moreover, the amount of misalignment between the SA PSMs and LA PSMs can be used as an indicator of motion. To perform this assessment, the LA PSMs are initially rigidly registered (by 3D translation only, using sum of squared differences as dissimilarity metric) to the SA PSM stack to compensate for potential motion between different acquisitions. Then, for each slice of the SA PSM stack, the three registered LA PSMs are resampled and combined into a single image (referred to as combined LA PSM) containing the sections of the LA PSMs with respect to a specific slice (see Fig. 4). Finally, in-plane rigid registration (by translation only, using sum of squared differences as dissimilarity metric) is performed between each SA PSM slice and the associated combined LA PSM, and the magnitude of the translation T s used as a metric for motion (i.e. differences in breath-holding positions). Of note, this step is performed only on the slices which are effectively covering the LV, condition assessed using the LA LMs as in Algorithm 1. The probabilistic nature of PSMs allows the application of a sanity check performed to detect slices with a failed PSM estimation: SA PSM slices (whose values range between 0 and 1024) with a peak probability value below a user-defined threshold are considered unreliable, and thus their T s discarded. Also, this technique could be applied even if only two LA images were available. The steps for motion detection are listed in Algorithm 2.  Fig. 4: Motion detection technique. For each slice of the SA stack, the corresponding portion of space in each LA PSM is resampled and combined, producing the "asterisk-shaped" LA PSM comb image. In-plane rigid registration is then performed between each SA PSM and LA PSM comb, and the translation magnitude T s used as proxy for inter-slice motion for that slice. A color map, with the intensity of each slice proportional to the respective T s , can be also generated.

D. Cardiac Image Contrast Estimation
Cardiac image contrast is estimated using the LV cavity and LV myocardium PSMs extracted from the SA stack. The rationale is to transform the PSMs into hard segmentations (SMs) and to use them to estimate the difference between average pixel intensity in the LV cavity and in the LV myocardium. Each cavity PSM slice is thresholded selecting the N cav pixels with the highest probability values: this will maximize the probability of measuring the intensity in the actual cavity. The same happens to each myocardium PSM, thresholded selecting N myo pixels. To exclude potential spurious regions from the obtained segmentation, the average centroid for the cavity segmentation is computed among the different slices, and for each slice only the connected component closest to the average centroid is kept, both for the cavity and the myocardium segmentations. Of note, this step is performed taking into account the slice-by-slice rigid transformation estimated using Algorithm 2, which amounts to performing the average centroid computation and connected components analysis on a motion-compensated stack. Finally, in order to exclude potential papillary muscles from the cavity intensity computation, a Gaussian mixture model is fitted to the distribution of intensity values inside the cavity segmentation. Since only some slices show papillary muscles, both a twocomponent and a one-component models are used, and only one is selected based on the Akaike information criterion [27]. If the two-component model yields the best fit, since the cavity distribution is always higher than that of papillary muscles, the mean of the component with the highest mean is used as average intensity value for the cavity. For the myocardium, the mean intensity of the pixels masked by the segmentation is computed. Cardiac image contrast is finally defined as the difference between these two values. A double sanity check is performed leveraging the probabilistic nature of PSMs: if either the peak value of either the cavity or the myocardium PSM was below a user-defined threshold or the size of either of the final hard segmentations for the cavity or the myocardium was less than a defined number of pixels, the obtained contrast was deemed unreliable. The steps for cardiac image contrast estimation are also listed in Algorithm 3. with SA s the s-slice of the SA stack end E. Performance Evaluation Image Acquisition: To train and test the proposed quality control pipeline, images from two different datasets were used: the UKBB [3] and the UKDHP [8]. CMR imaging for the UKBB was performed using a 1.5T Siemens MAGNETOM Aera system equipped with a 18 channels anterior body surface coil (45 mT/m and 200 T/m/s gradient system). 2D cine balanced steady-state free precession (b-SSFP) SA image stacks were acquired with in-plane spatial resolution 1.8×1.8 mm, slice thickness 8 mm, slice gap 2 mm, image size 198×208 and average number of slices 10. 2D cine b-SSFP LA images were acquired with in-plane spatial resolution 1.8×1.8 mm, slice thickness 8 mm and image size 162×208. Further acquisition details can be found in [3]. CMR imaging for the UKDHP was performed on healthy volunteers using a 1.5T Philips Achieva system equipped with a 32 element cardiac phased-array coil (33 mT/m and 160 T/m/s gradient system). 2D cine balanced steady-state free precession (b-SSFP) SA image stacks were acquired with in-plane spatial resolution 1.2×1.2 mm, slice thickness 8 mm, slice gap 2 mm, image size 288×288 and average number of slices 12. 2D cine b-SSFP LA images were acquired with in-plane spatial resolution 1.5×1.5 mm, slice thickness 8 mm and image size 256×256. In both datasets, only end-diastolic frames were considered.
Experimental design: A series of experiments was conducted to assess the accuracy of each portion of the pipeline. First of all, the five hybrid random forest models were trained using a randomly-generated subset of 500 cases from the UKBB. For each LA image-based model, the 500 images were used together with manually-annotated landmarks and segmentations of the LV cavity. The segmentations were obtained with a CNN-based automated tool proven to reach human-level performance [28], and then visually checked for accuracy. Each training set was quadrupled in size through data augmentation applying random rescaling (following a normal distribution with µ = 1, σ = 0.1) and random rotation (µ = 0°, σ = 30°). For each of the two SA stack-based models, the slices extracted from the 500 stacks were used (for a total of 5165 images) together with segmentations of the LV cavity and of the LV myocardium, respectively (obtained using the same process described for LA images). Details regarding forest training include image patch size 48×48 px for LA models and 32×32 px for SA ones, segmentation label size 16×16 px, number of samples 4·10 6 , number of trees T = 8.
A first series of experiments was performed by evaluating the trained pipeline on a separate testing set consisting of 3000 cases randomly extracted from the UKBB. To evaluate the accuracy of the proposed heart coverage estimation technique, two experiments were conducted. First, for each of the three LA views, the positions of the landmarks were manually annotated on 100 randomly selected cases. The automatically detected LMs were compared to the manually identified ones by measuring the Euclidean distance between the two sets of points. Then, the 3000 SA stacks were visually inspected (sometimes using LA images as reference) to identify cases with insufficient coverage, defined as such when at least one full slice was missing. Automated heart coverage estimation was then performed on the same dataset. To instruct the previously described sanity check, the mean and standard deviation of the relative distances between manually annotated landmarks were computed on the 100 images (l 2 − l 1 = 89 ± 12 mm, l 3 − l 2 = 32 ± 5 mm, l 3 − l 1 = 87 ± 12 mm): then, for each LA view, when either of the relative distances between the automatically detected LMs was over 2 standard deviations greater or smaller than the respective mean distance value (thus covering roughly 95% of the measured variability), the LMs from that image were discarded and the automated coverage estimation was performed only on the remaining ones (if available). Finally, the accuracy of the technique was assessed against the performed visual inspection performing a standard binary classification test using a threshold for insufficient coverage optimized automatically with an ROC analysis. To evaluate the accuracy of the motion detection technique, two experiments were conducted. First, for each of the three LA views as well as for the SA stacks, the automatically extracted PSMs were compared to hard segmentations obtained using the previously-described CNN-based automated tool [28] on 1000 randomly selected cases. While this experiment was aimed at assessing the accuracy of the PSMs, it is worth noting that the PSMs are never directly thresholded for segmentation purposes in the pipeline, which on the contrary exploits their probabilistic nature. For the sake of this comparison, the PSMs were turned into hard segmentation by applying a global threshold and compared to the reference ones by computing the Dice coefficient (DSC). The global threshold was optimized automatically using an ROC analysis. Then, 1500 SA stacks were visually inspected (sometimes using LA images as reference) to identify cases with noticeable motion corruption. Automated motion detection was then performed on the same dataset. To implement the previously described sanity check, PSM slices with peak probability values below 600 were considered not reliable for motion detection, and thus their T s (i.e. the estimated translation magnitude) discarded; if less than 2 T s values were left, the motion detection analysis was not performed on the specific stack. Accuracy of the automated technique was assessed against visual inspection with a standard binary classification test using the following criterion: a stack was deemed motion-corrupted if either the average T s was above a first threshold T A or at least two T s were above a second threshold T B . This double criterion aimed at the detection of both stacks with a few, clearly misaligned slices as well as stacks with poor general alignment. Both T A and T B were optimized automatically using an ROC-like approach. To evaluate the accuracy of the cardiac image contrast estimation technique, 100 random slices from as many random SA stacks were manually annotated selecting regions of interests (ROIs) within the LV cavity and the LV myocardium. Cardiac image contrast was estimated both from the original images and from the images after contrast normalization using a randomly selected reference image stack. Automated contrast estimation was then performed on the same dataset, both before and after normalization, using N cav = 450 px and N myo = 200 px. To implement the previously described sanity check, contrast extracted from slices with PSMs (either for the cavity or for the myocardium) with peak values below 150 or with respective hard segmentations with a size of less than 32 mm 2 (i.e. 10 pixels) was deemed unreliable and excluded from the analysis. Automatically estimated and manually computed contrast values were compared using Pearson's correlation coefficient, linear regression and Bland-Altman analyses.
A second series of experiments was then performed by evaluating the pipeline trained on UKBB on a separate testing set consisting of 100 cases randomly extracted from the UKDHP. Since the scans in UKDHP were acquired with a different scanner and with different parameters from those used for UKBB, these experiments were aimed at assessing the generalization properties of the proposed pipeline. To harmonize the differences between training and testing datasets, the images in UKDHP were pre-processed through intensity normalization [29], spatial resampling and image reorientation. The 100 cases were then visually inspected and manually annotated following the same criteria described for the previous experiments to provide the ground truth for estimation coverage, motion detection and contrast estimation. Since the visual assessment for heart coverage estimation returned no sub-optimal cases, a procedure was implemented to simulate coverage issues and allow a more meaningful evaluation of the pipeline. Stacks were randomly picked following a uniform distribution (10% chances of being picked), and a number of slices were deleted (either from the top or the bottom of the stack with equal probability), with this number randomly selected from a normal distribution (µ = 1, σ = 2). Coverage was then visually re-assessed on the whole dataset. It is important to note that while this corruption procedure altered the properties of the dataset with respect to coverage, it did not affect the images on which the learning-based portion of the pipeline is applied (i.e. the LA images) but only the SA stacks, which influence the coverage estimation by means of their size and spatial orientation. The pipeline was applied with the same settings used for the previous dataset except for the threshold for the sanity check for contrast estimation relative to the peak PSM value, which was moved from 150 to 100 to account for the slightly lower overall response in the PSMs. The evaluation strategy for the three checks was the same as for the previous set of experiments.
For all the experiments, manual annotations and visual inspections used as ground truth were performed internally by G. T. (medical imaging researcher with 10 years of experience in cardiac imaging) and H. S. (experienced cardiologist), both blinded to the results of the automated analyses: more specifically, H. S. visually inspected the 3000 SA stacks from the UKBB dataset to identify cases with insufficient coverage, and G. T. performed all of the remaining assessments.

IV. RESULTS
The experiments were initially run on a single core of an Intel™Xeon CPU E5-1650 v3 @ 3.50GHz with 64 GB of memory to assess the speed of the current pipeline. Average time required to extract PSMs and LMs (when included in the model) was 1.3s per SA stack (of roughly 10 slices) and 0.85s per LA image. Average times required to perform the quality control checks were 0.26s per SA stack for coverage Fig. 5: Results for heart coverage estimation in two cases, one with sufficient (case #1) and one with insufficient coverage (case #2). In the first three columns, the results for landmark detection in the three LA views. In the last column, a mix view with the LA two-chamber view and the SA stack together with the median landmarks for the apex and the mitral valve.  estimation, 9s per SA stack for motion detection (in this case using parallelization on 6 cores to evaluate multiple slices from one stack at once) and 0.6s per slice for contrast estimation.
The localization errors for landmark detection on UKBB for the three LA views are reported in Table I and in Fig.  9 (Supplementary Material). Of note, the landmarks extracted from one image per LA view were identified as outliers and thus excluded from the reported results. Mean DSC values between thresholded PSMs (using a threshold of 450) and reference segmentations were respectively 0.90 ± 0.07 for the SA stacks, 0.94 ± 0.08 for LA 2CH, 0.94 ± 0.08 for LA 3CH, and 0.94 ± 0.07 for LA 4CH.
First are reported the results on UKBB. For accuracy assessment of heart coverage estimation, 3 of the 3000 cases were excluded from the analysis: one due to the lack of LA images, and two for failing the sanity check on all the LA images. The ROC analysis performed on the remaining 2997 images returned an optimal threshold of 90%. The results of the binary classification test are reported in Table  II. For accuracy assessment of motion detection, 3 of the 1500 cases were excluded from the analysis: one due to the lack of the SA stack and two for failing the sanity check. An ROC-like analysis was performed on the remaining 1497 images to select the thresholds T A and T B . The results of the binary classification test, obtained for T A = 3.4 mm and T B = 6 mm, are reported in Table III. For accuracy assessment of contrast estimation, 3 of the 100 images were excluded from the analysis for failing the sanity check. Results for Pearson's correlation coefficient, linear regression and Bland-Altman analyses between automatically and manually estimated contrast values are reported in    Then are reported the results on UKDHP. For accuracy assessment of heart coverage estimation, all cases passed the sanity check. The ROC analysis returned an optimal threshold of 92% coverage, and the results of the subsequent binary classification test are reported in Table V. For accuracy assessment of motion detection, 1 of the 100 cases was excluded from the analysis for failing the sanity check. The ROC-like analysis was performed on the remaining 99 images to select the thresholds T A and T B . The results of the binary classification test, obtained for T A = 3 mm and T B = 6 mm, are reported in Table VI. For accuracy assessment of contrast estimation, 9 of the 100 images were excluded from the analysis for failing the sanity check. Results for Pearson's correlation coefficient, linear regression and Bland-Altman analyses between automatically and manually estimated contrast values are reported in Table VII and in Fig. 11

V. DISCUSSION
The results obtained for the landmark localization experiment show that the average localization error is around 3.9 mm (roughly two pixels) and is thus small compared to the reconstructed slice thickness in both datasets (10 mm), suggesting the reliability of the landmark detection technique for the sake of heart coverage estimation. The proposed hybrid decision forest method is based upon a previous implementation for landmark detection [22] which consisted of a multistage approach devised to increase the robustness to large variations in distances and orientation of the landmarks. It is worth mentioning that initial experiments performed using this approach showed no measurable improvement with respect to the single-stage one (perhaps due to the size of the training set and to the consistency of the orientation of the images), which was thus preferred (Fig. 12, Supplementary Material). The high DSC values obtained for the PSMs suggest their reliability for both motion detection and contrast estimation. The fact that the PSMs of the SA stacks are slightly worse than those of the LA images (0.90 vs 0.94) is mainly due to a lower response of the model in the apical slices, where a different thresholding value would have been beneficial. However, this does not cause a direct problem on the proposed pipeline, which never thresholds PSMs for segmentation purposes and instead exploits their probabilistic nature.
The first set of experiments involving the whole pipeline was aimed at assessing its accuracy on UKBB. The binary classification test on coverage estimation performed on 2997 cases from UKBB indicates the high accuracy of the proposed technique, with sensitivity = 88% and specificity = 99%. The interpretation of these results is hindered by the strong class imbalance between cases with sufficient and insufficient coverage, and thus a more detailed analysis of the reported confusion matrix is required. By applying the proposed automated technique, it is possible to correctly detect 88% of the cases with insufficient coverage, and thus to lower the percentage of undetected wrongly imaged cases from 1.9% to 0.2%. This comes at the price of having to visually check an additional 0.5% of cases that actually featured a sufficient coverage. Notably, several of the 15 FP cases actually had a sub-optimal coverage, but not of the amount required to be considered as wrongly imaged following the criterion adopted during visual inspection. Compared to our previous work [12], the present approach makes use of three LA images instead of just one. The redundancy offered by exploiting all the available LA views allows a more robust and reliable estimation: this is suggested by the higher sensitivity and specificity achieved (88% vs 73% and 99% vs 98%, respectively, although a direct comparison is not completely fair since the UKBB subset used in [12] was different from the present one) and by the lower number of cases excluded due to failing the sanity check (down from 89 to 3). Of note, this check is able to indirectly detect and exclude LA images with high noise levels, wrong acquisition planning or wrong file naming that make the landmark localization unreliable, and in the present implementation only cases in which all the three LA views yielded bad landmark detection had to be excluded from the coverage assessment. Zhang et al. [10] addressed coverage estimation by performing fully-supervised CNNbased slice classification to detect stacks with missing basal (MBS) or apical slices (MAS). In their later work [11], they acknowledged the need for a large amount of labelled data during training to achieve good generalization: to mitigate this issue, the authors have increased the size of the training set using generative networks (reaching average accuracies of 93% for MAS and 89% for MBS on a dataset of 3400 cases from UKBB). The use of different subsets of data from the UKBB and the different validation strategies (detection of missing slices separately in the apical and in the basal  Fig. 6: Results for motion detection on UKBB in two cases, one with (case #1) and one without motion corruption (case #2). In the second column, the color maps of the translation magnitude for each slice are overlaid on top of the SA stacks.
region vs detection of overall non-optimal cases) make the comparison between the two approaches not straightforward.
The main advantage of the approach of Zhang et al. is that it can detect problematic scans using only the SA stack, while our pipeline relies on the presence of at least one of the LA views (which are, however, routinely acquired in most CMR protocols). On the other hand, we believe there is a clinical and practical advantage in measuring the relative coverage instead of performing binary classification: cases with only slightly sub-optimal coverage could still be included in the following analyses, especially when the lack of coverage is in the apical area. Moreover, while their approach completely relies on feature extraction from single slices and thus small image perturbations can potentially lead to misclassification, our approach is designed to exploit the redundancy offered by the multiple LA views for greater robustness. The reported results on UKBB for motion detection indicate that the proposed approach achieves sensitivity = 85% and specificity = 95% over 1497 cases. By applying the proposed automated technique, it is possible to lower the percentage of undetected motion-corrupted cases from 16.8% to 2.6%. This comes at the price of having to visually check 3.9% cases that were visually deemed motion-free. It is worth to note that the binary classification of stacks based on the visual assessment of motion is a difficult task in itself, limiting the measurable accuracy of any technique. A more thorough examination would require a slice-by-slice visual classification, which is however impractical for datasets of this size.
The accuracy of the contrast estimation technique on UKBB is indicated by very high correlation coefficients and regression lines near unity both for images before and after contrast normalization. Bland-Altman analyses show negligible biases and narrow limits of agreement with respect to the mean measured values, suggesting the high accuracy of the technique.
The second set of experiments was aimed at assessing the accuracy of the pipeline (trained on UKBB) on the UKDHP dataset, thus testing its generalization properties, and yielded encouraging results. Regarding heart coverage estimation, our Fig. 7: Results for contrast estimation on UKBB in two cases, one with high (case #1) and one with low contrast (case #2). The ROIs from which the mean intensities are estimated are shown in red and cyan.
technique was able to correctly identify all sub-optimal cases. Regarding motion detection, it returned slightly lower values for sensitivity and specificity than those obtained on UKBB: while this might be due to a lower accuracy of the extracted PSMs, we noted that motion in the UKDHP dataset is considerably less pronounced than on UKBB, so it is easier to misclassify borderline cases. Regarding contrast estimation, the technique showed again very high correlation coefficients and regression lines near unity. The increased difficulty in dealing with a testing dataset different from the training one can be seen in the slightly higher number of cases failing the sanity check (up from 3 to 9) and in bigger biases, still however negligible when compared to the mean measured values. In general, the small size of the UKDHP dataset should be taken in consideration when evaluating these results, especially for binary classification tests where the misclassification of a single case can have a very large influence on the accuracy figures. However, we believe the reported results show that the proposed approach generalizes well to previously unseen datasets, coping with differences in the acquisition protocols.
Our approach to quality control does not attempt to directly classify sub-optimal cases for two reasons. First, this allows the complete circumvention of any class-imbalance issues, since the only learning-based portions of our pipeline aim at the identification of structures that are present in every image. Second, the implemented pipeline does not constitute a "blackbox" approach: each quality check produces quantitative metrics with a clear meaning, which can be of great value in informing the MR operators on the type and the entity of the identified issues. Importantly, the proposed pipeline could be adopted also using different techniques for landmark detection and probabilistic segmentation. One major requirement for these alternative methods would be the generation of fuzzy segmentations maps providing a probabilistic representation of the target structures: this allows the assessment of their reliability for both motion detection and contrast estimation, otherwise unfeasible with standard, hard segmentations.
The main limitation affecting our approach is that no quality check is performed on the manual selection of the imaging planes for LA and SA images, which can be subject to error. However, countermeasures have been implemented to deal with this issue. Regarding coverage estimation, the redundancy offered by exploiting all the three LA views and the adoption of a sanity check helps to minimize the issue. Regarding motion detection, a slighty off-axis LA image still correctly represents the cardiac anatomy, and the initial 3D registration step will position it correctly with respect to the SA stack.

VI. CONCLUSION
In this paper, a fully-automated, learning-based pipeline for quality control of CMR images has been presented. The implemented quality checks are heart coverage estimation, interslice motion detection and cardiac image contrast estimation for short-axis image stacks. The pipeline uses hybrid random forests to extract probabilistic segmentation maps and identify landmarks on long-and short-axis images, and then leverages these information to perform the quality checks. It was tested on up to 3000 cases from the UKBB as well as on 100 cases from the UKDHP, and compared to the results of visual or manual analyses to evaluate its accuracy. The results suggest that the proposed approach is able to perform the quality checks with a high accuracy across different datasets. With the recent launch of several initiatives for the acquisition of largescale CMR datasets, there is a strong need for robust quality control tools in order to facilitate and ensure the reliability of the analyses performed as part of clinical studies. In addition, the low computational time required by the proposed pipeline makes it potentially deployable at the acquisition site, allowing the almost real-time assessment of the scan and the potential triggering of a new acquisition.