Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation: The M&Ms Challenge

—Theemergenceof deep learninghas consider- ably advanced the state-of-the-art in cardiac magnetic resonance (CMR) segmentation. Many techniques have been proposed over the last few years, bringing the accuracy of automated segmentation close to human performance. However, these models have been all too often trained and validated using cardiac imaging samples from single clin- ical centres or homogeneous imaging protocols. This has prevented the development and validation of models that are generalizable across different clinical centres, imaging conditionsor scanner vendors.To promote further research and scientiﬁc benchmarking in the ﬁeld of generalizable deep learning for cardiac segmentation, this paper presents the results of the Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation (M&Ms) Challenge, which was recently organized as part of the MICCAI 2020 Conference. A total of 14 teams submitted different solutionsto the problem, combining various baselinemodels, data augmentation strategies, and domain adaptation techniques. The obtained results indicate the importance of intensity-driven data augmentation, as well as the need for further research to improve generalizability towards unseen scanner vendors or new imaging protocols. Furthermore, we present a new resource of 375 heterogeneousCMR datasets acquired by using four different scanner vendors in six hospitals and three different countries (Spain, Canada and Germany), which we provide as open-access for the community to enable future research in the ﬁeld.


I. INTRODUCTION
A CCURATE segmentation of cardiovascular magnetic res- onance (CMR) images is an important pre-requisite in clinical practice to reliably diagnose and assess a number of major cardiovascular diseases [1], [2].Currently, the process typically requires the clinician to provide a significant amount of manual input and correction to accurately and consistently annotate the cardiac boundaries across all image slices and cardiac phases.The automation of such a tedious and timeconsuming task has been pursued for a long time by using multiple approaches, such as statistical shape models [3] or cardiac atlases [4].In the last few years, the advent of the deep learning paradigm has motivated the development of many neural network based techniques for improved CMR segmentation, as listed in a recent review [5].However, most of these techniques have been all too often trained and evaluated using cardiac imaging samples collected from single clinical centres using similar imaging protocols.While these works have advanced the state-of-the-art in deep learning based cardiac image segmentation, their high performances were reported on samples with relatively homogeneous imaging characteristics.
As an example, the CMR datasets from the Automated Cardiac Diagnosis Challenge (ACDC) dataset [6] have been extensively used to build and test new implementations of deep neural networks for cardiac image segmentation.The top performing technique in the ACDC challenge, proposed by Isensee et al. [7], obtained a very high segmentation accuracy for both the left and right ventricles.However, the ACDC datasets were compiled from 150 subjects scanned at a single clinical centre using the same imaging protocol, which limits the ability of the researchers to develop and test models that can generalize suitably across multiple centres and scanner vendors.Other researchers attempted to encode higher variability by building and testing their models based on much larger datasets obtained from the UK Biobank [8].For instance, Bai et al. [9] implemented a fully convolutional network that achieved highly accurate results on this large dataset (over 4,875 cases), but the authors concluded that their model might not generalize well to other vendor or sequence datasets.
Some researchers proposed to improve CMR segmentation by training neural networks with images from multiple cohorts [10], [11], but these works do not include methods for addressing domain shifts between training and new unseen cohorts.Other works used data augmentation on models built from single cohorts such as the ACDC [12] or the UK Biobank [13], then tested their techniques on other existing public cohorts, including the Sunnybrook Cardiac Data [14], LV Segmentation Challenge Dataset (LVSC) [15] or RV Segmentation Challenge Dataset (RVSC) [16].However, these studies are limited by the fact that these different CMR cohorts have been annotated with distinct standard operating procedures (SOPs), which makes it difficult to draw conclusions from the multi-cohort comparative results.Furthermore, such an approach requires a large training dataset from the single centre to model high variability across subjects.Another multi-centre and multi-vendor study conducted by Tao et al. [11] relied solely on private data, which makes it difficult to replicate the results and perform communitydriven benchmarking.While these recent works confirmed the difficulties encountered by deep learning models to generalize beyond the training samples, they also support the need for well-defined heterogeneous public datasets that can be used by the community to improve model generalizability through scientific benchmarking.
In this context, the Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation (M&Ms) Challenge was proposed and organized as part of the Statistical Atlases and Computational Modelling of the Heart (STACOM) Workshop, held in conjunction with the MICCAI 2020 Conference.The M&Ms challenge was set up as part of the euCanSHare international project, 1 which is aimed at developing interoperable data sharing and analytics solutions for multi-centre cardiovascular research data.Together with clinical collaborators from six different hospitals in Spain, Canada and Germany, a public CMR dataset was established from 375 participants, scanned with four different scanners (Siemens, Philips, General Electric (GE) and Canon) and annotated using a consistent contouring SOP across centres.
To our knowledge, this dataset is the most diverse resource of CMR studies, which is provided as open-access2 to promote further research and scientific benchmarking in the development and evaluation of future generalizable deep learning models in cardiac image segmentation.In this paper, we also present and discuss the results of the M&Ms challenge in detail, to which a total of 14 international teams submitted a range of solutions, including different strategies of transfer learning, domain adaptation and data augmentation, to accommodate for the differences in scanner vendors and imaging protocols.The obtained results show the extent of the problem, the promise of the proposed solutions, as well as the need for further research to build fully generalizable tools that can be    I.In total, 375 studies were included in this challenge.The subjects considered for this multi-disease study were selected among groups of various cardiovascular diseases, such as hypertrophic cardiomyopathy, dilated cardiomyopathy, coronary heart disease, abnormal right ventricle, myocarditis and ischemic cardiomyopathy as well as healthy volunteers (see Table II for more details on the distribution of these cases).The specific scanner manufacturers are: 1) Siemens (Siemens Healthineers, Germany), 2) Philips (Philips Healthcare, Netherlands), 3) General Electric (GE, GE Healthcare, USA) and 4) Canon (Canon Inc., Japan).These four manufacturers were coded as A, B, C and D during the challenge, respectively.The CMR images derived from these four vendors are illustrated in Fig. 1.More specific details on the studies are given in Table III.
Every CMR study was annotated manually by an expert clinician from the centre of origin, with experiences ranging from 3 to more than 10 years.Following the clinical protocol, short-axis views were annotated at the end-diastolic (ED) and end-systolic (ES) phases, as they correspond to the phases used to compute the relevant clinical biomarkers for cardiac diagnosis and follow-up.Three main regions were considered: the left and right ventricle (LV and RV, respectively) cavities and the left ventricle myocardium (MYO).In order to reduce the inter-observer and inter-centre variability in the contours, in particular at the apical and basal regions, a detailed revision of the provided segmentations was performed by four researchers in pairs.They applied the same SOP across all CMR datasets to obtain the final ground truth.To generate consistent annotations for the research community, we chose to apply the SOP that was already used by the ACDC challenge, as follows: a) The LV and RV cavities must be completely covered, including the papillary muscles.b) No interpolation of the MYO boundaries must be performed at the basal region.c) The RV must have a larger surface at the ED time-frame compared to ES. d) The RV does not include the pulmonary artery.Clinical delineations as well as later corrections were performed using CVI42 software (Circle Cardiovascular Imaging Inc., Calgary, Alberta, Canada).All studies were provided in DICOM format and contours were extracted in cvi42 workspace format (.cvi42ws).An in-house software was then used to extract the contours and transform the images into Four 2D UNet models [17] were trained with datasets from the four vendors separately (rows) and subsequently tested their segmentation performance on datasets from all vendors (columns).The heatmap shows the Dice similarity coefficient, with a color scale that goes from blue (good generalizability) to red (poor generalizability).The results are the average of 5 models cross-validated on subsets of 30 training subjects.
the NIFTI format, representing the final files delivered to the challenge participants.

B. Model Training
The 375 CMR studies were divided into three sets, namely training, validation and testing, as detailed in Table IV.To decide on a particular subdivision, we first estimated the degree of generalizability of models trained from the four vendors, as shown in Figure 2. We have thus decided to combine the datasets from vendors A, which generalize relatively well, with datasets from B, which generalize poorly to new vendors, as training datasets.The participants received the 175 training cases on 1st May 2020, including 75 annotated CMRs from vendor A, 75 annotated CMRs from vendor B, 25 CMRs from vendor C but without any annotations (only the raw images) and no datasets from vendor D, in order to test generalizability to different situations (e.g.image protocol included or not included in the training).Note that in the case of vendor A, the 75 CMRs were included from centre 1 but none from centre 6, to test generalizability across vendors but also across centres for the same vendors.Regarding vendor B, we included more training datasets from centre 2 (50 cases) than from centre 3 (25 cases) to assess the impact of imbalanced training data and fairness in multi-centre cardiac image segmentation.For optimizing the models, the participants were allowed to remotely validate against 40 additional CMRs, i.e. 10 from each of the four vendors.A maximum of 7 submissions were allowed per team during the validation process.Note that during training, it was not allowed to use any external datasets or pre-trained models, to enable a fair comparison between the proposed solutions.

C. Model Evaluation
The testing period for the challenge started on 8th June 2020 and concluded on 15th July 2020.The participants had to evaluate their models remotely to ensure the unseen datasets were totally hidden from the segmentation methods.As such, for example, the participants had no prior information on the images provided by vendor D. In order to evaluate the models, the participants were asked to build a Singularity image 3 and share it with the organizers via a MEGA 4 folder shared by the organizers or by any other secure cloud storage service.This Singularity image allows its execution on a similar architecture machine without the need to install all the diversity of used libraries.The necessary computing power was sponsored by NVIDIA, who provided the organizers with access to an NVIDIA V100 GPU card with 16GB of memory, as well as the Barcelona Supercomputing Center (BSC) who provided access to two K80 NVIDIA GPU cards.
In order to assess the quality of the automatically segmented masks P with respect to the ground truth G, four measures were proposed, namely: (i) Dice similarity coefficient (DSC): that measures the degree of overlapping of two volumes.
(ii) Jaccard index (JI): that measures overlapping as well but is more sensitive to results with average performance.
(iii) Average symmetric surface distance (ASSD): that measures the average distance between the two volumes.
3 https://sylabs.io 4https://mega.nz(iv) Hausdorff distance (HD): that measures the largest disagreement between the volumes and it is useful for identifying small outliers.All these metrics were computed using the public library medpy. 5hese metrics were computed for the three target labels: LV, RV, and MYO, resulting in a total of 12 measures.In case one participant had a prediction missing for a specific subject, a value of zero was assumed for DSC and JI and maximum values of 150 and 50 milimetres were assumed for HD and ASSD, respectively, based on the worst results obtained by the participating methods.Any value above the thresholds on surface distances was set to the maximum value.
To obtain the final ranking for each team, a weighted average was computed giving a greater importance to the unlabelled and unseen scanner vendors.Therefore, if v A and v B are defined as the labelled vendors, v C , the unlabelled one and v D , the unseen one, the weighted sum for a metric M is obtained as follows: Then, a min-max normalization was applied across participants for each measure and a final average over the normalized metrics yielded the performance (P) ranging from 0 to 1, being 1 the value that a team would obtain if it had the best results for every metric.

III. PARTICIPATING METHODS
In total, 80 teams registered to download the M&Ms training dataset, 16 submitted a solution for the final testing phase and 14 teams submitted their methodology as a paper to the STACOM Workshop (see Table V for details on these teams).All participants used deep learning as their segmentation approach.Table VI summarizes the main characteristics of the submitted techniques, including the backbone architectures and domain adaptation strategies, which are described in more detail in the following subsections.Furthermore, details on the hardware used during training and the times that each method took for training and inference as well as the number of parameters for each model are presented in Table VII.

A. Backbone Architectures
There is a degree of variability in the backbone architectures used between the different participants, as shown in Table VI.Four teams used the nnUNet [33] (which includes UNet architectures in 2D and 3D as well as a cascaded UNet) as their baseline segmentation model (P1-P3 & P9).Four participants used a traditional UNet [17] (P6, P10, P13, P14), while other variants of UNets were adopted by the rest of the teams.In particular, UNets combined with residual connections were applied by three teams (P4, P8, P11), with P8 preferring a residual UNet with dilated convolutions (DRUNet) [34].P5 proposed the use of an attention UNet [35],   [36] with an AdaIN [37] decoder.
As pre-processing techniques, all models that provided detailed information about this step performed either image normalization to a unit Gaussian distribution or pixel value rescaling to the range [0,1] (only P6 chose the range [0,255] instead).With regards to image resolution, images were resized based on target size or pixel resolution values in 10 out of 14 methods, while the other methods preferred to keep the original image resolution (P4, P7, P8, P11).In order to obtain squared images, cropping and zero padding were used depending on the desired image size for each case.Additionally, some methods applied intensity clipping between varying ranges to get rid of bright artifacts (P5, P6, P11).Finally, P8 was the only method to apply also a non-local means denoising filter prior to the training process.

B. Data Augmentation
All participants in the challenge (except P11) used some form of data augmentation to enhance their models.Specifically, two families of data augmentations were considered: (1) spatial transformations to increase sample size through rotation, flipping, scaling or deformation of the original images; (2) intensity-driven techniques, which maintain the spatial configuration of the anatomical structures but modify Additionally, some teams (P1-P3, P9, P13) applied testtime augmentation techniques, which consist of passing to the model two or more transformed versions of the same inference image to obtain several predictions.These predictions are then combined to obtain one final outcome, usually by averaging them.This method has been shown to improve the final performance in small data size scenarios and a net improvement with a scale effect that depends on the model architecture [38].

C. Domain Adaptation
Of all participants, only three teams (P4, P6, P10) implemented a method to explicitly address the differences in the image distributions between the unseen and trained vendors.At training, P4 constructed a classifier to distinguish between scanner vendors and used it to modify the training images (through error propagation) until the classifier could not distinguish between the domain.In other words, this method resulted in training images and a trained model that are less dependent on the specific vendors.P6 and P10 proposed to train two models simultaneously with shared features, one for segmentation and one for classification, such that the classification loss is high while the segmentation loss is low, generating features that are robust to vendor-specific variations as well as optimal for segmentation.IV, a balanced dataset across the four vendors was prepared for evaluating the final submissions (40 CMRs per vendor, total 160 datasets).In this section, we analyze the obtained results per (1) team, (2) vendor, (3) clinical center, and (4) show some qualitative results.For analysing the obtained results, we also implemented two baseline models to better appreciate the added value of the data augmentation and domain adaptation techniques used in this challenge: B1: A 2D UNet without any data augmentation as described in the original reference [17], trained with weighted cross entropy loss.B2: The nnUNet pipeline, with a 2D UNet module and default parameters as given in [33] (the best fold according to the validation set was selected).In particular, B2 differed from those in P1-P3 in that it only included one architecture type [2D UNet] and ±180 degrees rotations, flippings, scalings, deformations, gamma transformations and test-time augmentation as data augmentation.In contrast, P1, P2 and P3 methods included further augmentation techniques such as histogram matching, noise addition, brightness modification, contrast modification and pseudolabel generation by label propagation in time space.

A. Analysis per Team
Fig. 4 displays the results of the challenge for all participants and according to two evaluation metrics (DSC and HD).It can be seen that the curves are flat for about half of the participating teams, which indicates comparable performances overall.Note that these methods (P1 to P7) are also the ones that performed better than the baseline methods and we hypothesize that the other models (P8 to P14) suffered from some form of over-fitting (see also the shapes of the curves in Fig. 4).Team P1 provided the most consistent results across all metrics.However, the difference with respect to other teams was relatively small and in many cases not statistically significant, as presented in Table VIII.The three best performing teams, P1 to P3, used nnUNet as the baseline pipeline, as well as standard intensity-based data augmentation (e.g.blurring, noise addition, histogram matching), but no domain adaptation, showing a significative improvement with respect to the standard nnUNet implementation B2.For a similar performance, P5 used an Attention UNet as the backbone architecture and CycleGANs for data augmentation through image synthesis.P4 and P6 also obtained similar performances overall, but implemented instead domain adaptation methods and no image-driven data augmentation.Fig. 5 displays the average DSC for all participating teams organised this time per pathology, showing better segmentation performance for healthy cases and dilated cardiomyopathy (DCM), followed by hypertrophic cardiomyopathy (HCM) and other pathologies.It can be seen that the performances of

B. Analysis per Vendor
Fig. 6 summarizes the segmentation results for all teams for each vendor separately (A, B, C & D).It can be seen that overall, the differences in the segmentation errors between the vendors are reduced with respect to the results obtained by the two baseline methods as detailed in Table IX.Specifically, it can be seen that for the baseline methods there is a loss of accuracy of up to −6% in the segmentation of images from vendors C and D compared to A and B. However, this loss is reduced, for example, to −1.5% for P1 (e.g. from DSC = 0.92 for vendor A to 0.90 in vendor C and D, for the LV), −2.1% for P2 (e.g. from DSC = 0.87 in vendor B to 0.82 in vendor D, for the RV), and almost to 0% for P7.This indicates that while there is a need for further research to  7. Boxplots with centre-wise results for DSC and HD when all participants predictions are considered.Same color-coding as in Fig. 6 is used for scanner vendors.

C. Analysis per Centre
In the previous subsection, centres were combined in the analysis despite having different machines or scanning protocols.In doing so, possible variabilities between centres using the same scanner may be overstated, making it necessary to consider also Fig. 7, where the segmentation results are summarized according to the six clinical centres.Here too, it can be seen that there remains some degree of variation in the segmentation of the CMR images from the different centres.In more detail, there is a decrease in segmentation accuracy between centres 1 and 6 even though their images are from the same scanner vendor A. However, this difference can be explained by two facts: 1) the scanners in these two centres are different models and have different field strengths, as shown in Table III, and 2) all the 75 datasets included during training for vendor A were from centre 1 (Spain) and none from centre 6 (Canada).In this case, even though the images are from the same vendor, differences in scanner specifications resulted in the lack of generalizability.In contrast, images from both centres 2 and 3 were included in the training of vendor B, which resulted in segmentation accuracies for these two centres that are comparable.Finally, the datasets from centres 4 and 5 correspond to vendors C and D, respectively, which were not included in the training, which explain the loss of accuracy compared to centres 1, 2 and 3.In Fig. 8, the results are grouped for all centres according to their inclusion (or not) in the training.Clearly, it can be seen that the segmentation accuracy is the highest for centres that are part of the training together with their labels, followed by those with images but no labels, and finally the performance is the lowest and most variable for images from fully unseen centres.This result confirms the need for further developments to optimize the generalizability of deep learning solutions in future tools for cardiac image segmentation.

D. Qualitative Results
Fig. 9 presents the effect of the slice position in the final segmentation DSC for the top three performing teams, quantifying the loss of accuracy, especially prominent in the apical and basal slices.To illustrate this, Fig. 10 provides some visual examples from team P1 to further show the added value of the implemented techniques, as well as their limitations when applied to unseen vendors.In the two examples above, the segmentation techniques enabled to accurately identify the cardiac boundaries even though these imaging protocols were not included in the training set.However, in the two examples below, despite the use of data augmentation and domain adaptation, the models were unsuccessful in the segmentation of these unseen cases and diverged more notably from the ground truth in basal slices.These examples illustrate the need for future work to further improve the generalizability of deep learning models in cardiac image segmentation.

V. DISCUSSION
In this paper, we presented a comprehensive analysis of a range of deep learning solutions for the automated segmentation of multi-centre, multi-vendor and multi-disease CMR datasets.Roughly speaking, the 14 participants in the challenge developed varying workflows combining a baseline neural network, intensity-based and/or spatial data augmentation, and in some cases a data adaptation strategy.In addition to a relatively large sample of 175 cases for training, the authors were given a total of seven attempts for optimising the parameters and characteristics of their models during the validation process, to ensure an optimal design of the solutions.

A. Analysis of the Methods
The obtained results, first of all, indicate that data augmentation, though its primary purpose is to increase training size and reduce over-fitting, can perform well in addressing some of the differences in image appearance between vendors.In particular, by varying the parameters and types of intensity transformations (e.g.histogram matching, contrast modification, noise addition, image synthesis), one can generate new training images that enhance the generalizability of the models.As an example, one can look at the performance of the baselines models B1 and B2 and augmented models, such as P1, P2 and P3.While for the baseline models, the results do not differ significantly for specific cases, such as at ES, P1-P3 used many more data augmentation types, such as histogram matching, noise addition, brightness modification and contrast modification, and obtained a more marked improvement (e.g. the DSC for the myocardium at ES increased from 0.84 for B1 to 0.86 for P1, the DSC for the RV at ES increased from 0.81 for B1 to 0.84 for P3).This indicates the added value of more advanced image-driven data augmentation for multi-vendor image segmentation as well as that the domain shift between different scanners or protocols can be potentially solved by using an exhaustive set of image transformations during training.However, the results also clearly show that the obtained segmentations remain generally more stable in trained vendors compared to unseen vendors, as intensity-driven data augmentation alone cannot enable a full coverage of the variety of imaging protocols that can exist across clinical centres.
As for domain adaptation, while it is theoretically suitable for multi-vendor image segmentation, as it can adapt on the spot to the imaging distribution of the unseen images, it did not result in better segmentations than when using exhaustive data augmentation alone.In fact, the three first techniques in the ranking did not use any domain adaptation, though it is important to reiterate that the first seven solutions obtained relatively similar results overall.It is worth noting that the choice of the baseline model may play a role, as again the first three techniques used the same model, namely the nnUNet.Finally, while the results indicate the potential of data augmentation and domain adaption, they also show that there is still a loss in segmentation accuracy when segmenting labelled versus unlabelled or unseen image samples.Note also that training and testing a model on two datasets from the same vendor does not guarantee a good generalizability.This is particularly true if the two sets of images are from two different centres and scanner types, such as 1.5T (e.g.centre 1) and 3T (e.g.centre 6) as shown in Figure 7.
The results also show that advanced workflows integrating, for instance, data augmentation or generative adversarial networks, are not guaranteed to lead to robust segmentations.In fact, half of the submitted techniques had a lower performance than the two baselines implemented for comparison.This shows that over-fitting remains a challenge that requires special attention during the calibration and validation of complex deep learning solutions for cardiac image segmentation, in particular in the presence of highly heterogeneous data.
Lastly, the presented methods show a vast diversity in hardware performance, with training times ranging from 6 to 100 hours and inference times from tenths of seconds to almost half a minute.However, the amount of training and inference time do not correlate well with the final accuracy, indicating an excessive use of computational power for some techniques.For example, the methods implemented by P1 and P2, despite using the same baseline model than P3, needed around half the time for training and obtained slightly better results (1.2% average improvement in DSC), while P4 used around one tenth of computing time for similar loss of accuracy with respect to P1 (1.6% average loss in DSC).Furthermore, clinical centres usually lack dedicated hardware for deep learning models thus increasing even more the segmentation time.In this sense, a good equilibrium between accuracy and processing time needs to be attained, with methods such as P4 serving as a good example with a competitive performance and a prediction rate of around 3 images per second.
In summary, the main findings are: a) Exhaustive data augmentation reduced considerably the domain gap, although the results were still more stable within the domains used during training.b) Domain adaptation did not result in better performance when compared to nnUNet models trained with spatial and intensity-driven data augmentation.c) Complex workflows did not always lead to better results, resulting sometimes in an excessive use of computing resources.

B. Analysis of the Segmentation Results
Compared to other publicly available and annotated multistructure (LV, MYO, RV) datasets in the field of CMR segmentation, M&Ms is the largest as well as the most diverse (375 cases from four vendors, six centres and three countries, vs. 150 cases for ACDC from one centre).However, given that ACDC is an established database, we selected to use its contouring SOP in this challenge to derive standardized annotations for the community, as well as to enable the combination of these datasets in future studies.
Note that our study, while it focuses on multi-scanner generalizable segmentation, confirms several of the results already obtained by the ACDC challenge and other previous works.Specifically: a) The segmentations at ED were more accurate than at ES for LV and RV cavities, but not for the myocardium, which becomes thicker and therefore easier to segment when the heart contracts.b) The segmentation accuracy according to the DSC was the highest for the LV blood pool, followed by the RV and MYO, in this order, but it was the lowest for the RV for the distance-based measures, given its shape complexity.c) The segmentation accuracy was at its maximum at the mid-ventricular slices, while the performance decreased for the apical and basal slices, where there is higher variability and complexity.On average, the best performing method in this challenge obtained 0.88 as DSC and 11 mm as HD versus the values 0.93 and 9 mm obtained in the ACDC challenge, respectively, with the greatest difference shown at ES.This gap can be easily explained by the single-centre nature of the ACDC studies in comparison to a multi-centre scenario in this work, although other effects such as the training size may play a role and should be assessed (150 vs. 100 studies, respectively).

C. Future Work
In addition to the results and analyses presented in this paper on multi-scanner cardiac image segmentation, we also provide the M&Ms dataset open-access for the community, which can be downloaded from the M&Ms website. 6It represents one of the most heterogeneous datasets ever compiled in cardiac image analysis, comprising CMRs from a variety of imaging protocols and cardiology units, and including a range of cardiovascular diseases as distinct as coronary heart disease, cardiomyopathies, abnormal right ventricle or myocarditis.We thus hope the dataset will be of high value for the community to address a number of research topics in the field, such as multi-scanner image registration, multi-structure segmentation, cardiac quantification, motion analysis and image synthesis. 6www.ub.edu/mnmsIt is important to note that a follow-up challenge is being organised on multi-centre, multi-vendor and multi-disease cardiac diagnosis.The diagnoses for the 375 cases are being gathered from the different hospitals in a legally compliant manner and the clinical information will be made available after the end of the next challenge, thus allowing the community to work on cardiac image analysis as well as on computer-aided diagnosis in a multi-centre setting.Note that the participants had less than three months to implement, optimize and test their techniques, which did not allow to go beyond the existing state-of-the-art techniques in data augmentation and domain adaptation.With more time at their disposal beyond the constraints of the challenge, we expect that researchers will have a valuable resource with the M&Ms dataset to investigate, develop and test new theories and frameworks for addressing the difficulties posed by domain-shift in cardiac image analysis.

D. Conclusion
The M&Ms challenge is the first study to evaluate a range of deep learning solutions for the automated segmentation of multi-centre, multi-vendor and multi-disease cardiac images.The results show the promise of existing data augmentation and domain adaptation methods, but also calls for further research to develop highly generalizable solutions given the inherent heterogeneity in cardiac imaging between centres, vendors and protocols.More generally, there is a need for more research and development to realise the much-needed shift from single-centre image analysis towards multi-domain approaches that will enable wider translation and usability of future artificial intelligence tools in cardiac imaging and clinical cardiology.
Sergio Escalera is with the Departament de Matemàtiques i Informàtica, Universitat de Barcelona, 08007 Barcelona, Spain, and also with the Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain (e-mail: sergio.escalera.guerrero@gmail.com).

Fig. 3 .
Fig. 3.The effect of data augmentation on a single CMR slice.In the top row, the original image and spatial augmentations are shown.In the bottom row, intensity-based augmentations.
the 14 techniques relative to each other do not change when analysed per pathology.

Fig. 6 .
Fig. 6.Boxplots with vendor-wise results for DSC and HD when all participants predictions are considered.Vendors are presented in order: Siemens (A), Philips (B), GE (C) and Canon (D).

Fig. 8 .
Fig. 8. Boxplots for DSC and HD results for centres that had labelled samples in the training set, unlabelled samples in the training set and no samples at all.

Fig. 9 .
Fig. 9. Boxplots for DSC results for the top 3 performing methods depending on different cardiac structures (LV, MYO and RV) and different slice position for both ED and ES.The apex and the base are defined as the last and first annotated slices, respectively.The middle slice is the slice located in between the apex and base slices.The remaining slices are defined based on their relative position with respect to the middle slice.

Fig. 10 .
Fig. 10.Prediction examples for method P1 for vendors C (GE) and D (Canon).Top two rows show satisfactory results, while the two bottom rows present some error in the final contours.Color correspondence: left ventricle endocardium (red), left ventricle epicardium (green) and right ventricle endocardium (yellow).Ground truth is drawn in white color.

TABLE I INFORMATION
FROM CENTRES INCLUDED IN THIS WORK Fig. 1.Visual appearance of a CMR short axis middle slice for anatomically similar subjects in the four different vendors considered.

TABLE II DISTRIBUTION
OF THE MOST FREQUENT PATHOLOGIES AND HEALTHY VOLUNTEERS BETWEEN CENTRES.THE ABBREVIATIONS CORRESPOND TO HYPERTROPHIC CARDIOMYOPATHY (HCM), DILATED CARDIOMYOPATHY (DCM), HYPERTENSIVE HEART DISEASE (HHD), ABNORMAL RIGHT VENTRICLE (ARV), ATHLETE HEART SYNDROME (AHS), ISCHEMIC HEART DISEASE (IHD), AND LEFT VENTRICLE NON-COMPACTION (LVNC) translated reliably and deployed in routine clinical practice across the globe.II.CHALLENGE FRAMEWORKA.Data PreparationA total of six clinical centres from Spain, Canada and Germany (numbered 1 to 6 in this work) contributed to this

TABLE III AVERAGE
SPECIFICATIONS FOR THE IMAGES ACQUIRED IN THE DIFFERENT CENTRESchallenge by providing a different number of CMR studies from different scanner vendors, as detailed in Table

TABLE IV NUMBER
OF STUDIES FOR EACH STEP OF THE CHALLENGE PRESENTED BY CENTRE AND SCANNER VENDOR Fig. 2. Degree of generalizability of models trained from the four vendors.

TABLE V LIST
AND DETAILS OF THE PARTICIPATING TEAMS IN THE CHALLENGE MATCHING (HM), GAUSSIAN NOISE (GN), BRIGHTNESS (B), GAMMA (G), TEST TIME AUGMENTATION (TTA)

TABLE VII TRAINING
AND INFERENCE TIME, AND HARDWARE USED, FOR ALL PARTICIPATING METHODS.H, M, S AND MIL.STAND FOR HOURS, MINUTES, SECONDS AND MILLIONS, RESPECTIVELY while P7 developed a modified UNet based on multi-gate and dilated inception blocks to extract multi-scale features.Lastly, one team (P12) proposed a modified Spatial Decomposition Network (SDN)

TABLE IX DSC
RESULTS STRATIFIED BY VENDOR AND HEART SUBSTRUCTURE.THE LAST TWO COLUMNS ARE THE AVERAGE DSC LOSS FOR VENDORS C AND D WITH RESPECT TO THE COMBINED AVERAGE DSC RESULTS FROM VENDORS A AND BFig.