Deep Learning for Ultrasound Image Formation: CUBDL Evaluation Framework and Open Datasets

Deep learning for ultrasound image formation is rapidly garnering research support and attention, quickly rising as the latest frontier in ultrasound image formation, with much promise to balance both image quality and display speed. Despite this promise, one challenge with identifying optimal solutions is the absence of unified evaluation methods and datasets that are not specific to a single research group. This article introduces the largest known international database of ultrasound channel data and describes the associated evaluation methods that were initially developed for the challenge on ultrasound beamforming with deep learning (CUBDL), which was offered as a component of the 2020 IEEE International Ultrasonics Symposium. We summarize the challenge results and present qualitative and quantitative assessments using both the initially closed CUBDL evaluation test dataset (which was crowd-sourced from multiple groups around the world) and additional in vivo breast ultrasound data contributed after the challenge was completed. As an example quantitative assessment, single plane wave images from the CUBDL Task 1 dataset produced a mean generalized contrast-to-noise ratio (gCNR) of 0.67 and a mean lateral resolution of 0.42 mm when formed with delay-and-sum beamforming, compared with a mean gCNR as high as 0.81 and a mean lateral resolution as low as 0.32 mm when formed with networks submitted by the challenge winners. We also describe contributed CUBDL data that may be used for training of future networks. The compiled database includes a total of 576 image acquisition sequences. We additionally introduce a neural-network-based global sound speed estimator implementation that was necessary to fairly evaluate the results obtained with this international database. The integration of CUBDL evaluation methods, evaluation code, network weights from the challenge winners, and all datasets described herein are publicly available (visit https://cubdl.jhu.edu for details).


I. INTRODUCTION
SIGNIFICANT research has been dedicated recently to developing methods for deep learning in ultrasound imaging, as summarized in several recent review articles and special issue editorials [1]- [4]. The merger of deep learning and ultrasound image formation is promising because it has the potential to shed light on features that are not considered by algorithmic approaches that underlie the mathematical, model-based component of image formation, with multiple input-output and training options [5]- [7]. These data-driven deep learning approaches have the potential to be more robust than the traditional model-based beamforming methods, as they do not require parameter changes when switching to different scanners, they are able to generalize across different datasets, and they can infer from advanced beamforming methods in less time than that required to perform the otherwise computationally intensive calculations associated with advanced beamformers [8]- [11]. Despite the promising potential of deep learning approaches applied to ultrasound imaging, there has been a noticeable dearth of publicly available frameworks to evaluate new deep learning methods with the same reference data. Such open frameworks are useful for benchmarking and comparing methods against each other, as demonstrated in the fields of visual recognition [12] and computed tomography [13].
One outcome of the Challenge on Ultrasound Beamforming with Deep Learning (CUBDL), offered as a component of the 2020 IEEE International Ultrasonics Symposium, was the development of an open evaluation framework that may be used as a standard for deep learning image formation methods in ultrasound imaging. The overarching goal of this challenge was to explore the potential of deep learning to improve ultrasound image quality and to balance these improvements with implementation practicality (i.e., reasonable display frame rates) [14]. Comparisons were completed on the same datasets and with the same evaluation methods and framework.
This article summarizes and compares submissions to CUBDL using the evaluation framework developed for this challenge. We present results on both the initially closed CUBDL evaluation test dataset and additional in vivo breast ultrasound data contributed after the completion of the challenge. Components of this evaluation framework were publicly available while CUBDL was in progress, and the framework is now merged with the CUBDL test data and the additional datasets described herein. This framework merger and the associated datasets are publicly available at [15] and [16], respectively (visit https:// cubdl.jhu.edu for more details). The summary within this article and the accessibility of our data and code are provided in efforts to enable future benchmarking and evaluation of new methods.
The remainder of this article is organized as follows. Section II provides an overview of the CUBDL challenge, including dates associated with major milestones. Section III describes details of the open datasets, the top challenge submissions, the CUBDL evaluation process, scoring and ranking methods, and the post-CUBDL analyses implemented to ensure robustness and reproducibility of our shared code and challenge winner declaration. This section is organized to align with the sequential timeline of events and is effectively a combination of materials used and methods implemented by three categories of contributors to this article: 1) the CUBDL organizers who implemented dataset curation, evaluation, scoring, assessment, and rankings; 2) the top challenge participants who provided network details; and 3) A.W. who completed the independent post-CUBDL evaluations and curation of the post-CUBDL datasets. Section IV presents the results from the perspective of the CUBDL organizers, detailing our evaluation and analysis findings, with post-CUBDL contributions from A.W. Section V discusses results within the broader context of the current state of the field from the perspective of the CUBDL organizers. Finally, Section VI concludes the article.

II. CHALLENGE SUMMARY AND TIMELINE
Organization of this challenge initiated on October 13, 2019. The complete challenge setup, submission process, and discussion of the challenge organization are described in detail in the associated conference proceedings coauthored by the challenge organizers [20]. To briefly summarize details in [20] that are relevant to this article, the challenge was designed with no specific training data provided by the CUBDL organizers, because a review of current literature on the topic of deep learning for ultrasound beamforming reveals that there are multiple possibilities for both training approaches and training datasets. Therefore, the CUBDL organizers decided to keep the training open-ended, with the goals of encouraging a wide variety of submission approaches, enabling participants to innovate on the training methods, and challenging participants to produce a network that achieved specific tasks and met specified requirements. The publicly available PICMUS data [21] were additionally advertised and considered to serve as possible training data for participants who desired more structure. From November 22, 2019 to January 23, 2020, test data were crowd-sourced from multiple groups around the world, representing the largest known international database of ultrasound channel data.
The challenge opened to participants on January 30, 2020, and closed to participants on the submission deadline of June 23, 2020. Although three optional tasks were conceived by the CUBDL organizers [20] and advertised on the CUBDL website [14], all participants submitted networks to be evaluated for Task 1, which was beamforming with deep learning after a single plane wave transmission, with two optional subtasks. Task 1a was explicitly focused on creating a high-quality image from a single plane wave to match a higher quality image created from multiple plane waves. Three participants submitted to this subtask. Task 1b allowed more flexibility to create images with the highest possible image quality metrics, including metrics that exceeded those obtained with multiple plane wave transmissions (e.g., no speckle preservation required). One participant submitted to this subtask. Note that the dates that CUBDL was open to participants overlapped with the onset and peaks of the global COVID-19 pandemic [22], which likely affected participation rates.
Participants were instructed to train networks using their preferred machine learning frameworks (e.g., TensorFlow, PyTorch) and submit final model files to the IEEE DataPort website [16]. The CUBDL organizers then downloaded the submitted models and launched a Python script to perform evaluation [15] on the internationally crowd-sourced database of test data [16]. Evaluation metrics advertised since the launch of the CUBDL website [14] were pre-selected by the CUBDL organizers based on literature from multiple groups reporting beamforming with deep learning (e.g., [8], [9], [11], [23]) and based on common computer vision literature containing assessments of network complexity (e.g., [24]- [26]).
After evaluation, two participants (one submitting to Task 1a and the other submitting to Task 1b) produced networks that showed evidence of overfitting to the training data when tested on the closed CUBDL test dataset for Task 1 [19], [27]. These participants were contributed additional test data for network evaluation after the challenge was closed. These in vivo breast ultrasound data (initially reported by Li et al. [19]) consisted of 11 acquisition sequences, acquired from six patients, including two orthogonal scans (i.e., radial and anti-radial) of five patients and a radial scan of one patient. These data were acquired after informed consent and with approval from the Johns Hopkins Medicine Institutional Review Board (IRB). Specifically, raw plane wave radio frequency channel data were acquired with an Alpinion ECUBE-12R research ultrasound scanner connected to either an L8-17 or an L3-8 linear array ultrasound transducer, with 75 plane wave transmissions including angles ranging −16° to 16°, or with 73 plane wave transmissions including angles ranging −8° to 8°. The sampling frequency was 40 MHz. Additional details and acquisition parameters are reported in Table I and in the Appendix.

3) Additional Dataset Contributions:
A subset of contributed data not selected by the CUBDL organizers for evaluation of Task 1 are included in the open dataset [16] published with this article for potential future training of new networks. There were various reasons these data were not selected for Task 1, including a desire to maintain sufficient variability, to avoid dominance of data from one group, phantom, subject, or system setting, and to exclude data acquired with focused transmissions for the unattempted Task 3. These additional data consisted of 544 image acquisition sequences (including 529 acquired with plane wave transmissions and 15 acquired with focused transmissions), contributed from MYO, UFL, TSH, EUT, INS, OSL, and JHU. Image targets included simulated structures, experimental phantoms, and in vivo heart, breast, brachioradialis, and carotid. The in vivo data from JHU and OSL were acquired after informed consent and IRB approval, as noted in [30] and [31], respectively. The in vivo data from TSH were acquired after informed consent and ethics approval. The phantom data consisted of 12 different phantoms from six manufacturers including: 1) CIRS models 040, 049, 054GS, 050, 059, and ATS549; 2) GAMMEX models 403 and 410 SCG; 3) NPL Thermal Test Phantom; 4) CAE Blue Phantom Elastography Breast Model; 5) True Phantom Solutions Brain Phantom; and (6) Dansk Phantom Service Model 453. One of these phantom acquisitions (i.e., from MYO) included a layer of ex vivo porcine abdominal tissue to introduce acoustic clutter. These data were acquired with 4 ultrasound scanner models (i.e., Verasonics Vantage 256, Verasonics Vantage 128, Alpinion ECUBE12-R, and ULA-OP 256) and 11 ultrasound transducer models. The acquisition center frequencies ranged from 2.97 to 12.5 MHz. The sampling frequencies for this dataset ranged from 6.25 to 100 MHz, representing a larger range than that reported in Section III-A1. The ultrasound transducers consisted of linear and phased arrays. Additional details and acquisition parameters are reported in the Appendix.

4) Sound Speed Correction Applied to Phantom Data:
In ultrasound imaging, the sound speed is often assumed to be globally constant at 1540 m/s when computing focusing time delays. However, the true sound speed often differs from this assumption and depends on specific properties of the underlying tissue. For example, the sound speed of fat is lower than that of liver or muscle [32], [33], and the sound speeds of calibrated tissue-mimicking phantoms are known to vary with the ambient temperature [34]. Improper focusing can degrade image quality, lead to worsened metrics such as contrast and point target resolution, and cause incorrect positioning of image targets. As CUBDL evaluation heavily depended on these and other image quality parameters [20], the CUBDL organizers applied an initial sound speed correction to the contributed test data.
Sound speed correction is an active area of research with many proposed figures of merit, including speckle brightness [35], [36], coherence factor [37], among many others [38]- [41]. Considering this multitude of validated options, the CUBDL organizers selected speckle brightness maximization for its simplicity and its wide acceptance as a criterion of focusing quality [35]. More specifically, the correct global sound speed was selected as the one that maximized the average brightness in a homogeneous region of speckle [36].
PyTorch [42] provides a convenient differentiable framework to perform per-image optimization. Using the same PyTorch implementation of delay-and-sum (DAS) plane wave beamforming provided to CUBDL participants, the sound speed was adjusted via gradient ascent until speckle brightness was maximized using the Adam optimizer (initial step size: 10m/s, decay rate: 0.9, 30 steps). Despite the use of PyTorch, we emphasize that this sound speed correction did not involve any actual deep learning. Specifically, no parameters were trained, and no future predictions of sound speed were performed. PyTorch was simply used to execute a well-known and previously validated sound speed correction algorithm [36] on a per-image basis.
For each of the 49 phantom acquisition sequences acquired using plane wave transmissions (including 19 of the plane wave sequences described in Section III-A1 and 30 of the plane wave sequences described in Section III-A3), the PyTorch DAS beamformer and associated sound speed correction implementation processed the 75 plane-wave raw data in a pixel grid corresponding to a homogeneous region of speckle. Fig. 1(a) shows the sound speeds obtained throughout 30 iterations of speckle brightness maximization for each of the 49 phantom acquisition sequences. After 30 iterations, a wide range of optimal sound speeds were observed across these contributed datasets, ranging from 1453 to 1618 m/s. The corrected value is largely grouped by institution. The two sequences from MYO acquired with the clutter layer (i.e., MYO004 and MYO005) are most deviant in comparison to the remaining sequences from the same institution. The data submitted by INS, which included nine different phantoms, demonstrated the largest variation. The sound speed correction was sensitive to the chosen speckle region for some of these phantoms. , and MYO003. The sound speed included in the files submitted by data contributors for these four data sequences was 1540 m/s (although the sound speed values provided on data sheets from the associated phantom manufacturers were unlisted, 1540, 1580, and 1580 m/s, respectively). When using 1540 m/s as the sound speed to create images, lesion inclusions and point targets were visible, yet with degraded contrast and resolution, as seen in the top row of Fig. 1(b). However, when the same channel data were used to create images with the corrected sound speeds, this update yielded considerably sharper lesion boundaries, more visually separable point targets, and target repositioning to calibrated depths, as seen in the bottom row of Fig. 1(b). The specific sound speed values used to create the example images in Fig. 1(b) are reported in the top left corner of each image. This sound speed correction was a necessary step implemented by the CUBDL organizers to enable fair evaluations, and the corrected sound speed values were not shared with the participants while the challenge was open. The corrected sound speeds for the 49 phantoms are now available with the CUBDL evaluation code [15] and associated datasets [16]. During the evaluation process, the corrected sound speeds were provided as inputs to the networks submitted by participants and were used to calculate the image quality metrics reported throughout the article.

5) Summary of Compiled Open Datasets:
In summary, the complete CUBDL + post-CUBDL datasets released with this publication contain a total of 576 acquisition sequences with the following breakdown: 1) 49 experimental phantom data sequences acquired with plane wave transmissions; 2) 11 in vivo data sequences from the breast of six patients, each acquired with plane wave transmissions; 3) 500 in vivo data sequences from the brachioradialis of a healthy volunteer, each acquired with plane wave transmissions; 4) six experimental phantom data sequences acquired with focused transmissions; 5) eight in vivo data sequences comprising the carotid artery of a healthy volunteer, the heart of two healthy volunteers, and the breast of two patients, each acquired with focused transmissions; and 6) 2 Field II [43], [44] simulations.
This wide range of channel data was acquired with four ultrasound scanner models and 11 ultrasound transducer models. The acquisition center frequencies ranged from 2.97 to 12.5 MHz. The sampling frequencies ranged from 6.25 to 100 MHz. The ultrasound transducers consisted of linear and phased arrays. These data were provided by seven groups total: 1) MYO; 2) UFL; 3) EUT; 4) INS; 5) OSL; 6) TSH; and 7) JHU. More details about these data are available in Tables IV through X in the Appendix.
We additionally provide sound speeds for the 49 phantom acquisition sequences acquired using plane wave transmissions with the released datasets [16], including corrected sounds speeds from the procedure described in Section III-A4, sound speeds submitted by data contributors, and sound speeds reported in publicly available datasheets from phantom manufacturers for comparison.

1) Rothlübbers et al. [28]:
A fully convolutional network was submitted by Rothlübbers et al. [28], and the following summary includes details that are not available in [28]. The network had four layers. The network input was time-delayed, magnitudenormalized, complex-valued data from the 0° plane wave transmission angle. The network output was a real-valued (scalar) weighting factor for each reconstructed pixel. The network was designed to model the united sign coherence factor (USCF) [45] by computing pixel-wise weighting. The final pixel values were obtained by multiplying the unweighted sum absolute pixel values by the network output pixel weights, followed by log compression and a correction for the maximum value.
Batch normalization and ReLU activation followed each convolution. To account for memory limitations during training and inference, a patch-based approach was used, dividing input and target data into patches of size 200 × 200. Convolutions were applied in the channel domain only, resulting in individually processed pixels. This network used an Adam optimizer with a learning rate decay of 0.1 every five epochs, and it was trained for 15 epochs. The loss was computed as a linear combination of mean-squared error (MSE) and multiscale structural similarity (MS-SSIM) [46] loss on the log-compressed, normalized final images. The trained network weights are available with the evaluation code provided with this publication.
The training data consisted of 107 US raw datasets of a phantom (Model 054GS, CIRS, Norfolk, VA, USA), acquired with multiple angles using a 128-element linear array transducer (DiPhAS, Fraunhofer IBMT, Sankt Ingbert, Germany) operating at 4 MHz. High-quality target images were reconstructed using multi-angle USCF imaging [45], using data from seven plane wave angles. The reconstruction grid was chosen with an equidistant isotropic pixel spacing of a third of the wavelength and positioned such that artifact-prone areas such as, for example, near the transducer were excluded. The publicly available PICMUS dataset [21] was used to test the model prior to submitting it to the challenge.

2) Goudarzi et al. [29]:
The details in the following summary parallel the details provided in Section III-B1 if direct comparison between the two networks is desired, and this summary includes additional details that are not available in [29]. The submission by Goudarzi et al. [29] used the MobileNetV2 [47] architecture, which consists of depthwise separable convolution building blocks, linear bottlenecks between layers, and shortcut connections between the bottlenecks. The network input was a 2 × m × n matrix in which first the two channels were the real and imaginary parts of in-phase and quadrature (IQ) data, n is the number of channels, and m is the length of the window considered for temporal averaging to preserve the speckle statistics. The network output was a two-dimensional (2-D) vector containing real and imaginary parts of the beamformed data. The network was designed to estimate and apply an apodization window to the input IQ channel data for minimum variance beamforming [48]. The output IQ data were then envelope-detected and log-compressed to obtain the final B-mode ultrasound image. This network used an AdamW optimizer [49] with default parameters (i.e., β 1 = 0.9 and β 2 = 0.999). The loss was computed as the L1-norm between the network output and the IQ pixel values obtained using the minimum variance beamformer. The trained network weights are available with the evaluation code provided with this publication.
The training data for this network consisted of the publicly available plane wave and focused transmission datasets available in the Ultrasound Toolbox [21], [31], [50]. This toolbox contains replicates of data sequences OSL008-OSL009 [51], OSL011-OSL014 [31], and OSL015 [52] in a different file format. Plane wave datasets from this toolbox were acquired with a Verasonics Vantage 256 scanner and L11-4v probe (phantom and in vivo data), an Alpinion E-Cube12R scanner and L3-8 probe (phantom data), and Field II simulated data. Focused imaging datasets were acquired with a Verasonics Vantage 256 scanner connected to a P4-2v probe and an Alpinion E-Cube12R scanner connected to a L3-8 probe. Although up to 75 plane waves were available, images were reconstructed from data received after only a single 0° plane wave transmission as the ground-truth output images during training, because the objective when implementing this network was to mimic minimum variance beamforming [48] (which is a real mapping function on the channel data).

1) Local Image Quality Metric:
Contrast, contrast-to-noise ratio (CNR), generalized contrast-to-noise ratio (gCNR) [51], [53], and signal-to-noise ratio (SNR) within selected regions of interest (ROIs) were measured on the envelope of beamformed images from CUBDL Task 1 data as follows: where μ i , σ i , and f i represent the mean, standard deviation, and histogram, respectively, of ROI i. The gCNR metric was calculated with 256 bins. The SNR metric was used to measure speckle preservation with the ground truth derived from the 75 plane wave image. In addition to the above metrics, the axial and lateral full-width at half maximum (FWHM) of point targets was calculated to determine resolution. Details surrounding the specific ROIs chosen for each test case are available in in the Appendix and in our public evaluation code [15]. This code will further include routines for evaluating speckle pattern autocorrelations for resolution measurements.

2) Global Image Quality Metrics:
The more global ℓ 1 loss, ℓ 2 loss, peak signal-to-noise ratio (PSNR), and normalized cross correlation (ρ) metrics were computed from CUBDL Task 1 data for image-to-image comparisons between network-produced envelope images and the corresponding 75 (or 31 for sequence TSH002) plane wave envelope image where x and y denote the two images being compared, each containing N pixels. The image comparisons were performed using pixels within the dynamic range of −40 to 0 dB with respect to the maximum pixel value of the ground truth image to avoid overemphasis of small magnitude differences and to avoid penalizing networks that do not reproduce the acoustic clutter that resides at magnitudes less than −40 dB [54]. In addition, both the ℓ 1 and ℓ 2 losses were computed for images on a linear scale (i.e., before log compression, denoted as ℓ 1 and ℓ 2 , respectively) and on a log-compressed scale (i.e., after log compression, denoted as ℓ 1 -log and ℓ 2 -log, respectively). The linear scale is expected to be more sensitive to highamplitude variations (e.g., edges), whereas the log scale is more sensitive to low-amplitude variations (e.g., speckle). These metrics were computed on images that were normalized to minimize the achievable loss, as described in the Appendix of [11]. Overall, these global image quality metrics provide quantitative information that is not subject to the region selection requirement of the local metrics.
3) Network Complexity: Network complexity was determined by the number of learnable parameters in each submitted model. Although the CUBDL organizers initially intended to include network speed as an additional parameter to assess network complexity, there was greater variety in the "preprocessing" prior to entering the network than anticipated, which complicated fair comparisons of network speed. For example, the submission by Goudarzi et al. [29] needed the data to be reshaped and reformatted into 1-D kernels along the axial dimension, all of which was implemented outside of PyTorch, whereas the two submissions from the runners up [19], [27] needed a 2-D image interpolation in numpy. For this reason, the CUBDL organizers only considered network size, as each participant successfully addressed the proposed task which purposefully did not specify which component of the image formation process had to be learned.

D. Scoring and Ranking
The four CUBDL participants were evaluated using the test data described in Section III-A1 and received a rank based on each quantitative network performance described in Section III-C. These rankings were grouped into two categories: 1) image quality and 2) network complexity, considering that participants were challenged to balance both image quality and display frame rates. We averaged the ranks of the metrics obtained by each participant within these two groups. The average rank from each group was summed. This scoring system is represented mathematically as follows: Final Score = image quality metric rankings T IQ + Network complexity metric rankings T NC (9) where T IQ and T NC are the total numbers of image quality metric rankings and network complexity rankings, respectively.

E. Post-CUBDL Analyses
To improve the robustness of our evaluation methods and to ensure the robustness of our open-source code, two additional analyses were implemented. First, a two-sample, two-tailed t-test with a 5% significance level was performed to determine the statistical significance of differences among the top two submitted networks described in Section III-B when evaluating the global image-to-image metrics described in Section III-C2. Second, the test data were evaluated by an independent user (A.W.) after the challenge was closed to new submissions. This independent user provided feedback to increase compatibility of the evaluation framework and also used this framework to evaluate the submitted networks with the post-CUBDL data described in Section III-A2.

1) Baseline Evaluation:
To ensure that participant submissions from different deep learning frameworks (e.g., Tensor-Flow and PyTorch) were properly loaded into our evaluation code, we used a baseline evaluation on the publicly available PICMUS data [21], as shown in Fig. 2. All submissions were confirmed to produce images morphologically similar to the single and 75 plane wave images before proceeding with the evaluation [14].
2) Lesion Targets: Fig. 3  The quantitative results for lesion visibility were measured for a total of eight ROIs from the following seven sequences: UFL001, UFL005 (two independent ROIs from this sequence), OSL007, MYO001, MYO004, INS008, and INS021. The mean ± standard deviation of the contrast, CNR, and gCNR for the eight ROIs is reported in Table II. No participants achieved greater lesion visibility or detectability than that obtained with 75 plane wave transmissions. Goudarzi et al. [29] achieved the best lesion visibility with the least variance overall and also achieved the closest metrics to the 75 plane wave results, while in some cases exceeding the single plane wave result.
3) Speckle Targets: Fig. 4 shows examples from two speckle targets taken from dataset sequences MYO002 and INS016. Most participants achieved results that qualitatively resembled speckle texture. Goudarzi et al. [29] best achieved the goal of preserving speckle SNR from both single and 75 plane wave transmissions, when measured from six ROIs total from the following sequences: UFL004, OSL007, MYO002, EUT003, INS004, and INS016. The mean ± standard deviation of speckle SNR within these six ROIs is reported in Table II. Fig. 5 shows example results from the point target evaluation, taken from dataset sequence UFL004. There were a total of four ROIs from sequences MYO002, MYO003, UFL002, and UFL004, each containing 3-5 points that were evaluated. The mean ± standard deviation of the lateral and axial resolution of these point targets is reported in Table II as axial and lateral FWHM, respectively. It is remarkable that the networks submitted by Rothlübbers et al. [28] and Goudarzi et al. [29] achieved better resolution results when compared with standard DAS imaging with both single and 75 plane wave transmissions, considering that the network images were created from channel data acquired after a single plane wave transmission.

5) Image
Targets: Fig. 6 shows examples from three image targets. These example images were taken from data sequences OSL010, INS008, and TSH002. Network and single plane wave results were compared with the 75 plane wave images, using the global image-to-image metrics described in Section III-C2 and the 12 images created with data  sequences EUT006, INS006, INS008, INS015, INS019, MYO001, MYO002, MYO004, MYO005, OSL010, UFL002, and TSH002. The corresponding image-to-image comparison results are reported in Table III.
Summarizing the results in Table III, there were no statistically significant differences (p > 0.05) when comparing the mean ℓ 1 , ℓ 1 -log, ℓ 2 , and PSNR of the networks submitted by Goudarzi et al. [29] and Rothlübbers et al. [28]. However, there were statistically significant differences (p < 0.05) when comparing the mean ℓ 2 -log and ρ of the networks submitted by Goudarzi et al. [29] and Rothlübbers et al. [28]. The corresponding image quality metrics obtained when the single plane wave results were compared with the 75 plane wave results from the same datasets are also reported in Table III for additional comparison. With the exception of ℓ 1 , these single plane wave results produce the best mean values across the 12 image targets included in the CUBDL evaluation dataset (as indicated by the bold text in Table III), where lower values of ℓ 1 , ℓ 2 -log, ℓ 1 , ℓ 2 -log, and higher values of ρ and PSNR indicate better matches with the ground-truth 75 plane wave results. Note that for cases with identical mean ℓ 1 and ℓ 2 values in Table III, the corresponding values with the lower mean ℓ 1 -log and ℓ 2 -log, respectively, were bolded.

6) Rankings:
Based on the first additive term of (9), the images created with 75 plane wave transmissions achieved the highest rank in a majority of cases (i.e., 1), followed by images created with DAS beamforming after a single plane wave transmission, then the network submitted by Goudarzi et al. [29], and then the network submitted by Rothlübbers et al. [28]. Specifically, the average ranks for image quality [i.e., the first additive term of (9)] were 1.25, 2.58, 2.67, and 3.58, respectively, for the rank order provided in the preceding sentence. Based on the second additive term of (9), Rothlübbers et al. [28] submitted a network with significantly less network complexity (i.e., 3059 parameters) when compared with that of Goudarzi et al. [29] (i.e., 2226146 parameters), as reported in Table II. Therefore, Rothlübbers et al. [28] received a rank of 3 in the network complexity category, followed by Goudarzi et al. [29] receiving a rank of 4, considering that both single and 75 plane wave DAS implementations had 0 trainable parameters for this evaluation.
When combining the two additive terms in (9) to determine the winner, Rothlübbers et al. [28] and Goudarzi et al. [29] achieved final scores of 6.58 and 6.67, respectively, resulting in both being declared by the CUBDL organizers as the challenge winners. This declaration was made because Goudarzi et al. [29] achieved both qualitatively and quantitatively highquality images, and at the same time, it was remarkable that Rothlübbers et al. [28] produced a network with significantly low complexity (and generally acceptable image quality with the best lateral and axial resolution overall). in Section III-A2. These example images were taken from data sequences JHU028 and JHU030. Network and single plane wave results were compared with the 75 plane wave images, using the global image-to-image metrics described in Section III-C2 and the 11 images created with data sequences JHU024 through JHU034. The quantitative image-toimage comparison results for this dataset are reported in Table III.

B. Performance With Post-CUBDL In Vivo Data
The framework developed using the initial CUBDL evaluation test set was able to seamlessly incorporate the JHU post-CUBDL in vivo breast data, which was not included in the initial test set. Remarkably, the submitted networks also generalized to these new data from a different imaging environment. We observed qualitative differences between the in vivo breast images produced by the two submitted networks, and the results in Table III demonstrate that the ℓ 2 -log and ρ metrics consistently show statistically significant differences across both datasets (i.e., CUBDL and post-CUBDL data). Table III also shows that the network submitted by Goudarzi et al. [29] produced consistently better quantitative image-to-image comparisons than the network submitted by Rothlübbers et al. [28] across both datasets. In addition, the results obtained when testing this network [29] on the post-CUBDL in vivo dataset were consistently better than the results obtained with DAS beamforming of the single plane wave data (as indicated by bold in Table III). For this post-CUBDL dataset, the network submitted by Rothlübbers et al. [28] also produced better qualitative results in some cases (e.g., Fig. 7, sequence JHU028) than the corresponding single plane wave result.

A. Overview of Challenge Outcomes
The details described in this article serve the threefold purpose of: 1) summarizing, comparing, and drawing insights from the top CUBDL submissions; 2) providing a detailed description of the evaluation process for future benchmarking of deep neural networks designed to create ultrasound images; and 3) describing the totality of shared evaluation code and datasets for ease of future reproducibility and replication. Our major challenge outcomes include the largest known international database of raw ultrasound channel data, network descriptions and trained network weights from the CUBDL winners, a PyTorch DAS beamformer containing multiple components that can be converted to trainable parameters, a data sheet of phantom sound speeds containing optimal speeds identified by the CUBDL organizers using the PyTorch DAS beamformer, and evaluation code that integrates these major outcomes.
During the evaluation process, the first step to achieving the challenge objectives was to ensure that the image reconstruction process offered a fair comparison despite the wide range of test data from multiple groups. Hence, the development of the neural-networkbased PyTorch DAS beamformer and the associated sound speed correction (described in Section III-A4) were key contributions to enable fair comparisons. While these contributions are the first to implement a differential framework for sound speed correction, there are other solutions that implement DAS within a computational framework capable of automatic differentiation (e.g., PyTorch or TensorFlow) [55]- [57].
The results from these sound speed correction contributions (i.e., Fig. 1) support the importance of including a standardized international dataset for deep learning network development (e.g., training, testing, and evaluation). In particular, a dataset containing incorrect sound speeds and limited acquisition parameter variability may not generalize well to data acquired under different conditions. Sound speed differences within groups (e.g., INS) and between groups (e.g., MYO and UFL) that imaged the same phantom models can be attributed to a combination of ambient environment, the presence of an ex vivo tissue layer producing acoustic clutter, and phantom construction differences. Considering that many of the observed phantom variations occurred in the absence of tissue layers and the speckle brightness sound speed correction method was previously validated [36], environmental factors (e.g., ambient temperatures, phantom degradation) are considered to be the primary contributor to these observed differences.

B. Quality of Image Formation With Deep Learning
One of the underlining questions surrounding CUBDL Task 1 was an inquiry regarding the capability of deep learning to create high-quality images after a single plane wave transmission. The image quality summary in Table II indicates three insights and capabilities of deep neural networks implemented for beamforming. First, they are capable of preserving speckle SNR. Second, they may be capable of improving resolution. Third, the submitted networks (which created images after only a single 0° plane wave transmission) were capable of producing better qualitative, quantitative, and lesion detectability results than the single plane wave DAS result in some cases (e.g., Fig. 3 sequence MYO001, Figs. 5-7, Tables II and III). It remains to be determined whether the latter two capabilities can be attributed to the choice of target beamformer (i.e., USCF, minimum variance) during training, as opposed to the function-approximating nature of the deep learning approach itself. In addition, the submitted networks had difficulty obtaining better lesion detectability than the 75 plane wave DAS images. These observations are based on the presented image quality metrics and are supported by the image quality rankings, with associated limitations discussed in more detail in Section V-E. For example, although speckle SNR was quantitatively preserved in some cases (see Table II), it was not always qualitatively preserved (e.g., MYO002 in Fig. 4), when comparing both single plane wave results and network results with the 75 plane wave results.
We also observed that learning a single, intermediate step of the image formation pipeline (as pursued by Goudarzi et al. [29] and Rothlübbers et al. [28]) represents a more clearly defined transformation for the presented task, as opposed to attempts to learn the entire beamforming process (i.e., the approach taken by [19] and [27], which suffered from overfitting to the training data, as shown in the presentation of all challenge results [14]). To briefly summarize for the context of this article, overfitting manifested as an inability of the submitted networks to detect point targets in the unseen test data, recreate lesions from the unseen test data, or replicate the gradation from light to dark shown in Fig. 6 (sequence OSL010). Instead, the images looked more like the learned tissue texture or lesions in the training images in most cases. The training datasets curated by the respective authors of these two runner-up submissions contained similar data to each corresponding testing dataset, and the ability of the resulting networks to produce acceptable results on these similar data is evident in the articles describing the associated submissions [19], [27]. These two networks were also able to successfully reproduce the results shown in Fig. 2, as presented in [14], likely because the PICMUS data were included in the training sets of both networks. These observations highlight successes that can be achieved when the training and testing sets are similar, with a discussion of the overfitting implications available in Section V-D in the context of future recommendations.

C. Speed of Image Formation
It is promising that the improvements to image quality were obtained with network complexity levels that indicate a possibility for improved speed compared with the traditional implementations of advanced beamforming approaches. Although the CUBDL organizers did not specify which parts of the image formation task had to be learned, participants were aware that they would be evaluated based on the number of trainable parameters within submitted networks. Participants therefore had the freedom to decide whether they used a network to replace the delay step, the sum step, both steps, a post-processing filter step, and so on. This freedom somewhat complicated the network complexity evaluation, considering that a neural network has the potential to outperform deterministic DAS at the cost of adding parameters that need to be trained. If a network replaces both "delay" and "sum," it might be more complex than a network that replaces just the "sum," but it may also rely on fewer assumptions and thus would have greater potential to give better results. This tradeoff was considered to be captured sufficiently by counting the number of trainable parameters.
It is possible that an additional dimension of complexity could have been included by adding a metric for "number of processing components replaced". However, this addition is not trivial to implement, considering the wide range of possible steps outside of basic DAS beamforming. While different network architectures with the same number of parameters may have different inference speeds, oftentimes this difference in speed is due to nuanced hardware-and software-specific optimizations. It is also possible that we could have estimated the number of floating point operations (FLOPs), but there are multiple methods and implementations for estimation with no clear standard. These considerations point to the additional importance of learning from these CUBDL challenge results to produce updated guidelines if a similar undertaking is attempted in the future.

D. Future Recommendations
The observations summarized in Section V-B highlight the importance of a truly blind test set to detect and prevent possible overfitting, which is a secondary outcome of the CUBDL challenge (i.e., the newly available database [16] and evaluation code [15] being released with the publication of this article). Regarding the choice not to provide training data for CUBDL participants, the intent was to create a challenge that represented the current state of ultrasound imaging research. In particular, prior to the dataset contribution described in this article, there was no previously curated standard training set specifically catered to deep learning tasks for ultrasound image formation. Instead, the CUBDL organizers encouraged participants to make intellectual contributions on both the choice of architecture and training data, rather than to find a single optimal architecture for a given training dataset.
Now that the challenge is completed, we propose the use of the datasets described in Sections III-A1 and III-A2 as testing data (with caution to avoid overfitting if the dataset described in Section III-A3 is also used for training). Similarly, we propose the use of the top CUBDL challenge submissions as benchmarks for future work in this research domain. For example, related work on the topic of CUBDL Task 1 published after the close of the challenge [58]- [62] can potentially be reevaluated and compared using the CUBDL data [16] and code [15] released with this publication, with no alterations to network training weights and with attention to the overfitting challenges experienced by the runners-up, particularly when considering that some of these methods were similarly trained and tested with the PICMUS data. We are essentially challenging the community to achieve better results than the CUBDL submissions and to make more headway toward explaining why the results are better based on the open research questions articulated in this discussion of challenge results. We also welcome new contributions to the field of ultrasound beamforming with deep learning (and possibly beamforming without deep learning) based on new ideas conceived while reading this article and while working with our open source data and code. The shared datasets and code provide a major step toward making meaningful comparisons in future work.

E. Limitations and Additional Considerations
Three possible factors that may be viewed as limitations of administering CUBDL include the novelty of the sound speed correction approach, the subjectivity of decisions surrounding the ranking method (which is arguably a limitation of any challenge [63]), and the diversity of the datasets.
First, sound speed optimization is not widely used, and its implementation in a deep learning framework (i.e., PyTorch) may be considered novel and thus inappropriate to introduce in a challenge. The decision to include sound speed correction was threefold: 1) correction was critically needed to enable fair contrast and resolution evaluations in the challenge [see Fig. 1(b)]; 2) the sound speed correction algorithm itself was based on a well-established criterion [35] (see Section III-A4); and 3) the specific implementation of the algorithm (e.g., traditional brute-force search versus PyTorch optimization) does not affect the final result. Thus, the PyTorch-optimized sound speeds were included in the challenge. A potential limitation of this work is that further validation of the sound speed correction was considered out of scope and thus not performed. However, we note that this correction was intended solely as a coarse improvement over assuming 1540 m/s rather than as a perfect sound speed correction, which cannot be achieved by any assumed global sound speed due to the heterogeneity of real targets.
Second, CUBDL placed an equal weighting on the objectives of image quality and network complexity, which may be considered arbitrary and subjective. Indeed, establishing absolute rankings in any multi-objective task is non-trivial because it is possible to be optimal in one metric but sub-optimal in another, that is, to be "Pareto-optimal." In such a system (known as a partial order), the only way for one method to be definitively superior is to dominate all others in all categories. Although a worthy goal, such an optimum often does not exist and thus cannot be relied upon for challenge administration. Furthermore, it is important to recognize that each image quality metric alone measures a narrow aspect of image quality, such that optimality in that individual metric may not correspond to perceived image quality. For example, speckle SNR measures only first-order statistics and does not capture second-order statistics of speckle structures, making it a poor indicator of image quality alone; however, the (ℓ 1 and (ℓ 2 losses provide complementary spatial information. To reliably obtain a final rank (a total order) from multiple independent categories, one must necessarily apply subjective opinions as to the importance and weighting of each category.
Our chosen approach (detailed in Section III-D) effectively placed less weight on any individual image quality metric and greater weight on overall network complexity. The CUBDL organizers made a conscious decision to emphasize network complexity as a distinguishing factor in anticipation of numerous submissions with similar imaging performance, considering that ultrasound beamforming needs to be real-time for practical application and implementation. It is possible that a low-complexity network producing poor image quality would have achieved a good score with the chosen approach. In particular, according to the image quality rankings reported in Section IV-A6, the current approach favored single plane wave DAS over the submitted networks due to a combination of the better rank performance of the global metrics (summarized in Table III for Task 1 data) and the absence of trainable parameters with traditional DAS beamforming (despite the better performance of the networks over single plane wave DAS images with regard to local metrics, as summarized in Table II). Other weightings are equally subjective, thus equally valid, and may result in different final rankings. The CUBDL organizers are making all data and evaluation scripts available with the publication of this article, leaving room for others to implement alternative analyses if desired.
Finally, while the diversity of the test datasets is viewed as a strength with regard to assessing network generalizability (as discussed in Section V-A) and with regard to detecting possible overfitting to training data (as discussed in Section V-D), this diversity may have also been viewed as a barrier for entry into the challenge. In particular, variations in multiple parameters that are routinely modified to achieve a desired ultrasound image were advertised (e.g., variations in transducers, ultrasound systems, system operators, plane wave angles, center frequencies, sampling frequencies, bandwidths, transducer properties, image depths, and imaging targets), with example images provided in the CUBDL Data Guide [14] and replicated in [20]. Although the PICMUS data were advertised and available, the PICMUS data did not contain this wide range of varying parameters. Concerns regarding limited knowledge of either beamforming or deep learning may have also been a barrier to entry. However, the CUBDL participants reported experience levels with beamforming and deep learning that ranged from novice to expert [14], [20].

VI. CONCLUSION
This article summarizes the results of the CUBDL challenge, as well as the detailed evaluation process implemented by the CUBDL organizers and associated insights gained from the evaluation process and challenge results. Evaluation was further analyzed independently, and the associated evaluation code and datasets are newly released with this publication. The open datasets include the CUBDL test data, additional in vivo breast data included after the close of the challenge, and a subset of internationally crowd-sourced data that were not used for evaluation, but may potentially be used for future network training and comparison to the results presented within this article. The complete dataset includes 576 image acquisition sequences and represents the largest known open international database of ultrasound channel data. We are releasing this combination of CUBDL results, evaluation code, and open datatasets to our community in efforts to help standardize and accelerate research at the intersection of ultrasound beamforming and deep learning. We additionally propose the use of the top CUBDL challenge submissions as benchmarks for future work in this research domain, and we share the totality of released resources to enable meaningful comparisons of future methods. He is currently a Professor with the Department of Electrical Engineering and Computer Science, York University, Toronto, ON, Canada. He works in the area of statistical signal processing and communications. His current projects include distributed agent networks (autonomy and consensus in complex and contested environments), medical imaging (ultrasound elastography and brain-computer interfaces), data science (graph signal

APPENDIX
The Appendix includes seven tables describing contributed ultrasound channel data (see Tables IV-X), followed by the 18 ROIs used to calculate the local image quality metrics (see Fig. 8).        Images created with 75 plane wave transmissions showing the ROIs used to measure contrast, CNR, gCNR (where ROI 1 is indicated by the inner circle and ROI 2 is indicated by the concentric outer circles), speckle SNR, and point target FWHM. These images are displayed with 40-dB dynamic range.  Results from the baseline PICMUS dataset [21] of simulated anechoic lesions, displayed with 60-dB dynamic range. Example results from the lesion test set, taken from dataset sequences UFL001 (top) and MYO001 (bottom), displayed with 40-dB dynamic range. The ROIs used for image quality metrics are overlaid on the 75 plane wave images, which served as the ground truth. Example results from the speckle test set, taken from dataset sequences MYO002 (top) and INS016 (bottom), displayed with 40-dB dynamic range. Example results from the point target test set, taken from dataset sequence UFL004, displayed with 40-dB dynamic range. Example results from the image test set, taken from dataset sequences INS008, OSL010, and TSH002, from top to bottom, respectively, displayed with 60-dB dynamic range. The right column shows images created after 31 (TSH002) or 75 (INS008, OSL010) plane wave transmissions.