Reproducibility of the First Image of a Black Hole in the Galaxy M87 from the Event Horizon Telescope (EHT) Collaboration

This paper presents an interdisciplinary effort aiming to develop and share sustainable knowledge necessary to analyze, understand, and use published scientific results to advance reproducibility in multi-messenger astrophysics. Specifically, we target the breakthrough work associated with the generation of the first image of a black hole, called M87. The image was computed by the Event Horizon Telescope Collaboration. Based on the artifacts made available by EHT, we deliver documentation, code, and a computational environment to reproduce the first image of a black hole. Our deliverables support new discovery in multi-messenger astrophysics by providing all the necessary tools for generalizing methods and findings from the EHT use case. Challenges encountered during the reproducibility of EHT results are reported. The result of our effort is an open-source, containerized software package that enables the public to reproduce the first image of a black hole in the galaxy M87.


Introduction
Developing reproducible analyses is a challenging aspect of scientific research. Few realworld studies have been performed to provide guidance on the necessary processes and products, especially in domains relying on scientific computing. There reproducibility is limited by the availability of data, software, platforms, and documentation. Consequently, despite a group's best efforts, other scientists attempting to reproduce an analysis may find that the necessary information is incomplete.
We present an interdisciplinary effort to develop and share sustainable knowledge necessary to understand, reproduce, and reuse the published scientific results of the Event Horizon Telescope (EHT) project's analysis of the black hole in the center of the M87 galaxy [1]. Unlike our previous reproduction of Advanced LIGO's observations [2], none of the authors of this paper was involved in the original EHT analysis. Thus, our work builds exclusively on the several papers describing the EHT project workflow [3], data [4], [5], and software [6] that are available online. Each EHT paper presents specific aspects of the scientific discovery but a comprehensive approach including documentation, software, and environment to reproduce the published results of the EHT project is still missing. To this end, this paper follows rigorous reproducibility directions and expands preliminary work presented in a poster [7].
As part of our contributions, we investigate the availability and integrity of the data used to recreate the images of the M87 black hole. We model the image processing workflow and study its limitations in terms of software availability, dependencies, configuration, portability, and documentation. We rebuild the workflow's software stack to reproduce the published images; we use the software stack for our analysis of discrepancies between original and reproduced results. We document each step in this process, starting from a systematic assessment of the availability of data, software, and documentation. We deliver a collection of fully documented containers for data validation and image reconstruction. Finally, we compile guidelines to increase the reproducibility of computational workflows in scientific projects.
Our work enhances the reproducibility and reach of scientific projects like the EHT project, and facilitates the engagement of the overall scientific community, including postdocs and students, regardless of the domain.

M87 Event Horizon Telescope (EHT)
The EHT project uses Very Long Baseline Interferometery (VLBI) to link together eight radio telescopes around the world to study the immediate environment of a black hole with angular resolution comparable to the size of the black hole itself. In April 2019, the EHT Collaboration published measurements of the properties of the central radio source in M87 [8], including the first direct image of a black hole. The results, that received world wide attention, revealed for the first time a bright ring formed as light bends in the intense gravity around a black hole in the galaxy M87. The black hole is 6.5 billion times larger than the Sun.
The EHT project provides links to their calibrated data [5] published in CyVerse Data Commons, a publication describing their data processing and calibration [4], a link to the software used in their imaging workflow [6], and a publication describing the imaging workflow [3]. The EHT Collaboration released both data products and software, hosting them on third-party repositories. This is a common approach for many NSFfunded projects ranging in size from individual investigators to international collaborations.

Characterization of the EHT Workflow
The EHT workflow comprises three key components: the data collection, the data processing, and image building (see Figure 1).
Data Collection. Eight telescopes in the EHT network collect radio interferometry data on certain days at certain times that have permissible weather conditions for all sites, allowing the gathering of data from multiple angles and effectively turning the Earth into one single giant telescope. The EHT data used for the generation of the first M87 black hole images consists of spatiotemporal data of visibility amplitudes collected over five days in 2017 (i.e., April 5, 6, 7, 10, and 11). For each day, collected raw data contains both high and low telescope frequencies. Figure 1: High level overview of the EHT project with its eight telescopes collecting radio interferometry data, its three workflow components including three pipelines for image building, and an image of the M87 black hole extracted from Figure 3 in [8].
Raw Data Processing. Raw data is first pieced together by using the Earth's geometry and clock/delay model to obtain a common time reference and the pairwise correlation coefficients are computed. Then, the data is reduced to a manageable size for use in source imaging and model fitting: data is fringe-fitted, calibrated a priori, and network calibrated. Fringe-fitting is performed using the EHT-HOPS Pipeline for Millimeter VLBI Data Reduction [9]. Data undergoes a priori calibration and network calibration in the post-processing stage of the EHT-HOPS pipeline to create .uvfits [4] files. The processed data is stored in the First M87 EHT Results [5] data repository in .csv, .txt, and .uvfits formats and are available to the community. We use this processed data for analysis as both raw data is not open-access and processing scripts are not open-source at the time of this reproducibility study.
Image Building. To reduce biases and increase trust in results, the EHT Collaboration uses three independently-designed pipelines to generate the black hole images. They are: the Difmap M87 Stokes I Imaging Pipeline (DIFMAP) [10], the EHT-Imaging M87 Stokes I Imaging Pipeline (EHT-Imaging) [11], and the Sparse Modeling Imaging Library for Interferometry (SMILI) [12]. Each pipeline is based on different methods, algorithms, and software libraries but uses the same input data. While the code for each individual pipeline is available as open-source software, the repositories do not contain all of the scripts for image post-processing and generation. Providing documentation for scientific software is challenging and we find that documentation for packaging, installing, and running the pipelines can be incomplete or or is unavailable for certain parts of the analysis. Table 1 lists the available, unavailable, and incomplete data, scripts, code, and documentation used by the EHT workflow and shared with the community before our reproducibility study. To succeed in our effort, we generated and made available the missing components.

Validating the Data Integrity
A key aspect of any work when reproducing scientific results is the validation of the data integrity: the data used for the generation of the original EHT images should match the data made available to the community. The integrity of data is often considered secondary but can compromise any reproducibilty effort, as it was previously demonstrated by the author in [2]. Figure 1 in [3] characterizes the original data in terms of telescope baselines (i.e., u-v coverages). Scripts to compare the properties of the original data with available data were not available. We generated the missing Python scripts and integrated them into a Jupyter notebook using standard Python modules such as matplotlib, pandas, and numpy. Figure 2 shows the comparison between the properties of the original data used in Figure 1 in [3] (set of sub-figures in Fig. 2(a)) and the reproduced properties using the available data (set of sub-figures in Fig. 2(b)). Top left plots in the two sets represent the intra-site EHT interferometer baselines (short baselines). The top right plots represent the aggregate baseline coverage of the Table 1: Availability of data, scripts, code, and documentation before our reproducibility study. Available and incomplete components are linked to the paper presenting them; missing components are marked as unavailable.

Raw data
Unavailable Processed data Available [5] Scripts Raw data processing Unavailable Processed data validation Unavailable Image post-processing Unavailable Figure  EHT array for all four days observed. The bottom plots show the short and long baseline coverage observed by each telescope set at high and low frequencies each day. Qualitatively we can assess the integrity of the data that we input to the three pipelines (i.e., DIFMAP, EHT-imaging, and SMILI). The only difference is the incomplete left plot in Fig. 2(b) due to the fact that the analysis is based on both the available processed EHT data and the unavailable intra-ALMA data from the Atacama Large Millimeter/submillimeter Array (ALMA). This external dataset is not included in the EHT Data Products; based on communications with the EHT Collaboration, the data is not needed for the pipelines to be able to reproduce the black hole images.

Rebuilding the EHT Software Stack
The three EHT pipelines that are part of image building can be modeled in terms of their functional modules (Figures 3(a), 4(a), and 5(a)). Each pipeline comprises a parameter definition module for users to establish workflow-specific behaviour as well as data preparation and data pre-calibration modules to pre-process the input files that are fed to the core of each pipeline. A module performing the image reconstruction cycles runs the image reconstruction algorithm; note how each pipeline uses a different number of cycles. The output of each pipeline includes a final image and statistics module that is used for qualitative and quantitative analysis of the reconstructed results, respectively. In SMILI, the first two modules are inverted and a image evaluation module for data visualization is available at the end of the pipeline.
Although the three pipelines share similar high-level steps, each of them has its own set of auxiliary steps, dependencies, and implementation. Figures 3(b), 4(b), and 5(b)) show the dependencies and software components of each pipeline in relation to its functional modules. DIFMAP ( Figure 3) is written in C and uses the CLEAN algorithm for image reconstruction involving iterative deconvolution, paired with a technique called "difference mapping." EHT's DIFMAP script takes a file containing observation data, a mask (set of cleaning windows) file that defines areas of interest for the algorithm to iterate upon, and five command-line arguments, which have been provided in the EHT repository [6]. After loading this file, the script initializes values, reads the file specifying the mask, and begins the pre-calibration phase, which involves its first cleaning and phase self-calibration. Afterwards, the image undergoes twenty rounds of amplitude self-calibrations and cleanings, and this is when image reconstruction occurs. EHT-Imaging ( Figure 4) uses the Regularized Maximum Likelihood (RML) method of image reconstruction and relies heavily on the eht-imaging Python module (EHTIM) to complete its processes. The EHTIM module defines numerous classes to allow the loading, simulation, and manipulation of VLBI data. By leveraging the classes in this module, the EHT-Imaging workflow loads both the low and high band data files of a single day's observations into a data object and performs various data preparation and pre-calibration steps. The workflow then moves to an imaging cycle with four iterations. Each successive iteration relies directly on the image generated in the previous iteration. After four iterations, the final image is output. The pipeline also allows for optional outputs includ-ing the final image and an image summary file containing various imaging parameters and data related to the imaging process.
SMILI ( Figure 5) is also written in Pyhton and uses RML like EHT-Imaging. Prior to imaging, SMILI also uses the EHTIM module in order to use data sets pre-calibrated consistently with the other workflows. After the pre-calibration stage, the software generates data tables that are used for the final imaging process. Reconstruction of an image begins with a circular Gaussian with successive iterations relying on the image generated from the previous iteration. There are four stages of iterations with each stage performing three imaging cycles. Once completed, the software outputs the final image and packages the input, pre-calibrated, and self-calibrated data files for traceability.
Note that each pipeline has its own GitHub repository [10], [11], [12]. The compilation of each pipeline's original code from the three EHT repositories resulted in several errors. For example, on a Power9 system we missed dependencies and had to remove optimization compilation flags from the installation script to generate the executable code successfully. In general none of the three pipeline codes include a comprehensive list of required software dependencies and libraries used or their versions. We solved dependencies manually by editing problematic scripts; we used Spack, Anaconda, and Pip to install the latest stable version of each necessary library. Once the compilation was successfully completed, we experienced runtime errors with EHT-Imaging and SMILI that we solved by correcting syntax issues in part of the Python code. We could not find documentation on how to transform the grayscale output of DIFMAP and SMILI into the colored and formatted images from Figure 11 in [3]. We solved this issue by utilizing the EHTIM module for post-processing of grayscale output. In the process of rebuilding the EHT software stack, we documented the software packages used, their dependencies, the compilation requirements, and the execution processes for all three pipelines, completing the unavailable or incomplete components in Figure 1.

Packaging and Distribution
To support the portability of the ETH workflow across different platforms, we created a collection of four Docker containers that allows users to reproduce two key results from the EHT project: the characterization of available data (i.e., Figure 1 in [3]) and the final EHT images of the black hole in Figure 11 in [3]). A first container hosts the entire setting to reproduce the validation of the data integrity; its includes the data tarball from the EHT Data Product page along with our Bash, Python, and Docker scripts. We developed these scripts to automate the installation and configuration of the environment in an easily accessible and portable way. In order for users to be completely satisfied with the validation of the data integrity, we have incorporated a spare tarball within the container for users to perform the md5sum program on it to compare with md5sum of the data from the EHT Data Products page. If both md5sum match, then users knows that the data in the Data Products page has not been modified in any way, and they can move on with the validation by running the Python scripts to reproduce the images of the black hole. The other three containers are used to reproduce the final EHT images of the black hole. Each of the EHT pipelines is packaged into an independent container that automates their installation, dependency setup, environment configuration, and execution. The containers include our own scripts and auxiliary files for conducting the image postprocessing steps, which are not available in the original EHT repository.
All four containers are publicly available in a Docker Hub 1 . Additional documentation for deploying and using these containers is available in Github 2 , along with the scripts to generate the figures reproduced in this paper. These materials augment existing containers in the EHT Docker Hub 3 and the EHT repositories [6].

Reproducing EHT Images
We tested the containerized pipelines both on commodity hardware (a laptop with Inter CPU) and a Power9 cluster at the University of Tennessee, Knoxville. Figure 6 compares our results: Figure 6a shows the original images from Figure  11 in [3] and Figure 6b shows our reproduced images using the containerized pipelines. The two figures show that we can reproduce the M87 images for all three pipelines.
The images in Figure 6 provide us with a qualitative comparison. Both sets of images look visually similar in terms of shape and brightness, and the similarity is consistent across pipelines. To perform a quantitative analysis, we compare the "closure" quantities reported in Table 5 in [3] with those reported by our executions of the the three pipelines. For each day and each pipeline, we compare both the χ 2 CP and χ 2 log CA quantities computed across the top set of parameters [3]. For brevity, we only report the values with 0% systematic uncertainty. We observe consistency between the two sets of results with no perfect agreement for the EHT-Imaging and SMILI pipelines. We also find a larger difference between the original and reproduced values for the DIFMAP pipeline: this is consistent with the discussion of the different time averaging used in DIFMAP.

Lessons Learned and Guidelines
We compile lessons learned and guidelines to support the reproducibility of scientific projects based on our experience and observations reproducing the M87 black hole images from the EHT project.
Data Availability. The unavailability of the raw data made the direct validation of the pipeline  input data unfeasible. As a proxy for data validation, we reproduced Figure 1 in [3], as this figure captures properties of the data telescope frequency and coverage. We are able to reproduce most of these properties except for the intra-site EHT interferometer baselines (short baselines) because the data from the intra-ALMA data) is not available. While this was not the case in this study, any incomplete or missing dataset may result in the users inability to fully verify the data integrity, and can threaten the entire reproducibility process. Data size or ownership constraints can be an obstacle to make raw data available to the public. Under these circumstances, data integrity mechanisms such as hashes ensure the correctness of processed data when releasing the raw data is not feasible. We add the additional service to run an MD5 integrity check for the pipeline input data as part of our EHT container set to facilitate data integrity validation.   Table 2: Closure quantity χ 2 values and statistics for top set images with 0% systematic uncertainty. We compute the difference δ between the Top Set values in Table 5 in [3] and our reproduced values. None of the values agree exactly, but our value is consistent with the spread reported in [3] for the EHT-Imaging and SMILI pipelines.
Software Availability. Several pieces of software were unavailable at different stages of the EHT workflow, and for the three pipelines. The raw data and corresponding software to process the data are not available, neither are the scripts to run the data validation. We developed those scripts for data validation purposes. The code for running the three pipelines is completely available but the image post-processing scripts are not, which forced us to experiment with different settings in order to obtain results comparable to the original for each pipeline. Finally, the plotting libraries used to reproduce the results in Figure  1 from [3] were insufficiently defined. Thus, we manually tuned our plotting scripts to obtain a suitable plotting configuration. The qualitative differences between the original and reproduced images can be the result of our manual tuning, which indicates that just sharing the core for the three pipelines is not sufficient to reproduce the original images of the black hole in the galaxy M87. To support portability across platforms, we generated four containers that allow users to execute the data integrity validation and original pipeline codes. We also enable execution of the end-to-end workflow by providing all auxiliary materials for the image post-processing, figure generation, and result analysis.
Documentation Availability. In general, there is insufficient documentation on how to pack-age, install, and execute the EHT pipelines, as well as on how to perform both qualitative and quantitative analyses of the results. This hinders the overall reproducibility effort. For instance, documentation is key to reproduce Table 5 in [3]. Regarding the pipelines, there is insufficient information about software dependencies and versions used as well as file locations and their use. Documenting the whole EHT workflow beyond its image reconstruction components is crucial for the successful reproducibility of the results. Our documentation covering configuration and use of the entire EHT workflow was instrumental for the success of our reproducibility study.
Software Packaging. The incomplete documentation resulted in installation, dependency, and portability challenges. We manually edited dependencies to allow installation and compilation, and had to override installation instructions that were resulting in unstable environments. We found that by containerizing the workflow we can hide these challenges from the end user, simplifying the installation and deployment of the EHT workflow.
Methods Description. Incomplete descriptions of the results analysis process (e.g., the data averaging time or the additional systematic error budget added to the uncertainties) complicates the reproducibility of the χ 2 statistics in Table  5 in [3], as errors add up quickly. Conducting an adequate quantitative assessment of the final results becomes very challenging under these circumstances. Members of the EHT Collaboration highlighted in conversations with the authors of this paper how a qualitative comparison of the images is more interpretable.
Access to Final Results. The authors of [3] did not release the output fiducial images, and therefore we did not have a fixed reference to use for direct comparison of our reproduced images and the original ones. This, in addition to partial access to the data and incomplete description of the methods used, prevents us from conducting a complete validation of the reproduced images.
Access to Distributed Knowledge. The EHT Collaboration made substantial investments to allow independent users to qualitative and quantitatively reproduce their results and ensure the robustness of the EHT project. Nonetheless, we found challenging to reproduce the original results without direct knowledge of the methods and analyses, or the direct collaboration with the authors of the studies. Our experience illustrates the general challenges that users external to a project face when gathering knowledge on data, code, and documentation that was originally generated from multiple teams in a distributed fashion. The effort of the EHT Collaboration to remove biases by designing and deploying three completely separate pipelines, while instrumental for the trustworthiness of the project results, is also an obstacle to the project reproducibilty.

Conclusions
In this paper we deliver our experience reproducing the black hole images from the EHT project, and report new guidance and practices for building reproducible scientific research. Our work complements the work of the EHT Collaboration with supplemental data, scripts, documentation, and a set of containers. Postdocs, graduate and undergraduate students, and even high school students can benefit from accessing our data and code, and using our documentation to reproduce findings from the EHT project, learn about the EHT funding, and ultimately get involved in STEM research. Our guidance and practices can be incorporated more broadly by other scientific workflows. The EHT project continues to be a leader in reproducibility efforts and have provided comprehensive data products for their recent observations of Sag A * .
Assessing the level of detail required to cover the vast knowledge developed in a project the size of EHT is a complex task. Finding the balance between the effort from original research teams to enable reproducibility, and users attempting to reproduce the results is still an open question. Our experience with the EHT and LIGO projects reveals an important and recurring issue in reproducibility: challenges remain in disseminating findings in a way that allows reproducibility of results without direct interaction with the original team that produced them.