Critique of: “A Parallel Framework for Constraint-Based Bayesian Network Learning via Markov Blanket Discovery” by SCC Team From UC San Diego

Bayesian networks (BNs) have become popular in recent years to describe natural phenomena in situations where causal linkages are important to understand. In order to get around the inherent non-tractability of learning BNs, Srivastava et al. propose a markov blanket discovery-based approach to learning in their paper titled “A Parallel Framework for Constraint-based Bayesian Network Learning via Markov Blanket Discovery.” We are able to reproduce both the strong and weak scaling experiments from the paper up to 128 cores, and verify communication cost scaling for all three algorithms in the paper. We also introduce methodological improvements to weak scaling that show the paper's findings are unique to the methodology and not the datasets used. Slight variations in performance were observed due to differences in datasets, core count, and job scheduling.


INTRODUCTION
In the Supercomputing Student Cluster Competition (SCC), teams of university students compete across a variety of events that challenge their ability to manage supercomputing infrastructure and execute popular HPC workloads. One such event is the Reproducibility Challenge, in which teams are directed to attempt to reproduce results from a Technical Program paper from the previous year. The SC20 paper by Srivastava et al. titled "A parallel framework for constraint-based Bayesian network learning via markov blanket discovery" [1], introduces ramBLe , a software package that efficiently learns Bayesian networks (BNs) [2] using the discovery of Markov blankets (MBs) as an intermediate step. Bayesian networks have seen wide applicability in causal inference [3], medicine [4], and natural language processing [3]. The paper claims the method is scalable to datasets with up to tens of thousands of variables.
In order to achieve this scalability, ramBLe presents three unique parallel approaches to learning BNs that improve on the standard (but sequential in nature) bnlearn package. Unlike bnlearn, the MBbased approaches can be divided into roughly four phases: grow, shrink, symmetry correction, and construct parent-child connections for each node from its Markov blanket (referred to as PC in [1]). For the parallel implementation, the authors developed three approaches: grow-shrink (GS), Incremental Association MB (IAMB), and Interleaved IAMB (Inter-IAMB).
In order to optimize the grow phase, the algorithm looks at all variable pairs in parallel. This happens for all three algorithms (GS, IAMB, Inter-IAMB). For the IAMB algorithms specifically, a candidate variable is selected for every variable by looking at the maximum c-score for the pair. We know from the paper that the collective communication component of this process takes Oððt þ mÞlog pÞ, which we should expect to see in Fig. 2. Next, the shrink phase requires no communications, as it simply needs to remove variables from each Markov blanket by looking through the MB sets. The next phase, symmetry correction, is designed to obtain symmetrically consistent MBs via a parallel sort. The complexity of this parallel sort dominates and gives us a communication cost of Oðtlog 2 p þ m n 2 p log 2 pÞ. Finally, the PC from MB construction phase (Srivastava et al. Section III.B) produces the skeleton of the BN from the MB sets. There is no collective communication cost here, and it is also not parallelized since this is the fastest of the four phases.
In this article, we reproduce most of the results shown in the original paper, indicating that the package does indeed work as intended. We also introduce an improvement to the experimental approach that makes the results slightly more robust to randomness. These results pave the way for the general adoption of ramBLe in a variety of use cases.

EXPERIMENTAL SETUP
All experiments were run on Oracle Cloud hardware provided by the Student Cluster Competition. The bare metal cluster consisted of a scheduler node, four worker nodes, and a GPU node (not used). For the hardware stack, we used an 8-core Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz as a scheduler node and 18-core (2 sockets per node) Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz as compute nodes. Each node ran Oracle Linux 7.9 with gcc 9.2.0, boost 1.77.0, mvapich2 2.3.6, spack 0.17.0, scons 4.2.0 (for compilation), and slurm 20.11.8 (for scheduling). Two datasets were provided by the student cluster competition: data from patients who tested negative for COVID-19 (C2) and patients who tested positive for COVID-19 and exhibited severe symptoms (C3). The dataset shapes are (30 307 genes, 11 180 observations) and (30 604 genes, 12 909 observations), respectively.

DESCRIPTION OF EXPERIMENTAL RUN
In order to automate the large number of runs required for the experiments (2 datasets Â 3 algorithms Â 8 core counts) we used a Python script that would submit a series of Slurm jobs from the scheduler. Using the subprocessing module, jobs were submitted starting with the lowest core count and going up to the highest core count. The justification for this strategy was that the lowest core count runs took the longest time in the paper, so those could run in parallel while the higher core count runs completed.
ramBLe was compiled with the TIMER=1 and TEST=0 options to log the time taken in each of the phases. Two values were scanned from the output of each run, the communication cost (mxx) and the time to get the network (end-to-end time until network file was written), then concatenated into one CSV file. The plots were created within a Jupyter Notebook with Matplotlib [5].
Compiling ramBLe was non-trivial, as we wanted to utilize Oracle's RoCE (RDMA over Converged Ethernet) technology to minimize latency and maximize performance. A specific version (20.11.8) of Slurm was required as the dependency of MVAPICH2, otherwise MVAPICH2 is unable to detect and use the faster RoCE connection between the compute nodes. All the dependencies with

STRONG SCALING
In the strong scaling experiments, we seek to test the algorithm's speed as the amount of available cores increases for a fixed problem size. Fig. 1 is the plot of strong scaling efficiency and speed up plot using both datasets on the three different algorithms mentioned in the paper, based on the results shown in Table 1. It is clear C2 and C3 resemble D2 and D3 from the paper, while having almost twice as many genes. Although we were only able to reproduce up to 128 cores due to hardware constraints, the efficiency and speedup plots still match very closely to the ones in the original paper -we observe less-than linear speedup for grow-shrink and close to linear speedup for IAMB and Inter-IAMB. Speedup and Efficiency were calculated as follows: Where T 1 is the serial execution time, T p is the parallel execution time and p is the number of cores. For the GS algorithm, we are able to achieve a similar linear scaling plot in speedup for both C2 and C3. The only difference is that when we are running on multiple cores, our runs were not as efficient as theirs. For example, when running on 8 cores, the paper achieved around 7 times speedup whereas we have 4.97 speedup on the C2 dataset. We hypothesize this is because these grow-shrink has the highest communication overhead of the algorithms (as detailed in Section 1) and is compounded by multiple runs on the same node (for example, all of the one-core runs sharing the same node).
For the IAMB algorithm, we achieved a better speedup when we are reproducing the experiment result. We achieve a nearly perfect linear speedup on different numbers of cores. This algorithm's efficiency is much better compared to GS, which is also reflected in the paper. The only thing to note is that for C3, when we are running on 8 cores, we have an efficiency drop that is inconsistent with the general trend.
We observe a similar phenomenon in the Inter-IAMB algorithm. We achieved almost the same performance speedup as the paper's result, with a little difference in efficiency running with 16 cores, which corresponds to the low point on the bottom right graph of Fig. 1.
It is worth noting that a couple of the runs on the higher core counts, specifically runs with the C3 dataset on 64 and 128 cores, did not complete because the program encountered OS/Slurm errors and all the processes were killed. Since the runs with same parameters for the C2 dataset, we believe this to be an issue of memory sizes, and that the combination of the C3 dataset, IAMB algorithm, and large core count was posing a memory constraint for the four nodes in our cluster. This problem may be ameliorated by utilizing a larger cluster. Fig. 2 shows the communication cost as a percentage of total runtime as a function of cores used. While the cost for IAMB and Inter-IAMB stays low as core counts increase, GS increases rapidly for the same reason as detailed in Section 1.1. This verifies the behavior seen in the paper, but our communication cost for GS increases much more rapidly than the one in the paper. We hypothesize this is also because of multiple jobs incurring overhead when being run simultaneously.

WEAK SCALING
In weak scaling, we vary the problem size in proportion to the processing resources. Using the same core counts as the original paper (up to 128) and running on C2, Fig. 3 shows a high weak scaling efficiency across all core counts tested with a slight drop at the end and with the exception of the 16 core count run for GS and the 8 core runs for the IAMB/Inter-IAMB algorithms. This difference could be due to the fact that we chose to randomly sample columns (genes) from the dataset when scaling the size commensurate to the number of cores used, whereas the original paper only took the first n columns, where n ¼ , c is the maximum core count, n c is the total number of genes in the dataset, and p is the number of cores in the current run. We think this randomness is more methodologically sound as the genes appear to be independent. Sampling was done without replacement to minimize duplicates that could affect the construction of a BN. The key finding, however, is that the weak scaling trends appear to be similar to the original paper's results despite randomly sampling genes, which shows that the paper's findings are not related to the specific genes selected and indeed related to the parallel optimizations of the three algorithms.

CONCLUSION
Overall, we were able to reproduce the scaling and efficiency results from the paper well. Our findings indicate that parallel Markov blanket discovery is a promising approach towards learning Bayesian Networks in a parallel manner. We were not able to reproduce core counts above 128 due to hardware constraints and noticed some communication overhead likely due to simultaneous runs of many jobs at once. For weak scaling, random selection of genes yielded similar results to the original paper. Future efforts would include running the application on a larger system, and benchmarking the scaling and performance.
Arunav Gupta is currently working toward the 4th-year data science degree with UC San Diego. His research interests include robustness in machine learning and Markov chain Monte-Carlo methods. Previously, he was with AWS, working on the Sagemaker Machine Learning platform.
John Ge is currently working toward the computer science degree with UC San Diego. During the SCC21 competition, he worked on the Cardioid and ramBLe applications. His research interests lie in networking and cybersecurity.
John Li is currently working toward the 4th-year undergraduate degree of mathematics and computer science with UCSD. His current research is on resilient bayesian hyperparameter optimization on distributed systems. Upon graduation, he plans to work in industry as a software engineer on Azure blob storage.  Paul Rodriguez received the PhD degree in cognitive science and has worked in neural network modeling, dynamical systems simulations, time series analysis, and statistical methods for analysis and predictions in fMRI data. He is a computational data scientist. His current interests involve scalable machine learning and deep learning.
Mahidar Tatineni received the MS and PhD degrees in aerospace engineering from UCLA. He currently leads the User Services group with SDSC and has done many optimization and parallelization projects on the supercomputing resources including Gordon.
Mary Thomas received the MS degree in computer science and physics, and the PhD degree in computational science. She is a member of the Data Enabled Scientific Computing (DESC) division. Her research interests include: HPC computing and training; optical physics, coastal ocean modeling; cyberinfrastructure and emerging technologies, including AI, Jupyter notebooks, interactive and cloud computing.
Santosh Bhatt received the master's degree in computer science and applied mathematics. He is a master principal enterprise cloud architect and a team lead with Oracle Corporation. He works mainly with higher education institutions in resolving their complex computational problems. His expertise includes database management system, high performance computing, database security & cloud infrastructure (IaaS & PaaS).