By Topic

• Abstract

SECTION I

## INTRODUCTION

THE exponential growth of data intensive applications and the necessity for complex and massive data analysis have elevated modern large-scale parallel computing technology and demand. Future High-Performance Computing (HPC) systems will go through a rapid evolution of node architectures as power and cooling constraints are limiting increases in microprocessor clock speeds. Consequently computer architects are trying to increase significantly the on-chip parallelism to keep up with the demands for fast performance and high volume of data processing. Multiple cores on a chip is no longer cutting edge technology due to this hardware paradigm shift. In the new Top 500 supercomputer list published in March 2011, more than 99% of supercomputers are multi-core processors [33]. As hardware has evolved, software applications must adapt and gain the capability of effectively running multiple tasks simultaneously through parallel methods. It is of critical importance to provide an accurate estimate of an application's performance in a massively parallel system both for predicting the most effective design of a multi-core large-scale architecture as well as for optimizing and fine-tuning the software application to efficiently execute in such a highly concurrent environment.

A key element of the strategy as we move forward is the co-design of applications, architectures, and programming environments, to navigate the increasingly daunting constraint-space for feasible exascale system design. The complexity of designing large-scale computer systems has motivated the development and utilization of a large number of large-scale system simulators [14], [29], [34]. There is a pressing need to develop deep code analysis and system simulation platforms to insert application developers directly into the design process for HPC systems in the exascale era. It is of significant importance to build the simulation platforms for accurate emulation of the hardware architectures of the next decade and their design constraints. This will enable computer scientists to engage early in the design and utilization of effective programming models.

Three well-known approaches have been investigated for estimating large-scale performance. The most common approach is direct execution of the full application on the target system [25], [26], [36]. This simulation approach uses virtual time unlike normal benchmarking that uses real time. Here, performance is modeled by using a processor model and communication work in addition to simulated time for a modeled network. Another approach is tracing the program in order to collect information about how it communicates and executes [36]. The resulting trace file contains computation time and actual network traffic. Tracing provides high levels of evaluation accuracy, but cannot be easily scaled to a different number of processors. A third approach is to implement a model skeleton program that is a simple curtailed version of the full application but provides enough information to simulate realistic activity [1], [31]. This approach has the advantage that the bulk of the computational complexity can be replaced by simple calls with statistical timing information. What makes this approach challenging is the necessity to develop a model skeleton program based on a complex scientific HPC application that often includes a large number of HPC computational methods and libraries, sophisticated communication and synchronization patterns, and architecture-specific optimizations. Moreover, it is difficult to analyze and predict the runtime statistics for domain-specific applications using heuristic algorithms. The skeleton application provides a powerful method for evaluating the scalability and efficiency over various architectures of moderate or extreme scales. For example by running skeleton applications, the Structural Simulation Toolkit's macroscale simulator (SST/macro) [14], [27]has been able to model application performance at levels of parallelism that are not obtainable on any known existing HPC system.

In this work we present the design and application of a discrete event simulation-based framework for analyzing the scalability and performance of a number of optimizations of mpiBLAST. mpiBLAST [9] is an open-source parallel implementation of the National Center for Biotechnology Information's (NCBI) Basic Local Alignment Search Tool (BLAST) [3]. BLAST is the most widely used genomic sequence alignment algorithm. Though a heuristic method is employed to improve computational efficiency, computation time is debilitating because of the rapid growth of sequence data. A parallel version of BLAST, mpiBLAST, uses a database segment approach. The design of mpiBLAST has been revised a number of times to better address the challenges of distributed result processing [18], hierarchical architectures [32], include further dynamic load balancing optimizations, and I/O optimizations [17], [19]. Our simulation-based framework allows programmers to better address the challenges of executing genomic sequence alignment algorithms on many-core architectures and at the same time gain important insights regarding the effectiveness of the mentioned mpiBLAST optimization techniques. This allows both scientists, library developers, and hardware architecture designers to evaluate the scalability and performance of a data intensive application on a wide variety of multi-core architectures, ranging from a regular cluster machine to a future many-core petascale supercomputer. Our approach can help in several ways including:

• enhance the evolution of the software application by performing further architecture-specific optimizations to meet the challenges of the communication and synchronization bottlenecks of the new multiprocessor architectures,
• adapt the hardware set-up to better facilitate the computational and communication patterns of the application,
• evaluate the effectiveness and associated trade-offs of any future co-design evolution of the application software and the hardware platform.

In this paper, we present the application of SST/macro, an event-driven macroscale simulator, for estimating and predicting the performance of large-scale parallel bioinformatics applications based on mpiBLAST. SST/macro has been recently developed and released by the Scalable Computing R&D Department at Sandia National Labs and is a fully component-based open source project that is freely available to the research and academic community [27]. We demonstrate the use of SST/macro and its trace-driven simulation that is based on DUMPI [14], a custom-built MPI tracing library developed as a part of the SST/macro simulator. We also present a methodology for constructing SST/macro skeleton programs based on mpiBLAST.

The rest of this work is organized as follows: Section II introduces the event-driven SST/macro simulator that is at the core of our simulation framework, Section III discusses the mpiBLAST algorithm for parallel genome sequence matching and the possible optimizations of mpiBLAST, Section IV presents in details the methodology for collecting DUMPI trace files and our approach for implementing mpiBLAST-based skeleton models as well as our experimental set-up and results, and Section Vconcludes this paper.

SECTION II

## EVENT-DRIVEN MACROSCALE SIMULATION

We use the macroscale component of the Structural Simulation Toolkit [2], [27], which permits coarse-grained study of data-intensive parallel applications, to support architecture simulation. SST/macro is a fully component-based profiling and architectural simulation tool that is freely available [2], [27]. SST/macro has a modular structure implemented in C++ [30] and allows flexible addition of new components and modifications. Its simulation capacity can play a crucial role for the effective design and implementation of large-scale data intensive applications on the future multi-core hardware platforms. Such platforms could include a wide variety of features including a heterogenous design of CPUs, GPUs, and possibly FPGAs.

Fig. 1 provides an overview of the design of the SST/macro simulator. The simulator makes use of lightweight application threads, allowing it to maintain simultaneous task counts ranging into the millions. SST/macro supports two execution modes: trace-driven simulation mode and skeleton model-driven execution.

Fig. 1. SST/macro simulation framework.

### A. Trace-Driven Simulation

In the trace-driven simulation execution, an application is executed and profiled in order to extract a wealth of information about its execution pattern. The trace-driven simulation can provide our infrastructure with the following information: average instruction mix, memory access patterns, communication mechanisms and bottlenecks, and the network utilization on a per-link basis. SST/macro supports the following two trace file formats, both of which record execution information by linking the target application with a library that uses the PMPI [22] interface to intercept MPI calls.

• Open Trace Format (OTF) [16]: OTF is a trace format designed for use with large-scale parallel platforms. OTF has three main targets: openness, flexibility, and performance.
• DUMPI [14], [15]: which we designed as a custom trace format distributed as a part of the SST/macro simulator. DUMPI's goal is to record more detailed information compared to OTF, including the full signature of all MPI-1 and MPI-2 calls. In addition, DUMPI trace files store information regarding return values of MPI requests, which allows error checking and MPI operation matching. DUMPI files also provide hardware performance counter information using the Performance Application Programming Interface (PAPI) [14], which allows information such as cache misses and floating point operations to be logged.

The main advantage of trace file driven simulation is accuracy, especially if the planned runtime system is known in details. However, a main difficulty is the fact that it requires the execution of the actual application that could often be data intensive and of high computational complexity. Moreover, trace file simulation is not capable of predicting performance on future hardware platforms, as the generated trace files are specific to the execution environment.

### B. Skeleton Application Simulation

Skeleton applications are simplified models of actual HPC programs with enough communication and computation information to simulate the application's behavior. One method of implementing a skeleton application is to replace portions of the code performing computations with system calls that instruct the simulator to account for the time implicitly. Since the performance models can be embedded in the skeleton application and real calculations are not performed, the simulator requires significantly less computational cost than simulating the entire system. Skeleton application simulation can also evaluate efficiency and scalability at extremely different scales, which provides a powerful option for performance prediction of non-existing super-scalar systems. Though driving the simulator with a skeleton application is a powerful approach for evaluating the application's scalability and efficiency, it requires extensive efforts for programmers to implement the skeleton models for a large-scale parallel program. The effort is justified by the difficulty of predicting computation time for complex applications such as mpiBLAST. mpiBLAST search time varies greatly across the same size database and query, because the computation time depends on the number of positive matches found in a query. Match location can also affect the execution time.

Fig. 2 shows the implementation of an MPI ping-pong skeleton application in which pairwise ranks communicate with each other. As shown in Fig. 2, skeleton application implementations for the SST/macro are very similar to the native MPI implementation with the exception of the syntax of the MPI calls. In addition to replacing the communication calls, we can replace computation parts with system calls such as compute(…), which reduce simulation time dramatically.

Fig. 2. Core execution loop of the MPI ping-pong skeleton application.

### C. Communication Models

The purpose of SST/macro's communication component is to study the complex interaction of the various software components and the network. Recent growth of large-scale systems has made evaluation of communication loads across complex networks vital. SST/macro is capable of simulating and evaluating advanced network workload with diverse topology and routing. A simple processor model is added to provide timings for processor workload and data movement within each node. The simulator currently supports torus, fat-free, hypercube, Clos, and gamma topologies [8]. Moreover, general network frameworks can be evaluated with network parameters such as bandwidth and latency, and the modularity of the simulator makes defining new topologies and routing protocols easy. These components enable us to investigate interconnect design options such as choice and tuning of topologies (high-dimensional meshes, fit-trees as opposed to fat-trees); routing algorithms (wormhole vs dispersive routing, oblivious vs. adaptive routing); and system parameter choices (e.g, router buffer sizes, bandwidths and latencies). At the same time, we will be able to quantify the benefits of new algorithms and algorithmic paradigms such as alternatives to the infamous bulk-synchronous parallelism. In addition, we will study how to decompose the main problem into subproblems and how to map tasks to the processors.

### D. Programming Models

SST/macro is designed to support a variety of programming and synchronization models. The most recent release of SST/macro [27] provides full support for Pthreads and the Message Passing Interface (MPI) [23]. MPI is the most common message passing library interface specification for a distributed memory system and is widely applied in a large number of scientific codes. In SST/macro, lightweight application threads perform MPI operations (Fig. 1). The simulator implements a complete MPI which skeleton applications can use to emulate node communication in a straightforward manner. SST/macro has been used to test the performance impact of proposed extensions to the MPI standard [7], [14] and optimizations to mpiBLAST [2], [3]. In the course of this project we will implement components for supporting additional programming styles such as the partitioned global address space (PGAS) programming model [24], and non-blocking synchronization [11]. Such extensions will address the needs of applications and algorithms that increasingly rely on fine-grained parallelism such as lock-free synchronization [10], [11]and strong scaling while supporting fault resilience [21] to accommodate the massive growth of explicit on-chip parallelism and constrained bandwidth anticipated of future chip architectures.

### E. SST/Macro Simulation Accuracy

We provide a brief overview of our simulator validation study [14]. Fig. 3(a) shows the result of our validation studies, demonstrating that SST/macro's predictions are always within 10% of the observed runtimes. Fig. 3(b) shows the runtime performance of our simulator, demonstrating that the simulator can easily simulate up to millions of processors, with its performance bounded by cache size. In this study we have executed a trivial MPI Ping-Pong test [2]. The MPI Ping-Pong is a communication test between two hosts used to determine whether a particular host is reachable across an IP network. Fig. 3(c) shows how the simulator can be used to understand how architectural features impact application performance. A detailed discussion of these results is available in [14], [15].

Fig. 3. SST/macro simulator results showing (a) validation data with simulated runtime plotted against observed runtime, (b) real time required to simulate an MPI ping pong round trip, and (c) use of the simulator to study machine topology and bandwidth affects on application runtime.
SECTION III

## mpiBLAST

This section lays out the core design and functionality of mpiBLAST. Furthermore, we discuss the I/O and computation scheduling optimization proposed to mpiBLAST.

### A. The Fundamental Design of mpiBLAST

In bioinformatics, a sequence alignment is an essential mechanism for the discovery of evolutionary relationships between sequences. One of the most widely used alignment search algorithms is BLAST (Basic Local Alignment Search Tool) [3], [4]. The BLAST algorithm searches for similarities between a set of query sequences and large databases of protein or nucleotide sequences. The BLAST algorithm is a heuristic search method for finding locally optimal alignments or HSP (high scoring pair) with a score of at least the specified threshold. The algorithm seeks words of length $W$ that score at least $T$ when aligned with the query and scored with a substitution matrix. Words in the database that score T or greater are extended in both directions in an attempt to find a locally optimal un-gapped alignment or HSP (high scoring pair) with a score of at least $E$ value lower than the specified threshold. HSPs that meet these criteria will be reported, provided they do not exceed the cutoff value specified for the number of descriptions and/or alignments to report.

Today, the number of stored genomic sequences is increasing dramatically, which demands higher parallelization of sequence alignment tools. Moreover, next-generation sequencing, a new generation of non-Sanger-based sequencing technologies, has presented new challenges and opportunities in data intensive computing [28]. Many parallel approaches for BLAST have been investigated [5], [6], [18], [20], and mpiBLAST is an open-source, widely used parallel implementation of the NCBI BLAST toolkit.

The original design of mpiBLAST follows a database segmentation approach with a master/worker system. It works by initially dividing up the database into multiple fragments. This pre-processing step is called ${\tt mpiformatdb}$. The master uses a greedy algorithm to assign and distribute pre-partitioned database chunks to worker processors. Each worker then concurrently performs a BLAST search on its assigned database fragment in parallel. The master server receives the results from each worker, merges them, and writes the output file. mpiBLAST achieves an effective speedup when the number of processors is small or moderate. However, mpiBLAST suffers from non-search overheads when the number of processors increases and the database size varies. Additionally, the centralized output processing design can greatly hamper the scalability of mpiBLAST.

### B. Optimizations of mpiBLAST

#### 1) Hierarchical Architecture

mpiBLAST expands the original master-worker design to hierarchical design, which organizes all processes into equal-sized partitions by a supermaster process. The supermaster process manages assigning tasks to different partitions and handling inter-partition load balancing. There is one master processor for each partition that is responsible for coordinating both computation and I/O scheduling with many workers in a partition. This hierarchical design has an advantage in massive-scale parallel machines as it distributes the workload well across multiple partitions.

#### 2) Dynamic Load Balancing Design

It is difficult to estimate the execution time of BLAST because search time is extremely variable and thus unpredictable [13]. Therefore a greedy scheduling algorithm for fine-grained task assignment to idle processes is necessary. To avoid load imbalance while reducing the scheduling overhead, mpiBLAST adopts a dynamic worker group management approach where the masters dynamically maintain a window of outstanding tasks. Whenever a worker finishes its tasks, it requests further assignments from its master. With query prefetching, the master requests the next query segment when the total number of outstanding tasks in the window falls under a certain threshold.

#### 3) Parallel I/O Strategy

Massive data I/O can lead to performance bottlenecks especially for data driven applications such as mpiBLAST. To deal with this challenge, mpiBLAST pre-distributes database fragments to workers before the search begins. Workers cache database fragments in memory instead of local storage. This is recommended on diskless platforms where there is no local storage attached to each processor. By default, mpiBLAST uses the master process to collect and write results within a partition, which may not be suitable for massively parallel sequential search. Asynchronous parallel output writing techniques optimize concurrent noncontiguous output access without inducing synchronization overhead which result from traditional collective output techniques.

SECTION IV

## EXPERIMENTAL RESULTS

We chose to identify and use freely available datasets for executing our mpiBLAST-based simulation analysis. In our experimental set-up we run mpiBLAST on the genome of the yellow fever mosquito, Aedes aegypti, which has been investigated by biologists for spreading dengue and yellow fever viruses. The genome database can be downloaded freely from the source in [35] and has a suitable size of 1.4GB for testing on both our local machine and the cluster system. We use 1 MB sequences randomly sampled from the Aedes aegypti transcriptome dataset because such query sequences match well with the genome's characteristics.

In our experiments, we relied on DUMPI to facilitate more detailed tracing of MPI calls than was available from other trace programs. The results of a DUMPI profiling run consists of two file formats. One is an ASCII metafile for the entire run, and the other is a binary trace file for each node. The metafile is a simple key/value ASCII file that is intended to be human-readable and to facilitate grouping related trace files together. Each trace file consists of a 64-bit lead-in magic number and 8 data records. In order to trace an application with DUMPI, a collection of DUMPI libraries are linked to the application when it is executed in the system. Afterwards, several executables built on DUMPI repository are used to analyze the DUMPI trace files.

In our experimental setup we have traced and analyzed the mpiBLAST implementation described in Section III. The current open-source version of mpiBLAST has several options for parallel input/ouput of data. We have simulated and tested three optimizations as described below:

• Optimization 1 (—use-parallel-write), enabling high-performance parallel output: by default, mpiBLAST uses the master process to collect and write results within a partition. This is the most portable output solution and should work on any file system. However, using the parallel-write solution is highly recommended on platforms with fast network interconnection and high-throughput shared file systems.
• Optimization 2 (—use-virtual-frags), enabling workers to cache database fragments in memory instead of on local storage: this is recommended on diskless platforms where there is no local storage attaching to each processor.
• Optimization 3 (—predistribute-db), pre-distributing database fragments to workers before the search begins: especially useful in reducing data input time when multiple database replicas need to be distributed to workers.

We have traced a large number of mpiBLAST experimental executions with the SST/macro simulator to validate the simulator and predict the application's performance on a large-scale parallel machine. We executed our SST/macro simulation of the mpiBLAST application on two different platforms: a multi-core Linux machine and a distributed memory cluster system. The local machine consisted of a 2.66 GHz Intel Core(TM)2 Duo CPU and 2 GB memory. The cluster system is composed of 113 nodes, where each node contains two 3.2 GHz Intel Xeon CPU and 2 GB memory.

### A. Validation of SST/Macro With mpiBLAST

SST/macro has been recently released [27]and it has not yet been exposed to testing outside of its development environment at Sandia National Labs. For this reason, in this section we briefly mention our findings in our efforts to validate the accuracy of the SST/macro simulator. The simulator was validated with mpiBLAST results using default bandwidth (2.5 GB/s) and latency (1.3 $\mu{\rm s}$) on both testing environments. We used processor counts from 8 to 64, and traces were collected using the lightweight DUMPI library. Fig. 4 shows the simulated walltime versus the elapsed realtime for the simulation driven by these DUMPI traces.

Fig. 4. Comparison of observed and simulated runtimes both on the local machine and the cluster system.

We applied the concept of K-L divergence [12] to evaluate the similarity between the results from observed and simulated run time across the tested systems. K-L divergence is a non-commutative measure of the difference between two samples $P$ and $Q$ typically $P$ representing the “true” distribution and $Q$ representing arbitrary distribution. Therefore we set $P$ as simulation results and $Q$ as SST/macro DUMPI-driven runtimes with varying total CPUs. The K-L divergence is defined to be TeX Source $$D_{KL}(P\Vert Q)=\sum_{i}P(i)\log{P(i)\over Q(i)}\eqno{\hbox{(1)}}$$ where $Q(i)\ne 0$. A smaller value of the K-L divergence variable signifies greater similarity between the two distributions.

Table I shows the K-L divergence distance. We carefully analyzed the resulting K-L distance and found out that the SST/macro trace clock times are very close to real simulation wall-times on both the local machine and the cluster system.

TABLE I THE ABSOLUTE DISTANCE AND K-L DIVERGENCE

### B. Simulation of mpiBLAST Optimizations

To evaluate the various optimizations of mpiBLAST that we mentioned earlier in this section, we run SST/macro and collected the simulation DUMPI traces using 16, 32, and 64 processors on the cluster system. Fig. 5 shows the scalability and efficiency of each approach. The $y$-axis shows the total execution time in seconds for all processes of each approach, and the $x$-axis represents the number of processors that we used for our simulation runs. In our diagram we use the following notations: Optimization 1 is the enabling of parallel output, Optimization 2 uses virtual fragments, and Optimization 3 pre-distributes database to workers before search begins. In addition, we also tested a version that includes all three mpiBLAST refinements named Optimization 1+2+3, and a version that excludes all optimizations named No Optimizations. The simulation results indicate that for our selected genome analysis, the sequence matching of the Aedes aegypti genome using sequences of size 1 MB, Optimization 2: the use of virtual fragments provides the best scalability and efficiency. Optimization 2 leads to a speed-up of a factor of 2 or more compared to our No Optimization solution when executed on 64 nodes of our cluster system. This finding is not surprising given the exponential increase of the cost of accessing global memory with the increase of the participating compute nodes. Enabling the master process to collect and write output (Optimization 1) also led to a performance increase by about a factor of 2 on our 64 node execution. Our tests indicated that in all execution scenarios the use of static work pre-distribution alone (Optimization 3) led to a significant overhead in our genome sequencing analysis and slowed down the execution time. However, when combined with Optimization 1 and Optimization 2, static work pre-distribution did not lead to performance loss and even helped increase the speed of execution in certain scenarios. Enabling Optimization 2 helped in achieving faster execution in the tests we performed using 16 and 32 nodes, however the observed speed-up was not as significantly high as with the scenario with 64 nodes. Intuitively, this result demonstrates that Optimization 2 provides excellent scalability, however, due to the overhead of computing the fragments, the methodology is effective only when we have a system with a higher degree of parallelism. The graph in Fig. 5 shows the same trend for Optimization 1, where enabling parallel output even deteriorated the execution time for the scenario of using 16 cluster nodes.

Fig. 5. Scalability of 5 different optimization strategies on a 113-node cluster system. Regular bars represent SST/macro DUMPI-driven simulation times with different optimizations and the core bars represent observed time.
SECTION V

## CONCLUSION

The application of hardware/software co-design has been a feature of embedded system designs for a long time. So far, hardware/software co-design techniques have found little application in the field of high-performance computing. The multi-core paradigm shift has left both software engineers and computer architects with a lot of challenging dilemmas. The application of hardware/software co-design for HPC systems will allow for a bi-directional optimization of design parameters where software specifications and behavior drive hardware design decisions and hardware constraints are better understood and accounted for in the implementation of effective application software. The use of simulation tools provides the data and insights to estimate the performance impact on an HPC applications when it is subjected to certain architectural constraints. In this work we demonstrated the application of a newly developed open-source macroscale simulator (SST/macro) for the evaluation and optimization of data intensive genome sequence matching algorithms. We performed both trace-driven simulation and simulation based on application modeling. In our experimental set-up, we run an mpiBLAST sequence matching algorithm using 1 MB sequences of the genome of the yellow fever mosquito, Aedes aegypti. Using this data intensive application as a canonical example, we validated the accuracy of SST/macro. In addition, the analysis of our performance data indicated that the use of dynamic data fragmentation leads to significant performance gains and high scalability on a distributed memory cluster system. The framework we have presented in this work allows for the evaluation and optimization of mpiBLAST application on a wide variety of platforms, ranging from a conventional workstation to a system allowing levels of parallelism that are not obtainable by existing supercomputers. This simulation ability can play a crucial role for the effective design and implementation of large-scale data intensive applications to be executed on the future multi-core hardware platforms, that often could include a wide variety of features including a heterogenous design of CPUs, GPUs, and even FPGAs. In our future work, we intend to further develop and distribute a full-scale SST/macro model implementation of the entire mpiBLAST library and make it available as a part of the SST/macro simulation distribution.

### ACKNOWLEDGMENT

We express our gratitude to our team members of the SST/macro team at Sandia National Laboratories, Livermore, CA: Gilbert Hendry, Joe Kenny, and Jackson Mayo. In addition, we thank Adrian Sandu from Virginia Tech and the anonymous referees from IEEE Access for providing helpful comments and suggestions.

## Footnotes

D. Dechev, is with the Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA Corresponding author: D. Dechev (dechev@eecs.ucf.edu)

T.-H. Ahn, is with the Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Comment Policy