Many phenomena are dif ficult or simply impossible to fully study experimentally due to a lack of reliable methods for measuring or controling the experiment. For example, very large-scale phenomena such as supernovae cannot be tested experimentally, and very small-scale phenomena such as interaction of particles at the quantum level cannot be precisely measured. In order to study such phenomena, scientists often use numerical simulations. These simulations can supplement existing partial results or lead to new insights which can guide scientists to speci fic experiments which would con firm the results. However, to obtain useful results, these simulations usually involve large and complicated calculations, which cannot be processed in a reasonable time on a serial computer. Therefore, a parallel supercomputer must be harnessed for such large-scale high-resolution modeling. For decades, signi ficant investments have been made to advance areas of scientific research by creating larger and more powerful parallel supercomputers. In recent years, large-scale systems have gone from tera-scale to peta-scale, and are continuing to exa-scale systems, enabling scientists to study complex problems which were previously intractable.
Many such systems are being operated by both U.S. Department of Energy (DOE) and National Science Foundation (NSF). DOE's "Scientific Discovery through Advanced Computing" (SciDAC) is researching ways to optimize and utilize large-scale systems, and maintains and funds several of them at different DOE sites, such as the Oak Ridge National Laboratory, Sandia National Laboratories, Lawrence Berkley Laboratory, and Lawrence Livermore National Laboratory [21]. NSF sponsors several systems of its own though its PetaApps [18] program, including those at University of Texas, Austin [26] and University of Illinois at Urbana-Champaign [16]. These systems involve many thousands of processors networked together to allow for large-scale computational simulations.
In order to fully utilize such large parallel systems, the algorithms and calculations within the simulation must be carefully parallelized. Most implementations of parallel scientific computing use a message passing paradigm, such as that used by the standard Message Passing Interface (MPI). While some tasks can be embarrassingly parallel, many of the operations necessary for these complex simulations are not trivial to parallelize. For instance, many of the matrix algorithms in First Principle Molecular Dynamics (FPMD) simulations take O(n3) operations to work with O(n2) data [13]. The result of this is that the more parallel this kind of algorithm is made, the more sparsely data is divided among the processors and hence the more communication is necessary between processes to exchange data. As the number of processors increase, it becomes more effective for optimization to analyze and reduce the communication than to use metrics such as numbers of operations.
One common way to analyze the communications of such programs is through visualization. Several libraries and tools have been developed to capture MPI events and visualize the captured communication patterns. These tools, while effective at analyzing small systems, often do not scale well to large, massively parallel systems. For instance, one common visualization in most existing tools is a Gantt chart, which lines up the processes vertically and plots the MPI events versus time on the horizontal axis. This technique runs into problems once the number of processes exceeds the number of pixels available on the display. We propose an alternate visual analysis strategy for understanding MPI communications at extreme scales.
Once large-scale computation involves tens-of-thousands to millions of processes, it becomes less useful to consider every process individually; it makes more sense to consider groups of processes or groups of MPI calls before drilling down to individual processes or MPI events. At the highest level, we consider the system as a whole and see how the overall communications are impacting performance over time. Next, we consider the communications at the level of groups of processes by plotting related communications together regardless of the participating processes. This way, MPI calls can be represented at an abstract level regardless of the number of processes. Finally, individual calls and processes can be singled out from this view. We present a scalable approach to MPI visualization that does this by using a timeline overview in combination with focused views which are abstracted from individual processes. The focused view achieves this by directly mapping the MPI events in a temporal space regardless of process rank and using modulated opacity to show process density, as shown in Figure 1. We also show that with our visualization strategy it becomes possible to understand communication behaviors at a large scale and identify room for performance optimization.
1.1 Related work
Software visualization is a fairly broad field. Many visualizations focus on managing software development and repositories [24]. StarGate [17] is a tool that visualizes both the evolution of the software repository and the communication patterns of the developers involved. Other visualizations focus on visualizing the code itself and aid in the analysis of code dependencies in larger projects. Some use visualization to analyze and reverse engineer compiled binary code [25]. Using visualization to optimize performance has been approached in several ways by existing work. For example, TraceVis [19] visualizes the execution times of individual CPU instructions, and Bootchart [1] visualizes the performance of programs involved in the boot process of an operating system. Both of these examples use variants of Gantt charts to present the information. However, these tools focus on serial programming, where parallel issues such as communication delays do not come up.
The problem of characterizing communication has been studied by many researchers. Network monitoring tools such as EtherApe [4] and EZEL [28] show communication patterns well, but they focus on pure network activity and do not incorporate properties particular to distributed communication. The communications between parallel processes and data storage servers has also been researched through analysis of access patterns [20], [31], [32]. Visualization of communication between software modules such as client-server relationships have also been analyzed through the use of graph based visualizations [33]. These visual approaches are effective at analyzing network traf fic, but by focusing singly on network information, the impact on computation ef ficiency in a massively parallel computation environment would be dif ficult to deduce
One common set of visualization tools for MPI data is Jumpshot [2], [30] and its predecessors (Nupshot [10] and Upshot [7]). These tools use the MPI Parallel Environment (MPE) library to intercept the MPI calls in a parallel program. Then they visualize the collected trace with a Gantt chart by plotting process rank versus time using color to represent the MPI calls. ParaGraph [6] is another, older program that visualizes MPI traces collected with the MPICL library which also uses Gantt charts, among other metrics such as overall summaries and communication graphs. Vampir [15] is another visual tool which combines Gantt charts and summary views. The Tuning and Analysis Utilities (TAU) [23] suite of tools is one of the more comprehensive tools. The logging facilities included with it allow for conversion to many of the formats used by other existing tools, such as Jumpshot or Vampir. Its own visualizations include Gantt charts, a communication matrix view, and a call graph, among others. Virtue [22] is the most unique of the related works listed here in that it is a real-time visualization. This allows the user to monitor the performance of an application while it is running and potentially tune it or interact with it. It also incorporates VR techniques such as a CAVE (Cave Automatic Virtual Environment) to provide a more immersive visualization than most other tools. For other parallel environments, GVUs PVaniM tool [27] and ATEMPT [11], [12] present some detailed views of communication events in a PVM (Parallel Virtual Machine) system.
Some software visualizations address the scalability issues of plots such as Gantt charts. The works of Jerding et al. [8], Moreta and Telea [14], and Cornelissen et al. [3] use plots similar to Gantt charts to pro file program execution traces. However, these works maintain the strict ordering of the charts, and use sub-pixel techniques to handle the scalability and allow for visibility of both large trends and outliers. In contrast, our approach sacri fices the ordering to spatially separate large trends from individual outliers.
Our approach draws upon several existing visualization techniques. The timeline view consists of a stacked graph representation, and the detailed view is based on techniques such as scatterplots and arc diagrams [29]. In order to plot a large number of calls simultaneously, we also incorporate existing techniques such as high precision alpha blending and opacity scaling similar to the work by Johansson et al.[9].
SECTION 2
A Scalable Approach
As the number of processes increases, the usefulness of keeping track of individual processes lessens, and it becomes more helpful and more useful to consider the system as a whole or in part before looking at individual processes. However, it is still useful to be able to drill down into the details of the data, so we implemented an interactive focus+context visualization which presents a high level abstraction, a focused view, and details on demand. The high level view consists of a timeline view which shows the status of the entire system over the entire run by depicting what fraction of the system is performing what MPI calls over time. From the timeline, a range of time can be selected to focus on. The MPI call view plots MPI calls within this range directly with respect to time, using opacity to handle overplotting issues due to the scale of the data. From the MPI call view, the individual processes can be highlighted to provide speci fic details to the user.
2.1 Timeline View
The timeline view depicts a stacked graph of the overall process activity over time. Each stacked area of the graph is associated with an MPI function, and its height represents the fraction of the processes that were calling that function at that time. One result of this is that the height of the remaining space which is empty corresponds to the ef ficiency of the system as a whole, as that is the fraction of processes not involved in communication at that time. Figure 2 shows a small portion of the timeline view blown up for clarity, with the MPI functions colored according to the legend in Figure 3. The more common functions are colored in unique colors while the less common functions are all grey. The timeline view is also used as an interface to select smaller time ranges to view in more detail, and the selected range is indicated by the semi-transparent box shown in Figure 2.
2.2 MPI Call View
The most direct representation of the MPI calls is to render each call from each process with respect to time. Gantt charts do this, but they restrict the y-axis to represent the MPI calls' originating processes. While we retain the use of the x-axis as time, we chose to use the y-axis to represent other properties of the MPI calls. In particular, we found it effective to use the y-axis to represent duration of the MPI calls, particularly on a log scale as the durations vary over several orders of magnitude. The advantage to using duration on the y-axis is that large delays due to communication will be prominently seen at the top of the plot. Since this and other y-axis mappings allow the MPI calls to overlap, we modulate the opacity of the calls, which makes the overall intensity of the visualization represent the density of the MPI calls. The color is mapped to the MPI function being called as in the timeline. For the representation of the calls themselves, we explored several options, including arches, lines, and individual points, examples of which are shown in Figure 4.
Of the representations we use, the arch representation, shown in Figure 4(a), is the least scalable, but is probably also the most intuitive. The beginning point of each arch corresponds to the start time of the MPI call, and the end point of the arch corresponds to the end time. The y-position of the apex is proportional to the duration of the call on some scale. Figure 4(a) is on a linear scale, and depicts patterns that show dependency relationships such as when many processes are dependent on a previous synchronized MPI call or when one global communication is blocking. These are indicated by sets of MPI calls that either start or end nearly simultaneously.
The line representation, shown in Figure 4(b), is the most similar to traditional Gantt charts. Each line goes from the start of the MPI call to the end of the MPI call. As in the other representations, the y-position is proportional to the duration and, in this example, is on a log scale. This representation is more scalable than the arc representation as it produces less clutter on the screen. This comes at the cost of being able to readily see dependencies, as dependent MPI calls no longer touch. However, patterns of simultaneous starting and stopping of MPI calls are readily visible as vertically linear and logarithmic trends. Also, clear groupings of MPI calls can be seen, corresponding to the originating MPI functions in the code.
As the duration of the MPI calls are already being encoded in the height, it is redundant to show duration on the x-axis as well. So the final and most scalable representation of MPI calls we implemented uses simple points to plot the duration of the MPI calls versus either the start or end times of the MPI calls, as shown in Figure 4(c). Similar to the line representation, dependency information is not easily visible. However, vertical and logarithmic trends clearly delimit simultaneous function calls and returns. When plotting start times versus duration, the vertical trends show simultaneous start times and the log curves to the left show simultaneous end times, and when plotting end time versus duration it is the other way around, with the log curves going to the right.
From any of these representations, details of any MPI call can be determined by selecting it with the mouse, at which point all the calls from the selected call's process are highlighted, and details about the selected MPI call are presented to the user textually, as is demonstrated in all three examples in Figure 4. MPI functions can also be highlighted by selecting them from the color legend, at which point all calls to that function get highlighted in the call view.
2.3 Opacity Scaling
When plotting the MPI calls with our approach, many of them overlap, particularly when they start or end simultaneously. A simple way to resolve this overlap is to make the calls semitransparent and use alpha blending to combine them. However, this very quickly runs into limitations as the number of calls increases, as shown in Figure 5(a). First, the standard 8-bit alpha buffer only allows for a maximum overplotting of 256. And second, in order to show overlap of large numbers of MPI calls, the opacity has to be set so low that outliers are nearly invisible. In order to keep both the opacity of outliers high and the combined opacity of dense overlap from overflowing the alpha buffer, we utilize the opacity scaling techniques of [9]. In our implementation of this technique, we first render to a high precision density buffer D which keeps track of the total amount of overplot and then to a high precision color buffer C which blends the input color information with opacity inversely proportional to the density information to result in an average color that is fully opaque. We then combine these buffers with a transfer function to render the final pixels P to the screen. We implemented two such functions: a linear map, and a logarithmic map.
The linear map (shown in Figure 5(b)) is de fined as:
TeX Source
$$P_{xy} = C_{xy} \times \left({o_{min} + (1 - o_{min}) \times {{(D_{xy})} \over {(D_{max})}}} \right)$$
And the logarithmic map is de fined as (shown in Figure 5(c)):
TeX Source
$$P_{xy} = C_{xy} \times \left({o_{min} + (1 - o_{min}) \times {{\log (D_{xy})} \over {\log (D_{max})}}} \right)$$
Where omin is a user de fined minimum opacity level and Dmax is the maximum level of overplotting that occurred. By calculating the final opacity in this manner, we guarantee that any outliers will have at least opacity omin, that no overplotting exceeds the maximum opacity and, in the case of the logarithmic map, that the system will be able to handle many orders of magnitude of overplotting.
Many simulations are performed through the use of large linear algebra calculations. One of the most common tools used to run these calculations is the ScaLAPACK library, which utilizes MPI communications to perform distributed linear algebra calculations. The Qbox FPMD simulation codes, for instance, utilize ScaLAPACK functions intensively. In order to demonstrate our approach of visualizing large parallel MPI traces, we use it to analyze matrix operations which use the ScaLAPACK library and its underlying libraries.
We captured the MPI communication using the Multi-Processing Environment (MPE) library. This generates a standardized log file in either clog2 or slog2 format, which we can then visualize. The examples shown here were run on NERSC's Franklin, which is a Cray XT4 massively parallel processing system with 38,128 Opteron compute cores and a peak performance of 356 TFlops/sec [5]. All tests were run with one process per processor, so that no extraneous context switching overhead would be incurred. While tracing adds overhead for writing the log file out at the end of the program, we found that the impact on performance of actual computation was negligible.
Common Matrix Operations Figure 6 shows the results of visualizing a series of common matrix operations. The operations chosen are commonly used in scientific calculations. In this example, the operations were run on 256 processes. From the timeline in Figure 6(a), the first thing that is plainly visible is that the program went through several visually distinct stages, each of which correspond to different matrix operations. We visualize each section in more detail, then compare and contrast them.
The first operation performed was a matrix multiplication, shown in Figure 6(b). As indicated by the colorings, the matrix multiplication's communication pattern mostly consists of MPI Send, MPI Recv, and MPI Reduce, with the MPI Recv calls generally taking the longest. The calls are staggered and generally quite short, indicating that the algorithm is already well optimized. It ends with a single large MPI All Reduce which resynchronizes the system.
Inversion is more complicated than multiplication, involving multiple steps. It starts with an LU decomposition (Figure 6(c)), then uses the resulting triangular matrices to calculate the actual inverse (Figure 6(d)). The LU decomposition consisted almost entirely of MPI Recv calls, with the corresponding MPI Send calls barely visible. Interestingly, there is a very cyclic pattern, alternating between short calls and long calls. The strong synchronicity of the communications in this section is also interesting, and it could indicate potential for optimization either through redistribution of the data or changing the communication methods to be more asynchronous. Figure 6(d) shows the completion of the matrix inversion and contains two sub-sections. These sections each start with large calls to MPI Reduce, and there are many shorter calls to MPI Bcast in the first half and MPI Recv in the second half. While the calls to MPI Bcast and MPI Recv are quite numerous, they are generally short and staggered. The real expense here are the MPI Reduce calls, which keeps many processes idle for a long time. The these MPI Reduce calls form a very distinctive pattern where they start synchronously, but their ends follow a logarithmic trend. This pattern could be indicative of a network communication issue. For instance, this pattern could be induced by using a logical tree communication network when the underlying physical network topology is a actually a torus.
The eigenproblem is an eigenvector/eigenvalue solver and is the single largest matrix operation in this case study, so in Figure 6(e), we only show a representative part of it. Most of the MPI calls here also group together into distinct clusters, and they are a mix of MPI Reduce and MPI Recv calls, with the MPI Reduce taking slightly longer. However, they are all very short compared to the single large MPI Bcast which starts at the top of Figure 6(e), gradually locks more processes as the program progresses, and does not finish until the middle of Figure 6(f), at which point some processes have been idle for more than half of the total computation time. This can be seen in its entirety at the top of Figure 1 Figure 6(f) depicts more calculations that were involved in the eigenproblem after the long operation finished, such as MPI All Reduce calls, along with the MPI Reduce and MPI Recv calls which are running very synchronously.
The sixth section, shown in 6(g) shows a Gram-Schmidt orthogonalization, which is composed of 4 individual matrix operations, one of which is trivially parallelizable and takes nearly no time to complete. The communications in the other three operations look much like the matrix multiplication, with the exception that there are gaps between the operations where the processes synchronized and that there are some MPI All Reduce calls. However, there is one large cluster of calls to MPI Send and MPI Recv near the end which are substantially longer. We determined that this occurred in the middle of the pdtrsm() operation. If this were due to a straggling process or poor load balancing, the calls would end simultaneously. Since they do not, this could indicate a network bottleneck or other system interference.
Testing Scalability While Figure 6 demonstrates our approach on a series of matrix operations on a moderate size system, larger systems should also be considered. In order to investigate the effects of scaling on the visualization, we focus on matrix multiplication, as it is a commonly performed operation. Figure 7 demonstrates the effects of scaling up a matrix multiplication from a modestly small set of processes (64) up to large numbers of processes (16,384). As the number of processes is scaled up, so is the size of the data it is working on. This keeps the communication effects from completely overwhelming the execution, and vice versa. The first observation that can be made from these timelines is that the proportion of time spent doing the actual calculation decreases as the scale goes up. That is, as more processes are used, it takes longer to finish the initialization process to set up the communication channels and distribute the initial data. By 4,096 processes (Figure 7(d)), it already takes more time to initialize the program than to calculate the result. However, this effect would be offset on the more complex programs or larger datasets used in actual simulations, as the shorter computation offered by larger systems marginalizes the cost of initialization. Another observation that can be made is that within the matrix multiplication itself, the more processes there are, the greater the proportion of them that are in the middle of some form of communication at any given time, and thus the lower the ef ficiency of the system. In particular, by the time we reach 16,384 processes, almost all the time is spent in communication rather than actual computation. Finally, while the communication patterns were fairly cyclic at smaller scales, variances in the communications add up in the larger scales leading to acyclic patterns, as can be seen well in Figure 7(c). To understand what goes on within the matrix operation at large scales, we then focus on it in the detail view.
Figure 8 shows the detail of the matrix multiplication on 16,384 processes shown near the end of Figure 7(d). As this scale, the MPI calls are quite dense, so we use the point representations of plotting either start or end times versus duration separately. Figure 8(b) shows the end points of the communication calls. The first major trend visible in this view is that there are two stages in the operation. While the second half is much the same as in smaller scales such as in Figure 6(b), the first half is quite different. It contains MPI calls that took much longer than at the smaller scale. Namely, it begins with MPI Recv calls that take a long time followed by some MPI Reduce calls which took longer than normal. It can be seen that there were still un finished MPI Comm Create calls, which would explain the perturbation of the matrix multiplication. Thus in this case, to improve the performance of the matrix multiplication itself it would help to optimize the initialization procedures. After that, the matrix multiplication is a fairly dense mix of MPI Send, MPI Recv, and MPI Reduce calls with some of the MPI Recv calls taking distinctly longer than the rest of the MPI calls.
In order to better understand the normal communication patterns at this scale, we zoom into a small region of the operation where the communication was fairly regular. Even at this scale, the number of communications and thus points on the screen is quite dense. However, some small trends and clusters of communication are visible. One point of interest is how the communication does split into several very distinct layers. The MPI Send calls are still the shortest, as in Figure 6(b). Next is a layer of MPI Recv calls which are also fairly short. Above that are the MPI Reduce calls, which take longer to distribute data among all processes involved. Finally, there is the distinct layer of clusters of MPI Recv calls above, which are clearly separated from the rest of the calls.
As massively parallel computer systems are constantly moving to larger scales, it is becoming ever more important to understand how to use these systems ef ficiently. Access to these systems is often limited, so scientists cannot usually afford to thoroughly analyze their codes during long term and computation intensive simulations. Our approach uses process independent visualization and focus+context techniques to offer more scalability than traditional parallel system visualizations. And by analyzing a common scientific computation library on a modern supercomputer, our results can aid in re fining and optimizing the underlying library used by the scientists, which would allow for more ef ficient use of the limited access time the scientists are allotted on similar large-scale systems.
While the results we achieved were quite effective at the scale we were dealing with, further extension of this work to greater scales could prove challenging. For instance, we currently load the entire log file into memory before visualizing it. Very large log files would need outof-core access. Support for more log formats would be very bene ficial to this end. We support clog and clog2 formats, but extending to slog2 format would aid out-of-core visualization, as it was designed with that intent. Extending the work to include pro filing a real simulation would be useful, but the resulting log file would likely be much larger than the ones shown here. This would necessitate not only out-of-core data access, but also a higher level interface than the current time-line, such as one that abstracts the data to the matrix operation level.
The data formats we use do not clearly identify the MPI call across processes. If we move to a data format that identifies the calls hierarchically from the function level, the MPI calls could be accurately clustered together, which would allow for a hierarchically based visualization. As our current approach only uses two views, it would be interesting to either add an intermediate level view or a more detailed view based on selections from the MPI call plot. Further understanding could also be achievable by taking into account the topology of the supercomputer itself, or by drilling down to the underlying network traf fic. This would allow for detection of network bottlenecks, which our current system cannot explicitly show.
Acknowledgments
This work is supported in part by the National Science Foundation through grants CCF-0938114, CCF-0808896, OCI-0749227, OCI-0749217, CNS-0551727, and CCF-0811422, and the U.S. Department of Energy through the SciDAC program with Agreement No. DE-FC02-06ER25777. This research used resources of the National Energy Research Scientific Computing Center (NERSC) through the DOE SciDAC program.