By Topic

Supercomputing, 1996. Proceedings of the 1996 ACM/IEEE Conference on

Date 1996

Filter Results

Displaying Results 1 - 25 of 47
  • Parallel Hierarchical Molecular Structure Estimation

    Page(s): 1
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB)  

    Determining the three-dimensional structure of biological molecules such as proteins and nucleic acids is an important element of molecular biology because of the intimate relation between form and function of these molecules. Individual sources of data about molecular structure are subject to varying degrees of uncertainty. We have previously examined the parallelization of a probabilistic algorithm for combining multiple sources of uncertain data to estimate the structure of molecules and predict a measure of the uncertainty in the estimated structure. In this paper we extend our work on two fronts. First we present a hierarchical decomposition of the original algorithm which reduces the sequential computational complexity tremendously. The hierarchical decomposition in turn reveals a new axis of parallelism not present in the "flat" organization of the problems, as well as new parallelization issues. We demonstrate good speedups on two cache-coherent shared-memory multiprocessors, the Stanford DASH and the SGI Challenge, with distributed and centralized memory organization, respectively. Our results point to several areas of further study to make both the hierarchical and the parallel aspects more flexible for general problems: automatic structure decomposition, processor load balancing across the hierarchy, and data locality management in conjunction with load balancing. We outline the directions we are investigating to incorporate these extensions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Data-Parallel Implementation of O(N) Hierarchical N-Body Methods

    Page(s): 2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (408 KB)  

    The O(N) hierarchical N-body algorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25%, and the overall efficiency is about 35%. The evaluation of the potential field of a system of 100 million particles takes 3 minutes and 15 minutes on a 256 node CM-5E, giving expected four and seven digits of accuracy, respectively. The speed of the code scales linearly with the number of processors and number of particles. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Design of a Portable Scientific Tool: A Case Study Using SnB

    Page(s): 3
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1424 KB)  

    Developing and maintaining a large software package is a complex task. Decisions are made early in the design process that affect i) the ability of a user to effectively exploit the package and ii) the ability of a software engineer to maintain it. This case study discusses issues in software development and maintainability of a scientific package called SnB, which is used to determine molecular crystal structures. The design of the user interface is discussed along with software engineering concepts, including modular programming, data encapsulation, and in ternal code documentation. Issues concerning the integration of Fortran, a language that is still widely used in the scientific community, into a modern scientific application with a C-based user interface are also discussed. Scientific applications benefit from being available on a wide variety of platforms. Due to the demand, SnB is available on a variety of sequential and parallel platforms. Methods used in the design of SnB for such portability are presented, including POSIX compliance, automatic configuration scripts, and parallel programming techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Runtime Performance of Parallel Array Assignment: An Empirical Study

    Page(s): 4
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (472 KB)  

    Generating code for the array assignment statement of High Performance Fortran (HPF) in the presence of block-cyclic distributions of data arrays is considered difficult, and several algorithms have been published to solve this problem. We present a comprehensive study of the run-time performance of the code these algorithms generate. We classify these algorithms into several families, identify several issues of interest in the generated code, and present experimental performance data for the various algorithms. We demonstrate that the code generated for block-cyclic distributions runs almost as efficiently as that generated for block or cyclic distributions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance

    Page(s): 5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB)  

    This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK, and indicate the difficulties inherent in producing correct codes for networks of heterogeneous processors. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Network Performance Modeling for PVM Clusters

    Page(s): 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (920 KB)  

    The advantages of workstation clusters as a parallel computing platform include a superior price-performance ratio, availability, scalability, and ease of incremental growth. However, the performance of traditional LAN technologies such as Ethernet and FDDI rings are insufficient for many parallel applications. This paper describes APACHE (Automated Pvm Application CHaracterization Environment), an automated analysis system that uses an application-independent model for predicting the impact of ATM on the execution time of iterative paralle applications. APACHE has been used to predict the performance of several core applications that form the basis for many real scientific and engineering problems. We present a comparison of the performance predicted by APACHE with observed execution times to demonstrate the accuracy of our model. Finally, we present a method for a simple cost-benefit analysis that can be used to determine whether an investment in ATM equipment is justified for a particular workstation cluster environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Parallel Algorithms for Interactive Visualization of Cuved Surfaces

    Page(s): 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (10272 KB)  

    We present efficient parallel algorithms for interactive display of higher order surfaces on current graphics systems. At each frame, these algorithms approximate the surface by polygons and rasterize them over the graphics pipeline. The time for polygon generation for each surface primitive varies between successive frames and we address issues in distributing the load across processors for different environments. This includes algorithms to statically distribute the primitives to reduce dynamic load imbalance as well a distributed wait-free algorithm for machines on which re-distribution is efficient, e.g. shared memory machine. These algorithms have been implemented on different graphics systems and applied to interactive display of trimmed spline models. In practice, we are able to obtain almost linear speed-ups (as a function of number of processors). Moreover, the distributed wait-free algorithm is faster by 25-30% as compared to static and dynamic schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • STREN: A Highly Scalable Parallel Stereo Terrain Renderer for Planetary Mission Simulations

    Page(s): 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB)  

    In this paper, we describe STREN, a parallel stereo renderer for fixed-location terrain rendering tasks required for the simulation of planetary exploration missions. The renderer is based on a novel spatial data representation, called the TANPO map. This data representation stores terrain data using a simple and compact structure and provides excellent locality for such rendering applications. Experimental results show that the renderer not only performs very well, but also scales perfectly to different numbers of processors. Examples of the rendered result is show below using the red/blue stereo display method. Click on the image to view an stereo MPEG movie (2 MBytes). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Education in High Performance Computing via the WWW: Designing and Using Technical Materials Effectively

    Page(s): 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (96 KB)  

    The Cornell Theory Center (CTC), a national center for high-performance computing, has been designing and delivering education programs on high-performance computing in traditional workshops for over ten years. With the advent and growth of the World Wide Web, we have been able to expand our training efforts to a distance education format, including online lectures and exercises, communication with CTC consultants and other participants, and logins on CTC's world-class IBM RS/6000 SP. This description includes workshop design, technical content covered, design of the modules, and participants' responses. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiler-directed Shared-Memory Communication for Iterative Parallel Applications

    Page(s): 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1024 KB)  

    Many scientific applications are iterative and specify repetitive communication patterns. This paper shows how a parallel-language compiler and a predictive cache-coherence protocol in a distributed shared memory system together can implement shared-memory communication efficiently for applications with unpredictable but repetitive communication patterns. The compiler uses static analysis to identify program points where potentially repetitive communication occurs. At runtime, the protocol builds a communication schedule in one iteration and uses the schedule to pre-send data in subsequent iterations. This paper contains measurements of three iterative applications (including adaptive programs with unstructured data accesses) that show that a predictive protocol increases the number of shared-data requests satisfied locally, thus reducing the remote data access latency and total execution time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transformations for Imperfectly Nested Loops

    Page(s): 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1048 KB)  

    Loop transformations are critical for compiling high-performance code for modern computers. Existing work has focused on transformations for perfectly nested loops (that is, loops in which all assignment statements are contained within the innermost loop of a loop nest). In practice, most loop nests, such as those in matrix factorization codes, are imperfectly nested. In some programs, imperfectly nested loops can be converted into perfectly nested loops by loop distribution, but this is not always legal. In this paper, we present an approach to transforming imperfectly nested loops directly. Our approach is an extension of the linear loop transformation framework for perfectly nested loops, and it models permutation, reversal, skewing, scaling, alignment, distribution and jamming. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Analysis and Optimization on the UCLA Parallel Atmospheric General Circulation Model Code

    Page(s): 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (128 KB)  

    An analysis is presented of the primary factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model (AGCM) on distributed-memory, massively parallel computer systems. Several modifications to the original parallel AGCM code aimed at improving its numerical efficiency, load-balance and single-node code performance are discussed. The impact of these optimization strategies on the performance on two of the state-of-the-art parallel computers, the Intel Paragon and Cray T3D, is presented and analyzed. It is found that implementation of a load-balanced FFT algorithm results in a reduction in overall execution time of approximately 45% compared to the original convolution-based algorithm. Preliminary results of the application of a load-balancing scheme for the Physics part of the AGCM code suggest additional reductions in execution time of 10-15% can be achieved. Finally, several strategies for improving the single-node performance of the code are presented, and the results obtained thus far suggest reductions in execution time in the range of 25-35% are possible. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Climate Data Assimilation on a Massively Parallel Supercomputer

    Page(s): 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (112 KB)  

    We have designed and implemented a set of highly efficient and highly scalable algorithms for an unstructured computational package, the PSAS data assimilation package, as demonstrated by detailed performance analysis of systematic runs on up to 512-nodes of an Intel Paragon. The preconditioned Conjugate Gradient solver achieves a sustained 18 Gflops performance. Consequently, we achieve an unprecedented 100-fold reduction in time to solution on the Intel Paragon over a single head of a Cray C90. This not only exceeds the daily performance requirement of the Data Assimilation Office at NASA's Goddard Space Flight Center, but also makes it possible to explore much larger and challenging data assimilation problems which are unthinkable on a traditional computer platform such as the Cray C90. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Analysis Using the MIPS R10000 Performance Counters

    Page(s): 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB)  

    Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events - support that is particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies, hardware-based cache coherence, and speculative execution. We first explain how performance data is collected using an integrated set of hardware mechanisms, operating system abstractions, and performance tools. We then describe several examples drawn from scientific applications, which illustrate how the counters and profiling tools provide information that helps developers analyze and tune applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Profiling a Parallel Language Based on Fine-Grained Communication

    Page(s): 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB)  

    Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present a profiler for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. As our experience shows, it is much more challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program. We quantify the profiling overhead and present several Split-C applications which show that the profiler is useful in determining performance bottlenecks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling, Evaluation, and Testing of Paradyn Instrumentation System

    Page(s): 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB)  

    This paper presents a case study of modeling, evaluating, and testing the data collection services (called an instrumentation system) of the Paradyn parallel performance measurement tool using well-known performance evaluation and experiment design techniques. The overall objective of the study is to use modeling- and simulation-based evaluation to provide feedback to the tool developers to help them choose system configurations and task scheduling policies that can significantly reduce the data collection overheads. We develop and parameterize a resource occupancy model for the Paradyn instrumentation system (IS) for an IBM SP-2 platform. This model is parameterized with a measurement-based workload characterization and subsequently used to answer several "what if" questions regarding configuration options and two policies to schedule instrumentation system tasks: collect-and-forward (CF) and batch-and-forward (BF) policies. Simulation results indicate that the BF policy can significantly reduce the overheads. Based on this feedback, the BF policy was implemented in the Paradyn IS as an option to manage the data collection. Measurement-based testing results obtained from this enhanced version of the Paradyn IS are reported in this paper and indicate more than 60% reduction in the direct IS overheads when the BF policy is used. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Analytical Model of the HINT Performance Metric

    Page(s): 19
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (144 KB)  

    The HINT benchmark was developed to provide a broad-spectrum metric for computers and to measure performance over the full range of memory sizes and time scales. We have extended our understanding of why HINT performance curves look the way they do and can now predict the curves using an analytical model based on simple hardware specifications as input parameters. Conversely, by fitting the experimental curves with the analytical model, hardware specifications such as memory performance can be inferred to provide insight into the nature of a given computer system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Communication Performance Models in Prism : A Spectral Element-Fourier Parallel Navier-Stokes Solver

    Page(s): 20
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1448 KB)  

    In this paper we analyze communication patterns in the parallel three-dimensional Navier-Stokes sover Prism, and present performance results on the IBM SP2, the Cray T3D and the SGI Power Challenge XL. Prism is used for direct numerical simulation of turbulence in non-separable and multiply-connected domains. The numerical method used in the solver is based on mixed spectral element-Fourier expansions in (x-y) planes and z_direction, respectively. Each (or a group) of Fourier modes is computed on a separate processor as the linear contributions (Helmholtz solves) are completely uncoupled in the incompressible Navier-Stokes equations; coupling is obtained via th nonlinear contributions (convective terms). The transfer of data beween physical and Fourier space requires a series of complete exchange operations, which dominate the communication cost for small number of processors. As the number of processors increases, global reduction and gather operations become important while complete exchange becomes more latency dominated. Predictive models for these communication operations are proposed and tested against measurements. A relatively large variation in communication timings per iteration is observed in simulations and quantified in terms of specific operations. A number of improvements are proposed that could significantly reduce the communications overhead with increasing numbers of processors, and generic predictive maps are developed for the complete exchange operation, which remains the fundamental communication in Prism. Results presented in this paper are representative of a wider class of parallel spectral and finite element codes for computational mechanics which require similar communication operations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architecture and Application: The Performance of the NEC SX-4 on the NCAR Benchmark Suite

    Page(s): 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (176 KB)  

    In November 1994, the NEC Corporation announced the SX-4 supercomputer. It is the third in the SX series of supercomputers and is upward compatible from the SX-3R vector processor with enhancements for scalar processing, short vector processing, and parallel processing. In this paper we describe the architecture of the SX-4 which has an 8.0 ns clock cycle and a peak performance of 2 Gflops per processor. We also describe the composition of the NCAR Benchmark Suite, designed to evaluate the computers for use on climate modeling applications. Additionally, we contrast this benchmark suite with other benchmarks. Finally, we detail the scalability and performance of the SX-4/32 relative to the NCAR Benchmark Suite. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Latency Communication on the IBM RISC System/6000 SP

    Page(s): 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (9384 KB)  

    The IBM SP is one of the most powerful commercial MPPs, yet, in spite of its fast processors and high network bandwidth, the SP's communication latency is inferior to older machines such as the TMC CM-5 or Meiko CS-2. This paper investigates the use of Active Messages (AM) communication primitives as an alternative to the standard message passing in order to reduce communication overheads and to offer a good building block for higher layers of software. The first part of this paper describes an implementation of Active Messages (SP AM) which is layered directly on top of the SP's network adapter (TB2). With comparable bandwidth, SP AM's low overhead yields a round-trip latency that is 40% lower than IBM MPL's. The second part of the paper demonstrates the power of AM as a communication substrate by layering Split-C as well as MPI over it. Split-C benchmarks are used to compare the SP to other MPPs and show that low message overhead and high throughput compensate for SP's high network latency. The MPI implementation is based on the freely available MPICH version and achieves performance equivalent to IBM's MPI-F on the NAS benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiled Communication for All-Optical TDM Networks

    Page(s): 25
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (968 KB)  

    While all-optical networks offer large bandwidth for transferring data, the control mechanisms to dynamically establish all-optical paths incur large overhead. In this paper, we consider adapting all-optical multiplexed networks in multiprocessor or multicomputer environment by using compiled communication as an alternative to dynamic network control. Compiled communication eliminates the runtime overhead by managing network resources statically. Thus, it can employ complex off-line algorithms to improve resource utilization. We studied several off-line connection scheduling algorithms for minimizing the multiplexing degree required to satisfy communication requests. The performance of compiled communication is evaluated and compared with that of dynmaically controlled communication for static communications in a number of application programs. Our results show that compiled communication out-performs dynamic communication to a large degree. Since most of the communication patterns in scientific applications are static, we conclude that compiled communication is an effective mechanism for all-optical networks in multiprocessor environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Increasing the Effective Bandwidth of Complex Memory Systems in Multivector Processors

    Page(s): 26
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB)  

    In multivector processors, the cycles lost due to memory interferences between concurrent vector streams make the effective throughput be lower than the peak throughput. Using the classical order, the vector stream references the memory modules using a temporal distribution that depends on the access patterns. In general, different access patterns determine different temporal distributions. These different temporal distributions could imply the presence of memory module conflicts even if the request rate of all the concurrent vector streams to every memory modules is less than or equal to their service rate. In addition, in a memory system where several memory modules are connected to each bus (complex memory system), bus conflicts are added to the memory module conflicts. This paper proposes an access order, different from the classical order, to reference the vector stream elements. The proposed order imposes a temporal distribution to reference the memory modules that reduces the average memory access time in vector processors with complex memory systems. When the request rate of all the vector streams to every memory module is greater than the service rate, the proposed order reduces the number of lost cycles, and the effective throughput increases. Under other conditions, the effective throughput reaches the peak throughput. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Parallel Cosmological Hydrodynamics Code

    Page(s): 27
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (7128 KB)  

    Understanding the formation by gravitational collapse of galaxies and the large-scale structure of the universe is a nonlinear, multi-scale, multi-component problem. This complex process involves dynamics of the gaseous baryons as well as of the gravitationally dominant dark matter. We discuss an implementation of a parallel, distributed memory, cosmological hydrodynamics code using MPI message passing; it is based on the Total-Variation-Diminishing (Harten 1983) serial code of Ryu et al. (1993). this parallel code follows the motion of gas and dark matter simultaneously, combining a mesh based Eulerian hydrodynamics code and a Particle-Mesh N-body code. A new, flexible matrix transpose algorithm is used to interchange distributed and local dimensions of the mesh. Timing results from runs on an IBM SP2 supercomputer are given. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel Preconditioners for Elliptic PDEs

    Page(s): 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB)  

    Iterative schemes for solving sparse linear systems arising from elliptic PDEs are very suitable for efficient implementation on large scale multiprocessors. However, these methods rely heavily on effective preconditioners which must also be amenable to parallelization. In this paper, we present a novel method to obtain a preconditioned linear system which is solved using an iterative method. Each iteration comprises of a matrix-vector product with k sparse matrices (k ≤ log n), and can be computed in O(n) operations where n is the number of unknowns. The numerical convergence properties of our preconditioner are superior to the commonly used incomplete factorization preconditioners. Moreover, unlike the incomplete factorization preconditioners, our algorithm affords a higher degree of concurrency and doesn't require triangular system solves, thereby achieving the dual objective of good preconditioning and efficient parallel implemenatation. We describe our scheme for certain linear systems with symmetric positive definite or symmetric indefinite matrices and present an efficient parallel implementation along with an analysis of the parallel complexity. Results of the parallel implimentation of our algorithm will also be presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines

    Page(s): 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1392 KB)  

    Sparse LU factorization with partial pivoting is important to many scientific applications, but the effective parallelization of this algorithm is still an open problem. The main difficulty is that partial pivoting operations make structures of L and U factors unpredictable beforehand. This paper presents a novel approach called S* for parallelizing this problem on distributed memory machines. S* incorporates static symbolic factorization to avoid run-time control overhead and uses nonsymmetric L/U supernode partitioning and amalgamation strategies to maximize the use of BLAS-3 routines. The irregular task parallelism embedded in sparse LU is exploited using the RAPID run-time system which optimizes asynchronous communication and task scheduling. The experimental results on the Cray0-T3D with a set of Harwell-Boeing nonsymmetric matrices are very encouraging and good scalability has been achieved. Even compared to a highly optimized sequential code, the parallel speedups are still impressive considering the current status of sparse LU research. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.