Scheduled System Maintenance on May 29th, 2015:
IEEE Xplore will be upgraded between 11:00 AM and 10:00 PM EDT. During this time there may be intermittent impact on performance. We apologize for any inconvenience.
By Topic

Frontiers of Massively Parallel Computation, 1995. Proceedings. Frontiers '95., Fifth Symposium on the

Date 6-9 Feb. 1995

Filter Results

Displaying Results 1 - 25 of 62
  • Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation

    Publication Year: 1995
    Save to Project icon | Request Permissions | PDF file iconPDF (30 KB)  
    Freely Available from IEEE
  • Visualizing distributed data structures

    Publication Year: 1995 , Page(s): 480 - 487
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB)  

    A new programming style for large-scale parallel programs centered around distributed data structures has emerged. The current parallel program visualization tools were intended for the old style and do not deal with distributed data structures. We show, with several examples of visualizations and animations developed for large scale pC++ programs, that visualizing and animating distributed data structures is an important part of debugging and performance tuning for the new style parallel programs. Our approach is based on a new methodology for recording execution behavior that uses I/O abstractions and compile time source analysis and instrumentation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiler support for out-of-core arrays on parallel machines

    Publication Year: 1995 , Page(s): 110 - 118
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (620 KB)  

    Many computational methods are currently limited by the size of physical memory, the latency of disk storage, and the difficulty of writing an efficient out-of-core version of the application. We are investigating a compiler-based approach to the above problem. In general, our compiler techniques attempt to choreograph I/O for an application based on high-level programmer annotations similar to Fortran D's DECOMPOSITION, ALIGN, and DISTRIBUTE statements. The central problem is to generate “deferred routines” which delay computations until all the data they require have been read into main memory. We present the results for two applications, LU factorization and red-black relaxation, on 1 to 32 nodes of an Intel Paragon after hand application of these compiler techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Aligning parallel arrays to reduce communication

    Publication Year: 1995 , Page(s): 324 - 331
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (596 KB)  

    Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5 View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic synchronisation elimination in synchronous FORALLs

    Publication Year: 1995 , Page(s): 350 - 357
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (648 KB)  

    This paper investigates a promising optimization technique that automatically eliminates redundant synchronization barriers in synchronous FORALLs. We present complete algorithms for the necessary program restrictions and subsequent code generation. Furthermore, we discuss the correctness, complexity, and performance of our restructuring algorithm before we finally evaluate its practical usefulness by quantitative experimentation. The experimental evaluation results are very encouraging. An implementation of the optimization algorithms in our Modula-2* compiler eliminated more than 50% of the originally present synchronization barriers in a set of seven parallel benchmarks. This barrier reduction improved the execution times of the generated programs by over 40% on a MasPar MP-1 with 16384 processors and by over 100% on a sequential workstation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient matrix operations in a reconfigurable array with spanning optical buses

    Publication Year: 1995 , Page(s): 273 - 280
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (608 KB)  

    A reconfigurable array with spanning optical buses (RASOB) is introduced. By taking advantage of the unique properties of optical signal transmissions, the RASOB architecture provides flexible reconfiguration and strong connectivities with low hardware and control complexities. We discuss reconfiguration methods and communication capabilities of the architecture. In addition, we use a parallel implementation of the matrix transposition as well as multiplication algorithms as an example to show how the architectural capabilities can be taken advantage of in designing efficient parallel algorithms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploitation of control parallelism in data parallel algorithms

    Publication Year: 1995 , Page(s): 385 - 392
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (540 KB)  

    This paper considers the matrix decomposition A=LDLT, as a vehicle to explore the improvement in performance obtainable through the execution of multiple streams of control on SIMD architectures. Several methods for partitioning the SIMD array are considered. Architectural support for and feasibility of using control parallelism in SIMD algorithms is briefly considered. Techniques for converting the extracted control parallelism into increased performance are illustrated via their application to the example algorithm. Analytical expressions for execution times are expressed in terms of execution times of the constituent operations. Experimental results for the various partitioning schemes based on execution traces are also presented. Timings based on MasPar MP-2 operations and extrapolated from experimental data are used to compare the various control parallel versions of the algorithm and the traditional SIMD counterpart View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On mapping data and computation for parallel sparse Cholesky factorization

    Publication Year: 1995 , Page(s): 171 - 178
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (656 KB)  

    When performing the Cholesky factorization of a sparse matrix on a distributed-memory multiprocessor, the methods used for mapping the elements of the matrix and the operations constituting the factorization to the processors can have a significant impact on the communication overhead incurred. This paper explores how two techniques, one used when mapping dense Cholesky factorization and the other used when mapping sparse Cholesky factorization, can be integrated to achieve a communication-efficient parallel sparse Cholesky factorization. Two localizing techniques to further reduce the communication overhead are also described. The mapping strategies proposed here, as well as other previously proposed strategies fit into the unifying framework developed in this paper. Communication statistics for sample sparse matrices are included View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Periodically regular chordal ring networks for massively parallel architectures

    Publication Year: 1995 , Page(s): 315 - 322
    Cited by:  Papers (8)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB)  

    Chordal rings have been proposed in the past as networks that combine the simple routing framework of rings with the lower diameter, wider bisection, and higher resilience of other architectures. Virtually all proposed chordal ring networks are node-symmetric; i.e., all nodes have the same in/out degree and interconnection pattern. Unfortunately, such regular chordal rings are not scalable. The periodically regular chordal ring network is proposed as a compromise for combining low node degree with small diameter. Discussion is centered on the basic structure, derivation of topological properties, routing algorithms, optimization of parameters, and comparison to competing architectures such as meshes and PEC networks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable, visual interface for debugging with event-based behavioral abstraction

    Publication Year: 1995 , Page(s): 472 - 479
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB)  

    Event-based behavioral abstraction, in which models of intended program behavior are compared to actual program behavior, offers solutions to many of the debugging problems introduced by parallelism. Currently, however, its widespread application is limited by an inability to provide sufficient feedback on the mismatches between intended and actual behaviors, and an inability to provide output that scales for large or complex systems. The AVE/Ariadne debugging system was developed to address these limitations. Ariadne is a post mortem debugger that combines a simple modeling language with functional queries to support thorough exploration of execution traces. AVE is a visual interface to Ariadne that provides scalable, visual feedback. AVE features hierarchical visualizations that reflect the structure of user-defined behavioral models, dynamic attribute calculation, and automatic partitioning of matched behaviors and attributes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic generation of efficient array redistribution routines for distributed memory multicomputers

    Publication Year: 1995 , Page(s): 342 - 349
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (560 KB)  

    Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. A significant contribution of this work is the ability to handle arbitrary source and target processor sets while performing redistribution; another is the ability to handle arbitrary dimensionality for the array being redistributed in a sealable manner. The results presented show low overheads for our redistribution algorithm as compared to naive runtime methods View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A broadcast algorithm for all-port wormhole-routed torus networks

    Publication Year: 1995 , Page(s): 529 - 536
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (612 KB)  

    A new approach to broadcast in wormhole-routed two- and three-dimensional torus networks is proposed. The approach extends the concept of dominating sets from graph theory by accounting for the relative distance-insensitivity of the wormhole routing switching strategy and by taking advantage of an allport communication architecture. The resulting broadcast operation is based on a tree structure that uses multiple levels of extended dominating nodes (EDN). Performance results are presented that confirm the advantage of this method over recursive doubling View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characteristics of the MasPar parallel I/O system

    Publication Year: 1995 , Page(s): 265 - 272
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB)  

    Input/output speed continues to present a performance challenge for high-performance computing systems. This is because technology improves processor speed, memory speed and capacity, and disk capacity at a much higher-rate than mass storage latency. Developments in I/O architecture have been attempting to reduce this performance gap. The MasPar I/O architecture includes many interesting features. This work presents an experimental study of the dynamic characteristics of the MasPar parallel I/O. Performance measurements were collected and compared for the MasPar MP-1 and MP-2 testbeds at NASA GSFC. The results have revealed strengths as well as areas for potential improvements and are helpful to software developers, systems managers, and system designers View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Algorithm for constructing fault-tolerant solutions of the circulant graph configuration

    Publication Year: 1995 , Page(s): 514 - 520
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB)  

    Recently, a general method was developed to design a k-fault-tolerant solution for any given circulant graph, where k is the number of faulty nodes to be tolerated. In this paper, a new algorithm is proposed which, (unlike the earlier method), constructs a family of k-fault-tolerant solutions, for any given circulant graph. These solutions can then be compared to select the one with the least cost. The algorithm is efficient to implement as it requires only a polynomial time (to generate and search the solutions). The proposed method is also useful to other architectures, as demonstrated in the paper. We shall examine the application of the method to the problem of designing k-fault-tolerant extensions of (2 and 3 dimensional) meshes, and show that the solutions obtained are very efficient View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing multidisciplinary and multi-zonal applications using MPI

    Publication Year: 1995 , Page(s): 496 - 503
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (816 KB)  

    Multidisciplinary and multi-zonal applications are codes where two or more distinct parallel programs or copies of a single program are utilized to model a single problem. To support such applications, a program can be divided into several single program multiple data stream (SPMD) applications, each of which solves the equations for a single physical discipline or grid zone. These applications are bound together to form a single multidisciplinary or multizonal program in which the constituent parts communicate via point-to-point message passing routines. In this report it is shown that the new Message Passing Interface (MPI) standard is a viable portable library for implementing the message passing portion of multidisciplinary applications. Further with the extension of a portable loader; fully portable multidisciplinary application programs can be developed. Finally, the performance of MPI is compared to that of some native message passing libraries. This comparison shows that MPI can be implemented to deliver performance commensurate with native message passing libraries View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Runtime incremental parallel scheduling (RIPS) for large-scale parallel computers

    Publication Year: 1995 , Page(s): 456 - 463
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (676 KB)  

    Runtime incremental parallel scheduling (RIPS) is an alternative strategy to the commonly used dynamic scheduling. In this scheduling strategy, the system scheduling activity alternates with the underlying computation work. RIPS utilizes advanced parallel scheduling techniques to produce a low-overhead, high-quality load balancing and adapts to applications of nonuniform structures View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Work-efficient nested data-parallelism

    Publication Year: 1995 , Page(s): 186 - 193
    Cited by:  Papers (3)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    An apply-to-all construct is the key mechanism for expressing data-parallelism, but data-parallel programming languages like HPF and C* significantly restrict which operations can appear in the construct. Allowing arbitrary operations substantially simplifies the expression of irregular and nested data-parallel computations. The technique of flattening nested parallelism introduced by Blelloch, compiles data-parallel programs with unrestricted apply-to-all constructs into vector operations, and has achieved notable success, particularly with irregular data-parallel programs. However, these programs must be carefully constructed so that flattening them does not lead to suboptimal work complexity due to unnecessary replication in index operations. We present new flattening transformations that generate programs with correct work complexity. Because these transformations may introduce concurrent reads in parallel indexing, we developed a randomized indexing that reduces concurrent reads while maintaining work-efficiency. Experimental results show that the new rules and implementations significantly reduce memory usage and improve performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel remapping algorithms for adaptive problems

    Publication Year: 1995 , Page(s): 367 - 374
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (608 KB)  

    We present fast parallel algorithms for remapping a class of irregular and adaptive problems on coarse-grained distributed-memory machines. We show that the remapping of these applications, using simple index-based mapping algorithms, can be reduced to sorting a nearly sorted list of integers or merging an assorted list of integers with a sorted list of integers. By using the algorithms we have developed, the remapping of these problems can be achieved at a fraction of the cost of mapping from scratch. Results of experiments performed on the CM-5 are presented View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An object-oriented approach to nested data parallelism

    Publication Year: 1995 , Page(s): 203 - 210
    Cited by:  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs data aggregates called “collections” and expresses parallelism as operations performed over the elements of a collection. When the elements of a collection are also collections, then there is the possibility for “nested data parallelism.” Few current programming languages support nested data parallelism however. In an object-oriented framework, a collection is a single object. Its type defines the parallel operations that may be applied to it. Our goal is to design and build an object-oriented data-parallel programming environment supporting nested data parallelism. Our initial approach is built upon three fundamental additions to C++. We add new parallel base types by implementing them as classes, and add a new parallel collection type called a “vector” that is implemented as a template. Only one new language feature is introduced: the foreach construct, which is the basis for exploiting elementwise parallelism over collections. The strength of the method lies in the compilation strategy, which translates nested data-parallel C++ into ordinary C++. Extracting the potential parallelism in nested foreach constructs is called “flattening” nested parallelism. We show how to flatten foreach constructs using a simple program transformation. Our prototype system produces vector code which has been successfully run on workstations, a CM-2 and a CM-5 View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the influence of partitioning schemes on the efficiency of overlapping domain decomposition methods

    Publication Year: 1995 , Page(s): 375 - 384
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (688 KB)  

    One level overlapping Schwarz domain decomposition preconditioners can be viewed as a generalization of block Jacobi preconditioning. The effect of the number of blocks and the amount of overlapping between blocks on the convergence rate is well understood. This paper considers the related issue of the effect of the scheme used to partition the matrix into blocks on the convergence rate of the preconditioned iterative method. Numerical results for Laplace and linear elasticity problems in two and three dimensions are presented. The tentative conclusion is that using overlap tends to decrease the differences between the rules of convergence for different partitioning schemes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The DEC High Performance Fortran 90 compiler front end

    Publication Year: 1995 , Page(s): 46 - 53
    Cited by:  Papers (2)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB)  

    Digital has developed a compiler for full Fortran 90 and the High Performance Fortran extensions to Fortran 90. This compiler targets Digital's Alpha workstations, servers, shared-memory SMP servers, and distributed memory AdvantageCluster and workstation farm systems. This paper gives an overview of the structure of the compiler's front end component, responsible for lexical analysis, syntax analysis, and semantic analysis. It also presents, by means of an example, the compiler's high-level platform-independent common intermediate representation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallelization of two breadth-first search-based applications using different message-passing paradigms: an experimental evaluation

    Publication Year: 1995 , Page(s): 12 - 19
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB)  

    We present experimental results for parallelizing two breadth-first search-based applications on the CM-5 by using two different message-passing paradigms, one based on send/receive and the other based on active messages. The parallelization of these applications requires fine-grained communication. Our results show that the active messages-based implementation gives significant improvement over the send/receive-based implementation. The improvements can primarily be attributed to the lower latency of the active messages implementation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and analysis of product networks

    Publication Year: 1995 , Page(s): 521 - 528
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (708 KB)  

    In this paper a unified theory of Cartesian product networks is developed. Product networks (PN) include meshes, tori, and hypercubes among others. This paper studies the fundamental issues of topological properties, cost-performance ratio optimization, scalability, routing, embedding, and fault tolerance properties of PNs. In particular, the degree, diameter, average distance, connectivity, and node-symmetry of PNs are related to those of their constituent factor networks. Cost/performance analysis and comparison between different PNs, especially n-dimensional meshes/tori and n-dimensional r-ary hypercubes, are conducted, and the optimal trade-off between the number of dimensions and the size along each dimension are identified. Fast generic algorithms for point-to-point routing, broadcasting and permuting on PNs are designed, making use of the corresponding algorithms of the factor networks. Finally, efficient embeddings on PNs are constructed for linear arrays, rings, meshes, tori and trees View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing irregular computations on SIMD machines: a case study

    Publication Year: 1995 , Page(s): 222 - 230
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (676 KB)  

    Data-parallel computations with regular structure fixed data size and predictable control patterns can be implemented efficiently on SIMD architectures. However many large applications have irregular structure, either data sets that vary in size as the computation progresses or control structures that select different subsets of the processors at each stage of the computation. In this paper we describe a stochastic biology simulation and some of the methods we used to improve its performance on the MasPar MP-1104. We present a simple model for evaluating the performance of a data parallel application and use the model to improve the performance of the simulator View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high performance sparse Cholesky factorization algorithm for scalable parallel computers

    Publication Year: 1995 , Page(s): 140 - 147
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (796 KB)  

    This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforest-to-subcube mapping instead of the subtree-to-subcube mapping of another recently introduced scheme by A. Gupta and V. Kumar (1994). Asymptotically, both formulations are equally scalable on a wide range of architectures and a wide variety of problems. But the subtree-to-subcube mapping of the earlier formulation causes significant load imbalance among processors, limiting overall efficiency and speedup. The new mapping largely eliminates the load imbalance among processors. Furthermore, the algorithm has a number of enhancements to improve the overall performance substantially. This new algorithm achieves up to 20GFlops on a 1024-processor Cray T3D for moderately large problems. To our knowledge, this is the highest performance ever obtained on an MPP for sparse Cholesky factorization View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.