By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 11 • Date Nov 1996

Filter Results

Displaying Results 1 - 6 of 6
  • A framework for resource-constrained rate-optimal software pipelining

    Page(s): 1133 - 1149
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1684 KB)  

    The rapid advances in high-performance computer architecture and compilation techniques provide both challenges and opportunities to exploit the rich solution space of software pipelined loop schedules. In this paper, we develop a framework to construct a software pipelined loop schedule which runs on the given architecture (with a fixed number of processor resources) at the maximum possible iteration rate (a la rate-optimal) while minimizing the number of buffers-a close approximation to minimizing the number of registers. The main contributions of this paper are: First, we demonstrate that such problem can be described by a simple mathematical formulation with precise optimization objectives under a periodic linear scheduling framework. The mathematical formulation provides a clear picture which permits one to visualize the overall solution space (for rate-optimal schedules) under different sets of constraints. Secondly, we show that a precise mathematical formulation and its solution does make a significant performance difference. We evaluated the performance of our method against three leading contemporary heuristic methods. Experimental results show that the method described in this paper performed significantly better than these methods. The techniques proposed in this paper are useful in two different ways: 1) As a compiler option which can be used in generating faster schedules for performance-critical loops (if the interested users are willing to trade the cost of longer compile time with faster runtime). 2) As a framework for compiler writers to evaluate and improve other heuristics-based approaches by providing quantitative information as to where and how much their heuristic methods could be further improved View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Achieving full parallelism using multidimensional retiming

    Page(s): 1150 - 1163
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1700 KB)  

    Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Static and dynamic evaluation of data dependence analysis techniques

    Page(s): 1121 - 1132
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1332 KB)  

    Data dependence analysis techniques are the main component of today's trategies for automatic detection of parallelism. Parallelism detection strategies are being incorporated in commercial compilers with increasing frequency because of the widespread use of processors capable of exploiting instruction-level parallelism and the growing importance of multiprocessors. An assessment of the accuracy of data dependence tests is therefore of great importance for compiler writers and researchers. The tests evaluated in this study include the generalized greatest common divisor test, three variants of Banerjee's test, and the Omega test. Their effectiveness was measured with respect to the Perfect Benchmarks and the linear algebra libraries, EISPACK and LAPACK. Two methods were applied, one using only compile-time information for the analysis, and the second using information gathered during program execution. The results indicate that Banerjee's test is for all practical purposes as accurate as the more complex Omega test in detecting parallelism. However, the Omega test is quite effective in proving the existence of dependences, in contrast with Banerjee's test, which can only disprove, or break dependences. The capability of View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impact of memory contention on dynamic scheduling on NUMA multiprocessors

    Page(s): 1201 - 1214
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1484 KB)  

    Self-scheduling is a method for task scheduling in parallel programs, in which each processor acquires a new block of tasks for execution whenever it becomes idle. To get the best performance, the block size must be chosen to balance the scheduling overhead against the load imbalance. To determine the best block size, a better understanding of the role of load imbalance in self-scheduling performance is needed. In this paper we study the effect of memory contention on task duration distributions and, hence, load balancing in self-scheduling on a Nonuniform Memory Access (NUMA) machine. Experimental studies on a BBN TC2000 are used to reveal the strengths and weaknesses of analytical performance models to predict running time and optimal block size. The models are shown to be very accurate for small block sizes. However, the models fail when the block size is large due to a previously unrecognized source of load imbalance. We extend the analytical models to address this failure. The implications for the construction of compilers and runtime systems are discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SPIFFI-a scalable parallel file system for the Intel Paragon

    Page(s): 1185 - 1200
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1720 KB)  

    This paper presents the design and performance of SPIFFI, a scalable high-performance parallel file system intended for use by extremely I/O intensive applications including “Grand Challenge” scientific applications and multimedia systems. This paper contains experimental results from a SPIFFI prototype on a 64 node/64 disk Intel Paragon. The results show that SPIFFI provides high performance and linear scaleup on real hardware. The paper also explains how shared file pointers (i.e., file pointers that are shared by multiple processes) can simplify the design of a parallel application. By sequentializing I/O accesses and by providing dynamic I/O load balancing, a shared file pointer may even improve an application's performance. This paper also presents the predictions of a SPIFFI simulator that we validated using the prototype. The simulator results show that SPIFFI continues to provide high performance even when it is scaled to configurations with as many as 128 disks or 256 compute nodes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decomposition abstraction in parallel rule languages

    Page(s): 1164 - 1184
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2120 KB)  

    Decomposition abstraction is the process of organizing and specifying decomposition strategies for the exploitation of parallelism available in an application. In this paper we develop and evaluate declarative primitives for rule-based programs that expand opportunities for parallel execution. These primitives make explicit, implicit relations among the data and similarly among the rules. The semantics of the primitives are presented in a general object-based framework such that they may be applied to most rule-based programming languages. We show how the additional information provided by the decomposition primitives can be incorporated into a semantic-based dependency analysis technique. The resulting analysis reveals parallelism at compile time that is very difficult, if not impossible, to discover by traditional syntactic analysis techniques. Simulation results demonstrate scalable and broadly available parallelism View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology