Scheduled System Maintenance:
Some services will be unavailable Sunday, March 29th through Monday, March 30th. We apologize for the inconvenience.
By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 8 • Date Aug 1995

Filter Results

Displaying Results 1 - 12 of 12
  • A notation for deterministic cooperating processes

    Publication Year: 1995 , Page(s): 863 - 871
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (804 KB)  

    This paper proposes extensions of sequential programming languages for parallel programming that have the following features: 1) Dynamic Structures: The process structure is dynamic. Processes and variables can be created and deleted. 2) Paradigm Integration: The programming notation supports shared memory and message passing models. 3) Determinism: Demonstrating that a program is deterministic-all executions with the same input produce the same output-is straightforward, Programs can be written so that compilers can verify that the programs are deterministic. Nondeterministic constructs can be introduced in a sequence of refinement steps to obtain greater efficiency if required. The ideas have been incorporated in an extension of Fortran, but the underlying sequential imperative language is not central to the ideas described here. A compiler for the Fortran extension, called Fortran M, is available by anonymous ftp From Argonne National Laboratory. Fortran M has been used for a variety of parallel applications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal hot spot allocation on meshes for large-scale data-parallel algorithms

    Publication Year: 1995 , Page(s): 788 - 802
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1372 KB)  

    Hot spots are notorious for degrading the performance of a parallel algorithm. We attempt to minimize the hot-spot access time for a class of problems, namely, Large-Scale Data-Parallel (LSDP) algorithms, on a 2D mesh. An LSDP algorithm has rich data parallelism but without exclusive task and data partitioning. Our approach is to allocate the hot spots at the optimal locations such that the hot-spot access time is minimized. Also, we have designed scheduling algorithms which control hot spot access sequences to achieve the minimal access time. Both uniform and nonuniform hot spots have been considered in this study. We have analytically derived the optimal allocations for wrapped-around and non-wrapped-around square meshes. The theoretical results have been verified by parallelizing the EM algorithm for 3D PET image reconstruction on the Intel iPSC/860 View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed load balancing for parallel main memory hash join

    Publication Year: 1995 , Page(s): 841 - 849
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (824 KB)  

    Parallel joins have been widely studied during the past decade and a number of efficient algorithms were presented. While it is known that the performance of these algorithms may suffer greatly in the presence of skewed input data, the work on load balancing schemes for parallel join has been limited. The main contribution of this paper is the development and analysis of a new distributed data structure and an effective load balancing scheme for parallel main memory hash join on NUMA architecture. Multiprocessors based on this architecture are scalable in both size of main memory and number of processors, and provide very high memory bandwidth. The load balancing scheme is based on random probing to avoid the hot spot problems caused by probing sequentially. We have modeled this load balancing scheme both analytically and experimentally. The experiments were run on a BBN TC2000 multiprocessor system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Executing algorithms with hypercube topology on torus multicomputers

    Publication Year: 1995 , Page(s): 803 - 814
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1028 KB)  

    Many parallel algorithms use hypercubes as the communication topology among their processes. When such algorithms are executed on hypercube multicomputers the communication cost is kept minimum since processes can be allocated to processors in such a way that only communication between neighbor processors is required. However, the scalability of hypercube multicomputers is constrained by the fact that the interconnection cost-per-node increases with the total number of nodes. From scalability point of view, meshes and toruses are more interesting classes of interconnection topologies. This paper focuses on the execution of algorithms with hypercube communication topology on multicomputers with mesh or torus interconnection topologies. The proposed approach is based on looking at different embeddings of hypercube graphs onto mesh or torus graphs. The paper concentrates on toruses since an already known embedding, which is called standard embedding, is optimal for meshes. In this paper, an embedding of hypercubes onto toruses of any given dimension is proposed. This novel embedding is called xor embedding. The paper presents a set of performance figures for both the standard and the xor embeddings and shows that the latter outperforms the former for any torus. In addition, it is proven that for a one-dimensional torus (a ring) the xor embedding is optimal in the sense that it minimizes the execution time of a class of parallel algorithms with hypercube topology. This class of algorithms is frequently found in real applications, such as FFT and some class of sorting algorithms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new approach for the verification of cache coherence protocols

    Publication Year: 1995 , Page(s): 773 - 787
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1396 KB)  

    We introduce a cache protocol verification technique based on a symbolic state expansion procedure. A global Finite State Machine (FSM) model characterizing the protocol behavior is built and protocol verification becomes equivalent to finding whether or not the global FSM may enter erroneous states. In order to reduce the complexity of the state expansion process, all the caches in the same state are grouped into an equivalence class and the number of caches in the class is symbolically represented by a repetition constructor. This symbolic representation is partly justified by the symmetry and homogeneity of cache-based systems. However, the key idea behind the representation is to exploit a unique property of cache coherence protocols: the fact that protocol correctness is not dependent on the exact number of cached copies. Rather, symbolic states only need to keep track of whether the caches have 0, 1, or multiple copies. The resulting symbolic state expansion process only takes a few steps and verifies the protocol for any system size. Therefore, it is more efficient and reliable than current approaches. The verification procedure is first applied to the verification of five existing protocols under the assumption of atomic protocol transitions. A simple snooping protocol on a split-transaction shared bus is also verified to illustrate the extension of our approach to protocols with nonatomic transitions View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computing global combine operations in the multiport postal model

    Publication Year: 1995 , Page(s): 896 - 900
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (424 KB)  

    Consider a message-passing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n pieces of data and to make the result known to all the n processors. This operation is frequently used in many message-passing systems and is typically referred to as global combine, census computation, or gossiping. This paper explores the problem of global combine in the multiport postal model. This model is characterized by three parameters: n-the number of processors, k-the number of ports per processor, and λ-the communication latency. In this model, in every round r, each processor can send k distinct messages to k other processors, and it can receive k messages that were sent from k other processors λ-1 rounds earlier. This paper provides an optimal algorithm for the global combine problem that requires the least number of communication rounds and minimizes the time spent by any processor in sending and receiving messages View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generalized multiway branch unit for VLIW microprocessors

    Publication Year: 1995 , Page(s): 850 - 862
    Cited by:  Papers (7)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1068 KB)  

    VLIW processors use multiway branch instructions to achieve high-speed, parallel evaluation of control structures. This paper introduces a new multiway branch mechanism that allows constant-time branch-target resolution based on an arbitrary condition tree. The unique feature of this mechanism is its target selection unit, which yields a branch-target based on a set of condition bit values and a condition tree description. A representation of condition trees that results in a compact target selection unit is described, and the logic diagram of a target selection unit that provides a four-way branching is shown. Our experimental results on nontrivial integer benchmarks indicate that the proposed multiway branch unit can improve the performance of VLIW machines substantially (i.e., as much as a geometric mean of 35%), compared to using the conventional two-way branching View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Message complexity of the tree quorum algorithm

    Publication Year: 1995 , Page(s): 887 - 890
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (276 KB)  

    The tree quorum algorithm (TQA) uses a tree structure to generate intersecting (tree) quorums for distributed mutual exclusion. This paper analyzes the number of messages required to acquire a quorum in TQA. Let i be the depth of the complete binary tree used in TQA, and let Mi be the number of messages required to acquire a quorum or to determine that no quorum is accessible. We discuss Mi as a function of i and p, where p (½<p<1) is the probability that each site is operational. Let Ci denote the average number of sites in the quorum that TQA finds. The analysis shows that, although both Mi and Ci increase without bound as i increases, Mi/Ci approaches to 1+p/p as i increases. According to the result, an approximate close form for Mi is derived View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparative performance evaluation of hot spot contention between MIN-based and ring-based shared-memory architectures

    Publication Year: 1995 , Page(s): 872 - 886
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1356 KB)  

    Hot spot contention on a network-based shared-memory architecture occurs when a large number of processors try to access a globally shared variable across the network. While multistage interconnection network (MIN) and hierarchical ring (HR) structures are two important bases on which to build large scale shared-memory multiprocessors, the different interconnection networks and cache/memory systems of the two architectures respond very differently to network bottleneck situations. In this paper, we present a comparative performance evaluation of hot spot effects on the MIN-based and HR-based shared-memory architectures. Both nonblocking MIN-based and HR-based architectures are classified, and analytical models are described for understanding network differences and for evaluating hot spot performance on both architectures. The analytical comparisons indicate that HR-based architectures have the potential to handle various contentions caused by hot spots more efficiently than MIN-based architectures. Intensive performance measurements on hot spots have been conducted on the BBN TC2000 (MIN-based) and the KSR1 (HR-based) machines. Performance experiments were also conducted on the practical experience of hot spots with respect to synchronization lock algorithms. The experimental results are consistent with the analytical models, and present practical observations and an evaluation of hot spots on the two types of architectures View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Loop transformation using nonunimodular matrices

    Publication Year: 1995 , Page(s): 832 - 840
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (704 KB)  

    Linear transformations are widely used to vectorize and parallelize loops. A subset of these transformations are unimodular transformations. When a unimodular transformation is used, the exact bounds of the transformed loop nest are easily computed and the steps of the loops are equal to 1. Unimodular loop transformations have been widely used since they permit the implementation of many useful loop transformations. Recently, nonunimodular transformations have been proposed to reduce communication requirements or to use the memory hierarchy efficiently. The methods used for unimodular transformations do not work in the case of nonunimodular transformations, since they do not produce the exact bounds of the transformed loop nest. In this paper, we present a method for nested loop transformation which gives the exact bounds for both unimodular and nonunimodular transformations. The basic idea is to use the Hermite Normal Form (HNF) of the transformation matrix View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance of barrier synchronization methods in a multiaccess network

    Publication Year: 1995 , Page(s): 890 - 895
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (488 KB)  

    Barrier synchronization is a commonly used primitive in parallel processing. In this paper, we present different algorithms for barrier synchronization on the widely prevalent multiaccess bus network, and derive analytical performance metrics for each of the proposed schemes, which are then compared against simulation results View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Runtime support and compilation methods for user-specified irregular data distributions

    Publication Year: 1995 , Page(s): 815 - 831
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1688 KB)  

    This paper describes two new ideas by which a High Performance Fortran compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of proposed compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements, and computational load. The second mechanism is a conservative method for compiling irregular loops in which dependence arises only due to reduction operations. This mechanism in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g., communication schedules, loop iteration partitions, and information that associates off-processor data copies with on-processor buffer locations). This paper also presents performance results for these mechanisms from a Fortran 90D compiler implementation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology