By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 1 • Date Jan 1992

Filter Results

Displaying Results 1 - 10 of 10
  • Accuracy of memory reference traces of parallel computations in trace-drive simulation

    Page(s): 97 - 109
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1284 KB)  

    For given input the global trace generated by a parallel program in a shared memory multiprocessing environment may change as the memory architecture, and management policies change. A method is proposed for ensuring that a correct global trace is generated in the new environment. This method involves a new characterization of a parallel program that identifies its address change points and address affecting points. An extension of traditional process traces, called the intrinsic trace of each process, is developed. The intrinsic traces maximize the decoupling of program execution from simulation by describing the address flow graph and path expressions of each process program. At each point where an address is issued, the trace-driven simulator uses the intrinsic traces and the sequence of loads and stores before the current cycle, to determine the next address. The mapping between load and store sequences and next addresses to issue, sometimes, requires partial program reexecution. Programs that do not require partial program reexecution are called graph-traceable View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiprocessor implementation of digital filtering algorithms using a parallel block processing method

    Page(s): 110 - 120
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (996 KB)  

    An efficient real-time implementation of digital filtering algorithms using a multiprocessor system in a ring network is investigated. This method is based on a parallel block processing approach, where a continuously supplied input data is divided into blocks, and the blocks are processed concurrently by being assigned to each processor in the system. This approach requires only a simple interconnection network and reduces significantly the number of communications among the processors, making the system easily expandable and highly efficient. In addition, various digital signal processing algorithms can be implemented on the same multiprocessor system. The data dependency of the blocks to be processed concurrently brings on dependency problems between the processors. A systematic scheduling method has been developed by using a precedence graph for the analysis of the dependency relation. Methods for solving the dependency problems between the processors are also investigated. Implementation procedures and results for FIR, recursive, and adaptive filtering algorithms are illustrated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A dynamic information-structure mutual exclusion algorithm for distributed systems

    Page(s): 121 - 125
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (512 KB)  

    A dynamic information-structure mutual exclusion algorithm is presented for distributed systems whose information-structure evolves with time as sites learn about the state of the system through messages. An interesting feature of the algorithm is that it adapts itself to heterogeneous or fluctuating traffic conditions to optimize the performance (the number of messages exchanged). The performance of the algorithm is studied by simulation technique and compared to the performance of a well-known mutual exclusion algorithm. The impact message loss and site failures on the algorithm is discussed and methods to tolerate these failures are proposed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bused hypercubes and other pin-optimal networks

    Page(s): 14 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1060 KB)  

    Pin minimization is an important issue for massively parallel architectures because the number of processing elements that can be placed on a chip, board, or chassis is often pin limited. A d-dimensional bused hypercube interconnection network is presented that allows nodes to simultaneously (in one clock tick) exchange data across any dimension using only d+1 ports per node rather than 2d. Despite this near two-to-one reduction, the network also allows nodes that are two dimensions apart to simultaneously exchange data; as a result, certain routings can be performed in nearly half the time. The network is shown to be a special case of a general construction in which any set of d permutations can be performed, in one clock tick, using only d+1 ports per node. A lower-bound technique is also presented and used to establish the optimality of the network, as well as that of several other new bused networks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A VLSI constant geometry architecture for the fast Hartley and Fourier transforms

    Page(s): 58 - 70
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1152 KB)  

    An application-specific architecture for the parallel calculation of the decimation in time and radix 2 fast Hartley (FHT) and Fourier (FFT) transforms is presented. A real sequence with N=2n data items is considered as input. The system calculates the FHT and the FFT in n and n+1 stages. respectively. The modular and regular parallel architecture is based on a constant geometry algorithm using butterflies of four data items and the perfect unshuffle permutation. With this permutation, the mapping of the algorithm in VLSI technology is simplified and the communications among processors are minimized. Organization of the processor memory based on first-in, first-out (FIFO) queues facilitates a systolic data flow and permits the implementation in a direct way of the complex data movements and address sequences of the transforms. This is accomplished by means of simple multiplexing operations, using hardwired control. The total calculation time is (Nlog2N)/4Q cycles for the FHT and N(1+log2N)/4Q cycles for the FFT, where Q is the number of processors ( Q= 2q, QN/4) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient processor assignment algorithms and loop transformations for executing nested parallel loops on multiprocessors

    Page(s): 71 - 82
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (880 KB)  

    An important issue for the efficient use of multiprocessor systems is the assignment of parallel processors to nested parallel loops. It is desirable for a processor assignment algorithm to be fast and always generate an optimal processor assignment. The paper proposes two efficient algorithms to decide the optimal number of processors assigned to each individual loop. Efficient parallel counterparts of these two algorithms are also presented. These algorithms not only always generate an optimal processor assignment, but also are much faster than the exiting optimal algorithm in the literature. The paper discusses improving the performance of parallel execution by transforming a nested parallel loop into a semantically equivalent one. Three loop transformations are investigated. It is observed that, in most cases, the parallel execution time is improved after applying these transformations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A hybrid scheme for processing data structures in a dataflow environment

    Page(s): 83 - 96
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1108 KB)  

    The asynchronous nature of the dataflow model of computation allows the exploitation of maximum inherent parallelism in many application programs. However, before the dataflow model of computation can become a viable alternative to the control flow model of computation, one has to find practical solutions to some problems such as efficient handling of data structures. The paper introduces a new model for handling data structures in a dataflow environment. The proposed model combines constant time access capabilities of vectors as well as the flexibility inherent in the concept of pointers. This allows a careful balance between copying and sharing to optimize the storage and processing overhead incurred during the operations on data structures. The mode] is compared by simulation to other data structure models proposed in the literature, and the results are good View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A processor-time-minimal systolic array for cubical mesh algorithms

    Page(s): 4 - 13
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (792 KB)  

    Using a directed acyclic graph (DAG) model of algorithms, the paper focuses on time-minimal multiprocessor schedules that use as few processors as possible. Such a processor-time-minimal scheduling of an algorithm's DAG first is illustrated using a triangular shaped 2-D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n×n×n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh required 3n-2 steps. It is shown that the number of processing elements needed to achieve this time bound is at least [3n2/4]. A systolic array for the cubical directed mesh is then presented. It completes the mesh using the minimum number of steps and exactly [3n 2/4] processing elements it is processor-time-minimal. The systolic array's topology is that of a hexagonally shaped, cylindrically connected, 2-D directed mesh View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extended hypercube: a hierarchical interconnection network of hypercubes

    Page(s): 45 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (916 KB)  

    A new interconnection topology-the extended hypercube-consisting of an interconnection network of k-cubes is discussed. The extended hypercube is a hierarchical, expansive, recursive structure with a constant predefined building block. The extended hypercube retains the positive features of the k-cube at different levels of hierarchy and at the same time has some additional advantages like reduced diameter and constant degree of a node. The paper presents an introduction to the topology of the extended hypercube and analyzes its architectural potential in terms of message routing and executing a class of highly parallel algorithms. Topological properties and performance studies of the extended hypercube are presented View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and analysis of a scalable cache coherence scheme based on clocks and timestamps

    Page(s): 25 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1560 KB)  

    A timestamp-based software-assisted cache coherence scheme that does not require any global communication to enforce the coherence of multiple private caches is proposed. It is intended for shared memory multiprocessors. The scheme is based on a compile-time marking of references and a hardware-based local incoherence detection scheme. The possible incoherence of a cache entry is detected and the associated entry is implicitly invalidated by comparing a clock (related to program flow) and a timestamp (related to the time of update in the cache). Results of a performance comparison, which is based on a trace-driven simulation using actual traces. between the proposed timestamp-based scheme and other software-assisted schemes indicate that the proposed scheme performs significantly better than previous software-assisted schemes, especially when the processors are carefully scheduled so as to maximize the reuse of cache contents. This scheme requires neither a shared resource nor global communication and is, therefore, scalable up to a large number of processors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology