By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 4 • Date Apr 1999

Filter Results

Displaying Results 1 - 6 of 6
  • Multiple multicast with minimized node contention on wormhole k-ary n-cube networks

    Page(s): 371 - 393
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1864 KB)  

    This paper presents a new approach to minimize node contention while performing multiple multicast/broadcast on wormhole k-ary n-cube networks with overlapped destination sets. The existing multicast algorithms in the literature deliver poor performance under multiple multicast because these algorithms have been designed with only single multicast in mind. The new algorithms introduced in this paper do not use any global knowledge about the respective destination sets of the concurrent multicasts. Instead, only local information and a source-specific partitioning approach are used. For systems supporting unicast message-passing, a new SPUmesh (Source-Partitioned Umesh) algorithm is proposed and is shown to be superior than the conventional Umesh algorithm for multiple multicast. Two different algorithms, SQHL (Source-Quadrant Hierarchical Leader) and SCHL (Source-Centered Hierarchical Leader), are proposed for systems with multidestination message-passing and shown to be superior than the HL scheme. All of these algorithms perform 1) 5-10 times faster than the existing algorithms under multiple multicast and 2) as fast as existing algorithms under single multicast. Furthermore, the SCHL scheme demonstrates that the latency of multiple multicast can, in fact, be reduced as the degree of multicast increases beyond a certain number. Thus, these algorithms demonstrate significant potential to be used for designing fast and scalable collective communication libraries on current and future generation wormhole systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimizing the maximum delay for reaching consensus in quorum-based mutual exclusion schemes

    Page(s): 337 - 345
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (264 KB)  

    The use of quorums is a well-known approach to achieving mutual exclusion in distributed computing systems. This approach works based on a coterie, a special set of node groups where any pair of the node groups shares at least one common node. Each node group in a coterie is called a quorum. Mutual exclusion is ensured by imposing that a node gets consensus from all nodes in at least one of the quorums before it enters a critical section. In a quorum-based mutual exclusion scheme, the delay for reaching consensus depends critically on the coterie adopted and, thus, it is important to find a coterie with small delay. Fu (1997) introduced two related measures called max-delay and mean-delay. The former measure represents the largest delay among all nodes, while the latter is the arithmetic mean of the delays. She proposed polynomial-time algorithms for finding max-delay and mean-delay optimal coteries when the network topology is a tree or a ring. In this paper, we first propose a polynomial-time algorithm for finding max-delay optimal coteries and, then, modify the algorithm so as to reduce the mean-delay of generated coteries. Unlike the previous algorithms, the proposed algorithms can be applied to systems with arbitrary topology View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tree-based parallel load-balancing methods for solution-adaptive finite element graphs on distributed memory multicomputers

    Page(s): 360 - 370
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB)  

    To solve the load imbalance problem of a solution-adaptive finite element application program on a distributed memory multicomputer, nodes of a refined finite element graph can be remapped to processors or load of a refined finite element graph can be redistributed based on the current load of each processor. For the former case, remapping can be performed by some fast mapping algorithms. For the latter case, a load-balancing algorithm can be applied to balance the computational load of each processor. In this paper, three tree-based parallel load-balancing methods, the MCSTLB method, the BTLB method, and the CBTLB method, were proposed to deal with the load imbalance problems of solution-adaptive finite element application programs. To evaluate the performance of the proposed methods, we have implemented those methods along with three mapping methods, the AE/ORB method, the AE/MC method, and the MLkP method, on an SP2 parallel machine. Three criteria, the execution time of mapping/load-balancing methods, the execution time of a solution-adaptive finite element application program under different mapping/load-balancing methods, and the speedups achieved by mapping/load-balancing methods for a solution-adaptive finite element application program, are used for the performance evaluation. The experimental results show that 1) if the initial mapping is performed by a mapping method and the same mapping method and load-balancing methods were used in each refinement to balance the load of processors, the execution time of an application program under a load-balancing method is always shorter than that of the mapping method, and 2) the execution time of an application program under the CBTLB method is shorter than that of the BTLB method and the MCSTLB method View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On parallelizing the multiprocessor scheduling problem

    Page(s): 414 - 431
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (744 KB)  

    Existing heuristics for scheduling a node and edge weighted directed task graph to multiple processors can produce satisfactory solutions but incur high time complexities, which tend to exacerbate in more realistic environments with relaxed assumptions. Consequently, these heuristics do not scale well and cannot handle problems of moderate sizes. A natural approach to reducing complexity, while aiming for a similar or potentially better solution, is to parallelize the scheduling algorithm. This can be done by partitioning the task graphs and concurrently generating partial schedules for the partitioned parts, which are then concatenated to obtain the final schedule. The problem, however, is nontrivial as there exists dependencies among the nodes of a task graph which must be preserved for generating a valid schedule. Moreover, the time clock for scheduling is global for all the processors (that are executing the parallel scheduling algorithm), making the inherent parallelism invisible. In this paper, we introduce a parallel algorithm that is guided by a systematic partitioning of the task graph to perform scheduling using multiple processors. The algorithm schedules both the tasks and messages, and is suitable for graphs with arbitrary computation and communication costs, and is applicable to systems with arbitrary network topologies using homogeneous or heterogeneous processors. We have implemented the algorithm on the Intel Paragon and compared it with three closely related algorithms. The experimental results indicate that our algorithm yields higher quality solutions while using an order of magnitude smaller scheduling times. The algorithm also exhibits an interesting trade-off between the solution quality and speedup while scaling well with the problem size View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Control flow prediction schemes for wide-issue superscalar processors

    Page(s): 346 - 359
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (716 KB)  

    In order to achieve high performance, wide-issue superscalar processors have to fetch a large number of instructions per cycle. Conditional branches are the primary impediment to increasing the fetch bandwidth because they can potentially alter the flow of control and are very frequent. To overcome this problem, these processors need to predict the outcome of multiple branches in a cycle. This paper investigates two control flow prediction schemes that predict the effective outcome of multiple branches with the help of a single prediction. Instead of considering branches as the basic units of prediction, these schemes consider subgraphs of the control flow graph of the executed program as the basic units of prediction and predict the target of an entire subgraph at a time, thereby allowing the superscalar fetch mechanism to go past multiple branches in a cycle. The first control flow prediction scheme investigated considers sequential block-like subgraphs and the second scheme considers tree-like subgraphs to make the control flow predictions. Both schemes do a 1-out-of-4 prediction as opposed to the 1-out-of-2 prediction done by branch-level prediction schemes. These two schemes are evaluated using a MIPS ISA-based 12-way superscalar microarchitecture. An improvement in effective fetch size of approximately 25 percent and 50 percent, respectively, is observed over identical microprocessors that use branch-level prediction. No appreciable difference in the prediction accuracy was observed, although the control flow prediction schemes predicted 1-out-of-4, outcomes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A general interprocedural framework for placement of split-phase large latency operations

    Page(s): 394 - 413
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (860 KB)  

    Overlapping split-phase large latency operations with computations is a standard technique for improving performance on modern architectures. In this paper, we present a general interprocedural technique for overlapping such accesses with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous operations with a balanced pair of asynchronous operations. We have evaluated this scheme in the context of overlapping I/O operations with computation. We demonstrate how this analysis is useful for applications which perform frequent and large accesses to disks, including applications which snapshot or checkpoint their computations or out-of-core applications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology