Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 3 • Date Mar 1999

Filter Results

Displaying Results 1 - 10 of 10
  • Performance-based path determination for interprocessor communication in distributed computing systems

    Publication Year: 1999 , Page(s): 316 - 327
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (872 KB)  

    The different types of messages used by a parallel application program executing in a distributed computing system can each have unique characteristics so that no single communication network can produce the lowest latency for all messages. For instance, short control messages may be sent with the lowest overhead on one type of network, such as Ethernet, while bulk data transfers may be better suited to a different type of network, such as Fibre Channel or HIPPI. This work investigates how to exploit multiple heterogeneous communication networks that interconnect the same set of processing nodes using a set of techniques we call performance-based path determination (PBPD). The performance-based path selection (PBPS) technique selects the best (lowest latency) network among several for each individual message to reduce the communication overhead of parallel programs. The performance-based path aggregation (PBPA) technique, on the other hand, aggregates multiple networks into a single virtual network to increase the available bandwidth. We test the PBPD techniques on a cluster of SGI multiprocessors interconnected with Ethernet, Fibre Channel, and HiPPI networks using a custom communication library built on top of the TCP/IP protocol layers. We find that PBPS can reduce communication overhead in applications compared to using either network alone, while aggregating networks into a single virtual network can reduce communication latency for bandwidth-limited applications. The performance of the PBPD techniques depends on the mix of message sizes in the application program and the relative overheads of the networks, as demonstrated in our analytical models View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An application-driven study of parallel system overheads and network bandwidth requirements

    Publication Year: 1999 , Page(s): 193 - 210
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB)  

    Evaluating and analyzing the performance of a parallel application on an architecture to explain the disparity between projected and delivered performance is an important aspect of parallel systems research. However, conducting such a study is hard due to the vast design space of these systems. We study two important aspects related to the performance of parallel applications on shared memory parallel architectures. First, we quantify overheads observed during the execution of these applications on three different simulated architectures. We next use these results to synthesize the bandwidth requirements for the applications with respect to different network topologies. This study is performed using an execution-driven simulation tool called SPASM, which provides a way of isolating and quantifying the different parallel system overheads in a nonintrusive manner. The first exercise shows that in shared memory machines with private caches, as long as the applications are well-structured to exploit locality, the key determinant that impacts performance is network connection. The second exercise quantifies the network bandwidth needed to minimize the effect of network connection. Specifically, it is shown that for the applications considered, as long as the problem sizes are increased commensurate with the system size, current network technologies supporting 200-300 MBytes/sec link bandwidth are sufficient to keep the network overheads (such as latency and contention) within acceptable bounds View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tighter bounds on full access probability in fault-tolerant multistage interconnection networks

    Publication Year: 1999 , Page(s): 328 - 335
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (244 KB)  

    This paper proposes a cut-based technique to compute bounds on the full access probability of an extra stage shuffle exchange network (ESEN) and a wrap-around inverse banyan network (WIBN). Note that the problem of finding an exact full access probability is known to be NP-hard. Our results obtain tighter bounds as compared to those using existing techniques. For a small size multistage interconnection network, it deviates less from the exact value. We also notice that our proposed lower bound is conservative. Further, the lower bound is important as it suggests that a network is at least this much reliable View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Safety and reliability driven task allocation in distributed systems

    Publication Year: 1999 , Page(s): 238 - 251
    Cited by:  Papers (45)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB)  

    Distributed computer systems are increasingly being employed for critical applications, such as aircraft control, industrial process control, and banking systems. Maximizing performance has been the conventional objective in the allocation of tasks for such systems. Inherently, distributed systems are more complex than centralized systems. The added complexity could increase the potential for system failures. Some work has been done in the past in allocating tasks to distributed systems, considering reliability as the objective function to be maximized. Reliability is defined to be the probability that none of the system components falls while processing. This, however, does not give any guarantees as to the behavior of the system when a failure occurs. A failure, not detected immediately, could lead to a catastrophe. Such systems are unsafe. In this paper, we describe a method to determine an allocation that introduces safety into a heterogeneous distributed system and at the same time attempts to maximize its reliability. First, we devise a new heuristic, based on the concept of clustering, to allocate tasks for maximizing reliability. We show that for task graphs with precedence constraints, our heuristic performs better than previously proposed heuristics. Next, by applying the concept of task-based fault tolerance, which we have previously proposed, we add extra assertion tasks to the system to make it safe. We present a new heuristic that does this in such a way that the decrease in reliability for the added safety is minimized. For the purpose of allocating the extra tasks, this heuristic performs as well as previously known methods and runs an order of magnitude faster. We present a number of simulation results to prove the efficacy of our scheme View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synthesizing efficient out-of-core programs for block recursive algorithms using block-cyclic data distributions

    Publication Year: 1999 , Page(s): 297 - 315
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (408 KB)  

    In this paper, we present a framework for synthesizing I/O efficient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory model in which data can be distributed using various cyclic(B) distributions in contrast to the normally used physical track distribution cyclic(Bd ), where Bd is the physical disk block size. We first introduce tensor bases to capture the semantics of block-cyclic data distributions of out-of-core data and also data access patterns to out-of-core data. We then present program generation techniques for tensor products and matrix transposition. We accurately represent the number of parallel I/O operations required for the synthesized programs for tensor products and matrix transposition as a function of tensor bases and data distributions. We introduce an algorithm to determine the data distribution which optimizes the performance of the synthesized programs. Further, we formalize the procedure of synthesizing efficient out-of-core programs for tensor product formulas with various block-cyclic distributions as a dynamic programming problem. We demonstrate the effectiveness of our approach through several examples. We show that the choice of an appropriate data distribution can reduce the number of passes to access out-of-core data by as large as eight times for a tensor product and the dynamic programming approach can largely reduce the number of passes to access out-of-core data for the overall tensor product formulas View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Embedding and reconfiguration of spanning trees in faulty hypercubes

    Publication Year: 1999 , Page(s): 211 - 222
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (284 KB)  

    The problem of tolerating faulty nodes in hypercubes has been studied by many researchers either by using spares or by reconfiguration. Algorithms for tolerating faulty nodes and links in hypercubes are presented. The algorithms are based on using general spanning trees (GST), complete unbalanced spanning trees (CUST), and balanced spanning trees (BST) for reconfiguring the hypercube to avoid faulty nodes and links. The algorithms contain two phases: the first phase involves the construction of the spanning tree and the second one is for reconfiguring the hypercube should a faulty node be detected. The reconfiguration process consists of two basic steps. First, the faulty node is disconnected from the spanning tree. Then, a new spanning tree is constructed by reconnecting the children of the faulty node to the spanning tree. One hundred percent single fault correction (avoidance) and almost 100 percent fault correction (avoidance) of double and triple faults are achieved by the proposed algorithms for hypercubes having a dimension of n⩾6. Simulation results for the algorithm under more than three faults also are presented. For any k faulty nodes (1⩽k⩽2n-1), the reconfiguration algorithm may be applied k times to avoid these k faulty nodes as long as no combination of any two of the faults results in a blocking situation. The proposed reconfiguration algorithms tolerate all possible single-link faults. The reconfiguration algorithms are extended to tolerate (k⩽n-1) multiple faults, causing blocking situation, with a backtracking View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predictable threads for dynamic, hard real-time environments

    Publication Year: 1999 , Page(s): 281 - 296
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (372 KB)  

    Next-generation, hard real-time systems will require new, flexible functionality and guaranteed, predictable performance. This paper describes the UMass Spring threads package, designed specifically for multiprocessing in dynamic, hard real-time environments. This package is unique because of its support for new thread semantics for real-time processing. Predictable creation and execution of threads is achieved because of an underlying predictable kernel, the UMass Spring kernel. Design decisions and lessons learned while implementing the threads package are presented. Measurements affirm the predictability of this implementation on a representative multiprocessor platform. The adoption of the threads package in the UMass Spring kernel results in additional performance improvements, which include reduced context switching overhead and reduced average-case memory access durations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The mesh with hybrid buses: an efficient parallel architecture for digital geometry

    Publication Year: 1999 , Page(s): 266 - 280
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (392 KB)  

    The first main contribution of this work is to propose an efficient VLSI architecture obtained by augmenting the Mesh with Multiple Broadcasting (MMB) with precharged 1-bit row and column buses. The new architecture, which we call Mesh with Hybrid Buses (MHB for short), is realizable in VLSI with no increase in the area or the wiring complexity of the MMB chip. Our second main contribution is to show that the MHB is extremely well-suited for solving an entire slew of digital geometry tasks. The MHB is not a reconfigurable architecture. Yet, quite remarkably, for a large number of fundamental digital geometry tasks, the MHB offers a level of performance previously attained only by reconfigurable architectures. Specifically, with a digital image pretiled onto a MHB of size √n×√n one pixel per processor, we show that the problems of computing the convex hull of the image, computing the diameter and the width of the image, deciding whether a set of digital points is a digital line, computing the maximum distance between two images, deciding whether two images are linearly separable, computing several moments and low-level descriptors of the image, including the perimeter, area, center, and median row of its convex hull, can be solved in O(log n) time. By contrast, the fastest possible algorithms for the problems above on the MMB run in Θ(n 1/6) time. Finally, we go on to show that, with minor changes, our algorithms can be implemented to run within cost-optimality on a MHB of size √n/log n×√n/log n View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Constructing a reliable test&set bit

    Publication Year: 1999 , Page(s): 252 - 265
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (356 KB)  

    The problem of computing with faulty shared bits is addressed. The focus is on constructing a reliable test&set bit from a collection of test&set bits of which some may be faulty. Faults are modeled by allowing operations on the faulty bits to return a special distinguished value, signaling that the operation may not have taken place. Such faults are called omission faults. Some of the constructions are required to be gracefully degrading for omission. That is, if the bound on the number of component bits which fail is exceeded, the constructed bit may suffer faults, but only faults which are no more severe than those of the components; and the constructed bit behaves as intended if the number of component bits which fail does not exceed that bound. Several efficient constructions are presented, and bounds on the space required are given. Our constructions for omission faults also apply to other fault models View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-free Hamiltonian cycles in faulty arrangement graphs

    Publication Year: 1999 , Page(s): 223 - 237
    Cited by:  Papers (56)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB)  

    The arrangement graph An,k, which is a generalization of the star graph (n-k=1), presents more flexibility than the star graph in adjusting the major design parameters: number of nodes, degree, and diameter. Previously, the arrangement graph has proved Hamiltonian. In this paper, we further show that the arrangement graph remains Hamiltonian even if it is faulty. Let |Fe| and |Fv| denote the numbers of edge faults and vertex faults, respectively. We show that An,k is Hamiltonian when 1) (k=2 and n-k⩾4, or k⩾3 and n-k⩾4+[k/2]), and |Fe|⩽k(n-k)-2, or 2) k⩾2, n-k⩾2+[k/2], and |Fe|⩽k(n-k-3)-1, or 3) k⩾2, n-k⩾3, and |Fe |⩽k, or 4) n-k⩾3 and |Fv|⩽n-3, or 5) n-k⩾3 and |Fv|+|Fe|⩽k. Besides, for An,k with n-k=2, we construct a cycle of length at least 1) [n!/(n-k!)]-2 if |Fe|⩽k-1, or 2) [n!/(n-k)!]-|Fv |-2(k-1) if |Fv|⩽k-1, or 3) [n!/(n-k)!]-|Fv |-2(k-1) if |Fe|+|Fv|⩽k-1, where [n!/(n-k)!] is the number of nodes in An,k View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology