Scheduled System Maintenance:
Some services will be unavailable Sunday, March 29th through Monday, March 30th. We apologize for the inconvenience.
By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 12 • Date Dec. 2001

Filter Results

Displaying Results 1 - 11 of 11
  • Author index

    Publication Year: 2001 , Page(s): 1332 - 1335
    Save to Project icon | Request Permissions | PDF file iconPDF (179 KB)  
    Freely Available from IEEE
  • Subject index

    Publication Year: 2001 , Page(s): 1335 - 1341
    Save to Project icon | Request Permissions | PDF file iconPDF (192 KB)  
    Freely Available from IEEE
  • Compiler-assisted multiple instruction word retry for VLIW architectures

    Publication Year: 2001 , Page(s): 1293 - 1304
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3064 KB) |  | HTML iconHTML  

    Very Long Instruction Word (VLIW) architectures can enhance performance by exploiting fine-grained instruction level parallelism. In this paper, we describe a compiler assisted multiple instruction word retry scheme for VLIW architectures. A read buffer is used to resolve the more frequent on-path hazards, while the compiler resolves the remaining branch hazards. Performance evaluation is described for 11 benchmark programs based on the IBM VLIW research compiler, Chameleon. Experimental results indicate that, for a VLIW machine with P functional units to rollback N instruction words, a read buffer of 2NP entries with the compiler assist can be an effective approach in producing low overhead runtime performance and small code growth, for P = 4, 8, 12, and 16 and N ⩽ 3 View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speculative Versioning Cache

    Publication Year: 2001 , Page(s): 1305 - 1317
    Cited by:  Papers (7)  |  Patents (41)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (731 KB) |  | HTML iconHTML  

    Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. Our evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance and private cache solutions trade-off hit rate for hit latency View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Controlling aggregation in distributed object systems: a graph-based approach

    Publication Year: 2001 , Page(s): 1236 - 1255
    Cited by:  Papers (2)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1091 KB) |  | HTML iconHTML  

    The Distributed Object Kernel is a federated database system providing a set of services which allow cooperative processing across different databases. The focus of this paper is the design of a DOK security service that provides for enforcing both local security policies, related to the security of local autonomous databases, and federated security policies, governing access to data aggregates composed of data from multiple distributed databases. We propose Global Access Control, an extended access control mechanism enabling a uniform expression of heterogeneous security information. Mappings from existing Mandatory and Discretionary Access Controls are described. To permit the control of data aggregation, the derivation of unauthorized information from authorized data, our security framework provides a logic-based language, the Federated Logic Language (FELL), which can describe constraints on both single and multiple states of the federation. To enforce constraints, FELL statements are mapped to state transition graphs which model the different subcomputations required to check the aggregation constraints. Graph aggregation operations are proposed for building compound state transition graphs for complex constraints. To monitor aggregation constraints, two marking techniques, called Linear Marking Technique and Zigzag Marking Technique, are proposed. Finally, we describe a three-layer DOK logical secure architecture enabling the implementation of the different security agents. This includes a Coordination layer, a Task layer, and a Database layer. Each contains specialized agents that enforce a different part of the federated security policy. Coordination is performed by the DOK Manager, enforcing security is performed by a specialized Constraint Manager agent, and the database functions are implemented by user and data agents View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hybrid algorithms for complete exchange in 2D meshes

    Publication Year: 2001 , Page(s): 1201 - 1218
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1113 KB) |  | HTML iconHTML  

    Parallel algorithms for several common problems such as sorting and the FFT involve a personalized exchange of data among all the processors. Past approaches to doing complete exchange have taken one of two broad approaches: direct exchange or the indirect message-combining approaches. While combining approaches reduce the number of message startups, direct exchange minimizes the volume of data transmitted. This paper presents a family of hybrid algorithms for wormhole-routed 2D meshes that can effectively utilize the complementary strengths of these two approaches to complete exchange. The performance of hybrid algorithms using Cyclic Exchange and Scott's Direct Exchange are studied using analytical models, simulation, and implementation on a Cray T3D system. The results show that hybrids achieve lower completion times than either pure algorithm for a range of mesh sizes, data block sizes, and message startup costs. It is also demonstrated that barriers may be used to enhance performance by reducing message contention, whether or not the target system provides hardware support for barrier synchronization. The analytical models are shown useful in selecting the optimum hybrid for any given combination of system parameters (mesh size, message startup time, flit transfer time, and barrier cost) and the problem parameter (data block size) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • L2 vector median filters on arrays with reconfigurable optical buses

    Publication Year: 2001 , Page(s): 1281 - 1292
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1565 KB) |  | HTML iconHTML  

    In spite of their good filtering characteristics for vector-valued image processing, the usability of vector median filters is limited by their high computational complexity. Given an N × N image and a W × W window, the computational complexity of vector median filter is O(W4N2). In this paper, we design three fast and efficient parallel algorithms for vector median filtering based on the 2-norm (L2) on the arrays with reconfigurable optical buses (AROB). For 1 ⩽ p ⩽ W ⩽ q ⩽ N, our algorithms run in O(W4 log W/p4), O(W2N2/p 4q2 log W) and O(1) times using p4N2 / log W, p4q2 / log W, and W4N2 log N processors, respectively. In the sense of the product of time and the number of processors used, the first two results are cost optimal and the last one is time optimal View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A delay-optimal quorum-based mutual exclusion algorithm for distributed systems

    Publication Year: 2001 , Page(s): 1256 - 1268
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (387 KB) |  | HTML iconHTML  

    The performance of a mutual exclusion algorithm is measured by the number of messages exchanged per critical section execution and the delay between successive executions of the critical section. There is a message complexity and synchronization delay trade-off in mutual exclusion algorithms. The Lamport algorithm (1978) and the Ricart-Agrawal algorithm (1981) both have a synchronization delay of T (T is the average message delay), but their message complexity is O(N). Maekawa's algorithm reduces the message complexity to O(√N); however, it increases the synchronization delay to 2T. After Maekawa's algorithm (1985), many quorum-based mutual exclusion algorithms have been proposed to reduce the message complexity or the increase the resiliency to site and communication link failures. Since these algorithms are Maekawa-type algorithms, they also suffer from the long synchronization delay. We propose a delay-optimal quorum-based mutual exclusion algorithm which reduces the synchronization delay to T and still has a low message complexity of O(K) (K is the size of the quorum which can be as low as log N). A correctness proof and a detailed performance analysis are provided View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiler-directed Collective-I/O

    Publication Year: 2001 , Page(s): 1318 - 1331
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (885 KB) |  | HTML iconHTML  

    Current approaches to parallel I/O demand extensive user effort to obtain acceptable performance. This is in part due to difficulties in understanding the characteristics of a wide variety of I/O devices and in part due to inherent complexity of I/O software. While parallel I/O systems provide users with environments where persistent data sets can be shared between parallel processors, the ultimate performance of I/O-intensive codes depends largely on the relation between data access patterns exhibited by parallel processors and storage patterns of data in files and on disks. In cases where access patterns and storage patterns match, we can exploit parallel I/O hardware by allowing each processor to perform independent parallel I/O. In order to keep performance decent under circumstances in which data access patterns and storage patterns do not match, several I/O optimization techniques have been developed in recent years. Collective I/O is such an optimization technique that enables each processor to do I/O on behalf of other processors if doing so improves the overall performance. While it is generally accepted that collective I/O and its variants can bring impressive improvements as far as the I/O performance is concerned, it is difficult for the programmer to use collective I/O in an optimal manner. We propose and evaluate a compiler-directed collective I/O approach which detects the opportunities for collective I/O and inserts the necessary I/O calls in the code automatically. An important characteristic of the approach is that instead of applying collective I/O indiscriminately, it uses collective I/O selectively only in cases where independent parallel I/O would not be possible or would lead to an excessive number of I/O calls. The approach involves compiler-directed access pattern and storage pattern detection schemes that work on a multiple application environment. We have implemented the necessary algorithms in a source-to-source translator and within a stand-alone tool Our experimental results on an SGI/Cray Origin 2000 multiprocessor machine demonstrate that our compiler-directed collective I/O scheme performs very well on different setups built using nine applications from several scientific benchmarks. We have also observed that the I/O performance of our approach is only 5.23 percent worse than an optimal scheme View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel processing of adaptive meshes with load balancing

    Publication Year: 2001 , Page(s): 1269 - 1280
    Cited by:  Papers (30)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (767 KB) |  | HTML iconHTML  

    Many scientific applications involve grids that lack a uniform underlying structure. These applications are often also dynamic in nature in that the grid structure significantly changes between successive phases of execution. In parallel computing environments, mesh adaptation of unstructured grids through selective refinement/coarsening has proven to be an effective approach. However, achieving load balance while minimizing interprocessor communication and redistribution costs is a difficult problem. Traditional dynamic load balancers are mostly inadequate because they lack a global view of system loads across processors. In this paper, we propose a novel and general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication topology and compare its performance with a successful global load balancing environment, called PLUM, specifically created to handle adaptive unstructured applications. Our experimental results on an IBM SP2 demonstrate that the SBN-based load balancer achieves lower redistribution costs than that under PLUM by overlapping processing and data migration View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A general theory for deadlock-free adaptive routing using a mixed set of resources

    Publication Year: 2001 , Page(s): 1219 - 1235
    Cited by:  Papers (26)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (438 KB) |  | HTML iconHTML  

    This paper presents a theoretical framework for the design of deadlock-free fully adaptive routing algorithms for a general class of network topologies and switching techniques in a single, unified theory. A general theory is proposed that allows the design of deadlock avoidance-based as well as deadlock recovery-based wormhole and virtual cut-through adaptive routing algorithms that use a homogeneous or a heterogeneous (mixed) set of resources. The theory also allows channel queues to be allocated nonatomically, utilizing resources efficiently. A general methodology for the design of fully adaptive routing algorithms applicable to arbitrary network topologies is also proposed. The proposed theory and methodology allow the design of efficient network routers that require minimal resources for handling infrequent deadlocks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology