By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 2 • Date Feb. 2012

Filter Results

Displaying Results 1 - 25 of 27
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (109 KB)  
    Freely Available from IEEE
  • [Cover 2]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (140 KB)  
    Freely Available from IEEE
  • A Two-Dimensional Low-Diameter Scalable On-Chip Network for Interconnecting Thousands of Cores

    Page(s): 193 - 201
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2580 KB) |  | HTML iconHTML  

    This paper introduces the Spidergon-Donut (SD) on-chip interconnection network for interconnecting 1,000 cores in future MPSoCs and CMPs. Unlike the Spidergon network, the SD network which extends the Spidergon network into the second dimension, significantly reduces the network diameter, well below the popular 2D Mesh and Torus networks for one extra node degree and roughly 25 percent more links. A detailed construction of the SD network and a method to reshuffle the SD network's nodes for layout onto the 2D plane, and simple one-to-one and broadcast routing algorithms for the SD network are presented. The various configurations of the SD network are analyzed and compared including detailed area and delay studies. To interconnect a thousand cores, the paper concludes that a hybrid version of the SD network with smaller SD instances interconnected by a crossbar is a feasible low-diameter network topology for interconnecting the cores of a thousand core system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accelerating Matrix Operations with Improved Deeply Pipelined Vector Reduction

    Page(s): 202 - 210
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (904 KB) |  | HTML iconHTML  

    Many scientific or engineering applications involve matrix operations, in which reduction of vectors is a common operation. If the core operator of the reduction is deeply pipelined, which is usually the case, dependencies between the input data elements cause data hazards. To tackle this problem, we propose a new reduction method with low latency and high pipeline utilization. The performance of the proposed design is evaluated for both single data set and multiple data set scenarios. Further, QR decomposition is used to demonstrate how the proposed method can accelerate its execution. We implement the design on an FPGA and compare its results to other methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Efficient Approach for Mobile Asset Tracking Using Contexts

    Page(s): 211 - 218
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (511 KB) |  | HTML iconHTML  

    Due to the heterogeneity involved in smart interconnected devices, cellular applications, and surrounding (GPS-aware) environments there is a need to develop a realistic approach to track mobile assets. Current tracking systems are costly and inefficient over wireless data transmission systems where cost is based on the rate of data being sent. Our aim is to develop an efficient and improved geographical asset tracking solution and conserve valuable mobile resources by dynamically adapting the tracking scheme by means of context-aware personalized route learning techniques. We intend to perform this tracking by proactively monitoring the context information in a distributed, efficient, and scalable fashion. Context profiles, which indicate the characteristics of a route based on environmental conditions, are utilized to dynamically represent the values of the asset's properties. We designed and implemented an adaptive learning based scheme that makes an optimized judgment of data transmission. This manuscript is complemented with theoretical and practical evaluations that prove that significant costs can be saved and operational efficiency can be achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Autonomic Placement of Mixed Batch and Transactional Workloads

    Page(s): 219 - 231
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2011 KB) |  | HTML iconHTML  

    To reduce the cost of infrastructure and electrical energy, enterprise datacenters consolidate workloads on the same physical hardware. Often, these workloads comprise both transactional and long-running analytic computations. Such consolidation brings new performance management challenges due to the intrinsically different nature of a heterogeneous set of mixed workloads, ranging from scientific simulations to multitier transactional applications. The fact that such different workloads have different natures imposes the need for new scheduling mechanisms to manage collocated heterogeneous sets of applications, such as running a web application and a batch job on the same physical server, with differentiated performance goals. In this paper, we present a technique that enables existing middleware to fairly manage mixed workloads: long running jobs and transactional applications. Our technique permits collocation of the workload types on the same physical hardware, and leverages virtualization control mechanisms to perform online system reconfiguration. In our experiments, including simulations as well as a prototype system built on top of state-of-the-art commercial middleware, we demonstrate that our technique maximizes mixed workload performance while providing service differentiation based on high-level performance goals. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BloomCast: Efficient and Effective Full-Text Retrieval in Unstructured P2P Networks

    Page(s): 232 - 241
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (627 KB) |  | HTML iconHTML  

    Efficient and effective full-text retrieval in unstructured peer-to-peer networks remains a challenge in the research community. First, it is difficult, if not impossible, for unstructured P2P systems to effectively locate items with guaranteed recall. Second, existing schemes to improve search success rate often rely on replicating a large number of item replicas across the wide area network, incurring a large amount of communication and storage costs. In this paper, we propose BloomCast, an efficient and effective full-text retrieval scheme, in unstructured P2P networks. By leveraging a hybrid P2P protocol, BloomCast replicates the items uniformly at random across the P2P networks, achieving a guaranteed recall at a communication cost of O(√N), where N is the size of the network. Furthermore, by casting Bloom Filters instead of the raw documents across the network, BloomCast significantly reduces the communication and storage costs for replication. We demonstrate the power of BloomCast design through both mathematical proof and comprehensive simulations based on the query logs from a major commercial search engine and NIST TREC WT10G data collection. Results show that BloomCast achieves an average query recall of 91 percent, which outperforms the existing WP algorithm by 18 percent, while BloomCast greatly reduces the search latency for query processing by 57 percent. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Communication-Aware Globally-Coordinated On-Chip Networks

    Page(s): 242 - 254
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2389 KB) |  | HTML iconHTML  

    With continued Moore's law scaling, multicore-based architectures are becoming the de facto design paradigm for achieving low-cost and performance/power-efficient processing systems through effective exploitation of available parallelism in software and hardware. A crucial subsystem within multicores is the on-chip interconnection network that orchestrates high-bandwidth, low-latency, and low-power communication of data. Much previous work has focused on improving the design of on-chip networks but without more fully taking into consideration the on-chip communication behavior of application workloads that can be exploited by the network design. A significant portion of this paper analyzes and models on-chip network traffic characteristics of representative application workloads. Leveraged by this, the notion of globally coordinated on-chip networks is proposed in which application communication behavior-captured by traffic profiling-is utilized in the design and configuration of on-chip networks so as to support prevailing traffic flows well, in a globally coordinated manner. This is applied to the design of a hybrid network consisting of a mesh augmented with configurable multidrop (bus-like) spanning channels that serve as express paths for traffic flows benefiting from them, according to the characterized traffic profile. Evaluations reveal that network latency and energy consumption for a 64-core system running OpenMP benchmarks can be improved on average by 15 and 27 percent, respectively, with globally coordinated on-chip networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compression of View on Anonymous Networks—Folded View—

    Page(s): 255 - 262
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (668 KB) |  | HTML iconHTML  

    View is a labeled directed graph containing all information about the network that a party can learn by exchanging messages with its neighbors. View can be used to solve distributed problems on an anonymous network (i.e., a network that does not guarantee that every party has a unique identifier). This paper presents an algorithm that constructs views in a compressed form on an anonymous n-party network of any topology in at most 2n rounds with O(n6log n) bit complexity, where the time complexity (i.e., the number of local computation steps per party) is O(n6log n). This is the first view-construction algorithm that runs in O(n) rounds with polynomial bits complexity. The paper also gives an algorithm that counts the number of nonisomorphic views in the network in O(n6log n) time complexity if a view is given in the compressed form. These algorithms imply that some well-studied problems, including the leader election problem, can deterministically be solved in O(n) rounds with polynomial bit and time complexity on an anonymous n-party network of any topology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DDC: A Novel Scheme to Directly Decode the Collisions in UHF RFID Systems

    Page(s): 263 - 270
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (712 KB) |  | HTML iconHTML  

    RFID has been gaining popularity due to its variety of applications, such as inventory control and localization. One important issue in RFID system is tag identification. In RFID systems, the tag randomly selects a slot to send a Random Number (RN) packet to contend for identification. Collision happens when multiple tags select the same slot, which makes the RN packet undecodable and thus reduces the channel utilization. In this paper, we redesign the RN pattern to make the collided RNs decodable. By leveraging the collision slots, the system performance can be dramatically enhanced. This novel scheme is called DDC, which is able to directly decode the collisions without exact knowledge of collided RNs. In the DDC scheme, we modify the RN generator in RFID tag and add a collision decoding scheme for RFID reader. We implement DDC in GNU Radio and USRP2 based testbed to verify its feasibility. Both theoretical analysis and testbed experiment show that DDC achieves 40 percent tag read rate gain compared with traditional RFID protocol. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Delegation-Based I/O Mechanism for High Performance Computing Systems

    Page(s): 271 - 279
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (661 KB) |  | HTML iconHTML  

    Massively parallel applications often require periodic data checkpointing for program restart and post-run data analysis. Although high performance computing systems provide massive parallelism and computing power to fulfill the crucial requirements of the scientific applications, the I/O tasks of high-end applications do not scale. Strict data consistency semantics adopted from traditional file systems are inadequate for homogeneous parallel computing platforms. For high performance parallel applications independent I/O is critical, particularly if checkpointing data are dynamically created or irregularly partitioned. In particular, parallel programs generating a large number of unrelated I/O accesses on large-scale systems often face serious I/O serializations introduced by lock contention and conflicts at file system layer. As these applications may not be able to utilize the I/O optimizations requiring process synchronization, they pose a great challenge for parallel I/O architecture and software designs. We propose an I/O mechanism to bridge the gap between scientific applications and parallel storage systems. A static file domain partitioning method is developed to align the I/O requests and produce a client-server mapping that minimizes the file lock acquisition costs and eliminates the lock contention. Our performance evaluations of production application I/O kernels demonstrate scalable performance and achieve high I/O bandwidths. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fat versus Thin Threading Approach on GPUs: Application to Stochastic Simulation of Chemical Reactions

    Page(s): 280 - 287
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (708 KB) |  | HTML iconHTML  

    We explore two different threading approaches on a graphics processing unit (GPU) exploiting two different characteristics of the current GPU architecture. The fat thread approach tries to minimize data access time by relying on shared memory and registers potentially sacrificing parallelism. The thin thread approach maximizes parallelism and tries to hide access latencies. We apply these two approaches to the parallel stochastic simulation of chemical reaction systems using the stochastic simulation algorithm (SSA) by Gillespie [14]. In these cases, the proposed thin thread approach shows comparable performance while eliminating the limitation of the reaction system's size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • How Much to Share: A Repeated Game Model for Peer-to-Peer Streaming under Service Differentiation Incentives

    Page(s): 288 - 295
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (477 KB) |  | HTML iconHTML  

    In this paper, we propose a service differentiation incentive for P2P streaming system, according to peers' instant contributions. Also, a repeated game model is designed to analyze how much the peers should contribute in each round under this incentive. Simulations show that satisfying streaming quality is achieved in the Nash Equilibrium state. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • In Cloud, Can Scientific Communities Benefit from the Economies of Scale?

    Page(s): 296 - 303
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (781 KB) |  | HTML iconHTML  

    The basic idea behind cloud computing is that resource providers offer elastic resources to end users. In this paper, we intend to answer one key question to the success of cloud computing: in cloud, can small-to-medium scale scientific communities benefit from the economies of scale? Our research contributions are threefold: first, we propose an innovative public cloud usage model for small-to-medium scale scientific communities to utilize elastic resources on a public cloud site while maintaining their flexible system controls, i.e., create, activate, suspend, resume, deactivate, and destroy their high-level management entities-service management layers without knowing the details of management. Second, we design and implement an innovative system-DawningCloud, at the core of which are lightweight service management layers running on top of a common management service framework. The common management service framework of DawningCloud not only facilitates building lightweight service management layers for heterogeneous workloads, but also makes their management tasks simple. Third, we evaluate the systems comprehensively using both emulation and real experiments. We found that for four traces of two typical scientific workloads: High-Throughput Computing (HTC) and Many-Task Computing (MTC), DawningCloud saves the resource consumption maximally by 59.5 and 72.6 percent for HTC and MTC service providers, respectively, and saves the total resource consumption maximally by 54 percent for the resource provider with respect to the previous two public cloud solutions. To this end, we conclude that small-to-medium scale scientific communities indeed can benefit from the economies of scale of public clouds with the support of the enabling system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interactivity-Constrained Server Provisioning in Large-Scale Distributed Virtual Environments

    Page(s): 304 - 312
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (746 KB) |  | HTML iconHTML  

    Maintaining interactivity is one of the key challenges in distributed virtual environments (DVEs). In this paper, we consider a new problem, termed the interactivity-constrained server provisioning problem, whose goal is to minimize the number of distributed servers needed to achieve a prespecified level of interactivity. We identify and formulate two variants of this new problem and show that they are both NP-hard via reductions to the set covering problem. We then propose several computationally efficient approximation algorithms for solving the problem. The main algorithms exploit dependencies among distributed servers to make provisioning decisions. We conduct extensive experiments to evaluate the performance of the proposed algorithms. Specifically, we use both static Internet latency data available from prior measurements and topology generators, as well as the most recent, dynamic latency data collected via our own large-scale deployment of a DVE performance monitoring system over PlanetLab. The results show that the newly proposed algorithms that take into account interserver dependencies significantly outperform the well-established set covering algorithm for both problem variants. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Payments for Outsourced Computations

    Page(s): 313 - 320
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (599 KB) |  | HTML iconHTML  

    With the recent advent of cloud computing, the concept of outsourcing computations, initiated by volunteer computing efforts, is being revamped. While the two paradigms differ in several dimensions, they also share challenges, stemming from the lack of trust between outsourcers and workers. In this work, we propose a unifying trust framework, where correct participation is financially rewarded: neither participant is trusted, yet outsourced computations are efficiently verified and validly remunerated. We propose three solutions for this problem, relying on an offline bank to generate and redeem payments; the bank is oblivious to interactions between outsourcers and workers. We propose several attacks that can be launched against our framework and study the effectiveness of our solutions. We implemented our most secure solution and our experiments show that it is efficient: the bank can perform hundreds of payment transactions per second and the overheads imposed on outsourcers and workers are negligible. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-World Sensor Network for Long-Term Volcano Monitoring: Design and Findings

    Page(s): 321 - 329
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1119 KB) |  | HTML iconHTML  

    This paper presents the design, deployment, and evaluation of a real-world sensor network system in an active volcano - Mount St. Helens. In volcano monitoring, the maintenance is extremely hard and system robustness is one of the biggest concerns. However, most system research to date has focused more on performance improvement and less on system robustness. In our system design, to address this challenge, automatic fault detection and recovery mechanisms were designed to autonomously roll the system back to the initial state if exceptions occur. To enable remote management, we designed a configurable sensing and flexible remote command and control mechanism with the support of a reliable dissemination protocol. To maximize data quality, we designed event detection algorithms to identify volcanic events and prioritize the data, and then deliver higher priority data with higher delivery ratio with an adaptive data transmission protocol. Also, a light-weight adaptive linear predictive compression algorithm and localized TDMA MAC protocol were designed to improve network throughput. With these techniques and other improvements on intelligence and robustness based on a previous trial deployment, we air-dropped 13 stations into the crater and around the flanks of Mount St. Helens in July 2009. During the deployment, the nodes autonomously discovered each other even in-the-sky and formed a smart mesh network for data delivery immediately. We conducted rigorous system evaluations and discovered many interesting findings on data quality, radio connectivity, network performance, as well as the influence of environmental factors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Self-Protection in a Clustered Distributed System

    Page(s): 330 - 336
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (540 KB) |  | HTML iconHTML  

    Self-protection refers to the ability for a system to detect illegal behaviors and to fight-back intrusions with counter-measures. This article presents the design, the implementation, and the evaluation of a self-protected system which targets clustered distributed applications. Our approach is based on the structural knowledge of the cluster and of the distributed applications. This knowledge allows to detect known and unknown attacks if an illegal communication channel is used. The current prototype is a self-protected JEE infrastructure (Java 2 Enterprise Edition) with firewall-based intrusion detection. Our prototype induces low-performance penalty for applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic-Aware Metadata Organization Paradigm in Next-Generation File Systems

    Page(s): 337 - 344
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (806 KB) |  | HTML iconHTML  

    Existing data storage systems based on the hierarchical directory-tree organization do not meet the scalability and functionality requirements for exponentially growing data sets and increasingly complex metadata queries in large-scale, Exabyte-level file systems with billions of files. This paper proposes a novel decentralized semantic-aware metadata organization, called SmartStore, which exploits semantics of files' metadata to judiciously aggregate correlated files into semantic-aware groups by using information retrieval tools. The key idea of SmartStore is to limit the search scope of a complex metadata query to a single or a minimal number of semantically correlated groups and avoid or alleviate brute-force search in the entire system. The decentralized design of SmartStore can improve system scalability and reduce query latency for complex queries (including range and top-k queries). Moreover, it is also conducive to constructing semantic-aware caching, and conventional filename-based point query. We have implemented a prototype of SmartStore and extensive experiments based on real-world traces show that SmartStore significantly improves system scalability and reduces query latency over database approaches. To the best of our knowledge, this is the first study on the implementation of complex queries in large-scale file systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sleep Scheduling for Critical Event Monitoring in Wireless Sensor Networks

    Page(s): 345 - 352
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1137 KB) |  | HTML iconHTML  

    In this paper, we focus on critical event monitoring in wireless sensor networks (WSNs), where only a small number of packets need to be transmitted most of the time. When a critical event occurs, an alarm message should be broadcast to the entire network as soon as possible. To prolong the network lifetime, some sleep scheduling methods are always employed in WSNs, resulting in significant broadcasting delay, especially in large scale WSNs. In this paper, we propose a novel sleep scheduling method to reduce the delay of alarm broadcasting from any sensor node in WSNs. Specifically, we design two determined traffic paths for the transmission of alarm message, and level-by-level offset based wake-up pattern according to the paths, respectively. When a critical event occurs, an alarm is quickly transmitted along one of the traffic paths to a center node, and then it is immediately broadcast by the center node along another path without collision. Therefore, two of the big contributions are that the broadcasting delay is independent of the density of nodes and its energy consumption is ultra low. Exactly, the upper bound of the broadcasting delay is only 3D+2L, where D is the maximum hop of nodes to the center node, L is the length of sleeping duty cycle, and the unit is the size of time slot. Extensive simulations are conducted to evaluate these notable performances of the proposed method compared with existing works. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Supporting Overcommitted Virtual Machines through Hardware Spin Detection

    Page(s): 353 - 366
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1423 KB) |  | HTML iconHTML  

    Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS's virtual processors (VCPUs) to avoid OS synchronization overhead. However, gang scheduling is infeasible for some application domains, and is inflexible in other domains. In an overcommitted environment, an individual guest OS has more VCPUs than available physical processors (PCPUs), precluding the use of gang scheduling. In such an environment, we demonstrate a more than two-fold increase in application runtime when transparently virtualizing a chip-multiprocessor's cores. To combat this problem, we propose a hardware technique to detect when a VCPU is wasting CPU cycles, and preempt that VCPU to run a different, more productive VCPU. Our technique can dramatically reduce cycles wasted on OS synchronization, without requiring any semantic information from the software. We then present a server consolidation case study to demonstrate the potential of more flexible scheduling policies enabled by our technique. We propose one such policy that logically partitions the CMP cores between guest VMs. This policy increases throughput by 10-25 percent for consolidated server workloads due to improved cache locality and core utilization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications

    Page(s): 367 - 374
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (860 KB) |  | HTML iconHTML  

    Cache sharing on modern Chip Multiprocessors (CMPs) reduces communication latency among corunning threads, and also causes interthread cache contention. Most previous studies on the influence of cache sharing have concentrated on the design or management of shared cache. The observed influence is often constrained by the reliance on simulators, the use of out-of-date benchmarks, or the limited coverage of deciding factors. This paper describes a systematic measurement of the influence with most of the potentially important factors covered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions in the PARSEC benchmark suite, regardless of the types of parallelism, input data sets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch between the software design (and compilation) of multithreaded applications and CMP architectures. By performing source code transformations on the programs in a cache-sharing-aware manner, we observe up to 53 percent performance increase when the threads are placed on cores appropriately, confirming the software-hardware mismatch as a main reason for the observed insignificance of the influence from cache sharing, and indicating the important role of cache-sharing-aware transformations-a topic only sporadically studied so far-for exerting the power of shared cache. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • User-Level Implementations of Read-Copy Update

    Page(s): 375 - 382
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (851 KB) |  | HTML iconHTML  

    Read-copy update (RCU) is a synchronization technique that often replaces reader-writer locking because RCU's read-side primitives are both wait-free and an order of magnitude faster than uncontended locking. Although RCU updates are relatively heavy weight, the importance of read-side performance is increasing as computing systems become more responsive to changes in their environments. RCU is heavily used in several kernel-level environments. Unfortunately, kernel-level implementations use facilities that are often unavailable to user applications. The few prior user-level RCU implementations either provided inefficient read-side primitives or restricted the application architecture. This paper fills this gap by describing efficient and flexible RCU implementations based on primitives commonly available to user-level applications. Finally, this paper compares these RCU implementations with each other and with standard locking, which enables choosing the best mechanism for a given workload. This work opens the door to widespread user-application use of RCU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • What's new in Transactions [advertisement]

    Page(s): 383
    Save to Project icon | Request Permissions | PDF file iconPDF (764 KB)  
    Freely Available from IEEE
  • New issue alerts [advertisement]

    Page(s): 384
    Save to Project icon | Request Permissions | PDF file iconPDF (1733 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology