By Topic

Reliable Distributed Systems, 2006. SRDS '06. 25th IEEE Symposium on

Date 2-4 Oct. 2006

Filter Results

Displaying Results 1 - 25 of 48
  • 25th IEEE Symposium on Reliable Distributed Systems - Cover

    Publication Year: 2006 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (137 KB)  
    Freely Available from IEEE
  • 25th IEEE Symposium on Reliable Distributed Systems - Title

    Publication Year: 2006 , Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (56 KB)  
    Freely Available from IEEE
  • 25th IEEE Symposium on Reliable Distributed Systems - Copyright

    Publication Year: 2006 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (53 KB)  
    Freely Available from IEEE
  • 25th IEEE Symposium on Reliable Distributed Systems - Table of contents

    Publication Year: 2006 , Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (57 KB)  
    Freely Available from IEEE
  • Message from the Symposium Chair

    Publication Year: 2006 , Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (44 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Technical Program Co-chairs

    Publication Year: 2006 , Page(s): x
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Conference Committees

    Publication Year: 2006 , Page(s): xi - xii
    Save to Project icon | Request Permissions | PDF file iconPDF (40 KB)  
    Freely Available from IEEE
  • External reviewers

    Publication Year: 2006 , Page(s): xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (40 KB)  
    Freely Available from IEEE
  • Reducing the Availability Management Overheads of Federated Content Sharing Systems

    Publication Year: 2006 , Page(s): 5 - 17
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (246 KB) |  | HTML iconHTML  

    We consider the problem of ensuring high data availability in federated content sharing systems. Ideally, such a system would provide high data availability in a device transparent manner so that users are not faced with the time-consuming and error-prone task of managing data replicas across the constituent devices of the system. We propose a novel unified availability model and a decentralized replication algorithm to approximate this ideal. Our availability model addresses three different concerns: availability during connected operation (online), availability during disconnected operation (offline), and availability after permanent disconnection from the federated system (ownership). Our replication algorithm centers around the intuition that devices should selfishly use their local storage to ensure offline and ownership availability for their individual owners. Excess storage, however, is used communally to ensure high online availability for all shared content. Evaluation of an implementation shows that our algorithm rapidly reaches stable and communally desirable configurations when there is sufficient space. Consistent with the fact that devices in a federated system are owned by different users, however, as space becomes highly constrained, the system approaches a non-cooperative configuration where devices only hoard content to serve their individual owners' needs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Topology Sensitive Replica Selection

    Publication Year: 2006 , Page(s): 18 - 28
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (189 KB) |  | HTML iconHTML  

    As the disks typically found in personal computers grow larger, protecting data by replicating it on a collection of "peer" systems rather than on dedicated high performance storage systems can provide comparable reliability and availability guarantees but at reduced cost and complexity. In order to be adopted, peer-to-peer storage systems must be able to replicate data on hosts that are trusted, secure, and available. However, recent research has shown that the traditional model, where nodes are assumed to have identical levels of trust, to behave independently, and to have similar failure modes, is over simplified. Thus, there is a need for a mechanism that automatically and efficiently selects replica nodes from a large number of available hosts with varying capabilities and trust levels. In this paper we present an algorithm to handle replica node selection either for new replica groups or to replace failed replicas in a peer-to-peer storage system. We show through simulation that our algorithm maintains the node inter-connection topology minimizing the cost of recovery from a failed replica, measured by the number of nodes affected by the failure and the number of inter-node messages View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deleting Files in the Celeste Peer-to-Peer Storage System

    Publication Year: 2006 , Page(s): 29 - 38
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (205 KB) |  | HTML iconHTML  

    Celeste is a robust peer-to-peer object store built on top of a distributed hash table (DHT). Celeste is a working system, developed by Sun Microsystems Laboratories. During the development of Celeste, we faced the challenge of complete object deletion, and moreover, of deleting "files" composed of several different objects. This important problem is not solved by merely deleting meta-data, as there are scenarios in which all file contents must be deleted, e.g., due to a court order. Complete file deletion in a realistic peer-to-peer storage system has not been previously dealt with due to the intricacy of the problem - the system may experience high churn rates, nodes may crash or have intermittent connectivity, and the overlay network may become partitioned at times. We present an algorithm that eventually deletes all file content, data and meta-data, in the aforementioned complex scenarios. The algorithm is fully functional and has been successfully integrated into Celeste View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reliably Executing Tasks in the Presence of Untrusted Entities

    Publication Year: 2006 , Page(s): 39 - 50
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (283 KB) |  | HTML iconHTML  

    In this work we consider a distributed system formed by a master processor and a collection of n processors (workers) that can execute tasks; worker processors are untrusted and might act maliciously. The master assigns tasks to workers to be executed. Each task returns a binary value, and we want the master to accept only correct values with high probability. Furthermore, we assume that the service provided by the workers is not free; for each task that a worker is assigned, the master is charged with a work-unit. Therefore, considering a single task assigned to several workers, our goal is to have the master computer to accept the correct value of the task with high probability, with the smallest possible amount of work (number of workers the master assigns the task). We explore two ways of bounding the number of faulty processors: (a) we consider a fixed bound f < n/2 on the maximum number of workers that may fail, and (b) a probability p < 1/2 of any processor to be faulty (all processors are faulty with probability p, independently of the rest of processors). Our work demonstrates that it is possible to obtain high probability of correct acceptance with low work. In particular, by considering both mechanisms of bounding the number of malicious workers, we first show lower bounds on the minimum amount of (expected) work required, so that any algorithm accepts the correct value with probability of success 1 - epsiv, where epsiv Lt 1 (e.g., 1/n). Then we develop and analyze two algorithms, each using a different decision strategy, and show that both algorithms obtain the same probability of success 1 - epsiv, and in doing so, they require similar upper bounds on the (expected) work. Furthermore, under certain conditions, these upper bounds are asymptotically optimal with respect to our lower bounds View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • WRAPS: Denial-of-Service Defense through Web Referrals

    Publication Year: 2006 , Page(s): 51 - 60
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (180 KB) |  | HTML iconHTML  

    The Web is a complicated graph, with millions of Web sites interlinked together. In this paper, we propose to use this Web sitegraph structure to mitigate flooding attacks on a Web site, using a new Web referral architecture for privileged service ("WRAPS"). WRAPS allows a legitimate client to obtain a privilege URL through a click on a referral hypher-link, from a Web site trusted by the target Web site. Using that URL, the client can get privileged access to the target Web site in a manner that is far less vulnerable to a DDoS flooding attack. WRAPS does not require changes to Web client software and is extremely lightweight for referrer Web sites, which eases its deployment. The massive scale of the Web sitegraph could deter attempts to isolate a Web site through blocking all referrers. We present the design of WRAPS, and the implementation of a prototype system used to evaluate our proposal. Our empirical study demonstrates that WRAPS enables legitimate clients to connect to a Web site smoothly in spite of an intensive flooding attack, at the cost of small overheads on the Web site's ISP's edge routers View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Client-Transparent Approach to Defend Against Denial of Service Attacks

    Publication Year: 2006 , Page(s): 61 - 70
    Cited by:  Papers (2)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB) |  | HTML iconHTML  

    Denial of service (DoS) attacks attempt to consume a server's resources (network bandwidth, computing power, main memory, disk bandwidth etc.) to near exhaustion so that there are no resources left to handle requests from legitimate clients. An effective solution to defend against DoS attacks is to filter DoS attack requests at the earliest point (say, the Web site's firewall), before they consume much of the server's resources. Most defenses against DoS attacks attempt to filter requests from inauthentic clients before they consume much of the server's resources. Client authentication using techniques like IPSec or SSL may often require changes to the client-side software and may additionally require superuser privileges at the client for deployment. Further, using digital signatures (as in SSL) makes verification very expensive, thereby making the verification process itself a viable DoS target for the adversary. In this paper, we propose a light-weight client transparent technique to defend against DoS attacks with two unique features: (i) Our technique can be implemented entirely using JavaScript support provided by a standard client-side browser like Mozilla FireFox or Microsoft Internet Explorer. Client transparency follows from the fact that: (i) no changes to client-side software are required, (ii) no client-side superuser privileges are required, and (iii) clients (human beings or automated clients) can browse a DoS protected Web site in the same manner that they browse other Web sites, (ii) Although we operate using the client-side browser (HTTP layer), our technique enables fast IP level packet filtering at the server's firewall and requires no changes to the application(s) hosted by the Web server. In this paper we present a detailed design of our technique along with a detailed security analysis. We also describe a concrete implementation of our proposal on the Linux kernel and present an evaluation using two applications: bandwidth intensive Apach- - e HTTPD and database intensive TPCW. Our experiments show that our approach incurs a low performance overhead and is resilient to DoS attacks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proactive Resilience Revisited: The Delicate Balance Between Resisting Intrusions and Remaining Available

    Publication Year: 2006 , Page(s): 71 - 82
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (315 KB) |  | HTML iconHTML  

    In a recent paper, we presented proactive resilience as a new approach to proactive recovery, based on architectural hybridization. We showed that, with appropriate assumptions about fault rate, proactive resilience makes it possible to build distributed intrusion-tolerant systems guaranteed not to suffer more than the assumed number of faults during their lifetime. In this paper, we explore the impact of these assumptions in asynchronous systems, and derive conditions that should be met by practical systems in order to guarantee long-lived, i.e., available, intrusion-tolerant operation. Our conclusions are based on analytical and simulation results as implemented in Mobius, and we use the same modeling environment to show that our approach offers higher resilience in comparison with other proactive intrusion-tolerant system models View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach

    Publication Year: 2006 , Page(s): 83 - 95
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (8307 KB) |  | HTML iconHTML  

    Availability prediction in a telecommunication system plays a crucial role in its management, either by alerting the operator to potential failures or by proactively initiating preventive measures. In this paper, we apply linear (ARMA, multivariate, random walk) and nonlinear (Radial and Universal Basis Functions) regression techniques to recognize system failures and to predict the system's call availability up to 15 minutes in advance. Secondly we introduce a novel nonlinear modeling technique for call availability prediction. We benchmark all five techniques against each other. The applied modeling methods are data driven rather than analytical and can handle large amounts of data. We apply the modeling techniques to real data of a commercial telecommunication platform. The data used for modeling includes: a) time stamped event-based log files; and b) continuously measured system states. Results are given in terms of a) receiver operator characteristics (AUC) for classification into classes of failure and non-failure states and b) as a cost-benefit analysis. Our findings suggest: a) high degree of nonlinearity in the data; b) statistically significant improved forecasting performance and cost-benefit ratio of nonlinear modeling techniques; and finally finding that c) log file data does not contribute to improve model performance with any modeling technique View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FT-PPTC: An Efficient and Fault-Tolerant Commit Protocol for Mobile Environments

    Publication Year: 2006 , Page(s): 96 - 105
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (331 KB) |  | HTML iconHTML  

    Transactions are required not only for wired networks but also for the emerging wireless environments where mobile and fixed hosts participate side by side in the execution of the transaction. This heterogenous environment is characterized by constraints in mobile host capabilities, network connectivity and also an increasing number of possible failure modes. Classical atomic commit protocols used in wired networks are therefore not directly suitable for this heterogenous environment. Furthermore, the few commit protocols designed for mobile transactions either consider mobile hosts only as initiators though not as active participants, or show a high resource blocking time. We present the Fault-Tolerant Pre-Phase Transaction Commit (FT-PPTC) protocol for mobile environments. FT-PPTC decouples the commit of mobile participants from that of fixed participants. Consequently, the commit set can be reduced to a set of entities in the fixed network. Thus, the commit can easily be supported by any traditional atomic commit protocol, such as the established 2PC protocol. We integrate fault-tolerance as a key feature of FT-PPTC. Performance evaluations confirm the efficiency, scalability and low resource blocking time of our approach View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling Distributed Computing System Reliability with DRBD

    Publication Year: 2006 , Page(s): 106 - 118
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (362 KB) |  | HTML iconHTML  

    Nowadays the great part of devices or systems we commonly use are often driven or managed by microchips and computers: cars, music players, phones, trains, planes, .... A consolidated trend of technology is to substitute mechanical with electronic parts, analogical with digital devices or controls, and so on. In this context, features like security, availability and reliability, usually summarized under the concept of dependability, are receiving higher attention. The dependability analysis, especially for what regards critical parts as computing systems or subsystems, is becoming more strategic: specific requirements and explicit or tighter constraints have to be satisfied. Even though this fact, there is a lack of suitable tools to properly model and analyze these aspects, with particular reference to reliability. To fill this gap, we propose the dynamic reliability block diagram (DRBD) modeling tool derived from the reliability block diagram (RBD) formalism. The DRBD permits to model the dynamic reliability behavior of a system through dependence models, exploited to represent dynamics behaviors as redundancy, load sharing, multiple, probabilistic and common failure mode. In this paper, the DRBD expressiveness and other capabilities, are illustrated through the analysis of a complex distributed computing system taken as example View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DRIFT: Efficient Message Ordering in Ad Hoc Networks Using Virtual Flooding

    Publication Year: 2006 , Page(s): 119 - 131
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (369 KB) |  | HTML iconHTML  

    We present DRIFT - a total order multicast algorithm for ad hoc networks with mobile or static nodes. Due to the ad hoc nature of the network, DRIFT uses flooding for message propagation. The key idea of DRIFT is virtual flooding - a way of using unrelated message streams to propagate message causality information in order to accelerate message delivery. We describe DRIFT in detail. We evaluate its performance in a simulator and in a wireless sensor network. In both cases our results demonstrate that the performance of DRIFT exceeds that of the simple total order multicast algorithm designed for wired networks, on which it is based. In simulation at scale, for certain experiment settings, DRIFT achieved speedup of several orders of magnitude View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generalised Repair for Overlay Networks

    Publication Year: 2006 , Page(s): 132 - 142
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (281 KB) |  | HTML iconHTML  

    We present and evaluate a generic approach to the repair of overlay networks which identifies general principles of overlay repair and embodies these as a reusable service. At the heart of our approach is an algorithm that discovers the extent of a failed section of any type of overlay, and assigns responsibility to carry out the repair. The repair strategy itself is 'pluggable' and can be tailored to the requirements of a specific overlay type or instance. Our approach is efficient in terms of the number of repair-related message exchanges it incurs; scalable in that it involves only nodes in the locality of the failed section of the overlay; and resilient in that it correctly handles cases in which multiple adjacent nodes fail simultaneously, and it tolerates new failures that occur while a repair is underway. The benefits of our approach are that: (i) it extracts and encapsulates best practice in repair for overlays; (ii) it simplifies the design and implementation of new overlays (because repair issues can be treated orthogonally to basic functionality); and (iii) it supports tailorable levels of dependability for overlays, including pluggable repair strategies View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decentralized Local Failure Detection in Dynamic Distributed Systems

    Publication Year: 2006 , Page(s): 143 - 154
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (515 KB) |  | HTML iconHTML  

    A failure detector is an important building block when constructing fault-tolerant distributed systems. In asynchronous distributed systems, failed processes are often indistinguishable from slow processes. A failure detector is an oracle that can intelligently suspect processes to have failed. Different classes of failure detectors have been proposed to solve different kinds of problems. Almost all of this work is focused on global failure detection, and moreover, in systems that do not contain mobile nodes or include dynamic topologies. In this paper, we present diamPm l - a local failure detector that can tolerate mobility and topology changes. This means that diamPm l can distinguish between a failed process and a process that has moved away from its original location. We also establish an upper bound on the duration for which a process wrongly suspects a node that has moved away from its neighborhood. We support our theoretical results with experimental findings from an implementation of this algorithm for sensor networks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improvements and Reconsideration of Distributed Snapshot Protocols

    Publication Year: 2006 , Page(s): 155 - 164
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    Distributed snapshots are an important building block for distributed systems, and, among other applications, are useful for constructing efficient checkpointing protocols. In addition to the imposed overhead of the existing distributed snapshot protocols, those protocols are not trivially applicable (if at all) in many of today's distributed systems, e.g., grid, mobile, and sensors systems. After presenting the shortages and the inapplicability of the most popular existing distributed snapshot protocols, this paper discusses improvement directions for the protocols. In addition, it presents a new and an important improvement for the most popular distributed snapshot protocol, which was presented by Chandy and Lamport in 1985. Although the proposed improvement is simple and easy to implement, it has significant benefits in reducing the software and hardware overheads of distributed snapshots. Then, the paper presents proofs for the safety and progress of the new protocol. Lastly, it presents a performance analysis of the protocol using stochastic models View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Weakly-Persistent Causal Objects in Dynamic Distributed Systems

    Publication Year: 2006 , Page(s): 165 - 174
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (213 KB) |  | HTML iconHTML  

    In the context of clients accessing a read/write shared object, persistency of a written value is a property stating that a value written into the object is always available unless overwritten by a successive write operation. This property can be easily guaranteed in a static distributed system provided that either a subset of processes implementing the object does not crash or processes can crash and then recover being able to retrieve their last state. Unfortunately the enforcing of this property in a potentially large scale and dynamic distributed system (e.g. a P2P system) is far from being trivial when considering the case in which processes implementing the object may fail or leave at any time without notifying any other process (i.e., the last state might not be retrievable). The paper introduces the notion of weak persistency that guarantees persistency of values when a system becomes quiescent (arrivals and departures subside). An implementation of a weakly-persistent object ensuring causal consistency is provided along with its correctness proof. The interest of causal consistency lies in the fact that, contrarily to atomic consistency, it can be maintained even during non-quiescent periods of the distributed system (i.e., when persistency is not guaranteed) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Non-Blocking Synchronous Checkpointing Based on Rollback-Dependency Trackability

    Publication Year: 2006 , Page(s): 411
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (215 KB) |  | HTML iconHTML  

    This article proposes an original approach that applies the rollback-dependency trackability (RDT) property to implement a new non-blocking synchronous checkpointing protocol, called RDT-NBS, that takes mutable checkpoints and efficiently supports concurrent initiators. Mutable checkpoints can be saved in non-stable storage and make it possible for non-blocking synchronous checkpointing protocols to save a minimal number of checkpoints in stable storage during the construction of a consistent global checkpoint. We prove that this minimality property does not hold in presence of concurrent checkpointing initiations. Even though, RDT-NBS uses mutable checkpoints to reduce the use of stable memory assuring the existence of a consistent global checkpoint in stable storage. We also present simulation results that compare RDT-NBS to quasi-synchronous RDT View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PLATO: Predictive Latency-Aware Total Ordering

    Publication Year: 2006 , Page(s): 175 - 188
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (747 KB) |  | HTML iconHTML  

    PLATO is a predictive total ordering protocol designed for low-latency multicast in datacenters. It predicts out-of-order arrival of multicast packets by observing their inter-arrival times, and delays packets before passing them up to the application only if it believes the packets to have arrived in the wrong order. We show through experimentation on real datacenter-style networks that the inter-arrival time of consecutive packet pairs is an excellent predictor of out-of-order delivery. We evaluate an implementation of PLATO on the Emulab testbed, and show that it drives down delivery latencies by more than a factor of 2 compared to the fixed-sequencer protocol View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.