By Topic

Reliable Distributed Systems, 2004. Proceedings of the 23rd IEEE International Symposium on

Date 18-20 Oct. 2004

Filter Results

Displaying Results 1 - 25 of 45
  • An integrated architecture for dependable embedded systems

    Page(s): 160 - 161
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (207 KB) |  | HTML iconHTML  

    Summary form only given. A federated architecture is characterized in that every major function of an embedded system is allocated to a dedicated hardware unit. In a distributed control system this implies that adding a new function is tantamount to adding a new node. This has led to a order to achieve some functional coordination. Adding fault-tolerance to a federated architecture, e.g., by the provision of triple modular redundancy (TMR) leads to a further significant increase in the number of nodes and networks. The major advantages of a dedicated architecture are the physical encapsulation of the nearly autonomous subsystems, their outstanding fault containment and their clear-cut complexity management (independent development) in case the subsystems are nearly autonomous. An integrated distributed architecture for mixed-criticality applications must be based on a core design that supports the safety requirements of the highest considered criticality class. This is of particular importance in safety-critical applications, where the physical structure of the integrated system is determined to a significant extent by the independence requirement of fault-containment regions. The central part of an integrated distributed architecture for time-critical systems must provide the following core services: deterministic and timely transport of messages; fault tolerant clock synchronization; strong fault isolation with respect to arbitrary node failures; and consistent diagnosis of failing nodes. Any architecture that provides these core services can be used as a base architecture for an integrated distributed embedded system architecture. An example of such an integrated architecture is the time-triggered architecture (TTA). In this contribution we describe the structure and the services of the TTA that has been developed during the last twenty years and is deployed in a number of safety-critical applications in the transport sector. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware support for high performance, intrusion- and fault-tolerant systems

    Page(s): 195 - 204
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (502 KB) |  | HTML iconHTML  

    The paper proposes a combined hardware/software approach for realizing high performance, intrusion- and fault-tolerant services. The approach is demonstrated for (yet not limited to) an attribute authority server, which provides a compelling application due to its stringent performance and security requirements. The key element of the proposed architecture is an FPGA-based, parallel crypto-engine providing (1) optimally dimensioned RSA Processors for efficient execution of computationally intensive RSA signatures and (2) a KeyStore facility used as tamper-resistant storage for preserving secret keys. To achieve linear speed-up (with the number of RSA Processors) and deadlock-free execution in spite of resource-sharing and scheduling/synchronization issues, we have resorted to a number of performance enhancing techniques (e.g., use of different clock domains, optimal balance between internal and external parallelism) and have formally modeled and mechanically proved our crypto-engine with the Spin model checker. At the software level, the architecture combines active replication and threshold cryptography, but in contrast with previous work, the code of our replicas is multithreaded so it can efficiently use an attached parallel crypto-engine to compute an attribute authority partial signature (as required by threshold cryptography). Resulting replicated systems that exhibit nondeterministic behavior, which cannot be handled with conventional replication approaches. Our architecture is based on a preemptive deterministic scheduling algorithm to govern scheduling of replica threads and guarantee strong replica consistency. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Token-based atomic broadcast using unreliable failure detectors

    Page(s): 52 - 65
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (604 KB) |  | HTML iconHTML  

    Many atomic broadcast algorithms have been published in the last twenty years. Token-based algorithms represent a large class of these algorithms. Interestingly, all the token-based atomic broadcast algorithms rely on a group membership service, i.e., none of them uses unreliable failure detectors directly. The paper presents the first token-based atomic broadcast algorithm that uses an unreliable failure detector - the new failure detector denoted by ℛ - instead of a group membership service. The failure detector ℛ is compared with <>V and <>S. In order to make it easier to understand the atomic broadcast algorithm, the paper derives the atomic broadcast algorithm from a token-based consensus algorithm that also uses the failure detector ℛ. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Progress in real-time fault tolerance

    Page(s): 109 - 111
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (220 KB) |  | HTML iconHTML  

    This paper discusses progress in the field of real-time fault tolerance. In particular, it considers synchronous vs. asynchronous fault tolerance designs, maintaining replica consistency, alternative fault tolerance strategies, including checkpoint restoration, transactions, and consistent replay, and custom vs. generic fault tolerance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Crash-resilient time-free eventual leadership

    Page(s): 208 - 217
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (391 KB) |  | HTML iconHTML  

    Leader-based protocols rest on a primitive able to provide the processes with the same unique leader. Such protocols are very common in distributed computing to solve synchronization or coordination problems. Unfortunately, providing such a primitive is far from being trivial in asynchronous distributed systems prone to process crashes. (It is even impossible in fault-prone purely asynchronous systems.) To circumvent this difficulty, several protocols have been proposed that build a leader facility on top of an asynchronous distributed system enriched with synchrony assumptions. This paper consider another approach to build a leader facility, namely, it considers a behavioral property on the flow of messages that are exchanged. This property has the noteworthy feature not to involve timing assumptions. Two protocols based on this time-free property that implement a leader primitive are described. The first one uses potentially unbounded counters, while the second one (which is a little more involved) requires only finite memory. These protocols rely on simple design principles that make them attractive, easy to understand and provably correct. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Panel statement: why progress in (composite) fault tolerant real-time systems has been slow (-er than expected... & what can we do about it?)

    Page(s): 112 - 113
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (215 KB)  

    The pervasiveness of computers in our current IT driven society (transportation, e-commerce, e-transactions, communication, process control), also implies our growing dependency on their "correct" functionality. In many a case, the real value of the systems and also our usage of these systems comes, in part, based on the dependency (real or perceived) we are consequently willing to put into the provisioning of the services i.e., the implicit or explicit assurance of trust we put for sustained delivery of desired services. Some systems are considered as safety-critical (flight/reactor control etc), though others are accorded varied degrees of criticality. Nevertheless, our expectancy extends to obtaining the proper services when the system is fault-free and especially when it encounters perturbations (design or operational), e.g., electromagnetic interference or a lightning strike for an aircraft. Consequently, it is important to qualitatively and quantitatively associate some measures of trust in the system's ability to "actually" deliver us the desired services in the presence of faults. This is often termed as "dependability" measures for a system with a plethora of fault-tolerance (FT) strategies to help achieve desired levels of dependability. As before, dependability entails the sustained delivery of services, be they service-critical or cost-critical, regardless of the perturbations encountered during their operation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A stability-oriented approach to improving BGP convergence

    Page(s): 90 - 99
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (371 KB) |  | HTML iconHTML  

    This paper shows that the elimination of fault-agnostic instability, the instability caused by fault-agnostic distributed control, substantially improves BGP convergence speed. To this end, we first classify BGP convergence instability into two categories: fault-agnostic instability and distribution-inherent instability; secondly, we prove the impossibility of eliminating all distribution-inherent instability in distributed routing protocols; thirdly, we design the grapevine border gateway protocol (G-BGP) to show that all fault-agnostic instability can be eliminated. G-BGP eliminates all fault-agnostic instability under different fault and routing policy scenarios by (i) piggybacking onto BGP UPDATE messages fine-grained information about faults to the nodes affected by the faults, (ii) quickly resolving the uncertainty between link and node failure as well as the uncertainty of whether a node has changed route, and (iii) rejecting obsolete fault information. We have evaluated G-BGP by both analysis and simulation. Analytically, we prove that, by eliminating fault-agnostic instability, G-BGP achieves optimal convergence speed in several scenarios where BGP convergence is severely delayed (e.g., when a node or a link fail-stops), and when the shortest-path-first policy is used, G-BGP asymptotically improves BGP convergence speed except in scenarios where BGP convergence speed is already optimal (e.g., when a node or a link joins). By simulating networks with up to 115 autonomous systems, we observe that G-BGP improves BGP convergence stability and speed by an order of magnitude. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Run-time monitoring for dependable systems: an approach and a case study

    Page(s): 41 - 49
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (350 KB) |  | HTML iconHTML  

    This paper describes a run-time monitoring system designed for same functionality systems installed in different places that use equivalent hardware configurations, but with slightly different implementations. These systems exhibit common characteristics. They are large software systems, they depend on hardware to execute their functions, and they are usually adjusted to meet new user needs. In this scenario it is unreasonable to assume that software testing will uncover all latent errors. Besides gathering information about a target program as it executes the run-time monitoring system proposed provides information about the target operating system and the target hardware in order to improve availability by reducing time to diagnose failures and provide a system with the reactive capability of reconfiguring and reinitializing after the occurrence of a failure. A case study for an automatic teller machine system is discussed as an application of the run-time monitoring system and the results from this application are presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Skewed checkpointing for tolerating multi-node failures

    Page(s): 116 - 125
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (601 KB) |  | HTML iconHTML  

    Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated checkpointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation. Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed checkpointing ensures log2N degrees of redundancy when the number of nodes is N. In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The mutable consensus protocol

    Page(s): 218 - 227
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (375 KB) |  | HTML iconHTML  

    In this paper we propose the mutable consensus protocol, a pragmatic and theoretically appealing approach to enhance the performance of distributed consensus. First, an apparently inefficient protocol is developed using the simple stubborn channel abstraction for unreliable message passing. Then, performance is improved by introducing judiciously chosen finite delays in the implementation of channels. Although this does not compromise correctness, which rests on an asynchronous system model, it makes it likely that the transmission of some messages is avoided and thus the message exchange pattern at the network level changes noticeably. By choosing different delays in the underlying stubborn channels, the mutable consensus protocol can actually be made to resemble several different protocols. Besides presenting the mutable consensus protocol and four different mutations, we evaluate in detail the particularly interesting permutation gossip mutation, which allows the protocol to scale gracefully to a large number of processes by balancing the number of messages to be handled by each process with the number of communication steps required to decide. The evaluation is performed using a realistic simulation model which accurately reproduces resource consumption in real systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using program analysis to identify and compensate for nondeterminism in fault-tolerant, replicated systems

    Page(s): 251 - 263
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (433 KB) |  | HTML iconHTML  

    Fault-tolerant replicated applications are typically assumed to be deterministic, in order to ensure reproducible, consistent behavior and state across a distributed system. Real applications often contain nondeterministic features that cannot be eliminated. Through the novel application of program analysis to distributed CORBA applications, we decompose an application into its constituent structures, and discover the kinds of nondeterminism present within the application. We target the instances of nondeterminism that can be compensated for automatically, and highlight to the application programmer those instances of nondeterminism that need to be manually rectified. We demonstrate our approach by compensating for specific forms of nondeterminism and by quantifying the associated performance overheads. The resulting code growth is typically limited to one extra line for every instance of nondeterminism, and the runtime overhead is minimal, compared to a fault-tolerant application with no compensation for nondeterminism. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proactive hot spot avoidance for Web server dependability

    Page(s): 309 - 318
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1351 KB) |  | HTML iconHTML  

    Flash crowds, which result from the sudden increase in popularity of some online content, are among the most important problems that plague today's Internet. Affected servers are overloaded with requests and quickly become "hot spots." They usually suffer from severe performance failures or stop providing service altogether, as there are scarcely any effective techniques to scalably deliver content under hot spot conditions to all requesting clients. In this paper, we propose and evaluate collaborative techniques to detect and proactively avoid the occurrence of hot spots. Using our mechanisms, groups of small- to medium-sized Web servers can team up to withstand unexpected surges of requests in a cost-effective manner. Once a Web server detects a sudden increase in request traffic, it replicates on-the-fly the affected content on other Web servers; subsequent requests are transparently redirected to the copies to offload the primary server. Each server acts both as a primary source for its own content, and as a secondary source for other servers' content in the event of a flash-crowd; scalability and dependability are therefore achieved in a peer-to-peer fashion, with each peer contributing to, and benefiting from, the service. Our proactive hot spot avoidance techniques are implemented as a module for the popular Apache Web server. We have conducted a comprehensive experimental evaluation, which demonstrates that our techniques are effective at dealing with flash crowds and scaling to very high request loads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Model-based validation of an intrusion-tolerant information system

    Page(s): 184 - 194
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (402 KB) |  | HTML iconHTML  

    An increasing number of computer systems are designed to be distributed across both local and wide-area networks, performing a multitude of critical information-sharing and computational tasks. Malicious attacks on such systems are a growing concern, where attackers typically seek to degrade quality of service by intrusions that exploit vulnerabilities in networks, operating systems, and application software. Accordingly, designers are seeking improved techniques for validating such systems with respect to specified survivability requirements. In this regard, we describe a model-based validation effort that was undertaken as part of a unified approach to validating a networked intrusion-tolerant information system. Model-based results were used to guide the system's design as well as to determine whether a given survivability requirement was satisfied. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance comparison of a rotating coordinator and a leader based consensus algorithm

    Page(s): 4 - 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (498 KB) |  | HTML iconHTML  

    Protocols that solve agreement problems are essential building blocks for fault tolerant distributed systems. While many protocols have been published, little has been done to analyze their performance, especially the performance of their fault tolerance mechanisms. In this paper, we compare two well-known asynchronous consensus algorithms. In both algorithms, a leader process tries to impose a decision, and another leader retries if the leader fails doing so. The algorithms elect leaders differently: the Chandra-Toueg algorithm has a rotating leader, whereas processes in the Paxos algorithm elect leaders directly. We investigate the performance implications of this difference. In the system under study, processes send atomic broadcasts to each other. Consensus is used to decide the delivery order of messages. We evaluate the steady state latency in (1) runs with neither crashes nor suspicions, (2) runs with crashes and (3) runs with no crashes in which correct processes are wrongly suspected to have crashed, as well as the transient latency after (4) one crash and (5) multiple correlated crashes. The results show that the Paxos algorithm tolerates frequent wrong suspicions (3) and correlated crashes (5) better, while the performance is comparable in all other scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The φ accrual failure detector

    Page(s): 66 - 78
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB) |  | HTML iconHTML  

    The detection of failures is a fundamental issue for fault-tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously. We present a novel abstraction, called accrual failure detectors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failure detectors in distributed systems. Instead of providing information of a binary nature (trust vs. suspect), accrual failure detectors output a suspicion level on a continuous scale. The principal merit of this approach is that it favors a nearly complete decoupling between application requirements and the monitoring of the environment. In this paper, we describe an implementation of such an accrual failure detector, that we call the φ failure detector. The particularity of the φ failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed. We analyzed the behavior of our φ failure detector over an intercontinental communication link over a week. Our experimental results show that if performs equally well as other known adaptive failure detection mechanisms, with an improved flexibility. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient checkpointing protocol for the minimal characterization of operational rollback-dependency trackability

    Page(s): 126 - 135
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (378 KB) |  | HTML iconHTML  

    A checkpointing protocol that enforces rollback-dependency trackability (RDT) during the progress of a distributed computation must induce processes to take forced checkpoints to avoid the formation of nontrackable rollback dependencies. A protocol based on the minimal characterization of RDT tests only the smallest set of nontrackable dependencies. The literature indicated that this approach would require the processes to maintain and propagate O(n2) control information, where n is the number of processes in the computation. In this paper, we present a protocol that implements this approach using only O(n) control information. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • An hoarding approach for supporting disconnected write operations in mobile environments

    Page(s): 276 - 288
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (371 KB) |  | HTML iconHTML  

    Caching is one technique that reduces costs and improves performance in mobile environments. It also increases availability during temporary, involuntary disconnections. However, our focus is on voluntary, client initiated disconnections, where hoarding can be used to predict data requirements. Existing hoarding approaches ignore conflicts arising out of write sharing and are thus unable to deal with them. However, since conflicts are detrimental to bandwidth utilisation, for scenarios with high write sharing, hoarding techniques need to provide support for sharing in a manner that reduces or avoids conflicts. We propose a hoarding approach for disconnected write operations that focuses on reducing the likelihood of conflicts, arising from write sharing, in a highly concurrent environment. Data that clients might need when disconnected is predicted based on the notion of semantic similarity. To avoid/reduce conflicts, data are first clustered based on their update probabilities. The hoard tree is then created based on the clusters and semantic similarity between data. Simulations show an increase in the cache hit-rate along with an reduction in the total number of conflicts. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Balancing the tradeoffs between data accessibility and query delay in ad hoc networks

    Page(s): 289 - 298
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (452 KB) |  | HTML iconHTML  

    In mobile ad hoc networks, nodes move freely and link/node failures are common. This leads to frequent network partitions, which may significantly degrade the performance of data access in ad hoc networks. When the network partition occurs, mobile nodes in one network are not able to access data hosted by nodes in other networks. In this paper, we deal with this problem by applying data replication techniques. Existing data replication solutions in both wired or wireless networks aim at either reducing the query delay or improving the data accessibility. As both metrics are important for mobile nodes, we propose schemes to balance the tradeoffs between data accessibility and query delay under different system settings and requirements. Simulation results show that the proposed schemes can achieve a balance between these two metrics and provide satisfying system performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Slow advances in fault-tolerant real-time distributed computing

    Page(s): 106 - 108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB) |  | HTML iconHTML  

    Is fault-tolerant (FT) real-time computing a specialized branch of FT computing? The key issue in real-time (RT) computing is to economically produce systems that yield temporal behavior which is relatively easily analyzable and acceptable in given application environments. Fault-tolerant (FT) RT computing has been treated by the predominant segment of the FT computing research community as a highly specialized branch of FT computing. This author believes that the situation should be changed. It seems safe to say that FT techniques for which useful characterizations of temporal behavior have not been or cannot be developed, are at best immature, if not entirely useless. This means that FT RT computing is at the core of FT computing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nested objects in a Byzantine quorum-replicated system

    Page(s): 79 - 89
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (7771 KB) |  | HTML iconHTML  

    Modern distributed, object-based systems support nested method invocations, whereby one object can invoke methods on another. In this paper we present a framework that supports nested method invocations among Byzantine fault-tolerant, replicated objects that are accessed via quorum systems. A challenge in this context is that client object replicas can induce unwanted method invocations on server object replicas, due either to redundant invocations by client replicas or Byzantine failures within the client replicas. At the core of our framework are a new quorum-based authorization technique and a novel method invocation protocol that ensure the linearizability and failure atomicity of nested method invocations despite Byzantine client and server replica failures. We detail the implementation of these techniques in a system called Fleet, and give preliminary performance results for them. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • How to tolerate half less one Byzantine nodes in practical distributed systems

    Page(s): 174 - 183
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    The application of dependability concepts and techniques to the design of secure distributed systems is raising a considerable amount of interest in both communities under the designation of intrusion tolerance. However, practical intrusion-tolerant replicated systems based on the state machine approach (SMA) can handle at most f Byzantine components out of a total of n = 3f + 1, which is the maximum resilience in asynchronous systems. This paper extends the normal asynchronous system with a special distributed oracle called TTCB. Using this extended system we manage to implement an intrusion-tolerant service based on the SMA with only 2f + 1 replicas. Albeit a few other papers in the literature present intrusion-tolerant services with this approach, this is the first time the number of replicas is reduced from 3f + 1 to 2f + 1. Another interesting characteristic of the described service is a low time complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dependable pervasive systems

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (196 KB) |  | HTML iconHTML  

    Summary form only given. Present trends indicate that huge networked computer systems are likely to become pervasive, as information technology is embedded into virtually everything, and to be required to function essentially continuously. I believe that even today's (underused) "best practice" regarding the achievement of high dependability - reliability, availability, security, safety, etc. - from large networked computer systems will not suffice for future pervasive systems. I will give my perspective on the current state of research into the four basic dependability technologies: (i) fault prevention (to avoid the occurrence or introduction of faults), (ii) fault removal (through validation and verification), (iii) fault tolerance (so that failures do not necessarily occur even if faults remain), and (iv) fault forecasting (the means of assessing progress towards achieving adequate dependability). I will then argue that much further research is required on all four dependability technologies in order to cope with pervasive systems, identify some priorities, and discuss how this research could best be aimed at making system dependability into a "commodity" that industry can value and from which it can profit. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Self-managing federated services

    Page(s): 240 - 250
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (559 KB) |  | HTML iconHTML  

    We consider the problem of deploying and managing federated services that run on federated systems spanning multiple collaborative organizations. In particular, we present a peer-to-peer framework targeted to the construction of self-managing services that automatically adjust the number of service components and their placements in response to changes in the system or client loads. Our framework is completely decentralized, depending only on a modest amount of loosely synchronized global state. More specifically, our framework is comprised of a set of per-node monitoring agents and per-service-component management agents that periodically exchange information about the state of the system and of the service with each other using a gossiping protocol. Each management agent then periodically searches for configurations that are better than the current one according to an application model and explicit performance and availability targets. On finding a better configuration, an agent will enact the new configuration after a random delay to avoid possible collisions. We evaluate our framework by studying a prototype UDDI service. We show that while agents act autonomously, the service rapidly reaches a stable and appropriate configuration in response to system dynamics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.