Management policies can be used to specify requirements about the desired behaviour of distributed systems. Violations of policies (faults) can then be detected, isolated, located and corrected using a policy-driven fault management system. Other work in this area to date has focused on network-level faults. We believe that in a distributed system it is more appropriate to focus on faults at the application level. Furthermore, this work has been largely domain-specific-a generic, structured approach to this problem is needed. Our work has focused on policy-driven fault management in distributed systems at the application level. In this paper, we define a generic architecture for policy-driven fault management and present a prototype system based on this architecture. We also discuss experience to date using and experimenting with our prototype system
Published in:
Software Reliability Engineering, 1996. Proceedings., Seventh International Symposium on
Date of Conference: 30 Oct-2 Nov 1996