Skip to Main Content
Large, distributed IT infrastructures providing business-critical services have to protect themselves against internal and external threats and adapt to changing environmental parameters, as workload. Most widely applied, structural resilience mechanisms use some form of local static redundancy deployed to each critical resource for failover. However, recently both large-scale interconnected distributed systems and virtualization enable on-line structural reconfiguration exploiting a globally managed spare capacity as on- demand failover resource. In this paper, we present system and service resilience as a control problem and briefly describe how classes of the widely used, but vague notion of 'IT metrics' map to the concepts of generic control with a special emphasis on the control aspects of structural reconfiguration as a generic resilience mechanism. Most importantly, we introduce some initial metrics that aim at measuring the self-healing capability of systems employing structural reconfiguration.
Date of Conference: 25-31 Aug. 2008