Skip to Main Content
Supporting high availability by checkpointing and switching to a backup upon failure of a primary has a cost. Trade-off studies help system architects to decide whether higher availability at the cost of higher response time is to strive for. The decision leads to configuring a fault-tolerant server for best performance. This paper provides a mathematical model employing queuing theory that helps to compute the optimal checkpointing interval for a primary-backup replicated server. The optimization criterion is system availability. The model guides towards the checkpointing interval that is short enough to give low failover time, but long enough to utilize most of the system resources for servicing client requests. The novelty of the work is the detailed modelling of service times, wait times for earlier calls in the queue, and priority of checkpointing calls over client calls within the queues. Studies on the model in Mathematica and validation of a modelling assumption through simulations are included.
Date of Conference: 12-14 Dec. 2005