Skip to Main Content
Checkpoint-recovery based Virtual Machine (VM) replication is an emerging approach towards accommodating VM installations with high availability, especially, due to its inherent capability of tackling with symmetric multiprocessing (SMP) virtual machines, i.e. VMs with multiple virtual CPUs (vCPUs). However, it comes with the price of significant performance degradation of the application executed in the VM because of the large amount of state that needs to be synchronized between the primary and the backup machines. Previous research improving VM replication performance focused primarily on decreasing the amount of data transferred over the network, while relying on constant checkpoint frequency. Our goal is to investigate how and to what extent performance degradation can be mitigated by adjusting the checkpoint period dynamically. We provide a comprehensive analysis of various workloads from the aspect of VM replication, paying special attention to their behavior over the increasing number of vCPUs in the system. We propose several heuristics for scheduling replication checkpoints in order to improve quality of service. Our algorithm adapts dynamically to the properties of the workload being executed in the VM, such as changes in the number of dirtied memory pages, network and disk I/O operations, as well as to the network bandwidth available for replication. We evaluate our scheduling algorithm over two network architectures, Gigabit Ethernet and Infiniband, a high-performance interconnect fabric. We find that checkpoint scheduling has a great impact on the performance of replicated virtual machines, and show that replicated virtual machines with up to 16 vCPUs can attain performance close to the native VM execution, not only over high-performance, but also over commercial network architectures.
Date of Conference: 12-14 Dec. 2011