Skip to Main Content
Large-scale hosting infrastructures have become the fundamental platforms for many real-world systems such as cloud computing infrastructures, enterprise data centers, and massive data processing systems. However, it is a challenging task to achieve both scalability and high precision while monitoring a large number of intranode and internode attributes (e.g., CPU usage, free memory, free disk, internode network delay). In this paper, we present the design and implementation of a Resilient self-Compressive Monitoring (RCM) system for large-scale hosting infrastructures. RCM achieves scalable distributed monitoring by performing online data compression to reduce remote data collection cost. RCM provides failure resilience to achieve robust monitoring for dynamic distributed systems where host and network failures are common. We have conducted extensive experiments using a set of real monitoring data from NCSU's virtual computing lab (VCL), PlanetLab, a Google cluster, and real Internet traffic matrices. The experimental results show that RCM can achieve up to 200 percent higher compression ratio and several orders of magnitude less overhead than the existing approaches.