Skip to Main Content
For large-scale parallel applications, lightweight online monitoring can enable a wide range of online adaptations, including load balancing, power management, and progress monitoring. The processing and monitoring overhead of centralized global tracing techniques make them unsuitable for such tasks. Purely local tools, on the other hand, fail to provide the global information necessary for many desirable online adaptations of large-scale applications. In this paper, we describe a novel distributed online measurement method for large-scale applications called Embedded Gossip (EG). EG works by piggybacking performance information about application behavior on existing application messages and merging received information with previously known data in a fashion customized to the needs of a particular monitoring task. EG thus provides each process with both local and global views of application behavior with low overhead. To illustrate the capabilities of Embedded Gossip, we also show that it disseminates global information in a timely fashion for a wide range of monitoring tasks, including critical path profiling, workload imbalance monitoring, and progress monitoring. This global information has a wide range of potential uses, including imbalance detection for load balancing and energy management tools, progress monitoring for batch schedulers, and a wide range of other performance debugging and optimization techniques.