Skip to Main Content
Understanding and tuning the performance of large-scale long-running applications is difficult, with both standard trace-based and statistical methods having substantial shortcomings that limit their usefulness. This paper describes a new performance monitoring approach called Embedded Gossip (EG) designed to enable lightweight online performance monitoring and tuning. EG works by piggybacking performance information on existing messages and performing information correlation online, giving each process in a parallel application a weakly consistent global view of the behavior of the entire application. To demonstrate the viability of EG, this paper presents the design and experimental evaluation of two different online monitoring systems and an online global adaptation system driven by Embedded Gossiping. In addition, we present a metric system for evaluating the suitability of an application to EG-based monitoring and adaptation, a general architecture for implementing EG-based monitoring systems, and a modified global commit algorithm appropriate for use in EG-based global adaptation systems. Together, these results demonstrate that EG is an efficient low-overhead approach for addressing a wide range of parallel performance monitoring tasks and that results from these systems can effectively drive online global adaptation.