Skip to Main Content
Scalability of parallel applications on clusters and multi-clusters is often limited by communication performance. Message tracing can provide data for understanding bottlenecks, and for performance tuning. However, it requires collecting, storing, analyzing, and transferring potentially gigabytes of data. We have designed the EventSpace system for low overhead and high performance runtime collective communication trace analysis. EventSpace separates the perturbation and performance requirements of data collection, analysis, gathering sand visualization. Data collection overhead is low since the minimum amount of data is recorded and stored temporarily in main memory. The recorded data is either discarded or analyzed on demand using available cluster resources. Analysis is distributed for high performance, and coscheduled with the computation and communication system threads for low perturbation. Gathering of analyzed data is done using extensible collective communication operations, which can be tuned to trade off between performance and monitoring overhead. EventSpace was used to do run-time monitoring and analysis of collective communication micro-benchmarks run on clusters, multi-clusters, and multi-clusters with emulated WAN links. Performance data was collected, analyzed and gathered with 0-3% monitoring overhead.