As researchers push for Exascale computing, one of the emerging challenges is system resilience. Unlike fault-tolerance which corrects errors, recent reports suggest that resilient systems will need to continue to make progress on an application despite faults. A first step in developing a resilient system is to have robust, scalable system monitoring. The work described here presents a novel, minimally-invasive system monitor that operates over a separate network. We analytically characterize the performance for an arbitrary set of nodes and demonstrate a working implementation of the design. We argue that the hardware approach is inherently superior to the ad hoc, software techniques currently employed in practice.
Published in:
High-Performance Reconfigurable Computing Technology and Applications ( HPRCTA), 2010 Fourth International Workshop on
Date of Conference: 14-14 Nov. 2010