By Topic

System-level monitoring of floating-point performance to improve effective system utilization

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

7 Author(s)
Davide Del Vento ; Computational and Information Systems Laboratory, National Center for Atmospheric Research, Boulder, CO 80307-300, USA ; David L. Hart ; Thomas Engel ; Rory Kelly
more authors

NCAR's Bluefire supercomputer is instrumented with a set of low-overhead processes that continually monitor the floating point counters of its 3,840 batch-compute cores. We extract performance numbers for each batch job by correlating the data from corresponding nodes. From experience and heuristics for good performance, we use this data, in part, to identify poorly performing jobs and then work with the users to improve their job's efficiency. Often, the solution involves simple steps such as spawning an adequate number of processes or threads, binding the processes or threads to cores, using large memory pages, or using adequate compiler optimization. These efforts typically result in performance improvements and a wall-clock runtime reduction of 10% to 20%. With more involved changes to codes and scripts, some users have obtained performance improvements of 40% to 90%. We discuss our instrumentation, some successful cases, and its general applicability to other systems.

Published in:

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Date of Conference:

12-18 Nov. 2011