By Topic

Computation-at-risk: assessing job portfolio management risk on clusters

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Kleban, S.D. ; Sandia Nat. Labs., Albuquerque, NM, USA ; Clearwater, S.H.

Summary form only given. In this paper we introduce the concept of computation-at-risk, CaR, a methodology, procedure, and quantity of computational risk and reward resulting from running a particular portfolio of jobs on a cluster under a specific queue policy. Modeled after value-at-risk, VaR, from the financial community, CaR introduces the new element of computational risk into the management of a computational cluster. Specifically, administrators of clusters and other large-scale computing systems must deal with a wide range of job sizes, often up to eight orders of magnitude in the number of cycles. Such a job portfolio has implicit risks and rewards to performance both for certain types of jobs and to the facility overall. In this paper we quantify the risk and reward in terms of makespan and expansion factor. We assess the risk/reward profile for two categories of job portfolios, one with respect to queue settings and the other in terms of job sizes. These assessments provide a means for evaluating which queue policies or job sizes have the best risk/reward characteristics in terms of performance. We found that looser constraints on queue policy in the form run-time limits were beneficial from a risk/reward and CaR perspective. This information can be used by administrators to modify queue policy and by users to tailor the size of their jobs.

Published in:

Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International

Date of Conference:

26-30 April 2004