Skip to Main Content
Summary form only given. In this paper we introduce the concept of computation-at-risk, CaR, a methodology, procedure, and quantity of computational risk and reward resulting from running a particular portfolio of jobs on a cluster under a specific queue policy. Modeled after value-at-risk, VaR, from the financial community, CaR introduces the new element of computational risk into the management of a computational cluster. Specifically, administrators of clusters and other large-scale computing systems must deal with a wide range of job sizes, often up to eight orders of magnitude in the number of cycles. Such a job portfolio has implicit risks and rewards to performance both for certain types of jobs and to the facility overall. In this paper we quantify the risk and reward in terms of makespan and expansion factor. We assess the risk/reward profile for two categories of job portfolios, one with respect to queue settings and the other in terms of job sizes. These assessments provide a means for evaluating which queue policies or job sizes have the best risk/reward characteristics in terms of performance. We found that looser constraints on queue policy in the form run-time limits were beneficial from a risk/reward and CaR perspective. This information can be used by administrators to modify queue policy and by users to tailor the size of their jobs.