Computation-at-risk: employing the grid for computational risk management

2 Author(s)
Kleban, S.D. ; Sandia Nat. Labs., Albuquerque, NM, USA ; Clearwater, S.H.

This work expands upon our earlier work involving the concept of computation-at-risk (CaR). In particular, CaR refers to the risk that certain computations may not get done within a timely manner. We examine a number of CaR distributions on several large clusters. The important contribution of This work is that it shows that there exist CaR-reducing strategies and by employing such strategies, a facility can significantly reduce the risk of inefficient resource utilization. Grids are shown to be one means for employing a CaR-reducing strategy. For example, we show that a CaR-reducing strategy applied to a common queue can have a dramatic effect on the wait times for jobs on a grid of clusters. In particular, we defined a CaR Sharpe rule that provides a decision rule for determining the best machine in a grid to place a new job.

Published in:

Cluster Computing, 2004 IEEE International Conference on

Date of Conference:

20-23 Sept. 2004