Skip to Main Content
This extended abstract describes the keynote presentation "stochastically robust resource management in heterogeneous parallel computing systems," to be given by H. J. Siegel. What does it mean for a computer system to be "robust"? How can robustness be described? How does one determine if a claim of robustness is true? How can one decide which of two systems is more robust? Often people state that their system software component, piece of hardware, application code, or technique is "robust," but never define what they mean by "robust." How does one determine if a claim of robustness is true when it is not defined? Furthermore, without a definition, robustness cannot be quantified, so if two people claim to have robust computing systems, for example, how can one decide which is the more robust? These are the types of issues we address in this keynote presentation. We study robustness in the context of resource allocation in heterogeneous parallel and distributed computing systems, but the robustness concepts presented have broad applicability. In heterogeneous parallel and distributed computing environments, a collection of different machines is interconnected and provides a variety of computational capabilities. These capabilities can be used to execute a workload composed of different types of applications, each of which may consist of multiple tasks, where the tasks have diverse computational requirements. The execution times of a task may vary from one machine to the next, and just because some machine A is faster than some machine B for task 1 does not mean it will be faster for task 2. Furthermore, there can be inter-task data dependencies. Tasks must share the computing and communication resources of the system. A critical research problem is how to allocate resources to tasks to optimize some performance objective. However, systems frequently have degraded performance due to uncertainties, such as unexpected machine failures, changes in system workload, or i- naccurate estimates of system parameters. It is important for system performance to be robust against uncertainties. To accomplish this, we present a stochastic model for deriving the robustness of a resource allocation. This model assumes that stochastic (experiential) information is available about the values of these parameters whose actual values are uncertain. The robustness of a resource allocation is quantified as the probability that a user-specified level of system performance can be met. We present this stochastic robustness model and show how to use it to compare different resource allocations. It will be demonstrated how this model can be incorporated into resource management heuristics that produce robust allocations to optimize some user-specified performance criterion. Robust resource allocation heuristics for an example environment will be discussed and compared. The stochastic robustness analysis approach can be applied to a variety of computing and communication system environments, including parallel, distributed, cluster, grid, Internet, cloud, embedded, multicore, content distribution networks, wireless networks, and sensor networks. Furthermore, the robustness model is generally applicable to design problems throughout various scientific and engineering fields.