Skip to Main Content
This paper presents a general framework for optimal task reallocation in heterogeneous distributed-computing systems and offers a rigorous analytical model for the stochastic execution time of a workload. The model takes into account the heterogeneity and stochastic nature of the tasks' service and transfer times, servers' failure times, as well as an arbitrary task-reallocation policy. The stochastic service, transfer and failure times are assumed to have general, age-dependent (non-exponential) distributions, resulting in a tandem distributed queuing system with non-Markovian dynamics. Auxiliary age variables are introduced in the analysis to capture the memory associated with the non-Markovian stochastic times, thereby enabling a regenerative age-dependent analytical characterization of the statistics of the execution time of a workload. The model is utilized to devise task reallocation policies that optimize three metrics: the average execution time of a workload, the quality-of-service in executing a workload by a prescribed deadline and the reliability in executing a workload. Implications of the non-exponential event times on these metrics are also studied. Key results are verified experimentally on a distributed-computing testbed.