Skip to Main Content
Many-task computing (MTC) is a practical paradigm for developing loosely coupled and complex scientific applications. In this paradigm, computation on a large dataset is decomposed into tasks that are expected to be executed in parallel with dynamically allocated computing resources. These tasks pass data via files, and each one is to execute an existing program on one dataset element. Task scheduling is a key issue to enable MTC on parallel platforms like large-scale clusters, Grids and Clouds. Current solutions mainly focus on maximizing the number of utilized parallel computing resources. This paper proposes a configurable MTC model that aims to minimize a MTC computation's turnaround time cost with as few resources as possible. The primary strategy is to coalesce tasks with application-specific expertise into task-sequences, and assign tasks on granularity of task-sequences. Based on this model, a self-optimizing task partitioning algorithm has been devised for scheduling tasks in MTC. It separates task assignment from resource allocation, and makes a tradeoff between maximizing utilized resources, balancing workload and reducing computation-scheduling overhead. The algorithm has been implemented in Harmonia, which is a software platform developed by Peking University for enabling MTC on large-scale distributed platforms. Both the configurable MTC model and the self-optimizing task partitioning algorithm were evaluated with the genome alternative splicing application, and experimental results have proved the model's practicability.