Skip to Main Content
A job proxy is an abstraction for provisioning CPU resources. This paper proposes an adaptive algorithm for allocating job proxies to distributed host clusters with the objective of improving large-scale job ensemble throughput. Specifically, the paper proposes a decision metric for selecting appropriate pending job proxies for migration between host clusters, and a self-synchronizing Paxos-style distributed consensus algorithm for performing the migration of these selected job proxies. The algorithm is further described in the context of a concrete application, the MyCluster system, which implements a framework for submitting, managing and adapting job proxies across distributed high performance computing (HPC) host clusters. To date, the system has been used to provision many hundreds of thousands of CPUs for computational experiments requiring high throughput on HPC infrastructures like the NSF TeraGrid. Experimental evaluation of the proposed algorithm shows significant improvement in user job throughput: an average of 8% in simulation, and 15% in a real-world experiment.