System support for many task computing
Van Hensbergen, E.
Minnich, R.
IBM Res., Austin, TX;
This paper appears in: Many-Task Computing on Grids and Supercomputers, 2008. MTAGS 2008. Workshop on
Publication Date: 17-17 Nov. 2008
On page(s): 1-8
Location: Austin, TX,
ISBN: 978-1-4244-2872-4
INSPEC Accession Number: 10471338
Digital Object Identifier: 10.1109/MTAGS.2008.4777907
Current Version Published: 2009-02-06
Abstract
The popularity of large scale systems such as Blue Gene has extended their reach beyond HPC into the realm of commercial computing. There is a desire in both communities to broaden the scope of these machines from tightly-coupled scientific applications running on MPI frameworks to more general-purpose workloads. Our approach deals with issues of scale by leveraging the huge number of nodes to distribute operating systems services and components across the machine, tightly coupling the operating system and the interconnects to take maximum advantage of the unique capabilities of the HPC system. We plan on provisioning nodes to provide workload execution, aggregation, and system services, and dynamically re-provisioning nodes as necessary to accommodate changes, failure, and redundancy. By incorporating aggregation as a first-class system construct, we will provide dynamic hierarchical organization and management of all system resources. In this paper, we will go into the design principles of our approach using file systems, workload distribution and system monitoring as illustrative examples. Our end goal is to provide a cohesive distributed system which can broaden the class of applications for large scale systems and also make them more approachable for a larger class of developers and end users.
Index
Terms
Available to subscribers and IEEE members.
References
Available to subscribers and IEEE members.
Citing Documents
Available to subscribers and IEEE members.