Loading [a11y]/accessibility-menu.js
Easy and reliable cluster management: the self-management experience of Fire Phoenix | IEEE Conference Publication | IEEE Xplore

Easy and reliable cluster management: the self-management experience of Fire Phoenix


Abstract:

High-Performance clusters are rapidly becoming an important computing platform for both scientific and business applications. To fulfil the new demands and challenges, cl...Show More

Abstract:

High-Performance clusters are rapidly becoming an important computing platform for both scientific and business applications. To fulfil the new demands and challenges, cluster system software is inevitably complex. Even for experienced administrators, the management of a cluster system is an exhausting job. This paper introduces Fire Phoenix, a scalable and self-managing cluster system software that supports both scientific and commercial applications. With the self-configuring and self-healing features, much of the machine configuration and error recovery can be done automatically. Our design has been proven effective in the operations of the Dawning 4000A supercomputer, which is the biggest cluster system in China
Date of Conference: 25-29 April 2006
Date Added to IEEE Xplore: 26 June 2006
Print ISBN:1-4244-0054-6
Print ISSN: 1530-2075
Conference Location: Rhodes Island

1. Introduction

Within the last decade, clustering has become one of the mainstream technologies with its low cost and high scalability. Not limited in traditional scientific computing applications, clusters are more and more widely adopted in business domains including databases, large web sites, and digital libraries. Compared with scientific computing, the business domains have different requirements, such as scalability, reliability and support for heterogeneous platforms. Facing these challenges, cluster system software should have a flexible component framework, high availability service and dynamic configuration ability to meet the growing and varying demands of business applications; at the same time, the cluster system software should hide the complexity and make cluster management easy and reliable so as to avoid operation errors and reduce the TCO.

Contact IEEE to Subscribe

References

References is not available for this document.