Skip to Main Content
Grid system software is inherently complex, hard to build and maintain. In this paper, we propose a self-managing building block: grid unit, which facilitates constructing grid system with higher availability and lower management overhead. We present an agent organization as autonomic management framework, and propose a self-recovering protocol to eliminate most of tough jobs from system administrator's routines. The system has been deployed on Dawning 4000A since 2004, the biggest node for China grid system. We have done extensive experiments to evaluate grid unit, and the collected log data shows the availability of a grid parallel process management service, built on the basis of grid unit, reaches 99.997%.