On Wednesday, July 29th, IEEE Xplore will undergo scheduled maintenance from 7:00-9:00 AM ET (11:00-13:00 UTC). During this time there may be intermittent impact on performance. We apologize for any inconvenience.
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance