Skip to Main Content
Grid environment has significant challenges due to diverse failures encountered during job execution. Computational grids provide the main execution platform for long running jobs. Such jobs require long commitment of grid resources. Therefore fault tolerance in such an environment cannot be ignored. Most of the grid middleware have either ignored failure issues or have developed adhoc solutions. Most of the existing fault tolerance techniques are application dependant and causes cognitive problem. This paper examines existing fault detection and tolerance techniques in various middleware. We have proposed fault tolerant layered grid architecture with cross-layered design. In our approach Hybrid Particle Swarm Optimization (HPSO) algorithm and Anycast technique are used in conjunction with the Globus middleware. We have adopted a proactive and reactive fault management strategy for centralized and distributed environments. The proposed strategy is helpful in identifying root cause of failures and resolving cognitive problem. Our strategy minimizes computation and communication thus achieving higher reliability. Anycast limits the effect of Denial of Service/Distributed Denial of Service D(DoS) attacks nearest to the source of the attack thus achieving better security. Significant performance improvement is achieved through using Anycast before HPSO. The selection of more reliable nodes results in less overhead of checkpointing.