Fault tolerance is an important issue in Grid computing, where many and heterogenous machines are used. In this paper we present a flexible failure handling framework which extends a service-oriented architecture for Distributed Data Mining previously proposed, addressing the requirements for fault tolerance in the Grid. The framework allows users to achieve failure recovery whenever a crash can occur on a Grid node involved in the computation. The implemented framework has been evaluated on a real Grid setting to assess its effectiveness and performance.
Published in:
Parallel, Distributed and Network-Based Processing (PDP), 2011 19th Euromicro International Conference on
Date of Conference: 9-11 Feb. 2011