Skip to Main Content
Grids have the potential to revolutionize computing by providing ubiquitous, on demand access to computational services and resources. However, grid systems are extremely large, complex and prone to failures. A survey we have conducted reveals that fault diagnosis is still a major problem for grid users. When a failure appears at the user screen, it becomes very difficult for the user to identify whether the problem is in his application, somewhere in the grid middleware, or even lower in the fabric that comprises the grid. To overcome this problem, we argue that current grid platforms must be augmented with a collaborative diagnosis mechanism. We propose for such mechanism to use automated tests to identify the root cause of a failure and propose the appropriate fix. We also present a Java-based implementation of the proposed mechanism, which provides a simple and flexible framework that eases the development and maintenance of the automated tests.