This paper aims at effectively reproducing the results of previous scientific workflow executions. We first identify the information that needs to be recorded in a workflow execution log, based on which we then determine the data flow among experiments. To effectively record a workflow execution log in a relational database management system (RDBMS) when the workflow design is absent, we propose a generic relational storage schema. Then techniques have been designed to automatically discover the minimal set of experiments that must be performed in order to reproduce a scientific result by posing appropriate SQL queries. Although such SQL queries can be evaluated using an off-the-shelf database system, we investigate the unique characteristics of the workflow log data and optimization techniques for evaluating such SQL queries efficiently.
Published in:
Services, 2007 IEEE Congress on
Date of Conference: 9-13 July 2007