Improving Parallelism in Data-Intensive Workflows with Distributed Databases | IEEE Conference Publication | IEEE Xplore

Improving Parallelism in Data-Intensive Workflows with Distributed Databases


Abstract:

The efficient execution of data-intensive workflows relies on strategies to enable parallel data processing, such as partitioning and replicating data across distributed ...Show More

Abstract:

The efficient execution of data-intensive workflows relies on strategies to enable parallel data processing, such as partitioning and replicating data across distributed resources. The maximum degree of parallelism a workflow can reach during its execution is usually defined at design time. However, designing workflow models capable to provide an efficient use of distributed computing platforms is not a simple task and requires specialized expertise. Furthermore, since Workflow Management Systems see workflow activities as black-boxes, they are not able to automatically explore data parallelism in the workflow execution. To address this problem, in this work we propose a novel method to automatically improve data parallelism in workflows based on annotations that characterize how activities access and consume data. For an annotated workflow model, the method defines a model transformation and a database setup (including data sharding, replication, and indexing) to support data parallelism in a distributed environment. To evaluate this approach, we implemented and tested two workflows that process up to 20.5 million data objects from real-world datasets. We executed each model in 21 different scenarios in a cluster on a public cloud, using a centralized relational database and a distributed NoSQL database. The automatic parallelization created by the proposed method reduced the execution times of these workflows up to 66.6%, without increasing the monetary costs of their execution.
Date of Conference: 02-07 July 2018
Date Added to IEEE Xplore: 06 September 2018
ISBN Information:
Electronic ISSN: 2474-2473
Conference Location: San Francisco, CA, USA

Contact IEEE to Subscribe

References

References is not available for this document.