Improving Parallelism in Data-Intensive Workflows with Distributed Databases | IEEE Conference Publication | IEEE Xplore

Improving Parallelism in Data-Intensive Workflows with Distributed Databases


Abstract:

The efficient execution of data-intensive workflows relies on strategies to enable parallel data processing, such as partitioning and replicating data across distributed ...Show More

Abstract:

The efficient execution of data-intensive workflows relies on strategies to enable parallel data processing, such as partitioning and replicating data across distributed resources. The maximum degree of parallelism a workflow can reach during its execution is usually defined at design time. However, designing workflow models capable to provide an efficient use of distributed computing platforms is not a simple task and requires specialized expertise. Furthermore, since Workflow Management Systems see workflow activities as black-boxes, they are not able to automatically explore data parallelism in the workflow execution. To address this problem, in this work we propose a novel method to automatically improve data parallelism in workflows based on annotations that characterize how activities access and consume data. For an annotated workflow model, the method defines a model transformation and a database setup (including data sharding, replication, and indexing) to support data parallelism in a distributed environment. To evaluate this approach, we implemented and tested two workflows that process up to 20.5 million data objects from real-world datasets. We executed each model in 21 different scenarios in a cluster on a public cloud, using a centralized relational database and a distributed NoSQL database. The automatic parallelization created by the proposed method reduced the execution times of these workflows up to 66.6%, without increasing the monetary costs of their execution.
Date of Conference: 02-07 July 2018
Date Added to IEEE Xplore: 06 September 2018
ISBN Information:
Electronic ISSN: 2474-2473
Conference Location: San Francisco, CA, USA
References is not available for this document.

Select All
1.
C. P. Chen and C. Y. Zhang, “Data-intensive applications, challenges, techniques and technologies: A survey on Big Data,” Information Sciences, vol. 275, pp. 314–347, 2014.
2.
E. Deelman, D. Gannon, M. Shields, and I. Taylor, “Workflows and e-Science: An overview of workflow system features and capabilities,” Future generation computer systems, vol. 25, no. 5, pp. 528–540, 2009.
3.
G. Singh, “Workflow task clustering for best effort systems with pegasus,” in Proc. of the 15th ACM Mardi Gras Conference: From lightweight mash-ups to lambda grids, 2008, pp. 9: 1–9: 8.
4.
C. Pautasso and G. Alonso, “Parallel computing patterns for grid workflows,” in Proc. of the 6th Workshop on Workflows in Support of Large-Scale Science, 2006, pp. 1–10.
5.
E. Deelman, “Data management challenges of data-intensive scientific workflows,” in Proc. of the 8th IEEE International Symposium on Cluster Computing and the Grid, 2008, pp. 687–692.
6.
E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, “Pegasus, a workflow management system for science automation,” Future Generation Computer Systems, vol. 46, pp. 17–35, 2015.
7.
J. M. Wozniak, T. G. Armstrong, K. Maheshwari, “Turbine: A distributed-memory dataflow engine for extreme-scale many-task applications,” in Proc. of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, 2012, pp. 5: 1–5: 12.
8.
D. de Oliveira, V. Viana, E. Ogasawara, “Dimensioning the virtual cluster for parallel scientific workflows in clouds,” in Proc. of the 4th ACM workshop on Scientific cloud computing, 2013, pp. 5–12.
9.
E. Ogasawara, J. Dias, D. Oliveira, F. Porto, P. Valduriez, and M. Mat-toso, “An algebraic approach for data-centric scientific workflows,” Proc. of the VLDB Endowment, vol. 4, no. 12, pp. 1328–1339, 2011.
10.
G. Juve, E. Deelman, G. B. Berriman, “An evaluation of the cost and performance of scientific workflows on Amazon EC2,” Journal of Grid Computing, vol. 10, no. 1, pp. 5–21, 2012.

Contact IEEE to Subscribe

References

References is not available for this document.