Abstract:
In recent years distributed processing frameworks such as Apache Spark have been utilized for running big data applications. Predicting the application's execution time h...Show MoreMetadata
Abstract:
In recent years distributed processing frameworks such as Apache Spark have been utilized for running big data applications. Predicting the application's execution time has been an important goal since it can help the end user to determine the necessary processing resources to be reserved. While there have been some previous works that examine the problem of profiling Spark applications, they mainly focus on specific application types (e.g., Machine learning applications) and rely on the existence of a large number of previous execution runs. In this work we aim at overcoming these limitations by minimizing the number of past execution runs needed for the profiling phase. Furthermore, we identify patterns of continuous identical dataset transformations between different applications to cope with the limited historical data availability. We propose an on-line profiling framework, called Dione, that estimates the running times of new applications, even if no historical data is available. Finally, in our detailed experimental evaluation, using practical workloads on our local cluster, we illustrate that our approach accurately predicts the execution times of Spark applications and requires 30% less training time and monetary cost compared to the current state-of-the-art techniques.
Date of Conference: 11-14 December 2017
Date Added to IEEE Xplore: 15 January 2018
ISBN Information: