d -Simplexed: Adaptive Delaunay Triangulation for Performance Modeling and Prediction on Big Data Analytics

—Big Data processing systems (e.g., Spark) have a number of resource conﬁguration parameters, such as memory size, CPU allocation, and the number of running nodes. Regular users and even expert administrators struggle to understand the mutual relation between different parameter conﬁgurations and the overall performance of the system. In this paper, we address this challenge by proposing a performance prediction framework, called d -Simplexed, to build performance models with varied conﬁgurable parameters on Spark. We take inspiration from the ﬁeld of Computational Geometry to construct a d -dimensional mesh using Delaunay Triangulation over a selected set of features. From this mesh, we predict execution time for various feature conﬁgurations. To minimize the time and resources in building a bootstrap model with a large number of conﬁguration values, we propose an adaptive sampling technique to allow us to collect as few training points as required. Our evaluation on a cluster of computers using WordCount, PageRank, Kmeans, and Join workloads in HiBench benchmarking suites shows that we can achieve less than 5% error rate for estimation accuracy by sampling less than 1% of data.


INTRODUCTION
N UMEROUS Big Data frameworks have been introduced to address the problem of organizing large-scale faulttolerant computation in a clustered environment (the cloud). The general-purpose frameworks which have emerged, such as Spark, [40], Hadoop [38], and MapReduce [9], are capable of handling diverse Big Data analytics workloads. In a cloud environment where resources are billed down to the second, the costs of job inefficiencies rapidly become visible, and it is becoming increasingly important for Big Data frameworks to run jobs efficiently [43].
These modern Big Data frameworks are complex systems. They have many resource tuning knobs, such as memory, CPU/GPU allocation, the number of running nodes, and other I/O considerations. With so many tuning knobs, the end users (often non-technical people) submitting the jobs are left clueless about the impact of each parameter on the performance of the job and the overall performance of the cluster. Furthermore, the choice of the configuration parameters is highly dependant on the type of job, the amount of resources available, and the input data size, etc.
Motivation and Challenges. Our work models the performance topography of the whole feature space. Fig. 1 depicts the execution time of a Spark job (Kmeans clustering) with three varied parameters, including input data size, the number of vcores (virtual CPU cores), and the amount of memory. In contrast, tuning only finds one parameter value towards the (local) optimal performance of the specific metric. Tuning to another goal often requires retraining. Best parameters for runtime is not necessarily cost-effective [35]. Unlike tuning, performance modeling can determine the best resource provisioning (e.g., [7], [32]) towards different goals, e.g., shortest runtime [30], lowest resource consumption [15], highest throughput [1], etc. It is meaningful to build a topological model with a fewer but important parameters to represent the performance surface for further decisions, such as the best monetary cost (determining optimal number of machines [35]) in Amazon "pay-as-you-go" services, 1 fewer resource assignment but deadline-guarantee in a competed cluster [7].
However, we are faced with challenges in building an accurate prediction model and achieving enough samples for training the model. There exist many excellent works for performance modeling and tuning in different platforms, such as Hadoop (e.g., [17], [30]), Spark(e.g., [16], [35]), Database (e.g., [1], [11]). Table 1 compares the most relevant stateof-the-art approaches. Alvaro [16], OtterTune [1], and CDBTune [42] use several regressors to tune a set of parameters. However, they train the models in the way of maximizing one objective, i.e., predicting local optimal performance. Therefore do not perform well on the whole topology of the feature space. In contrast, Ernest [35] is designed to predict any unknown parameters in the topology but capable of predicting only vcore parameter.
Yet, these regression models either (Gaussian Process [1] and multi-layer Neural Network [42]) face a deterioration accuracy due to overfitting and require many samples for model building or (Linear regression [16], Ernest [35]) are simple but achieves unsatisfactory accuracy. We apply a novel accurate model and develop an adaptive sampling to mitigate the training cost, as follows.
Contribution 1 in Section 4 -Delaunay Triangulation Model. We introduce a novel black-box method for modeling and predicting the performance of Big Data applications. We build a bootstrap performance surface model, which accurately predicts any unknown parameters. We take inspiration from the field of Computational Geometry to construct a d þ 1-dimensional mesh (e.g., [13]) over a selected set of d features with Delaunay Triangulation (DT) [10], which has wide applications in the fields of computational geometry (e.g., curved surface modeling [6]) and computer graphics (e.g., path planning in automated driving [2]).
In particular, DT partitions the d feature space into a set of interconnected d-simplexes. 2 Such a piece-wise model helps to avoid overfitting. Fig. 2a illustrates an example mesh of 2simplexes constructed from memory and CPU core features. We first determine a simplex to which an unknown feature configuration belongs to. Then we predict the job execution time by calculating a hyperplane in d þ 1 dimensional space by bringing in the runtime dimension as shown in Fig. 2b. In Fig. 1, DT is required to construct a mesh of 3-simplexes, or tetrahedrons for three features (i.e., vcore, memory, and data size) and predicts the runtime for a Kmeans job.
Contribution 2 in Section 5 -Adaptive Sampling. The premise for the above performance prediction is building the mesh model. However, training the whole topography is excessive with a large parameter space, and randomly picking samples does not guarantee the desired accuracy. Determining both the right fraction and the appropriate representatives of the samples for building a model is not trivial. To address this challenge, we integrate an adaptive sampling framework with d-Simplexed. The sampling method bootstraps with Latin Hypercube Sampling (LHS) (e.g., [25]), which spreads widely in each feature dimension, and uses fewer points to represent the whole range of feature values. Based on experiments, we develop an approach to estimate the utility of new candidate samples and select those samples which significantly improve the model.
Contribution 3 in Section 6 -Comprehensive Experiments. The Delaunay Triangulation (DT) model and the adaptive sampling algorithms (We name it d-Simplexed) are implemented and comprehensively evaluated on the Spark platform through benchmarking and synthetic workloads. The empirical experiments show that d-Simplexed outperforms the stateof-the-art methods, such as Decision Tree Regression (DTR) by Alvaro [28], Cost-based method (e.g., Ernest [35]), Gaussian Process (GP) in OtterTune [29], and Multilayer Neural Network (NN) in CBDTune [42]. Also, the proposed adaptive sampling method enables us to use fewer samples to train the model and outperforms the baseline sampling techniques, i.e., random and gridding sampling. d-Simplexed exploits a few samples to achieve a very low error prediction, and the error continues decreasing as the samples increase. For example, we achieve 1.58 percent prediction error by sampling 1 percent data points for the Kmeans workload.
The rest of the paper is structured as follows. In Section 2, we provide a background on Spark and some mathematical primitives required to understand our method. In Section 3, we present the problem formally. Sections 4 and 5 present our contributions in modeling, prediction, and sampling for d-Simplxed. In Section 6, we analyze the empirical performance results. We survey the related work in Section 7, and conclude this paper in Section 8.

PRELIMINARIES
In this section, we lay the groundwork required to understand our method presented hereafter, including Spark preliminaries and computational geometry primitives.

Spark Preliminaries
Spark [40] is a general-purpose cluster computing framework. It was originally built to handle iterative Big Data algorithms (e.g., Machine Learning workloads). It uses a distributed memory abstraction called Resilient Distributed Datasets (RDDs) [41] that enables parallel computation on data and make such data exceptionally resilient to faults.
At runtime, Spark applications are split into tasks and assigned running executors. The executors remain alive for the duration of the job using multiple CPU threads running received tasks in parallel. The size and number of executors,  2. A Simplex is borrowed from computational geometry, denoting a triangle-like object that can be extended to arbitrary dimensions. as well as their available execution threads, are the key performance indicators for any given job. A summary of the relevant tuning parameters controlling them is presented in Table 2. Spark runs on top of a cluster resource manager such as Mesos [18] or YARN [34]. When submitting jobs to the resource manager, users may specify the configuration parameters from Table 2, and the resource manager will manage the allocation.

Delaunay Triangulation Primitives
In the following, we introduce some basic terms and concepts on Delaunay Triangulation (DT) from Computational Geometry.
Convex Region. A convex region [26] is a region such that, for every pair of points in the region, every point on the straight line segment that joins the pair of points is also within the region. We show an example in Fig. 3a, illustrating the convex and non-convex regions.
Convex Set. A convex set [26] represents the points inside a convex region. In convex geometry, a convex set is a subset of an affine space that is closed under convex combinations [5].
Convex Hull. It is the fundamental construction of Computational Geometry. The convex hull [3] of a set of points is the smallest convex set that contains the points.
Hyperplane. It is the generalization of a plane in three dimensions to higher dimensions, i.e., any d subspace in R dþ1 .
Simplex. It is the generalization of a triangle to different dimensions. In d dimensions, the concept of a triangle becomes a d-simplex. For example in Fig. 3b, a 2d triangle is a 2-simplex, a 3d tetrahedron is a 3-simplex, and so on. Furthermore, each simplex is constructed of facets, which form the boundary (i.e., min and max values) of the surface. The number of facets in a simplex is a function of the number of edges.
In our context, we construct the model by plotting a d feature configuration of Spark into a d-dimensional space, using DT to partition the feature space into a set of interconnected d-simplexes, thereby forming a d þ 1-hyperplane over the feature set. Unlike previous approaches (e.g., Otter-Tune [1]), we build the DT performance model of the entire feature space instead of searching local optima (Section 4).

PROBLEM STATEMENT
At a high level, we want to predict the runtime of a Big Data analytics job with varied parameter configurations given a set of historical (or sampled) data points. Intuitively, the research problem of this paper is: how to develop a performance model based on the sampled data to produce accurate runtime predictions with given parameter configurations? More formally, given a collection of n samples, a configuration space F , a prediction model PMðÁÞ returns the estimated running timeT for FT where S ¼ fhF i ; T i ij1 i ng, and each configuration . . . ; f d g includes d features 3 (parameters) and the corresponding runtime is T i . To evaluate the model in our experiments, we use a metric called the Mean Absolute Percentage Error (MAPE), which is the average of the Percent Errors for a set of estimated runtimes (T ) and actual runtimes (T ) where l is the number of specific feature configurations to be tested.
In general, this problem can be converted to a general prediction problem with historical data (i.e., training data), which can be solved by a statistics machine learning method such as Gaussian Process (GP) [29] and Neural Network [42]. However, as we will show in our evaluation in Ernest [35] Cost function Vcore Modeling Works for only one parameter Alvaro [16] Regression models Parallelism Tuning Hard to pick model, (local) optimal finding OtterTune [1] Guassian process Sets of parameters Tuning Possible overfitting, (local) optimal finding CDBTune [42] Neural networks Sets of parameters Tuning Possible overfitting, (local) optimal finding d-Simplexed Delaunay triangulation Sets of parameters Modeling Apply for multiple parameters and adaptive sampling (a) An example Delaunay Triangulation constructed from two-dimensional samples for the features memory and vcore, and (b) runtime prediction using the feature configurations for a Spark workload.  3. Features can be not only system parameters (e.g., vcore and memory) but also job or input data information (e.g., data size). Section 6, these methods require massive training data points for better prediction, and they are very costly to construct an accurate performance model in our problem. In contrast, we will develop a novel method for accurate runtime prediction by building an effective estimation model which requires much less training data. Against our problem, we propose a framework, called d-Simplexed, by using the Delaunay Triangulation (DT) model to make the prediction for a given parameter configuration and heuristic adaptive sampling to reducing samples for training. Fig. 4 shows our framework at a high level. Correspondingly, a brief algorithm is also described in Algorithm 1. The main features are adaptive sampling, DT modeling, and performance prediction. We present the DT model and its prediction in the following section, then introduce adaptive sampling that accelerates the modeling in Section 5.

DELAUNAY TRIANGULATION
The central task of the paper is to model the performance (e.g., Kmeans performance mesh in Fig. 1 and synthetic workload performance mesh in Fig. 11) of a job with a set of parameters. There exists many methods for creating polygon mesh. In particular, a Delaunay triangulation [10] is a triangulation DT(P) (where P is set of discrete points in a plane) such that no point in P is inside the circumcircle of any triangle in DT(P). Delaunay triangulation maximizes the minimum angle of all the angles of the triangles in the triangulation. The reasons for choosing the DT are three-fold: 1) It prefers to form equilateral triangles as much as possible. Avoiding long and skinny triangles leads to a better interpolation (prediction runtime for our case) of values because the vertices of a long and skinny triangle tend to be spread out far from each other [8].
2) It can easily be extended into higher dimensions meaning that the model is not limited by the number of features in our context. 3) It is well researched -many algorithms and tools have been developed for creation and traversal of triangulation (e.g., Flip [8] and QuickHull [3]). The division of a 2-dimensional or d-dimensional shape into a collection of discrete planes has wide applications in the fields of computational geometry and especially computer graphics (e.g., path planning in automated driving [2]). We introduce the following main steps to build and use the DT model to fit our problem of performance modeling and prediction:

Modeling
Triangulation. With d features selected, we generate a DT mesh in R d space. In two or three dimensions, DT forms familiar meshes over the feature configuration space, thereby connecting unknown points and enabling runtime prediction. Fig. 2 shows an example of a Delaunay Triangulation over a 2-dimensional hmemory; vcorei data set. In higher dimensions, the idea works in the same way except that it becomes more complex to visualize. Other approaches (e.g., cost-based models [31], [35]) attempt to fit a smooth function over a similar space. In contrast, we split the space into regions using DT, instead of trying to fit a single function to the whole data set. To efficiently employ DT, we apply Quickhull [3] algorithm. There are many variations to the Quickhull algorithm, and the one we implemented is the most generic. Quickhull runs in Qðn log nÞ with a worst-case complexity of Oðn 2 Þ. The worst-case complexity occurs when points have unfavorable (highly symmetric) distributions. In our framework, we do not consider this to be a valid cause for concern, as our sampling method in Section 5 ensures that we pick distributed asymmetric points, thereby avoiding the high-symmetry pitfalls.
Projection. Following that we use d features together with runtime to find a multivariate linear function representing a hyperplane in R dþ1 which is responsible for the predictions of points contained in each region (d-simplex). There is no requirement for the fitting function to be continuous; defining it piece-wise makes it easier and more accurate. DT partitions the feature space into a series of simplexes and we construct a hyperplane by bringing in the runtime dimension for each simplex. This process is best explained by stepping through in two and three dimensions, then extrapolating into higher dimensions. In two dimensions, e.g., hf 1 ; f 2 i, DT generates 2simplexes (triangles), each naturally containing three points. For each point hf 1 ; f 2 i, we add the runtime as the third dimension (e.g., hf 1 ; f 2 ; runtimeðf 1 ; f 2 Þi), then compute the hyperplane containing these three points. In three dimensions, e.g., hf 1 ; f 2 ; f 3 i, DT creates 3-simplexes (tetrahedrons), each containing four points as shown in Fig. 3b. Next, we add runtime in and then compute the hyperplane containing each of the tetrahedron's points. As we travel into higher dimensions, the process remains the same. In other words, in a model constructed with d features, we must lift it into R dþ1 by including runtime in a separate step. We illustrate the modeling in 1.

Example 1.
To generate a DT mesh for a set of points in R d space, we utilize the relationship 4 between DT in R d and a parabola in R dþ1 . Fig. 5, by a parabola ðx; y; x 2 þ y 2 Þ, depicts the simplex and hyperplane relation in 2-dimension ðx; yÞ. We first compute the hyperplane, i.e., the convex hull of the point set in R 2þ1 space (by lifting the points to a parabola), then project the convex hull back into R 2 space (i.e., simplexes), leaving DT in R 2 .

Prediction
Once we have the new configuration F ¼ hf 1 ; f 2 ; . . . ; f d i, we can predict runtime based upon the features in the model. A formal procedure can be found in Algorithm 2. Given the computed DT model and new configuration F , the first step is to determine which simplex a given point belongs to. We then can calculate the hyperplane for prediction. The generic scalar form of a hyperplane of d dimensions By the known d+1 configurations of the simplex and their runtimes, we can compute the hyperplane's parameter fb 0 ; . . . ; b dþ1 g. We define hðÁÞ to return runtime (e.g., x dþ1 ) when given new configuration values in Equation (3). When this simplex's hyperplane is obtained, we simply can plug the known configuration values into Equation (3) to get the estimated runtime at that specific con- Note that this prediction algorithm relies heavily on the assumption that any potential prediction values will lie inside the convex hull of the model. This is precisely why in the approach that we picked 2 d boundary samples at the extremes of the feature configuration space. This step ensures that any future predictions will fall inside the convex hull, and thus have a simplex for prediction. Points lying outside the convex hull ("outer-hull" points) are difficult to predict because their values must be extrapolated from a hyperplane inside the hull. By selecting boundary samples at the beginning, we can avoid this situation as described in the right following section.
return T ; 6 Result: Predicted runtime T for configuration F As samples are added to model, DT splits the surface into multiple regions, and models the surface locally and linearly, avoiding overfitting with an upper error bound. That is when every hyperplane covers a monotonic surface, the prediction performance is with a bound as follows: Lemma 1. If each hyperplane predicts a monotonic surface, the MAPE of each hyperplane prediction bounds to where T max and T min are the maximum and minimum performance in the split surface(or hyperplane).
The proof idea is to calculate the maximum error between an arbitrary monotonic function and linear function built by T max and T min .

ADAPTIVE SAMPLING
As mentioned in the earlier section, DT may perform poorly when the feature space has an unfavorable or highly symmetrical distribution. Also, even finding an optimal parameter for the system is NP-hard [1,42]. To remedy this, we rely on heuristic sampling. The baseline sampling techniques in this situation are random sampling and grid sampling. Random sampling selects points at random from the configuration space without replacement. Grid sampling (e.g., in [33] sampling for database parameters) meanwhile divides the 4. Delaunay triangulation for parabola projection is detailed in http://www.cs.wustl.edu/ pless/546/lectures/L13.html configuration space into uniform grids and selects points which lie at an equal distance from each other. Samples gathered from these baseline sampling techniques may produce a sub-optimal model. The reason is that they may over-sample in the regions where there is little change in the runtime and under-sample in the regions where the runtime is highly dynamic. Consequently, in this paper, we introduce a novel feedback-driven sampling technique to improve the model's performance by selecting more samples from the place where the topography of the model is turbulent, i.e., to avoid model overfitting by improving the errors addressed in Lemma 1. To guarantee each hyperplane predicting the monotonic surface, we heuristically discover sparse area with less known parameters, while to lower the error bound, we heuristically choose the area with huge performance changes (define in Equation (6)).
In particular, we split our sampling technique into two phases. First, we select seed samples to initialize the model. Second, we iteratively add samples by a utility-driven method to improve the model in each step. The sample with the highest utility, i.e., the highest potential to improve the model's prediction accuracy, is picked to update the model.
Initial Samples. We go through the following steps to generate the seed samples:  [20] to select m feature configurations. In the LHS technique, samples are chosen in a way such that the complete range of parameter values is fully represented. However, LHS may generate bad spreads where all samples are spread along the diagonal. Therefore, we maximize the minimum distance between any pair of samples. Suppose we have n samples, we will select the sample set X Ã such that where x 1 6 ¼ Á Á Á 6 ¼ x d , and Dist is a typical distance metric, e.g., euclidean distance. 3) Combine LHS samples with 2 d boundary samples (taken from the minimum and maximum points of each feature axis) to gather 2 d þ m seed samples for the initial model. These seed samples will be excluded in adaptive sampling. The reason for including boundary points is to avoid outer-hull point prediction, which normally performs unsatisfactorily. Adaptive Sampling. The adaptive sampling technique is to select new points for improving the model accuracy. We heuristically search the area with the greatest runtime changes and more unknown configurations. We need more samples from these areas where small changes in feature values may result in significant changes in runtime. Such heuristic helps accelerating the convergence of the model in a turbulent surface, i.e., performance rapidly and non-monotonically changing. We introduce a utility metric to compute the distance between the predicted point and its hyperplane. Intuitively, a higher utility value indicates a larger distance to the points of its prediction hyperplane and thus a higher potential improvement to the model. Given samples X from LHS domain D LHS and n samples S with d features, we achieve the predicted runtimeT ðiÞ by sample X ðiÞ and its hyperplane S 0 S with h sample hF ðiÞ k ; T ðF ðiÞ k Þi; 1 < k < h. The utility UðX ðiÞ Þ of sample X ðiÞ is defined as follows: where h is the number of points constructing its hyperplane.
The iterative sampling technique proceeds as follows until a stopping condition has been met.

Algorithm 3. ADAPTIVE SAMPLING
Input: Sample set: S; initial prediction model: Our adaptive sampling is described in Algorithm 3. The algorithm adds a sample in each iteration. Line 2 is to get new m sample points D LHS using Latin Hypercube Sampling across the entire feature space in every iteration. Lines 5-8 are to get the sample with the highest utility using the current model to compute the utility UðÁÞ of X Ã by Equation (6), where for each X Ã in D LHS . Next, we rank all the samples by utility and pick the largest UðX ðiÞ Þ as the next sample to add to the model (by Equation (7)). The features of this new sample F ðnþ1Þ is achieved as follows: Line 9 updates the model and Line 10 removes the added sample from sample set. We define an explicit threshold of prediction error as the stopping condition. In our empirical experiments, we continue selecting more sampling points into the model until an average error has been reached (e.g., MAPE 5 percent). We provide an example to illustrate the idea of picking the next point by the utility as follows. Then we update the model as shown in sub-figure (b). At that moment, we compute 1.25 and 2.005 utility for configuration value 2.5 and 5, respectively. We choose the configuration value 5 as our next sample, as the largest uncertainty of points between 3 to 7 (by exploration).
It is noteworthy that the above sampling technique balances the conflicting tasks of exploration (understanding the global surface of the model) and exploitation (going through the regions, where the performances of the adjacent points change fast) that arise in model building, which is non-trivial to achieve.

EMPIRICAL EVALUATION
In this section, we present the empirical results and evaluate our approach systematically. Experimental results show the superiority of the d-Simplexed model and adaptive sampling over the state-of-the-art models and baseline sampling techniques. The detailed evaluations are the following.
Compared Samplers. We implemented adaptive sampling, comparing with two baseline sampling techniques: random, grid. The adaptive sampling helps to achieve the next best samples for the d-Simplexed as well as the compared models.
Evaluated Workloads. Our evaluation consists of two suites of workloads. The first suite of workloads, i.e., WordCount, PageRank, Kmeans clustering, Bayesian classification, and SQL Join (described in Table 3), is from the HiBench benchmarking [19]. The second suite of workloads, i.e., Single-Wave, MultiWave, Kmeans4d (described in Section 6.2.4), is the synthetic turbulent surface to simulate more complex workloads for Spark or any other systems.

Experiment Setting
On the software side, we used Apache Spark v2.1.0 on top of Hadoop v2.8.1. With Hadoop, we used HDFS as our distributed file system, and YARN as our resource manager. This is a typical open-source Spark software stack. From the hardware side, our experiments were conducted on a high performance computing cluster. The cluster consists of over 10 Dell PowerEdge M610 servers, each having 32 GB of RAM, 2 Intel Xeon E5540 2.53 GHz CPUs and 4 cores per CPU, making 16 vcores available with hyperthreading. We construct various experiment specific feature spaces from the available resources in this cluster for our experiments. We present runtimes as the average of three experimental runs. For each experiment, we randomly took 10 percent of corresponding feature space samples as the test data for computing accuracy.

Model Implementation
d-Simplexed is implemented as a Python project organized as a set of modules. We implement Delaunay Triangulation by using a Python wrapper for the qhull 5 library which internally uses the Quickhull Algorithm ( [3]) to determine the Convex Hull and resulting Delaunay Triangulation for a set of points. The qhull library also contains useful abstractions for the components in the triangulation, such as determining if a simplex contains a point. Latin Hypercube Sampling is developed in Python from scratch.
We implemented LR, DTR, and GP by scikit-learn 6 with their default model settings. Ernest is trained by the curve-fitting function from sklearn. We implemented Multilayer Neural Network (NN) (with three hidden layers) [42] by Keras 7 using TensorFlow [12] backend.

Overview of Workload Evaluation
In this part, we show a high level comparison of different prediction models for different workloads. Among all the compared methods, DTR performs satisfactorily, as DTR owns piece-wise regression function similar to d-Simplexed. Also, GP and NN outperform LR in the case of Kmeans, Bayesian and synthetic workloads (i.e., Sin-gleWave & MultiWave), since the performance of these 5. http://qhull.org/ 6. https://scikit-learn.org/ 7. https://keras.io workloads change abruptly (usually non-linear changing) with feature values. No surprise, GP and NN are more suitable than LR in these non-linear curves fitting. In contrast, LR outperforms GP in the case of WordCount and Join workloads, as the performance of these two workloads is flatter, i.e., slight performance change as the parameters change.
We next show the detailed evaluation concerning various prediction models, features, sampling techniques, and complex workloads.

Model Evaluation
The first set of experiments compares d-Simplexed with adaptive sampling (Algorithm 3), against other state-of-theart methods, namely Multivariate Linear Regression (LR) and Gaussian Process (GP), Ernest, Decision Tree regression (DTR), and multi-layter Neural network (NN).
We first consider input data size and vcore as the features, since Ernest only tolerates these two features. We compare all the models with variable data sizes  and numbers of vcores (20-50 v). The unit size of each step is 1 GB data and 1 vcore. Fig. 8 shows the experimental results against Kmeans workload. d-Simplexed is the best among these six models in terms of Mean Absolute Percentage Error (MAPE) Equation (2) and convergence because of its property of local and piece-wise modeling and prediction. Ernest is better than LR because Ernest consists of not only linear but also logarithm terms which better capture the performance properties. GP and NN flip around, which is typical behavior of over-fitting globally, as the samples increase. Since Ernest does not include other features, we exclude it in the rest cases.
We now select memory and vcore as the features, since these two parameters are common yet challenging decision to make in the cloud environment (e.g., Amazon on-demand price). Table 3 shows the detailed evaluations of HiBench workloads against state-of-the-art methods in terms of MAPE. 1 percent of the training samples are used in this evaluation. d-Simplexed is the best against other methods concerning MAPE. It achieves less than 5 percent MAPE in Kmeans, WordCount, PageRank, and Join workloads and 15.768 percent MAPE in the Bayesian workload. Fig. 9 depicts the more detailed MAPEs for methods with workloads, as sample size increases. Specifically, LR performs poorly in (a) Kmeans and (d) Bayesian workloads since a linear model faces difficulty in capturing non-linear behavior. In contrast, GP and NN models work relatively better than LR in these two workloads. However, they have the problem of over-fitting for the data as more feature points are loaded into the model. However, GP and NN perform unsatisfactorily for (b) WordCount and (c) Pag-eRank workload. DTR splits configuration space piecewisely, resulting in satisfactory MAPE. Nevertheless, DTR uses an average value to represent the local model instead of a linear regressor like in d-Simplexed, and therefore, poorly catches the surface of local configuration space. Unfortunately, we found d-Simplexed sometimes does not yield the best at the beginning phase. For example, GP and DTR outperform d-Simplexed in Fig. 9d with fewer samples ( 40). This is because non-linear regressors have a better regression than the fewer linear surfaces in d-Simplexed.
As the samples increase, d-Simplexed outperforms them. There are two reasons. First, when new sample points are added, the d-Simplexed model prefers to fit them locally, that is, their placement only impacts a few adjacent simplexes, rather than the entire model (in the case of LR, GP and NN). Second, d-Simplexed uses a utility function (in Equation (6)) to judiciously pick the next point, which continuously improves the model as the sample size increases. The results are the Mean Absolute Precision Error (MAPE) with 1 percent of samples. d-Simplexed achieves the best performance against other methods. Configuration space: start value -step size -end value.

Sampling Evaluation
The second set of experiments compares the performance of Delaunay Triangulation (DT) with random and grid sampling. In each sampling approach, initially, we select 4 boundary points and 4 seed samples retrieved using LHS sampling, giving 8 samples to create the initial model with.
For each sampling technique, the model is then iteratively improved by adding new samples 1-by-1 and evaluating the MAPE at each step. Table 3 shows the detailed evaluations of d-Simplexed against DT with random sampling (DT-RD) and DT with grid sampling (DT-GD) in terms of MAPE. d-Simplexed achieves the best MAPE among three samplings. Fig. 10 plots more detailed steps for three sampling techniques with the Kmeans, PageRank, Bayesian, and Join workloads. It shows that adaptive sampling achieves the relatively better MAPE among three techniques, given the same number of samples after certain points, meaning that the adaptive approach can be used to build a more accurate performance model. Particularly in the Bayesian workload, grid sampling achieves more than 60 percent MAPE with even more than 200 samples. This is because the performance surface of the Bayesian changes rapidly in small value area while grid sampling keeps sampling in the global space.
The performance of random sampling is unsatisfactory, especially when the model has a few sampling points. This is because, with a few points in the model, there is a high probability that the next selected random point will fall in a region which has a small performance change, i.e, a small contribution to predicting the critical turning point of performance. Grid sampling is better than random sampling, and it performs well at the beginning phase, as it evenly distributes the chance of picking points for the model. In contrast, adaptive sampling selects the points that continuously improve the accuracy of the model. The adaptive sampling technique avoids picking points in the regions where there are little changes in performance (by Equation (6)). This behavior is especially pronounced, where adaptive sampling zooms-in on the critical region of the model which has a steep performance change, while grid and random sampling use a static strategy to pick points randomly or uniformly.

More Evaluation Results
The third set of experiments challenges d-Simplexed with more complex synthetic workloads and with more input features. We describe in detail as follows.
Synthetic Workload. In this set of experiments, we sought to analyze the performance of models against a synthetic workload. The purpose of these experiments is to simulate more complex or arbitrary workloads from Spark or any other systems. Specifically, we create a synthetic 120 Â 120 =14400 point hf 1 ; f 2 i surface to demonstrate the model's flexibility under a hypothetical runtime condition, where the runtimes have massive turbulence as shown in Fig. 11. We assume that some workloads (e.g., with dependent parameters [1], [11]) may exhibit such behavior, and it is an interesting surface to challenge with our model.
We evaluated d-Simplexed for synthetic surfaces against the state-of-the-art models and presented the results in Fig. 12. LR perform unsatisfactory, as its linear fitting is not suitable for the turbulent non-linear surface. GP and NN are much better than LR, nevertheless, they still cannot   achieve a huge error, especially in a more complex multiwave workload. The reason is that GP and NN consider the surface as a whole and cannot adapt to the local flexibility. What's worst, with additional samples, LR and GP may overfit the samples, thus deteriorating the models. On the contrary, DTR and d-Simplexed keeps discovering the unknown regions and continuously improving the model as they split configuration space piece-wisely, avoiding overfitting. Thanks to the utility function in adaptive sampling and a better regressor in d-Simplexed, d-Simplexed outperforms DTR. Due to the complex surface, we finally achieve less than 10 and 20 percent MAPE with 2.36 percent samples for the synthetic data with the single wave and multiple waves, respectively.
More Input Features. In this part, we additionally investigate the performance of d-Simplexed with 3 input features. We setup 3-parameter (i.e., vcore (21-50), memory(21-50 G), and data size(11-40 G)) with 30Á30Á30 experiments. Fig. 13 shows that d-Simplexed outperforms the state-of-the-art methods in a higher dimension situation too.
It is noteworthy that the number of required experiments grows exponentially [42], as the number of features increases. For example, when the total dimension is three and we have two feature parameters (e.g., CPU and memory) each with 30 values, then we need 900(=30Á30) ground truths to verify the model. However, adding one more dimension, e.g., data size with a granularity of 30, will need 27,000(=30 3 ) ground truths to verify the model. Since running all 27,000 experiments of workload would be a nightmare, we only run partially needed experiments, i.e., 400 samples run and 320 (0.5 percent of all) test results. It is worth noting that only 238 ( 0.80 percent of all) samples are required for DT to achieve less than 5 percent MAPE.
Training Overhead. Fig. 13b shows the training overhead of all the models for running Kmeans workload in Fig. 13a. LR and DTR require relatively low cost for training model and making prediction due to its simpler model in nature. GP and NN need more significant time on training the model parameters. Although d-Simplexed requires more training and prediction time than LR and DTR,it yields better performance in accuracy and still just requires less than 1 second for both training the model and making prediction from 400 samples.

Evaluation Summary
We highlight our main findings in the experiments as follows:   proposed a non-parametric prediction model based on Locally Weighted Linear Regression and built on a large set of historical Hadoop job traces. AROMA [23] depends on historical traces and applies clustering to identify the jobs with similar behavior, and then applies pattern matching to find the optimal resources for a job. ALOJA-ML [4] is another machine learning-based framework for predicting the execution time of Hadoop machine learning jobs. The framework maintains a large collection of job execution time and resource configurations. Ernest [35] constructs prediction models by executing jobs with a fraction of the input data. Our method proposed in this paper relies on a small amount of historical job execution traces to derive the prediction models with Delaunay Triangulation. Also, our method is applicable to Hadoop and MapReduce frameworks. Database Performance Prediction. Database systems have many mature tools and guidelines for physical design tuning (e.g., index selection, materialized view generation). However, they do not provide a performance model with the varied configuration parameters, because they depend on the query optimizer's cost models, which do not capture the effects of many important parameters. There a few works have been done on the prediction model with various configuration parameters in modern database systems. For example, MDN [22] (Mixture Density Network) is a black-box model that feeds the historical traces of Hive queries with features and performance to a neural network which can predict the execution time of a new job. ParaTimer [27] predicts the progress of running parallel database queries that are expressed as Pig Scripts which together resembles a DAG of MapReduce jobs running in parallel.
Database Sampling Techniques. Traditional database sampling deals with the problem of sampling from a large dataset, and their main purpose is to estimate the cardinality of query results (or intermediate results). In contrast, our sampling strategy in Section 5 aims to draw samples from a response surface that is never materialized fully. Another related paper iTuned [11] relies on Gaussian Process (GP) to decide the next sampling points. While our method d-Simplexed shares the main goal for adaptive sampling to build an accurate performance model, the application scenarios and algorithms differ.

CONCLUSIONS AND FUTURE WORK
In this paper, we studied the problem of predicting the performance of Big Data analytics platforms with a small amount of historical traces. We demonstrated how a geometric interpolation method: Delaunay Triangulation could be used to model the feature configurations for predicting the runtime of the analytics jobs. We proposed a sampling strategy to select data points judiciously for faster and accurate model construction. Finally, we performed empirical experiments to demonstrate the superiority of our method over the-state-of-art approaches.
Exciting follow-up research can be centered around the implementation of parameter tuning algorithms based on the prediction model proposed in this paper. In addition, we intend to explore the extension of the d-Simplexed framework to handle various large-scale machine learning platforms. Consider, for instance, the problem of building an accurate performance model on TensorFlow [12] or Sys-temML [14].
Mohammad A. Hoque received the MSc degree in computer science and engineering, and the PhD degree from Aalto University, Finland in 2010 and 2013. He is a postdoctoral researcher with the University of Helsinki. His research interests include energy efficient mobile computing, data analysis, distributed computing, and resource-aware scheduling.
Jiaheng Lu is an associate professor with the University of Helsinki, Finland. His main research interests include big data management and database systems. He has published more than eighty journal and conference papers. He has published several books, on XML, Hadoop and NoSQL databases. His book on Hadoop is one of the top-10 best-selling books in the category of computer software in China in 2013. He has frequently served as a PC member for conferences including SIGMOD, VLDB, ICDE, EDBT, CIKM etc.
Sasu Tarkoma is a professor of computer science with the University of Helsinki, and head of the Department of Computer Science. He has authored four textbooks and has published more than 160 scientific articles. His research interests include Internet technology, distributed systems, data analytics, and mobile and ubiquitous computing. He has seven granted US Patents. His research has received several Best Paper awards and mentions, for example at IEEE PerCom, ACM CCR, and ACM OSR.