Expectile Regression on Distributed Large-Scale Data

Large-scale data presents great challenges to data analysis due to the limited computer storage capacity and the heterogeneous data structure. In this article, we propose a distributed expectile regression model to resolve the challenges of large-scale data by designing a surrogate loss function and using the Iterative Local Alternating Direction Method of the Multipliers (IL-ADMM) algorithm, which is developed for the calculation of the proposed estimator. To obtain nice performance only after fewer rounds of communications, the proposed method only needs to solve an M-estimation problem on the master machine while the other working machines only to compute the gradients based on local data. Moreover, we show the consistency and the asymptotic normality of the proposed estimator, and illustrate the efficient proof by numerical simulations and positive analysis on the superconductor data.


I. INTRODUCTION
In recent years, machine learning techniques based on large-scale data have been widely used in inspection systems, correct classification and Internet of Things, as well as in social networks and smart cities [1]- [3]. As for the large-scale data, the problem is not only the size of the data, but in fact the data is usually stored in multiple locations. This means that the statistical analysis involving the entire data must deal data communication between different storages, which may slow down the calculation. Hence, the impact of statistical procedures related to calculations and data storage has been inciting exuberant research interest.
Several illuminating methods have been proposed to study the statistical inference of such large-scale data, such as Mcdonald et al. [4], Boyd et al. [5], Duchi et al. [6], Dekel et al. [7], Zhang et al. [8], Lee et al. [9]. Among these methods, the divided-and-conquer (DC) is widely used. The main idea of DC is to first compute local estimates on each machine and then take average, where only one round of communication between machines is required and it could reduce the communication cost significantly. However, as discussed in Jordan et al. [10], these averaged-based one-shot methods may cause efficiency loss. For example, the number of The associate editor coordinating the review of this manuscript and approving it for publication was Anand Paul . machines for dividing the data must not be too large for achieving the minimax convergence rate [11].
Usually, the above-mentioned studies on DC are mainly focused on least squares linear regression, and the ordinary least squares linear regressions give the same weight to negative and positive residuals and estimate the expected values of a response variable as a liner function of covariates. Although, quantile regressions [12]- [14] based on asymmetric l 1 -norm present some improvements by giving different weights to the positive and negative parts of the residual. As an alternative to quantile regression, Newey and Powell [15] criticized the use of regression quantiles in liner models, and proposed the following asymmetric quadratic loss function based on l 2 -norm φ τ (λ) = τ λ 2 , λ ≥ 0, with the τ -expectile of random variable ξ as Since the classical article of Newey and Powell [15], there are numerous and extensive studies on expectile and related statistical inference [16]- [22]. Moreover, there has been developed several popular applied expectile regression packages, such as expectreg [23] and ER-Boost [21]. But the two packages can only be applied to the calculation of medium-sized sample data.
The first advantage of the asymmetric least squares method over quantiles lies in the convenience of calculation of sample expectiles, which only uses grading or iteratively re-weighted least squares method. The second advantage is that the statistical inferences of expectiles are more efficient than the cases of quantiles by considering the distance and location information on the data sufficiently, meanwhile the empirical quantiles utilize only the information on whether the observations are below or above the predictor [24]. In addition, unlike sample quantiles, sample expectiles provide a class of smooth curves as functions of the level τ , and demonstrate much more robustness for heavy-tailed distribution [25]- [26].
As for the distributed large scale data, we confront many interesting challenges. Jordan et al. [10] proposed the communication-efficient surrogate likelihood (CSL) procedure to solve distributed statistical learning problems. Especially, CSL only needs to exchange the gradients of local data, so can effectively reduce the communication cost of large data. Moreover, CSL can make statistical inferences in the same way as the global likelihood function by using all samples, and the estimation based on CSL can achieve the same convergence rate as oracle method based on the whole data. Note the theoretical results in Jordan et al. [10] need a very important assumption on the Hessian matrix of the population risk function, which should be invertible. In fact, this condition may be not satisfied usually, which could bring many difficulties to the theoretical proof and limit the application of the CSL. Here, we improve the assumption on Hessian matrix by Jordan et al. [10], then establish the asymptotic properties of the proposed estimator by M-estimation and the asymptotics for minimisers of convex processes, which bring much more remarkable theoretical and practical worth.
Another challenge arises from the numerical implementation of the proposed estimator. To this end, we adopt the Alternating Direction Method of Multipliers (ADMM) algorithm [5], [27]- [28], which is competent to distributed convex optimization problems and large-scale statistical inference problems. Combining the idea of CSL with the ADMM algorithm, we develop an Iterative Local Alternating Direction Method of Multipliers (IL-ADMM) algorithm for the distributed estimator in the linear expectile regression model. Compared to these existing approaches, our proposed method can not only solve the expectile regression with distributed data immediately, but also reduce the storage and transmission cost in distributed computing effectively. Most importantly, its efficiency priority can be compared with the oracle method.
The layout of this article is as follows. In section 2, we develop to fit the linear expectile model for distributed large-scale data, and present the IL-ADMM algorithm for the realization of the proposed estimator in Section 3. We further study the consistency and the asymptotic normality of the estimator in Section 4. Then, we carry out numerical simulations to evaluate the finite-sample performance of the proposed estimator in Section 5, and illustrate a real data example in Section 6. Finally, we remind of some concluding remarks, and offer some proofs in the Appendix.

II. EXPECTILE REGRESSION WITH DISTRIBUTED DATA
Let y be a response variable, let x be an observable p×1 vector of covariates. Here, the sample size N of the observable data {x i , y i } N 1 is so large that the data cannot fit in the memory of a single machine. We assume that the data are stored in one distributed way across multiple machines, such that each machine stores a sub-sample of n observations. Let Our main objective is to estimate the p-dimensional vector expectile coefficient β 0 (τ ) for certain τ ∈ (0, 1) in the following latent linear expectile regression model where ε i (τ ) is the random error with the τ th expectile conditional on x i equalling 0. We would drop τ in the parameter and error term for notation simplicity in the following without ambiguity.
Define the local and global loss functions as follows: where φ τ (·) is defined by (1). Note that L k (β) is the loss function evaluated at β by using the local data stored in the kth machine M k , and L N (β) = K −1 K k=1 L k (β). Without loss of generality, we call the first machine that stores local data set {x 1i , y 1i } i∈ [n] as the master machine, and call the remaining machines are worker machines.
Expectile regression is to solve the following optimization problem: where B denotes a compact subset of R p that contains β 0 .
Here the global empirical loss minimizer β is the best estimator of β, and Newey and Powell [15] had established the asymptotic behavior of this estimator. The above global model could achieve the optimal statistical error, however the expensive communications make it impractical. Inspired by the ideas on CSL methodology, we now consider a surrogate loss function that approximates the global loss function as follows: where β 0 is any initial estimator of β, ·, · denotes the inner produce and ∇ denotes the partial derivative with respect to β. For example, β 0 = arg min β L 1 (β). Note that the VOLUME 8, 2020 construction of the surrogate loss function only uses the local data on the main machine, and the gradient vectors calculated from the data on other working machines, so that the method will greatly reduce the cost of data transmission and communication. The master machine calculates an initial estimator β 0 firstly, and then β 0 is broadcasted to all other machines, with each worker machine computing the gradient ∇L k (β 0 ) and passing it to the master machine. Hence, it is called one round of communication.
In the distributed learning setting, the proposed estimator with the surrogate loss as the objective optimization function is

III. IL-ADMM ALGORITHM
We now present an IL-ADMM algorithm to solve the proposed estimator in (6). Here, the key principle of the IL-ADMM algorithm is to construct a surrogate loss function as in (5), and then transfer the optimization problems from the original objective function (3) to the surrogate function.
To construct the surrogate loss function, by replacing β 0 with the current mth iteration value β (m) , instead of (5), we rewrite the surrogate loss function as (7) and design the optimization problem as It is noted that the Hessian matrix ∇ 2 L 1 (β) is not necessarily invertible at β (m) , therefore the solution of (8) cannot be obtained by Newton-Ralphson algorithm as in [10].
To crack this problem, we consider the following ADMM algorithm as in [2].
By convexity, we can transfer optimization problem (8) to Define the augmented Lagrangian function of (9) as follows: where σ > 0 is a fixed constant and γ ∈ R n is the Lagrangian multiplier.
The iterations for the local ADMM algorithm are given as where (β (m) , z (m) , γ (m) ) denotes the mth iteration of the algorithm for m ≥ 0. Specifically, Note that the β Step can be reformulated as . (12) The update of z (m+1) can be carried out component-wisely.
In order to solve the above univariate minimization problems, we define the operator Prox φ τ of φ τ as follows: We call Prox φ τ as the proximal mapping of φ τ , and the solution of (13) can be found in the Remark 1. Thus, we apply the proximal mapping formula to the z Step, then obtain that, for 1 ≤ i ≤ n, Therefore, for fixed α > 0, we have Now, we summarize the above iteration procedure on the master machine by Algorithm 1 in the following. Transmit the current iterate β (t) to machines Compute the local gradient ∇L k (β (t) ) at machine M k , k = 1, 2, . . . , K ; 4: Transmit the local gradient ∇L k (β (t) ) to the master machine M 1 ;

IV. ASYMPTOTIC ANALYSIS
To present the asymptotic results, we introduce some notation firstly. For 1i ]I (ε 1i < 0). Note that · 2 refers to the l 2 -norm in the Euclidean space, d − → represents convergence in distribution, and P − → represents convergence in probability.
Moreover, we presents some important conditions: [K ] are independent and identically distributed, and satisfy µ τ (ε ki |x ki ) = 0 with E ε 2 ki |x ki < ∞. (C5) There exists a positive definite matrix such that lim . Condition (C1) is not very strong, but it is often used in the extreme value estimates; Condition (C2) guarantees the identification of the problem; Conditions (C3)-(C5) are common conditions for the expectile regression as in Newey and Powell [15]. And, the condition (C5) is one technical requirement being used in Xu et al. [29].
The asymptotic properties of the proposed estimator β are summarized in the following theorems, with the proofs given in the Appendix.
Therefore, the limiting distribution of β is same as that of the global expectile regression estimator β in (4) obtained by analyzing all data sets together [15].

V. NUMERICAL ANALYSIS
In this section, we examine the finite sample performance of our proposed method, and illustrate comparisons with existing methods via numerical studies. The methods to be compared are: • Proposed: the proposed IL-ADMM algorithm.
• Oracle: the master machine collects all data from different machines together, and implement the ADMM algorithm with 0 as initial estimator to solve the global loss minimization problem.
• OneShot: each machine conduct the ADMM algorithm with 0 as initial value to compute a local estimator with the portion of the data stored locally, and the master machine averages the local estimators to produce an aggregate estimator β OneShot = K −1 K k=1 β k . Recall the linear expectile regression model (2), where x i is generated from multivariate normal distribution with mean zero vector, and covariance matrix with (i, j)-element ij = 0.5 |i−j| . We assume different distributions for the random error ε i : (I) homogeneous error ε i ∼ N (0, 1), ε i ∼ t(3) and (2), and x i 2 is the ith observation value of the second component of x. The scalar response y i is generated from (2).
And, set the sample size N = 10000 and covariate dimension p = 80, and randomly partition the data as K = 20, 40 and 80 machines respectively. Therefore, the local sample size n on each machine is 500, 250 and 125 respectively.
Choosing τ = 0.3, 0.5 and 0.7, we compare the performance of the three methods mentioned above with different expectile levels of τ .
In each configuration, the results presented below are obtained from 100 independently generated data sets. Table 1 and Table 2 summarize the average (AER) and the standard deviation (SD) of estimation error defined as β − β 0 2 2 . VOLUME 8, 2020   As for the boxplots comparing the estimation error over the number of machines, Figure 1 and Figure 2 show boxplots of the estimated errors of the above three methods for different  machine numbers under homogeneous and heterogeneous errors, respectively. Figure 3 and Figure 4 illustrate the estimation for the proposed IL-ADMM algorithm with different rounds of communications and the estimation error based on the Oracle and OneShot method. Summing up all the cases considered here, we have the following observations: • The Oracle approach gives us the nice estimation and statistical performance. Our proposed approach obtains an estimation that could compete with the Oracle approach. However, the OneShot approach is relatively inferior. Moreover, we can see from Figures 1 and 2 that the estimation error of the Oneshot method is significantly increased as dealing with heterogeneous data versus the homogenous data. However, the estimation error of the proposed method varies small and is close to that of the Oracle method.
• Since the total sample size N is fixed, the local sample size n decreases when the number of machines K varies from 20, 40 and 80. But, the AER and the SD with the Oracle method remain fixed. The OneShot and the proposed estimators present an ascent in the AER and SD as the number of machines increasing. Meticulously, our proposed method can still compete with the Oracle method.
• The rounds of communications are fixed for the Oracle and OneShot method, with the estimation error seeming as an horizontal line. Even though the number of machines grows, by fewer rounds of communications the estimation error of our proposed method decreases to be truly competitive with respect to the Oracle method.
• Note that in the OneShot method, we need to solve K = N /n optimization problems, which could be computationally expensive. However, our method only needs to calculate the gradient of the local data and an optimization problem on the master in each round of communication. As in Jordan et al. [10] and Zhang et al. [8], the communication complexity of our method is O(pK ), and the computation complexity of the OneShot method is O(nlog(nK )).

VI. REAL DATA ANALYSIS
To illustrate the effectiveness of our proposed approach, we also practice a large-scale sample of real dataset to compare the performance of the three methods mentioned above. The dataset comes from the Superconducting Material Database maintained by Japan's National Institute for Materials Science (NIMS) and is also available from the UCI Machine Learning Repository. 1 The goal here is to predict the critical temperature based on the features extracted from the superconductor's chemical formula. Readers may refer to Hamidieh [30] for a detailed description of this dataset. Li et al. [31] proposed a distributed screening framework for this dataset.
The dataset contains 81 covariates extracted from 21263 superconductors along with the associated critical temperature. Noting that the covariates are linked to the critical temperature with a non-linear relationship, we take log-critical temperature, denoted by Y ( the response). Since the non-Gaussian empirical distribution of Y , the expectile 1 https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data regression approach may be more attractive to analyze this data.
To evaluate the performance of the Proposed relative to the Oracle and the OneShot, we use cross-validation for three methods with three levels as 0.3, 0.5 and 0.7. By randomly partitioning the sample 100 times, randomly selecting 16000 samples each time as the training set D train and keeping 5263 samples as the test set D test , We store training set data with machine number K of 20, 40 and 80 respectively, and then set the local sample size n to be 800, 400 and 200. Inspired by Wang et al. [14], the prediction error is defined as (1/5263) i∈D test φ τ (y i − y i ). We use the above three methods to calculate the average (APE) and the standard deviation (SD) of the prediction errors.  The results are summarized in Table 3. The boxplots for prediction error versus the number of machines based on the three methods are plotted in Figure 5. The prediction errors versus the rounds of communications for the superconductor data under three machines and three methods are showed in Figure 6.
From Table 3 and Figure 5, we can see that our method is significantly better than the OneShot method. Whether the mean or the standard deviation of prediction error, our method is very close to the Oracle method. As the number of machines increasing, our results change little but the OneShot method error increases significantly. From Figure 6, we can clearly see that when the number of machines is 20 or 40, our method converge rapidly such that it compete with Oracle model by less than 50 rounds of communications. When the number of machines reaches 80, our method can still be comparable to the Oracle method after only about 100 rounds of communications. All these show that our method is more efficient and robust when dealing with large-scale real data.

VII. DISCUSSION
For large-scale dataset, certain partition of data across multiple machines is the only practical way to overcome the limitation of storage and computer memory. In this article, we extend the CSL method by Jordan et al. [10] to the distributed expectile regression model. Since the target loss function does not satisfy the smooth assumption as in the common CSL, both the theoretical results and computation method cannot be applied directly. We propose an efficient approach by using a surrogate loss function to fit the expectile regression model for distributed large-scale data, and develop an IL-ADMM algorithm for the calculation of the proposed estimator.
Estimation error can be improved by fewer rounds of communications. Asymptotic properties of our proposed estimator are established. The theoretical results show that if we choose the appropriate initial values of the regression parameters, our proposed method can match the ideal Oracle method. Extensive simulation studies and real data example demonstrate that our proposed approach performs better than the OneShot averaging approach, and also show that our method is still comparable to the Oracle method even with the heterogeneous error.
In real implementation of the ADMM algorithm, much more flexible strategies than z Step can be involved. Future works also include developments of the proposed IL-ADMM algorithm. For example, to improve the communication efficiency, we could consider the other versions of ADMM algorithm as in [32]- [33]. Here we assumed that the dataset are split into K machines, that is to say, the sample size N is large. However, when dataset is of ultra-high dimensionality, it would be an interesting issue to discuss how to split the data along the p direction, and how to implement the corresponding ADMM algorithm as in [34]- [35].
Moreover, the expectile regression is more applicable in engineering, especially in communication network. For example, the base stations of 5G network are smaller, picocells and femtocells are widely used, and micro data centers neighboring access network edge are popular trends. Large-scale data distributed in the micro data centers are always unbalanced, and usually the estimation object based on the data are heavy-tailed distributed, hence the expectile regression in that condition could be more robust than quantile regression. The network performance prediction is another example. Although deep learning and statistical algorithm are used to predict network performance, the error range varies dramatically due to heterogeneity of networks. Here, the expectile regression gives play as the a generalized form of mean regression, and could be considered as one more robust alternative.

APPENDIX: PROOFS OF THE THEOREMS
Proof of Theorem 1: Based on the surrogate loss function (5), by direct computation, we obtain 1i β 0 ), and P n is the empirical distribution. Note that ϕ τ (λ) is the first order derivative of function φ τ (λ). Define (β) = Eψ β . According to condition (C4), we have (β 0 ) = E x 1i ϕ τ (y 1i − x T 1i β 0 ) = 0, which means that β 0 is a stationary point of (β). By condition (C1) and the Theorem 5.9 in Van der Vaart [36], we only need to verify the following two conditions: the Euclidean distance function, and is any positive constant. In fact, the first condition (a) could be guaranteed by the Theorem 2 in Jennrich [37]. Hence we only need to verify three conditions of Theorem 2 in Jennrich [37]. Notice that the first condition on parameter space B is satisfied by condition (C1). And, ψ β is continuous with probability one at each β by condition (C4). Next, we check that the third condition of Theorem 2 in Parikh and Boyd [34]. Under conditions (C1) and (C4), for k = 1, . . . , K , we have where · F is the Frobenius norm of a matrix, and C is a definite positive number. By combining the properties of the VOLUME 8, 2020 trace of a matrix, we have Under conditions (C1) and (C3), we get E x ki x T ki F = E x ki 2 2 < ∞, then combined with conditions (C3) and (C4), we have So we can get , and Eh(x 1i ) < ∞. In addition, , then the inequality (A.16) means ψ β 2 ≤ H (x), and EH (x) < ∞. This verifies the third condition of Theorem 2 in Parikh and Boyd [34]. Hence, the first condition (a) is satisfied.
Proof of Theorem 2: Let V be a symmetric and positive definite matrix, U be a random variable and A n (t) be a convex objective function with the minimum point a n . Since Knight [38] had verified that, if In the following, given the above consistency results, we set up to derive the result (A.17), we then derive the asymptotic normality of β. Define δ n = √ n( β − β 0 ), then β = β 0 + δ n √ n . Notice In fact, we can obtain x ki ϕ τ (ε ki ) .
x ki x T ki    .
Denote Z ∧ = 1 √ n n i=1 Z i , by the fact that Z is a sum of independent and identically distributed zero-mean random vectors and condition (C5), we have By Lindeberg-Feller central limit theorem, we get the following results: Then, based on the fact that N = nK and the result (A.21), as N → ∞, the asymptotic property of β holds as follows, which completes the proof of Theorem 2.