A Survey of Distributed and Parallel Extreme Learning Machine for Big Data

Extreme learning machine (ELM) is characterized by good generalization performance, fast training speed and less human intervention. With the explosion of large amount of data generated on the Internet, the learning algorithm in the single-machine environment cannot meet the huge memory consumption of matrix computing, so the implement of distributed ELM algorithm has gradually become one of the research focuses. In view of the research significance and implementation value of distributed ELM, this paper first introduced the research background of ELM and improved ELM. Secondly, this paper elaborated the implementation method of distributed ELM from the two directions: ensemble and matrix operation. Finally, we summarized the development status of distributed ELM and discussed the future research direction.


I. INTRODUCTION
With the rapid development of Internet and Internet of things, the amount of data has been increasing rapidly in recent years. We have entered the big data era. The volume, velocity, variety and value are the main features of big data [1]. With the continuous development of big data, the complexity of data is getting higher and higher. In 2017, IBM has added veracity to the 4V big data feature, to emphasize that meaningful data must be true and accurate. After that, the features of big data have been gradually added vitality, emphasizing the vitality of the whole data system; visiualization, emphasizing the explicit display of data; validity, emphasizing the validity of data collection and application. The knowledge hidden in big data is valuable for decision-making in all fields. Machine learning has become one of the hot methods of knowledge discovery and a research hotspot in the field of big data. How to conduct data mining and machine learning The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . on large amounts of data has become an important issue in the era of big data [2]- [4]. With the increase of the number of hidden layers, the traditional training method of neural networks will have many problems such as slow convergence, time consumption and so on. In order to avoid the above problems, neural networks with random weights are proposed in which the weights between the hidden layer and input layer are randomly selected and the weights between the output layer and the hidden layer are obtained analytically [5]- [7]. Extreme learning machine (ELM) is one kind of neural networks with random weights to train single-hidden layer feedforward networks (SLFNs). Compared with traditional SLFNs, ELM has the following remarkable characteristics: First, ELM is characterized by its fast learning speed. Second, traditional SLFNs may face problems such as local minima, inappropriate learning rate and overfitting, it needs to use some optimization methods to avoid these problems. ELM is simpler than traditional SLFNs, which does not involve these problems. For example, Radial Basis Function (RBF) neural network investigated the implicit assumptions made VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ when employing a feed-forward layered network model to analyze complex data [8], and ELM can be extended to RBF network case, which allows the centers and impact widths of RBF kernels to be randomly generated and the output weights to be simply analytically calculated instead of iteratively tuned. Finally, compared with the back propagation algorithm in SLFNs, ELM has better generalization performance [9]. In the original ELM, the nodes in the hidden layer are generated randomly, which is not related to the training data. However, recent research have shown that setting the number of hidden layer nodes arbitrarily may lead to underfitting or overfitting of ELM. The statistical characteristics of data sets and the way of generating random parameters have a significant impact on the performance of ELM [10]- [12]. Recently, there have been many research of ELM based on parameter optimization, such as ELM based on Principal Component Analysis ELM(PCA-ELM) [13], ELM based on Particle Swarm Optimization (PSO-ELM) [14] and ELM based on Genetic Algorithm (GA-ELM) [15]. Compared with the traditional neural network learning methods, ELM has significant learning efficiency. Therefore, ELM provides good generalization performance [16] with its extremely fast learning speed and has been widely used in the fields of text classification [17], image recognition [18], sensory recognition(visual, taste, and smell) [19]- [21], Industrial [22], [23] and bioinformatics [24], [25]. Although in recent years, many variants of ELM have been developed, such as Multilayer Probability ELM (MP-ELM) [26], ELM combined with sparse representation classification (ELM-SRC) [27], Residual compensation ELM(RC-ELM) [28], to improve the efficiency of ELM. But the ELM algorithm is a memory-resident algorithm. That is to say, all pre-processed data must be loaded into computer memory in advance. With the continuous expansion of the training data scale, the memory limitation of traditional serial or single computer environment, and the huge computation amount of the ELM matrix, the traditional ELM cannot play its efficient role. Therefore, in the face of large-scale data sets, the application of parallel algorithm and the the distributed procession of ELM will become one of the key points in the future research.
ELM has been developed for nearly 15 years, a lot of research have been carried out on it domestic and overseas, and remarkable achievements have been made. Huang et al. [16] mainly introduced the variants of ELM, such as batch learning mode of ELM, fully complex ELM, online sequential ELM and incremental ELM. And they proposed that ELM should be studied in the future such as digging hidden nodes and evaluating the generalization performance of ELM. Ding et al. [29] introduced the research status, such as algorithm, theory and application, including the model and the concrete application of ELM. And they pointed out the future research direction, such as, further improve the generalization performance of ELM model structure and algorithm, combined the online learning and genetic algorithm with ELM, how to make better use of ELM to deal with all kinds of round and multiple classification problems, etc. Huang et al. [30] summarized the ELM from interpolation theory, the universal approximation ability and generalization ability, introduced the ELM in biomedical engineering, computer vision, system identification and so on, finally outlook the ELM in the future. Cao et al. [31] summarized the research progress of ELM in recent years and its applications in big data processing such as graphics processing, video processing and medical signal processing. They concluded that due to the randomness of the network parameters and the untuned learning strategy, the computational complexity is greatly reduced, which benefits the application of ELM and its variants in intelligent high-dimensional big data processing. Three ELM problems in high-dimensional big data processing are proposed. 1) How to balance the performance and processing time? 2) How to select the optimal number of hidden neurons for a specific application? 3) Most of the research results are realized by computer simulation in the laboratory, real-world devices for different applications are always facing various challenges. In recent years, there have been many studies on ELM, such as discriminative ELM [32], cross-domain ELM [33], evolutionary cost-sensitive ELM [34], and adaptive ELM [35] to solve different problems.
The above research introduced ELM from three aspects of the ELM's variants, optimization methods and its applications in different fields. But there isn't a detailed elaboration and comparison on the distributed method of ELM. In view of the theoretical research and practical significance of distributed ELM, this paper summarized the algorithm on the ELM distributed platform for big data, and divided distributed ELM algorithms into ensemble method and matrix decomposition method according to its implementation modes. Section 2 summarizes the research status of ELM, OS-ELM and ELM with kernel. Section 3 describes two distributed ELM implementation methods: ensemble and matrix operation. Section 4 analyzes the future research trend of distributed ELM and makes a summary.

II. BACKGROUND
In this section, we will describe our background, including ELM, OS-ELM and ELM with kernel.

A. ELM
The distributed ELM researcher mainly focuses on ELM and the online sequential ELM (OS-ELM). Therefore this section will mainly introduces the research status of classical ELM and OS-ELM.

1) CLASSIC ELM
ELM [9], [16], [36] was initially proposed for the single hidden layer feedforward neural networks (SLFNs) [5], [6], and then extended to the generalized SLFNs, where the hidden layer does not need the same neurons. ELM first randomly assigns input weights and hidden layer bias, and then determines the output weights of SLFNs. ELM has great advantage in efficiency and generalization performance compared with the classical neural network algorithm which applied to a wide rage of problems in different fields.
For any N different samples (x j , t j ), where x j = [x j1 , x j2 , · · · , x jn ] T ∈ R n and t j = [t j1 , t j2 , · · · , t jm ] T ∈ R m . The standard SLFNs mathematical model with L hidden nodes and activation function g(x) is modeled as where w i = [w i1 , w i2 , · · · , w in ] T is the weight vector connecting the ith hidden node with the input node; β i = [β i1 , β i2 , · · · , β im ] T is the weight vector connecting the ith hidden node with the output node; b i is the threshold of ith hidden node; o j = [o j1 , o j2 , · · · , o jm ] T is the jth output vector of SLFNs. Standard SLFNs with L hidden nodes and activation function g(x) can approximate N samples with zero error. It means that L j=1 ||o j − t j || = 0 and according to Equation 1, the Equation 2 as follows: The above equation can be succinctly expressed as: where H is: H is called the hidden layer output matrix of the neural network and the ith column of H is the ith hidden node output with respect to inputs x 1 , x 2 , · · · , x N . The smallest norm least-squares solution of the above linear system is: where H † is the Moore-Penrose generalized inverse of matrix H . The the output function of ELM can be modeled as follows: where h(x) is feature mapping function.

2) OS-ELM
Liang et al. [37] developed an online sequential learning algorithm for single hidden layer feedforward networks (SLFNs) with additive or radial basis function (RBF) hidden nodes in a unified framework. This algorithm is called online sequential ELM (OS-ELM). OS-ELM can fix or change the block size of the data block one-by-one or chunk-by-chunk learning. OS-ELM consists of two phases, namely the initialization phase and the sequential learning phase. In the initialization phase, input the activation function g(x), and the appropriate matrix H 0 is filled up for use in the learning phase. The number of data required to fill up should be at least equal to the number of hidden nodes. Following the initialization phase, the learning phase commences either on a one-byone or chunk-by-chunk (with fixed or varying size) basis as desired. Once the data is used, it is discarded and not used any more. Finally, we get the output weight β. Initialization Phase: Initialize the learning using a small chunk of initial training data a. Assign random input weights a i and bias b i (for additive hidden nodes)or center a i and impact factor b i (for RBF hidden nodes), i = 1, . . . , L.
b. Calculate the initial hidden layer output matrix H 0 c. Estimate the initial output weight β (0) where N k+1 denotes the number of observations in the (k + 1)th chunk. a. Calculate the partial hidden layer output matrix H k+1 for the k + 1th chunk of data N k+1 1 : d. Set k = k + 1. Go to Sequential Learning Phase.

B. KERNEL ELM
Huang et al. [38] combined the learning principle of support vector machine, introduced the kernel function into ELM, and proposed the kernel ELM. Compared with the Extreme SVMs proposed by Liu et al. [39] and Benoit et al. [40], the ELM constructed with this method has fewer constraints and better learning ability. Kernel ELM does not consider the feature mapping function h(x), the input weight vector w, the bias b and the number of hidden nodes L in ELM. When the mapping is unknown, it will construct a kernel function to represent HH T : So the connection weight matrix β between hidden layer and output layer can be expressed as: where C is regularized parameter. And where where the expected output vector of the m output nodes is t i = [0, · · · , 0, 1, 0, · · · , 0] T The classification formula of kernel ELM expressed as

III. DISTRIBUTED ELM
The ensemble method consists of a group of separately trained classifiers, and an ensemble is usually more accurate than any classifier in the ensemble [41]. Ensemble multiple ELM into one model can achieve the parallelization of ELM and speed up the computation. Although ensemble ELM solves the computational consumption problem of multiple learning on the same data block, the ensemble learning cannot learn all the data. In the ELM computation, the most expensive computation part is the matrix multiplication operator of the Moore-Penrose generalized inverse matrix [42]. The matrix multiplication operator is decomposable. Therefore, matrix operation is proposed to solve the problem that ensemble learning cannot learn all the data. Existing researchers mainly apply ensemble and ELM matrix operation optimization to conduct distributed processing on ELM.

A. ELM ENSEMBLE
There are two main ways to parallelize ELM using the ensemble methods: (1) Results ensemble. Decompose the problem (data set) into sub-problem (sub-data set), train an ELM for each sub-problem (sub-data set), and finally gather the trained results.
(2) Parameters ensemble. Divide ELM into multiple sub-ELMs through different partitioning methods, train the ELM sub-models in parallel, and finally combine all the trained sub-models through some algorithms. Results ensemble gather the trained results of multiple models, the trained results have higher accuracy. Parameters ensemble gather the parameters on a single model to calculate, which is more complicated than the calculation on a single model of results ensemble. However, the results ensemble are calculated on multiple models,the calculation complexity is similar. At the same time, parameters ensemble has better performance in memory optimization.

1) RESULTS ENSEMBLE
As shown in Fig. 1, the results ensemble is the aggregation of the results after training multiple ELM. Results ensemble are mainly used to train models in different environments such as GPU [43], network [44], [45], and MapReduce [46]- [48] framework. And the results ensemble method is applied to kernel ELM [49]. In 2011, Heeswijk et al. [43] proposed to combine multiple ELM into the inheritance model. They paralleized model training and model structure process among multiple GPU and CPU cores, realized the simultaneous construction of ELM multiple models, paralleized and improved the learning speed of ELM. To address the problem of classification in P2P networks, Sun et al. [44] applied OS-ELM to the P2P network, trained each peer node and generated an ensemble classifier. This method not only improved the computation speed, but also solved the problem of low classification accuracy in traditional P2P ensemble classifier, where the local classifier only learns part data. Wang et al. [45] applied M 3network [50] into the ELM ensemble, and proposed a parallel ensemble ELM (M 3 -ELM) based on M 3 -network. M 3 -ELM first decomposes the classification problem into smaller subproblems, then trains one ELM for each sub-problem, and finally ensemble these ELM with M 3 -network. Compared with common ELM, M 3 -ELM increased the training speed by 1.6-4.6 times and reduces the training error by 0.37-19.51%. MapReduce is a computing framework for big data parallel processing, and the ELM ensemble based on the MapReduce framework is widely used. ELM-MapReduce [46] adopts the ELM learning method to build an ELM ensemble classifier on the MapReduce, as shown in figure Fig. 2. ELM is widely used for gesture recognition.
However, when the data set includes multiple objects, the classical ELM may produce large errors. Liang et al. [47] built a separate ELM network for each gesture and combined the results after training, instead of building one ELM for all gestures. Noise is frequent in large-scale data, and in order to eliminate the impact of noise, Huang et al. [48] first proposed an ensemble OS-ELM framework, which integrates three ensemble methods: bagging, subspace partitioning, and cross validation. Then, they designed a parallel ensemble algorithm of OS-ELM based on the MapReduce to analyze large scale data effectively.
In order to build an ensemble classifier of kernel ELM, Li et al. [49] proposed a parallel one-class ELM algorithm (P-ELM) based on bayesian method. P-ELM divides the training data set into k components according to class attributes, and then uses the divided training data set to train k kernel-based one-class ELM classifiers, finally, uses the bayesian method to compare the output function values of one-class P-ELM classifier based on kernel.

2) PARAMETERS ENSEMBLE
The research of parameters ensemble is mainly to train a ELM sub-model on each node through the MapReduce framework, and then collect all the sub-models to form the final model through some algorithms. It is used to solve problems such as integrating classification and regression [51], inefficiency and lack of memory [52], [53], and improving the scalability of data processing [54].
Chen et al. [51] proposed MR-ELM to solve both classification and regression problems. MR-ELM trains an ELM sub-model in each Hadoop node, uses local sample blocks, collects trained hidden nodes, and forms the final ELM model. For the regression problem, they used least squares method to calculate the weight of each group of hidden nodes. For classification problems, simply merged the hidden node groups. Different from MR-ELM, the model proposed by Wu et al. [52] adopts the generalized inverse method to calculate the weight of each Hadoop node, and the combined weight is obtained by the Moore-Penrose generalized inverse operator to combine all ELM sub-models. Catak [53] constructed AdaBoosted-ELM classifier based on the combination of ELM and AdaBoosting to improve the classifiers prediction ability. Adaboosted-ELM creates data blocks from the training data set by using MapReduce paradigm, and uses each subset of the training data set as a single global classifier function to find the ELM ensemble. Budiman et al. [54] integrated CNN architecture with ELM, CNN serves as unsupervised convolution feature learner and ELM as supervised classifier to improve the scalability of big data processing. In the calculation process of MapReduce, the Map process acts as the classifier of CNN-ELM and conducts independent learning on different training data partitions. Reduce process integrates the all weights (kernel weight on CNN and output weight on ELM) on CNN-ELM on an average basis.

B. ELM MATRIX OPERATION OPTIMIZATION
Because the matrix multiplication operator of the Moore-Penrose generalized inverse matrix is the most expensive part, it is very difficult to calculate on a single machine, and the matrix operator is decomposed. Therefore, most research focused on the decomposition of matrix multiplication operation [42], [55]- [57], and proposed the double classifier algorithm that combines the most basic matrix decomposition method with other methods [58], [59]. Subsequent studies focused on some defects of the basic ELM matrix decomposition, and solved the problems of the deficiency of MapReduce framework [60]- [62], only considering supervised ELM [63], [64], and unbalanced data [65], [66] for i = 1 → L do 8: for j = 1 → L do 9: context.write(triple ( U , i,

j), h[i]h[j])
10: for j = 1 → m do 11: context.write(triple ( V , i,  for all sum s ∈ [s 1 , s 2 , · · · ] do 19: ω = ω + s 20: context.write(triple p, sum ω) 21: end for 22: end function increasing the processing time for ELM. Xin et al. [42] found that the matrix U and V can share the calculation of h ij of the matrix H , and the calculation of partial sum of u ij and v ij is independent. Therefore, they proposed ELM* to calculate the matrix U and V in one MapReduce process as shown in Algorithm 1. ELM* combines two MapReduce jobs of PELM, and only uses one MapReduce to obtain the final ELM result. ELM* not only reduces the transmission cost of a large number of intermediate results, but also improves the processing efficiency. However, ELM* has a weak learning ability, in order to make up for the deficiency of ELM* in updating the number of hidden nodes, Xin et al. [56] proposed an adaptive distributed ELM (A-ELM*). A-ELM* first computes the intermediate matrix multiplication of the updated hidden subset of nodes, then updates the matrix multiplication by modifying old matrix multiplication and intermediate matrix multiplication, and finally uses update matrix multiplication to obtain a new output weight vector. In order to make up for ELM*'s lack of updated large-scale data sets, Xin et al. [57] proposed an elastic ELM based on the MapReduce, named E 2 LM. E 2 LM first calculates the intermediate matrix multiplication of the updated training subset, and then uses the same method as A-ELM* to update the matrix multiplication and obtain the new output weight vector. Since the number of update hidden nodes is much smaller than the whole update part and the updated training data set is smaller than the whole training data set, the calculation time of A-ELM* and E 2 LM is much smaller than ELM*. The matrix decomposition method of ELM* shows good performance in improving the performance of centralized recommendation algorithm in large-scale recommendation [74] and in WiFi-based fingerprint indoor positioning system [75].
A large number of data sets not only have a large number of records, but also bring the problem of the feature space dimension, so it is always necessary to reduce the dimension of feature space. Nonlinear principal component analysis (NLPCA) is used as a dimension reduction method, which takes into account the nonlinear relationship between features. Tejasviram et al. [58] proposed that Auto Associative ELM (AAELM) perform NLPCA, extract the output of AAELM hidden node, and treat it as NLPCs after the training. And implement AAELM by matrix decomposition on the MapReduce. Decision Trees(DT) [76] is a promising parallel classification algorithm with the advantages of simple implementation, fewer parameters and less computation. However, many parallel DT algorithms ignore the over-segmentation problem, which may lead to redundancy and over-fitting. To solve this problem, Wang et al. [59] proposed a hybrid DT induction method -ELM-Tree. When all available segmentation gain ratios are less than the threshold, ELM is embedded as a leaf node. Since the calculation of information gain and gain ratio of different cutting points are independent, it can be completed in parallel. Considering the parallel calculation of ELM output matrix, the parallel calculation is applied to ELM-Tree.
Although distributed processing based on the MapReduce has been widely used, many Map and Reduce tasks are generated. Intermediate results generated in the Map phase will be written to disk;In the Reduce phase these intermediate results will be read from the disk to the Hadoop distributed file system (HDFS). This process greatly increases the communication cost and reduces the learning speed and efficiency. In contrast to Hadoop, Spark operations are based on Resilient Distributed Datasets (RDD), which can be cached in memory across nodes and reused in multiple parallel operations similar to MapReduce. Therefore, multiple occurrences of variables and intermediate variables can be cached in memory rather than on disk, reducing communication costs and I/O overhead. Oneto et al. [60], [61] realized emotion recognition and polarity detection on ELM with Spark memory technology, and solved the problem of selecting the super-parameter of ELM with the best generalization performance. Duan et al. [62] proposed ELM (SELM) based on Spark framework. By partitioning the corresponding data reasonably to maintain balance among node workload, the hidden layer output matrix calculation algorithm, matrixÛ decomposition algorithm, and matrix V decomposition algorithm perform most of the computations locally, SELM realizes localization of most calculations, while keeping the intermediate results in distributed memory and caching diagonal matrix as broadcast variables, thus reducing a large amount of costs. Compared with PELM [55], ELM* [42], and improved ELM* [56], [57], SELM achieves the highest acceleration speed on the premise that the accuracy is the same as that of traditional ELM.

2) IMPROVED ELM
Currently, distributed ELM only supports supervised learning on labeled training data sets, and does not support the processing of partially labeled or unlabeled training data. Considering parallelization of semi-supervised ELM (SS-ELM), Chen [63] proposed parallel approximation SS-ELM (PASS-ELM). PASS-ELM is based on the approximate adjacent similarity matrix (AASM) algorithm, uses the Locality-Sensitive Hashing (LSH) algorithm to calculate the approximate neighborhood similarity matrix, and adopts the Laplace acceleration method for distributed processing of ELM. Different from the emphasis on PASS-ELM, U-ELM proposed by Wang et al. [64] adopts matrix decomposition method to parallelize ELM and the Laplace acceleration method adopted by PASS-ELM to form a complementary, and not only extends distributed ELM to semi-supervised learning, but also to unsupervised learning.
ELM and its variants have been widely used in many big data learning applications, where it is easy to find the raw data with imbalanced hierarchical distribution [77], [78]. Zong et al. [79] proposed a weighted ELM (W-ELM) to deal with the imbalance problem. Different from the traditional ELM which treats all training data equally, W-ELM adds different penalty coefficients to weight the training errors of different inputs. Wang et al. [65] proposed the distributed processing of W-ELM (DW-ELM), which improved the efficiency of learning a large number of unbalanced data. DW-ELM first uses two MapReduce jobs to effectively calculate matrix multiplication in parallel, and then obtains the corresponding output weight vector through centralized calculation. The experiment shows that, no matter how the experimental parameters change, DW-ELM can always process large-scale data effectively and quickly. After that, Wang et al. [66] proposed an improved DW-ELM (IDW-ELM). DW-ELM uses two MapReduce jobs to complete the calculation of U and V matrix, while IDW-ELM only uses one MapReduce job to finish the same calculation, so the transmission time of IDW-ELM is far less than that of DW-ELM.
A traditional ELM assumes that all training data is prepared prior to the training process, however, in some tasks, the training data is sequential. In order to extend ELM to online sequential data, Liang et al. [37] proposed the online sequential ELM (OS-ELM), which can learn data one by one or block by block, and can change block size to process blockto-block data. Ai et al. [67] proposed a distributed collaborative ELM based on message exchange between adjacent nodes, named DC-ELM. DC-ELM restates the centralized ELM training problem into a separable form between nodes with uniform constraints, and then uses distributed optimization tools to solve the equivalence problem. Although DC-ELM does not parallelize OS-ELM, it uses the online sequential method to conduct distributed processing on ELM. Wang et al. [68] proposed a parallel OS-ELM (POS-ELM) based on the MapReduce by analyzing the dependency of OS-ELM matrix calculation. The effective of POS-ELM is equal to that of OS-ELM and ELM, and in large-scale learning, the efficiency of POS-ELM is better. POS-ELM supports training of a single OS-ELM model in parallel, but does not support training multiple OS-ELM models effectively. Therefore, in order to train multiple models accurately and effectively, Huang et al. [69], [70] proposed batch parallel OS-ELM (BPOS-ELM), estimated Map and Reduce execution time with historical statistical data, and generated execution plan. BPOS-ELM started a MapReduce to train multiple OS-ELM models according to the generated execution plan.
ELM provides a unified learning program and a widely used type of functional mapping. In these uniform algorithms, kernel ELM uses a kernel rather than a random feature map. However, with the exponential growth of training data in large-scale learning applications, centralized kernel ELM has a large matrix computing memory consumption problem, so it is very important to conduct distributed processing on kernel ELM. Bi et al. [71] realized parallelization of kernel ELM on the MapReduce, and realized matrix decomposition on the MapReduce by using orthogonal projection method. Karthick et al. [72] adopted Spark ITFS technology to select features through dimensionality reduction, and then classify each node to parallelize the kernel ELM. Pandeeswari et al. [73] first proposed the kernel OS-ELM based on the MapReduce, and proposed the online sequential ELM method with kernel (OS-ELM-Ker) based on sparse criterion, and simultaneously considered the parallelization of OS-ELM and kernel ELM.
There are also distributed approaches that take into account both ensemble and matrix operations. For example, Han et al. [80] proposed a weighted ensemble ELM (WE-DELM) based on matrix operation, combining matrix decomposition with ensemble operation; Wang et al. [81] proposed two models, data parallel regularization ELM (DPR-ELM) and model parallel regularization model (MPR-ELM), respectively using matrix operation and ensemble.

C. OTHERS
In addition to ensemble and matrix operations, there are other ways to implement distributed ELM. For example, iteration acceleration [82], using acceleration package on MATLAB [83], GPU acceleration [84], [85], using online sequential to realize distributed [86], and so on. Different from He et al. [55], Xin et al. [42]. Kokkinos and Margaritis [82] et al. conducted the incremental version of ELM. Incremental ELM does not use direct matrix-matrix multiplicators, instead of adding neurons one by one, using each neuron to transmit one data, for direct parallelization. Compared with the classical training method of calculating the generalized inverse of regression matrix to solve the output weight, incremental ELM has a lower computational cost. Rizk et al. [83] used MATLAB's parallel tool to distribute feature space transition to multiple works, and applied the clustering algorithm to a single worker to achieve parallelization. Graphics processing units (GPUs) has become parallel processing tools due to its high computing power and low cost, especially in the field of high-performance computing. Phusomsai et al. [84] used histogram gradient for feature extraction based on tumor shape images, and ensemble them into ELM as a classifier. After that, they used the parallel feasibility study and implementation to accelerate the traditional ELM on the GPU by 3 times and 7 times in the classification stage. Chen et al. [85] were the first to combine the memory cluster computing platform Flink and GPU to parallelize Hierarchical ELM (H-ELM), which integrates the excellent characteristics of memory cluster computing and GPU. Vanli et al. [86] introduced an ELM algorithm based on the gradient of the distribution formula.This algorithm provides a guaranteed upper bound for SLFN performance of each agent, and proves that each independent SLFN can asymptotically achieve the optimal SLFN performance for centralized batch processing.
There are some methods for parallelizing ELM. Although distributed processing of ELM is not implemented, some parallelization methods combined with ELM are adopted to optimize ELM.Some random hidden nodes may play an important role in the network output. Yang et al. [87] applied the parallel algorithm to ELM and proposed an incremental ELM based on the parallel chaos search (PC-ELM), which is used to discover hidden nodes in the network output. Ahmad and Janahiraman et al. [88] proposed a parallel ELM (PIPSO-ELM) based on particle swarm optimization for modeling and prediction of surface roughness and power consumption in manufacturing. PIPSO-ELM is divided into two separate algorithm blocks, each representing surface roughness and power consumption, and then the two basic ELM based performance models are combined with the selected input weight and the hidden bias of PSO. In order to improve the performance of ELM in dealing with regression problems, the existing research proposes to apply the double-parallel structure to ELM. He et al. [89] applied a data-attributespace-oriented double parallel (DASODP) structure with data-oriented attribute space to ELM (DASODP-ELM). The double-parallel structure enables DASODP-ELM's output layer to receive not only information from neurons in the hidden layer, but also direct information from neurons in the input layer. Compared with ELM, DASODP-ELM with fewer parameters can achieve better performance.
Random vector functional-link (RVEL) networks can be regarded as a single hidden layer feedforward neural network resulting in a linear combination of nonlinear extensions of the original input. ELM is exactly proposed for the single hidden layer feedforward neural network. Therefore, the distributed learning algorithm proposed for RVEL may be applied to ELM. Scardapane et al. [90] proposed a distributed learning algorithm for training data distributed in a random vector functional-link network with a decentralized information structure. They proposed two algorithms based on decentralized average consensus (DAC) and alternating direction multiplier machine (ADMM) strategies. These algorithms work in a completely distributed manner and do not need the coordination of the central agent in the learning process. Scardapane et al. [91] investigated the problem of music classification when training data is distributed throughout a network of interconnected agents, and it is available in a sequential stream. Under the considered setting, the target is for all the nodes, after receiving any training data block without relying on the master node, to agree on a single classifier in a decentralized fashion. For a special class neural networks Scardapane et al. [92] proposed a RVEL algorithm based on the alternating direction method of multiplier optimization. The algorithm allows learning an RVFL network from multiple distributed data sources while limiting communication to a single operation that computations a distributed average. Field Programmable Gate Arrays(FPGAs) had the potential for flexible acceleration of many workloads and had been used to accelerate large-scale tasks, providing significant performance improvements and substantial power savings. It demonstrated that they have the potential for efficient large-scale computation [93]. Yeung et al. [94] proposed an implementation of MapReduce library that supports parallel FPGAs and Graphics Processing Units (GPUs) to provide up to 100 times performance improvement. Choi and So [95] proposed the design and implementation of k-means clustering algorithm for computer cluster based on FPGAs acceleration. They implemented a MapReduce programming model in which both map and reduce functions executed autonomously to the CPU on multiple FPGAs and developed a hardware/software framework to manage gateway execution on multiple FPGAs across clusters. However, this performance improvement brought a significant cost because of the long development cycle required to leverage FPGAs resources. Ghasemi and Chow [96] incorporated FPGAs acceleration into Spark. They provided easy access to FPGAs resources for ordinary application developers and retained the functionality and user interfaces of currently popular distributed platform such as Spark. With the application of FPGAs in distributed platforms, it can further accelerate distributed ELM processing.

IV. CONCLUSION
Distributed ELM, as one of the hot research directions, has attracted the attention of a large number of researchers. Table 1 compares the existing distributed ELM. As can be seen from the table, the parallelization method of ELM is mainly to ensemble and decompose the multiplication operator of the most expensive Moore-Penrose generalized inverse matrix in ELM calculation. Due to the limitations of the ensemble algorithm, most of the existing ELM distributed processing methods adopt matrix operation. Moreover, the implementation of distributed ELM on big data platforms such as MapReudce has become the mainstream. OS-ELM and kernel ELM, as important varieties of ELM, have received less attention in current research. In the future, more attention can be paid to some important varieties of ELM. According to the investigation, the test time of classical ELM is 4653s, while the test time of distributed ELM is 487s on Iris dataset, and the correct rate is basically the same. Therefore, while increasing the computing speed, distributed ELM will not reduce the accuracy rate. Instead, it will only increases the hardware cost and network communication.
Although distributed ELM has been successfully applied to classification, regression and other problems, and some research have been made on some improved ELM, there are still many problems to be solved: (1) Limitation of hardware. At present, the main limitation of distributed ELM is the hardware. Computing large amounts of data using distributed ELM requires excellent hardware configuration support. In the future, we will consider to apply more advanced hardware devices such as FPGAs to distributed ELM to improve computing efficiency.
(2) Dose not apply well to specific problems. Distributed ELM is applied to the processing and analysis of big data, but due to different data application scenarios and data with different characteristics, related problems cannot be solved well. With the development of ELM research, many variants of ELM have been proposed to solve problems in different scenario. In the future, we will consider the new ELM variants for distributed processing to better apply to different problems in big data.
(3) For the new distributed environment. Existing research on distributed ELM mainly focus on the MapReduce framework. With the development of distributed computing framework, the mainstream distributed platforms have gradually evolved from MapReduce and Hadoop to Spark, Flink and so on. Although there have been studied on distributed ELM on Spark and Flink, these studies do not take full advantage of these frameworks. There are many directions to conduct distributed processing on ELM on these platforms.  From 1989 and1990, he was at Carleton University, Ottawa, Canada, as a Research Associate, working on mobile radio communications. From 1990 to 1994, he was with Spar Aerospace Ltd., Montreal, Canada, where he was involved in research on satellite communications. From 1994 to 2000, he was with Qualcomm Inc., San Diego, CA, USA, where he participated in research and development in wireless code-division multiple-access (CDMA) systems. He has been with the Stevens Institute of Technology, Hoboken, NJ, USA, since 2000, and is currently a professor and the Department Director of Electrical and Computer Engineering. He is also a Professor with the College of Medicine and Biological Information Engineering, Northeastern University, and the Director of the Stevens' Wireless Information Systems Engineering Laboratory (WISELAB). He holds one Chinese patent and twelve U.S. patents. His research interests include wireless communications and networks, spread spectrum and CDMA, antenna arrays and beamforming, cognitive and software-defined radio (CSDR), and digital signal processing for wireless systems. He was an Associate Editor of the IEEE COMMUNICATIONS LETTERS and IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, and an Editor of IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS.