Ensemble Deep Random Vector Functional Link Neural Network for Regression

Inspired by the ensemble strategy of machine learning, deep random vector functional link (dRVFL), and ensemble dRVFL (edRVFL) has shown state-of-the-art results on different datasets. Our present work first fills the gap of dRVFL and edRVFL work in the field of regression. We test and evaluate the performances of the dRVFLs on regression problems. Subsequently, we propose a novel regularization method [boosted factor (BF)], two dRVFLs variants [edRVFL with skip connection (edRVFL-SC) and edRVFL with random skip connections (edRVFL-RSC)] and one strategy [ensemble skip connection edRVFL (esc-edRVFL)] which show significant improvement over the original dRVFL. The BF is a newly introduced hyperparameter to scale the values of the activated hidden neurons to accommodate the diversity of the data, and it is also able to filter the neurons. edRVFL-SC and edRVFL-RSC are the edRVFL variants with skip connections. In edRVFL-SC, we apply dense skip connections to the edRVFL, which is inspired by the residual architecture in the deep learning area. However, due to the specificity of randomized networks, the simple skip connections are probably leading to the reuse of useless features. To address this problem, we propose a random skip connection-based edRVFL, which can keep the diversity in the latent space. esc-RVFL is an ensemble scheme that utilizes several edRVFL-RSC models trained on the different folds of the training dataset. The esc-edRVFL is identified as the best-performing algorithm through a comprehensive evaluation of 31 UCI datasets.

the human brain's neurons [1]. Since then, the neural network has undergone several booming and winter seasons. In 2012, Alex Krizhevsky and his team won the Imagenet competition with their state-of-the-art Alexnet model [2]. The Alexnet achieved an unprecedented performance by neural network and revitalized the neural network field from a long winter period, and the architecture of Alexnet has become the inspiration of different types of models and a lot of progress has been made in various fields, including computer vision, natural language processing, and so on.
Most of the neural network models are trained by applying gradient descent to minimize a loss function defined by the users and using the backpropagation algorithm to update the weights of the hidden layers [3]. There is no doubt that the backpropagation mechanism is one of the most important factors contributing to the booming of the current Artificial Intelligence industry.
However, the backpropagation mechanism has its shortcomings. For instance, the training of such a model is time-consuming. And, the most important problem is that the gradient descent-based model often suffers from the convergence issue where the training progress may be stuck at a local minimum resulting in a suboptimal solution [4], [5].
Randomized neural networks, on the other hand, can address the issues mentioned above. Instead of updating the weights through iterations, the weights of randomized neural networks are randomly initialized and kept fixed throughout the training [4], which means the hidden weights of RNN are data independent. The randomized neural network was first proposed in the 90 s. It has gained attention due to its simplicity and performance with universal approximation capability. Random vector functional link (RVFL), introduced by Pao et al. in 1994 [6], has gained substantial traction due to its simplicity and superior performance in different domains. RVFL is proven as a universal approximator for continuous function with bounded finite dimension and closed-form solution in [7] and [8]. The weights and biases between the enhancement (i.e., hidden) nodes and input nodes in RVFL are randomly generated. The direct link between the input data and the output layer propagates the original data directly to the output layer. The direct link acts as a form of regularization for the randomization [9] and helps to keep the complexity of the model low [10]. Due to its simplicity and efficiency, RVFL is currently used in a variety of applications, such as the short-term load forecasting [11], the rolling bearing fault diagnosis [12], and unsupervised learning [13]. In the context of regression, this research structurally and algorithmically improve the ensemble deep RVFL (dRVFL) proposed for pattern classification in [10] and [14]. The main contributions of this article can be summarized as follows.
1) The boosted factor (BF) is introduced to cover the latent space more comprehensively and reduce the features diminishing effect occurring in ensemble dRVFL (edRVFL). 2) edRVFL with skip connection (edRVFL-SC) with skip connection (skipping a predefined number of hidden layers) is proposed to improve the predictive power of edRVFL by improving the utilization of latent features and decreasing the corruption of high-dimension information. 3) edRVFL with random skip connection (edRVFL-RSC) with random skip connections between hidden layers is designed to obtain diverse network structures. There are numerous ways to wire the skip connections between hidden layers and it is not practically feasible to exhaustively search the entire search space manually. 4) esc-RVFL is an ensemble scheme which can keep the diversity of the edRVFL-RSC while avoid the duplication of useless features. The pipeline is shown in Fig. 1.

A. RVFL Variants
Recurrent RVFL (R-RVFL) with higher nonlinearity than the vanilla RVFL was proposed in [15]. This improvement is achieved by adding recurrent feedbacks (outer feedbacks and inner feedbacks) to the network. The feedback connections act as a form of dynamic memory and provide the network with higher modeling capability.
Orthogonal polynomial expanded RVFL neural network (OPE-RVFL) is a variant of the RVFL introduced by Vuković et al. in [16]. OPE-RVFL was formulated based on the fact that orthogonal polynomial is a powerful and efficient method to approximate a nonlinear function. The architecture of OPE-RVFL can be divided into two parts. The first part is a nonlinear transformation while the second part is the same as the vanilla RVFL. The input to OPE-RVFL first undergoes a nonlinear transformation before being fed into the second part for training. In [16], four orthogonal polynomials (Chebyshev, Legendre, Laguerre, and Hermite) are used for the nonlinear transformation to obtain improved performance in regression.
In order to cope with data volatility, an exponentially expanded robust RVFL network (EE-RRVFLNN) was proposed in [17]. The robustness of this algorithm is ensured by a maximum likelihood estimator using Huber's cost function. The architecture of EE-RRVFLNN consists of a direct link and exponentially expanded mapping features that help to cope with positive dynamic volatility.

B. Other Randomized Neural Network
In addition to RVFL, there are many other varieties of randomized neural networks that have received extensive attention from researchers. 1) Extreme Learning Machine: Extreme learning machine (ELM) can be viewed as a simplified version of the RVFL without bias and direct connections from the input to the output. ELM was developed in 2004 [18] and there have been several publications comparing the original RVFL and the original ELM.
Hierarchical ELM (HELM) is a multilayer randomized neural network based on ELM introduced by Tang et al. [19]. The HELM can be divided into two components: 1) feature encoding and 2) classifier. Both the feature encoding and classifier components are based on ELM. Multiple layer HELM can be obtained by stacking the outputs of multiple autoencoders together before feeding it to a one-class or multiple-class classifier. The autoencoder part of HELM learns from the input data and comes out with a representation of the input data with lower dimensions. The autoencoder acts as a feature extraction tool that helps to identify and extract features that best represent the input data. dRVFL [20] was shown to outperform HELM.
2) Stochastic Configuration Networks: Stochastic configuration networks (SCNs) [21] are a type of randomized neural network generated incrementally by stochastic configurations. Instead fixed the number of hidden neurons among the whole training phase, SCN will increase the neurons by stochastic configuration and introduce the addition solution between the hidden neuron and final output. Recent research [22] indicates that algorithms for SCNs can be better off employing hyperparametric optimization techniques, such as the Bayesian optimization, as a better alternative than the stochastic configuration algorithm.
3) Broad Learning System: Broad learning system (BLS) [23] is another RVFL-based model with densed connections. RVFL's output layer connects both the raw input features and the hidden neurons, whereas BLS first maps the input data to the so-called feature layer which is comparable to a hidden layer. The output layer connects both the feature layer and the subsequent hidden layers.

4) Reservoir Computing and Echo State Network:
With a strong theoretical foundation, reservoir computation (RC) emerged as a viable alternative to gradient descent for training RNNs. RC refers generically to a class of RNNs, including liquid state machines (LSMs) [24], echo state network (ESN) [25], [26], and other RNNs that use backpropagation decorrelation strategies [27]. In the reservoir, the echo state is constituted by randomly linked neurons [28], which plays an important role in RC and ensures that the initial state's influence vanishes after a brief transitory. This property is analogous to short-term memory [29], and it enables RC to perform well in a variety of sequential tasks. However, as a subtype of RNNs, the vast majority of RC networks and their varieties are suitable solely to sequential tasks, such as time-series classification or forecasting [30]. Deep ESN (DeepESN) [31] has also recently received attention due to the structured state space organization with multitimescale dynamics in deep RNNs which is intrinsic to the compositional nature of recursive neural modules.
The concept of ESN was introduced and evolved in [28] and [32], which is dedicated to processing sequential tasks. The key idea of ESN is randomly generated a sparse matrix as the reservoir, which can be regarded as a directed acyclic graph with weights. The input features at any time step will be map to a latent space by an input weight, which is randomly generated from a uniform distribution. Then, the latent features will be fed to the reservoir. The output will be given by the matrix multiplication between the desired output weight and the concatenation of input features, reservoir features, and the output from the last time step.

III. PRELIMINARY
The vanilla RVFL consists of only one hidden layer. If the training data (X) to the network has M number of instances and N features per instance, the weights (W) randomly generated for the hidden layer will have N × Z dimension, where Z is the number of neurons per hidden layer. The product of the input data with the weights generated will become the input for a chosen nonlinear activation function (g) in the hidden layer neurons. The outputs of the hidden neurons are collectively denoted as H. The output layer receives data D which is the concatenation of both input data and outputs of enhancement nodes (H). The final output is obtained by multiplying the data matrix D with output layer weights β Since there is a closed-form solution, the β value could be computed using matrix inversion with regularization (ridge regression) or without (Moore-Penrose pseudoinverse). When using the Moore-Penrose pseudoinverse, the value of β is simply β = D + Y. As for regularized least square method (ridge regression), the value of β is given by where Y is the ground truth of the training dataset while λ is the regularisation parameter. We can choose either equation (4) or (5) depending on the input dataset dimension. By multiplying β with [X; H], we will obtain the final predictions from the model. By comparing the prediction with the ground truth of input data, we can evaluate the performance of the model. Regularization, such as L1 Norm and L2 Norm may be added to the output layer to prevent over-fitting and improve the generalization of the model. The constraint implemented on the network parameters of RVFL reduces the complexity of the error function, turns the learning process into a quadratic optimization problem, and thus promises very fast optimization [33]. In principle, the global minimum of the cost function could be found within a single step of training if such minimum exists and is well defined. Subsequently, Ignik and Pao proved that as long as the function to be approximated by RVFL is Lipshitz continuous, then the algorithm will exhibit a similar rate of approximation error convergence as MLP, which is O(1/n) [7]. This shows that RVFL is an efficient universal approximator with a closed-form solution.
Deep RVFL (dRVFL) and ensemble deep RVFL (edRVFL) [10] are two recently proposed deep variants of the RVFL. By modifying (1) and (2), the following equations which represent the computations that occur in dRVFL are obtained. For the ease of notation, the bias term in the equations is absorbed into input features and weight matrices where W (1) ∈ R d×N and W ∈ R N×N are the weights between the input and 1st hidden layer and intermediate layers, namely. The weights of hidden layers are assigned randomly and kept fixed. The final features fed to the output layer D in a L-layer dRVFL can be defined as follows: where L is the number of layers. Then, the output of the dRVFL can be expressed as follows: The hidden layers in dRVFL generate randomized internal representations in a cascading manner. The output of the previous layer is nonlinearly transformed and fed into the next layer for subsequent feature extraction. As the input data propagates down the hidden layers, different degrees of feature transformations are carried out at each level. dRVFL differentiates itself from the algorithms discussed in the previous section in such a way that instead of taking only the features from the final hidden layer, dRVFL takes into consideration all features generated from different levels. The features extracted from all levels together with the original input data (direct link) will be concatenated to generate the final outcome. The presence of direct links proves to be beneficial in improving the performance [16], [34], [35] as direct links act as a regularization mechanism to moderate the effect of randomization.
The main distinction between edRVFL and dRVFL is that edRVFL makes L independent sets of predictions [using (4) or (5)] with each set of the prediction made based on the features extracted from each of the L hidden layers, instead of making only one set of prediction as the final output. The L different predictions are ensembled together and the median of the L predictions is taken as the final output of the regression. The direct link is also included in edRVFL for the same reason mentioned above. Additionally, some recent research utilizes model performance enhancement techniques, including weighting and pruning [20].

IV. METHODS
In this section, we propose three enhanced variants of the edRVFL networks.

A. Boosted Factor for edRVFL
We propose the addition of a new hyperparameter to the dRVFLs algorithm to improve their performance. By adding a new hyperparameter e b to (6) and modifying (7), we obtain (9). We denote the new hyperparameter as "Deep Boosting" coefficient or DB, and the variants with DB added denoted as RVFL with BF from this point onward.
The main objective of proposing this work is twofold: first we want to have different priorities in the different layers of the dRVFLs and second we prefer to have more importance for the front part which is less perturbed by the randomly generated weights. In addition, we also need to take into account the range of values of the randomly generated weights. During the initialization phase, all the random weights take the range [0, 1], however, such weights cover the diversity of the input data, so we set a threshold to scale as e b where A = [X; 1] with 1 for bias. The output weights β are then solved independently using (4) or (5). It is worth noting that the BF is applicable for both dRVFL and edRVFL.

B. edRVFL With Skip Connections
Inspired by the concept of deep residual learning framework [36] and Highway network in [37], we propose to add skip connections between different hidden layers in edRVFL to improve its performance. The purpose of this skip connection is not only to avoid gradient exploding or vanishing in back-propagation-based neural networks, but more importantly, to bring the view of identity mapping. In the RVFL network, the direct link is a manifestation of identity mapping, however, this direct link only maps the original input to a different hidden layer. In order to allow the information from the prior hidden layer to be transferred to the posterior layers without corruption, we use skip connections between different hidden layers. We first try edRVFL with skip connections that "skip" only a single layer (skip connection between H i−3 and H i−1 , where i ≥ 4). For the sake of simplicity, we will denote edRVFL with skip connections that skip a single layer as edRVFL-SC. The framework of edRVFL-SC is illustrated in Fig. 2 and the pipeline can be found in Fig. 3. The computations of different hidden layers in this improved version of edRVFL (edRVFL-SC) are presented in the following: We can simply fuse DB with edRVFL-SC as follows: We shall refer to the edRVFL with skipping connections and BF as edRVFL-SC in the remainder of this article, and the detail can be found in Algorithms 1 and 2.

C. edRVFL With Random Skip Connections
Due to the attributes of the hidden layers in the randomized neural network, the ability of feature reuse brought by the dense skip connection in [38] is not effectively enhanced. Since the neuron weights in a random neural network are randomly generated, simply doing a high-dimensional mapping of these features to a deeper hidden layer will result in invalid Algorithm 1: Pseudo-Code of edRVFL-SC Training Input: Training data (X), ground truth (Y), maximum allowed layers (L max ) 1 A 1 = [X 1] with 1 for bias, L = 1 2 while L ≤ L max do 3 Generate random weight (W L ) with bias included. 4 Append W L to weightArray. 5 Multiply A L with W L to obtain inputs to activation functions. 6 Apply activation functions on inputs to obtain H L by either equation (11). 7 Append H L to H_store. 8 Obtain D L by concatenating H L with A L .

9
Solve for β using D L and Y by either equation (4) or (5). 10 Append β to β_computed. Multiply A_test L with weightArray [L] to obtain inputs to activation functions. 4 Apply activation functions on inputs to obtain the output as H L .

5
Append H L to H_store. 6 Compute D_test L by concatenating H L with A_test L . Increase L by 1 12 while L ≤ L maxL do 13 Compute prediction by multiplying D_test L and β_computed L . 14 Append the prediction to predictionList. 15 Increase L by 1 16 Compute the final prediction by aggregating the L predictions in predictionList using Median. 17 Compute the test RMSE using the final prediction and ground truth Y. 18 return test RMSE information being used multiple times. Thus, we use a random skip connection strategy to replace the dense skip connection strategy, and we term such network architecture with BF as edRVFL-RSC.
Specifically, edRVFL-RSC is to "skip" a random number of layers (skip connection between H i−1−k and H i−1 , where i ≥ 4 and k are randomly generated for each of the layer such that k > 1 and i − k ≥ 2). The same idea can be extended to many-to-one skip connections where the current hidden layer is connected to multiple previous hidden layers.

D. Ensemble Skip Connection edRVFL
Ensemble skip connection edRVFL (esc-edRVFL) is an ensemble of multiple edRVFL-RSC models with skip connections to yield a final prediction. It serves the following purposes: 1) the random skip connection strategy can effectively avoid multiple reuses of useless features and 2) the increased network diversity brought by the ensemble can compensate for the potential loss of information caused by random skip connections.
During the training and validation of stage 1, we train N different edRVFL-RSC models, normally N is generally taken as the fold of cross-validation. In other words, each edRVFL-RSC is trained and tuned using different data partitions. The architectures of all the edRVFL-RSC models are different as the skip connections in each of them are randomly generated. The edRVFL-RSC models with top K performances will be chosen for the testing phase.
In the testing phase, test data is fed into the K chosen models to generate K different predictions. The K predictions are then aggregated together using the median to obtain the final prediction. Finally, the test root mean square error (RMSE) is computed using the final prediction and the ground truth of the test data.

E. Generic to Randomized Neural Networks
Our method is a variation of edRVFL, and the ensemble strategy in edRVFL is generally applicable to other types of random networks, such as the recently published edBLS [39], which is an ensemble-based version of the BLS. Our method may be used in lieu of the deep-ensemble methods in edBLS and improves the performance of the original model by, including data-boosting factor and skip connections.

V. EXPERIMENTAL SETUP
The experiment is divided into two phases. In the first phase, the dRVFL and edRVFL proposed in [10], as well as the three newly proposed variants (edRVFL-SC, edRVFL-RSC, and esc-RVFL) are compared against three different variants of randomized neural networks (OPE-RVFL, RVFLNN, and ELM) as done in [16] and a backpropagation-based multilayer perceptron (MLP) with 15 layers. In the second phase of the experiment, the top two performers of the newly proposed randomized neural network algorithms (in the first phase) are further compared against several states of the art randomization methods. The details of the dataset are provided in the Appendix in the supplementary material. It is worth noting that all the datasets used in our experiments are from different fields in the real world [40]-housing price, abalone quality, automobile type and price, noise level, etc.  network is capped at L max = 35. The C parameter is varied as such C = 2 k , k = [0, 20] (λ in (4) and (5) is equal to 1/C).
Four different activation functions are tested. They are ReLU, Tansig, Sigmoid, and Tribas. The value of x for the DB coefficient is varied in the range of 0.05 to 3.0. normalization is applied to the hidden layers guided by the validation RMSEs. The layer normalization is added to the model to normalize the input to each of the layers if and only if the validation RMSE shows that it helps to train a better model and yield more accurate predictions than a non-normalized model. Equation (12) shows the formula of normalization being used. The importance of normalization to a model will be further discussed in Section VII As for the MLP, the parameters available for tuning are the batch size, fraction rate of dropout layer, and the arguments of the Adam optimizer. The batch size for a dataset with more than 200 instances is set to 32 instances per batch, else it is set to ten instances per batch. The fraction rate of the dropout layer is fixed at 0.2. Finally, the learning rate and the exponential decay rates for the two-moment estimates are set to 0.001, 0.9, and 0.999, respectively. Five-fold cross-validation is performed on the 70% data during the training stage. The trained model is then used for making a prediction on the reserved 30% test data and compute the test RMSE. This process is then repeated for 100 independent trials and the mean of the 100 different roots mean square errors is computed and tabulated in Table III. The RMSE is defined as follows: Lower RMSE implies that the prediction made by a model is closer to the ground truth. Table I shows the equation of the activation functions used in this experiment.
2) Phase 2: Our focus in this part is to compare our best three algorithms in phase 1 against ten other latest algorithms. Following the experiment design in [41], the input attributes of each dataset are rescaled into the interval [−1, 1], while the target value is rescaled into the interval [0, 1]. The experimental design is performed over ten runs of five folds cross-validation (total 50 trials). For each run, one partition is used for testing, another one for validation, and the rest are used for training the model. In this phase, the number of neurons in each hidden layer is dropped in [32,512] while the maximum number of layers is set to L max = 50. The remaining hyperparameters are varied according to the ranges stated 3) Parameter Sensitivity Analysis: In this phase, we perform ablation experiments and parameter-sensitive analysis. We first conducted a targeted study on the deep boosting methods, especially on the influence of BF, then we consider the impact of other hyper-parameters. The experiments includes: 1) the improvement on deep boosting strategies for different networks, including edRVFL-SC and edRVFL-RSC; 2) the relationship between BF and number of neurons; and 3) the effect of the different number of layers on the final result.
The datasets are selected from the datasets used in Phase 2, which belong to UCI datasets. The search range of hyperparameters is the same as Phase 1, shown in Table II, while we use grid search to sweep all possible choices in this phase.

A. Phase 1
In this phase, we mainly compared the differences between the RVFL series of methods and the basic ELM method. The dRVFLs are compared against the three different randomized neural network stated in [16] and a backpropagation-based MLP. The test RMSEs of the algorithms on the 29 UCI datasets are tabulated in Table III. The values in Table III are the average RMSE of 100 tests. Parametric and nonparametric statistical tests are adopted to conduct pairwise comparisons to identify the best algorithm and activation function. From Table III, it can be clearly seen that dRVFLs outperformed the randomized neural networks in [16] in all 29 datasets. The dRVFLs perform better than OPE-RVFL, RVFL, and ELM by a substantial margin. By using the two-tailed sign test, we have more than 99.99% confidence on dRVFL and dRVFL with BF to perform better than the algorithms in [16], while 100% for the edRVFL, edRVFL-SC, edRVFL-RSC, and esc-edRVFL. Generally, the edRVFL and its variants are performing better than the dRVFL.
The null hypothesis of the two-tailed sign test is that the performances of the paired algorithms are equal, and the value of N is 29. Similarly, the null hypothesis of the Wilcoxon signed-rank test is that the median difference between pairs of algorithms is zero. We will reject the null hypothesis at 0.05 significance level (reject if p < 0.05). Detailed discussion on dRVFL with BF, edRVFL-SC, edRVFL-RSC, and esc-edRVFL will be presented in the next section.

Algorithm 3: Pseudo-Code of esc-edRVFL
Input: Training data denoted as X 1 Init. N models on X. 2 for each model do 3 Train edRVFL-RSC network 4 Compute validation RMSE

return K models with lowest validation RMSEs (K≤N)
Input: K models, test data, ground truth 6 for Each model do 7 Feed in Test data and generate prediction 8 Aggregate the K different set predictions together using Median 9 Compute the test RMSEs with the aggregated predicted values and ground truth 10 return Test RMSE We compare the difference in performances between dRVFLs and backpropagation-based MLP. Table III shows that dRVFLs have better performances than MLP in most of the datasets. dRVFL, edRVFL, dRVFL with BF, edRVFL-SC, and edRVFL-RSC outperform the MLP in 18 out of 29 datasets while esc-edRVFL outperforms MLP in 22 out of 29 datasets. Comparisons are conducted based on RMSE values rounded up to six decimal places. This implies that dRVFLs have good learning capability that is on par with or better than MLP. Using both the two-tailed sign test and Wilcoxon signed-rank test, we confirm that the esc-edRVFL is significantly better than MLP at a confidence level of 0.05 and the detail of Wilcoxon signed-rank test are provided in Table IV. Algorithm 3 in Section IV-D shows that the esc-edRVFL is validated on N different validation sets for N folds crossvalidation. Using N different validation sets for N different models during the training and validation stage allows the N models to learn independently and encourage the models to form characteristics that are distinct from each other. The diversity of the N trained models is beneficial to the final ensemble model as the final model can generalize better and capture the data variations encountered in the testing phase, which in turn results in better performance. Table V shows that esc-edRVFL with N different validation sets performs better than esc-edRVFL with a single validation set.
Another worth mentioning observation is about the pattern of the optimal hyperparameters of dRVFLs. dRVFL and its variants generally require a lower value of ridge regression parameter (C) than edRVFL. We also identified that Sigmoid is the best activation function for all six dRVFLs algorithms. The relationship shows the comparisons between activation functions (> stands for "is better than" or "is more preferable" in this context): Sigmoid > Tansig > ReLU > Tribas.

B. Phase 2
In this phase, we compare the proposed method against some other randomized neural networks. First, we compare the differences between our proposed method and ELM series. The results of the comparison are tabulated in Table VI.   Table VII shows that the esc-edRVFL, edRVFL-SC, and edRVFL-RSC are significantly better than all the ELMbased algorithms. We also explore the performance differences between our proposed approaches and the SCN-based methods. Table VIII shows that our method is superior to both the basic SCN [21] and its recent variant, Bidirectional SCN [51]. We also compare with the fuzzy broad learning system (Fuzzy BLS) [52] approach in Table IX and experimentally demonstrate that our algorithm is significantly better.
Furthermore, we compared the performance of two classical conventional machine learning methods on these datasets, support vector regression (SVR) and K-neighbors regression (KNR). As can be seen, the randomized network generally outperforms classical machine learning algorithms.
We also conduct nonparametric statistical comparison of the algorithms using the Friedman test, and the details are shown in the Appendix in the supplementary material.   supplementary material. We can observe that the performance of the model gets progressively better as the depth increases in the early stages, however, after the network reaches 30 layers, the performance of the network stops improving or decreases. The reason for this is that the random weights of dRVFL introduce an excessive amount of noise to the information propagated to the deeper layers, making it difficult for the excessively deeper parts to effectively capture feature information, thereby degrading the overall network performance.

VII. DISCUSSION
As mentioned earlier, all the weights in the randomized neural network are randomly generated and kept constant throughout the whole training process. This has led to a lot of debates and doubts on the learning capabilities of RNN. The work by Giryes et al. managed to shed some light on this long debate. Data fed into any neural network at the input stage can be adequately viewed as points with different angles in a multidimensional space. Giryes et al. [53] showed that for a classification problem, the random weights in each layer of neural network distorts the Euclidean distances between different classes of input data proportionally to the angle of the input points. Thus, the smaller the angle of the input, the greater is the shrinkage of the distance and vice versa. Larger separation among classes makes the data easier to be classified. The model is able to learn useful patterns from the data by prioritizing their distorted angles.
However, excessive distortion may even reduce the network's performance. Due to the nature of RVFL, randomly generated weights may contain many redundant and invalid information, and a proper neuron eliminate approach may mitigate excessive distortion effects, such as the neuron selection according to importance of each neuron. In some classification tasks, ranking neurons according to the magnitude of their weights and sequentially discarding low-weight neurons can effectively improve the accuracy of the network.
At the same depth of randomized networks (equal L), the ensemble strategy has a significant advantage in terms of both time complexity and space complexity, which means edRVFL is generally faster than dRVFL in terms of training consumption. In the dRVFL, as we concatenate all intermediate outputs to calculate the final output weight β, the weight matrix will be huge as β ∈ R (d+N) 2 , and the algorithm to obtain such matrix inverse needs either O((d+N) 3 ) time or O((d+N) 2 ) memory, which contradicts the simple and low-consumption concept of RVFL series. In the edRVFL, each sublayer requires an output weight β l , but this weight matrix is only β l ∈ R (d+N) 2 , and the time for all β l calculation only requires O (L(d + N) 3 ). The testing stage does not involve matrix inversion, but only matrix multiplication, which is why the testing is always much faster than training. Our result shows that the training of RVFLs is much shorter than those of MLP. This is expected as the design and internal computation of dRVFLs are generally much simpler than MLP. Unlike MLP, dRVFLs do not use backpropagation and gradient descent to fine-tune the weight of the hidden layers. This explains why dRVFLs are much faster than the MLP in both training phases.
The addition of the newly proposed deep boosting coefficient is inspired by the notion that the features extracted from the lower layer will be diminished gradually as they are passing down to the higher layer. To counteract this diminishing effect, we introduced the deep boosting coefficient (e b , b > 0) to "boost" the features extracted in the previous layers. This boosting technique has successfully boosted the overall performance of dRVFL variants.
The result presented in the previous section shows that the addition of skip connections between different hidden layers improves the performance of edRVFL by a large margin in most of the datasets. The skip connections provide extra paths for more information to flow from the lower layers to the upper layers. In a deep neural network, different types of information will be extracted at different levels of layers. The skip connections coming from the lower layers to the upper layers are beneficial to the model as the model can now generate more diverse features leading to a better decision-making capability.
Versatility and extendability are also features of our approach. For ensemble strategies and skip connection methods, they can be applied to any type of randomized neural network with direct links to improve performance [10], [54]. Although the datasets we used are drawn from open-source UCI repository [40], the raw signals are collected from different fields in the real world. Furthermore, our approach can also be applied to other practical areas to solve the general regression problems, such as tidal turbine vibration [55], lateral force [56], and driver workload prediction [57].

VIII. CONCLUSION
In this article, we presented a detailed evaluation of the deep RVFL (dRVFL) and ensembled deep RVFL (edRVFL) which were proposed recently in [10] on regression problems. In addition, we introduced four novel dRVFL variants (dRVFL with BF, edRVFL-SC, edRVFL-RSC, and esc-edRVFL) and one deep boosting methods. Through a comprehensive evaluation, we concluded that the dRVFLs perform better than the other randomized neural networks in most circumstances. Furthermore, we compared the dRVFLs with MLPs and found out that the performances of the dRVFLs are actually on par or better than MLP. The newly proposed variants show significant improvement when compare with the original dRVFL variants. The findings of this article can serve as a guideline for designing a proper dRVFL network for solving regression problems. Different ensemble approaches will be investigated in the future to produce more accurate final predictions, and the introduction of additional neural networks to process the output of subnetworks in addition to the usage of the median will be investigated in the future. While this work only focuses on regression tasks, our methods may also be migrated to the classification tasks. By substituting multiple neurons for single floating neurons, the model could handle the classification tasks. We leave the exploration of the method for classification tasks, and the application of our algorithms to other practical problems are promising future directions.
ACKNOWLEDGMENT Open Access funding provided by the Qatar National Library.