Loading web-font TeX/Main/Regular
Experimental Study on 164 Algorithms Available in Software Tools for Solving Standard Non-Linear Regression Problems | IEEE Journals & Magazine | IEEE Xplore

Experimental Study on 164 Algorithms Available in Software Tools for Solving Standard Non-Linear Regression Problems


Experimental Study on 164 Algorithms Available in Software Tools for Solving Standard Non-Linear Regression Problems.

Abstract:

In the specialized literature, researchers can find a large number of proposals for solving regression problems that come from different research areas. However, research...Show More

Abstract:

In the specialized literature, researchers can find a large number of proposals for solving regression problems that come from different research areas. However, researchers tend to use only proposals from the area in which they are experts. This paper analyses the performance of a large number of the available regression algorithms from some of the most known and widely used software tools in order to help non-expert users from other areas to properly solve their own regression problems and to help specialized researchers developing well-founded future proposals by properly comparing and identifying algorithms that will enable them to focus on significant further developments. To sum up, we have analyzed 164 algorithms that come from 14 main different families available in 6 software tools (Neural Networks, Support Vector Machines, Regression Trees, Rule-Based Methods, Stacking, Random Forests, Model trees, Generalized Linear Models, Nearest Neighbor methods, Partial Least Squares and Principal Component Regression, Multivariate Adaptive Regression Splines, Bagging, Boosting, and other methods) over 52 datasets. A new measure has also been proposed to show the goodness of each algorithm with respect to the others. Finally, a statistical analysis by non-parametric tests has been carried out over all the algorithms and on the best 30 algorithms, both with and without bagging. Results show that the algorithms from Random Forest, Model Tree and Support Vector Machine families get the best positions in the rankings obtained by the statistical tests when bagging is not considered. In addition, the use of bagging techniques significantly improves the performance of the algorithms without excessive increase in computational times.
Experimental Study on 164 Algorithms Available in Software Tools for Solving Standard Non-Linear Regression Problems.
Published in: IEEE Access ( Volume: 7)
Page(s): 108916 - 108939
Date of Publication: 05 August 2019
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Regression is one of the most classic statistical techniques for predictive data mining [1]. Regression consists in designing a model from available training data that allows to predict the value of a continuous output variable from new given values of a set of input variables. Nowadays, a large number of proposals have been published for solving regression problems, such as financial forecasting [2], marketing [3] or drup response modeling [4] among others.

In the specialized literature, researchers can find proposals that come from different areas of research. When new approaches are published in any of these areas, researchers usually tend to use the same category of algorithms historically applied in the area of research in which they are experts, probably due to their partial knowledge about the available algorithms. In addition, this problem is made worse because of only a few number of researchers make the software and/or source code associated with their proposals public and sometimes authors provide vague or even ambiguous descriptions in the specialized literature. This issue, along with the high complexity of some proposals, makes the widespread use of these algorithms difficult. With the aim of tackling with these drawbacks, a great effort has been made by the data mining research community and a large number of regression algorithms have been included in well-known and used software tools, such as Matlab [5], R [6] and Weka [7], [8], among others.

The main objective of this paper is to analyze the performance of a large number of regression algorithms in order to help both non-expert users and specialized researchers. In this sense, non-expert users from other areas could properly solve their own regression problems and specialized researchers could develop well-founded future proposals by properly comparing and identifying algorithms that will enable them to focus on significant further developments. To accomplish this, we have analyzed 164 regression algorithms that come from 14 different families (Neural Networks, Support Vector Machines, Regression Trees, Rule-Based Methods, Stacking, Random Forests, Model trees, Generalized Linear Models, Nearest Neighbor methods, Partial Least Squares and Principal Component Regression, Multivariate Adaptive Regression Splines, Bagging, Boosting, and Other Methods) and that are available in the software tools Java Statistical Analysis Tool (JSAT) [9], KEEL [10], Matlab [5], R [6], Scikit-learn [11] and Weka [7], [8]. Moreover, a new measure has also been presented to assess the goodness of an algorithm with respect to the rest of analyzed algorithms in each dataset.

Notice that, it is intended to drive on algorithms that are correctly implemented and publicly available (so that we consider reference algorithms such as those included in the mentioned software tools). These algorithms have already been tested by many users, or even by the authors themselves, and corrected by experts in the case of presenting problems, understanding therefore that their implementations are reliable. Also, it is not intended to cover Big Data problems or problems with strong hardware requirements, which are not usually available to all users, but standard regression problems that even without such requirements are still difficult to solve. Of course, there are more recent algorithms in the specialized literature [12]–​[14], but they are still not included in the said software tools, so that for non-expert users (without programming abilities) it is somehow difficult to implement and apply them. Even though we recommend their consideration when it is possible, we will focus here only on those available in the mentioned software tools. The underlying idea is therefore, to be able to guide on the algorithms available in some of the most well-known and used software tools, so that any non-specialized researcher, student, company, etc., that needs to use these algorithms could know which algorithms can be used and what could be expected from their application. Likewise, we would also like to ease further well-founded comparisons and studies from the specialized research community.

In order to assess the performance of these algorithms, an experimental study collecting 52 real-world datasets, with a number of variables within the interval [2, 60] and a number of examples within the interval [43, 45730] has been performed. Moreover, a new absolute metric together with different quality categories (based on this metric domain) are proposed in this contribution for assessing the regression algorithms goodness over different datasets. We have also developed a double study. Firstly, we have studied the performance of all the algorithms over the 52 datasets. Secondly, we have compared the performance of the 30 best algorithms including bagging application, and the 30 best algorithms without considering bagging in order to analyze the influence of bagging on those algorithms for which the software tools allow us to apply it. In both studies, we have used some non-parametric statistical tests for multiple comparison [15], [16] over the average performance values obtained on the 52 datasets. Additionally, we analyze the variety of data and the different algorithms’ behavior related with the curse of dimensionality, as well as the tuning of the algorithmic parameters by nested cross-fold validation when using the “train” function from caret in R, in order to provide some further insights on the behavior of the most promising approaches.

Please, take it into account that we do not try to discard any algorithm based on the results obtained but only to find possible potentialities. A model explaining well a certain situation may fail in another situation. The “No Free Lunch” theorem states that “there is no one model that works best for every problem” [17]. And of course, this is still true after our particular study.

Finally, a web page associated with this paper (i.e., http://www4.ujaen.es/~mgacto/regression/study/http://www4.ujaen.es/~mgacto/regression/study/), which contains complementary material to this study, has been also developed. It includes, the datasets collected and used in this study (the 5-fold cross-validation partitions) together with the generated results (errors and times) per algorithm and dataset (164\times52 ), which can be found in a downloadable spreadsheet. Furthermore, it also includes the complete results by types of datasets on the 164 algorithms for the curse of dimensionality study. These public materials will ease further well-founded comparisons and studies from the specialized research community.

This paper is organized as follows. The next section describes the set-up of the experimental study considered in this paper and proposes a new absolute metric and quality categories for assessing regression algorithms goodness over different datasets. Section III analyzes and discusses the obtained results. Finally, in Section IV we draw some conclusions.

SECTION II.

Experimental Setup

Several experiments have been performed to evaluate the performance of the analyzed algorithms. In the following, we firstly show the datasets used in the experimental study; second, we introduce a new quality measure proposed for these kinds of studies; third, we present the widely known and used software tools including the public regression algorithms analyzed in this contribution; fourth, we introduce a brief description of the studied algorithms and their configurations; and finally we describe the statistical analysis that is performed in this study.

A. Datasets

The experiments have been carried out over 52 real-world datasets available in the well-known public repositories, with a number of variables within the interval [2, 60] and a number of examples within the interval [43, 45730]. These datasets have been downloaded from the following repositories: https://archive.ics.uci.edu/ml/datasets.html?format=&task=reg&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=tableUCI Machine Learning Repository [18], http://sci2s.ugr.es/keel/category.php?cat=regKEEL-dataset [19], http://www.cs.waikato.ac.nz/ml/Weka/datasets.htmlDataset Collections of Weka [7], [8], http://www.cs.toronto.edu/~delve/data/datasets.htmlDelve Datasets [20], [21], http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.htmlLuis Torgo Repository [22], and http://ww2.amstat.org/publications/jse/jse_data_archive.htmJournal of Statistics Education Data Archive [23].

We have included all the available standard regression datasets from these repositories. To the best of our knowledge no study has been performed previously on this quantity of standard regression datasets since it is quite difficult to find them public (with 28 being the highest number previously considered in a particular comparison from our knowledge to the date [24]). Table 1 summarizes the main characteristics of the datasets, where Name is the short name, Var is the number of input variables, and Examples is the number of examples.

TABLE 1 Datasets Used for the Experimental Study
Table 1- 
Datasets Used for the Experimental Study

In all the experiments, we adopted a 5-fold cross-validation model, i.e., we randomly split the dataset into 5 folds, each containing 20% of the examples of the dataset, where four folds have been used for training and one for testing. These datasets and their 5-fold crosvalidation partitions are available in the complementary material web page associated to this paper (http://www4.ujaen.es/~mgacto/regression/study/ http://www4.ujaen.es/~mgacto/regression/study/). Finally, for each of the five partitions, we executed three trials of the algorithms (same 3 different seeds for all), of course, only when they are non deterministic approaches.

B. Quality Measures Considered: RegM Proposal

To evaluate each algorithm we have used the well-known Mean Square Error (MSE): \begin{equation*} MSE = \frac {1}{N} \sum _{l=1}^{N} (alg(x^{l})-y^{l})^{2},\tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where N is the number of examples of the dataset, alg(x^{l}) is the output obtained from the model generated by the algorithm when the l -th example is considered and y^{l} is the known desired output.

In regression problems, the average MSEs obtained by an algorithm may not represent its real performance magnitude when it is compared to any other algorithm, since the domain of the output variable is different for each dataset. Therefore, MSE is a non absolute value depending on the range of each dataset outputs (estimated continuous values). MSE normalization could be a solution but it is very difficult to know the minimum and maximum possible MSE values on each given dataset. However, from the results in this paper we will have not only good estimations of the minimum and maximum possible MSE, but also a distribution with 164 values per dataset becoming a well-supported sample representation on what we could expect from regression algorithms.

Based on this distribution for each of the addressed datasets, we have also proposed a new absolute measure (RegM ) in order to identify a comparable average goodness value of the results obtained by an algorithm on different datasets (as in classification, where we can easily compute the average correct classification percentage, from 0 to 100%). In order to address this problem, we could properly identify the median point over 164 MSE values as the MSE expected for a normal (on or over 50% of the studied algorithms), or at least reasonably, well-performing algorithm (i.e., not particularly good or bad). We will fix this median value as the 50% performance scoring (in a range from 0 to 100%).

Since we have also the best and worst MSEs, we could now define intervals for normalization. However, while considering from median MSE to best MSE seems appropriate (these values are determined by well-performing algorithms), considering from median to worst MSE could be somehow as throwing the dice, since any algorithm performing bad could obtain unexpectedly high errors. In this sense, we have defined as appropriate interval for normalization twice the difference between the best MSE and the median MSE obtained for the dataset. It is, we fix the measurable loss of performance as equal to the possible improvement. See Figure 1 top for a graphical representation of this interval definition. Taking it into account, this interval is defined as:\begin{equation*} Interval = (MSE^{d}_{Median}-MSE^{d}_{Best}) * 2\tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where MSE^{d}_{Best} is the best MSE obtained by the analyzed algorithms in the dataset d , MSE^{d}_{Median} is the median of the MSEs obtained by the analyzed set of algorithms in the dataset d (concretely, in this study, 164 algorithms). Thus, the new measure RegM for the algorithm alg is defined as:\begin{align*} RegM_{alg}^{d}=&max \left({0, 1 - \frac {MSE^{d}_{alg}-MSE^{d}_{Best}}{Interval}}\right) \tag{3}\\ RegM_{alg}=&\frac {100}{N_{Dat}} \sum _{d=1}^{N_{Dat}} RegM_{alg}^{d}\tag{4}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where MSE^{d}_{alg} is the MSE obtained by the algorithm alg in the dataset d , and N_{Dat} is the number of datasets (52 datasets in this study).

FIGURE 1. - Normalization interval, performance categories, associated symbols and labels, and 
$RegM$
 domains for dataset 
$d$
.
FIGURE 1.

Normalization interval, performance categories, associated symbols and labels, and RegM domains for dataset d .

This measure takes values in the interval [0, 100], where we define four qualitative categories with values in [75, 100] representing algorithms with a very good performance, values in [50, 75) representing algorithms with a good performance, values in [0, 50) representing algorithms with a moderate performance, and values bellow 0 representing a not good performance. Figure 1 shows the definition of the performance categories and their interpretation (symbols and labels) together with the associated RegM value domains for a given dataset d . From this study, these domains could be taken as reference values for each of the 52 datasets, thus making easier testing a new proposal.

C. Software Used for the Experiments

In this paper we have considered 6 public software tools including the analyzed algorithms in the experimental study. A brief description of these software tools can be found in the following:

  • Java Statistical Analysis Tool (JSAT) [9] is a “library for quickly getting started with Machine Learning problems” written in Java. The library has no external dependencies, and almost all of the algorithms are independently implemented using an Object-Oriented framework. JSAT is suitable for small and medium size problems and it is made available for use under the GPL 3. JSAT. Version 0.0.9 has been employed.

  • Knowledge Extraction based on Evolutionary Learning (KEEL) [10] is a “Java software tool that can be used for a large number of different knowledge data discovery tasks. KEEL provides a simple GUI based on data flows to design experiments with different datasets and computational intelligence algorithms (paying special attention to evolutionary algorithms). It contains a wide variety of classical knowledge extraction algorithms, preprocessing techniques, computational intelligence based learning algorithms, hybrid models, statistical methodologies for contrasting experiments and so forth”. KEEL is open source for use under the GPL 3. We have used the current KEEL version with date of creation 2018-04-09.

  • MATLAB (MATrix LABoratory) [5] is a commercial multi-paradigm numerical computing environment developed by MathWorks. This environment allows “matrix manipulations, functions and data plotting, implementation of algorithms (including machine learning methods”. In the paper, we have used the version R2016a. Moreover, we have used several toolboxes implemented by Gints Jekabsons. The toolbox codes are open source regression software for Matlab/Octave and are licensed under the GNU GPL license. These toolboxes are: ARESLab [25] version 1.13.0, M5PrimeLab [26] version 1.7.0, and PRIM [27] version 2.2.

  • Scikit-learn [11] is a library for Machine Learning in Python. “It features a rich number of supervised and unsupervised learning algorithms and builds on NumPy, SciPy, and matplotlib”. Sciki-learn is open source software issued under BSD 3 license. We have used Scikit-learn version 0.18.1 included in Anaconda 3-4.3.1 (Python distribution 3.6).

  • R [6] is a “free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS”. Many useful R functions come in packages and free libraries of code written by the R’s active users community. The software can redistribute it and/or modify it under the terms of the GNU GPL as published by the Free Software Foundation. It is available on the web page https://cran.r-project.org/. We have considered the R version 3.5. Moreover, when it is possible we use the train function from caret package in R to set up a grid of tuning parameters for regression routines, to fit each model and calculate a resampling based performance measure. This allows learning which is the best parameter for the algorithm, for example, to set the correct value for k parameter in the k-Nearest Neighbors (kNN) method.

  • Weka [7], [8] is a data mining software tool “including a collection of machine learning algorithms for data mining tasks” developed by University of Waikato. “The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization”. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. We have used Weka version 3.9.1.

D. Algorithms and Parameters Considered in the Experiments

In the experiments, 164 algorithms for regression problems available in the software tools JSAT, KEEL, Matlab, R, Scikit-learn, and Weka have been used. Notice that although a few additional algorithms are available in these tools, they have not been included in the experiments since they were not able to run in all the considered datasets due to scalability problems.

These 164 algorithms (actually different ones or some particular implementations of the same ones) have been grouped into 14 families by unifying the own tools categorization: Neural Networks (NNET): 21 algorithms; Support Vector Machines (SVM): 16 algorithm; Regression Trees (RT): 17 algorithms; Rule-based Methods (RL): 9 algorithms; Stacking (STA): 2 algorithms; Random Forests (RF): 10 algorithms; Model Trees (MT): 8 algorithms; Generalized Linear Models (GLM): 29 algorithms; Nearest Neighbor methods (NN): 7 algorithms; Partial Least Squares and Principal Component Regression (PLSR): 4 algorithms; Multivariate Adaptive Regression Splines (MARS): 4 algorithms; Bagging (BAG): 18 algorithms; Boosting (BST): 5 algorithms; and Other Methods (OM): 14 algorithms. In order to facilitate the analysis, we have used the same terminology used by the authors in [28], where a similar study was performed for classification techniques.

A brief description of these algorithms is as follows (sorted by category and alphabetically). Each algorithm has been identified by its name in the software tool in which it is available followed by the short form of the software tool. For example, JSAT:J; Weka:W; KEEL:K; Matlab:M; Scikit-learn:P; R:R; R using caret with implicit use of the caret Train function):T.

In relation with parameters, the standard ones recommended by authors (those included in each tool as recommended parameters by default), except in R for those that are tuneable with train have been considered. In some cases, for different implementations of the same algorithm any of the parameters proposed in each software tool by default are different. In these cases, we have determined the best parameter value experimentally, and set it the same for all the cases. For example, the number of total trees in the Random Forest based algorithms is not 500 by default in all the software tools but only in some of them. Setting up all of them to 500 improved their results systematically, without significant improvements over this value, so that for fair comparison with the different software tools we fix it to 500 for all the cases. It was the same for the bagging application, where in some cases 25 bags are recommended and for some others 50. We fixed it to 50 without significant changes over this value.

Moreover, as we previously said, when it is possible we use the train function from caret package in R to set up a grid of tuning parameters for regression routines, to fit each model and calculate a resampling based performance measure. It involves a nested cross-fold validation, where only training data is used for inner cross-fold validation and selection of the best parameters, and test data is lately used on the final application of these selected parameters. This is applicable to some methods that are external to caret. In these cases, we have included both versions, without algorithmic parameter tuning (as they are recommended in their own packages) and with algorithmic parameter tuning by train (in order to also check this possibility). However, it must be said that the use of the train function imposes high computational restrictions due to the cost timing required in tuning the parameters slowing down the overall operation of the algorithm, so that in some cases it is impossible performing this type of parameter tuning since the computational time needed goes beyond the wise (see Sections III-D and III-F, where a computational time analysis and the effects of the tuning of algorithmic parameters are briefly studied).

Therefore, even though using this type of automatic nested cross-fold validation could be a non biased way to perform for the best algorithmic parameters selection (not by hand trial and error, where you could check the test error), in the analyzed software tools it is only directly available for a reduced set of algorithms (in order to ease non-expert users, to whom this contributions is particularly focused, who are not usually able to implement the corresponding scripts or to modify the available implementations). Moreover, the much higher computational cost on 164 \times 52 \times 5 (algorithms, datasets and folds\ldots even seeds for non-deterministic methods) makes it impossible to apply for the whole study, i.e., the application of 42,640 nested cross-fold validations over different combinations of algorithmic parameters.

On the other hand, from the brief study on the train parameter tuning in section III-F, we can see as it is only reporting slight improvements in general and that even sometimes the test errors (generalization ability) worsen significantly. It is the problem when we are prohibited to check test errors while tuning or developing a method in a given single problem, that we do not know whether it is overfitting or not.

All these reasons are why for these type of studies, standard parameters are even recommended [15], [16]. In our case, we are applying the standard ones, which also represents the real situation when non-expert users need to apply data mining techniques to the problems they face in their respective areas, in order to show a general (of course non perfect) estimation on how they perform, which is one of the main objectives of this contribution.

1) Neural Networks (NNET): 21 Algorithms

  1. avNNet-T from the caret package, creates a committee of multi-layer perceptrons (MLPs) from the nnet package (the number of MLPs is given by parameter repeat) trained with different random weight initializations. The tunable parameters are the #hidden neurons (size) in {1, 3, 5} and the weight decay (values {0, 0.1, 10^{-4} }). This low number of hidden neurons is to reduce the computational cost of the ensemble.

  2. BackPropagationNet-J is an implementation of a feed forward neural network trained by back propagation. NNets are powerful classifiers and regressors, but can suffer from slow training time and overfitting.

  3. elm-M is a extreme learning machine [29] implemented in Matlab using the code freely available in the elm Web (http://www3.ntu.edu.sg/home/egbhuang/elm_codes.html), using sigmoidal function for activation functions and 20 as the value for #hidden neurons.

  4. elmNN-R [29], [30] trains of a generic single hidden-layer feed forward neural network using ELM algorithm, from the elmNN package.

  5. EnsembleR-K [31] is an ensemble neural network for regression problems. The method employs an ensemble construction based on the use of nonlinear projections to achieve both accuracy and diversity of individual regressors. It also uses the philosophy of boosting for difficult instances.

  6. iRProp+-K [32] is a regression model by means of product unit neural networks or multilayer perceptrons trained with the iRProp+ algorithm.

  7. MLP-BP-K [33] is an MLP for regression problems, with back-propagation as learning technique. The networks apply a sigmoid function as an activation function.

  8. mlp-R creates a MLP and learns it with backpropagation, by using the RSNNS package.

  9. MLPRegressor-P produces an MLP regressor which optimizes the squared-loss using stochastic gradient-based proposed by Kingma [34].

  10. mlpWeightDecay-T trains MLP networks using caret to access the RSNNS package with #hidden neurons and the weight decay parameter tuning.

  11. MultilayerPerceptron-W is an MLP network with sigmoid hidden neurons, unthresholded linear output neurons, learning rate 0.3, momentum 0.2, 500 training epochs, and #hidden neurons equal to (#inputs)/2.

  12. newff-M creates a feed-forward backpropagation network implemented in Matlab with hidden neurons 3:3:30. Matlab v. 7.9.0.529 (R2009b) with Neural Network Toolbox v. 6.0.3

  13. NNEP-K [33] consists of obtaining the neural network architecture and simultaneously estimating the weights of the model coefficients with an algorithm of evolutionary computation.

  14. nnet-R [35] fits single-hidden-layer neural network possibly with skip-layer connections using nnet package, considering 10 as the number of hidden layer.

  15. nnet-T uses caret as interface to function nnet in the nnet package, training an MLP network. The tunable parameters are the #hidden neurons (size) with 1:2:9 and the weight decay values {0,0.1,0.01,0.001,0.0001}.

  16. pcaNNet-T trains the MLP using caret and the nnet package, and running principal component analysis (PCA) previously on the data set. The tunable parameters are the size with 1:2:9 and weight decay values{0,0.1,0.01,0.001,0.0001}.

  17. rbf-R creates a radial basis function (RBF) network in the RSNNS package considering default values. The number of hidden neurons takes values 5 or 3 (for smaller datasets) depending on the datasets.

  18. rbfDDA-R [36] creates incrementally from the scratch a RBF network with dynamic decay adjustment (DDA), in the RSNNS package.

  19. RBFNet-J produces a RBF neural network which uses K-means to select the RBF centers. Using a number of clusters (or hidden neurons) equal to 25.

  20. RBFNR-K [36] builds a RBF neural network composed of one hidden layer and one output layer. This hidden layer contains neurons, each one being activated when the input to the network falls close to a point that is considered the center of that neuron. The final result of the network is provided by the neurons of the output layer that perform a weighted sum using the outputs coming from hidden neurons.

  21. RBFRegressor-W implements a normalized gaussian RBF network. It uses the k-means clustering algorithm to provide the basis functions.

2) Support Vector Machines (SVM): 16 Algorithms

  1. DCD-J implements Dual Coordinate Descent (DCD) [37], [38] training algorithms for a Linear L1 or L2 SVM for binary classification and regression (in our case, we use the default L1), without the shrinkage optimization.

  2. DCDs-J creates a linear SVM trained by DCD.

  3. EPSILON-SVR-K builds regression models by means of EPSILON-SVM [39] in libSVM [40] library.

  4. fitrsvm-M fits an SVM regression model on a low-through moderate-dimensional predictor data set.

  5. ksvmEpsilon-R uses the function ksvm [41] in the kernlab package with epsilon regression.

  6. ksvmNu-R uses the function ksvm [41] (kernlab package) with Nu regression.

  7. LibLINEAR-R [42] creates Linear predictive models estimation based on the LIBLINEAR C/C++ Library in the LibLinear package, with type 11. We have been testing with three values for type parameter:

    • 11 L2-regularized L2-loss support vector regression (primal)

    • 12 L2-regularized L2-loss support vector regression (dual)

    • 13 L2-regularized L1-loss support vector regression (dual)

    The best result was obtained by type 11 that we finally use in this contribution.

  8. LinearSVR-P is a scalable linear SVM for regression implemented using liblinear [40].

  9. NuSVR-K creates regression model by means of NU-SVM [39] based on libSVM [40] library.

  10. NuSVR-P is an SVM for regression implemented with libsvm with a parameter to control the number of support vectors.

  11. SMOreg-W [43] is an SVM for regression. The parameters are learned using RegSMOImproved with C=1 and polynomial kernel. RegSMOImproved learns SVM using Sequential Minimal Optimization (SMO) with adaption of the stopping criterion.

  12. svm-R creates an SVM with the kernel used in training and predicting by radial basis, using the library LibSVM [40] in the e1071 package.

  13. svmLinear-R [40] uses the function SVM (e1071 package) with linear kernel.

  14. svmPoly-R [40] uses the e1071 package to create a SVM with polynomial kernels.

  15. svmSigmoid-R [40] trains an SVM with sigmoid kernel in e1071 package.

  16. SVR-P implements an epsilon-support vector regression based on libsvm [40] library.

3) Regression Trees (RT): 17 Algorithms

  1. ctree-T uses the function ctree [44], [45] in the party package, which creates conditional inference trees by recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework. The threshold in the association measure is given by the parameter mincriterion, tuned with the values 0.1:0.11:0.99 (10 values).

  2. ctree2-T uses the function ctree tuning the maximum tree depth with values up to 10.

  3. DecisionStump-W is a one-node regression tree which develops classification or regression based on just one input using entropy.

  4. DecisionTree-J is a generic implementation, allowing the ability to mimic the behavior of many tree algorithms such as C4.5 [46] (for classification) and CART [47] (regression).

  5. DecisionTreeReg-P [47], [48] is a simple decision tree regressor. It creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

  6. ExtraTreesReg-P [49] implements a meta estimator that fits a number of randomized regression trees (extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Consider 500 as the number of default trees in the forest previous to the final meta estimator.

  7. fitrtree-M [47], [50], [51] fits a binary regression decision tree.

  8. quantregForest-R [52] infers conditional quantile functions from data based on previously obtained quantile regression forests. Included in quantregForest package.

  9. RandomSubSpace-W [53] trains multiple REPTrees regressors selecting random subsets of inputs (random subspaces) to obtain a decision tree based classifier. Each REPTree is learnt using information gain/variance and error-based pruning with backfitting. Each subspace includes the 50% of the inputs. The minimum variance for splitting is 0.001, with at least 2 patterns per leaf.

  10. RandomTree-J is a regression tree that chooses a random subset of features at each iteration to consider.

  11. RandomTree-W is a non-pruned tree where each leaf tests log_{2} (#inputs+1) randomly chosen inputs, with at least 2 instances per leaf, unlimited tree depth and without backfitting.

  12. REPTree-W learns a fast pruned regression tree using information variance and Reduced Error Pruning (REP). It uses at least 2 training patterns per leaf, 3 folds for reduced error pruning and unbounded tree depth. The minimum proportion of the variance on all the data for splitting is 0.001.

  13. rpart-R [47] implements CART method using the function rpart in the rpart package, which develops recursive partitioning.

  14. rpart-T uses the same previous function tuning the complexity parameter (threshold on the accuracy increasing achieved by a tentative split in order to be accepted) with 6 values from 0.1 to 0.11.

  15. rpart1SE-T trains CART using caret and the rpart package with no tuning parameters.

  16. rpart2-T uses the function rpart by tuning the tree depth with values up to 10.

  17. tree-R [35], [47] grows a tree by binary recursive partitioning using the response in the specified formula and choosing splits from the terms of the right-hand-side.

4) Rule-Based Methods (RL): 9 Algorithms

  1. ConjunctiveRule-W uses a single rule whose antecedent is the AND of several antecedents and whose consequent is the mean for a numeric value. If the test instance is not covered by this rule, then it’s predicted using the value of the data not covered by the rule in the training data. This learner selects an antecedent by computing the information gain of each antecedent and prunes the generated rule using REP.

  2. DecisionTable-W [54] is a simple decision table majority regressor which uses BestFirst as search method.

  3. GFS-GPG-K [55] is a fuzzy learning based on genetic programming grammar operators.

  4. GFS-GSP-K [56] implements a symbolic fuzzy learning based on genetic programming grammar operators and simulated annealing.

  5. GFS-SAP-Sym-K [56] is a symbolic fuzzy-valued data learning based on genetic programming grammar operators and simulated annealing.

  6. GFS-SP-K [55] produces fuzzy rule learning grammar-GP based operators and simulated annealing-based algorithm.

  7. PRIM-M implements the Patient Rule Induction Method (PRIM) [57] included in the PRIM toolbox [27]. This method is for finding “interesting” regions (bump hunting) in high-dimensional data. The regions are described by hyper-rectangles (boxes) containing simple decision rules.

  8. WM-K [58] implements the fuzzy rule learning Wang-Mendel algorithm for generating fuzzy rules by learning from examples.

  9. ZeroR-W predicts the mean for all the test patterns. Obviously, this regressor gives low accuracies, but it serves to give a lower limit on the accuracy.

5) Stacking (STA): 2 Algorithms

  1. Stacking-J is a stacking ensemble [59]. Stacking learns several base classifiers and a top level classifier learns to predict the target based on the outputs of all the ensemble models. A linear model is used, which translates to learning a weighted vote of the regressor outputs.

  2. Stacking-W is a stacking ensemble [59] using ZeroR as meta and base regressors.

6) Random Forests (RF): 10 Algorithms

  1. cforest-R is a version of random forest and bagging ensemble of conditional inference trees (ctrees) aggregated by averaging observation weights extracted from each ctree. The parameter mtry takes the value 1 with 500 trees. It uses the caret package to access the party package (no algorithmic parameter tuning is performed).

  2. RandomForest-J [60] creates a collection of random trees with 500 trees.

  3. randomForest-R creates a random forest [60] ensemble using the randomForest function in the randomForest package, with parameters ntree = 500 (number of trees in the forest) and mtry=#inputs.

  4. RandomForest-W implements a forest of RandomTree base classifiers with 500 trees (except in 2dplanes and casp datasets with 300 trees since the method causes memory problems), the number of randomly chosen attributes as log_{2}(\#inputs) + 1 and unlimited depth trees.

  5. RandomForestRegressor-P implements a RF algorithm for regression problem with 500 trees.

  6. ranger-R is a fast implementation of RF [60] or recursive partitioning, particularly suited for high dimensional data, with 500 trees. It is included in the package ranger.

  7. Rborist-R is a rapid decision tree construction and evaluation, with 500 trees, provided by Rborist package. The method includes accelerated implementation of the random forest algorithm and it is tuned for multicore and GPU hardware.

  8. rf-T creates a random forest using the caret interface to the function randomForest in the randomForest package, with ntree = 500 and tuning the parameter mtry with values 2:3:8.

  9. rfsrc-R is the random forests for survival, regression and classification, with 500 trees, included in randomForestSRC package.

  10. RRF-R [61], [62] implements regularized random forest algorithm, with 500 trees, included in the package RRF. It is based on the randomForest R package.

7) Model Trees (MT): 8 Algorithms

  1. Cubist-R is a rule-based model that is an extension of the classic Quinland’s M5 model tree [63]. A tree is grown where the terminal leaves contain linear regression models.

  2. M5-K implements M5 model tree.

  3. M5P-R [64] implements M5 prime model tree using M5P function in the package RWeka.

  4. M5P-W builds the M5 prime tree regression method.

  5. M5Prime-M is the M5 prime regression method. This method is included in M5PrimeLab [26] toolbox.

  6. M5Rules-K implements M5 model rules.

  7. M5Rules-R creates a M5 model rules using M5Rules function in the package RWeka.

  8. M5Rules-W builds the same M5 model rules.

8) Generalized Linear Models (GLM): 29 Algorithms

  1. bam-mgcv-R [65], [66] is a version of generalized additive models for very large datasets included in mgcv package.

  2. BayesianRidge-P builds a bayesian ridge regression model to optimize the regularization parameters.

  3. ElasticNet-P is a linear regression model trained with L1 and L2 prior as regularizer. It is useful when there are multiple features which are correlated with one another.

  4. fitrlinear-lsr-M [38] implements linear regression model to high-dimensional data and uses least-squares regression method as learner.

  5. fitrlinear-svm-M [37], [38], [67] fits linear regression models to high-dimensional data. The method includes regularized support vector machines (SVM) and minimizes the objective function using techniques that reduce computing time (e.g., stochastic gradient descent).

  6. fitlm-M creates a simple linear regression model.

  7. fitlm-Robust-M produces a linear robust regression model to reduce outlier effects.

  8. gam-R is used to fit generalized additive models [68], [69], specified by giving a symbolic description of the additive predictor and a description of the error distribution. It uses the backfitting algorithm to combine different smoothing or fitting methods. The model is included in gam package.

  9. gam-mgcv-R [70], [71] fits generalized additive models with integrated smoothness estimation in mgcv package.

  10. glm-R [72] uses the function glm in the stats package with gaussian family.

  11. glmnet-R trains a GLM via penalized maximum likelihood, with Lasso or elasticnet regularization parameter [73] (using glmnet function in the glmnet package).

  12. glmStepAIC-R performs generalized linear regression with stepwise selection by Akaike information criterion [74] using the function stepAIC in the MASS package, which es executed by train from caret (no algorithmic parameter tuning is performed).

  13. HuberRegressor-P [75], [76] is a linear regression model that is robust to outliers.

  14. lars-R fits Least Angle Regression (LARS) [77] and is included in the package lars. With the “lasso” option, it computes the complete lasso solution simultaneously for all values of the shrinkage parameter in the same computational cost as a least squares fit.

  15. Lars-P performs LARS [77] algorithm for high-dimensional data.

  16. Lasso-P is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent.

  17. LassoLars-P is a lasso model implemented using the LARS algorithm.

  18. LinearRegression-P produces an ordinary least squares linear regression.

  19. LinearRegression-R builds suitable linear regression models, using the Akaike criterion for model selection. It is included in RWeka package.

  20. LinearRegression-W learns a simple linear regression model. It picks the attribute that results in the lowest squared error.

  21. lm-R [78], [79] is used to fit linear models in stats package.

  22. nnls-R trains an algorithm for non-negative least squares executed from caret with no tuning of algorithmic parameters.

  23. PassiveAggressive-J is a version of the passive aggressive algorithm [80] for regression. It is a type of online algorithm that performs the minimal update necessary to correct for a mistake.

  24. PassiveAggressiveRegressor-P is a passive-aggressive regressor algorithm [80] for large-scale learning.

  25. randomGLM-R is a random generalized linear model predictor included in the randomGLM package.

  26. SGDRegressor-P is a linear model fitted by minimizing a regularized empirical loss with stochastic gradient descent.

  27. SimpleLinearRegression-W implements a simple linear regression model.

  28. stepwiseglm-M [72], [81], [82] creates generalized linear regression model by stepwise regression.

  29. TheilSenRegressor-Ps [83] implements a robust multivariate regression model. The algorithm uses a generalization of the median in multiple dimensions and it is robust to multivariate outliers.

9) Nearest Neighbor Methods (NN): 7 Algorithms

  1. IBk-W [84] is a k-Nearest Neighbors regressor with linear neighbor search and euclidean distance, considering 7 as the number of nearest neighbors.

  2. IB1-W is a simple 1-NN regressor.

  3. kknn-R uses the function kknn in the kknn package considering 7 as the number of neighbors.

  4. knn-T trains function knn in the caret package with 12 number of neighbors in the range 3:2:25.

  5. KMeans-P is k-means clustering method. Consider 8 as the number of centroids.

  6. KNeighborsRegressor-P implements regression algorithm based on k-nearest neighbors. Considering 5 as the number of neighbors by default.

  7. NearestNeighbour-J is a nearest neighbor algorithm with 7 as the number of neighbors.

10) Partial Least Squares and Principal Component Regression (PLSR): 4 Algorithms

  1. kernelpls-R [85] performs partial least squares regression with the function plsr (in the pls package) and method=kernelpls.

  2. pcr-R performs principal components regression using pls package.

  3. pls-R uses the function mvr in the pls package to perform partial least squares regression, which es executed by train from caret (no algorithmic parameter tuning is performed).

  4. simpls-R considers the same function plsr using the SIMPLS [86] method in the pls package.

11) Multivariate Adaptive Regression Splines (Mars): 4 Algorithms

  1. earth-R builds MARS [87], [88], in earth package.

  2. gcvEarth-R uses the function earth in the earth package. It builds an additive MARS model without interaction terms using the fast MARS [48] method.

  3. mars-M is the MARS method included in the ARESLab [25] toolbox.

  4. mars-R fits a MARS [87] model using the function mars in the mda package.

12) Bagging (Bag): 18 Algorithms

It is one of the cases where some of the software tools use a different number of bags, it is 10, 20 or 50. In [89], the authors tested with different number of bags indicating that 50 or 25 bags were necessary or sufficient to obtain good results within reasonable execution times. In our experiments, 50 bags have been used for all the methods.

In the case of the ensemble based methods (RF based ones among others) it makes no sense the bagging application (for example method ExtraTreeReg-P). We tried it over a good amount of datasets and it did not significantly improved the results but it greatly increased the execution time. It is the case as said among others, when bagging is applied to randomForest-R, thus making its computational time cost extremely high without signficant improvement. Thereforre, such contributions have not been considered as cases of study.

  1. bagEarth-R is a bagging ensemble of MARS method included in the earth package with 50 bagging iterations.

  2. bagEarthGCV-R is a bagged MARS method from the earth package, using gCV pruning with 50 bagging iterations.

  3. bagging-DecisionStump-W uses DecisionStump base regressor with 50 bagging iterations.

  4. bagging-DecisionTable-W uses DecisionTable with BestFirst and forward search, leave-one-out validation and RMSE as measure used to evaluate the performance, with 50 bagging iterations.

  5. bagging-DecisionTree-J is an ensemble technique for reducing variance that uses DecisionTree base regressors with 50 bagging iterations.

  6. bagging-DecisionTree-P is an ensemble meta-estimator that fits decision tree base regressors each on random subsets of the original dataset and then aggregate their individual predictions (the number of estimators in the ensemble is 50) to form a final prediction.

  7. bagging-IBk-W uses IBk base classifiers, which develop kNN regressor tuning K using cross-validation with linear neighbor search and Euclidean distance, with 50 bagging iterations.

  8. bagging-MultilayerPerceptron-W is a bagging with 50 iterations using the same configuration as the single MultilayerPerceptron-w method.

  9. bagging-M5P-R uses M5P base regressor with 50 bagging iterations.

  10. bagging-M5P-W applies bagging with 50 iterations to the same M5P base regressor.

  11. bagging-M5Rules-R uses M5Rules base regressor with 50 bagging iterations.

  12. bagging-M5Rules-W builds a bagging with 50 iterations using M5Rules method as base regressor.

  13. bagging-RandomTree-J is a bagging ensemble that uses RandomDecisionTree base regressor with 50 bagging iterations.

  14. bagging-RandomTree-W applies bagging with RandomTree base regressor without backfitting, with unlimited tree depth, considering [log_{2}(\#inputs)+1] as the number of random inputs, and 2 as the number of instances per leaf, with 50 bagging iterations.

  15. bagging-REPTree-W uses REPTree with 2 instances per leaf, minimum class variance 0.001, 3-fold for reduced error pruning and unlimited tree depth, with 50 bagging iterations.

  16. bagging-Rpart-R [89] is a bagging ensemble with 50 bagging iterations of decision trees (rpart method) using the function bagging (in the ipred package).

  17. treebag-R trains a bagging ensemble of linear discriminant analysis with option bagControl=ldaBag and 50 bagging iterations.

  18. treeBagger-M creates a bag of regression trees with 50 trees. TreeBagger grows the decision trees in the ensemble using bootstrap samples of the data. Also, the method selects a random subset of predictors to use at each decision split as in the random forest algorithm.

13) Boosting (BST): 5 Algorithms

  1. bstls-R uses a gradient boosting for optimizing loss functions with component wise linear models as base learners, with the function bst (from the bst package), learner=ls and number of boosting iterations equals 50.

  2. bsttree-R fits a boosting for regression using the tree regression models, with the function bst (from the bst package), learner=tree and the same number of iterations of the bstls method.

  3. glmboost-R is the gradient boosting for optimizing arbitrary loss functions where component-wise linear models are utilized as base-learners. It is included in mboost package and uses 100 as number of boosting iterations.

  4. fitensembleBst-M is a regression tree ensemble using LSBoost and 100 learning cycles. LSBoost is the gradient boosting strategy applied for least squares from Friedman [90].

  5. GradientBoostingRegressor-P is a Gradient Boosting for regression. It builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function. It uses 100 as the number of boosting stages to perform.

14) Other Methods (OM): 14 Algorithms

  1. AdditiveRegression-W [91] is a method that helps to improve the performance of the regression where each iteration adjusts a model to the residuals left by the regressor on the previous iteration. The prediction is obtained by adding the predictions of each regressor. This, avoids overfitting but increases the learning time.

  2. AttributeSelectedClassifier-W uses M5P trees to classify patterns reduced by attribute selection. The CfsSubsetEval method [92] selects the best group of attributes weighting their individual predictive ability and their degree of redundancy, preferring groups with high correlation within outputs. The BestFirst forward search method is used, stopping the search when five non-improving nodes are found.

  3. foba-R is a greedy variable selection for ridge regression using a forward greedy, backward greedy and the Adaptive Forward-Backward Greedy (FoBa) [93] methods. This method is included in foba package.

  4. KernelRLS-J [94] implements Kernel Recursive Least Squares (RLS) for online regression learning. This is a kernelization of the RLS algorithm, and it uses projection for bounded learning.

  5. KStar-W [95] is an instance-based regressor which uses entropy based similarity to assign a test instance to the output of its nearest training instances.

  6. LWL-J [96] is a Local Weighted Learning (LWL) that builds a local model for every query, and uses that local model to make predictions.

  7. LWL-W [97] is an ensemble of Decision-Stump base regressors. Each training instance is weighted with a linear weighting kernel, using the Euclidean distance for a linear search of the nearest neighbor.

  8. MultiScheme-W selects a regressor among several ZeroR regressors using cross validation on the training set.

  9. ppr-R builds a projection pursuit regression model [98]. It is included in stats library.

  10. RandomCommittee-W is an ensemble of RandomTrees (each one built using a different seed) whose output is the average of the base regressor outputs.

  11. relaxo-R builds relaxed lasso solutions [99] included in the relaxo package.

  12. Ridge-P performs linear least squares with l2 regularization. The model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm.

  13. RidgeRegression-J creates a simple batch implementation of ridge regression.

  14. spikeslab-R fits a rescaled spike and slab model [100], [101] using a continuous bimodal prior in the spikeslab package. A generalized elastic net estimator is used for variable selection and estimation. It can be used for prediction and variable selection in low and high-dimensional linear regression models.

E. Statistical Analysis

In order to assess whether significant differences exist among the results, we have adopted statistical analysis [15], [16], concretely non-parametric tests. According to the recommendations made in [15] and [16], a set of simple, safe and robust non-parametric tests for statistical comparisons of regressors has been considered. We have employed the Friedman’s test [102] in order to rank the studied algorithms and to find out whether at least a significant difference exists among any of the mean values. And then, we have proceed with the post-hoc Holm’s test [103] in order to find the concrete pairwise comparisons which produce differences.

Notice that the Holm’s test have only been applied to the results obtained by the best 30 algorithms Friedman’s ranking since the total number of algorithms is too high to compute this test. A detailed description of these tests and explanations of the use of non-parametric tests for data mining and Computational Intelligence can be found at the Website at http://sci2s.ugr.es/sicidm/.

SECTION III.

Results and Discussion

In order to evaluate the performance of the analyzed algorithms, several analyzed have been performed in this paper, which are organized in this section as follows:

  • In Subsection III-A, we present the rankings and average RegM results and we analyze the performance of the 164 algorithms studied.

  • In Subsection III-B, we analyze the best 30 algorithms in rank without considering the algorithms that make use of bagging.

  • In Subsection III-C, we analyze the best 30 algorithms in rank by considering the algorithms that make use of bagging.

  • In Subsection III-D, we analyze the scalability of the studied algorithms.

  • In Subsection III-E, we analyze the variety of data and the different algorithms’ behavior related with the curse of dimensionality.

  • In Subsection III-F, we analyze the tuning of the algorithmic parameters by “train”.

  • In Subsection III-G, we analyze the results obtained grouped by algorithm family.

A. Analysis of the 164 Algorithms Available in the Studied Software Tools

Several executions have been carried out on different datasets in order to analyze the performance of the 164 algorithms (see Subsections III-A, II-D and III-G). Tables 2, 3 and 4 sum up the average RegM results obtained by each algorithm (sorted by rankings) where Rank represents the Friedman’s ranking for the averaged error (MSE) obtained over the test data; RegM represents the values obtained for the new measure proposed in this paper (see Subsection II-B); Win is the number of datasets in which the algorithm obtains the best MSE over the test data; ✶ represents the number of datasets in which the algorithm has obtained a value in [75, 100] for the measure RegM ; ★ represents the number of datasets in which the algorithm has obtained a value in [50, 75) for the measure RegM ; ♦ represents the number of datasets in which the algorithm has obtained a value in [0, 50) for the measure RegM ; and ▼ represents the number of datasets in which the algorithm would obtain a value bellow 0 for RegM . Finally, AvTime is the average computational cost in seconds.

TABLE 2 Results Obtained by the Studied Methods. (I/III)
Table 2- 
Results Obtained by the Studied Methods. (I/III)
TABLE 3 Results Obtained by the Studied Methods. (II/III)
Table 3- 
Results Obtained by the Studied Methods. (II/III)
TABLE 4 Results Obtained by the Studied Methods. (III/III)
Table 4- 
Results Obtained by the Studied Methods. (III/III)

All these values have been computed over the particular MSE results which can be found in a downloadable spreadsheet at the web page associated with this paper (i.e., http://www4.ujaen.es/mgacto/regression/study/http://www4.ujaen.es/~mgacto/regression/study/). It includes the generated results (errors and times) per algorithm and dataset (164\times52 ).

The following facts can be highlighted from the results presented in the Tables 2, 3 and 4:

  • The bagging methodology have been applied to several algorithms belonging to different families (MT, MARS, NNET, RT, among others). From the results shown in the tables; it can be drawn that applying bagging to simple algorithms allows obtaining quite improved results with a reasonable computational cost. Individual regression methods tend to overfit but bootstrap-aggregated (bagged) regression combine the results from many regressors, reducing the effects of overfitting and improving the accuracy. Nevertheless, we can still find some bagging based algorithms remaining in the last ranking positions. The best one is bagging-M5P-R, with the best RegM value (85.52) and with zero results within the moderate/bad result interval zones, represented by ♦ and ▼, respectively. There is also another algorithm with zero results within the moderate/bad zones, i.e., bagging-REPTree-W (82.65 RegM). In this sense, both of them could be considered as quite robust algorithms without any registered bad result.

  • The algorithms obtaining the best values for the Friedman ranking and the new measure RegM are the algorithms M5 and M5Rules available in the software tools R and Weka and making use of 50 bagging iterations (Bagging-M5P-R). Notice that both of them are algorithms that belong to the MT family to which 50 bagging iterations have been applied. Close to the results obtained by these algorithms we can also find several algorithms of the RF family.

  • Analyzing the values obtained for the RegM measurement, we can see how these values present a coherent correlation with the values for Friedman’s ranking, decreasing the value of the measurement as the ranking value increases.

  • Finally, from our point of view and taking into account the RegM distribution of results into the four quality intervals, we recommend as highly promising those whose RegM average values is around and over 60. Moreover, as a particular singularity, we can find the elm-M algorithm, which gets the best results in 12 datasets and very bad results (according to RegM) in 35 datasets, by which it is ranked within the last algorithms. Therefore, we think elm-M is also a promising one to consider for tackling real problems.

In the following subsections, we perform statistical analysis on the 30 best algorithms according to Friedman’s ranking, with and without bagging.

B. Analysis of the Best 30 Algorithms Without Bagging Consideration

We have only analyzed the 30 best algorithms according to Friedman’s ranking without considering the BAG family algorithms in order to study the algorithms without this additional methodology. Table 5 shows the 30 best algorithms according to the Friedman’s ranking (recalculated for only these 30 algorithms) and the adjusted p-value (APV_{Holm} ) obtained by the Holm’s test when we compare the best ranking algorithm (ExtraTreesReg-P ) with the remaining algorithms. As a summary: 7 RFs, 6 MTs, 4 MARSs, 3 SVMs, 3 OMs, 2 RTs, 2BSTs, 1 GLM, 1 NNET and 1 NN. Notice that no algorithms from the RL, STA, and PLSR families have been included among the 30 best algorithms. Taking into account the results shown in Table 5 we can highlight:

  • The equality hypothesis to the first one is not rejected for the remaining first 18 algorithms with a significance level of 0.05. Among them, we can find algorithms from 7 different families (7 algorithms from RF, 6 from MT, 1 from RT, 1 from BST, 2 from SVM, 1 from MARS and 1 from OM) but most of them belong to the RF and MT families, which shows the potential of tree based algorithms. SVMs also appears twice, which shows a significantly good performance taking into account that there are 146 algorithms without bagging.

  • The best Friedman’s ranking is obtained by the single tree-based algorithm ExtraTreesReg-P . Moreover, we can see how the following 4 top-ranked algorithms belong to the RF family, which are ensemble algorithms with multiples trees in the forest (500 trees in this study).

  • Finally, single-based MT family algorithm also seen to compete with RF ones. It is quite interesting since they are quite simpler and therefore they should be easier to understand/interpret. It also shows potentiality in order to be considered as base algorithms for bagging or new RF proposals.

TABLE 5 The Best 30 Algorithms Without Bagging (Including Recalculated Friedman’s Test and Holm’s Adjusted P-Value)
Table 5- 
The Best 30 Algorithms Without Bagging (Including Recalculated Friedman’s Test and Holm’s Adjusted P-Value)

C. Analysis of the Best 30 Algorithms With Bagging

We have only focused the 30 best algorithms of the 164 algorithms analyzed in this study according to Friedman’s ranking (see Table 2). It makes it able testing the equality hypothesis (first ranked as reference) according to Holm’s test when we compare the results obtained by all the algorithms including bagging. Notice that we have not applied bagging to the ensemble algorithms (RF family and ExtraTrees Reg-P) because these algorithms already perform an internal bagging-like process (see Section II-D - Bagging, for extended explanation).

Table 6 shows the results obtained by Friedman’s test recalculated for these 30 algorithms (this type of table was described in the previous subsection). As a summary: 11 BAGs,7 RFs, 6 MTs, 2 SVMs, 1 RT, 1BST, 1 MARS and 1 OM. Notice that the best Friedman’s ranking is obtained by M5P, but in this case with its implementation available in Weka. In addition, this implementation also got the best value for the measure RegM (see Table 2). Analyzing the results presented in Table 6, we can highlight the following facts:

  • The bagging methodology allows to improve considerably the precision of the algorithms, being 11 of the best 30 algorithms from BAG family (from a total of 18 BAG algorithms).

  • Holm’s test with bagging-M5P-W as reference algorithm rejects the equality hypothesis with a significance level of 0.05 with the last 10 algorithms in the table, which include several implementations of the M5 algorithm. It should be noted that the implementation of M5 without bagging available in Weka is not even among Friedman’s top 30 ranking algorithms. This shows how the use of bagging can significantly improve the algorithm’s performance without a high computational cost (see Tables 2, 3 and 4).

  • The remaining algorithms (those with p-values over 0.05) are distributed in families as follows: 8 algorithms from BAG, 7 from RF family (from total of 10 RF algorithms), 2 from MT and 1 from BST. Even when there are no SVM, MARS or OM in the not rejected set of algorithms, we should take into account that there are no implemented versions on these algorithms in combination with bagging. It depicts on open framework for including these types of combinations as a part of the studied software tools.

TABLE 6 The Best 30 Algorithms With Bagging (Including Recalculated Friedman’s Test and Holm’s Adjusted P-Value)
Table 6- 
The Best 30 Algorithms With Bagging (Including Recalculated Friedman’s Test and Holm’s Adjusted P-Value)

D. Analysis of Scalability

In this section, we include an analysis of the computational cost of the first ranked algorithms and the average times by families. Figure 2 shows the graph of the time in seconds of the best 30 algorithms sorted by their rankings. The methods using the “train” function from caret to tune algorithm parameters need significant extra time to obtain the results, thus slowing down the overall operation of the algorithms. In this figure, we can see how the algorithms from BAG family (12 among the 30 best) have a reasonable time except Bagging-M5Rules-R, bagging-MultilayerPerceptron-W and bagEarthGCV-R algorithms. The algorithms of the RF family are slower except for those versions specifically designed to be fast, such as ranger-R and Rborist-R methods, the Weka versions and the python versions. In general, the algorithms implemented in python are quite fast.

FIGURE 2. - Average computational cost of the 30 best ranked algorithms.
FIGURE 2.

Average computational cost of the 30 best ranked algorithms.

The average times from each family are displayed in Figure 3. In this figure, we have excluded algorithms executed with “train” (-T) since the way to execute the algorithms looking for an adjustment of the parameters penalize the own family times. This figure shows that RL, NNET and OM families are quite slow. The algorithms from the BAG family are not the fastest ones but they are quite competitive in average times and they get pretty good precision results. The best results in time, with an acceptable and very competitive precision, are the algorithms of the MT family (as we saw in the previous section, they obtain acceptable results even equivalent to the best algorithm). Therefore, applying bagging to MT methods allows not only obtaining good results but also good run times.

FIGURE 3. - Average time for each family.
FIGURE 3.

Average time for each family.

E. Analysis on the Curse of Dimensionality

In this paper we have considered a wide variety of data - particularly relative to number of attributes and number of examples. This section helps to analyze some algorithms in different situations, as for datasets with a large number of attributes compared to other methods, datasets with a small number of attributes, or the combined effect depending on the number of examples. In order to do so, we have divided the datasets into two different groups depending on their dimensionality (i.e., number of attributes). The first one includes the datasets with the higher dimensionality, High Dimensional (HD) group with only >=9 variables. The second one includes the datasets with the lower dimensionality, Low Dimensional (LD) group with only < 9 variables. Thus, we can compute all the measures again separately by group (Friedman’s ranking, RegM , and number of results in the quality intervals) in order to check differences with the general results and to contrast both groups.

On the other hand, there are some data complexity measures that were proposed or used in the classification framework where most of them were proposed based on the existence of classes. Even though, some of them can be directly used for regression. That is the case when in discussions on curse of dimensionality, the number of patterns is compared to the number of variables, which seems more interesting than considering number of patterns only. In [104], [105], the authors introduced a very simple index, denoted T2, defined as the average number of patterns per variable. This measure also represents an interesting characteristic for the datasets in the regression framework. In this section, we also apply this measure as in the previous case for analyzing the dimensionality. Thus, we have again divided the datasets in a group with High T2 (HT2) values (good distributions with T2 >=250) and a group with Low T2 (LT2) values (bad distributions with T2 < 250).

The complete results in the four groups (HD, LD, HT2 and LT2) sorted by ranking on the 164 algorithms are available in the complementary material web page associated to this paper (http://www4.ujaen.es/~mgacto/regression/study/http://www4.ujaen.es/~mgacto/regression/study/). For the sake of simplicity and due to it is not possible to show and analyze all the 164 algorithms in the mentioned four groups in the manuscript, we show and analyze here only the best 20 algorithms for each group.

Table 7 shows these results, where PosHD is the position sorted by Friedman’s ranking obtained on the HD group of datasets (only >=9 variables), PosLD is the position sorted by ranking obtained on the LD group of datasets (only ¡ 9 variables), PosGlobal is the position sorted by ranking obtained when all datasets are considered and the remaining columns where previously explained in Section III-A for Tables 2, 3 and 4. Analogously, Table 8 contains the same columns but for the corresponding division on T2, High HT2 group of datasets (only T2 >=250) and LT2 group of datasets (only T2 < 250), respectively.

TABLE 7 The Best 20 Algorithms for the HD Datasets and for LD Datasets (Including Recalculated Performance Metrics, Where More Than 10 Position Differences are Boldfaced)
Table 7- 
The Best 20 Algorithms for the HD Datasets and for LD Datasets (Including Recalculated Performance Metrics, Where More Than 10 Position Differences are Boldfaced)
TABLE 8 The Best 20 Algorithms for the HT2 Datasets and for LT2 Datasets (Including Recalculated Performance Metrics, Where More Than 10 Position Differences are Boldfaced)
Table 8- 
The Best 20 Algorithms for the HT2 Datasets and for LT2 Datasets (Including Recalculated Performance Metrics, Where More Than 10 Position Differences are Boldfaced)

From the results shown in these tables we can highlight the following facts:

  • There is a group of algorithms, the bagging of the M5P versions, that reach a good behaviour in both the HD and the LD groups of datasets, also obtaining the same behaviour in T2. They rank the best positions in all the considered datasets. Notwithstanding, this aspect was shown in Table 2 where their RegM average scorings are quite high but, most importantly, their individual RegM results are mainly located in the very good and good quality ranges, represented by ✶ and ★, respectively. For this reason, the use of these algorithms as a first approximation on a given real problem could be considered as a starting recommendation for non-expert users whenever there are no special restrictions such as interpretability, real time computing, etc. And then, if it is possible (depending on their programming abilities and data mining knowledge) to try to improve the obtained results with more recent techniques.

  • Despite including feature selection as part of the learning process itself, the base algorithms of these combinations (versions of M5P) suffer more in the LD than in the HD, so it seems that bagging behavior is improved in small problems, where maybe overlearning occurs, without affecting its behavior in the HD too much. This aspect is even more prominent when we focus on T2, where the base versions do not appear in the top 20 in the LT2 group (datasets with low data density), showing that bagging really solves the problem of low data density (tendency to overlearning). As a conclusion or recommendation, these base algorithms are also good alternatives for solving problems with high T2 values, which allow obtaining simpler models (just one tree, or even a set of rules). Anyway, they also maintain their individual RegM results, even in the LT2 group, mainly located in the very good and good quality ranges, represented by ✶ and ★, respectively.

  • The family of RF algorithms seems to behave in a contrary way than M5P versions. Focusing on T2, it seems to be that, although they work well in both types of problems, they are generally better in problems with poor data density/distribution than with good ones. They also provide a better behavior in the HD than LD, and vice versa, depending on the RF version, so they are actually more dependent on the distribution of data than on the dimensionality itself (although logically they are related).

F. Tuning of the Algorithmic Parameters by “Train”

As we explained before, the setting/tuning of algorithmic parameters is an important aspect on which we had to take a decision based on the reasons discussed in Section II-D. Even though it is not the objective in this contribution, because of the reasons previously stated, in this section we show the performance of those algorithms for which “train” (from Caret in R) has been applied on recommended nested cross-fold validation for tuning some of their most relevant algorithmic parameters (based on Caret recommendations). Of course, it is not a taxative demonstration on how the parametric tuning influences the all the algorithms in general, but it could represent a glimpse whether repetitive trends can be found on some different algorithms. Moreover, it could lead to some recommendations when users/researchers perform parametric tuning in real applications/problems they need to solve (especially when they are non-expert users).

In Table 9, we show the average results obtained by the available methods with fixed standard parameters (without parametric tuning) and their recommended versions with parametric tuning. It includes the same columns that where previously explained in Section III-A for Tables 2, 3 and 4, but also their obtained global positions in these tables (positions from 1 to 164).

TABLE 9 Available Methods With Standard Parameters (Without Parametric Tuning, Boldfaced) and Their Recommended Versions With Parametric Tuning
Table 9- 
Available Methods With Standard Parameters (Without Parametric Tuning, Boldfaced) and Their Recommended Versions With Parametric Tuning

Taking into account the values of the different metrics shown in Table 9, we can stress the following facts:

  • The highest differences in algorithm positions with respect to their tuned versions are found for random Forest-R and its tuned version rf-T (positions 6 to 22, respectively); for mlp-R and mlpWeightDecay-T (positions 158 to 146, respectively); and for rpart-R and rpart2-T (positions 88 to 96, respectively). Contrary to the expected, two of them (those with the best ranking positions) are even obtaining worse results (of course, in their test errors). Checking the RegM intervals, we can observe that some datasets are changing their quality classification moving to a worse category, showing the overfitting effects on some datasets/problems. However, even in these cases, they are not highly relevant changes over a total of 164 algorithms.

  • The remaining changes look not so relevant, by changing no more than four positions and only getting slight improvements in general. Moreover the frequency in the RegM quality intervals is quite similar, involving only small changes in the shown distributions.

  • In general, it seems that significant changes on performance could depend more on the design of the algorithms itself than on the subsequent tuning of their parameters.

Finally, we recommend for non-expert users the use of standard parameters in principle, unless they have previous knowledge on the corresponding parameters effects, so that some alternative combinations could be explored. In principle, without expert knowledge, probably not so high improvements and taking into account the high risk of overfitting, it would be better trying on different types of algorithms than adjusting and adjusting on only one or two of them. But in any case, in order to provide a good assessment/estimation of the real system error, please avoid checking the test errors before fixing all the tentative combinations of parameters. And once they are fixed then compute the test errors without repeating the process (since, maintaining the test data hidden for all the learning process is the only way to properly estimate the generalization ability of the models obtained).

It should be the same for expert users, which are supposed to be able to also apply a nested cross-fold validation (and/or even consider some fixed combinations of parameters). Again, it should be always fixed and performed before checking the test errors, without repeating the process if test errors are not as good as the expected, in order to avoid possible overfitting on the final real system application.

G. Discussion by Algorithm Family

In this section, we discuss the results by regressor family (see Tables 2, 3 and 4). Figures 4.a) and 4.b) show respectively: the minimum, maximum and average ranking values by family considering all the algorithms (even those with extremely bad results) and the same by only considering the 3 best ranked algorithms by family (since we think that those most competitive ones could represent better each family potentiality). From now on, we will follow the order in Figure 4.b) to briefly analyze the algorithm families.

FIGURE 4. - a) Friedman’s rank average and range for the regressors by family; b) Friedman’s average and range for only the 3 best ranked regressors by family.
FIGURE 4.

a) Friedman’s rank average and range for the regressors by family; b) Friedman’s average and range for only the 3 best ranked regressors by family.

We can find the most accurate family in BAG. Checking Figure 4.a) we can see it gets almost the best average, while its best algorithm gets the first ranking position. When checking Figure 4.b) we can infer that BAG global average, see also Figure 4.a), is affected by a few bad algorithms but most of the BAG family are in the best ranking positions. In fact, the best regressor is bagging-M5P-R (ensemble technique for reducing variance that uses M5 base regressors with 50 bags in R). Followed by bagging-M5P-W (bagging of M5 model tree in Weka), bagging-M5Rules-W (bagging of M5 model rules in Weka), bagging-M5Rules-R (bagging of M5 model tree in R) and bagging-MultilayerPerceptron-W (bagging of MLP networks in Weka) with ranks about 19.17–26.16. These five first best algorithms belong to the BAG family. Bagging could be considered as a meta family or additive family, since it is applied over existent implementations of the remaining families, so that it also depends on the base algorithms from the other families.

The following family of regressors is RF. The best three ones from this family are randomForest-R (RF in R), RandomForest-W (RF in Weka) and RRF-R (Regularized RF in R) with ranks about 27.27–28.22. Among the 10 best ranked algorithms we can find several algorithms from the RF family, all of them with a fairly good ranking. RF are ensemble algorithms that perform an internal bagging-like process. This is one of the main reasons for them to be highly competitive with respect to the BAG family.

Two of the best ranked families that are not directly based on ensembles are MT and SVM. MT-based regressors work relatively well, being M5-R (M5 method in R) still equivalent to the best ranked method (see Table 6). The following methods from the MT family are M5Rules-R, M5Rules-W, cubist-R (advanced version of M5 method in R), M5-K and M5Rules-K, i.e., different versions of the classic M5. In fact, it should be noted that the method with the best ranking from the total of 164 algorithms is a bagging of the old and classic M5 algorithm. The SVM family is also quite competitive being the best method svm-R which is a simple implementation of a SVM using the library LibSVM [40]. Many of the methods from the SVM family are based on this library. They were not able to be used in combination with bagging in their current implementations at the studied software tools, which seems a potential open framework for the data mining software developers.

The methods from the MARS family obtain quite acceptable results, being mars-M the one with the best performance. MARS methods get results over the median value (i.e., very good ✶ and good ★ categories) in almost all the datasets, and only in a small percentage of datasets the results outweigh the median.

OM family includes very different algorithms and the rankings are not too good on global average but acceptable on the 3 best ranked average. ppr-R method (pursuit regression model in R) achieves good results (42.53 ranks) among the OM family and it is within the 30 best algorithms. The next method of the family is RandomCommittee-W with a ranking very close to ppr-R.

RT family presents intermediate results in general. ExtraTreesReg-P is the algorithm with the best ranking of the RT family with a rank value of 28.57. ExtraTreesReg-P is the algorithm with the best ranking of the RT family, surprisingly getting the best ranked position when bagging is not considered. The remaining algorithms of the RT family have a quite superior ranking, higher than 58. The second best algorithm of the family is ctree-T followed by REPTree-W. Both, REPTree-W and ExtraTreesReg-P, are quite fast.

Next, we are going to analyze the results of the group of intermediate-low ranked families. GradientBoosting Regressor-P regressor (Gradient Boosting in Python) is the first algorithm of the BST family, the rest of the algorithms of this family are glmboost-R and bsttree-R but they have a worse ranking higher than 75. The family of GLM is the one with the higher number of algorithms 29, being the best one the stepwiseglm-M method. In this family, there are a great variety of methods: generalized linear models, least angle regression, passive aggressive algorithm, etc.

The algorithms of the NNET family are quite competitive even though they present some overfitting. They obtain very good results in some datasets (those that does not suffers from overfitting) whereas in other datasets (those that suffer from overfitting) they obtain results above the acceptable values. It can be said that depending on the problem these algorithms may be the best or the worst option. An example is the elm-M algorithm, which gets the best results in 12 datasets and very bad results in 35 datasets. RBFRegressor-W is the best ranked method from the NNET family with an acceptable ranking of 47.39.

Finally, the NN methods are classic methods, being most of them applicable to both regression and classification problems. They are quite fast in general (there is no learning stage) but their results are not excessively good. The best algorithm is the knn-T with k tuning by Caret “train”. The RL methods does not obtained the best results, achieving the classical method DecisionTable-W the best results compared to evolutionary fuzzy rule learning (GFS-SP-K, GFS-GSP-K and GFS-SAP-Sym-K). However, these methods were particularly designed with explainability/interpretability purposes, so that they mainly try to obtain clear and simple models (when it is possible). M5Rules versions can also be considered as a part of the RL family. The PLSR family seems not too competitive at all. The pls-T method archives the best ranking value with a high rank (91.33). The STA family is the last one, probably motivated because of only two algorithms of this family have been included (this family is underrepresented in the studied software tools).

SECTION IV.

Conclusions

In this paper, we have performed an experimental study with 164 algorithms for regression problems that come from 14 different families (BAG, RF, MT, BST, RT, SVM, MARS, OM, GLM, NNET, NN, PLSR, RL, and STA) that are available in several software tools (JSAT, KEEL, Matlab, R, Scikit-learn, and Weka) over 52 datasets. A statistical study with non-parametric tests has been carried out on all the algorithms and on the best 30 algorithms including the algorithms of the BAG family and without them. Moreover, the new measure RegM based on MSE has also been proposed to show the performance of an algorithm with respect to an interval of the MSE, which allows us to represent when an algorithm has a suitable performance. The objective of this study is to analyze the performance of a large number of regression algorithms to help non-expert users from other areas to properly solve their own regression problems and to help specialized researchers developing well-founded future proposals by properly comparing and identifying algorithms that will enable them to focus on significant further developments. In this context, a key aspect is related with the parameter setting. We have highlighted the importance of this aspect and detailed why for this type of studies, the standard parameters recommended by the authors are considered despite that the variety of datasets. Notice that, applying the standard parameters, a real situation when non-expert users need to apply data mining techniques to the problem they face in their areas is represented.

The results obtained over the 52 datasets collected show how the implementation of the M5 algorithm combined with bagging, which is available in the software tools R and Weka, obtains the best Friedman’s ranking and the best value for the measure RegM when we analyze the 164 algorithms. On the other hand, the implementation of the ensemble-based algorithms ExtraTreesReg available in Scikit-learn and Random Forest available in R, Weka and Scikit-learn, present the best performance when we only analyze the best 30 algorithms according to Friedman’s ranking without considering the algorithms of the BAG family. Notice that the ensemble algorithms perform an internal bagging-like process. This highlights the potential of tree-based methods to solve regression problems.

Results analyzed by families show that the algorithms from RF, MT and SVM get the best positions in the rankings obtained by the statistical tests when bagging is not considered. Finally, from the results obtained in the analyses with and without bagging we can see how the use of bagging can significantly improve the algorithm’s performance without a high computational cost in general, so that BAG becomes the most accurate family.

References

References is not available for this document.