Multi-Objective Evolutionary Simultaneous Feature Selection and Outlier Detection for Regression

When investigating the causes of contamination in specific contexts, such as in underground water wells, multivariate regression is commonly used to establish possible links between the chemical-physical values of the samples and the levels of contaminant. Two issues often arise from such a statistical analysis: selecting the best predicting variables and detecting the instances that can be suspected to be outliers. In this paper, we propose a comprehensive, integrated, and general optimization model that solves these two problems simultaneously in such a way that outliers can be detected in reference to the specific variables that are selected for the regression, and we implement such an optimization model with a well-known evolutionary algorithm. We test our proposal on data extracted from a project whose aim is to establish the causes of the contamination of underwater water wells in a very specific area of northeastern Italy. The results show that our variable selection and outlier detection algorithm allows the synthesis of very reliable, interpretable, and clean regression models.


I. INTRODUCTION
Statistical and machine learning methods are widely employed in geology and, in general, the study of natural resources (see [1]- [3], among many others), and among the possible models, regression (specifically, linear regression) is one of the most common approaches (see, e.g., [4]- [6]). In particular, in studying problems that involve the contamination of natural resources, such as underground water, linear regression is a key tool that allows us to establish possible explainable relationships between physical-chemical values and contaminant(s) level(s). Linear, and in some cases polynomial, regression is often preferred over other models (nonlinear regression, neural networks, and so on) The associate editor coordinating the review of this manuscript and approving it for publication was Zhan Bu .
because it allows an easier interpretation of the results, which in turn leads to more plausible geological models of the events. Regression can be univariate, when there is only one independent variable, or multivariate; additionally, it can be linear, when a linear function is searched for, or nonlinear. In certain types of applications, not only is the regressed function interesting but also the precise subset of independent variables that actually play a role is of interest. In the typical interpretation of the output of a linear regression algorithm, for example, those independent variables that are associated with very small constants are considered to be less important; however, one cannot simply identify the importance of a variable only in terms of its coefficient, and furthermore, when the regressed function is not explicit, such as, for example, in random forest, there are no coefficients at all. As a matter of fact, the most general technique for solving VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ such a problem is feature selection [7], while in the classical literature, feature selection is considered to be a technique to reduce data dimensionality and speed up further analysis steps; in real cases, it can also be used to determine exactly which independent variables have a role in a certain learning task. Associated with feature selection, there are two other common techniques in data analysis, namely, instance selection [8] and outlier detection [9]. The former can be defined as the selection of a smaller subset of the original set of instances that behaves as or better than the original set in a learning task, while the latter is the task of identifying possible instances whose values could be the product of some instrumental or human mistake and that should not be considered. Instance selection differs from outlier detection in the sense that while the former aims to find a small group, possibly not unique, of representative instances, which can be used instead of the original problem, the latter aims to highlight the nonrepresentative instances caused by measurement error or inherent data variability, which might or might not exist. Outliers are commonly, and informally, defined as observations that deviate so much from other observations as to increase suspicion that they were generated by other mechanisms (see, e.g., [10]) and, as such, are not always easy to detect and interpret. For example, while detecting unusual symptoms or test results in a medical context, outliers are typical examples that could indicate potential health problems of a patient, and the entire data mining process is concentrated on their detection; however, if those data are in fact anomalous, they could contaminate the results and reduce the accuracy of the response [11]. Typically in real-life data, outliers are more frequently obtained with human intervention (such as physical-chemical analysis), while automatically generated data are less prone to containing mistakes. Moreover, the sample size and the statistical technique used also determine the importance of their detection: the smaller the sample size is, and the less robust the statistical techniques are, and the higher the influence of the outliers (for example, the sample mean gives a moderately robust estimation of the population central tendency, and thus, one outlier among a large sample will have a limited impact). In statistics, outliers are also classified as vertical (when they outlay in the values of independent variables) and horizontal (when they outlay in the value of the dependent variable). Feature selection methods that do not incorporate dependencies between attributes are called univariate methods, while those methods that evaluate the collective prediction power of subsets of attributes are called multivariate. Moreover, depending on the underlying evaluation strategy, feature selection methods are classified as single or subset attribute evaluation; single evaluation methods can be univariate or multivariate, whereas subset evaluation methods are always multivariate. Feature selection methods are also categorized into filter, wrapper and embedded models: filters are algorithms that perform the selection of features using a statistical measure that classifies their significance without making use of any machine learning algorithm; wrapper methods evaluate attributes driven by the performances of an associated learning algorithm, and embedded models perform the two operations (selecting variables and building a classifier) at the same time -while these tend to produce very good subsets efficiently, they are usually complex and difficult to implement. Outlier detection methods can follow a similar taxonomy, although the classical literature separates supervised from unsupervised methods (the latter being far more common) and classifies unsupervised outlier detection methods into density-and distance-based methods. In algorithmic terms, existing methods are almost universal filters (although there are some embedded methods), because they are focused on finding outliers independently from the learning task.
In this work, we are interested in wrapper multivariate methods, which can be seen as multi-objective optimization models. Following the classical multi-objective optimization model for feature selection, our contribution can be summarized as follows: • First, we define first a multi-objective optimization model for outlier detection.
• Second, we propose a multi-objective optimization model for simultaneous feature selection and outlier detection.
• Third, we implement an evolutionary algorithm to solve the proposed models.
• Fourth, we test our solution on both artificial and real data, comparing the results against several state-ofthe-art techniques for outlier detection. In particular, we test our algorithm in the case of linear regression on data that contain the result of the chemical-physical analysis of groundwater samples collected from a very specific area of northeastern Italy, where traces of pollution due to the abnormal presence of mercury (Hg) were found. The problem consists of finding which chemical-physical elements are linked to the pollutant and in which way. Our data set is part of a larger project whose aim is to find the origins and causes of pollution in the affected area; linking chemical-physical characteristics to the presence of pollutants can be useful in precisely identifying contaminated underground water bodies and can offer some insights into possible causes. This paper is organized as follows. First, in Section II, we give the necessary background on feature selection, outlier detection, and multi-objective optimization. In Section III, we define the optimization model for outlier detection and for simultaneous feature selection and outlier detection. In Section IV, we describe a solution to the optimization models via evolutionary algorithms, we propose an implementation, and we test our methodology on artificial data sets to evaluate its performances in a controlled environment. Then, in Section V, we test our implementation on real data, and we design a set of experiments devoted to establishing which strategy for feature selection and outlier detection works best. The remainder of the paper presents our conclusions.

A. FEATURE SELECTION
Feature selection (FS) is defined as the process of eliminating features from the database that are irrelevant to the task to be performed [7]. Feature selection facilitates data understanding, reduces the storage requirements, and lowers the processing time, which allows the model learning to become an easier process. Feature selection methods that do not incorporate dependencies between attributes are called univariate methods; in multivariate FS, on the other hand, the assessment is performed for subsets of features rather than single features. In many cases, the dataset will be better characterized by the sole features that have the highest discriminative power than by the whole set of features. However, when an individual feature in a high-dimensional feature space presents a small correlation with the target class, it could still be useful in combination with other features; in this regard, multivariate FS tends to present better results than univariate FS [12]. Feature selection algorithms are also categorized into filter, wrapper and embedded models. Filters are algorithms that perform the selection of features using an evaluation measure that classifies their ability to differentiate classes without making use of any machine learning algorithm. Wrapper methods select variables driven by the performances of an associated learning algorithm. Finally, embedded models perform the two operations (selecting variables and building a classifier) at the same time; in the particular case of regression, a relatively well-known embedded algorithm is the so-called lasso regression method [13]. Wrapper methods for feature selection are more common in the literature; often, they are implemented by defining the selection as a search problem and solved using metaheuristics such as evolutionary computation (see, e.g., [14]- [16]). Wrapper schemata are more common in supervised classification rather than regression; however, some effort in this direction has been done (see, e.g., [17]). Other strategies for feature selection include adaptive boosting [18]. Feature selection methods based on subset evaluation consist of four steps, usually called subset generation, subset evaluation, stopping criterion, and result validation. Subset generation is commonly implemented as a heuristic search algorithm in which candidate subsets are prepared for evaluation. Obviously, the search space for candidate subsets has cardinality 2 N , where N is the number of features. Examples of subset generation mechanisms include [15], [19]- [27]. During the phase of subset evaluation, the goodness of a subset produced by a given subset generation procedure is measured, and examples include [28]- [33]. The stopping criterion is established when the feature selection process must finish; it can be defined as a control procedure that ensures that no further addition or deletion of features produces a better subset, or it can be as simple as a counter of iterations. Finally, during result validation, the validity of the selected subset is tested.

B. OUTLIER DETECTION
Outlier detection (OD) is defined in [34] as the process of eliminating samples from the database that do not comply with the general behavior of the data model. Such samples, called outliers, differ significantly from the remaining set of data, and their existence can be either caused by measurement error or the result of inherent data variability. While OD methods could be classified following a taxonomy similar to the one used for FS methods, in the literature, OD is first separated into supervised and unsupervised methods. Supervised OD defines outlier detection as a classification process; it requires a previous labeling of those instances that are known to be outliers, and then, it focuses on training a classifier to recognize further outliers [35]- [39]. When referring to generic outlier detection, unsupervised OD, which is much more common than supervised OD, is further separated into density-(see, e.g., [40] and many other variants of the LOF algorithm) and distance-based methods (see, e.g., [41], [42]). In our terminology, all of these methods can be classified as filters (both unsupervised and supervised, both univariate and multivariate), because the process of detecting outliers is not linked to any successive learning task or algorithm. Univariate, unsupervised filters for OD are also the most commonly available methods; for example, the open-source suite Weka offers the filter InterquartileRange, which allows one to detect outliers and extreme values by looking at the values of any attribute under a normal distribution hypothesis. Embedded models for OD in regression have also been attempted [43]- [45]. Finally, robust regression (see, e.g., [46]) is a portfolio of regression techniques that are designed to respond better than classic regression techniques when the usual hypothesis on which they are based fails, for example, in the presence of outliers. Therefore, they should be listed as outlier detection techniques. Robust methods are based on either substituting the least square principle with other principles, such as least absolute deviation, m-estimation, or least trimmed squares, or replacing the normal distribution with a t-distribution, or on using unit weights. Robust regression is not as popular as classical regression; the algorithms are not very widely present in learning and statistical suits, and they have often been criticized for being computationally demanding. Robust regression has also been approached via evolutionary algorithms [47].

C. COMBINED MODELS
A combination of FS and OD has been attempted in the literature using several different schemata. One of the early approaches is [48]. A more recent diagnostic method based on dummy variables and evaluation-by-plotting, which is specific for linear regression, was presented in [49], and a two-step method specific for linear regression was presented in [50]. On the other hand, OD is similar to the (more common) instance selection (IS), which consists of discovering a subset of instances such that a model built with only that VOLUME 9, 2021 subset has similar, or better, performances than a model built from the entire set of instances [8], and IS has been combined with FS in several experiments whose structure is similar to the model used in this paper. Outlier detection is similar to IS in the sense that in both cases, instances are selected from the data set; however, while IS is focused on finding a set of representative instances, the aim of OD is discovering, should they exist, unrepresentative instances that should be eliminated. In the literature, IS is often combined with FS [8], [51]- [53]. Combined models for FS and IS range from sequential combinations (FS+IS versus IS+FS) to simultaneous models, although the latter are not usually defined as optimization models.

D. MULTI-OBJECTIVE OPTIMIZATION
A multi-objective optimization problem (see, e.g., [54]) can be formally defined as the optimization problem of simultaneously minimizing (or maximizing) a set of k arbitrary functions: wherex is a vector of decision variables. A multi-objective optimization problem can be continuous, in which case we look for real values, or combinatorial, where we look for objects from a countably (in)finite set, typically integers, permutations, or graphs. Maximization and minimization problems can be reduced to each other, which means that it is sufficient to consider only one type. A set F of solutions for a multi-objective problem is nondominated (or Pareto optimal) if and only if for eachx ∈ F, there exists noȳ ∈ F such that (i) In other words, a solutionx dominates a solutionȳ if and only ifx is better thanȳ in at least one objective, and it is not worse thanȳ in the remaining objectives. We say thatx is nondominated if and only if there is no other solution that dominates it. The set of nondominated solutions from F is called the Pareto front. Optimization problems can be approached in several ways; among them, multi-objective evolutionary computation is a popular choice (see, e.g., [55]- [57]). Within the evolutionary computation paradigm, there are many choices, that range from ant colony optimization, to bee colony optimization, to differential evolution, to evolutionary algorithms, among many others. Some very recent contributions, especially applied to the FS problem, include [58]- [60]. Among the several evolutionary algorithms offered in the recent literature, however, the non-dominated sorted genetic algorithm (NSGA-II) [61], the second version of the NSGA algorithm, is a very well-known one, easily available as open source code. NSGA-II is an elitist Pareto-based multi-objective evolutionary algorithm that uses a strategy with binary tournament selection and a rank-crowding better function, where the rank of an individual is defined as its non-domination level in the whole population. While experience indicates that there is no unique evolutionary algorithm that outperforms all others in solving every optimization problem, NSGA-II has been tested in a very wide range of problems showing remarkable performances, and it is available in open source libraries.

E. MULTI-OBJECTIVE EVOLUTIONARY FEATURE SELECTION AND OUTLIER DETECTION
Genetic algorithms have always be considered a powerful tool for feature selection [15] and have been proposed by numerous authors as a search strategy in filter, wrapper, and embedded models [62]- [64] as well as feature weighting algorithms and subset selection algorithms [16]. The first evolutionary approach that involves multi-objective optimization for feature selection was proposed in [65] with three criteria: accuracy, number of features, and number of instances. In this approach, the three criteria are aggregated into a single criterion, and then, a single-objective algorithm is used. A formulation of feature selection as a multi-objective optimization problem was presented in [56]. The wrapper approach proposed in [66] accounts for the misclassification rate of the classifier, the difference in the error rate among classes, and the size of the subset using a multi-objective evolutionary algorithm where a niche-based fitness punishing technique is proposed to preserve the diversity of the population. A wrapper approach was proposed in [67], which minimizes both the error rate and the size of a decision tree. Another wrapper method was proposed in [68] to maximize the cross-validation accuracy on the training set, maximize the classification accuracy on the testing set, and minimize the cardinality of the feature subsets using support vector machines applied to protein fold recognition. In [69], two wrapper methods with three and two objectives, respectively, applied to cancer diagnosis were compared. The three-objective version optimizes the sensitivity, specificity and number of genes, while the two-objective version optimizes the accuracy and number of genes. NSGA-II is used as the search strategy, and a support vector machine is used for the classification task. A filter local search embedded multi-objective memetic algorithm was presented in [70], which is a synergy of NSGA-II and a filter method for the identification of relevant features in a multiclass problem. The filter approach proposed in [71] includes measures of consistency, dependency, distance and information, and it is based, again, on NSGA-II. An NSGA-II wrapper approach was also proposed in [72] for named entity recognition. Finally, a modification of the dominance relation is introduced in [73] to treat an arbitrarily large number of objectives and is used in a combination of NSGA-II, logistic regression, and naïve Bayes with Laplace correction as a classification algorithm. Very recent examples of multi-objective feature selection systems can be found in [55], [74], [75]. Finally, there have been a few (simple) attempts to use evolutionary computation to detect outliers; they include, for example, [76]. A generic formalization of a wrapper methodology for feature selection applied to a data set with n + 1 features and posed as a two-objective optimization problem can be devised from the literature. It simply involves adapting (1) to simultaneously optimize two functions: wherex, which takes the values in {0, 1}, represents the set of chosen features (1 means that the feature is selected, and 0 means that it is discarded), and C(x) represents its cardinality: and f (x) are any measures of the goodness of the learning algorithm used when applied to the data set obtained by selecting only those features indicated byx. If we choose to minimize such a function, then possible choices include defining it as the opposite of the accuracy, the correlation coefficient, or the mean squared error, among many others, depending on the specific classification/regression problem to be solved. Observe that applyingx entails transforming the original data set by selecting specific attributes while discarding the others; such a concept, as we shall see, can be easily generalized.

F. EVOLUTIONARY OUTLIER DETECTION
Outlier detection has been approached with evolutionary algorithms in the literature. In [77], for example, OD is solved by harmony search and differential evolution, where a single objective is defined using the sparsity coefficient. Additionally, in [78], the anomaly detection problem is dealt with as a classification problem. Voronoi diagrams are used for the task of classification, and they are generated by many-objective evolutionary computation where objectives are progressively added. They correspond to the classical performance metrics for classification and for Voronoi diagrams (accuracy, recall, number of cells, total empty volume, and so on). Finally, in [79], a multiobjective evolutionary algorithm is proposed to determine the best penalty factors of a sparse group lasso constraint imposed on the learning objective of Adabost, which is combined with an autoencoder for image outlier detection.

G. AN OVERVIEW OF OUR CONTRIBUTION
In this paper, we first define the OD problem as a multi-objective optimization problem. This definition is rendered with a wrapper methodology that at the expense of being linked to a specific learning model (in our case, regression) can detect outliers in a very effective and easyto-implement way; as a matter of fact, we use well-known mechanisms whose implementations are widely available. Moreover, because our methodology is multivariate, we evaluate outliers in groups instead of one-by-one. Furthermore, a wrapper OD method has the advantage of being able to spot outliers in reference to a particular learning model, and those outliers could be overlooked by a blind method. Finally, being multi-objective, our methodology has the advantage of minimizing the number of outliers, which is consistent with considering the original data to be reliable up to a certain point, as opposed to simply eliminating all instances that do not allow us to fit a model. Second, we define the simultaneous FS and OD problem as a multi-objective optimization problem, building on the previous definition and adding a third objective. This proposal has the same advantages as the previous proposal, except that it allows the detection of outliers in specific reference to the selected attribute. Third, we propose and run a testing methodology to evaluate the effectiveness of not only the proposed method against popular outlier detection models but also the several possible variations of our schema that emerge from defining our selection models against each other.

III. SIMULTANEOUS FEATURE SELECTION AND OUTLIER DETECTION FOR REGRESSION A. OUTLIER DETECTION AS A MULTI-OBJECTIVE OPTIMIZATION PROBLEM
Inspired by the optimization model for multivariate wrapper feature selection, which, as we recalled in the previous section, can be defined as the problem of minimizing the cardinality of the selected features while optimizing, at the same time, the performances of the associated learning algorithm, we see that outlier detection can be solved in a very similar way. For a data set with m instances, letȳ = (y 1 , . . . , y m ) be a vector of decision variables in {0, 1}. We interpretȳ as the set of selected outliers, in other words, the subset of instances that are not used by the learning algorithm on which the wrapper is built (thus, unlike FS, for 1 ≤ t ≤ m, we have thatȳ(t) = 1 represents not selecting the instance t, whilē y(t) = 0 represents selecting it). Then, we can define the outlier detection problem by adapting (1) as follows: where the choices for f range, as before, among the possible measures of performance of the learning algorithm. For example, if we had a data set with m = 7, applyinḡ y = (0, 0, 0, 1, 0, 1, 1) encompasses a transformation that consists of eliminating instances 4, 6, and 7. This approach corresponds to evaluating the hypothesis that such instances are, in fact, outliers and that being the case, the regression algorithm will behave better. Observe that minimizing the number of selected outliers is consistent with assuming that if there are outliers in the data set at all, they cannot be too many. While (2) and (4) appear identical, the (fundamental) difference lies in how the decision variables are interpreted, in other words, how the data set is transformed at each iteration of the wrapper. Observe that, for example, in the case of linear regression, the objectives are opposite to each other. In particular, observe that there are trivial solutions to a perfect regression, for example, with two points. This circumstance means that if we were not to minimize the number of outliers while maximizing the correlation index, the system would return irrelevant solutions with subsets of points that lie on the same straight line. Moreover, this system is designed for data sets that are supposed to be relatively clean, because they are the results of systematic lab analyses. They could contain outliers, but not too many; minimizing the number of outliers and setting a predetermined maximum number, therefore, makes sense. Finally, this approach is insensitive to the type of outliers, and it can detect both vertical and horizontal outliers alike.

B. SIMULTANEOUS FEATURE SELECTION AND OUTLIER DETECTION
In some problems, the best way to search for a function that fits a data set is performing feature selection and outlier detection simultaneously; as a matter of fact, an instance could be an outlier only under the point of view of certain features, and thus, it should be considered as such only if those features are selected. We can express the process of simultaneous feature selection and outlier detection as a multi-objective optimization problem, as follows. For a data set with n + 1 features and m instances, letx,ȳ be two vectors of decision variables, wherex (with range n) is defined as in (2), and y (with range m) is defined as in (4). The pairx,ȳ entails a transformation of the data set that corresponds to selecting those features set to 1 inx and discarding those instances set to 1 inȳ. Then, a three-objective optimization problem can be defined by adapting (1) as follows: where f (x,ȳ) measures, as expected, the performances of the learning algorithm on (i) the instances survived after the transformation entailed byȳ, and (ii) the features chosen byx only. Thus, continuing with the previous example and further assuming that the original dataset has 5 features (excluding the class), havingx = (0, 1, 1, 0, 1),ȳ = (0, 0, 0, 1, 0, 1, 1) implies selecting features 2, 3, and 5 while disregarding the instances 4, 6, and 7 (see Fig. 1).

IV. MULTI-OBJECTIVE EVOLUTIONARY OPTIMIZATION A. IMPLEMENTATION
As seen in the previous sections, we have defined FS, OD, and simultaneous feature selection and outlier detection as optimization problems. We choose to approach such optimization problems via an evolutionary algorithm and, in particular, using the well-known algorithm NSGA-II [61], which is available in open source from the suite jMetal [80]. As a black box regression algorithm, we used the class LinearRegression from the open source learning suite Weka [81], with the following default parameters: no embedded feature selection method, elimination of collinear features, and ridge penalty 10 −8 , run in 10-fold cross-validation mode; therefore, in our experiments, we shall be looking for linear regression models. We use a fixed-length representation, where each individual solution consists of a bit set, which is different for each strategy. In simple FS, each individual is of the following type: where, for each 1 ≤ t ≤ n, x t = 1 (resp., x t = 0) is interpreted as the t-th attribute being selected (resp., discarded), while in simple OD it is of the type: y = (y 1 , y 2 , . . . , y m ), where, for each 1 ≤ t ≤ m, y t = 1 (resp., y t = 0) is interpreted as the t-th instance being discarded (resp., selected). Clearly, the same representation is used in the sequential strategies. In the simultaneous strategy, the representation is the natural generalization of the strategy used in FS and OD: While the outlier detection model is very general, as a rule of thumb, we consider any solution that has too many outliers unacceptable. This choice has effects on both the initial population and the generations; indeed, the initial population is generated randomly, but while individuals (or parts of individuals) used for feature selection are generated in the range of all possible solutions, the individuals (or parts of individuals) used for outlier detection are generated in such a way that the cardinality of each individual is less than or equal to a parameter that we control. When considering that the NSGA-II does not allow the management of external constraints, unacceptable solutions that emerge during the generations are artificially given the worst possible values in each objective and are therefore discarded. Lett be any of x,ȳ, andz. In terms of objectives, minimizing the cardinality of the individuals is straightforward, and we do so by using the function C(t) defined in the previous section. To optimize the performance of the learning algorithm used in the wrapper, we define where ρ() measures the correlation between the stochastic variable obtained by the observations and the linear variable obtained by LinearRegression on the data set after the transformation entailed byz, as explained in the previous section. The correlation varies between −1 (perfect negative correlation) and 1 (perfect positive correlation), with 0 being the value that represents no correlation at all. Defined in this way, f ought to be minimized. Summarizing, the fitness of individual is computed using the correlation of the linear regression with the selected data (i.e., selected features and/or selected instances after outlier elimination), in 10-fold cross-validation mode, and the number of selected features/outliers. In terms of parameterization, we proceed as follows. In every experiment, the population size is 100. After an initial preliminary test, we found that 10000 total evaluations (100 generations) were sufficient to show convergent behavior. In each experiment, the sequential strategy consists of a first phase of feature selection (or outlier detection) followed by a phase of outlier detection (or feature selection), each of which should be treated as an experiment on its own. Therefore, each of these phases has been granted 10000 evaluations, which means that the last population in a sequential experiment is the result, in fact, of 20000 evaluations. For this reason, in simple strategy experiments, as well as in simultaneous strategy experiments, the number of evaluations has been set to 20000. The maximum percentage of detected outliers is set to 10%. We used the standard crossover and (uniform) mutation, with probabilities (set after an initial explorative analysis) of 0.7 and 0.3, respectively. Finally, in terms of an a posteriori decision, we choose to select, in each experiment, the individual with the best value of f (z). The seeds for folder extraction are always fixed from 1 to 10, to ensure the repeatability of the experiments.

B. TESTING WITH SYNTHETIC DATA
Artificial data sets are generally not a reliable playground for algorithm evaluation. Outlier detection, however, is an exception to this rule: as a matter of fact, knowing the nature and identity of true outliers helps us to establish whether an outlier detection algorithm is at least partially correct.
Here, we consider two artificial datasets with two different purposes; for each dataset, we run an experiment as explained above, and we show the results only in terms of the correlation coefficient, outliers, and selected features. In the first experiment, inspired by [82], we use the artificial wood dataset distributed within the R package [83], which encompasses 20 instances with five features and one dependent variable. Such a data set is declared to have 4 outliers (instances 4, 6, 8, and 19); in other words, 25% of the data are outliers. We executed 10 independent runs of our algorithm in outlier detection mode (later called OD mode) only, to evaluate whether outliers were detected and how often. The results, shown in Tab. 1, indicate that our method finds all outliers in 5 out of 10 executions and, as expected, the correlation coefficient is the highest when the outliers are found. In the second experiment, we created an artificial data set to better control its content. Our data set contains 105 instances described by 4 attributes X 1 , X 2 , X 3 , X 4 and one dependent variable, which is computed by the function: The first 100 instances are computed exactly; the remaining 5 have random values, and they are outliers. Attribute X 4 does not play any role. We executed 10 independent runs of our algorithm in outlier detection and feature selection mode (later called FSOD mode) only, to evaluate whether outliers were detected and how often and which features are selected. The results, shown in Tab. 2, once again indicate that our method works as expected; some outliers are found in every execution, all outliers are found in 2 out of 10 executions, and the correct features are selected in all executions. As in the previous case, by applying our a posteriori decision policy (in other words, selecting the execution with the highest correlation coefficient), the resulting data set ends up without outliers.

V. A COMPREHENSIVE EXPERIMENT ON REAL DATA A. STRATEGIES
Feature selection and outlier detection, as we have explained, can be performed individually, sequentially, or simultaneously, and it is interesting to establish which strategy performs better. To accomplish this goal, we designed a set of experiments to be performed on the same data set, which was organized as follows: • Simple strategy. This strategy consists of performing feature selection or outlier detection alone. By abusing the notation, we refer to the former as the strategy FS and the latter as the strategy OD.
• Sequential strategy. This strategy consists of performing feature selection first to obtain a reduced data set, which in turn is used as the starting data set for the outlier detection phase. We use the symbol FS+OD to denote VOLUME 9, 2021 such a strategy. The symmetric solution, which consists of outlier detection first followed by feature selection, is denoted by OD+FS.
• Simultaneous strategy. This strategy consists of simply applying the simultaneous optimization model, and it is denoted by FSOD.

B. TEST SETTINGS
Multiple executions (10 in our case) are required to ensure that good results do not occur by chance. As we have already mentioned, for each strategy and each execution, every evaluation is performed in 10-fold cross-validation mode with a fixed seed, and the random choices that occur within the computations (initial population, mutations, and so on) are governed by a seed that changes with the execution, from 1 to 10. In this way, the final results are always comparable to one another. For the original data set, the Weka built-in linear regression algorithm LinearRegression with default parameters (as specified in the previous section), in 10-fold cross-validation mode with seeds from 1 to 10, obtained ten correlation coefficients, which we call OR i (1 ≤ i ≤ 10). Moreover, we applied the Weka deterministic filter InterquartileRange to the original data set and tested the resulting data set with the linear regression algorithm in 10-fold cross-validation mode, obtaining ten correlation coefficients, which we call IR i (1 ≤ i ≤ 10); we used the following default parametrization: all columns were analyzed except the target, factor for outlier detection set to 3, and factor for extreme value detection set to 6. Similarly, we applied the standard robust regression algorithm (RR i , 1 ≤ i ≤ 10) and the standard lasso regression algorithm (LR i , 1 ≤ i ≤ 10), which were both offered in the R package [83]. We used the following, default, parametrization: for robust regression, we used Huber weights and default parameters, and for Lasso regression we used α = 1, standardize set to True, normalize set to False, maximum 1000 iterations, negative coefficients allowed. Then, in the simple and simultaneous strategies experiments, we performed 10 executions per strategy (again, with seeds from 1 to 10); out of each final population, we extracted the best individual in terms of the second objective (for us, the correlation coefficient) and named FS i (resp., OD i , FSOD i ): to each one of these individuals, we applied the linear regression algorithm, evaluated in cross-validation mode with the corresponding seed. Finally, for the sequential FS+OD strategy, we proceeded as follows: in each execution, after the initial 100 generations of the FS phase, we obtained a final population from which the best (correlation-wise) individual has been extracted; to the latter, we applied 100 further generations to obtain a new final population from which, again, we extracted the best individual, denoted, here, by FS+OD i , and whose correlation coefficient has been computed by running the linear regression algorithm and evaluating it in cross-validation mode with its corresponding seed. The correlation coefficients for the individuals OD+FS i have been obtained in a symmetric way.

C. RESULTS EVALUATION
To evaluate the results, we designed the following tests: • Interstrategy test, designed to evaluate which strategy performs better. To accomplish this goal, we considered the set of 10 results of each strategy (in terms of the correlation coefficient) as an independent sample, and we applied an independent group t-test to verify the statistical reliability of the results; • Post-strategy test, designed to evaluate the results from a mathematical point of view. To accomplish this goal, we compare the quality of the regression model obtained with the original data with that of the model obtained with the best individual from the simultaneous strategy via a residual versus fitted test and a residual versus leverage test.

D. DATA AND RESULTS
To test our system, we considered a data set of physical-chemical samples of underground water in a very specific area in northeastern Italy. Such samples were collected as part of an ongoing investigation commissioned by the local Regional Agency for Environment and Prevention to the University of Ferrara, with the purpose of exploring the causes of a sudden, unexpected spike of Hg (Mercury) in the underground water. Such data are being used to perform several physical-chemical studies. Among these, we are inter-    427 instances. The box plot shown in Fig. 2 refers to the single-attribute outliers that the original problem presents; as seen, it would be quite difficult to pick individual instances as outliers based on this information. First, let us consider Tab. 3, which contains the ten executions in cross-validation mode of the linear regression algorithm, with seeds from  1 to 10. Use the following indicators: cc (the correlation coefficient), rmse (the root mean squared error), rae (the relative absolute error), and rrse (the root relative squared error). It can be immediately observed that the correlation coefficients that emerge from this initial experiment are not very high; this finding could be due to several factors,  including the presence of noise attributes and/or outliers in the instances. Eliminating the outliers that are found by running the Weka built-in tool (as explained in the previous section) does not appear to improve the results, as can be observed in Tab. 4; this observation is not surprising: this method is quite primitive and does not consider the features that are actually used in the subsequent regression step (in fact, it is unrelated to the learning algorithm). The presence of features that are, in fact, noise with respect to the problem is witnessed by Tab. 7, which contains the results of the experiments with the pure FS strategy. When we run the OD experiments, it can be seen that the correlation coefficients increase significantly. The last column of Tab. 8 indicates the rate of outliers that have been selected and eliminated in each best individual of each experiment; as can be seen, especially in some cases, the number is very low, which indicates that a correct choice of (suspected) outliers can truly improve the learning phase, even in the presence of noise features (recall that, in this experiment, there is no feature selection). When we run the simultaneous FSOD experiment, we obtain the results in Tab. 9, which are, as expected, the best results, proving that outliers should be detected relative to the features that are being used in the learning phase. While a simultaneous, integrated, and automated system is certainly preferable over a semiautomated system, such as a sequential strategy, for practical reasons, one can still wonder if selecting features and then detecting outliers, or the other way around ,   TABLE 10. Evaluation of linear regression on each of the 10 data sets obtained by selecting the best individual from 10 executions; the seed was varied, and the sequential feature selection and outlier detection model (FS+OD) was used.  gives results that are comparable to those obtained with the simultaneous strategy. The results of two experiments that correspond to the two possible sequential strategies, shown in Tab. 10 and Tab. 11, respectively, prove that while both of them are comparable to the simultaneous strategy in terms of improvement over the original problem, the simultaneous strategy still presents a slight advantage over both sequential schemata. Tab. 12 summarizes the average results.
The inter-strategy test shows interesting values. After setting a confidence level of 0.01, as shown in Tab. 14, the differences in the average scores between the simultaneous strategy and pure feature selection, standard outlier detection, and the original data appear to be statistically significant. From Tab. 15, on the other hand, it can be seen that also the improvement of each of our strategies that includes outlier detection over pure feature selection is statistically significant; the improvement of the simultaneous strategy over the  FS+OD strategy is also significant, but at a 0.05 confidence level. Thus, the simultaneous strategy produces better results than any of the other strategies that we have tested, and such improvements have a very high level of reliability in most cases. In Fig. 3 and Fig. 4, we display the results of the tests on the residuals of two models: the original model and the model obtained from the best individual of the simultaneous strategy. In both cases, it is possible to see how the residuals appear clearly closer to the fitting line and more normally distributed in the model obtained by the simultaneous strategy compared with the original strategy, which indicates the superior quality of the former over the latter.
After having established that FSOD produces the best absolute results and the most statistically reliable findings, it is important to compare its results with those that can be obtained with standard techniques. Applying the filter (as in Tab 4) produces clearly worse results in the same conditions and only slightly better than those that can be obtained from the original data set. Applying robust regression (as in Tab. 5), which, let us recall, should be resilient to the   presence of outliers and does not perform feature selection, produces slightly better results than those that can be obtained on both the original data set, the filtered data set, and the data sets obtained after feature selection only but decidedly worse than those obtained by our outlier detection method and our simultaneous method. Finally, lasso regression (shown in Tab. 6) does produce, in some execution, good results; however, the difference between the worst and the best case is so high that the method cannot be considered to be reliable, and on average, the correlation coefficient is still inferior to the average result obtained by our simultaneous technique. Moreover, lasso regression has a feature selection embedded algorithm, and in general, it does not show the selected features or the detected outliers.

E. INTERPRETATION
From the results obtained by FSOD, we can now extract the most interpretable individuals for the problem at hand. Recall that the ultimate aim of analyzing the water samples was to identify, if it exists, the physical-chemical signature of the contamination, in other words, the subset of variables that allow the extraction of a linear regression model to predict the level of contamination whose reliability, after having excluded potential outliers, is sufficiently high to allow a geochemical interpretation. Therefore, from each of the ten executions of the FSOD method, we extract the best individual (in terms of the correlation coefficient) with fewer than six variables, and we study such individuals to identify patterns. Tab. 13 shows such individuals. We can see that Ca, HCO3, Fe, and Na are the variables that apparently present the highest correlation with the contaminant.

VI. CONCLUSION, LIMITATIONS, AND FUTURE WORK
In this paper, we have approached two common problems associated with multivariate regression: selecting the best variables that predict the independent variable(s) and detecting the instances that can be suspected to be outliers. While these problems have been widely debated in the literature, they are often considered separately and with different heterogeneous methods. We have proposed a comprehensive, integrated optimization model that solves them simultaneously in such a way that outliers can be detected in reference to the specific variables that are selected for the regression. We implemented such an optimization model with a well-known evolutionary algorithm, and we tested it on real data taken from a problem of natural resource exploitation. The results show that our algorithm, when used to detect outliers without feature selection, already produces better models than those produced by standard outlier detection methods and those produced by standard feature selection methods. Additionally, they show that simultaneous feature selection and outlier detection, as we expected, produces the best results, comparable with those produced by attacking the two problems sequentially (in particular, first outlier detection and then feature selection), yet it is still superior (in terms of the correlation coefficient of the extracted linear model) to the latter. Since it is based on an evolutionary algorithm, one of the main limitations of our approach is the computation time, which is much higher than, for example, embedded outlier detection methods. Moreover, our method can be improved by including instance selection for a better result: the idea is that while outliers are discarded, the most representative instances are selected, and both choices depend on the selected features. Finally, our approach does not take into account different possibilities to solve the underlying multi-objective problem: in the future, we plan to explore not only modern evolutionary algorithms, such as binary differential evolution with self-learning or self-adaptive particle swarm optimization, but also non-evolutionary techniques.
FERNANDO JIMÉNEZ received the M.D. degree in computer science from the University of Granada, Spain, and the Ph.D. degree in computer science from the University of Murcia, Spain. He is currently an Associate Professor with the University of Murcia. His research interests include evolutionary computation, multi-objective constrained optimization, soft computing, evolutionary fuzzy systems, and data mining.
ESTRELLA LUCENA-SÁNCHEZ is currently pursuing the Ph.D. degree with the University of Modena and Reggio Emilia, Italy, and with the University of Ferrara, Italy. She specializes in intelligent data mining and knowledge extraction from biological, chemical, and physical data, with classical and new methods and techniques.
GRACIA SÁNCHEZ received the M.D. degree in computer science from the Polytechnic University of Valencia, Spain, and the Ph.D. degree in computer science from the University of Murcia, Spain. She is currently an Associate Professor with the University of Murcia. Her research interests include multi-objective constrained optimization, soft computing, evolutionary fuzzy systems, and data mining.
GUIDO SCIAVICCO received the M.D. and Ph.D. degrees in computer science from the University of Udine, Italy. He is currently an Associate Professor with the University of Ferrara, Italy. He has been researching in temporal, spatial, and modal logic for over ten years, especially in satisfiability and model-checking problems. Recently, he has focused on studying temporal data mining problems with applications to real-world cases.