Revisiting the Dissimilarity Representation in the Context of Regression

In machine learning, a natural way to represent an instance is by using a feature vector. However, several studies have shown that this representation may not accurately characterize an object. For classification problems, the dissimilarity paradigm has been proposed as an alternative to the standard feature-based approach. Encoding each object by pairwise dissimilarities has been demonstrated to improve the data quality because it mitigates some complexities such as class overlap, small disjuncts, and low-sample size. However, its suitability and performance when applied to regression problems have not been fully explored. This study redefines the dissimilarity representation for regression. To this end, we have carried out an extensive experimental evaluation on 34 datasets using two linear regression models. The results show that the dissimilarity approach decreases the error rates of both the traditional linear regression and the linear model with elastic net regularization, and it also reduces the complexity of most regression datasets.


I. INTRODUCTION
A N underlying step in machine learning and pattern recognition is the characterization of objects, where an ideally good representation ensures the building of accurate learning algorithms [1]. Three approaches have emerged to represent a real-world object [2], [3]: the structural or syntactical approach using a symbolic data structure, the statistical approach based on a feature representation, and the class models.
The statistical approach assumes that an object is characterized by an n-dimensional vector x = [x 1 , . . . , x n ] T ∈ R n , where each x i is a numeric attribute (feature) whose values are obtained through observation or as samples of the data (e.g., pixels of an image) [3], [4]. However, this representation may not capture the internal structure of some objects that have an intrinsic and detectable organization [5]- [7]. In classification problems, it is often difficult to obtain an appropriate feature-based characterization of objects, leading to a high dimensional representation with class overlap or also a representation with a mixture of continuous and categorical features [5], [8].
Pȩkalska and Duin [9] proposed the dissimilarity representation where objects are characterized by the difference or the dissimilarity to other objects from a representation set. A straightforward method of constructing the new representation is by means of mapping processes that convert a feature vector into a dissimilarity vector using some distance metric. Several studies have demonstrated that this alternative representation suggests practical advantages over the feature representation such as: i) it is possible to use a simple linear prediction model [10], ii) it yields a good separability between classes [11], iii) all dimensions in the dissimilarity space are equally relevant [11], and iv) the small disjunct problem is reduced [12].
The dissimilarity representation has extensively been applied to a variety of classification problems. For example, Bruno et al. [13] proposed a particular form of dissimilarity space for multimodal information, enabling fast and efficient interactive content-based retrieval of video data. Porro-Muñoz et al. [14] concluded that the use of the dissimilarity representation for the classification of chemical spectral data, which is characterized by changes in the shape of the spectra of different classes, outperformed the results achieved on the feature space.
Theodorakopoulos et al. [15] developed a method for pose-based human action recognition in the dissimilarity space. The problem of corporate bankruptcy prediction was tackled using four linear classifiers designed on the dissimilarity space, showing that their performance was considerably better than that of the models applied onto the feature space [10]. Orozco-Alzate et al. [1] investigated the suitability of a dynamic time warping based on the dissimilarity representation for distinguishing among the different seismic volcanic patterns. Classification of time series was carried out by using the dissimilarity representation [16].
Martins et al. [17] introduced a framework based on dissimilarity vectors and dynamic classifier selection to identify microscopic images of forest species. A two-stage model that consists of a feature selection algorithm and the dissimilaritybased representation for the classification of microarray gene expression data was proposed by García and Sánchez [18], reporting that the dissimilarity representation appears to be less sensitive to the number of genes than the feature-based representation. Also, the dissimilarity representation was combined with multiple classifier systems for text categorization [19].
To the best of our knowledge, far less research attention has been paid to the applicability of the dissimilarity representation to regression tasks. For instance, Jaramillo-Garzon et al. [20] modeled time-frequency representations by means of support vector regression, and the distance between regressions was calculated through dissimilarity measures based on dot products for the classification of phonocardiographic recordings. Silva-Mata et al. [21] combined the dissimilarity representation with the classical Partial Least Square regression model for the recognition of substances and their chemical-physical properties in biochemical data. Despite these few works, we argue that the dissimilarity representation has not yet been deeply studied in the framework of regression problems.
This paper offers a large scale experimental analysis with 34 benchmark regression data sets aiming to compare the performance of two linear models trained using feature and dissimilarity vectors. We intend to shed light on the suitability of the dissimilarity representation by addressing the following questions: 1) How the representation set size affects the predictive performance of regression models? 2) Does the dissimilarity representation reduce the complexity of a regression problem? 3) Do the dissimilarity-based linear regression models perform significantly better than the feature-based ones?
We cope with these issues by evaluating the dissimilarity representation constructed by means of the Euclidean distance and a random selection strategy designed to conform the representation set. To capture the difficulty of a regression problem, we compute a data complexity measure [22] with the purpose of checking whether or not the regression problem is simpler using the dissimilarity representation than the traditional feature representation.
Henceforth the paper is organized as follows. Section II introduces the basis of the dissimilarity representation, whereas Section III describes the process for adapting the dissimilarity representation to regression problems. Next, Section IV provides the experimental set-up. In Section V, the experimental results are presented and discussed. Finally, Section VI remarks the main conclusions and outlines possible future directions to extend this work.

II. THE DISSIMILARITY REPRESENTATION
The construction of the dissimilarity representation from the feature representation is based on measuring pairwise dissimilarities between an object and a set of prototypes or representative objects for each class R = {p 1 , . . . , p M }. This set R can be taken as the complete set of objects T = {x 1 , . . . , x N } or a subset of T (R ⊆ T ), or even it can be defined as a set of generated prototypes [23]. The most straightforward method to select M prototypes from T is the random selection, which can be done either by ensuring that R contains prototypes of each class or by a global selection where R may not have examples from all classes. Several works have shown that an appropriate, intelligent selection strategy improves the performance due to a better transformation of the feature space [24].
For the dissimilarity representation, we need a suitable dissimilarity measure d(·, ·) computed or derived from the objects; this dissimilarity measure must be non-negative (d(x i , x j ) > 0 if x i is distinct from x j ) and obey the reflexivity condition (d(x i , x i ) = 0), but it might be non-metric. Among others, common dissimilarity measures are Chisquare, Euclidean distance, Kolmogorov-Smirnov distance, cosine distance, Pearson correlation coefficient, Minkowski distance, and Spearman correlation.
A dissimilarity representation is defined as a datadependent mapping function D(·, R) from T to the dissimilarity space [9], [11]. This implies that each object x i ∈ T can be represented by an M -dimensional realvalued vector in the dissimilarity space, , that is, each dimension corresponds to the dissimilarity computed between x i and a prototype p j ∈ R, (j = 1, . . . M ). Then, the dissimilarities between all objects in T and the prototypes in R are represented by a matrix D(T, R) of size N × M [25]: Now a representation space can be built from this matrix. The dimensionality of the dissimilarity space is equal to M (the cardinality of R). In this way, each dimension corresponds to the dissimilarities with one of the prototypes in the representation set. Note that the mapping process generates new variables to represent the data, thus changing the meaning of the original attributes.

III. DISSIMILARITY REPRESENTATION FOR REGRESSION
A regression problem involves a pair of measurements (x, z), where x is called the independent variable and z ∈ R is the dependent variable. The aim of regression is to find a function f (.) that can predict z from T based on N observations Alike the mapping process to convert a feature vector into a dissimilarity vector in classifications tasks, in regression problems the process is carried out for the learning and testing phases as can be observed in the flowchart of Fig. 1. The mapping process into a dissimilarity representation. Dotted lines stand for the process to convert a test data set into a dissimilarity regression data set, whereas straight lines correspond to the step for building the dissimilarity regression training set.
The first step for the construction of a dissimilarity matrix is to select a representation set from the training set. In this paper, this process is performed by a random selection method (Algorithm 1). The representation set is constructed taking M random samples without replacement from T . In classifications tasks, there exist several instance selection methods focused on extract the most significant samples.

Algorithm 1: Random selection of the representation set
Input: Regression training set Once R has been selected, the dissimilarity matrix is constructed using some dissimilarity measure as shown in Algorithm 2. Note that we will use the Euclidean distance in the experiments. It has to be remarked that the resulting matrix D is a set that contains the target values taken from T . Remember that this mapping process should be performed in both training and testing data sets. These dissimilarity sets are passed through the regression model for learning and predicting the independent variable of a new instance x .

Algorithm 2: Mapping process to construct a dissimilarity matrix
Input: The computational complexity of the proposed algorithm depends on the computational costs associated with the mapping process to construct a dissimilarity matrix D. Thus, for the training stage, both the time complexity and space complexity are O(N · M ), being N the training set size and M the representation set size. As the testing stage makes use of the representation set, the time and space complexities of mapping a test example are O(M ).

IV. EXPERIMENTAL SET-UP
Taking into account that the ultimate goal of this work is to investigate the benefits of the dissimilarity representation over the feature representation in the context of regression, we performed a systematic experimental study using two linear regression models and a pool of gold-standard data sets. In addition, a Wilcoxon's paired signed-rank test was employed to support the statistical validity of results. The dissimilarity mapping process was implemented in mlr3 library [26] and is available at https://github.com/JAIR-VG/ dissreg-tools.

A. DATA SETS
Experiments were carried out on 34 benchmark small and medium sized regression data sets that are commonly used in some papers on regression [27]- [29]. Table 1 summarizes the main characteristics of these data sets, which were taken from the following sources: 1) Torgo repository (https://www.dcc.fc.up.pt/~ltorgo/ Regression/DataSets.html) 2) Weka data set repository (https://waikato.github.io/ weka-wiki/datasets/) 3) Energy efficiency data set used in [30] 4) Extrusion diameter data set used in [31] 5) Residential building data set used in [32] For each data set, the input variables were normalized in the range [0, 1]. The quality estimation of the linear regression models was carried out using 5-fold cross-validation. VOLUME 4, 2021 The resulting training and testing data sets were transformed by Euclidean distances using a representation set. The training and testing processes of both regression models were carried out over the original data set (feature representation) and the transformed data set (dissimilarity representation). The performance results reported in this paper correspond to averaged values from the five trials.

B. REGRESSION MODELS
The two regression models evaluated in our experiments were the generalized linear model with elastic net regularization (GLM) and a linear regressor (LR). In a linear model, the response variable z i is modeled by a linear function of explanatory variables x j , j = 1, . . . , n plus an error term i (typically, it is assumed i ∼ N (0, σ 2 )) as follows: where β j are the regression coefficients and x ji are the regression variables.
In contrast with linear regression where the output is assumed to follow a Gaussian distribution, the generalized linear model [33] is a special class of non-linear models where the response variable y i does not need to be normally distributed, but it can follow some distribution from the exponential family (Poisson, multinomial, Bernoulli, chi-squared, gamma, and many others). Furthermore, homogeneity of variance does not need to be satisfied and errors need to be independent but not normally distributed.
A GLM is made up of a linear predictor η i = β 0 + β 1 x 1i + · · · + β n x ni , a smooth and invertible linearizing link function g(µ i ) = η i that describes how the mean (E(Y i ) = µ i ) depends on the linear predictor, and a variance function var(y i ) = φV (µ i ) that describes the conditional distribution of the response variable y i , that is, how the variance depends on the mean µ i and a dispersion parameter φ.
The LR and the GLM with a Gaussian distribution were taken from the mlr3 framework [26] implemented in the R environment [34] using the default parameter values so that the results were not affected by a fine-tuning parameter step.

C. EVALUATION CRITERIA
We adopted two performance metrics commonly used in regression problems [29]. Both compute the numeric difference between the prediction of the model (ẑ i ) and the true value (z i ) [35]. First the Root Mean Squared Error (RMSE) is defined as follows: where N test is the number of test samples. The second metric is the Mean Absolute Error (MAE):

D. STATISTICAL TESTS
The Wilcoxon's paired signed-rank test was used to check for statistically significant differences between each pair of models. This statistic ranks the differences in performances of two algorithms for each data set, ignoring the signs, and compares the ranks for the positive and the negative differences. Let d i be the difference between the performance scores of the two models on i-th out of L data sets. The differences are ranked according to their absolute values. Let R + be the sum of ranks for the data sets on which the first model outperforms the second, and R − the sum of ranks for the opposite. Ranks of d i = 0 are split evenly among the sums; if there is an odd number of them, one is ignored: Let Z be the smaller of the sums, Z = min(R + , R − ). If Z is less than or equal to the value of the distribution of Wilcoxon for L degrees of freedom, the null-hypothesis that both models perform equally well can be rejected.

E. DATA COMPLEXITY IN REGRESSION
Data complexity analysis was proposed in classification tasks as an approach to describing the intrinsic data characteristics [36]. The ultimate aim is to quantify some difficulties such as class ambiguity, boundary complexity, sample sparsity, and feature space dimensionality [36].
Lorena et al. [22] proposed several complexity measures to estimate the regression complexity. In this paper, we employed a feature correlation measure that captures the relationship of the feature values with the outputs. It is called the maximum feature correlation (C 1 ), where higher values indicate simpler problems, and lower values the opposite situation. C 1 takes the maximum correlation value over all feature dimensions and can be computed as follows: where ρ is the Spearman correlation, x j the feature j, z the independent variable, and n the dimensionality.

V. RESULTS
The general objective of the experiments can be divided into a series of more specific purposes. First, we analyzed the effect of selecting different representation set sizes on the performance of the regression models. Second, we statistically checked whether or not the dissimilarity representation outperforms the feature one when applied to regression. Finally, we compared the complexity of the dissimilaritybased data sets against that of the feature-based data sets.

A. INFLUENCE OF THE REPRESENTATION SET SIZE
In this experiment, we evaluated how the different representation set sizes could affect the performance of the regression models. We omitted the small-sized databases (Diabetesnumeric, Basketball, Pollution, Pyrimidines, and Triazines) and for the remaining 29 data sets, we randomly selected a number of representative objects ranging from 2 to 150. The upper bound was set to 150 because the random selection process cannot guarantee the optimal number of prototypes, whereas previous studies observed that selecting more than 150 objects did not produce significant differences [18]. When comparing and contrasting two or more data set series, it is important to represent them on comparable scales. Thus, we defined a relative error difference [29], [37] computed for each data set as follows: where F and D are the result of RMSE (or MAE) for the feature-based regressor and the result of RMSE (or MAE) for the dissimilarity-based regressor, respectively. Note that this score can be viewed as an indicator of improvement or   deterioration of the dissimilarity-based model compared to the feature-based one. Fig. 2 shows the relative error difference of the 29 selected data sets achieved with the dissimilarity mapping process for all set sizes (2, . . . , 150), where the red/blue lines correspond to the RMSE and MAE values averaged across all data sets. The x-axis represents the number of objects selected to construct the representation set and the y-axis is the relative error difference in terms of RMSE ( Fig. 2-a, Fig. 2-b) and MAE (Fig. 2-c, Fig. 2-d). Note that negative values indicate that the model using the dissimilarity representation was better than that based on the feature representation.

Data sets GLM-F
As can be observed from these plots, there appears a general tendency to reduce the error when increasing the representation set size. Also, the relative error difference indicates that the regression models performed better with most data sets transformed into a dissimilarity representation than with the original feature-based data sets. However, it was not possible to determine the optimal size of the representation set. Although the dissimilarity mapping process did not yield an error reduction when applied to some databases, we believe that the use of some intelligent prototype selection strategy instead of the random method could lead to the expected good behavior. Table 2 reports the RMSE and MAE values of both regression models, respectively. As the dissimilarity experiments were performed using different representation set sizes, for clarity and conciseness only the best RMSE and MAE values for each pair of data set and regression model were collected. As can be observed, the best performances were mostly obtained by training the regressor with the dissimilarity-based data sets. In addition, for each performance measure, Table 3 summarizes how many times the regression models built on the dissimilarity representation were better/same/worse than the regressors based on the feature representation.

B. PERFORMANCE EVALUATION
To state whether or not the dissimilarity representation was better than the traditional approach, we ran a Wilcoxon's signed-rank test for detecting statistically significant differences using the results of RMSE and MAE. Table 4 shows the ranks and the p-values when comparing one represen- tation against the other. Considering a level of significance α = 0.05, we highlight in bold the winner models when the associated p-value was lower than α. As can be seen, for all comparisons, the best algorithms were those trained with the data mapped into a dissimilarity space. C 1 was computed for each data set in the feature representation as well as for the dissimilarity representation constructed from several representation set sizes. In the dissimilarity representation, for the sake of simplicity and clarity, Table 5 summarizes the maximum C 1 values obtained from all dissimilarity data sets. The results show that, for some data sets, the mapping process converted a data set into a simpler problem. However, this behavior was not observed in all the data sets, despite the fact that some of them achieved better results when using the dissimilarity representation.

VI. CONCLUSIONS AND FURTHER EXTENSIONS
A good object representation influences the performance of supervised machine learning methods. The dissimilarity representation has been used as an alternative to the feature representation in classification problems, showing that dissimilarity-based classifiers may improve their accuracy. In addition, this representation presents important advantages regarding the reduction of some intrinsic data difficulties that allows the use of single linear models. In this sense, the dissimilarity mapping process can also be used in the context of regression. Therefore, we performed an extensive experimental study on 34 benchmark regression data sets where each one was transformed into a dissimilarity matrix using the Euclidean distance and a representation set. The independent variables in the new space correspond to the dissimilarity between the pairs of objects. From the experimental results, it is possible to draw some concluding remarks that support our findings: 1) Through a random selection of a representation set of various sizes, it has been proved that mapping a feature sample into a dissimilarity space improves the performance of the linear regression models. On the other hand, it seems that in some cases, the use of a larger representation set can result beneficial.
2) The Wilcoxon's signed-rank test with α = 0.05 validates our claims that the linear regression models built on the dissimilarity representation perform better than those based on the traditional feature-based representation. 3) Using a data complexity measure, it has been onserved that the problems yield high correlation values when the data sets are mapped into a dissimilarity representation, that is, the problems become simpler.
The main criticism that can be made to the present work is the lack of a theoretical analysis. However, the experimental results have demonstrated the potential benefits of using the dissimilarity representation in the context of regression problems, thus opening some avenues for further research. One of them can be the design of systematic methods for VOLUME 4, 2021 the selection of representative objects specifically focused on regression tasks. Although it has been claimed that the Euclidean distance is suitable for the dissimilarity transformation, other metrics should also be explored. In this sense, we believe that the adoption of distance metrics for highdimensional problems could be an interesting direction to extend the present work. Another point to investigate in the future can be to study the behavior of the dissimilarity representation for non-linear regression models, such as XGBoost, K-nearest neighbors, support vector regressor and random forest. In addition, living in the Big Data era where data is growing exponentially, we would like to extend the present work by exploring the performance of the dissimilarity-based regression method on massive data sets, which bring a series of special computational challenges.
Finally, our proposal in its present form could not be applied to real-world applications in a continuous learning system, capable of incrementally storing and discarding streaming data. Thus the design of dissimilarity-based regression models with the power of continuous learning and adaptation during real-time operations under changes in the environment may constitute an interesting open line for further research.