A Comparison Framework of Machine Learning Algorithms for Mixed-Type Variables Datasets: A Case Study on Tire-Performances Prediction

Many engineering applications in the automotive, aeronautic, rubber, mechanics, and manufacturing industries collect multiple datasets measuring physical relations between input variables and performances for modeling purposes. The challenge relies on that such data is often highly dimensional, non-linear and contain mixed variables, i.e., numerical and categorical features, requiring specific algorithms and encoding schemes to perform regression task efficiently. Moreover, defining an appropriated similarity criterion for mixed-type data is a non-trivial task, especially when it is meant to be used in regression problems. This paper discusses the use of different machine learning algorithms for regression problems, involving mixed-type variables across multiple datasets. We use tire-related datasets as a case study to perform a rigorous, statistically founded comparison of different machine learning algorithms with encoding schemes to handle mixed variables in the prediction of tire-performances across multiple tire-related datasets. Friedman’s statistic and Nemenyi post-hoc tests are used to test the significance of performance differences between techniques and encoding strategies. Our contributions come as a series of recommendations for handling efficiently mixed-type variables while achieving high performances on regression tasks over multiple datasets. Furthermore, we provide a flexible and efficient similarity function between tires useful for tire comparison, prediction, and retrieval tasks.


I. INTRODUCTION
Machine learning (ML) in engineering applications has grown in popularity during the last decades [1]- [5]. Many industrial applications use ML tools to build regression models for product design, performance optimization, variable design, fault detection, quality assessment, and others. For instance, the rubber industry, [6] uses non-linear least squares to estimate the tire-road friction coefficient for tire design. In automotive design, [7] employs support vector regression in structural optimization to vehicle crashworthiness design. More recently [8] performed thermodynamics compressor performance modeling for engine design with neural networks and non-linear support vector regression. A common characteristic of engineering data is its tabular-like structure, where rows represent data examples, which are themselves described as a mixture of numerical and categorical, i.e., mixed-type variables. For instance, in car crashworthiness The associate editor coordinating the review of this manuscript and approving it for publication was Seifedine Kadry . design [9], vehicle structures must be designed to absorb crash energy through structural deformation as much as possible and attenuate the impact force to lower levels when impact occurs. The design variables are thickness related (measured in mm) and steel hardness types. For instance, the B-pillar inner, reinforce, the floor side and door beltline, are all numerical variables. However material design variables associated to the steel hardness, i.e., meal, medium, or high strength steel are categorical variables as well as material types such as iron, aluminium, plastic steel, glass, rubber and copper [9]. Besides, engineering problems often involve multiple inter-related datasets describing complementary performance measurements of the system of interest (see our case study). Therefore, assessing the overall performance of a ML algorithm in the regression tasks across multiple engineering datasets, is a challenging task in particular when the data contains mixed-type features.
From the data modelling perspective, dealing with mixed-variables may be problematic for many ML algorithms. As categorical variables do not have explicit ordering FIGURE 1. General overview of our approach. 1) having multiple tabular-like datasets with mixed variables, 2) we train different ML algorithms for regression 3) with different categorical encoding schemes (if necessary). 4) We evaluate the performance of all methods across multiple datasets and apply 5) statistical tests on the performance rankings in order to assess the overall performance of ML methods and encoding schemes with mixed-type variables.
or format, operating with them require special treatment to be transformed in numbers, understandable for algorithms. Many strategies to handle mixed-type data have been proposed in the literature, including encoding schemes for categorical features such as 'dummy' encoding or specialized metrics for clustering [10], [11]. However, it is unclear from the literature which encoding schemes are more appropriated for a certain class of regression algorithms, i.e., linear, tree-based, kernel-based models for regression. Conversely, deciding about which algorithm is the most appropriated for a regression task fixing an encoding strategy is critical, especially when multiple datasets are considered. Hence, in this paper, we propose a comparison framework to systematically evaluate and compare regression algorithms and encoding schemes to handle mixed-type variables datasets. Figure 1 shows a general overview of our approach.
Our methodology is presented in a real-life case study with engineering data from the tire industry. Typically tire-related data is highly dimensional, non-linear, and contains mixed variables, i.e., numerical and categorical features. For instance, automobile tires have molded into their sidewall the ISO code providing a generic description of the tire, see Figure 2. This code specifies the dimensions of the tire, i.e., tire width, aspect ratio, wheel diameter, and limitations such as the load-bearing ability, and maximum speed. A tire is built with different rubber types and composites describing their physical composition. Many of these variables are numerical, and others, such as speed rating, carcass type or belt type are categorical. Thus, handling adequately such mixture data is critical for the accuracy of ML algorithms in tire-performance prediction task.
Typically tire-engineers rely on physical or mechanical models to understand the relationship between variables and outcomes [12]. Because understanding the modeling process is critical in tire designing, we experimented with simple shallow models that may provide clearer explanations about the underlying predictions. The regression techniques we employed in this work are grouped in four categories, 1) linear methods:, Linear regression (LR), 2) kernel-based methods: support vector regression with linear and Gaussian kernels (SVR linear, SVR rbf), and an especial kernel capable of dealing with mixed-type variables, the clinical kernel (SVR clinical). 3) We use non-parametric K-nearest neighbors (KNN), and kernelized KNN (KNN clinical), and 4) ensemble methods with Random Forest (RF regressor) and Gradient boosting (GBoosting). We are particularly interested in the effect of encoding schemes for handling categorical variables, such as one-hot (dummy), binary, hashing, and backward difference encodings, in each group of algorithms mentioned above.
All techniques are evaluated under a nested cross-validation scheme with hyperparameter optimization, using the mean square error (MSE) and the coefficient of determination (R2-score) as evaluation metrics to the regression task. VOLUME 8, 2020 Subsequently, we perform statistical inference on the reported R2-scores to test the significance of the differences between 1) regression algorithms for each encoding strategy and 2) encoding schemes for a given class of algorithms; across multiple datasets. We follow [13] to assess the differences between ML algorithms together with encoding schemes applying the non-parametric Friedman test and the Nemenyi post-hoc test. The effectiveness of this approach has been demonstrated in applications in science [14] and engineering [15] when comparing ML algorithms in classification tasks. Finally, the results are visualized with the significance diagrams [13] and p-value tables for pairwise comparisons.
It is worth mentioning that our purpose is not to perform a thorough investigation of ML methods with numerous encoding schemes. We rather present our contribution as a methodological approach providing recommendations for handling efficiently mixed-type data, i.e., categorical and numerical variables, while achieving high performances on regression tasks associated with multiple datasets. Because of the nature of the tire data we used, we consider only categorical variables ignoring any ordering between levels. However, the inclusion of ordinal variables is straightforward. Finally, although the conclusions are drawn in terms of tiredata, we emphasize that our approach can naturally be applied to any engineering datasets.
The paper is organized as follows. First, we review some related work, following by the presentation of the case study in section III. In section IV we introduce the ML algorithms, encoding schemes, statistical tests as well as the evaluation methodology. The experiments and results are presented in section V and the final conclusions in section VII.

II. RELATED WORK
A large number of ML methods for regression problems have been proposed and used in diverse engineering applications [6]- [8], [16], [17] to name some of them. Nevertheless, we focus on existing work that compares different approaches to regression problems for engineering applications. For instance, [18] compare four popular regression algorithms to establish the relationship between tire tread composites and filler system. More recently in [16], authors leverage numerical tire-size features to predict force and moment performances using different regression algorithms. Conversely, authors in [19] make a comparative study of categorical variable encodings for predicting vehicle properties using neural networks. A similarity measure to handle categorical variables is introduced in [20] and validated in KNN regression problems over twelve datasets. In [21] authors propose a hybrid decision tree algorithm for mixed categorical and numerical regression analysis. Their method is compared against five popular regression algorithms and, similar to us, they perform a statistical analysis of the performances of such methods over multiple datasets, but they used a single dummy variable encoding. More recent deep learning approaches for regression problems have proven to be very efficient in diverse application such as industrial surface defect detection [22], sustainable smart manufacturing in industry 4.0 [23] and short-long term load electricity forcasting [24]. Unlike the mentioned approaches, we present a rigorous statistical-based framework to compare and recommend regression algorithms considering many strategies to handle appropriately mixed variables over multiple datasets.

A. TIRE DATASETS
Our proposed methodology is presented in a case study from the tire industry. For this study, we collected several tire-related data from the automobile industry, containing tire-size features Figure 2, with performances tested at different conditions. Table 1 shows some statistics of the datasets we used in our study. Each dataset (rows) contains tire measurements for specific engineering target performances. Columns show the number of numerical and categorical features per dataset, and the associated performances to predict. For a complete description about tire tests in the automotive industry we refer the reader to [25].
In the following, we provide a brief description of the datasets.
• D1: This dataset contains measurements of the stretching force of a tire bead with different rim diameters, also known as bead compression test.
• D2: Consists of several measurements related to the rolling resistance force. This is a fundamental force acting in opposition to the motion when the tire rolls on a surface. Here we estimated two performances, denoted as D2_p0 and D2_p1 • D3: Was used before in [16], and contains relevant force and moments (F&M) measurements. Tire F&M are fundamental to characterize tire performance characteristics that highly influence the dynamics of the vehicle. That is, force and moment characteristics have to be designed such that the vehicle can easily be kept under driver control under diverse driving conditions. Three tire performances are estimated on this dataset, denoted as D3_p0, D3_p1 and D3_p2.
• D4: Consists of data related to plunger test, where a steel plunger is forced perpendicular to the tread of a mounted tire until the tire ruptures, or the plunger is stopped by reaching the rim. Here, seven performances are estimated, denoted as D4_p0, D4_p1, D4_p2, D4_p3, D4_p4, D4_p5 and D4_p6.
• D5: This dataset has measurements of the high-speed test, where the tire is exposed to different maximum speed scenarios where the tire can sustain.
• D6: Contains information of the contact patch of the tire which is touching the road surface. It is within this interaction area that tire forces and moments arise and that wear occurs.
• D7: This dataset contains measurements of the standard bead unseat test to determine the conditions to which a tire will stay on the rim.
• D8: Contains regulatory measurements related to noise perturbations of the tire in diverse speed conditions and surfaces of contact.
• D9: Contain diverse measurements of the tire wet grip associated with the ability of a tire to adhere to the road in wet conditions.

B. DATA PREPROCESSING
Once the datasets were collected by tire-engineers, data cleaning and preprocessing were carried out to remove inconsistencies and improve the quality of the data. To do so, we applied a series of transformations necessaries to make the data modelling more efficient [26]. First, tire features were handcrafted and selected beforehand by tire-engineer experts, respecting physical constraints between variables and tire-performances. Second, anomalous observations were removed [27] by re-scaling and centering the data around zero and looking at points whose distance to the origin was larger than three times the standard deviation of the data, i.e., the z-score. Third, all datasets were standardized by removing its mean and scaling with its standard deviation. For certain kernel matrices such as the clinical kernel, we centered the matrix by normalizing to have zero mean.

C. TASK DESCRIPTION
We focus on building regression models on tire-related measurements to predict tire-performances across the datasets mentioned in Table 1. Also, the datasets include a mix of numerical and categorical variables. We use a well known family of regression algorithms from non-parametric, ensemble methods, and kernel-based algorithms. For the methods that do not handle directly categorical variables, we preprocess the datasets with different encoding schemes so that the considered algorithms can do computations on such data. Subsequently, we perform a rigorous statistical analysis on the performances of the methods across datasets, to state the significance of the machine learning methods and the encoding mechanism in tire-performance prediction.

IV. METHODS
In this section, we present our approach with the algorithms, the encoding schemes, and the statistical approaches we use in our analysis.

A. REGRESSION ALGORITHMS
The proposed methodology is applied to regression problems for the prediction of tire-performances. We have considered a wide range of linear and non-linear regression models grouped in categories according to its nature: parametric (linear regression), non-parametric (K nearest neighbors), ensemble methods (Random Forest, Gradient Boosting), and kernel-based methods (support vector regression). A brief description of the techniques is introduced below. 1) Linear regression [28]: LR fits a linear model with real-valued coefficients to minimize the sum of squares of residuals between the observed targets in the data, and the predicted values by the linear approximation. 2) K-nearest neighbors [29]: This non-parametric algorithm is one of the simplest algorithms in ML. It uses features similarity to predict the target variable of new points. This means that a new point is assigned a value based on the average value of its K closest points in the training set. The Euclidean distance is used by default but can be extended to more general distances. 3) Support vector regression [30]: SVR algorithm is a variant of the popular support vector machines for classification. In its basic form SVR aims to fit an hyper-plane subject to all residuals having a value less than a non-negative coefficient determining the fitting accuracy. This model can be extended to a non-linear formulation whose basic principle is to map non-linear data to a higher dimensional space where the problem becomes linear. This transformation is done through a kernel function that has the property of being a dot product of feature mappings from the input to a Euclidean space. The problem is modeled as a quadratic optimization problem, and its solution provides real coefficients characterizing the so-called support vectors, i.e., training examples with associated null coefficients. A typical non-linear kernel is the radial basis function (Rbf) [30]: for any x, y ∈ R p and σ a scaling parameter. 4) The clinical kernel [31]: This kernel has been proposed in the context of analyzing clinical data for patients. The main advantage is that it allows modelling numerical and categorical variables in a compact formulation without using any categorical encoding scheme. That is, for two data points x, y ∈ R p with p features, the clinical kernel [31] is defined as: where K f is the kernel matrix of the feature f . Furthermore: where δ(z) = 1 when z is true and 0 otherwise, and x f and y f are the f th feature of x and y. The max f and min f VOLUME 8, 2020 represents their respective maximum and minimum value.
The choice of the appropriated kernel function may affect the performance of the ML algorithms in regression [32]. For this reason, we experimented with different linear and non-linear kernels within the SVR formulation, in particular the linear and Rbf kernels with different encoding schemes. In addition, the SVR clinical and KNN clinical are formulations of KNN and SVR when using the clinical kernel to handle mixed-variables in the regression tasks. 5) Random Forest [33]: RF is a popular algorithm based on aggregation principles for classification and regression problems. RF operates by constructing and training an ensemble of decision trees over a subset of randomly selected features. The decision trees outputs are combined to estimate a target value using any aggregation mechanism such as averaging individual responses. 6) Gradient Boosting [34]: GBoosting is an ensemble technique for classification and regression that aggregates the estimation of individual models (typically decision trees) improving the prediction from inaccurate inter-media estimations in a sequential manner.

B. ENCODING SCHEMES
Using categorical variables in regression problems is not a trivial task as there is not an explicit notion of ordering or semantic between its values or with the response variable. However, there are two main approaches used in the literature to deal with this difficulty. The most common is transforming categories into numerical values applying encoding schemes keeping semantic between categories' levels. The second one is designing specific algorithms for regression that handle internally mixed data, i.e., clinical kernel. In this work we will cover both alternatives focusing on the following encoding schemes: 1) One-hot (dummy) encoding [35]: This is one of the most popular encoding methods. For a categorical variable of cardinality C It creates C − 1 new binary features, with a value of 1 for the actual value and zero otherwise. It works well on linear models but is not suitable for variables with large C. 2) Binary encoding [35]: Similar to one-hot, but category levels are treated as positive integers and subsequently converted to binary digits. Each binary digit gets one column. For a variable of cardinality C it will add p new binary digits such that p = min{i : C ≤ 2 i , i ≥ 0} and C = p + 1. 3) Hashing encoding [36]: This method utilizes hash functions to map the levels of a categorical variable to numbers, which are themself encoded to binary strings of a given dimension. As the levels are not memorized, it can deal with new levels gracefully and therefore scale to categories with large cardinality.

4) Backward difference:
This is a kind of the so-called contrast encoding methods [37]. For a categorical variable with C levels, it creates new C − 1 variables of inter-level differences. That is, for each level, it calculates the difference between the mean output given a category, and the overall expected value of the dependent variable. It can be used with ordinal and categorical values.

C. MODEL COMPARISON AND STATISTICAL TESTS
Beyond the performances achieved for different algorithms, we use hypothesis testing techniques to provide statistical support to our analysis. Concretely we use non-parametric methods given that the strong assumptions required by parametric methods may not be satisfied in our datasets. Here we use the Friedman-Nemenyi test for comparing multiple algorithms across datasets [13]. The Friedman test provides evidence that the outcome of different regression models is statistically different. The Nemenyi post-hoc test assesses significant differences between individual algorithms. The Friedman test is a non-parametric test based on the average ranked performances (R j ) of the regression models on each dataset. The Friedman statistics is calculated as where D denotes the number of datasets, K the number of regression algorithms, and R j = 1 D D i=1 r j i as the average rank of the algorithm j, where r j i denotes the rank of the j−th algorithm in the i − th dataset. Under the null hypothesis that all algorithms perform equally, the Q statistics is approximately distributed as a Chi-square X 2 K −1 distribution with K − 1 degree of freedom. Therefore we can reject the null hypothesis and conclude that some algorithms perform better than other when Q is large enough, with the probability that If the null-hypothesis is rejected, we can proceed with a post-hoc test. The post-hoc Nemenyi test [13] is applied to report any significant difference between individual algorithms. This test states that the performance of various algorithms is significantly different if their average rank differs by at least the critical difference (CD): where the critical values q α,K are based on the Studentized range statistic divided by √ 2. Finally, the results of the Friedman-Nemenyi test can be visualized with the diagrams proposed by Demsar [13]. These diagrams show the mean ranked performances of the algorithms along with the critical difference such that the lower the ranking, the better the method. Horizontal lines connect algorithms that are not significantly different.

D. EVALUATION METHODOLOGY
We evaluate the performance of the regression techniques using nested cross-validation (CV) approach. This method is commonly used to reduce the bias in the generalization error induced by the random splitting of the dataset. The inner CV is used to select the optimal model and the outer CV to estimate the generalization error of the method. In the outer 5-fold CV loop, 80% of the data (four folds) is provided as a training set to the inner 5-fold CV, and the remaining 20%, i.e., the hold-out fold, is used as testing set to evaluate the performance of the model. Within the inner 5-fold CV, we tune the parameters of the ML models. i.e., The , σ and C for SVR, number of trees for Random Fores and GBoosting, and k for KNN, selecting the best model using the inner hold-out fold as validation set within a grid-search hyperparameter optimization scheme [38], [39]. That is, grid-search exhaustively generates a combination of parameters from a grid of predefined values in order to train ML algorithms and select the best model. Finally, we report the average and standard deviation of the mean square error (MSE) and the coefficient of determination (R2-score) of the outer CV folds.

V. EXPERIMENTS AND RESULTS
In this section, we present our comparison methodology and results. That is, following the scheme of Figure 1, for each dataset of Table 1, we transform the categorical variables applying the encoding schemes of section IV-B. Subsequently, we perform regression tasks applying the algorithms introduced previously in section IV-A, following the evaluation setting of section IV-D. For the algorithms that can handle mixed variables, we perform regression without any encoding. Finally, we provide the overall evaluation of our models, applying the statistical tests presented in IV-C. Such tests were performed on the R2-scores only, but similar conclusions are achieved when applied on mse. In this paper, the Friedman test is evaluated with a significance level of 0.01 followed by the post-hoc Nemenyi test with a significance of 0.05 [13]. Our approach was coded in Python 3.7 with sklearn 0.22.2, and run on a standard Laptop Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz with 16Gb of RAM.

A. BEST ALGORITHM FOR ENCODING STRATEGY
The first part of our experimental setting aims to answer the question: For a given encoding scheme, which machine learning algorithm is the most appropriated to predict tireperformances?. As the strategy to manage mixed-variables is critical, we complete our analysis for each encoding scheme and summarize the overall outcome in the discussion section. Table 2 shows the Mean and standard deviation of the R2-score for the regression tasks.

1) RESULTS WITH ONE-HOT ENCODING
The critical value of X 2 7 is 18.47, which is lower than Friedman's statistic Q = 67.259. As a consequence, we reject the null hypothesis that all algorithm's performance is equivalent. Regarding this rejection, the post-hoc Nemenyi test is applied. The significance diagram in Fig. 3 shows the average performance ranks of the algorithms along with the Nemenyi's critical difference tail (CD = 2.474).
From the figures 3, 4, we can see that: • GBoosting and Random Forest are significantly better than KNN, LR and SVR linear.
• The SVR clinical and SVR Rbf kernels are equivalent and significantly better than SVR linear.
• The SVR linear is significantly worse than KNN clinical, SVR clinical, SVR Rbf, Random Forest and GBossting. Table 3 shows the Mean and standard deviation of the R2-score for the regression task across datasets. The Friedman statistics is Q = 71.093, which is greater than the critical VOLUME 8, 2020   value X 2 7 = 18.47. Thus, the null hypothesis is rejected with a significance of 0.01, and we proceed with the Nemenyi test. From the significance diagram of Figure 5 and 6 we can state that:

2) RESULTS WITH BINARY ENCODING
• GBoosting, Random Forest and SVR Rbf are significantly better than KNN, LR and SVR linear.
• The SVR clinical significantly better than SVR linear • The SVR linear is significantly worse than KNN clinical, SVR clinical, SVR Rbf, Random Forest and GBoosting. Table 4 shows the Mean and standard deviation of the R2-score for the regression task across datasets. The critical value of X 2 7 is 18.47, which is lower than Friedman's statistic Q = 70.074. We reject the null hypothesis and conclude that the algorithm's performance is not equivalent. The critical difference, according to the Nemenyi test, is CD = 2.474. Its significance diagram is shown in Fig. 7 showing the following results:

3) RESULTS WITH HASHING ENCODING
• Random Forest is significantly better than KNN, LR, SVR clinical, SVR linear.
• The SVR Rbf is significantly better than LR and SVR linear • The SVR linear is significantly worse than KNN clinical, SVR clinical, SVR Rbf, Random Forest and GBoosting. Table 5 shows the Mean and standard deviation of the R2-score for the regression task across datasets.

4) RESULTS WITH BACKWARD DIFFERENCE ENCODING
Here, the Friedman statistics Q = 70.033 is greater than the critical value X 2 7 = 18.47. Thus, we reject the null hypothesis with a significance of 0.01, and we conclude that the algorithms perform differently. The critical difference for the Nemenyi test is CD = 2.4747. Thus, regarding the significance diagram of Figure 9, we can conclude that: • GBoosting and Random Forest are significantly better than KNN, LR and SVR linear • The SVM Rbf is significantly better than KNN and SVR linear   • The SVR linear is significantly worse than KNN clinical, SVR clinical, SVR Rbf, Random Forest and GBoosting

5) DISCUSSION OF ENCODING SCHEMES
In this experimental setting, we have shown the effect of different strategies for handling mixed variables in the precision of the regression algorithms. Overall, the experiments reveal that ensemble methods for regression, i.e., Random Forest and GBoosting, perform significantly better than non-parametric (KNN) and linear models, i.e., linear regression (LR) and SVR linear, in all encoding schemes. This can be explained because tire-related data is highly dimensional and non-linear, i.e., tire performances have complex physical dependencies with tire features and compounds, and thus non-linear ensemble models will outperform simpler linear models. Besides, the Random Forest model uses a collection of trees to make its predictions selecting a random subset of input features, reducing the search space where each tree is optimizing. Hence, ensemble methods can better handle the curse of dimensionality introduced by different encoding schemes, therefore improving its generalization capability. On the other hand, there is not enough evidence to claim that ensemble methods are better than non-linear kernels, i.e., SVR Rbf and SVR clinical. Besides, it is clear that SVR linear performance is significantly the worst of all methods, no matter the encoding strategy used. Finally, SVR Rbf and SVR clinical are equivalent methods under one-hot encoding, and unlike ensemble methods, they provide a meaningful way to compare tires for further mining purposes. This will be discussed with more details in section VI.

B. BEST ENCODING FOR CLASSES OF ALGORITHMS
In the second part of our experiments, we address the problem of choosing the best encoding for classes (groups) of algorithms, in particular for the most performing methods from the previous section. We consider ensemble, kernel-based, and non-parametric models as groups under analysis.

1) ENSEMBLE METHODS
Here we investigate the best encoding scheme for Random Forest and GBoosting algorithms. 1 We apply the Friedman test independently on Random Forest and GBoosting performances from Tables 2,3,4,5 to test the hypothesis that the selected algorithm has equivalent performances across encoding strategies. Starting with Random Forest, the critical value of X 2 3 for a significance level of 0.01 is 11.34, which is greater than the Friedman's statistic Q = 5.267, meaning that we fail to reject the null hypothesis. Similarly, Friedman's statistic for GBoosting is Q = 1.80, so that we fail to reject the hypothesis that all encoding methods perform equally. As a consequence, we verify that there is not a favorite encoding strategy to perform regression with ensemble methods in this tire-performance data.

2) KERNEL-BASED REGRESSION
Here we investigate the effect of encoding schemes on SVR Rbf and SVR linear methods. As before, we perform the Friedman test independently on the performance from tables 2,3,4 and 5, to test the hypothesis that the performance of the given algorithm is equivalent across different encoding systems.
The critical value for the X 2 3 with a significance of 0.01 is 11.34, which is greater than the Friedman's statistics Q = 2.799. Thus, we fail to reject the null hypothesis that the SVR linear performances are equivalent across encodings. On the other hand, Q = 20.847 for SVR Rbf (including the SVR clinical), which is greater than the critical value X 2 4 of 13.276 at 0.01 of significance. The null hypothesis is rejected and the Nemenyi critical difference CD = 1.43 is shown in the significance diagram, Figures 11 and 12 Here we observe that the considered encoding schemes are equivalent under SVR linear models, which is expected because this kernels can handle large dimensional data better 1 Although tree based methods are known for nativity handle categorical variables without any transformation, most of implementations require an encoding reprocessing step.   than other models, impart due to the excessive regularization that occurs [40]. On the other hand, the binary encoding is significantly better than backward difference, clinical, and one-hot encoding for the SVR Rbf. Indeed, the Gaussian kernel is more sensitive to the curse of dimensionality, performing better in encoding schemes with lower dimensions.

3) K-NEAREST NEIGHBOR REGRESSION
Here we investigate the effect of the considered encodings on KNN regression (including KNN clinical). As before, we perform the Friedman test on the performances taken from tables 2,3,4 and 5, to test the hypothesis that KNN performs similarly across different encoding strategies.   The critical value for the X 2 4 with a significance of 0.01 is 13.276, which is lower than the Friedman's statistics Q = 19.520. Thus, we reject the null hypothesis, and after applying the Nemenyi post-hoc test, we achieve a critical difference of CD = 1.43 which is shown in the significance diagram, Figures 13 and 14 KNN is well known to be sensitive to high dimensions. Therefore, it is shown that KNN clinical is significantly better than backward difference and one-hot encoding.

VI. A TIRE-SIMILARITY FUNCTION
As a side benefit, the previous analysis allows us to define a similarity metric for tire-comparison. Similarity in data mining is an important concept for searching, comparing, and retrieving objects from a database. Designing a similarity measure for tire-based data is not a trivial task as the mixedtype nature of its variables and because there is not a general rule of what a 'good' similarity should be. Roughly speaking, a similarity function f : R p × R p −→ R is a real-valued function that takes two input tires x and y and assigns a high value when they are 'close' whereas assigns a lower value when x, y are far away in a Euclidean space. As we showed in section 3, this notion is implicit in the definition of kernels. Although the intuition of kernels as measures of similarity is not always obvious [41], there are cases where this notion coincides. In addition, we hypothesize that a good similarity function for tires should be good enough to predict tire performances. Thus, supported by our results from the previous section, we consider the Gaussian (Rbf) and the clinical kernel functions as relevant similarity measures for tires.

QUERYING A REFERENCE TIRE
We show the proposed tire-similarity qualitatively in a common industrial application of the query by example principle. That is, given a reference existing tire from the dataset, we retrieve the nearest tires according to the similarity function. As a proof of concept, we provide a visualization of the D3 from Table 1, i.e., the force and moments data, by embedding the tires in a three-dimensional space preserving the clinical-kernel similarity. The pairwise dissimilarity matrix is projected to 3d coordinates by means of the Multidimensional scaling (MDS) [42] algorithm so that the embedded tires preserve the original clinical-kernel distances.   In the above figure can be seen the embedded tire space, the reference tire (in red) and its ten closest tires retrieved from the F&M dataset with the clinical kernel used previously in regression tasks.

VII. CONCLUSION
In this work, we proposed a comparison methodology to select the most appropriate ML regression algorithm when dealing with mixed-type variables across multiple datasets. Our approach was presented with a case study of the tire industry for tire-performance prediction. We have shown that non-parametric Friedman and Nemenyi statistical tests allow us to decide about the appropriateness of encoding strategies for certain classes of regression algorithms. In particular, we showed that ensemble methods, i.e., Random Forest and Gradient Boosting perform significantly better than linear models and KNN and its performance across categorical encodings are equivalent. In contrast, kernel-methods for regression are sensitive to encoding schemes. The SVR linear performance is significantly the worst of the considered methods with any encoding and SVR Rbf is comparable marginally with ensemble methods when used with binary encoding. Besides, kernels have the benefit of being used for both regression and similarity tasks for data mining applications.

APPENDIX MEAN SQUARE ERROR TABLES
See Table 6-9. LEONARDO GUTIÉRREZ-GÓMEZ received the B.Eng. and B.Math. degrees from the Escuela Colombiana de Ingenieria Julio Garavito, Bogotá, Colombia, the M.Sc. degree in industrial and applied mathematics from the Université de Grenoble 1, France, and the Ph.D. degree from the Université Catholique de Louvain, Belgium, with a dissertation in machine learning on complex networks: dynamical fingerprints, embeddings, and feature engineering. He worked with ST Microelectronics and the Grenoble Computer Science Laboratory (LIG) doing applied research in deep learning, information retrieval, and data mining on graphs. He is currently a Postdoctoral Researcher with the Luxembourg Institute of Science and Technology. His research interests include machine learning in industrial applications, network science, and data mining on graphs.
FRANK PETRY received the Ph.D. and diploma degrees in physics from the University of Heidelberg, in 1993 and 1995, respectively. With the grant support of the Max-Planck Institute for nuclear physics, he continued his research in particle physics. After two years as a Lecturer in applied computer science and algebra, he started at the Product Evaluation Department, Goodyear Technical Center Luxembourg, in 1997. He developed new tire test and modeling methods and published several articles on tire contact mechanics and tire traction. He is currently a Senior Research and Development Associate with the Virtual Capability and Tire/Vehicle Mechanics Department, Goodyear Innovation Center* Luxembourg, Luxembourg.
DJAMEL KHADRAOUI received the master's degree in computer and network security from the École de Mines de Saint-Etienne, France, and the Ph.D. degree in computer vision and robotics from the Université Blaise Pascal. He is currently the Head of the Trusted Service Systems Research Unit, Luxembourg Institute of Science and Technology. His research interests include the design, the security, and the optimisation of service systems enabled by data-intensive infrastructures engineering and aligned with the creation of business impact. His has managed multiple research projects related with the enterprise modeling, value creation, and reliable infrastructures; the data intensive systems with a focus on the processes, the science, and the platforms of big data; the security, privacy, and resilient critical infrastructures with a focus on data privacy and security, cybersecurity, and information security management; and the operations and supply chain optimisation with a focus on data analytics and optimisation, as well as the operations optimisation.