CBRG: A Novel Algorithm for Handling Missing Data Using Bayesian Ridge Regression and Feature Selection Based on Gain Ratio

Existing imputation methods may lead to biased predictions and decrease or increase the statistical influence which leads to improper estimations. Several missing value imputation approaches performance depends on the size of the dataset and the number of missing values within the dataset. In this work, the authors proposed a novel algorithm for manipulating missing data versus some common imputation approaches. The proposed algorithm imputes missing values in cumulative order depending on the gain ratio (GR) feature selection (to select the candidate feature to be manipulated) and the Bayesian Ridge Regression (BRR) technique (to build the predictive model). Each imputed feature will be used to manipulate the missing values in the following selected candidate feature. The proposed algorithm was implemented on eight different datasets after generating different missing values proportions from the missingness mechanisms. The imputation performance was calculated in terms of imputation time, mean absolute error (MAE), coefficient of determination ( $R^{2}$ ), and root-mean-square error (RMSE). The results show the efficiency of the proposed algorithm when imputing any dataset with any number of missing data from any missingness mechanism.


I. INTRODUCTION
Data preparation is dealt with as the most significant and time consuming task, which toughly influences the success of the research. The best method for manipulating incomplete instances is to avoid missing data. All skilled researchers still facing missing values that would happen for unknown reasons. Decisions can be made by the researcher at the data collection stage about what data to gather and the method to screen data collecting. The distribution and scale of the variables in the data and the reason for missingness are two important issues to choose the best approach for manipulating missing data [1]. Many imputation algorithms may fail in the imputation of all missing values in the dataset, others give poor performance or consume long imputation time. The proposed algorithm presented in this paper handles these The associate editor coordinating the review of this manuscript and approving it for publication was Dezhong Peng. defects by using the most effective features for handling missing values in a cumulative order.

A. FEATURE SELECTION
Feature selection lies in discovering the best subset of possible features from a large set of features. Deleting features that have large amount of missing values in a dataset (e.g., >50%) is considered a simple solution. However, deleting a feature may result in losing analytical power and ability to observe statistically significant differences and it is frequently a cause of bias, affecting badly on the outcomes. Feature selection requires taking into consideration the missing data mechanism [2].

B. MISSINGNESS MECHANISMS
Manipulating missing data requires detecting the missingness mechanisms (i.e., the cause for existing missing values in a VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ dataset). This paper deals with the three missingness mechanisms [2]- [5]: • Missing completely at random (MCAR) [6]: Suppose that the missing value indicator matrix M = (M ij ) and the complete data Y = y ij . The missing data mechanism can be described by If the missingness depends on both missing and observed data [7].

C. HANDLING MISSING DATA
The best approach for handling missing data is to avoid it through careful data gathering and follow up, along with determining missing data after the fact (for example, by observing missing data or re-contacting study participants). However, it is generally difficult to avoid missing values in total; therefore, statistical approaches for manipulating missing values are needed. Since missing values are exceptionally complicated, statisticians cannot form a universal set of procedures to handle all cases. Therefore, they run emulation to select the best approach [8]. The methods for manipulating missing values have to custom-made to the proportion of missing values, the size of the dataset and reasons of missingness [2]. The best method to manipulate the missingness is to delete instances that contain missing values. In common, manipulating missing values using deletion leads to accurate estimations just for MCAR [9]. The other method for manipulating missing values and overcomes the disadvantages of the deletion method is the imputation method. Imputation benefits from the complete instances available in the dataset to estimate the missing value [4]. Imputation is regularly implemented before or after a feature selection [2]. The two common model based imputation algorithms are likelihood and regression. The likelihood requires parameter estimation in the existence of missing values, i.e., models and their parameters are assessed using maximum posteriori procedures or maximum likelihood. Imputing missing values using the regression model depends on predicting the unobserved values within a feature depending on the detected values [10]. The BRR technique is used within the proposed algorithm. The proposed algorithm uses a regression model with a regularization parameter. The model satisfies [11]: where: where y is the target feature which is distributed as normal distribution characterized by variance α and mean µ = βX . β = {β 0 , β 1 , β 2 , . . . , β p } and X = {x 1 , x 2 , . . . , x p } represent the unknown parameters and independent features respectively. p denotes the number of independent features. λ and α are regularizing parameters that are distributed as gamma distribution and both are jointly assessed when the model is fitted by maximizing the log marginal likelihood. α 1 , α 2 , λ 1 and λ 2 are hyper-parameters of the gamma prior distributions. The structure of this paper as follows: Section 2 shows the literature review of manipulating the missing data. Sections 3 and 4 present the proposed algorithm and explain the experimental implementation, respectively. Results and discussion are described in section 5. Finally, Section 6 concludes this paper.

II. LITERATURE REVIEW
Deletion is the preferred method to manipulate missing values. Deleting instances that contain one or more missing values in their feature values is known as list-wise deletion or Complete Case Analysis [4], [12]. Deleting the feature that holds more than a predefined proportion of missing values (e.g., 50%) is known as specific deletion [4], [8], [12], [13]. In addition, deletion can be pair-wise deletion which takes place when statistical procedures cannot use a feature when it contains missing values. Nevertheless, it can benefit from the instance when analyzing other features that do not contain missing values. In the case of each feature in the dataset contains missing data through different instances then using deletion approach may result in losing a big part of the dataset or deletion of the whole dataset [14].
In imputation methods an estimated value is imputed rather than the missed value [12], [13], [15]. The imputed value can be median, mode, mean or any predefined value of the feature that holds missing value [16]- [18], or can be acquired from case substitution. Imputed value can also be estimated using KNN (K Nearest Neighbors) [3], cold deck imputation [19], expectation-maximization imputation [20]- [22], hot-deck imputation [23], etc. In techniques contain prediction models, a model is built based on the available information within the dataset, then this model is used to predict missing values [24]. Imputation methods are used to manipulate missing values if; i) the missing values are MCAR or MAR type, ii) the feature that contains missing values has statistical influence with the target feature, iii) deletion of instances decreases the size of the dataset, which in turn affects on building the predictive model, iv) an instance does not include missing values across many features [14]. For MCAR missingness mechanism type can be manipulated using maximum likelihood methods or listwise deletion. There are no general approaches to manipulate missing values of MNAR missingness mechanism type [4], [25].
Imputation types are single imputation or multiple imputation. In single imputation, a particular value is imputed rather than the missed value [19]. Multiple imputation, where m complete datasets are generated by imputing the missing values m times, the final imputed dataset is the analysis average of these m datasets [26]. Although multiple imputation prerequisites more resources [27], it has benefits more than other methods such as single imputation, maximum likelihood techniques and deletion [28]. Inverse Probability Weighting (IPW) approaches are other methods to handle missing values. IPW depends on the inverse of the detected probability to weight detected instances by representing the entire data even the missing values. However, the performance of imputation method is better [4], [29]. Singular value decomposition (SVD), partial least squares (PLS) and ordinary least squares (OLS) are also good choices for multiple imputation [4]. Manipulating missing values in datasets containing ordinal and binary features using Multi Variate Normal Imputation (MVNI) and Fully Conditional Specification (FCS) approaches generate similar and less biased results. The model specification is granted easily by MVNI, but as a result of its unrealistic nature assumption, some people may found a problem with it. FCS tends to have a difficult model requirements as a result of demanding a single regression model for every variable whose missing values will be imputed [30]. KNN is used by FINNIM which is a practical nonparametric iterative multiple imputation approach to estimate missing values [31]. Successive regression trees are used as a multiple imputation conditional model, which has the ability to identify complex relations and need small improvement by the user [32]. Manipulating missing values with predictive mean matching (PMM) approach is implemented using a randomly drawn instance from a set of detected instances whose predictive mean is close to the predictive mean of missing values [4]. Like PMM, local residual draw (LRD) approach manipulates missing values using the predictive mean, but with extra randomly drawn from the residuals set of detected instances with predictive means close to that of the missing value [33]. Reinforcement programming (RP) which depends on Reinforcement Learning gives better performance over mean per category imputation, zero imputation and genetic algorithm (GA) in terms of the sum of square error and computational time [34]. Cumulative linear regression which depends on the linear regression technique to manipulate missing data works better with large and small datasets [3]. Manipulating missing data in variables of interest with the aid of observed values from other variables depends on the similarities of the observation values within the donor variable to manipulate the missing data within the recipient variable. It works well when the mount of the missing values is large [6].

III. PROPOSED ALGORITHM
This section exhibits the proposed algorithm in detail. The following procedural steps help in elaborating the proposed algorithm described in Fig. 1: Step 1: Suppose that the proposed algorithm takes a dataset D which holds missing values as input, then by partitioning D into two sets. The first set X (comp) contains all complete features and the second set X (mis) contains all incomplete features. The authors assume that the output feature contains no missing values, so X (comp) consists of all complete features plus the output feature y.
Step 2: The proposed algorithm we called Cumulative Bayesian Ridge with Gain ratio (CBRG), uses the GR feature selection. The choice of the candidate feature to be imputed must offer the highest GR with the y (Algorithm 1). The GR criterion defined by Quinlan is given by (4) [35]: where IG is the information gain. X is the set of cases. I specifies the entropy, and E is the expected information of the feature A k . n is the number of possible values of feature A k , and IV is the intrinsic value.
Step 3: After choosing the candidate feature X (mis) g , the model is fitted using the cumulative formula given by (5) with X (comp) as the independent feature and the candidate feature as dependent. The chosen feature removed from X (mis) and after imputing missing values within the X Split D into X (comp) and X (mis) .

2
From X (mis) select X (mis) l that satisfies the condition: ii Fit the BRR model on X (comp) as independent features and X (mis) g as dependent feature using the cumulative formula given by (5).
imp and y. Choose a new X (mis) g from X (mis) . Fit the model with the cumulative formula with X (comp) as an independent feature and the new X (mis) g as the dependent feature. where: where g = 1, 2, . . . , m.m is the number of features containing missing values and c is the number of complete indepen-TABLE 1. Specifications of datasets. The first column, second column, and third column present the dataset name, number of instances, and number of features respectively, and the forth column presents the missingness ratio (MR). dents. The four hyper-parameters are selected to be noninformative, by default α 1g = α 2g = λ 1g = λ 2g = 10 −6 .
Step 4: Repeat from step 2 until X (mis) is empty, at that point return X (comp) as the imputed dataset.

IV. EXPERIMENTAL IMPLEMENTATION A. DATASETS
The datasets were gathered from several databases repository and differ from each other in size and type. The authors used eight different datasets that usually studied in the literature) Table 1(. The BNG_heart_statlog and Poker Hand datasets are considered massive datasets, therefore the analysis was implemented on randomly sampled sub-datasets of 10000, 15000, 20000, and 50000 of instances from both of them [4].
In every dataset, the proportions of missing values were generated from each missingness mechanism using ampute function from the R environment [44].

B. PERFORMANCE EVALUATION
The performance evaluation was calculated using RMSE, MAE, R 2 and time of imputation in seconds [3].
whereŷ l and y l are the predicted and real values of the lth instances, respectively.ȳ is the mean of the observed data. n is the number of instances. The CBRG was compared versus six comparison algorithms presented briefly in Table 2. Multiple Imputation by Chained Equations (MICE) is one of the best methods for manipulating missing data problems. MICE assumes that the missing data are of MAR type. It manipulates missing data in a feature using an imputation model per feature. By default, linear regression and Logistic regression are used to impute continuous and categorical missing values respectively. In MICE, predictive models are used to manipulate missing data in an iterative series. When convergence occurs, these iterations should be stopped. Least squares methodology is used by the LeastSquares method, which is a standard method in the regression analysis. It minimizes the sum of squared residuals (i.e., the difference between the imputed value predicted by a predictive model and the observed value). When the data are of MCAR type, coefficients of leastSquares are stable (i.e., unbiased when the sample dimension increases) but they are not fully efficient. Assessing models using weighted least squares leads to better results. The Norm method creates a Gaussian distribution using the sample mean and variance of the observed data. Randomly samples from this distribution are used to manipulate missing data by Norm method. In the Stochastic method, missing values are manipulated using the least squares methodology. It adds the random draw to the prediction after creating samples from the regression's error distribution. Although this method can be used directly, such behavior is not preferred. The Fast KNN method manipulates missing data with a passed in initial impute fn (mean impute) and then uses the resulting initial imputed dataset to build VOLUME 8, 2020  a KDTree which will be used to assess nearest neighbours. The 'k' nearest neighbours will be taken as a weighted average.
The experiments were implemented using a computer with the following specification: 4 GB memory, Intel Core i5-2400 (3.10 GHz) processor, 500 GB HDD, Windows 10 OS, and R (version 3.5.2) and Python (version 3.7) programming language. This section is subdivided into three subsections. The first subsection presents the accuracy analysis, the second explains the error analysis, and the third subsection presents the imputation time. Finally, the fourth subsection presents limitations of the proposed algorithm.

A. ACCURACY ANALYSIS
This subsection exhibits that CBRG shows better accuracy in most cases. In what follows, the accuracy analysis is discussed in details. The accuracy is known as how well the model in predicting the unknown data. Fig. 2 shows the improvement percentage of R 2 defined by (8) for CBRG versus the compared algorithms. When CBRG is compared with Stochastic, Norm, Fast KNN, Expectation Maximization Imputation (EMI) in all missingness mechanisms, the improvement of R 2 in all datasets for the CBRG is better than Stochastic, Norm, Fast KNN and EMI, but in MNAR, CBRG is worse than Stochastic when implemented on the California dataset. When CBRG is compared with Least-Squares in MAR, R 2 of CBRG is better than LeastSquares when implemented on the profit estimation of companies and diamonds datasets. R 2 of CBRG equals to LeastSquares in all generated samples from BNG_heart_statlog and the Poker Hand datasets, and R 2 of CBRG is worse than LeastSquares when implemented on the rest of the datasets. In MCAR, R 2 of CBRG is equal to LeastSquares when implemented on all generated samples from BNG_heart_statlog and the Poker Hand datasets, and worse when implemented on the rest of the datasets. In MNAR, R 2 of CBRG is better than LeastSquares when implemented on the profit estimation of companies dataset. R 2 of CBRG is equal to LeastSquares when implemented on graduate admissions, diamonds and all generated samples from BNG_heart_statlog and the Poker Hand datasets, and worse when implemented on the rest of the datasets. When CBRG is compared with Multivariate Imputation by Chained Equations (MICE) in MAR, it was observed that R 2 of CBRG is equal to MICE when implemented on diamonds and all generated samples from BNG_heart_statlog and the Poker Hand datasets, and worse when implemented on the rest of the datasets. In MCAR, it was observed that R 2 of CBRG is equal to MICE when implemented on all generated samples from BNG_heart_statlog and the Poker Hand datasets, and worse when implemented on the rest of the datasets. In MNAR, it was observed that R 2 of CBRG is equal to MICE when implemented on diamonds, BNG (10000), Poker (10000), Poker (15000), Poker (20000) and Poker (50000) datasets, and worse when implemented on the rest of the datasets.
Selecting large samples from a dataset leads the distribution of the samples to be very close to a normal distribution. Since the CBRG depends on the BRR technique which in turn assumes that the independent features have a normal distribution, hence CBRG gives good accuracy when implemented on the generated samples datasets.

B. ERROR ANALYSIS
This subsection exhibits that CBRG shows lower errors in most cases. In what follows, the error analysis is discussed in details. The error analysis was implemented by calculating MAE given by (7) and RMSE defined by (6). Both Fig. 3 and Fig. 4 show the improvement percentage of MAE and RMSE for the proposed algorithm versus the compared algorithms, respectively. When CBRG is compared with Norm in all missingness mechanisms, it was observed that CBRG granted lower error than Norm when implemented on all datasets. When CBRG is compared with Stochastic in all missingness mechanisms, MAE of CBRG is better than Stochastic when implemented on all datasets except the diamonds dataset. In MAR, the RMSE of CBRG is worse than Stochastic when implemented on the diamonds dataset. In MNAR, the RMSE of CBRG is worse than Stochastic when implemented on diamond and diabetes datasets. In  In MCAR, RMSE of CBRG is better than LeastSqaures when implemented on graduate admissions, profit estimation of companies and Poker (10000) and Poker (50000) datasets, and worse when implemented on the rest of the datasets. When CBRG is compared with MICE, in MAR, MAE of CBRG is better than MICE when implemented on graduate admissions, profit estimation of companies, Poker (10000), Poker (15000) and Poker (20000) datasets, and worse when implemented on the rest of the datasets. In MCAR, MAE of CBRG is better than MICE when implemented on graduate admissions, profit estimation of companies and Poker (10000) datasets, and worse when implemented on the rest of the datasets. In MAR, RMSE of CBRG is better than MICE when implemented on graduate admissions, profit estimation of companies, BNG (10000), Poker (10000), Poker (15000) and Poker (20000) datasets, and worse when implemented on the rest of the datasets. In MCAR, RMSE of CBRG is better than MICE when implemented on graduate admissions Poker (10000) and Poker (50000) datasets, and worse when implemented on the rest of the datasets. In MNAR, RMSE of CBRG is better than MICE when implemented on graduate admissions, profit estimation of companies, Poker (10000), Poker (15000), Poker (20000) and Poker (50000) datasets, and worse when implemented on the rest of the datasets.

C. IMPUTATION TIME
This subsection exhibits that CBRG shows better imputation time in most cases. In what follows, the imputation time analysis is discussed in details. Fig. 5 shows the improvement percentage of imputation time for the proposed algorithm versus the compared algorithms. When compare CBRG with LeastSquares, and Stochastic, in all missingness mechanisms, the imputation time of CBRG is better than LeastSquares, and Stochastic in graduate admissions, diabetes, BNG (10000), BNG (15000), BNG (20000), BNG (50000), Poker (10000) Poker (15000) and Poker (50000) datasets, and worse when implemented on the rest of the datasets. In all missingness mechanisms, the imputation time of CBRG is better than MICE in all datasets. When compare CBRG with EMI, in MAR, the imputation time of CBRG is better than EMI in all datasets except in graduate admissions, diabetes, profit estimation of companies and California datasets. In MCAR and MNAR, the imputation time of CBRG is better than EMI in all datasets except diabetes, profit estimation of companies and California datasets. When compare CBRG with MICE, in all missingness mechanisms, the imputation time of CBRG is better than MICE in all datasets except in profit estimation of companies and California datasets. When compare CBRG with Norm, in MAR, the imputation time of CBRG is worse than Norm in all datasets except in the graduate admissions dataset. When compare CBRG with Fast KNN, in MAR, the imputation time of CBRG is better than Fast KNN in all datasets except in graduate admissions, profit estimation of companies, red & white wine, California and diamonds datasets. In MCAR and MNAR, the imputation time of CBRG is better than Fast KNN in all datasets except in profit estimation of companies, red & white wine, California and diamonds datasets.
Choosing the candidate feature to be imputed by the CBRG depends on the feature selection of the GR. Calculating the GR requires calculating the entropy which is computationally expensive. Nevertheless, CBRG shows a good imputation time with the generated samples, from the Poker Hand and BNG_heart_statlog datasets. In addition, when manipulating too small size datasets, CBRG presents a good imputation time. The proposed algorithm consumes long imputation time when imputing large size datasets. The MICE is not efficient in imputation time. Norm offers the best imputation time with all datasets. Fast KNN and EMI offer a good imputation time with small datasets but consume long imputation time when dealing with large datasets.

D. LIMITATIONS OF THE PROPOSED METHOD
The proposed method deals only with numerical features not nominal features. Imputation of large datasets leads the proposed method to consume more time than other stated methods. The proposed method assumes that the distribution of independent and dependent features is Gaussian distribution so it is affected with the distribution of the features especially when there are skewed in distribution, noise and outliers in data features.

VI. CONCLUSION
It is essential to manipulate missing data, as it happens in almost all real world data. Manipulating incomplete instances is very significant for observational analyses with several predictors. In this work, the authors studied a set of previously published approaches used for manipulating missing data, reviewed their implementation on different datasets with different proportions of missing values were generated from the three missingness mechanisms. In addition, proposing a new algorithm CBRG works in cumulative order to impute all missing values. The candidate attribute is selected depending on the GR feature selection. The proposed algorithm shows good accuracy performance when compared with Stochastic, Fast KNN, Norm, MICE, LeastSquares, and EMI. The CBRG, LeastSquares, and MICE present the best performance among other mentioned methods. The proposed algorithm shows an acceptable running time and is considered a fast method but not the fastest because of the time consumed in calculating the GR. The results also revealed that the proposed algorithm works well with any missingness mechanism and with any missing data percentage. When the data features are highly correlated, the CBRG shows high imputation accuracy with low error as shown when implemented on the profit estimation of companies dataset.
In future research, it is recommended to implement the proposed imputation algorithm on additional datasets; additional units of standard error (like T-value and P-value) will be considered when choosing the candidate feature. The best future trend is to take the help of algorithms that deal with optimization problems with mixed features such as the GSA-GA algorithm [48]. In addition, the proposed algorithm is promising to be applied in imputing incomplete medical datasets, such as DNA microarray data, cardiovascular disease data, pulmonary embolism data, food composition data, and other medical data.  SAFWAT HAMAD received the bachelor's degree, in 2000, the M.Sc. degree in modeling, simulation, and visualization, and the joint Ph.D. degree in high-performance computing from the Computer Science and Engineering Department, University of Connecticut, USA, and the Faculty of Computer and Information Sciences (FCIS), Ain Shams University (ASU), Cairo, Egypt, in 2008. He was a teaching assistant for several undergraduate courses. He has been the Chair of the Scientific Computing Department, FCIS, ASU, since 2017. He is currently an Associate Professor. His research interests include image and video processing, computational biology, machine learning, encryption, and security.
HIROFUMI AMANO (Member, IEEE) received the B.E. degree in electronics and the M.E. and D.Eng. degrees in computer science and communication engineering from Kyushu University, Fukuoka, Japan, in 1986, 1988, and 1991, respectively.
From 1991 to 1994, he was a Research Associate with the Department of Computer Science and Communication Engineering, Kyushu University. In 1994, he was promoted to an Associate Professor with the Computer Center, Kyushu University. Since 2007, he has been an Associate Professor with the Research Institute for Information Technology, Kyushu University. He is currently working with the Graduate School of Information Science and Electrical Engineering, Kyushu University. His research interests include parallel processing, distributed processing, grid computing, and cloud computing. He is also a member of the Information Processing Society of Japan, and the Institute of Electronics, Information and Communication Engineers, Japan. VOLUME 8, 2020