Missing Value Imputation Designs and Methods of Nature-Inspired Metaheuristic Techniques: A Systematic Review

Missing values are highly undesirable in real-world datasets. The missing values should be estimated and treated during the preprocessing stage. With the expansion of nature-inspired metaheuristic techniques, interest in missing value imputation (MVI) has increased. The main goal of this literature is to identify and review the existing research on missing value imputation (MVI) in terms of nature-inspired metaheuristic approaches, dataset designs, missingness mechanisms, and missing rates, as well as the most used evaluation metrics between 2011 and 2021. This study ultimately gives insight into how the MVI plan can be incorporated into the experimental design. Using the systematic literature review (SLR) guidelines designed by Kitchenham, this study utilizes renowned scientific databases to retrieve and analyze all relevant articles during the search process. A total of 48 related articles from 2011 to 2021 was selected to assess the review questions. This review indicated that the synthetic missing dataset is the most popular baseline test dataset to evaluate the effectiveness of the imputation strategy. The study revealed that missing at random (MAR) is the most common proposed missing mechanism in the datasets. This review also indicated that the hybridizations of metaheuristics with clustering or neural networks are popular among researchers. The superior performance of the hybrid approaches is significantly attributed to the power of optimized learning in MVI models. In addition, perspectives, challenges, and opportunities in MVI are also addressed in this literature. The outcome of this review serves as a toolkit for the researchers to develop effective MVI models.


I. INTRODUCTION
Data quality in machine learning has been intensively studied over the past decades. One of the data quality issues is missing values. Missing values can be defined as portions of the data that are either incomplete or absent in the dataset. The presence of missing values in the dataset diminishes data quality, reduces the power of data analysis, and induces bias in data science applications. Hence, dealing with incomplete information is critical for most data mining and machine learning techniques [1].
Numerous studies have been successfully conducted to address the issue of missing values. Little and Rubin [2] VOLUME XX, 2017 1 classified missing values into three mechanisms, missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR). In the case of MCAR, the probability of missing values is independent. In the type of MAR, the probability of data incomplete is not related to the missing value; instead, it is related to the part of the observed data. In the MNAR case, missing values are dependent on the missing variable, in which the incomplete values are associated with unmeasured events. Any missing value estimation technique could be applied due to the absence of bias in the data. Furthermore, the missing value pattern explains how the data is missing in different ways. Univariate missing value pattern occurs when only one variable is missing. Data is missing monotone if the missing values follow a pattern. On the other hand, data is missing arbitrarily if the data is missing without a clear pattern.
Moreover, the percentage of missing values impacting the data quality. However, the existing literature does not have a standard cutoff for the acceptable proportion of missing values in a dataset for quality data analysis. For example, Bormann [3] suggested that 10% missing precipitation values of the calendar days are the threshold for removing the whole winter observations from the analysis. In contrast, Tatar et al. [4] stated that a threshold of 50% missing features was excluded from the prediction of low salinity waterflooding, while imputation of mean value was applied for missing features below the missing threshold.
Equipment failure is a major cause of high missing rates. Eliminating high missing rates from the observations diminishes the representativeness of the samples. However, missing values can be higher than 50% in real-world scenarios. Therefore, missing value imputation (MVI) is used to address the problem of missing values. MVI is a procedure that is used to fill in missing values with substitutes [5]. Over the past decades, various machine learning techniques have been proposed to deal with incomplete datasets for different domain problems, such as medical [6], hydrology [7], [8], and transportation [9].
Consequently, a number of literature [10]- [12] discusses recent machine learning-based imputation techniques in solving incomplete dataset problems. Nevertheless, with respect to MVI of nature-inspired metaheuristic techniques, the literature receives limited attention. Therefore, the purpose of this literature is to review recent MVI designs of metaheuristic techniques used for handling and optimizing missing value imputation. This SLR follows the guidelines established by Kitchenham and Charters [13], thereby providing significant insights for researchers working in the MVI domain.
The contributions of this literature are: 1) A comprehensive systematic literature review on the existing MVI designs for metaheuristic approaches is presented, together with experimental design, dataset design, missingness mechanisms, missing rates, and evaluation metrics.
2) A guide to address, manage, and report MVI studies is introduced. This SLR serves as a toolkit for the researchers to come up with solutions for challenges in implementing effective missing value imputation.
This research is organized as follows: Section II presents the SLR methodologies, whereas Section III summarizes the SLR findings. Section IV discusses the research trends and potential opportunities in MVI. Section V highlights the challenges, and finally, the conclusion is presented in Section VI.

II. RESEARCH AND REVIEW METHOD
This section describes the systematic approach for reviewing recent articles on metaheuristic-based MVI techniques by adopting Kitchenham's SLR standards. This SLR is inspected, analyzed, and evaluated according to the research questions and review protocols. Each phase of this SLR is explained in the following sections.

A. PLANNING THE REVIEW
This section outlines the review plan needed to undertake the SLR, which includes formulating research questions in accordance with the review's primary objective, defining a search strategy, and designing a comprehensive review protocol.

1) RESEARCH QUESTIONS
This review aims to study the existing literature on metaheuristic designs and methods for optimizing and solving missing value problems. The following Research Questions (RQs) for this literature are formulated to accomplish this aim, as indicated in Table 1.
In the past ten years, several novel imputation techniques have been proposed. This SLR aims to identify the differences among the methods to enrich the understanding of MVI methods, which can be taken as the basis for planning and developing a new imputation model. RQ1 is formulated to provide an overview of state-of-the-art metaheuristic techniques used to handle and optimize missing value imputation. Meanwhile, RQ2 is defined to explore the experimental designs of imputation and understand what factors affect the MVI design. RQ3 is outlined to understand what metrics are commonly used when evaluating the missing value imputation method.

2) SEARCH STRATEGY
The search strategy begins with selecting relevant databases (IEEExplore, ScienceDirect, Scopus, and other electronic databases) to track scientific papers that address research topics published in linked journals, conferences, and book chapters. The search string used to retrieve articles from the scientific databases is described as follows: VOLUME XX, 2017 1 String: ("metaheuristic" OR "optimization" OR "evolutionary") AND ("imputation") AND (YEAR > 2010 AND YEAR < 2022)

3) INCLUSION AND EXCLUSION CRITERIA
A list of inclusion and exclusion criteria was constructed in this literature, as shown in Table 2. The inclusion and exclusion criteria are used as one of the review protocols to narrow the relevant studies to the most pertinent ones during the article review process.

4) QUALITY ASSESSMENT CRITERIA
Another review protocol is the quality assessment criteria. The quality assessment criteria are crucial to determining the selected articles' quality. A quality assessment criteria constructed based on Kitchenham and Charters [13], Genc-Nayebi et al. [14], and Yang et al. [15] is presented in Table  3. The quality assessment is based on the response of "Yes," "No," and "Partial applicable," abbreviated as "Y," "N," and "P," respectively.

B. CONDUCTING THE REVIEW
The article selection was carried out by applying the mentioned search string. Initially, our search string found 758 publications from different databases between 2011 and 2021. The search results were then narrowed down to manually reviewing all the articles' titles and abstracts, resulting from a total of 644 articles. Next, the potential articles were filtered according to the RQs, which yielded 181 articles. Further filtering was applied by removing irrelevant studies according to the detailed inclusion and exclusion criteria, as shown in Table 2. Additionally, the quality assessment was conducted, and we chose articles that affirmatively respond to the nine quality assessment criteria listed in Table 3. The findings indicated that most selected articles satisfied all the quality assessment criteria. On final selection, a total of 48 articles fulfilled all the inclusion and quality assessment criteria used in this literature. The article selection processes are summarized and illustrated in Figure  1. 3 What are the commonly used metrics to evaluate the performance of the missing value imputation?
Identify the most common metric used to assess missing value imputation performance. Is the data analysis sufficiently rigorous? 37 0 11

III. RESEARCH FINDINGS
This section presents and discusses the findings from the literature review conducted in response to the RQs identified in Section II. This section is divided into three subsections: the first illustrates state-of-the-art metaheuristic techniques VOLUME  Figure 2 indicates the trend of publications over ten years. The graph illustrates the popularity of metaheuristic techniques in MVI research over time. As can be seen, studies on metaheuristic-based MVI have experienced continuous growth since 2011 and show an emerging trend in MVI research. The growth is apparently due to the explosion of data science research involving high-quality data, which raised researchers' awareness of the importance of imputation. Next, we summarize the metaheuristic techniques employed in handling MVI and highlight their primary benefits. We have categorized the metaheuristic technologies into three categories. The first category is a single objective approach, followed by multi-objective and hybrid approaches as the second and third categories. The taxonomy of metaheuristic approaches in handling and optimizing MVI is shown in Figure 3.
From the literature, genetic algorithm (GA) has become one of the most widely used metaheuristic approaches in MVI tasks. Figueroa García et al. [16] used GA imputation to estimate missing values by minimizing an error function derived from covariance matrix and means vector, while Lobato et al. [17] improved GA imputation for the incomplete multi-attribute dataset. Recently, Awawdeh et al. [18] performed imputation and feature selection simultaneously. GA was used to determine the most optimal features, while mean and mode imputation were used for filling missing numeric and categorical features, respectively. The advantage of this approach is that it is more tolerant of bias in MAR and NMAR missingness types. In another study, Sivapragasam et al. [19] utilized mathematical models in genetic programming (GP) to reconstruct missing time series rainfall data. In [20], PSO imputation was proposed to infill missing gene expressions. The advantages of this approach are it is simple and easy to implement. However, the performance of the PSO imputation cannot be generalized as it is only compared with conventional imputers such as K-nearest neighbor (KNN) and row averaging imputation at missing rates of 5%, 8%, and10%.
For multi-objective metaheuristic approaches, Lobato et al. [21] analyzed incomplete instances and modeling task information using multi-objective GA (MOGA-II) based non-dominated sorting genetic algorithm-II (NSGA-II) to infill mixed-attribute datasets. Both objective functions of root mean square error (RMSE) and classification accuracy significantly improved the imputation performances for incomplete numeric and nominal features. On the other hand, recent work by Khorshidi et al. [22] proposed two objective functions of cluster validity function and correlation function to enhance the existing NSGA-II. The advantages of this approach are that it is robust and able to handle online imputation and classification simultaneously for MAR missingness type. The proposed multi-objective particle swarm optimization (MOPSO) approach in [23] determined the optimal imputation algorithm based on the MCAR, MAR, and MNAR missingness mechanisms, in which the fitness function adapted according to sensitivity and specificity. The proposed MOPSO improved the imputation accuracy by 16.52% than the delete missing, mean, expectation-maximization, multivariate imputation by chained equations (MICE), and missForest imputation approaches. However, the shortcomings of this approach are that it is slow, and the imputation model is more dependent on variables than on records.
Several new methods have been proposed to improve imputation accuracy that combines metaheuristic methods with other techniques such as Bayesian, clustering, probabilistic, and neural network. Furthermore, most studies adopted hybrid approaches to address missing value issues. As for the Bayesian category, several studies [24]- [27] have explored the idea of infilling MVI using the combination of metaheuristic and Bayesian algorithms. The advantage of incorporating bayesian fitness is improved the solution's optimality. In [28], Nekouie and Moattar improved imputation performance using bayesian, tensor, and chaotic PSO. The approach significantly reduced 4% error than the tensor method for missing numerical values and class imbalance problems.
On the other hand, some researchers combined probabilities and metaheuristics approaches to estimate missing values [29]- [33]. KNN imputation was used to infill missing values based on neighbors' data and optimized by GA [29] and PSO [30]. Recently, Nagarajan and Dhinesh Numbers of studies VOLUME XX, 2017 1 Babu [31] proposed a feature weighting approach that combined an improved local search and whale optimization algorithm (WOA). The advantage of this approach is that the hybrid learned various k of nearest neighbor for different testing values by examining the correlation matrix between the training and testing datasets. Moreover, the WOA avoided local optima and converged to a better solution in final iterations. The findings indicated that missing values were predicted more precisely and improved classification performance in electronic health records. However, this approach is inefficient in large datasets with high dimensional features. Meanwhile, Krishna and Ravi [32] utilized a covariance matrix to reduce the error function of PSO. The approach achieved better classification accuracy than the hybrid K-means and multilayer perceptron (MLP), producing comparable results for regression tasks. In the time series problem, a combination of inverse distance weight (IDW), tolerance rough set (TR), and PSO [33] was proposed to determine the optimal influence factor value for each recognized data point in the neighboring group, thereby reducing the error rate of the imputed time series data. As for the clustering category, several researchers employed the clustering method with soft computing. In [34], Veroneze et al. proposed a combination of bi-clustering and ant colony optimization (ACO) to deal with missing data problems. The introduction of a bi-clustering strategy and optimal parameter selection in this approach enhanced the imputation quality for the missing gene expression datasets; however, the impact of long execution times increased the computational cost of this approach.
The works of [35] and [36] specified the use of Fuzzy Cmeans (FCM) with GA by generating a matrix-based data structure and optimizing it through a GA parameter optimization process to improve the accuracy of missing value estimation. Meanwhile, Aydilek and Arslan [37] demonstrated that combining an optimized clustering process with support vector training improved imputation performance. However, higher proportions of 25% missing data were not considered in the study. Then, Khotimah and Pramudita [38] implemented a self-organizing map (SOM) imputation with GA. The selection of SOM weights using GA with elite chromosomes determined the shortest distance between the data and the cluster centroid, resulting in a more accurate solution for incomplete data estimation.
In FCM imputation with the PSO method [39]- [42], the missing values can be estimated from the observed data with different optimized weights to improve data quality. Recent work by Hu et al. [43] presented missing values in hybrid numeric and granular forms. It used information granularities to construct granular fuzzy models (GFM), while PSO optimized the optimal allocation of information granularities. The advantage of this approach is that the established granular models improved numerical value prediction accuracy by extracting the essential target information from incomplete data. On the other hand, Gautam and Ravi [44] implemented data imputation via a two-stage learning strategy: the first stage was based on local learning in particle swarm optimization-evolving clustering method (PSO-ECM), and the second stage was based on global approximation in auto-associative extreme learning machine (AAELM). Another approach is the ELM+PSO+FCM proposed by Sun et al. [45], which resulted in effective data imputation for byproduct gas flow data. These studies [43]- [45] demonstrated a positive impact on MVI accuracy, but the imputation results were only examined at missing rates under 50%.
To provide greater accuracy in predicting numerical and nominal missing values, the recent work in [46] extended the existing PSO imputation approach by incorporating ontology and K-means, where ontology eliminated irrelevant data, and K-means accelerated PSO convergence. In addition to PSO imputation, a fruit fly optimization algorithm (FOA) has been proposed by [47] in solving missing time series values. First, SOM was used to cluster the time series and obtain a similarity matrix for the incomplete series. Then, this approach employed a cross-validation procedure and FOA strategy to determine the optimal parameter in the leastsquares support vector machine (LSSVM) for building an optimal imputation model. In addition, Tran et al. [48] proposed an approach for classifying missing values that integrated imputation, clustering, and feature selection. The proposed clustering minimized the number of instances used by imputation, whereas differential evolution (DE) extracted relevant features of the training data. However, removing instances may result in data loss, and performing feature selection after initial imputation can be time-consuming, particularly when dealing with high-dimensional data.
Duma et al. [50] proposed a hybrid multi-layered artificial immune system and GA to fill in missing values for insurance datasets. In [49], the authors demonstrated that using random forest (RF) and GA-selected predictors to estimate missing forest inventory variables with data from target and auxiliary stands significantly reduced model bias. Other than that, the proposed hybrid GA and asexual reproduction optimization (ARO) approach outperformed the mean and original GA imputation approaches by incorporating ARO imputation and GA optimization [51].
This approach improved the classification accuracy for mixed variable types. [17] Handling missing value imputation and feature selections simultaneously.
This approach can minimize bias when handling MAR and NMAR missing data types. [18] GP GP incorporated with mathematical models such sin, cos, exp, and log, to predict missing monthly rainfall data.
The approach was able to handle the nonlinear relationship of rainfall data. [19] PSO PSO based imputation for missing gene expressions.
Simple and easy to implement. [20] Multi objective MOGA-II Employed multi objective GA based on the NSGA-II, which can handle mixed-attribute datasets and incorporated information from incomplete instances and modeling tasks.
Significantly improved imputation performances and has higher statistical ranking than the compared methods in both objective functions studied (RMSE and classifier accuracy). [21] Proposed multi objective optimization model with two objective functions (cluster validity function and correlation function) for imputation and model selection.
Concurrently performed online imputation and classification. It is robust and works well in various situations. [22]

MOPSO
The approach proposed the optimal imputation algorithm based on missing data type.
The imputation accuracy was improved by 16.52% than the compared methods. [23] Hybrid Bayesian ACO+ bayesian ACO was hybridized with Bayesian principles for imputing the missing values with MAR mechanism.
The proposed approach performed better in estimating discrete and continuous missing values in large datasets under MAR mechanism, compared to multiple imputation, expectation maximization and kernel imputations. [24]

ABC+ bayesian
An average value of mean imputation, distance imputation, and random imputation was used to estimate the missing value. Further, bayesian optimization was integrated into the ACO model.
Bayesian optimization employed posterior and prior probability values to evaluate the fitness function of the ACO. This approach successfully solved the discrete value imputation problems. [25] Max-min ACO+ bayesian Hybrization of bayesian min-max and ACO algorithm. The bayesian fitness, which was incorporated into the proposed model, improved the optimality of the solution.
This approach outperformed the competitive imputation models at different percentages of missing rates, ranging from 5% to 50%.
[26], [27] Bayesian+ tensor+ chaotic PSO Bayesian networks were used to estimate initial missing values. Additionally, the CRAPSO was used for sample generation to deal with tensor data insufficiency. Finally, a modified tensor factorization approach was used for estimating final missing values.
In the presence of missing numerical values and class imbalance, the proposed approach outperformed the compared methods for missing data estimation. [28] VOLUME XX, 2017 1 Probabilistic GA+KNN Handle missing value imputation using a genetic algorithm optimized KNN algorithm.
This approach can identify the optimal value of k and weight each attribute in the dataset. [29] GMSA+MP SO +WKNN WKNN was used to select neighbors' data for missing data estimation, while GMSA-MPSO was utilized to optimize feature weights.
This approach showed better estimation accuracy for sensor monitoring manufacturing systems than the compared techniques. [30] Hybridization of an improved local search and WOA with feature weighted nearest neighbor imputation approach for missing health records.
This approach improved classification performances using the imputed health datasets. [31]

PSO+ covariance matrix
This approach reduced the error function derived from covariance matrix and its determinant.
Better classification accuracy compared to regression. [32] IDW+TR+ PSO TR employed the rough set concept to determine the neighborhood set for each unknown data point. This was followed by a PSO technique to find the optimal influence factor value for each known data point in the neighborhood set.
In comparison to other imputation techniques such as KNN, expectation maximization, and traditional IDW, the proposed system significantly reduced the error rate of the imputed time series results. [33] Clustering ACO+ clustering The nearest neighbor (Euclidean distance) technique was utilized in the pre-imputation stage. The pre-imputed dataset was then replaced by new estimation using biclustering and optimal parameter selection strategies in ACO.
The use of bi-clustering strategy and optimal parameter selection in ACO achieved higher imputation quality than KNN and SVD for MCAR and MAR missing mechanisms, despite its higher computational cost. [34] FCM+GA A hybrid method that combined FCM imputation method with GA optimization method. This study proposed a matrix-based data structure and GA parameter optimization process to improve the missing data estimation.
This approach was superior to the competitive imputation models.
[35], [36] FCM+ SVR+GA This method employed fuzzy C-means clustering data, which combined SVR and GA to handle low proportion of missing data.
The optimized clustering process combined with support vector training improved the imputation performance significantly. [37] GA+SOM Clustering-based imputation, in which the model's weights were updated via a chromosome elite search strategy in GA.
Chromosome elite search strategy was more effective and efficient than nonelite search in GA. [38] FCM+PSO This hybrid optimization of PSO and FCM employed a fuzzy clustering approach to impute missing values.
This approach improved traditional clustering imputation by incorporating PSO to find the most optimal values for filling the missing values.
The established granular models enhanced the prediction accuracy for numerical values. [43]

PSO-ECM+ AAELM
This approach employed two-stage learning; First stage was a local learning in PSO-ECM and second stage was a global approximation in AAELM.
The optimal parameter selection of ECM by PSO contributed significantly to the good performances of PSO-ECM and PSO-ECM+AAELM. This approach also improved local learning, global optimization, and global learning of the proposed models. [44]

ELM+PSO+ FCM
Prefilled missing values were estimated using membership matrix and related cluster centers by linear interpolation and FCM. The clustering size and weighting factor parameters were optimized by an iterative PSO optimization to enhance the accuracy of FCM. The missing value This approach improved the model's accuracy by imputing missing values in the byproduct gas flow dataset. [45] VOLUME XX, 2017 1 imputation was further enhanced in ELM by minimizing the Euclidean distance between the estimated values and the missing values.
PSO+K-means+ ontology model Incorporated ontology and K-means in PSO imputation, in which ontology removed irrelevant data and K-means improved PSO convergence speed.
The use of ontologies and K-means in PSO imputation significantly reduced errors in predicting missing nominal and numerical data. [46] SOM+FOA + LSSVM Optimization techniques were combined with the clustering method to provide sufficient information and an optimal solution.
Higher imputation accuracy for dealing with missing spatial-temporal values. [47]

DE + clustering
This study proposed a hybrid of DE with clustering and feature selection for classification with missing values.
By incorporating clustering and feature selection into imputation, the proposed approach achieved higher accuracy at a lower computational time. [48] Random forest GA+RF This method utilized target and auxiliary stands (off-site samples) data for imputing missing forest inventory variables.
The use of GA-selected predictors and auxiliary reference stands from the target dataset contributed to a reduction in model bias. [49] MAIS MAIS+GA A hybrid multi-layered artificial immune system and GA for partial missing value imputation.
This approach enhanced accuracy and robustness in the presence of different missing rates. [50] ARO GA+ARO This approach employed ARO to impute missing values for each feature. The output of ARO (best chromosome) will be used as an initialization input for GA. GA iteratively optimized the solution to find the best optimal solution.
This proposed technique performed better accuracy than the compared methods. [51] Tree vector GP+tree vector Improved version of GP, where a mixed tree-vector representation was proposed for performing instance selection, while GP was used for symbolic regression on missing data.
This approach improved imputation performance for a medium number of instances. However, this approach is less significant for datasets with a relatively small or large instances. This approach performed well for filling missing creatinine values. [53] PSO+ LSSVM This LSSVM model imputed missing data by combining the previous monitoring data from a node and the current monitoring data from a neighboring node. The parameters of imputation were then optimized by PSO.
A higher quality of imputation for missing dose rate and sensor notes in wireless sensor network. [54] Wrapper GP+wrapper Proposed to enhance the symbolic regression performance of missing value estimation.
The proposed approach achieved the highest number of better cases than the competitive models at missing rates of 50%, followed by 30% and 10%. [55] Incorporating noise sensitivity measure and wrapper into the GP imputation.
An improved GP-based imputation regression predictor. [56] Neural network GSO+MLP This method employed three-layer feed forward neural network, in which the weights and thresholds were optimized by GSO during missing traffic flow data imputation.
This method can accurately predict missing time series data. [57] GA+MLP, SA+MLP, PSO+MLP, RF+MLP Prediction and classification comparison for several MLP based auto associative neural network with GA, SA, PSO and RF.
For prediction, the GA+MLP, SA+MLP, and PSO+MLP algorithms performed better than the RF+MLP algorithm. However, the RF+MLP algorithm outperformed the GA+MLP, SA+MLP, and PSO+MLP algorithms for classification problems. [58] SC-FITNET The sine cosine algorithm was used to optimize a neural network for imputing missing rainfall data.
Effectively imputed time series data at different missing rates and outperformed LSTM approach.

SC-FDO+ MLP
Proposed hybrid sine cosine and fitness dependent optimizer for missing rainfall imputation.
The modified pace-updating position, random weight factor, and conversion parameter strategies significantly enhanced imputation accuracy for the high-low proportion of missingness. [60] DL-CS This study trained a deep autoencoder network. The CS algorithm, which optimized the objective function in the trained network, was used to approximate the missing values.
Effectively imputed large datasets of handwritten digits. [61] DL-BAT A deep autoencoder network was used to replicate the input data, and the bat algorithm was employed to estimate the missing data.
Effectively deal with high dimensional datasets of handwritten digits. [62] DL-GSA The proposed DL-GSA utilized a deepautoencoder and gravitational search algorithm to estimate missing handwritten digits.
The proposed DL-GSA outperformed DL-CS in terms of accuracy and efficiency.
[ 63] In [53], Ismail et al. incorporated levy flight into PSO to improve global exploration of PSO and helped PSO to escape from local optimum. The results indicated that support vector machine (SVM) imputation, optimized by levy flight PSO achieved the lowest error for filling the incomplete creatinine data than KNN, naïve Bayes, and decision tree imputation. Gao et al. also presented a variant of SVM-based imputation that employed LSSVM optimized by PSO to estimate incomplete values for dose rate and sensor rate data. The results revealed that the PSO+LSSVM approach achieved better accuracy than the LSSVM model [54]. Furthermore, Al-Helali et al. [55]- [56] proposed wrapper-specific GP methods to improve imputation accuracy and symbolic regression performances.
The research done in [57] implemented a hybrid GSO and neural network system to perform missing time series data imputation tasks, and the results demonstrated that the approach could accurately predict incomplete traffic flow data for urban arterial streets. The authors [59] proposed a sine cosine algorithm to optimize a function-fitting neural network to impute incomplete rainfall data. A significant advantage of the method is that it outperformed the long short-term memory (LSTM) method in imputing time-series data at various missing rates. Recent work [60] extended the existing sine cosine algorithm by proposing a novel hybrid sine cosine and fitness dependent optimizer (SC-FDO) to approximate missing rainfall data. The introduction of the modified pace-updating position, random weight factor, and conversion parameter strategies significantly improved the searching accuracy of exploratory and exploitative balance in the proposed SC-FDO. The findings revealed that the SC-FDO based MLP trainer yielded higher imputation accuracy for low and high missing rates compared to sine cosine algorithm (SCA) and fitness-dependent optimizer (FDO) based MLP trainer.
On the other hand, Leke et al. [58] investigated the use of hybrid MLP-based auto-associative neural networks with GA, simulated annealing (SA), PSO, and RF in the prediction and classification of missing values. The GA+MLP, SA+MLP, and PSO+MLP algorithms outperformed the RF+MLP algorithm in prediction. However, the RF+MLP algorithm outperformed the GA+MLP, SA+MLP, and PSO+MLP algorithms for classification problems.
In addition to that, Leke et al. explored missing values in high dimensional datasets with the aid of deep learning (DL) and swarm intelligence approaches such as cuckoo search algorithm (CS), firefly algorithm (FA), and bat algorithm. The essential advantage of proposing hybrid models (DL-CS [61] and DL-Bat [62]) is that both models yielded more accurate estimates than the hybrid MLP models in [58] and DL-FA. One of the shortcomings of a deep neural network is that it is time-consuming. As a result, the DL-CS and DL-Bat have higher computational time than the hybrid MLP approaches. The work in [63] further improved the imputer models of [61], [62] by proposing the hybrid DL and gravitational search algorithm (DL-GSA). The DL-GSA [63] outperformed the DL-CS [61] and DL-Bat [62] at higher accuracy and shorter computational time. A relative comparison of metaheuristic techniques for dealing with MVI is presented in Table 4.

B. EXPERIMENTAL DESIGNS
This subsection focuses on the RQ2 that identifies the experimental designs used for imputation. The three aspects to consider when dealing with missing data: dataset characteristics, the missing mechanisms, and missing rates.   TABLE 5, 88 % of the datasets are publicly available, while a total of 16 datasets is real-world datasets from industry or agency sources. The findings revealed that the UCI Machine Learning Repository was the most often used dataset over the last ten years, followed by OpenML and Keel. Of all the UCI datasets used here, iris, forest fires, Pima Indian, and wine datasets are the most used datasets. However, the famous databases are on a small scale, containing the number of feature dimensions of less than 15 and the number of instances less than 800.

2) MISSING MECHANISMS
From the findings, missingness can be grouped into two categories: real missing and synthetic missing datasets. A real missing dataset has the original missing data values, in which it does not include any synthetic or artificial missing ratios in the dataset. A synthetic missing dataset contains artificial missing ratios that have been inserted into the dataset according to the missing mechanisms. Nearly 79.2% (38/48) of studies in the last decade used synthetic datasets to evaluate imputation performance, while only 8.3% (4/48) used real missing datasets, and 6 studies did not clarify the missingness. Among the synthetic missing datasets, MAR missing mechanism is the most popular mechanism, with 13 studies accounting for 27.1% (13/48) of the studies, followed by MCAR (20.8%, 10/48), MCAR+MAR (12.5%, 6/48), MCAR+MAR+MNAR (10.4%, 5/48). At the same time, the least attention is paid to MNAR missing mechanism. Dealing with the MNAR mechanism is complex and challenging [67]. The distribution of studies based on the missing mechanism is depicted in Figure 5.

3) MISSING RATES
The missing rates used in the experiment can be divided into three categories: missing rates <= 30%, missing rates under 30% -50%, and missing rates > 50%. Figure 6 shows the distribution of studies based on the missing rates. According to the findings, dataset with missing rates <= 30% category is the most frequently used missing rates for experimentation in the studies (45.8%), followed by 25% of the studies designed to impute missing rates under 30% -50% category. However, nearly 14.6% of the studies did not reveal their missing rates for the experimentation. The works in [24]- [25], for example, used the Framingham heart dataset with real missing values, but the authors did not disclose the proportion of missing values in the dataset.    Nevertheless, the missing rates greater than 50% category received the least attention, accounting for 14.6% (7/48) of the studies. The detailed metaheuristic techniques for dealing with high missing rates are presented in TABLE 6. The techniques include ACO clustering for imputing gene expression database [34], GA imputation for infilling missing multi-attribute dataset [17], MOGA-II proposal for estimating missing data pattern in classification [21], data imputation of spatio-temporal underground water [47], DE clustering and feature selection with incomplete data [48], GP+tree vector imputer model for instance selection and symbolic regression on incomplete data [52] and SC-FDO based MLP trainer for missing rainfall time series imputation [60]. Among these imputation techniques, 6 out of 7 studies employed small-scale datasets of less than 10,000 instances. In general, the proposed approaches produced comparable results for the MVI tasks. Moreover, most high missing rate studies investigated MVI, assuming that the missing data mechanism is MCAR and MAR.
On the other hand, the work [60] utilized a large-scale dataset (over 10,000 instances) to fill in gaps for missing rainfall data. To sum up, the MVI studies need to be addressed from the three aspects, as illustrated in Figure 7. Therefore, future researchers need to identify the study's dataset characteristics, missing mechanisms, and missing rates.

C. EVALUATION METRICS
To answer RQ3, this subsection identifies the most often used metrics for evaluating the MVI's performances.
As illustrated in Figure 8, the nine most frequently used metrics for evaluating the performance of MVI were identified as the root mean square error (RMSE), accuracy, correlation coefficient (R), mean square error (MSE), mean absolute error (MAE), error, mean absolute percentage error (MAPE), relative accuracy (RA), and specificity. Furthermore, 70.3% of the selected studies used these metrics. Many of the metrics are rarely used by the authors; therefore, these metrics have been categorized as 'Others'. The RMSE is the most frequently used metric for evaluating imputation performance, mainly to measure the differences between the predicted variables and the actual variables. For example, the works in [19], [20], [33], [36], [45], [46], [59], [60], to name a few, implemented this metric to determine how concentrated the predicted time series variables would be around the line of the actual variables. This metric is widely reported in time series imputation literature, such as missing rainfall, groundwater level, traffic volume, byproduct gas flow, and radiation dose rate data. Nagarajan and Dhinesh Babu [31] also used this metric to measure the performance of imputation related to missing health datasets.
The R metric assesses the linear correlation between predicted and actual values. A higher R-value implies a better imputation performance. The works of [19], [36], [54], [58]- [60] used R to assess the correlation and association of the predicted and actual values for infilling missing values in the river basin, weather, traffic volume, and forest fire datasets. MSE is another metric for assessing the mean squared difference between predicted and actual values. For example, Garg et al. [63] measured their proposed DL-GSA imputation with the works in [61] and [62] in terms of R and MSE. The results revealed that the DL-GSA imputation method produced more substantial correlation results and lower MSE than the works in [61] and [62]. Some researchers also adopted MAE to measure the proposed imputation methods in terms of the average magnitude of the errors for continuous variables [53], [58]- [60].
Nevertheless, the shortcoming of the MAE metric is that it does not consider the direction of the mean error. As opposed to this shortcoming, Willmott [68] suggested that comparing average model performance error should use MAE because MAE is a natural measure of average error magnitude. In some instances, MAPE is essential to assess the prediction accuracy of the imputation models. Zhang [57] used MAPE to evaluate imputation results in missing Spatiotemporal data. Concerning MAPE, the PSO-ECM+AAELM imputer [44] outperformed the PSO+covariance matrix imputer [32] for all 12 datasets, such as autompg, body fat, boston housing, forest fires, iris, pima Indian, Spanish, spectf, Turkish, UK bankruptcy, UK credit and wine datasets.

IV. DISCUSSION
This section discusses the research trends and potential opportunities in the metaheuristic approach for handling and optimizing MVI.

A. THE MVI APPROACHES
In reference to the RQs, which attempt to identify the existing metaheuristic techniques used for handling and optimizing MVI, it can be revealed that most techniques used to handle missing values were hybrid metaheuristics with clustering or neural networks. Each of the hybrids has characteristics that make it a good fit for a particular problem. For example, the hybrids of deep-autoencoder and metaheuristics provide good results in imputing highdimensional handwritten digits. In particular, the DL-GSA [63] imputer model was faster and more accurate than the DL-CS [61] and DL-BAT [62]. However, the computational times of the hybrids MLP and metaheuristics (GA+MLP, SA+MLP, and PSO+MLP) [58] were relatively shorter than the DL-GSA, DL-BAT, and DL-CS approaches. On the other hand, the work in [59] indicated that the hybrid function of fitting neural network and metaheuristic (SC-FITNET) yielded more accurate estimates than LSTM imputer model for missing rainfall data when correlation coefficient, mean absolute error, and root mean square errors were taken into account. Therefore, selecting the suitable imputer model best suited for the incomplete datasets is essential. Additionally, the hybridization of the state-of-art metaheuristic and neural networks could be of interest for the researchers, therefore providing new studies. VOLUME XX, 2017 1

B. FINE TUNING HYPERPARAMETER
Typically, researchers perform a series of studies to fine-tune parameters in imputer models, which requires considerable effort. For instance, metaheuristic parameters [24], [26], such as the population size and the iteration count, require fine-tuning; parameters in neural network models [61]- [63] are the number of hidden layers in the neural network and the number of neurons in the hidden layer; and parameters in clustering [39] such as the fuzzification parameter, the number of clusters, and the number of nearest neighbors all require fine-tuning.
Consequently, several studies [69]- [71] investigate automatic parameter tuning methods to optimize the algorithm's performance. However, there is no universally accepted guideline for selecting the optimal set of parameters to achieve the best performance. Therefore, future research could consider a semi-automatic or automatic parameter tuning approach for a given context and domain.

C. THE DATASETS
The most often used databases show various domain datasets; however, they are not on large scales, such as iris, forest fires, pima Indian, and wine datasets. In contrast, large scale of datasets (over 10, 000 instances) were used in the works of [26], [27] (discrete, continuous data type), and [59], [60] (continuous data type). Meanwhile, handwriting datasets with high dimensions and scales [61]- [63] were used.
Dataset scales (the number of instances) in a dataset influence the imputation performance. Data resources with few instances may cause imputed values to be underestimated or overestimated. Therefore, researchers must expand the size of the databases, as small-scale datasets can lead to biases and a lack of generalization. Furthermore, training on a large-scale and high-dimensional dataset is difficult due to computational complexity. For example, neural networks, deep learning algorithms need many data to improve accuracy. Hence, dimensionality reduction approaches can help reduce computational costs and improve the accuracy of imputation performance.
On the other hand, imputation models [52] built on a relatively small number of instances (< 300 ) or a large number of instances (> 8191) were ineffective and inaccurate. For this reason, researchers need to comprehend the requirements in both the problem and solution domains before proposing an imputer model. Furthermore, less attention has been paid to real-world datasets from industries or agencies. Therefore, real-world datasets from industries or agencies with larger scales (over 10, 000 instances) and higher dimensions might be the areas worth exploring by future researchers.

D. THE MISSING MECHANISMS
The approaches of handling incomplete data are associated with the missing mechanisms. MAR and MCAR are the two most frequently used for evaluating imputation performance among the missing mechanisms. However, the MNAR missing mechanism receives the least attention.
Domain-based imputation approaches are developed to deal with the problem of incomplete data. It is not envisaged that some features are missing for all patients in medical datasets. In real-life cases, some features may be missing by certain patients [50], [53], [72] in medical datasets. The occurrence of the missingness pattern depends on the observed values of other features in the dataset. For example, the salary feature for professional patients and the number of cigarette features for young patients are likely to be missing. While in weather datasets, it is expected that a particular feature, for example, rainfall feature [59], there may be a possibility of missing for all days when hardware failure occurs at a specific gauging station. In this case, the MAR and MNAR missing mechanisms are appropriate for evaluating imputation performance on incomplete medical datasets.
The missing rainfall feature of one gauging station does not influence the other gauging stations. Therefore, a domain-based imputation approach and missing mechanism for a given context should be investigated further to improve the adaptability and accuracy of the imputation models. For this reason, the MCAR missing mechanism is appropriate for evaluating the incomplete rainfall datasets.
The findings also indicated that synthetic datasets with missing rates less than 30% are the most frequently used missing rates for experimentation in studies (45.8%), while only 14.6% of the studies considered missing rates greater than 50%. However, the missing rates could be larger than 50% in real-world problems. Therefore, this SLR suggests designing MVI methods that can deal with low and high missingness problems, for example, missing rates of 10% -90%. These findings also agree with other work [10] that imputation studies with more significant missing rates would be more practical.

F. THREATS TO VALIDITY
Four potential threats to validity should be considered to support the findings of this SLR: construct, internal, external, and conclusion validity. To achieve maximum construct validity, we conducted this literature review following Kitchenham's guidelines [13] and performed analyses in response to research questions, quality assessment, and inclusion and exclusion criteria. However, the relevance of various terms associated with the missing could constrain our findings. We attempt to maximize internal validity by applying all missing terms associated with imputation techniques and datasets as described in Table 4 and   TABLE 5. Additionally, we seek to maximize internal validity by employing an exhaustive manual and automated search strategy to ensure the paper selection was unbiased. Further, external validity considers whether our findings can be generalized to other studies. In this study, we emphasize MVI designs and methods of metaheuristic techniques exclusively, holding the other paradigms for future research.
Finally, data extraction was carried out to ensure the conclusion's validity by adhering to the review protocols, including the research questions, quality assessment, inclusion criteria, search strategy, and study selection [15]. Other review protocols could be used, which may increase or decrease research bias and lead to different findings.

V. CHALLENGES IN IMPLEMENTING MISSING VALUE IMPUTATION DESIGNS AND METHODS
There will be challenges with any new research method, especially in identifying the appropriate approaches for a wide range of research questions and experimental designs. Careful planning and consideration are required to reduce the impact of missing values and improve data quality. The following section discusses some roadblocks to implementing MVI and tentative guidelines.
A. IMPUTATION PERFORMANCES AND COMPUTATIONAL COST One of the significant MVI challenges is the expensive computational time, especially with large-scale and highdimensional datasets. Data normalization, feature selection, or feature extraction can be employed to reduce the computational cost. For example, [48] demonstrated that feature selection significantly reduced the computational time of imputation while improving the imputation and classification accuracy.

B. UNPLANNED MISSING VALUE
In some studies, data with missing values were removed [73]- [75]. The works in [3], [76], [77] also removed missing time-series data from experiments. However, it is important to note how the authors dealt with the records' continuity because accurate forecasting relies on continuous time-series records. Furthermore, Hussain et al. [78] reported that many missing data entries made it challenging to accurately impute the electric power consumption data. Only 60.11% of the total consumers with null entries lower than 200 were considered for MVI, whereas 39.89% of the customer records were removed from the experiment. However, removing missing values from observations results in a reduction in sample representativeness. The effects of unintentional missing values can induce biases in parameter estimates and uncertainty, which can be mitigated by adopting an effective MVI procedure and design plan.

C. OPTIMAL MISSING VALUE IMPUTATION APPROACHES
This study also revealed no definitive answer on which method is the best to date for all the missingness. The adoption of MVI approaches depends on many factors: data characteristics, missingness mechanisms, the proportion of missing values, dependent and independent variables, dataset volume, computational time, and domain applications. VOLUME XX, 2017 1 Consequently, the existing reports of MVI studies are of great worth assisting future researchers in developing an effective MVI strategy. However, 14.6% of the studies did not report on missing rates, whereas 20.8% of the studies (10/48) did not clarify the used missingness mechanism. This information is a valuable factor when planning for the experimental design of MVI. Therefore, an overview of the recommended guidelines in addressing, managing, and reporting MVI studies is outlined in Figure 9.
The MVI strategic planning process begins with the collection of incomplete datasets. It is crucial to identify the three main aspects of incomplete datasets: dataset characteristics, missing mechanisms, and missing rates. The next step is the selection of MVI approach. Having a clear justification of the chosen strategy, the potential impact of imputation, and computational cost are crucial to the success of MVI method. Without a clear direction, the MVI strategy may stall or even fail. Data normalization, feature selection or feature extraction method could be considered to improve the performances of the MVI approach.
Researchers can then use complete or incomplete training datasets to construct optimal imputer models. The incomplete dataset can be real missing or synthetic missing dataset. Training and testing dataset design, variables with missing data, missing rates, missing mechanism, and dataset characteristics should be thoroughly reported. Researchers should train the imputer models on one dataset and test them on another dataset to verify the robustness of the proposed imputer models. A set of performance metrics is used to measure the effectiveness and efficiency of the MVI method. The commonly used metrics are RMSE, accuracy, R, MSE, and MAE. Statistical analysis such as Wilcoxon signed-rank test [21], [32], Wilcoxon rank-sum test [37], and Friedman test [21] can be performed to assess the significance of the proposed MVI approach. Finally, we suggest that the researchers to report the three factors affecting MVI in detail (dataset characteristics, missing mechanisms, and missing rates), training and testing procedures, measurement metrics, and the findings of the studies. Additionally, the reporting could couple with the discussion of the impact and challenges of the MVI, which will increase the overall confidence in the study.
The planned MVI procedures and strategies can raise statistical power and model convergence compared to employing a complete case analysis [79]. Preparing for missing values before starting an experiment can also help avoid the problems of nonrandom missing data, leading to significant bias and invalid statistical inferences [2], [80]. Furthermore, researchers can use the planned MVI design in conjunction with missing data procedures to increase the quality and scope of the study and lower research costs. Researchers might minimize the study cost by strategically implementing an effective MVI design.

VI. CONCLUSION
In recent years, MVI for incomplete datasets has grown in popularity to improve data quality, statistical power and reduce bias in data science applications. In this study, we conducted a SLR to examine the existing metaheuristic techniques used for handling and optimizing missing value imputation over the last ten years. This SLR is also This study concentrated on three major scientific databases: IEEExplore, ScienceDirect, and Scopus. The findings of this SLR revealed that the hybridizations of metaheuristics with clustering or neural networks are the most used MVI approaches. The review indicates that the hybrid metaheuristic is a promising field of study for solving various imputation problems. Additionally, we discovered that the synthetic missing dataset is the most frequently used incomplete dataset for evaluation, and RMSE is the topmost used metric for evaluating the performance of the proposed MVI. When handling missing data, the three aspects to consider are the dataset characteristics, missing mechanisms, and missing rates. This review also addresses MVI perspectives, challenges, and opportunities. An optimal imputer approach by domain-based approaches should be investigated further. However, designing a planned MVI design and method to expand the quality of study scope remains a significant challenge. Therefore, the literature provides an overview of recommended guide for planning MVI designs and methods, which serve as a toolkit for developing an effective MVI strategy.