Introduction
Missing information is one of the most common issues in real-world data. This impacts analysis and may lead to biased inferences. Missing data occur in many applications, such as medical diagnosis [1], [2], [3], business [4], traffic monitoring [5], speech recognition [6], and telecommunication and computer networks [7]. Missing values in datasets occur for a variety of reasons, such as sample losses in data preparation [a person refuses to answer questions in a survey, such as smoking (Yes/No)], errors during data collection, or human errors. Multiple methods have been proposed to handle missing value problems, such as removing records that contain missing values. However, this reduces the number of samples and may result in a loss of important information. Another technique is to impute missing values using mean, mode, or hot-deck [8] imputation. However, such methods disregard the relationships between the attributes. Furthermore, more sophisticated missing imputation techniques are required when there is more than 5% of data missing in the dataset [9]. Model-based techniques, such as the expectation-maximization (EM) algorithm, which iteratively finds maximum likelihood estimates, are used for better missing data imputation. However, one of the major limitations of the EM algorithm is that it does not provide standard errors as an automatic part of the process [8].
Machine Learning (ML)-based algorithms are used to predict missing values based on the information available in the dataset [1], [10], [11]. The ML algorithms, such as multi-layer perceptron, K-nearest neighbors, and support vector machines, use the information from the available complete data to estimate the missing values with high precision. More recently, deep learning-based algorithms have been proposed to automatically learn the complex inter-variable association and estimate more realistic values [12]. One of the deep learning-based algorithms is known as denoising autoencoders (DAEs) [13] which are designed to reconstruct an original input from noisy data. The DAE consists of an encoder and decoder where the dataset with missing entries serves as an input to the encoder and missing entries serve as the noise. The decoder reconstructs the complete dataset by predicting the missing values. Details of the DAE can be found in section x. However, the DAE requires full data and initial guesses are required before imputation [14], [15], such as the work of Ranjit et al. [15] using 0 as the initialization.
Tabular data can be grouped into three categories: numeric data, categorical (nominal) data, and mixed data. An example of a mixed dataset is shown in Table 1, in which the age, weight, height, and Body Mass Index (BMI) are numeric data, while smoking, education, and the diagnosis of diabetes are categorical (nominal) data. Table 1 shows that some of the entries are missing (remain empty with no response). Mixed datasets are difficult due to problems in finding the similarity between two points (categorical and numerical) [16], [17]. Similarly, imputation in numerical values is easier; thus, multiple algorithms have been proposed with great success [4], [14], [19], [20], [21], [22], [23], [24], [25]. However, these algorithms cannot work on most real datasets, such as genomic data, industrial data, and equipment maintenance, because they are often mixed datasets [26]. Imputing mixed datasets is a challenge in many ways, including measuring the relationships between the categorical and numerical instances, hybrid estimators for numerical and categorical data, and considering the attributes of purely categorical or numerical datasets [26], [27]. Furthermore, it is difficult to calculate the generalized similarity measures between features in mixed datasets [28], [29], [16]. The missingness in data occurs due to various reasons, which can be categorized as three types of missingness mechanisms [5], [18]. The first is missingness at random (MAR) in which the missingness depends on the observed data. In the missing not-at-random (MNAR) mechanism, the missingness depends on the missing attributes. In the missing completely at random (MCAR) mechanism, the missingness does not depend on any missing or observed attributes, i.e., the missingness cannot be explained by the data.
Deep learning-based algorithms usually require sufficient training data, which is one of the primary challenges for such models. Although a large amount of data is generated in everyday life, such as in hospitals, retail, education, transportation, and energy, privacy and copyright issues only allow a small proportion of the datasets to be publicly available [30]. In addition, publicly available datasets usually contain a limited number of samples with missing values. Furthermore, obtaining a large amount of data is difficult because it relies on humans as the gold standard, and may require verifications with multiple sources, which is expensive, time-consuming, and subjective [31]. Alternatively, data oversampling is used where artificial samples are created from the original training data to enlarge the training dataset. Adding synthetic samples to the data increases the generalization and avoids overfitting [32]. Data augmentation is common for images [31], [32] [33], [34], [35], [36]; however, it has received less attention for tabular data augmentation [37] [38]. In images, artificial samples can be created by rotating, flipping, zooming, shearing, and adding noise to images. However, generative adversarial networks (GANs), as introduced by Goodfellow et al. [39], are a more sophisticated method to create realistic synthetic samples. GANs are used for artificial data generation, which consists of two neural networks: the generator and the discriminator. The generator is used to generate synthesized artificial samples with required variations from the input data distribution while the discriminator differentiates between the samples created by the generator or from the input data. Multiple modified versions of the GAN are introduced to synthesize new data for both images and tabular data. This paper uses two versions of GANs (TGAN [40] and CTGAN [41]) to oversample tabular data.
The major contributions of this paper are summarized as follows:
We propose a novel approach that incorporates synthetic samples with training data to improve the imputation performance using two different GANs to synthesize tabular training data.
We provide extensive experiments on four mixed datasets with each set divided into different training-testing sets of 80-20, 60-40, 40-60, and 20-80 with an increasing missing ratio of 10%, 30%, 50%, and 70%.
We provide a detailed performance evaluation for three missing data imputation methods using mixed datasets.
The rest of the paper is organized as follows. Section II provides an overview of related works. The methodology of the proposed work is described in Section III. Section IV gives the experiments and presents the results. Section V provides the discussion. Finally, the conclusion is presented in Section VI.
Related Works
This section provides a brief overview of the studies performed for missing data imputation. We first describe studies that used ML models and deep learning for missing data imputation with numerical or mixed datasets. Imputation of missing values is common to reduce bias due to missingness and aids in improving the statistical and ML-based analyses [42]. The simplest method to impute missing values is using the mean for numerical values and mode for categorical data [43]; however, these methods ignore the relationships and underestimate variance with the other variables [44]. The ML-based techniques create a prediction model from the observed data to determine the missing values [25]. Missing value imputation for numerical datasets is common [14], [19], [20], [21], [22], [23], [4], [24], [25] and include the nearest neighbor [22], LLS [45], SVD [22], and ensemble [26] approaches. Deep learning-based DAE [14] is also used for numerical data imputation.
Imputing missing values in a mixed dataset is a challenge and has received little attention. Some of the work that uses missing data imputation for mixed datasets are briefly described as follows. Methods such as the MissForest [46] use an RF classifier for missing value imputation, while MICE [47] algorithms are used for mixed data imputations (see further details in the methodology section). Alsaber et al. [46] used MissForest imputation on air quality data obtained from the Kuwait Environmental Public Authority from 2012–2018. The authors showed that introducing multiple missingness ratios of 5%, 10%, 20%, and 40% in the dataset gave MissForest a better prediction accuracy compared to other methods such as KNN, MICE, Amelia, and missCompare. Aleryani et al [47] used ensembles of MICE and EMB using bagging and stacking ensembles for data imputation. The evaluation was performed on 20 datasets, which shows the ensemble methods outperform individual imputation techniques.
Awawdeh et al. 2022 [48] proposed a genetic algorithm (GA) based simultaneously on missing data imputation and feature selection. The proposed GA-based method divided the dataset into observed and missing data. The observed data were used to train the model, and a subset of features that contain missing values was imputed with the initial value (mean or mode), which was used as a test set. Then, the imputed instances were used as a test set to evaluate the performance of the trained model, which was trained based on the complete data. The process of imputing data in the test set was repeated based on the GA-based stopping criteria until the optimal subset of the imputed dataset was obtained. The authors evaluated their method on 12 different datasets and compared their method with other imputation techniques.
Several studies have used deep learning-based algorithms for missing data imputation on mixed data as well as numerical data. Lall et al. [15] proposed deep learning based on multiple imputations using DAEs for mixed datasets. Missing values in the dataset were treated as noise, and the DAE reconstructed the missing values. The authors first replaced the missing entries in the dataset with 0s. Then, the DAE was trained using the loss function for categorical and numerical variables to minimize the reconstruction errors. Finally, the imputed dataset was obtained by predicting missing values (which were initially replaced with 0). The authors used simulated data, adult datasets [49], and a Cooperative Congressional Election Study dataset [50]. The authors showed that the proposed method achieves better efficiency and accuracy compared to other works. Similar work was done by Gondara and Wang [12], who proposed the DAE for missing value predictions using 15 publicly available datasets. The authors compared the method with the MICE imputation and showed that the DAE was better.
Deep learning-based GAN networks have also achieved attention for missing data imputation such as the work by Yoon et al., who proposed Generative Adversarial Imputation Nets (GAIN). This uses the GAN architecture with an additional hint to the generator to ensure the generator creates the new synthetic data based on the original input data distribution. The authors showed that the proposed GAIN achieves better performance compared to other imputation methods using four datasets. Although GAIN achieved promising performance, it ignores category information, which identifies the relationship between samples. To address this, Wang et al. proposed the pseudo-label conditional GAIN (PC-GAIN) [51] to consider category information and improve the imputation. The authors first used the low-missing-rate samples to impute using GAN followed by clustering to obtain the pseudo-labels. The pre-trained classifier was obtained from the imputed dataset and corresponding pseudo-labels. Finally, the full dataset was imputed using GAN, which used a pre-trained classifier for the generator. The results showed that the proposed method achieves better performance compared to other imputation techniques. Work done by Awan et al. [42] proposed conditional GAIN (CGAIN) which considers the class-specific distribution for missing data imputation, especially in imbalanced datasets. The authors show that their proposed work achieved better performance compared to other algorithms. Although GAN-based methods are used for missing data imputation, no work has been done to study the effects of the training data on imputation performance. Therefore, we propose a GAN-based imputation that increases the amount of training data by adding synthetic samples to improve missing value predictions in the testing data. The details of the proposed methodology are described below.
Proposed Methodology
This section presents the proposed methodology. We first describe the formulation of our method for mixed data imputation. We then propose an algorithm to handle the challenge of imputing missing values in a mixed dataset. Then, we explain the algorithm by first describing the oversampling of training data using GANs. Finally, we briefly explain the imputation techniques for mixed datasets.
Imputing missing values is important to reduce bias due to missingness [52]. Consider the dataset
Moreover, many algorithms are available to impute numerical values. However, mixed (numerical and categorical) data imputation is challenging due to the complex joint data distribution [53]. In real applications, datasets are mixed and contain missing values due to various reasons. Therefore, the objective of this study is to impute mixed datasets using model-based approaches. The proposed methodology is represented in algorithm 1. First, the dataset
Proposed methodology: the dataset is divided in
Mixed Data Missing Data Imputation Using GANs
The input
The output
Divide
Use the GANs to generate
Use
Combine
Train the imputation algorithms on
Impute
A. GANs for TABULAR Data
We adopted two different versions of the GANs (tabular [40] and conditional [41]) for mixed data synthesis. Each GAN is briefly described below.
1) Tabular GAN
Generating tabular data is a challenge compared to images as they contain various types of data, such as categorical, numerical, text, time, and cross-table references [40]. To generate synthetic data using TGAN, consider a table
After preprocessing, the sum of both vectors for the categorical and numerical columns is sent to the discriminator and generator as the input and output, respectively. The generator in the TGAN is a long-short-term memory (LSTM) network while a multi-layered perceptron (MLP) is used in the discriminator. One of the disadvantages of the TGAN is that it requires a relatively long time to train [31], [62].
2) Conditional Tabular GAN
The CTGAN is based on PacGAN [63] and uses a conditional generator and training-by-sampling approach for synthetic tabular data. Unlike TGAN, which uses GMM, the authors adopt a variational mixture model (3 CTGAN variants) to automatically estimate the number of modes for the numerical variables. Furthermore, a conditional generator is introduced to handle the category imbalance in the categorical variables. Thus, the categories are sampled evenly for the variables during the training process. The CTGAN introduces three key elements: conditional vector, generator loss, and training-by-sampling method. Firstly, a conditional vector is introduced to select an entry from a categorical column. For example: let
B. Data Imputation Algorithms
To effectively evaluate our proposed methodology, we select three missing imputation algorithms as a baseline: MICE [65], MissForest [66], and DAEs [15]. We propose that using GANs with the incorporated training data improves the imputation performance of the baseline algorithms. The baseline methods are briefly described below.
1) Multivariate Imputation By Chained Equations (Mice)
The MICE algorithm uses multiple imputations to provide missing values in mixed datasets using four general steps [65]. First, the missing values in the datasets are imputed with initial values as a temporary placeholder, which includes the mean. Second, the placeholder for one variable is set to missing while the placeholders for other variables remain filled. The missing values in the second step are regressed via linear regression and are dependent variables. Fourth, the fitted regression model is used to predict the missing values of the other variables. These four steps are used interactively to predict the missing values using a regression model. The details about the MICE algorithm can be obtained from [65].
2) Miss Forest
The MissForest is based on the random forest algorithm for mixed data imputation and first imputes the missing values with mean imputation as an initial guess value. The variables are then first sorted based on minimum missing values. The missing values are predicted by training the RF model with the observed data. The imputation procedure is repeated iteratively until the stopping criterion is met [66].
3) Denoising Auto-Encoders (DAE)
Autoencoders are neural networks that are used to reproduce a copy of the input from the encoding. The input is first encoded into the lower-dimensional latent representation. Then, the latent representation is back-decoded at the output with the same dimension and distribution as the input. To avoid the identical representation of data, one variant of the autoencoder is the DAEs where the input
Experiments and Results
The experimental setup is represented in Figure 2. We conduct a series of experiments on four publicly available mixed datasets from different domains of social, financial, and health. Each dataset is divided into multiple ratios of training
A. Dataset
Four different real-world datasets from domains such as social, financial, and health are used to evaluate the performance of the proposed method. The datasets are publicly available and adopted from the UCI directory [30]. The statistics of the datasets are shown in Table 2.
1) Adult
The adult dataset is mixed with categorical and numerical variables obtained from the 1994 United States Census with 48,842 individuals from 42 countries having 14 characteristics. The prediction task using the adult dataset is to determine if a person makes more than
2) German Credit
The German credit dataset classifies whether a person has good or bad credit risks based on 20 mixed attributes related to banking and financial details. The dataset includes 13 categorical and 7 numerical variables. There are 1000 instances that belong to two classes in the dataset, i.e., good credit means the customer is likely to repay the loan, while bad credit is for the customers who are not. Not approving the load to good customers may result in a loss of business while approving a loan to bad customers results in financial losses for the bank.
3) Australian Credit
The Australian credit dataset provides information related to credit card approval based on 14 attributes. There is a total of 690 credit card applications (instances) with 14 attributes consisting of 8 categorical and 6 numerical.
4) Heart Cleveland
Heart Cleveland is a small medical domain dataset that contains 302 instances. There are 13 different attributes with 7 categorical and 6 numerical.
B. Training and Testing Division
To illustrate the effect of the proposed method on the considered data, each dataset is randomly divided into different training and testing sets. Multiple training-testing divisions include 80-20, 60-40, 40-60, and 20-80. Five different sets were obtained for each division to generalize the proposed method.
C. Missingness
The datasets contain a small amount of missingness. Therefore, we inject artificial missingness using the MCAR mechanism used by [14], [20], [47], [67], [51]. Missing data imputation in each testing set is evaluated by varying the missing rates at 10%, 30%, 50%, and 70%.
D. GANs for Data Synthesis
We train CTGAN and TGAN using
E. Performance Metrics
The root means squared error (RMSE) is a performance metric to evaluate the actual and predicted values using the proposed imputation model. The RMSE for both numerical \begin{align*} {RMSE}_{num}=&\sqrt {\frac {1}{\left |{ N_{num} }\right |} \sum \limits _{\left ({i,j }\right)\in N_{num}} {(\hat {x}_{i,j}-x_{i,j})}^{2} } \\ {RMSE}_{cat}=&\sqrt {\frac {1}{\left |{ N_{cat} }\right |} \sum \limits _{\left ({i,j }\right)\in N_{num}} 1_{\{\hat {x}_{i,j}\ne x_{i,j})} }\end{align*}
F. Results
To evaluate the performance of the proposed work, we considered our results on four datasets for multiple training-testing divisions with various missing ratios. Each experiment was repeated five times at every step on the shuffled dataset, and the averaged results are represented in Tables 3–16. Only the best results are represented in this paper, while the detailed experimental results are provided in the supplementary data Table S1–S50.
The experiments performed on the German dataset with the 80-20 training-testing set are represented in Table S1–S12. For the MICE imputation, the best (smallest) RMSE for the numerical variables was achieved by CTGAN with 25% oversampled data for a 10% missingness. For the categorical variables, the best RMSE was achieved by TGAN with 50% oversampled data followed by CTGAN with 25%. Similarly, TGAN achieved a better performance in the MissForest categorical while the baseline for the numerical was best in the MissForest imputation. TGAN with a 100% oversampling ratio achieved the best performance using the DAE imputation in the numerical variable. The performance using TGAN 25% for categorical variables was the best while TGAN with 100% also achieved a comparable performance. For the 30% missingness represented in Table S2, the performance of the numerical variable baseline for MICE and MissForest was the best, while CTGAN was better for categorical variables for MICE and TGAN for MissForest. For imputation using DAE, both RMSE (numerical and categorical) were best with TGAN. For a 50% missingness, CTGAN performed better for MICE and MissForest under both categorical variables, except MissForest in which TGAN performed better. For imputation using DAE, the TGAN achieved a better performance in both categorical and numerical variables with an RMSE of 0.316 and 363.077, respectively. Similarly, in the 70% missingness for the 80-20 training-testing represented in Table 3, the performance of the proposed method was better compared to the baseline methods, in two GANs with imputation using MissForest, CTGAN achieved the smallest RMSE for the numerical variables. In the remaining imputations, TGAN was better. Overall, the GAN approach is better compared with the baseline methods.
The results obtained from the German dataset on the 60-40 training-testing data are shown in Table 4. For all missingness ratios, the performance of the proposed method is better than the baseline methods. Only when there was 30% missingness was the numerical RMSE of the MissForest slightly better than the GAN-based method. The results represented in Table 4 show that TGAN achieved the best performance compared to all baseline methods. The CTGAN was only better for numerical imputation using the MICE imputation while TGAN achieved the best performance overall. For the 50% missingness (Table S 6), CTGAN was better for MICE imputation, and CTGAN was better for numerical data imputation using DAE.
For the 20-80 data division, the results for the 70% missingness are shown in Table 6. For the MICE, the CTGAN performed well in both numerical and categorical imputation. For DAE and MissForest, the TGAN achieved a lower RMSE for both categorical and numerical imputation, while the MissForest baseline achieved a better numerical RMSE. Similar performances were achieved at 50% where only the numerical imputation using MissForest was better. For 10% and 30%, the numerical imputation using the baseline method was better for MICE and MissForest imputations (see Tables S10–S12).
The experiments performed on the Australian credit dataset are shown below, which illustrates that the proposed method improves performance compared to the baseline. For the 80-20 training-testing division, Table 7 shows that TGAN achieved the best overall results for the 50% missingness compared to all baseline methods. For imputation using DAE, the CTGAN performed slightly better than TGAN with a numerical RMSE of 473.41 compared to 474.07. For MICE, imputation, the bassline achieved 566.69 and 0.317 for the numerical and categorical variables, respectively, while TGAN achieved 496.62 and 0.313. Similarly, for MissForest, the baseline method achieved 498.84 and 0.275 for the numerical and categorical variables, respectively, while TGAN achieved 393.48 and 0.27. Finally, for imputation using DAE, the CTGAN achieved an RMSE of 473.41 compared to 476.28 by the baseline method, while the categorical RMSE was 0.290 by TGAN compared to 0.291 by the baseline method. The results for 10%, 30%, and 70% missingness ratios represented in Tables S13, S14, and S15, respectively, show that incorporating GANS with the baseline method improves the imputations.
The results obtained from the 60-40 data divisions using the Australian credit data show that for the 70% missingness, the proposed method achieved a better imputation compared to the baseline. Moreover, for missingness of 10-50%, the proposed method had a relatively better imputation of missing values. Table 8 represents that for the 70% missingness, the TGAN with an oversampling ratio of 75% achieved the best performance compared to bassline and CTGAN. The results in Table S16–S18 show that for the 10-30%, the performances of the CTGAN and TGAN were better than the baseline methods. For the 50% missingness, the performance of TGAN was better than the baseline methods (except for the categorical RMSE using MissForest, which was slightly better). Results obtained from the 40-60 and 20-80 data divisions for the Australian credit dataset are represented in Tables S19–S26, which show that the proposed method did not perform as well.
The experiments performed on the heart dataset show that the use of GAN improves the overall predictions of missing values. For the 80-20 data division, the 10-50% missingness represented in Tables S27–S29 shows that the overall imputation with the aid of TGAN improves the prediction of missing values. Thus, GAN-based imputation performs either better or is comparable to the baseline methods. The best results obtained for the 70% missingness are shown in Table 9, where the TGAN achieved the best performance compared to CTGAN and baseline methods. Only the MissForest baseline method was slightly better with an RMSE of 0.452 compared to 0.453 with TGAN. Better performances were achieved when the training data increased with 100% TGAN-generated samples. Similar performance was achieved for the 60-40 data division. For the 70%, missingness shown in Table 10, incorporating synthetic TGAN-generated data improved the imputing missing values compared to the baseline method. Only the categorical RMSE was better than the proposed method using 70% missingness. For the 60-40 data division, the 10-50% missingness ratios also show that the proposed method achieved better performance, as depicted in Tables S30–S32.
For the 40-60 data division, the results obtained from the 10% missingness are shown in Table 11, where using the proposed method helps better reconstruct missing values. The proposed method achieved better results for both categorical and numerical values compared to the baseline methods. The results depicted in Tables S33–S35 show that the proposed method achieved better or similar performances compared to the baseline methods. The results obtained from the 20-80 data division show similar performances where the GAN improves the prediction performance compared to the baseline methods. For instance, for 10% missingness results shown in Table 12 for DAE imputation have baseline categorical and numerical RMSEs of 0.166 and 16.33, respectively, while the TGAN achieved 0.158 and 14.57. Similarly, the results for 30-70% missingness ratios in Tables S36–S38 suggest that incorporating GANS improves the reconstruction error.
We also performed experiments on the large adult dataset, which contains over 48,000 samples. The best results obtained on the adult dataset for the 80-20 data division using the 70% missingness are represented in Table 13, which shows that CTGAN achieved the best imputation performance when used with the MissForest imputation. However, for categorical imputation, the CTGAN performed better. The results shown in Tables S39–S41 for 10-50% missingness also indicate that using GAN for imputation achieves a lower RMSE compared to the baseline. Similar performances were achieved for the 60-40 data division in which the MissForest imputation baseline method was improved with CTGAN, the MICE baseline was improved using TGAN for numerical and CTGAN for categorical imputation, and the TGAN aided categorical imputation when DAE was used as a baseline. The results for the 70% missingness are shown in Table 14 while results for 10-50% missingness are shown in Tables S42–S44, which indicates the proposed method improves the imputation performance compared to the baseline.
The results for the 10% missingness in the 40-60 data division for the adult dataset are represented in Table 15, where oversampling using GANs achieved better performance compared to the baseline methods. The results show that CTGAN achieved a better numerical RMSE for the MissForest and categorical for MICE, while the TGAN achieved a better RMSE for categorical variables in the MissForest and DAE imputation. The numerical imputation was better when using the MICE algorithm. Similarly, for 30-70% missingness, the proposed method achieved relatively better performance compared to the baseline, as shown in Tables S45–S47. Finally, for the 20-80 data division, the 50% missingness shown in Table 16 indicates that the TGAN achieved the best numerical imputation results compared to the baseline and CTGAN. For categorical imputation, the TGAN was only better for DAEs while the CTGAN was better for MICE imputation; the RMSEs were similar to the baseline in the MissForest imputation. Similarly, other missing ratios, such as 10, 30, and 70% represented in Tables S48–S50, show that the proposed method achieved a better imputation performance compared to the baseline methods.
Discussion
In real-life scenarios, most datasets consist of mixed data with frequent missing entries. In addition, small datasets are usually publicly available; therefore, training an ML model with a limited amount of data does not achieve promising results. Moreover, missing value imputation in mixed datasets is challenging and has achieved little attention. This work proposes a method to improve the prediction performance of missing values when training with oversampled data. Oversampled training data is attained using two types of GANs for mixed data synthesis: TGAN and CTGAN. We have shown that using various experiments our proposed method works on various mixed datasets using different imputation algorithms.
We used four mixed datasets that are publicly available in the UCI directory [30]. Each dataset was divided into training-testing sets {80-20, 60-40, 40-60, and 20-80}. In each test set, missingness was introduced with a different missing ratio of 10–70%. Each experiment was performed five times and the average results are presented here. For the German credit dataset, the overall best performance was achieved when the missingness was 70% (see Tables 3–6) followed by 50% (see Tables S3, S6, S9). For 10–30% missingness, the proposed method performed better than or similar to the baseline methods. Similar imputation results were achieved when the proposed method was applied to the Australian credit dataset with a 70% missingness (see Table 8) with the proposed method performing best. For 20-80 and 60-40 data divisions, the proposed method performed better or similar to the baseline methods for 10–50% missingness ratios. However, for 40-60 and 20-80 data divisions, our algorithm did not perform as well. For the heart dataset, the proposed method achieved the best performance for all missingness ratios relative to the baseline methods (Tables 9–12 and S27–S38). Finally, for the adult dataset, our proposed method outperformed the baseline methods in all data divisions with nearly all the missingness ratios.
We found through a series of experiments that our method was better than the baseline methods at nearly all missingness ratios. However, the proposed method works better for a 70% missingness, which suggests that the method can be applied when the data contains significant missing entries. We also compared each baseline method between themselves as we used default parameters without any hyperparameter optimization, and the presented results reflect this comparison. For the German credit dataset, the numerical imputation using the baseline MICE was better when the missingness was 10–30%. When the missingness was more than 50%, then baseline DAE performed well when reconstructing the numerical values. Categorical imputation using the MissForest was better than the MICE and DAE imputation. For the Australian credit dataset, categorical imputation using MissForest was better overall missingness ratios and data divisions and achieved better performance for numerical imputation when the data divisions were 80-20 and 60-40. The performance using DAE was better when the data divisions were 40-60 and 20-80.
For the heart dataset, the baseline MICE was best for numerical imputation. For categorical imputation, the DAE was best for divisions of 80-20 and 60-40 with a 70% missingness. While for 40-60 and 20-80 data divisions, categorical imputation using MissForest was better compared to DAE and MICE imputation. For the adult dataset, the 80-20 and 60-40 data divisions showed that the DAE achieved better performance for both categorical and numerical imputations when the missingness was 70%. For 40-60 and 10% missingness, numerical imputation was still the best using DAE, but categorical imputation using MICE was better. For the 20-80 data division and 50% missingness, the best RMSE was achieved by the MissForest imputation.
In some cases, the baseline DAE imputation was better than the baseline MissForest; however, when GAN-based data was incorporated, the MissForest was better. For instance, the 60-40 division for the Australian dataset had a baseline RMSE for the MissForest of 993.38m, which was 790.32 for DAE at a missingness ratio of 70%. This shows that DAE imputation is better than MissForest. However, when our method was applied, the RMSEs were reduced to 677.16 and 788.42 for the MissForest and DAE, respectively. Similarly, the heart 80-20 division with a 10% missingness was best for the MissForest baseline for numerical and DAE for categorical imputation. However, after applying the proposed method, the lowest numerical RMSE was 5.645 using the MICE imputation compared to 6.000 for the MissForest baseline. Similarly, the MICE achieved a lower categorical RMSE of 0.1383 using TGAN compared to 0.1554 for the DAE baseline.
To compare the imputation results using each GAN, the German credit dataset showed that the performance of categorical imputation was better when TAGN was used in addition to the original training data with the baseline imputation algorithms. For numerical imputation, the TGAN resulted in the best RMSE when the missingness ratio was 70%; thus, the TGAN was overall better with the German credit dataset compared to CTGAN. Comparisons of the two GANs for the Australian credit dataset also show that the TGAN performed better than CTGAN. For the 80-20 data division, the CTGAN only performed well on the 10% missingness. The 60-40 CTGAN was better for the numerical imputation for 30 and 50% missingness ratios. The CTGAN was better for the 70% missingness in both categorical and numerical imputations. For other data divisions, while the baseline was the best, the TGAN performance was better than CTGAN. For the heart dataset, imputation using TGAN gave better missing value predictions compared to CTGAN for all the data divisions and missingness ratios. For the adult dataset, CTGAN was better than TGAN.
We compared the effect of oversampling on data imputation with each baseline method. Figure 3 shows that for all experiments using the MICE imputation with a total of 64 experiments. Imputation using TGAN for numerical data (TGAN_num) was good in 42 experiments, followed by the MICE baseline (MICE_num) in 15 experiments and CTGAN (CTGAN_num) in 11 experiments. Similarly, for categorical imputation using the MICE algorithm, the CTGAN-based oversampling (CTGAN_cat) performed well in 30 experiments followed by TGAN (TGAN_cat) in 16 experiments. As a comparison, the MissForest-based imputation for baseline (MissF_num and MissF_cat) and GAN-based imputation are represented in Figure 4, which shows that TGAN for categorical imputation using MissForest was best in 32 experiments followed by the MissForest baseline in 28 experiments. The performance of the CTGAN with MissForest was not improved compared to the baseline, which was comparable with the proposed method; however, the MissForest baseline did not perform well when the missingness was 70% in all four datasets. Finally, the comparison of the DAE baseline and GAN-based oversampling imputation is represented in Figure 5, which shows that when oversampling using TGAN followed by DAE to reconstruct missing values, TGAN (TGAN_cat) was best in 49 experiments for categorical imputation while the baseline method was only better in 12 experiments. Similarly, for numerical imputation, the TGAN was best in 37 experiments while the baseline method was best in only 14 experiments. This highly suggests that using the proposed method will improve the imputation performance of datasets with missing values.
Comparison of MICE-based imputation with the MICE baseline and GAN-based imputation using MICE.
Comparison of MissForest (MissF)-based imputation with the MissF baseline and GAN-based imputation using MissF.
Comparison of DAE-based imputation with the DAE baseline and GAN-based imputation using DAE.
Figure 6 represents the best results for the entire experiment, such as the best result obtained in each table. For example, Table 3 shows the best results obtained by TGAN for both categorical (TGAN_cat) and numerical (TGAN_num) imputation. Figure 4 shows the best results achieved in each experiment, where TGAN was best in most experiments in both categorical (best in 38 experiments) and numerical (best results in 31 experiments) imputation. Among the baseline methods, MissForest imputation was best in 5 and 19 experiments for numerical and categorical imputations, respectively, MICE for 6 and 0, and DAE for 12 and 1. Thus, MissForest was better than the other baseline methods. One reason for this is that all experiments were performed using default hyperparameters; however, MissForest does not require any parameter tuning, as described in [66]. Although we did not record the running times for each experiment explicitly, we found that MissForest is computationally complex compared to MICE and DAE. The computational complexity of MissForest is also evident from several studies [mention]. We also found that CTGAN performed better when the training dataset is larger. For smaller datasets, its performance was worse than TGAN. For the adult dataset, CTGAN was best in 7 experiments for categorical imputation and 5 for numerical imputation. The TGAN was the best in 4 experiments (for 20-80 divisions only) with numerical imputation and 6 experiments for categorical imputation. The DAE was better compared to other baseline methods on the adult dataset as it achieved better reconstruction errors in 7 experiments for numerical imputation, while the MissForest baseline method was best in 6 experiments for categorical imputation. Thus, the overall performance of CTGAN was best on the adult dataset. The results obtained from the detailed experiments highly suggest using the proposed method for better imputation of missing values. We used the recently proposed GANs and three different imputation models; however, algorithm 1 can easily be generalized to incorporate more GANs for data synthesis as well as baseline imputation models.
Best results in each experimental setup, where TGAN was best for both categorical and numerical imputations.
Conclusion
Many algorithms for numerical data imputation have been proposed; however, algorithms to impute mixed datasets are scarce. In addition, currently available datasets contain a limited amount of data, while model-based imputation algorithms require sufficient complete training datasets. Therefore, this study proposed a method using GANs (TGAN and CTGAN) to synthesize samples from training data for better model training. We then used three different imputation models, including MICE, MissForest, and DAE, to impute missing values. We found from several experiments with different missingness ratios that our proposed method improves the performance of predicting missing values compared to the baseline methods. Although this study provides promising improvements compared to baseline methods, there are some limitations that we aim to address in future works. We introduced random missingness into the data using the MCAR mechanism [14], [18], [67], [51]; however, future works will consider more experiments using different missingness criteria, such as MAR and MNAR [18]. This study only used mixed datasets because they are difficult to handle and represent both numerical and categorical data. However, to prove the effectiveness of our proposed model, future works will use purely categorical and numerical datasets. We also aim to generalize algorithm 1 for multiple GANs and baseline imputation models. Each model used in this study was implemented using default parameters, and no hyperparameter optimization was performed, which will be addressed in the future. We also aim to use the category information [51] to further improve the imputation performance.
Appendix
Appendix
The supplementary materials are provided with the paper.