Skip to Main Content
Missing value imputation is an actual yet challenging issue confronted by machine learning and data mining. Existing missing value imputation is a procedure that replaces the missing values in a dataset by some plausible values. The plausible values are generally generated from the dataset using a deterministic, or random method. In this paper we propose a new and efficient missing value imputation based on data clustering, called CRI (clustering-based random imputation). In our approach, we fill up the missing values of an instance with those plausible values that are generated from the data similar to this instance using a kernel-based random method. Specifically, we first divide the dataset (exclude instances with missing values) into clusters. And then each of those instances with missing-values is assigned to a cluster most similar to it. Finally, missing values of an instance A are thus patched up with those plausible values that are generated using a kernel-based method to those instances from A's cluster. Our experiments (some of them are with the decision tree induction system C 5.0) have proved the effectiveness of our proposed method in missing value imputation task.