Loading [MathJax]/extensions/MathZoom.js
A Label Propagation Approach for Missing Data Imputation | IEEE Journals & Magazine | IEEE Xplore
The methodology was divided into four stages: (A) data collection, complete datasets selected for experiments; (B) missing data generation (amputation), introducing synth...

Abstract:

Missing data is a common challenge in real-world datasets and can arise for various reasons. This has led to the classification of missing data mechanisms as missing comp...Show More

Abstract:

Missing data is a common challenge in real-world datasets and can arise for various reasons. This has led to the classification of missing data mechanisms as missing completely at random, missing at random, or missing not at random. Currently, the literature offers various algorithms for imputing missing data, each with advantages tailored to specific mechanisms and levels of missingness. This paper introduces a novel approach to missing data imputation using the well-established label propagation algorithm, named Label Propagation for Missing Data Imputation (LPMD). The method combines, weighs, and propagates known feature values to impute missing data. Experiments on benchmark datasets highlight its effectiveness across various missing data scenarios, demonstrating more stable results compared to baseline methods under different missingness mechanisms and levels. The algorithms were evaluated based on processing time, imputation quality (measured by mean absolute error), and impact on classification performance. A variant of the algorithm (LPMD2) generally achieved the fastest processing time compared to other five imputation algorithms from the literature, with speed-ups ranging from 0.7 to 23 times. The results of LPMD were also stable regarding the mean absolute error of the imputed values compared to their original counterparts, for different missing data mechanisms and rates of missing values. In real applications, missingness can behave according to different and unknown mechanisms, so an imputation algorithm that behaves stably for different mechanisms is advantageous. The results regarding ML models produced using the imputed datasets were also comparable to the baselines.
The methodology was divided into four stages: (A) data collection, complete datasets selected for experiments; (B) missing data generation (amputation), introducing synth...
Published in: IEEE Access ( Volume: 13)
Page(s): 65925 - 65938
Date of Publication: 10 April 2025
Electronic ISSN: 2169-3536

Funding Agency:


References

References is not available for this document.