Machine Learning Analysis for Data Incompleteness (MADI): Analyzing the Data Completeness of Patient Records Using a Random Variable Approach to Predict the Incompleteness of Electronic Health Records

The purpose of this article is to propose a methodology involving various methods that can be used to predict the data incompleteness of a dataset. Here the investigators have presented data incompleteness as both continuous and discrete random variables. In addition the investigators used transfer entropy for the purpose of advancing the science associated with the analysis of data incompleteness of electronic health records. The underlying methodology has been coined as “Machine Learning Analysis for Data Incompleteness” (MADI) with the intention of developing a possible solution to data incompleteness in electronic health records. MADI advances the analysis of data incompleteness with the use of Kolomogorov Smirnov goodness of fit, mielke distribution, and beta distributions for a holistic analysis. Alongside the methodology presented, the investigators explored stochastic gradient descent, generalized additive models, and support vector machines for comparison. Overall, the investigators have presented a complete set of methods and algorithms to help predict data incompleteness in a medical setting and provided suggestions for practical applications into the prediction of data incompleteness.


I. INTRODUCTION
In the world of medical informatics, the critical importance of the data completeness of electronic health records is becoming more and more clear. As the digital transformation of medical facilities across the world expands and provides greater computational abilities to health professionals, patient data has been viewed closer than ever before. Whether the data need exists for research, hospital demographics, or even for the patients themselves, the need for completeness of each entry in a medical records system increases. In the past Nasir et al. [1], have introduced the idea of using an algorithmic approach towards solving this problem. However, given the advances in artificial intelligence and machine learning The associate editor coordinating the review of this manuscript and approving it for publication was Adnan Kavak .
there is a critical need for a more advanced approach that can be effectively used to predict data incompleteness in electronic health records. For this purpose, the investigators involved in the project delineated in this article describe the use of advanced machine learning methods to predict this data incompleteness. Specifically, the core objectives of the work delineated in this article are as follows: • advance the application of probability distributions to improve the prediction of data incompleteness in electronic health records, and • advance the application of transfer entropy and ontologies in predicting the same.
For this purpose, the investigators present a methodology termed as Machine Learning Analysis for Data Incompleteness (MADI).

II. BACKGROUND
As data warehouses within medical and hospital systems increase, the representation and scope of the data within electronic medical systems has increased as well. Given the increase in scope and patient population representation, data availability and completeness presents itself as a much more critical issue within the electronic health system used. When a doctor or provider has a patient present, data will be entered to represent the visit and what the situation is involving the patient. However, in many different cases, the data is entered quickly or hastily, and some of the data turns incomplete or blank, as a result [2]. This is a major issue since the need for completeness in clinical or medical data is a must when working in medical research or even when trying to backreference past patient data.

III. METHODS
Based on the argument presented on data completeness, the investigators provide a comprehensive study in analyzing data incompleteness of electronic health records. Here the investigators use Probability Density Function (PDF) as a representation of patient record's incompleteness, where the data incompleteness is perceived as a random variable [3]. In probability and statistics, a random variable is described informally as a variable whose values depend on the outcomes of a random phenomenon [4].
To propose our algorithm, we define the following variables: • Completeness Parameter Variable (CPV): where x wz is the measure of completeness for the data field located in w th row and z th column. Measured in a binary method, 1 represents a complete data field and 0 represents an incomplete field. [5] • Completeness Scoring Variable (CSV): Where, CSV z is the completeness measure of column z and its value is between 0 and 1: where r is number of rows 1 ≤ z ≤ number of columns (c) 1 • DIM z : defines the Data Incompleteness Measure of column z (or the Incompleteness Ratio of column z): According to Algorithm 1, the investigators compute DIM for each column of the experimental dataset. This is followed by the generation of the histogram of the entire dataset [6], based on the incompleteness ratio of each column [7].

A. DISTRIBUTION FITTING
With regards to predicting the incompleteness of each dataset presenting, one of the most critical tasks in data pre-processing is distribution fitting. To begin the process of Algorithm 1: Plotting the Histogram for the Experimental Dataset procedure Plot the Histogram() Initialize the histogram bins Plot the corresponding histogram distribution fitting, we have to understand the parameters of each dataset we're working with, and provide a general idea of how each model will work, given those parameters [8].
Using the SciPy package, we can call a Maximum Likelihood Estimator (MLE) for parameter estimation on each of the datasets used.
The Kolomogorov Smirnov test [9] is used to determine if a sample distribution comes from a specific distribution. It is based on the empirical distribution function (ECDF) [10]. Given N samples number of ordered data samples Y 1 , Y 2 ,. . . ., Y Nsamples , this test is defined by [11]: i) the data which fit a specified distribution, ii) the data which do not fit the specified distribution, iii) Test Statistic D KS : (3) Distribution fitting is not the only necessary step required in data engineering, but rather, one of two steps. The second critical step in finding a best fit distribution is testing the proposed model. The method we used in our experimentation was the Kolomogorov − Smirnov test, as seen previously in Section III-A1. The Kolomogorov Smirnov test has been widely used in a variety of statistical models, as seen in studies such as [12] and [13]. As a result, we can use the Kolomogorov Smirnov test as a method to finding the best fitted distribution. As for a few assumptions of the model, the data is assumed to be standardized and each distribution (in this case, the 88 distributions known and available within the SciPy library) must be applied to the dataset.
Algorithm 2 represents our method to finding the best distribution to fit to the outputted histogram.

B. COMPLETENESS GRAPHING
Aside from fitting the data with a proper distribution, measures of incompleteness are much easier to grasp and utilize when viewing them in a more tangible format. To begin data post-processing and representation, it is critical to utilize all data available, along with the chosen distribution seen in Section III-A1. We utilized the Miss-ingNo Library [14], [15] to produce high-quality depictions of the data represented throughout experimentation. Algorithm 3 presents our pseudo-code for creating the visualizations following distribution fitting.  As mentioned above, we utilized the MissingNo Library [14] as one of our main visualization tools in terms of converting the data from logistic and applied regression models into visualized representations of how the dataset looks.
However, another important aspect behind designing the ontology and improving the overall strength in the representation of information entropy within the dataset is the design and calculation of intersections within the entropic points of the data [16]. As seen in previous works, information entropy presents itself in different contexts, and can be represented using comparison or coordination. Using the available modules within SciPy and MissingNo, a method can be reached where the different levels of entropy in the ontological map are represented and give a greater context to how the incompleteness of the data connects at different points within the presented structural aspects of the data itself [17]. Another key aspect when it comes to measuring the entropy of the datasets is seen in the information contents of a node itself within the hierarchical structure created [18]. Here, a probabilistic generalization is  created for each data point within the set, and the lengths of each node is generated based on the probability of a given node [column]. As nodes in the hierarchical structure increase, the overall information representation increases in tandem [19].
Algorithm 4 presents our method for producing the intersections found in our ontological representation of the mixed data-type dataset, as seen in 6.
, where the final fraction will be raised to a measured coefficient power end end end

IV. EXPERIMENTATION AND RESULTS
In this section, the authors illustrate the results obtained after implementing Algorithm 1, Algorithm 2, Algorithm 3, and Algorithm 4 using SciPy, NumPy, and MissingNo, which have been discussed in Section III and in Section III-B1. To evaluate our proposed algorithms, we used three different types of datasets, as follows; integer-based, mixed-data, and string-based datasets [20]. As experimentation continued, one of the most surprising findings seen was in the ways measuring data incompleteness worked when testing different models.
When it comes to traditional and probabilistic statistical methods, the distribution and representation of the data held true in majority of the models tested. Among the most useful models were the Kolomogorov Smirnov test, mielke representation, logistic regression, and other non-parametric representations of the distribution.
As seen in Algorithm 1, Algorithm 2, Algorithm 3, and Algorithm 4, the use of seemingly simple statistical methods can help mitigate much needed computation when dealing with how data columns are complete and re-applying the findings into other contexts. Starting with Algorithm 1, the whole of the input data is used to output a histogram, and a data incompleteness measure is applied to each column of the inputted medical data to provide an idea of how the data is distributed throughout the dataset.
In Algorithm 2, the histogram produced in Algorithm 1 is then re-applied and utilized as a method to form a line or match of best fit, given the present output data. A maximum likelihood estimator is used, along with distribution fitting methods, to output a best-fit model. After the algorithm procedure is finished, the Kolomogorov Smirnov test is then used to provide a tangible visualizer on how well the fit actually is.  In Algorithm 3, the data represented in Algorithm 1 and Algorithm 2 is then utilized as a method to re-introduce the findings in a different context. Using the incompleteness and best fit measures, the data can be reapplied and visualized in a series of different plots. The experimentation lead to a series of plots being seen as ideal, which includes a heatmap, ontology, and bar graph. The data is presented in easy to read formats, and provides a much better look on how the dataset is distributed.
Finally, in Algorithm 4, a method for how the ontology is developed is shown. Given the data from Algorithm 1, the algorithm can weight each column and classify each into it's own separate block. Entropy weights and distances are determined, which allows the data to be shown in the context of information entropy.

V. ALTERNATIVE METHODS TESTED
In this section, the investigators present the results of experimentation when applying the proposed algorithm under different contexts. These include Stochastic Gradient Descent, Generalized Additive Models, and Support Vector Machines. While these may not have been used in the final proposed solution, the results found do provide an interesting look into how data incompleteness can be approaching with modern machine learning techniques.

A. STOCHASTIC GRADIENT DESCENT
With regards to stochastic gradient descent, past research suggests that optimization of objective functions may improve upon total measures of data incompleteness and improve VOLUME 9, 2021 FIGURE 5. Heatmap of data incompleteness: For a binary measure, 1 represents complete data (in a correlative analysis), and −1 represents always incomplete data.
upon the general output of data features, following data engineering. However, in practice, a few issues arose when attempting to implement stochastic gradient descent or even regular gradient descent models. The first issue seen in experimentation was with regards to the objective function and optimization of it. While there is a general ''fit'' for the data, it cannot be seen given that the data is missing. Past researchers have suggested filling in the missing data to allow for the objective function to be optimized, however, given the goal of finding a measure of incompleteness, is an incompatible suggestion. Another issue is seen with how data presents itself when incomplete. When data is incomplete, measures of optimization and general improvements will not affect the overall fit of the data, and therefore shouldn't be used when the goal is to view the data as raw as possible (or as it was originally inputted.

B. GAMS, SVMS, AND UNSUPERVISED MACHINE LEARNING MODELS
The next models attempted during testing were Generalized Additive Models, Support Vector Machines, and some unsupervised methods. One of the main questions raised was in how several methods could produce the same result, and the answer shown in testing was unfortunately disappointing. To begin with Generalized Additive Models, past studies have used the term non-ignorability and suggested a sensitivity analysis when utilizing the models on incomplete data. In regards to Support Vector Machines, suggestions turned to the direction of adjusting classifiers and preparing for ''missing'' bias. In the final testing rounds, unsupervised models pointed towards trying out similarity scores and preparing the study using a-priori methods.
The reason these were grouped together in a section was due to the end result; over-fitting of the datasets and overall lack of results was displayed in each of these methods. Similar issues occurred through each iteration of testing; the incompleteness of the dataset would cause the model to act inappropriately, or the resulting output from each of the models showed filtering when it comes to measuring and testing the missing data. As a result, each of these methods were not viable solutions when it came to attempting to create methods surrounding them.
However, past research does suggest that further testing could lead to an improved model, especially when it comes to unsupervised learning. Reapplying the methods seen in noisy dataset research, frequency-inverse analysis methods seem to have the greatest viability with regards to measuring the incompleteness [21] of a medical record system. The biggest hurdle for future research seems to come from model training and large-scale data warehousing, as often seen in the medical networks of a typical hospital or medical complex.

VI. DISCUSSION
Over time, the importance and overall need for data completeness measurements within the medical and allied health professions have increased and provided a clear look into how critical systems can fail without such parameters in mind. Solutions and methods to addressing the lack of algorithmic approaches to data completeness have slowly trickled in from different parts of academia, but there are very few papers or available methods giving a robust and finite solution in a medical setting. Given the need and lack of such an application, adding in data completeness operations into medical records systems would allow for overall medical warehouses to get insight onto where their physicians and practitioners are missing information, and give a better spread of the clinical data available not only to the offices practicing, but to the patient populations being served.
Throughout the experimentation and overall research process, the proposed ideology has centered on finding a complete solution to solving the issues seen in data availability and completeness in medical records systems. Utilizing novel algorithmic and statistical methods, we wanted to see how several regular, robust, and noisy methods could address the problem, along with some optimization and regression techniques [22]. Using a combined method involving several of these techniques, our algorithmic system focuses on a statistical calculation of the overall columns seen in a given dataset (leading to completeness scoring and a plotted histogram for visualization), followed by the application of several different statistical spread methods to present a best-fit representation for the overall completeness. The fitting of the spread is then verified using a Kolomogorov − Smirnov test [23].
Alongside the proposed solution, there are several other methods the investigators have tested to attempt to see their effects on overall data completeness and whether or not they worked better than the proposed algorithmic solution. Such examples include stochastic gradient descent, generalized additive models, and support vector machines. The main issue we ran into was over-fitting and the inability to work as an algorithmic method on several different medical datasets the methods were tested on. As a result, further experimentation was abandoned in favor of a statistical and optimization focused method.

A. COMPARISON WITH OTHER RELATED WORK
With regards to other previous works, the researchers sought to explore novel and computational methods to apply data incompleteness to an applied health dataset.
Some of the studies done in the past relied heavily on supervised machine learning methods [24], [25], or used differing statistical methodologies [26]- [29]. Our focus in this study wanted to provide two clear goals: 1) Provide a practical utilization of machine learning analysis with regards to electronic health record data incompleteness. 2) Propose implementation and software development patterns for usage in large-scale medical networks. In an effort to provide the best solution moving forwards, the main solution revolved around optimization and implementation in a warehouse system. Given these guidelines and ideologies at play, the comparison with past works reveals that there is some room for improvement in medical records systems. Majority of currently available algorithms and modules, the solutions typically involve using smoothing or data collection models, which can have issues in the long run [30]. Overall, the uniqueness of the work delineated in this article presents an integration of information entropy, probability distributions, and ontologies to the problem of data incompleteness.

B. LIMITATIONS
Current limitations in the proposed methods appear when datasets are not a large enough to be supported by the available supervised learning methods or by the algorithmic approach over the four algorithms. This occurs in several different use cases, such as when the data cannot be converted into an ordinal format, or when the dataset is too small to be split and calculated through regression or statistical analysis.
The best example of this comes in the usage of a smaller module dataset within a medical EHR system. When working in modules and certain units, you may be limited to a small set of time-series data. Such is seen when you want to pull medical records from a certain unit within a defined time period. When this occurs, the column and row length will limit the VOLUME 9, 2021 algorithmic calculations on how much incompleteness can be transmitted and presented as a method of entropy. A further consequence is that, even though the measure within the data space remains binary, the overall representation and spread of the data can become very uneven and unclear over shorter datasets.
Another key limitation of this research is that the investigators have not taken into considerations of missing at random, missing completely at random, and not missing at random [31], [32]. The involvement of these factors will involve questions such as how data was gathered in the publicly available datasets used for experimentation. Here the investigators will have to accept with all honesty that they do not have information on these factors and thereby not being able to make suitable comments on this matter.
The last limitation seen within the algorithmic method is in how the system works. Through the series of algorithms, the spread and overall coverage of the data measurements are limited to the completeness and relevant information of the columns in a given dataset. Other information related to the dataset will either have to be analyzed separately or reconciled as a different algorithm.

VII. CONCLUSION
To summarize, the investigators have described an experimentation that provides the scientific community a new horizon on the application of probability distributions, transfer entropy, ontologies to the problem of analyzing data incompleteness; thereby, advancing the science of medical informatics. Specifically, the core contributions of MADI are as follows: • Advancing the science of transfer entropy applied to the problem of data incompleteness.
• Advancing the application of ontologies in analyzing electronic health records.
• Advancing the application of probability distributions to advance machine learning applied to data incompleteness of electronic health records. Furthermore, the article also presents some insights into Support Vector Machines, stochastic gradient descent, and generalized additive models with respect to this problem. In future, the investigators plan to advance MADI with more advanced machine learning approaches [33]. VARADRAJ P. GURUPUR (Senior Member, IEEE) is currently working as an Associate Professor with the Department of Health Management and Informatics, University of Central Florida. He has more than seven years of teaching experience. He has served as a teacher for two different countries. He has worked in healthcare industry for several years. Based on this work experience and academic training, he is involved in discovering innovative solutions to difficult problems associated with electronic health records. His research interest includes software engineering decision support systems for healthcare and education.
He was a recipient of two international awards, one national award, and several regional and institutional awards.
MUHAMMED SHELLEH (Member, IEEE) received the bachelor's degree in health and biomedical sciences. He is currently a student pursuing his graduate degree in Computer Science with the Department of Computer Science, University of Central Florida. VOLUME 9, 2021