SNOMED CT-Based Standardized e-Clinical Pathways for Enabling Big Data Analytics in Healthcare

,


I. INTRODUCTION
Healthcare processes produce a large amount of data that has great potential for healthcare administrators, health policymakers, and big data researchers. However, a great portion of such data is not properly captured, missing in electronic format, and hidden inside paperwork and forms. It was the hope of Electronic Health Record (EHR) systems to store that vast amount of data in digital format. However, that target has never been achieved despite the fact that EHRs act as the central component of Health Information Systems (HIS) for decades. An important study on missing clinical and behavioral health data in a large EHR system revealed that EHRs inadequately capture various healthcare The associate editor coordinating the review of this manuscript and approving it for publication was Wai Keung Fung . data such as diagnosis, visits, specialty care, hospitalizations, and medications [1]. The study concluded that missing data undermine many central functions of EHR and that ''missing clinical information raises concerns about medical errors and research integrity'' [1]. The authors stressed that ''given the fragmentation of health care and poor EHR interoperability, information exchange and usability, priorities for further investment in health IT will need thoughtful reconsideration'' [1]. This is not the only study regarding the vast amount of missing healthcare data in HIS. In [2], the authors presented multiple cases on how missing data in HIS is likely to result in medication errors and other patient harms. Missing data form obstacles in front of big data research in healthcare. In [3], the authors indicate that missing patients' data are prevalent in EHRs and are an impedance to utilizing machine learning for predictive and classification tasks in healthcare. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ For example, Length of Stay (LOS) prediction methods found in machine learning literature, operate without considering rehabilitation nursing interventions. This degrades the prediction accuracy of rehabilitation LOS. Although documented in papers, rehabilitation data are rarely captured electronically in patient records [4]. A major source of missing data in healthcare institutions is paper-based forms and unstructured data. In [5], the authors list medical data written in an unstructured text format as one of the major sources of missing data in HIS. This is because EHRs are not designed to capture non-standardized data. In support of this analysis, and upon analyzing the literature, we found that an important reason for missing data in HIS is that a primary source of healthcare data is still paper-based and has not yet been fully automated. By this primary data source, we mean Clinical Pathways (CPs). CPs have been defined as optimal sequencing and timing of medical interventions by doctors, nurses, and other caregivers for a particular procedure or diagnosis, developed to minimize delays and resource utilization and to improve the quality of healthcare [6]- [8]. CPs appeared in healthcare first in the mid-1980s in the USA and then have spread all over the world [9]. The concept itself was not new because it has its roots in management theories that were proposed to improve the quality of business processes such as Critical Path Method (CPM), Program Evaluation and Review Technique (PERT), and Business Process Reengineering (BPR). These successful management theories were not applied in healthcare; thus, the concept of CP was an initiative to adopt effective management concepts in hospitals [10]- [12].
Despite the fact that CPs are becoming globally popular in hospitals as main components for patients' treatment and follow-up, CPs are still circulated in hospitals as paper-based documents and charts/tables. This forms a great barrier between CPs and their integration with today's automated hospitals. There have been several studies to automate CPs [13]- [19]. However, analyzing the relevant literature reveals that the common theme in all studies attempting to computerize CPs is that the main emphasis behind the computerization process was on how electronic CPs can support EHR systems. Therefore, the automation was not complete in the sense that CPs themselves were left with their unstructured nature (i.e., without standardization), and the resulting e-CPs were only partially automated and hidden behind EHRs.
The proper automation of CPs requires converting them to full digital entities that can work smoothly with all existing HIS, not only EHRs. This is crucial to reducing missing data in healthcare because CPs contain instructions on all interventions and procedures done on patients. Most missing data in healthcare (i.e., data not recorded digitally) exist due to the fact that CPs are not digitized.
The objective of this research is to study the effect of a new CP computerization framework on big data analytics in healthcare. In addition, the basic concepts of the new framework are briefly summarized here since its full details will be addressed in a different article dedicated mainly to the framework itself. The new SNOMED CT-based automation framework digitizes CPs and makes e-CP systems as central components in HIS. The new framework enables a detailed statistical analysis of CP interventions and maximizes data extraction from CPs. This contributes to decreasing data missingness and providing richer ''CP-based'' datasets for data analytics, as will be the focus of this research. Thus, we are motivated to hypothesize that the framework enables big data analytics and has positive effects on machine learning applications in healthcare. As an illustrative numerical example, we show the contribution of this work in improving the prediction accuracy of machine learning algorithms through experiments applied to stroke CP data to predict LOS of stroke patients in an acute rehabilitation facility. The experiments were applied to the CP data of real stroke patients in collaboration with the Regional Stroke Unit at Thunder Bay Regional Health Sciences Centre (TBRHSC), Ontario, Canada. We also present an example of time variation analysis of CP interventions.
The rest of the paper is organized as follows. Section 2 discusses CP automation and integration with HIS. Section 3 addresses the effect of SNOMED CT based e-clinical pathways on big data analytics in healthcare. In Section 4, we describe the experimental environment and discuss the experimental results. Finally, conclusions are drawn, and future research work is suggested in Section 5.

II. CP AUTOMATION AND INTEGRATION WITH HIS
As described above, e-CPs found in the literature have direct communication only with EHRs. They are not designed to be independent, fully-functioning systems. We view this as improper automation and integration of e-CPs since, in actual life, CPs are designed to be central healthcare components that produce data for all types of HIS commonly used in healthcare. This analysis of CP automation reveals two fundamental research challenges in the path of achieving complete automation of CPs and their proper integration with existing HIS. These challenges are summarized below.
• CP Automation: CPs are sometimes expressed in ambiguous local textual instructions. This not only makes them difficult to understand by other medical staff members but also makes them difficult to automate (with limited transferable electronic data). This usually creates a digital barrier or ''digital divide'' between CPs and HIS. Fig. 1 illustrates how paper-based or partially automated CPs create a digital divide between CPs and other HIS. e-CP research so far has ignored the presence of this digital divide, and most efforts were directed towards ''programmatically'' linking basic CP data with EHR systems while leaving CPs ''as is,'' digitally invisible and far away from the digital age. This situation keeps CPs semantically non-operable with today's IT infrastructure.
• e-CP Integration with HIS: CP automation to produce computerized e-CPs forms only one aspect of CP inclusion in HIS. Another equally important consideration is: with what other HIS should e-CP be able to communicate? As mentioned in the introduction, by analyzing the literature so far, it is noticed that automated CPs were positioned as side components that are created to support EHR systems. This positioning of CPs undervalues the real potential of CPs as the main healthcare plans for treatment and follow-up. The following sub-sections give a summary of our CP automation framework with regard to the two considerations addressed above. We also show the importance of this research through a scenario related to detailed time variation analysis of CP interventions.

A. CP AUTOMATION
CPs are populated with data that are only partially transferred to other HIS. A key factor that is impeding the transfer of full CP data is that CPs are prepared in hospitals without attention to standardizing their medical terms. After a thorough review of CP research found in the literature and discussions with our domain experts at TBRHSC, it was clear that most CPs are currently developed using ambiguous local medical terms and abbreviations [20]- [29]. This situation makes CPs prone to human errors and forms a challenge to exchanging them across medical institutions. This also causes the loss of valuable CP data because existing HIS use standardized terminology systems in their encoding of medical terms. A solution for this, in our framework, is to encode CP data using internationally recognized medical reference terminology. For this objective, we selected the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) [30] to be the terminology system for encoding CP data because SNOMED CT is the most comprehensive and largest medical terminology in the world [31]. SNOMED CT encoding is realized by representing each CP data element by its equivalent SNOMED CT term and SNOMED CT ID (SCTID). For example, the ambiguous instruction ''Complete Hx and PE'' that is an element of a CP for open appendectomy [20], is converted in our framework into ''Complete History and Physical Examination'', where History and Physical Examination is a standard SNOMED CT term whose SCTID is 53807006. Table 1 gives more examples from a  stroke CP. In addition to SNOMED CT encoding, rather than modeling CPs using traditional information systems, we modeled CPs using an ontology-based approach. The CP ontology is designed in consultation with domain experts and ontology research available in the literature [16], [17]. Fig. 2 shows the basic classes and relations in the main CP ontology that we developed using the Protege ontology editor and knowledge management system [32]. The ontology is integrated with a Java-based prototype CP management system that we developed based on our proposed framework. Disease-specific ontologies (e.g., ischemic stroke ontology) are then instantiated from the main CP ontology such that CP individuals (i.e., CP data) are SNOMED CT standardized. Fig. 3 shows part of the stroke ontology encoded with the proposed ontology-based modeling with SNOMED CT-based standardization.
Some advantages of the proposed modeling include [33]: • Ontology, as a knowledge engineering modeling tool, allows for sharing and reuse of domain knowledge.
• An ontology defines the semantic of CP domain knowledge, which provides a shared understanding that can be communicated between heterogeneous applications in a machine-understandable way, thus facilitating semantic interoperability among e-CPs and HIS.
• Using SNOMED CT to represent CP data makes the integration of e-CPs with existing HIS smooth. This maximizes data extraction from CPs for less missing data and improved data analytics. VOLUME 8, 2020 • Standardized CPs convert patients' CP traces (i.e., the path of each patient in the CP from admission to discharge) into path sequences of well-defined interventions with their start times and end times (e.g.,
• e-CPs with standardized data facilitate the retrieval of their contents for quality control. For example, the data of two CPs for the same disease used at two different hospitals can be retrieved digitally and compared by comparing their SNOMED CT terms and codes. This is a difficult task to realize with today's unstandardized, local text-based CPs. This also enables an e-CP system to act as a fully independent system, having its own CP functions (e.g., comparing CPs, exporting CP data to other systems, saving CP traces in a database of traces and applying data analytics on them).

B. e-CLINICAL PATHWAYS INTEGRATION WITH HIS
In order to address the proper communication level between e-CP systems and HIS, we first briefly consider the major subsystems of HIS and the data they contain [34]. Then we consider e-CPs and analyze the relationship between CP data and HIS.

1) ELECTRONIC HEALTH/MEDICAL RECORD SYSTEMS (EHR/EMR)
An EMR is a digital version of patients' paper charts. It is used by single practice clinics and small hospitals for their local records. Typically, an EMR contains the medical history of the patients, diagnoses, and treatments. An EHR can be viewed as a ''large-scale'' EMR that stores more data and facilitates the sharing of health records across different institutions. Modern systems are capable of playing the roles of both EMR and EHR since they offer options to keep patients' data local inside the institution or sharing the data with larger systems [4].

2) LABORATORY INFORMATION SYSTEMS (LIS)
LIS are software systems with features that support modern laboratory operations and informatics. The main functions of LIS include recording, managing, and storing clinical laboratory data for patients. LIS have traditionally been most adept at sending laboratory test orders to lab instruments, tracking orders, and recording lab test results. In addition, LIS support the operations of public health institutions and their labs by managing and reporting critical data concerning immunology and infection [35].

3) RADIOLOGY INFORMATION SYSTEMS (RIS)
RIS are the core systems for the e-management of imaging departments and are critical to the efficient workflow of radiology practices. The main functions of RIS include scheduling of patients, managing resources of radiology departments, image performance tracking, and distribution of results. A central component of RIS is the radiology PACS (Picture Archiving and Communication System) which provides storage and easy access to medical images from various sources (e.g., computed tomography (CT), medical ultrasound, X-ray, magnetic resonance imaging (MRI), computed radiography (CR), etc.) [34].

4) PHARMACY INFORMATION SYSTEMS (PIS)
PIS provide functions to maintain the organization and supply of drugs. A PIS can be a separate system for pharmacy usage, or it can be coordinated with inpatient hospital order entry systems. PIS are used to increase patient safety, report drug usage, and track costs. Outpatient PIS have a strong emphasis on medication labeling, drug warnings, and instructions for administration. The effective and safe dispensing of pharmaceutical drugs is the most important function of PIS. During the dispensing process, PIS prompt pharmacists to verify that the medication they have filled is for the correct patient, contains the right quantity and dosage, and displays accurate information on the prescription label [4].

5) e-CLINICAL PATHWAYS SYSTEMS (e-CPS)
The concept of applying CPs in hospitals was a novel initiative to adopt successful management practices in healthcare. Therefore, since their introduction to healthcare institutions, the main objective of CPs was to coordinate and ''manage'' healthcare processes as central components. CPs contain all the interventions required to treat the patients; thus, within CPs lies the very heart of medical planning and treatment, including cost and quality factors in healthcare. The considerations above suggest that CPs were designed to produce all types of data in healthcare described above (e.g., EHR data, LIS data, etc.). Fig. 4 makes this point clear by illustrating how CPs generate data for all types of HIS discussed above. Two CPs for Diabetes Mellitus and Carotid Artery Disease are illustrated. As shown by the arrows in the figure, both CPs include order instructions that result in data that need to be transferred to all types of HIS. Thus, computerized CP management systems should be designed and positioned such that they are ''centralized'' (i.e., positioned at the centre of HIS) and allowed to communicate with all types of HIS, not only EHRs as the common theme in CP systems found in the literature. Besides using ontological modeling and SNOMED CT-based standardization, this positioning and high communication level of e-CPs can be achieved by equipping CP management systems with Health Level 7 (HL7) messaging functionality to communicate with existing HIS (Fig. 5). HL7 consists of a set of international standards for the transfer of clinical data between software applications [36]. This is achieved through standard, machine-readable HL7 messages. The generation of standard HL7 messages can be automated through application programs in high-level languages such as the Java-based HL7 Application Programming Interface (API) toolkit [37]. Fig. 6 shows an illustration of an HL7 observation result message to communicate the result of human immunodeficiency virus status using SNOMED CT encoding.

C. TIME VARIATION ANALYSIS OF CP INTERVENTIONS
Fully automated and digitized CPs open new opportunities for the application of data and statistical analyses on CPs. This helps improve the CP performance and optimize hospital resources. Examples can cover a wide range of applications. We present here an example scenario related to time variation analysis of CP interventions. In today's busy medical work environments, time management in hospitals is important to control costs and save lives. Wasted time may deprive other patients of having the required healthcare service due to a lack of medical staff members. This may cause the death of some patients. e-CP systems in which the start time and end time of each intervention are recorded can provide great help in discovering inefficient practices related to time management. For this objective, we developed the CP time analytics VOLUME 8, 2020   Fig. 7 illustrates the graph of a sample output related to Screening for Dysphagia for ischemic stroke patients (SCTID 431765005). In ischemic stroke CP, one of the medical interventions is ''Screening for dysphagia'' because stroke often causes a swallowing disorder called dysphagia [38]. The average time for screening is around 30 minutes. A long time for screening (e.g., 57 minutes, as shown in the figure), makes the testing room more occupied and deprive other patients of having this test on time. In this particular case, solutions to control the time could be either by preparing the test room before the arrival of patients, analyzing the cases of a long time for better control, or by providing more training for nurses on efficient practices in dysphagia screening. Time variation analysis of CP interventions for all CPs in a hospital helps healthcare managers in improving healthcare provision and optimizing hospital resources. Such detailed CP analysis is not possible with today's unstructured paper-based CPs.

III. EFFECT OF SNOMED CT BASED e-CPs ON BIG DATA ANALYTICS IN HEALTHCARE
Data analytics algorithms prefer rich datasets. They work better when data are as much complete as possible. Missing data have always formed a challenge in the face of getting good classification and prediction results through machine learning algorithms. This is more critical in the field of healthcare data analytics because patients' outcomes are sensitive to data collected in hospitals. CPs are intended to be one of the most important sources of patients' data. Since the framework addressed in this research has the objective of generating computerized CPs that are fully encoded with SNOMED CT terms, then this framework contributes to reducing missing data and supplying rich CP-based datasets. This is achieved by making all CP data digitally visible and communicable with existing HIS (through SNOMED CT encoding, ontology-based modeling, and HL7 messages). We illustrate this contribution by machine learning experiments from the domain of hospital Length of Stay (LOS) prediction. LOS refers to the number of days that an inpatient stays in a hospital. LOS has long been a crucial metric of hospital efficiency and quality of care. The uncertainty of LOS increases costs and makes it difficult for hospitals to optimize their scheduling process [39]. The clinical and financial consequences of long LOS have made LOS as one of the most observed measures in healthcare systems [40]. LOS predictions that are related to rehabilitation CPs (i.e., CPs applied to patients whose hospital stays are in rehabilitation facilities) suffer from the fact that many rehabilitation interventions are not stored in EHRs. This makes EHR-based datasets yield less accurate LOS predictions. For example, stroke patients are initially treated in hospitals in emergency departments and stroke units, and after that, they move to acute rehabilitation facilities and follow the rehabilitation CP during their hospitalization. The challenge with rehabilitation CPs is that they contain many nursing care interventions.
We refer to such interventions as ''soft'' interventions in this paper. By soft interventions, we mean interventions like assisting with toileting, etc. Although such interventions are specified on CPs, actually performed on patients, and documented on papers, they are rarely recorded in EHRs compared with what we term as ''hard'' CP interventions like X-rays and surgical procedures [4]. As noticed by nurses and domain experts, in stroke patients, soft interventions have an effect on LOS prediction because patients who need more nursing Although terminology systems (like SNOMED CT and ICD-10) have advanced in recent years and have included standardized terms and codes for nursing care tasks, current CPs in use at hospitals still present most soft interventions as unstructured text with local terms. This is a major reason for the soft interventions being missed in EHR-based datasets (i.e., datasets obtained from EHRs without capturing all data on CPs). The framework outlined in this research contributes to recording soft interventions by means of standardized e-CPs that can transfer all their data to HIS. This results in CP-based datasets that are richer in data. To illustrate this experimentally, we performed machine learning experiments on the prediction of rehabilitation LOS using a stroke CP and CP-based dataset of stroke patients from TBRHSC, Ontario, Canada. The objective is to compare LOS prediction results between the CP-based dataset that includes nursing services and the same dataset without nursing services (EHR-based dataset) using the same machine learning algorithm on both datasets. Dataset preparation and experiments were performed with the help of stroke domain experts from the Regional Stroke Unit at TBRHSC, who assisted in the study with their knowledge on stroke patients' rehabilitation. The dataset contains rehabilitation admission data for 500 patients who had stroke secondary to Carotid Artery Disease (CAD). CAD is a chronic vascular disease characterized by the formation of plaques in the wall of the carotid artery, causing stenosis and impairing the flow of blood to the brain. In the case of plaque rupture, a clot of blood may form and detach, then move with blood to smaller brain vessels, potentially leading to an ischemic stroke [41]. In the dataset, each patient record contains several characteristics such as demographic data (age, gender, and ethnicity), disease history (sleep apnea, atrial fibrillation, diabetes, and hypertension), length of stay, medical history and habits (e.g., previous carotid artery intervention, alcohol consumption, smoking, type of stenosis, speech and language disorder), and the CP-based required nursing care interventions (e.g., assisting with toileting, brushing teeth, combing of hair, assistance with shaving). The LOS values were between four and nine days. The median LOS was five days. The objective of the data mining model is to predict short versus long LOS. We used the median LOS as the threshold dividing long vs. short LOS. Thus, patient records with LOS less than or equal to five days were labeled as short LOS while records greater than five days were labeled as long LOS. The initial dataset was imbalanced due to having more short than long LOS. This was the reason for considering the median (rather than the average) LOS as the threshold dividing short vs. long LOS since the median is the preferred measure for the central tendency for skewed data [42]. Furthermore, to avoid the drawbacks of dealing with an imbalanced classification problem, we used under-sampling techniques to train the classification model on balanced classes. Our LOS prediction example is a binary classification problem that is non-linear in nature, and this makes decision tree-based methods very suitable for this problem because they are successful in dealing with non-linear classification [43]. Furthermore, many researchers have reported that decision tree methods are successful in LOS prediction [44]- [47]. In general, decision tree algorithms use entropy-based methods to form tree nodes. This is done by selecting the most informative attributes based on two measures: entropy and information gain, as follows.
• Entropy (H) measures the impurity of a category or class (X), as shown in equation (1).
where P(x) is the probability of label x in X [43].
• Information gain measures the purity of an attribute based on the conditional entropy determined by equation (2) below.
where H Y |X is the conditional entropy for each attribute (X ) relative to base entropy (Y ) which is the entropy of the output variable, LOS in our case. The information gain of an attribute X is defined as the difference between the base entropy and the conditional entropy of the attribute, as shown in equation (3).
Information gain compares the degree of purity of the upper node (parent node) before a split with the degree of purity of the lower node (child node) after a split. At every split, an attribute (or predictor) with the highest information gain is considered as the most informative attribute and is chosen for the split [43]. Among the most commonly used decision tree learning algorithms are ID3 (Iterative Dichotomiser 3), C4.5, and C5.0. ID3 algorithm has the drawback that it may construct a complex and deep tree that causes overfitting leading to poor prediction results. The C4.5 algorithm is an improved ID3 algorithm that addresses the overfitting problem in ID3 by using the technique of pruning to simplify the decision tree. Pruning is done by removing the tree nodes and branches that do not provide additional information [43]. C5.0 algorithm offers a number of improvements over C4.5, including faster processing and more efficient memory usage [48]. Another commonly used decision tree algorithm is CART (Classification And Regression Trees); however, preliminary experiments on our datasets showed that C5.0 has a slightly better overall performance than CART. Based on the above analysis, we adopted the C5.0 algorithm in our simulation experiments. As will be detailed below in the results section, experimental evaluation demonstrates that LOS prediction with fully automated CPs that result in rich datasets (CP-based datasets) outperforms traditional LOS predictions based on EHR-based datasets.

IV. RESULTS AND DISCUSSION
We implemented the experiments using ''RStudio'' integrated development environment for R programming language [49], [50]. To compare the results with and without full CP data, we performed identical processing and experiments on two datasets. The first dataset does not include the CP nursing services (EHR-based dataset), while the second dataset includes the CP nursing services (CP-based dataset). The datasets were split into training/testing sets. In order not to generalize the results from a single split, we conducted experiments with 70:30 and 80:20 training/testing split ratios. Furthermore, we have diversified the performance metrics by including multiple major common metrics, including the area under the receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity, and precision, as shown in equations (4), (5), (6) and (7) Fig. 8 and 9 show the experimental results. As shown in the figures, the performance of the prediction model that includes the CP nursing services is better than the model without the nursing services in terms of the considered metrics. The results show better AUROC for CP-based dataset (≈ 88% and 93%) compared to EHR-based dataset (≈ 78% and 84%) for split ratios 70:30 and 80:20, respectively. The most commonly reported measure of a classifier is the accuracy because accuracy evaluates the overall efficiency. The results show better accuracy for the CP-based dataset (≈ 85% and 92%) compared to EHR-based dataset (≈ 77% and 85%) for split ratios 70:30 and 80:20, respectively. The results show better performance for the CP-based dataset in terms of sensitivity and equal performance with the EHR-based dataset in terms of specificity. Sensitivity assesses the effectiveness of the classifier on the positive/minority class. In our experiments, this is the class of patients with long LOS. Thus, the CP-based dataset yields better long LOS prediction performance. Specificity, on the other hand, measures the effectiveness of predicting negative cases (short LOS in our experiments). Since fewer nursing services are related to short LOS, then it is reasonable that both datasets show equal specificity, i.e., equal prediction performance on patients with less nursing services. Precision is also called positive predictive value. The obtained results show that the CP-based dataset gives better precision under both training/testing split ratios.
The metrics mentioned above are the most used performance measures for such classification problems. However,  since our datasets are slightly imbalanced, we decided to investigate more metrics that consider imbalanced datasets. This helps in generalizing the results by considering performance metrics that combine the previous metrics to account for imbalanced datasets. Therefore, we considered the Balanced Accuracy and Geometric Mean (G-mean), which are common imbalance-oriented performance metrics [51], as shown in equations (8) Fig. 10 and 11 show the experimental results considering the imbalance-oriented metrics. The balanced accuracy is the average between the sensitivity and the specificity, which measures the average accuracy obtained from both the majority and minority classes. ''This quantity reduces to the traditional accuracy if a classifier performs equally well on either classes. Conversely, if the high value of the traditional accuracy is due to the classifier taking advantage of the distribution of the majority class, then the balanced accuracy will decrease compared to the accuracy'' [51].
Our results show that both the traditional accuracy and the balanced accuracy have close values for both datasets, with the CP-based dataset showing better performance. This is an indication of the good performance of both classifiers on the majority and minority classes with the CP-based dataset yielding improved performance. G-Mean is a metric suitable for imbalanced datasets because it measures the balance  between classification performances on both the majority and minority classes [51]. Our results show higher G-mean values for the CP-based dataset (≈ 86% and 92%) than the G-mean values of the EHR-based dataset (≈ 78% and 85%), for both 70:30 and 80:20 split ratios, respectively.
As shown in the analysis above, experimental results support the hypothesis that complete CP standardization and automation reduces data missingness in healthcare, resulting in rich datasets and contributing to improving the performance of data analytics algorithms. It is worth mentioning that there are other factors than nursing services that affect stroke patients' LOS (e.g., comorbidity, diabetes, etc.). However, such data are common to both datasets in our experiments; thus, the only differentiating data are nursing services available on the clinical pathway.

V. CONCLUSION
CPs are crucial components of healthcare systems and deserve more studies regarding their automation and complete integration with HIS. In this research, we presented the summary of a CP automation framework based on three major components: SNOMED CT standardization of all CP data, ontology-based modeling, and HL7 communication. Our model enables detailed statistical analyses on CP data that help in optimizing hospital resources and providing better healthcare services for patients. Regarding big data analytics in healthcare, our research hypothesis is that the problem of missing data in HIS can be addressed by fully digitizing and integrating CPs with HIS. This helps in generating rich datasets and improving the performance of data analytics algorithms in healthcare. To test this hypothesis, we standardized and converted a hospital stroke CP into an e-CP that was integrated with a prototype e-CP system. The data of real 500 stroke patients were collected and simulated based on the standardized stroke e-CP in collaboration with the Regional Stroke Unit at TBRHSC, Ontario, Canada. Next, we conducted machine learning experiments on the resulting datasets. The experiments were from the domain of predicting rehabilitation LOS.
The experimental results show that LOS prediction with the dataset that includes the required CP nursing services outperforms LOS prediction using the dataset that does not include the nursing services. Five common performance metrics were evaluated to reach this conclusion: AUROC, accuracy, sensitivity, specificity, and precision. Furthermore, since the datasets were imbalanced, we supported our analysis by considering two additional combined performance measures that account for the imbalance nature of the LOS problem, namely, the balanced accuracy and G-mean. The experimental results using the imbalance-oriented performance metrics confirmed that LOS prediction using the CP-based dataset outperforms the prediction using the EHR-based dataset. The results can be justified by the fact that patients who are affected more negatively by the stroke incidence require more nursing services in stroke rehabilitation and thus show longer LOS. Therefore, datasets that include the full healthcare data available on CPs can enable and guide machine learning algorithms more accurately towards improved predictive results.
The complete digitization and automation of e-CPs is a key contribution towards capturing all CP data. This allows detailed statistical analysis on CP contents and patients' treatments, as well as proved to be important for big data analytics in healthcare.
Currently, only EHR-based datasets are available for researchers. Obtaining CP-based datasets is a challenge since most CPs in use today are paper-based, and CP-based datasets can only be obtained by collaborating directly with cooperating hospitals. One of our future research directions is to collaborate with multiple hospitals to obtain more and larger CP-based datasets.
We consider this research as a starting initiative towards the complete automation of CPs, hoping to encourage more research in this area to achieve better CP automation, improved healthcare outcomes, enhanced patient satisfaction, and healthier society.