Early-Stage Risk Prediction of Non-Communicable Disease Using Machine Learning in Health CPS

Cyber-Physical Systems (CPS) embed computation and communication capability into its core to regulate physical processes and seamlessly mediate between the cyber and the physical world for various control and monitoring tasks. Health CPS, a variant of CPS in the healthcare sector, acts as a health monitoring system to dynamically capture, process, and analyze health sensor data through integrated internet of things (IoT)-enabled cyber-physical processes. These systems can suitably support patients suffering from non-communicable diseases (NCDs) or who are at risk of suffering from those. Identifying the risk of NCDs, such as heart disease and diabetes, requires artificial intelligence (AI) techniques into the core of health CPS. Recently, there has been growing interest to incorporate machine learning into CPS, which can facilitate the disease classification, detection, monitoring, and prediction of several NCDs. However, there is a shortage of visible work that focus on early-stage risk prediction of these diseases. In this work, we propose a novel machine learning based health CPS framework that addresses the challenge of effectively processing the wearable IoT sensor data for early risk prediction of diabetes as an example of NCDs. In the experiment, a verified diabetic dataset has been used for training, while the testing has been performed on an artificially generated data collection from sensors. The experiment with several machine learning algorithms shows the effectiveness of the proposed approach in achieving the maximum precision from the Random Tree algorithm, which requires a minimum time of 0.01s to construct a model and obtains 94% accuracy to predict the probability of diabetes at an early point.

The NCDs are not transferable and non-contagious [16], although more life-threatening than contagious diseases. According to the World Health Organization (WHO), NCDs are responsible for 71% of all deaths globally. However, the risk factors and determinants of these diseases, which are commonly known as epidemiological factors, are modifiable and controllable [17]. For example, obesity is an epidemiological factor that can cause NCDs like diabetes, stroke, hypertension, and kidney disease [18]. Therefore, the incidence of NCDs can be minimized by controlling these factors. Epidemiological factors of NCDs generally stem from physical inactivity, alcoholic habit, diets, and other conditions. Hence, pre-screening and preventive measures are the keys to respond to NCDs [19]. The value of health transformation, the empowerment of wearable sensors, and ML must be broadly acknowledged in the fight against losses due to NCDs [20], [21]. However, few research efforts have been made in MCPS or HCPS domain to conduct early-stage risk prediction of NCDs, which is important to improve the health of the people by taking precautionary measures.
Among the many healthcare applications supported by MCPS and HCPS, patient monitoring from various perspectives has been the dominant one [12], [22], [23]. These include the work of remote patient observation [24], [25], activity monitoring [2], home health monitoring [26], heart health monitoring for cardiovascular disease [27], stroke detection [28], and epilepsy detection [29], obese monitoring [30]. to list a few. While such systems provide patient monitoring to a broader extent in a sensor-rich smart environment [24], [31], these are often used for disease classification and real-time alerting as a way of avoiding NCDs without any emphasis on early prediction of such diseases.
This paper reports our contribution in two-folds: first, we propose a closed-loop ML-powered HCPS for early-stage risk prediction of NCDs, considering diabetes as an example; and second, we incorporate the innovative concept of verified training dataset and dynamic test dataset, which have paved the way for applying ML on real-time data from wearable sensors. In order to support these contributions, we used different types of ML classification algorithms including Random Forest (RF), Decision Tree (DT), Naive Bayes (NB), Bayesian Net (BN), Multi-layer Perception (MLP), Support Vector Machine (SVM-Polykernel and SVM-RBFKernel), Logistic Regression (LR), Random Forest (RT), AdaBoost, Bagging, and K-th Nearest Neighbor (KNN). These algorithms have been suggested for diabetes prediction by researchers in existing work [32]- [35]. In our proposed work, a particular real dataset of early-stage diabetes predictability is used for training purpose, whereas the testing is carried out using an externally supplied test dataset, which is dynamically generated from sensors in a simulated environment. We conducted several experiments that demonstrate that the proposed framework provides an effective mechanism for ML-based early-prediction of NCDs.
The rest of the work is organized as follows. Section II comments on the related work, while Section III provides the details of the proposed method. Section IV shows our experimental details along with a comparison and discussion on the results. Finally, Section V concludes the paper with a highlight to future work directions.

II. RELATED WORK
This section comments on existing work that are relevant to AI-based approach such as deep learning (DL), ML-based approach for smart health monitoring, AI-IoT convergence for healthcare, healthcare Internet of Things [36], CPS for smart healthcare, ML-based CPS, MCPS or HCPS for NCDs risk prediction, and more specifically ML in predicting diabetes risk with or without HCPS context [26], [37].
The authors of a recent survey [7] highlight the importance of incorporating intelligence into MCPS. The study reveals that the emerging health applications increasing need to include machine intelligence to provide innovative and smart services. The authors further describe the conversion of raw physiological inputs into functions and how those are used in ML, analyze the suitable ML algorithms, and describe how decisions are made and propagated to the user. In [22], the authors introduce a detail taxonomy for CPS in healthcare based on a comparative review of components and procedures. The taxonomy includes information about HCPS application, architecture, sensing approaches, data handling, computation, communication, security, and control, which can be consulted when developing HCPS applications.
The authors of a smart healthcare framework [11] highlight the importance of incorporating Gaussian mixture model-based classification for voice pathology detection that is used by physician for possible action. The findings of this study demonstrate how cloud and big data will improve the efficiency of healthcare system and provide smart healthcare solutions for the population. However, this work does not include any information of other disease prediction mechanism except for the voice pathology detection. The QoS issues have been studied in the context of remote healthcare in [25]. The work discusses the resolution of QoS challenges in urban healthcare big data system [38]. While it addresses the problems of healthcare and physical CPS systems, information about how IoT-sensor data can be analyzed intelligently for NCD predictions has not been made available.
The author in [23] proposes a CPS that incorporate localization information on the sensing, analyzing and sharing of patient data for continuous health monitoring, however, there is no indication of risk prediction of any particular disease in the work. In the area of general healthcare monitoring, the work in [24] shows a CPS implementation to monitor blood pressure (BP), blood glucose (BG), body temperature (BT), and heart beat rate (HR) based on embedded and cloud-based technology. This approach interconnects the communication, computation, and control aspect of CPS for continuous monitoring of patients and actuate remote treatment method when necessary.
The authors in [28] propose a CPS architecture for timely detection of stroke, a common NCD in patients, to minimize the risk in people. The system analyzes electroencephalography (EEG) data and connects to physician when it identifies stroke occurrence, and sends alerting message to the concerned personnel. However, it does not focus on early prediction of stroke.
The processes of data collection, analysis, and visualization in CPS for cardiovascular disease have been demonstrated in [27]. The authors emphasize on constant tracking of patients' heart function using a smart phone and web-based interface. They enable CPS with cardiac signal processing capabilities based on ML and big data platform. Although the work does not elaborate on a specific prediction mechanism, it highlights the promise of machine and deep learning based CPS to support the identification and prediction of NCDs.
The work in [39] proposes a model to predict, monitor, and control the risk of coronary heart disease in CPS context. They authors use ANFIS fuzzy inference system to identify the different levels of risk assessment. They define 800 plus rules to determine the risk level and the consideration of additional attributes will require them to add even more rules, thereby increasing the overhead. Contrary to this work, our proposed method introduces verified training dataset against which ML classifier is built to predict the early risk level of diabetes from wearable sensor data.
As a summary of the above work, a comparative illustration is given in Table 1 that states key information from the reviewed papers in a convenient manner. It is evident from Table 1 that existing work focus on diverse aspects of smart healthcare. For example, the state-of-the art of CPS in smart healthcare, the architectural challenges of CPS, general health monitoring via HCPS, and the detection and monitoring of NCDs like stroke and cardiovascular disease. The table also includes the references of some general ML-based research [32]- [35] that focus on the diagnosis and prediction of diabetes from different datasets, although not in HCPS context. We include those work to compare the results of diabetes prediction accuracy.
While HCPS for healthcare is currently in its early stage of adoption, there are limited experimental analysis that appears in existing work. A recent work in [39] does provide detail analysis for early prediction and monitoring of heart disease, more is needed to generalize a framework for early-prediction of other NCDs. The proposed work aims to contribute in this by providing an ML-enabled HCPS framework for the early-stage risk prediction of NCDs and demonstrates its objectives with the early prediction of diabetes.

III. PROPOSED METHOD A. OVERVIEW
This research proposes an ML-powered HCPS system for the prediction of diabetes. Unlike traditional ML approach, which follows a longer training process associated with huge pre-processing, the proposed approach omits/ minimizes the pre-processing stage by introducing an epidemiological knowledge base. This includes the use of a verified training dataset approved by medical practitioner and rules to extract health data from raw sensor's data. For the testing phase, the proposal prescribes subsequent stages for obtaining a dynamic test dataset, which is produced from a combination of sensory and non-sensory data to fit the training data structure.
It should be noted that the involvement of medical practitioner into creating a knowledge base does not introduce delay per se in the training process. Rather, the verified training dataset make the end-user application robust and reliable, while providing a low-computational ML algorithms to process raw sensory data in IoT-embedded HCPS environment. This low-computational property of this approach is due to applying a verified training dataset to train the classifier and using a dynamic test dataset for evaluation. The detail of system processes relevant to training and testing phases appears in the following sections.

B. PROCESSES FOR TRAINING PHASE 1) TRAINING DATASET GENERATION
The core of generating training dataset in HCPS is to define an epidemiology library [19] for disease risk factors from real patient data. The patient data can be collected from a direct pre-screening questionnaire or via other means, which are approved and overseen by the healthcare practitioners, who also verify the class level of data. The refined data are stored into a knowledge base. The practitioners-approved data have the potential to increase the level of acceptance for risk prediction of NCDs. To predict multiple NCDs through a single system, the epidemiology library using electronic healthcare records (EHR) can be constructed as a potential solution. The electronic healthcare records are used in many tele-healthcare systems.

2) KNOWLEDGE BASE
A knowledge base includes verified datasets, ontology, and rules to label data [19]. For example, risk ontology, symptom and disease ontology, medical rules to determine attribute value, and other information which can be repeatedly used to serve data query. The use of an epidemiological knowledge base can accelerate the performance of the classification system. The proposed system uses the knowledge base in both the training and the testing phase. In the training phase, the verified dataset from the knowledge base has been used to train the classifier with several ML classification algorithms, while in testing phase the knowledge base has been used to assign rules and data labels and to extract features for predicting NCDs from sensor data.

3) TRAINED CLASSIFIER BUILDING
The verified training dataset is used to train the classifier. Several popular ML algorithms [41] are used for NCD prediction. The outcome of the training phase is the trained classifiers. These classifiers are used to evaluate sensory data in the testing phase.

C. PROCESSES FOR TESTING PHASE
The proposed method innovates several processes for the HCPS testing phase. The goal is to generate a dynamic testing dataset from the raw sensor data and classify them for predicting diabetes. Similar process can be followed in the case of predicting other NCDs, depending on the type of data system records and the classifiers it trains. The processes in this phase are depicted in Figure 1 and the data flow is marked with numbers to demonstrate a closed-loop model. The details of the processes are given as follows.

1) DETECTION OF HEALTH MONITORING SENSORS
The wearable network includes sensors of diverse genres targeting different goals, such as health monitoring, disorder prediction, safety monitoring, home rehabilitation, activity monitoring, treatment assessment, and so on. Information about these sensors include sensor type, id, record type, manufacturer, and service information. These information can be extracted using Python command lines or dedicated tools. To detect whether the sensor is a health monitoring sensor or not, information about each wearable sensor in a specific network will be checked. If the model includes n wearable sensors then W = w 1 , w 2 , w 3 , . . . , w n is a list of sensors in an environment. For each wearable sensor w, an information list is extracted as info_list = i 1 , i 2 , .., i m . An instantiation of info_list can be seen in Figure 2.

2) SELECTION OF BIOMEDICAL CORRELATED DATA
The reading of the sensors will be used for selecting biomedical correlated variables. In particular, the correlation between sensor readings and the biomedical variable are assessed. Possible examples of biomedical correlated variables are beat per minute (BPM), sweat, step count, etc. To conduct this step, a pre-defined list containing the biomedical variables is checked. For example, if the list contains {bpm, step count, sweat, sleeping time} and the reading from a wearable sensor w is {bpm:60} then the variable bpm will be selected as a biomedical correlated variable.

3) DATA LABELING
The data labeling is performed by applying rules from the epidemiological knowledge base on the sensor data variables. The epidemiological knowledge base includes the features NCDs and data from general healthcare records (marked as 3 and 4 on Figure 1). These records provide the selection of epidemiological factors by the users through user interface at real time. For example, the knowledge base will include epidemiological factors (e.g. age, sudden weight-loss, and palpitation) with rules. These epidemiological factors are the filtered features. This kind of knowledge-driven feature selection is a low-computational approach to feature selection.   14: for i=0 to health_sensor_info.length() do 15: \ * Split sensor data which is in ''data name: value'' format. 16: \ * find() is a Python function-returns index of a character in string. 17: \ * ':' is the delimiter, splitting value before and after ':'. 18 At this stage, the core contribution of this research has been introduced. The knowledge base including epidemiological factors of different NCDs will be applied to the sensor data variable. This will assign data labels dynamically to the regular sensor data. For example, if the weekly step count of a user is less than 2000, the value of obesity feature will be yes.

4) DYNAMIC TEST DATASET GENERATION
The dynamic test data requires data from smart phone apps and of the labeled sensory data. The apps provide different information (e.g. age, gender) and other derived data such as genital thrush from the frequency of drinking time. The non-sensory data and dynamically labeled data along with the filtered features generate a dataset as the weekly records of a user. It should be noted here that non-sensory features such as age and gender will be used without any modification as it matches the feature format of verified training dataset.
The process of generating test data dynamically is given in Algorithm 1. The defined class TESTING has two attributes, feature_name to contain a feature name of the training dataset and feature_value to contain the value of a particular feature. The other defined class RULE has five attributes, which represent a rule in the knowledge base. An example of instantiation of an object of this class is: r=RULE(match_param_name = 'bpm', operator =>, match_param_value = 80, decision_param_name = 'Irritability', decision_param_value = 1).
The algorithm takes health sensor information list as an input and initialize list T for output. In Python programming, a list is used as array. Therefore, we have used the term list and T as the list of TESTING instances. The contents of the information of sensors is presented in Figure 2. The sensor data of each sensor is then split into two parts with a delimiter. For example, sensor data = 'bpm:60' is split as bpm to be the biomedical correlated variable and 60 is its value.
Then the biomedical correlated variable is checked with the match_param_name in the list of type RULE by the Compare function. If the rule is satisfied then decision_param_name is stored as the feature name of a TESTING instance and decision_param_value is stored as the feature value. Finally the list T is returned as the dynamically created training data record.

5) EVALUATION
Like any ML approach, the proposed model includes the necessary evaluation process. To evaluate the performance of the classifiers, we considered the widely accepted ML evaluation measures against the dynamic test dataset. Finally, an evaluated version of ML services are provided to the end user application. The following parameters are used to measure the early-stage disease risk prediction. The results of evaluation with these parameters are provided in subsequent section.
-True Positive (TP) = NCD Risk identified correctly for those who are at risk.

IV. EXPERIMENTAL ANALYSIS A. EXPERIMENTAL SETUP
Following the method of this research, the experimental setup is divided into two phases. In the first phase, the training procedure is performed using a verified dataset of diabetes [42], [43]. This dataset has been collected with ethical approval and informed consent from real patients from a diabetic hospital. All data are collected from the patients prescription, where a medical officer identified a patient as diabetes potential. More specifically, patients who are recommended for clinical test are classified as positive. The use of this dataset is to predict the likelihood of diabetes at early-stage from common sign and symptoms such that potential loss of valuable life from diabetes can be minimized.
In the second phase, the test data has been produced from simulation and prototype to evaluate the performance of different classification algorithms. For the simulation network of wearable sensors, a sensor network was constructed by updating examples of cooja simulator using Contiki Operating System [44]. The sensors have been modified using Python programming language. Finally, the results have been compared with other existing works focusing on the context of diabetes risk prediction using ML.

B. EXPERIMENT FOR THE TRAINING PHASE
At this stage, a verified training dataset [42] is used as ground truth for the training purpose. There are total 17 attributes, including one class attribute. The detail information of the training dataset is provided in Table 2. The 16 attributes excluding the class attribute are taken from [43], which appear in Table 3. The distribution of positive (clinical diabetes test prescribed) and negative (clinical diabetes test not prescribed) class in the training dataset is depicted in Figure 3. It can be observed that the dataset includes class variation for all the 16 attributes.
Following the attribute distribution, the classifier was trained with training datasets applying 11 classification algorithms, which are RF, DT, NB, BN, MLP, SVM-Polykernel and SVM-RBFKernel, LR, RT, AdaBoost, Bagging, and KNN. The time to build model for each classification algorithm has been shown in Figure 4.  It can be observed from Figure 4 that the minimum time taken to build classifier is 0.01s for NB, SVM (PolyKernel), LR, RT, AdaBoost, and Bagging. The maximum time was required by KNN algorithm at 1.82s. Time to build model was recorded by using the whole training dataset. The classifier build time was expected to be minimal due to the use of verified training dataset, which can change in a different setup. However, the performance of the classification algorithms cannot be summarized only from training time, a detail evaluation on the test dataset is therefore conducted and provided in the next section.

C. EXPERIMENT FOR THE TESTING PHASE
The testing phase includes several processes as explained before. Here we show the experimental aspects of the outcomes of these processes.

1) IDENTIFYING HEALTH MONITORING SENSOR FROM SENSOR PAYLOAD
The information of the sensor are updated by the payloads. There are multiple wearable sensors in the network for different purposes. Unlike basic sensors like gyroscope or accelerometer, the modified sensors are able to provide more meaningful information. Table 4 provides detail information of the wearable sensors simulated for this work. The sen-sor_name represents the title of the sensors, which reflects the purpose of the sensors. The timestamps represent the time of capturing the sensor records. The sensor_data provides information about the reading of the sensors. Further elaboration of the sensors in Table 4 can be given as the following.
-The heart rate sensor provides bpm record.
-The eating sensor provides information about the number of food intake time.    -The skin rub sensor provides information about the number of time the skin is rubbed by a person. -The blood pressure (bp) sensor provides information about systolic and diastolic pressure (systolic, diastolic) in mm (Hg). -In practice, smartwatch and fitness band has embedded sensors to provide more meaningful information [45]. These sensors are often made of basic sensors, which collect regular sensing data like pressure, temperature, movement, location, etc., and transform these data into more meaningful information like, drink water time, food intake, sleep time, etc.
The sensor_name is matched with a pre-defined list of health sensors and when a match is found the flag is set to 1 in the program to represent it as a health sensor. Based on the information in Table 4, it can be seen that there are six sensors found as health sensors in the wearable sensor network as shown in Table 5.

2) OBTAINING BIOMEDICAL CORRELATED VARIABLE FROM SENSOR PAYLOAD
The sensor reading is matched with another list to identify the correlated variables. In Table 6, the identified variables from the sensor information can be seen, which are bpm, eat_count, step_count, drink_count, skin_rub_count and bp.

3) LABELING DATA
After extracting the variables from sensors, rule from the epidemiological knowledge base will be applied. The rules can be determined by domain experts or existing studies, such as in [46]. A sample of the rules for the biomedical correlated variables has been tabulated in Table 6. For our VOLUME 9, 2021    simulation, these rules have been used to label the data of different attributes.

4) GENERATING DYNAMIC TEST DATASET
At this stage, the dynamic dataset for testing is produced using one week data from the sensor network, the detail of which is given in Table 7. More specifically, the features (attributes) set for predicting diabetes at an early stage in the diabetes dataset are selected as the filtered features. In the testing dataset, some of the data are obtained from the sensor network and some are from health applications. A health application was prototyped for collecting user responses like age, gender, and symptoms. In Table 8, the source of the field value for each filtered feature is shown.
As per the dynamically generated test dataset, the data distribution in the dynamic test sets is shown in Figure 5, which matches the attribute list of Figure 3 and shows the variation in terms of class distribution.

5) EVALUATION
A thorough evaluation has been conducted on the test dataset, which is dynamically generated to evaluate the classification algorithms for early diabetes prediction. To evaluate the performance in details, several performance measures are considered as discussed in Section III-C5. First, we represent the correctly and incorrectly classified instances by each algorithm shown in Figure 6. It can be observed that RF, DT, MLP, RT, and KNN classified 91%-94% data correctly. The lowest correct classification is 81% by the NB and BN algorithms.
For more details, the confusion matrix for each algorithm has been given in Table 9. The confusion matrix represents     properties a, 9, 10, 12. Here, a is the default value calculated by a = (attributes+classes)/2. In our case, a = (16+2)/2 = 9. This setup provided very small error per epoch, which is 0.00049. The first layer in this figure contains the input and no computation is performed in this layer. The hidden layer includes computations to predict two classes.
The tree from the DT classification is depicted in Figure 8. The root of the tree is polydipsia, which then branches to polyuria and afterwards to reach the class attribute.
The Kappa statistics and RMSE comparison is provided in the Figure 9. These two statistical measures are considered widely for ML performance evaluation. The kappa statistics value of RF, RT, AdaBoost, and KNN is mostly closer to 1, which indicates the efficiency of these classification algorithms for this problem. On the other hand, the RMSE value of KNN, RT, MLP, and RF are the least, which proves the efficiency of those algorithms for the target prediction task.
To get a more detail view of the classifier performance, other accuracy measures like TP rate, FP rate, Precision, Recall, ROC area, and F-measure are illustrated in Figure 10. The highest value of these measures is approximately 0.93, 0.12, 0.94, 0.93, 0.93 and 0.93, respectively for multiple algorithms like RF and RT. The SVM performs worst among these algorithms. The SVM (RBFkernel) accounts for the highest FP rate at 0.23 and the subsequent ones are for NB and BN at 0.22, and so on.
Based on all the analysis above, it is evident that RT performed best in terms of model build time during training and accuracy measures during testing. The RT take 0.01s time to build the classifier and exhibits 0.94 precision.

D. COMPARISON WITH EXISTING WORK
Existing work mostly consider clinical dataset for diabetic prediction, not for early-stage risk prediction of diabetes. Different work consider different datasets with diverse set of attributes. However, we took the context of diabetes and ML to compare our work that considers dataset and corresponding attirbutes for early-stage diabetes risk prediction.    A comparison of the proposed work with the existing work has been outlined in Table 10.
The best accuracy for each algorithm is highlighted in Table 10. It is evident from the table that in most of the cases our proposed work provides the best accuracy. However, the work in [35] comes next providing best accuracy for three algorithms DT, NB, and SVM (Polykernal) and then in [34] for LR algorithm. Also, the -sign in the table cell represents that the corresponding algorithm is not used by the cited work. It can be observed from the table that each of existing work individually has used 3-4 classification techniques, whereas we analyzed our data with 11 classification techniques that have been popularly used for diabetes prediction in the literature. Therefore, comparatively the proposed work justifies the novelty in performance with respect to the referenced work.

E. SUMMARY OF RESULTS AND DISCUSSION
Overall, it is evident from the experimental results that although multiple algorithms built the model in minimum time within 0.01s, the accuracy of them varied significantly. For instance, the model is built in 0.01s by applying Bagging, AdaBoost, RT, LR, SVM(PolyKernel), and NB. However, the accuracy obtained by RF and RT is nearly 10% higher at 94% than SVM at 85.14%. Again, the RF provides the highest accuracy at 94.02% and ROC area at 0.97, but the time required by RF to build the model is two times more than RT. Though both RF and RT obtain the same value for TP rate, FP rate, Precision, Recall, F-measure and ROC accuracy measures, the less model built time supports RT to be the best algorithm for this experiment. This diversity of results provides interesting insights, such as a) Although several algorithms provide almost similar accuracy, the classification algorithms may require variable training time and b) For the prediction of NCDs, performance should be evaluated in both training and testing phase.
Overall, this research represents a new scope for early stage NCD prediction with modified wearable sensors. Interestingly, the epidemiological knowledge base made the approach sophisticated providing knowledge rules and NCD dataset. The dynamic labeling can solve the data labeling problem for classification in HCPS, one of the major problems in HCPS research field. Besides, this work represents the necessity for developing more advanced healthcare sensors, which can transform the health monitoring systems to complete healthcare systems. A NCD risk prediction closed-loop system can predict the risk of developing life-threatening disease (e.g. diabetes, thyroid, and stroke) at early stage and the public health of any community can be improved significantly, which can extend the life span of individuals as well.

V. CONCLUSION
In this work, the least unexplored field of healthcare, the early-stage risk prediction of NCDs through wearable technology in HCPS, has been studied. The use of a medical practitioner's verified training dataset in the framework has reduced the massive pre-processing stage of ML. In addition to this, a novel approach of dynamic test dataset generation from IoT sensors' raw data has been introduced. The multistage conversion of heterogeneous IoT sensor data into a meaningful dataset opens new door to predict the risk of NCDs from the low-level sensor data in HCPS. This has enabled the ML classification algorithms RF and RT to perform with 94% accuracy or more. Due to using perfectly refined training data, the classifier build time with training data becomes significantly low at 0.01s. Also, the comparison of the accuracy with other existing work demonstrates that the proposed framework performs the best for most of the classification algorithms considered. This work considers diabetes as an example of NCD to demonstrate the novelty of the proposed mechanism. However, other NCDs such as stroke or thyroid can also be predicted with a proper epidemiological dataset, which can shape the further extension of this work.