Next-Day Prediction of Hypoglycaemic Episodes Based on the Use of a Mobile App for Diabetes Self-Management

Hypoglycaemia is one of the most common complications in diabetes, which can be life threatening if not managed appropriately. So far, research on hypoglycaemia prediction has been scarce, focusing on small cohorts linked to specific geographical regions, thus limiting the generalizability of the findings. In this paper, we developed and validated different machine learning models for next-day hypoglycaemia prediction in type 2 diabetes. We used a large international cohort comprising 669 participants, who had been regular users (for over a couple of years) of a mobile app for diabetes self-management and used common portable commercial devices for measuring their blood glucose and blood pressure levels, collecting in total 96121 observations (from which we extracted a balanced dataset of 2998 observations). Random Forests (RF), Support Vector Machines, Adaptive Boosting and Feed-Forward Artificial Neural Networks were employed to train predictive models based on 10-day temporal sequences with blood glucose and blood pressure measurements towards estimating next day hypoglycaemic episodes. We used a leave-one-subject-out (LOSO) approach for model validation, and found that RF achieved the best accuracy (0.814) and F1-score (0.812) with sensitivity (0.805) and specificity (0.824) for next-day hypoglycaemia prediction. The results of this study provide an expedient and reliable app-based approach to accurately predict hypoglycaemia in day-to-day life, thereby facilitating patient and care provider awareness and potentially preventing other serious complications.


I. INTRODUCTION
Diabetes is one of the leading causes of mortality and disability in the world [1].The global diabetes prevalence in 20-79 year olds in 2021 was estimated to be 10.5% (536.6 million people), and it is expected to rise to 12.2% (783.2 million) by 2045 [2].The aging population along with other major risk factors such as obesity, The associate editor coordinating the review of this manuscript and approving it for publication was Gyorgy Eigner .have been drivers of diabetes higher prevalence, leading to reduced life expectancy and costly complications [3], [4], [5].
Assisting people to make changes towards a healthier lifestyle and self-manage diabetes is an important strategy to maintain a good quality of life and avoid possible complications [6].In particular, low blood sugar level, i.e., hypoglycaemia, is one of the most common barriers in Type 1 Diabetes Mellitus (T1DM) and Type 2 Diabetes Mellitus (T2DM) to achieve tight glycaemic control, which can be life-threatening if not treated quickly [7].Research has shown that people with diabetes with an episode of severe hypoglycaemia during the last 5 years, had a 3.4-fold increased risk of death than those with mild or no hypoglycaemia [8].The economic impacts of hypoglycaemia due to increased blood glucose monitoring, hospitalisations, medical contact, and absence from work, are enormous [9].In this context, appropriate management of hypoglycaemia is of vital importance.
Mobile health (mHealth) has recently shown benefits [10] of efficient medical care for chronic diseases anytimeanywhere [11], [12].Mobile devices such as smartphones and portable medical devices with sensing and communication capabilities could be utilized by participants and their care providers, in order to monitor patient health status continually, and thus reduce the probability of potential complications, by providing adjusted therapeutic plans and improving patient self-management or remote medical management [13], [14].
Machine learning methods, which harness data generated through mobile and sensing devices, have already shown their capability to predict disease exacerbations and health status deterioration, e.g.exacerbation in chronic obstructive pulmonary disease [15], anxiety [16] and cardiovascular risk [17].Therefore, machine learning could be a vehicle to facilitate hypoglycaemia management, by acquiring knowledge derived from patient's data, and predicting the occurrence of hypoglycaemic episodes.Prediction of hypoglycaemic episodes could be useful for both patients and their care providers, because it may improve glycaemic control, reduce the possible fear or anxiety over facing hypoglycaemic episodes, and facilitate adherence to treatment [18], [19], [20], [21].
Related research works in hypoglycaemia prediction based on mHealth data, have been rather limited.Bertachi et al. [22], used the OhioT1DM dataset to predict hypoglycaemic episodes, based on Continuous Glucose Monitoring (CGM) data received from 6 individuals with T1DM for a limited period of 8 weeks.The same dataset (with the addition of 6 participants from the 2020 version of the OhioT1DM dataset) was also explored by Deng et al. [23], in order to test different neural network architectures.Along similar lines, Marcus et al. [24] used CGM data from 11 participants with T1DM for 50 days, to test kernel methods for prediction of hypoglycaemia.Sudharsan et al. [25], used prediction models based on Random Forests (RF), Support Vector Machines (SVM), k-nearest neighbor, and naïve Bayes, by harnessing 56K self-monitored blood glycose samples from a clinical trial with 163 T2DM participants over one year in Maryland, US.Other research studies have focused on hypoglycaemia prediction without considering mHealth data, but acquiring data from resources such as electronic health records and health insurer databases [26], [27].Furthermore, a recent review [28] has called for the need to continue research work in the development of accurate machine learning models to predict hypoglycaemia, considering also the lack of focus  on T2DM and the inaccuracy of CGM in the hypoglycaemic range.
In this paper, our main objective is to develop and validate different machine learning models for next-day hypoglycaemia prediction in T2DM based on the use of a mobile app for diabetes self-management and regular consumer portable devices for measuring blood glucose and blood pressure.The ultimate aim of our work is to provide the means for accurate prediction of hypoglycaemia and the prevention of its complications, through the use of machine learning models within mHealth services provided for participants and their care providers in the real-world.

II. DATA A. MOBILE APP FOR DIABETES SELF-MANAGEMENT
The mobile app 'forDiabetes' 1 was developed in order to improve self-management and remote medical management of diabetes.The mobile app included several functions such as a diary for recording measurements (blood glucose, blood pressure, meals, physical activity, medication, HbA1c, etc.), goal setting and editing, measurement graphs, and exchange of data with care providers.The mobile app also allows the automated recording of blood glucose measurements measured using popular consumer glucometers such as the Contour Next ONE, Contour Plus ONE, GlucoMen areo, and Beurer GL50.The mobile app is GDPR-compliant and it has been available in Android and iOS (both free and commercial versions available) since May 2018.In addition to English it 1 https://fordiabetes.app/ 7470 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.has been translated in 10 languages, and had received more than 40K downloads from users around the world until May 2022. Figure 1 presents the user interface of the forDiabetes mobile app.

B. DATA DESCRIPTION
The dataset was obtained from the database of forDiabetes and includes data from 2013 (the majority of data is from 2018) up to November of 2020.The dataset comprises: • 808276 glucose records from 11165 different users • 25025 hypoglycemic incidents (<70mg/dL) from 1438 different users • 180373 manual medication logs • 5506 records of Glycated haemoglobin (HbA1c) • 65708 records of blood pressure (systolic and diastolic) • 27160 records of weight • 29735 meal records (hydrocarbons and glycemic index) • 113509 records of physical activity (type and duration, calorie burns,distance) From these users only 998 provided data over ten-day consecutive intervals of glucose records.The mean number of ten-day consecutive intervals per patient with T2DM was 318 ± 371, with a range from 22 (patient with minimum data points) to 3958 (patient with maximum data points).The data was extracted from the database of the forDiabetes app as a set of text files in the Comma Separated Values (CSV) format, whereas each file corresponded to a database table.
In Figure 2 we present the mean of glucose records geographically distributed, and in Figure 3 we present an overview of the weighed hypoglycemic episodes per country.

A. DATA PRE-PROCESSING
We retained only the participant data which contained at least a ten-day interval of consecutive glucose data, which correspond to 998 participants that contain 317549 glucose records and 33287 blood pressure records.The selection of the ten-day interval was based on the work by Sudharsan et al. [25].The devices supported for recording glucose measurements were the following:

B. FEATURE EXTRACTION
Subsequently, we created temporal sequences of ten-day data for diabetes 2 participants using a sliding window technique, starting from the last glucose measurement of each user to the first in steps of -1, and taking the previous 10 glucose measurements of the same meal type (that is, we match by selecting only the observations that were measured at the same period of each day as the current observation for the window, that is either all before lunch, or all after lunch etc.) as the current glucose measurement, for the previous 10 days.The choice of using only 10 measurements for each prediction was inspired by other work [25] and we extend that approach by matching measurements of the same meal type.The 10 measurements of each window are not combined in any way, instead each one is used ''raw'' i.e. as a feature that will be used as input into the subsequent statistical learners.So for each glucose measurement selected as the current we fill its corresponding window by querying the past 10 days to fill in the features.The current glucose measurement, in each iteration, is not added to the data but is instead used to calculate the hypoglycemic episode status for each observation.As this process is iterative each participant will contain many time windows of ten-day data.The timestamps of the measurements are also added to the dataset.For example: suppose the current glucose measurement is 60 mg/dL and is taken before a meal.In this iteration the previous 10 glucose measurements that were taken before a meal will be added to the dataset and the value of 1 will be added as episode status (as 60 < 70mg/dL).Moreover, for each glucose record its source_id (1 <= source_id <= 10) will also be added to the dataset.Source_id records the device used to take the measurement, where 1 defines the user manually inputting the measurement to the mobile app, II for Apple Health, 3 for Google Fit, 4 for Fitbit, 5 for Contour Next ONE etc (see Figure 5).In addition, in the same (ten-day) time windows for each iteration, the last 10 measurements of systolic blood pressure, as well as the last 10 measurements of diastolic blood pressure will be added, along with their respective timestamps.If there are missing measurements a value of −1 will be added for each missing blood pressure measurement (but not for glucose ones, if the respective time window contains less than 10 previous glucose measurements this window will be dropped, thus the method requires 10 glucose measurements for each 10-day window).For each observation a one-hot encoding vector, of length 5, will be added encoding the meal type of the glucose measurements kept for each iteration.The five classes are: (fasting, before meal, after meal, bedtime, other).Finally, for each observation we add the participant's age, weight and a glycated hemoglobin (h1bc) measurement (nearest previous measurements in the time window) along with the timestamp of the measurement (or 0 and −1 respectively if no such record exists), as well as we calculate the ratio of sum_of _after_medication_glucose_measurements number_of _glucose_measurements for each time-window.The full pipeline for the final inclusion of 96121 observations from 669 participants is presented in Figure 4. Before any further processing, we need to decide on a strategy to handle missing data.Only the 'age' variable was missing in 15 participants (5157 samples); otherwise this is a design matrix with complete entries.Given that preliminary analysis (see the section 'Statistical analysis') did not reveal age to be statistically significantly associated with the outcome, we decided not to include it in the model and retain the data from the 15 participants.
Figure 5 presents the sources for the measurements that are retained in the final dataset, after the application of the preprocessing pipeline.The final number of the participants is 669 and their average age is 57 ± 11 years.No information was stored about the gender of the participants.Table 1 presents the demographics of the 669 participants whose countries had at least 10 participants included in the study.They are from 86 countries in total.The countries in Table 1 are sorted according to the number of participants in descending order.However 24 participants had not defined their country in the app.We note these participants as Undefined in the table.

C. STATISTICAL ANALYSIS
We started data exploration using standard data visualization plots to assess probability densities, scatter plots, and computed statistical correlations between each of the features and the binary outcome to determine both whether these statistical relationships are statistically significant (at the p = 0.05 level) and also to assess the extent of the statistical strength.We used the empirical rule of thumb in medical applications that correlation coefficients that exhibit a magnitude above 0.3 are deemed to be statistically strong [29].

D. FEATURE SELECTION
Extracting a large number of features (70 in this study, once age has been excluded) may be detrimental for the performance of statistical models and challenging to interpret findings.According to Hastie et al. [30] advanced statistical learning algorithms, in practice, are typically fairly robust to the inclusion of potentially noisy or irrelevant features.However, identifying a smaller feature set always facilitates insight into the application by focusing on the key features contributing towards estimating the outcome [31].Therefore, although in this study we do not have a very high-dimensional dataset, we nevertheless aimed to develop a parsimonious generalizable model with a succinct feature set.We used the new feature selection algorithm called relevance, redundancy and complementarity trade-off (RRCT), which was recently demonstrated to be extremely competitive across domains in 12 datasets when benchmarked against 20 state-of-art feature selection algorithms [32].In brief, RRCT inherently accounts for the key elements towards identifying a robust information-rich compact feature subset, i.e. relevance (quantifying the statistical relationship of features with the outcome), redundancy (quantifying the statistical relationship between pairs of features in the selected subset), and complementarity (or conditional relevance, quantifying the conditional added value of joint feature sets over and above their univariate statistical association with the outcome).The features were selected using the strategy we have developed and explained in detail in previous work.The underlying concept is using a voting strategy to aggregate the feature sets selected when presented with perturbed versions of the dataset to ensure this robustly generalizes [32], [33], [34].

E. STATISTICAL MAPPING
We used state-of-the-art statistical mapping algorithms to develop a functional supervised learning model using the selected feature set from the preceding step to map onto the binary outcome.Specifically, we used: (1) RF [35], (2) Support Vector Machines (SVM), (3) Adaptive Boosting (AdaBoost), (4) XGBoost, (5) Feed-Forward Artificial Neural Network (ANN).We chose these methods as they are commonly used off-the-shelf classifiers that have been shown to be accurate in diverse supervised learning problems.Similarly to our previous studies, we explored different approaches towards optimizing the statistical learners' hyperparameters [36].For the RF we explored optimizing performance using Breiman's recommendation with half and twice the default recommended number of features over which to select features for the trees and explored the use of 500 and 1000 trees.We used the 'Statistics and Machine Learning Toolbox' for MATLAB for RF and the scikit-learn implementation (for Python 3) for SVMs and AdaBoost, xgboost (for Python 3) for XGBoost and tensorflow 2 for ANNs.We applied Z-Score Standardization for the features for all models except RF and the SVM.For the SVM we linearly rescaled the features to the [0..1] range and used a Gaussian radial basis function kernel.We clarify that for the scaling of the features in both the training and the testing subsets only the information from the training subset was used and subsequently applied to the testing subset.The regularization parameter C and the kernel coefficient γ were determined using a grid search where C = [10 −2 , 10 −1 , . . ., 10 2 ] and gamma = [10 −1 , . . ., 10 1 ].For AdaBoost, the learning rate hyper-parameter was explored in the range 0.01 to 0.5 in steps of 0.05 using 1000 trees with a maximum depth of 2. For ANNs we used a grid-search using a parameter grid of various parameters of 2-layer to 4-layer networks with first layer node count of [2 5 , . . . 2 9 ] and subsequently halved the next layer node count.We utilized batch-sizes of 5, 20, 50 and tested 3 dropout configurations (0.2, 0.3, decrementing dropout ending with 0.2 on the final layer and increasing by 0.1 in each previous layer).We trained for 300 epochs with early-stopping (on validation loss with patience = 7) and use of the Adam Optimizer.

F. MODEL VALIDATION AND GENERALIZATION
The dataset in this study is highly unbalanced (94622/1499, total: 96121 samples indicating that most observations did not involve a hypoglycemic episode, i.e. > 98% samples in the dominant class).Problems where a class is dominating at that level are known to be particularly challenging for statistical learners, and hence we need to decide on a strategy towards the development and evaluation of the model.Given there is a very large number of samples available, for computational   efficiency and practicality, we created a balanced dataset comprising all 1499 samples from the non-dominant class and randomly selected 1499 samples from the dominant class.
We clarify that all analysis was carried on this balanced dataset, from the statistical analysis exploring associations to selecting a feature subset and statistical mapping.For the evaluation of the model performance we used the leaveone-subject-out (LOSO) approach because this is how we envisage this tool would likely be used in practice: we wanted to evaluate how well we might expect the model to generalize on new unseen people.We report different performance measures to assess model generalization, including confusion matrices, balanced accuracy, sensitivity, specificity, and F1-measure.

IV. RESULTS
We start our exploration by assessing statistical strength using correlation coefficients.We found that correlations (between the features used in the study) were generally relatively weak (univariately no feature is statistically strongly associated with the binary outcome).Nevertheless, some correlations were about |0.2| (see Figure 6) which inspires confidence that when considered jointly in a statistical learning model may lead to good predictions.
Figure 6 presents the correlation coefficients for all features, whereas Table 2 presents the selected feature subset in descending order of importance.The five most important features were glucose_-4, systolic_-3, meal_relatedHot_vec_1, meal_relatedHot_vec_3, glucose_-1.Figure 7 presents the balanced accuracy using LOSO as a function of the number of features presented into RF.We note that performance is very stable (in terms of using 5-30 features explored herein) and is optimized with 25 features, reaching 0.814 balanced accuracy.Figure 8 presents the confusion matrix of the RF model comprising the 25 features.As can be observed from the confusion matrix the misclassification rate between the two classes is similar.
Table 3 presents the balanced accuracy, sensitivity, specificity and F1-score for each model.The highest scoring model (in terms of balanced accuracy and F1-measure) is the one based on RF comprising the 25 features selected using RRCT with a balanced accuracy of 0.814, a sensitivity score of 0.805 and specificity of 0.824.Finally, Figure 9 presents the RF importance scores to obtain an overall impression of the actual contribution of each of the selected features towards estimating the two classes.As can be observed from the figure, some features were more important than others in estimating the binary response.The feature importance for Random forest could be grouped in 3 clusters, with cluster #1 containing all features with RF importance > 100, cluster 7474 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.#2 with feature #12 having an RF importance of around 80 and cluster #3 containing all features with RF importance scores of <= 20.For convenience we present these results using boxplots to also present the variability in the internal RF importance weights arising from the different LOSO repetitions.

V. DISCUSSION
We presented a machine learning approach for next-day prediction of hypoglycaemia in daily life.Our primary finding is that prediction of hypoglycaemic episodes for T2DM based on mHealth data, such as blood glucose records and blood pressure measurements captured by widely available mobile devices for self-monitoring, is accurate.
The predictive outcomes relied on an international cohort which used a mobile app for diabetes self-management in the real world, in contrast with previous single-centre studies restricted to small geographic areas [25], [37], [38].Furthermore, other works have focused mostly on CGM or data derived from electronic health records [28], and not data captured from ordinary mHealth devices used in everyday life.To the authors' knowledge this is the first study to assess next-day prediction of hypoglycaemia in an international cohort.
We compared the performance of different machine learning models in this study.RF proved to be empirically superior to SVMs, Adaptive Boosting and Feed-Forward Artificial Neural Networks.We do not have theoretical proof for the justification of this finding, however we note that in our experience RF has often worked well in complicated practical settings, thus providing further evidence to support the notion of being best of-the-shelf classifier as indicated by Hastie et al. [30].
The highest scoring model, based on RF with the 25 features selected with RRCT, reached a balanced accuracy of 0.814 and F1-score of 0.812 (with 0.085 sensitivity and 0.824 specificity).From the original features we selected features that had a more direct impact on hypoglycaemic incidents.Detailed information of medication, meals and physical activity (see the section 'Data description') was not utilised, because we found that to be of variable quality and depending on how adherent participants were to enter that information.The glucose records' table from the database of 'forDiabetes' contained columns which recorded the measurement's proximity to medication, physical activity and meal, i.e., if the measurement was recorded after taking medication/having a meal and exercising.Two of these features were used in the original 71, that is the ratio of glucose measurements that were taken after medication to the glucose measurements of each time window, and the one-hot vector encoding the meal-type of the glucose measurements of each time window (i.e., if the measurements were taken after a meal/before a meal).The column containing information correlating the measurement with medication use was not included in the original 71 features as it was highly imbalanced (after medication samples were around 10% of the total samples).
In contrast with previous studies, our study relies on a large international cohort of individuals who have downloaded and used the mobile app in their daily lives.The dataset we used was imbalanced in regards to hypoglycemic incidents, as there were only 1499 cases of hypoglycemic incidents out of 96121 observations (indicating that most observations did not involve a hypoglycemic episode, i.e. > 98% samples in the dominant class).This provided the need to create a balanced dataset.We have repeated analysis twice using a different randomly selected subset from samples in the dominant class and repeated the methodology described.We found that the out of sample reported performance was very similar, which inspires confidence that the developed model will likely generalize well in new unseen data.However, additional studies in the real world are required to confirm our findings and accumulate robust evidence.

VI. CONCLUSION AND FUTURE WORK
We demonstrated that accurate and practical next-day hypoglycaemia prediction is feasible using real-world data with a custom-built diabetes-specific smartphone application.Therefore, the current work has enormous potential to enable day-to-day glycaemic control by participants with diabetes in the community, empower individuals to monitor potential problems, and facilitate the optimization of diabetes type 2 therapeutic management by care providers.
For the future we plan to expand the work presented by including a more diverse dataset covering various ethnicities, age groups, and co-morbid conditions to improve the robustness and universality of the findings, addressing potential biases in machine learning models in healthcare.Moreover we plan to contrast using other methods for feature selection and compare them to RRCT, as well as conduct a more thorough analysis of the model inaccuracies to understand the limitations.

FIGURE 1 .
FIGURE 1. Screens to view recordings (blood glucose, blood pressure, meal, medication, physical activity) in forDiabetes mobile app.

FIGURE 5 .
FIGURE 5. Devices used for recording glucose measurements for the selected data.

FIGURE 6 .
FIGURE 6. Correlation coefficients for all 70 features used in the study.

FIGURE 7 .
FIGURE 7. Errorbar depicting balanced accuracy as a function of the number of features presented into RF.

FIGURE 8 .
FIGURE 8. Confusion matrix along with probabilities of class estimates with the RF model comprising the 25 features selected using RRCT (C indicates control, D indicates diabetes).

FIGURE 9 .
FIGURE 9. Random forest feature importance using the selected feature subset of 25 features.Higher value indicates that the corresponding feature contributes more towards the estimation of the binary response (detecting a hypoglycemic episode).

TABLE 2 .
Selected features using RRCT in descending order of importance.