Evaluation of Offline Reinforcement Learning for Blood Glucose Level Control in Type 1 Diabetes

Patients with Type 1 diabetes must closely monitor their blood glucose levels and inject insulin to control them. Automated glucose control methods that remove the need for human intervention have been proposed, and reinforcement learning has been used recently as an effective control method in simulation environments. However, its real-world application would require trial and error interaction with patients. As an alternative, offline reinforcement learning does not require interaction with humans and initial studies suggest promising results can be obtained with offline datasets, similar to classical machine learning algorithms. However, its application to glucose control has not yet been evaluated. In this study, we evaluated two offline reinforcement learning algorithms for blood glucose control and discussed their potential and shortcomings. We also evaluated the influence on training and performance of the method that generates the training datasets, as well as the influence of the type of trajectories used (single-method or mixed trajectories), the quality of the trajectories, and the size of the datasets. Our results show that one of the offline reinforcement learning algorithms evaluated, Trajectory Transformer, is able to perform at the same level as commonly used baselines such as PID and Proximal Policy Optimization.


I. INTRODUCTION
Type 1 diabetes (T1D) is an autoimmune system disorder involving the destruction of liver β cells of the pancreatic islets of Langerhans due to insulin deficiency.Without enough insulin, glucose cannot enter the cells to transform it into energy.People with T1D need to monitor their blood glucose (BG) levels regularly and take insulin to keep their blood sugar levels within a normal range.Higher (hyperglycemia) or lower (hypoglycemia) blood glucose levels can cause serious health problems such as blindness, kidney failure, or heart attack, so people with T1D must monitor their blood glucose levels and inject insulin to prevent them.There are several insulin delivery methods both The associate editor coordinating the review of this manuscript and approving it for publication was Hong-Mei Zhang .manual and automated.The usual insulin delivery method to manage glucose levels is the basal-bolus (BB) regime, which involves taking insulin before meals and at bedtime.A continuous Glucose Monitor (CGM) is a device that measures human plasma glucose levels in real-time.A CGM typically consists of a small sensor that is inserted under the skin, a transmitter that sends the data to a receiver or smartphone, and an application or other interface that displays the glucose levels in real-time.Even combined with a CGM, the disadvantage of BB is the need for manual injection several times per day, which is a trouble, especially for children when they are at school [1].
As a solution, several methods for automated glucose control have been developed.Methods that completely remove the need for human intervention are usually called close-loop controls or artificial pancreas (AP).Those systems additionally include an insulin pump and some method to regulate the injections, that is, a control algorithm.The control algorithms employed usually are predictive integral derivative controllers (PID) [2] and model predictive controllers (MPC) [3].Both algorithms are effective and widely used [4].In particular, PID controller is the most used both in commercial and research because of its simplicity and robustness [5].But these methods are sensitive to external factors such as food intake, exercise, and illness, which affect the control effectiveness [6], [7].
Recently, machine learning (ML), including reinforcement learning (RL), has gained attention in diverse domains such as finance, robots, computer vision or language recognition.ML predictive models can be applied to time series data to understand changes in glycemic state and determine the amount of insulin to deliver.Reinforcement learning is a branch of ML that lets the agent learn by interacting with the environment, which in our case is the simulation of an artificial patient [8].RL is being applied in diverse domains, including robotic rehabilitation [9], aircraft maintenance [10], and electric vehicle battery lifetime prediction [11].The RL agent, gathers rewards from outcomes of the agent's action, which uses to learn to take better decisions.Thus, RL algorithms can use physiological data gathered from CGM systems to train the agent.However, this RL process, called online RL, requires extensive trial and error interaction with the environment, the real patient in this case, something that is obviously not safe at the moment.Therefore, online RL has been so far successfully used to automatically control BG [12], [13] but only in in silico tests and there is no clear way of bringing it to clinical trials because of the high risk involved when working on real patients.
In contrast, offline RL [14], a recent approach, could solve that problem.Offline RL requires only pre-obtained data to make an agent learn a policy for a particular environment.This data can come from real measurements taken from patients.Thus, this approach does not involve actual interaction with the environment (patient) during the training phase.Offline RL methods have been used in various applications such as marketing [15], web user interfaces [16], sport strategy planning [17], healthcare [18], [19], and T1D blood glucose control [4], [20].Offline RL is particularly suitable for time series data, such as blood glucose data, due to its ability to learn from historical sequences and capture the temporal dependencies and patterns present in the data.
So, one key advantage of offline RL for blood glucose control is its ability to handle non-stationary environments.Blood glucose levels can vary significantly over time, and offline RL algorithms can adapt to these changes by learning from the entire historical sequence.Offline RL can leverage recurrent neural networks (RNNs) or transformers.These models can capture long-term dependencies and accurately represent the sequential nature of blood glucose measurements, leading to more accurate predictions and decision-making.This enables the agent to capture the dynamics of the underlying system and make appropriate decisions even in the face of changing blood glucose patterns.
The other key advantage is the ability of offline RL to avoid exploration by interaction with the environment.Since offline RL algorithms learn from a pre-collected dataset, which may have undergone extensive safety checks, the risk of dangerous or harmful actions during the learning process is reduced.This is particularly important in the context of blood glucose control, as patient safety is paramount.
On the other hand, there are some challenges in offline RL, including distribution mismatch, biased behavior, sample complexity, off-policy evaluation, and practical deployment.Addressing these challenges requires the development of robust algorithms, novel techniques for policy evaluation, and careful consideration of safety and deployment considerations.Indeed, the suitability of offline RL for BG control has only been started to be discussed in the literature [4], showing that certain offline RL algorithms may be a feasible alternative to online RL, a fact that has not been clearly established yet.Therefore, our first contribution in this paper is to show that some additional offline algorithms can actually perform at the level of online ones for BG control.
Moreover, the importance of the implementation details is recognized for both online RL and offline RL [12], in addition to the algorithms used: Online RL requires to design or select the state space, the reward function, and other factors while offline RL, in addition to those choices, requires careful selection of dataset trajectories, as we will discuss in the paper.That is, while for simple environments the states are clearly defined, for most of real problems, including BG control, this is actually a design decision.For example, as discussed in [12], one can use as input state just the last BG sample, or a sequence of past BG samples or a combination of past BG samples and injected insulin doses [13].The design of the reward function is also a crucial step.
Therefore, a second contribution of our work is to explore and discuss part of this available design space.Our findings and lessons learned will be valuable for other researchers, enabling them to focus on other key aspects, which should save testing time, especially considering that training offline RL agents is a highly time-consuming and resource-intensive task.
In summary, the contributions of this work are: • An evaluation of offline RL as a method for effective blood glucose (BG) control.
• An assessment of the potential and shortcomings of offline RL algorithms for data-driven BG control.
• A comparison of their performance against online RL and PID baselines.
• An exploration of several factors influencing the learning ability of offline RL agents, including the dataset size and its quality.
• Extensive evaluation of different dataset types, sizes and selection approaches.
• The identification of the importance of careful data selection for training offline RL agents.
To facilitate results repeatability, the trained agents, as well as the baseline data and the datasets generated for training for this paper, are available on the open science framework repository [21].
In the remainder of this paper, we first review glycemic control methods and related works.Afterwards, we describe our experimental setup and data generation and the results of our tests.Next we discuss our findings and potential next steps.Finally, we provide concluding remarks.

II. BACKGROUND AND RELATED WORK A. T1D SIMULATION AND MODELS
For safety reasons, biomedical experiments with machine learning algorithms have been done and pre-evaluated in silico through computer simulation.Currently, there are several T1D simulators available, with both free and paid versions, as for instance, AIDA [22], Type 1 Diabetes Virtual Patient Population (T1D-VPP) [23], and the UVA/PADOVA Simulator [24].AIDA is a free software simulating human plasma insulin and blood glucose for education and research purposes.T1D-VPP involves single (SH) and dual hormone (DH) mathematical models which generate a T1D diabetes virtual population of patients and model the effect of exercise in the glucoregulatory system.
The UVA/PADOVA simulator was originally developed in 2007.It is the first approved in silico T1D model by the United States Food and Drug Administration (FDA) [24].The simulator provides virtual patients in three age groups: adults, adolescents, and children, with 10 patients per group in the free version.In this paper, we use SimGlucose, an open-source Python implementation of the UVA/PADOVA simulator [25], previously used in similar studies [8], [13], [26], [27], which can be seamlessly integrated with multiple machine-learning libraries.

B. METHODS FOR GLYCEMIC REGULATION
T1D conditions typically develop in children or young adults and require lifelong treatment with insulin injections.Several insulin regimes are used to control blood sugar.The traditional ones involve one or two injections per day.But patients must control their food intake to be constant throughout the three meals a day.Multiple daily injection therapy, or basal-bolus (BB), offers more flexibility in diet and dosage, but patients still need to control carbohydrate intake and insulin injections [28].Automatic insulin pumps with integrated continuous glucose monitors (CGMs) have been developed to alleviate the burden of glycemic control and deliver optimal insulin according to current blood glucose levels, allowing patients to live independently without having to worry about delivering insulin.A system that does not requires any human intervention is usually called a closed-loop controller or Artificial Pancreas (AP).Currently, most of the commercially available insulin pumps use a PID (proportional-integral-derivative) algorithm to control blood sugar levels.A PID controller is a control system that uses feedback to adjust a system's output in order to achieve the desired outcome.In the context of blood glucose control, a PID controller is used to regulate the release of insulin in order to maintain a stable blood glucose level [29].The proportional component of the PID controller adjusts the output based on the current error between the desired and actual blood glucose levels, while the integral component considers the accumulated error over time and the derivative part predicts future errors based on the current rate of change.By combining and tuning these three components, PID controllers can regulate blood glucose levels, but they usually have problems to adapt to disturbances in food intake and need to be customized to individual patients [13], [30].
ML is gaining momentum in AP research recently.ML algorithms can be used theoretically in the field of blood glucose control to develop systems that are able to automatically regulate blood glucose levels according to the individual needs.As other data-driven methods, the idea is to collect labeled data from CGMs and other devices and train a ML model.Through the training process, ML algorithms would ideally identify patterns and trends in order to learn how to predict blood glucose (BG) levels and adjust insulin levels accordingly.At this point, there are several alternatives.The first one is to use the ML model to just predict the expected BG level ahead of time and then use some other method to decide the insulin dose required to keep BG at the desired level.However, the human response to insulin is highly non-linear and it is also difficult to predict the response to the insulin injection.Therefore, another alternative is to learn that response with ML methods also.To this purpose, Reinforcement Learning (RL) could be used, since the ML agent directly learns the appropriate action (insulin doses) to take given a certain input state (the patient BG history).This is what traditional RL (also called online RL) does, by letting the agent interact with the environment, and receive a reward as a result of this interaction.By maximizing the cumulative rewards, the agent effectively learns how to adjust insulin levels.That is, by learning through trial and error, the agent could potentially develop effective strategies for maintaining healthy blood glucose levels over time.This method has successfully been used to automatically control BG levels in in silico trials, outperforming PID-based methods [12], [13].
However, the main drawback of this approach is that it is not clear at all how to apply it to real patients, that is, how to transfer the learning from the in silico environment to real patients.Although data (BG level, physical activity, etc.) can be automatically collected from real patients from electronic devices, RL agents still need to experiment with the patient response in order to learn.
To solve this issue, a more recent approach, called offline reinforcement learning, has emerged.In offline reinforcement learning, the agent is not able to receive any feedback from its environment during the learning process, and must instead learn only from previously collected data [14].This means that the agent must learn to make decisions based on the information that is available (no exploration).Note that the main difference with other ML methods is that with offline RL the actions and rewards are also given as input data.For example, a typical supervised ML algorithm uses collected BG levels (as well as other context data) to train and is able to predict the next BG level, given a certain input BG history.On the contrary, to train an offline RL agent we need to use BG levels, actions taken and observed rewards, and, once trained, it is able to predict the required action, given a certain BG history as input.
The advantage is that it is useful in situations where it is not possible or practical to experiment with the environment, such as when working with historical data or in safety-critical environments.As a drawback, note that, although it removes the need to interact with the environment to learn, it still leaves open the question of how to collect the required states, actions and rewards for training, which is not obvious for many practical situations.In this paper, since we can collect those data from simulations, we put aside temporarily this question and focus on evaluating how effective is offline RL for BG control.Let us finally remark that the value of offline RL is that it is able to effectively generalize, that is, to apply the appropriate action to an input not previously seen in the training dataset.In other contexts, ML has proved to be very effective generalizing [31], but to the best of our knowledge, the generalizing performance of offline RL for BG control has only been started to be discussed in the literature [4].Our goal in this paper is to evaluate it and discuss factors that may have an influence in the learning and prediction performance.
In particular, we evaluate the following offline RL algorithms: Decision Transformer (DT) [32] and Trajectory Transformer (TT) [33].Both of them approach offline RL as a sequence modeling problem, that is, the agent is trained with sequences of observations, actions and rewards (trajectories) and its goal is to generate sequences that result in high rewards.We summarize their features: Decision Transformer (DT) [32]: • Uses return-to-go ( Rt = T t=1 r t ) instead of rewards • Input of DT is a subset of the trajectory τ consisting of the K most recent time steps Trajectory Transformer (TT) [33]: • Uses discretized states and actions as input, along with a scalar reward • Augments the trajectory with return-to-go as in DT and employs a beam search algorithm for planning [34] Both DT and TT uses as architecture for action prediction a transformer network.The transformer is a type of deep learning model that is designed to process sequential data which was introduced by Vaswani et al. in 2017 [35].The transformer architecture is based on the idea of using self-attention mechanisms to process input data, rather than using traditional convolutional or recurrent layers.This allows the model to capture long-range dependencies in the data and to process the input sequence in parallel, which makes it faster and more efficient than many other types of models.A key aspect determining the performance of offline RL algorithms is the quality of the datasets used for training.In fact, their performance is usually validated separately according to the quality of the trajectories included in the dataset.For instance, the quality of the dataset can range from randomly (random dataset) generated trajectories to trajectories generated by the best-performing algorithm (expert dataset) or a mixture of them [32], [33].

C. RELATED WORK
Most of the current commercially available control algorithms for AP systems are based on PID or Model-Predictive-Control (MPC) [6], [7], [29], [36], [37], [38], [39].PID and MPC controllers usually require the user to announce her meal intake and exercise activity, and so they work as an hybrid closed-loop system, [29].PID and MPC are used in current FDA-approved products such as MiniMed systems, Control IQ, and Dexcom [5].
The main drawback of PID controllers is that usually do not handle well variability in food intake [13], [36].Several improvements of the basic PID control have been put forward, such as insulin feedback (IF), which increases its efficiency [38], [39].MPC controllers use a mathematical model to predict and control BG levels.It involves using a mathematical model of the patient's physiology to predict future blood glucose levels and optimizing a sequence of control actions over a specific time horizon.The process includes modeling the patient's dynamics, predicting glucose trajectories, formulating an optimization problem to minimize a cost function while satisfying constraints, implementing the first control action, and repeating the process in a receding horizon manner.MPC offers benefits such as dynamic adaptation and incorporation of safety constraints, but challenges include model accuracy, patient-specific parameters, and computational requirements.Di Ferdinando et al. [6] and Borri et al. [7] model the endogenous insulin delivery rate (IDR) with nonlinear differential difference equation (DDE).These models usually are applied to T2DM patients, since IDR cannot be neglected for them.Finally, overnight hypoglycemia is dealt by PID and MPC-based commercial products with Predictive Low Glucose Suspend (PLGS) technology [29], which predicts glucose concentration trends and suspends insulin delivery before hypoglycemia occurs.
ML has been used as a tool for the prediction of diabetes [40], [41], [42], [43], but also for glycemic control in an insulin pump, and such techniques are growing rapidly within the artificial pancreas research community.Most ML experiments are done in silico, through computer simulation.As CGM data are time series, non-linear autoregressive neural networks are used for BG prediction in [23], while [44], [45], [46], [47], [48] use recurrent neural networks (RNN) and long short-term memory (LSTM).
104646 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
For BG control, RL has been increasingly tested, using multiple RL methods such as double score strategy [49], Q-learning [50], [51], [52], Deep Q-network (DQN) [53], Deep Deterministic Policy Gradient (DDPG) [54] and its improvement Twin Delayed DDPG (TD3) [55], Soft Actor-Critic [13], [56] and Proximal Policy Optimization [12].These RL methods are called online RL, since the agent interacts with the environment to collect data.As an example, our previous work [12], shows a simple RL implementation strategy that outperforms PID with IF for BG control in in silico tests.The recent work of Yu et al. [57] uses a meta-RL framework called active RL with personalized embeddings (ARLPE).By learning a general meta-policy and then finetuning it to the particular patient, their results show very promising results.However, their results have only been tested for adult and adolescent cohorts, excluding the most difficult group to train, the children.In addition, it remains the question of how to actually do the fine tuning for real patients, which cannot be done on the simulator.That is in fact, the main problem of online RL.A potential alternative to alleviate this problem is to use a model-based RL approach, such as the recent one in [58], where a hypothesized insulin dose is simulated on a BG predictor before actually being injected to the patient.Its performance is good for simulations up to 12 hours and two meals but decreases in more realistic scenarios.
In summary, in spite of recent advances, online RL is not yet suitable for safety-critical environments, where interaction with the environment (the real patient) is not possible.Therefore, recently, researchers have paid more attention to offline RL.Offline RL is similar to online RL, but the offline RL agent does not need to interact and receive any new information from the environment during the learning process [14].This means that the agent instead learns from previously collected data, which is safer and more useful for medical and healthcare research.Only a few works have evaluated the use of offline RL for BG control, such as [20], which uses Simulation-Augmented Batch RL (SABR), and [4], which applies and compares three offline RL techniques: Batch Constrained Deep Q-learning (BCQ), Conservative Q-learning (CQL) and Twin Delayed DDPG with Behavioural Cloning (TD3-BC).The work of Fox demonstrates how offline RL can reduce risks over two months and two years of evaluation.The work of Emerson et al. shows that TD3-BC outperformed PID across all patients.This is the work most similar to ours in this paper, but there are significant differences: first, we evaluate more recent offline RL algorithms (DT and TT), which have shown better results than the ones used by Emerson et.al.Second, their work only evaluates 9 patients, 3 from each of the three group ages available at SimGlucose, while we evaluate all the virtual patient population, 30 patients.Finally, their training dataset only contains 10 5 samples generated by PID for each patient, while our datasets contain 1 million sample per patient and have been generated with PID-IF and our previous online RL implementation.As we said, for offline RL it is key to evaluate the influence of the training dataset, so we have extensively evaluated this aspect by: trying different dataset sizes, using those two types of datasets, mixing them and selecting the best subset of trajectories.

III. MATERIAL AND METHODS
In this section we describe our evaluation of offline RL as a method for automatic BG control.We evaluate two offline RL algorithms, Decision Transformer [33] (DT) and Trajectory Transformer [32] (TT).Each of the algorithms have been trained with two different sets of datasets, one generated by our previous online RL BG controller, PPO-RNN [12], and another one generated from a PID-IF controller [12], [39].We also use those methods as baselines for comparison.In the remaining of the paper, each combination is referred to as Decis-PPO, Traj-PPO, Decis-PID-IF and Traj-PID-IF, respectively.In addition, a dataset that mixes trajectories from both methods (PPO-RNN and PID-IF) is also used to evaluate both algorithms.As metrics used to determine whether the glycemic control algorithm works appropriately, we use the percentage in time in euglycemia or Time in Range (TIR).In both cases they refer to the time spent in the target glycemic level range between 70 and 180 mg/dL.Lower (hypoglycemia) and higher ranges (hyperglycemia) may cause short-term and long-term complications in T1D.Most diabetics should aim for a TIR of at least 70 percent of readings [4].
We first describe the baselines and the experimental setup and then discuss our evaluation results.Our general goal is to determine whether offline RL is a feasible method for automated BG control and how the quality and size of the datasets influence the learning process.

A. BASELINES 1) PROXIMAL POLICY OPTIMIZATION (PPO-RNN)
In a previous paper we proposed and evaluated a RL control based on the PPO [59] algorithm [12].One key finding of our previous work was that we were able to successfully train the agents if we selected a proper observation frequency for each type of patient, different from the default 3-minute CGM samples.That is, instead of using the default frequency of the CGM sensor, observations were made every 45, 30 and 15 minutes for adults, adolescents and children respectively.In addition, a simple reward function, shown in eq. ( 1), was used.
With this implementation strategy, we showed that the PPO agent outperforms other control methods and is able to keep over 73% of time in euglycemia across all groups.

2) PROPORTIONAL INTEGRATIVE DERIVATIVE WITH INSULIN FEEDBACK (PID-IF)
In our previous paper [12] we also tested a PID control that aims to keep the BG level at a target point of 112.517 mg/dl, which is the zero-risk point in Clake's Risk Index.Note that PID-IF includes insulin feedback [38], [39].Insulin feedback is an adjustment of insulin delivery that adapts to metabolism changes due to life activities and has been shown to improve the performance of PID controls.Therefore, formally, the trajectories used as input for both offline RL agents use s i t = b t , a i t = u t and r i t = rw t , where b t , u t , and rw t are, respectively, the BG sample, the insulin units and the reward from eq. ( 1) at time t.In addition, offline RL uses additional input consisting of a terminal flag indicating whether the patient's BG is below 10 or above 1,000 mg/dL, and a timeout flag indicating whether the patient survived the full episode length.Thus, each sample in the dataset corresponds to an agent trajectory and consists of five vectors, with the mentioned data.
For our previous work we implemented the PID-IF control for the default observation frequency (OF) of 3 minutes for all patients.However, in this paper we want to combine PID-IF trajectories with PPO-RNN for training the offline RL agents.Since the PPO-RNN agents use different observation frequencies for each group age, as discussed we have to adapt the PID parameters, proportional, derivative and integral constants, K p , K d , and K i , for that particular frequencies.To accommodate different observation frequencies, the utilization of new OF values may necessitate the discovery of new PID parameters.Hence, in order to adapt to varying observation intervals and the corresponding insulin response based on age groups, the optimization framework Optuna [60] is employed to identify suitable PID parameters.After evaluating various optimization methods, including Tree-structured Parzen Estimator (TPE) [61], we determined that TPE exhibited the best performance.Subsequently, we conducted 1000 trials using TPE to identify the optimal PID parameters that resulted in the highest euglycemia percentage within a 10-day episode length.The optimal PID parameters for each patient are provided in Table 1.
In summary, in this work both baselines, PPO-RNN and PID-IF, use the same observation frequency; 15, 30, and 45 minutes for children, adolescents, and adults, respectively.Finally, meals were randomly generated by the Harris-Benedict algorithm [13] and used along in data generation for training and evaluation.
The baseline data as well as the datasets generated for training for this paper are available on the open science framework repository [62], in CSV format.

B. EXPERIMENTAL SETUP
We used the open-source implementations of TT and DT, available at [32], [33].For training and evaluation, we used the SimGlucose: python framework based on the UVA/Padova simulator, with 30 virtual patients divided into three groups: adults, adolescents and children, with 10 subjects each [25].The parameters of patients were obtained from the academic edition of the commercial UVA/PADOVA simulator version 2008, according to the developer [63].This simulator is based on the Open AI Gym standard [64], which is compatible with RL algorithms and easy to adapt to various kinds of research.It also provides different types of CGM sensors, insulin pumps, and a random meal scheduler with noise.SimGlucose has been previously used in similar studies [8], [13], [26], [27].We trained DT and TT with the datasets generated by our baselines previously described.

1) DATA GATHERING
Initially, we generated three groups of datasets for training the offline RL agents.Each dataset contains five features: observation, action, reward, terminal, and timeout.An observation is the current CGM state; an action is an amount of delivered insulin, and the reward is genereated according the reward function in eq. ( 1), described in [12].A terminal is True when the patient's BG is under 10 or over 1,000 mg/dL, which is considered a catastrohpic failure and timeout is True when a patient survived for 10 days, that is, there was no catastrophic failure in the 10 days.In the first stage, we used the datasets generated from baselines -PPO and PID-IF.The size of each dataset is one million samples per patient, so we generated 30 million samples in total.The second stage considers a combination of PPO and PID-IF datasets, since we hypothesize that if we combine data from multiple sources, the agents may learn better.Thus, we sorted the datasets by the highest rewards and then mixed the datasets as follows: the first one with 80% samples from PPO and 20% from PID-IF 20%, and a second one with 50% of PPO and 50% of PID-IF.A new mixed dataset for each patient was generated.Finally, to test the influence of the dataset size in the learning process, in the final stage, we generated new datasets from the sorted baselines ones, by reducing the number of samples to one hundred thousand and ten thousand.In total, there are three groups of datasets for each patient: two baseline datasets, two combined datasets, and two reduced datasets, as shown in 1.

2) TRAINING
We trained the offline RL agents for each patient and dataset with the original hyperparameters from its code repositories [32], [33].Hyperparameter tuning has not been considered for this work because, first, we are concerned at this point about whether offline RL is a feasible method for BG regulation and the general factors that may have influence in the training process independently of the particular algorithm used; and second, because adding a hyperparameter optimization process on top at this stage was unfeasible due to the time and resources needed to test all the combinations of algorithms and datasets considered in this work.Once we have identified a promising algorithm We intend to perform a thorough hyperparameter tuning on it, using advanced methods such as the one in [65].

3) EVALUATION
We evaluated all the offline RL agents and dataset combinations, as well as the baselines, using 20 simulation replications with different seeds, per patient.Each replication is run for 10-days of simulation time, so each episode is 10-days long.The observation frequency is 45 minutes, 30 minutes, and 15 minutes, for adults, adolescents and children, respectively.The termination due to catastrophic failure (BG level under 10 or above 1,000 mg/dL) is identical to the one used in the training process.TIR or euglycemia fraction of time as well as hyperglycemia, hypoglycemia fractions and Clarke's risk index are the metrics used for evaluation and comparison between DT and TT with different datasets.

IV. RESULTS
In the following sections we compare the different alternatives.A paired t-test has been done with the results for pairs of alternatives, between both offline RL and against the PPO and PID alternatives.In all the cases, it has been found that there are significant differences, with a p-value below 0.05, except for the combination of datasets.

A. EPISODE LENGTH
Our first test is to determine whether offline RL agents are able to avoid catastrophic failures.We simulate each virtual patient for a fixed duration of 10 days in order to compare the performance of different methods.Although blood glucose control is a continuous task that lasts indefinitely, this limited episode length allows us to assess and compare the percentages of euglycemia achieved by the different methods within a standardized timeframe.
Our results in Fig. 2 show that offline RL Trajectory and Decision Transformers cannot outperform PID-IF and online RL PPO-RNN, and cannot reach ten days as the baselines, which means that BG level reaches a value outside the 10-1000 mg/dL.Traj-PPO achieves the longest average episode length.It reaches an average episode length of over 4,000 steps (8 days of simulated time) in every age group.There are notable differences for each group and method, without a clear trend.In the following sections we look at the fraction of time spent at each state during the episode and discuss reasons for this behavior.

B. RISK INDEX AND GLYCEMIC STATES
We now compare the glycemic state, that is, the fraction of time spent in each BG range.In Fig. 3a, we show all methods for all age groups.Traj-PPO achieves the highest median euglycemia of offline RL methods.Its median and 75 percentile slightly outperform the PID-IF baseline.On the contrary, when trained with PID-IF trajectories, Traj-PID-IF, it exhibits a poor performance.The performance of the Decision Transformer is bad with all the datasets tested.The results show clearly that offline RL cannot learn properly how to control with PID-IF trajectories.In fact, Decis-PID-IF has the highest hyperglycemia fraction, while Traj-PID-IF has the highest hypoglycemia fraction.We can see in Fig. 3b the glycemic state by age group.Traj-PPO shows good performance across all age groups and even its median hyperglycemia in all groups is better than the original online PPO.However, its hypoglycemia median and 75 percentile are high and have a broad range, meaning that Traj-PPO implies a high low blood glucose risk, a serious concern in modern AP products.Decis-PPO, in its turn, shows unacceptable high ranges for both hypoglycemia in adults and hyperglycemia in adolescents and children.
Actually, the risk index, evaluated in Fig. 4, provides a more summarized view of the relative danger of hyper and hypoglycemic states, and shows that the riskiest method when attending to hyperglycemia is Decis-PID-IF, while hypoglycemia is more frequent in adults, adolescents, and children when using Decis-PPO, Traj-PPO, and Traj-PID-IF, respectively.The information of the risk index is complemented by the percentage of time spent in severe hyperglycemia (>250 mg/dL) or hypoglycemia (<50 mg/dL) shown in Fig. 5.It can be seen that Traj-PPO spends slighly less time in severe hyperglycemia than the other methods, but more time in severe hypoglycemia, which is adequately captured by the risk indexes.
As summary from this section we can conclude that Traj-PPO provides a level of performance similar to online PPO and PID-IF, but it has serious issues with hypoglycemia, that is, tends to inject too much insulin.In the sections we come back to this matter.

C. COMBINATION OF PPO AND PID-IF DATASETS
We compare the Decis-PPO, Decis-PID-IF, Traj-PPO and Traj-PID-IF with the combined datasets of PPO and PID-IF with two different ratios: eight to two (PP82) and five to five (PP55).In Table 2, we show the variation in percentage of the average episode length.We can see that the use of mixed datasets does not improve TT.On the contrary, it worsens its performance for all glycemic states.For DT, the mixed dataset slightly increases its performance for children and adolescents compared to Decis-PID-IF, and very clearly for adults.In Fig. 6 the global euglycemia in all methods is about the same level at 40%.However, the DT with both datasets performed well in avoiding hypoglycemia.TT has the same high and low glycemic risks.In terms of RI, from Fig. 7, we can see all DT and TT cases with mixed datasets range in 20-40 and they are outperformed only by the previous Decis-PID-IF.
In Table 3 we show the average daily dose of insulin injected by each method.As can be seen, there is a direct correlation, as expected, between the daily dose and the time spent at each glycemic state shown in Fig. 3a.Moreover, in Table 4 we show, in percentage, whether the catastrophic events of each method are due to hyperglycemia or hypoglycemia.
From these data, we see that the average insulin dose of Traj-PPO is higher than that of PPO and that all the catastrophic events of Traj-PPO are due to hypoglycemia, while in Decis-PPO they are practically balanced.When mixing the datasets, the proportion of catastrophic events due to hyperglycemia increases for all the methods.
With these tests our aim is to decide if the offline agents may improve their performance when trained with a ''more'' distributed dataset, that is, with a dataset with a potentially wider range of states and actions.Our results show that transformers cannot generalize adequately.We conclude that more care has to be put in selecting the trajectories for the datasets.For instance, when ordering the trajectories we just look at the highest rewards, but the average BG level of those trajectories is not taken into account.Traj-PPO only has catastrophic events due to hypoglycemia because it tends to keep patients on a low BG level.Due to our reward function, such kind of trajectories may have a reward which is high but equal to other trajectories that keep the patient on a higher BG level, which would be better.Such considerations have to be taken into account when creating the training datasets.

D. DATASET SIZE
The dataset size is important because one cannot realistically expect to collect samples from patients for years and so we want to test how much we can reduce the dataset to get good enough results.Interestingly, from Table 5, both DT and TT with 100k size have longer episode lengths than 1M size on average.This is due to the fact that we sorted and use only the best trajectories.And the average euglycemia percentage is almost the same level as the 1M dataset.The difference in euglycemia for TT is 0.47% and 1.8% for DT.While hyperglycemia between 100k and 1M datasets in TT decreases, in DT it increases by almost 10%.As a result, TT globally improves performance with 100k and has better RI than 1M, because it is able to reduce severe hypoglycemia, as shown in Fig. 5, but DT with 100K slightly decreases an already poor performance.Clearly, DT and TT are less effective when the amount of data was reduced to 10k.Both methods had a decrease of more than 10% in TIR and a significant increase in RI.Additionally, the computational time is also affected: a dataset comprising 1M samples required approximately 26 hours of training time per patient.For a dataset of 100k samples, this value was reduced to approximately 21 hours, and further decreased to around 19 hours for a dataset of 10k samples.

E. HYPERPARAMETER TUNING FOR DECISION TRANSFORMER
So far, we have shown that DT systematically performs worse than its alternatives.We have checked whether the DT bad performance is due to an incorrect selection of hyperparameters.DT takes a subset of the trajectory τ as input, specifically, the K most recent time steps.Each time step consists of three items: the return-to-go, state, and action.The default value K = 20 is considered in the algorithm implementation.Consequently, we conducted experiments by setting K to different values, namely 10, 50, and 100, and training two agents with different performance, using a PPO dataset comprising 100K samples.Table 6 illustrates the 104650 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.impact of different K values on performance.As can be seen, the performance remains quite consistent, particularly when comparing K=10 and K=20.

V. DISCUSSION
In our previous paper, [12] and similar works [13], PID and PPO agents performed considerably well for BG control in the T1D simulator, so our hypothesis was that offline RL with these datasets should have comparable performance.Our results show that at least Traj-PPO has a performance similar to that of online PPO in most of the metrics, which is promising, since the main goal of this work is to determine whether offline RL can be a realistic alternative for data-driven BG control, before attempting clinical trials with real patient data.Our results also agree quantitatively with the work of [4], which shows a similar level of performance, although tested with fewer patients and different algorithms.Our evaluation also shows that training offline RL is not straightforward: neither all the algorithms tested nor the datasets used were equally effective in learning.It suggests that a better understanding of the influence of different data aspects and careful planning and design of the data-gathering is still necessary before collecting real-patient data for further tests, which is a complex and time-consuming task.
More research is needed to correct some of the observed deficiencies of offline RL methods.Most importantly, to prevent the inability to achieve full episode length without catastrophic failures.Unlike the baselines, the average  episode length cannot reach the full 10 days, even though the best one, Traj-PPO, reaches almost 9 days globally.
In Table .4, the catastrophic event of Traj-PPO is 100% due to hypoglycemia, while no catastrophic hyperglycemia occurred.Thus, additional research is needed to ensure that Traj-PPO is able to avoid hypoglycemia and thus able to achieve the full episode length and higher TIR.A direct next step is to further improve the quality of the training dataset to avoid hypoglycemic trajectories, as discussed below.
From our results, it is also clear that DT is not able to deliver good performance in this task, showing unacceptable high hyperglycemia levels in some groups.A simple reason may be that we have not optimized the DT hyperparameters, in particular, the minibatch sequence length, to which DT is sensitive for several tasks.However, the preliminary tests conducted to check if that was the case, seem to rule it out.There may be the need for deeper adaptations, such as pretraining or architectural changes, which have been  shown to improve the basic DT performance [67], [68], [69], [70].We leave the improvement of DT behavior as future work.Training with the PID-IF dataset did not yield satisfactory results for any of the algorithms.It seems that PID-IF generates too many out-of-distribution samples, that is, actions that move the state to not previously seen states which degrade the performance [14].
We sorted data by reward and length of the episode, then combined sorted PID and PPO datasets to determine if we can improve the learning process of the offline RL.Unfortunately, just a crude mixing, even with sorted trajectories, is not enough to improve the performance.It was partially effective with DT, slightly improving an already quite bad performance.It suggests that it may have potential but our results also imply that it is actually the quality of the datasets what actually brings the improvements.In fact, the importance of having good trajectories is obvious: if the dataset size is reduced but only the best trajectories are kept, the performance can be even improved.The average TIR in the 100k-sample dataset is at a value similar to the 1M-sample datasets.The episode length is increased because of sorting trajectories and keeping the best ones, which can be seen in the results obtained from combining datasets and dataset size reduction.However, offline RL algorithms can not learn from datasets when the size is down to 10k samples.We have found a good trade-off with a dataset size of 100k samples, which also agrees with the work of [4] and [20].But we may further improve the results by filtering appropriately the datasets, that is keeping the best ones, and removing the trajectories with undesirable characteristics.For instance, removing the trajectories that result in high hypoglycemic and hyperglycemic fractions, even if they have a good accumulated reward.This can be done by shrinking the target TIR, for example, to be in the range of 90-100 mg/dL.Alternatively, we can redesign the reward function to punish more hypoglycemia and high hyperglycemia.
Although the offline RL with Transformers architecture does not outperform clearly the baselines, the main advantage of offline RL is that it does not require interaction with the environment, as compared to online RL, which needs to interact with the patient to collect data for training.Offline RL emerges therefore as a safer and promising alternative for RL, being a practical application of automated and customized glycemic control.
The next phase of research is further optimizing current methods to adapt the algorithms to learn how to better control blood sugar to normal levels and to make it more effective.However, a potential solution in which the patient or caregiver simply collects its own CGM data over time and converts it into a customized training dataset for offline RL still leaves multiple open questions.In particular, for the best performing model, we have used datasets generated from a simulated environment and from an optimal agent that was previously trained also on a simulated environment.But for real patients, to generate the dataset we would need to collect their CGMs and insulin doses, delivered according to the insulin regime the patients use, which is assumed to be not optimal in the first place.And, since exploration is not possible in offline RL [14], we can only expect marginal improvements over the patient actual insulin regime.A potential avenue, tested in this paper, is to generate the training datasets by mixing trajectories from different sources.For instance, from real patient data and an optimized agent from a simulated environment customized to the patient class.Our results with mixed datasets in this paper have not been satisfactory, so mixing deserves further attention.
Only when those issues and others have been clarified, we can expect to conduct clinical trials with healthcare professionals to collect datasets, and test and evaluate them in real patients.

VI. CONCLUSION
In this paper we have carried out a thorough evaluation of two recent offline RL algorithms for automated BG control of T1D patients.We have evaluated the influence on training and performance of the method that generated the datasets, as well as the influence of the type of trajectories used (singlemethod or mixed trajectories), the quality of the trajectories and the size of the datasets, and compared it with typically used baselines: PID and online RL methods.
Our results show that a Trajectory offline RL trained with a previous optimal PPO agent data performs at the level of the baselines, which supports that offline RL can be a realistic alternative for data-driven BG control.However, we have also shown several shortcomings of the tested methods, discussing potential avenues for improvement and next steps.

FIGURE 1 .
FIGURE 1. Schema of data generation from PPO and PID-IF algorithms.

FIGURE 2 .
FIGURE 2. Fraction of completed 10-day evaluation reached for each method and group.

FIGURE 3 .
FIGURE 3. Comparative fraction of time spent in glycemic states.(a) global glycemic state.(b) glycemic state by group.

FIGURE 4 .
FIGURE 4. Comparative fraction by global risk index.

FIGURE 6 .
FIGURE 6. Comparative fraction of time spent of mixed datasets in global glycemic state.

FIGURE 7 .
FIGURE 7. Comparative fraction of global risk index of mixed dataset.

TABLE 2 .
Increase/reduction of completed 10-day episodes of mixed datasets reaced for each method and group.

TABLE 3 .
Average daily insulin dose.

TABLE 4 .
Type of catastrophic events by methods (boldface highlights the higher risk of each algorithm).

TABLE 5 .
Evaluation of influence of dataset size (boldface highlights the best performance of each algorithm).

TABLE 6 .
Hyperparameter K tuning for Decision Transformer (boldface highlights the best performance of each algorithm).