Errors-in-variables Modeling of Personalized Treatment-Response Trajectories

Estimating the effect of a treatment on a given outcome, conditioned on a vector of covariates, is central in many applications. However, learning the impact of a treatment on a continuous temporal response, when the covariates suffer extensively from measurement error and even the timing of the treatments is uncertain, has not been addressed. We introduce a novel data-driven method that can estimate treatment-response trajectories in this challenging scenario. We model personalized treatment-response curves as a combination of parametric response functions, hierarchically sharing information across individuals, and a sparse Gaussian process for the baseline trend. Importantly, our model considers measurement error not only in treatment covariates, but also in treatment times, a problem which arises in practice for example when treatment information is based on self-reporting. In a challenging and timely problem of estimating the impact of diet on continuous blood glucose measurements, our model leads to significant improvements in estimation accuracy and prediction.


Introduction
Increasing popularity of electronic health records (EHRs) and smart healthcare services has led to accumulation of large quantities of heterogeneous data, with potential to considerably improve the efficiency of clinical practice and health services [34]. This highlights the importance of novel machine learning techniques for EHR data, which can be integrated with mobile apps to provide personalized guidance for purposes ranging from early diagnosis to support for lifestyle change [12,27]. The latter is specifically relevant to reduce the cost of chronic diseases in the face of the aging population. For instance, the annual economic cost of diabetes in the U.S. is approximately $250 billion [1].
Inferring relationships between correlated variables is essential in many fields. An important question is to estimate a patient's response to a given treatment, comparing the patient's data from before and after the treatment. A traditional solution is to use randomized controlled trials, which, however may be infeasible due to the cost or ethical considerations. One possibility is to use mechanistic models, specifically tailored for the problem, and use data to learn about their unknown parameters. However, these models require substantial expert knowledge and are not applicable if the underlying mechanism is unknown. On the other hand, data-driven methods, trained on observational EHR data, provide a promising alternative.
Estimating the impact of a treatment is particularly challenging when the response is a continuous curve consisting, for example, of a time-series of measurements of a biological marker. In such cases, the outcome is typically modeled by a Gaussian process [35], but also neural networks have been considered [21]. The treatment may either be a continuous dose function [42], or a discrete event in time [45,41]. The latter approach is often relevant in practice when treatments are recorded as discrete events, even if their true duration is not exactly zero. Treatment data are usually sparse, and hence it is essential to share relevant information in a probabilistic model. A latent trajectory model of [39] uses additive components to explain variation on population and individual levels. Conditional random fields can be incorporated to further capture correlations between different treatment types [40], and multivariate response curves can be modeled by learning latent structure [42] shared across the outcomes.
Despite increased recent attention, there are still crucial issues in treatment-response estimation that have not been addressed when the response is continuous. Most importantly, the treatments are consistently assumed to be exactly measured and known, while in reality the treatment input may be severely perturbed by numerous factors. This problem dramatically escalates for user-reported treatment data, which potentially results in complete discredit of the findings [20]. The erroneous effect is two-fold, i.e., in addition to measurement error in covariates, the actual time of the treatment might be known only approximately. Another issue arises from the competing relationship between a counterfactual trend, i.e., the evolution of the outcome assuming no treatment, and the treatment response. When modeled and trained jointly, it easily happens that a flexible trend completely overrides the treatment response, and therefore these two components are often trained separately in practice.
To address the mentioned shortcomings, we introduce errors-in-variables (EIV) framework for modeling of continuous treatment-response trajectories. The EIV models account for measurement errors not only in the output variable, as common regression, but also in the inputs [13,14]. They are closely related to latent-variable models in machine learning [28,4], and based on modeling the unobserved true values from which noisy observations are obtained. Our contributions can be summarized as follows: -We formulate an EIV model for personalized treatment-response trajectories, where a treatment comprises a vector of noisy covariates and treatment times are uncertain. -We introduce an interpretable hierarchical prior on the treatment effects that efficiently shares information between individuals, and allows training the full model jointly, appropriately balancing between the trend and the responses. -In a challenging topic, representative of the current technological mega-trend on self-monitoring data from wearable devices, we show our method can meaningfully estimate the personalized impact of diet on continuous blood glucose measurements.
The code and data used in the analyses are available at [link added upon acceptance, reviewers can view the material in Supplement] and allow fully reproducing our results.

Related work
Treatment response: Besides machine learning, the problem of treatment response estimation has been studied in various fields, including informatics for medicine and social sciences, where the data-driven approach can bring advantages compared to the experimental trials [30]. For example, individual-level treatment response prediction has been studied for schizophrenia [5] and depression [33]. An empirical comparison of classifiers for treatment-response prediction for chemoradiotherapy appears in [9]. Topics studied in social sciences include the effect of a discrete treatment, years of education, on an individuals' income [6], and allowing a response to depend on social interactions and treatments for other individuals [24].
Mechanistic models: In contrast to the data-driven approaches used in machine learning, mechanistic models use substantial knowledge of a specific problem to characterize the system with differential equations, and inference is done for example using filtering algorithms. Similar to our application, [3] and [2] study blood glucose dynamics, affected by nutrition and other factors. Another example is a computational model of the physiological mechanisms for type-2 diabetes, aiming to quantify factors useful for prevention of the disease [37].
EIV models for treatment response: In contrast to the vast body of research concerning data-driven methods for the prediction of a treatment response, very little is known about their performance when measurement error is present. Authors of [46] study a method for inferring causal directions using EIV models, but they do not focus on the response conditioned on specific treatments. Regression on various student covariates, incorporating EIV, has been used to predict standardized test scores [22]. In [26], the effect of measurement error on a binary treatment response is analyzed, underlining the devastating impact of ignoring such errors. A model for measurement errors has been used to quantify uncertainty in order to increase the confidence in detecting genuine treatment changes for liver metastases [31].
None of these works address the problem of estimating the impact of a multivariate vector of covariates on a continuous response with measurement error in covariates and uncertainty in treatment timing, the topic of this paper.

Methods
In this section, we first review EIV models on a general level. Then we describe three essential components of our model for personalized treatment-response trajectories: a hierarchical prior on parametric treatment-responses functions, a Gaussian process model for the trend, and measurement error models. Throughout the section, we present the model in generic terms, but also outline the specific model that we use in Section 4.2 to estimate the impact of diet, recorded as nutrient contents of different meals, on continuous blood glucose measurements.
Our model is fully Bayesian, yielding uncertainty estimates for all parameters, essential in scientific applications. Inference is done using Markov chain Monte Carlo (MCMC) with the state-of-the-art No-U-Turn (NUTS) sampler [17] implemented in software PyMC3 [36]. Implementation details are discussed below and in the supplementary, and can be viewed in full in the published code.

Errors-in-variables models
EIV models, a.k.a. measurement error models, are regression or classification models that, in contrast to most existing models, account for errors not only in the output variable but also in inputs [7,38,15,10]. Though commonly neglected, input mismeasurement may be extremely harmful. For example in simple linear regression it leads to biased estimates that can not be corrected for even with an infinite sample, while, on the other hand, unbiased homoscedastic error in the output variable only induces additional variability [7]. A graphical model for a general EIV model is presented in Figure 1a, where X * and Y * represent the true values of the inputs and the output, and X and Y are the corresponding noisy observations. The most important type of mismeasurement is classical error, which corresponds to independence of an error term from the true value. Except for the simplest case of linear regression [29], EIV modeling almost always requires auxiliary information or data to correct the mismeasurement bias in estimation. For problems that have an analytical solution, the bias can be corrected by a multiplication or addition of external terms [18], e.g., an estimated reliability ratio [7]. For nonlinear models, auxiliary data, e.g., instrumental variables or repeated measurements, can be exploited to help correct bias, e.g., by estimating the density function of the true variable using a deconvolution technique [38]. However, without additional data, Bayesian EIV modeling is currently the most powerful and flexible approach, as it allows incorporating additional information in the form of distributional assumptions [15]. In this work, we adopt the Bayesian approach.
Mathematically, the measurement error mechanism is defined as the distribution of the noisy observed input, X, given the true unobserved input, X * . The joint distribution of the model factorizes accordingly as: where P(X|X * , θ M ) and P(Y|Y * , θ N ) are called error or measurement models, P(Y * |X * , θ R ) is a response or outcome model, P(X * |θ E ) is an exposure model, and are the corresponding parameters. Bayes theorem can be used to infer the unknown parameters and unobserved true values of the variables.
If the exposure model is noninformative and the measurement model is symmetric, i.e., , then the Bayesian modeling of classical error is equivalent to another class of mismeasurement techniques know as Berkson error modeling [7].
A well-known difficulty with EIV models is that they are often nonidentifiable [15], i.e. there are more than one set of values for the unknowns leading to the same model. This can be understood intuitively by noticing that the model stays the same if we multiply the linear regression coefficients by a constant factor and at the same time divide the estimated true values of inputs by the same factor. Therefore, to achieve identifiability, some crucial information about measurement model has to be assumed or estimated, e.g., the variance of a classical additive error in simple linear regression [7]. The Bayesian paradigm offers a unique solution to the nonidentifiability of the EIV models, as long as mismeasurement is modest and the prior is sufficiently good [16].

Model for treatment-response trajectories
Notation: A graph of our model for treatment-response trajectories is presented in Figure  1b. We assume there are N patients, and a trajectory consisting of a time series of length G n of the outcome (e.g. blood glucose) is observed for each individual: These measurements have been taken at times τ n = (τ n1 , . . . , τ nG n ) T , n = 1, . . . , N.
Furthermore, each patient has M n observed treatments (e.g. meals eaten), indexed by m ∈ 1, . . . , M n , where each treatment is characterized by P covariates: x nm = (x nm1 , . . . , x nmP ) T , for all m, n, and the corresponding recorded treatment times are t n = (t n1 , . . . , t nM n ) T , for all n.
Here, x nm and t nm are assumed to be noisy observations of the treatment covariates and timings, and their true unobserved values are denoted by x * nm and t * nm , respectively. Outcome model: We model the observed outcome trajectory of individual n, y n , as where T n ∈ R G n is a counterfactual trend (i.e. it describes the evolution of the outcome had the treatment not been taken), R nm ∈ R G n is the additive response to the mth treatment, and e = (e 1 , . . . , e G n ) T is the vector of errors with e i ∼ N(0, σ 2 y ). We note that the sum of the trend and the responses can be viewed as a trajectory for a 'clean' outcome (omitted from Figure 1b), of which a version y n corrupted by Gaussian noise is observed. Additive response functions can be seen as a continuous extension of scalar average treatment effect (ATE) which is defined as the expected difference of outcomes before and after treatment.
Response function: Response functions specify how treatments affect the outcome over time, and they should be specified to suit the application at hand, balancing flexibility, interpretability, etc. For example, if interpretability is not needed and the amount of data is large, non-parametric functions that learn the shape of the response are attractive. On the other hand, parametric functions are suitable when data are scarce, and they are often interpretable, which is valuable in itself but also helps specifying prior knowledge to improve accuracy. In the application of learning the impact of meals on blood glucose (Section 4.2), we model the treatment response using a bell-shaped parametric function where a lag vector ∆ nm = τ n − t * nm represents the time since a specific treatment. The shape of this response is shown in Figure 2a and it is determined by two parameters h nm and l nm with straightforward interpretations: h nm is the height of the response, and l nm is the length-scale which is proportional to the 'width' or 'duration' of the response. The main challenge in our application is scarceness and noisiness of data, with only 13 individuals and on average 10 meals per patient. We also tried a more flexible threeparameter response used in [41], which allows a skewed response (see Figure 2a), but this model suffered from convergence problems, for which reason we selected the simpler alternative.
In applications it is often of interest to measure how the response depends on treatment covariates, and therefore we allow these parameters to depend on the covariates: In Equation (4), the coefficient vectors β h n , β l n ∈ R P represent the personalized impact of each of the P covariates on the height or width of the response for the nth individual. To share information across individuals, we introduce a Bayesian hierarchical prior, see [11], and assume that the personalized height and length-scale coefficients, β h n and β l n , are drawn from common distributions: A hyperprior is further placed on the mean parameters of these distributions: The hierarchical prior introduces shrinkage and facilitates estimation of the personalized coefficients with limited data. Further details are given in the Supplementary material. Counterfactual trend: A counterfactual trend represents the outcome assuming no treatment has been taken. It has to be sufficiently flexible to handle any variation in the outcome that is not accounted for by the treatments. In this paper, we model the trend T n (t) for individual n using a Gaussian Process (GP) [35]: where θ T n are parameters associated with the kernel function k(x, x |θ T n ). GPs are nonparametric regression models with well-known closed-form formulas for posterior estimation, which they inherit from the Normal distribution by assuming all training and test data follow a joint Normal distribution. For example, if S n = y n − m R nm is the residual of the outcome after subtracting the impact of the treatment responses, then T n (t)|S n ∼ N(µ * , Σ * ), where µ * = k(τ n , t) T K(τ n , τ n ) −1 S n , and Σ * = k(t, t) − k(τ n , t) T K(τ n , τ n ) −1 k(τ n , t).
We refer the reader to [35] for more details about GPs. As the kernel, we use the sum of Squared Exponential (SE) and constant kernels, where the former equips the GP with desired smoothness, and the latter enables meaningful extrapolation to regions where no input points have been observed. To speed up computation, we use a sparse GP [35] instead of a full GP, which samples a small set of inducing points uniformly from τ n to achieve a low-rank approximation of K(τ n , τ n ) and its inverse. A detailed prior specification is provided in the Supplementary material.
Measurement models: Measurement models describe error in observations. With self-reported data both covariates and the timing of a treatment may be uncertain. To account for the uncertainty in treatment timing, we assume: t nm ∼ N(t * nm + d n , (σ t n ) 2 ), for all n, m.
In words, the observed time t nm is obtained from the true time t * nm by shifting it with a bias term d n , and adding Gaussian noise. The bias term d n represents reporting habits of different individuals. For example, in the blood glucose application in Section 4.2, some individuals may systematically report their meal after eating, while others may do this before eating.
Different models are possible for treatment covariates, depending on the assumptions and data available [15]. Here we assume a simple perturbation on the amount of treatment: x ), for all n, m.
The coefficient δ nm represents the error in the mth treatment of the nth individual. Intuition in the blood glucose application is that users are able to report correctly what they have eaten, but not how much. While the model (5) captures our understanding of the type of mismeasurement expected in our data, more complicated models could also be justified, but they would require stronger additional assumptions to resolve the nonidentifiability of the EIV models. The model in (5) is identifiable and can be trained with relatively little data, as we demonstate in Section 4. Estimating t * nm is straightforward as it only shifts the response, but does not change its shape. However, estimating x * nm is more complicated, and requires assuming that the counterfactual trend is sufficiently regularized. Otherwise the trend could easily compensate for the perturbation. We solve this by encouraging a large length-scale for the squared exponential kernel in the prior. Further details, e.g., prior distributions for x * nm , t * nm , and d n , are provided in the Supplementary material.

A note on causality
We briefly review results related to estimation of causal effects from observational data on treatment-response trajectories [32,25,41], to enable a user of our method to judge to what extent the effects found may or may not be interpreted causally. The causal effect of an action A (e.g. a treatment) on Y is defined as P(Y = y|do(A = a)), where the do(·) operator represents a manipulation of A to value a. The key assumption is that there are no unmeasured confounders (NUC), such as Z 2 in Figure 2b. Without Z 2 , the causal effect of A on Y can be estimated from observational data using the adjustment formula: Time-varying treatments (Figure 2c) further face an issue of treatment-confounder feedback [25], which means that hidden confounders do not have to affect A directly to create a spurious correlation between an action and future observations. A generalized adjustment formula, g-formula, can still be used to calculate P(Ȳ ≥t |do(A t−1 , A t )), see [8].
A useful result, applicable with our model, is to use the model to estimate P(Ȳ ≥t |Ā ≤t ,Ȳ <t ) from observational data. Then, assuming NUC, the following holds [25]: P(Ȳ ≥t |do(A t ),Ā <t ,Ȳ <t ) = P(Ȳ ≥t |Ā ≤t ,Ȳ <t ).  In words, conditionally on the history of treatments and the outcome (and relevant observed confounders not shown in the formula), the causal impact of the most recent treatment on future outcomes can be estimated from observational data. This short-term effect can be used, e.g., to select between alternative treatments available at a certain point in time, when the relevant history of the individual is known. In Section 4.2 we analyse the impact of diet on blood glucose. Based on domain knowledge, we know that diet is a prominent cause of changes in blood glucose. Furthermore, in our data we often see a rapid increase and decrease in blood glucose after a meal. Therefore, it seems plausible that meals affect blood glucose causally. However, in general, the causal assumptions can not be verified from observational data, and it is possible that some confounder affects both glucose and diet, but the effect of any such confounder is expected to be small compared to the impact of diet. Hence, while interpreting our results causally seems reasonable, we can not make assertions of this. More generally, with emergence of modern wearable self-monitoring devices, it will be possible to measure all relevant factors that could affect blood glucose much more comprehensively, and the NUC assumption is reasonable. Our model is straightforward to extend to such data.

Experiments
In this section, we first examine identifiability and accuracy of our model using simulated data, and then use it to analyze a real-world dataset comprising diet and continuous blood glucose measurements. Throughout, we compare four models, in an increasing order of complexity (later models include the previous as special cases):

Identifiability and accuracy of the models on simulated datasets
As a simple experiment, we first study the identifiability of the EIV model when there is measurement error in the covariates. We simulate artificial data using a toy model specified as the sum of a linear trend and the parametric treatment response from Equation (3). The dimension of treatment covariates is here set to 2, and each input is perturbed with an additive term drawn from N(1, 0.2 2 ). We analyze the data using the EIV model that assumes measurement error, and a model that disregards the noise in the covariates. Results and further details for this simple setup are presented in the Supplementary material, and they show that the EIV model recovers all true inputs and effect sizes with high accuracy, while the model that neglects the noise leads to biased coefficient estimates with wide confidence intervals.
To study the accuracy and identifiability of our method in a more realistic simulated setup, we first fit our model to the real-world data from Subsection 4.2, and use the fitted model to simulate responses and trends for two individuals. We perturb half of the inputs according to Equation (5) and let the model use the perturbed inputs and try to recover the true inputs and parameters. The performance of all models depends on the relative contributions of the trend and responses, and we scale up the response with a factor of 5, which facilitates a meaningful comparison. Results for one individual are shown in Figure 3, and for the other in the Supplementary material. We see that the direction of each non-zero perturbation is estimated correctly (left panel), and this is true also for the other individual (Supplementary). On the other hand, if there is no perturbation, the model may even then estimate non-zero perturbations, introducing additional noise. This reflects the trade-off between flexibility and overfitting, and highlights the importance of carefully validating the model to suit the amount and complexity of data. We also see that the regression coefficients are estimated accurately by the EIV model (center), and that the benefit from using EIV becomes more significant when the size of the perturbation increases (right). However, a too loose EIV prior (large SD) may actually harm the performance by introducing additional noise, when the true perturbation is small.

Experiments on real-world glucose data
The data contain blood glucose measurements and dietary records. These anonymized data were provided by the Obesity Research Unit at the University of Helsinki, Finland, and they are available for 13 non-diabetic individuals across three days. Diabetic individuals were excluded because their metabolism differs extensively from healthy individuals, and detailed modeling of that is beyond the scope of this work. The realvalued blood glucose measurements were collected by a portable continuous glucose monitoring system approximately every fifteen minutes. The dietary records consist of user-reported contents and times of all meals eaten during the 3-day study period. Each meal has been processed into amounts of five nutrients: starch, sugar, fiber, fat, and protein. The goal of the analysis is to learn how these nutrients influence blood glucose. Both the exact amounts of food eaten and the exact meal times are uncertain, as they are based on values estimated and reported by users, which motivates the use of EIV models for these data. A visualization of the data (and results) for one individual is shown in Figure 4, and for all other individuals in the Supplement. Some markers may be missing due to device errors or when a user has removed the device. Metrics: The models are trained using data from the first two days, and the third day is used for testing. The performance of treatment-response estimation is quantified using five metrics M i , i ∈ {1, . . . , 5}. M 1 is the proportion of variance explained by the trend: M 2 indicates how much more is explained when also the treatment responses are included: In detail, a large M 1 means that the outcome is mostly explained by the trend, and a small M 2 represents an inactive treatment response. These metrics are computed in regions of non-zero treatment response. Metrics M 3 and M 4 are simply the mean squared errors in the training and test data. They are calculated for all individuals for whom M 2 indicates that the response has been properly learned. Thus one patient, shown in Figure 5, with M 2 ≈ 0.05 for the baseline model M hier is excluded from MSE calculations (other patients have M 2 > 0.3). Because M 4 measures pointwise error, it may be misleadingly low when the response shape is correct if its location is inaccurate. Metric M 5 is insensitive to the inaccuracy of location, and it measures the absolute error in variance between predicted response and outcome: Because our interest is in estimation of the treatment response, and not the trend, we calculate M 4 and M 5 in windows from one hour before to three hours after each meal. We use the Mann-Whitney U-test [23] to test if other models are better than M hier in terms of test error M 4 . The reason for using M hier as the baseline is the main argument of this article that EIV modeling is beneficial when estimating treatmentresponse trajectories, and M hier is otherwise the same as M hier+time and M hier+time+cov except that it the does not include the EIV components. We also compare the models using an information criterion for predictive accuracy. The state-of-the-art criterion is leave-one-out cross-validation (LOO) [43], which is used here. Results: Result are shown in Table 1. We see that all models outperform the nonhierarchical baseline M ind by a large margin. Furthermore, taking treatment time inaccuracy into account in M hier+time improves significantly over the non-EIV model M hier . In fact, estimation of the response fails completely for some individuals without time EIV; the results with and without time uncertainty modeling for one such case are shown in Figure 5. On the other hand, taking uncertainty in covariates into account does not notably improve accuracy, which is likely caused by a combination of increased flexibility and limited amount of data. Overall, models with EIV component outperform the model without EIV in all metrics. Interpretability of personalized treatment response is also of great interest; for instance, understanding how an individual's glucose level changes if she eats one more unit of sugar. The overall goal of glucose monitoring is to keep the glucose level in a given range, and both the amount of excess and the duration of the hyperglycemic state are clinically important. Hence, a sensible parameter to consider is the impact of different nutrients on the area of the response curve. Though this is not a parameter of our model, it is straightforward to derive the personalized increase in response area due to one unit increase of a specific nutrient ∆A np (n ∈ 1, . . . , N, p ∈ {1...P}), using coefficients for height and width, which are modeled explicitly (see Supplement).
Overall, starch and sugar have the strongest positive impact on glucose ( Figure  6a), consistent with the understanding that carbohydrates increase blood glucose [44]. Protein, on the other hand, has a negative impact, which has been observed before and might represent a complex short-term interaction between nutrients [19]. An advantage of our model is that we get personalized coefficients for each individual, as shown for starch in Figure 6b, and for the other nutrients in the Supplement. Finally, posterior uncertainty of personalized starch coefficients is shown in Figure 6c. Importantly, models with EIV have much narrower confidence intervals, meaning that they are estimated more accurately, thanks to increased flexibility that allows fitting the complex data.

Conclusion
We presented a hierarchical model with EIV components to estimate personalized treatment-response trajectories when the covariates and timing of a treatment are imprecise. Our model demonstrates superior performance in both simulated and real-world data on various metrics, and allows extracting interpretable and meaningful estimates of the personalized impacts of treatment covariates, valuable in applications. Future directions include extensions and identifiability of EIV modelling, and extending the model to include interactions between covariates and other unmeasured confounders, such as physical activity, for causal completeness.