Assessment of Prediction Uncertainty Quantification Methods in Systems Biology

Biological processes are often modelled using ordinary differential equations. The unknown parameters of these models are estimated by optimizing the fit of model simulation and experimental data. The resulting parameter estimates inevitably possess some degree of uncertainty. In practical applications it is important to quantify these parameter uncertainties as well as the resulting prediction uncertainty, which are uncertainties of potentially time-dependent model characteristics. Unfortunately, estimating prediction uncertainties accurately is nontrivial, due to the nonlinear dependence of model characteristics on parameters. While a number of numerical approaches have been proposed for this task, their strengths and weaknesses have not been systematically assessed yet. To fill this knowledge gap, we apply four state of the art methods for uncertainty quantification to four case studies of different computational complexities. This reveals the trade-offs between their applicability and their statistical interpretability. Our results provide guidelines for choosing the most appropriate technique for a given problem and applying it successfully.

Abstract-Biological processes are often modelled using ordinary differential equations. The unknown parameters of these models are estimated by optimizing the fit of model simulation and experimental data. The resulting parameter estimates inevitably possess some degree of uncertainty. In practical applications it is important to quantify these parameter uncertainties as well as the resulting prediction uncertainty, which are uncertainties of potentially time-dependent model characteristics. Unfortunately, estimating prediction uncertainties accurately is nontrivial, due to the nonlinear dependence of model characteristics on parameters. While a number of numerical approaches have been proposed for this task, their strengths and weaknesses have not been systematically assessed yet. To fill this knowledge gap, we apply four state of the art methods for uncertainty quantification to four case studies of different computational complexities. This reveals the trade-offs between their applicability and their statistical interpretability. Our results provide guidelines for choosing the most appropriate technique for a given problem and applying it successfully.
Index Terms-Computational methods, dynamic models, nonlinear systems, observability, prediction error methods, state estimation, uncertainty Ç

INTRODUCTION
T HE dynamics of many biological systems can be described by nonlinear ordinary differential equations (ODEs). As these models usually have a number of unknown parameters, it is necessary to calibrate them using experimental data [1]. Most models are partially observed, i.e., only a subset of their state variables -or functions thereof -are measured. In practice, these measurements are noisy, which complicates the model calibration task. Once parameter estimates are available, the dynamic behavior of the biological systems can be simulated by integrating the ODEs. This calculation yields the time courses of the model states, which are usually concentrations or other measures of abundance of biochemical species.
Calibrated models are frequently employed in prediction tasks [2], [3]. A prediction can in principle be any quantity or result derived from model simulations, including state variables. As parameter uncertainties can result in prediction uncertainties, a comprehensive uncertainty quantification is essential. Fig. 1 provides a graphical illustration of the problem.
Uncertainty quantification is related to the concepts of observability and identifiability. Broadly speaking, a model is observable (respectively, identifiable) if it is possible to infer its state trajectories (respectively, parameters) from knowledge about the time course of its observables. Structural and practical observability (identifiability) analysis methods are available for the assessment of this characteristic [4]. The former take into account only the model equations, and inform about the theoretical possibility of determining the unknowns. Therefore, these methods can reveal deficiencies of the model structure, such as symmetries. The latter also take into account the data available for model calibration. Hence, these methods inform about the confidence/credibility of parameter estimates and predictions. In this manuscript, we consider practical observability problems arising from practical identifiability issues, respectively prediction uncertainties resulting from parameter uncertainties.
Uncertainty quantification for a specific combination of model and datasets is a non-trivial task. The extent to which the uncertainty is propagated from parameters to predictions depends not only on the practical identifiability, but also on the sensitivity of the model to its parameters. Key challenges are nonlinearity and dimensionality of models considered in the field of computational biology, as well as the computational complexity required for simulation. Several numerical uncertainty quantification approaches have been proposed and a general overview is provided, e.g., in [5]. Some methods for the assessment of prediction uncertainty are discussed in [6], [7], [8]. Kaltenbach et al. [6] explicitly mention lack of scalability as one of the open issues.
While various approaches are available and applied, a systematic assessment of strengths and weaknesses of the state of the art methods is missing. In a previous study we analysed three methods [9]: a local method based on the Fisher Information Matrix (FIM) [10], a Bayesian method based on Markov chain Monte Carlo sampling (SAM) [11], and an ensemble modelling approach (ENS) based on optimization results [12]. We applied these methods to two case studies with a small number of parameters. In the present work we address two important limitations of the aforementioned paper: (1) we consider prediction profile likelihood (PPL) [13], and (2) we enlarge the set of case studies to allow for a deeper evaluation of the performance of the methods, revealing the trade-offs between computational cost and statistical rigor. We are particularly interested in identifying methods which provide reliable estimates of the prediction uncertainty and scale well with model size (both in number of parameters and states). Based on these requirements, several types of existing methods were ruled out, including basic Monte Carlo sampling methods [6] and methods based on polynomial chaos expansions [14], which follow a rigorous approach with a sound mathematical basis, but currently can only be applied to small problems. We remark that in this work we consider the prediction uncertainty that results from propagation of parametric uncertainty. We do not take into account the possible uncertainty about the model structure. However, the methods considered here can also account for this type of uncertainty, by encoding alternative topologies as parameterized relationships in the ODEs.
The article is structured as follows: In Section 2, we describe the four selected methods, the metrics used for their evaluation, and the implementation details. In Section 3, we apply the methods to four case studies of increasing complexity, and evaluate their performance. In Section 4, we discuss the most relevant methodological aspects in view of the results. Finally, we present the conclusions of the comparison and provide guidelines for the application of the methods in Section 5.

METHODS
Modeling Framework and Notation.We consider ODE models, in which xðtÞ 2 R nx is the vector of state variables at time t, yðtÞ 2 R n y is the vector of observables at time t, and u 2 R n u is the vector of unknown parameters. The vector field f : R n x Â R n u Â R 7 ! R n x and the mappings g : R n x Â R n u Â R 7 ! R n y and x 0 : R n u 7 ! R n x are possibly nonlinear. The calibration of ODE models requires the estimation of u from measurements of yðtÞ at n t times, t i ¼ t 1 ; t 2 ; . . .; t n t . The number of measurements is n t Â n y . In the application examples, the measurement noise follows a normal distribution, k;i $ N ð0; s 2 k;i Þ; where k ¼ 1; . . .; n y and s k ðt i Þ is the standard deviation. Thus, a noise-corrupted measurement of the k th observable isỹ k;i ¼ y k ðt i Þ þ k;i . We denote the set of all measurement data as D. The maximum likelihood estimate of the vector of unknown parameters for a given dataset D can be found by minimizing the negative log-likelihood function: For the maximum a posterior estimate, the negative log-posterior is minimized, J nlp ¼ J nll À J np , in which the J np denotes the negative logarithm of the prior evaluated at u. The search space for u is usually constrained, e.g., by lower and upper bounds, yielding u L u u U . Predictions of the models are denoted by in which h : R nx Â R n u Â R 7 ! R nz represents a possibly nonlinear mapping. In this study, we focus on the assessment of the uncertainties in the time-dependent state variable, meaning that z ¼ x. Yet, the considered analysis approaches are more flexible, which is why we provide the equations for generic functions h. In Sections 2.1, 2.2, 2.3, and 2.4, we describe the four considered methods for the quantification of prediction uncertainties (Fig. 2). For each method, we provide a definition of its prediction, x p j ðt i Þ, and of its uncertainty estimates, e p j ðt i Þ. Here, j indexes for the state variables (j ¼ 1; . . .; n x ) and i indexes the time points (i ¼ 1; . . .; n t ) (which do not necessarily have to be aligned with the time points of the observations). In Section 2.5, we define the metrics used to quantify the performance of the methods. In Section 2.6, we provide descriptions on the implementation.

Approximation Approach Based on Fisher
Information Matrix (FIM) For many decades parameters uncertainties were predominately analyzed using methods based on asymptotic statistics. These methods are based on the assessment of the variability of the parameter estimatesû, given different replicates of the measurements. For globally identifiable models, the variability of the point estimates is described by the Fisher information matrix (FIM) in which @y k ðt i Þ @u denotes the sensitivity of the k-th observable with respect to the parameters u, evaluated at the measurement time point t i and the parameter estimatesû. As the sensitivity is the first order derivative, it provides information about the effect of small changes.
The Cram er-Rao theorem [15] states that, ifû is an unbiased estimate of u (i.e., EðûÞ ¼ u), the inverse of the FIM provides a lower bound of the covariance matrix in which u is the true parameter vector. The parameter covariance matrix informs about the individual and pairwise variability of parameter estimates along different realizations of the experimental data. The parameter covariance matrix (and hence the FIM) can be used to estimate the uncertainty in predictions. To this end, the first order Taylor series expansion of the mapping h is used to propagate variability from the parameters Fig. 2. Illustration of the employed uncertainty analysis methods. All methods seek to estimate the uncertainty in the time courses of state variables that results from uncertainties in the parameter estimations. Each method quantifies uncertainty in a different way, as defined by Equations (8), (12), (15) and (18). Briefly, FIM approximates the prediction uncertainty as the standard deviation calculated from the square root of the prediction covariance matrix (8). SAM considers the credibility region as the distance between the 0.5th-and the 99.5th-percentile of the prediction samples, which are obtained by integration with the parameter samples drawn from the posterior distribution (12). PPL also approximates the confidence region as the width of the 99th percentile, and obtains the upper and lower levels by solving an optimization problem constrained by the prediction values (15). ENS adopts a similar approach, but builds the confidence region with those vectors explored during parameter estimation that yield an objective function below a certain threshold (18).
to the predictions [16]. This yields the prediction covariance matrix: Cov½zðtÞ ¼ @h @x @x @u þ @h @u CovðûÞ @h @x @x @u þ @h @u T ; (6) with all derivatives being evaluated at time t for parameter u and corresponding state xðtÞ.
The covariance matrices of u and zðtÞ can only be approximated using the inverse of the FIM if the FIM is invertible. If a single parameter is locally non-identifiable, this is not the case. In principle, this would preclude the application of this approach to unidentifiable models. A solution is to approximate the inverse with the Moore-Penrose pseudoinverse, as e.g., in [17].
The assessment of the predictions using the FIM yields for z j at time t, the point estimate z p j ðtÞ ¼ h j ðxðt;ûÞ;û; tÞ; which is the evaluation of model simulations with the optimal parameter vectorû. The uncertainty of the prediction as measured by the standard deviation is This approximate assessment is here denoted as FIM-based approach, but was in previous studies also referred to as Linear Covariance Analysis (LCA) (see [10]).

Bayesian Approach: Sampling the Posterior Predictive Distribution (SAM)
In systems biology, measurements are often scarce and the application of asymptotic approaches arguable. Therefore, a broad spectrum of Frequentist and Bayesian uncertainty quantification methods have been introduced. In Bayesian statistics, the uncertainty of parameters is studied using the posterior distribution pðujDÞ ¼ pðDjuÞpðuÞ pðDÞ ; (9) in which pðDjuÞ denotes the likelihood of the data D given the parameters u, pðuÞ denotes the prior distribution of u, and pðDÞ denotes the marginal probability. The posterior pðujDÞ encodes the available information about the parameters. Hence, it also describes the uncertainty about the parameters u taking into account the available experimental data, D, and the prior belief, pðuÞ. Likewise, one can define the posterior predictive distribution pðzjDÞ, which is obtained by integrating over the latent variables, simply speaking: The posterior distributions are usually not available in closed-from. In most cases their properties are assessed using sampling procedures such as Markov chain Monte Carlo methods (MCMC). Since these procedures are computationally expensive, it is crucial to use an efficient sampling technique. The adaptive parallel tempering algorithm combines the sampling from tempered posterior distributions with a local adaptation to improve sampling efficiency. The algorithm provides n s samples from the posterior distribution for the parameters, fu ðsÞ g n s s¼1 , which can be used to quantify parameter uncertainties. The corresponding samples from the posterior predictive distribution are obtained by simulating the model for the sampled parameters, fz ðsÞ ¼ hðxðu ðsÞ ; tÞ; u ðsÞ ; tÞg n s s¼1 . These samples are then used to calculate the mean predictions and their associated uncertainties. The computationally demanding step is the sampling of the parameter posterior distribution. Once it has been computed, the calculation of the parameter and prediction uncertainties is efficient.
The mean prediction for z j at time point t i is yet in many applications the prediction obtained for the maximum a posterior estimate might be preferred. To quantify the prediction uncertainty, the sample-based approximation of the marginal distribution is considered, common choices are the highest posterior density interval and the equal-tailed interval. Here, we choose the latter for a credibility level of 99%, meaning that the prediction uncertainty is the distance between the 0.5th-and the 99.5th-percentile of the samples of the prediction, z 0:5 j ðt i Þ and z 99:5 j ðt i Þ, which yields e p j ðt i Þ ¼ z 99:5 j ðt i Þ À z 0:5 j ðt i Þ:

Frequentist Approach: Prediction Profile Likelihood (PPL)
In contrast to sampling-based methods used in Bayesian statistics, frequentist approaches for uncertainty quantification often use profile likelihoods. Profile likelihoods provide a maximum projection of the likelihood on the parameter of interest The profile likelihood value PL u l ðcÞ is the maximum value of the likelihood function attainable for u l ¼ c. The definition of statistical significance levels, e.g., based on the likelihood ratio test (relating to the x 2 -distribution), yields the parameter confidence intervals. Following this concept, Kreutz et al. [13] introduced the concept of prediction profile likelihoods (PPLs) Conceptually, the PPL provides the highest possible likelihood value for a specific value of the (parameter dependent) prediction. This translates to a optimization problem with an equality constraint for the value of the prediction. Yet, to avoid problems related to nonlinear equality constraints, a reformulation that relies on artificial data points has been presented [13]. Additionally, integration based techniques that are more efficient for the analysis of uncertainties of state trajectories have also been proposed [18]. For PPLs to provide information about the prediction uncertainties, it is necessary to define a statistical threshold. We choose a confidence level of a ¼ 0:01 related to the likelihood ratio test, which yields the uncertainty measure in which z min j ðt i Þ and z max j ðt i Þ are the minimal and maximal values of c for which PPL z j ðt i Þ ðcÞ=pðDjûÞ > expðÀD a =2Þ, respectively. The threshold parameter D a is the a percentile of the x 2 -distribution. As in the FIM-based method, the PPL state prediction is the model simulation with the optimal parameter vectorû z p j ðt i Þ ¼ h j ðxðt;ûÞ;û; tÞ:

Ensemble Modelling Approach (ENS)
The frequentist and Bayesian methods have more recently been complemented by ensemble modelling approaches. These approaches exploit that parameter optimization has already explored the parameter space [12]. The parameter vectors encountered during the optimization process which meet a certain quality criteria, e.g., reasonable log-likelihood values compared to the optimal point, are considered as an ensemble.
In practice, a diverse ensemble of parameter vectors is obtained by performing several optimizations with different random seeds and initial points. From the set of parameter vectors found during the optimizations, those with an objective function value smaller than a given threshold are included in the ensemble. Statistical interpretability is ensured by using a threshold according to the likelihood ratio test (as for parameter and prediction profile likelihoods). We use a confidence level of a ¼ 0:01. The resulting ensemble provides an envelope for parameter and model predictions. As the envelope is not pushed towards the boundaries (as done in profile calculation), the resulting parameter and prediction envelopes provide inner approximations compared to profile likelihood-based approaches.
The ensemble prediction is the average prediction of the predictions by the ensemble with z m j ðt i Þ denoting the prediction z j at time t i for the m th model parameterisation in the ensemble, and n m is the number of parameter vectors in the ensemble. While in principle the full range of the ensemble-based envelop could be used, previous works filter extreme results. Similar to the Bayesian method, the width of the 99th-percentile interval of the ensemble (12) was employed e p j ðt i Þ ¼ z 99:5 j ðt i Þ À z 0:5 j ðt i Þ: with z per j ðt i Þ denoting the per-th percentile of the ensemble simulations.

Performance Metrics
In this work we perform a comparative study based on published parameter problems. To facilitate a comprehensive assessment we used synthetic data with the same characteristics as the published datasets for the considered problems. Furthermore, we used as predictions of interest the complete set of state variables.
To assess the performance of the methods we used the three metrics:

Computational Cost
We use the CPU time of all the calculations performed to solve a particular problem. We set a computational budget, i.e., a maximum time for each problem per method, of 1350 hours (somewhat less than two months).

Agreement Between Predictions and True States
We assessed the agreement of changes in the predicted value x p j ðt i Þ and true value x j ðt i Þ of the state variable x j at time t i . Therefore, we subtracted from both their averages over all time points, x p j and x j . To obtain a dimensionless quantity between -1 and 1, we normalize with the respective variability. This yields the performance metric for which a value of 1 indicates that the prediction and true values are perfectly aligned up to an offset and a scaling constant. The time grid used for the predictions was defined so as to resemble the original experimental data points, whenever available. The overall value of the performance metric for a model, r x , is the average of the performance metrics of the state variables, r x ¼ 1 nx P nx j¼1 r x j . The equation of the performance metric is identical to the equation of the Pearson's correlation coefficient, yet, a statistical interpretation is potentially problematic.

Agreement Between Uncertainty Estimates and Observed Error
We assessed the agreement of changes in the predicted uncertainty e p j ðt i Þ and the actual error e j ðt i Þ ¼ jx p j ðt i Þ À x j ðt i Þj for state variable x j at time t i . Therefore, we subtracted from both their averages over all time points, e p j and e j . To obtain a dimensionless quantity between -1 and 1, we normalize with the respective variability. This yields the performance metric r e j ¼ P n t i¼1 ðe p j ðt i Þ À e p j Þðe j ðt i Þ À e j Þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P n t i¼1 ðe p j ðt i Þ À e p j Þ 2 P n t i¼1 ðe j ðt i Þ À e j Þ 2 q : As before, a value of 1 indicates a good agreement of predicted uncertainties and actually errors, meaning that in the case of large predicted uncertainties also the error is large. The overall performance metric for a model, r e is the average of the performance metrics of the state variables, r e ¼ 1 n x P n x j¼1 r e j .

Implementation and Availability
Parameter optimization tasks were performed using the MATLAB version of the enhanced scatter search (eSS) method included in the MEIGO optimization toolbox [19], combined with the MATLAB version of AMICI [20] for model simulation. The initial parameters were generated using latin hypercube uniform sampling. For the PPL method we used an optimization-based approach [21], and for SAM we used an adaptive parallel tempering algorithm [11]. We used the implementation of both methods available in the MATLAB toolbox Data2Dynamics [22]. For numerical integration, Data2Dynamics and AMICI rely on the SUNDIALS solver CVODES [23]. For initialization of the algorithms, we used the optimal parameter vector found by the eSS method.
We implemented in-house MATLAB scripts for the FIM and ENS analyses, as well as for the evaluation of the MCMC samples to obtain the prediction posterior samples. In the FIM analyses of the unidentifiable models (all but the a-pinene) we used the Moore-Penrose pseudoinverse as a replacement of the inverse of the FIM. When computing the pseudoinverse of a matrix A, singular values smaller than a threshold given by max(size(A))*eps(norm(A)) were treated as zero, where eps is Matlab's floating point accuracy. The complete implementations of the case studies analysed in this work (including the MATLAB code, computational results and the respective version of the used toolboxes) are available on ZENODO (doi:10.5281/ zenodo.5995941).
FIM and ENS results were obtained in a multi-core PC running Windows 7 64-bit with 16 GB RAM and 12 cores, Intel Xeon 2.30 GHz with MATLAB version R2017b. SAM and PPL results were obtained in a multi-core HPC cluster running Fedora 25 64-bit in a computing node using up to 16 cores allocating 300 MB of memory per core with MAT-LAB version R2017b. Both environments have similar -but not identical -computing power. To allow a fair comparison, we have scaled their CPU times according to the result of the LINPACK 100 benchmark, 1 which is a common measure of computational performance. To this end we used the C version of the LINPACK benchmark. Executing this test on both computers yielded 2301.46 MFLOPS in the Windows system and 3922.43 in the HPC cluster, i.e., a ratio of 1.70. Thus, we divided the CPU times of the Windows system by 1.70 and reported them in Fig. 3 A.

CASE STUDIES AND RESULTS
To assess the performance of the four uncertainty analysis methods described in the preceding section, we applied them to four case studies of increasing complexity. They possess in the order of 10 1 , 10 2 , 10 3 , and 10 4 predictions; these numbers represent number of state variables times number of time points.
The main characteristics of the four case studies are shown in Table 1. In the following, we describe the application of the methods to each case study.

Isomerization of a-Pinene
As a first case study and a sanity check of the different methods, we considered a fully observed model for isomerization of a-pinene [24]. It describes the thermal isomerization of Since the results were obtained using two different computing environments, the CPU times reported for FIM and ENS have been scaled to ensure a fair comparison, as explained in Section 2.6. For the FIM method, the computation time corresponds to the optimization used to obtain the optimum. For the ENS method it includes all the optimization runs used to obtain the parameter vectors in the ensemble. For SAM it includes the sampling time, which was performed with adaptive parallel tempering. We set a maximum CPU time per problem of 1350 hours (slightly less than two months); calculations with PPL hit this limit for all but the smallest model: for EGF and JAK-STAT, PPL completed 35% and 2% of the calculations in the allowed time, respectively. For the largest case study (BM1) PPL was deemed as not applicable (therefore shown in black). (B-C): Agreement between predictions and true states (19) and between prediction uncertainty quantified by each state and actual prediction error (20), for all methods and case studies. The prediction uncertainties are those of the 95% percentile. Results for 99% and 68% were also calculated but are not shown here, since they do not vary for the FIM method and the differences are relatively small for SAM and ENS, showing that the metric is robust with respect to the choice of the confidence level. The values shown are the mean AE the standard deviation for each state. For all but the smallest model (a-pinene) the PPL method only produced results for a fraction of the predictions before exceeding the computation time limit. The percentage of predictions that could be calculated is shown with an asterisk. For the most computationally expensive model (BM1) this method was not applicable, which is noted as N/A. a-pinene to dipentene and allo-ocimene, which in turn yields pyronene and a dimer. There are thus five state variables in the model, all of which can be measured at nine time points, including the initial conditions. Assuming first order kinetics, the model has five rate constants that are the unknown parameters.
All parameters of this model were practically identifiable and the FIM could be inverted. For the ENS method we built an ensemble with 4000 parameter vectors through optimization. We found that this number could be obtained with a low computational cost, and adding more vectors did not alter the results. For the SAM method we created a Monte Carlo chain with 100 000 samples. Identifiability led to low dispersion in the parameter values included in the ensemble and in the SAM samples, as seen in Fig. 4.
All approaches were applicable and computation times were in the order of minutes for all methods (Fig. 3 A). Furthermore, all methods achieved good agreement of predicted and true state (Fig. 3 B) as well as predicted uncertainty and error (Fig. 3 C). The prediction uncertainties were relatively small (Fig. 5).

EGF Signaling
As a second case study, we considered a model [25] that describes the nerve growth factor (NGF)-induced differentiation of neuronal cells. It models the effect of two growth factors, NGF and the mitogenic epidermal growth factor (EGF), in rat pheochromocytoma (PC12) cells. NGF an EGF phosphorylate extracellular regulated kinase (Erk) through different signaling pathways. The resulting model has 28 states, six of which are measured at twelve time points, and 48 unknown parameters. Thus, this model was larger than the model for isomerization of a-pinene, and only partially observed. This hampered parameter identifiability, which led to a large uncertainty in the values of the parameters. Indeed, for some parameters a wide range of values allowed for a good fit to the data (Fig. 4). The results for the ENS method are based in 2 400 parameter vectors and the results for the SAM method on 38 000 samples.
For the experimental designs used in [25], the considered model for EGF signalling was locally practically non-identifiable. Due to this, the FIM is not invertible; hence, as mentioned in Section 2.1, we calculated the FIM-based uncertainties using the Moore-Penrose pseudoinverse. Non-identifiability, in turn, led to a decrease in the accuracy of the predictions. That said, all methods were able to obtain good predictions (Fig. 3 B); however, accuracy of the estimate of the prediction errors decreased for most methods (Fig. 3 C). The performance degradation was particularly notable in FIM, for which the agreement decreased roughly from 0.8 to 0.5. While the decrease of PPL was not as pronounced, this method only calculated uncertainty estimates for approximately 35% of the predictions before reaching the computation time limit. Fig. 6 shows the results of the different methods for this case study for a representative subset of 10 of the 28 model state variables.   6. Results of the different approaches for 10 representative states of the EGF signaling pathway example. The solid black lines are the predictions of the state trajectories, and the dashed red lines the true states. The grey areas show the percentiles of the predictions calculated with each method (dark grey: 68:27%, light grey: 95:45%, lighter grey: 99%). Note that, for this case study, PPL only produced results for a subset of the states within the allowed computation time.

JAK/STAT Signalling
As a third case study, we considered a model for JAK2/ STAT5 signaling [26]. The purpose of the model is to elucidate the role of two transcriptional feedback regulators in erythropoiesis. The response to erythropoietin stimulation first activates receptor and JAK2 phosphorylation, and then phosphorylates the latent transcription factor STAT5. This model has 25 states and a relatively large number of outputs (20), but only a few of them are direct measurements of state variables, while most of them are functions of a subset of the states. Therefore, they do not provide as much information as it might seem at first sight. This model shared many characteristics with the model for EGF signalling, but the number of experimental conditions was substantially higher, resulting also in an ten-fold larger number of predictions. As before, the FIM was not invertible and the Moore-Penrose pseudoinverse was applied for FIM-based uncertainty quantification. Furthermore, the PPL calculation finished within the considered time constraints only for 2% of predictions.
The computations times for FIM, SAM and ENS methods were comparable (Fig. 3 A). Furthermore, all methods achieved a good agreement between predicted and true state variables (r x > 0:9), but lower than for the two preceding case studies (Fig. 3 B). The agreement between predicted uncertainty and actual error were r e > 0:75 except for PPL, which falls to 0.372. Fig. 7 depicts the predictions and uncertainties estimated by all methods for a representative subset of states.

Insulin Signaling (BM1)
As a fourth case study, we considered a model for insulin signaling in mice [27]. It also considers the interaction of insulin signaling with oxidative stress, and it includes transcriptional feedback through the FOXO transcription factor, which controls long-term adaptation. Thus, the model consists of several interconnected modules, with over one hundred states and almost four hundred parameters. However, only five states are measured. This was by far the largest and most computationally demanding model. Indeed, the PPL method -which already struggled with the two previous problems -did not complete any calculations within the allowed computation time. The performances of the other three methods were comparable to their performances on the two previous problems, albeit with a higher computational cost. The FIM-based method is computationally cheap, but it has a number of limitations. First, it is strictly local, being calculated from a single parameter vector. In the presence of non-identifiability, the true vector can be very different from the estimated (optimal) one, affecting the results. Second, the confidence intervals estimated from the FIM are always symmetric, which might violate constraints (e.g., positivity bounds). Third, this method relies on a linearisation, and can be overly optimistic if nonlinearities are present. For the finite sample case, it is expected to give inaccurate results in the presence of strong nonlinearities. Our computational results have confirmed this theoretical expectation. Finally, if the model has non-identifiability issues, the FIM cannot be inverted and a pseudoinverse has to be used [17]. This was the case for three of the four case studies analysed here, and it is a very common scenario in systems biology models. However, to calculate the pseudoinverse it is necessary to specify a threshold, which may affect the results.

Bayesian Approach: Sampling the Posterior Predictive Distribution (SAM)
The Bayesian approach enables the assessment of uncertainties in a comprehensive manner. Yet, the posterior distribution needs to be approximated at first. This step tends to be computationally challenging, especially for increasingly complex models where MCMC methods tend to suffer from convergence issues. For the model of JAK/STAT signalling we found that the predictions for the state variables were "shifted" when compared to the ground truth (Fig. 7). This probably implies that the MCMC chain did not properly sample from the posterior distribution. Yet, this was not the case for the other models. Therefore, one should make sure that the samples are converged to the posterior distribution before further analyses.

Frequentist Approach: Prediction Profile Likelihood (PPL)
Unlike the Bayesian approach, the use of PPLs require individual calculations for each model prediction. This renders the methods computationally demanding if a large number of predictions needs to be assessed. This has been reflected in the EGF, JAK-STAT and BM1 model, as only a fraction of the predictions were covered in the computation time limit considered. In this regard, it must be noted that the high performance computing infrastructure used in the present study had a time limit of 48 hours for each job. Possibly, longer run times for individual jobs could allow more calculations to finish. We considered the use of advanced integration-based PPL calculation implemented in Data2Dynamics, but did not succeed. Hence, while providing stringent statistical guarantees, the use of PPLs is challenging.

Ensemble Modelling Approach (ENS)
As the ENS method exploits the results of parameter optimization, it is applicable even for high-dimensional models.

CONCLUSION
In this paper we have compared four different approaches for uncertainty quantification in dynamic biological models: FIM, SAM, ENS, and PPL. These four methods estimate the uncertainty of the time-dependent state variables. We found that several factors should be taken into account when choosing a method for a specific problem.
In regard to accuracy of the uncertainty estimates, the four methods showed good agreement for the a-pinene model. This is the simplest of the case studies considered, since it is the smallest one, linear, and fully observed. For the larger and more complex (nonlinear) models, more differences appeared, as expected. For those three case studies, SAM and ENS yielded more accurate estimations of the uncertainty of the predictions than FIM. Yet, the confidence intervals often did not cover the true trajectory. This might be due to conceptual limitations or technical problems (e.g., convergence of the MCMC sampler for SAM). PPL calculations did not outperform FIM due to computation time limitations.
In regard to statistical interpretability, Bayesian and Frequentist approaches have arguably the most rigorous foundations. ENS is arguably the technique with less theoretical justification, although, since we have used it with an uncertainty metric defined in the same way as that of SAM, it could be regarded as a low-cost approximation of a Bayesian approach. ENS may also be considered as an inner approximation of PPL, which provides a lower bound on the uncertainty estimates, since by construction its envelopes are narrower than those obtained with a working PPL method.
Another key consideration is computational cost. The FIMbased method is the cheapest one, since it only requires one successful optimization in order to find an optimal parameter vector. The most expensive one is the PPL approach, which can become very expensive -and even inapplicablefor large models. The computational costs of ENS and SAM are in the same order of magnitude (although ENS is generally cheaper than SAM) and they are intermediate between FIM and PPL.
Parallelization is a way of reducing the wall clock time of the computations. In this regard, it should be noted that PPL is easily parallelizable, while other approaches such as FIM and ENS are not. Yet, the most computationally demanding step of these methods is the parameter optimization, which can be performed with different techniques. In principle, parallelizable strategies such as multistart optimization could be used to this end [28]; however, in the present work this step was performed with a metaheuristic optimization method that is less amenable to parallelization. SAM methods are generally more difficult to parallelize, but there are also approaches to exploit computation resources [29].
Our results suggest a trade-off between computational scalability, on the one hand, and accuracy and statistical rigor on the other. At one end of the trade-off there is the FIM-based method, which should be chosen only if the other approaches are computationally too expensive for the problem under consideration. At the other end there is the PPL method, whose computational cost hampers its application to high-dimensional problems. ENS and SAM lie between both extremes; while ENS has a lower computational cost, SAM provides a clearer statistical interpretation.
In this studies we did not assess the flexibility of the different uncertainty analysis approaches. While the four methods are generally applicable to every nonlinear ODE model, a difference may exist if not only the parameter values but also the model structure is uncertain. While in the present work we have not considered this possibility, such uncertainty can easily be taken into account in the ENS and the SAM framework by building an ensemble of models with different structures. For PPL and the FIM-based approach one could encode the existence of different possible structures using additional "on/off" parameters, turning the parameter estimation to a mixed-integer optimization problem. In general the topic is related to model averaging.
We note that in this study a specific set of state-of-the-art optimization, profile calculation and sampling methods was used. This selection influences the results. Yet, we made an effort to select efficient and robust approaches within the respective classes, based on previous benchmarking studies (e.g., [28], [30]) and own experiences. We expect that the qualitative findings are robust to the choice of the methods as well as models. Accordingly, we expect this assessment to be of broad relevance.