Physics Enhanced Data-Driven Models With Variational Gaussian Processes

Centuries of development in natural sciences and mathematical modeling provide valuable domain expert knowledge that has yet to be explored for the development of machine learning models. When modeling complex physical systems, both domain knowledge and data provide necessary information about the system. In this paper, we present a data-driven model that takes advantage of partial domain knowledge in order to improve generalization and interpretability. The presented approach, which we call EVGP (Explicit Variational Gaussian Process), has the following advantages: 1) using available domain knowledge to improve the assumptions (inductive bias) of the model, 2) scalability to large datasets, 3) improved interpretability. We show how the EVGP model can be used to learn system dynamics using basic Newtonian mechanics as prior knowledge. We demonstrate how the addition of prior domain-knowledge to data-driven models outperforms purely data-driven models.


I. INTRODUCTION
For centuries, scientists and engineers have worked on creating mathematical abstractions of real world systems. This principled modeling approach provides a powerful toolbox to derive white-box models that we can use to understand and analyze physical systems. However, as the complexity of physical systems grow, deriving detailed principled models becomes an expensive and tedious task that requires highly experienced scientists and engineers. Moreover, incorrect assumptions leads to inaccurate models that are unable to represent the real system.
Data-driven black-box models provide an appealing alternative modeling approach that requires little to none domain knowledge. These models are fit to data extracted from the real system, minimizing the problems derived from incorrect assumptions. However, using data-driven models while completely ignoring domain knowledge may lead to models that do not generalize well and are hard to understand. Completely black-box approaches ignore the structure of the problem, wasting resources [1] and making the model less explainable [2].
Gray-box models combine domain knowledge and data as both provide important and complementary information about the system. Domain knowledge can be used to construct a set of basic assumptions about the system, giving the data-driven model a baseline to build upon. Data can be used to fill the gaps in knowledge and model complex relations that were not considered by the domain expert.
In this paper, we explore an approach for embedding domain knowledge into a data-driven model in order to improve generalization and interpretability. The presented gray-box model, which we called EVGP (Explicit Variational Gaussian Process), is a scalable approximation of a sparse Gaussian Process (GP) that uses domain knowledge to define the prior distribution of the GP. In this paper, domain knowledge is extracted from physics-based knowledge, however the EVGP can be applied to any domain.
The work on this paper has three cornerstones ( Fig. 1): 1) Gaussian processes are used for learning complex non-linear behavior from data while providing uncertainty estimations, 2) Partial domain knowledge is used as prior in order to improve inductive bias, 3) Variational Inference provides advantageous scalability to large datasets. Inductive bias refers to the assumptions made by the model when doing predictions on novel data. The presented approach provides uncertainty estimations which are fundamental in order to avoid the risk associated with overconfidence in unexplored areas [3] and warns the user of possible incorrect estimations [4].
The work in this paper is highly applicable when: 1) modeling physical systems with uncertainty estimations, 2) partial domain knowledge of the system is available, 3) large quantities of data are available. The aim is to help the engineer and take advantage of available knowledge without requiring the derivation of complex and detailed models. Instead, an engineer only has to provide simple, partially formed, models and the EVGP takes care of filling the gaps in knowledge. We show how the EVGP model can be used to learn system dynamics using basic physics laws as prior knowledge. We specifically demonstrate the approach on multi-body physics systems. We chose the pendulum as the base for our discussion to demonstrate how concepts from a simple system can be used as building blocks for priors of 2 DOF systems up to a 7 DOF robot arm.
The contributions of this paper are: r We present a stochastic data-driven model enhanced with domain knowledge to improve model accuracy using less amount of data when compared with fully data-driven approaches.
r The presented approach uses very simple priors, demonstrating that accuracy improves even when using heavily simplified physics as domain-knowledge.
r Scalability to handle large datasets is achieved by using the sparse variational framework.
r Interpretability is exemplified using the posterior distribution of trainable parameters, i.e. the value of the model parameters after training. Our objective is to evaluate the benefits of including partially defined (imperfect) models as domain-knowledge to data-driven models. Hence, our experiments compare the performance of the EVGP approach with respect to fully datadriven (black-box) models to evaluate the benefits of the proposed approach.
The rest of the paper is organized as follows: section II presents the proposed EVGP approach; section III presents a set of priors derived from simplified Newtonian dynamics for the EVGP model; section IV presents the experimental section where we compare the EVGP with fully data-driven models; section V presents the related work; section VI discusses extensions of the presented approach to be applied in other domains, explicitly handle noise in dynamic system modeling, and handle non-continuous systems; section VII concludes the paper.

II. EVGP APPROACH -EXPLICIT VARIATIONAL GP
The novel EVGP approach presented in this paper is designed to solve regression problems under uncertainty. Figure 2 offers a visual representation of the approach.
Given a dataset D = (x (i) , y (i) ) composed of input/output pairs of samples (x (i) , y (i) ), we would like to obtain a predictive distribution p(y|x, D) that estimates the value of the output y for a given input x. The EVGP model approximates p(y|x, D) using variational Inference. The EVGP is defined as a distribution p(y|x, w) where w are a set of parameters with prior distribution p(w).
In the following sections we describe in detail: A) the EVGP model approach, B) the variational loss function used to train the model, C) the predictive distribution that approximates p(y|x, D). Appendix A provides a quick description of acronyms and symbols for a quick reference for the reader.

A. EVGP MODEL DEFINITION
The EVGP model takes the following form: ) is a Gaussian process with kernel k, y ∼ N (0, y ) is the observation noise and g(x) is the denoised prediction for the input x. Figure 2 offers a visual representation of the model. The following is the description of the main components of the EVGP model: r Domain knowledge is embedded in the explicit function h(x) T β, parameterized by β. The function h(x) describes a set of features which are explicitly stated by the domain expert (hence the name of our method). β is modeled using a normal distribution with a prior that is also extracted from domain knowledge. In this paper, h(x) T β is derived from partially defined Newtonian mechanics.
r The Gaussian Process f (x) adds the ability to learn complex non-linear relations, increasing the model capacity of the entire model by supplementing the function h(x) T β. Given a dataset D, the exact predictive distribution p(y|x, D) for the model in Eq. (1) is described in [5]. For the rest of the paper, we refer to the exact distribution as EGP. Computing the EGP predictive distribution has a large computational cost and does not scale well for large datasets.
To alleviate this problem, sparse approximation methods [6] use a small set of m inducing points represented by a tuple ( f m , X m ), where f m is a vector of size m, and X m is a matrix of size m × m. The inducing points ( f m , X m ) are treated as trainable parameters, which allows to approximate the predictive distribution instead of requiring the entire dataset.
In order to construct a sparse approximation for the model in Eq. (1), we use a set of m inducing points ( f m , X m ) as parameters that will be learned from data. Given ( f m , X m ) and a set of test points (g, X ), the prior distribution of the model can be expressed as follows: where X denotes the data matrix, where each row represents an individual sample. The rows of H x represent the value of the function h() applied to the real samples X . The rows of H m represent the value of the function h() applied to the inducing (learned) points X m . Using the conditional rule for multivariate Gaussian distributions, we obtain the definition of the denoised sparse EVGP model: where ω = { f m , β} are the parameters of our model. Here, K xm , K mm , K mx , and K xx are matrices that represent the value of the Gaussian Process kernel k(·, ·) evaluated between: (X , X m ), (X m , X m ), (X m , X ), (X , X ), respectively. Equation (2) defines our scalable EVGP model. In following sections we give prior distributions to the parameters ω and perform approximate Bayesian inference. Note that Eq. (2) is also conditioned on X m , however we do not indicate this explicitly in order to improve readability.

B. VARIATIONAL LOSS
In this section we present the loss function that we use to fit our model. In this paper we follow a variational Bayesian approach (see supplemental material for a brief overview). Given a training dataset D and a prior distribution p(ω), we wish to approximate the posterior distribution p(ω|D). The posterior of ω is approximated using a variational distribution p φ (w) ≈ p(ω|D) parameterized by φ. For the EVGP parameters ω = { f m , β}, we use the following variational posterior distributions: where (a, b) are the mean values of the respective normal distributions while (A, B) are the covariance matrices. The prior-distributions for ω are also defined as multivariate normal distributions: these prior distributions represent our prior knowledge, i.e. our knowledge before looking at the data. The matrix K mm is computed using the GP kernel on the inducing points X m , while μ β and β are provided by the expert. Given the training dataset D = (y, X ), the parameters φ of p φ (ω) are learned by minimizing the negative Evidence Lower Bound (ELBO). For the EVGP, the negative ELBO takes the following form: where y H x . The term L KL is the KL-divergence between the posterior and prior distributions for the parameters: A detailed derivation of the variational loss in Eq. (3) is presented in the supplemental material. The negative ELBO (Eq. 3) serves as our loss function to learn the parameters φ given the training dataset y, X . In order to scale to very large datasets, the ELBO is optimized using mini-batches . In our case, the parameters of the variational approximation are: φ = a, A, b, B, X m .

C. PREDICTIVE DISTRIBUTION
After learning the parameters φ, we would like to provide estimations using the approximated variational distribution p φ (ω). Given a set of test pointsX , the estimated denoised predictive distribution is computed as an expectation of Eq.
(2) w.r.t. p φ (ω): 4) Note thatĝ is just a denoised version ofŷ. Eq. (4) approximates p(ĝ|x, D), using the learned distribution p φ (ω) . The result is equivalent to [7] with the addition of H x b for the mean and HxBH T x for the covariance. These additional terms include the information provided by the prior function H x with the parameters b and B that were learned from data.
The approximated predictive distribution with observation noise is the following: In the next section, we show how Eq. (4) can be used to model system dynamics and predict the next state of a physical system given the control input and current state.

III. EMBEDDING PHYSICS-BASED KNOWLEDGE
In this paper, we apply the EVGP model to learn the dynamics of a physical system. The state z [t] of the physical system can be modeled using the following recurrent version of the EVGP: where u [t] is the control input and y [t] is the measured output of the system at time t. The symbol ⊕ denotes concatenation and [t] is the input to the EVGP model. For example, in the case of a mechatronic system: u [t] are the forces applied by the actuators (e.g. electric motors); z [t] is the position and velocity of the joints; y [t] is the output from the sensors. We assume independent EVGP models for each output y [t] in the equation (5). The function g() in Eq. (5) is modeled using the EVGP model from Eq. (1) and Eq. (4). The recurrent EVGP primarily handles model uncertainty thanks to the variational approximation used in the EVGP. We do not consider observation or process noise. 1 In the following sections we present how we can use simple Newtonian mechanics to define useful priors h(x) T β for the EVGP model. Figure 3(a) shows a simple example of a single rigid-body link. The simplest model that we can use for this system comes from Newton's second law u = Jq 1 , where u is the torque applied to the system, J is the moment of inertia, and q 1 is the angle of the pendulum. Using Euler discretization method, we obtain the following state-space representation that serves as the prior h(x) T β for our EVGP model:

A. PRIORS DERIVED FROM SIMPLE NEWTONIAN DYNAMICS
We refer to this prior as IF (inertia+force). t is the discretization time and q 1 [t]q 1 [t] T is the state z [t] of the system. The IF prior in Eq. (7) does not include gravitational effects. Gravity pulls the link to the downward position with a force proportional to sin q 1 . Hence, a prior that considers gravitational forces can be constructed by including sin(q 1 [t]): we call this prior IFG (inertia+force+gravity). We purposely did not define J and γ . One of the advantages of the presented approach is that the algorithm can learn the parameters from data if they are not available. If the user does not know the value of J and γ , a prior with large standard deviation can be provided for these parameters (large β ). Although parameters like J and γ are easy to compute for a simple pendulum, for more complex systems they may be hard and tedious to obtain. Our objective is to take advantage of domain knowledge and allow the model to fill in the gaps in knowledge.
For the rest of the paper, priors derived from Eq. (7) are referred as IF priors, while Eq. (8) priors are referred as IFG.
In the experiments (section IV) we compare the performance for both priors in order to illustrate how performance can be progressively improved with more detailed priors.

B. SIMPLIFIED PRIORS FOR ACROBOT, CARTPOLE AND IIWA
In addition to the pendulum, we consider the Acrobot, Cartpole, and IIWA systems in our analysis. For these systems, we consider much simpler priors than the exact dynamic models derived from multi-body dynamics. We use the same principles shown in the previous section in order to get simple priors for the systems. These rules are summarized as follows: r For the IFG prior, gravity pulls the links to the downward position proportional to the sine of the angle w.r.t. the horizontal plane. Gravity has no effect when the links are completely down/up. The objective with these priors is to demonstrate how extremely simplified priors extracted with simple physics can be used to improve performance of data-driven models. Figure 3(c) shows a diagram of the Acrobot system. A simple prior for this system can be constructed using the prior in Eq. (7) for each one of the links of the Acrobot: where sin 1 = sin(q 1 [t]), and sin 12 = sin(q 1 [t] + q 2 [t]). In this case, the input u[t] only drives the second link. The IF and IFG priors for the Cartpole (Figure 3(b)) are the following: For the IIWA system, we constructed the priors following the same rules as before, with an approach that closely resembles to that used for the Acrobot. The priors were constructed using the prior in Eq. (7) for each one of the IIWA links. This results in a matrix β with a similar diagonal structure than the matrices in Eq. (10). To compute the IFG factors for h(), we derived an equation that can be used for a general 3D link (see Fig. 4). The equation uses forward kinematics to evaluate the contribution of the torque applied by gravity (v g ): where × denotes cross product, · denotes dot product, and v x and v z are unit vectors. The matrix R i is the rotation matrix of the frame attached to link i. This matrix is computed using classic forward kinematics, therefore its value depends on the value of q. This equation computes the torque applied by gravity to the axis of rotation v z . The result is separated by monomials, each added as a feature in h. We only use unitary vectors as we assume that information like the position of the center of gravity will be captured by the posterior of β, i.e. the learned parameters. These priors are extremely simple as they do not consider friction or coriolis/centrifugal forces. However, they provide important information about the mechanics of the system.

IV. EXPERIMENTS
In order to evaluate the performance of the presented model, we performed experiments on a set of simulated systems: Pendulum, Cartpole, Acrobot, and IIWA. We also performed qualitative tests on a toy-dataset to visualize the performance of the EVGP model.
We used Drake [8] to simulate the Pendulum, Cartpole, Acrobot and IIWA systems and obtain the control/sensor data used to train and evaluate the EVGP models. We used the squared exponential function for the covariance kernels. The reason for this choice is that all the experiments involve continuous systems. The EVGP model was implemented using Tensorflow and the minimization of the negative ELBO loss was done using the ADAM optimizer [9]. The experiments were run in a computer with a single GPU (Nvidia Quadro P5000) with an Intel(R) Xeon(R) CPU (E3-1505 M at 3.00 GHz).

A. EXPERIMENTS ON TOY DATASET
The toy dataset is intended to serve as an illustration of the behavior of the EVGP model and visualize the qualitative differences between several GP models. We use a modified version of the toy dataset used in [10], [7]. The dataset 2 is modified as follows: The modification is intended to provide a global linear trend to the data (See Figure 5). We use this modification to illustrate how the EVGP is able to model both global and local trends. Global trends describe the overall shape of the function over the entire domain, while local trends describe deviations from the global behavior in specific regions of the domain. Figure 5 shows the distribution learned using different versions of a Gaussian Process. Figures 5(a) and Figure 5 In this case, the prior-knowledge that we provide to the EVGP is a simple linear function h(x, β) = xβ 1 + β 2 . Figure 5(d) shows how we can use the prior in order to control the global shape of the function. The figure shows how the EGP and EVGP models use the prior knowledge to fit the global behavior of the data (linear) while using the kernels to model the local non-linear behavior.

B. LEARNING SYSTEM DYNAMICS
We evaluated the performance of the EVGP model in learning system dynamics using data obtained from simulations of the Pendulum, Cartpole, Acrobot and IIWA systems. Concretely, we evaluated the accuracy of the EVGP model with IF and IFG priors in predicting the next state of the system given the current control inputs and state.
We show the advantages of EVGP regarding generalization and interpretability as follows: r For generalization, we evaluate all metrics on a testing dataset that has not been used during training. We show the EVGP achieves lower errors on testing data when compared to other models while using smaller training datasets. This means that the EVGP generalizes better r For interpretability, we illustrate how to interpret the trained model by visualizing the parameter b from the posterior variational distribution. Data: to evaluate the difference in generalization, we sampled two different datasets for each system: one for training and one for testing. The experimental procedure is described in Algorithm 1. We used the Pendulum, Cartpole, Acrobot, and IIWA provided in the examples of the Drake library [8]. We did not include process noise or observation noise in the simulated systems. For the Pendulum, Cartpole, and Acrobot, the datasets were sampled by simulating the system using random initial states z [0] ∼ α U (−1, 1) and random control inputs u [t] ∼ η N (0, 1) drawn from uniform and normal distributions, respectively. Several simulations where performed to obtain the trajectories of states and control inputs which where sampled with a period of t = 0.03s. Table 1 shows the values of the scales (α, η) that were used to sample the trajectories. In Table 1, N refers to the number of sampled trajectories, |D| refers to the total number of samples. These values were chosen in order to cover at least the range (−π, π) for the angles on the systems.
For the IIWA, the datasets were obtained by simulating the robot executing a random reference trajectory. The IIWA was controlled by the inverse dynamics controller provided by the Drake library. The controller allowed us to maintain the state of the robot inside a reasonable operating region. The datasets were obtained by simulating the robot starting from a random position q [0] ∼ αU (−1, 1). A reference trajectory q [t] was generated by adding a random increment from the previous position to the next position in the trajectory q [t+1] = q [t] + ηU (−1, 1). The reference trajectory is used as a reference for the controller. The IIWA is simulated using several random reference trajectories while data of inputs and states is sampled with a period of t = 0.006˜s.
Baseline: we compare the EVGP model with a standard VGP model [11], a residual VGP (RES-VGP) [12], [11], and a residual Deep Bayesian Neural Network (RES-DBNN) [13], [14]. The VGP model is based on [11] and uses a zero mean prior. The residual VGP and DBNN models assume the system can be approximated as z [t+1] = z [t] + g r (z [t] ⊕ u [t] ), where g r represents either the VGP or the DBNN model. Approximating residuals is a common approach used to simplify the work for GP and DBNN models [12], [14], [15]. The RES-DBNN model is trained using the local reparameterization trick presented in [13] with Normal variational distributions

Algorithm 1 Experimental Procedure
Input: SYSTEM to sample data from. MODEL to be tested.   for the weights. For these set of experiments, we did not consider the exact GP and EGP models given the large number of samples in the training datasets. Table 2 shows the number of inducing points (m) used for the VGP and EVGP models. The table also shows the number of hidden units for the DBNN model, where [15,15] means a two-layer network with 15 units in each layer. We used the LeakyRelu activation function. Table 3 shows the space and time complexity of the models considered in this paper. Space complexity shows how the memory requirements grow asymptotically as we increase the size of the models. Time complexity shows how the execution time grows asymptotically as we increase the size of the models. In this table, m is the number of inducing points, o is the number of outputs, L is the number of hidden layers, and n is the number of hidden units in each layer. To simplify the analysis, we assume all hidden layers of the DBNN have the same number of hidden units. The table shows that the time complexity of the VGP and EVGP models are governed by the matrix inversion K −1 mm , which is O(m 3 ). Because we assume completely independent VGP and EVGP models for each output of the system, their complexity also depends on the number of outputs o. The time complexity of the DBNN model is governed by the matrix-vector product between the weight matrices and the hidden activation vectors, which is O(n 2 ). For space complexity, the VGP and EVGP require to store m number of inducing points for each output o as we assume independent models for each output. The DBNN requires to store the weight matrices (n 2 ) for each hidden layer L. All models have constant space complexity w.r.t. the training dataset size |D|. Furthermore, all models have linear time complexity w.r.t. |D| if we assume that training requires to visit each sample in D at least once.
Metrics: for comparison, we used three metrics: 1) prediction error (Error), 2) predicted standard-deviation ( ST D ), 3) containing ratios (CR). Prediction error is computed as the difference between the dataset values (y) and the model expected estimated output (E[ŷ (i) ]). STD is computed as the magnitude of the predicted standard deviation: where |D| is the number of samples in the respective dataset. The expected output (E[ŷ (i) ]) for the EVGP model is equal to μĝ |x in Eq. (4). For the DBNN model, the expectation is estimated using Monte-Carlo. The containing ratios (CR) are the percentage of values covered by the estimated distributionŷ. We consider containing ratios for one, two and three standard deviations (CR-1, CR-2, and CR-3 respectively). We expect the best model to have Error and STD close to zero, while best containing ratios will be closer to the (68-95-99.7) rule of standard distributions.
Results: Table 4 shows the prediction error and CR scores obtained in the testing dataset. EVGP-IF and EVGP-IFG refers to the use of an IF or IFG prior, respectively. We can observe a considerable improvement on the testing error and CR scores when using EVGP models. EVGP models provided the lowest error and best CR scores. We also see a progressive improvement on the testing error when using more detailed priors. The IFG prior provided lower prediction errors when compared with the IF prior. The EVGP-IFG model provided the estimations with the lowest prediction error, with low ST D and best CR scores. Table 4 also shows that the EVGP model achieved the best performance using fewer number of parameters. Figure 6 shows a comparison of the prediction error on the test dataset as we increase the number of training samples. For this experiment, we kept the testing dataset fixed while samples were progressively aggregated into the training dataset. The figure shows the mean, max, and min values obtained for four independent runs. Figure 6(a) shows a comparison that includes all models. As expected, the prediction error is reduced as we increase the size of our training dataset. The figure shows that the EVGP provides the most accurate predictions while requiring fewer training samples. Figure 6(a) also shows how the performance of VGP and RES-VGP plateaus, struggling to take advantage of larger datasets. Although the RES-DBNN performs poorly with small training datasets, the high capacity of the RES-DBNN model allows it to take advantage or large datasets and improve accuracy, reducing the performance gap w.r.t. the EVGP models as more data is available. Thanks to the lower computational time-cost of the RES-DBNN (see Table 3), this model can use a larger set of parameters without incurring in excessive training times. Figure 6(b) shows a scaled version that only considers the EVGP model with different priors. This figure shows that the IFG prior provides more accurate predictions when compared to the IF prior. In the case of the pendulum, the IFG prior provides a highly accurate model of the system, requiring only a small number of training samples. Figure 6(b) also shows how as training data is aggregated, the accuracy gap between IF and IFG priors is reduced. In the case of the IIWA, we observe that the performance of both IF and IFG is very close. The IIWA proved to be the most challenging system. Even when the DBNN provided CR-3 values above 89% for Pendulum, Cartpole, and Acrobot, the DBNN struggled with the IIWA in terms of containing ratios. The advantage of IFG over IF for the IIWA is mostly observed in the containing ratios in Table 4. Figure 7 shows a visualization of the EVGP state estimations of the IIWA for multiple time-steps ahead. The multistep estimations are performed using a standard sequential Monte-Carlo method where particles were used to unroll the EVGP model to predict the state trajectory 45 time-steps ahead [14], [15]. The figure shows that the EVGP is able to provide long-term estimations that follow the dynamics of the physical system. We use the standard deviation to visualize the uncertainty. The figure shows how the uncertainty propagates over time. Table 5 shows single step error and multi-step estimation error evaluated on the testing dataset. The multi-step error is an average of the error in Eq. (11) computed across the estimated state trajectory for 45 time steps ahead. The single step error is equivalent to Eq. (11). The table shows a comparison of the errors between the EVGP and different RES-DBNN models

TABLE 5 Activation Functions and Error Comparison on IIWA
with different activation functions. The table shows that the EVGP provides the lowest single and multi-step error for all models. The table also shows how small single-step errors are accentuated in the multi-step Error. The table also shows that LeakyReLU activation function provides the DBNN model with the lowest error, which is the reason why we choose it for all other experiments.
The priors that we use are extremely simple, they are ignoring friction and coriolis/centrifugal effects. Nonetheless, we observe a considerable performance improvement after providing our data-driven model basic information with the IF and IFG priors. Although we did not include observation or process noise in the training or testing data, we found that including the uncertainty term y in the EVGP ( 5) was essential to obtain a stable training, i.e the error steadily decreasing during training while avoiding NaN results. Understanding the learned model: one of the advantages of incorporating domain knowledge is that the learned model is easy to interpret by the domain expert. For example, in the case of the Acrobot, the value of the parameter β can be visualized to understand and debug what the model has learned. Figure 8 shows the value of b learned with the IF (Fig. 8(a)) and IFG priors (Fig. 8(b)).
We observe in Figure 8 that the learned parameters follow a similar structure given by the prior (see 10). In our experiments, we did not enforce the sparse structure from the priors, i.e. zero parameters in the prior are allowed to have non-zero values in the posterior. Figure 8(a) shows that when using the IF prior, the EVGP model compensates for the missing gravity information by assigning negative values to (q 1 , q 1 ) and (q 2 , q 2 ). The reason for this behavior is that q 1 ≈ sin q 1 for small q 1 , however this approximation is no longer valid for large q 1 . When using IFG priors ( Fig. 8(b)), we observe that the model no longer assigns negative values to (q 1 , q 1 ) and (q 2 , q 2 ). The reason is that IFG provides the values of sin(q 1 ) and sin(q 1 + q 2 ) which help to model the effect of gravity more accurately. Figure 9(a) shows the visualization of the mean variational posterior b for the IIWA. Similar to the Acrobot, the IIWA posterior has a familiar diagonal structure that resemblances the priors in 10. This observation matches our expectations as we expect the posterior to resemble the prior, which was constructed by placing the pendulum prior on each of the IIWA links. The figure also shows that the lower rows corresponding toq have larger values than those corresponding to q. This is also expected as most of the inputs should only directly affect the velocity (according to Eq. (7)). Meanwhile, the position can be estimated by performing discrete integration of the velocities, an observation that is embedded in the priors and evidenced by the diagonal structure and low values of the top rows in Figure 9(a).   Figure 9(c) shows the same posterior but constraining the elements offdiagonal to be zero. Figure 9(c) is an example of how an engineer can manipulate the model by visually inspecting the learned posterior. In our methodology, constraining elements of b is straightforward as b is a trainable parameter whose individual elements can be set to predefined values in each training step. In practice we did not see any improvement of accuracy by constraining zero values in the posterior, but this exemplifies the options available to the engineer thanks to the interpretability of the model provided by relationship between the prior h(x) T β and posterior b.

V. RELATED WORK
Incorporating prior scientific knowledge in machine-learning algorithms is an ongoing effort that has recently gained increased attention. In [16] LEAP nets are introduced for system identification applied to power systems. The architecture is based on ResNets and uses a set of discrete structural parameters that encode the network topology. Hamiltonian neural networks [17] embeds Hamiltonian mechanics in the architecture of the neural network to ensure that the model follows energy conservation laws. DGNets [18] use a more general formulation for energy-based modeling than Hamiltonian networks and extend the methodology to discrete time. Convolutional neural networks (CNNs) have been used in modeling and simulation applications such as forecasting of sea surface temperature [19] and efficient simulation of the Navier-Stokes equations [20]. Hybrid modeling approaches that combine data and physics based methods are also starting to gain attention in modeling cyber-physical systems [21]. Hybrid models and physics-based loss functions have been introduced in physics-informed neural networks [22], [23] in order to ensure that the trained neural network satisfies energy conservation. In [24] first principles models are combined with machine learning to create deterministic models of process engineering systems. The machine learning model is trained separately to predict the mismatch between a first principles model and the target solution. No training is performed on the first principles model. The solutions of the first principles model and the machine learning model are added to provide the final prediction. In contrast, for the EVGP we model domain knowledge (extracted from simplified first-principles) using a prior probability distribution. This allows us to encode uncertainty in the provided domain-knowledge which in our case is important as we consider domain-knowledge to be imperfect. When we train the EVGP, we use variational inference to obtain a posterior over the parameters given in the prior distribution. The mean of this posterior is represented by b. Data is used to simultaneously tune the parameters in the domain-knowlege function h(x) T β and train the data-driven GP.
Gaussian Processes (GP) have been used as a general purpose non-parametric model for system identification and control under uncertainty [12], [25]. Previous work has explored using GPs to include partial model information [26]. In [27] the EGP model described in [5] is used in combination with a simplified physics model in a thermal building modelling application. Our work is based on the GP model with explicit features (EGP) presented in [5]. Variations of this model are commonly used in calibration of computer models [28], [29]. To the best of our knowledge, we are the first to apply variational inference to the EGP model in order to embed simplified physics models and improve scalability.
Despite the advantages of GP models for modeling complex non-linear relations with uncertainty, GPs are computationally expensive. A large bulk of research has focused on improving the computational requirements of GPs. Sparse GP approximation methods are some of the most common approaches for reducing GP computation cost [6]. Bayesian approximation techniques such as Variational Inference provide a rich toolset for dealing with large quantities of data and highly complex models [30], [7]. Variational approximations of a sparse GP have been explored in [7], [11]. In [3] a variational GP model is presented for nonlinear state-space models. In [15], [14], Deep Bayesian Neural Networks (DBNNs) are proposed as an alternative to GPs in order to improve scalability in reinforcement learning problems. Given the popularity of GP models and Variational Inference, there is an increased interest on developing automated variational techniques for these type of models [31], [32].

VI. DISCUSSION
In this paper we wanted to demonstrate how we can use simple concepts from physics that can be scaled and embedded into machine learning models. For this reason we chose the pendulum as representative domain problem, demonstrating how it can be used as building block for 2 DOF systems and expanded to a full 7-DOF manipulator. In a similar manner, the EVGP can be used in other areas such as thermal, eletromagnetic, fluids, among others. These systems are usually modeled using differential equations. The EVGP as used in 5 can be used to approximate discretized versions of these systems. The case study that we showed in section III-B uses an Euler discretization of the differential equation to build the function h(x) T β, as shown in Eq.7. The presented approach can be extended to other areas by extracting priors from discretized differential equations from the respective area of interest. The extracted priors can then be introduced in the function h(x) T β of the EVGP. If detailed domain knowledge is available, this can be directly incorporated in the function h(x) T β. Furthermore, if some parameters values in β are known with high precision (for example, some of them are known to be zero), it is possible to constrain these values in the posterior mean b to ensure that the value is not changed after training.
For our analysis of system dynamics, we did not consider measurement or process noise. The presented approach focuses primarily on handling model uncertainty, i.e. uncertainty that derives because of the approximation nature of the learning algorithm [12], [14], [15]. We achieve this by using Gaussian processes and the variational inference framework. Although the EVGP is able to handle observation noise when used in regression problems such as in Fig. 5, the recurrent structure used for system dynamics (Eq. 5) makes the approach sensible to observation noise. This limitation of recurrent GP formulations has been studied in existing literature [33]- [35], but an in-depth analysis is out the scope of this paper. These approaches can be incorporated into the recurrent formulation of the EVGP to explicitly handle observation noise. For example, the direct method from [35] can be incorporated into the EVGP by using moment-matching to estimate denoised states given a sequence of previous noisy measurements. The denoised states can be introduced as inputs to the EVGP and the entire model can be trained using back-propagation.
In this paper, all discussed examples were continuous systems. Hence, we used the squared exponential kernel in all experiments as it encodes smoothness assumptions. The presented approach can be extended to non-continuous systems in several ways: 1) discontinuous functions can be added to the domain-knowledge function h(x); 2) a kernel that is adapted to model discontinuities can be introduced instead of the squared exponential [36]; 3) a transformation of the input space can be learned to find a representation that encodes the discontinuity [37].

VII. CONCLUSION
In this paper, we presented the EVGP model, a stochastic data-driven model that is enhanced with domain knowledge extracted from simplified physics. The EVGP is a variational approximation of a Gaussian Process which uses domain knowledge to define the mean function of the prior. The priors provided a rough but simple approximation of the mechanics, informing the EVGP of important structure of the real system. We compared the performance of the EVGP model against purely data-driven approaches and demonstrated improved accuracy and interpretability after incorporating priors derived from simplified Newtonian mechanics. We showed that as we include more detailed priors, the algorithm performance increases. We also showed how the difference between physics enhanced and purely data drive becomes smaller as we increase the availability of training data. We demonstrated that in case of smaller training data sets, the EVGP proved its superiority over the purely data driven approaches. Finally, we illustrate how visualizing the learned parameters can be useful to gain insights into the learned model and hence improve understandability.  A, b, B, X m ). N (a, A) Variational distribution of the parameter f m defined as a normal distribution with mean a and covariance matrix A. p φ (β) = N (b, B) Variational distribution of the parameter β defined as a normal distribution with mean b and covariance matrix B. EVGP for physical system modeling t Time (discrete). t

ML
Sampling interval in seconds.
Output (vector) of the EVGP at time t. The vector is composed by the output of independent EVGP models. z [t] State at time t. u [t] Control input at time t. x [t] = z [t] ⊕ u [t] Input of the EVGP at time t. J, γ i Parameters used to define the structure and values of β. They are used only to illustrate how to embed domain-knowledge in h(x) T β, but they ultimately become part of β. q i Position of the link i. R i Rotation matrix of the frame attacked to a link i. v x , v y , v z unitary vectors in axes x, y, and z, respectively. v g unitary vector in the direction of gravity.