Model error correction in data assimilation by integrating neural networks

In this paper, we suggest a new methodology which combines Neural Networks (NN) into Data Assimilation (DA). Focusing on the structural model uncertainty, we propose a framework for integration NN with the physical models by DA algorithms, to improve both the assimilation process and the forecasting results. The NNs are iteratively trained as observational data is updated. The main DA models used here are the Kalman filter and the variational approaches. The effectiveness of the proposed algorithm is validated by examples and by a sensitivity study.


Introduction and Motivation
Data Assimilation (DA) is an uncertainty quantification technique by which measurements (see Eq. (1), Algorithm 1) and model predictions (see Eq. (2), Algorithm 1) are combined to obtain an accurate representation of the state of the modeled system [1] . DA was firstly proposed in atmospheric models and then widely applied in climatology, geophysics, aerodynamics, and economic models in the past decades. Recently, DA has also been applied in numerical simulations of geophysical applications [2] and medical and biological sciences [3,4]  the reliability of the numerical simulations. Today, there are many DA algorithms. Those which have gained acceptance as powerful methods in the last ten years include the variational DA approach and the Kalman Filter. These approaches assume that the two sources of information, forecasts and observations, have errors that are adequately described by stationary error covariance matrices.

Algorithm 1: A(DA)
Input: A time step t k . Assimilation window A. A numerical discretization of a dynamic system F k W R nx ! R nx ; where x k 2 R nx be a state estimate vector and an initial condition x 0 .
M vectors of noisy measurements y j 2 R ny in an assimilation window time OEt k ; t kCA such that: where H j W R nx ! R ny is the first order linearization of the observation function. 1 Model Integration. Compute where F 0!k D F k 1!k ı F k 2!k 1 ı ı F 0!1 2 Define the covariance matrices R k and B k 3 Assimilate x k with y j (for j D 1; : : : ; A), i.e., compute x DA k D argmin The solution gained by minimizing the cost function in Eq. (3) is the result of the Kalman Filter [5] . If the solution is instead using variational x, it is called the variational approach. In recent years, variational approaches and Kalman Filters become more sophisticated to better fit their application requirements and circumvent their implementation issues. Nevertheless, these approaches are incapable of overcoming fully their unrealistic assumptions, particularly linearity, normality, and zero error covariances. The first step in error analysis (also called sensitivity analysis) is to understand the errors that arise at the different stages of the solution process, namely, the uncertainty in the mathematical model, in the model's solution, and in the measurements. These are the errors intrinsic to the DA inverse problem. Moreover, there are the approximation errors introduced by linearization, discretization, and model reduction. These errors occur when infinitedimensional equations are replaced by a finitedimensional system, or when simpler approximations to the equations are developed. Finally, there are rounding errors introduced by working in a finite precision arithmetic during implementation of the algorithm.
Machine learning algorithms are capable of assisting or replacing the aforementioned traditional methods in assimilating data and making forecasts, without the assumptions of the conventional methods. As NNs can approximate any linear or nonlinear functions, it has been integrated with DA as a supplement in various applications.
In this paper, we consider integrating the NNs into the conventional DA (hereafter named as "DA+NN"). In recent years, deep learning shows great advantage in function approximations which have unknown model and strong nonlinearity. Here, we use NNs to characterize the structural model uncertainty. The NN is implemented in an End-to-End (E2E) approach and its parameters are iteratively updated with coming observations by applying the DA methods.
This paper is structured as follows. In Section 2, we discuss the related works about machine learning and DA. We proposed a new methodology in Section 3 where we integrate NNs into DA. In Section 4, we give a brief discussion on our method. In Section 5, numerical examples are given to implement and verify the proposed algorithm. Based on these discussions, we give an outlook on this field. Finally, conclusions are drawn and future works are discussed.

Related Work and Contribution of the Present Study
Sensitivity of the four-dimensional variational (4D-Var) DA model has been studied in Ref. [6] where an adjoint modeling is used to obtain first-and secondorder derivative information and a reduced-order approach is formulated to alleviate the computational cost associated with the sensitivity estimation. This method makes rerunning less expensive; however, the parameters must still be selected a priori, and, consequently, important sensitivities may be missed [7] . It is proved that Algorithm 1 obtains Q where C A.DA/ is a parameter that depends on the condition numbers of the DA numerical model and the numerical forecasting model implemented in Algorithm 1 [8] and k is the error in the numerical forecasting model in Eq. (2). Approaches to reduce the error propagation are previously proposed [6,8] and studies based on adjoint analysis are provided [7] . In this paper, we study the problem of reducing k by estimating the errors using machine learning. Machine learning algorithms are capable of assisting or replacing the aforementioned traditional methods in assimilating data and making forecasts, without the assumptions of the conventional methods. As NNs can approximate any linear or nonlinear functions, they have been integrated with DA as a supplement in various applications. Babovic et al. [9][10][11] applied neural network for error correction in forecasting. In these literatures, a neural network is trained by the error of past several steps. This network compensates the forecasting result from a deterministic model. Similar approaches are implemented on hydraulic applications [12][13][14][15] and financial applications [16] . While machine learning capability in approximating any nonlinear or complex system is promising, it is a blackbox approach, which lacks the physical meanings of the actual system structure and its parameters, as well as their impacts on the system. To resolve this problem, we introduce in Section 3 a neural network model trained on DA results. This introduces a novelty in sensitivity analysis techniques used to determine which inputs in NN are significant. In Ref. [17], a series of seven different sensitivity analysis methods were reviewed. As we show in the next section, our approach can be related to the weights method [18] . The difference between our method and the classical weights method is that, in our approach, the connection weight among the nodes is provided by the error covariance matrices computed in the DA process.

Data Assimilation with Neural Networks
The overall process of DDA is described as Fig. 1 and the algorithm description of DDA is as Algorithm 2. As depicted in Eq. (2), the error k causes the error in DA. To compensate such error from model uncertainty, we introduce an NN G ! (as depicted in Fig. 2), and alternate the model F by a composition G ! ı F . The training target is While NN capability in approximating complex systems is promising, it is often a black-box approach, which lacks the physical meanings of the actual system structure and its parameters, as well as their impacts on the system. To face this problem, in Eq. (5), we use the values of F .x DA k 1 / and x DA k as input and output of network G ! , respectively. However, the real value of state x is unobtainable. To train the G ! , we exploit the i is defined as the network trained in the i-th stage. Let M i be the model implemented in the i-th stage: We define the result of DA and DA+NN as x DA and x DACNN , respectively. For iteration i > 1, the As for loss function in Eq. (10), it is defined as in Theorem 1 described in Section 4 and the network is updated by the gradient loss function on its parameter: The Fully-Connected (FC) network is the earliest and most basic NN structure. It means that the output of each neuron in the upper layer becomes the input of each neuron in the next layer, and it has a structure with corresponding weight values. There is no cycle or loop inside the FC network. FC NNs are widely cited in the fitting of functions, especially the output of continuous value functions.
The two-layer FC network contains a hidden layer and an output layer. If w 1 is the weight of the hidden layer, b 1 is the bias of the hidden layer. w 2 is the weight of the output layer and b 2 is the bias of the output layer. The hidden layer has one activation function (tanh function or relu function) per neuron, and the output layer does not have an activation function. The input  and output relations of this neural network are In training, in the framework of data assimilation with neural network as previously described, the training set is the DA result over a period of time. Taking the first cycle (i D 0) as an example, the training set is x DA 0 ; x DA 1 ; x DA 2 ; : : : ; x DA M (consecutive time series results), and the model is generated from this set of data. The next estimate is . This resulted in a total of M training sets. Since this network is a feedfoward neural network and cannot handle time series, we consider that the M group training sets are independent of each other.
In the execution, taking the stage i D 1 as an example, we have already trained the feedfoward NN G !;1 as described in the previous paragraph. Therefore, during the DA of period i D 1, we will modify the model to the modified neural network model M 1 D G !;1 M 0 . Then we need to use the model M 1 to update the system when each step DA needs a model update.

Mathematical Analysis
We can then prove the improvement in DA results introduced by our proposed algorithm by Theorem 1 and Theorem 2.
The loss function by the backpropagation, defined as the Mean Square Error (MSE) for the DA training set, is such that where d k D y k H k x k is the so-called misfit between observed and background data, D k D diag fM k H k ; : : : ; M kCA H kCA g and A k D D k B k D T k C R k is the Hessian of the DA system in Eq. (3).
Proof: The loss function by back-propagation, defined by the MSE, is as Let be D k D diag fM k H k ; : : : ; M kCM H kCM g, the solution of the DA system in Eq. (3) is [1] x (10) and (11) give Eq. (9).
We estimate the error k in Eq. (2) by the deep NN. Even if some error's sources cannot be ignored (for example, the round-off error), the introduction of the proposed method which exploits the benefit from both NN and DA methods allows us to reduce k . Let O x k be the solution of the dynamic system (Eq. (2) This introduces the possibility to reduce the error in the dynamic system by deep NN impact on the solution of the DA. The following result held: Theorem 2 Let ı k be the error in the DA solution as defined in Eq. (4), and let O ı k be the error in the solution of proposed methodology: Proof: Let C A.DA/ and C A.DACNN/ be the error propagation parameters which depend on the numerical methods implemented in DA and DA+NN algorithms, respectively, as defined in Eq. (4). It can be proved that In fact C A.DA/ ' .A DA /, where A DA is the Hessian of the DA system (Eq. (3)) and C A.DACNN/ ' .A DACNN / where A DACNN is the Hessian of the DA system (Eq. (3)). As the condition number of both Hessian matrices A DA and A DACNN are proportional to the condition number of the background error covariance matrix B k (as in Ref. [19]) then Eq. (14) holds. From Eqs. (12) and (14) it follows that

Numerical Examples
In this section, we address two different applications of DA+NN in case of known or unknown system model inaccuracy . Our interest is the modeling uncertainty effect in real applications. Uncertainty widely exists in the modeling of natural processes such as turbulence in fluid dynamics or aircraft design. Also, mis-modelling of chemical or biological process has not been clearly investigated. These processes can be interpreted as a general dynamic system. In these situations, DA is not sufficient in reducing the modelling uncertainty in approaching the true models. The DA+NN algorithm we have proposed is applicable in solving this problem.

Double-integral mass dot system
We first use a double-integral particle system, which is widely used for the physical representation of a simplified motion system. The system can be understood as the motion of a particle under the influence of a controlled acceleration. The affected state includes position and velocity. Due to the existence of a Proportion-Integral-Derivative (PID) controller in the feedback control system, the position as a controlled state eventually converges upon a relatively stable value. The output of the controller is the acceleration acting on the system, or it can be understood as a force that linearly relates to the acceleration. A double-integral system is a mathematical abstraction of Newton's Second Law which is a simplified model of many controlled systems. It can be represented as a continuous model as P x D Ax C Bu C ; where the state x D OEx 1 ; x 2 T is a two-dimensional vector containing position x 1 and velocity x 2 information, and u D P x 2 is the controlled input. The coefficient matrices A,B, and C are time-invariant system and observation matrices. v is the observation noise, which is a two-dimensional Gaussian. w is the system disturbance including two aspects: random system noise and structural model uncertainty.
To implement DA on this system, model (16) is discretized to a discrete form F , G, and H as Eqs. (2) and (1). We introduce a disturbance in system Eq. (16) so that k D smu .x k / C iid;k (17) in which smu . / is the structural model uncertainty and iid is the i.i.d random disturbance. We test two kinds of model uncertainty smu . /. They represents two typical structural error compositions: parallel and serial. Please note error smu . / is added in the real system, which is continuous.
The error from parallel composition is where M D OEm 1 ; m 2 is the amplitude vector which can be adjusted according to the reality. The error from serial composition is smu .x k / D .M I /Ax k (19) which describes the condition that the real system is P To achieve the model simulation, we design a cascade controller to avoid the system from divergence. The tracking signal is set as a step function. We obtain 1000 samples from a 10-second simulation with the sample frequency of 100 Hz. This simulation is implemented on the Simulink of Matlab 2016a.
First, we run a Kalman filter based on a discretized model F , G, and H . The algorithm of DA+NN is as Algorithm 2. The training set is built by the time series fF .x DA k 1 /; x DA k g: For the dynamic system Eq. (16), we first run the system and record the observation y, control input u. A corresponding Kalman filter runs along the system updates, and outputs a sequence of prediction x DA . In the first stage (the left area divided by vertical dashline) of Fig. 3, there is an error between the black line and blue line. The network G !;1 is trained by the dataset of Train a neural networks G !;iC1 with D i 7 Put trained G !;k in the model: Count up i for the next stage 9 end Output: x DACNN k these K steps.
After that, the system and Kalman filter continue to update from the .K C 1/-th step to the 2K-th step. Now the system-update step in Kalman filter integrates the trained NN G !;1 as Figure 3a is the result of these numerical examples. Two vertical dashed lines are at point which the NN is trained. We can see that after the first training, the Kalman filter with NN outperforms the standard one. After the second training, its performance is even further improved. Figure 3b shows the result and comparison of this algorithm. By training twice, the result of DA+NN got close to the real value. However, biased estimation still exists and overfitting is also likely to occur in this case.

Lorenz system
To describe the atmosphere behavior in a simple mathematical model, Lorenz [20] developed a simplified mathematical model for atmospheric convection. It is a popular test case for DA algorithms. The model is a system of three ordinary differential equations (the Lorenz equations). It is notable for exhibiting a chaotic behaviour for certain parameter values and initial conditions. The Lorenz equations are given by the nonlinear system: dp dt D .p q/; dq dt D p q pr; where p, q, and r are coordinates, and , , andǎ re parameters (in this test case, D 10, D 8=3, D 28). This Lorenz system can be discretized by the second order Runge-Kutta method as in Ref. [21].
Given a state x k D OEp k ; q k ; r k T and the discrete function F k . /, we define a Lorenz system with structural model uncertainty as  x k D F k .x k 1 / C smu;k (23) in which the structural model uncertainty is smu;k D M x k 1 (in this case M D 0:01), and F k . / is .q k C t . p k q k p k r k // ˇr k t .p k q k ˇr k /: For the dynamic system as Eq. (22), we firstly run the system and record the system true value x. Then, the observation y is generated by adding Gaussian noise on the true value x. After that, a 4D-Var algorithm is applied to generate a sequence of DA result x DA with length K. The training set is built by the time series fF k .x DA k 1 /; x DA k g. Then the NN G ! is trained. The forecasting results based on the DA model and DA with the NN model are then compared. For DA, the system update model is formed as x k D F k .x k 1 / (24) For DA+NN, the system update model is x k D G ! .F k .x k 1 // (25) We compare the forecasting results in Fig. 4. The training set generated by k 2 OE0; 1000 is not plotted in complete. The forecasting begins at k D 1000. We can see the forecasting errors of DA model in all three axes are big. The trajectories of DA model cannot track the real state, while the trajectories of DA+NN model track the real value very well. Figure 4b is a 3D plot of this Lorenz system forecasting part. It shows the DA model fails to forecast the right hand part of this butterfly. In this test, the DA+NN model outperforms the DA model in forecasting, for the trajectory of the right wing of this butterfly.

Conclusion and Future Work
In this article, we discuss a new methodology that integrates NN into data assimilation and validates the algorithm by numerical examples of model uncertainty.
So far, the effectiveness of this algorithm still remains to be verified in various specific domains having huge state space and more complex internal and external mechanisms.
Future work should include the implementation of a DA+NN under more general conditions. For example, we will investigate the combinatorial case of parametric model inaccuracy and structural model uncertainty. Also, modern deep learning tools (e.g., CNN, RNN) can be introduced into data assimilation to improve the adaption and performance in more divergent conditions. Given a state x k D OEp k ; q k ; r k T and the discrete function F k . /, we define a Lorenz system with structural model uncertainty as in which the structural model uncertainty is smu;k D M x k 1 (in this case M D 0:01), and F k . / is t. p k q k p k r k / t.q k p k /; q kC1 D q k C t 2 OE p k q k p k r k C .p k C t.q k p k // q k t. p k q k p k r k / .p k C t.q k p k //.r k C t.p k q k ˇr k //; r kC1 D r k C t 2 OEp k q k ˇr k C .p k C t .q k p k // .q k C t. p k q k p k r k // ˇr k t.p k q k ˇr k /:   Rossella Arcucci received the master degree (cum laude) in mathematics in 2008 from the University of Naples Federico II (UNINA), Italy, and the PhD in computational and computer science from the same university in 2012. She is a Research Associate of the Data Science Institute (DSI) at Imperial College London (ICL) in UK. Her area of expertise is in numerical analysis, scientific computing and development of methods, and algorithms and software for scientific applications on high performance architectures including parallel and distributed computing. She works on numerical and parallel techniques for accurate and efficient data assimilation by exploiting the power of machine learning models. She also achieved computing efficiency by designing models specifically to massive parallel computers and graphics processing units. During her post doc at UNINA, she coordinated the H2020 project "iNnovative Approaches for Scalable Data Assimilation in oCeanography" (NASDAC) as PI until September 2017, when she joined the DSI. She received the acknowledgement of Marie Sklodowska-Curie fellow from European Commission Research Executive Agency in Brussels on the 27th of November 2017.
Yi-ke Guo received a first-class honours degree in computing science from Tsinghua University, China, in 1985 and received the PhD degree in computational logic from Imperial College in 1993 under the supervision of Professor John Darlington. He founded InforSense, a software company for life science and health care data analysis, and served as CEO for several years before the company's merging with IDBS, a global advanced R&D software provider, in 2009. He is founding director of DSI at ICL as well as leading the Discovery Science Group in the department. Professor Guo also holds the position of CTO of the tranSMART Foundation, a global open source community using and developing data sharing and analytics technology for translational medicine. His research area is in data analysis and e-science. The projects he has contributed have been internationally recognised, including winning the "Most Innovative Data Intensive Application Award" at the Supercomputing 2002 conferencefor Discovery Net, and the Bio-IT World "Best Practices Award" for U-BIOPRED in 2014. He is a Senior Member of the IEEE, a Fellow of the British Computer Society, and a Fellow of the Royal Academy of Engineering.