On the Identification of Noise Covariances and Adaptive Kalman Filtering: A New Look at a 50 Year-Old Problem

The Kalman filter requires knowledge of the noise statistics; however, the noise covariances are generally unknown. Although this problem has a long history, reliable algorithms for their estimation are scant, and necessary and sufficient conditions for identifiability of the covariances are in dispute. We address both of these issues in this paper. We first present the necessary and sufficient condition for unknown noise covariance estimation; these conditions are related to the rank of a matrix involving the auto and cross-covariances of a weighted sum of innovations, where the weights are the coefficients of the minimal polynomial of the closed-loop system transition matrix of a stable, but not necessarily optimal, Kalman filter. We present an optimization criterion and a novel six-step approach based on a successive approximation, coupled with a gradient algorithm with adaptive step sizes, to estimate the steady-state Kalman filter gain, the unknown noise covariance matrices, as well as the state prediction (and updated) error covariance matrix. Our approach enforces the structural assumptions on unknown noise covariances and ensures symmetry and positive definiteness of the estimated covariance matrices. We provide several approaches to estimate the unknown measurement noise covariance R via post-fit residuals, an approach not yet exploited in the literature. The validation of the proposed method on five different test cases from the literature demonstrates that the proposed method significantly outperforms previous state-of-the-art methods. It also offers a number of novel machine learning motivated approaches, such as sequential (one sample at a time) and mini-batch-based methods, to speed up the computations.


I. INTRODUCTION
The Kalman filter (KF) [23] is the optimal state estimator for linear dynamic systems driven by Gaussian white noise with measurements corrupted by Gaussian white noise. 1 In the classical design of a Kalman filter, the noise covariance matrices are assumed known and they, along with the system dynamics, determine the achievable filter's accuracy. However, in many practical situations, including noisy feature data in machine learning, the statistics of the noise covariances are often unknown or only partially known.
We were motivated by the following learning problem: Given a vector time series and a library of models of system dynamics for the data (e.g., a Wiener process, a white noise acceleration model, also called nearly constant velocity model, or a white noise jerk model, also called nearly constant acceleration model), find a suitable process noise and measurement noise model and the best system dynamics for the time series. The problem we consider in this paper is limited to finding a suitable process noise and measurement noise covariance for a given dynamic model.

A. PREVIOUS WORK
The approaches for estimating the noise covariance matrices for a Kalman filter can be broadly classified into four general categories: Bayesian inference, maximum likelihood estimation, covariance-matching, and correlation methods. The first two categories pose the noise covariance estimation problem as a parameter estimation problem.
In the Bayesian inference approach [19], the covariance estimation problem is solved by obtaining the posterior probability density function (pdf) of the unknown parameters (in this case, the noise covariance matrix elements) from their prior pdf and the observed measurements using the Bayes' formula recursively. In 2013, Matisko and Havlena [32] proposed a new Bayesian method to estimate the unknown covariance matrices. They first use a Monte Carlo method to generate a grid of possible unknown covariance matrix pairs (Q, R) with more density near the highest prior probability. Then, they compute the likelihood and posterior probability after performing state estimation for each pair using a Kalman filter. In general, the Bayesian approach suffers from the curse of dimensionality and is computationally intractable due to the fact that it involves numerical integration or Monte Carlo simulations over a very large parameter space.
In maximum likelihood estimation [25], [48], the noise statistics are obtained by maximizing the probability density function of the measurement residuals generated by the filter, which is the likelihood of the filter parameters [2]. These filter-based maximum likelihood methods require nonlinear programming based optimization and are computationally intractable. Shumway and Stoffer [47] utilize the expectation maximization (EM) algorithm [12], which requires the smoothed estimates of the system state. This approach starts with the smoothed estimation of the system state given an estimate of the initial state and noise covariance matrices. Then, the unknown parameters are estimated via maximum likelihood estimation using the smoothed state estimates obtained from the expectation step. Later, Ghahramani and Hinton [17] present an extension of [47] that can account for an unknown observation matrix in linear dynamic systems. They then go on to use forward and backward recursions to estimate the noise covariance matrices. This process is repeated until the estimated parameters converge. In addition to computational complexity, this method suffers from convergence to a local optimum.
The basic idea of the covariance-matching techniques [38] is that the sample covariance of the innovations should be consistent with its theoretical value. In [38], the unknown noise covariances are estimated from the sample covariance computed from the innovation sequences accumulated over the entire historical data (or in a moving time window). In this method, if the estimated innovation covariance value is much larger than the theoretical value, then the process noise covariance is increased. The convergence has never been proved for this method.
With regard to correlation methods, Heffes [18] derived an expression for the covariance of the state error and of the innovations of any suboptimal filter as a function of noise covariances. This expression serves as a fundamental building block in the correlation methods. The first innovation-based technique to estimate the optimal Kalman filter gain and the unknown noise covariance matrices via the correlations of innovations from an arbitrary initial stabilizing filter gain was introduced by Mehra [33]. Another procedure to carry out the identification of unknown optimal Kalman filter gain and the noise covariance matrices is by Carew and Bélanger [7]. Their strategy calculates the Kalman filter gain based on the estimation error that is defined as the discrepancy between the optimal state estimates obtained from the optimal Kalman filter gain and the state estimates obtained from an arbitrary suboptimal Kalman filter gain. There is a question as to whether the correlation method is sensitive to the initial Kalman filter gain selection. Mehra suggested to repeat the noise covariance estimation steps with the obtained gain from the first attempt to improve the estimation. However, Carew and Bélanger [7] claim that if the optimal Kalman filter gain is used as the initial condition, then the approximations in Mehra's approach are such that the correctness of the optimal gain will not be confirmed.

II. PLANT AND MEASUREMENT MODEL FOR THE KALMAN FILTER
The notation used in the remainder of this paper is listed in Table 1. Consider the discretetime linear dynamic system x(k + 1) = F x(k) + Γv(k) (1) z(k) = Hx(k) + w(k) (2) (i.e., a Gauss-Markov system) where x(k) is an n x -dimensional state vector, F is the state transition matrix of the system, H is the n z × n x measurement matrix, and Γ is the n x × n v dimensional noise gain matrix. The sequences v(k), k = 0, 1, …, and w(k), k = 0, 1, …, are zero-mean white Gaussian noises with covariance matrices Q and R, respectively. The two noise sequences and the initial state are assumed to be mutually independent. The matrices Q and R are assumed to be positive definite. Note that even if Q is positive definite, ΓQΓ′ need not be; it can be positive semi-definite. We assume that the system is observable and (F, ΓQ 1/2 ) is controllable. 2 Given the estimate x(k | k), the Kalman filter [2], [23] estimates the state at the next time instant k + 1 as x(k + 1 | k) = F x(k | k) (3) v(k + 1) = z(k + 1) − Hx(k + 1 | k) (4) x(k + 1 | k + 1) = x(k + 1 | k) + W (k + 1)v(k + 1) (5) P (k + 1 | k) = F P (k | k)F ′ + ΓQΓ′ (6) S(k + 1) = HP (k + 1 | k)H′ + R (7) W (k + 1) = P (k + 1 | k)H′S(k + 1) −1 (8) P (k + 1 | k + 1) = P (k + 1 | k) − W (k + 1)S(k + 1)W (k + 1)′ (9) where the estimate x(k + 1 | k) is the one-step extrapolated estimate of the state vector x(k) based on the measurements up to k, W(k), k = 1, …, N is the sequence of Kalman filter gains, ν(k), k = 1, …, N is the innovation sequence, P(k + 1|k) is the state prediction covariance, S(k + 1) is the measurement prediction (or innovation) covariance, and P(k + 1|k The six-step approach in this paper is designed specifically for a steady-state Kalman filter. The steady-state state prediction covariance matrix P satisfies an algebraic Riccati equation. P = F [P − PH′(HPH′ + R) −1 HP]F ′ + ΓQΓ′ (10) The steady-state updated state covariance, denoted as P, can also be computed via another algebraic Riccati equation (see Appendix A). P = F P F ′ − P H′(R − HP H′) −1 HP ′ + ΓQΓ′ (11) Evidently, P = P − W SW ′ (12) = (I n x − W H)P(I n x − W H)′ + W RW ′ (13) where (13) is known as the Joseph form; W and S are the steady-state optimal gain, and the steady-state innovation covariance, respectively, and are given by S = E[v(k)v(k)′] = HPH′ + R (15) Note that (I n x − WH) is invertible, but need not be stable (i.e., eigenvalues need not be inside the unit circle).

III. IDENTIFIABILITY OF Q AND R
One major issue in the previous literature involves the necessary conditions to estimate the unknown covariance matrices. Mehra [33] claimed that the system must be observable and controllable; however, Odelson [42] provided a counter-example wherein the system was observable and controllable, but the full Q matrix was not estimable. Following the ideas in [49], we prove that the necessary and sufficient condition (as detailed in Appendix B) to estimate the unknown covariance matrices in a system is directly related to its minimal polynomial of its stable closed-loop filter matrix F, and a transformation of the innovations based on the coefficients of the minimal polynomial. Let us define x(k + 1 | k) to be the predicted error between the state x(k + 1) and its predicted state x(k + 1 | k), that is, We can rewrite x(k + 1 | k) in terms of x, that is, Then, substituting (18) into (17) and using (1), we have where F is defined in (16). We can also write ν(k) in terms of x, that is v(k) = Hx(k | k − 1) + w(k) (20) Let us define the m th order minimal polynomial of F as (27) Then, the innovations ν(k) can be written as (22), shown at the bottom of the next page.
Note that we apply the minimal polynomial of F to ensure that the innovation in (22), is stationary. Let us define ξ(k) as (23)- (27), shown at the bottom of the next page, where ℬ l and G l are the sum of two moving average processes driven by the process noise and the measurement noise, that is, Denoting L j = E [ξ(k)ξ(k − j)′], for j = 0, 1, 2, …, m, we have  We know that Q = [q ij ] is an n v × n v positive semi-definite and symmetric matrix, and R = [r ij ] is an n z × n z positive definite and symmetric matrix. Utilizing the symmetry of Q and R, and letting b i,l and g i,l denote the l-th column of ℬ i and G i , respectively, we can rewrite (31) as (32)- (34) shown at the bottom of the next page.
From (34), we can form the noise covariance identifiability matrix ℐ of dimension (m + 1)n z 2 × 1 2 [n v (n v + 1) + n z (n z + 1)], as in Algorithm 1. Algorithm 1 uses the vec(A) function to convert a matrix A into a column vector. For a p×n matrix A, vec(A) ≜ [a 11 , …, a p1 , a 12 , …, a p2 , …, a 1n , …, a pn ]′ (35) Using (34) and collecting terms corresponding to each q lp , p = l, l + 1, …, n v and each r lp , p = l, l + 1, …, n z into the corresponding columns of ℐ, we obtain the following identifiability condition that must be satisfied by Q and R, The linearity of (36) implies the full rank condition on ℐ. Since R is always estimable because G m (recall that m is the order of minimal polynomial) is invertible, 3 the maximum number of unknowns in Q that can be estimated must be less than or equal to the minimum number of independent measurements minus the number of unknowns in R.
3 See Appendix C for a detailed proof. That is rank(ℐ) − n R > n Q (37) where n R is the number of unknowns in R, and n Q is the number of unknowns in Q To illustrate the necessity and sufficiency of this condition, consider an example system from [42], ZHANG et al. Page 10 z(k) = 0 1 0 0 0 1 x(k) + w(k) (39) with Q being a full 3 × 3 positive definite symmetric matrix and R being a full 2 × 2 positive definite symmetric matrix. Since the rank of ℐ is not affected by W (the observability condition is independent of the filter gain matrix), one can examine the rank of ℐ for W = 0 for convenience. In this case, the minimal polynomial coefficients are The ℬ and G matrices are G 2 = 0.81 0 0 0.81 (43) Here, ℐ is a 12 × 9 matrix with a rank of 8. Since there are 9 unknown variables (6 in Q and 3 in R), the covariance matrix elements are not identifiable.
is diagonal, as is typically the case, then the covariance matrix elements are identifiable because there are only 6 unknown variables (full R matrix and three diagonal elements of Q).
Another example to illustrate the necessity and sufficiency of this condition is to consider the system with Q being a diagonal 2×2 positive definite diagonal matrix and R being a scalar. Similarly, we examine the rank of ℐ for W = 0 and obtain the minimal polynomial coefficients, The ℬ and G matrices are G 0 = 1 G 1 = − 0.3 G 2 = 0.02 (48) Here, ℐ = 1.04 0 1.09 −0.2 0 −0.31 0 0 0.02 (49) has a rank of 2. Since there are 3 unknown variables (2 in Q and 1 in R), the covariance matrix elements are not identifiable.
Note that the minimal polynomial can be used to estimate the unknown covariances R and Q via quadratic programming techniques. Furthermore, it can be used to estimate the optimal gain W, as in [49] and Appendix D; however, reliable and accurate estimation of the parameters of vector moving average processes is still an unresolved problem [16], [24], [31], [45].

IV. APPROACHES TO OBTAIN FILTER PARAMETERS
There are two competing approaches for the estimation of the filter parameters W, R, Q, and P. The first approach is to estimate the noise covariance matrices first and subsequently the Kalman filter gain W and the predicted state covariance P are computed given the estimated noise covariance matrices [32], [48]. This method has an underlying problem in that it involves the sum of two moving average processes. Additionally, the autoregressive moving average (ARMA) approach, pioneered in the econometric literature, does not extend naturally to sums of moving average processes and we have found the resulting algorithms [16], [24], [31], [45] to have erratic computational behavior.
The second approach is to estimate the Kalman filter gain W from the measured data first [7], [33]. Given the optimal W, we can compute R, Q and P (this approach is applied in this paper). The proposed R, Q and P estimates in this paper are valid as long as an optimal gain W is provided. There are many ways to obtain the optimal Kalman filter gain W. The techniques listed in this paper to obtain the optimal W, that is, Section V and Appendix D, are by no means all-inclusive, and several such methods may be suitable for a given problem. For example, the optimal gain W can be obtained from the suboptimal Kalman filter residual [8], solving the minimal polynomial problem [49], utilizing the least squares method on the observable form [6], and utilizing a second Kalman filter to track the error in the estimated residual of the first Kalman filter [44], to name a few.

V. ESTIMATION OF W
This section includes the discussion of two different approaches to estimate the optimal Kalman filter gain W, namely, the minimal polynomial approach and the successive approximation, coupled with an adaptive gradient descent scheme, on a criterion based on innovation correlations. The derivation of the minimal polynomial approach is detailed in Appendix D. This approach assumes the system to be purely driven by the optimal innovation. In doing so, the estimation of the optimal Kalman gain can be achieved via a vector auto-regressive model approximation of a vector moving average process. However, from limited testing on examples chosen in this paper, this approach was found to be numerically unstable, only performing well on systems with no eigenvalues close to unity. In fact, the vector auto-regressive model has various numerical problems and an accurate and reliable algorithm to obtain the solution still remains to be developed [24]. Therefore, we omit this approach from the paper and focus on minimization of the innovation correlations using a successive approximation and adaptive gradient descent method.
In the sequel, we describe in detail the approach of our paper using the correlation-based criterion. If the Kalman filter gain W is not optimal, the innovation sequence {v(k)} k = 1 N is correlated. We can use the innovation sequence of any stable suboptimal Kalman filter and compute M sample covariance matrices, as in [33]: We know that the optimal Kalman filter gain W makes the autocorrelation function C(i), i = 0, 1, 2, …, M − 1 vanish for all i ≠ 0. Given the correlation matrix for i ≥ 1 as in [33], that is where F is as in (16). We define the objective function J to be minimized as where diag(C) is the Hadamard product of an identity matrix, of same dimension as C, with C diag(C) = I ⊙ C (53) This objective function is selected to minimize the sum of the normalized C(i) with respect to the corresponding diagonal elements of C(0) for i > 0. The optimal J becomes 0 as the sample size N tends to ∞ because the time averages are the same as ensemble averages given infinite data. Substituting (51) into (52) and utilizing the cyclic property of trace, we have where For ill-conditioned systems, a regularization term λ W tr(WW′) can be added to convexify the objective function. Taking the gradient 4 of (54) with respect to W, we get and Z is given by the Lyapunov equation and X is obtained by rewriting (51) as where A † is the pseudoinverse of A, defined by which exists, since we assume the system to be completely observable and M ≥ n x . The gradient direction can be used to obtain the optimal Kalman filter gain W iteratively through the bold driver method in [3], [29], [53]. Details of this application can be found in Section VIII-C2.

A. GENERAL R
Given the steady-state optimal gain W and the innovation covariance S, whose estimation is explained later in Section VIII, let μ(k), k = 1, …, N be the sequence of post-fit residuals of the Kalman filter, that is, Note that (I n z − HW) is invertible (rank n z ) because (I n z −HW) = RS −1 (proven below) and due to the assumption that R is positive definite.
Proposition 1: Given the optimal steady-state Kalman filter gain W, the post-fit residual sequence μ(k), and the innovation sequence ν(k), the joint covariance of these two sequences is Proof: On the right hand side of (67), the (1,1) block is simply the definition of the innovation covariance matrix in (15). Using (66), the (1,2) block in (67) is, given by Using (7) and (8), The (2,2) block of (67) is obtained as follows. which, given (14), simplifies to G = R − HP H′ (74) Note that by using the Schur determinant identity [5], [51], the determinant of (67) is where the relationship G = R − HPH′ = RS −1 R is proved in (74) and Proposition 2 below.
Proposition 2: Given the optimal steady-state Kalman filter gain W and the corresponding post-fit residual μ(k) and innovation ν(k) sequences, the covariance matrix R can be computed in the following five ways: Then, by substituting (14) into (81) We also know from (15) (S − R) = HPH′ By substituting (83) into (82), we can write G, defined in (71), as Note that (85) is a continuous-time algebraic Riccati equation. 5 Therefore, we can estimate R by solving the continuous-time Riccati equation, as in [1], or Kleinman's method [27].
Some additional methods to solve the continuous-time algebraic Riccati equation can be found in [30]. We can also interpret (78) in terms of a Linear Quadratic Regulator (LQR) optimal control problem, where we can obtain R as the solution of the continuous-time algebraic Riccati equation associated with the optimal gain in the LQR problem. The computation of R is also related to the simultaneous diagonalization problem 6 in linear algebra [51]. Note that, in the scalar case, R is the geometric mean of the variance of the post-fit residual and the innovation, as in the (1,2) block of (67).
Note that R1-R5 are theoretically the same; however, they are numerically different. We recommend R3, since it ensures the positive definiteness of R.

B. DIAGONAL R
When R is diagonal, we solve the least squares problem of where F, indicates the Frobenius norm. The positive definite R can be estimated from R3, given in Proposition 2. The solution is simply the diagonal elements of the estimated R from R3. This can also be interpreted as the masking operation to impose structural constraints on R, as discussed in the context of the estimation of Q in Section VII.

C. USE OF SMOOTHED STATE ESTIMATE WITH ONE-STEP-LAG POST-FIT RESIDUALS
Note that R can also be estimated using one-step-lag smoothing on the post-fit residuals. Let us define the one-step-lag smoothed residual s(k) as in [37], that is, where F is defined as From (65), we can also write s(k) as a one-step moving average process Therefore, and for the optimal Kalman filter gain W, we can write (102) as A similar expression can be derived for E[s(k)μ(k)′], that is, and for the optimal Kalman filter gain W, we have and with the optimal Kalman filter gain W, combined with (14), we get (107)-(110), as shown at the bottom of the next page.
Note that E[s(k)s(k)′] can be used in a manner similar to the algorithm in Section V to obtain the optimal Kalman filter gain W. More investigation is needed into this approach.

VII. ESTIMATION OF Q, P AND P
In this section, we discuss a method to estimate the process noise covariance Q and the state prediction (updated) covariance P (P). Unlike the case of a Wiener process and for a process with H = I, where both Q and P can be estimated separately and without iteration, as shown in Section IX-A3, Q and P (P) are coupled in the general case, requiring multiple iterations for the estimation to converge. The relationship between the steady-state state prediction covariance matrix P and the steady-state updated state covariance matrix P with the process noise covariance matrix Q is Similarly, the steady-state updated state covariance matrix can be written as where F is defined as in (99) and (115) is derived utilizing (14) and the fact (from [2]) that We also define P as Given P and S, or P and S, or P and S, we can compute ΓQΓ′ in the following ways: where Q1 -Q3 are derived from (6).
In this paper, we utilize the updated state covariance matrix to estimate Q and P, iteratively. Let t = 0, 1, … and ℓ = 0, 1, … denote the (two loop) iteration indices, and let us assume the initial estimate ΓQ (0) Γ′ = WSW′ (this is the Wiener process solution for the estimation of Q, as shown in Section IX). Let us initialize P by solving the Lyapunov equation (starting with t = 0 and ℓ = 0) for P (0) . We compute P (ℓ+1) utilizing (116) until the value converges, that is, Given the converged P, let us denote D (t+1) as Then, we can update Q (t+1) from (120) A mask matrix A can shape Q to enforce the structural constraints (e.g., diagonal covariance). The mask matrix comprises binary matrix elements with a 1 in the desired positions and 0, elsewhere, for example, as in an identity matrix. Then Q is structured by where ☉ is the Hadamard product. We subsequently set ℓ = 0 and recompute P using Q (t+1) in (123), and this process repeats until the estimate of Q converges. For ill-conditioned systems, a tuning (regularization) parameter λ Q can be used in (125), that is After the estimate of Q converges, we can estimate P using either (111), (112) or (113).

AND R
Given the methods to obtain estimates of R and Q in Sections VI and VII, we summarize our method into a six-step solution approach to obtain the optimal steady-state W, S, P (P), Q, and R.

A. STEP 1
Start with iteration r = 0 and initialize with a W (0) to stabilize the system as in [28]. We execute the Kalman filter for samples k = 1, 2, …, N as

C. STEP 3
In this step, we check whether any of the termination conditions given below are met. If none of the termination conditions are met, we update the Kalman filter gain via the proposed method, detailed later in Section VIII-C2.
1) TERMINATION CONDITIONS-There are five conditions that result in algorithm termination, subsequently yielding a Kalman filter gain W for R, Q and P estimates in later steps: Condition 1: The converged Kalman filter gain is within a specified threshold ζ W .
Condition 2: The gradient of Kalman filter gain (60) is within a specified threshold ζ Δ .
Condition 3: The objective function value in (52) is within a specified threshold ζ J from zero.
Condition 4: The objective function value stops improving, given a specified "patience" (number of epochs, detailed in Section VIII-C2) for the adaptive gradient method. then where ./ indicates element-wise division and ∥·∥ is a matrix norm (In this paper, the authors use the Euclidean norm) and ϵ W is a very small value to protect against zeros in the denominator. When δ W is less than a specified threshold ζ W , the Kalman filter gain is assumed to have converged and we terminate the algorithm; otherwise, we update the Kalman filter gain W for the next iteration.

b: CONDITION 2:
We also examine the gradient of the Kalman filter gain ∇ W J for convergence. We assume the Kalman filter gain to be converged when the Euclidean norm of ∇ W J is less than a sufficiently small threshold ζ Δ , that is, c: CONDITION 3: Similar to W, we can compute the change in the objective function J from iteration r to r + 1. The Kalman filter gain is assumed to have converged when J (r) is less than a specified threshold ζ J ; otherwise, we update the Kalman filter gain for the next iteration.

d: CONDITION 4:
The fourth termination condition is related to the step size for the proposed approximation method. We adapt the bold driver method in [3], [29], [53] and the method considers a "patience" parameter to indicate that the objective function value J (r) has stopped improving (detailed in Section VIII-C2). The algorithm is terminated with the Kalman filter gain corresponding to minimum J (r) .

e: CONDITION 5:
This condition is implemented to ensure that the algorithm terminates within a reasonable number of iterations, denoted by n L . Typically, the number of iterations required to reach the optimal Kalman filter gain W increases proportionally with n x .
2) KALMAN FILTER GAIN UPDATE-When any of the above conditions are met, we terminate the algorithm. Otherwise, we update the Kalman filter gain W for the next iteration r+1 via the gradient direction in (60). Given the gradient direction, the Kalman filter gain at iteration r+1 is updated by where α (r) is the step size for the proposed method. The step size is initialized as where c is a positive constant and is used to update the Kalman filter gain in the first iteration, N s is a hyperparameter on the number of observations, and β is a positive constant to adapt the initial step size to the number of observations. Note that (136) is selected to automatically tune the initial step size. When only a small subset of samples are observed, we want to use a smaller step size to prevent large steps that could result in unstable gains. If a line search is used instead, initialization is not necessary. Use of stochastic approximation type step sizes will enable one to extend the estimation method to on-line situations and the extended Kalman filter.
Subsequently, α (r) is computed using the bold driver method in [3], [29], [53]. That is, after each iteration, we compare the J (r) to its previous value, J (r−1) , and set where c is the maximum step size defined as, and c max is a positive constant between 0 and 1.
Once we update the Kalman filter gain W, we go back to Step 1 by setting r = r + 1 and repeat the same process until any of the five termination conditions are met.
Note that each time J (r) ≤ J (r−1) , we save the corresponding Kalman filter gain W (r) , and J (r) , and we halve the step size each time J (r) > J (r−1) in the hope of observing a decrease in J (r) . If the value of J (r) has consecutively increased for a specified number of iterations (i.e., given a "patience" factor), we select the best Kalman filter gain W by We then terminate the iteration and move onto Step 4 after repeating Steps 1 and 2 with the corresponding W. Note that adaptive stochastic gradient descent methods can be applied to compute the optimal Kalman filter gain W as in [21], [26], [40], [50], [54].

D. STEP 4
Once we obtain the optimal steady-state Kalman filter gain W and the corresponding innovation covariance S, we can compute the unknown R, as in Section VI.

E. STEP 5
Given the covariance matrix R, computed in Step 4, we can compute the covariance matrix Q and steady-state state prediction covariance matrix P, as detailed in Section VII.

F. STEP 6
We implement a successive approximation as follows: an outer-loop is used to reinitialize with the R and Q obtained from Step 5 and then reinvoke Steps 1-5. We keep track of the best J (r) among the outer-loop iterations. The Kalman filter gain associated with the minimum J (r) is selected to be the optimal Kalman filter gain. The algorithm terminates when the difference between the best J (r) from each outer-loop is less than ζ J or the maximum number of outer-loop iterations is reached.

IX. SPECIAL CASES: WIENER PROCESS AND H = I n x CASES
In this section, we consider two special cases below. The first case is when the state transition matrix F and the measurement matrix H are both identity matrices, I n x and I n z , where n x = n z . This considerably simplifies our method to estimate R and Q. The second special case is when only the measurement matrix H is the identity matrix, while the state transition matrix F remains general. Note that we can extend either case to that of one assuming perfect measurements, that is, when H = I n x , we have no measurement noise, and thus, R = 0.

A. CASE 1: WIENER PROCESS
For a Wiener process, we have F = I n x and H = I n x .

1) KALMAN FILTER GAIN UPDATE FOR A WIENER PROCESS-
To get the optimal Kalman filter gain, for k = 1, 2, …, N, Define Then, let us define L 0 and L 1 as Note that both L 0 and L 1 can be computed from samples. Additionally, we can obtain the optimal W from L 1 as Substituting W in (148) into (146), we can write the relationship between L 0 and L 1 as Note that (150) is in a form related to the discrete algebraic Riccati equation and has a positive definite solution [15].

2) ESTIMATION OF R FOR A WIENER PROCESS-Proposition 3: For a Wiener
process where both the state transition matrix F and the measurement matrix H are both the identity matrices, I n x and I n z , respectively, where n x = n z , and given the optimal steady-state Kalman filter gain W, and the concomitant post-fit residual sequence μ(k) and innovation sequence ν(k), the covariance matrix R can be computed in the following ways: Then, we can compute R as Symmetrizing (162) hence, SR5 is proven. Proposition 4: For a Wiener process, where the state transition matrix F and the measurement matrix H are both identity matrices, I n x and I n z , respectively, and given the optimal steady-state Kalman filter gain W, and the corresponding innovation sequence ν(k), the steady-state state prediction covariance and the process noise covariance Q can be computed as:

3) ESTIMATION OF P AND Q FOR
Proof: Given the relationship in (8) and knowing that, for a Wiener process H = I n z , using For a Wiener process, we can rewrite the Riccati equation (10) as P = P − P(P + R) −1 P + Q Using the relationship of (15) and (164) in (166) yields P = P − W SW ′ + Q Thus, for a Wiener process, Q can be estimated as Hence, (165) is proven. Note that (165) is used as Q (0) in the general case for iteratively computing Q. Also note that when R = 0 (i.e., perfect measurement case), we have,

B. CASE 2: H = I n x
In the second case, only H is the identity matrix, but F is not necessarily so.

1) KALMAN FILTER GAIN UPDATE FOR THE H = I n x CASE-To get the optimal
Kalman filter gain, for k = 1, 2, …, N, Let ξ(k) be Define where F = F (I n x − W ) We can write L 0 = E{ξ(k)ξ(k)′} as Similarly, L 1 = E{ξ(k)ξ(k − 1)′} can be computed based on (177) as, From the right hand side of (179), we can find S by solving Upon calculating S, we can find the optimal Kalman filter gain W as and we can calculate R from R3, in (85). G can be obtained by running the filter given the optimal Kalman filter gain. Note that, we can also write ξ(k) as Then, L 0 is Equating (179) and (185), we can compute ΓQΓ′ as Equation (187) Note that when R = 0 (i.e., perfect measurement case), we have W, P and G as in

X. NUMERICAL EXAMPLES
In this section, we illustrate our method on the following five cases:

1.
A second-order kinematic system (a white noise acceleration or nearly constant velocity model) by varying the lags M in the correlation.

3.
A five-state system from [33] and [4] with diagonal Q and R.
For each test case, we examine the condition number of the system's observability and controllability matrices, as well as matrix ℐ. The condition number of matrix A is computed as κ(A) = ‖A‖‖A † ‖ (190) where A † is the pseudoinverse of A and ∥·∥ is a Euclidean norm. The rank of matrix ℐ is also examined for each test case. For each test case result, we compute the 95% probability interval (PI) via the highest probability interval method 7 and denote by r and r the corresponding lower and upper limits, respectively. We also provide the mean and the root mean squared error (RMSE) of each distribution. The averaged normalized innovation squared (NIS) is also provided to measure the consistency of the Kalman filter, where n MC is the number of MC runs. The elements of each matrix A are denoted as a ij , representing the element in the i th row and the j th column of A.

A. CASE 1
We simulated a second-order kinematic system described by with sampling period T = 0.1, where where δ kj is the Kronecker delta function. The mean of the process and the measurement noises are assumed to be zero and the corresponding variances are given in (194) and (195), respectively. Note that the system has the condition number of 20.1 for its observability matrix and 20.2 for its controllability matrix. The noise covariance identifiability matrix ℐ, given the initial Kalman filter gain in (197), is ℐ = 5 ⋅ 10 −5 6 2.5 ⋅ 10 −5 −4 0 1 (196) which has a rank of 2, and we have 2 unknown variables to estimate, implying that Q and R are identifiable. The condition number for ℐ is 1.5·10 5 . The least squares problem using the minimal polynomial approach is ill-conditioned. Estimates beyond the whisker length are marked as outliers (indicated by the "+" symbols). The accuracies of the estimates of both R and Q increase with an increase in M. Table 2 shows the mean value of the estimates of both R and Q. The smallest error of the median of the estimates of R and the variability of the estimates of Q are obtained with M ≥ 100.

2) ESTIMATION OF W AND P-Given M = 100, for 100 MC runs with the initial
Kalman Filter gain as in (197), we found that R1-R5 estimate the same R values. The true values of R all lie within the 95% PI associated with the distribution of estimates. Fig.3 shows the Q versus R plot of each estimate. The true values are marked by "+" symbols. The reason the estimated Q varies so much is that its value is very small compared to the measurement noise. Fig.4 shows the averaged NIS and its 95% probability region, which proves that the filter is consistent.

B. CASE 2
We simulated the system described in Neethling [39], The system's condition numbers for its observability and controllability matrices are 2.18 and 2.56, respectively. Here, ℐ, given the initial Kalman filter gain, is ℐ = 1.25 1.8 0.5 −1.12 0 0.4 (202) and the rank is 2. The number of unknown variables is 2, therefore, the system noise variances are estimable. The condition number of ℐ is 2.3 and indeed the minimal polynomial approach works well for this problem. We simulated 100 Monte Carlo runs with N = 1000, n L = 100, N s = 1000, and an initial suboptimal Kalman filter gain Table 4 shows the estimated noise variances. Similar to the Case 1 result, the mean values of each of the estimated parameters are very close to their corresponding true values. As seen in Table 4, the true values lie within the 95% PI associated with the distribution of estimates for each variable Q, R, W and P ii . Fig.5 shows the Q and R estimates for each MC run. As shown in Fig.6, the Kalman filter is considered consistent.

C. CASE 3
In Case 3, we test on the example in [33]. The system matrices are assumed to be as follows.

1) MINIMUM NUMBER OF OBSERVATION SAMPLES NEEDED FOR MEHRA's AND Bélanger's METHODS TO CONVERGE-Both Mehra's [33] and Bélanger's [4]
methods to update the Kalman filter gain W can be unstable unless a large number of data samples are observed. This is due to the fact that the time average converges slowly to the ensemble average. We conducted 100 MC simulations with 10,000 data samples in each run given the five-state system described in (204)-(207). We then varied the number of observed samples from 100 to 10,000 and updated the Kalman filter gain using both the Mehra [33] and Bélanger [4] methods. We measure the percentage of unstable Kalman filter gains by checking if any of the eigenvalues of F are outside of the unit circle for each run over the 100 MC runs. The results are shown in Figs.7 and 8. We only display up to 5,000 samples for both methods because each approach terminated with a stable gain when the total observation samples exceeded 5,000. The minimum number of samples required to obtain a stable gain from these methods were about 4,500. Our proposed method always results in a stable Kalman filter gain; hence, it is not included in the comparison of methods.  We test and compare the proposed method with that of Mehra's and Bélanger's for the estimation of R, Q and P using the methodology described in Sections VI-B and VII, combined with the converged Kalman filter gain from Table 5. The results are shown in Table 6 and Mehra's method results in the true value of P 33 staying outside of the 95% PI. In comparison to Bélanger's method, the proposed method is vastly more accurate with lower RMSE (2 to 9 times smaller) for all R, Q, and P, while Mehra's method obtained a result that is less accurate than Bélanger's method, as expected from the Kalman filter gain results. The reason r 1 is so difficult to estimate is that S 1 is dominated by the state uncertainty (S 1 = 65, r 1 = 1), i.e. the measurement noise is "buried" in a much larger innovation. In the case of r 2 = 1, one has S 2 = 2.45, i.e., r 2 is "visible" in the innovations.

3) VARYING THE NUMBER OF SAMPLES OBSERVED-
In this section, we vary the number of samples observed, N = 1000, 2500, 5000, 10000 using our six-step approach.
The results are detailed in Tables 7 and 8. As expected, the accuracy increases with an increase in N. The estimation is greatly degraded for N < 5000. Fig.9 illustrates that the Kalman filter is consistent.

D. CASE 4
For case 4, we simulate the unobservable (but detectable) system in [42], With the initial Kalman filter gain, the system has ℐ = The rank of ℐ is 2 and we have a total of 2 unknown variables. The condition number for ℐ is 23.4. We simulated 100 MC runs with observed samples N = 1000 in each run. We set n L = 100, N s = 1000, and λ Q = 0.1. Table 9 shows the estimated parameters with the initial Kalman filter gain obtained by solving the Riccati equation with R (0) = 0.2, and Q (0) = 0.4. Note that the system is not fully observable, i.e., the condition number for the observability matrix is infinity, while that for the controllability matrix is 25.8. In Table 9, the true values lie within the 95% PI associated with each distribution. Fig.10 shows a wide variation of Q and R estimates; however, the NIS in Fig.11 shows that the Kalman filter is consistent.

E. CASE 5
In Case 5, we simulate the system from [42], The condition number for observability and controllability matrices are 362 and 561, respectively; hence it is an ill-conditioned case. With the initial Kalman filter gain, the noise covariance identifiability matrix ℐ is The rank of ℐ is 2 and we have a total of 2 unknown variables indicating that both Q and R are identifiable (albeit due to the high condition number, not very well relative to the other systems tested). The condition number for ℐ is 36.4. We simulated 200 MC runs with N = 1000 observed samples for each run. We set M = 15 to be consistent with the setup in [42]. We also set the maximum number of iterations n L = 100, N s = 1000, and the regularization term from (127) is λ Q = 0.3. Table 10 shows the estimated parameters with the initial Kalman filter gain obtained by solving the Riccati equation with R (0) = 0.1, and Q (0) = 0.5. The results are detailed in Table 10, where the true value stays within the 95% PI. Fig.12 shows the scatter plot for the estimates of R and Q of each MC run. The plot is similar to the estimates in [42]. However, the upper bound on Q is less than that of [42] (about 0.2), which does not provide the detailed results presented in Table 10. Fig.13 shows that the Kalman filter is consistent.

XI. CONCLUSION AND FUTURE WORK
In this paper we derived necessary and sufficient conditions for the identification of the process and measurement noise covariances for a Gauss-Markov system. We also provide a novel six-step successive approximation method, coupled with an adaptive gradient method, to estimate the steady-state Kalman filter gain W, unknown noise covariance matrices R, and Q, as well as the state prediction (or updated) covariance matrix P (or P) when Q and R are identifiable. Moreover, we developed a novel iterative approach to obtain positive definite Q, R and P, while ensuring that the structural assumptions on Q and R are enforced (e.g., diagonality of Q and R, if appropriate, symmetry and positive definiteness). We provided several approaches to estimate the unknown noise covariance R via post-fit residuals. We examined previous methods from the literature and heretofore undiscussed assumptions of these methods that result in largely inaccurate or unstable estimates of the unknown parameters. The proposed method significantly outperformed the previous ones, given the same system assumptions.
We validated the proposed method on five different test cases and were able to obtain parameter estimates where the truth stays within the 95% probability interval based on the estimates.
In the future, we plan to pursue a number of research avenues, including 1) estimating Q and R using one-step lag smoothed residuals; 2) exploring vector moving average estimation algorithms using the minimal polynomial approach and/or truncating the effects of state; 3) replacing the batch innovation covariance estimates by their individual or mini-batch estimates, as is done in machine learning, to enable real-time estimation; 4) investigating accelerated gradient methods (e.g., Adam [26], AdaGrad [13], RMSProp [50], conjugate gradient, memoryless quasi-Newton, and trust region methods [5]); 5) automatic model selection from a library of models; and 6) extension to nonlinear dynamic models.

IDENTIFIABILITY OF UNKNOWN COVARIANCES
Proposition 5: The necessary and sufficient condition to estimate the unknown covariance matrices in a system is directly related to its minimal polynomial of its stable closed-loop filter matrix F and a transformation of the innovations based on the coefficients of the minimal polynomial.
Proof: To prove necessity, let us assume that Q and R are uniquely estimable, but ℐ does not have a full column rank. This implies the nullspace of ℐ, N(ℐ), contains nonzero vectors. Consequently, the column vectors of ℐ are dependent and there is an infinite number of Q and R estimates that satisfy the linear equations, which contradicts our original assumption. Now, to prove sufficiency, let us assume that ℐ has a full column rank. This implies that N(ℐ) contains only the null vector, and thus, a unique solution exists for the Q and R concatenated into a column vector in (36). Therefore, Q and R must be estimable.

APPENDIX C PROOF OF ESTIMABILITY OF R
Without loss of generality, let us assume that a m ≠ 0 and the closed-loop transition matrix F is invertible. Note that W should be such that F does not correspond to a deadbeat observer (which has no noise assumption) or an observer with zero eigenvalues for F. Since R is assumed to be positive definite, F is always invertible [22], [43], [52]. When the Kalman filter gain W = 0, it is evident that G m = a m I n z Then, and R is clearly identifiable. When W is not zero, using (21) in (29) Thus, R is estimable.

APPENDIX D PROCEDURE TO OBTAIN W USING MINIMAL POLYNOMIAL
Let W s be the suboptimal Kalman filter gain and e be the difference of the state predictions between the optimal and suboptimal filter, that is, where F s is defined as We can write the suboptimal innovation ν s (k) in terms of e (k | k − 1) v s (k) = He (k | k − 1) + v(k) Note that we can write ξ(k) as a vector auto-regressive process of infinite order (which can be truncated to ℳ tℎ order), that is, The z-transform of (239) is, Also, note the relationship between (240) and (238), By equating coefficients, we have We can truncate the infinite vector auto-regressive model at ℳ ≫ m, for i = 1, 2, …, ℳ, Then, we obtain the estimates of {Y i } i = 1 ℳ by solving and recall (237) and note that where {C l } are the sample covariance matrices. Then, Alternately, W can be computed via {V l }. To compute V l , we know that Recalling (237), we have the following relationship, V l = V l − a l I n z + C l W s = C l W l = 1, 2, …, m where the vec(·) function converts W into a column vector as in (35) and ⊗ is the Kronecker product. We can obtain the optimal Kalman filter gain W by solving the least squares problem, where a unique solution exists if C a has full column rank.

APPENDIX F CHOLESKY DECOMPOSITION AND EIGEN DECOMPOSITION
To solve for R using R3, we first perform Cholesky decomposition of S −1 . That is, Then, Let us perform eigen decomposition on (267)

APPENDIX G SIMULTANEOUS DIAGONALIZATION
To solve for R using R3, we first perform eigen decomposition on S −1 . That is, Noting that and R can be computed as