MSCEqF: A Multi State Constraint Equivariant Filter for Vision-aided Inertial Navigation

This letter re-visits the problem of visual-inertial navigation system (VINS) and presents a novel filter design we dub the multi state constraint equivariant filter (MSCEqF, in analogy to the well known MSCKF). We define a symmetry group and corresponding group action that allow specifically the design of an equivariant filter for the problem of visual-inertial odometry (VIO) including IMU bias, and camera intrinsic and extrinsic calibration states. In contrast to state-of-the-art invariant extended Kalman filter (IEKF) approaches that simply tack IMU bias and other states onto the $\mathbf{SE}_2(3)$ group, our filter builds upon a symmetry that properly includes all the states in the group structure. Thus, we achieve improved behavior, particularly when linearization points largely deviate from the truth (i.e., on transients upon state disturbances). Our approach is inherently consistent even during convergence phases from significant errors without the need for error uncertainty adaptation, observability constraint, or other consistency enforcing techniques. This leads to greatly improved estimator behavior for significant error and unexpected state changes during, e.g., long-duration missions. We evaluate our approach with a multitude of different experiments using three different prominent real-world datasets.


I. INTRODUCTION AND RELATED WORK
I N the past years, VINS have shown remarkable success in estimating the position and orientation of robots by relying only on low-cost and lightweight IMUs and cameras.
Popular algorithms for VINS include visual-inertial odometry (VIO) and visual-inertial simultaneous localization and mapping (VI-SLAM).VIO focuses only on the local surroundings and is, therefore, computationally simpler, less accurate, and it suffers from accumulated drift.VINS algorithms can also suffer from inconsistencies [1].The classical extended Kalman filter (EKF)-SLAM algorithm suffers from overconfidence due to spurious information gain along the unobservable Manuscript received: July, 2, 2023; Revised October, 13, 2023; Accepted November, 13, 2023.This paper was recommended for publication by Editor Pascal Vasseur upon evaluation of the Associate Editor and Reviewers' comments.This work was supported by the European Union's Horizon 2020 research and innovation program under grant agreement 871260 (BugWright2), and by the Army Research Office under Cooperative Agreement Number W911NF-21-2-0245.The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government.The U.S. Government is authorized to reproduce and distribute preprints for Government purposes notwithstanding any copyright notation herein.
1 Alessandro Fornasier, Eren Allak and Stephan Weiss are with the Control of Networked Systems Group, University of Klagenfurt, Austria.
{name.surname}@ieee.org 2 Pieter van Goor and Robert Mahony are with the System Theory and Robotics Lab, Australian National University, Australia.
{name.surname}@anu.edu.audirections [2].Different solutions have been proposed in literature to overcome the problems caused by inconsistencies.By manipulating the linearization point and enforcing the correct number of unobservable directions for the linearized system, Huang et al. introduced the first estimate jacobian (FEJ) [3], whereas Hesch et al. the observability constraint (OC) [1] as techniques aiming at solving the inconsistency issue at the cost of sub-optimal linearization points.More recently, in [4], Barrau and Bonnabel introduced the IEKF and showed that exploiting the natural symmetry of group affine systems leads to algorithms that are inherently consistent [5].Although the IEKF theory does not apply to inertial navigation systems (INS) when IMU bias are explicitly considered, many authors [6,7,8,9,10,11,12] have exploited the Imperfect-IEKF framework [13] to design VINS algorithms.
In very recent research, van Goor et al. introduced the EqF [14,15] as a general filter design for systems on homogeneous spaces, and proposed a symmetry for fixed landmark measurements in the context of VI-SLAM [16,17,18,19,20].Later, Fornasier et al. proposed a novel symmetry for INS that couples navigation states and IMU bias and developed an EqF design for INS [21,22] that proved superior to state-of-the-art in terms of robustness to wrong initialization, transient behavior, and consistency properties.In a very recent research study [23], the same authors analyzed the theoretical properties of different symmetry groups when employed in designing filters for inertial navigation systems, and provided a discussion of the relative strengths and weaknesses of different filter algorithms.
For vision aided INS systems, however, the lack of robustness against unexpected disturbances and the requirement for sophisticated tuning for a given environment and setup remain important limitations.Real-world deployments are typically constrained to precise tuning and highly engineered codebases, where the core VIO algorithm is encompassed by numerous modules responsible for tasks such as initialization, failure detection, algorithm reset, and more.A people's visual-inertial odometry, that is, an algorithm whose operation requires minimal knowledge, little to no tuning, and yet still functions in many different real-world scenarios, would enable a whole new tranch of real-world applications without the requirement of having highly trained engineers available.The present letter builds upon the recent results in [21,22,23] and is a step towards enabling this goal.
This perspective shifts the evaluation of algorithm performance from measures such as root mean square error (RMSE), accuracy, and precision, to measures such as the likelihood of failure for poor initial conditions or poor calibration.We acknowledge that state-of-the-art VINS approaches reached a plateau in the former metrics, but there is still a large room for improvement in the latter metrics.Furthermore, this letter does not claim completeness in comparative evaluations, rather, we present here our novel findings enabling a multi state constraint equivariant filter (MSCEqF) as a step towards the people's VIO; compare it against OpenVINS [24], the best open-source available MSCKF [25], and see an extensive comparison covering all suitable approaches as a work that goes beyond the scope of this letter.
Apart from the different metric evaluation, this work differentiates itself from state-of-the-art by extending insights on symmetries and EqF design for fixed landmark VINS [19,20] and INS including IMU bias into the symmetry [21,22,23] to the idea of a multi state constraint but equivariant VINS.To the best of our knowledge, the resulting algorithm is the first ever, equivariant multi state constraint filter for VIO.Our approach, dubbed MSCEqF, leverages a semi-direct product symmetry group, yielding improved linearized error dynamics when compared to other filter types [23].Hence, the MSCEqF demonstrates consistency naturally without artificial changes of linearization points and very high robustness to poor extrinsic calibration.It not only handles significant absolute (calibration) errors but also addresses the concept of dealing with "you don't know what you don't know", such as errors exceeding the prior covariance (e.g., sudden changes of calibrations states due to a disturbance during the operational phase of the robotic platform, where the state has converged already and the covariance has shrunk).
To summarize, with this work, we make the following contributions: (i): We introduce the MSCEqF; a novel multi state constraint visual-inertial navigation system based on the equivariant filter framework, with camera and IMU self-calibration capabilities.
(ii): We demonstrate that the proposed MSCEqF achieves state-of-the-art accuracy, with superior robustness to significant absolute errors, as well as errors exceeding the prior covariance.
Our experiments show that the MSCEqF can be directly deployed in real-world scenarios with little tuning and no additional health-check modules.Furthermore, we show that the proposed MSCEqF is a naturally consistent filter without the need for FEJ, OC, or other heuristic techniques.We implemented our framework as a stand-alone C++ library, and we made it source-available to the community 1 .Wrappers for the standard middle-ware (e.g., ROS1, ROS2, etc.) will be provided such that code is available for direct use and comparison against other approaches.We derived the filter matrices in analytical form without resorting to numerical differentiation, leading to code with higher portability and lower computational complexity, appropriate for computelimited hardware, such as nano-drones, augmented reality devices, etc.

II. MATHEMATICAL PRELIMINARIES AND NOTATION A. Vector and matrix notation
Vectors describing physical quantities expressed in frame of reference {A} are denoted by A v. Rotation matrices encoding the orientation of a frame of reference {B} with respect to a reference {A} are denoted by A R B ; in particular, A v = A R B B v .I n ∈ R n×n denotes the n-dim identity matrix, and 0 n×m ∈ R n×m denotes the zero matrix with n rows and m columns.

B. Lie theory
A Lie group G is a smooth manifold endowed with a smooth group structure.For any X, Y ∈ G, the group multiplication is denoted XY , the group inverse X −1 and the identity element I.
Given a Lie group G, G denotes the G-Torsor [26].
For a given Lie group G, the Lie algebra g is a vector space corresponding to the tangent space at the identity of the group, together with a bilinear non-associative map [⋅, ⋅] ∶ g × g → g called the Lie bracket.The Lie algebra g is isomorphic to a vector space R n of dimension n = dim (g).
Define the wedge map and its inverse, the vee map as linear isomorphisms between the vector space and the Lie algebra For any X, Y ∈ G, define the left and right translations The Lie group ('big') Adjoint matrix is defined by for every X ∈ G and u ∧ ∈ g, where dL X , and dR X denote the differentials of the left, and right translation, respectively.The Lie algebra ('little') adjoint matrix is defined by for every u, v ∈ R n .

C. Important matrix Lie groups
The special orthogonal group SO(3), special Euclidean group SE(3), extended special Euclidean group SE 2 (3), and their respective Lie algebras are defined, in matrix form, by The Semi-direct Bias group G SD ∶= SE 2 (3) ⋉ se(3) introduced in [23], is a group structure on the tangent bundle G ⋉ g ∶= G ⋉ g given by the semi-direct product of a group G with a Lie subalgebra g.
For a detailed introduction to equivariant filters for inertial navigation systems, semi-direct product groups and theoretical properties this work is built upon, we refer the reader to our previous works [21,22,23].Moreover, [23] discuss the advantages of semi-direct product symmetries for filter design and compares it to classical solutions such as the MEKF and the IEKF.

E. Intrinsics group IN
In this work, we recognized that elements of the camera intrinsics matrix [27] form a Lie group.Thus, we introduce the intrisincs group IN, as the matrix Lie group defined by This matrix representation is associated with the standard camera intrinsics matrix, well-known in computer vision.A typical element of IN may be written as To the authors' understanding, exploiting the group structure of the IN group in equivariant or invariant VINS design represents a novel approach to this work.

F. Useful maps
For all v = (x, y, z) ∈ R 3 , define the maps

A. System definition
Consider a mobile platform equipped with a camera observing global visual features G p f , and an IMU providing biased acceleration and angular velocity measurements, denoted by I w = ( I ω, I a).Define G T I = ( G R I , G v I , G p I ) to be the extended pose of the system, where G R I corresponds to the rigid body orientation, whereas G p I and G v I denote the IMU position and velocity with respect to the global frame, respectively.Define G P I = ( G R I , G p I ).Define I b = ( I b ω , I b a ) to be the gyroscope and accelerometer biases, respectively.Let g denote the magnitude of the acceleration due to gravity, and let G e 3 denote the direction of gravity in the global frame.Finally, define I S C to be the camera extrinsic calibration, and K be the camera intrinsic calibration.
For the sake of readability, from now on, we suppress all the subscripts and superscripts that are not strictly required.
Define the matrices W, B, D, G to be Finally, the visual-inertial navigation system is written where τ , µ, ζ are used to model the deterministic dynamics of the bias and calibration states and are zero when these states are modeled as constants, as they are in our formulation.Define ξ I = (T, b) ∈ SE 2 (3) × R 6 to be the inertial navigation state.Define ξ S = (S, K) ∈ SE(3) × IN (3) to be the camera calibration state.Then the full system state is defined as ξ = (ξ 18 to be the system's input.Note that in this work, visual features are not considered as part of the state since the dependency of measurement on features is removed through nullspace projection.
Without loss of generality, let us consider the case of a single feature p f .The camera measurement is modeled as the measurement of the bearing of the feature p f seen from the camera.

B. Symmetry of the visual-inertial navigation system
The symmetry for the inertial navigation state ξ I is given by the Semi-Direct symmetry group G SD ∶= (SE 2 (3) ⋉ se(3)), the symmetry for the extrinsic calibration state is given by the special Euclidean group SE(3), and the symmetry for the intrinsic calibration state is given by the intrinsics group IN.The complete symmetry for the visual-inertial navigation system is thus defined to be the product group Then, ϕ is a transitive right group action of G on M.

C. Lifted system
The implementation of the equivariant filter (EqF) requires a lift Λ ∶ M × L → g to define a lifted system on the symmetry group G that projects down to the original system dynamics via the proposed group action ϕ.The transitivity of ϕ guarantees the existence of such a lift [28], and the following theorem provides an explicit form for a lift of the system studied in this paper. where Then Λ is a lift for the system in Equ.(1) with respect to the symmetry group G.
The existence of the lift allows the construction of a lifted system on the symmetry group [28].Let X ∈ G be the state of the lifted system, and let ξ = ( T, b, S, K) ∈ M be an arbitrarily chosen element of the original state in Equ.(1), called the origin.Then the lifted system is defined

A. Filter state definition
Define X = ((( D, δ) , Ê, L) , Ê1 , ⋯, Êk ) ∈ G × SE(3) k to be the filter's state evolving on the symmetry group.Similarly to the original formulation [25] we maintain a sliding window of k past Ê elements in the state of the filter, corresponding to the different times a camera measurement was collected.

B. Error dynamics and state transition matrix
Let e = ϕ X−1 (ξ) denote the equivariant error.Normal coordinates [15] of the state space M in a neighborhood of the origin ξ are ε = ϑ (e) ∶= log (ϕ −1 ξ (e)) ∨ ∈ R 25 , where log ∶ G → g is the logarithm of the symmetry group.
Recall the derivation of the linearized error dynamics in [15] ε ≈ A 0 t ε, The state matrix A 0 t is given by where The discrete-time state transition matrix is defined by Φ = exp (A 0 t ∆T ) for time steps ∆T .

C. Multi state constraint
Consider the measurement model in Equ.(2), applying the action of the symmetry group to the state space in Equ.(3) yields Recall the equivariant error e = ϕ X−1 (ξ , where ς (⋅) represents the chosen feature parametrization.The true feature can then be written as p f = ς −1 (ς ( pf ) + ỹ).Therefore, the measurement model in Equ. ( 2) can be linearized at ε = 0, and ỹ = 0 as follows: Let us derive the C t , and C f t for the anchored inverse depth parametrization [29,25] of the feature.Note that the matrix C f t can be computed for any desired parametrization.Let A P A S be the pose of the anchor, defined as the pose of the camera where the feature p f has been first seen.Define the feature in the anchor frame as a f = ( A P A S) −1 * p f , with a f = (a fx , a fy , a fz ) ∈ R 3 .The anchored inverse depth parametrization is written ] .
Then the matrix C f t is written where we have used ξ ∶= ϕ X ( ξ) to map between the estimated state in the homogeneous space ξ, and the estimated state in the symmetry group X. Therefore According to [15], the C t matrix is defined by where ε E , and εA E represent respectively the error in normal coordinates for the element E of the symmetry group corresponding to the most recent pose and to the anchor pose, whereas ε L represent the error in normal coordinates that is related to the camera intrinsics.
To compute the matrix C t in Equ. ( 12), an estimate of the feature position in the anchor frame is required.To this end, when a feature has been seen from multiple views a linearnonlinear least square problem can be solved [25,24].
Finally, to remove the dependency of the features, and hence perform a filter update, we employ nullspace marginalization of the matrix C f t in Equ.(8), according to the original formulation [25].

V. EXPERIMENTS
In this letter, we perform a series of experiments to evaluate the accuracy, consistency, and, more importantly, robustness of the proposed MSCEqF.We perform many experiments on real-world data to evaluate robustness to expected and unexpected errors in the camera extrinsic calibration.In all these experiments, we limit our comparison to filter-based MSCKF algorithms for VIO, and in particular, to the best available one we believe represents the state-of-the-art, that is Open-VINS [24].For a fair comparison, we turned off OpenVINS's persistent features (SLAM features), and only compare against its pure MSCKF part.Furthermore, in all the experiments, OpenVINS's MSCKF parameters were specifically tuned, for each dataset, according to the authors' suggested parameters.In contrast, the proposed MSCEqF shares the same tuning parameters across all the experiments and datasets.

A. Robustness
Robustness is an important property of a modern filter-based visual-inertial odometry algorithm.It is the ability to function with significant yet known errors, as well as the ability to deal with unknown unknowns.In simpler terms, it refers to how well an algorithm performs under non-ideal conditions, such as imperfect tuning parameters, poor calibration, or unexpected changes in the sensor's extrinsic parameters during field operations.
To assess the robustness of the proposed MSCEqF and the MSCKF, we ran a series of experiments using widelyknown dataset for evaluating VIO algorithms.Specifically, the Euroc dataset [30], the TUM-VI dataset [31], and the UZH-FPV dataset [32].For each dataset, we selected two sequences and ran each estimator 6 × 6 × 6 = 216 times (for a total number of runs of 2592).In these experiments, we intentionally initialized the filters with incorrect camera extrinsic parameters, introducing errors in six steps ranging from (15 ○ , 0.05m), to (90 ○ , 0.3m).For each error step, we ran the estimators with six different priors (initial covariance) accounting for initial calibration errors in the range of the six error steps.For each pair (prior, error) we run each estimator six times.Finally, for each individual run, we classified an estimator as converged or diverged based on a position error threshold.
Based on the results of the experiment in Fig. 1, we derive the following noteworthy observations.In absolute terms, there seems to be an upper limit of absolute error that, no matter the prior, makes the estimators diverge.Although this limit highly depends on the dataset, for each of the tested sequences, the proposed MSCEqF possesses a higher error limit, and hence improved robustness to known absolute error.In relative terms, the proposed MSCEqF seems to deal better with unknown errors since the line at which the estimator fails is straight and does not bend towards the left side as it appears to happen for the MSCKF.Encouraged by these results, we ran an additional experiment on the V1_01_easy sequence of the Euroc dataset, introducing new, smaller priors and errors to effectively evaluate whether the estimators are able to manage errors that are smaller in absolute terms but outside the prior covariance.Fig. 2 clearly shows that the MSCEqF is indeed a more robust filter, able to deal with unexpected errors.Finally, Fig. 3 shows the convergence of the camera extrinsic parameters for both filters evaluated on the Euroc V1_01_easy In these grid plots, the x-axis is the prior standard deviation the estimators are set with.The y-axis is how many σ-levels that error corresponds to.Labeled diagonal dashed lines represent iso-error lines (lines along with the error is constant).The bottom part of each grid represents expected errors, thus errors falling within 1 /6σ-1 /2σ, whereas the top part of each grid represents unexpected errors, thus errors falling within 2σ-6σ.According to the colorbar, the color of each cell shows the number of failures.
sequence, with an initial error of (30 ○ , 0.1m) and an initial covariance to match the error.The error plots clearly show that the proposed MSCEqF not only is a more robust filter, but it also converges faster.
Quantifying robustness in robotics, however, remains an ongoing challenge.In the presented evaluation, we have chosen the camera extrinsic calibration as a state subjected to error.Even though static and dynamic initialization approach exists [33,34] for such a problem, in our formulation, extrinsic parameters are treated as regular state variables, and our proposed algorithm showcases inherent robustness by successfully attaining reliable estimation, for both expected and unexpected errors, eliminating the need of any auxiliary module.This characteristic sets our algorithm apart from conventional VIO algorithms, emphasizing its superior robustness.

B. Accuracy
Our next experiment focuses on the classical and widelyused metric for evaluating the performance of visual-inertial odometry algorithms [35], namely the RMSE of the absolute trajectory error (ATE).For this experiment, we ran the proposed MSCEqF and OpenVINS's MSCKF on all Euroc sequences [30].The results presented in Tab.I demonstrate that the proposed MSCEqF achieves state-of-the-art accuracy comparable to the MSCKF.It should be noted that in our evaluation, we aligned each estimate with the groundtruth using the initial state rather than finding the optimal alignment that minimizes the error throughout the entire trajectory.

C. Consistency
An estimator is said to be consistent if the estimated covariance of the error reflects its real distribution; in other words, an estimator is consistent if the error is unbiased and within  4), to reference frame transormations [20,10].This ensures that the filter does not gain spurious information along the unobservable directions.Then the action of the symmetry group on the state space ϕ and the lift Λ are respectively compatible and invariant with respect to change of reference, that is as required.
where we have used the fact that H It is straightforward to see that R H g e 3 = g e 3 since R H is a rotation about the e 3 axis.This completes the proof.
In this final experiment, we employed the pose (orientation and position) average normalized estimation error squared (ANEES) as a metric to analyze the consistency of the proposed MSCEqF.In particular, we used the VINSEval framework [36] to generate a photorealistic synthetic dataset of 25 runs of the same trajectory, with the same noise statistics but different noise realizations.
The ANEES for the MSCEqF was computed according to the following formula where M is the number of runs, n = dim (ε) is the dimension of the error ε, and Σ is the covariance of the error.The error   The resulting ANEES shown in Fig. 4 fluctuates around a computed average of 1.0 and is not increasing or decreasing over time.This is a very similar average than FEJ estiamtors [24,37], but without requiring artificial modification of the linearization points to achieve consistency.

VI. CONCLUSION
This letter presented the multi state constraint equivariant filter (MSCEqF).A novel equivariant filter formulation for the VIO problem, capable of camera intrinsic and extrinsic self-calibration.With our approach, we address the need for an VIO algorithm that achieves state-of-the-art accuracy and consistency while minimizing the need for sophisticated tuning and remaining robust against expected and unexpected errors.Through the presented experiments, we have demonstrated that the proposed MSCEqF successfully tackles these requirements.It exhibits robustness against both high absolute errors and unexpected errors that exceed the prior covariance.Furthermore, the MSCEqF has been proven to be a naturally consistent estimator, achieving accuracy comparable to a state-of-the-art MSCKF algorithm but without the need for additional health-check nor consistency enforcing modules and heuristics.Future work includes the extension of the proposed MSCEqF with a polar symmetry for explicit SLAM features [20] , a, b ∈ R 3 .Define the subgroups B = χ (D) ∈ SE(3), and C = Θ (D) ∈ SE(3).Finally, define E ∈ SE(3), and L ∈ IN.Lemma 3.1.Define

Figure 1 .
Figure1.Results of the experiment evaluating the robustness of the proposed MSCEqF and OpenVINS's MSCKF.In these grid plots, the x-axis is the prior standard deviation the estimators are set with.The y-axis is how many σ-levels that error corresponds to.Labeled diagonal dashed lines represent iso-error lines (lines along with the error is constant).The bottom part of each grid represents expected errors, thus errors falling within 1 /6σ-1 /2σ, whereas the top part of each grid represents unexpected errors, thus errors falling within 2σ-6σ.According to the colorbar, the color of each cell shows the number of failures.

Figure 2 .
Figure 2. Grid plot showing the robustness of the proposed MSCEqF compared to OpenVINS's MSCKF for unexpected errors, thus the ability to deal with you don't know what you don't know.The x-axis is the prior standard deviation the estimators are set with.The y-axis is how many σ-levels that error corresponds to.Diagonal dashed lines represent iso-error lines.The blue bold dashed line is the limit at which each estimator fails.According to the colorbar, the color of each cell represents the number of failures.

Theorem 5 . 1 .
Define H∶= (R H , 0, p H ) ∈ SE 2 (3), where R H ∈ SE e3 (3) represent a anti-clockwise rotation about the vertical axis e 3 , and p H represent the a translation.Define the right group action α ∶ SE 2 (3) × M → M such that α(H, ξ) ∶= (H −1 T, b, S, K) represents a change of reference, from {G} to {H} that leaves the direction of gravity unchanged.

Figure 3 .
Figure 3. Absolute errors of camera extrinsic parameters for the proposed MSCEqF, and OpenVINS's MSCKF.The plots show the convergence performance of the filters evaluated on the Euroc V1_01_easy sequence, for 6 runs, with an initial error of (30 ○ , 0.1m).

Figure 4 .
Figure 4. Pose (orientation and position) ANEES of the proposed MSCEqF for 25 runs on a custom dataset generated with the VINSEval framework.

ε
= log SE(3) ( P−1 P P−1 P) ∨ is the pose components of the equivariant error defined in Sec.IV-B.

Table I
ATTITUDE (A), AND POSITION (P) ABSOLUTE TRAJECTORY ERROR (ATE) RMSE ON EUROC DATASET