Plane-Aided Visual-Inertial Odometry for 6-DOF Pose Estimation of a Robotic Navigation Aid

The classic visual-inertial odometry (VIO) method estimates the 6-DOF pose of a moving camera by fusing the camera’s ego-motion estimated by visual odometry (VO) and the motion measured by an inertial measurement unit (IMU). The VIO attempts to updates the estimates of the IMU’s biases at each step by using the VO’s output to improve the accuracy of IMU measurement. This approach works only if an accurate VO output can be identiﬁed and used. However, there is no reliable method that can be used to perform an online evaluation of the accuracy of the VO. In this paper, a new VIO method is introduced for pose estimation of a robotic navigation aid (RNA) that uses a 3D time-of-ﬂight camera for assistive navigation. The method, called plane-aided visual-inertial odometry (PAVIO), extracts planes from the 3D point cloud of the current camera view and track them onto the next camera view by using the IMU’s measurement. The covariance matrix of each tracked plane’s parameters is computed and used to perform a plane consistent check based on a chi-square test to evaluate the accuracy of VO’s output. PAVIO accepts a VO output only if it is accurate. The accepted VO outputs, the information of the extracted planes, and the IMU’s measurements over time are used to create a factor graph. By optimizing the graph, the method improves the accuracy in estimating the IMU bias and reduces the camera’s pose error. Experimental results with the RNA validate the effectiveness of the proposed method. PAVIO can be used to estimate the 6-DOF pose for any 3D-camera-based visual-inertial navigation system.


I. INTRODUCTION
Visual impairment reduces a person's independent mobility and severely deteriorates the quality of life. According to the World Health Organization, there are about 285 million people with visual impairment, of which 39 million are blind. Because age-related diseases are the leading causes of vision loss, the visually impaired community is growing as the population continues to age. Therefore, there is a pressing need to develop new mobility tools that may help the visually impaired move around independently. In the literature, a number of Robotic Navigation Aids (RNAs) [1]- [8] have been introduced to guide the blind in indoor environments. Among these RANs, vision-based systems are becoming more and more popular because the cameras used in these RNAs, such as monocular camera [2], [9] stereo cameras [3], [4], [10], RGB-D cameras [5], [6], [11] and 3D time-of-flight (TOF) cameras [7], [8], can provide better The associate editor coordinating the review of this manuscript and approving it for publication was Heng Wang . navigational information, including 6-DOF device pose (position and orientation) and object-level information. The pose information can be used to build a 3D map of the environment, locate a blind traveler in the environment, and guide the traveler to the destination. The problem of camera pose estimation is also known as visual simultaneous localization and mapping (SLAM). The state-of-the-art visual SLAM algorithms [12]- [15] have demonstrated their effectiveness in existing research. However, they are effective only in a feature-rich environment. Their performances may be compromised when the operating environment is feature-sparse. To address this problem, two approaches have been taken in the literature. The first approach is to use the geometric features (e.g., planes, lines) of the operating environment to limit the pose estimation error because they are ubiquitous in a man-made environment. Plane features have been used in EKF-SLAM [16] and pose-graph SLAM [17]- [19] to reduce pose error accumulated by visual odometry (VO). In [7] and [20], floor plane and wall lines were extracted from the point cloud data of a 3D TOF camera and used to reduce the pose error of an RNA for wayfinding in an indoor environment. However, the available geometric features in some cases may be inadequate for a full correction to the 6-DOF pose. For instance, there are often only one or two planes that are observable when using the RNA [7], [20] in an indoor environment. In this case, only part of the 6-DOF pose estimation error can be corrected because at least three intersecting planes are needed for a full pose error correction. The second approach is to use an additional sensor, usually an inertial measurement unit (IMU) or a gyro, and perform sensor fusion to reduce pose estimation error. In the robotics community, the combination of a camera and an IMU has become standard equipment for SLAM. The Camera-IMU-based SLAM method is termed visual-inertial odometry (VIO). The state-of-the-art VIO [21]- [25] uses VO to periodically update the estimates of the IMU's biases and integrate the IMU's measurements for pose estimation when VO malfunctions. A more efficient way to incorporate the data is to fuse the IMU's measurements with the visual observations by incremental smoothing. However, VO can fail or produce a large pose error in a feature-sparse environment. In either case, the estimates of the IMU biases can become inaccurate, resulting in a large pose estimation error. Although a VO failure may be detected, it is difficult to assess VO accuracy online. To address this issue, we propose a new method, called Plane-Aided VIO (PAVIO), for pose estimation of an RNA that uses a 3D TOF camera and an IMU for wayfinding. The method extracts plane features from the camera's point cloud data and tracks these features over the camera's data frames to associate plane features between the frames. It then uses the plane association information to reduce pose estimation error by minimizing the residual of a factor graph that integrates the VO-estimated pose changes (PCs), the IMU-estimated PCs, and the information of the associated planes. The main contributions of this paper include: • We propose a plane-aided VIO method that improves pose estimation accuracy by tightly coupling the observed planes with the VO-computed pose change estimates and the IMU's measurements in a graph optimization framework.
• We propose a plane-consistency check (PCC) method to detect and reject inaccurate pose change estimates by the VO. The PCC prevents the proposed VIO method from using inaccurate VO data and therefore its use results in more accurate pose estimation.
• We implemented the proposed plane-aided VIO method on the visual-inertial navigation system of an RNA prototype for assistive navigation of the visually impaired and carried out experiments in real world environments. The results validate its practicality for the wayfinding application.

II. OVERVIEW OF THE RNA
The RNA prototype that we have fabricated is depicted in Fig. 1. It is a computer-vision-enhanced white cane that allows a blind traveler to ''see'' the surrounding environment. It provides the user with the desired travel direction in an intuitive way and offers a friendly human-device interface. The computer vision system uses a 3D TOF camera (SwissRanger SR4000 of Mesa Imaging) for 3D perception, an IMU (VN-100 of VectorNav Technologies, LLC) for motion measurement and an Up Board computer for computation. The SR4000 illuminates the environment with modulated infrared light and measures range up to 5 meters (accuracy: ∼1 cm, resolution: 176 × 144) by using phase shift measurement. The camera produces range, intensity and confidence data at a rate up to 54 frames per second. The SR4000 has a much better range measurement accuracy for a distant objects and has better data completeness than a stereo camera. This may result in better pose estimation performances. The IMU measures the angular velocity and acceleration at 200 Hz. All software codes are implemented with the Up Board. The system provides both pose estimation and obstacle detection functions. The results are used to guide the blind traveler for wayfinding and obstacle avoidance. The RNA has an active rolling tip (ART) that can steer itself into the desired travel direction. A speech interface using a Bluetooth headset is used for human-device interaction. The ART (the bottom image in Fig. 1) is an assembly of a servo motor, an encoder, a gear head, and an electromagnetic clutch that is connected to a ball-bearing-supported rolling tip by a flexible coupling. A servo controller and custom microcontroller boards are used to control the motor and the clutch, respectively. Both controller boards are connected to the UP Board's USB ports.

III. NAVIGATION SYSTEM AND NOTATION
In this paper, a new VIO method is developed to estimate the IMU's pose change by incorporating the IMU's measurements, pose changes estimated by the SR4000-based VO [26], and geometry features (i.e., planes) extracted from the camera's point cloud data. The VO estimates the pose VOLUME 8, 2020 change between two keyframes. The first data frame of the camera is taken as the first keyframe. A subsequent keyframe is defined as one with a substantial translation (>0.1 meters) or rotation (>2 • ) from the previous keyframe.
The ith keyframe contains intensity image i (called image i for simplicity) and depth image i. There are multiple IMU measurements between two keyframes. These inertial measurements are integrated to produce another pose change estimate (PCE). The two PCEs together with the plane features extracted from the point cloud data are used to create a factor graph [27] for estimating the IMU's pose in the world coordinate system. The use of the plane features helps to reduce pose estimation error. The proposed method is detailed in sections III and IV. The coordinate systems of the IMU and the camera denoted X s Y s Z s and X c Y c Z c , respectively, are defined in Fig. 1. The world coordinate system X w Y w Z w is defined as the IMU's coordinate system at the first keyframe. Throughout this paper, a matrix is represented by an uppercase letter and a vector a bold lowercase letter. The right superscript of a letter represents the coordinate system, in which the variable is expressed, whereas a left superscript indicates the type of the variable. For example, the camera's transformation matrix from keyframe i to keyframe j is denoted by T c ij . Similarly, the transformation matrix from X w Y w Z w to X s Y s Z s at keyframe i is denoted T ws i . We use Riemannian geometry notation to describe the pose of a rigid body. Let the IMU's pose at keyframe i be denoted ] T to denote the IMU's pose change from keyframe i to keyframe j and T s ij to denote the corresponding transformation matrix. The relation between ω w i and the rotation matrix R ws i ∈ SO(3) is determined by the exponential/ logarithm map [25]. Specifically, R ws (3) is the corresponding Lie algebra element of ω w i the hat operator (·) ∧ maps a vector to a skew-symmetric matrix in so (3), and the vee operator (·) ∨ indicates the inverse operation. The transformation The IMU's pose ξ w j at keyframe j can be computed from T ws is the IMU's linear velocity, and g b i ∈ R 3 and a b i ∈ R 3 are the bias of the gyroscope and accelerometer, respectively. At keyframe i, K planes are detected and described in X s Y s Z s as l s ik = n s ik , d s ik T (k = 1, . . . , K ), where n s ik and d s ik are the plane's normal vector and the distance from the origin of X s Y s Z s to the plane, respectively. The representation of the plane in X w Y w Z w is given by

IV. FACTOR GRAPH FORMULATION
In this paper, we use a factor graph to model the SLAM problem. A factor graph [27], [28] is a bipartite graph consisting of nodes and edges. There are two types of nodes: variable nodes and factor nodes. A variable node represents the random variables to be estimated and a factor node encodes a measurement model defined by a probabilistic distribution function (PDF) of these variables. We denote the set of variables up to M keyframes by represents the variables to be estimated; factor node f i ∈ F represents the measurements; and edges ε ij ∈ E represents the connection/relation between a factor node and a variable node. Graph G defines a function given by: where each factor f i is a function of i and i = θ j is the set of variables involved in factor f i . The dependence relationships between the nodes are encoded by edges ε ij . By using a probabilistic representation for f i and assuming Gaussian measurement models, Here, is a measurement prediction function and z i is the actual measurement.
i is the covariance matrix for e i . The optimal value * that maximizes F ( ) is given by: The nonlinear LS problem cannot be solved directly but requires a numerical solution that starts with an initial estimate˜ = θ i | i=1,···M and successively approximate the optimal solution. The Levenberg-Marquardt (LM) algorithm [29] is employed to solve the problem. The LM method iterates the following steps until the optimal is found: 1) approximates residual e i aroundθ i by e i (θ i +δθ i ) ≈ e i (θ i )+J i δθ i , where J i is the Jacobian of e i (θ i ) computed atθ i ; 2) linearizes the F ( ) by using the approximated e i and find δ * that minimizes F ( ); 3) updates˜ by˜ =˜ + δ * . The iterative process terminates when δ * approaches 0. And the resulted˜ is taken as the optimal solution.
for a keyframe are represented by a pose node, a velocity node and a bias node, respectively. The prior factor node determines the initial pose. An IMU factor node provides a constraint between the two consecutive states by connecting the neighboring state nodes. Assuming that the VO-estimated PC between ξ w 2 and ξ w 3 is rejected by the PCC method and two planes observed at keyframe 2 are tracked onto keyframe 3. Two VO factor nodes (one between ξ w 1 and ξ w 2 and the other between ξ w 3 and ξ w 4 ) and two plane nodes (l w 21 and l w 22 ) between ξ w 2 and ξ w 3 are added to the graph accordingly.
There are four types of variable nodes and four types of factor nodes. The four variable nodes are related to the variables for the pose, velocity, IMU bias, and planes. For keyframe i with K planes observed, these variable nodes are denoted by ξ w i , v w i , b s i , and l w ik (k = 1, . . . , K ), respectively. Our goal is to estimate the values of these variables that minimize the cost function of the graph. The prior factor node reflects the prior knowledge on the device's initial pose and it fixes the first pose node in the graph. Each subsequent pose node is added by using the IMU preintegration as the PCE. The other three factor nodes represent the measurements obtained from IMU preintegration, VO, and plane-fitting. An IMU preintegration factor constrains the estimates for the pose change, velocity change and IMU bias drift. Therefore, it connects the two neighboring pose nodes, velocity nodes, and IMU bias nodes. A VO factor constrains the PCE between the two pose nodes and therefore it connects them. A factor graph example that consists of four keyframes (i = 1, . . . , 4) and two planes (l w 21 and l w 22 ) observed at keyframe 2 is shown in Fig. 2. In this paper, we propose to evaluate the accuracy of the VO-computed PCE by checking the consistency of plane observation (see section V). If a PCE fails the consistency check, no VO factor node will be created. Instead, a plane factor is added to the graph to constrain the PCE. A plane factor involves a plane node and a pose node at which the plane is observed. By connecting each variable node to one or more factor nodes, the graph structure encodes a joint probability density function F ( ) of these variables. he LS problem (6) is given by * = argmin where o r 2 , c r ij 2 , s r 2 i,i+1 , and l r 2 ik are the squared Mahalanobis distance related to the factors of prior, VO, IMU and plane measurement, respectively. The computation of o r 2 for the prior factor and s r 2 i,i+1 for the IMU factors F I follows Forster's VIO method [25]. The error functions and covariance matrices related to the VO factor F O and F L are detailed as follows.

A. VISUAL ODOMETRY FACTOR F O
The VO estimates the PC between keyframes i and j by extracting visual features from image i and tracking them onto image j. SIFT features [30] are used in this work. Note that we use the keypoints only to avoid the computation of the descriptors and keep the computational cost low. Given the IMU-measured PC,T c ij , a space correspondence search is employed to find the matched features between the two images and the RANSAC method [26] is used to identify the inliers, from which PCE is computed. Finally, we follow the method in [31] and employ Bundle Adjustment (BA) to refine the PCE T c ij and computes its covariance c ij . From our experiments, we observed that the PCE produced by BA could be inaccurate even when it was computed from a good number of inliers. Fig. 3 shows a case with 22 inliers. The point cloud data of the two keyframes are registered by using the estimated PC. The misalignment of the wall surfaces (as well as the ground surfaces) indicates an inaccurate PCE. In this work, we propose to use the associated planes between the two keyframes to evaluate the accuracy of the BA-estimated PC and determine the acceptance/rejection of the VO-computed PCE for graph construction. The proposed VO method is illustrated in Algorithm 1. The method computes [T c ij , c ij ] (the PCE and the covariance matrix) and L (the associated plane-pairs) between keyframes i and j. If it returns 0 (i.e., the method successfully results in an accurate PCE), then [T c ij , c ij ] is used to construct the VO factor in the graph and L is not used to create plane node factors. Otherwise, L is used to add plane nodes and construct plane factors in the graph but no VO factor is created. The PlaneDetection function extracts planes from keyframe j (after substracting the point cloud data of the tracked planes) by using the clustering approach in [32]. It is noted that the algorithm returns 1 (failure) if the inlier number N is smaller than 12. This is because that the SR4000-based VO requires at least 12 matched points to compute an accurate PCE [16]. The rest of the algorithm is self-explanatory. VOLUME 8, 2020 . R and t represent the transformation matrix and translation vector, respectively. The squared Mahalanobis distance for the visual odometry factor can be computed as For each keyframe, the clustering approach in [33] was used to extract planes from the SR4000's depth data. The k th plane detected at keyframe i is denoted by l c ik = n c ik , d c ik T .
Its covariance matrix is denoted c ik is computed by forward-propagating the covariances of the points on the plane [20]. This method requires the numerical computation of the Jacobians for the plane's parameters (n c ik , d c ik ) with respect to each point of the plane and is thus time-consuming or even infeasible in case of a large number of points. To tackle this issue, we generate a linear plane function based the inverse depth parameterization [34] and then analytically compute c ik from the function by using the theory of linear regression [35]. The plane equation n x x + n y y + n z z = d (8) is nonlinear in terms of the plane parameter [n, d] T = [n x , n y , n z , d] T , where n z = 1 − n 2 x − n 2 y . Equation (8) can be written as (9), we have Equation (10)  The plane fitting problem is to minimize the squared According to the linear regression theory [35], the optimal estimate forβ and the covariance matrix C(β) is given bŷ and where σ 2 is the variance of the plane-fitting error. An unbiased estimate of σ 2 is: 3 is the dimensionality of the plane. The plane's parameters can be computed by d = 1 |β| and n =β |β|. The plane's covariance matrix can be computed by forward propagation and is given by Therefore, the covariance matrix of the k th plane at keyframe i is given by c ik = C(l c ik ). Since the factor graph optimization is performed in the variables' tangent spaces, we transform c ik to its tangent space by 90046 VOLUME 8, 2020 The computation of B n c ik is given in Appendix A. The actual measurement of the plane's parameters and its covariance matrix in X s Y s Z s can be computed by l s ik =g(T cs , l c ik ) and S s ik = J sc S c ik J T sc , where J sc is given by The predicted measurement of the k th plane at keyframe i in X s Y s Z s is given byl is the transformation matrix from X w Y w Z w to X s Y s Z s andl w ik is the estimated k th plane in X w Y w Z w . In fact,l w ik inherits the plane parameters from the previous step (i.e.,l w ik = l w i−1,k ). Using the Geodesic distance [36], the residual vector of the plane factor is defined by where ρ ik = B T n s ik θ sin(θ) (n s ik −n s ik cos(θ)) with θ = cos −1 ((n s ik ) T n s ik ). Then, the Mahanobis distance for the plane factor can be computed as ( l r ik ) 2 = e T ik (S s ik ) −1 e ik .

V. PLANE ASSOCIATION AND CONSISTENCY CHECK A. PLANE ASSOCIATION
AssumingT c ij (the camera's PC measured by the IMU) is accurate, l c ik (the k th plane observed at keyframe i) can be tracked onto keyframe j asl c jk = n c jk ,d c jk T =g(T c ij , l c ik ) to speed up the search for its associated plane l c jk at keyframe j. The distances between each data point of keyframe j and l c jk is computed. Data points with a distance below 2σ ik are treated as candidate data points of the plane l c jk (σ ik is the plane-fitting error for l c ik ) because plane l c jk is in the neighborhood ofl c jk . In this work, a RANSAC plane-fitting process is used to extract the plane l c jk from the candidate data points. After the RANSAC process, l c jk is extended by adding data points satisfying the distance criteria. Finally, the plane's parameters are recomputed by following the process in section III.B. If the number of data points of l c jk is above a threshold, l c ik , l c jk is taken as a pair of associated/tracked planes.  (18) where J˜l ξ and J˜l k are the Jacobians ofñ c jk with respect to ξ c ij and n c ik , respectively. Computations of J˜l ξ and J˜l k are given in Appendix B. The consistency between the predicted planẽ l c jk and the associated plane l c jk is evaluated based on the error vector defined in equation (17). When θ (the angle between normal vectorsñ c jk and n c jk ) is small, the difference between them can be approximated by η k ≈ B T n c jk n c jk and the vector of distance error for the plane pair l c ik , l c jk is computed by The covariance matrix of e k is computed by (20) where J el and J el are the Jacobians of e k with respect to the parameters ofl c jk and l c jk , respectively. The computations of J el and J el are given in Appendix B. Then, for the k th tracked plane-pair, the Mahalanobis distance is computed by e k e k . If K tracked plane-pairs are found between keyframes i and j, the PCC is evaluated by using the maximum value of r 2 k , i.e., E r = max(r 2 k ), k = 1, . . . , K . If E r < χ 2 3,0.9 = 6.25, the PCE of VO is accepted. Otherwise, it is rejected.

VI. EXPERIMENTS
We used the RNA to collect data to validate the efficacy of the proposed SLAM method. Seven datasets were obtained from the human subject experiments we carried out in indoor environments with lobbies, hallways and/or stairways in two buildings on campus. These datasets were acquired from the sensors by using the onboard computer (UP Board computer) and processed offline by a desktop computer. When collecting the data, the subject swung the RNA and walked in normal walking speed (average speed: ∼ 0.6 m/s) to imitate the way a blind person uses a white cane. In each of the experiments, the subject walked along a looped path and returned to the starting point. We use the Endpoint Position Error Norm (EPEN) of the trajectory as the performance metric. The EPENs (both the absolute value and that in the percentage of the path-length) produced by PAVIO for the seven datasets are tabulated in Table 1, where the results from the state-of-the-art VIO method [25] (using the inlier thresholds N = 12, N = 20, and N = 30) and VINS-Fusion [37] (a representative visual-depth camera based SLAM method) are compared. To demonstrate the significance of the proposed PCC, we also ran PAVIO by disabling the PCC  function on the same datasets. The use of the percentage EPEN allows us to compute the statistical results (the mean and standard deviation of EPEN over the seven datasets) of each method and compare their overall performances (the last row of Table 1).
Since VINS-Fusion diverged for datasets 3-7, only the results on datasets 1 and 2 are included. From Table 1, it can be observed that: (1) PAVIO outperforms the other methods for each dataset; (2) PAVIO has an overall much better performance (mean EPEN: 2.47%) in pose estimation than the VIO method (mean EPEN: 3.75%) and VINS-Fusion (mean EPEN: 8.1%); (3) The use of the PCC greatly improves the pose estimation accuracy (reduce the EPEN more than 24.9% on average); (4) PAVIO has better repeatability in pose estimation (standard deviation of EPEN: 1.03%) than VIO (standard deviation: 1.74%), meaning that it is more robust to the variables of operating environments. As shown in Table 1, the VIO's performance may be improved by adjusting the value of N . However, PAVIO still exhibits a better overall performance (in terms of pose estimation accuracy and robustness) than VIO. The results in Table 1 show that the VIO's performance can be improved by decreasing the inlier threshold (N value). A smaller N increases the total number of VO outputs and thus increases the number of VO factor nodes. This may result in a better VIO-estimated pose at the beginning. However, it decreases the VO-computed PCE accuracy overall and may deteriorate the VIO-computed pose at some points. It is not possible to find an appropriate N a priori because the VO's accuracy can be affected not only by N but also by the accuracy of the visual features' depth data (the larger the depth the lower the accuracy). The PAVIO method addresses this problem by using a relatively smaller threshold (N = 12) and employing the PCC to evaluate the VO-computed PCE online and reject those inaccurate PCE values. VINS-Fusion has the worst pose estimation accuracy because its visual feature tracking method and outlier removal method are least accurate. Fig. 4 shows the trajectories generated by these methods over the point cloud map built by using the PAVIO-estimated poses for each experiment. It can be seen that the trajectories estimated by PAVIO are more accurate.

VII. CONCLUSION
We have presented a new visual-inertial odometry (VIO) method for pose estimation of a robotic navigation aid (RNA) that uses a 3D time-of-flight camera and an IMU for assistive navigation. The method, called plane-aided visual-inertial odometry (PAVIO), extracts planes from the 3D point cloud of the camera's current frame and track them onto the next camera frame by using the IMU's inertial measurement. The tracking result is used to evaluate the pose change estimate (PCE) of the visual odometry (VO) and accept the PCE only if it is accurate. The accepted VO outputs, the information of the extracted planes, and the IMU measurements over time are used to create a factor graph. By optimizing the graph, the method improves the accuracy in estimating the IMU bias and reduces the pose estimation error for the RNA. Experimental results with the RNA demonstrate that the proposed method outperforms the state-of-the-art VIO method in pose estimation in terms of both accuracy and repeatability. The proposed method can be used to produce a more accurate 3D map of the operating environment for object/obstacle detection and to locate a visually impaired traveler in the environment for assistive navigation. It can also be used for the 6-DOF pose estimation of any 3D-camera-based visual-inertial navigation system.

APPENDIX A COMPUTATION OF B n
A set of normal vectors form a smooth manifold-a 2-dimensional sphere S 2 . A normal vector n represents a point on the surface of S 2 ⊆ R 3 , where n = [n x , n y , n z ] T is a unit vector. Each point in the space of S 2 corresponds to a normal vector of a plane. The tangent space T n (S 2 ) at a point (corresponding to normal vector n) is composed of 3D vectors ϕ : T n (S 2 ) ϕ ∈ R 3 |n T ϕ = 0 . It is a hyperplane in R 3 and it is in the null space of the normal vector n. A basis than spans the tangent space T n (S 2 ) is denoted as (b 1 , b 2 ). It is computed as follows: b 1 = b 1 /|b 1 | and b 2 = b 2 /|b 2 |, where b 1 = n × a, b 2 = n × b 1 and vector a is set to [1, 0, 0] T ,[0, 1, 0] T or [0, 0, 1] T if n x , n y or n z of n dominates the other two elements. The matrix B n = [b 1 |b 2 ] is, therefore, the basis matrix for a normal vector n.