VISUAL motion registration is a key technology for many applications since the sensors are inexpensive and provide high information bandwidth. We are interested in using it to construct maps and maintain precise position estimates for mobile robot platforms indoors and outdoors, in extended environments with loops of more than 10 km, and in the absence of global signals such as GPS. This is a classic simultaneous localization and mapping (SLAM) problem. In a typical application, we gather images at modest frame rates, and extract hundreds of features in each frame for estimating frame-to-frame motion. Over the course of even 100 m, moving at 1 m/s, we can have a thousand images and half a million features. The best estimate of the frame poses and feature positions, even for this short trajectory, is a large nonlinear optimization problem. In previous research using laser rangefinders, one approach to this problem was to perform frame-to-frame matching of the laser scans and keep only the constraints among the frames, rather than attempting to directly estimate the position of each scan reading (feature)[18], [28], [30].

Using frames, instead of features, reduces the nonlinear system by a large factor, but still poses a problem as frames accumulate over extended trajectories. To efficiently solve large systems, we reduce the size of the system by keeping only a selected subset of the frames, the *skeleton.* Most importantly, and contrary to almost all current research in SLAM, the skeleton system consists of nonlinear constraints. This property helps it to maintain accuracy even under severe reductions, up to several orders of magnitude smaller than the original system. Fig. 1 shows an example from an urban round-about scene. The original system has 700 stereo frames and over 100K 3-D features. A skeleton graph at 5 m intervals eliminates the features in favor of a small number of frame–frame links, and reduces the number of frames by almost an order of magnitude. The full nonlinear skeleton can be solved in 35 ms.

In this paper, we present frameSLAM, a complete visual mapping method that uses the skeleton graph as its map representation. Core techniques implemented in the system include:

precise, real-time visual odometry for incremental motion estimation;

nonlinear least-squares estimation for local registration and loop closure, resulting in accurate maps;

constant space per area map representation. The skeleton graph is used for data association as well as map estimation. Continuous traversal of an area does not cause the map to inflate; and

experimental validation. We perform small- and large-scale experiments, in indoor, urban, and rough-terrain settings, to validate the method and to show real-time behavior.

### A. Related Work

There has been a lot of recent work in visual SLAM, most of which concentrates on global registration of 3-D features (undelayed SLAM). One approach, corresponding to classical extended Kalman filter (EKF) SLAM, is to use a large EKF containing all of the features [7], [8], [36]; another, corresponding to fastSLAM, is to use a set of particles representing trajectories and to attach a small EKF to each feature [10], [29]. EKF SLAM has obvious problems in representing larger environments, because the size of the EKF filter grows beyond real-time computation. Some recent works have split the large filter into submaps [34], which can then deal with environments on the order of 100 m, with some indications of real-time behavior.

A scale problem also afflicts fastSLAM approaches: it is unclear how many particles are necessary for a given environment, and computational complexity grows with particle set size. Additionally, the 6 DOF nature of visual SLAM makes it difficult for fastSLAM approaches to close loops. For these reasons, none of these approaches has been tried in the type of large-scale, rough-terrain geometry that we present here.

A further problem with feature-based systems is that they lose the power of consensus geometry to give precise estimates of motion and to reject false positives. The visual odometry (VO) backbone that underlies frameSLAM is capable of errors of less than 1% over many hundreds of meters [26], which has not yet been matched by global feature-based systems.

In delayed SLAM, camera frames are kept as part of the system. Several systems rely on this framework [13], [21], [22]. The incremental smoothing and mapping (iSAM) filter approach [22] uses an information filter for the whole system, including frames and features. A nice factorization method allows fast incremental update of the filter. While this approach works for modest-size maps (∼1000 features), other techniques must be used for the large numbers of features found in visual SLAM.

The delayed filter of [13] and [21] keeps *only* the frames—visual feature matching between frames induces constraints on the frames. These constraints are maintained as a large, sparse information filter, and used to reconstruct underwater imagery over scales of 200–300 m. It differs from our work in using a large linear filter instead of a reduced skeleton of nonlinear constraints: the incremental cost of update grows linearly with map size, and is not proposed as a real-time approach.

Since these techniques rely on linear systems, they could encounter problems when closing large loops, where the linearization would have to change significantly.

A very recent paper by Steder *et al.* [37] and an earlier work by Unnikrishnan and Kelly [41] have a very similar approach to skeleton systems. They also keep a constraint network of relative pose information between frames, based on stereo VO, and solve it using nonlinear least-square (NLSQ) methods. However, their motion is basically restricted to 4 DOF, and the matching takes place on near-planar surfaces with downward-looking cameras, rather than the significantly more difficult forward-looking case.

Another related research area is place recognition for long-range loop closure. Recently, several new systems have emerged that promise real-time recovery of candidate place matches over very large databases [5], [32].

Finally, there has been recent work on delayed-filter frame constraint systems based on an efficient solution of the *nonlinear* maximum likelihood (ML) problem, using relaxation methods to solve the corresponding NLSQs problem. Multilevel relaxation [15] uses a pyramid of reduced systems to generate a full solution to an approximate problem, then passes that solution down to finer levels to limit oscillations and increase convergence rates. Stochastic gradient descent [33] is another relaxation method where oscillation is limited by using a learning rate parameter. Convergence is sensitive to the parameterization of the constraints among frames; a particularly efficient version of this method uses a tree parameterization [17].

The cornerstone of these methods is an *incremental* approach to solving the NLSQ problem. As new data are added to the system, the NLSQ problem is expanded and some small amount of time is spent, pushing it toward the optimal solution. In closing large loops, the system error spikes, but is brought down with each succeeding input. One problem is that they deal with 2-D datasets; recently, Grisetti *et al.* [16] have shown how to extend their method to 3-D rotation angles. We exploit these methods to solve the NLSQ problem incrementally; the skeleton system reduces the problem sufficiently to allow real-time operation.

It is important to note that the skeleton system is more general than the relaxation methods for NLSQ, since the latter are restricted to3-D constraints between frames. In contrast, skeleton systems deal with arbitrary constraints, including those among multiple frames, or between frames and features. In this paper, we will develop the general formulation of skeleton systems, and then, apply them to the particular case of visual slam using stereo, where the reduction to frame–frame constraints allows the use of incremental NLSQ solutions.

SECTION II

## Skeleton Systems

We are interested in localizing and mapping using just stereo cameras over large distances. There are three tasks to be addressed.

*Local registration:* The system must keep track of its position locally, and build a registered local map of the environment.

*Long-range tracking:* The system must compute reasonable long-range trajectories with low error.

*Global registration:* The system must be able to recognize previously visited areas, and reuse local map information.

As the backbone for our system, we use *VO* to determine incremental motion of stereo cameras. The principle of VO is to simultaneously determine the pose of camera frames and position of world features by matching the image projections of the features (see Fig. 2), a well-known technique in computational vision. Our research in this area has yielded algorithms that are accurate to within several meters over many kilometers, when aided by an inertial measurement unit (IMU)
[1], [2],
[26].

In this paper, we concentrate on solving the first and third tasks. VO provides very good incremental pose results, but like any odometry technique, it will drift because of the composition of errors. To stay registered in a local area, or to close a large loop, requires recognition of previous frames, and the ability to integrate current and previous frames into a consistent global structure. Our approach is to consider the large system consisting of all camera frames and features as a Bayes net, and to produce reduced versions—skeletons—that are easily solvable but still accurate.

### A. Skeleton Frames

Let us consider the problem of closing a large loop, which is at the heart of the SLAM technique. This loop could contain thousands of frames and millions of features; for example, one of our outdoor sets has 65K stereo pairs, each of which generates 1000 point features or more. We cannot consider closing this loop in a reasonable time;eventually, we have to reduce the size of the system when dealing with large-scale consistency. At the same time, we want to be able to keep locally dense information for precise navigation. Recognition of this problem has led to the development of *submaps*, small local maps that are strung together to form a larger system [3], [34]. Our system of reductions has a similar spirit, but with the following key differences.

No explicit submaps need to be built; instead, *skeleton frames* form a reduced graph structure on the top of the system.

The skeleton structure can scale to allow optimization of any portion of the system.

The skeleton frames support constraints that do not fully specify a transformation between frames. Such constraints arise naturally in the case of bearing-only landmarks, where the distance to the landmark is unknown.

Since there are many more features than frames, we want to reduce feature constraints to frame constraints; and after this, to reduce further the number of frames, while still maintaining fidelity to the original system. The general idea for computing a skeleton system is to convert a subset of the constraints into an appropriate single constraint. Consider Fig. 3, which shows a Bayes net representation of image constraints. When a feature *q*_{j} is seen from a camera frame *c*_{i}, it generates a measurement *z*_{ij}, indicated by the arrows from the variables. Now take the subsystem of two measurements *z*_{00} and *z*_{10}, circled in Fig. 3(a). These generate a multivariate Gaussian probability density function (PDF), . We can reduce this PDF by marginalizing *q*_{0}, leaving . This PDF corresponds to a *synthetic constraint* between
*c*_{0} and
*c*_{1}, which is represented in
Fig. 3(b) by the first three nodes. In a similar manner, a PDF involving *c*_{0}–
*c*_{3} can be reduced, via marginalization, to a PDF over just
*c*_{0} and *c*_{3},
as in Fig. 3(c).

It is clear that the last system is much simpler than the first. As with any reduction, there is a tradeoff between simplicity and accuracy. The reduced system will be close to the original system, as long as the frame variables are not moved too far from their original positions. When they are, the marginalization that produced the reduction may no longer be a good approximation to the original PDF.For this reason, the form of the new constraint is very important. If it is tied to the global position of the frames, then it will become unusable if the variables are moved from their original global position, say, in closing a loop. But, if the constraint uses relative positions, then the frames can be moved anywhere, as long as their relative positions are not displaced too much. The key technique introduced in this paper is the derivation of reduced relative constraints that are an accurate approximation of the original system.

A reduced system can be much easier to solve than the original one. The original system in Fig. 1 has over100K (vector) variables, and our programs run out of space trying to solve it, while its reduced form has just 133 vector variables, and can be solved in 35 ms.

SECTION III

## Nonlinear Least-Squares Estimation

The most natural way to solve large estimation problems is NLSQ. There are several reasons why NLSQ is a convenient and efficient method. First, it offers an easy way to express image constraints and their uncertainty, and directly relates them to image measurements. Second, NLSQ has a natural probabilistic interpretation in terms of Gaussian multinormal distributions, and thus, the Bayes net introduced in the previous section can be interpreted and solved using NLSQ methods. This connection also points the way to reduction methods, via the theoretically sound process of marginalizing out variables. Finally, by staying within a nonlinear system, it is possible to avoid problems of premature linearization, which are especially important in loop closure.

These properties of NLSQ have, of course, been exploited in previous work, especially in structure-from-motion theory of computer vision (sparse bundle adjustment (SBA) [40]) from which this research draws inspiration. In this section, we describe the basics of the measurement model, NLSQ minimization, variable reduction by marginalization, and the “lifting” of linear to nonlinear constraints.

### A. Frames and Features

Referring to Fig. 2, there are two types of system variables: camera frames *c*_{i} and 3-D features *q*_{j}.Features are parameterized by their 3-D position; frames by their position and Euler roll, pitch, and yaw angles:
TeX Source
$$\eqalignno{q_j &= [x_j, y_j, z_j]^{\ssr T}&\hbox{(1)}\cr c_i &= [x_i, y_i, z_i, \phi_i, \psi_i, \theta_i]^{\ssr T} .&\hbox{(2)}}$$Features project onto a camera frame via the projection equations. Each camera frame *c*_{i} describes a point transformation from world to camera coordinates as a rotation and translation. If *q*_{w} is a point in world coordinates, then the corresponding point in camera frame *i* is *q*_{i} = *R*_{i} *q*_{w} + *t*_{i}, with *R*_{i} ≡ *R*(φ_{i},ψ_{i},θ_{i})^{⊺}and *t*_{i} ≡ *R*_{i}^{⊺}[*x*_{i},*y*_{i},*z*_{i}]^{⊺}. We abbreviate the 3 × 4 matrix*T*_{i} as .

The projection [*u*,*v*]^{⊺} onto the image is given by
TeX Source
$$\left[\matrix{ u \cr v \cr 1 }\right] \propto K T_i \left[\matrix{ q_i \cr 1 }\right]\eqno{\hbox{(3)}}$$where *K* is the 3×3 camera intrinsic matrix [19], and*q*_{i} is in the coordinate system of *c*_{i}.

For stereo, (3) describes projection onto the reference camera, which, by convention, we take to be the left one. A similar equation for [*u*′,*v*′]^{⊺} holds for the right camera, with *t*′_{i} = *t*_{i} + [*B*,0,0]^{⊺} as the translation part of the projection matrix.^{1} We will label the image projection [*u*,*v*,*u*′,*v*′]^{⊺} from *c*_{i} and *q*_{j} as*z*_{ij}.

### B. Measurement Model

For a given frame *c*_{i} and feature *q*_{j}, the expected projection is given by
TeX Source
$$z_{ij} = h(x_{ij}) + v_{ij}\eqno{\hbox{(4)}}$$where *v*_{ij} is Gaussian noise with covariance *W*^{−1}_{ij}, and for convenience, we let *x*_{ij} stand for *c*_{i},*q*_{j}. Here, the measurement function *h* is the projection function of (3). Typically, *W*^{−1} is diagonal, with a standard deviation of a pixel.

For an actual measurement , the induced error or cost is
TeX Source
$$\Delta z(x_{ij}) = \bar z_{ij} - h(x_{ij})\eqno{\hbox{(5)}}$$and the weighted square cost is
TeX Source
$$\Delta z(x_{ij})^{\ssr T} W_{ij} \Delta z(x_{ij}).\eqno{\hbox{(6)}}$$We will refer to the weighted square cost as a *constraint.* Note that the PDF associated with a constraint is
TeX Source
$$p(z\vert x_{ij}) \propto \exp\left({1\over 2} \Delta z(x_{ij})^{\ssr T} W_{ij} \Delta z(x_{ij})\right).\eqno{\hbox{(7)}}$$

Although only frame-feature constraints are presented, there is nothing to prevent other types of constraints from being introduced. For example, regular odometry would induce a constraint Δ *z*(*c*_{i},*c*_{j}) between two frames.

### C. Nonlinear Least Squares

The optimization problem is to find the best set of model parameters*x*—camera frames and feature positions—to explain vectors of observations —feature projections on the image plane. The NLSQs method estimates the parameters by finding the minimum of the sum of the constraints [sum of squared errors (SSE)]
TeX Source
$$f(x) = \sum_{ij} \Delta z(x_{ij})^{\ssr T} W_{ij}\Delta z(x_{ij}).\eqno{\hbox{(8)}}$$A more convenient form of *f* eliminates the sum by concatenating each error term into a larger vector. Let Δ *z*(*x*) ≡ [Δ *z*(*x*_{00})^{⊺},…,Δ *z*(*x*_{mn})^{⊺}]^{⊺} be the full vector of measurements and *W* ≡diag(*W*_{00},…,*W*_{mn}) the block-diagonal matrix of all the weights. Then, the SSE equation (8) is equivalent to the matrix equation
TeX Source
$$f(x) = \Delta z(x)^{\ssr T} W \Delta z(x).\eqno{\hbox{(9)}}$$Assuming that the measurements are independent under *x*, the matrix form can be interpreted as a multinormal PDF , and by Bayes' rule, . To maximize the likelihood , we minimize the cost function (9) [40].

Since (9) is nonlinear, finding the minimum typically involves reduction to a linear problem in the vicinity of an initial solution. After expanding via Taylor series and differentiating, we get the incremental equation
TeX Source
$$H \delta{\bf x} = -g\eqno{\hbox{(10)}}$$where *g* is the gradient and *H* is the Hessian of *f* with respect to *x.* Finally, after getting rid of some second-order terms, we can write (10) in the *Gauss–Newton* form
TeX Source
$$J^\top W J \delta x = -J^\top W \Delta z\eqno{\hbox{(11)}}$$with *J* the Jacobian ∂ *h*/∂ *x* and the Hessian *H* approximated by *J*^{⊺}*WJ.*

In the nonlinear case, one starts with an initial value *x*_{0}, and iterates the linear solution until convergence. In the vicinity of a solution, convergence is quadratic. Under the ML interpretation, at convergence *H* is the inverse of the covariance of*x*, that is, the information or precision matrix. We will write *J*^{⊺} *WJ* as Λ to emphasize its role as the information matrix. In the delayed-state SLAM formulation [13],Λ serves as the system filter in a noniterated, incremental version of the SSE problem.

### D. Nonlinear Reduction

At this point, we have the machinery to accomplish the reduction shown in Fig. 3, eliminating variables from the constraint net. Consider all the constraints involving the first two frames (Δ *z*(*c*_{i},*q*_{j}) for *i* = 0,1). Fig. 4diagrams the process of reducing this to a single nonlinear constraint Δ *z*(*c*_{0},*x*_{1}). On the left side, the NSLQ process induces a PDF over the variables *x*, with an ML value of . Marginalization of all variables but *x*_{1} gives a PDF over just *x*_{1}, which corresponds to the linear constraint . The “lifting” process relativizes this to a nonlinear constraint. Mathematical justification for several of the steps is given in Appendix.

The following algorithm specifies the process in more detail.

In step 1, we constrain the system to have *c*_{0} as the origin. Normally, the constraints leave the choice of origin free, but we want all variables relative to *c*_{0}'s frame.

The full system (minus *c*_{0}, which is fixed) is then solved in step2, generating estimated values for all variables, as well as the information matrix . These two represent a Gaussian multivariate distribution over the variables *x*_{1,n}. Next, in step3, the distribution is marginalized, getting rid of all variables except *x*_{1}. The reduced matrix represents a PDF for*x*_{1} that summarizes the influence of the other variables.

Step 4 is the most difficult to understand. The information matrix is derived under the condition that *c*_{0} is the origin. But, we need a constraint that will hold when intermixed with other constraints, where *c*_{0} may be nonzero. The final step lifts the result from a fixed *c*_{0} to *any* pose for *c*_{0}. Here is how it works. Steps 2 and 3 produce a mean and information matrix such that is a PDF for *x*_{1}. This PDF is equivalent to a *synthetic observation* on *x*_{1}, with the linear measurement function *h*(*x*_{1}) = *x*_{1}. Now replace *h*(*x*_{1}) with the relativized function *h*(*c*_{0},*x*_{1}) = *T*_{0} *x*_{1}, where *T*_{0} transforms*x*_{1} into *c*_{0}'s coordinates. When *c*_{0} = 0, this gives exactly the same PDF as the original *h*(*x*_{1}). And, for any *c*_{0} ≠ 0, we can show that the constraint Δ *z*(*c*_{0},*x*_{1}) produces the same PDF as the constraints Δ *z*(*x*_{i},…) (see Appendix I).

What is interesting about Δ *z*(*c*_{0},*x*_{1}) is its nonlinear nature. It represents a spring connecting *c*_{0} and *x*_{1}, with precisely the right properties to accurately substitute for the larger nonlinear system Δ *z*}(*c*_{0}, *x*_{1}, …). The accuracy is affected by how closely the reduced system follows two assumptions that were made as follows.

The displacement between *c*_{0} and *x*_{1} is close to .

None of the variables *x*_{2,n} participate in other constraints in the system.

These assumptions are not always fully met, especially the second one. Nonetheless, we will show in experiments with large outdoor systems that even very drastic variable reductions, such as those in Fig. 1, give accurate results.

### E. Marginalization

In step 3, we require the reduction of an information matrix to extract a marginal distribution between two of the vector variables. Just deleting the unwanted variable rows and columns would give the *conditional distribution* . This distribution significantly overestimates the confidence in the connection, compared to the marginal, since the uncertainty of the auxiliary variables is disregarded [12]. The correct way to marginalize is to convert to its covariance form by inversion, delete all but the entries for *c*_{0} and *x*_{1}, and then, invert back to the information matrix. A shortcut to this procedure is the *Schur complement* [14], [24], [35], which is also useful in solving sparse versions of (9). We start with a block version of this equation, partitioning the variables into frames *c* and features *q*:
TeX Source
$$\left[\matrix{H_{11} & H_{12}\cr H_{12}^{\ssr T} & H_{22} }\right]\left[\matrix{ \delta c\cr \delta q }\right] =\left[\matrix{ -J_c^{\ssr T} W_c {\Delta z}(c)\cr -J_q^{\ssr T} W_q {\Delta z}(q) }\right].\eqno{\hbox{(12)}}$$Now, we define a reduced form of this equation:
TeX Source
$$\bar H_{11} \delta c = -g\eqno{\hbox{(13)}}$$with
TeX Source
$$\eqalignno{\bar H_{11} & \equiv H_{11} - H_{12} H_{22}^{-1} H_{12}^{\ssr T}&\hbox{(14)}\cr-g & \equiv -J_c^{\ssr T} W_c {\Delta z}(c) - H_{12} H_{22}^{-1}J_q^{\ssr T} W_q {\Delta z}(q) .&\hbox{(15)}}$$Here, the matrix equation (13) involves only the variables *c.* There are two cases where this reduction is useful.

*Constraint reduction:* The reduced Hessian is the information matrix for the variables *c*, with variables *q* marginalized out. Equation (15) gives a direct way to compute the marginalized Hessian in step 3 of constraint reduction.

*Visual odometry:* For the typical situation of many features and few frames, the reduced equation offers an enormous savings in computing NLSQ, with the caveat: *H*_{22} must be easily invertible. Fortunately, for features the matrix *H*_{22} is diagonal since features only have constraints with frames, and thus, the Jacobian *J* in *J*^{⊺} *WJ* affects just *H*_{12}.

SECTION IV

## Data Association and Landmarks

The raw material for constraints comes from data association between image features. We have implemented a method for matching features across two camera frames that serves a dual purpose. First, it enables incremental estimation of camera motion for tracking trajectories (VO). Second, on returning to an area, we match the current frame against previous frames that serve as landmarks for the area. These landmark frames are simply the skeleton system that is constructed as the robot explores an area—a reduced set of frames, connected to each other by constraints. Note that landmark frames are *not* the same as feature landmarks normally used in nondelayed EKF treatments of visual SLAM (VSLAM). Features are only represented implicitly by their projections onto camera frames. Global and local registration is done purely by matching images between frames and generating frame–frame constraints from the match.

### A. Matching Frames

Consider the problem of matching stereo frames that are close spatially. Our goal is to precisely determine the motion between the two frames, based on image feature matches. Even for incremental frames, rapid motion can cause the same feature to appear at widely differing places in two successive images; the problem is even worse for wide-baseline matching. The images of Fig. 5 show some difficult examples from urban and off-road terrain. Note the significant shift in feature positions.

One important aspect of matching is using image features that are stable across changes of scale and rotation. While scale-invariant feature transform (SIFT)[27] and speeded up robust features (SURF) [20] are the features of choice, they are not suitable for real-time implementations (15 Hz or greater). Instead, we use a novel multiscale center-surround feature called CenSure [26]. In previous research, we have shown that this operator has the requisite stability properties, but is just slightly more expensive than the Harris operator.

In any image feature matching scheme, there will be false matches. In the worst cases of long-baseline matching, sometimes only 10% of the matches are good ones. We use the following robust matching algorithm to find the best estimate, taking advantage of the geometric constraints imposed by rigid motion.

Three matched points give a motion hypothesis between the frames [19]. The hypothesis is checked for inliers by projecting all feature points onto the two frames (3). Features that are within 2 pixels of their projected position are considered to be inliers—note that they must project correctly in all four images. The best hypothesis(most inliers) is chosen and optimized using the NLSQ technique of Section III-E. Some examples are shown in Fig. 5 with matched and inlier points indicated.

This algorithm is used for both incremental tracking and wide-baseline matching, with different search parameters for finding matching features. Note that we do not assume any motion model or other external aids in matching.

### B. Visual Odometry

Consensus matching is the input to a VO process for estimating incremental camera motion, and for computing reduced skeleton frames. To make incremental motion estimation more precise, we incorporate several additional methods.

*Key frames:* If the estimated distance between two frames is small, and the number of inliers is high, we discard the new frame. The remaining frames are called key frames. Typical distances for key frames are about 0.1–0.5 m, depending on the environment.

*Incremental bundle adjustment:* A sliding window of about 12keyframes (and their features) is input to the NLSQ optimization. Using a small window significantly increases the accuracy of the estimation [11], [26], [31], [39].

*IMU data:* If an IMU is available, it can decrease the angular drift of VO, especially tilt and roll, which are referenced to gravity normal [26], [38]. We show results both with and without IMU assistance.

To construct the skeleton, we start with the sliding bundle adjustment window, which can be computed very efficiently using the marginalization technique of Section III-E
[11], [31]. The sliding window contains at least one frame *c*_{i} that is marked as a skeleton frame. When a new key frame *c*_{k} is added to the window, the oldest frame in the window is dropped, and the Hessian is recomputed. The marginalization operation eliminates the feature variables from the Hessian, leaving just constraints between the camera frames. If the most recent frame is a sufficient distance from the most recent skeleton frame in the window, it is added to the skeleton frame by further reducing to contain just *c*_{i} and *c*_{k}, and then lifting the constraint.

IMU measurements can easily be added to the NLSQ estimation. The following equations describe IMU measurements of gravity normal and yaw angle increments:
TeX Source
$$\eqalignno{g_i &= h_g(c_i)&\hbox{(16)}\cr\Delta \psi_{i-1, i} &= h_{\Delta \psi}(c_{i-1}, c_{i}).&\hbox{(17)}}$$The function *h*_{g}(*c*) returns the deviation of the frame *c* in pitch and roll from gravity normal.*h*_{Δψ}(*c*_{i−1},*c*_{i}) is just the yaw angle difference between the two frames. Using accelerometer data acting as an inclinometer, with a very high noise level to account for unknown accelerations, is sufficient for (16) to constrain roll and pitch angles and completely avoid drift. For yaw angles, only a good IMU can increase the accuracy VO estimates. In the experiments section, we show results both with and without IMU aiding.

### C. Wide Baseline Matching

To register a new frame against a landmark frame, we use the same consensus-matching technique as for VO. This procedure has several advantages.

*Sensitivity over scale:* The CenSure features are scale independent, and so, are stable when there has been significant viewpoint change.

*Efficiency:* The CenSure features and stereo are already computed for VO, so wide baseline matching just involves steps 3–5 of the consensus match algorithm. This can be done at over30 Hz.

*Specificity:* The geometric consistency check almost guarantees that there will be no false positives, even using a very low inlier threshold.

Fig. 5 shows an example of wide baseline matching in the upper urban scene. The distance between the two frames is about 5 m, and there are distractors such as cars and pedestrians. The buildings are very self-similar, so the geometric consistency check is very important in weeding out bad matches. In this scene, there are about 800 features per image, and only 100 inliers for the best estimate match.

To analyze sensitivity and selectivity, we computed the inlier score for every possible cross-frame match in the 700 frame urban sequence shown in Fig. 1. Fig. 6 shows the results, by number of inliers. Along the diagonal, matching occurs for several frames, to an average of 10 m along straight stretches. The only off-diagonal matching occurs at the loop closure. The lower scores on closure reflect the sideways offset of the vehicle from its original trajectory. Consensus matching produced essentially perfect results for this dataset, giving no false positives, and correctly identifying loop closures.

We also looked at off-road images over very large trajectories(Section V-B). The challenging part of these datasets is the large amount of similarly textured areas: grass, trees, etc., with no distinguishing man-made objects (see Fig. 5, bottom). Nevertheless, specificity was perfect, with no false positives across tens of thousands of images. Sensitivity was not as good; it was sometimes difficult to match scenes when the viewpoints were separated by 10 m or more. Partly this results from our use of normalized cross correlation to obtain matches; partly it is the difficulty of finding the correct match in similarly textured areas.

While we did not observe any false positives in any dataset we considered, we did not do any rigorous experiments to verify this performance. Geometric consistency is such a strong constraint that we suspect that it is almost impossible to find situations in which it will produce false positives, given a threshold of at least 30 inliers.

### D. Place Recognition

We implement a simple but effective scheme for recognizing places that have already been visited, using the consensus match just presented. This scheme is not intended to do “kidnapped” recognition, matching an image against a large database of places [5], [32]. Instead, it functions when the system has a good idea of where it is relative to the landmark frames that have been accumulated. In a local area, for example, the system always stays well registered, and has to search only a small number of frames for matches. Over larger loops, the method is linear in the size of the area it must search.

The main problem that arises is which landmark frames should serve as candidates to match to the current frame. Ideally, we would use the relative covariance estimates computed from the Bayes net as a gate, that is, Δ *x*^{⊺} *W* Δ *x* < *d*, where Δ *x* is the distance between the current frame *c*_{n} and a candidate landmark*c*_{i} [9]. However, computing relative covariance involves marginalizing all variables in the system Hessian, and is too expensive to be done online [12]. Instead, we use the skeleton to provide an approximate covariance that is conservative. In the skeleton net, we find the shortest path from*c*_{n} to *c*_{i}, and then, compose the incremental covariances along the path to get an estimate of the relative covariance. As in [3], an efficient search of the net (*O*(*n* log *n*)) can be done using Dijkstra's method [37]. In practice, this method needs several milliseconds on the largest graphs we have attempted; for very large spaces, a hierarchical skeleton would be appropriate to keep computational costs reasonable.

One property we would like to observe is that the space consumed by the skeleton should be proportional to the area visited. On continual movement in an area, the skeleton will continue to grow unless frames are removed. Our method here is to marginalize out any frames that are within a distance *d* and angle *a* of an existing skeleton frame. As an example, we show an indoor dataset consisting of 8490 keyframes in a small 12 m × 12 m area (see Fig. 7). The skeleton graph was produced using a landmark distance of 2 m and 10°, reducing the graph to 272 frames with 23 cross-frame links. The final skeleton graph is solved by NLSQ in 65 ms.

### E. Real-Time Implementation

Our VO system has been implemented and runs at 15 Hz on 512 × 384 resolution stereo images using a 2 GHz processor, and is in habitual use in demo runs on an outdoor robot [1], [2], [26]. At this point, we have implemented the rest of the frameSLAM system only in processing logs for which we report timing results; we are transitioning the system to an outdoor robot, and will report system results soon.

The main strategy for adding registration matches is to use a dual-core machine: run VO on one core, and wide-baseline matching and skeleton computations on another. Wide-baseline matching is on an“anytime” basis: we match the current keyframe against a set of candidates that pass the Mahalanobis gate, until the next keyframe comes in, when we start matching that. Whenever a good match is found, the system performs an NLSQ update, either on the whole skeleton or a smaller area around the current keyframe, depending on the situation.

Of course, this strategy does not address the significant problems of frame storage and retrieval for very large systems, as done by recent place-recognition algorithms [5], [32]. It may also miss some matches, since it does not explore all possible candidates. But, for skeletons of less than 10K frames, where the frames can be kept in RAM, it works well.

For efficient calculation of the NLSQ updates, we use a preconditioned conjugate gradient algorithm (PCG) that has been shown to have good computational properties—in many cases, the complexity increases linearly with the size of the skeleton [23]. For the large outdoor dataset, Fig. 8 plots the PCG time against the size of the skeleton. Note that these are for a very large loop closure of a combined 10 km trajectory—typically only a small local area needs to be optimized.

It is important to test VSLAM systems on data gathered from real-world platforms. It is especially important to test under realistic conditions: narrow field of view (FOV) cameras, full 3-D motion, and fast movement, as these present the hardest challenge for matching and motion estimation. For this research, we used three widely different sources of stereo frames.

An indoor sequence consisting of 22K frames in a small area, moving slowly (courtesy of Sim [10]). The stereo system had a wide FOV, narrow baseline, and was purely planar motion.

An outdoor automobile sequence, the Versailles Rond dataset(courtesy of Comport [4]). This dataset has 700frames with fast motion, 1 m baseline, narrow FOV, covering about 400m.

Two outdoor rough-terrain sequences of about 5 km each from the Crusher project [6]. Baseline is 0.5 m, narrow FOV, and fast, full 3-D motion with lots of bouncing on rough terrain. These datasets offer a unique opportunity, for two reasons. First, they are autonomous runs through the same way points; they overlap and cross each other throughout, and end up at the same place for a large loop closure. Second, the dataset is instrumented with both IMU and real-time kinematic (RTK)GPS data, and our frameSLAM results can be computed for both aided and unaided VO, and compared to ground truth.

A certain number of frames, about 2%, cannot be matched for VO in this dataset. We fill in these values with IMU data.

### A. Planar Datasets

The indoor and Versailles Rond datasets were used throughout the paper to illustrate various aspects of the frameSLAM system. Because there is no ground truth, they offer just anecdotal evidence of performance: the indoor dataset illustrates that the skeleton does not grow with continued traversal of a region; the Versailles Rond dataset shows loop closure over a significant distance. From the plots of Fig. 9, the improvement in fidelity to the environment is apparent. One possible measure of the improvement is the planarity of the trajectories. We fit a plane to all points, then measure the rms error perpendicular to the plane. The following table lists the relevant statistics for the two runs. The timings are for NLSQ optimization of the entire skeleton.

### B. Crusher Datasets

The Crusher data come from two autonomous 5 km runs, which overlap significantly and form a loop. There are 20K keyframes in the first run, and 22K in the second. Over 20 million features are found and matched in the keyframe images, and roughly three times that in the total image set. Fig. 10 (top)gives an idea of the results from raw VO on the two runs. There is significant deviation in all dimensions by the end of the loop(circled in red). With a skeleton of frames at 5 m intervals, there were a total of 1978 reduced frames and 169 wide-baseline matches between the runs using consensus matching. These are shown as red links in the top plot.

The middle plot shows the result of applying frameSLAM to a 5 m skeleton. Here, the red trajectory is ground truth for the blue run, and it matches the two runs at the beginning and end of the trajectory(circled on the left). The two runs are now consistent with each other, but still differ from ground truth at the far right end of the trajectory. This is to be expected: the frameSLAM result will only be as good as the underlying odometry when exploring new areas.

If VO is aided by an IMU (Section IV-B), global error is reduced dramatically. The bottom plot shows the frameSLAM result using the aided VO—note that the blue run virtually overlays the RTK GPS ground truth trajectory.

How well does the frameSLAM system reduce errors from open-loop VO? We should not expect any large improvement in long-distance drift at the far point of trajectories since SLAM does not provide any global input that would correct such drift. But, we should expect dramatic gains in relative error, that is, between frames that are globally close since SLAM enforces consistency when it finds correspondences. To show this, we compared relative pose of every frame pair to ground truth, and plotted the results as a function of distance between the frames. Fig. 11 shows the results for both raw and aided VO.For raw VO (top plot), the open-loop errors are very high, because of the large drift at the end of the trajectories (see Fig. 10, top). With the cross-links enforcing local consistency, frameSLAM gives much smaller errors for short distances, and degrades with distance, a function of yaw angle drift. Note that radical reductions in the size of the skeleton, from 1/4 to 1/400 of the original keyframes, have negligible effect, proving the accuracy of the reduced system.

A similar story exists for IMU-aided VO. Here, the errors are much smaller because of the smaller drift of VO. But, the same gains in accuracy occur for small frame distances, and again, there is almost no effect from severe skeleton reductions until after 300 m.

We have described frameSLAM, a system for visual SLAM that is capable of precise, real-time estimation of motion, and also, is able to keep track of local registration and global consistency. The key component is a skeleton system of visual frames that act both as landmarks for registration, and as a network of constraints for enforcing consistency. FrameSLAM has been validated through testing in a variety of different environments, including large-scale, challenging off-road datasets of 10 km.

We are currently porting the system to two live robot platforms [6], [25], with the intent of providing completely autonomous off-road navigation using just stereo vision. The VO part of the system has already been proven over a year of testing, but cannot eliminate the long-term drift that accrues over a run. With the implementation of the skeleton graph, we expect to be able to assess the viability of the anytime strategy for global registration presented in Section IV-E.

Appendix

## Nonlinear Constraint “Lifting”

Let *c*_{0}, *x*_{1}, and **q** be a set of variables with measurement cost function
TeX Source
$${\Delta z}^{\ssr T} W_i {\Delta z}\eqno{\hbox{(18)}}$$and measurement vector . For *c*_{0} fixed at the origin, let be the Hessian of the reduced form of (18), according to step 3 of the constraint reduction algorithm. We want to show that the cost function
TeX Source
$${{\Delta z}'}^{\ssr T} \hat\Lambda_i {\Delta z}'\eqno{\hbox{(19)}}$$has approximately the same value at the ML estimate , where*z*′(*c*_{0},*x*_{1}) = *T*_{0} *x*_{1} and . To do this, we show that the likelihood distributions are approximately the same.

The cost function (18) has the joint normal distribution
TeX Source
$$P(\hat z\vert x_1, {\bf q}) \propto \exp\left(-{1\over 2} {\Delta z}^{\ssr T} W_i {\Delta z}\right).\eqno{\hbox{(20)}}$$We want to find the distribution (and covariance) for the variable*x*_{1}. Let **x** = *x*_{1},**q**, and *f*(**x**) the cost function (18). Expanding *f*(**x**+δ**x**) in a Taylor series, the cost function becomes
TeX Source
$$\eqalignno{& (\hat z - f({\bf x}))^{\ssr T} W (\hat z - f({\bf x}))&\hbox{(21)}\cr& \simeq (\hat z - f({\bf x}) - J\delta x)^{\ssr T} W (\hat z - f({\bf x})-J\delta x)&\hbox{(22)}\cr& = \delta x_1^{\ssr T} \hat\Lambda_1 \delta x_1 - 2 {\Delta z} W J \delta x +\hbox{const}&\hbox{(23)}}$$where we have used the Schur equality to reduce the first term of the third line. As Δ *z* vanishes at , the last form is quadratic in *x*_{1}, and so, is a joint normal distribution over *x*_{1}.From inspection, the covariance is . Hence, the ML distribution is
TeX Source
$$P(x_1\vert \hat z) \propto \exp\left(-{1\over 2} (\hat x_1 - x_1)^{\ssr T}\hat\Lambda_1 (\hat x_1 - x_1)\right).\eqno{\hbox{(24)}}$$The cost function for this PDF is (19) for *c*_{0} fixed at the origin, as required. When *c*_{0} is not the origin, the cost function (18) can be converted to an equivalent function by transforming all variables to *c*_{0}'s coordinate system. The value stays the same because the measurements are localized to the positions of *c*_{0} and *x*_{1}—any global measurement, for example, a GPS reading would block the equivalence. Thus, for arbitrary *c*_{0}, (20) and (24) are approximately equal just when*x*_{1} is given in *c*_{0}'s coordinate system. Since (24) is produced by the cost function (19), we have the approximate equivalence of the two cost functions.