IEEE Xplore At-A-Glance
  • Abstract

FrameSLAM: From Bundle Adjustment to Real-Time Visual Mapping

Many successful indoor mapping techniques employ frame-to-frame matching of laser scans to produce detailed local maps as well as the closing of large loops. In this paper, we propose a framework for applying the same techniques to visual imagery. We match visual frames with large numbers of point features, using classic bundle adjustment techniques from computational vision, but we keep only relative frame pose information (a skeleton). The skeleton is a reduced nonlinear system that is a faithful approximation of the larger system and can be used to solve large loop closures quickly, as well as forming a backbone for data association and local registration. We illustrate the workings of the system with large outdoor datasets (10 km), showing large-scale loop closure and precise localization in real time.



VISUAL motion registration is a key technology for many applications since the sensors are inexpensive and provide high information bandwidth. We are interested in using it to construct maps and maintain precise position estimates for mobile robot platforms indoors and outdoors, in extended environments with loops of more than 10 km, and in the absence of global signals such as GPS. This is a classic simultaneous localization and mapping (SLAM) problem. In a typical application, we gather images at modest frame rates, and extract hundreds of features in each frame for estimating frame-to-frame motion. Over the course of even 100 m, moving at 1 m/s, we can have a thousand images and half a million features. The best estimate of the frame poses and feature positions, even for this short trajectory, is a large nonlinear optimization problem. In previous research using laser rangefinders, one approach to this problem was to perform frame-to-frame matching of the laser scans and keep only the constraints among the frames, rather than attempting to directly estimate the position of each scan reading (feature)[18], [28], [30].

Using frames, instead of features, reduces the nonlinear system by a large factor, but still poses a problem as frames accumulate over extended trajectories. To efficiently solve large systems, we reduce the size of the system by keeping only a selected subset of the frames, the skeleton. Most importantly, and contrary to almost all current research in SLAM, the skeleton system consists of nonlinear constraints. This property helps it to maintain accuracy even under severe reductions, up to several orders of magnitude smaller than the original system. Fig. 1 shows an example from an urban round-about scene. The original system has 700 stereo frames and over 100K 3-D features. A skeleton graph at 5 m intervals eliminates the features in favor of a small number of frame–frame links, and reduces the number of frames by almost an order of magnitude. The full nonlinear skeleton can be solved in 35 ms.

Figure 1
Fig. 1. Skeleton reduction of a 100-m urban scene. Full Bayes graph is 700 frames (blue dots) and ∼100K features (cyan crosses). Frame-feature links are in yellow. Only one in 200 features are shown for display. Original frames are in blue (see inset for a closeup). The 133 reduced frames and their links are in red. The reduced graph is solved in 35 ms.

In this paper, we present frameSLAM, a complete visual mapping method that uses the skeleton graph as its map representation. Core techniques implemented in the system include:

  1. precise, real-time visual odometry for incremental motion estimation;

  2. nonlinear least-squares estimation for local registration and loop closure, resulting in accurate maps;

  3. constant space per area map representation. The skeleton graph is used for data association as well as map estimation. Continuous traversal of an area does not cause the map to inflate; and

  4. experimental validation. We perform small- and large-scale experiments, in indoor, urban, and rough-terrain settings, to validate the method and to show real-time behavior.

A. Related Work

There has been a lot of recent work in visual SLAM, most of which concentrates on global registration of 3-D features (undelayed SLAM). One approach, corresponding to classical extended Kalman filter (EKF) SLAM, is to use a large EKF containing all of the features [7], [8], [36]; another, corresponding to fastSLAM, is to use a set of particles representing trajectories and to attach a small EKF to each feature [10], [29]. EKF SLAM has obvious problems in representing larger environments, because the size of the EKF filter grows beyond real-time computation. Some recent works have split the large filter into submaps [34], which can then deal with environments on the order of 100 m, with some indications of real-time behavior.

A scale problem also afflicts fastSLAM approaches: it is unclear how many particles are necessary for a given environment, and computational complexity grows with particle set size. Additionally, the 6 DOF nature of visual SLAM makes it difficult for fastSLAM approaches to close loops. For these reasons, none of these approaches has been tried in the type of large-scale, rough-terrain geometry that we present here.

A further problem with feature-based systems is that they lose the power of consensus geometry to give precise estimates of motion and to reject false positives. The visual odometry (VO) backbone that underlies frameSLAM is capable of errors of less than 1% over many hundreds of meters [26], which has not yet been matched by global feature-based systems.

In delayed SLAM, camera frames are kept as part of the system. Several systems rely on this framework [13], [21], [22]. The incremental smoothing and mapping (iSAM) filter approach [22] uses an information filter for the whole system, including frames and features. A nice factorization method allows fast incremental update of the filter. While this approach works for modest-size maps (∼1000 features), other techniques must be used for the large numbers of features found in visual SLAM.

The delayed filter of [13] and [21] keeps only the frames—visual feature matching between frames induces constraints on the frames. These constraints are maintained as a large, sparse information filter, and used to reconstruct underwater imagery over scales of 200–300 m. It differs from our work in using a large linear filter instead of a reduced skeleton of nonlinear constraints: the incremental cost of update grows linearly with map size, and is not proposed as a real-time approach.

Since these techniques rely on linear systems, they could encounter problems when closing large loops, where the linearization would have to change significantly.

A very recent paper by Steder et al. [37] and an earlier work by Unnikrishnan and Kelly [41] have a very similar approach to skeleton systems. They also keep a constraint network of relative pose information between frames, based on stereo VO, and solve it using nonlinear least-square (NLSQ) methods. However, their motion is basically restricted to 4 DOF, and the matching takes place on near-planar surfaces with downward-looking cameras, rather than the significantly more difficult forward-looking case.

Another related research area is place recognition for long-range loop closure. Recently, several new systems have emerged that promise real-time recovery of candidate place matches over very large databases [5], [32].

Finally, there has been recent work on delayed-filter frame constraint systems based on an efficient solution of the nonlinear maximum likelihood (ML) problem, using relaxation methods to solve the corresponding NLSQs problem. Multilevel relaxation [15] uses a pyramid of reduced systems to generate a full solution to an approximate problem, then passes that solution down to finer levels to limit oscillations and increase convergence rates. Stochastic gradient descent [33] is another relaxation method where oscillation is limited by using a learning rate parameter. Convergence is sensitive to the parameterization of the constraints among frames; a particularly efficient version of this method uses a tree parameterization [17].

The cornerstone of these methods is an incremental approach to solving the NLSQ problem. As new data are added to the system, the NLSQ problem is expanded and some small amount of time is spent, pushing it toward the optimal solution. In closing large loops, the system error spikes, but is brought down with each succeeding input. One problem is that they deal with 2-D datasets; recently, Grisetti et al. [16] have shown how to extend their method to 3-D rotation angles. We exploit these methods to solve the NLSQ problem incrementally; the skeleton system reduces the problem sufficiently to allow real-time operation.

It is important to note that the skeleton system is more general than the relaxation methods for NLSQ, since the latter are restricted to3-D constraints between frames. In contrast, skeleton systems deal with arbitrary constraints, including those among multiple frames, or between frames and features. In this paper, we will develop the general formulation of skeleton systems, and then, apply them to the particular case of visual slam using stereo, where the reduction to frame–frame constraints allows the use of incremental NLSQ solutions.


Skeleton Systems

We are interested in localizing and mapping using just stereo cameras over large distances. There are three tasks to be addressed.

  1. Local registration: The system must keep track of its position locally, and build a registered local map of the environment.

  2. Long-range tracking: The system must compute reasonable long-range trajectories with low error.

  3. Global registration: The system must be able to recognize previously visited areas, and reuse local map information.

As the backbone for our system, we use VO to determine incremental motion of stereo cameras. The principle of VO is to simultaneously determine the pose of camera frames and position of world features by matching the image projections of the features (see Fig. 2), a well-known technique in computational vision. Our research in this area has yielded algorithms that are accurate to within several meters over many kilometers, when aided by an inertial measurement unit (IMU) [1], [2], [26].

Figure 2
Fig. 2. Stereo frames and 3-D points. VO estimates the pose of the frames and the positions of the 3-D points at the same time, using image projections.

In this paper, we concentrate on solving the first and third tasks. VO provides very good incremental pose results, but like any odometry technique, it will drift because of the composition of errors. To stay registered in a local area, or to close a large loop, requires recognition of previous frames, and the ability to integrate current and previous frames into a consistent global structure. Our approach is to consider the large system consisting of all camera frames and features as a Bayes net, and to produce reduced versions—skeletons—that are easily solvable but still accurate.

A. Skeleton Frames

Let us consider the problem of closing a large loop, which is at the heart of the SLAM technique. This loop could contain thousands of frames and millions of features; for example, one of our outdoor sets has 65K stereo pairs, each of which generates 1000 point features or more. We cannot consider closing this loop in a reasonable time;eventually, we have to reduce the size of the system when dealing with large-scale consistency. At the same time, we want to be able to keep locally dense information for precise navigation. Recognition of this problem has led to the development of submaps, small local maps that are strung together to form a larger system [3], [34]. Our system of reductions has a similar spirit, but with the following key differences.

  1. No explicit submaps need to be built; instead, skeleton frames form a reduced graph structure on the top of the system.

  2. The skeleton structure can scale to allow optimization of any portion of the system.

  3. The skeleton frames support constraints that do not fully specify a transformation between frames. Such constraints arise naturally in the case of bearing-only landmarks, where the distance to the landmark is unknown.

Since there are many more features than frames, we want to reduce feature constraints to frame constraints; and after this, to reduce further the number of frames, while still maintaining fidelity to the original system. The general idea for computing a skeleton system is to convert a subset of the constraints into an appropriate single constraint. Consider Fig. 3, which shows a Bayes net representation of image constraints. When a feature qj is seen from a camera frame ci, it generates a measurement zij, indicated by the arrows from the variables. Now take the subsystem of two measurements z00 and z10, circled in Fig. 3(a). These generate a multivariate Gaussian probability density function (PDF), Formula. We can reduce this PDF by marginalizing q0, leaving Formula. This PDF corresponds to a synthetic constraint between c0 and c1, which is represented in Fig. 3(b) by the first three nodes. In a similar manner, a PDF involving c0c3 can be reduced, via marginalization, to a PDF over just c0 and c3, as in Fig. 3(c).

Figure 3
Fig. 3. Skeleton reduction as a Bayes net. System variables are in black, measurements are in red: each measurement represents one constraint. The initial net (a) contains camera frames ci and features qi that give rise to image points. (b) Most of the features are marginalized, yielding measurements between the frames. (c) Some of the intermediate frames have also been marginalized.

It is clear that the last system is much simpler than the first. As with any reduction, there is a tradeoff between simplicity and accuracy. The reduced system will be close to the original system, as long as the frame variables are not moved too far from their original positions. When they are, the marginalization that produced the reduction may no longer be a good approximation to the original PDF.For this reason, the form of the new constraint is very important. If it is tied to the global position of the frames, then it will become unusable if the variables are moved from their original global position, say, in closing a loop. But, if the constraint uses relative positions, then the frames can be moved anywhere, as long as their relative positions are not displaced too much. The key technique introduced in this paper is the derivation of reduced relative constraints that are an accurate approximation of the original system.

A reduced system can be much easier to solve than the original one. The original system in Fig. 1 has over100K (vector) variables, and our programs run out of space trying to solve it, while its reduced form has just 133 vector variables, and can be solved in 35 ms.


Nonlinear Least-Squares Estimation

The most natural way to solve large estimation problems is NLSQ. There are several reasons why NLSQ is a convenient and efficient method. First, it offers an easy way to express image constraints and their uncertainty, and directly relates them to image measurements. Second, NLSQ has a natural probabilistic interpretation in terms of Gaussian multinormal distributions, and thus, the Bayes net introduced in the previous section can be interpreted and solved using NLSQ methods. This connection also points the way to reduction methods, via the theoretically sound process of marginalizing out variables. Finally, by staying within a nonlinear system, it is possible to avoid problems of premature linearization, which are especially important in loop closure.

These properties of NLSQ have, of course, been exploited in previous work, especially in structure-from-motion theory of computer vision (sparse bundle adjustment (SBA) [40]) from which this research draws inspiration. In this section, we describe the basics of the measurement model, NLSQ minimization, variable reduction by marginalization, and the “lifting” of linear to nonlinear constraints.

A. Frames and Features

Referring to Fig. 2, there are two types of system variables: camera frames ci and 3-D features qj.Features are parameterized by their 3-D position; frames by their position and Euler roll, pitch, and yaw angles:Formula TeX Source $$\eqalignno{q_j &= [x_j, y_j, z_j]^{\ssr T}&\hbox{(1)}\cr c_i &= [x_i, y_i, z_i, \phi_i, \psi_i, \theta_i]^{\ssr T} .&\hbox{(2)}}$$Features project onto a camera frame via the projection equations. Each camera frame ci describes a point transformation from world to camera coordinates as a rotation and translation. If qw is a point in world coordinates, then the corresponding point in camera frame i is qi = Ri qw + ti, with RiRiii)and tiRi[xi,yi,zi]. We abbreviate the 3 × 4 matrixTi as Formula.

The projection [u,v] onto the image is given byFormula TeX Source $$\left[\matrix{ u \cr v \cr 1 }\right] \propto K T_i \left[\matrix{ q_i \cr 1 }\right]\eqno{\hbox{(3)}}$$where K is the 3×3 camera intrinsic matrix [19], andqi is in the coordinate system of ci.

For stereo, (3) describes projection onto the reference camera, which, by convention, we take to be the left one. A similar equation for [u′,v′] holds for the right camera, with ti = ti + [B,0,0] as the translation part of the projection matrix.1 We will label the image projection [u,v,u′,v′] from ci and qj aszij.

B. Measurement Model

For a given frame ci and feature qj, the expected projection is given byFormula TeX Source $$z_{ij} = h(x_{ij}) + v_{ij}\eqno{\hbox{(4)}}$$where vij is Gaussian noise with covariance W−1ij, and for convenience, we let xij stand for ci,qj. Here, the measurement function h is the projection function of (3). Typically, W−1 is diagonal, with a standard deviation of a pixel.

For an actual measurement Formula, the induced error or cost isFormula TeX Source $$\Delta z(x_{ij}) = \bar z_{ij} - h(x_{ij})\eqno{\hbox{(5)}}$$and the weighted square cost isFormula TeX Source $$\Delta z(x_{ij})^{\ssr T} W_{ij} \Delta z(x_{ij}).\eqno{\hbox{(6)}}$$We will refer to the weighted square cost as a constraint. Note that the PDF associated with a constraint isFormula TeX Source $$p(z\vert x_{ij}) \propto \exp\left({1\over 2} \Delta z(x_{ij})^{\ssr T} W_{ij} \Delta z(x_{ij})\right).\eqno{\hbox{(7)}}$$

Although only frame-feature constraints are presented, there is nothing to prevent other types of constraints from being introduced. For example, regular odometry would induce a constraint Δ z(ci,cj) between two frames.

C. Nonlinear Least Squares

The optimization problem is to find the best set of model parametersx—camera frames and feature positions—to explain vectors of observations Formula—feature projections on the image plane. The NLSQs method estimates the parameters by finding the minimum of the sum of the constraints [sum of squared errors (SSE)]Formula TeX Source $$f(x) = \sum_{ij} \Delta z(x_{ij})^{\ssr T} W_{ij}\Delta z(x_{ij}).\eqno{\hbox{(8)}}$$A more convenient form of f eliminates the sum by concatenating each error term into a larger vector. Let Δ z(x) ≡ [Δ z(x00),…,Δ z(xmn)] be the full vector of measurements and W ≡diag(W00,…,Wmn) the block-diagonal matrix of all the weights. Then, the SSE equation (8) is equivalent to the matrix equationFormula TeX Source $$f(x) = \Delta z(x)^{\ssr T} W \Delta z(x).\eqno{\hbox{(9)}}$$Assuming that the measurements are independent under x, the matrix form can be interpreted as a multinormal PDF Formula, and by Bayes' rule, Formula. To maximize the likelihood Formula, we minimize the cost function (9) [40].

Since (9) is nonlinear, finding the minimum typically involves reduction to a linear problem in the vicinity of an initial solution. After expanding via Taylor series and differentiating, we get the incremental equationFormula TeX Source $$H \delta{\bf x} = -g\eqno{\hbox{(10)}}$$where g is the gradient and H is the Hessian of f with respect to x. Finally, after getting rid of some second-order terms, we can write (10) in the Gauss–Newton formFormula TeX Source $$J^\top W J \delta x = -J^\top W \Delta z\eqno{\hbox{(11)}}$$with J the Jacobian ∂ h/∂ x and the Hessian H approximated by JWJ.

In the nonlinear case, one starts with an initial value x0, and iterates the linear solution until convergence. In the vicinity of a solution, convergence is quadratic. Under the ML interpretation, at convergence H is the inverse of the covariance ofx, that is, the information or precision matrix. We will write J WJ as Λ to emphasize its role as the information matrix. In the delayed-state SLAM formulation [13],Λ serves as the system filter in a noniterated, incremental version of the SSE problem.

D. Nonlinear Reduction

At this point, we have the machinery to accomplish the reduction shown in Fig. 3, eliminating variables from the constraint net. Consider all the constraints involving the first two frames (Δ z(ci,qj) for i = 0,1). Fig. 4diagrams the process of reducing this to a single nonlinear constraint Δ z(c0,x1). On the left side, the NSLQ process induces a PDF over the variables x, with an ML value of Formula. Marginalization of all variables but x1 gives a PDF over just x1, which corresponds to the linear constraint Formula. The “lifting” process relativizes this to a nonlinear constraint. Mathematical justification for several of the steps is given in Appendix.

Figure 4
Fig. 4. Reduction process diagram. Nonlinear measurements constraints induce a Gaussian PDF over z, which is converted to a PDF over the system variables x. Reduction gives a smaller PDF over just x1, which corresponds to the linear constraint Formula. Lifting relativizes this to Formula.

The following algorithm specifies the process in more detail.

In step 1, we constrain the system to have c0 as the origin. Normally, the constraints leave the choice of origin free, but we want all variables relative to c0's frame.

The full system (minus c0, which is fixed) is then solved in step2, generating estimated values Formula for all variables, as well as the information matrix Formula. These two represent a Gaussian multivariate distribution over the variables x1,n. Next, in step3, the distribution is marginalized, getting rid of all variables except x1. The reduced matrix Formula represents a PDF forx1 that summarizes the influence of the other variables.

Step 4 is the most difficult to understand. The information matrixFormula is derived under the condition that c0 is the origin. But, we need a constraint that will hold when intermixed with other constraints, where c0 may be nonzero. The final step lifts the result from a fixed c0 to any pose for c0. Here is how it works. Steps 2 and 3 produce a mean Formula and information matrixFormula such that Formula is a PDF for x1. This PDF is equivalent to a synthetic observation on x1, with the linear measurement function h(x1) = x1. Now replace h(x1) with the relativized function h(c0,x1) = T0 x1, where T0 transformsx1 into c0's coordinates. When c0 = 0, this gives exactly the same PDF as the original h(x1). And, for any c0 ≠ 0, we can show that the constraint Δ z(c0,x1) produces the same PDF as the constraints Δ z(xi,…) (see Appendix I).

What is interesting about Δ z(c0,x1) is its nonlinear nature. It represents a spring connecting c0 and x1, with precisely the right properties to accurately substitute for the larger nonlinear system Δ z}(c0, x1, …). The accuracy is affected by how closely the reduced system follows two assumptions that were made as follows.

  1. The displacement between c0 and x1 is close to Formula.

  2. None of the variables x2,n participate in other constraints in the system.

These assumptions are not always fully met, especially the second one. Nonetheless, we will show in experiments with large outdoor systems that even very drastic variable reductions, such as those in Fig. 1, give accurate results.

E. Marginalization

In step 3, we require the reduction of an information matrixFormula to extract a marginal distribution between two of the vector variables. Just deleting the unwanted variable rows and columns would give the conditional distribution Formula. This distribution significantly overestimates the confidence in the connection, compared to the marginal, since the uncertainty of the auxiliary variables is disregarded [12]. The correct way to marginalize is to convert Formula to its covariance form by inversion, delete all but the entries for c0 and x1, and then, invert back to the information matrix. A shortcut to this procedure is the Schur complement [14], [24], [35], which is also useful in solving sparse versions of (9). We start with a block version of this equation, partitioning the variables into frames c and features q: Formula TeX Source $$\left[\matrix{H_{11} & H_{12}\cr H_{12}^{\ssr T} & H_{22} }\right]\left[\matrix{ \delta c\cr \delta q }\right] =\left[\matrix{ -J_c^{\ssr T} W_c {\Delta z}(c)\cr -J_q^{\ssr T} W_q {\Delta z}(q) }\right].\eqno{\hbox{(12)}}$$Now, we define a reduced form of this equation:Formula TeX Source $$\bar H_{11} \delta c = -g\eqno{\hbox{(13)}}$$withFormula TeX Source $$\eqalignno{\bar H_{11} & \equiv H_{11} - H_{12} H_{22}^{-1} H_{12}^{\ssr T}&\hbox{(14)}\cr-g & \equiv -J_c^{\ssr T} W_c {\Delta z}(c) - H_{12} H_{22}^{-1}J_q^{\ssr T} W_q {\Delta z}(q) .&\hbox{(15)}}$$Here, the matrix equation (13) involves only the variables c. There are two cases where this reduction is useful.

  1. Constraint reduction: The reduced Hessian Formula is the information matrix for the variables c, with variables q marginalized out. Equation (15) gives a direct way to compute the marginalized Hessian in step 3 of constraint reduction.

  2. Visual odometry: For the typical situation of many features and few frames, the reduced equation offers an enormous savings in computing NLSQ, with the caveat: H22 must be easily invertible. Fortunately, for features the matrix H22 is diagonal since features only have constraints with frames, and thus, the Jacobian J in J WJ affects just H12.


Data Association and Landmarks

The raw material for constraints comes from data association between image features. We have implemented a method for matching features across two camera frames that serves a dual purpose. First, it enables incremental estimation of camera motion for tracking trajectories (VO). Second, on returning to an area, we match the current frame against previous frames that serve as landmarks for the area. These landmark frames are simply the skeleton system that is constructed as the robot explores an area—a reduced set of frames, connected to each other by constraints. Note that landmark frames are not the same as feature landmarks normally used in nondelayed EKF treatments of visual SLAM (VSLAM). Features are only represented implicitly by their projections onto camera frames. Global and local registration is done purely by matching images between frames and generating frame–frame constraints from the match.

A. Matching Frames

Consider the problem of matching stereo frames that are close spatially. Our goal is to precisely determine the motion between the two frames, based on image feature matches. Even for incremental frames, rapid motion can cause the same feature to appear at widely differing places in two successive images; the problem is even worse for wide-baseline matching. The images of Fig. 5 show some difficult examples from urban and off-road terrain. Note the significant shift in feature positions.

Figure 5
Fig. 5. Matching between (top) two frames in an urban scene and (bottom) rough terrain (bottom). Objects that are too close, such as the bush in the bottom right image, cannot be found by stereo, and, therefore, have no features. Inlier matches for the best consensus estimate are in cyan; other features found that are not part of the consensus are in magenta. The upper pair is about 5 m between frames, and there are moving objects. The lower pair has significant rotation and translation.

One important aspect of matching is using image features that are stable across changes of scale and rotation. While scale-invariant feature transform (SIFT)[27] and speeded up robust features (SURF) [20] are the features of choice, they are not suitable for real-time implementations (15 Hz or greater). Instead, we use a novel multiscale center-surround feature called CenSure [26]. In previous research, we have shown that this operator has the requisite stability properties, but is just slightly more expensive than the Harris operator.

In any image feature matching scheme, there will be false matches. In the worst cases of long-baseline matching, sometimes only 10% of the matches are good ones. We use the following robust matching algorithm to find the best estimate, taking advantage of the geometric constraints imposed by rigid motion.

Three matched points give a motion hypothesis between the frames [19]. The hypothesis is checked for inliers by projecting all feature points onto the two frames (3). Features that are within 2 pixels of their projected position are considered to be inliers—note that they must project correctly in all four images. The best hypothesis(most inliers) is chosen and optimized using the NLSQ technique of Section III-E. Some examples are shown in Fig. 5 with matched and inlier points indicated.

This algorithm is used for both incremental tracking and wide-baseline matching, with different search parameters for finding matching features. Note that we do not assume any motion model or other external aids in matching.

B. Visual Odometry

Consensus matching is the input to a VO process for estimating incremental camera motion, and for computing reduced skeleton frames. To make incremental motion estimation more precise, we incorporate several additional methods.

  1. Key frames: If the estimated distance between two frames is small, and the number of inliers is high, we discard the new frame. The remaining frames are called key frames. Typical distances for key frames are about 0.1–0.5 m, depending on the environment.

  2. Incremental bundle adjustment: A sliding window of about 12keyframes (and their features) is input to the NLSQ optimization. Using a small window significantly increases the accuracy of the estimation [11], [26], [31], [39].

  3. IMU data: If an IMU is available, it can decrease the angular drift of VO, especially tilt and roll, which are referenced to gravity normal [26], [38]. We show results both with and without IMU assistance.

To construct the skeleton, we start with the sliding bundle adjustment window, which can be computed very efficiently using the marginalization technique of Section III-E [11], [31]. The sliding window contains at least one frame ci that is marked as a skeleton frame. When a new key frame ck is added to the window, the oldest frame in the window is dropped, and the Hessian Formula is recomputed. The marginalization operation eliminates the feature variables from the Hessian, leaving just constraints between the camera frames. If the most recent frame is a sufficient distance from the most recent skeleton frame in the window, it is added to the skeleton frame by further reducing Formula to contain just ci and ck, and then lifting the constraint.

IMU measurements can easily be added to the NLSQ estimation. The following equations describe IMU measurements of gravity normal and yaw angle increments:Formula TeX Source $$\eqalignno{g_i &= h_g(c_i)&\hbox{(16)}\cr\Delta \psi_{i-1, i} &= h_{\Delta \psi}(c_{i-1}, c_{i}).&\hbox{(17)}}$$The function hg(c) returns the deviation of the frame c in pitch and roll from gravity normal.hΔψ(ci−1,ci) is just the yaw angle difference between the two frames. Using accelerometer data acting as an inclinometer, with a very high noise level to account for unknown accelerations, is sufficient for (16) to constrain roll and pitch angles and completely avoid drift. For yaw angles, only a good IMU can increase the accuracy VO estimates. In the experiments section, we show results both with and without IMU aiding.

C. Wide Baseline Matching

To register a new frame against a landmark frame, we use the same consensus-matching technique as for VO. This procedure has several advantages.

  1. Sensitivity over scale: The CenSure features are scale independent, and so, are stable when there has been significant viewpoint change.

  2. Efficiency: The CenSure features and stereo are already computed for VO, so wide baseline matching just involves steps 3–5 of the consensus match algorithm. This can be done at over30 Hz.

  3. Specificity: The geometric consistency check almost guarantees that there will be no false positives, even using a very low inlier threshold.

Fig. 5 shows an example of wide baseline matching in the upper urban scene. The distance between the two frames is about 5 m, and there are distractors such as cars and pedestrians. The buildings are very self-similar, so the geometric consistency check is very important in weeding out bad matches. In this scene, there are about 800 features per image, and only 100 inliers for the best estimate match.

To analyze sensitivity and selectivity, we computed the inlier score for every possible cross-frame match in the 700 frame urban sequence shown in Fig. 1. Fig. 6 shows the results, by number of inliers. Along the diagonal, matching occurs for several frames, to an average of 10 m along straight stretches. The only off-diagonal matching occurs at the loop closure. The lower scores on closure reflect the sideways offset of the vehicle from its original trajectory. Consensus matching produced essentially perfect results for this dataset, giving no false positives, and correctly identifying loop closures.

Figure 6
Fig. 6. Matching statistics on a 700 image urban scene. The number of inliers in matches between frames is color coded; inliers counts below 30 were eliminated. Only the upper right triangle is used. Note the (a) longer matching areas where the vehicle goes along a straight stretch and the (b) very long matches at the end as the car slows down along a straightway. (c) Off-diagonal is the set of matched frames closing the loop.

We also looked at off-road images over very large trajectories(Section V-B). The challenging part of these datasets is the large amount of similarly textured areas: grass, trees, etc., with no distinguishing man-made objects (see Fig. 5, bottom). Nevertheless, specificity was perfect, with no false positives across tens of thousands of images. Sensitivity was not as good; it was sometimes difficult to match scenes when the viewpoints were separated by 10 m or more. Partly this results from our use of normalized cross correlation to obtain matches; partly it is the difficulty of finding the correct match in similarly textured areas.

While we did not observe any false positives in any dataset we considered, we did not do any rigorous experiments to verify this performance. Geometric consistency is such a strong constraint that we suspect that it is almost impossible to find situations in which it will produce false positives, given a threshold of at least 30 inliers.

D. Place Recognition

We implement a simple but effective scheme for recognizing places that have already been visited, using the consensus match just presented. This scheme is not intended to do “kidnapped” recognition, matching an image against a large database of places [5], [32]. Instead, it functions when the system has a good idea of where it is relative to the landmark frames that have been accumulated. In a local area, for example, the system always stays well registered, and has to search only a small number of frames for matches. Over larger loops, the method is linear in the size of the area it must search.

The main problem that arises is which landmark frames should serve as candidates to match to the current frame. Ideally, we would use the relative covariance estimates computed from the Bayes net as a gate, that is, Δ x W Δ x < d, where Δ x is the distance between the current frame cn and a candidate landmarkci [9]. However, computing relative covariance involves marginalizing all variables in the system Hessian, and is too expensive to be done online [12]. Instead, we use the skeleton to provide an approximate covariance that is conservative. In the skeleton net, we find the shortest path fromcn to ci, and then, compose the incremental covariances along the path to get an estimate of the relative covariance. As in [3], an efficient search of the net (O(n log n)) can be done using Dijkstra's method [37]. In practice, this method needs several milliseconds on the largest graphs we have attempted; for very large spaces, a hierarchical skeleton would be appropriate to keep computational costs reasonable.

One property we would like to observe is that the space consumed by the skeleton should be proportional to the area visited. On continual movement in an area, the skeleton will continue to grow unless frames are removed. Our method here is to marginalize out any frames that are within a distance d and angle a of an existing skeleton frame. As an example, we show an indoor dataset consisting of 8490 keyframes in a small 12 m × 12 m area (see Fig. 7). The skeleton graph was produced using a landmark distance of 2 m and 10°, reducing the graph to 272 frames with 23 cross-frame links. The final skeleton graph is solved by NLSQ in 65 ms.

Figure 7
Fig. 7. Skeleton graph for an indoor dataset (dataset courtesy of R. Sim). Distances are in meters. Dataset has 8490 key frames (a small set of these is shown in blue). The skeleton graph is shown in red; some of the longer range links are from landmark frame matching, and some are from reductions that eliminate frames. Skeletons nodes that are close are at different orientations. NLSQ computation for the skeleton graph takes 65 ms.

E. Real-Time Implementation

Our VO system has been implemented and runs at 15 Hz on 512 × 384 resolution stereo images using a 2 GHz processor, and is in habitual use in demo runs on an outdoor robot [1], [2], [26]. At this point, we have implemented the rest of the frameSLAM system only in processing logs for which we report timing results; we are transitioning the system to an outdoor robot, and will report system results soon.

The main strategy for adding registration matches is to use a dual-core machine: run VO on one core, and wide-baseline matching and skeleton computations on another. Wide-baseline matching is on an“anytime” basis: we match the current keyframe against a set of candidates that pass the Mahalanobis gate, until the next keyframe comes in, when we start matching that. Whenever a good match is found, the system performs an NLSQ update, either on the whole skeleton or a smaller area around the current keyframe, depending on the situation.

Of course, this strategy does not address the significant problems of frame storage and retrieval for very large systems, as done by recent place-recognition algorithms [5], [32]. It may also miss some matches, since it does not explore all possible candidates. But, for skeletons of less than 10K frames, where the frames can be kept in RAM, it works well.

For efficient calculation of the NLSQ updates, we use a preconditioned conjugate gradient algorithm (PCG) that has been shown to have good computational properties—in many cases, the complexity increases linearly with the size of the skeleton [23]. For the large outdoor dataset, Fig. 8 plots the PCG time against the size of the skeleton. Note that these are for a very large loop closure of a combined 10 km trajectory—typically only a small local area needs to be optimized.

Figure 8
Fig. 8. Large-scale timings for a skeleton graph. The trajectory of the graph was ∼10 km. The NLSQ time is linear in the size of the skeleton for this graph.


It is important to test VSLAM systems on data gathered from real-world platforms. It is especially important to test under realistic conditions: narrow field of view (FOV) cameras, full 3-D motion, and fast movement, as these present the hardest challenge for matching and motion estimation. For this research, we used three widely different sources of stereo frames.

  1. An indoor sequence consisting of 22K frames in a small area, moving slowly (courtesy of Sim [10]). The stereo system had a wide FOV, narrow baseline, and was purely planar motion.

  2. An outdoor automobile sequence, the Versailles Rond dataset(courtesy of Comport [4]). This dataset has 700frames with fast motion, 1 m baseline, narrow FOV, covering about 400m.

  3. Two outdoor rough-terrain sequences of about 5 km each from the Crusher project [6]. Baseline is 0.5 m, narrow FOV, and fast, full 3-D motion with lots of bouncing on rough terrain. These datasets offer a unique opportunity, for two reasons. First, they are autonomous runs through the same way points; they overlap and cross each other throughout, and end up at the same place for a large loop closure. Second, the dataset is instrumented with both IMU and real-time kinematic (RTK)GPS data, and our frameSLAM results can be computed for both aided and unaided VO, and compared to ground truth.

    A certain number of frames, about 2%, cannot be matched for VO in this dataset. We fill in these values with IMU data.

A. Planar Datasets

The indoor and Versailles Rond datasets were used throughout the paper to illustrate various aspects of the frameSLAM system. Because there is no ground truth, they offer just anecdotal evidence of performance: the indoor dataset illustrates that the skeleton does not grow with continued traversal of a region; the Versailles Rond dataset shows loop closure over a significant distance. From the plots of Fig. 9, the improvement in fidelity to the environment is apparent. One possible measure of the improvement is the planarity of the trajectories. We fit a plane to all points, then measure the rms error perpendicular to the plane. The following table lists the relevant statistics for the two runs. The timings are for NLSQ optimization of the entire skeleton.

Figure 9
Fig. 9. (Top) Indoor dataset showing raw VO (green crosses are frames and cyan dots are features) versus frameSLAM result (red trajectory and blue features). Note that the frameSLAM features correspond to a rectangular set of rooms, while the VO results are skewed after the loop. (Bottom) Versailles Rond urban dataset. Blue is raw VO, red is frameSLAM result (see Fig. 1 for cross-frame links). Note that the Z offset of the loop has been corrected.

B. Crusher Datasets

The Crusher data come from two autonomous 5 km runs, which overlap significantly and form a loop. There are 20K keyframes in the first run, and 22K in the second. Over 20 million features are found and matched in the keyframe images, and roughly three times that in the total image set. Fig. 10 (top)gives an idea of the results from raw VO on the two runs. There is significant deviation in all dimensions by the end of the loop(circled in red). With a skeleton of frames at 5 m intervals, there were a total of 1978 reduced frames and 169 wide-baseline matches between the runs using consensus matching. These are shown as red links in the top plot.

Figure 10
Fig. 10. XYZ plot of two Crusher trajectories (blue and green) of about 5 km each. (Top) Raw VO, with cross-matched frames with red links. The start and finish of both runs is at the left, circled in red; the runs are offset vertically by 20 m at the begninning to display the links. Note the loop closure between the end of the blue run and the beginning of the green run. (Middle) FrameSLAM-corrected system for a 5-m skeleton. The ground truth for the blue run is in red. The relative positions of the green and blue runs have been corrected and the loop closed. (Bottom) Excellent result for IMU-aided VO.
Figure 11
Fig. 11. RMS error as a function of distance. For every pair of frames, the error in their relative pose is plotted as a function of the distance between the frames. (Top) Unaided VO. The blue line shows poor open-loop VO performance, even for short distances; frameSLAM (red lines) gives excellent results for these distances. Skeleton reduction factor has negligible influence. (Bottom) IMU-aided VO.

The middle plot shows the result of applying frameSLAM to a 5 m skeleton. Here, the red trajectory is ground truth for the blue run, and it matches the two runs at the beginning and end of the trajectory(circled on the left). The two runs are now consistent with each other, but still differ from ground truth at the far right end of the trajectory. This is to be expected: the frameSLAM result will only be as good as the underlying odometry when exploring new areas.

If VO is aided by an IMU (Section IV-B), global error is reduced dramatically. The bottom plot shows the frameSLAM result using the aided VO—note that the blue run virtually overlays the RTK GPS ground truth trajectory.

How well does the frameSLAM system reduce errors from open-loop VO? We should not expect any large improvement in long-distance drift at the far point of trajectories since SLAM does not provide any global input that would correct such drift. But, we should expect dramatic gains in relative error, that is, between frames that are globally close since SLAM enforces consistency when it finds correspondences. To show this, we compared relative pose of every frame pair to ground truth, and plotted the results as a function of distance between the frames. Fig. 11 shows the results for both raw and aided VO.For raw VO (top plot), the open-loop errors are very high, because of the large drift at the end of the trajectories (see Fig. 10, top). With the cross-links enforcing local consistency, frameSLAM gives much smaller errors for short distances, and degrades with distance, a function of yaw angle drift. Note that radical reductions in the size of the skeleton, from 1/4 to 1/400 of the original keyframes, have negligible effect, proving the accuracy of the reduced system.

A similar story exists for IMU-aided VO. Here, the errors are much smaller because of the smaller drift of VO. But, the same gains in accuracy occur for small frame distances, and again, there is almost no effect from severe skeleton reductions until after 300 m.



We have described frameSLAM, a system for visual SLAM that is capable of precise, real-time estimation of motion, and also, is able to keep track of local registration and global consistency. The key component is a skeleton system of visual frames that act both as landmarks for registration, and as a network of constraints for enforcing consistency. FrameSLAM has been validated through testing in a variety of different environments, including large-scale, challenging off-road datasets of 10 km.

We are currently porting the system to two live robot platforms [6], [25], with the intent of providing completely autonomous off-road navigation using just stereo vision. The VO part of the system has already been proven over a year of testing, but cannot eliminate the long-term drift that accrues over a run. With the implementation of the skeleton graph, we expect to be able to assess the viability of the anytime strategy for global registration presented in Section IV-E.


Nonlinear Constraint “Lifting”

Let c0, x1, and q be a set of variables with measurement cost functionFormula TeX Source $${\Delta z}^{\ssr T} W_i {\Delta z}\eqno{\hbox{(18)}}$$and measurement vector Formula. For c0 fixed at the origin, letFormula be the Hessian of the reduced form of (18), according to step 3 of the constraint reduction algorithm. We want to show that the cost functionFormula TeX Source $${{\Delta z}'}^{\ssr T} \hat\Lambda_i {\Delta z}'\eqno{\hbox{(19)}}$$has approximately the same value at the ML estimate Formula, wherez′(c0,x1) = T0 x1 and Formula. To do this, we show that the likelihood distributions are approximately the same.

The cost function (18) has the joint normal distributionFormula TeX Source $$P(\hat z\vert x_1, {\bf q}) \propto \exp\left(-{1\over 2} {\Delta z}^{\ssr T} W_i {\Delta z}\right).\eqno{\hbox{(20)}}$$We want to find the distribution (and covariance) for the variablex1. Let x = x1,q, and f(x) the cost function (18). Expanding f(xx) in a Taylor series, the cost function becomesFormula TeX Source $$\eqalignno{& (\hat z - f({\bf x}))^{\ssr T} W (\hat z - f({\bf x}))&\hbox{(21)}\cr& \simeq (\hat z - f({\bf x}) - J\delta x)^{\ssr T} W (\hat z - f({\bf x})-J\delta x)&\hbox{(22)}\cr& = \delta x_1^{\ssr T} \hat\Lambda_1 \delta x_1 - 2 {\Delta z} W J \delta x +\hbox{const}&\hbox{(23)}}$$where we have used the Schur equality to reduce the first term of the third line. As Δ z vanishes at Formula, the last form is quadratic in x1, and so, is a joint normal distribution over x1.From inspection, the covariance is Formula. Hence, the ML distribution isFormula TeX Source $$P(x_1\vert \hat z) \propto \exp\left(-{1\over 2} (\hat x_1 - x_1)^{\ssr T}\hat\Lambda_1 (\hat x_1 - x_1)\right).\eqno{\hbox{(24)}}$$The cost function for this PDF is (19) for c0 fixed at the origin, as required. When c0 is not the origin, the cost function (18) can be converted to an equivalent function by transforming all variables to c0's coordinate system. The value stays the same because the measurements are localized to the positions of c0 and x1—any global measurement, for example, a GPS reading would block the equivalence. Thus, for arbitrary c0, (20) and (24) are approximately equal just whenx1 is given in c0's coordinate system. Since (24) is produced by the cost function (19), we have the approximate equivalence of the two cost functions.


Manuscript received December 16, 2007; revised June 24, 2008. First published October 10, 2008; current version published nulldate. This paper was recommended for publication by Associate Editor J. Neira and Editor L. Parker upon evaluation of the reviewers comments. This work was supported by the United States Air Force under Contract FA8650-04-C-7136.

K. Konolige is with Willow Garage, Menlo Park, 94025 CA USA, and also with Stanford University, Stanford, CA 94305 USA (e-mail:

M. Agrawal is with SRI International, Menlo Park, CA 94025-3493 USA (e-mail:

Color versions of one or more of the figures in this paper are available online at

1. We assume a standard calibration for the stereo pair, where the internal parameters are equal, and the right camera is displaced along the camera frame X axis by an amount B.


1. Real-time localization in outdoor environments using stereo vision and inexpensive GPS

M. Agrawal, K. Konolige

Proc. Intl. Conf. Pattern Recog. (ICPR), 2006-08, 1063–1068

2. Rough terrain visual odometry

M. Agrawal, K. Konolige

presented at the Int. Conf. Adv. Robot. (ICAR), Jeju, Korea 2007-08

3. An atlas framework for scalable mapping

M. Bosse, P. Newman, J. Leonard, M. Soika, W. Feiten, S. Teller

Proc. Int. Conf. Robot. Autom. (ICRA), 2003 pp. 1899–1906

4. Accurate quadrifocal tracking for robust 3D visual odometry

A. Comport, E. Malis, P. Rives

Proc. Int. Conf. Robot. Autom. (ICRA), Roma, 2007, pp. 40–45

5. Probabilistic appearance based navigation and loop closing

M. Cummins, P. M. Newman

Roma, Italy
Proc. Intl. Conf. Robot. Autom. (ICRA), 2007, 2042–2048

6. DARPA Crusher Project

[Online]. Available at: http://www.rec.ri.cmu. edu/projects/autonomous/index.htm

7. Real-time simultaneous localisation and mapping with a single camera

A. Davison

Proc. Int. Conf. Comput. Vis. (ICCV), 2003, 1403–1410

8. Monoslam: Real-time single camera slam

A. J. Davison, I. D. Reid, N. D. Molton, O. Stasse

IEEE Trans. Pattern Anal. Mach. Intell., Vol. 29, issue (6) pp. 1052–1067, 2007-06

9. A solution to the simultaneous localization and map building (SLAM) problem

M. Dissanayake, P. Newman, S. Clark, H. Durrant-Whyte, M. Csorba

IEEE Trans. Robot. Autom., Vol. 17, issue (3) pp. 229–241, 2001-06

10. SigmaSLAM: Stereo vision slam using the Rao-BlackWellised particle filter and a novel mixture proposal distribution

P. Elinas, R. Sim, J. J. Little

Int. Conf. Robot. Autom. (ICRA), presented at the, Rome, Italy 2007

11. Bundle adjustment rules

C. Engels, H. Stewnius, D. Nister

Photogrammetric Comput. Vision, presented at the, Bonn, Germany, 2006-09

12. Exactly sparse delayed-state filters for view-based SLAM

R. M. Eustice, H. Singh, J. J. Leonard

IEEE Trans. Robot., Vol. 22, issue (6) pp. 1100–1114, 2006-12

13. Visually mapping the RMS titanic: Conservative covariance estimates for SLAM information filters

R. M. Eustice, H. Singh, J. J. Leonard, M. R. Walter

Intl. J. Robot. Res., Vol. 25, issue (12) pp. 1223–1242, 2006

14. A proof for the approximate sparsity of slam information matrices

U. Frese

Proc. Int. Conf. Robot. Autom. (ICRA), 2005, 329–335

15. A multilevel relaxation algorithm for simultaneous localisation and mapping

U. Frese, P. Larsson, T. Duckett

IEEE Trans. Robot., Vol. 21, issue (2) pp. 1–12, 2005-04

16. Efficient estimation of accurate maximum likelihood maps in 3D

G. Grisetti, S. Grzonka, C. Stachniss, P. Pfaff, W. Burgard

Proc. Int. Conf. Intell. Robots Syst. (IROS), 2007, 3472–3478

17. A tree parameterization for efficiently computing maximum likelihood maps using gradient descent

G. Grisetti, C. Stachniss, S. Grzonka, W. Burgard

Robot.: Sci. Sys. (RSS), presented at the, Atlanta, GA 2007

18. Incremental mapping of large cyclic environments

J. Gutmann, K. Konolige

Proc. IEEE Int. Symp. Comput. Intell. Robot. Autom. (CIRA), Monterey, CA 1999-11, pp. 318–325

19. Multiple View Geometry in Computer Vision

R. Hartley, A. Zisserman

Cambridge, U.K.
Multiple View Geometry in Computer Vision, Cambridge Univ. Press 2000

20. Surf: Speeded up robust features

T. T. H. Bay, L. V. Gool

Eur. Conf. Comput. Vision, presented at the, Graz, Austria 2006-05

21. Outdoor delayed-state visually augmented odometry

V. S. Ila, J. Andrade, A. Sanfeliu

IFAC Symp. Intell. Auton. Vehicles, presented at the, Toulouse, France 2007

22. iSAM: Fast incremental smoothing and mapping with efficient data association

M. Kaess, A. Ranganathan, F. Dellaert

Proc. Int. Conf. Robot. Autom. (ICRA), Rome, 2007, pp. 1670–1677

23. Large-scale map-making

K. Konolige

Proc. Nat. Conf. AI (AAAI), 2004, 457–463

24. Frame-frame matching for realtime consistent visual mapping

K. Konolige, M. Agrawal

Proc. Int. Conf. Robot. Autom. (ICRA), Roma, Italy, 2007, pp. 2803–2810

25. Outdoor mapping and navigation using stereo vision

K. Konolige, M. Agrawal, R. C. Bolles, C. Cowan, M. Fischler, B. Gerkey

Int. Symp. Exp. Robot. (ISER), presented at the, Sao Paulo, Brazil, 2007

26. Large scale visual odometry for rough terrain

K. Konolige, M. Agrawal, J. Solá

Int. Symp. Res. Robot. (ISRR), presented at the, Menlo Park, CA 2007-11

27. Distinctive image features from scale-invariant keypoints

D. G. Lowe

Int. J. Comput. Vision, Vol. 60, issue (2) pp. 91–110, 2004

28. Globally consistent range scan alignment for environment mapping

F. Lu, E. Milios

Auton. Robots, Vol. 4 pp. 333–349, 1997

29. Gamma-SLAM: Stereo visual SLAM in unstructured environments using variance grid maps

T. K. Marks, A. Howard, M. Bajracharya, G. W. Cottrell, L. Matthies

Int. Conf. Robot. Autom., presented at the, Rome, Italy 2007

30. Large-scale robotic 3-D mapping of urban structures

M. Montemerlo, S. Thrun

Int. Symp. Exp. Robot. (ISER), presented at the, Singapore 2004

31. Real time localization and 3D reconstruction

E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, P. Sayd

Proc. Comput. Vis. Pattern Recog. (CVPR), 2006-06, vol. 1, 363–370

32. Scalable recognition with a vocabulary tree

D. Nister, H. Stewenius

Proc. Comput. Vis. Pattern Recog. (CVPR)'06, pp. 2161–2168

33. Fast iterative alignment of pose graphs with poor estimates

E. Olson, J. Leonard, S. Teller

Proc. Int. Conf. Robot. Autom. (ICRA), 2006 pp. 2262–2269

34. Divide and conquer: EKF SLAM in o(n)

L. Paz, J. Tardós, J. Neira

IEEE Trans. Robot., Vol. 24, issue (5) 2008-10

35. Constant time sliding window filter slam as a basis for metric visual perception

G. Sibley, G. S. Sukhatme, L. Matthies

Int. Conf. Robot. Autom. Workshop, presented at the, Rome, Italy 2007

36. Undelayed initialization in bearing only slam

J. Solá, M. Devy, A. Monin, T. Lemaire

Proc. Int. Conf. Robot. Autom. (ICRA), 2005, 2499–2504

37. Learning maps in 3D using attitude and noisy vision sensors

B. Steder, G. Grisetti, C. Stachniss, S. Grzonka, A. Rottmann, W. Burgard

IEEE Int. Conf. Intell. Robots Syst. (IROS), San Diego, CA, 2007 pp. 644–649

38. Motion estimation from image and inertial measurements

D. Strelow, S. Singh

Int. J. Robot. Res., Vol. 23, issue (12) pp. 1157–1195, 2004

39. Visual odometry using sparse bundle adjustment on an autonomous outdoor vehicle

N. Sunderhauf, K. Konolige, S. Lacroix, P. Protzel

New York
Tagungsband Autonome Mobile Systeme, Springer-Verlag 2005

40. Bundle adjustment—A modern synthesis

B. Triggs, P. F. McLauchlan, R. I. Hartley, A. W. Fitzibbon

New York
Vision Algorithms: Theory and Practice, (Lecture Notes Computer Science), Springer-Verlag, 2000 pp. 298–375

41. A constrained optimization approach to globally consistent mapping

R. Unnikrishnan, A. Kelly

Proc. Int. Conf. Robot. Syst. (IROS), 2002, 564–569


Kurt Konolige

received the Ph.D. degree in computer science from Stanford University, Stanford, CA, in 1984.

Kurt Konolige He is currently a Senior Researcher at Willow Garage, Menlo Park CA. He is also a Consulting Professor of computer science at Stanford University, where he teaches a course in mobile robotics. He is the Co-Developer of the Pioneer and AmigoBot robot line and the Saphira robot control architecture. His current research interests include real-time perception, navigation for mobile robots, visual perception, mapping, and navigation using probabilistic techniques.

Dr. Konolige is a Fellow of Association for the Advancement of Artificial Intelligence (AAAI).

Motilal Agrawal

received the B.Tech. degree in computer science from the Indian Institute of Technology, Delhi, India, in 1998 and the Ph.D. degree in computer vision from the University of Maryland, Baltimore, in 2002.

Motilal Agrawal In January 2003, he joined the Artificial Intelligence Center, SRI International, Menlo Park, CA, where he is currently a Senior Computer Scientist with the Perception Program. He has organized two workshops at the Computer Vision and Pattern Recognition (CVPR) and Intelligent Robots Systems (IROS) Conferences. His current research interests include real-time perception and navigation for mobile robots and novel algorithms for real-time visual localization. He is the author or coauthor of several papers published in prominent computer vision and robotics conferences.

Cited By

Information-Based Compact Pose SLAM

Robotics, IEEE Transactions on, vol. 26, issues (1), p. 78–93, 2010


IEEE Keywords

No Keywords Available

More Keywords

No Keywords Available


No Corrections




5,584 KB


22,186 KB

Indexed by Inspec

© Copyright 2011 IEEE – All Rights Reserved